Reverberation is a common problem for many speech technologies, such as automatic speech recognition (ASR) systems. This paper investigates the novel combination of precedence, binaural and statistical independence cues for enhancing reverberant speech, prior to ASR, under these adverse acoustical conditions when two microphone signals are available. Results of the enhancement are evaluated in terms of relevant signal measures and accuracy for both English and Polish ASR tasks. These show inconsistencies between the signal and recognition measures, although in recognition the proposed method consistently outperforms all other combinations and the spectral-subtraction baseline.
Speech enhancement is fundamental for various real time speech applications and it is a challenging task in the case of a single channel because practically only one data channel is available. We have proposed a supervised single channel speech enhancement algorithm in this paper based on a deep neural network (DNN) and less aggressive Wiener filtering as additional DNN layer. During the training stage the network learns and predicts the magnitude spectrums of the clean and noise signals from input noisy speech acoustic features. Relative spectral transform-perceptual linear prediction (RASTA-PLP) is used in the proposed method to extract the acoustic features at the frame level. Autoregressive moving average (ARMA) filter is applied to smooth the temporal curves of extracted features. The trained network predicts the coefficients to construct a ratio mask based on mean square error (MSE) objective cost function. The less aggressive Wiener filter is placed as an additional layer on the top of a DNN to produce an enhanced magnitude spectrum. Finally, the noisy speech phase is used to reconstruct the enhanced speech. The experimental results demonstrate that the proposed DNN framework with less aggressive Wiener filtering outperforms the competing speech enhancement methods in terms of the speech quality and intelligibility.
The paper presents the results of sentence and logatome speech intelligibility measured in rooms with induction loop for hearing aid users. Two rooms with different acoustic parameters were chosen. Twenty two subjects with mild, moderate and severe hearing impairment using hearing aids took part in the experiment. The intelligibility tests composed of sentences or logatomes were presented to the subjects at fixed measurement points of an enclosure. It was shown that a sentence test is more useful tool for speech intelligibility measurements in a room than logatome test. It was also shown that induction loop is very efficient system at improving speech intelligibility. Additionally, the questionnaire data showed that induction loop, apart from improving speech intelligibility, increased a subject’s general satisfaction with speech perception
Speaker‘s emotional states are recognized from speech signal with Additive white Gaussian noise (AWGN). The influence of white noise on a typical emotion recogniztion system is studied. The emotion classifier is implemented with Gaussian mixture model (GMM). A Chinese speech emotion database is used for training and testing, which includes nine emotion classes (e.g. happiness, sadness, anger, surprise, fear, anxiety, hesitation, confidence and neutral state). Two speech enhancement algorithms are introduced for improved emotion classification. In the experiments, the Gaussian mixture model is trained on the clean speech data, while tested under AWGN with various signal to noise ratios (SNRs). The emotion class model and the dimension space model are both adopted for the evaluation of the emotion recognition system. Regarding the emotion class model, the nine emotion classes are classified. Considering the dimension space model, the arousal dimension and the valence dimension are classified into positive regions or negative regions. The experimental results show that the speech enhancement algorithms constantly improve the performance of our emotion recognition system under various SNRs, and the positive emotions are more likely to be miss-classified as negative emotions under white noise environment.
LABLITA-Suite. Resources for the acquisition of Italian as a second language – LABLITA-suite provides technology-enhanced learning resources for the acquisition of Italian L2. IMAGACT allows for mastering the semantic properties of action verbs in the early phases of language acquisition. The LABLITA corpus of Spoken Italian can be used for training learners for face to face conversations. RIDIRE and CORDIC provide corpus linguistic tools for accessing Italian phraseology, which is useful for enhancing writing capabilities in the various domains of language usage.
The most challenging in speech enhancement technique is tracking non-stationary noises for long speech segments and low Signal-to-Noise Ratio (SNR). Different speech enhancement techniques have been proposed but, those techniques were inaccurate in tracking highly non-stationary noises. As a result, Empirical Mode Decomposition and Hurst-based (EMDH) approach is proposed to enhance the signals corrupted by non-stationary acoustic noises. Hurst exponent statistics was adopted for identifying and selecting the set of Intrinsic Mode Functions (IMF) that are most affected by the noise components. Moreover, the speech signal was reconstructed by considering the least corrupted IMF. Though it increases SNR, the time and resource consumption were high. Also, it requires a significant improvement under nonstationary noise scenario. Hence, in this article, EMDH approach is enhanced by using Sliding Window (SW) technique. In this SWEMDH approach, the computation of EMD is performed based on the small and sliding window along with the time axis. The sliding window depends on the signal frequency band. The possible discontinuities in IMF between windows are prevented by the total number of modes and the number of sifting iterations that should be set a priori. For each module, the number of sifting iterations is determined by decomposition of many signal windows by standard algorithm and calculating the average number of sifting steps for each module. Based on this approach, the time complexity is reduced significantly with suitable quality of decomposition. Finally, the experimental results show the considerable improvements in speech enhancement under non-stationary noise environments.
This paper proposes a speech enhancement method using the multi-scales and multi-thresholds of the auditory perception wavelet transform, which is suitable for a low SNR (signal to noise ratio) environment. This method achieves the goal of noise reduction according to the threshold processing of the human ear's auditory masking effect on the auditory perception wavelet transform parameters of a speech signal. At the same time, in order to prevent high frequency loss during the process of noise suppression, we first make a voicing decision based on the speech signals. Afterwards, we process the unvoiced sound segment and the voiced sound segment according to the different thresholds and different judgments. Lastly, we perform objective and subjective tests on the enhanced speech. The results show that, compared to other spectral subtractions, our method keeps the components of unvoiced sound intact, while it suppresses the residual noise and the background noise. Thus, the enhanced speech has better clarity and intelligibility.
Despite various speech enhancement techniques have been developed for different applications, existing methods are limited in noisy environments with high ambient noise levels. Speech presence probability (SPP) estimation is a speech enhancement technique to reduce speech distortions, especially in low signalto-noise ratios (SNRs) scenario. In this paper, we propose a new two-dimensional (2D) Teager-energyoperators (TEOs) improved SPP estimator for speech enhancement in time-frequency (T-F) domain. Wavelet packet transform (WPT) as a multiband decomposition technique is used to concentrate the energy distribution of speech components. A minimum mean-square error (MMSE) estimator is obtained based on the generalized gamma distribution speech model in WPT domain. In addition, the speech samples corrupted by environment and occupational noises (i.e., machine shop, factory and station) at different input SNRs are used to validate the proposed algorithm. Results suggest that the proposed method achieves a significant enhancement on perceptual quality, compared with four conventional speech enhancement algorithms (i.e., MMSE-84, MMSE-04, Wiener-96, and BTW).
Subspace-based methods have been effectively used to estimate enhanced speech from noisy speech samples. In the traditional subspace approaches, a critical step is splitting of two invariant subspaces associated with signal and noise via subspace decomposition, which is often performed by singular-value decomposition or eigenvalue decomposition. However, these decomposition algorithms are highly sensitive to the presence of large corruptions, resulting in a large amount of residual noise within enhanced speech in low signal-to-noise ratio (SNR) situations. In this paper, a joint low-rank and sparse matrix decomposition (JLSMD) based subspace method is proposed for speech enhancement. In the proposed method, we firstly structure the corrupted data as a Toeplitz matrix and estimate its effective rank value for the underlying clean speech matrix. Then the subspace decomposition is performed by means of JLSMD, where the decomposed low-rank part corresponds to enhanced speech and the sparse part corresponds to noise signal, respectively. An extensive set of experiments have been carried out for both of white Gaussian noise and real-world noise. Experimental results show that the proposed method performs better than conventional methods in many types of strong noise conditions, in terms of yielding less residual noise and lower speech distortion.
There are many industrial environments which are exposed to a high-level noise, sometimes much higher than the level of speech. Verbal communication is then practically unfeasible. In order to increase the speech intelligibility, appropriate speech enhancement algorithms can be used. It is impossible to filter off the noise completely from the acquired signal by using a conventional filter, because of two reasons. First, the speech and the noise frequency contents are overlapping. Second, the noise properties are subject to change. The adaptive realisation of the Wiener-based approach can be, however, applied. Two structures are possible. One is the line enhancer, where the predictive realisation of the Wiener approach is used. The benefit of using this structure it that it does not require additional apparatus. The second structure takes advantage of the high level of noise. Under such condition, placing another microphone, even close to the primary one, can provide a reference signal well correlated with the noise disturbing the speech and lacking the information about the speech. Then, the classical Wiener filter can be used, to produce an estimate of the noise based on the reference signal. That noise estimate can be then subtracted from the disturbed speech. Both algorithms are verified, based on the data obtained from the real industrial environment. For laboratory experiments the G. R. A. S. artificial head and two microphones, one at back side of an earplug and another at the mouth are used.