Search for: [Keywords = "speech recognition"]

Enhancing Speech Recognition in Adverse Listening Environments: The Impact of Brief Musical Training on Older Adults

Akhila R. Nandakumar Haralakatta Shivananjappa Somashekara Vibha Kanagokar Arivudai Nambi Pitchaimuthu

Archives of Acoustics | 2024 | vol. 49 | No 1 | 3-9 | DOI: 10.24425/aoa.2023.146825

Keywords musical training carnatic music speech recognition in noise speech recognition in reverberation

Download PDF Download RIS Download Bibtex

Abstract

The present research investigated the effects of short-term musical training on speech recognition in adverse listening conditions in older adults. A total of 30 Kannada-speaking participants with no history of gross otologic, neurologic, or cognitive problems were divided equally into experimental (M = 63 years) and control groups (M = 65 years). Baseline and follow-up assessments for speech in noise (SNR50) and reverberation was carried out for both groups. The participants in the experimental group were subjected to Carnatic classical music training, which lasted for seven days. The Bayesian likelihood estimates revealed no difference in SNR50 and speech recognition scores in reverberation between baseline and followed-up assessment for the control group. Whereas, in the experimental group, the SNR50 reduced, and speech recognition scores improved following musical training, suggesting the positive impact of music training. The improved performance on speech recognition suggests that short-term musical training using Carnatic music can be used as a potential tool to improve speech recognition abilities in adverse listening conditions in older adults.

Go to article

Authors and Affiliations

Akhila R. Nandakumar

1

Haralakatta Shivananjappa Somashekara

1

e-mail:

ORCID:

Vibha Kanagokar

1

e-mail:

ORCID:

Arivudai Nambi Pitchaimuthu

1

e-mail:

ORCID:

Department of Audiology and Speech-Language Pathology, Kasturba Medical College, Mangalore Manipal Academy of Higher Education

Sentence Recognition in the Presence of Competing Speech Messages Presented in Audiometric Booths with Reverberation Times of 0.4 and 0.6 Seconds

Kim Abouchacra Janet Koehnke Joan Besing Tomasz Letowski

Archives of Acoustics | 2011 | vol. 36 | No 1 | 3-14 | DOI: 10.2478/v10168-011-0001-4

Keywords sound field testing reverberation speech recognition

Download PDF Download RIS Download Bibtex

Abstract

This study examined whether differences in reverberation time (RT) between typical sound field test rooms used in audiology clinics have an effect on speech recognition in multi-talker environments. Separate groups of participants listened to target speech sentences presented simultaneously with 0-to-3 competing sentences through four spatially-separated loudspeakers in two sound field test rooms having RT = 0:6 sec (Site 1: N = 16) and RT = 0:4 sec (Site 2: N = 12). Speech recognition scores (SRSs) for the Synchronized Sentence Set (S3) test and subjective estimates of perceived task difficulty were recorded. Obtained results indicate that the change in room RT from 0.4 to 0.6 sec did not significantly influence SRSs in quiet or in the presence of one competing sentence. However, this small change in RT affected SRSs when 2 and 3 competing sentences were present, resulting in mean SRSs that were about 8-10% better in the room with RT = 0:4 sec. Perceived task difficulty ratings increased as the complexity of the task increased, with average ratings similar across test sites for each level of sentence competition. These results suggest that site-specific normative data must be collected for sound field rooms if clinicians would like to use two or more directional speech maskers during routine sound field testing.

Go to article

Authors and Affiliations

Kim Abouchacra

Janet Koehnke

Joan Besing

Tomasz Letowski

Phoneme Segmentation Based on Wavelet Spectra Analysis

Bartosz Ziółko Mariusz Ziółko Suresh Manandhar Richard Wilson

Archives of Acoustics | 2011 | vol. 36 | No 1 | 29-47 | DOI: 10.2478/v10168-011-0003-2

Keywords speech recognition speech segmentation discrete wavelet transform

Download PDF Download RIS Download Bibtex

Abstract

A phoneme segmentation method based on the analysis of discrete wavelet transform spectra is described. The localization of phoneme boundaries is particularly useful in speech recognition. It enables one to use more accurate acoustic models since the length of phonemes provide more information for parametrization. Our method relies on the values of power envelopes and their first derivatives for six frequency subbands. Specific scenarios that are typical for phoneme boundaries are searched for. Discrete times with such events are noted and graded using a distribution-like event function, which represent the change of the energy distribution in the frequency domain. The exact definition of this method is described in the paper. The final decision on localization of boundaries is taken by analysis of the event function. Boundaries are, therefore, extracted using information from all subbands. The method was developed on a small set of Polish hand segmented words and tested on another large corpus containing 16 425 utterances. A recall and precision measure specifically designed to measure the quality of speech segmentation was adapted by using fuzzy sets. From this, results with F-score equal to 72.49% were obtained.

Go to article

Authors and Affiliations

Bartosz Ziółko

Mariusz Ziółko

Suresh Manandhar

Richard Wilson

Frequency Selection Based Separation of Speech Signals with Reduced Computational Time Using Sparse NMF

Yash Vardhan Varshney Omar Farooq Zia Ahmad Abbasi Musiur Raza Abidi

Archives of Acoustics | 2017 | vol. 42 | No 2 | DOI: 10.1515/aoa-2017-0031

Keywords sparse NMF mixed speech recognition Machine learning

Download PDF Download RIS Download Bibtex

Authors and Affiliations

Yash Vardhan Varshney

Omar Farooq

Zia Ahmad Abbasi

Musiur Raza Abidi

Phase Autocorrelation Bark Wavelet Transform (PACWT) Features for Robust Speech Recognition

Sayf A. Majeed Hafizah Husain Salina A. Samad

Archives of Acoustics | 2015 | vol. 40 | No 1 | 25-31 | DOI: 10.1515/aoa-2015-0004

Keywords speech recognition feature extraction phase autocorrelation wavelet transform

Download PDF Download RIS Download Bibtex

Abstract

In this paper, a new feature-extraction method is proposed to achieve robustness of speech recognition systems. This method combines the benefits of phase autocorrelation (PAC) with bark wavelet transform. PAC uses the angle to measure correlation instead of the traditional autocorrelation measure, whereas the bark wavelet transform is a special type of wavelet transform that is particularly designed for speech signals. The extracted features from this combined method are called phase autocorrelation bark wavelet transform (PACWT) features. The speech recognition performance of the PACWT features is evaluated and compared to the conventional feature extraction method mel frequency cepstrum coefficients (MFCC) using TI-Digits database under different types of noise and noise levels. This database has been divided into male and female data. The result shows that the word recognition rate using the PACWT features for noisy male data (white noise at 0 dB SNR) is 60%, whereas it is 41.35% for the MFCC features under identical conditions

Go to article

Authors and Affiliations

Sayf A. Majeed

Hafizah Husain

Salina A. Samad

System for Automatic Transcription of Sessions of the Polish Senate

Krzysztof Marasek Danijel Koržinek Łukasz Brocki

Archives of Acoustics | 2014 | vol. 39 | No 4 | 501-509 | DOI: 10.2478/aoa-2014-0054

Keywords large vocabulary speech recognition language modelling transcription transliteration subtitles

Download PDF Download RIS Download Bibtex

Abstract

This paper describes research behind a Large-Vocabulary Continuous Speech Recognition (LVCSR) system for the transcription of Senate speeches for the Polish language. The system utilizes severalcomponents: a phonetic transcription system, language and acoustic model training systems, a Voice Activity Detector (VAD), a LVCSR decoder, and a subtitle generator and presentation system. Some of the modules relied on already available tools and some had to be made from the beginning but the authors ensured that they used the most advanced techniques they had available at the time. Finally, several experiments were performed to compare the performance of both more modern and more conventional technologies.

Go to article

Authors and Affiliations

Krzysztof Marasek

Danijel Koržinek

Łukasz Brocki

Deep Belief Neural Networks and Bidirectional Long-Short Term Memory Hybrid for Speech Recognition

Łukasz Brocki Krzysztof Marasek

Archives of Acoustics | 2015 | vol. 40 | No 2 | 191-195 | DOI: 10.1515/aoa-2015-0021

Keywords deep belief neural networks long-short term memory bidirectional recurrent neural networks speech recognition large vocabulary continuous speech recognition

Download PDF Download RIS Download Bibtex

Abstract

This paper describes a Deep Belief Neural Network (DBNN) and Bidirectional Long-Short Term Memory (LSTM) hybrid used as an acoustic model for Speech Recognition. It was demonstrated by many independent researchers that DBNNs exhibit superior performance to other known machine learning frameworks in terms of speech recognition accuracy. Their superiority comes from the fact that these are deep learning networks. However, a trained DBNN is simply a feed-forward network with no internal memory, unlike Recurrent Neural Networks (RNNs) which are Turing complete and do posses internal memory, thus allowing them to make use of longer context. In this paper, an experiment is performed to make a hybrid of a DBNN with an advanced bidirectional RNN used to process its output. Results show that the use of the new DBNN-BLSTM hybrid as the acoustic model for the Large Vocabulary Continuous Speech Recognition (LVCSR) increases word recognition accuracy. However, the new model has many parameters and in some cases it may suffer performance issues in real-time applications.

Go to article

Authors and Affiliations

Łukasz Brocki

Krzysztof Marasek

Laughter Classification Using Deep Rectifier Neural Networks with a Minimal Feature Subset

Gábor Gosztolya András Beke Tilda Neuberger László Tóth

Archives of Acoustics | 2016 | vol. 41 | No 4 | 669-682 | DOI: 10.1515/aoa-2016-0064

Keywords speech recognition speech technology computational paralinguistics laughter detection deep neural networks

Download PDF Download RIS Download Bibtex

Abstract

Laughter is one of the most important paralinguistic events, and it has specific roles in human conversation. The automatic detection of laughter occurrences in human speech can aid automatic speech recognition systems as well as some paralinguistic tasks such as emotion detection. In this study we apply Deep Neural Networks (DNN) for laughter detection, as this technology is nowadays considered state-of-the-art in similar tasks like phoneme identification. We carry out our experiments using two corpora containing spontaneous speech in two languages (Hungarian and English). Also, as we find it reasonable that not all frequency regions are required for efficient laughter detection, we will perform feature selection to find the sufficient feature subset.

Go to article

Authors and Affiliations

Gábor Gosztolya

András Beke

Tilda Neuberger

László Tóth

Speech Recognition in an Enclosure with a Long Reverberation Time

Jędrzej Kociński Edward Ozimek

Archives of Acoustics | 2016 | vol. 41 | No 2 | 255-264 | DOI: 10.1515/aoa-2016-0025

Keywords speech intelligibility speech recognition sentence test reverberation time clarity speech transmission index

Download PDF Download RIS Download Bibtex

Abstract

The aim of this work was to measure subjective speech intelligibility in an enclosure with a long reverberation time and comparison of these results with objective parameters. Impulse Responses (IRs) were first determined with a dummy head in different measurement points of the enclosure. The following objective parameters were calculated with Dirac 4.1 software: Reverberation Time (RT), Early Decay Time (EDT), weighted Clarity (C50) and Speech Transmission Index (STI). For the chosen measurement points, a convolution of the IRs with the Polish Sentence Test (PST) and logatome tests was made. PST was presented at a background of a babble noise and speech reception threshold - SRT (i.e. SNR yielding 50% speech intelligibility) for those points were evaluated. A relationship of the sentence and logatome recognition vs. STI was determined. It was found that the final SRT data are well correlated with speech transmission index (STI), and can be expressed by a psychometric function. The difference between SRT determined in condition without reverberation and in reverberation conditions appeared to be a good measure of the effect of reverberation on speech intelligibility in a room. In addition, speech intelligibility, with and without use of the sound amplification system installed in the enclosure, was compared.

Go to article

Authors and Affiliations

Jędrzej Kociński

Edward Ozimek

An Effective Speaker Clustering Method using UBMand Ultra-Short Training Utterances

Robert Hossa Ryszard Makowski

Archives of Acoustics | 2016 | vol. 41 | No 1 | 107-118 | DOI: 10.1515/aoa-2016-0011

Keywords automatic speech recognition interindividual difference compensation speaker clustering universal background model GMM weighting factor adaptation

Download PDF Download RIS Download Bibtex

Abstract

The same speech sounds (phones) produced by different speakers can sometimes exhibit significant differences. Therefore, it is essential to use algorithms compensating these differences in ASR systems. Speaker clustering is an attractive solution to the compensation problem, as it does not require long utterances or high computational effort at the recognition stage. The report proposes a clustering method based solely on adaptation of UBM model weights. This solution has turned out to be effective even when using a very short utterance. The obtained improvement of frame recognition quality measured by means of frame error rate is over 5%. It is noteworthy that this improvement concerns all vowels, even though the clustering discussed in this report was based only on the phoneme a. This indicates a strong correlation between the articulation of different vowels, which is probably related to the size of the vocal tract.

Go to article

Authors and Affiliations

Robert Hossa

Ryszard Makowski

Application of Teager Energy Operator on Linear and Mel Scales for Whispered Speech Recognition

Branko R. Marković Miomir Mijić Jovan Galić

Archives of Acoustics | 2018 | vol. 43 | No 1 | DOI: 10.24425/118075

Keywords Teager energy operator cepstral mean subtraction whispered speech recognition linearscale mel scale dynamic time warping hidden Markov models

Download PDF Download RIS Download Bibtex

Authors and Affiliations

Branko R. Marković

Miomir Mijić

Jovan Galić

Effect of Time-domain Windowing on Isolated Speech Recognition System Performance

Ananthakrishna Thalengala H. Anitha T. Girisha

International Journal of Electronics and Telecommunications | 2022 | vol. 68 | No 1 | 161-166 | DOI: 10.24425/ijet.2022.139856

Keywords Hidden Markov model (HMM) Isolated speech recognition (ISR) system Kannada language Mono-phone model Mel frequency cepstral coefficients (MFCC)

Download PDF Download RIS Download Bibtex

Abstract

Speech recognition system extract the textual data from the speech signal. The research in speech recognition domain is challenging due to the large variabilities involved with the speech signal. Variety of signal processing and machine learning techniques have been explored to achieve better recognition accuracy. Speech is highly non-stationary in nature and therefore analysis is carried out by considering short time-domain window or frame. In the speech recognition task, cepstral (Mel frequency cepstral coefficients (MFCC)) features are commonly used and are extracted for short time-frame. The effectiveness of features depend upon duration of the time-window chosen. The present study is aimed at investigation of optimal time-window duration for extraction of cepstral features in the context of speech recognition task. A speaker independent speech recognition system for the Kannada language has been considered for the analysis. In the current work, speech utterances of Kannada news corpus recorded from different speakers have been used to create speech database. The hidden Markov tool kit (HTK) has been used to implement the speech recognition system. The MFCC along with their first and second derivative coefficients are considered as feature vectors. Pronunciation dictionary required for the study has been built manually for mono-phone system. Experiments have been carried out and results have been analyzed for different time-window lengths. The overlapping Hamming window has been considered in this study. The best average word recognition accuracy of 61.58% has been obtained for a window length of 110 msec duration. This recognition accuracy is comparable with the similar work found in literature. The experiments have shown that best word recognition performance can be achieved by tuning the window length to its optimum value.

Go to article

Authors and Affiliations

Ananthakrishna Thalengala

1

H. Anitha

1

T. Girisha

1

Department of Electronics and Communication Engineering, Manipal Institute of Technology (MIT), Manipal Academy of Higher Education (MAHE), Manipal, Karnataka State, India

Development of Speaker Voice Identification Using Main Tone Boundary Statistics for Applying To Robot-Verbal Systems

Yedilkhan Amirgaliyev Timur Musabayev Didar Yedilkhan Waldemar Wójcik Zhazira Amirgaliyeva

International Journal of Electronics and Telecommunications | 2020 | vol. 66 | No 3 | 583-588 | DOI: 10.24425/ijet.2020.134015

Keywords speaker voice identification voice interface (FXO) human being and robot interrelation (HRI) speech recognition statistics of voice fundamental tone computer-aided learning neural network

Download PDF Download RIS Download Bibtex

Abstract

Hereby there is given the speaker identification basic system. There is discussed application and usage of the voice interfaces, in particular, speaker voice identification upon robot and human being communication. There is given description of the information system for speaker automatic identification according to the voice to apply to robotic-verbal systems. There is carried out review of algorithms and computer-aided learning libraries and selected the most appropriate, according to the necessary criteria, ALGLIB. There is conducted the research of identification model operation performance assessment at different set of the fundamental voice tone. As the criterion of accuracy there has been used the percentage of improperly classified cases of a speaker identification.

Go to article

Authors and Affiliations

Yedilkhan Amirgaliyev

Timur Musabayev

Didar Yedilkhan

Waldemar Wójcik

Zhazira Amirgaliyeva

Search results

Filters

Search results

Enhancing Speech Recognition in Adverse Listening Environments: The Impact of Brief Musical Training on Older Adults

Abstract

Authors and Affiliations

Sentence Recognition in the Presence of Competing Speech Messages Presented in Audiometric Booths with Reverberation Times of 0.4 and 0.6 Seconds

Abstract

Authors and Affiliations

Phoneme Segmentation Based on Wavelet Spectra Analysis

Abstract

Authors and Affiliations

Frequency Selection Based Separation of Speech Signals with Reduced Computational Time Using Sparse NMF

Authors and Affiliations

Phase Autocorrelation Bark Wavelet Transform (PACWT) Features for Robust Speech Recognition

Abstract

Authors and Affiliations

System for Automatic Transcription of Sessions of the Polish Senate

Abstract

Authors and Affiliations

Deep Belief Neural Networks and Bidirectional Long-Short Term Memory Hybrid for Speech Recognition

Abstract

Authors and Affiliations

Laughter Classification Using Deep Rectifier Neural Networks with a Minimal Feature Subset

Abstract

Authors and Affiliations

Speech Recognition in an Enclosure with a Long Reverberation Time

Abstract

Authors and Affiliations

An Effective Speaker Clustering Method using UBMand Ultra-Short Training Utterances

Abstract

Authors and Affiliations

Application of Teager Energy Operator on Linear and Mel Scales for Whispered Speech Recognition

Authors and Affiliations

Effect of Time-domain Windowing on Isolated Speech Recognition System Performance

Abstract

Authors and Affiliations

Development of Speaker Voice Identification Using Main Tone Boundary Statistics for Applying To Robot-Verbal Systems

Abstract

Authors and Affiliations