Spatial speech detection for binaural hearing aids using deep phoneme classifiers

Topical Issue - Auditory models: from binaural processing to multimodal cognition

Open Access

Issue		Acta Acust. Volume 6, 2022 Topical Issue - Auditory models: from binaural processing to multimodal cognition


Article Number		25
Number of page(s)		14
DOI		https://doi.org/10.1051/aacus/2022013
Published online		27 June 2022

F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. Le Roux, J.R. Hershey, B. Schuller: Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR, in: International Conference on Latent Variable Analysis and Signal Separation, pp. 91–99. [Google Scholar]
I. Fedorov, M. Stamenovic, C. Jensen, L.-C. Yang, A. Mandell, Y. Gan, M. Mattina, P.N. Whatmough: TinyLSTMs: Efficient neural speech enhancement for hearing aids. Proc. Interspeech 2020 (2020) 4054–4058. https://doi.org/10.21437/Interspeech.2020-1864. [Google Scholar]
J. Chen, D. Wang: Long short-term memory for speaker generalization in supervised speech separation. The Journal of the Acoustical Society of America 141, 6 (2017) 4705–4714. [CrossRef] [PubMed] [Google Scholar]
C. Xu, W. Rao, X. Xiao, E.S. Chng, H. Li: Single channel speech separation with constrained utterance level permutation invariant training using grid LSTM. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE. 2018, pp. 6–10. [Google Scholar]
S. Ghorbani, A.E. Bulut, J.H. Hansen: Advancing multi-accented LSTM-CTC speech recognition using a domain specific student-teacher learning paradigm, in: IEEE Spoken Language Technology Workshop (SLT), IEEE. 2018 (2018) 29–35. [Google Scholar]
H. Kayser, J. Anemüller: A discriminative learning approach to probabilistic acoustic source localization, in: Proc. IWAENC 2014 – International Workshop on Acoustic Echo and Noise Control, 2014, pp. 100–104. [Google Scholar]
C. Volker, A. Warzybok, S.M.A. Ernst: Comparing binaural pre-processing strategies III. Trends in Hearing 19 (2015) 1–18. https://doi.org/10.1177/2331216515618609. [CrossRef] [Google Scholar]
S.R.S. Bissmeyer, R.L. Goldsworthy: Adaptive spatial filtering improves speech reception in noise while preserving binaural cues. The Journal of the Acoustical Society of America 142, 3 (2017) 1441–1453. https://doi.org/10.1121/1.5002691. [CrossRef] [PubMed] [Google Scholar]
K. Adiloğlu, H. Kayser, R.M. Baumgärtel, S. Rennebeck, M. Dietz, V. Hohmann: A binaural steering beamformer system for enhancing a moving speech source. Trends in Hearing 19 (2015) 1–13. https://doi.org/10.1177/2331216515618903. [Google Scholar]
D. Marquardt, S. Doclo: Performance comparison of bilateral and binaural MVDR-based noise reduction algorithms in the presence of DOA estimation errors, in Speech Communication; 12. ITG Symposium (2016) 1–5. [Google Scholar]
D. Marquardt, S. Doclo: Noise power spectral density estimation for binaural noise reduction exploiting direction of arrival estimates, in: 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). 2017, pp. 234–238. https://doi.org/10.1109/WASPAA.2017.8170030. [Google Scholar]
J. Xiao, Z.-Q. Luo, I. Merks, T. Zhang: A robust adaptive binaural beamformer for hearing devices, in 51st Asilomar Conference on Signals, Systems, and Computers (2017) 1885–1889. https://doi.org/10.1109/ACSSC.2017.8335691. [Google Scholar]
H. Hermansky, E. Variani, V. Peddinti: Mean temporal distance: Predicting ASR error from temporal properties of speech signal, in: Proc ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process (2013) 7423–7426. [Google Scholar]
S.H. Mallidi, T. Ogawa, H. Hermansky: Uncertainty estimation of DNN classifiers, in: Proc. IEEE Work. Autom. Speech Recognit. Understanding (ASRU) (2016) 283–288. [Google Scholar]
K. Kintzley, A. Jansen, H. Hermansky: Event selection from phone posteriorgrams using matched filters. Proc Interspeech (2011) 1905–1908. [Google Scholar]
B.T. Meyer, S.H. Mallidi, H. Kayser, H. Hermansky: Predicting error rates for unknown data in automatic speech recognition, in: Proc. ICASSP, 2017, pp. 5330–5334. [Google Scholar]
B.T. Meyer, S.H. Mallidi, A.M. Castro Martinez, G. Payá-Vayá, H. Kayser, H. Hermansky: Performance monitoring for automatic speech recognition in noisy multi-channel environments, in: IEEE Workshop on Spoken Language Technology, 2016, pp. 50–56. [Google Scholar]
A.M. Castro Martinez, L. Gerlach, G. Payá-Vayá, H. Hermansky, J. Ooster, B.T. Meyer: DNN-based performance measures for predicting error rates in automatic speech recognition and optimizing hearing aid parameters. Speech Communication 106 (2019) 44–56. https://doi.org/10.1016/j.specom.2018.11.006. [CrossRef] [Google Scholar]
J. Barker, M. Cooke: Modelling speaker intelligibility in noise. Speech Communication 49 (2007) 402–417. https://doi.org/10.1016/j.specom.2006.11.003. [CrossRef] [Google Scholar]
C. Spille, S.D. Ewert, B. Kollmeier, B.T. Meyer: Predicting speech intelligibility with deep neural networks. Computer Speech & Language 48 (2018) 51–66. [CrossRef] [Google Scholar]
N. Parihar, J. Picone, D. Pearce, H. Hirsch: Performance analysis of the Aurora large vocabulary baseline system, in: Proc. of Eurospeech’03 2004 September (2003) 10–13. [Google Scholar]
H. Kayser, S.D. Ewert, J. Anemüller, T. Rohdenburg, V. Hohmann, B. Kollmeier: Database of multichannel in-ear and behind-the-ear head-related and binaural room impulse responses. EURASIP Journal on Advances in Signal Processing 2009 (2009) 298605. [CrossRef] [Google Scholar]
K. Wagener, T. Brand, B. Kollmeier: Development and evaluation of a German sentence test I: Design of the Oldenburg sentence test. Zeitschrift für Audiologie/Audiological Acoustics 38 (1999) 4–15. [Google Scholar]
BBC: BBC sound effects library, 1991. [Google Scholar]
C. Knapp, G. Carter: The generalized correlation method for estimation of time delay. IEEE Transactions on Acoustics, Speech and Signal Processing 24, 4 (1976) 320–327. [CrossRef] [Google Scholar]
B.E. Boser, I.M. Guyon, V.N. Vapnik: A training algorithm for optimal margin classifiers, in: Proceedings of the Fifth Annual Workshop on Computational Learning Theory, COLT ‘92, New York, NY, USA: ACM, 1992, pp. 144–152. [CrossRef] [Google Scholar]
R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, C.-J. Lin: Liblinear: A library for large linear classification. J. Mach. Learn. Res. 9 (2008) 1871–1874. [Google Scholar]
J.S. Garofolo, L.F. Lamel, W.M. Fisher, J.G. Fiscus, D.S. Pallett, N.L. Dahlgren, V. Zue: TIMIT Acoustic-Phonetic Continuous Speech Corpus, CDROM. 1993. http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC93S1 . [Google Scholar]
W.A. Dreschler, H. Verschuure, C. Ludvigsen, S. Westermann: Artificial noise signals with speech-like spectral and temporal properties for hearing instrument assessment. Audiology 40, 3 (2001) 148–157. https://doi.org/10.3109/00206090109073110. [CrossRef] [Google Scholar]
H. Cox, R. Zeskind, M. Owen: Robust adaptive beamforming. IEEE Transactions on Acoustics, Speech, and Signal Processing 35, 10 (1987) 1365–1376. [CrossRef] [Google Scholar]
D. Marquardt, E. Hadad, S. Gannot, S. Doclo: Theoretical analysis of linearly constrained multi-channel wiener filtering algorithms for combined noise reduction and binaural cue preservation in binaural hearing aids. IEEE Transactions on Audio, Speech and Language Processing 23, 12 (2015) 2384–2397. [CrossRef] [Google Scholar]
D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, K. Vesely: The Kaldi speech recognition toolkit, in: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, IEEE Signal Processing Society (2011), pp. 1–4. [Google Scholar]
D. Pearce, H.-G. Hirsch: The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions, in: ISCA ITRW ASR2000 (2000) 29–32. [Google Scholar]
A. Jansen, P. Niyogi: Point process models for spotting keywords in continuous speech. IEEE Transactions on Audio, Speech and Language Processing 17, 8 (2009) 1457–1470. [CrossRef] [Google Scholar]
S. Okawa, E. Bocchieri, A. Potamianos, Multi-band speech recognition in noisy environments, in: Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. 2 (1998) 641–644. [Google Scholar]
C. Spille, H. Kayser, H. Hermansky, B.T. Meyer: Assessing speech quality in speechaware hearing aids based on phoneme posteriorgrams. Proc. INTERSPEECH (2016) 1755–1759. [Google Scholar]
L. Sari, N. Moritz, T. Hori, J. Le Roux: Unsupervised speaker adaptation using attention-based speaker memory for end-to-end ASR, in: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 7384–7388. [Google Scholar]
G. Saon, H. Soltau, D. Nahamoo, M. Picheny: Speaker adaptation of neural network acoustic models using I-vectors, in: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, IEEE (2013) 55–59. [Google Scholar]
Z. Huang, S.M. Siniscalchi, C.-H. Lee: A unified approach to transfer learning of deep neural networks with applications to speaker adaptation in automatic speech recognition. Neurocomputing 218 (2016) 448–459. [CrossRef] [Google Scholar]
B. Tessendorf, A. Bulling, D. Roggen, T. Stiefmeier, M. Feilner, P. Derleth, G. Tröster: Recognition of hearing needs from body and eye movements to improve hearing instruments. 2011, pp. 314–331. [Google Scholar]
A. Favre-Felix, R. Hietkamp, C. Graversen, T. Dau, T. Lunner: Steering of audio input in hearing aids by eye gaze through electrooculography, in: Proceedings of the International Symposium on Auditory and Audiological Research , Vol. 6. 2017, pp. 135–142. [Google Scholar]
G. Grimm, H. Kayser, M. Hendrikse, V. Hohmann: A gaze-based attention model for spatially-aware hearing aids, in: 13th ITG Conference on Speech Communication, ITG. 2018, pp. 231–235. [Google Scholar]
K.E. Silverman, J.R. Bellegarda: Using a sigmoid transformation for improved modeling of phoneme duration, in: Acoustics, Speech, and Signal Processing, 1999. Proceedings, 1999 EEE International Conference on. IEEE, Vol. 1 (1999) 385–388. [Google Scholar]
C. Spille, B. Kollmeier, B.T. Meyer, C. Spille, B. Kollmeier, B.T. Meyer: Combining binaural and cortical features for robust speech recognition. IEEE/ACM Transactions on Audio Speech and Language Processing (TASLP) 25, 4 (2017) 756–767. [CrossRef] [Google Scholar]
V. Gokhale, J. Jin, A. Dundar, B. Martini, E. Culurciello: A 240 G-ops/s mobile coprocessor for deep neural networks, in: IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2014, pp. 696–701. [Google Scholar]

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.