Effect of number and position of frames in speaker age estimation

Mohammed Muntaz Osman; Osman Büyük; Ali Tangel

Research Article

Year 2023, Volume: 41 Issue: 2, 243 - 255, 30.04.2023

Mohammed Muntaz Osman , Osman Büyük , Ali Tangel

Abstract

References

REFERENCES
[1] Barkana BD, Zhou J. A new pitch-range based fea-ture set for a speaker's age and gender classification. Appl Acoust 2015;98:52−61. [CrossRef]
[2] Schötz S. Perception, Analysis and Synthesis of Speaker Age. thesis/docmono. Lund University; 2006.
[3] Chauhan PM, Desai NP. Mel Frequency Cepstral Coefficients (MFCC) based speaker identification in noisy environment using wiener filter. In: 2014 International Conference on Green Computing Communication and Electrical Engineering (ICGCCEE); Mar 2014; pp. 1−5. [CrossRef]
[4] Murthy HA, Gadde V. The modified group delay function and its application to phoneme recogni-tion. In: 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03); Apr 2003; Vol. 1, p. I−68.
[5] Bahari MH, McLaren M, Van hamme H, van Leeuwen DA. Speaker age estimation using i-vec-tors. Eng Appl Artif Intell 2014;34:99−108. [CrossRef]
[6] Burkhardt F, Eckert M, Johannsen W, Stegmann J. A database of age and gender annotated telephone speech. Proceedings of the Language and Resources Conference (LREC); 2010.
[7] Ajmera J, Burkhardt F. Age and gender classification using modulation cepstrum. In: Odyssey; 2008; pp. 25.
[8] Muller C, Wittig F, Baus J. Exploiting speech for recognizing elderly users to respond to their spe-cial needs. In: Eighth European conference on speech communication and technology; Geneva, Switzerland; 1-4 Sep. 2003. [CrossRef]
[9] Braun A, Cerrato L. Estimating speaker age across languages. In: Proceedings of ICPhS; 1999; Vol. 99; pp. 1369−1372.
[10] Ghahremani P, Khorrami P, Lajevardi SM, et al. End-to-end Deep Neural Network Age Estimation. Interspeech 2018;2018:277−281. [CrossRef]
[11] Zazo R, Nidadavolu PS, Chen N, Gonzalez-Rodriguez J, Dehak N. Age estimation in short speech utterances based on LSTM recurrent neu-ral networks. IEEE Access 2018;6:22524−22530.[CrossRef]
[12] Büyük O, Arslan ML. Combination of long-term and short-term features for age identification from voice. Adv Electr Comput Eng 2018;18:101−108. [CrossRef]
[13] Büyük O, Arslan LM. Age identification from voice using feed-forward deep neural networks. In 2018 26th Signal Processing and Communications Applications Conference (SIU) (pp. 1−4). [CrossRef]
[14] Büyük O, Arslan LM. An Investigation of Multi-Language Age Classification from Voice. In BIOSIGNALS (pp. 85-92), 2019. [CrossRef]
[15] Kitagishi Y, Kamiyama H, Ando A, Tawara N, Mori T, Kobashikawa S. Speaker Age Estimation Using Age-Dependent Insensitive Loss. In 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)(pp. 319−324).
[16] Kalluri SB, Vijayasenan D, Ganapathy S. Automatic speaker profiling from short duration speech data. Speech Commun 2020;121:16−28. [CrossRef]
[17] Jacobs JP, Koziel S. Variable-fidelity modeling of antenna input characteristics using domain con-finement and two-stage Gaussian process regres-sion surrogates. Int J Numer Model 2020;33:e2758.[CrossRef]
[18] Calik N, Belen MA, Mahouti P, Koziel, S. Accurate modeling of frequency selective surfaces using ful-ly-connected regression model with automated architecture determination and parameter selec-tion based on bayesian optimization. IEEE Access 2021;9:38396−38410. [CrossRef]
[19] Koziel S, Mahouti P, Calik N, Belen MA, Szczepanski S. Improved modeling of microwave structures using performance-driven fully-connected regression sur-rogate. IEEE Access 2021;9:71470−71481. [CrossRef]
[20] Přibil J, Přibilová A, Matoušek J. GMM-based speaker age and gender classification in Czech and Slovak. J Electrical Eng 2017;68:3−12. [CrossRef]
[21] Reynolds DA, Quatieri TF, Dunn RB. Speaker verifi-cation using adapted Gaussian mixture models. Dig Sign Proces 2020;10:19−41. [CrossRef]
[22] Mak MW. Lecture Notes on Factor Analysis and i-Vectors, Technical Report and Lecture Note Series, Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Feb. 2016.
[23] Hegde RM, Murthy HA, Gadde VRR. Significance of the modified group delay feature in speech rec-ognition. IEEE Trans Audio Speech Lang Process 2006;15:190−202. [CrossRef]
[24] Vergin R, O'Shaughnessy D. Pre-emphasis and speech recognition. In: Proceedings 1995 Canadian Conference on Electrical and Computer Engineering 1995;2:1062−1065. [CrossRef]
[25] Harris FJ. On the use of windows for harmonic anal-ysis with the discrete Fourier transform. Proc IEEE 1978;66:51−83. [CrossRef]
[26] Osman MM, Büyük O. Parabolic filter mel fre-quency cepstral coefficient and fusion of features for speaker age classification. Sigma J Eng Nat Sci 2020;38:2177−2191.
[27] Hanilci C. Features and classifiers for replay spoofing attack detection. In: 10th International Conference on Electrical and Electronics Engineering (ELECO). IEEE; 2017 Nov 30 - Dec 2; Bursa, Turkey.
[28] Sadjadi SO. MSR Identity Toolbox. Seattle, WA, USA: Microsoft; 2013.
[29] Moon TK. The expectation-maximization algorithm. IEEE Signal Process Mag 1996;13:47−60. [CrossRef]
[30] Li W, Fu T, Zhu J. An improved i-vector extraction algorithm for speaker verification. J Audio Speech Music Proc 2015;2015:18. [CrossRef]
[31] Alpaydın E. Introduction to machine learning. 2nd ed. Cambridge, Mass.: MIT Press; 2010.
[32] Lodhi H, Saunders C, Shawe-Taylor J, Cristianini N, Watkins C. Text classification using string kernels. J Mach Learn Res 2002;2:419−444.

Effect of number and position of frames in speaker age estimation

Year 2023, Volume: 41 Issue: 2, 243 - 255, 30.04.2023

Mohammed Muntaz Osman , Osman Büyük , Ali Tangel

Abstract

With the invention of powerful processing devices as well as lucrative capabilities in the first two decades of the 21st century, machine learning algorithms will soon be able to predict speaker age with higher accuracy or much lower error rate. It is an age-old quest for the human society to profile individuals remotely which basically includes age. Speaker age estimation has been treated in quite few perspectives. However, most of these approaches fail to show the effect of utterance length, aka number of frames on speaker age estimation. We present a detailed analysis on the effect of number of frames and position of frames for speaker age es-timation using four magnitude-based and one phase-based spectral feature sets. The optimal speech duration for this objective is investigated. In addition, the mismatch between the train-ing and test utterance duration is explored. The magnitude-based features are mainly derived from filter bank analysis. After the filter-bank analysis, an i-vector is generated for each utter-ance. Least squares support vector regression (LSSVR) is employed for speaker age estimation. In the experiments, the aGender database which consists of utterances from four age groups of German speakers is used. Increasing number of frames in the training and test increases the age estimation accuracy. This can be associated with the notion that more data helps the estimation process. Concerning position, the frames located at the centre of utterances tend to offer better results for both genders. The backend algorithms offer the best performance when the utterance length of training and test sets are equal for longer speech segments, otherwise training with medium length utterances and testing with longer ones offers better estimation performance especially for the female dataset.

Keywords

Filter Banks, Frame Position, Mean Absolute Error, Regression, Speaker Age, Utterance Length

References

REFERENCES
[1] Barkana BD, Zhou J. A new pitch-range based fea-ture set for a speaker's age and gender classification. Appl Acoust 2015;98:52−61. [CrossRef]
[2] Schötz S. Perception, Analysis and Synthesis of Speaker Age. thesis/docmono. Lund University; 2006.
[3] Chauhan PM, Desai NP. Mel Frequency Cepstral Coefficients (MFCC) based speaker identification in noisy environment using wiener filter. In: 2014 International Conference on Green Computing Communication and Electrical Engineering (ICGCCEE); Mar 2014; pp. 1−5. [CrossRef]
[4] Murthy HA, Gadde V. The modified group delay function and its application to phoneme recogni-tion. In: 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03); Apr 2003; Vol. 1, p. I−68.
[5] Bahari MH, McLaren M, Van hamme H, van Leeuwen DA. Speaker age estimation using i-vec-tors. Eng Appl Artif Intell 2014;34:99−108. [CrossRef]
[6] Burkhardt F, Eckert M, Johannsen W, Stegmann J. A database of age and gender annotated telephone speech. Proceedings of the Language and Resources Conference (LREC); 2010.
[7] Ajmera J, Burkhardt F. Age and gender classification using modulation cepstrum. In: Odyssey; 2008; pp. 25.
[8] Muller C, Wittig F, Baus J. Exploiting speech for recognizing elderly users to respond to their spe-cial needs. In: Eighth European conference on speech communication and technology; Geneva, Switzerland; 1-4 Sep. 2003. [CrossRef]
[9] Braun A, Cerrato L. Estimating speaker age across languages. In: Proceedings of ICPhS; 1999; Vol. 99; pp. 1369−1372.
[10] Ghahremani P, Khorrami P, Lajevardi SM, et al. End-to-end Deep Neural Network Age Estimation. Interspeech 2018;2018:277−281. [CrossRef]
[11] Zazo R, Nidadavolu PS, Chen N, Gonzalez-Rodriguez J, Dehak N. Age estimation in short speech utterances based on LSTM recurrent neu-ral networks. IEEE Access 2018;6:22524−22530.[CrossRef]
[12] Büyük O, Arslan ML. Combination of long-term and short-term features for age identification from voice. Adv Electr Comput Eng 2018;18:101−108. [CrossRef]
[13] Büyük O, Arslan LM. Age identification from voice using feed-forward deep neural networks. In 2018 26th Signal Processing and Communications Applications Conference (SIU) (pp. 1−4). [CrossRef]
[14] Büyük O, Arslan LM. An Investigation of Multi-Language Age Classification from Voice. In BIOSIGNALS (pp. 85-92), 2019. [CrossRef]
[15] Kitagishi Y, Kamiyama H, Ando A, Tawara N, Mori T, Kobashikawa S. Speaker Age Estimation Using Age-Dependent Insensitive Loss. In 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)(pp. 319−324).
[16] Kalluri SB, Vijayasenan D, Ganapathy S. Automatic speaker profiling from short duration speech data. Speech Commun 2020;121:16−28. [CrossRef]
[17] Jacobs JP, Koziel S. Variable-fidelity modeling of antenna input characteristics using domain con-finement and two-stage Gaussian process regres-sion surrogates. Int J Numer Model 2020;33:e2758.[CrossRef]
[18] Calik N, Belen MA, Mahouti P, Koziel, S. Accurate modeling of frequency selective surfaces using ful-ly-connected regression model with automated architecture determination and parameter selec-tion based on bayesian optimization. IEEE Access 2021;9:38396−38410. [CrossRef]
[19] Koziel S, Mahouti P, Calik N, Belen MA, Szczepanski S. Improved modeling of microwave structures using performance-driven fully-connected regression sur-rogate. IEEE Access 2021;9:71470−71481. [CrossRef]
[20] Přibil J, Přibilová A, Matoušek J. GMM-based speaker age and gender classification in Czech and Slovak. J Electrical Eng 2017;68:3−12. [CrossRef]
[21] Reynolds DA, Quatieri TF, Dunn RB. Speaker verifi-cation using adapted Gaussian mixture models. Dig Sign Proces 2020;10:19−41. [CrossRef]
[22] Mak MW. Lecture Notes on Factor Analysis and i-Vectors, Technical Report and Lecture Note Series, Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Feb. 2016.
[23] Hegde RM, Murthy HA, Gadde VRR. Significance of the modified group delay feature in speech rec-ognition. IEEE Trans Audio Speech Lang Process 2006;15:190−202. [CrossRef]
[24] Vergin R, O'Shaughnessy D. Pre-emphasis and speech recognition. In: Proceedings 1995 Canadian Conference on Electrical and Computer Engineering 1995;2:1062−1065. [CrossRef]
[25] Harris FJ. On the use of windows for harmonic anal-ysis with the discrete Fourier transform. Proc IEEE 1978;66:51−83. [CrossRef]
[26] Osman MM, Büyük O. Parabolic filter mel fre-quency cepstral coefficient and fusion of features for speaker age classification. Sigma J Eng Nat Sci 2020;38:2177−2191.
[27] Hanilci C. Features and classifiers for replay spoofing attack detection. In: 10th International Conference on Electrical and Electronics Engineering (ELECO). IEEE; 2017 Nov 30 - Dec 2; Bursa, Turkey.
[28] Sadjadi SO. MSR Identity Toolbox. Seattle, WA, USA: Microsoft; 2013.
[29] Moon TK. The expectation-maximization algorithm. IEEE Signal Process Mag 1996;13:47−60. [CrossRef]
[30] Li W, Fu T, Zhu J. An improved i-vector extraction algorithm for speaker verification. J Audio Speech Music Proc 2015;2015:18. [CrossRef]
[31] Alpaydın E. Introduction to machine learning. 2nd ed. Cambridge, Mass.: MIT Press; 2010.
[32] Lodhi H, Saunders C, Shawe-Taylor J, Cristianini N, Watkins C. Text classification using string kernels. J Mach Learn Res 2002;2:419−444.

There are 33 citations in total.

Details

Primary Language	English
Subjects	Computer Software
Journal Section	Research Articles
Authors	Mohammed Muntaz Osman 0000-0001-6932-4159 Osman Büyük 0000-0003-1039-3234 Ali Tangel 0000-0002-0569-6399
Publication Date	April 30, 2023
Submission Date	April 20, 2021
Published in Issue	Year 2023 Volume: 41 Issue: 2

Cite

Vancouver	Osman MM, Büyük O, Tangel A. Effect of number and position of frames in speaker age estimation. SIGMA. 2023;41(2):243-55.

Article Files

Full Text

IMPORTANT NOTE: JOURNAL SUBMISSION LINK https://eds.yildiz.edu.tr/sigma/