TEDxSK and JumpSK: A New Slovak Speech Recognition Dedicated Corpus

Ján Staš; Daniel Hládek; Peter Viszlay; Tomáš Koctúr

doi:10.1515/jazcas-2017-0044

.blurhash-client-img { display: none !important; }

TEDxSK and JumpSK: A New Slovak Speech Recognition Dedicated Corpus

Journal of Linguistics/Jazykovedný casopis

Volume 68 (2017): Issue 2 (December 2017)

By: Ján Staš, Daniel Hládek, Peter Viszlay and Tomáš Koctúr

Open Access

|Jan 2018

Abstract

This paper describes a new Slovak speech recognition dedicated corpus built from TEDx talks and Jump Slovakia lectures. The proposed speech database consists of 220 talks and lectures in total duration of about 58 hours. Annotated speech database was generated automatically in an unsupervised manner by using acoustic speech segmentation based on principal component analysis and automatic speech transcription using two complementary speech recognition systems. The evaluation data consisting of 50 manually annotated talks and lectures in total duration of about 12 hours, has been created for evaluation of the quality of Slovak speech recognition. By unsupervised automatic annotation of TEDx talks and Jump Slovakia lectures we have obtained 21.26% of new speech segments with approximately 9.44% word error rate, suitable for retraining or adaptation of acoustic models trained beforehand.

References

[1] Koctúr, T., Juhár, J., Viszlay, P., Staš, J., and Lojka, M. (2016). Unsupervised speech transcription and alignment based on two complementary ASR systems. In Proceedings of RADIOELEKTRONIKA 2016, pages 358–362, Košice, Slovakia.10.1109/RADIOELEK.2016.7477435
Search in Google Scholar Back to article
[2] Rosseau, A., Deléglise, P., and Estève, Y. (2012). TED-LIUM: An automatic speech recognition dedicated corpus. In Proceedings of LREC 2012, pages 125–129, Istanbul, Turkey.
Search in Google Scholar Back to article
[3] Deléglise, P., Estève, Y., Meignier, S., and Merlin, T. (2009). Improvements to the LIUM French ASR system based on CMU Sphinx: What helps to significantly reduce the word error rate? In Proceedings of INTERSPEECH 2009, pages 2123–2126, Brighton, UK.10.21437/Interspeech.2009-607
Search in Google Scholar Back to article
[4] Žgank, A., Maučec, M. S., Verdonik, D. (2016). The SI TEDx-UM speech database: A new Slovenian spoken language resource. In Proceedings of LREC 2016, pages 4670–4673, Portorož, Slovenia.
Search in Google Scholar Back to article
[5] Rosseau, A., Deléglise, P., and Estève, Y. (2014). Enhancing the TED-LIUM corpus with selected data for language modeling and more TED talks. In Proceedings of LREC 2014, pages 3935–3939, Reykjavik, Iceland.
Search in Google Scholar Back to article
[6] Leeuwis, E., Federico, M., and Cettolo, M. (2003). Language modeling and transcription of the TED corpus lectures. In Proceedings of ICASSP 2003, pages 232–235, Hong Kong, China.10.1109/ICASSP.2003.1198760
Search in Google Scholar Back to article
[7] Cettolo, M., Brugnara, F. and Federico, M. (2004). Advances in the automatic transcription of lectures. In Proceedings of ICASSP 2004, pages 769–772, Montreal, Canada.10.1109/ICASSP.2004.1326099
Search in Google Scholar Back to article
[8] Niesler, T. and Willet, D. (2002). Unsupervised language model adaptation for lecture speech transcription. In Proceedings of ICSLP 2002, pages 1413–1416, Denver, Colorado, USA.10.21437/ICSLP.2002-63
Search in Google Scholar Back to article
[9] Wölfel, M. and Berger, S. (2005). The ISL baseline lecture transcription system for the TED corpus. Tech. Rep., Karlsruhe University, Germany.
Search in Google Scholar Back to article
[10] Naptali, W. and Kawahara, T. (2012). Automatic transcription of TED talks. In Proceedings of the 6^th Spoken Document Processing Workshop, SDPWS 2012, Toyohashi, Japan.
Search in Google Scholar Back to article
[11] Bell, P., Yamamoto, H., Swietojanski, P., Wu, Y., McInnes, F., Hori, Ch., and Renals, S. (2013). A lecture transcription system combining neural network acoustic and language models. In Proceedings of INTERSPEECH 2013, pages 3081–3091, Lyon, France.10.21437/Interspeech.2013-673
Search in Google Scholar Back to article
[12] Nanjo, H., Shitaoka, K., and Kawahara, T. (2003). Automatic transformation of lecture transcription into document style using statistical framework. In Proceedings of ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition, SSPR 2003, Tokyo, Japan.
Search in Google Scholar Back to article
[13] Hsu, B.-J. and Glass, J. (2009). Language model parameter estimation using user transcriptions. In Proceedings of ICASSP 2009, pages 4805–4808, Taipei, Taiwan.10.1109/ICASSP.2009.4960706
Search in Google Scholar Back to article
[14] Akita, Y., Watanabe, M., and Kawahara, T. (2012). Automatic transcription of lecture speech using language model based on speaking-style transformation of proceedings texts. In Proceedings of INTERSPEECH 2012, pages 2326–2329, Portland, Oregon, USA.10.21437/Interspeech.2012-610
Search in Google Scholar Back to article
[15] Viszlay, P., Staš, J., Koctúr, T., Lojka, M., and Juhár, J. (2016). An extension of the Slovak broadcast news corpus based on semi-automatic annotation. In Proceedings of LREC 2016, pages 4684–4687, Portorož, Slovenia.
Search in Google Scholar Back to article
[16] Vavrek, J., Viszlay, P., Kiktová, E., Lojka, M., Juhár, J., and Čižmár, A. (2014). Query-by-example retrieval via fast sequential dynamic time warping algorithm. In Proceedings of the 37^th International Conference on Telecommunications and Signal Processing, TSP 2014, pages 453–457, Berlin, Germany.
Search in Google Scholar Back to article
[17] Staš, J., Viszlay, P., Lojka, M., Koctúr, T., Hládek, D., Kiktová, E., Pleva, M., and Juhár, J. (2015). Automatic subtitling system for transcription, archiving and indexing of Slovak audiovisual recordings. In Proceedings of the 7^th Language & Technology Conference, LTC 2015, pages 186–191, Poznań, Poland.
Search in Google Scholar Back to article
[18] Lee, A., Kawahara, T., and Shikano, K. (2001). Julius – An open source real-time large vocabulary recognition engine. In Proceedings of EUROSPEECH 2001, pages 1691–1694, Aalborg, Denmark.10.21437/Eurospeech.2001-396
Search in Google Scholar Back to article
[19] Lojka, M., Ondáš, S., Pleva, M., and Juhár, J. (2014). Multi-threaded parallel speech recognition for mobile applications. Journal of Electrical and Electronics Engineering, 7(1):81–86.
Search in Google Scholar Back to article
[20] Rusko, M., Juhár, J., Trnka, M., Staš, J., Darjaa, S., Hládek, D., Sabo, R., Pleva, M., Ritomský, M., and Ondáš, S. (2016). Advances in the Slovak judicial domain dictation system. In Vertulani, Z., Uszkoreit, H., and Kubis, M., editors, Human Language Technology: Challenges for Computer Science and Linguistics, LNAI 9561, pages 55–67, Springer International Publishing Switzerland.10.1007/978-3-319-43808-5_5
Search in Google Scholar Back to article
[21] Koctúr, T., Staš, J., and Juhár, J. (2016). Unsupervised acoustic corpora building based on variable confidence measure thresholding. In Proceedings of the 58^th International Symposium ELMAR 2016, pages 31–34, Zadar, Croatia.10.1109/ELMAR.2016.7731748
Search in Google Scholar Back to article
[22] Darjaa, S., Cerňak, M., Trnka, M., and Rusko, M. (2011). Effective triphone mapping for acoustic modeling in speech recognition. In Proceedings of INTERSPEECH 2011, pages 1717–1720, Florence, Italy.10.21437/Interspeech.2011-190
Search in Google Scholar Back to article
[23] Stolcke, A. (2002). SRILM – An extensible language modeling toolkit. In Proceedings of ICSLP 2002, pages 901–904, Denver, Colorado, USA.10.21437/ICSLP.2002-303
Search in Google Scholar Back to article
[24] Staš, J. and Juhár, J. (2015). Modeling of the Slovak language for broadcast news transcription. Journal of Electrical and Electronics Engineering, 8(2):43–46.
Search in Google Scholar Back to article
[25] Hládek, D., Ondáš, S., and Staš, J. (2014). Online natural language processing of the Slovak language. In Proceedings of the 5^th IEEE International Conference on Cognitive InfoCommunications, CogInfoCom 2014, pages 315–316, Vietri sul Mare, Italy.10.1109/CogInfoCom.2014.7020469
Search in Google Scholar Back to article
[26] Fiscus, J. G. (1997). A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (ROVER). In Proceedings of ASRU 1997, pages 347–352, Santa Barbara, CA, USA.10.1109/ASRU.1997.659110
Search in Google Scholar Back to article
[27] Lojka, M. and Juhár, J. (2014). Hypothesis combination for Slovak dictation speech recognition. In Proceedings of the 56^th International Symposium ELMAR 2014, pages 43–46, Zadar, Croatia.10.1109/ELMAR.2014.6923311
Search in Google Scholar Back to article
[28] Staš, J., Hládek, D, and Juhár, J. (2016). Adding filled pauses and disfluent events into language models for speech recognition. In Proceedings of the 7^th IEEE International Conference on Cognitive InfoCommunications, CogInfoCom 2016, Wroclaw, Poland.10.1109/CogInfoCom.2016.7804538
Search in Google Scholar Back to article

Articles in this issue

DOI: https://doi.org/10.1515/jazcas-2017-0044 | Journal eISSN: 1338-4287 | Journal ISSN: 0021-5597

Journal RSS Feed

Language: English

Page range: 346 - 354

Published on: Jan 24, 2018

Published by: Slovak Academy of Sciences, Ľudovít Štúr Institute of Linguistics

In partnership with: Paradigm Publishing Services

Publication frequency: 3 issues per year

Keywords:

automatic annotation,

speech recognition,

speech corpus

Related subjects:

Linguistics and semiotics,

Theoretical frameworks and disciplines,

Linguistics, other

© 2018 Ján Staš, Daniel Hládek, Peter Viszlay, Tomáš Koctúr, published by Slovak Academy of Sciences, Ľudovít Štúr Institute of Linguistics
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.

Volume 68 (2017): Issue 2 (December 2017)