Arabic Algerian Oranee Dialectal Language Modelling Oriented Topic
Year 2019,
Volume: 2 Issue: 2, 1 - 14, 30.12.2019
Freha Mezzoudj
,
Mourad Loukam
Fatma Zohra Belkredim
Abstract
The Modern Standard Arabic (MSA) is the formal language used in the Arab world. In Algeria, the MSA and other varieties of informal Arabic dialects are used in the everyday matter communication. These dialects are by no means subject to further regional variations: eastern, western, central or southern. The Oranee dialect is the most important and used one in the west of Algeria. However, it is an under-resourced language, which lacks both audio and textual corpora. In this paper, we present the most particularities of this western Algerian dialect and introduce a natural language processing on an Oranee textual corpus. A MSA transcribed discourse could contain some dialect vocabularies and viceversa. Therefore, we propose to interpolate dialectal language models and MSA ones with respect to some topics. The best obtained interpolation weights are related to Religion topic data.
References
- Biadsy, F., Hirschberg, J., Habash, N. : Spoken Arabic dialect identication using phonotactic modeling. In: the eacl 2009 workshop on computational approache to semitic languages. Association for Computational Linguistics, 2009, pp. 53--61 (2009)
- Shoufan, A., Alameri, S.: Natural language processing for dialectical Arabic: A Survey. In Proceedings of the Second Workshop on Arabic Natural Language Processing, pp. 36-48 (2015)
- Zaghouani, W.: Critical survey of the freely available Arabic corpora. arXiv preprintarXiv:1702.07835, (2017)
- Droua-Hamdani, G., Selouani S.A. and Boudraa, M.: Algerian Arabic Speech Database (ALGASD): Corpus Design and Automatic Speech Recognition Application. In The Arabian Journal for Science and Engineering, 35(2C), pp.157--166, (2010
- Droua-Hamdani, G., Alotaibi, Y. A., Selouani S.A. and Boudraa, M.: Rhythmic Feature across Modern Standard Arabic and Arabic Dialects. In Proceedings of Workshop on free/Open Source Arabic corpora and corpora processing tools, pp.43--46, (2014)
- Meftouh, N., Bouchemal, S., Smaili, K.: A study of a non-resourced language: an algerian dialect. 3rd workshop on spoken language technologies of under-resourced languages. Cape Town, South Africa. (2012)
- Harrat, S., Meftouh, K., Abbes, M., Smaili, K.: Building resourced for Algerian Arabic dialects. In proceedings of annual conference of the international communication association (interspeech), Singapore. (2014)
- Harrat, S., Meftouh, K., Abbas, M., Hidouci, K. W., Smaili, K.: An Algerian dialect: Study and Resources. In International Journal of Advanced Computer Science and Applications, pp. 384--395, (2016).
- Harrat, S., Meftouh, M., Smaili, K. : Creating parallel Arabic Dialect Corpus: pitfalls to avoid. In proceedings of international conference on computational Linguistics and intelligent text processing. Budapest, Hungary, (2017)
- Meftouh, K., Harrat, S., Jamoussi, S., Abbes, M., Smaili, K. Machine translation experiments on PADIC: a parallel Arabic dialectl corpus. In proceedings of 29th pacic Asia conference on language, information and computation, Shanghai, China, (2015)
- Bougrine, S., Cherroun, H., Ziadi, D. Lakhdari, A., and Chorana, A. (2016). Toward a rich Arabic Speech Parallel Corpus for Algerian sub-Dialects. In proceedings of the 2nd Workshop on Arabic Corpora and Processing Tools. Theme: Social Media, pp. 2--10, (2016)
- Bougrine, S., Chorana, A., Lakhdari, A., and Cherroun, H. Toward a Web-based Speech Corpus for Algerian Arabic Dialectal Varieties. In proceedings of the 3rd Arabic Natural Language Processing. Workshop WANLP, Spain, pp. 138--146, (2017)
- Djellab, M., Amrouche, A., Bouridane, A., Mehallegue, N.: Algerian Modern Colloquial Arabic Speech Corpus (AMCASC): regional accents recognition within complex socio-linguistic environments. In Language Resources and Evaluation, 51 (3), pp. 613--641, (2017)
- Labed, Z. : Genealogical koineisation in Oran speech community: the case of young university oranees. Phd Thesis, University of Oran, (2014)
- Stolcke A. : SRILM-an extensible language modeling toolkit. In InterSpeech, (2002)
- Stolcke, J. Zheng, W. Wang, and V. Abrash. Srilm at sixteen: Update and outlook. In Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop, pp. 5,(2011).
- Mezzoudj, F., Langlois, D., Jouvet, D., Benyettou, A. : Textual data selection for language modelling in the scope of automatic speech recognition. Procedia Computer Science, pp 55--64, (2018)
- Mezzoudj, F., Benyettou, A. An empirical study of statistical language models: n-gram language models vs. neural network language models. International Journal of Innovative Computing and Applications, 9(4), pp 89--202, (2018)
- Ahmed Abdelali, A., Darwish, K., Durrani, N., Mubarak, H. Farasa: A Fast and Furious Segmenter for Arabic. NAACL, (2016)
- Helmy, M., Basaldella, M., Maddalena, E., Mizzaro, S., Demartini, G.: Towards building a standard dataset for arabic keyphrase extraction evaluation. International Conference on Asian Language Processing, IALP, pp. 26-29, (2016)
Year 2019,
Volume: 2 Issue: 2, 1 - 14, 30.12.2019
Freha Mezzoudj
,
Mourad Loukam
Fatma Zohra Belkredim
References
- Biadsy, F., Hirschberg, J., Habash, N. : Spoken Arabic dialect identication using phonotactic modeling. In: the eacl 2009 workshop on computational approache to semitic languages. Association for Computational Linguistics, 2009, pp. 53--61 (2009)
- Shoufan, A., Alameri, S.: Natural language processing for dialectical Arabic: A Survey. In Proceedings of the Second Workshop on Arabic Natural Language Processing, pp. 36-48 (2015)
- Zaghouani, W.: Critical survey of the freely available Arabic corpora. arXiv preprintarXiv:1702.07835, (2017)
- Droua-Hamdani, G., Selouani S.A. and Boudraa, M.: Algerian Arabic Speech Database (ALGASD): Corpus Design and Automatic Speech Recognition Application. In The Arabian Journal for Science and Engineering, 35(2C), pp.157--166, (2010
- Droua-Hamdani, G., Alotaibi, Y. A., Selouani S.A. and Boudraa, M.: Rhythmic Feature across Modern Standard Arabic and Arabic Dialects. In Proceedings of Workshop on free/Open Source Arabic corpora and corpora processing tools, pp.43--46, (2014)
- Meftouh, N., Bouchemal, S., Smaili, K.: A study of a non-resourced language: an algerian dialect. 3rd workshop on spoken language technologies of under-resourced languages. Cape Town, South Africa. (2012)
- Harrat, S., Meftouh, K., Abbes, M., Smaili, K.: Building resourced for Algerian Arabic dialects. In proceedings of annual conference of the international communication association (interspeech), Singapore. (2014)
- Harrat, S., Meftouh, K., Abbas, M., Hidouci, K. W., Smaili, K.: An Algerian dialect: Study and Resources. In International Journal of Advanced Computer Science and Applications, pp. 384--395, (2016).
- Harrat, S., Meftouh, M., Smaili, K. : Creating parallel Arabic Dialect Corpus: pitfalls to avoid. In proceedings of international conference on computational Linguistics and intelligent text processing. Budapest, Hungary, (2017)
- Meftouh, K., Harrat, S., Jamoussi, S., Abbes, M., Smaili, K. Machine translation experiments on PADIC: a parallel Arabic dialectl corpus. In proceedings of 29th pacic Asia conference on language, information and computation, Shanghai, China, (2015)
- Bougrine, S., Cherroun, H., Ziadi, D. Lakhdari, A., and Chorana, A. (2016). Toward a rich Arabic Speech Parallel Corpus for Algerian sub-Dialects. In proceedings of the 2nd Workshop on Arabic Corpora and Processing Tools. Theme: Social Media, pp. 2--10, (2016)
- Bougrine, S., Chorana, A., Lakhdari, A., and Cherroun, H. Toward a Web-based Speech Corpus for Algerian Arabic Dialectal Varieties. In proceedings of the 3rd Arabic Natural Language Processing. Workshop WANLP, Spain, pp. 138--146, (2017)
- Djellab, M., Amrouche, A., Bouridane, A., Mehallegue, N.: Algerian Modern Colloquial Arabic Speech Corpus (AMCASC): regional accents recognition within complex socio-linguistic environments. In Language Resources and Evaluation, 51 (3), pp. 613--641, (2017)
- Labed, Z. : Genealogical koineisation in Oran speech community: the case of young university oranees. Phd Thesis, University of Oran, (2014)
- Stolcke A. : SRILM-an extensible language modeling toolkit. In InterSpeech, (2002)
- Stolcke, J. Zheng, W. Wang, and V. Abrash. Srilm at sixteen: Update and outlook. In Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop, pp. 5,(2011).
- Mezzoudj, F., Langlois, D., Jouvet, D., Benyettou, A. : Textual data selection for language modelling in the scope of automatic speech recognition. Procedia Computer Science, pp 55--64, (2018)
- Mezzoudj, F., Benyettou, A. An empirical study of statistical language models: n-gram language models vs. neural network language models. International Journal of Innovative Computing and Applications, 9(4), pp 89--202, (2018)
- Ahmed Abdelali, A., Darwish, K., Durrani, N., Mubarak, H. Farasa: A Fast and Furious Segmenter for Arabic. NAACL, (2016)
- Helmy, M., Basaldella, M., Maddalena, E., Mizzaro, S., Demartini, G.: Towards building a standard dataset for arabic keyphrase extraction evaluation. International Conference on Asian Language Processing, IALP, pp. 26-29, (2016)