BibTex RIS Cite

-

Year 2015, Volume: 12 Issue: 1, 43 - 61, 01.06.2015

Abstract

Associative measures are “mathematical formulas determining the strength of association between two or more words based on their occurrences and cooccurrences in a text corpus” (Pecina, 2010, p. 138). The purpose of this paper is to test the 12 associative measures that Text-NSP (Banerjee & Pedersen, 2003) contains on a 10-million-word subcorpus of Turkish National Corpus (TNC) (Aksan et.al., 2012). A statistical comparison of those measures is out of the scope of the study, and the measures will be evaluated according to the linguistic relevance of the rankings they provide. The focus of the study is basically on optimizing the corpus data, before applying the measures and then, evaluating the rankings produced by these measures as a whole, not on the linguistic relevance of individual n-grams. The findings include intra-linguistically relevant associative measures for a comma representative, 10-million-word corpus of Turkish. splitted, delimited, sentence lower-cased, well-balanced

References

  • Aksan, M. & Aksan, Y. (2013). Multi-word units and pragmatic functions in genre specification. Paper presented at 13th IPrA Conference 08-13 September 2013. New Delhi, India.
  • Aksan, Y. et al. (2012). Construction of the Turkish National Corpus (TNC). Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC 2012). (pp. 3223-3227) İstanbul. Turkiye.
  • Aksan, Y., Mersinli, Ü. & Altunay, S. (2015). Colligational analysis of Turkish multi-word units. Paper presented at CCS-2015, Corpus-Based Word Frequency: Methods and Applications. 19-20 February 2015. Mersin University. Turkiye.
  • Banerjee, S & Pederson, T. (2003). The design, implementation and use of the (N)gram (S)tatistic (P)ackage. Proceedings of the Fourth International Conference on Intelligent Text Processing and Computational Linguistics, (pp. 370-381).
  • Blanken, H.M., de Vries, A.P., Blok, H.E & Feng, L. (Eds). (2007). Multimedia retrieval. City: Springer.
  • Calzolari, N. et.al. (2002). Towards best practice for multiword expressions in computational lexicons. Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC 2002). (pp. 1934-1940 ). Las Palmas, Canary Islands.
  • Durrant, P. & Mathews-Aydınlı, J. (2011). A function first approach to identifying formulaic language in academic writing. English for Specific Purposes, 30, 58-72.
  • Evert, S. (2004). The statistics of word cooccurrences (word pairs and collocations). (Doctoral Dissertation). Universitat Stuttgart.
  • Jackendoff, R. (1997). The architecture of the language faculty. Cambridge, MA. MIT Press.
  • Haspelmath, M. (2011). The indeterminacy of word segmentation and the nature of morphology and syntax. Folia Linguistica, 45(1), 31-80.
  • Hiemstra, D. & Kraaij, W. (2007). Evaluation of multimedia retrieval systems. In H.M. Blanken, A.P. de Vries, H. E. Blok, L. Feng (Eds.), Multimedia Retrieval (pp. 347-366). Berlin: Springer.
  • Kumova-Metin, S. & Karaoğlan B., (2010). Collocation extraction in Turkish texts using statistical methods. Paper presented at 7th International Conference on Natural Language Processing (LNCS-ISI) IceTAL 2010, Reykjavik, Iceland.
  • Manning, C. D., & Schütze, H. (2001). Foundations of statistical natural language processing. Cambridge, Mass.: MIT Press.
  • Mel’čuk, I. A. (1995). Phrasemes in language and phraseology in linguistics. In .M. Everaert, E.-J. van der Linden, A. Schenk & R. Schreuder (Eds.), Idioms: Structural and psychological perspectives (pp. 167–232). Hillsdale, NJ: Lawrence Erlbaum.
  • Mersinli, Ü. & Demirhan, U. (2012). Çok sözcüklü kullanımlar ve ilköğretim Türkçe ders kitapları. In M. Aksan & Y. Aksan. (Eds.), Türkçe Öğretiminde Güncel Çalışmalar. (pp. 113-122). Mersin: Mersin Üniversitesi.
  • Nissim, M. & Zaninello, A. (2013). Modeling the internal variability of multiword expressions through a pattern-based method. ACM Transactions on Speech and Language Processing, 10(2), 1-26.
  • O’Donnell. M.B. (2011). The adjusted frequency list: A method to produce cluster sensitive frequency lists. ICAME Journal. Computers in English Linguistics, 35, 135-169.
  • Oflazer, K., Çetinoğlu, Ö. & Say, B. (2004). Integrating morphology with multi-word expression processing in Turkish. Proceedings of the Workshop on Multiword Expressions: Integrating Processing (MWE '04). Association for Computational Linguistics, (pp.64-71). Stroudsburg, PA, USA.
  • Pecina, P. (2010). Lexical association measures and collocation extraction. Language Resources and Evaluation, 44(1-2), 137-158.
  • Ramisch, C., Villavicencio, A., & Kordoni, V. (2013). Introduction to the special issue on multiword expressions: From theory to practice and use. ACM Transactions on Speech and Language Processing, 10(2), pp. 1-3.
  • Sinclair, J.M., & Daley, J.S. (2004). English collocation studies: The OSTI Report. London, New York: Continuum.
  • Rayson, P., Piao, S., Sharoff, S., Evert, S., & Moiron, B.V. (2010). Multiword expressions: Hard going or plain sailing? Language Resources and Evaluation, 44(1-2), 1-5.
  • Wray, A. (2002). Formulaic language and the lexicon. Cambridge: Cambridge University Press.

Associative Measures and Multi-word Unit Extraction in Turkish

Year 2015, Volume: 12 Issue: 1, 43 - 61, 01.06.2015

Abstract

Associative measures are “mathematical formulas determining the strength of association between two or more words based on their occurrences and cooccurrences in a text corpus” (Pecina, 2010, p. 138). The purpose of this paper is to test the 12 associative measures that Text-NSP (Banerjee & Pedersen, 2003) contains on a 10-million-word subcorpus of Turkish National Corpus (TNC) (Aksan et.al., 2012). A statistical comparison of those measures is out of the scope of the study, and the measures will be evaluated according to the linguistic relevance of the rankings they provide. The focus of the study is basically on optimizing the corpus data, before applying the measures and then, evaluating the rankings produced by these measures as a whole, not on the linguistic relevance of individual n-grams. The findings include intra-linguistically relevant associative measures for a comma delimited, sentence splitted, lower-cased, well-balanced, representative, 10-million-word corpus of Turkish.

References

  • Aksan, M. & Aksan, Y. (2013). Multi-word units and pragmatic functions in genre specification. Paper presented at 13th IPrA Conference 08-13 September 2013. New Delhi, India.
  • Aksan, Y. et al. (2012). Construction of the Turkish National Corpus (TNC). Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC 2012). (pp. 3223-3227) İstanbul. Turkiye.
  • Aksan, Y., Mersinli, Ü. & Altunay, S. (2015). Colligational analysis of Turkish multi-word units. Paper presented at CCS-2015, Corpus-Based Word Frequency: Methods and Applications. 19-20 February 2015. Mersin University. Turkiye.
  • Banerjee, S & Pederson, T. (2003). The design, implementation and use of the (N)gram (S)tatistic (P)ackage. Proceedings of the Fourth International Conference on Intelligent Text Processing and Computational Linguistics, (pp. 370-381).
  • Blanken, H.M., de Vries, A.P., Blok, H.E & Feng, L. (Eds). (2007). Multimedia retrieval. City: Springer.
  • Calzolari, N. et.al. (2002). Towards best practice for multiword expressions in computational lexicons. Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC 2002). (pp. 1934-1940 ). Las Palmas, Canary Islands.
  • Durrant, P. & Mathews-Aydınlı, J. (2011). A function first approach to identifying formulaic language in academic writing. English for Specific Purposes, 30, 58-72.
  • Evert, S. (2004). The statistics of word cooccurrences (word pairs and collocations). (Doctoral Dissertation). Universitat Stuttgart.
  • Jackendoff, R. (1997). The architecture of the language faculty. Cambridge, MA. MIT Press.
  • Haspelmath, M. (2011). The indeterminacy of word segmentation and the nature of morphology and syntax. Folia Linguistica, 45(1), 31-80.
  • Hiemstra, D. & Kraaij, W. (2007). Evaluation of multimedia retrieval systems. In H.M. Blanken, A.P. de Vries, H. E. Blok, L. Feng (Eds.), Multimedia Retrieval (pp. 347-366). Berlin: Springer.
  • Kumova-Metin, S. & Karaoğlan B., (2010). Collocation extraction in Turkish texts using statistical methods. Paper presented at 7th International Conference on Natural Language Processing (LNCS-ISI) IceTAL 2010, Reykjavik, Iceland.
  • Manning, C. D., & Schütze, H. (2001). Foundations of statistical natural language processing. Cambridge, Mass.: MIT Press.
  • Mel’čuk, I. A. (1995). Phrasemes in language and phraseology in linguistics. In .M. Everaert, E.-J. van der Linden, A. Schenk & R. Schreuder (Eds.), Idioms: Structural and psychological perspectives (pp. 167–232). Hillsdale, NJ: Lawrence Erlbaum.
  • Mersinli, Ü. & Demirhan, U. (2012). Çok sözcüklü kullanımlar ve ilköğretim Türkçe ders kitapları. In M. Aksan & Y. Aksan. (Eds.), Türkçe Öğretiminde Güncel Çalışmalar. (pp. 113-122). Mersin: Mersin Üniversitesi.
  • Nissim, M. & Zaninello, A. (2013). Modeling the internal variability of multiword expressions through a pattern-based method. ACM Transactions on Speech and Language Processing, 10(2), 1-26.
  • O’Donnell. M.B. (2011). The adjusted frequency list: A method to produce cluster sensitive frequency lists. ICAME Journal. Computers in English Linguistics, 35, 135-169.
  • Oflazer, K., Çetinoğlu, Ö. & Say, B. (2004). Integrating morphology with multi-word expression processing in Turkish. Proceedings of the Workshop on Multiword Expressions: Integrating Processing (MWE '04). Association for Computational Linguistics, (pp.64-71). Stroudsburg, PA, USA.
  • Pecina, P. (2010). Lexical association measures and collocation extraction. Language Resources and Evaluation, 44(1-2), 137-158.
  • Ramisch, C., Villavicencio, A., & Kordoni, V. (2013). Introduction to the special issue on multiword expressions: From theory to practice and use. ACM Transactions on Speech and Language Processing, 10(2), pp. 1-3.
  • Sinclair, J.M., & Daley, J.S. (2004). English collocation studies: The OSTI Report. London, New York: Continuum.
  • Rayson, P., Piao, S., Sharoff, S., Evert, S., & Moiron, B.V. (2010). Multiword expressions: Hard going or plain sailing? Language Resources and Evaluation, 44(1-2), 1-5.
  • Wray, A. (2002). Formulaic language and the lexicon. Cambridge: Cambridge University Press.
There are 23 citations in total.

Details

Primary Language Turkish
Journal Section Makaleler
Authors

Ümit Mersinli This is me

Publication Date June 1, 2015
Published in Issue Year 2015 Volume: 12 Issue: 1

Cite

APA Mersinli, Ü. (2015). Associative Measures and Multi-word Unit Extraction in Turkish. Dil Ve Edebiyat Dergisi, 12(1), 43-61.