data sci. j.

Veri Bilimi

2667-582X

Murat GÖK

Engineering

Mühendislik

Investigating Word Association Mining Techniques

Kelime Birliktelik Madenciliği Tekniklerinin İncelenmesi

Bağcı Daş

Duygu

EGE UNIVERSITY

https://orcid.org/0000-0003-4774-9265

Genç

Sevdanur

KASTAMONU UNIVERSITY

12 25 2022

5 2 106 114 10 07 2022 11 21 2022

2018

Veri Bilimi

This study presents the investigation of the effect of conditional entropy, mutual information (MI) values, log-likelihood ratio (LLR), and simple co-occurrences on extracting strong syntagmatic relationships. Experiments are conducted by using the Yelp Academic Dataset, which includes extracted 10.000 restaurant reviews. The mutual information values of word pairs are considered to extract the top syntagmatically related words from the corpus. For this purpose, Spyder 3.3.6 and Python Natural Language Toolkit (NLTK) Library are used. The mutual information values are then compared with simple co-occurrences count. The analysis results indicated that the three Word collocation techniques give similar results and therefore, all of those can be employed for Word collocations effectively.

Bu çalışma, koşullu entropi, ortak bilgi (MI) değerleri, log-birliktelik oranı (LLR) ve basit ortak oluşumların güçlü sözdizimsel ilişkilerin çıkarılması üzerindeki etkisinin araştırılmasını sunmaktadır. Deneyler, 10.000 restoran yorumunu içeren Yelp Akademik Veri Kümesi kullanılarak gerçekleştirilmiştir. Ortak bilgi değeri en yüksek sözcük çiftlerinin, söz dizimsel olarak ilişkili en üstteki sözcükleri derlemden çıkardığı kabul edilir. Bu amaçla Spyder 3.3.6 ve Python Natural Language Toolkit (NLTK) Library kullanılmıştır. Ortak bilgi değerleri daha sonra basit ortak oluşum sayısı ile karşılaştırılır. Analiz sonuçları, üç farklı kelime eşdizimleme tekniğinin benzer sonuçlar verdiğini ve bu nedenle, bunların hepsinin Kelime eşdizimleri için etkili bir şekilde kullanılabileceğini göstermiştir.

Word Collocation Collocation Mining Collocation extraction Mutual Information Text Mining

Kelime birlikteliği Birliktelik madenciliği Eşdizim çıkarma Ortak bilgi Metin madenciliği

[1] Zhai, C. X., Massung, S., Text Data Management and Analysis- A Practical Introduction to Information Retrieval and Text Mining, ACM Books , 2016.

[2] Church, KW., Hanks, P., Word Association norms, mutual information and lexicography. Computational Linguistics, ACM Books , 1990.

[3] Damani, O.P., Improving Pointwise Mutual Information (PMI) by incorporating Significant Co-occurrence. 17th Conference on Computational Natural Language Learning , 2013.

[4] F. H. Khan, U.Qamar, S. Bashir, SentiMI: Introducing point-wise mutual information with SentiWordNet to improve sentiment polarity detection, Applied Soft Computing 39, 140–153, 2016.

[5] A.K. Jain, Y. Pandey, Analysis and implementation of sentiment classification using lexical POS markers, Int. J. Comput. Commun. Netw. 2 (1) , 36-40, 2013.

[6] T. Xu, Q. Peng, Y. Cheng, Identifying the semantic orientation of terms using S-HAL for sentiment analysis, Knowl. Based Syst. 35, 279–289, 2012.

[7] Manning, C.D., Raghavan, R. and Schütze, H., Introduction to Information Retrieval, Cambridge University Press (2008).

[8] Garrett, Michael, et al. "Leveraging mutual information to generate domain specific lexicons." Proceedings of the International Conference on Social Computing, Behavioral-Cultural Modeling, & Prediction and Behavior Representation in Modeling and Simulation, Washington DC, USA. 2018.

[9] Kang, Beom-mo. "Collocation and word association: Comparing collocation measuring methods." International journal of corpus linguistics 23.1, 85-113, 2018.

[10] Lai, Huei-ling. "Collocation analysis of news discourse and its ideological implications." Pragmatics 29.4 ,545-570, 2019.

[11] Liu, Xiaoxia, et al. "Recognition of collocation frames from sentences." IEICE TRANSACTIONS on Information and Systems 102.3, 620-627, 2019.

[12] Williams, Christopher KI. "On Suspicious Coincidences and Pointwise Mutual Information." arXiv preprint arXiv:2203.08089, 2022.

[13] Krenn, Brigitte. "Collocation mining: Exploiting corpora for collocation identification and representation." Entropy 1, 2000.

[14] Zhang, Ke, et al. "A Construction Method of Electric Power Professional Domain Corpus Based on Multi-model Collaboration." 2022 4th Asia Energy and Electrical Engineering Symposium (AEEES). IEEE, 2022.

[15] https://www.yelp.com/dataset