kefad

Ahi Evran Üniversitesi Kırşehir Eğitim Fakültesi Dergisi

2147-1037

Kırşehir Ahi Evran Üniversitesi

10.29299/kefad.1732570

Measurement Theories and Applications in Education and Psychology

Eğitimde ve Psikolojide Ölçme Teorileri ve Uygulamaları

Yapısal Konu Modellemesi Yoluyla Eğitimde Ölçme Alanındaki Eğilimler ve İçgörüler: Dil Değerlendirmesi Üzerine Bir İnceleme

Trends and Insights in Educational Measurement through Structural Topic Modeling: A Study in Language Assessment

https://orcid.org/0000-0002-3580-5568

Atalay Kabasakal

Kübra

Hacettepe Üniversitesi, Eğitim Fakültesi

https://orcid.org/0000-0003-3211-0426

Koçak

Duygu

ALANYA ALAADDİN KEYKUBAT ÜNİVERSİTESİ

https://orcid.org/0000-0003-3025-774X

Akcan

Rabia

Milli Eğitim Bakanlığı

01 31 2026

27 1 290 317 07 02 2025 09 16 2025

2000

Ahi Evran Üniversitesi Kırşehir Eğitim Fakültesi Dergisi

Bu araştırmada, eğitimde ölçme alanındaki tematik eğilimleri ve araştırma yönelimlerini ortaya koymak amacıyla Yapısal Konu Modellemesi (STM) kullanılmıştır. Bu doğrultuda, örnek bir alt alan uygulaması olarak Language Testing ve Language Assessment Quarterly dergilerinde son 16 yılda yayımlanan toplam 778 makale analiz edilmiştir. STM analizi, en belirgin konuların “Dil Testinin Sosyal, Politik ve Etik Boyutları”, “Dil Değerlendirme Okuryazarlığının Geliştirilmesi” ve “Okuma ve Dinleme Değerlendirmelerinde Psikometrik Yaklaşımlar” olduğu on farklı tema ortaya koymuştur. Çalışmada ayrıca, değerlendirici güvenirliğine ilişkin kritik sorunlar vurgulanmakta ve bu konunun dil değerlendirme araştırmalarındaki merkezi rolüne dikkat çekilmektedir. Ayrıca, işaret dili ve iki dillilik bağlamlarında özellikle sözcük bilgisinin dil yeterliğindeki rolüne ilişkin iki birbiriyle bağlantılı tema öne çıkmaktadır. Dil testinin sosyal, politik ve etik boyutlarına artan vurgu, bu alanın yalnızca yeterlilik ölçümünü aşarak eğitim politikalarını ve uygulamalarını şekillendirme gücünü göstermektedir. Psikometrik yöntemlerin ve dil değerlendirme okuryazarlığının öne çıkması ise alandaki süregelen kuramsal ve yöntemsel gelişmelere işaret etmektedir. Bu bulgular, dil değerlendirme araştırmalarındaki önceliklerin ve yönelimlerin nasıl değiştiğine ilişkin araştırmacılar, politika yapıcılar ve uygulayıcılar için önemli içgörüler sunmaktadır.

In this study, Structural Topic Modeling (STM) was employed to identify thematic trends and research orientations within the field of educational measurement. Accordingly, as a representative subfield application, a total of 778 articles published over the past 16 years in the journals Language Testing and Language Assessment Quarterly were analyzed. The STM analysis identified ten distinct themes, with the most prominent topics being “Social, Political, and Ethical Dimensions of Language Testing,” “Advancing Language Assessment Literacy,” and “Psychometric Approaches to Reading and Listening Assessment.” The study also highlights critical issues related to rater reliability, emphasizing its centrality in language assessment research. Furthermore, two interconnected themes emerge concerning the role of vocabulary in language proficiency, particularly in the contexts of sign language and bilingualism. The increasing emphasis on social, political, and ethical dimensions underscores the expanding impact of language testing beyond proficiency measurement, shaping policies and educational practices. Additionally, the prominence of psychometric methodologies and language assessment literacy reflects the field’s ongoing methodological and theoretical advancements. These findings offer valuable insights into emerging priorities and shift in language assessment research for scholars, policymakers, and practitioners.

Metin madenciliği Yapısal konu modellemesi Dil testi ve değerlendirmesi

Text mining Structural topic modelling Language testing and assesment

Aryadoust, V., Eckes, T., & In’nami, Y. (2021). Editorial: Frontiers in Language Assessment and Testing. Frontiers in Psychology, 12. https://doi.org/10.3389/fpsyg.2021.691614

Aryadoust, V., Goh, C. C. M., & Kim, L. O. (2011). An investigation of differential item functioning in the MELAB listening Test. Language Assessment Quarterly, 8(4), 361–385. https://doi.org/10.1080/15434303.2011.628632 Aryadoust, V., Zakaria, A., Lim, M. H., & Chen, C. (2020). An extensive knowledge mapping review of measurement and validity in language assessment and SLA research. Frontiers in Psychology, 11. https://doi.org/10.3389/fpsyg.2020.01941

Bachman, L. F., & Clark, J. L. D. (1987). The measurement of Foreign/Second Language Proficiency. The Annals of the American Academy of Political and Social Science, 490(1), 20–33. https://doi.org/10.1177/0002716287490001003

Bae, J., Bentler, P. M., & Lee, Y. (2016). On the role of content in writing assessment. Language Assessment Quarterly, 13(4), 302–328. https://doi.org/10.1080/15434303.2016.1246552

Baker, B. A., & Riches, C. (2017). The development of EFL examinations in Haiti: Collaboration and language assessment literacy development. Language Testing, 35(4), 557–581. https://doi.org/10.1177/0265532217716732

Banks, G. C., Woznyj, H. M., Wesslen, R. S., & Ross, R. L. (2018). A review of best practice recommendations for text analysis in R (and a User-Friendly app). Journal of Business and Psychology, 33(4), 445–459. https://doi.org/10.1007/s10869-017-9528-3

Barkaoui, K. (2010a). Explaining ESL essay holistic scores: A multilevel modeling approach. Language Testing, 27(4), 515-535. https://doi.org/10.1177/0265532210368717

Barkaoui, K. (2010b). Think-aloud protocols in research on essay rating: An empirical study of their veridicality and reactivity. Language Testing, 28(1), 51–75. https://doi.org/10.1177/0265532210376379

Barkaoui, K. (2010c). Variability in ESL essay rating processes: The role of the rating scale and rater experience. Language Assessment Quarterly, 7(1), 54–74. https://doi.org/10.1080/15434300903464418

Barkaoui, K. (2024). The Academic Achievement of Undergraduate Students with Different English Language Proficiency Profiles. Language Assessment Quarterly, 21(3), 224–244. https://doi.org/10.1080/15434303.2024.2346089

Barkaoui, K. (2025). The relationship between English language proficiency test scores and academic achievement: A longitudinal study of two tests. Language Testing, 0(0). https://doi.org/10.1177/02655322251319284

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022. https://doi.org/10.5555/944919.944937

Bochner, J. H., Samar, V. J., Hauser, P. C., Garrison, W. M., Searls, J. M., & Sanders, C. A. (2015). Validity of the American Sign Language Discrimination Test. Language Testing, 33(4), 473–495. https://doi.org/10.1177/0265532215590849

Carlsen, C. H., & Rocca, L. (2021). Language test misuse. Language Assessment Quarterly, 18(5), 477–491. https://doi.org/10.1080/15434303.2021.1947288

Cho, Y., & Bridgeman, B. (2012). Relationship of TOEFL iBT® scores to academic performance: Some evidence from American universities. Language Testing, 29(3), 421–442. https://doi.org/10.1177/0265532211430368

Choi, H., & Woo, J. (2022). Investigating emerging hydrogen technology topics and comparing national level technological focus: Patent analysis using a structural topic model. Applied Energy, 313, 118898.https://doi.org/10.1016/j.apenergy.2022.118898

Coghlan, S., Miller, T., & Paterson, J. (2021). Good proctor or “big brother”? Ethics of online exam supervision technologies. Philosophy & Technology, 34(4), 1581–1606. https://doi.org/10.1007/s13347-021-00476-1

Eckes, T. (2012). Operational Rater types in writing assessment: linking rater cognition to rater behavior. Language Assessment Quarterly, 9(3), 270–292. https://doi.org/10.1080/15434303.2011.64938

Elder, C., & McNamara, T. (2015). The hunt for “indigenous criteria” in assessing communication in the physiotherapy workplace. Language Testing, 33(2), 153–174. https://doi.org/10.1177/0265532215607398

Fan, J., & Yan, X. (2020). Assessing Speaking Proficiency: A narrative review of speaking assessment research within the Argument-Based Validation Framework. Frontiers in Psychology, 11. https://doi.org/10.3389/fpsyg.2020.00330

Fraenkel, J. R., Wallen, N. E., & Hyun, H. H. (2012). How to design and evaluate research in education. McGrawhill.

Gamaroff, R. (2000). Rater reliability in language assessment: The bug of all bears. System, 28(1), 31–53. https://doi.org/10.1016/S0346-251X(99)00059-7

Gardner, R. C., & MacIntyre, P. D. (1992). A student’s contributions to second language learning. Part I: Cognitive variables. Language Teaching, 25(4), 211–220. https://doi.org/10.1017/S026144480000700X

Gokturk, N., & Chukharev, E. (2024). Exploring the potential of a spoken Dialog System-Delivered Paired Discussion task for assessing interactional competence. Language Assessment Quarterly, 21(1), 60–99. https://doi.org/10.1080/15434303.2023.2289173

Hamdani, S., Chan, A., Kan, R., Chiat, S., Gagarina, N., Haman, E., … Armon-Lotem, S. (2024). Identifying developmental language disorder (DLD) in multilingual children: A case study tutorial. International Journal of Speech-Language Pathology, 1–15. https://doi.org/10.1080/17549507.2024.2326095

Hauck, M. C., Wolf, M. K., & Mislevy, R. (2016). Creating a Next-Generation system of K-12 English learner language proficiency assessments. ETS Research Report Series, 2016(1), 1–10. https://doi.org/10.1002/ets2.12092

Huang, F. L., & Konold, T. R. (2013). A latent variable investigation of the Phonological Awareness Literacy Screening-Kindergarten assessment: Construct identification and multigroup comparisons between Spanish-speaking English-language learners (ELLs) and non-ELL students. Language Testing, 31(2), 205–221. https://doi.org/10.1177/0265532213496773

Isaacs, T., Hu, R., Trenkic, D., & Varga, J. (2023). Examining the predictive validity of the Duolingo English Test: Evidence from a major UK university. Language Testing, 40(3), 748–770. https://doi.org/10.1177/02655322231158550

Isaacs, T., & Thomson, R. I. (2013). Rater experience, rating scale length, and judgments of L2 pronunciation: Revisiting research conventions. Language Assessment Quarterly, 10(2), 135–159. https://doi.org/10.1080/15434303.2013.769545

Isbell, D. R., Kremmel, B., & Kim, J. (2023). Remote proctoring in Language Testing: Implications for fairness and justice. Language Assessment Quarterly, 20(4–5), 469–487. https://doi.org/10.1080/15434303.2023.2288251

Jang, E. E., Cummins, J., Wagner, M., Stille, S., & Dunlop, M. (2015). Investigating the homogeneity and distinguishability of STEP proficiency descriptors in assessing English language learners in Ontario schools. Language Assessment Quarterly, 12(1), 87–109. https://doi.org/10.1080/15434303.2014.936602

Javidanmehr, Z., & Sarab, M. R. A. (2019). Retrofitting non-diagnostic reading comprehension assessment: Application of the G-DINA model to a high-stake reading comprehension test. Language Assessment Quarterly, 16(3), 294–311. https://doi.org/10.1080/15434303.2019.1654479

Kessler, G. (2018). Technology and the future of language teaching. Foreign Language Annals, 51(1), 205–218.

Kokhan, K. (2012). Investigating the possibility of using TOEFL scores for university ESL decision-making: Placement trends and effect of time lag. Language Testing, 29(2), 291–308. https://doi.org/10.1177/0265532211429403

Kotowicz, J., Woll, B., & Herman, R. (2020). Adaptation of the British Sign Language Receptive Skills Test into Polish Sign Language. Language Testing, 38(1), 132–153. https://doi.org/10.1177/0265532220924598

Kozaki, Y. (2010). An alternative decision-making procedure for performance assessments: Using the multifaceted Rasch model to generate cut estimates. Language Assessment Quarterly, 7(1), 75–95. https://doi.org/10.1080/15434300903464400

Kremmel, B., & Schmitt, N. (2016). Interpreting vocabulary test scores: What do various item formats tell us about learners’ ability to employ words? Language Assessment Quarterly, 13(4), 377–392. https://doi.org/10.1080/15434303.2016.1237516

Kuhn, K. D. (2018). Using structural topic modeling to identify latent topics and trends in aviation incident reports. Transportation Research Part C Emerging Technologies, 87, 105–122. https://doi.org/10.1016/j.trc.2017.12.018

Kunnan, A. J. (2009). Testing for citizenship: The U.S. naturalization test. Language Assessment Quarterly, 6(1), 89–97. https://doi.org/10.1080/15434300802606630

Kyle, K., & Crossley, S. (2017). Assessing syntactic sophistication in L2 writing: A usage-based approach. Language Testing, 34(4), 513–535. https://doi.org/10.1177/0265532217712554

Kyle, K., Crossley, S. A., & Jarvis, S. (2021). Assessing the validity of lexical diversity indices using direct judgements. Language Assessment Quarterly, 18(2), 154–170. https://doi.org/10.1080/15434303.2020.1844205

Lam, R. (2014). Language assessment training in Hong Kong: Implications for language assessment literacy. Language Testing, 32(2), 169–197. https://doi.org/10.1177/0265532214554321

Lam, D. M. K. (2019). Interactional Competence with and without Extended Planning Time in a Group Oral Assessment. Language Assessment Quarterly, 16(1), 1–20. https://doi.org/10.1080/15434303.2019.1602627

Laufer, B., & McLean, S. (2016). Loanwords and vocabulary size test scores: A case of different estimates for different L1 learners. Language Assessment Quarterly, 13(3), 202–217. https://doi.org/10.1080/15434303.2016.1210611

Li, X., Dai, A., Tran, R., & Wang, J. (2023). Text mining-based identification of promising miRNA biomarkers for diabetes mellitus. Frontiers in Endocrinology, 14. https://doi.org/10.3389/fendo.2023.1195145

Liu, H. Y., You, X. F., Wang, W. Y., Ding, S. L., & Chang, H. H. (2013). The development of computerized adaptive testing with cognitive diagnosis for an English achievement test in China. Journal of Classification, 30(2), 152-172. https://doi.org/10.1007/s00357-013-9128-5

Liu, T., Aryadoust, V., & Foo, S. (2021). Examining the factor structure and its replicability across multiple listening test forms: Validity evidence for the Michigan English Test. Language Testing, 39(1), 142–171. https://doi.org/10.1177/02655322211018139

Manias, E., & McNamara, T. (2016). Standard setting in specific-purpose language testing: What can a qualitative study add? Language Testing, 33(2), 235–249. https://doi.org/10.1177/0265532215608411

May, L. (2011). Interactional competence in a paired speaking test: Features salient to raters. Language Assessment Quarterly, 8(2), 127–145. https://doi.org/10.1080/15434303.2011.565845

McNamara, T. (2009). Australia: The dictation tests redux? Language Assessment Quarterly, 6(1), 106–111. https://doi.org/10.1080/15434300802606663

McNamara, T., & Ryan, K. (2011). Fairness versus justice in language testing: The place of English literacy in the Australian citizenship Test. Language Assessment Quarterly, 8(2), 161–178. https://doi.org/10.1080/15434303.2011.565438

Min, S., & He, L. (2014). Applying unidimensional and multidimensional item response theory models in testlet-based reading assessment. Language Testing, 31(4), 453–477. https://doi.org/10.1177/0265532214527277

Min, S., Cai, H., & He, L. (2021). Application of bi-factor MIRT and higher-order CDM models to an in-house EFL listening test for diagnostic purposes. Language Assessment Quarterly, 19(2), 189–213. https://doi.org/10.1080/15434303.2021.1980571

Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using many-facet Rasch measurement: Part I. Journal of Applied Measurement, 4(4), 386–422.

O’Hagan, S., Pill, J., & Zhang, Y. (2015). Extending the scope of speaking assessment criteria in a specific-purpose language test: Operationalizing a health professional perspective. Language Testing, 33(2), 195–216. https://doi.org/10.1177/0265532215607920

Olson, D. J. (2023). Measuring bilingual language dominance: An examination of the reliability of the Bilingual Language Profile. Language Testing, 40(3), 521–547. https://doi.org/10.1177/02655322221139162

Peña, E. D., Bedore, L. M., Lugo-Neris, M. J., & Albudoor, N. (2020). Identifying developmental language disorder in school-age bilinguals: Semantics, grammar, and narratives. Language Assessment Quarterly, 17(5), 541–558. https://doi.org/10.1080/15434303.2020.1827258

Plough, I. C., & Bogart, P. S. H. (2008). Perceptions of examiner behavior modulate power relations in oral performance testing. Language Assessment Quarterly, 5(3), 195–217. https://doi.org/10.1080/15434300802229375

Pill, J. (2015). Drawing on indigenous criteria for more authentic assessment in a specific-purpose language test: Health professionals interacting with patients. Language Testing, 33(2), 175–193. https://doi.org/10.1177/0265532215607400

Roberts, M. E., Stewart, B. M., Tingley, D., Lucas, C., Leder‐Luis, J., Gadarian, S. K., Albertson, B., & Rand, D. G. (2014). Structural topic models for Open‐Ended Survey Responses. American Journal of Political Science, 58(4), 1064–1082. https://doi.org/10.1111/ajps.12103

Roberts, M. E., Stewart, B. M., & Tingley, D. (2019). stm: An R package for structural topic models. Journal of Statistical Software, 91(2). https://doi.org/10.18637/jss.v091.i02

Robles-García, P., McLean, S., Stewart, J., Shin, J. young, & Sánchez-Gutiérrez, C. H. (2024). The development and initial validation of O-WSVLT, a meaning-recall online L2 Spanish vocabulary levels test. Language Assessment Quarterly, 21(2), 181–205. https://doi.org/10.1080/15434303.2024.2311724

Scarino, A. (2013). Language assessment literacy as self-awareness: Understanding the role of interpretation in assessment and in teacher learning. Language Testing, 30(3), 309–327. https://doi.org/10.1177/0265532213480128

Schaefer, E. (2008). Rater bias patterns in an EFL writing assessment. Language Testing, 25(4), 465–493. https://doi.org/10.1177/0265532208094273

Schissel, J. L., López-Gopar, M., Leung, C., Morales, J., & Davis, J. R. (2019). Classroom-based assessments in linguistically Diverse communities: a case for collaborative research methodologies. Language Assessment Quarterly, 16(4–5), 393–407. https://doi.org/10.1080/15434303.2019.1678041

Segbers, J., & Schroeder, S. (2017). How many words do children know? A corpus-based estimation of children’s total vocabulary size. Language Testing, 34(3), 297–320. https://doi.org/10.1177/0265532216641152

Shi, B., Huang, L., & Lu, X. (2020). Effect of prompt type on test-takers’ writing performance and writing strategy use in the continuation task. Language Testing, 37(3), 361–388. https://doi.org/10.1177/0265532220911626

Silge, J., & Robinson, D. (2016). tidytext: Text mining and analysis using tidy data principles in R. Journal of Open-Source Software, 1(3), 37. https://doi.org/10.21105/joss.00037

Stewart, J., Vitta, J. P., Nicklin, C., McLean, S., Pinchbeck, G. G., & Kramer, B. (2021). The Relationship between Word Difficulty and Frequency: A Response to Hashimoto. Language Assessment Quarterly, 19(1), 90–101. https://doi.org/10.1080/15434303.2021.1992629

Tonidandel, S., Summerville, K. M., Gentry, W. A., & Young, S. F. (2021). Using structural topic modeling to gain insight into challenges faced by leaders. The Leadership Quarterly, 33(5), 101576. https://doi.org/10.1016/j.leaqua.2021.101576

Usman, N., Hendrik, H., & Madehang, M. (2024). Difficulties in understanding the TOEFL reading test of English language education study program at university. IDEAS: Journal on English Language Teaching and Learning, Linguistics and Literature, 12(1), 755–773. https://doi.org/10.24256/ideas.v12i1.5179

Vogt, K., Tsagari, D., & Spanoudis, G. (2020). What do teachers think they want? A comparative study of In-Service Language Teachers’ beliefs on LAL training needs. Language Assessment Quarterly, 17(4), 386–409. https://doi.org/10.1080/15434303.2020.1781128

Wang, P. A., & Hsieh, S. (2023). Incorporating structural topic modeling into short text analysis. Concentric Studies in Linguistics, 49(1), 96–138. https://doi.org/10.1075/consl.22026.wan

Wolfersberger, M. (2013). Refining the construct of Classroom-Based Writing-From-Readings Assessment: The role of task Representation. Language Assessment Quarterly, 10(1), 49–72. https://doi.org/10.1080/15434303.2012.750661

Youn, S. J. (2019). Managing proposal sequences in role-play assessment: Validity evidence of interactional competence across levels. Language Testing, 37(1), 76–106. https://doi.org/10.1177/0265532219860077