jmeep

Journal of Measurement and Evaluation in Education and Psychology

1309-6575

Association for Measurement and Evaluation in Education and Psychology

Item Response Theory

Madde-Cevap Kuramı

Assessing Pre-Service Teachers’ Competencies in Open-Ended Item Development: Self-, Peer, Instructor Assessment

https://orcid.org/0000-0002-1991-1416

Yavuz

Emine

ERCİYES ÜNİVERSİTESİ

https://orcid.org/0000-0003-2683-4997

Şata

Mehmet

VAN YÜZÜNCÜ YIL ÜNİVERSİTESİ, VAN EĞİTİM YÜKSEKOKULU

04 01 2026

17 1 24 41 08 06 2025 03 09 2026

2010

Journal of Measurement and Evaluation in Education and Psychology

This research focused on the role of self-, peer, and instructor assessments in assessing pre-service teachers’ competencies in open-ended item development and aimed to determine whether there were rater biases in the scorings. Besides, it focused on the differences in the scores of pre-service teachers according to their gender and education type. Participants who 116 pre-service students were asked to prepare one open-ended item in the measurement and evaluation course as a performance task. Self and peers scored the items, and the instructor did so through a holistic rubric. Many facets of Rasch were used to analyze the data. Analysis showed that self-assessment was the most lenient rater type while instructor assessment had the most severe ratings, and there was no rater bias in the scoring. Besides, female pre-service teachers were more lenient than male pre-service teachers, and pre-service teachers studying in daytime classes were more lenient than those in evening classes.

rater severity rater leniency rater bias open-ended items validity

Adesiji, K. M., Agbonifo, O. C., Adesuyi, A. T., & Olabode, O. (2016). Development of an automated descriptive text-based scoring system. British Journal of Mathematics & Computer Science, 19(4), 1-14. https://doi.org/10.9734/BJMCS/2016/27558

Almazroa, H. & Alotaibi, W. (2023). Teaching 21st century skills: Understanding the depth and width of the challenges to shape proactive teacher education programmes. Sustainability, 15, 7365. https://doi.org/10.3390/su15097365

Alver, B. (2005). The emphatic skills and decision-making strategies of the students of the department of guidance and psychological counseling, faculty of education were studied. Journal of Social Science and Humanities Researches, (14), 19-34.

Anderson, L. W., & Krathwohl, D. R. (Eds.) (2001). A taxonomy for learning, teaching, and assessing: A revision of Bloom's taxonomy of educational objectives. Pearson.

Ateş, G. Ç. & Köse, M. F. (2024). An analysis of university students’ academic achievement in relation to accommodation and various dependent variables. Journal of University Research, 7(3), 212-223. https://doi.org/10.32329/uad.1500037

Bandura, A. (1986). Social Foundations of Thought and Action: A Social Cognitive Theory. Prentice-Hall.

Barrette, C. (2004). An analysis of foreign language achievement test drafts. Foreign Language Annals, 37(1), 58–70. https://doi.org/10.1111/j.1944-9720.2004.tb02173.x.

Bastarrica, M. C., & Simmonds, J. (2019). Gender differences in self and peer assessment in a software engineering capstone course. IEEE/ACM 2nd International Workshop on Gender Equality in Software Engineering, Montreal, CA, May 2019. https://doi.org/10.1109/GE.2019.00014

Birenbaum, M., Tatsuoka, K. & Gutvirtz, Y. (1992). Effects of response format on diagnostic assessment of scholastic achievement. Applied Psych. Measurement, 16(4), 353-363. https://doi.org/10.1177/0146621692016004

Bond, T. G., & Fox, C. M. (2015). Applying the Rasch model: Fundamental measurement in the human sciences (3rd ed.). Routledge.

Brookhards, S. (2014). How to design questions and tasks to assess student thinking. ASCD.

Cabello, V. M., & Topping, K. J. (2020). Peer assessment of teacher performance. What works in teacher education? International Journal of Cognitive Research in Science, Engineering and Education (IJCRSEE), 8(2), 121-132. https://doi.org/10.5937/IJCRSEE2002121C

De Marsico, M., Sciarrone, F., Sterbini, A., & Temperini, M. (2017). Supporting mediated peer-evaluation to grade answers to open-ended questions. EURASIA Journal of Mathematics Science and Technology Education, 13(4), 1085-1106. https://doi.org/10.12973/eurasia.2017.00660a

Demir, M. K. (2012). Analyzing empathy skills of primary school teacher candidates. Buca Faculty of Education Journal, (33), 107-121.

Douchy, F., Segers, M., & Sluijsmans, D. (1999). The use of self- peer and coassessment in higher education. Studies in Higher Education, 24(3), 331-350. https://doi.org/10.1080/03075079912331379935

Ebel, R.L., & Frisbie, D.A. (1991). Essentials of educational measurement. Prentice Hall.

Eckes, T. (2015). Introduction to Many-Facet Rasch Measurement: Analyzing and Evaluating Rater-Mediated Assessments. Peter Lang.

Eke, C. (2018). Analysis of objectives of high school physics curriculum according to the revised Bloom's taxonomy. Journal of Social Research and Behavioral Sciences, 4(6), 69-84.

Ercoşkun, M. H., & Nalçacı, A. (2008). The investigation of the empathic skills and democratic attitudes of the primary school teacher candidates. Milli Eğitim Dergisi, 37(180), 204-215.

Ercoşkun, M. H., Dilekmen, M., Ada, Ş., & Nalçacı, A. (2006). The investigation of empathic skills of the department of primary school teaching students as regards individual variations. Educational Academic Research, (13), 207–217.

Erkayıran, O., Şenocak, S. Ü., & Demirkıran, F. (2018). Investigation of empathic skill levels of nursing students in terms of some variables: A cross-sectional study. Journal of Nursing Science, 1(2), 01–04.

Erman-Aslanoglu, A., Karakaya, İ., Sata, M. (2020). Evaluation of university students’ rating behaviors in self and peer rating process via many facet rasch model. Eurasian Journal of Educational Research, 20(89), 25-46. https://izlik.org/JA58KH69DL

Falchikov, N., & D. Boud. (1989). Student self-assessment in higher education: A meta-analysis. Review of Educational Research, 59(4), 395–430. https://doi.org/10.2307/1170205

Falchikov, N., & Goldfinch, J. (2000). Student peer assessment in higher education. A meta-analysis comparing peer and teacher marks. Review of Educational Research, 70(3), 287-322. https://doi.org/10.2307/1170785

Fang, J.-W., Chang, S.-C., Hwang, G.-J., & Yang, G. (2021). An online collaborative peer‑assessment approach to strengthening pre-service teachers' digital content development competence and higher‑order thinking tendency. Education Tech Research Dev, 69, 1155–1181. https://doi.org/10.1007/s11423-021-09990-7

Farrokhi, F., Esfandiari, R., & Schaefer, E. (2012). A many-facet Rasch measurement of differential rater severity/leniency in three types of assessment. JALT Journal, 34(1), 79-101.

Farrokhi, F., Esfandiari, R., & Vaez Dalili, M. (2011). Applying the many-facet Rasch model to detect centrality in self-assessment, peer-assessment and teacher assessment. World Applied Sciences Journal, 15(11), 76-83.

Fraenkel, J. R., & Wallen, N. E. (2005). How to design and evaluate research in education. McGraw-Hill.

Genç, S. Z., & Kalafat, T. (2010). Prospective teachers’ problem solving skills and emphatic skills. Journal of Theoretical Educational Science, 3(2), 135-147.

Gielen, S., Dochy, F., & Onghena, P. (2010). An inventory of peer assessment diversity. Assessment and Evaluation in Higher Education 36(2), 137-155. https://doi.org/10.1080/02602930903221444

Gierl, M. J., Latifi, S., Lai, H., Boulais, A. P., & Champlain, A. D. (2014). Automated essay scoring and the future of educational assessment in medical education. Medical Education, 48, 950-962. https://doi.org/10.1111/medu.12517

Goodrich, H. (1997). Understanding Rubrics: The dictionary may define "rubric", but these models provide more clarity. Educational Leadership, 54(4), 14-17.

Heritage, M. (2007). Formative assessment: What do teachers need to know and do? Phi Delta Kappan. https://kappanonline.org/formative-assessment-heritage/

Kane, J., Bernardin, H., Villanueva, J., & Peyrefitte, J. (1995). Stability of rater leniency: Three studies. Academy of Management Journal, 38(4), 1036-1051. https://doi.org/10.2307/256619

Kane, L. S. & Lawler, E. E. (1978). Methods of peer assessment. Psych. Bull., 85(3), 555-586. https://doi.org/10.1037/0033-2909.85.3.555

Karakaya, İ. (2015). Comparison of self, peer, and instructor assessments in the portfolio assessment by using the many-facet RASCH model. Journal of Education and Human Development 4(2), 182-192. https://doi.org/10.15640/jehd.v4n2a22

Kim, Y., Park, I., & Kang, M. (2012). Examining rater effects of the TGMD-2 on children with intellectual disability. Adapted Physical Activity Quarterly, 29(4), 346-365. https://doi.org/10.1123/apaq.29.4.346

Knoch, U., Read, J., & von Randow, T. (2007). Re-training writing raters online: How does compare with face-to-face training? Assessing Writing, 12(2), 26-43. https://doi.org/10.1016/j.asw.2007.04.001

Kylonen, P. C. (2012). Measurement of 21st century skills within the common core state standards. Paper presented at the Invitational Research Symposium on Technology Enhanced Assessments, USA, May 2012.

La Velle, L. (2019). The theory–practice nexus in teacher education: New evidence for effective approaches. Journal of Education for Teaching, 45(4), 369-372. https://doi.org/10.1080/02607476.2019.1639267

Lejk, M. & Wyvill, M. (2001). The effect of the inclusion of self-assessment with peer-assessment of contributions to a group project: A Quantitative study of secret and agreed assessments. Assessment and Evaluation in Higher Education, 26(6), 551–61. https://doi.org/10.1080/02602930120093887

Leonard, D. K., & Jiang, J. (1999). Gender bias and the college predictors of the SATs: A cry of Despair. Research in Higher Education, 40(4), 375-407.

Li, H., Xiong, Y., Hunter, C. V., Guo, X., & Tywoniw, R. (2020). Does peer assessment promote student learning? A meta-analysis. Assessment and Evaluation in Higher Education, 45(2), 193-211. https://doi.org/10.1080/02602938.2019.1620679

Linacre, J. M. (1989). Many-facet Rasch measurement. MESA Press.

Linacre, J. M. (2012). FACETS (Version 3.70.1) [Computer Software]. https://www.winsteps.com/facgood.htm

Linacre, J. M. (2017). FACETS (Version 3.80.0) [Computer Software]. https://www.winsteps.com/facgood.htm

Main, J. B., & Sánchez-Peña, M. (2015). Student evaluations of team members: Is there gender bias? Paper presented at IEEE Frontiers in Education Conference (FIE), TX, USA, October, 2015. https://doi.org/10.1109/FIE.2015.7344177.

Maslach, C., & Jackson, S. E. (1981). The measurement of experienced burnout. Journal of Occupational Behavior, 2(2), 99–113. https://doi.org/10.1002/job.4030020205

Mertler, C. A. (2016). Classroom assessment: A practical guide for educators. Routledge.

Ministry of National Education (2017). Fen bilimleri dersi öğretim programı (ilkokul ve ortaokul 3, 4, 5, 6, 7 ve 8 .sınıflar) [Science course curriculum (primary and secondary school 3rd, 4th, 5th, 6th, 7th and 8th grades)]. Ankara, Turkey

Mumpuni, K. E., Priyayi, D. F., & Widoretno, S. (2022). How do students perform a peer assessment? International Journal of Instruction, 15(3), 751-766. https://doi.org/10.29333/iji.2022.15341a

Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using many-facet Rasch measurement: Part I. Journal of Applied Measurement, 4(4), 386-422.

Myford, C.M., & Wolfe, E.W. (2000). Strengthening the ties that bind: Improving the linking network in sparsely connected rating designs. Educational Testing Service. https://doi.org/10.1002/j.2333-8504.2000.tb01832.x

Myford, C.M., & Wolfe, E.W. (2004). Detecting and measuring rater effects using many-facet Rasch measurement: Part II. Journal of Applied Measurement, 5(2), 189-227.

Nilsson, P. (2013). What do we know and where do we go? Formative assessment in developing student teachers’ professional learning of teaching science. Teachers and Teaching, 19(2), 188-201. https://doi.org/10.1080/13540602.2013.741838

Oluwatayo, J. A., & Adebule, S. O. (2012). Assessment of teaching performance of student-teachers on teaching practice. International Education Studies, 5(5), 109-115. https://doi.org/10.5539/ies.v5n5p109

Osman, S. (2021). Basic school teachers’ assessment practices in the sissala east municipality, Ghana. European Journal of Education Studies, 8(7), 44-74. https://doi.org/10.46827/ejes.v8i7.3801

Palmer, K., & Richardson, P. (2003). On-line assessment and free-response input-a pedagogic and technical model for squaring the circle. Paper presented at Proc. 7th CAA Conference, Loughborough, UK, December 2003.

Popham, W. J. (2009). Assessment literacy for teachers: Faddish or fundamental? Theory Into Practice, 48(1), 4–11. https://doi.org/10.1080/00405840802577536

Rahmawati, Y., Ridwan, A., Hadinugrahaningsih, T., & Soeprijanto. (2019, January). Developing critical and creative thinking skills through STEAM integration in chemistry learning. In Journal of Physics: Conference Series (Vol. 1156, p. 012033). IOP Publishing.

Sadler, P.M. & Good, E. (2006). The Impact of Self- and Peer-Grading on Student Learning. Educational Assessment, 11(1), 1-31.

Sari, D.K., Dinata, P. A. C., & Uspayanti, R. (2022). The teachers’ competencies to develop assessments for high school students in Merauke. Ishlah: Jurnal Pendidikan, 14(3), 3199 – 3206. https://doi.org/10.35445/alishlah.v14i

Sasmaz-Oren, F. (2012) The effects of gender and previous experience on the approach of self and peer assessment: a case from Turkey. Innovations in Education and Teaching International, 49(2), 123-133. https://doi.org/10.1080/14703297.2012.677598

Şata, M. (2022). Açık uçlu sorular. In İ. Karakaya (Edt.), Açık uçlu soruların hazırlanması uygulanması ve değerlendirilmesi, 1-11. Pegem.

Şata, M., & Karakaya, İ. (2021). Investigating the effect of rater training on differential rater function in assessing academic writing skills of higher education students. Journal of Measurement and Evaluation in Education and Psychology, 12(2), 163-181. doi: 10.21031/epod.842094

Shen, B., & Bai, B. (2019). Facilitating university teachers’ continuing professional development through peer-assisted research and implementation teamwork in China. Journal of Education for Teaching, 45(4), 476-480. https://doi.org/10.1080/02607476.2019.1639265

Soland, J., Hamilton, L. S., & Stecher, B. M. (2013). Measuring 21st century competencies: Guidance for educators. RAND Corporation.

Sweller, J. (1988). Cognitive load during problem solving: Effects on learning. Cognitive Science, 12(2), 257–285. https://doi.org/10.1207/s15516709cog1202_4

Takeda, S., & Homberg, F. (2014). The effects of gender on group work process and achievement: An analysis through self- and peer-assessment. British Educational Research Journal, 40(2), 373–396. https://doi.org/10.1002/berj.3088

Taş, U. E., Arıcı, Ö., Ozarkan, H. B., & Özgürlük, B. (2016). PISA 2015 ulusal raporu [PISA 2015 national report]. Ankara, Turkey: Milli Eğitim Bakanlığı Yayınları.

Torres-Guijarro, S., & Bengoechea, M. (2017). Gender differential in self-assessment: A fact neglected in higher education peer and self-assessment techniques. Higher Education Research and Development, 36(5), 1072-1084. https://doi.org/10.1080/07294360.2016.1264372

van Zundert, M., Sluijsmans, D., & van Merriënboer, J. J. G. (2010). Effective peer assessment processes: Research findings and future directions. Learning and Instruction, 20(4), 270-279. https://doi.org/10.1016/j.learninstruc.2009.08.004

van-Trieste, R. F. (1990). The relation between Puerto Rican university students’ attitudes toward Americans and the students’ achievement in English as a second language. Homines, (13–14), 94–112.

Vygotsky, L. S. (1978). Mind in society: The development of higher psychological processes. Harvard University Press.

Wainer, H., & Steinberg, L. S. (1992) Sex differences in performance on the mathematics section of the scholastic aptitude test: A bidirectional validity study. Harvard Educational Review, 62(3), 323-336. https://doi.org/10.17763/haer.62.3.1p1555011301r133

Wen, M. L., & Tsai, C. (2008). Online peer assessment in an in-service science and mathematics teacher education course. Teaching in Higher Education, 13(1), 55-67. https://doi.org/10.1080/13562510701794050

Wilson, F. R., Pan, W., & Schumsky, D. A. (2012). Recalculation of the critical values for Lawshe’s content validity ratio. Measurement and Evaluation in Counseling and Development, 45(3), 197-210. https://doi.org/10.1177/07481756124402

Winke, P., Gass, S., & Myford, C. (2013). Raters’ L2 background as a potential source of bias in rating oral performance. Language Testing, 30(2), 231-252. https://doi.org/10.1177/0265532212456968

Wolfe, E. W., & McVay, A. (2012). Application of latent trait models to identify raters exhibiting score scale usage problems. Applied Measurement in Education, 25(2), 125–143.

Yaz, Ö., & Kurnaz, M. (2017). The Examination of 2013 Science Curricula. International Journal of Turkish Education Science, 2017(8), 173-184.

Yenen, E. T. (2021). Prospective teachers’ professional skill needs: A Q method analysis. Teacher Development, 25(2), 196–214. https://doi.org/10.1080/13664530.2021.1877188

Yeşilçınar, S., & Şata, M. (2021). Examining Rater Biases of Peer Assessors in Different Assessment Environments. International Journal of Psychology and Educational Studies, 8(4), 136-151. https://izlik.org/JA45TW35KZ