Research Article
BibTex RIS Cite
Year 2023, Volume: 10 Issue: 1, 251 - 274, 30.01.2023



  • Acar-Güvendir, M., & Özer-Özkan, Y. (2015). The examination of scale development and scale adaptation articles published in Turkish academic journals on education. Electronic Journal of Social Sciences, 14(52), 23-33. doi: 10.17755/esosder.54872
  • AERA, APA, & NCME. (2014). Standarts for educational and psychological testing. Washington, DC: American Educational Research Association.
  • Boyraz, C. (2018). Investigation of achievement tests used in doctoral dissertations department of primary education (2012-2017). Inonu University Journal of the Faculty of Education, 19(3), 14-28. doi: 10.17679/inuefd.327321
  • Boztunç-Öztürk, N. B., Eroğlu, M. G., & Kelecioğlu, H. (2015). A review of articles concerning scale adaptation in the field of education. Education and Science, 40(178), 123-137. doi: 10.15390/EB.2015.4091
  • Brookhart, S. M. (2018). Appropriate criteria: Key to effective rubrics. Frontiers in Education, 3(22), 1-12. doi: 10.3389/feduc.2018.00022.
  • Büyükkıdık, S. (2012). Comparison of interrater reliability based on the classical test theory and generalizability theory in problem solving skills assessment. (Published master thesis). Hacettepe University, Ankara.
  • Crocker, L., & Algina, J. (2006). Introduction to classical and modern test theory. Ohio, Maison: Cengage Learning.
  • Cronbach, L. J. (1990). Essentials of psychological testing (5. ed.). New York, NY: Harper & Row Publishers Inc.
  • Çelen, Ü. (2008). Comparison of validity and reliability of two tests developed by classical test theory and item response theory. Elementary Education Online, 7(3), 758-768. Retrieved from
  • Çelen, Ü., & Aybek, E. C. (2013). Öğrenci başarısının öğretmen yapımı bir testle klasik test kuramı ve madde tepki kuramı yöntemleriyle elde edilen puanlara göre karşılaştırılması. Journal of Measurement and Evaluation in Education and Psychology, 4(2), 64-75. Retrieved from
  • Çetin, B. (2019). Test geliştirme. B. Çetin (Ed.). In Eğitimde ölçme ve değerlendirme [Measurement and assessment in education] (p. 105-126). Ankara: Anı Publishing.
  • Çüm, S., & Koç, N. (2013). The review of scale development and adaptation studies which have been published in psychology and education journals in Turkey. Journal of Educational Sciences & Practices, 12(24), 115-135. Retrieved from
  • de Ayala, R. J. (2009). The theory and practice of item response theory. New York, NY: The Guilford Press.
  • Delice, A., & Ergene, Ö. (2015). Investigation of scale development and adaptation studies: An example of mathematics education articles. Karaelmas Journal of Educational Sciences, 3(1), 60-75. Retrieved from
  • DeMars, C. (2010). Item response theory. New York: Oxford University Press.
  • Doğan, N., & Kılıç, A. F. (2017). Madde tepki kuramı yetenek ve madde parametre kestirimlerinin değişmezliğinin incelenmesi. Ö. Demirel and S. Dinçer (Eds.). In Küreselleşen dünyada eğitim [Education in a globalizing world] (p. 298-314). Ankara: Pegem Academy. doi: 10.14527/9786053188407.21
  • Downing, S. M., & Haladyna, T. M. (2011). Handbook of test development. New Jersey, NJ: Lawrence Erlbaum Associates Publishers.
  • Enago (2021). Why is a pilot study important in research?. Retrieved from
  • Ergene, Ö. (2020). Scale development and adaptation articles in the field of mathematics education: Descriptive content analysis. Journal of Education for Life, 34(2), 360-383. doi:10.33308/26674874.2020342207
  • Evrekli, E., İnel, D. , Deniş, H., & Balım, A. G. (2011). Methodological and statistical problems in graduate theses in the field of science education. Elementary Education Online, 10(1), 206-218. Retrieved from
  • Haladyna, T. M. (2004). Developing and validating multiple-choice test items (3. ed.). New Jersey, NJ: Lawrence Erlbaum Associates Publishers.
  • Goodrich Andrade, H. (2000). Using rubrics to promote thinking and learning. Educational Leadership, 57(5), 13-18. Retrieved from
  • Goodrich Andrade, H. (2001). The effects of instructional rubrics on learning to write. Current Issues in Education, 4(4), 1-22. Retrieved from
  • Goodrich Andrade, H. (2005). Teaching with rubrics: The good, the bad, and the ugly. College Teaching, 53(1), 27-31. doi: 10.3200/CTCH.53.1.27-31
  • Hambleton, R. K., & Swaminathan, H. (1985). Item response theory. Principles and Applications. Dordrecht, The Netherlands: Kluwer-Nijhoff Publishing Co.
  • Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory (Vol. 2). California, CA: Sage.
  • Hunter, D. M., Jones, R. M., & Randhawa, B. S. (1996). The use of holistic versus analytic scoring for large-scale assessment of writing. The Canadian Journal of Program Evaluation, 11(2), 61-85. Retrieved from
  • Jonsson, A., & Svingby, G. (2007). The use of scoring rubrics: Reliability, validity and educational consequences. Educational Research Review, 2(2), 130-144. doi: 10.1016/j.edurev.2007.05.002
  • Karadağ, E. (2011). Instruments used in doctoral dissertations in educational sciences in Turkey: Quality of research and analytical errors. Educational Sciences: Theory & Practice, 11(1), 311-334. Retrieved from
  • Lane, S., Raymond, M. R., & Haladyna, T. M. (2016). Handbook of test development (2. ed.). New York, NY: Routledge.
  • Lord, F.M., & Novick, M.R. (1968). Statistical theories of mental test scores. Menlo Park, CA: Addison-Wesley.
  • Mertler, C.A. (2000). Designing scoring rubrics for your classroom. Practical Assessment, Research, and Evaluation, 7(25), 1-8. doi: 10.7275/gcy8-0w24
  • Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50, 741-749. doi:10.1037/0003-066x.50.9.741
  • Miles, M. B., & Huberman, A. M. (1994). Qualitative data analysis: An expanded sourcebook (2. ed.). Thousand Oaks, CA: Sage.
  • Mor-Dirlik, E. (2014). Ölçek geliştirme konulu doktora tezlerinin test ve ölçek geliştirme standartlarına uygunluğunun incelenmesi. Eğitimde ve Psikolojide Ölçme ve Değerlendirme Dergisi, 5(2), 62-78. doi: 10.21031/epod.63138
  • Mor Dirlik, E. (2021). Farklı test kuramlarından hesaplanan madde ayırt edicilik parametrelerinin karşılaştırılması. Trakya Eğitim Dergisi. 11(2), 732-744. doi: 10.24315/tred.700445
  • Moskal, B. M. (2000). Scoring rubrics: What, when and how?. Practical Assessment, Research, and Evaluation, 7(3), 1-5. Doi: 10.7275/a5vq-7q66
  • Moskal, B. M., & Leydens, J. A. (2000). Scoring rubric development: validity and reliability. Practical Assessment, Research, and Evaluation, 7(4), 1-22. doi: 10.7275/q7rm-gg74
  • Mutluer, C., & Yandı, A. (2012, September). Türkiye’deki üniversitelerde 2010-2012 yılları arasında yayımlanan tezlerdeki başarı testlerin incelenmesi. Paper presented at the Eğitimde ve Psikolojide Ölçme ve Değerlendirme III. Ulusal Kongresi, Turkey: Bolu. Abstract retrieved from
  • Olgun, G., & Alatlı, B. (2021). The review of scale development and adaptation studies published for adolescents in Turkey. The Journal of Turkish Educational Sciences, 19(1), 568-592. doi: 10.37217/tebd.849954
  • Öksüzoğlu, M. (2022). The investigation of items measuring high-level thinking skills in terms of student score and score reliability. (Unpublished master thesis). Hacettepe University, Ankara.
  • Özçelik, D. A. (1992). Ölçme ve değerlendirme [Measurement and assessment]. Ankara: ÖSYM Publ.
  • Reznitskaya, A., Kuo, L., Glina, M., & Anderson, R. C. (2009). Measuring argumentative reasoning: What’s behind the numbers?. Learning and Individual Differences, 19(2), 219–224. doi:10.1016/j.lindif.2008.11.001.
  • Şanlı, E. (2010). Comparing reliability levels of scoring of the holistic and analytic rubrics in evaluating the scientific process skills. (Unpublished master thesis). Ankara University, Ankara.
  • Şahin, M. G. (2019). Performansa dayalı değerlendirme. B. Çetin (Ed.). In Eğitimde ölçme ve değerlendirme [Measurement and assessment in education] (p. 213-264). Ankara: Anı Publ.
  • Şahin, M. G., & Boztunç-Öztürk, N. (2018). Scale development process in educational field: A content analysis research. Kastamonu Education Journal, 26(1), 191-199. doi: 10.24106/kefdergi.375863
  • Tindal, G., & Haladyna, T. M. (2012). Large-scale assessment programs for all students: Validity, technical adequacy, and implementation. Mahwah, New Jersey: Lawrence Erlbaum.
  • Turgut, F. (1992). Eğitimde ölçme ve değerlendirme [Measurement and assessment in education] (8. ed.). Ankara: Saydam Publ.
  • Yıldırım, A., & Şimşek, H. (2013). Sosyal Bilimlerde Nitel Araştırma Yöntemleri [Qulatitative Research Methods in Social Sciences] (9. ed.). Ankara: Seçkin Publ.
  • Yıldıztekin, B. (2014). The comparison of interrater reliability by using estimating tecniques in classical test theory and generalizability theory. (Unpublished master thesis). Hacettepe University, Ankara.

Examining the Achievement Test Development Process in the Educational Studies

Year 2023, Volume: 10 Issue: 1, 251 - 274, 30.01.2023


Literature review shows that the development process of an achievement test is mainly investigated in dissertations. Moreover, preparing a form that will shed light on developing an achievement test is expected to guide those who will administer the test. In this line, the current study aims to create an “Achievement Test Development Process Control Form” and investigate the achievement tests for Maths based on this form. Document analysis was conducted within the framework of qualitative research and was done based on descriptive analysis. Within the scope of the research, 1683 articles published in designated journals between 2015-2020 were reviewed. It was determined that a mathematics achievement test was developed in 39 of these articles, which were coded on the control form. The articles that were included in the scope of the current study were investigated in terms of the type of items used in the tests, the theory or practice on which the test was developed, the use of rubric for open-ended items, the number of items in the pilot and final form, features of the test form as well as those pertaining to the table of specifications, the features of item pool, the evaluation of pilot testing, the evaluation of real study, test validity and reliability, and the setting in which tests were administered. The current study findings show that mostly an item pool was not prepared, the pilot application was not conducted or was not specified, and even if it was conducted, item analysis was not performed, test forms or example items were not included in the articles, and there were some deficiencies regarding validity. On the other hand, it was clear that the articles mostly specified the test goal and reported the reliability coefficient. In light of the current findings, some suggestions are provided for test developers and those who will administer these tests.


  • Acar-Güvendir, M., & Özer-Özkan, Y. (2015). The examination of scale development and scale adaptation articles published in Turkish academic journals on education. Electronic Journal of Social Sciences, 14(52), 23-33. doi: 10.17755/esosder.54872
  • AERA, APA, & NCME. (2014). Standarts for educational and psychological testing. Washington, DC: American Educational Research Association.
  • Boyraz, C. (2018). Investigation of achievement tests used in doctoral dissertations department of primary education (2012-2017). Inonu University Journal of the Faculty of Education, 19(3), 14-28. doi: 10.17679/inuefd.327321
  • Boztunç-Öztürk, N. B., Eroğlu, M. G., & Kelecioğlu, H. (2015). A review of articles concerning scale adaptation in the field of education. Education and Science, 40(178), 123-137. doi: 10.15390/EB.2015.4091
  • Brookhart, S. M. (2018). Appropriate criteria: Key to effective rubrics. Frontiers in Education, 3(22), 1-12. doi: 10.3389/feduc.2018.00022.
  • Büyükkıdık, S. (2012). Comparison of interrater reliability based on the classical test theory and generalizability theory in problem solving skills assessment. (Published master thesis). Hacettepe University, Ankara.
  • Crocker, L., & Algina, J. (2006). Introduction to classical and modern test theory. Ohio, Maison: Cengage Learning.
  • Cronbach, L. J. (1990). Essentials of psychological testing (5. ed.). New York, NY: Harper & Row Publishers Inc.
  • Çelen, Ü. (2008). Comparison of validity and reliability of two tests developed by classical test theory and item response theory. Elementary Education Online, 7(3), 758-768. Retrieved from
  • Çelen, Ü., & Aybek, E. C. (2013). Öğrenci başarısının öğretmen yapımı bir testle klasik test kuramı ve madde tepki kuramı yöntemleriyle elde edilen puanlara göre karşılaştırılması. Journal of Measurement and Evaluation in Education and Psychology, 4(2), 64-75. Retrieved from
  • Çetin, B. (2019). Test geliştirme. B. Çetin (Ed.). In Eğitimde ölçme ve değerlendirme [Measurement and assessment in education] (p. 105-126). Ankara: Anı Publishing.
  • Çüm, S., & Koç, N. (2013). The review of scale development and adaptation studies which have been published in psychology and education journals in Turkey. Journal of Educational Sciences & Practices, 12(24), 115-135. Retrieved from
  • de Ayala, R. J. (2009). The theory and practice of item response theory. New York, NY: The Guilford Press.
  • Delice, A., & Ergene, Ö. (2015). Investigation of scale development and adaptation studies: An example of mathematics education articles. Karaelmas Journal of Educational Sciences, 3(1), 60-75. Retrieved from
  • DeMars, C. (2010). Item response theory. New York: Oxford University Press.
  • Doğan, N., & Kılıç, A. F. (2017). Madde tepki kuramı yetenek ve madde parametre kestirimlerinin değişmezliğinin incelenmesi. Ö. Demirel and S. Dinçer (Eds.). In Küreselleşen dünyada eğitim [Education in a globalizing world] (p. 298-314). Ankara: Pegem Academy. doi: 10.14527/9786053188407.21
  • Downing, S. M., & Haladyna, T. M. (2011). Handbook of test development. New Jersey, NJ: Lawrence Erlbaum Associates Publishers.
  • Enago (2021). Why is a pilot study important in research?. Retrieved from
  • Ergene, Ö. (2020). Scale development and adaptation articles in the field of mathematics education: Descriptive content analysis. Journal of Education for Life, 34(2), 360-383. doi:10.33308/26674874.2020342207
  • Evrekli, E., İnel, D. , Deniş, H., & Balım, A. G. (2011). Methodological and statistical problems in graduate theses in the field of science education. Elementary Education Online, 10(1), 206-218. Retrieved from
  • Haladyna, T. M. (2004). Developing and validating multiple-choice test items (3. ed.). New Jersey, NJ: Lawrence Erlbaum Associates Publishers.
  • Goodrich Andrade, H. (2000). Using rubrics to promote thinking and learning. Educational Leadership, 57(5), 13-18. Retrieved from
  • Goodrich Andrade, H. (2001). The effects of instructional rubrics on learning to write. Current Issues in Education, 4(4), 1-22. Retrieved from
  • Goodrich Andrade, H. (2005). Teaching with rubrics: The good, the bad, and the ugly. College Teaching, 53(1), 27-31. doi: 10.3200/CTCH.53.1.27-31
  • Hambleton, R. K., & Swaminathan, H. (1985). Item response theory. Principles and Applications. Dordrecht, The Netherlands: Kluwer-Nijhoff Publishing Co.
  • Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory (Vol. 2). California, CA: Sage.
  • Hunter, D. M., Jones, R. M., & Randhawa, B. S. (1996). The use of holistic versus analytic scoring for large-scale assessment of writing. The Canadian Journal of Program Evaluation, 11(2), 61-85. Retrieved from
  • Jonsson, A., & Svingby, G. (2007). The use of scoring rubrics: Reliability, validity and educational consequences. Educational Research Review, 2(2), 130-144. doi: 10.1016/j.edurev.2007.05.002
  • Karadağ, E. (2011). Instruments used in doctoral dissertations in educational sciences in Turkey: Quality of research and analytical errors. Educational Sciences: Theory & Practice, 11(1), 311-334. Retrieved from
  • Lane, S., Raymond, M. R., & Haladyna, T. M. (2016). Handbook of test development (2. ed.). New York, NY: Routledge.
  • Lord, F.M., & Novick, M.R. (1968). Statistical theories of mental test scores. Menlo Park, CA: Addison-Wesley.
  • Mertler, C.A. (2000). Designing scoring rubrics for your classroom. Practical Assessment, Research, and Evaluation, 7(25), 1-8. doi: 10.7275/gcy8-0w24
  • Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50, 741-749. doi:10.1037/0003-066x.50.9.741
  • Miles, M. B., & Huberman, A. M. (1994). Qualitative data analysis: An expanded sourcebook (2. ed.). Thousand Oaks, CA: Sage.
  • Mor-Dirlik, E. (2014). Ölçek geliştirme konulu doktora tezlerinin test ve ölçek geliştirme standartlarına uygunluğunun incelenmesi. Eğitimde ve Psikolojide Ölçme ve Değerlendirme Dergisi, 5(2), 62-78. doi: 10.21031/epod.63138
  • Mor Dirlik, E. (2021). Farklı test kuramlarından hesaplanan madde ayırt edicilik parametrelerinin karşılaştırılması. Trakya Eğitim Dergisi. 11(2), 732-744. doi: 10.24315/tred.700445
  • Moskal, B. M. (2000). Scoring rubrics: What, when and how?. Practical Assessment, Research, and Evaluation, 7(3), 1-5. Doi: 10.7275/a5vq-7q66
  • Moskal, B. M., & Leydens, J. A. (2000). Scoring rubric development: validity and reliability. Practical Assessment, Research, and Evaluation, 7(4), 1-22. doi: 10.7275/q7rm-gg74
  • Mutluer, C., & Yandı, A. (2012, September). Türkiye’deki üniversitelerde 2010-2012 yılları arasında yayımlanan tezlerdeki başarı testlerin incelenmesi. Paper presented at the Eğitimde ve Psikolojide Ölçme ve Değerlendirme III. Ulusal Kongresi, Turkey: Bolu. Abstract retrieved from
  • Olgun, G., & Alatlı, B. (2021). The review of scale development and adaptation studies published for adolescents in Turkey. The Journal of Turkish Educational Sciences, 19(1), 568-592. doi: 10.37217/tebd.849954
  • Öksüzoğlu, M. (2022). The investigation of items measuring high-level thinking skills in terms of student score and score reliability. (Unpublished master thesis). Hacettepe University, Ankara.
  • Özçelik, D. A. (1992). Ölçme ve değerlendirme [Measurement and assessment]. Ankara: ÖSYM Publ.
  • Reznitskaya, A., Kuo, L., Glina, M., & Anderson, R. C. (2009). Measuring argumentative reasoning: What’s behind the numbers?. Learning and Individual Differences, 19(2), 219–224. doi:10.1016/j.lindif.2008.11.001.
  • Şanlı, E. (2010). Comparing reliability levels of scoring of the holistic and analytic rubrics in evaluating the scientific process skills. (Unpublished master thesis). Ankara University, Ankara.
  • Şahin, M. G. (2019). Performansa dayalı değerlendirme. B. Çetin (Ed.). In Eğitimde ölçme ve değerlendirme [Measurement and assessment in education] (p. 213-264). Ankara: Anı Publ.
  • Şahin, M. G., & Boztunç-Öztürk, N. (2018). Scale development process in educational field: A content analysis research. Kastamonu Education Journal, 26(1), 191-199. doi: 10.24106/kefdergi.375863
  • Tindal, G., & Haladyna, T. M. (2012). Large-scale assessment programs for all students: Validity, technical adequacy, and implementation. Mahwah, New Jersey: Lawrence Erlbaum.
  • Turgut, F. (1992). Eğitimde ölçme ve değerlendirme [Measurement and assessment in education] (8. ed.). Ankara: Saydam Publ.
  • Yıldırım, A., & Şimşek, H. (2013). Sosyal Bilimlerde Nitel Araştırma Yöntemleri [Qulatitative Research Methods in Social Sciences] (9. ed.). Ankara: Seçkin Publ.
  • Yıldıztekin, B. (2014). The comparison of interrater reliability by using estimating tecniques in classical test theory and generalizability theory. (Unpublished master thesis). Hacettepe University, Ankara.
There are 50 citations in total.


Primary Language English
Subjects Other Fields of Education
Journal Section Research Articles

Melek Gülşah Şahin 0000-0001-5139-9777

Yıldız Yıldırım 0000-0001-8434-5062

Nagihan Boztunc Öztürk 0000-0002-2777-5311

Publication Date January 30, 2023
Acceptance Date December 18, 2022
Published in Issue Year 2023 Volume: 10 Issue: 1


APA Şahin, M. G., Yıldırım, Y., & Boztunc Öztürk, N. (2023). Examining the Achievement Test Development Process in the Educational Studies. Participatory Educational Research, 10(1), 251-274.