Research Article
BibTex RIS Cite

Expert judgment vs. psychometric analysis: Enhancing high school exam development in Albania

Year 2026, Volume: 13 Issue: 1, 108 - 122, 02.01.2026
https://doi.org/10.21449/ijate.1570567

Abstract

In Albania, high school graduates undergo State Matura exams that determine university entrance through a merit-based system. Since 2006, these pencil and paper exams have included three mandatory exams: Albanian language and literature, Mathematics, and a foreign language, plus one elective exam from a list of eight. Each exam totals 60 points: 20 points for multiple-choice items and 40 points for open-response items (structured and essay-type) covering Albanian literacy and a foreign language. Historically, item difficulty has been set primarily by expert judgment, with limited psychometric validation, and student proficiency has been computed via Classical Test Theory (CTT) on a 4-10 decimal scale. This study highlights the lack of psychometric analysis in Matura exams, emphasizing the need for improved assessment methods. By focusing on the limitations of relying solely on expert judgment in an era of technological innovations, we address the challenges posed by insufficient historical item parameters. To support expert judgment, we present a ShinyApp that integrates exam data to provide fast, transparent, and replicable psychometric insights. The tool demonstrates how technology can support evidence-based decision-making and contribute to modernizing Albania’s e assessment framework.

References

  • Bjorner, J.B., Chang, C.H., Thissen, D., & Reeve, B.B. (2007). Developing tailored instruments: item banking and computerized adaptive assessment. Qual Life Res 16(Suppl 1), 95 108. https://doi.org/10.1007/s11136-007-9168-6
  • Brown, G.T.L. (2022). The past, present, and future of educational assessment: A transdisciplinary perspective. Frontiers in Education, 7. https://doi.org/10.3389/feduc.2022.1060633
  • Brownstein, N.C., Louis, T.A., O’Hagan, A., & Pendergast, J. (2019). The role of expert judgment in Statistical Inference and evidence based decision making. The American Statistician, 73(1), 56 68. https://doi.org/10.1080/00031305.2018.1529623
  • Cronbach, L.J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297-334.
  • Ebel, R.L. (1972). Essentials of Educational Measurement (1st ed.). Prentice Hall.
  • Ebel, R.L., & Frisbie, D.A. (1991). Essentials of educational measurement (5th ed.). Prentice Hall.
  • Gjika, E., Basha, L., & Alizoti, A. (2025). TIA: A smart psychometric analysis app for Albanian high school exams framework. Smart Cities and Regional Development (SCRD) Preprints, 2(1). https://scrd.eu/index.php/scrd-pp/article/view/641
  • Fan, X. (1998). Item response theory and classical test theory: An empirical comparison of their item/person statistics. Educational and Psychological Measurement, 58(3), 357 381. https://www2.hawaii.edu/~daniel/irtctt.pdf
  • Firescu, A.M., Necula, M., Băsu, C.E., Ardelean, A., Roşu, M.M., Milea, E.C., & Păun, M. (2022). Investigation of the 2019-2021 National Evaluation exam: A case study of high school admissions in Bucharest. Proceedings of the International Conference on Business Excellence, 16(1), 59–81. https://doi.org/10.2478/picbe-2022-0008
  • Ke, C., Yingwei, W., Yajun, Y., & Runhua, L. (2010). Item-bank in language E-Assessment: Issues and perspectives. 5th International Conference on Computer Science & Education. Hefei, China. pp. 1657-1660. https://doi.org/10.1109/ICCSE.2010.5593611
  • Kreitchmann, R.S., Abad, F.J., Ponsoda, V., Nieto, M.D., & Morillo, D. (2019). Controlling for Response Biases in Self-Report Scales: Forced-Choice vs. Psychometric Modeling of Likert Items. Frontiers in Psychology, 10. https://doi.org/10.3389/fpsyg.2019.02309
  • Maghnouj, S., et al. (2020). OECD Reviews of Evaluation and Assessment in Education: Albania, OECD Reviews of Evaluation and Assessment in Education. OECD Publishing, Paris, https://doi.org/10.1787/d267dc93-en
  • McAlpine M, Hesketh, I. (2003). Multiple response questions- allowing for chance in authentic assessments. In: Christie J, editor. 7th International CAA Conference. Loughborough University.
  • Metsämuuronen, J. (2018). Essentials of visual diagnosis of test items – Logical and pathological patterns in items to be detected. https://doi.org/10.13140/RG.2.2.13950.23364
  • Metsämuuronen, D. (2022). Essentials of Visual Diagnosis of Test Items. Logical, Illogical, and Anomalous Patterns in Tests Items to be Detected. Practical Assessment, Research, and Evaluation, 27(1), 5. https://doi.org/10.7275/n0kf-ah40
  • MoES. (2019). Matura shtetërore: Gjuha shqipe dhe letërsia [Exam guideline]. https://maturashteterore.com/wp-content/uploads/2020/03/gjuhe_shqipe.pdf
  • MoES. (2023). Urdhër i përbashkët për Maturën Shtetërore [Regulatory document]. https://arsimi.gov.al/wp-content/uploads/2025/05/URDHER-I-PERBASHKET.pdf
  • OECD (2024a, June 18). PISA 2022 Results (Volume III): Creative minds, creative schools. OECD Publishing. https://doi.org/10.1787/765ee8c2-en
  • OECD (2024b, March 1). PISA 2022 Technical Report, PISA. OECD Publishing. https://doi.org/10.1787/01820d6d-en.
  • Oermann, M.H., & Gaberson, K. (2021). Evaluation and testing in nursing education. Springer Publishing.
  • Revelle, W. (2024). psych: Procedures for psychological, psychometric, and personality research (Version 2.3.12) [R package manual]. https://cran.r-project.org/web/packages/psych/psych.pdf
  • Saito T, Rehmsmeier M (2015) The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLoS ONE, 10(3), e0118432. https://doi.org/10.1371/journal.pone.0118432
  • Schwarz, R., Bulut, H.C., & Ani̇fowose, C. (2023). A data pipeline for e-large-scale assessments: Better automation, quality assurance, and efficiency. International Journal of Assessment Tools in Education, 10(Special Issue), https://doi.org/10.21449/ijate.1321061
  • Schwarz, R., & Gjika, E. (2023). Modernized psychometric analysis for digital assessments: An open-source approach for automation, quality assurance, and efficiency. The 48th International Association for Educational Assessment Annual Conference. https://iaea2023.org/
  • Shahini, A. (2021). Inequalities in Albanian education: Evidence from large-scale assessment studies. Kultura i Edukacja, 134(4), 40–70. https://doi.org/10.15804/kie.2021.04.03
  • Sijtsma, K. (2009). On the use, the misuse, and the very limited usefulness of Cronbach’s alpha. Psychometrika, 74(1), 107–120. https://doi.org/10.1007/s11336-008-9101-0
  • Suto, I., Crisp, V., & Greatorex, J. (2008). Investigating the judgemental marking process: an overview of our recent research. https://doi.org/10.17863/CAM.100464
  • Ten Berge, J.M.F., & Sočan, G. (2004). The greatest lower bound to the reliability of a test and the hypothesis of unidimensionality. Psychometrika, 69(4), 613 625. https://doi.org/10.1007/bf02289858
  • Upadhyah, A.A., Maheria, P.B., & Patel, J.R. (2019). Analysis of one best MCQS in five preuniversity physiology examinations. International Journal of Physiology, 7(4), 10 15. https://doi.org/10.37506/ijop.v7i4.47
  • Warburton, W.I. & Conole, G.C. (2003). Key Findings from recent literature on Computer-aided Assessment. ALT-C 2003, Sheffield. 08-10 Sep 2003. 19 pp. https://eprints.soton.ac.uk/14113/
  • Wonde, S.G., Tadesse, T., Moges, B., & Schauber, S.K. (2024). Experts’ prediction of item difficulty of multiple-choice questions in the Ethiopian undergraduate medicine licensure examination. BMC Medical Education, 24(1). https://doi.org/10.1186/s12909-024-06012-x
  • Wright, B.D., & Bell, S.R. (1984). Item Banks: What, Why, How. Journal of Educational Measurement, 21(4), 331–345. http://www.jstor.org/stable/1434585

Expert judgment vs. psychometric analysis: Enhancing high school exam development in Albania

Year 2026, Volume: 13 Issue: 1, 108 - 122, 02.01.2026
https://doi.org/10.21449/ijate.1570567

Abstract

In Albania, high school graduates undergo State Matura exams that determine university entrance through a merit-based system. Since 2006, these pencil and paper exams have included three mandatory exams: Albanian language and literature, Mathematics, and a foreign language, plus one elective exam from a list of eight. Each exam totals 60 points: 20 points for multiple-choice items and 40 points for open-response items (structured and essay-type) covering Albanian literacy and a foreign language. Historically, item difficulty has been set primarily by expert judgment, with limited psychometric validation, and student proficiency has been computed via Classical Test Theory (CTT) on a 4-10 decimal scale. This study highlights the lack of psychometric analysis in Matura exams, emphasizing the need for improved assessment methods. By focusing on the limitations of relying solely on expert judgment in an era of technological innovations, we address the challenges posed by insufficient historical item parameters. To support expert judgment, we present a ShinyApp that integrates exam data to provide fast, transparent, and replicable psychometric insights. The tool demonstrates how technology can support evidence-based decision-making and contribute to modernizing Albania’s e assessment framework.

References

  • Bjorner, J.B., Chang, C.H., Thissen, D., & Reeve, B.B. (2007). Developing tailored instruments: item banking and computerized adaptive assessment. Qual Life Res 16(Suppl 1), 95 108. https://doi.org/10.1007/s11136-007-9168-6
  • Brown, G.T.L. (2022). The past, present, and future of educational assessment: A transdisciplinary perspective. Frontiers in Education, 7. https://doi.org/10.3389/feduc.2022.1060633
  • Brownstein, N.C., Louis, T.A., O’Hagan, A., & Pendergast, J. (2019). The role of expert judgment in Statistical Inference and evidence based decision making. The American Statistician, 73(1), 56 68. https://doi.org/10.1080/00031305.2018.1529623
  • Cronbach, L.J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297-334.
  • Ebel, R.L. (1972). Essentials of Educational Measurement (1st ed.). Prentice Hall.
  • Ebel, R.L., & Frisbie, D.A. (1991). Essentials of educational measurement (5th ed.). Prentice Hall.
  • Gjika, E., Basha, L., & Alizoti, A. (2025). TIA: A smart psychometric analysis app for Albanian high school exams framework. Smart Cities and Regional Development (SCRD) Preprints, 2(1). https://scrd.eu/index.php/scrd-pp/article/view/641
  • Fan, X. (1998). Item response theory and classical test theory: An empirical comparison of their item/person statistics. Educational and Psychological Measurement, 58(3), 357 381. https://www2.hawaii.edu/~daniel/irtctt.pdf
  • Firescu, A.M., Necula, M., Băsu, C.E., Ardelean, A., Roşu, M.M., Milea, E.C., & Păun, M. (2022). Investigation of the 2019-2021 National Evaluation exam: A case study of high school admissions in Bucharest. Proceedings of the International Conference on Business Excellence, 16(1), 59–81. https://doi.org/10.2478/picbe-2022-0008
  • Ke, C., Yingwei, W., Yajun, Y., & Runhua, L. (2010). Item-bank in language E-Assessment: Issues and perspectives. 5th International Conference on Computer Science & Education. Hefei, China. pp. 1657-1660. https://doi.org/10.1109/ICCSE.2010.5593611
  • Kreitchmann, R.S., Abad, F.J., Ponsoda, V., Nieto, M.D., & Morillo, D. (2019). Controlling for Response Biases in Self-Report Scales: Forced-Choice vs. Psychometric Modeling of Likert Items. Frontiers in Psychology, 10. https://doi.org/10.3389/fpsyg.2019.02309
  • Maghnouj, S., et al. (2020). OECD Reviews of Evaluation and Assessment in Education: Albania, OECD Reviews of Evaluation and Assessment in Education. OECD Publishing, Paris, https://doi.org/10.1787/d267dc93-en
  • McAlpine M, Hesketh, I. (2003). Multiple response questions- allowing for chance in authentic assessments. In: Christie J, editor. 7th International CAA Conference. Loughborough University.
  • Metsämuuronen, J. (2018). Essentials of visual diagnosis of test items – Logical and pathological patterns in items to be detected. https://doi.org/10.13140/RG.2.2.13950.23364
  • Metsämuuronen, D. (2022). Essentials of Visual Diagnosis of Test Items. Logical, Illogical, and Anomalous Patterns in Tests Items to be Detected. Practical Assessment, Research, and Evaluation, 27(1), 5. https://doi.org/10.7275/n0kf-ah40
  • MoES. (2019). Matura shtetërore: Gjuha shqipe dhe letërsia [Exam guideline]. https://maturashteterore.com/wp-content/uploads/2020/03/gjuhe_shqipe.pdf
  • MoES. (2023). Urdhër i përbashkët për Maturën Shtetërore [Regulatory document]. https://arsimi.gov.al/wp-content/uploads/2025/05/URDHER-I-PERBASHKET.pdf
  • OECD (2024a, June 18). PISA 2022 Results (Volume III): Creative minds, creative schools. OECD Publishing. https://doi.org/10.1787/765ee8c2-en
  • OECD (2024b, March 1). PISA 2022 Technical Report, PISA. OECD Publishing. https://doi.org/10.1787/01820d6d-en.
  • Oermann, M.H., & Gaberson, K. (2021). Evaluation and testing in nursing education. Springer Publishing.
  • Revelle, W. (2024). psych: Procedures for psychological, psychometric, and personality research (Version 2.3.12) [R package manual]. https://cran.r-project.org/web/packages/psych/psych.pdf
  • Saito T, Rehmsmeier M (2015) The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLoS ONE, 10(3), e0118432. https://doi.org/10.1371/journal.pone.0118432
  • Schwarz, R., Bulut, H.C., & Ani̇fowose, C. (2023). A data pipeline for e-large-scale assessments: Better automation, quality assurance, and efficiency. International Journal of Assessment Tools in Education, 10(Special Issue), https://doi.org/10.21449/ijate.1321061
  • Schwarz, R., & Gjika, E. (2023). Modernized psychometric analysis for digital assessments: An open-source approach for automation, quality assurance, and efficiency. The 48th International Association for Educational Assessment Annual Conference. https://iaea2023.org/
  • Shahini, A. (2021). Inequalities in Albanian education: Evidence from large-scale assessment studies. Kultura i Edukacja, 134(4), 40–70. https://doi.org/10.15804/kie.2021.04.03
  • Sijtsma, K. (2009). On the use, the misuse, and the very limited usefulness of Cronbach’s alpha. Psychometrika, 74(1), 107–120. https://doi.org/10.1007/s11336-008-9101-0
  • Suto, I., Crisp, V., & Greatorex, J. (2008). Investigating the judgemental marking process: an overview of our recent research. https://doi.org/10.17863/CAM.100464
  • Ten Berge, J.M.F., & Sočan, G. (2004). The greatest lower bound to the reliability of a test and the hypothesis of unidimensionality. Psychometrika, 69(4), 613 625. https://doi.org/10.1007/bf02289858
  • Upadhyah, A.A., Maheria, P.B., & Patel, J.R. (2019). Analysis of one best MCQS in five preuniversity physiology examinations. International Journal of Physiology, 7(4), 10 15. https://doi.org/10.37506/ijop.v7i4.47
  • Warburton, W.I. & Conole, G.C. (2003). Key Findings from recent literature on Computer-aided Assessment. ALT-C 2003, Sheffield. 08-10 Sep 2003. 19 pp. https://eprints.soton.ac.uk/14113/
  • Wonde, S.G., Tadesse, T., Moges, B., & Schauber, S.K. (2024). Experts’ prediction of item difficulty of multiple-choice questions in the Ethiopian undergraduate medicine licensure examination. BMC Medical Education, 24(1). https://doi.org/10.1186/s12909-024-06012-x
  • Wright, B.D., & Bell, S.R. (1984). Item Banks: What, Why, How. Journal of Educational Measurement, 21(4), 331–345. http://www.jstor.org/stable/1434585
There are 32 citations in total.

Details

Primary Language English
Subjects National and International Success Comparisons
Journal Section Research Article
Authors

Eralda Gjika 0000-0003-2662-4316

Lule Basha This is me 0000-0003-3790-601X

Afërdita Alizoti This is me 0009-0005-2952-8460

Joao Paulo Lessa This is me 0000-0003-0751-7662

Submission Date October 21, 2024
Acceptance Date September 23, 2025
Publication Date January 2, 2026
Published in Issue Year 2026 Volume: 13 Issue: 1

Cite

APA Gjika, E., Basha, L., Alizoti, A., Lessa, J. P. (2026). Expert judgment vs. psychometric analysis: Enhancing high school exam development in Albania. International Journal of Assessment Tools in Education, 13(1), 108-122. https://doi.org/10.21449/ijate.1570567

23823             23825             23824