Expert judgment vs. psychometric analysis: Enhancing high school exam development in Albania

Eralda Gjika; Lule Basha; Afërdita Alizoti; Joao Paulo Lessa

doi:10.21449/ijate.1570567

Research Article

Expert judgment vs. psychometric analysis: Enhancing high school exam development in Albania

Year 2026, Volume: 13 Issue: 1, 108 - 122, 02.01.2026

Eralda Gjika , Lule Basha Afërdita Alizoti Joao Paulo Lessa

https://doi.org/10.21449/ijate.1570567

Abstract

In Albania, high school graduates undergo State Matura exams that determine university entrance through a merit-based system. Since 2006, these pencil and paper exams have included three mandatory exams: Albanian language and literature, Mathematics, and a foreign language, plus one elective exam from a list of eight. Each exam totals 60 points: 20 points for multiple-choice items and 40 points for open-response items (structured and essay-type) covering Albanian literacy and a foreign language. Historically, item difficulty has been set primarily by expert judgment, with limited psychometric validation, and student proficiency has been computed via Classical Test Theory (CTT) on a 4-10 decimal scale. This study highlights the lack of psychometric analysis in Matura exams, emphasizing the need for improved assessment methods. By focusing on the limitations of relying solely on expert judgment in an era of technological innovations, we address the challenges posed by insufficient historical item parameters. To support expert judgment, we present a ShinyApp that integrates exam data to provide fast, transparent, and replicable psychometric insights. The tool demonstrates how technology can support evidence-based decision-making and contribute to modernizing Albania’s e assessment framework.

Keywords

E-assessment , Item analysis , CTT , Expert judgment , High school exams

References

Bjorner, J.B., Chang, C.H., Thissen, D., & Reeve, B.B. (2007). Developing tailored instruments: item banking and computerized adaptive assessment. Qual Life Res 16(Suppl 1), 95 108. https://doi.org/10.1007/s11136-007-9168-6
Brown, G.T.L. (2022). The past, present, and future of educational assessment: A transdisciplinary perspective. Frontiers in Education, 7. https://doi.org/10.3389/feduc.2022.1060633
Brownstein, N.C., Louis, T.A., O’Hagan, A., & Pendergast, J. (2019). The role of expert judgment in Statistical Inference and evidence based decision making. The American Statistician, 73(1), 56 68. https://doi.org/10.1080/00031305.2018.1529623
Cronbach, L.J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297-334.
Ebel, R.L. (1972). Essentials of Educational Measurement (1st ed.). Prentice Hall.
Ebel, R.L., & Frisbie, D.A. (1991). Essentials of educational measurement (5th ed.). Prentice Hall.
Gjika, E., Basha, L., & Alizoti, A. (2025). TIA: A smart psychometric analysis app for Albanian high school exams framework. Smart Cities and Regional Development (SCRD) Preprints, 2(1). https://scrd.eu/index.php/scrd-pp/article/view/641
Fan, X. (1998). Item response theory and classical test theory: An empirical comparison of their item/person statistics. Educational and Psychological Measurement, 58(3), 357 381. https://www2.hawaii.edu/~daniel/irtctt.pdf
Firescu, A.M., Necula, M., Băsu, C.E., Ardelean, A., Roşu, M.M., Milea, E.C., & Păun, M. (2022). Investigation of the 2019-2021 National Evaluation exam: A case study of high school admissions in Bucharest. Proceedings of the International Conference on Business Excellence, 16(1), 59–81. https://doi.org/10.2478/picbe-2022-0008
Ke, C., Yingwei, W., Yajun, Y., & Runhua, L. (2010). Item-bank in language E-Assessment: Issues and perspectives. 5th International Conference on Computer Science & Education. Hefei, China. pp. 1657-1660. https://doi.org/10.1109/ICCSE.2010.5593611
Kreitchmann, R.S., Abad, F.J., Ponsoda, V., Nieto, M.D., & Morillo, D. (2019). Controlling for Response Biases in Self-Report Scales: Forced-Choice vs. Psychometric Modeling of Likert Items. Frontiers in Psychology, 10. https://doi.org/10.3389/fpsyg.2019.02309
Maghnouj, S., et al. (2020). OECD Reviews of Evaluation and Assessment in Education: Albania, OECD Reviews of Evaluation and Assessment in Education. OECD Publishing, Paris, https://doi.org/10.1787/d267dc93-en
McAlpine M, Hesketh, I. (2003). Multiple response questions- allowing for chance in authentic assessments. In: Christie J, editor. 7th International CAA Conference. Loughborough University.
Metsämuuronen, J. (2018). Essentials of visual diagnosis of test items – Logical and pathological patterns in items to be detected. https://doi.org/10.13140/RG.2.2.13950.23364
Metsämuuronen, D. (2022). Essentials of Visual Diagnosis of Test Items. Logical, Illogical, and Anomalous Patterns in Tests Items to be Detected. Practical Assessment, Research, and Evaluation, 27(1), 5. https://doi.org/10.7275/n0kf-ah40
MoES. (2019). Matura shtetërore: Gjuha shqipe dhe letërsia [Exam guideline]. https://maturashteterore.com/wp-content/uploads/2020/03/gjuhe_shqipe.pdf
MoES. (2023). Urdhër i përbashkët për Maturën Shtetërore [Regulatory document]. https://arsimi.gov.al/wp-content/uploads/2025/05/URDHER-I-PERBASHKET.pdf
OECD (2024a, June 18). PISA 2022 Results (Volume III): Creative minds, creative schools. OECD Publishing. https://doi.org/10.1787/765ee8c2-en
OECD (2024b, March 1). PISA 2022 Technical Report, PISA. OECD Publishing. https://doi.org/10.1787/01820d6d-en.
Oermann, M.H., & Gaberson, K. (2021). Evaluation and testing in nursing education. Springer Publishing.
Revelle, W. (2024). psych: Procedures for psychological, psychometric, and personality research (Version 2.3.12) [R package manual]. https://cran.r-project.org/web/packages/psych/psych.pdf
Saito T, Rehmsmeier M (2015) The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLoS ONE, 10(3), e0118432. https://doi.org/10.1371/journal.pone.0118432
Schwarz, R., Bulut, H.C., & Ani̇fowose, C. (2023). A data pipeline for e-large-scale assessments: Better automation, quality assurance, and efficiency. International Journal of Assessment Tools in Education, 10(Special Issue), https://doi.org/10.21449/ijate.1321061
Schwarz, R., & Gjika, E. (2023). Modernized psychometric analysis for digital assessments: An open-source approach for automation, quality assurance, and efficiency. The 48th International Association for Educational Assessment Annual Conference. https://iaea2023.org/
Shahini, A. (2021). Inequalities in Albanian education: Evidence from large-scale assessment studies. Kultura i Edukacja, 134(4), 40–70. https://doi.org/10.15804/kie.2021.04.03
Sijtsma, K. (2009). On the use, the misuse, and the very limited usefulness of Cronbach’s alpha. Psychometrika, 74(1), 107–120. https://doi.org/10.1007/s11336-008-9101-0
Suto, I., Crisp, V., & Greatorex, J. (2008). Investigating the judgemental marking process: an overview of our recent research. https://doi.org/10.17863/CAM.100464
Ten Berge, J.M.F., & Sočan, G. (2004). The greatest lower bound to the reliability of a test and the hypothesis of unidimensionality. Psychometrika, 69(4), 613 625. https://doi.org/10.1007/bf02289858
Upadhyah, A.A., Maheria, P.B., & Patel, J.R. (2019). Analysis of one best MCQS in five preuniversity physiology examinations. International Journal of Physiology, 7(4), 10 15. https://doi.org/10.37506/ijop.v7i4.47
Warburton, W.I. & Conole, G.C. (2003). Key Findings from recent literature on Computer-aided Assessment. ALT-C 2003, Sheffield. 08-10 Sep 2003. 19 pp. https://eprints.soton.ac.uk/14113/
Wonde, S.G., Tadesse, T., Moges, B., & Schauber, S.K. (2024). Experts’ prediction of item difficulty of multiple-choice questions in the Ethiopian undergraduate medicine licensure examination. BMC Medical Education, 24(1). https://doi.org/10.1186/s12909-024-06012-x
Wright, B.D., & Bell, S.R. (1984). Item Banks: What, Why, How. Journal of Educational Measurement, 21(4), 331–345. http://www.jstor.org/stable/1434585

Expert judgment vs. psychometric analysis: Enhancing high school exam development in Albania

Year 2026, Volume: 13 Issue: 1, 108 - 122, 02.01.2026

Eralda Gjika , Lule Basha Afërdita Alizoti Joao Paulo Lessa

https://doi.org/10.21449/ijate.1570567

Abstract

Keywords

E-assessment , Item analysis , CTT , Expert judgment , High school exams

References

Bjorner, J.B., Chang, C.H., Thissen, D., & Reeve, B.B. (2007). Developing tailored instruments: item banking and computerized adaptive assessment. Qual Life Res 16(Suppl 1), 95 108. https://doi.org/10.1007/s11136-007-9168-6
Brown, G.T.L. (2022). The past, present, and future of educational assessment: A transdisciplinary perspective. Frontiers in Education, 7. https://doi.org/10.3389/feduc.2022.1060633
Brownstein, N.C., Louis, T.A., O’Hagan, A., & Pendergast, J. (2019). The role of expert judgment in Statistical Inference and evidence based decision making. The American Statistician, 73(1), 56 68. https://doi.org/10.1080/00031305.2018.1529623
Cronbach, L.J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297-334.
Ebel, R.L. (1972). Essentials of Educational Measurement (1st ed.). Prentice Hall.
Ebel, R.L., & Frisbie, D.A. (1991). Essentials of educational measurement (5th ed.). Prentice Hall.
Gjika, E., Basha, L., & Alizoti, A. (2025). TIA: A smart psychometric analysis app for Albanian high school exams framework. Smart Cities and Regional Development (SCRD) Preprints, 2(1). https://scrd.eu/index.php/scrd-pp/article/view/641
Fan, X. (1998). Item response theory and classical test theory: An empirical comparison of their item/person statistics. Educational and Psychological Measurement, 58(3), 357 381. https://www2.hawaii.edu/~daniel/irtctt.pdf
Firescu, A.M., Necula, M., Băsu, C.E., Ardelean, A., Roşu, M.M., Milea, E.C., & Păun, M. (2022). Investigation of the 2019-2021 National Evaluation exam: A case study of high school admissions in Bucharest. Proceedings of the International Conference on Business Excellence, 16(1), 59–81. https://doi.org/10.2478/picbe-2022-0008
Ke, C., Yingwei, W., Yajun, Y., & Runhua, L. (2010). Item-bank in language E-Assessment: Issues and perspectives. 5th International Conference on Computer Science & Education. Hefei, China. pp. 1657-1660. https://doi.org/10.1109/ICCSE.2010.5593611
Kreitchmann, R.S., Abad, F.J., Ponsoda, V., Nieto, M.D., & Morillo, D. (2019). Controlling for Response Biases in Self-Report Scales: Forced-Choice vs. Psychometric Modeling of Likert Items. Frontiers in Psychology, 10. https://doi.org/10.3389/fpsyg.2019.02309
Maghnouj, S., et al. (2020). OECD Reviews of Evaluation and Assessment in Education: Albania, OECD Reviews of Evaluation and Assessment in Education. OECD Publishing, Paris, https://doi.org/10.1787/d267dc93-en
McAlpine M, Hesketh, I. (2003). Multiple response questions- allowing for chance in authentic assessments. In: Christie J, editor. 7th International CAA Conference. Loughborough University.
Metsämuuronen, J. (2018). Essentials of visual diagnosis of test items – Logical and pathological patterns in items to be detected. https://doi.org/10.13140/RG.2.2.13950.23364
Metsämuuronen, D. (2022). Essentials of Visual Diagnosis of Test Items. Logical, Illogical, and Anomalous Patterns in Tests Items to be Detected. Practical Assessment, Research, and Evaluation, 27(1), 5. https://doi.org/10.7275/n0kf-ah40
MoES. (2019). Matura shtetërore: Gjuha shqipe dhe letërsia [Exam guideline]. https://maturashteterore.com/wp-content/uploads/2020/03/gjuhe_shqipe.pdf
MoES. (2023). Urdhër i përbashkët për Maturën Shtetërore [Regulatory document]. https://arsimi.gov.al/wp-content/uploads/2025/05/URDHER-I-PERBASHKET.pdf
OECD (2024a, June 18). PISA 2022 Results (Volume III): Creative minds, creative schools. OECD Publishing. https://doi.org/10.1787/765ee8c2-en
OECD (2024b, March 1). PISA 2022 Technical Report, PISA. OECD Publishing. https://doi.org/10.1787/01820d6d-en.
Oermann, M.H., & Gaberson, K. (2021). Evaluation and testing in nursing education. Springer Publishing.
Revelle, W. (2024). psych: Procedures for psychological, psychometric, and personality research (Version 2.3.12) [R package manual]. https://cran.r-project.org/web/packages/psych/psych.pdf
Saito T, Rehmsmeier M (2015) The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLoS ONE, 10(3), e0118432. https://doi.org/10.1371/journal.pone.0118432
Schwarz, R., Bulut, H.C., & Ani̇fowose, C. (2023). A data pipeline for e-large-scale assessments: Better automation, quality assurance, and efficiency. International Journal of Assessment Tools in Education, 10(Special Issue), https://doi.org/10.21449/ijate.1321061
Schwarz, R., & Gjika, E. (2023). Modernized psychometric analysis for digital assessments: An open-source approach for automation, quality assurance, and efficiency. The 48th International Association for Educational Assessment Annual Conference. https://iaea2023.org/
Shahini, A. (2021). Inequalities in Albanian education: Evidence from large-scale assessment studies. Kultura i Edukacja, 134(4), 40–70. https://doi.org/10.15804/kie.2021.04.03
Sijtsma, K. (2009). On the use, the misuse, and the very limited usefulness of Cronbach’s alpha. Psychometrika, 74(1), 107–120. https://doi.org/10.1007/s11336-008-9101-0
Suto, I., Crisp, V., & Greatorex, J. (2008). Investigating the judgemental marking process: an overview of our recent research. https://doi.org/10.17863/CAM.100464
Ten Berge, J.M.F., & Sočan, G. (2004). The greatest lower bound to the reliability of a test and the hypothesis of unidimensionality. Psychometrika, 69(4), 613 625. https://doi.org/10.1007/bf02289858
Upadhyah, A.A., Maheria, P.B., & Patel, J.R. (2019). Analysis of one best MCQS in five preuniversity physiology examinations. International Journal of Physiology, 7(4), 10 15. https://doi.org/10.37506/ijop.v7i4.47
Warburton, W.I. & Conole, G.C. (2003). Key Findings from recent literature on Computer-aided Assessment. ALT-C 2003, Sheffield. 08-10 Sep 2003. 19 pp. https://eprints.soton.ac.uk/14113/
Wonde, S.G., Tadesse, T., Moges, B., & Schauber, S.K. (2024). Experts’ prediction of item difficulty of multiple-choice questions in the Ethiopian undergraduate medicine licensure examination. BMC Medical Education, 24(1). https://doi.org/10.1186/s12909-024-06012-x
Wright, B.D., & Bell, S.R. (1984). Item Banks: What, Why, How. Journal of Educational Measurement, 21(4), 331–345. http://www.jstor.org/stable/1434585

There are 32 citations in total.

Details

Primary Language	English
Subjects	National and International Success Comparisons
Journal Section	Research Article
Authors	Eralda Gjika 0000-0003-2662-4316 Lule Basha This is me 0000-0003-3790-601X Afërdita Alizoti This is me 0009-0005-2952-8460 Joao Paulo Lessa This is me 0000-0003-0751-7662
Submission Date	October 21, 2024
Acceptance Date	September 23, 2025
Publication Date	January 2, 2026
Published in Issue	Year 2026 Volume: 13 Issue: 1

Cite

APA	Gjika, E., Basha, L., Alizoti, A., Lessa, J. P. (2026). Expert judgment vs. psychometric analysis: Enhancing high school exam development in Albania. International Journal of Assessment Tools in Education, 13(1), 108-122. https://doi.org/10.21449/ijate.1570567

Article Files

Full Text

23823 23825 23824