Research Article

Preserving Rare Tasks in the KDD Cup 2010 Educational Dataset: A Task-Aware Coverage-Based Sampling Method

Volume: 15 Number: 1 February 1, 2026
EN TR

Preserving Rare Tasks in the KDD Cup 2010 Educational Dataset: A Task-Aware Coverage-Based Sampling Method

Abstract

Large-scale educational datasets inevitably require data reduction due to the sheer volume of student–task interactions they contain. However, most existing data reduction and sampling strategies focus on preserving global data distributions or label-level class balance, while largely overlooking the structural representation of rare yet pedagogically critical tasks. This limitation often leads to unreliable predictions and poor generalization performance, particularly for sparsely observed learning objectives. In this study, we propose a Task-Aware Coverage Sampling framework designed to explicitly preserve the structural coverage of rare tasks under aggressive data reduction. The proposed approach identifies rare tasks using task-specific statistics and constructs compact yet representative training subsets by enforcing structural coverage independently within each task. Unlike random and stratified sampling methods, the framework prioritizes task-level representativeness rather than global class balance. We further compare the proposed method against a geometry-based farthest-first sampling strategy, which promotes global diversity in the feature space but does not explicitly account for task structure. The method is evaluated on the Algebra I 2008–2009 dataset from the KDD Cup 2010 Educational Data Mining Challenge, which contains over 20 million student–task interactions. From this large-scale corpus, we identify 5,171 interactions associated with rare knowledge components. Notably, the proposed approach is able to model rare tasks effectively in representative rare-task configurations using as few as 22 samples, corresponding to less than 0.5% of the task-specific data, without compromising predictive stability. Experiments are conducted using logistic regression, random forests, and linear support vector machines, with the area under the precision–recall curve as the primary evaluation metric. The results show that, even when less than one percent of the original task-level training data is retained, the proposed method achieves competitive average performance and exhibits substantially greater stability in worst-case task scenarios compared to random, stratified, and geometry-based sampling baselines. These findings demonstrate that reliable learning on rare educational tasks can be achieved in large-scale educational datasets without requiring exhaustive access to the full training data.

Keywords

References

  1. Abdelrahman, G., Wang, Q., & Nunes, B. P. (2023). Knowledge tracing: A survey. ACM Computing Surveys, 55(11), No: 224, 1-37. https://doi.org/10.1145/356957
  2. Baker, R. S., & Yacef, K., C. (2009). The state of educational data mining in 2009: A review and future visions. Journal of Educational Data Mining, 1(1), 3–17. https://doi.org/10.5281/zenodo.3554657
  3. Birodkar, V., Tsipras, D., & Kolter, J. Z. (2019). Semantic Redundancies in Image-Classification Datasets: The 10% You Don't Need. ArXiv, abs/1901.11409. https:// doi.org/10.48550/arXiv.1901.11409
  4. Corbett, A. T., & Anderson, J. R. (1995). Knowledge tracing: Modeling the acquisition of procedural knowledge. User Modeling and User-Adapted Interaction, 4(4), 253–278. https://doi.org/10.1007/BF01099821
  5. Davis, J., & Goadrich, M. (2006). Knowledge tracing: The relationship between precision–recall and ROC curves. Proceedings of the 23rd International Conference on Machine Learning, 233–240. https://doi.org/10.1145/1143844.1143874
  6. Drummond, C., & Holte, R. C. (2003). C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats OverSampling, Proceedings of the ICML'03 Workshop on Learning from Imbalanced Datasets.
  7. Forman, G., & Cohen, I. (2004). Learning from little: Comparison of classifiers given little training. Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases, 161–172. DOI: 10.5555/1053072.1053089
  8. Gervet, T., Koedinger, K. R., Schneider, J., & Mitchell, T. (2020). When is deep learning the best approach to knowledge tracing?, Journal of Educational Data Mining, 12(3), 31–54. https://doi.org/10.5281/zenodo.4143614

Details

Primary Language

English

Subjects

Educational Technology and Computing

Journal Section

Research Article

Publication Date

February 1, 2026

Submission Date

January 2, 2026

Acceptance Date

January 17, 2026

Published in Issue

Year 2026 Volume: 15 Number: 1

APA
Güldoğan Dericioğlu, Y. (2026). Preserving Rare Tasks in the KDD Cup 2010 Educational Dataset: A Task-Aware Coverage-Based Sampling Method. Bartın University Journal of Faculty of Education, 15(1), 262-276. https://doi.org/10.14686/buefad.1854967
AMA
1.Güldoğan Dericioğlu Y. Preserving Rare Tasks in the KDD Cup 2010 Educational Dataset: A Task-Aware Coverage-Based Sampling Method. BUEFAD. 2026;15(1):262-276. doi:10.14686/buefad.1854967
Chicago
Güldoğan Dericioğlu, Yaprak. 2026. “Preserving Rare Tasks in the KDD Cup 2010 Educational Dataset: A Task-Aware Coverage-Based Sampling Method”. Bartın University Journal of Faculty of Education 15 (1): 262-76. https://doi.org/10.14686/buefad.1854967.
EndNote
Güldoğan Dericioğlu Y (February 1, 2026) Preserving Rare Tasks in the KDD Cup 2010 Educational Dataset: A Task-Aware Coverage-Based Sampling Method. Bartın University Journal of Faculty of Education 15 1 262–276.
IEEE
[1]Y. Güldoğan Dericioğlu, “Preserving Rare Tasks in the KDD Cup 2010 Educational Dataset: A Task-Aware Coverage-Based Sampling Method”, BUEFAD, vol. 15, no. 1, pp. 262–276, Feb. 2026, doi: 10.14686/buefad.1854967.
ISNAD
Güldoğan Dericioğlu, Yaprak. “Preserving Rare Tasks in the KDD Cup 2010 Educational Dataset: A Task-Aware Coverage-Based Sampling Method”. Bartın University Journal of Faculty of Education 15/1 (February 1, 2026): 262-276. https://doi.org/10.14686/buefad.1854967.
JAMA
1.Güldoğan Dericioğlu Y. Preserving Rare Tasks in the KDD Cup 2010 Educational Dataset: A Task-Aware Coverage-Based Sampling Method. BUEFAD. 2026;15:262–276.
MLA
Güldoğan Dericioğlu, Yaprak. “Preserving Rare Tasks in the KDD Cup 2010 Educational Dataset: A Task-Aware Coverage-Based Sampling Method”. Bartın University Journal of Faculty of Education, vol. 15, no. 1, Feb. 2026, pp. 262-76, doi:10.14686/buefad.1854967.
Vancouver
1.Yaprak Güldoğan Dericioğlu. Preserving Rare Tasks in the KDD Cup 2010 Educational Dataset: A Task-Aware Coverage-Based Sampling Method. BUEFAD. 2026 Feb. 1;15(1):262-76. doi:10.14686/buefad.1854967

All the articles published in the journal are open access and distributed under the conditions of CommonsAttribution-NonCommercial 4.0 International License 

88x31.png


Bartın University Journal of Faculty of Education