Integrating Metadiscourse Analysis with Transformer-Based Models for Enhancing Construct Representation and Discourse Competence Assessment in L2 Writing: A Systemic Multidisciplinary Approach
Year 2024,
Volume: 15 Issue: Special Issue, 318 - 347, 30.12.2024
Sathena Chan
,
Manoranjan Sathyamurthy
,
Chihiro Inoue
,
Michael Bax
,
Johnathan Jones
,
John Oyekan
Abstract
In recent years, large-scale language test providers have developed or adapted automated essay scoring systems (AESS) to score L2 writing essays. While the benefits of using AESS are clear, they are not without limitations, such as over-reliance on frequency counts of vocabulary and grammar variables. Discourse competence is one important aspect of L2 writing yet to be fully explored in AEE application. Evidence of discourse competence can be seen in the use of Metadiscourse Markers (MDM) to produce reader-friendly texts. The article presents a multidisciplinary study to explore the feasibility of expanding the construct representation of automated scoring models to assess discourse competence in L2 writing. Combining machine learning, automated textual analysis and corpus-linguistic methods to examine 2000 scripts across two tasks and five proficiency levels, the study investigates (1) in addition to frequency and range, whether accuracy of MDM is worth pursuing as a predictive feature in L2 writing, and (2) how identification and classification of MDM use might be fed into developing an automated scoring model using machine learning techniques. The contributions of this study are three-fold. Firstly, it offers valuable insights within the context of Explainable AI. By integrating MDM usage and accuracy into the scoring framework, this research moves beyond frequency-based evaluation. This study also makes significant contributions to the current understanding of L2 writing development that even lower-proficiency learners exhibit evidence of discourse competence through their accurate use of MDMs as well as their choice of MDMs in response to genre. From the perspective of expanding the construct representation in automated scoring systems, this study provides a critical examination of the limitations of many AEE models, which have heavily relied on vocabulary and grammar features. By exploring the feasibility of incorporating MDMs as predictive features, this research demonstrates the potential for construct expansion of L2 AEE. The results would support test providers in developing competence tests in various contexts and domains including manufacturing, medicine and so on.
References
- Adel, A. (2006). Metadiscourse in L1 and L2 English. John Benjamins Publishing. https://doi.org/10.1075/scl.24
- Bachman, L. and Palmer, A. (2010). Language assessment in practice. Oxford University Press.
- Barkaoui, K. (2016). What changes and what doesn’t? An examination of changes in the linguistic characteristics of IELTS repeaters’ Writing Task 2 scripts. IELTS Research Reports Online Series, vol. 2016/3, 1–55.
- Bax, S., D. Waller and Nakatsuhara, F. (2019). Researching L2 writers’ use of MDM at intermediate and advanced
levels, System, 83, 79-95. https://doi.org/10.1016/j.system.2019.02.010
- Breiman (2001). Random Forests, Machine Learning, 45(1), 5-32. https://doi.org/10.1023/A:1010933404324
- Brezina V. & Gablasova, D. (2015) Is There a Core General Vocabulary? Introducing the New General Service
List, Applied Linguistics, 36(1), 1-22, https://doi.org/10.1093/applin/amt018
- Burneikaitė, N. (2008) “Metadiscourse in Linguistics Master’s Theses in English L1 and L2”, Kalbotyra, 59, pp.
38–47. doi:10.15388/Klbt.2008.7591.
- Camiciottoli, B. C. (2003). Metadiscourse and ESP reading comprehension. Reading In A Foreign Language,
15(1), 28–44. https://nflrc.hawaii.edu/rfl/item/69
- Carlsen, C. (2010). Discourse connectives across CEFR-levels: A corpus based study. In I. Bartning, M. Maatin,
& I. Vedder (Eds.), Communicative proficiency and linguistic development: Intersections between SLA
and Language testing research (pp. 191-210). European Second Language Association.
- Chapelle, C. A. and Chung, Y-R. (2010). The Promise of NLP and speech processing technologies in language
assessment. Language Testing, 27(3), 301–315. https://doi.org/10.1177/026553221036440
- Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer W. P. (2002). SMOTE: synthetic minority oversampling
technique. Journal of Artificial Intelligence Research 16, 321-357.
https://doi.org/10.1613/jair.953
- Council of Europe (2001). Common European Framework of Reference for Languages: Learning, Teaching,
Assessment. Cambridge University Press. https://doi.org/10.1017/CHOL9780521221283
- Freund, Y. and Schapire, R. E. (1999). A short introduction to boosting. Journal of Japanese Society for Artificial
Intelligence, 14 (5), 771-780. http://www.yorku.ca/gisweb/eats4400/boost.pdf
- Crompton, P. (2012). Characterising hedging in undergraduate essays by Middle-Eastern students. Asian ESP
Journal, 8(2), 55-78. http://asian-esp-journal.com/wp-content/uploads/2013/11/Volume-8-2.pdf
- Devlin, J., Chang, M. W., Lee, K., and Toutanova, K. (2018). BERT: Pre-training of deep Bidirectional
Transformers for language understanding. https://arxiv.org/pdf/1810.04805.pdf
- Jarvis, S. (2013). ‘Defining and measuring lexical diversity.’ in S. Jarvis and M.H. Daller (eds.) Vocabulary
knowledge: human ratings and automated measures. John Benjamins, pp. 13-44.
- He, H., Bai, Y., Garcia, E. A., & Li, S. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced
learning. In 2008 IEEE international joint conference on neural networks (IEEE world congress on
computational intelligence) (pp. 1322-1328). IEEE. https//doi.org/10.1109/IJCNN.2008.4633969
- Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735-1780.
https://doi.org/10.1162/neco.1997.9.8.1735
- Hyland, K. (2005). Metadiscourse: Exploring Interaction in Writing. Bloomsbury Publishing.
- Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., and Liu, T. Y. (2017). LightGBM: A Highly
Efficient Gradient Boosting Decision Tree. Advances in Neural Information Processing Systems, 30
(NIPS 2017). https://proceedings.neurips.cc/paper/2017/hash/6449f44a102fde848669bdd9eb6b76fa-
Abstract.html
- Knoch, U., Macqueen, S., & O’Hagan, S. (2014). An Investigation of the Effect of Task Type on the Discourse
Produced by Students at Various Score Levels in the TOEFL iBT ® Writing Test. ETS Research Report
Series, 23, 14–43. https://doi.org/10.1002/ets2.12038
- Lee, J. J., & Deakin, L. (2016). Interactions in L1 and L2 undergraduate student writing: Interactional
metadiscourse in successful and less-successful argumentative essays. Journal of Second Language
Writing, 33, 21-34. https://doi.org/10.1016/j.jslw.2016.06.004
- Lialin, V., Deshpande, V. & Rumshisky, A. (2023). Scaling down to scale up: A guide to parameter-efficient fine-tuning. arXiv preprint arXiv:2303.15647. https://doi.org/10.48550/arXiv.2303.15647
- Lin, W., Hasenstab, K., Chnha, G. M. & Schwartzman, A. (2020) Comparison of handcrafted features and
convolutional neural networks for liver MR image adequacy assessment, Scientific Reports, 10, 20336.
https://doi.org/10.1038/s41598-020-77264-y
- Lin, T., Wang, Y., Liu, X. and Qiu, X. (2022). A survey of transformers. AI open, 3, 111-132.
https://doi.org/10.48550/arXiv.2106.04554
- Liu, J., Xu, Y. and Zhu, Y. (2019) ‘Automated Essay Scoring based on Two-Stage Learning’, arXiv [cs.CL].
https://doi.org/10.48550/arXiv.1901.07744
- Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L. and Stoyanov, V.,
(2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
https://doi.org/10.48550/arXiv.1907.11692
- O’Loughlin, K. (2013). Investigating lexical validity in the Pearson Test of English Academic. Pearson Research
Reports, p.1-21. https://www.pearsonpte.com/ctfassets/
yqwtwibiobs4/6iHE8HxuGJT3OMAgFoXxwV/88d7be6a2d43a9274d12fe7de838fef0/Investigati
ng_lexical_validity_in_the_Pearson_Test_of_English_Academic-_Kieran_O___Loughlin.pdf
- O'Sullivan, B., Dunlea, J., Spiby, R., Westbrook, C., and Dunn, K. (2020). Aptis General Technical Manual Version 2.2. https://www.britishcouncil.org/sites/default/files/aptis_technical_manual_v_2.2_final.pdf
- Owen, N., Shrestha, P. and Bax, S. (2021). Researching lexical thresholds and lexical profiles across the Common
European Framework of Reference for Language (CEFR) levels assessed in the Aptis test. ARAGs
Research Reports Online, AR- G/2021/1.
- Sanford, S. (2012). A comparison of metadiscourse markers and writing quality in adolescent written narratives. Missoula: Unpublished MSc thesis. The University of Montana.
- Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster,
cheaper and lighter. arXiv preprint arXiv:1910.01108. https://doi.org/10.48550/arXiv.1910.01108
- Schiffrin, D., Tannen, D., & Hamilton, H. (2001). The handbook of discourse analysis. Blackwell Publishers Ltd.
- Zhang, H. (2004). The optimality of Naive Bayes. https://typeset.io/papers/the-optimality-of-naive-bayesr4zge3fp91
- Zhang, Z. (2016). Introduction to machine learning: k-nearest neighbors. Annals of Translational Medicine, 4(11),
218. https://doi.org/10.21037/atm.2016.03.37
Year 2024,
Volume: 15 Issue: Special Issue, 318 - 347, 30.12.2024
Sathena Chan
,
Manoranjan Sathyamurthy
,
Chihiro Inoue
,
Michael Bax
,
Johnathan Jones
,
John Oyekan
References
- Adel, A. (2006). Metadiscourse in L1 and L2 English. John Benjamins Publishing. https://doi.org/10.1075/scl.24
- Bachman, L. and Palmer, A. (2010). Language assessment in practice. Oxford University Press.
- Barkaoui, K. (2016). What changes and what doesn’t? An examination of changes in the linguistic characteristics of IELTS repeaters’ Writing Task 2 scripts. IELTS Research Reports Online Series, vol. 2016/3, 1–55.
- Bax, S., D. Waller and Nakatsuhara, F. (2019). Researching L2 writers’ use of MDM at intermediate and advanced
levels, System, 83, 79-95. https://doi.org/10.1016/j.system.2019.02.010
- Breiman (2001). Random Forests, Machine Learning, 45(1), 5-32. https://doi.org/10.1023/A:1010933404324
- Brezina V. & Gablasova, D. (2015) Is There a Core General Vocabulary? Introducing the New General Service
List, Applied Linguistics, 36(1), 1-22, https://doi.org/10.1093/applin/amt018
- Burneikaitė, N. (2008) “Metadiscourse in Linguistics Master’s Theses in English L1 and L2”, Kalbotyra, 59, pp.
38–47. doi:10.15388/Klbt.2008.7591.
- Camiciottoli, B. C. (2003). Metadiscourse and ESP reading comprehension. Reading In A Foreign Language,
15(1), 28–44. https://nflrc.hawaii.edu/rfl/item/69
- Carlsen, C. (2010). Discourse connectives across CEFR-levels: A corpus based study. In I. Bartning, M. Maatin,
& I. Vedder (Eds.), Communicative proficiency and linguistic development: Intersections between SLA
and Language testing research (pp. 191-210). European Second Language Association.
- Chapelle, C. A. and Chung, Y-R. (2010). The Promise of NLP and speech processing technologies in language
assessment. Language Testing, 27(3), 301–315. https://doi.org/10.1177/026553221036440
- Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer W. P. (2002). SMOTE: synthetic minority oversampling
technique. Journal of Artificial Intelligence Research 16, 321-357.
https://doi.org/10.1613/jair.953
- Council of Europe (2001). Common European Framework of Reference for Languages: Learning, Teaching,
Assessment. Cambridge University Press. https://doi.org/10.1017/CHOL9780521221283
- Freund, Y. and Schapire, R. E. (1999). A short introduction to boosting. Journal of Japanese Society for Artificial
Intelligence, 14 (5), 771-780. http://www.yorku.ca/gisweb/eats4400/boost.pdf
- Crompton, P. (2012). Characterising hedging in undergraduate essays by Middle-Eastern students. Asian ESP
Journal, 8(2), 55-78. http://asian-esp-journal.com/wp-content/uploads/2013/11/Volume-8-2.pdf
- Devlin, J., Chang, M. W., Lee, K., and Toutanova, K. (2018). BERT: Pre-training of deep Bidirectional
Transformers for language understanding. https://arxiv.org/pdf/1810.04805.pdf
- Jarvis, S. (2013). ‘Defining and measuring lexical diversity.’ in S. Jarvis and M.H. Daller (eds.) Vocabulary
knowledge: human ratings and automated measures. John Benjamins, pp. 13-44.
- He, H., Bai, Y., Garcia, E. A., & Li, S. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced
learning. In 2008 IEEE international joint conference on neural networks (IEEE world congress on
computational intelligence) (pp. 1322-1328). IEEE. https//doi.org/10.1109/IJCNN.2008.4633969
- Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735-1780.
https://doi.org/10.1162/neco.1997.9.8.1735
- Hyland, K. (2005). Metadiscourse: Exploring Interaction in Writing. Bloomsbury Publishing.
- Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., and Liu, T. Y. (2017). LightGBM: A Highly
Efficient Gradient Boosting Decision Tree. Advances in Neural Information Processing Systems, 30
(NIPS 2017). https://proceedings.neurips.cc/paper/2017/hash/6449f44a102fde848669bdd9eb6b76fa-
Abstract.html
- Knoch, U., Macqueen, S., & O’Hagan, S. (2014). An Investigation of the Effect of Task Type on the Discourse
Produced by Students at Various Score Levels in the TOEFL iBT ® Writing Test. ETS Research Report
Series, 23, 14–43. https://doi.org/10.1002/ets2.12038
- Lee, J. J., & Deakin, L. (2016). Interactions in L1 and L2 undergraduate student writing: Interactional
metadiscourse in successful and less-successful argumentative essays. Journal of Second Language
Writing, 33, 21-34. https://doi.org/10.1016/j.jslw.2016.06.004
- Lialin, V., Deshpande, V. & Rumshisky, A. (2023). Scaling down to scale up: A guide to parameter-efficient fine-tuning. arXiv preprint arXiv:2303.15647. https://doi.org/10.48550/arXiv.2303.15647
- Lin, W., Hasenstab, K., Chnha, G. M. & Schwartzman, A. (2020) Comparison of handcrafted features and
convolutional neural networks for liver MR image adequacy assessment, Scientific Reports, 10, 20336.
https://doi.org/10.1038/s41598-020-77264-y
- Lin, T., Wang, Y., Liu, X. and Qiu, X. (2022). A survey of transformers. AI open, 3, 111-132.
https://doi.org/10.48550/arXiv.2106.04554
- Liu, J., Xu, Y. and Zhu, Y. (2019) ‘Automated Essay Scoring based on Two-Stage Learning’, arXiv [cs.CL].
https://doi.org/10.48550/arXiv.1901.07744
- Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L. and Stoyanov, V.,
(2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
https://doi.org/10.48550/arXiv.1907.11692
- O’Loughlin, K. (2013). Investigating lexical validity in the Pearson Test of English Academic. Pearson Research
Reports, p.1-21. https://www.pearsonpte.com/ctfassets/
yqwtwibiobs4/6iHE8HxuGJT3OMAgFoXxwV/88d7be6a2d43a9274d12fe7de838fef0/Investigati
ng_lexical_validity_in_the_Pearson_Test_of_English_Academic-_Kieran_O___Loughlin.pdf
- O'Sullivan, B., Dunlea, J., Spiby, R., Westbrook, C., and Dunn, K. (2020). Aptis General Technical Manual Version 2.2. https://www.britishcouncil.org/sites/default/files/aptis_technical_manual_v_2.2_final.pdf
- Owen, N., Shrestha, P. and Bax, S. (2021). Researching lexical thresholds and lexical profiles across the Common
European Framework of Reference for Language (CEFR) levels assessed in the Aptis test. ARAGs
Research Reports Online, AR- G/2021/1.
- Sanford, S. (2012). A comparison of metadiscourse markers and writing quality in adolescent written narratives. Missoula: Unpublished MSc thesis. The University of Montana.
- Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster,
cheaper and lighter. arXiv preprint arXiv:1910.01108. https://doi.org/10.48550/arXiv.1910.01108
- Schiffrin, D., Tannen, D., & Hamilton, H. (2001). The handbook of discourse analysis. Blackwell Publishers Ltd.
- Zhang, H. (2004). The optimality of Naive Bayes. https://typeset.io/papers/the-optimality-of-naive-bayesr4zge3fp91
- Zhang, Z. (2016). Introduction to machine learning: k-nearest neighbors. Annals of Translational Medicine, 4(11),
218. https://doi.org/10.21037/atm.2016.03.37