Integrating Metadiscourse Analysis with Transformer-Based Models for Enhancing Construct Representation and Discourse Competence Assessment in L2 Writing: A Systemic Multidisciplinary Approach

Sathena Chan; Manoranjan Sathyamurthy; Chihiro Inoue; Michael Bax; Johnathan Jones; John Oyekan

doi:10.21031/epod.1531269

Research Article

Integrating Metadiscourse Analysis with Transformer-Based Models for Enhancing Construct Representation and Discourse Competence Assessment in L2 Writing: A Systemic Multidisciplinary Approach

Year 2024, Volume: 15 Issue: Special Issue, 318 - 347, 30.12.2024

Sathena Chan , Manoranjan Sathyamurthy , Chihiro Inoue , Michael Bax , Johnathan Jones , John Oyekan

https://doi.org/10.21031/epod.1531269

Abstract

In recent years, large-scale language test providers have developed or adapted automated essay scoring systems (AESS) to score L2 writing essays. While the benefits of using AESS are clear, they are not without limitations, such as over-reliance on frequency counts of vocabulary and grammar variables. Discourse competence is one important aspect of L2 writing yet to be fully explored in AEE application. Evidence of discourse competence can be seen in the use of Metadiscourse Markers (MDM) to produce reader-friendly texts. The article presents a multidisciplinary study to explore the feasibility of expanding the construct representation of automated scoring models to assess discourse competence in L2 writing. Combining machine learning, automated textual analysis and corpus-linguistic methods to examine 2000 scripts across two tasks and five proficiency levels, the study investigates (1) in addition to frequency and range, whether accuracy of MDM is worth pursuing as a predictive feature in L2 writing, and (2) how identification and classification of MDM use might be fed into developing an automated scoring model using machine learning techniques. The contributions of this study are three-fold. Firstly, it offers valuable insights within the context of Explainable AI. By integrating MDM usage and accuracy into the scoring framework, this research moves beyond frequency-based evaluation. This study also makes significant contributions to the current understanding of L2 writing development that even lower-proficiency learners exhibit evidence of discourse competence through their accurate use of MDMs as well as their choice of MDMs in response to genre. From the perspective of expanding the construct representation in automated scoring systems, this study provides a critical examination of the limitations of many AEE models, which have heavily relied on vocabulary and grammar features. By exploring the feasibility of incorporating MDMs as predictive features, this research demonstrates the potential for construct expansion of L2 AEE. The results would support test providers in developing competence tests in various contexts and domains including manufacturing, medicine and so on.

Keywords

L2 Writing, Metadiscourse Markers, Automated Essay Scoring, Large Language Models

References

Adel, A. (2006). Metadiscourse in L1 and L2 English. John Benjamins Publishing. https://doi.org/10.1075/scl.24
Bachman, L. and Palmer, A. (2010). Language assessment in practice. Oxford University Press.
Barkaoui, K. (2016). What changes and what doesn’t? An examination of changes in the linguistic characteristics of IELTS repeaters’ Writing Task 2 scripts. IELTS Research Reports Online Series, vol. 2016/3, 1–55.
Bax, S., D. Waller and Nakatsuhara, F. (2019). Researching L2 writers’ use of MDM at intermediate and advanced levels, System, 83, 79-95. https://doi.org/10.1016/j.system.2019.02.010
Breiman (2001). Random Forests, Machine Learning, 45(1), 5-32. https://doi.org/10.1023/A:1010933404324
Brezina V. & Gablasova, D. (2015) Is There a Core General Vocabulary? Introducing the New General Service List, Applied Linguistics, 36(1), 1-22, https://doi.org/10.1093/applin/amt018
Burneikaitė, N. (2008) “Metadiscourse in Linguistics Master’s Theses in English L1 and L2”, Kalbotyra, 59, pp. 38–47. doi:10.15388/Klbt.2008.7591.
Camiciottoli, B. C. (2003). Metadiscourse and ESP reading comprehension. Reading In A Foreign Language, 15(1), 28–44. https://nflrc.hawaii.edu/rfl/item/69
Carlsen, C. (2010). Discourse connectives across CEFR-levels: A corpus based study. In I. Bartning, M. Maatin, & I. Vedder (Eds.), Communicative proficiency and linguistic development: Intersections between SLA and Language testing research (pp. 191-210). European Second Language Association.
Chapelle, C. A. and Chung, Y-R. (2010). The Promise of NLP and speech processing technologies in language assessment. Language Testing, 27(3), 301–315. https://doi.org/10.1177/026553221036440
Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer W. P. (2002). SMOTE: synthetic minority oversampling technique. Journal of Artificial Intelligence Research 16, 321-357. https://doi.org/10.1613/jair.953
Council of Europe (2001). Common European Framework of Reference for Languages: Learning, Teaching, Assessment. Cambridge University Press. https://doi.org/10.1017/CHOL9780521221283
Freund, Y. and Schapire, R. E. (1999). A short introduction to boosting. Journal of Japanese Society for Artificial Intelligence, 14 (5), 771-780. http://www.yorku.ca/gisweb/eats4400/boost.pdf
Crompton, P. (2012). Characterising hedging in undergraduate essays by Middle-Eastern students. Asian ESP Journal, 8(2), 55-78. http://asian-esp-journal.com/wp-content/uploads/2013/11/Volume-8-2.pdf
Devlin, J., Chang, M. W., Lee, K., and Toutanova, K. (2018). BERT: Pre-training of deep Bidirectional Transformers for language understanding. https://arxiv.org/pdf/1810.04805.pdf
Jarvis, S. (2013). ‘Defining and measuring lexical diversity.’ in S. Jarvis and M.H. Daller (eds.) Vocabulary knowledge: human ratings and automated measures. John Benjamins, pp. 13-44.
He, H., Bai, Y., Garcia, E. A., & Li, S. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence) (pp. 1322-1328). IEEE. https//doi.org/10.1109/IJCNN.2008.4633969
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735-1780. https://doi.org/10.1162/neco.1997.9.8.1735
Hyland, K. (2005). Metadiscourse: Exploring Interaction in Writing. Bloomsbury Publishing.
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., and Liu, T. Y. (2017). LightGBM: A Highly Efficient Gradient Boosting Decision Tree. Advances in Neural Information Processing Systems, 30 (NIPS 2017). https://proceedings.neurips.cc/paper/2017/hash/6449f44a102fde848669bdd9eb6b76fa- Abstract.html
Knoch, U., Macqueen, S., & O’Hagan, S. (2014). An Investigation of the Effect of Task Type on the Discourse Produced by Students at Various Score Levels in the TOEFL iBT ® Writing Test. ETS Research Report Series, 23, 14–43. https://doi.org/10.1002/ets2.12038
Lee, J. J., & Deakin, L. (2016). Interactions in L1 and L2 undergraduate student writing: Interactional metadiscourse in successful and less-successful argumentative essays. Journal of Second Language Writing, 33, 21-34. https://doi.org/10.1016/j.jslw.2016.06.004
Lialin, V., Deshpande, V. & Rumshisky, A. (2023). Scaling down to scale up: A guide to parameter-efficient fine-tuning. arXiv preprint arXiv:2303.15647. https://doi.org/10.48550/arXiv.2303.15647
Lin, W., Hasenstab, K., Chnha, G. M. & Schwartzman, A. (2020) Comparison of handcrafted features and convolutional neural networks for liver MR image adequacy assessment, Scientific Reports, 10, 20336. https://doi.org/10.1038/s41598-020-77264-y
Lin, T., Wang, Y., Liu, X. and Qiu, X. (2022). A survey of transformers. AI open, 3, 111-132. https://doi.org/10.48550/arXiv.2106.04554
Liu, J., Xu, Y. and Zhu, Y. (2019) ‘Automated Essay Scoring based on Two-Stage Learning’, arXiv [cs.CL]. https://doi.org/10.48550/arXiv.1901.07744
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L. and Stoyanov, V., (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. https://doi.org/10.48550/arXiv.1907.11692
O’Loughlin, K. (2013). Investigating lexical validity in the Pearson Test of English Academic. Pearson Research Reports, p.1-21. https://www.pearsonpte.com/ctfassets/ yqwtwibiobs4/6iHE8HxuGJT3OMAgFoXxwV/88d7be6a2d43a9274d12fe7de838fef0/Investigati ng_lexical_validity_in_the_Pearson_Test_of_English_Academic-_Kieran_O___Loughlin.pdf
O'Sullivan, B., Dunlea, J., Spiby, R., Westbrook, C., and Dunn, K. (2020). Aptis General Technical Manual Version 2.2. https://www.britishcouncil.org/sites/default/files/aptis_technical_manual_v_2.2_final.pdf
Owen, N., Shrestha, P. and Bax, S. (2021). Researching lexical thresholds and lexical profiles across the Common European Framework of Reference for Language (CEFR) levels assessed in the Aptis test. ARAGs Research Reports Online, AR- G/2021/1.
Sanford, S. (2012). A comparison of metadiscourse markers and writing quality in adolescent written narratives. Missoula: Unpublished MSc thesis. The University of Montana.
Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108. https://doi.org/10.48550/arXiv.1910.01108
Schiffrin, D., Tannen, D., & Hamilton, H. (2001). The handbook of discourse analysis. Blackwell Publishers Ltd.
Zhang, H. (2004). The optimality of Naive Bayes. https://typeset.io/papers/the-optimality-of-naive-bayesr4zge3fp91
Zhang, Z. (2016). Introduction to machine learning: k-nearest neighbors. Annals of Translational Medicine, 4(11), 218. https://doi.org/10.21037/atm.2016.03.37

Year 2024, Volume: 15 Issue: Special Issue, 318 - 347, 30.12.2024

Sathena Chan , Manoranjan Sathyamurthy , Chihiro Inoue , Michael Bax , Johnathan Jones , John Oyekan

https://doi.org/10.21031/epod.1531269

Abstract

References

Adel, A. (2006). Metadiscourse in L1 and L2 English. John Benjamins Publishing. https://doi.org/10.1075/scl.24
Bachman, L. and Palmer, A. (2010). Language assessment in practice. Oxford University Press.
Barkaoui, K. (2016). What changes and what doesn’t? An examination of changes in the linguistic characteristics of IELTS repeaters’ Writing Task 2 scripts. IELTS Research Reports Online Series, vol. 2016/3, 1–55.
Bax, S., D. Waller and Nakatsuhara, F. (2019). Researching L2 writers’ use of MDM at intermediate and advanced levels, System, 83, 79-95. https://doi.org/10.1016/j.system.2019.02.010
Breiman (2001). Random Forests, Machine Learning, 45(1), 5-32. https://doi.org/10.1023/A:1010933404324
Brezina V. & Gablasova, D. (2015) Is There a Core General Vocabulary? Introducing the New General Service List, Applied Linguistics, 36(1), 1-22, https://doi.org/10.1093/applin/amt018
Burneikaitė, N. (2008) “Metadiscourse in Linguistics Master’s Theses in English L1 and L2”, Kalbotyra, 59, pp. 38–47. doi:10.15388/Klbt.2008.7591.
Camiciottoli, B. C. (2003). Metadiscourse and ESP reading comprehension. Reading In A Foreign Language, 15(1), 28–44. https://nflrc.hawaii.edu/rfl/item/69
Carlsen, C. (2010). Discourse connectives across CEFR-levels: A corpus based study. In I. Bartning, M. Maatin, & I. Vedder (Eds.), Communicative proficiency and linguistic development: Intersections between SLA and Language testing research (pp. 191-210). European Second Language Association.
Chapelle, C. A. and Chung, Y-R. (2010). The Promise of NLP and speech processing technologies in language assessment. Language Testing, 27(3), 301–315. https://doi.org/10.1177/026553221036440
Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer W. P. (2002). SMOTE: synthetic minority oversampling technique. Journal of Artificial Intelligence Research 16, 321-357. https://doi.org/10.1613/jair.953
Council of Europe (2001). Common European Framework of Reference for Languages: Learning, Teaching, Assessment. Cambridge University Press. https://doi.org/10.1017/CHOL9780521221283
Freund, Y. and Schapire, R. E. (1999). A short introduction to boosting. Journal of Japanese Society for Artificial Intelligence, 14 (5), 771-780. http://www.yorku.ca/gisweb/eats4400/boost.pdf
Crompton, P. (2012). Characterising hedging in undergraduate essays by Middle-Eastern students. Asian ESP Journal, 8(2), 55-78. http://asian-esp-journal.com/wp-content/uploads/2013/11/Volume-8-2.pdf
Devlin, J., Chang, M. W., Lee, K., and Toutanova, K. (2018). BERT: Pre-training of deep Bidirectional Transformers for language understanding. https://arxiv.org/pdf/1810.04805.pdf
Jarvis, S. (2013). ‘Defining and measuring lexical diversity.’ in S. Jarvis and M.H. Daller (eds.) Vocabulary knowledge: human ratings and automated measures. John Benjamins, pp. 13-44.
He, H., Bai, Y., Garcia, E. A., & Li, S. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence) (pp. 1322-1328). IEEE. https//doi.org/10.1109/IJCNN.2008.4633969
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735-1780. https://doi.org/10.1162/neco.1997.9.8.1735
Hyland, K. (2005). Metadiscourse: Exploring Interaction in Writing. Bloomsbury Publishing.
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., and Liu, T. Y. (2017). LightGBM: A Highly Efficient Gradient Boosting Decision Tree. Advances in Neural Information Processing Systems, 30 (NIPS 2017). https://proceedings.neurips.cc/paper/2017/hash/6449f44a102fde848669bdd9eb6b76fa- Abstract.html
Knoch, U., Macqueen, S., & O’Hagan, S. (2014). An Investigation of the Effect of Task Type on the Discourse Produced by Students at Various Score Levels in the TOEFL iBT ® Writing Test. ETS Research Report Series, 23, 14–43. https://doi.org/10.1002/ets2.12038
Lee, J. J., & Deakin, L. (2016). Interactions in L1 and L2 undergraduate student writing: Interactional metadiscourse in successful and less-successful argumentative essays. Journal of Second Language Writing, 33, 21-34. https://doi.org/10.1016/j.jslw.2016.06.004
Lialin, V., Deshpande, V. & Rumshisky, A. (2023). Scaling down to scale up: A guide to parameter-efficient fine-tuning. arXiv preprint arXiv:2303.15647. https://doi.org/10.48550/arXiv.2303.15647
Lin, W., Hasenstab, K., Chnha, G. M. & Schwartzman, A. (2020) Comparison of handcrafted features and convolutional neural networks for liver MR image adequacy assessment, Scientific Reports, 10, 20336. https://doi.org/10.1038/s41598-020-77264-y
Lin, T., Wang, Y., Liu, X. and Qiu, X. (2022). A survey of transformers. AI open, 3, 111-132. https://doi.org/10.48550/arXiv.2106.04554
Liu, J., Xu, Y. and Zhu, Y. (2019) ‘Automated Essay Scoring based on Two-Stage Learning’, arXiv [cs.CL]. https://doi.org/10.48550/arXiv.1901.07744
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L. and Stoyanov, V., (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. https://doi.org/10.48550/arXiv.1907.11692
O’Loughlin, K. (2013). Investigating lexical validity in the Pearson Test of English Academic. Pearson Research Reports, p.1-21. https://www.pearsonpte.com/ctfassets/ yqwtwibiobs4/6iHE8HxuGJT3OMAgFoXxwV/88d7be6a2d43a9274d12fe7de838fef0/Investigati ng_lexical_validity_in_the_Pearson_Test_of_English_Academic-_Kieran_O___Loughlin.pdf
O'Sullivan, B., Dunlea, J., Spiby, R., Westbrook, C., and Dunn, K. (2020). Aptis General Technical Manual Version 2.2. https://www.britishcouncil.org/sites/default/files/aptis_technical_manual_v_2.2_final.pdf
Owen, N., Shrestha, P. and Bax, S. (2021). Researching lexical thresholds and lexical profiles across the Common European Framework of Reference for Language (CEFR) levels assessed in the Aptis test. ARAGs Research Reports Online, AR- G/2021/1.
Sanford, S. (2012). A comparison of metadiscourse markers and writing quality in adolescent written narratives. Missoula: Unpublished MSc thesis. The University of Montana.
Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108. https://doi.org/10.48550/arXiv.1910.01108
Schiffrin, D., Tannen, D., & Hamilton, H. (2001). The handbook of discourse analysis. Blackwell Publishers Ltd.
Zhang, H. (2004). The optimality of Naive Bayes. https://typeset.io/papers/the-optimality-of-naive-bayesr4zge3fp91
Zhang, Z. (2016). Introduction to machine learning: k-nearest neighbors. Annals of Translational Medicine, 4(11), 218. https://doi.org/10.21037/atm.2016.03.37

There are 35 citations in total.

Details

Primary Language	English
Subjects	Testing, Assessment and Psychometrics (Other)
Journal Section	Articles
Authors	Sathena Chan 0000-0002-7852-6737 Manoranjan Sathyamurthy 0000-0001-8928-2689 Chihiro Inoue 0000-0003-1927-6923 Michael Bax 0000-0002-2753-1990 Johnathan Jones 0000-0003-4158-7971 John Oyekan 0000-0001-6578-9928
Publication Date	December 30, 2024
Submission Date	August 13, 2024
Acceptance Date	December 16, 2024
Published in Issue	Year 2024 Volume: 15 Issue: Special Issue

Cite

APA	Chan, S., Sathyamurthy, M., Inoue, C., Bax, M., et al. (2024). Integrating Metadiscourse Analysis with Transformer-Based Models for Enhancing Construct Representation and Discourse Competence Assessment in L2 Writing: A Systemic Multidisciplinary Approach. Journal of Measurement and Evaluation in Education and Psychology, 15(Special Issue), 318-347. https://doi.org/10.21031/epod.1531269

Download Cover Image

Article Files

Full Text