Araştırma Makalesi

Creating a Parallel Corpora for Turkish-English Academic Translations

Cilt: IDAP-2021 : 5th International Artificial Intelligence and Data Processing symposium Sayı: Special 20 Ekim 2021
PDF İndir
EN TR

Creating a Parallel Corpora for Turkish-English Academic Translations

Öz

Parallel corpora are data sets created by representing sentences with the same meaning in different languages. One of the most important elements that determine the quality in machine translation systems is the parallel corpora created in large quantities and with high quality. Such data for the Turkish – English language pair are generally insufficient. In this study, a large amount of parallel corpora has been created that can be used for academic translations between Turkish and English languages. While creating this data set, the abstracts of the postgraduate theses were used. The best matches were obtained using sentence alignment algorithms such as Vecalign and Hunalign. As a result of the studies, 1M parallel sentence pairs were obtained. In addition, an Bi-LSTM-based translation system was created to measure the quality of the obtained data. The created model obtained 15.8 Bleu points with zero-shot learning method on the TED (Tr-En) test set.

Anahtar Kelimeler

Kaynakça

  1. Artetxe, Mikel, and Holger Schwenk. 2019. “Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond.” Transactions of the Association for Computational Linguistics 7: 597–610. https://doi.org/10.1162/tacl_a_00288.
  2. Ataman, Duygu. 2018. “Bianet: A Parallel News Corpus in Turkish, Kurdish and English,” 1–4. http://arxiv.org/abs/1805.05095.
  3. Barrault, Loïc, Ondřej Bojar, Marta R. Costa-jussà, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, et al. 2019. “Findings of the 2019 Conference on Machine Translation (WMT19)” 2 (Day 1): 1–61. https://doi.org/10.18653/v1/w19-5301.
  4. Bawden, Rachel, Giorgio Maria Di Nunzio, Cristian Grozea, Inigo Jauregi Unanue, Antonio Jimeno Yepes, Nancy Mah, David Martinez, et al. 2020. “Findings of the WMT 2020 Biomedical Translation Shared Task: Basque, Italian and Russian as New Additional Languages.” Proceedings of the Fifth Conference on Machine Translation, 660–87. https://www.aclweb.org/anthology/2020.wmt-1.76.
  5. Britz, Denny, Anna Goldie, Minh Thang Luong, and Quoc V. Le. 2017. “Massive Exploration of Neural Machine Translation Architectures.” EMNLP 2017 - Conference on Empirical Methods in Natural Language Processing, Proceedings, 1442–51. https://doi.org/10.18653/v1/d17-1151.
  6. Chaudhary, Vishrav, Yuqing Tang, Francisco Guzmán, Holger Schwenk, and Philipp Koehn. 2019. “Low-Resource Corpus Filtering Using Multilingual Sentence Embeddings” 3 (Day 2): 261–66. https://doi.org/10.18653/v1/w19-5435.
  7. El-Kishky, Ahmed, Vishrav Chaudhary, Francisco Guzmán, and Philipp Koehn. 2020. “CCAligned: A Massive Collection of Cross-Lingual Web-Document Pairs,” 5960–69. https://doi.org/10.18653/v1/2020.emnlp-main.480.
  8. Haddow, Barry, and Faheem Kirefu. 2020. “PMIndia -- A Collection of Parallel Corpora of Languages of India.” http://arxiv.org/abs/2001.09907.

Ayrıntılar

Birincil Dil

İngilizce

Konular

Yapay Zeka

Bölüm

Araştırma Makalesi

Yayımlanma Tarihi

20 Ekim 2021

Gönderilme Tarihi

3 Eylül 2021

Kabul Tarihi

16 Eylül 2021

Yayımlandığı Sayı

Yıl 2021 Cilt: IDAP-2021 : 5th International Artificial Intelligence and Data Processing symposium Sayı: Special

Kaynak Göster

APA
Sel, İ., Üzen, H., & Hanbay, D. (2021). Creating a Parallel Corpora for Turkish-English Academic Translations. Computer Science, IDAP-2021 : 5th International Artificial Intelligence and Data Processing symposium(Special), 335-340. https://doi.org/10.53070/bbd.990959
AMA
1.Sel İ, Üzen H, Hanbay D. Creating a Parallel Corpora for Turkish-English Academic Translations. JCS. 2021;IDAP-2021 : 5th International Artificial Intelligence and Data Processing symposium(Special):335-340. doi:10.53070/bbd.990959
Chicago
Sel, İlhami, Hüseyin Üzen, ve Davut Hanbay. 2021. “Creating a Parallel Corpora for Turkish-English Academic Translations”. Computer Science IDAP-2021 : 5th International Artificial Intelligence and Data Processing symposium (Special): 335-40. https://doi.org/10.53070/bbd.990959.
EndNote
Sel İ, Üzen H, Hanbay D (01 Ekim 2021) Creating a Parallel Corpora for Turkish-English Academic Translations. Computer Science IDAP-2021 : 5th International Artificial Intelligence and Data Processing symposium Special 335–340.
IEEE
[1]İ. Sel, H. Üzen, ve D. Hanbay, “Creating a Parallel Corpora for Turkish-English Academic Translations”, JCS, c. IDAP-2021 : 5th International Artificial Intelligence and Data Processing symposium, sy Special, ss. 335–340, Eki. 2021, doi: 10.53070/bbd.990959.
ISNAD
Sel, İlhami - Üzen, Hüseyin - Hanbay, Davut. “Creating a Parallel Corpora for Turkish-English Academic Translations”. Computer Science IDAP-2021 : 5TH INTERNATIONAL ARTIFICIAL INTELLIGENCE AND DATA PROCESSING SYMPOSIUM/Special (01 Ekim 2021): 335-340. https://doi.org/10.53070/bbd.990959.
JAMA
1.Sel İ, Üzen H, Hanbay D. Creating a Parallel Corpora for Turkish-English Academic Translations. JCS. 2021;IDAP-2021 : 5th International Artificial Intelligence and Data Processing symposium:335–340.
MLA
Sel, İlhami, vd. “Creating a Parallel Corpora for Turkish-English Academic Translations”. Computer Science, c. IDAP-2021 : 5th International Artificial Intelligence and Data Processing symposium, sy Special, Ekim 2021, ss. 335-40, doi:10.53070/bbd.990959.
Vancouver
1.İlhami Sel, Hüseyin Üzen, Davut Hanbay. Creating a Parallel Corpora for Turkish-English Academic Translations. JCS. 01 Ekim 2021;IDAP-2021 : 5th International Artificial Intelligence and Data Processing symposium(Special):335-40. doi:10.53070/bbd.990959

Cited By

The Creative Commons Attribution 4.0 International License 88x31.png  is applied to all research papers published by JCS and

a Digital Object Identifier (DOI)     Logo_TM.png  is assigned for each published paper.