Basılı Türkçe’nin Önemli Bazı İstatistiksel Özellikleri

Mehmet Emin Dalkılıç; Gökhan Dalkılıç

Research Article

Basılı Türkçe’nin Önemli Bazı İstatistiksel Özellikleri

Year 2002, Volume: 1 Issue: 1, 113 - 130, 15.04.2002

Abstract

Bu çalışmanın amacı, basılı Türkçe’nin bazı istatistiksel değerlerinin belirlenmesidir. Derlenen istatistikler tekli, ikili, …, beşli harf gruplarının sıklık dağılımları, ilk/son harf çözümlemeleri, harf başına belirsizlik (entropi)ve fazlalık, rastgelelik endeksi, sözcük uzunluk dağılımı, sesli/sessiz harf oranı’nı içermektedir. Hürriyet gazetesinin internet arşivinden bir Türkçe külliyat (corpus) oluşturularak anılan değerler elde edilmiştir. Bununla yetinilmeyip, Türkçe’ye ilişkin öteki çalışmalar da kullanılarak, tüm bu çalışmaların ağırlıklı bileşkesi olan, bugüne kadar elde edilen en geniş Türkçe külliyat tabanı ve metin çeşitliliğine sahip, en kapsamlı sonuçlar elde edilmiştir. Farklı çalışmalarda elde edilen sonuçların birbiriyle uyumluluk derecesini belirlemek amacıyla bir benzerlik ölçütü geliştirilmiş ve mevcut çalışmaların sonuçlarına uygulanmıştır.

Keywords

Türkçe’nin İstatistiksel Özellikleri , N-Gram Sıklık Dağılımları , Belirsizlik , İlk/Son Harf Çözümlemesi , Sözcük Uzunlukları , Sıralı Liste Benzerlik Ölçütü

References

COVER, T. and KING, R. (1978), A Convergent Gambling Estimate of the Entropy of English, IEEE Transactions on Information Theory, IT-24, n.4, 413-421
DALKILIÇ, G. (2001), Günümüz Türkçesi’nin İstatistiksel Özellikleri ve Bir Metin Sıkıştırma Uygulaması, Yüksek Lisans Tezi, Uluslararası Bilgisayar Enst., Ege Üniversitesi.
DALKILIÇ, and M.E. DALKILIÇ, G. (2000), On the Entropy, Redundancy and Compression of Contemporary Printed Turkish Proc. of the XV International. Symposium on Computer and Information Sciences, 60-67.
DİRİ, B. (2000), A Text Compression System Based on the Morphology of Turkish Language, Proc. of the XV Int’l. Symp. on Computer & Information Sciences, 12-23.
GÖKSU, T. and ERTAUL. L. (1998), Yer Değiştirmeli ve Dizi Şifreleyiciler için Türkçe’nin Yapısal Özelliklerini Kullanan Bir Kriptoanaliz, BAS’98, 184-194.
GÖNENÇ, G. (1980), Türkçe abece İçin ‘En İyi’ Kodlar, 3. Ulusal Bilişim Kurultayı, Bilişim’80 Bildiriler Kitabı, 73-75.
JURAFSKY, D. and MARTIN, J.H. (2000), Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition, Prentice Hall.
KOLTUKSUZ, A. (1995), Simetrik Kriptosistemler için Türkiye Türkçesinin Kriptanalitik Ölçütleri, Doktora Tezi, Bilgisayar Mühendisliği, Ege Üniversitesi.
SHANNON, C.E. (1951), Prediction and Entropy of Printed English, Bell System Technical Journal, 30(1), 50-64.
SIEGEL, S. (1956), Nonparametric Statistics for the Behavioral Sciences, McGrawHill.
STINSON, D.R. (1995), Cyrptography Theory and Practice, Newyork: CRC Press.
TÖRECİ, E. (1975), Statistical Investigations on the Turkish Language Using Digital Computers, Yüksek Lisans Tezi, ODTÜ, (Gönenç, 1980 de referans edildiği şekilde).

Some Important Statistical Properties of Printed Turkish

Year 2002, Volume: 1 Issue: 1, 113 - 130, 15.04.2002

Mehmet Emin Dalkılıç Gökhan Dalkılıç

Abstract

The goal of this study is to determine some statistical properties of printed Turkish. Compiled statistics include the letter frequency (monogram, digram, ..., pentagram) distributions of Turkish, first/last letter analyses, per letter entropy and redundancy, index of coincidence, word length distribution, vowel/consonant proportion. These values are obtained by compiling a corpus from the Internet archive of daily Hurriyet newspaper. Furthermore, using existing studies on Turkish and combining them together, the largest Turkish corpus base to date with the widest text variety and the most comprehensive results are obtained. To determine the degree of agreement for the results of the different studies, a similarity rate measure has been developed and applied to the existing studies' results.

Keywords

Statistical Properties of Turkish , N-Gram Frequency Distributions , Entropy , First/Last Letter Analysis , Word Lengths , Similarity Assessment of Sorted Lists

References

COVER, T. and KING, R. (1978), A Convergent Gambling Estimate of the Entropy of English, IEEE Transactions on Information Theory, IT-24, n.4, 413-421
DALKILIÇ, G. (2001), Günümüz Türkçesi’nin İstatistiksel Özellikleri ve Bir Metin Sıkıştırma Uygulaması, Yüksek Lisans Tezi, Uluslararası Bilgisayar Enst., Ege Üniversitesi.
DALKILIÇ, and M.E. DALKILIÇ, G. (2000), On the Entropy, Redundancy and Compression of Contemporary Printed Turkish Proc. of the XV International. Symposium on Computer and Information Sciences, 60-67.
DİRİ, B. (2000), A Text Compression System Based on the Morphology of Turkish Language, Proc. of the XV Int’l. Symp. on Computer & Information Sciences, 12-23.
GÖKSU, T. and ERTAUL. L. (1998), Yer Değiştirmeli ve Dizi Şifreleyiciler için Türkçe’nin Yapısal Özelliklerini Kullanan Bir Kriptoanaliz, BAS’98, 184-194.
GÖNENÇ, G. (1980), Türkçe abece İçin ‘En İyi’ Kodlar, 3. Ulusal Bilişim Kurultayı, Bilişim’80 Bildiriler Kitabı, 73-75.
JURAFSKY, D. and MARTIN, J.H. (2000), Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition, Prentice Hall.
KOLTUKSUZ, A. (1995), Simetrik Kriptosistemler için Türkiye Türkçesinin Kriptanalitik Ölçütleri, Doktora Tezi, Bilgisayar Mühendisliği, Ege Üniversitesi.
SHANNON, C.E. (1951), Prediction and Entropy of Printed English, Bell System Technical Journal, 30(1), 50-64.
SIEGEL, S. (1956), Nonparametric Statistics for the Behavioral Sciences, McGrawHill.
STINSON, D.R. (1995), Cyrptography Theory and Practice, Newyork: CRC Press.
TÖRECİ, E. (1975), Statistical Investigations on the Turkish Language Using Digital Computers, Yüksek Lisans Tezi, ODTÜ, (Gönenç, 1980 de referans edildiği şekilde).

There are 12 citations in total.

Details

Primary Language	Turkish
Subjects	Applied Statistics
Journal Section	Research Articles
Authors	Mehmet Emin Dalkılıç This is me Gökhan Dalkılıç
Publication Date	April 15, 2002
Published in Issue	Year 2002 Volume: 1 Issue: 1

Cite

APA	Dalkılıç, M. E., & Dalkılıç, G. (2002). Basılı Türkçe’nin Önemli Bazı İstatistiksel Özellikleri. İstatistik Araştırma Dergisi, 1(1), 113-130.

Download Cover Image

Article Files

Full Text