Research Article
BibTex RIS Cite

A New Hybrid Regression Model for Undersized Sample Problem

Year 2017, Volume: 13 Issue: 3, 803 - 813, 30.09.2017

Abstract

In traditional statistics, it is assumed that the
number of samples which are available for study is more than number of well
selected variables. Nowadays, in many fields, while the number of samples
expressed in tens or hundreds, the single observation may have thousands even
millions dimensions. The classical statistical techniques are not designed to
be able to cope with this kind of data sets. Many of multivariate statistical
techniques such as principal component analysis, factor analysis, classifiation
and cluster analysis and the prediction of regression coefficients need
estimation of the sample variance-covariance matrix or its inverse. When the
number of observations is much smaller than the number of features (or
variables), the usual sample covariance matrix degenerates and it can not be
inverted. This is one of the biggest encountered obstacle to the classical
statistical methods. To remedy the manifestation of the singular covariance
matrices in high dimensional data, Hybrid Covariance Estimators (HCE) has been
developed by Pamukcu et al.(2015). HCE has overcome the singularity problem of
the covariance matrix and, thus, the multivariate statistical analysis for high
dimensional data sets has been made possible. One of the most important process
in statistical analysis using HCE is to select the appropriate covariance
structure for the data set since HCE can in fact be obtained with many
different covariance structures. It can be selected by using the information
criteria such as Akaike Information Criteria, Information Complexity Criteria
which are well known as model selection criteria.  In this study, we introduce a new regression
model with HCE and information criteria for n<<p undersized high
dimensional data. We demonstrate our approach on simulation studies with
different scenarious for p/n ratios. We use AIC,CAIC and ICOMP criteria to
select appropriate HCE structure and compare the results with classical
regression analysis.

References

  • 1. Donoho, D.L.; High dimensional data analysis: The curses and blessings of dimensionality. statweb.stanford.edu/~donoho/Lectures/AMS2000/Curses.pdf. 2000
  • 2. Cunningham, P.; Dimension Reduction. Technical Re-port.UCD-CSI-2007-7. University College Dublin. 2007
  • 3. Fiebig, D.G.; On the maximum entropy approach to undersized samples. Applied Mathematics and Computation. 1984; 14, 301-312
  • 4. Stein, C.; Estimation of covariance matrix. Rietz Lecture. 39th Annual Meeting IMS. Atlanta, Georgia. 1975.
  • 5. Chen, Y.; Robust shrinkage estimation of high dimensional covariance matrices. IEEE Workshop on Sensor Array and Mul-tichannel Signal Processing (SAM). 2010
  • 6. Ledoit, O. ; Wolf, M. A well conditioned estimator for large dimensional covariance matrices. Journal of Multivariate Analysis. 2004; 88, 365-411
  • 7. Pamukçu, E.; Bozdogan, H., Çalık, S. A Novel Hybrid Dimen-sion Reduction Technique for Undersized High Dimensional Gene Expression Data Sets Using Information Complexity Criterion for Cancer Classification. Computational and Mathematical Methods in Medicine. Volume 2015 (2015), Article ID 370640, 14 pages
  • 8. Erbaş, Ü.; Entropi İlkelerinin Boyut İndirgeme Uygulamaları. Doktora tezi. Marmara Üniversitesi Sosyal Bilimler Enstitüsü. İstanbul. 2010
  • 9. Bozdogan, H.; Information Complexity and Multivariate Lear-ning in High Dimensions with Applications in Data Mining. Forth-coming book. 2017
  • 10. Bozdogan, H.; Howe, J.,A. Misspecified multivariate regres-sion models using the genetic algorithm and information comp-lexity as the fitness function. European Journal of Pure and App-lied Mathematics. 2012; 5(2), 211-249
  • 11. Haff, L.R.; Emprical bayes estimation of the multivariate normal covariance matrix. The Annals of Statistics. 1980; 8(3), 586-597
  • 12. Shurygin, A.; The linear combination of the simplest discrimi-nator and Fisher’s one. Nauka (ed). Applied Statistics. Moscow. Rusia. 1983; 144-158
  • 13. Press, S.; Estimation of a normal covariance matrix. Technical Report. University of British Columbia.1975.
  • 14. Chen, M.; Estimation of covariance matrices under a quadratic loss function. Research Report S-46. Department os Mathematics. SUNY at Albany. 1976.
  • 15. Bozdogan, H.; A new class of information complexity (ICOMP) criteria with an application to customer profiling and segmentation. Invited paper. In Istanbul University Journal of the School Business Administration. 2010; 39(2), 370-398
  • 16. Thomaz., C.E.; Maximum Entropy Covariance Estimate for Statistical Pattern Recognization. Doktora tezi. Department of Computing Imperial College. University of London. UK. 2004
  • 17. Akaike, H.; Information theory and extension of the maximum likelihood principle. 2nd International Symposium on Information Theory. Budapest: Academiai Kiado. 1973, 267-281
  • 18. Akaike,H.; A new look at the statistical model identification. IEEE Transaction and Automatic Control. 1974, AC-19:719-723
  • 19. Schwarz, G.; Estimating the dimension of model. Annals of Statistics. 1978; 6, 461-464
  • 20. Bozdogan, H.; Model selection and Akaike’s İnformation Criterion (AIC): the general theory and its analytical extensions. Psychometrika. 1987; 52(3), 345-370
  • 21. Bhansali, R.J.; Downham, D.Y. Some properties of the order of an autoregressive model selected by a generalization of Akai-ke’s FPE criterion. Biometrika. 1977; 64, 547-551
  • 22. Rissanen, J. ;Modeling by shortest data description. Autometi-ca. 1978; 14, 465-471.
  • 23. Bozdogan, H.; ICOMP: A new model selection criterion. Clas-sification and Related Methods of Data Analysis. 1988; 599-608
  • 24. Bozdogan, H.; On the information based measure of covarian-ce complexity and its application to the evaluation of multivariate linear models. Communications in Statistics: Theory and Methods. 1990; 1, 221-278
  • 25. Bozdogan, H.; Choosing the number of clusters, subset se-lecyion of variables and outlier detection in the standard mixture model cluster analysis. Invited paper in New Approaches in Clas-sification and Data Analysis. Springer Verlag. New York, 1994.
  • 26. Bozdogan, H.; Haughton, D.M.A. Information complexity criteria for regression models. Computational Statistics and Data Analysis. 1998; 28, 51-76
  • 27. Bozdogan, H.; Akaike’s information criterion and recent deve-lopments in information complexity. Journal of Mathematical Psychology. 2000; 44, 62-91
  • 28. Bozdogan, H.; Intelligent Statistical Data Mining with Informa-tion Complexity and Genetic Algortihm. In Statsitical Data Mining and Knowledge Discovery. H. Bozdogan (ed). Chapman and Hall/CRC. Florida,2004.

Aşırı Derecede Küçük Örneklem Problemi için Hibrit Regresyon Modeli

Year 2017, Volume: 13 Issue: 3, 803 - 813, 30.09.2017

Abstract

Geleneksel istatistik metodolojisinde, iyi seçilmiş değişkenlerin birkaç tane, örneklerin ise daha fazla olduğu farz edilir. Günümüzde ise birçok sahada, çalışma için ulaşılabilen örnekler onlar veya yüzlerle ifade edilirken, tek bir gözlem binlerce hatta milyonlarca boyuta sahip olabilmektedir. Klasik yöntemler bu tarz verilerle başa çıkabilecek şekilde tasarlanmış değillerdir. Temel bileşenler analizi, faktör analizi, sınıflama ve kümeleme analizleri, regresyon katsayılarının çıkarımı ve tahmini gibi klasik çok değişkenli istatistiksel tekniklerin birçoğu, verinin kovaryans matrisinin ve/veya onun tersinin tahminini gerektirir. p değişken sayısı n örnek sayısından fazla olduğu durumlarda ise örnek varyans-kovaryans matrisi dejenere olur ve tersi hesaplanamaz. Bu, klasik istatistiksel metotlar açıcından karşılaşılabilecek en önemli zorluklardan biridir. Pamukçu ve ark tarafından (2015) yüksek boyutlu veri setlerindeki kovaryans probleminin üstesinden gelebilmek için, Hibrit Kovaryans Tahmin Edicisi (Hybrid Covariance Estimator-HCE) yöntemi geliştirilmiştir. HCE ile kovaryans yapısındaki bu bozulmanın önüne geçilmiş ve n<<p probleminin olduğu yüksek boyutlu veri setlerinin istatistiksel analizleri mümkün hale gelmiştir. HCE, aslında birçok farklı kovaryans yapısı ile elde edilebildiği için HCE ile yapılacak analizlerde önemli aşamalardan biri, veri setine uygun kovaryans yapısının belirlenmesidir. Bu aşamada ise model seçim kriterleri olarak da bilinen AIC, CAIC ve ICOMP gibi bilgi kriterleri ile uygun kovaryans yapısı seçilebilmektedir. Bu çalışmada, n<<p olan yüksek boyutlu veri setlerinde HCE ve bilgi kriterleri ile önerilen Hibrit Regresyon Modeli-HRM tanıtılmış ve hesaplama adımları verilmiştir. Simülasyon çalışması ile farklı senaryolarda farklı p/n oranlarına sahip veri setleri HRM ile analiz edilmiş, uygun kovaryans yapısının seçimi AIC, CAIC ve ICOMP bilgi kriterleri ile yapılmış ve sonuçlar klasik regresyon analizi yöntemi ile karşılaştırılmıştır.

References

  • 1. Donoho, D.L.; High dimensional data analysis: The curses and blessings of dimensionality. statweb.stanford.edu/~donoho/Lectures/AMS2000/Curses.pdf. 2000
  • 2. Cunningham, P.; Dimension Reduction. Technical Re-port.UCD-CSI-2007-7. University College Dublin. 2007
  • 3. Fiebig, D.G.; On the maximum entropy approach to undersized samples. Applied Mathematics and Computation. 1984; 14, 301-312
  • 4. Stein, C.; Estimation of covariance matrix. Rietz Lecture. 39th Annual Meeting IMS. Atlanta, Georgia. 1975.
  • 5. Chen, Y.; Robust shrinkage estimation of high dimensional covariance matrices. IEEE Workshop on Sensor Array and Mul-tichannel Signal Processing (SAM). 2010
  • 6. Ledoit, O. ; Wolf, M. A well conditioned estimator for large dimensional covariance matrices. Journal of Multivariate Analysis. 2004; 88, 365-411
  • 7. Pamukçu, E.; Bozdogan, H., Çalık, S. A Novel Hybrid Dimen-sion Reduction Technique for Undersized High Dimensional Gene Expression Data Sets Using Information Complexity Criterion for Cancer Classification. Computational and Mathematical Methods in Medicine. Volume 2015 (2015), Article ID 370640, 14 pages
  • 8. Erbaş, Ü.; Entropi İlkelerinin Boyut İndirgeme Uygulamaları. Doktora tezi. Marmara Üniversitesi Sosyal Bilimler Enstitüsü. İstanbul. 2010
  • 9. Bozdogan, H.; Information Complexity and Multivariate Lear-ning in High Dimensions with Applications in Data Mining. Forth-coming book. 2017
  • 10. Bozdogan, H.; Howe, J.,A. Misspecified multivariate regres-sion models using the genetic algorithm and information comp-lexity as the fitness function. European Journal of Pure and App-lied Mathematics. 2012; 5(2), 211-249
  • 11. Haff, L.R.; Emprical bayes estimation of the multivariate normal covariance matrix. The Annals of Statistics. 1980; 8(3), 586-597
  • 12. Shurygin, A.; The linear combination of the simplest discrimi-nator and Fisher’s one. Nauka (ed). Applied Statistics. Moscow. Rusia. 1983; 144-158
  • 13. Press, S.; Estimation of a normal covariance matrix. Technical Report. University of British Columbia.1975.
  • 14. Chen, M.; Estimation of covariance matrices under a quadratic loss function. Research Report S-46. Department os Mathematics. SUNY at Albany. 1976.
  • 15. Bozdogan, H.; A new class of information complexity (ICOMP) criteria with an application to customer profiling and segmentation. Invited paper. In Istanbul University Journal of the School Business Administration. 2010; 39(2), 370-398
  • 16. Thomaz., C.E.; Maximum Entropy Covariance Estimate for Statistical Pattern Recognization. Doktora tezi. Department of Computing Imperial College. University of London. UK. 2004
  • 17. Akaike, H.; Information theory and extension of the maximum likelihood principle. 2nd International Symposium on Information Theory. Budapest: Academiai Kiado. 1973, 267-281
  • 18. Akaike,H.; A new look at the statistical model identification. IEEE Transaction and Automatic Control. 1974, AC-19:719-723
  • 19. Schwarz, G.; Estimating the dimension of model. Annals of Statistics. 1978; 6, 461-464
  • 20. Bozdogan, H.; Model selection and Akaike’s İnformation Criterion (AIC): the general theory and its analytical extensions. Psychometrika. 1987; 52(3), 345-370
  • 21. Bhansali, R.J.; Downham, D.Y. Some properties of the order of an autoregressive model selected by a generalization of Akai-ke’s FPE criterion. Biometrika. 1977; 64, 547-551
  • 22. Rissanen, J. ;Modeling by shortest data description. Autometi-ca. 1978; 14, 465-471.
  • 23. Bozdogan, H.; ICOMP: A new model selection criterion. Clas-sification and Related Methods of Data Analysis. 1988; 599-608
  • 24. Bozdogan, H.; On the information based measure of covarian-ce complexity and its application to the evaluation of multivariate linear models. Communications in Statistics: Theory and Methods. 1990; 1, 221-278
  • 25. Bozdogan, H.; Choosing the number of clusters, subset se-lecyion of variables and outlier detection in the standard mixture model cluster analysis. Invited paper in New Approaches in Clas-sification and Data Analysis. Springer Verlag. New York, 1994.
  • 26. Bozdogan, H.; Haughton, D.M.A. Information complexity criteria for regression models. Computational Statistics and Data Analysis. 1998; 28, 51-76
  • 27. Bozdogan, H.; Akaike’s information criterion and recent deve-lopments in information complexity. Journal of Mathematical Psychology. 2000; 44, 62-91
  • 28. Bozdogan, H.; Intelligent Statistical Data Mining with Informa-tion Complexity and Genetic Algortihm. In Statsitical Data Mining and Knowledge Discovery. H. Bozdogan (ed). Chapman and Hall/CRC. Florida,2004.
There are 28 citations in total.

Details

Subjects Engineering
Journal Section Articles
Authors

Esra Pamukçu

Publication Date September 30, 2017
Published in Issue Year 2017 Volume: 13 Issue: 3

Cite

APA Pamukçu, E. (2017). A New Hybrid Regression Model for Undersized Sample Problem. Celal Bayar University Journal of Science, 13(3), 803-813. https://doi.org/10.18466/cbayarfbe.339536
AMA Pamukçu E. A New Hybrid Regression Model for Undersized Sample Problem. CBUJOS. September 2017;13(3):803-813. doi:10.18466/cbayarfbe.339536
Chicago Pamukçu, Esra. “A New Hybrid Regression Model for Undersized Sample Problem”. Celal Bayar University Journal of Science 13, no. 3 (September 2017): 803-13. https://doi.org/10.18466/cbayarfbe.339536.
EndNote Pamukçu E (September 1, 2017) A New Hybrid Regression Model for Undersized Sample Problem. Celal Bayar University Journal of Science 13 3 803–813.
IEEE E. Pamukçu, “A New Hybrid Regression Model for Undersized Sample Problem”, CBUJOS, vol. 13, no. 3, pp. 803–813, 2017, doi: 10.18466/cbayarfbe.339536.
ISNAD Pamukçu, Esra. “A New Hybrid Regression Model for Undersized Sample Problem”. Celal Bayar University Journal of Science 13/3 (September 2017), 803-813. https://doi.org/10.18466/cbayarfbe.339536.
JAMA Pamukçu E. A New Hybrid Regression Model for Undersized Sample Problem. CBUJOS. 2017;13:803–813.
MLA Pamukçu, Esra. “A New Hybrid Regression Model for Undersized Sample Problem”. Celal Bayar University Journal of Science, vol. 13, no. 3, 2017, pp. 803-1, doi:10.18466/cbayarfbe.339536.
Vancouver Pamukçu E. A New Hybrid Regression Model for Undersized Sample Problem. CBUJOS. 2017;13(3):803-1.