A New Hybrid Regression Model for Undersized Sample Problem

Esra Pamukçu

doi:10.18466/cbayarfbe.339536

Research Article

A New Hybrid Regression Model for Undersized Sample Problem

Year 2017, Volume: 13 Issue: 3 , 803 - 813 , 30.09.2017

Esra Pamukçu

https://izlik.org/JA56EF22YH

Abstract

In traditional statistics, it is assumed that the
number of samples which are available for study is more than number of well
selected variables. Nowadays, in many fields, while the number of samples
expressed in tens or hundreds, the single observation may have thousands even
millions dimensions. The classical statistical techniques are not designed to
be able to cope with this kind of data sets. Many of multivariate statistical
techniques such as principal component analysis, factor analysis, classifiation
and cluster analysis and the prediction of regression coefficients need
estimation of the sample variance-covariance matrix or its inverse. When the
number of observations is much smaller than the number of features (or
variables), the usual sample covariance matrix degenerates and it can not be
inverted. This is one of the biggest encountered obstacle to the classical
statistical methods. To remedy the manifestation of the singular covariance
matrices in high dimensional data, Hybrid Covariance Estimators (HCE) has been
developed by Pamukcu et al.(2015). HCE has overcome the singularity problem of
the covariance matrix and, thus, the multivariate statistical analysis for high
dimensional data sets has been made possible. One of the most important process
in statistical analysis using HCE is to select the appropriate covariance
structure for the data set since HCE can in fact be obtained with many
different covariance structures. It can be selected by using the information
criteria such as Akaike Information Criteria, Information Complexity Criteria
which are well known as model selection criteria. In this study, we introduce a new regression
model with HCE and information criteria for n<<p undersized high
dimensional data. We demonstrate our approach on simulation studies with
different scenarious for p/n ratios. We use AIC,CAIC and ICOMP criteria to
select appropriate HCE structure and compare the results with classical
regression analysis.

Keywords

Curse of dimensionality , Hybrid covariance estimator (HCE) , Hybrid regression model (HRM) , Information complexity criterion (ICOMP) , Undersized sample problem

References

1. Donoho, D.L.; High dimensional data analysis: The curses and blessings of dimensionality. statweb.stanford.edu/~donoho/Lectures/AMS2000/Curses.pdf. 2000
2. Cunningham, P.; Dimension Reduction. Technical Re-port.UCD-CSI-2007-7. University College Dublin. 2007
3. Fiebig, D.G.; On the maximum entropy approach to undersized samples. Applied Mathematics and Computation. 1984; 14, 301-312
4. Stein, C.; Estimation of covariance matrix. Rietz Lecture. 39th Annual Meeting IMS. Atlanta, Georgia. 1975.
5. Chen, Y.; Robust shrinkage estimation of high dimensional covariance matrices. IEEE Workshop on Sensor Array and Mul-tichannel Signal Processing (SAM). 2010
6. Ledoit, O. ; Wolf, M. A well conditioned estimator for large dimensional covariance matrices. Journal of Multivariate Analysis. 2004; 88, 365-411
7. Pamukçu, E.; Bozdogan, H., Çalık, S. A Novel Hybrid Dimen-sion Reduction Technique for Undersized High Dimensional Gene Expression Data Sets Using Information Complexity Criterion for Cancer Classification. Computational and Mathematical Methods in Medicine. Volume 2015 (2015), Article ID 370640, 14 pages
8. Erbaş, Ü.; Entropi İlkelerinin Boyut İndirgeme Uygulamaları. Doktora tezi. Marmara Üniversitesi Sosyal Bilimler Enstitüsü. İstanbul. 2010
9. Bozdogan, H.; Information Complexity and Multivariate Lear-ning in High Dimensions with Applications in Data Mining. Forth-coming book. 2017
10. Bozdogan, H.; Howe, J.,A. Misspecified multivariate regres-sion models using the genetic algorithm and information comp-lexity as the fitness function. European Journal of Pure and App-lied Mathematics. 2012; 5(2), 211-249
11. Haff, L.R.; Emprical bayes estimation of the multivariate normal covariance matrix. The Annals of Statistics. 1980; 8(3), 586-597
12. Shurygin, A.; The linear combination of the simplest discrimi-nator and Fisher’s one. Nauka (ed). Applied Statistics. Moscow. Rusia. 1983; 144-158
13. Press, S.; Estimation of a normal covariance matrix. Technical Report. University of British Columbia.1975.
14. Chen, M.; Estimation of covariance matrices under a quadratic loss function. Research Report S-46. Department os Mathematics. SUNY at Albany. 1976.
15. Bozdogan, H.; A new class of information complexity (ICOMP) criteria with an application to customer profiling and segmentation. Invited paper. In Istanbul University Journal of the School Business Administration. 2010; 39(2), 370-398
16. Thomaz., C.E.; Maximum Entropy Covariance Estimate for Statistical Pattern Recognization. Doktora tezi. Department of Computing Imperial College. University of London. UK. 2004
17. Akaike, H.; Information theory and extension of the maximum likelihood principle. 2nd International Symposium on Information Theory. Budapest: Academiai Kiado. 1973, 267-281
18. Akaike,H.; A new look at the statistical model identification. IEEE Transaction and Automatic Control. 1974, AC-19:719-723
19. Schwarz, G.; Estimating the dimension of model. Annals of Statistics. 1978; 6, 461-464
20. Bozdogan, H.; Model selection and Akaike’s İnformation Criterion (AIC): the general theory and its analytical extensions. Psychometrika. 1987; 52(3), 345-370
21. Bhansali, R.J.; Downham, D.Y. Some properties of the order of an autoregressive model selected by a generalization of Akai-ke’s FPE criterion. Biometrika. 1977; 64, 547-551
22. Rissanen, J. ;Modeling by shortest data description. Autometi-ca. 1978; 14, 465-471.
23. Bozdogan, H.; ICOMP: A new model selection criterion. Clas-sification and Related Methods of Data Analysis. 1988; 599-608
24. Bozdogan, H.; On the information based measure of covarian-ce complexity and its application to the evaluation of multivariate linear models. Communications in Statistics: Theory and Methods. 1990; 1, 221-278
25. Bozdogan, H.; Choosing the number of clusters, subset se-lecyion of variables and outlier detection in the standard mixture model cluster analysis. Invited paper in New Approaches in Clas-sification and Data Analysis. Springer Verlag. New York, 1994.
26. Bozdogan, H.; Haughton, D.M.A. Information complexity criteria for regression models. Computational Statistics and Data Analysis. 1998; 28, 51-76
27. Bozdogan, H.; Akaike’s information criterion and recent deve-lopments in information complexity. Journal of Mathematical Psychology. 2000; 44, 62-91
28. Bozdogan, H.; Intelligent Statistical Data Mining with Informa-tion Complexity and Genetic Algortihm. In Statsitical Data Mining and Knowledge Discovery. H. Bozdogan (ed). Chapman and Hall/CRC. Florida,2004.

Aşırı Derecede Küçük Örneklem Problemi için Hibrit Regresyon Modeli

Year 2017, Volume: 13 Issue: 3 , 803 - 813 , 30.09.2017

Esra Pamukçu

https://izlik.org/JA56EF22YH

Abstract

Geleneksel istatistik metodolojisinde, iyi seçilmiş değişkenlerin birkaç tane, örneklerin ise daha fazla olduğu farz edilir. Günümüzde ise birçok sahada, çalışma için ulaşılabilen örnekler onlar veya yüzlerle ifade edilirken, tek bir gözlem binlerce hatta milyonlarca boyuta sahip olabilmektedir. Klasik yöntemler bu tarz verilerle başa çıkabilecek şekilde tasarlanmış değillerdir. Temel bileşenler analizi, faktör analizi, sınıflama ve kümeleme analizleri, regresyon katsayılarının çıkarımı ve tahmini gibi klasik çok değişkenli istatistiksel tekniklerin birçoğu, verinin kovaryans matrisinin ve/veya onun tersinin tahminini gerektirir. p değişken sayısı n örnek sayısından fazla olduğu durumlarda ise örnek varyans-kovaryans matrisi dejenere olur ve tersi hesaplanamaz. Bu, klasik istatistiksel metotlar açıcından karşılaşılabilecek en önemli zorluklardan biridir. Pamukçu ve ark tarafından (2015) yüksek boyutlu veri setlerindeki kovaryans probleminin üstesinden gelebilmek için, Hibrit Kovaryans Tahmin Edicisi (Hybrid Covariance Estimator-HCE) yöntemi geliştirilmiştir. HCE ile kovaryans yapısındaki bu bozulmanın önüne geçilmiş ve n<<p probleminin olduğu yüksek boyutlu veri setlerinin istatistiksel analizleri mümkün hale gelmiştir. HCE, aslında birçok farklı kovaryans yapısı ile elde edilebildiği için HCE ile yapılacak analizlerde önemli aşamalardan biri, veri setine uygun kovaryans yapısının belirlenmesidir. Bu aşamada ise model seçim kriterleri olarak da bilinen AIC, CAIC ve ICOMP gibi bilgi kriterleri ile uygun kovaryans yapısı seçilebilmektedir. Bu çalışmada, n<<p olan yüksek boyutlu veri setlerinde HCE ve bilgi kriterleri ile önerilen Hibrit Regresyon Modeli-HRM tanıtılmış ve hesaplama adımları verilmiştir. Simülasyon çalışması ile farklı senaryolarda farklı p/n oranlarına sahip veri setleri HRM ile analiz edilmiş, uygun kovaryans yapısının seçimi AIC, CAIC ve ICOMP bilgi kriterleri ile yapılmış ve sonuçlar klasik regresyon analizi yöntemi ile karşılaştırılmıştır.

Keywords

Bilgi karmaşıklığı kriteri (ICOMP) , Boyutsallık problemi , Hibrit kovaryans tahmin edicisi (HCE) , Hibrit regresyon modeli (HRR) , Küçük örnekem problemi

References

1. Donoho, D.L.; High dimensional data analysis: The curses and blessings of dimensionality. statweb.stanford.edu/~donoho/Lectures/AMS2000/Curses.pdf. 2000
2. Cunningham, P.; Dimension Reduction. Technical Re-port.UCD-CSI-2007-7. University College Dublin. 2007
3. Fiebig, D.G.; On the maximum entropy approach to undersized samples. Applied Mathematics and Computation. 1984; 14, 301-312
4. Stein, C.; Estimation of covariance matrix. Rietz Lecture. 39th Annual Meeting IMS. Atlanta, Georgia. 1975.
5. Chen, Y.; Robust shrinkage estimation of high dimensional covariance matrices. IEEE Workshop on Sensor Array and Mul-tichannel Signal Processing (SAM). 2010
6. Ledoit, O. ; Wolf, M. A well conditioned estimator for large dimensional covariance matrices. Journal of Multivariate Analysis. 2004; 88, 365-411
7. Pamukçu, E.; Bozdogan, H., Çalık, S. A Novel Hybrid Dimen-sion Reduction Technique for Undersized High Dimensional Gene Expression Data Sets Using Information Complexity Criterion for Cancer Classification. Computational and Mathematical Methods in Medicine. Volume 2015 (2015), Article ID 370640, 14 pages
8. Erbaş, Ü.; Entropi İlkelerinin Boyut İndirgeme Uygulamaları. Doktora tezi. Marmara Üniversitesi Sosyal Bilimler Enstitüsü. İstanbul. 2010
9. Bozdogan, H.; Information Complexity and Multivariate Lear-ning in High Dimensions with Applications in Data Mining. Forth-coming book. 2017
10. Bozdogan, H.; Howe, J.,A. Misspecified multivariate regres-sion models using the genetic algorithm and information comp-lexity as the fitness function. European Journal of Pure and App-lied Mathematics. 2012; 5(2), 211-249
11. Haff, L.R.; Emprical bayes estimation of the multivariate normal covariance matrix. The Annals of Statistics. 1980; 8(3), 586-597
12. Shurygin, A.; The linear combination of the simplest discrimi-nator and Fisher’s one. Nauka (ed). Applied Statistics. Moscow. Rusia. 1983; 144-158
13. Press, S.; Estimation of a normal covariance matrix. Technical Report. University of British Columbia.1975.
14. Chen, M.; Estimation of covariance matrices under a quadratic loss function. Research Report S-46. Department os Mathematics. SUNY at Albany. 1976.
15. Bozdogan, H.; A new class of information complexity (ICOMP) criteria with an application to customer profiling and segmentation. Invited paper. In Istanbul University Journal of the School Business Administration. 2010; 39(2), 370-398
16. Thomaz., C.E.; Maximum Entropy Covariance Estimate for Statistical Pattern Recognization. Doktora tezi. Department of Computing Imperial College. University of London. UK. 2004
17. Akaike, H.; Information theory and extension of the maximum likelihood principle. 2nd International Symposium on Information Theory. Budapest: Academiai Kiado. 1973, 267-281
18. Akaike,H.; A new look at the statistical model identification. IEEE Transaction and Automatic Control. 1974, AC-19:719-723
19. Schwarz, G.; Estimating the dimension of model. Annals of Statistics. 1978; 6, 461-464
20. Bozdogan, H.; Model selection and Akaike’s İnformation Criterion (AIC): the general theory and its analytical extensions. Psychometrika. 1987; 52(3), 345-370
21. Bhansali, R.J.; Downham, D.Y. Some properties of the order of an autoregressive model selected by a generalization of Akai-ke’s FPE criterion. Biometrika. 1977; 64, 547-551
22. Rissanen, J. ;Modeling by shortest data description. Autometi-ca. 1978; 14, 465-471.
23. Bozdogan, H.; ICOMP: A new model selection criterion. Clas-sification and Related Methods of Data Analysis. 1988; 599-608
24. Bozdogan, H.; On the information based measure of covarian-ce complexity and its application to the evaluation of multivariate linear models. Communications in Statistics: Theory and Methods. 1990; 1, 221-278
25. Bozdogan, H.; Choosing the number of clusters, subset se-lecyion of variables and outlier detection in the standard mixture model cluster analysis. Invited paper in New Approaches in Clas-sification and Data Analysis. Springer Verlag. New York, 1994.
26. Bozdogan, H.; Haughton, D.M.A. Information complexity criteria for regression models. Computational Statistics and Data Analysis. 1998; 28, 51-76
27. Bozdogan, H.; Akaike’s information criterion and recent deve-lopments in information complexity. Journal of Mathematical Psychology. 2000; 44, 62-91
28. Bozdogan, H.; Intelligent Statistical Data Mining with Informa-tion Complexity and Genetic Algortihm. In Statsitical Data Mining and Knowledge Discovery. H. Bozdogan (ed). Chapman and Hall/CRC. Florida,2004.

There are 28 citations in total.

Details

Subjects	Engineering
Journal Section	Research Article
Authors	Esra Pamukçu
Publication Date	September 30, 2017
DOI	https://doi.org/10.18466/cbayarfbe.339536
IZ	https://izlik.org/JA56EF22YH
Published in Issue	Year 2017 Volume: 13 Issue: 3

Cite

APA	Pamukçu, E. (2017). A New Hybrid Regression Model for Undersized Sample Problem. Celal Bayar University Journal of Science, 13(3), 803-813. https://doi.org/10.18466/cbayarfbe.339536
AMA	1.Pamukçu E. A New Hybrid Regression Model for Undersized Sample Problem. CBUJOS. 2017;13(3):803-813. doi:10.18466/cbayarfbe.339536
Chicago	Pamukçu, Esra. 2017. “A New Hybrid Regression Model for Undersized Sample Problem”. Celal Bayar University Journal of Science 13 (3): 803-13. https://doi.org/10.18466/cbayarfbe.339536.
EndNote	Pamukçu E (September 1, 2017) A New Hybrid Regression Model for Undersized Sample Problem. Celal Bayar University Journal of Science 13 3 803–813.
IEEE	[1]E. Pamukçu, “A New Hybrid Regression Model for Undersized Sample Problem”, CBUJOS, vol. 13, no. 3, pp. 803–813, Sept. 2017, doi: 10.18466/cbayarfbe.339536.
ISNAD	Pamukçu, Esra. “A New Hybrid Regression Model for Undersized Sample Problem”. Celal Bayar University Journal of Science 13/3 (September 1, 2017): 803-813. https://doi.org/10.18466/cbayarfbe.339536.
JAMA	1.Pamukçu E. A New Hybrid Regression Model for Undersized Sample Problem. CBUJOS. 2017;13:803–813.
MLA	Pamukçu, Esra. “A New Hybrid Regression Model for Undersized Sample Problem”. Celal Bayar University Journal of Science, vol. 13, no. 3, Sept. 2017, pp. 803-1, doi:10.18466/cbayarfbe.339536.
Vancouver	1.Esra Pamukçu. A New Hybrid Regression Model for Undersized Sample Problem. CBUJOS. 2017 Sep. 1;13(3):803-1. doi:10.18466/cbayarfbe.339536

Article Files

Full Text