Organization of Variation-Based Personal Genetic Data with Document-Based No-Sql Database

Onur Çakırgöz; Süleyman Sevinç

Research Article

Varyasyon-Bazlı Kişisel Genetik Verilerin Doküman-Tabanlı No-Sql Veri Tabanı ile Organizasyonu

Year 2021, Volume: 14 Issue: 4, 391 - 402, 31.10.2021

Onur Çakırgöz , Süleyman Sevinç

Abstract

Varyasyon-bazlı kişisel genetik veriler çoğu klinik uygulamanın ve biyoinformatikteki çoğu çalışmanın merkezinde bulunmaktadır. Ne yazık ki, kişisel genetik verileri organize etmek için geliştirilen mevcut yöntemlerin neredeyse tamamı varyasyon-bazlı değildir ve bu yöntemler büyük miktardaki gerçek verilerle test edilmemiştir. Varyasyon-bazlı verilere ihtiyaç duyan uygulamalarda, bu mevcut yöntemler kullanıldığında, yoğun bir veri dönüştürme problemi ortaya çıkmaktadır. Öte yandan, az sayıdaki mevcut varyasyon-bazlı çözümler tamamıyla yapısal değildir ve günlük pratiğin gereksinimlerini karşılamamaktadır. Bu çalışmada, varyasyon-bazlı kişisel genetik verilerin organizasyonu için doküman-tabanlı No-SQL veri tabanı ve ilgili tasarımlar önerilmektedir. Yapısal çözümümüz çok sayıda sınıf, koleksiyon ve indeks içermektedir ve tüm varyasyon tiplerini (yapısal ve yapısal olmayan) desteklemektedir. Bu veri tabanında, 1000 Genom Projesi tarafından yayınlanan 2504 kişinin varyasyon verileri sorunsuz ve verimli bir şekilde depolanmıştır. Kişisel genetik verilerin ana bellek ve sabit diskte kapladığı alanlar incelenmiştir. Ayrıca, klinik uygulamaların sıklıkla kullanabileceği bazı sorgular çalıştırılmış ve veri tabanının yanıt süreleri hesaplanmıştır. Analizlerin sonuçları, önerilen yöntemin çok önemli kazanımlar sağladığını göstermektedir.

Keywords

no-sql veritabanı, kişisel genom veritabanı, kişisel genetik veriler, insan genomu varyasyonları, 1000 genom projesi

References

N. J. Schork, “Personalized medicine: time for one-person trials”, Nature, 520(7549), 609-611, 2015.
C. Gonzaga-Jauregui, J. R. Lupski, R. A. Gibbs, “Human genome sequencing in health and disease”, Annual review of medicine, 63, 35-61, 2012.
1000 Genomes Project Consortium, “A map of human genome variation from population-scale sequencing”, Nature, 467(7319), 1061, 2010.
1000 Genomes Project Consortium, “An integrated map of genetic variation from 1,092 human genomes”, Nature, 491(7422), 56-65, 2012.
1000 Genomes Project Consortium, “A global reference for human genetic variation”, Nature, 526(7571), 68-74, 2015.
1000 Genomes Project Consortium, “An integrated map of structural variation in 2,504 human genomes”, Nature, 526(7571), 75-81, 2015.
Internet: 1000 Genomes Project Consortium, /vol1/ftp/release/20130502/ directory, ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/, 05.01.2021.
Internet: 1000 Genomes Project Consortium, /vol1/ftp/release/20130502/supporting/bcf_files directory, ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/bcf_files, 05.01.2021.
M. Futema, V. Plagnol, R. A. Whittall, H. A. W. Neil, S. E. Humphries, “Use of targeted exome sequencing as a diagnostic tool for Familial Hypercholesterolaemia”, Journal of medical genetics, 49(10), 644-649, 2012.
P. N. Taylor, E. Porcu, S. Chew, P. J. Campbell, M. Traglia, S. J. Brown, Y. Memari, “Whole-genome sequence-based analysis of thyroid function”, Nature communications, 6(1), 1-11, 2015.
International Human Genome Sequencing Consortium, “Finishing the euchromatic sequence of the human genome”, Nature, 431(7011), 931, 2004.
I. Dunham, E. Birney, B. R. Lajoie, A. Sanyal, X. Dong, M. Greven, J. Dekker, et. al., “An integrated encyclopedia of DNA elements in the human genome”, Nature. 489, 57–74, 2012.
Cancer Genome Atlas Research Network, “The cancer genome atlas pan-cancer analysis project”, Nature genetics, 45(10), 1113, 2013.
G. F. Gao, J. S. Parker, S. M. Reynolds, et. al., “Before and after: comparison of legacy and harmonized TCGA genomic data commons’ data”, Cell systems, 9(1), 24-34, 2019.
J. Carrot-Zhang, N. Chambwe, J. S. Damrauer, et. al., “Comprehensive analysis of genetic ancestry and its molecular correlates in cancer”, Cancer Cell, 37(5), 639-654, 2020.
Internet: Cancer Genome Atlas Research, GDC, https://portal.gdc.cancer.gov/, 05.01.2021.
H. Li, J. Ruan, R. Durbin, “Mapping short DNA sequencing reads and calling variants using mapping quality scores”, Genome research, 18(11), 1851-1858, 2008.
H. Li, R. Durbin, “Fast and accurate short read alignment with Burrows–Wheeler transform”, Bioinformatics, 25(14), 1754-1760, 2009.
R. Li, Y. Li, K. Kristiansen, J. Wang, “SOAP: short oligonucleotide alignment program”, Bioinformatics, 24(5), 713-714, 2008.
K. Chen, J. W. Wallis, M. D. McLellan, et. al., “BreakDancer: an algorithm for high-resolution mapping of genomic structural variation”, Nature methods, 6(9), 677-681, 2009.
D. C. Koboldt, K. Chen, T. Wylie, et. al., “VarScan: variant detection in massively parallel sequencing of individual and pooled samples”, Bioinformatics, 25(17), 2283-2285, 2009.
H. Li, B. Handsaker, A. Wysoker, et. al., “The sequence alignment/map format and SAMtools”, Bioinformatics, 25(16), 2078-2079, 2009.
A. McKenna, M. Hanna, E. Banks, et. al., “The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data”, Genome research, 20(9), 1297-1303, 2010.
J. Dean, S. Ghemawat, “MapReduce: simplified data processing on large clusters”, Communications of the ACM, 51(1), 107-113, 2008.
Internet: VCFtools, https://vcftools.github.io/specs.html, 05.01.2021.
S. Grumbach, F. Tahi, “Compression of DNA sequences”, DCC93: Data Compression Conference, 340-350, IEEE, 1993.
E. Rivals, J. P. Delahaye, M. Dauchet, “A guaranteed compression scheme for repetitive DNA sequences”, Data Compression Conference, 453-453, IEEE Computer Society, March, 1996.
A. Apostolico, S. Lonardi, S. “Compression of biological sequences by greedy off-line textual substitution”, DCC 2000, Data Compression Conference, 143-152, IEEE, March, 2000.
X. Chen, S. Kwong, M. Li, “A compression algorithm for DNA sequences and its applications in genome comparison”, Genome informatics, 10, 51-61 1999.
S. Christley, Y. Lu, C. Li, X. Xie, “Human genomes as email attachments”, Bioinformatics, 25(2), 274-275, 2009.
D. A. Wheeler, M. Srinivasan, M. Egholm, et. al., “The complete genome of an individual by massively parallel DNA sequencing”, Nature, 452(7189), 872-876, 2008.
S. Kuruppu, S. J. Puglisi, J. Zobel, “Relative Lempel-Ziv compression of genomes for large-scale storage and retrieval”, International Symposium on String Processing and Information Retrieval, Springer, Berlin, Heidelberg, October, 201-206, 2010.
M. D. Cao, T. I. Dix, L. Allison, C. Mears, “A simple statistical algorithm for biological sequence compression”, Data Compression Conference (DCC'07), 43-52, IEEE, March, 2007.
V. Mäkinen, G. Navarro, J. Sirén, N. Välimäki, “Storage and retrieval of highly repetitive sequence collections”, Journal of Computational Biology, 17(3), 281-308, 2010.
K. Shvachko, H. Kuang, S. Radia, R. Chansler, “The hadoop distributed file system”, IEEE 26th symposium on mass storage systems and technologies (MSST), 1-10, IEEE, May, 2010.
J. Dean, S. Ghemawat, “MapReduce: a flexible data processing tool”, Communications of the ACM, 53(1), 72-77, 2010.
H. Nordberg, K. Bhatia, K. Wang, Z. Wang, “BioPig: a Hadoop-based analytic toolkit for large-scale sequence data”, Bioinformatics, 29(23), 3014-3019, 2013.
A. Schumacher, L. Pireddu, M. Niemenmaa, et. al., “SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop”, Bioinformatics, 30(1), 119-120, 2014.
M. S. Wiewiórka, A. Messina, A. Pacholewska, et. al., “SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision”, Bioinformatics, 30(18), 2652-2653, 2014.
M. Masseroli, P. Pinoli, F. Venco, et. al., “GenoMetric Query Language: a novel approach to large-scale genomic data management”, Bioinformatics, 31(12), 1881-1888, 2015.
M. Zaharia, M. Chowdhury, T. Das, et. al., “Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing”, 9th {USENIX} Symposium on Networked Systems Design and Implementation, 15-28, 2012.
V. Bafna, A. Deutsch, A. Heiberg, et. al., “Abstractions for genomics”, Communications of the ACM, 56(1), 83-93, 2013.
C. Kozanitis, A. Heiberg, G. Varghese, V. Bafna, “Using Genome Query Language to uncover genetic variation”, Bioinformatics, 30(1), 1-8, 2014.
O. Çakirgoz, S. Sevinc, “Organization of Variation Based Personal Genetic Data with Relational Database”, International Journal of InformaticsTechnologies, 11(3), 295–307, 2018.

Organization of Variation-Based Personal Genetic Data with Document-Based No-Sql Database

Year 2021, Volume: 14 Issue: 4, 391 - 402, 31.10.2021

Onur Çakırgöz , Süleyman Sevinç

Abstract

Variation-based personal genetic data are at the center of many clinical practices and many studies in bioinformatics. Unfortunately, almost all existing methods developed to organize personal genetic data are not variation-based and these methods have not been tested with a large amount of real data. In applications requiring variation-based data, an intense data conversion problem arises when these existing methods are used. On the other hand, the few solutions available that are variation-based are not entirely structural, and they do not meet the needs of daily practice. In this study, a document-based No-SQL database and related designs are proposed for the organization of variation-based personal genetic data. Our structural solution contains many classes, collections and indexes, and it supports all types of variations (both structural and non-structural). In this database, the variation data of 2504 people published by the 1000 Genomes Project were stored smoothly and efficiently. The spaces occupied by personal genetic data in primary memory and hard disk were analyzed. In addition, some queries that might be frequently used by clinical applications were run and the response times of the database was calculated. The results of the analyzes show that the proposed method provides very important gains.

Keywords

no-sql database, personal genome database, personal genetic data, human genome variations, 1000 genomes project

References

N. J. Schork, “Personalized medicine: time for one-person trials”, Nature, 520(7549), 609-611, 2015.
C. Gonzaga-Jauregui, J. R. Lupski, R. A. Gibbs, “Human genome sequencing in health and disease”, Annual review of medicine, 63, 35-61, 2012.
1000 Genomes Project Consortium, “A map of human genome variation from population-scale sequencing”, Nature, 467(7319), 1061, 2010.
1000 Genomes Project Consortium, “An integrated map of genetic variation from 1,092 human genomes”, Nature, 491(7422), 56-65, 2012.
1000 Genomes Project Consortium, “A global reference for human genetic variation”, Nature, 526(7571), 68-74, 2015.
1000 Genomes Project Consortium, “An integrated map of structural variation in 2,504 human genomes”, Nature, 526(7571), 75-81, 2015.
Internet: 1000 Genomes Project Consortium, /vol1/ftp/release/20130502/ directory, ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/, 05.01.2021.
Internet: 1000 Genomes Project Consortium, /vol1/ftp/release/20130502/supporting/bcf_files directory, ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/bcf_files, 05.01.2021.
M. Futema, V. Plagnol, R. A. Whittall, H. A. W. Neil, S. E. Humphries, “Use of targeted exome sequencing as a diagnostic tool for Familial Hypercholesterolaemia”, Journal of medical genetics, 49(10), 644-649, 2012.
P. N. Taylor, E. Porcu, S. Chew, P. J. Campbell, M. Traglia, S. J. Brown, Y. Memari, “Whole-genome sequence-based analysis of thyroid function”, Nature communications, 6(1), 1-11, 2015.
International Human Genome Sequencing Consortium, “Finishing the euchromatic sequence of the human genome”, Nature, 431(7011), 931, 2004.
I. Dunham, E. Birney, B. R. Lajoie, A. Sanyal, X. Dong, M. Greven, J. Dekker, et. al., “An integrated encyclopedia of DNA elements in the human genome”, Nature. 489, 57–74, 2012.
Cancer Genome Atlas Research Network, “The cancer genome atlas pan-cancer analysis project”, Nature genetics, 45(10), 1113, 2013.
G. F. Gao, J. S. Parker, S. M. Reynolds, et. al., “Before and after: comparison of legacy and harmonized TCGA genomic data commons’ data”, Cell systems, 9(1), 24-34, 2019.
J. Carrot-Zhang, N. Chambwe, J. S. Damrauer, et. al., “Comprehensive analysis of genetic ancestry and its molecular correlates in cancer”, Cancer Cell, 37(5), 639-654, 2020.
Internet: Cancer Genome Atlas Research, GDC, https://portal.gdc.cancer.gov/, 05.01.2021.
H. Li, J. Ruan, R. Durbin, “Mapping short DNA sequencing reads and calling variants using mapping quality scores”, Genome research, 18(11), 1851-1858, 2008.
H. Li, R. Durbin, “Fast and accurate short read alignment with Burrows–Wheeler transform”, Bioinformatics, 25(14), 1754-1760, 2009.
R. Li, Y. Li, K. Kristiansen, J. Wang, “SOAP: short oligonucleotide alignment program”, Bioinformatics, 24(5), 713-714, 2008.
K. Chen, J. W. Wallis, M. D. McLellan, et. al., “BreakDancer: an algorithm for high-resolution mapping of genomic structural variation”, Nature methods, 6(9), 677-681, 2009.
D. C. Koboldt, K. Chen, T. Wylie, et. al., “VarScan: variant detection in massively parallel sequencing of individual and pooled samples”, Bioinformatics, 25(17), 2283-2285, 2009.
H. Li, B. Handsaker, A. Wysoker, et. al., “The sequence alignment/map format and SAMtools”, Bioinformatics, 25(16), 2078-2079, 2009.
A. McKenna, M. Hanna, E. Banks, et. al., “The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data”, Genome research, 20(9), 1297-1303, 2010.
J. Dean, S. Ghemawat, “MapReduce: simplified data processing on large clusters”, Communications of the ACM, 51(1), 107-113, 2008.
Internet: VCFtools, https://vcftools.github.io/specs.html, 05.01.2021.
S. Grumbach, F. Tahi, “Compression of DNA sequences”, DCC93: Data Compression Conference, 340-350, IEEE, 1993.
E. Rivals, J. P. Delahaye, M. Dauchet, “A guaranteed compression scheme for repetitive DNA sequences”, Data Compression Conference, 453-453, IEEE Computer Society, March, 1996.
A. Apostolico, S. Lonardi, S. “Compression of biological sequences by greedy off-line textual substitution”, DCC 2000, Data Compression Conference, 143-152, IEEE, March, 2000.
X. Chen, S. Kwong, M. Li, “A compression algorithm for DNA sequences and its applications in genome comparison”, Genome informatics, 10, 51-61 1999.
S. Christley, Y. Lu, C. Li, X. Xie, “Human genomes as email attachments”, Bioinformatics, 25(2), 274-275, 2009.
D. A. Wheeler, M. Srinivasan, M. Egholm, et. al., “The complete genome of an individual by massively parallel DNA sequencing”, Nature, 452(7189), 872-876, 2008.
S. Kuruppu, S. J. Puglisi, J. Zobel, “Relative Lempel-Ziv compression of genomes for large-scale storage and retrieval”, International Symposium on String Processing and Information Retrieval, Springer, Berlin, Heidelberg, October, 201-206, 2010.
M. D. Cao, T. I. Dix, L. Allison, C. Mears, “A simple statistical algorithm for biological sequence compression”, Data Compression Conference (DCC'07), 43-52, IEEE, March, 2007.
V. Mäkinen, G. Navarro, J. Sirén, N. Välimäki, “Storage and retrieval of highly repetitive sequence collections”, Journal of Computational Biology, 17(3), 281-308, 2010.
K. Shvachko, H. Kuang, S. Radia, R. Chansler, “The hadoop distributed file system”, IEEE 26th symposium on mass storage systems and technologies (MSST), 1-10, IEEE, May, 2010.
J. Dean, S. Ghemawat, “MapReduce: a flexible data processing tool”, Communications of the ACM, 53(1), 72-77, 2010.
H. Nordberg, K. Bhatia, K. Wang, Z. Wang, “BioPig: a Hadoop-based analytic toolkit for large-scale sequence data”, Bioinformatics, 29(23), 3014-3019, 2013.
A. Schumacher, L. Pireddu, M. Niemenmaa, et. al., “SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop”, Bioinformatics, 30(1), 119-120, 2014.
M. S. Wiewiórka, A. Messina, A. Pacholewska, et. al., “SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision”, Bioinformatics, 30(18), 2652-2653, 2014.
M. Masseroli, P. Pinoli, F. Venco, et. al., “GenoMetric Query Language: a novel approach to large-scale genomic data management”, Bioinformatics, 31(12), 1881-1888, 2015.
M. Zaharia, M. Chowdhury, T. Das, et. al., “Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing”, 9th {USENIX} Symposium on Networked Systems Design and Implementation, 15-28, 2012.
V. Bafna, A. Deutsch, A. Heiberg, et. al., “Abstractions for genomics”, Communications of the ACM, 56(1), 83-93, 2013.
C. Kozanitis, A. Heiberg, G. Varghese, V. Bafna, “Using Genome Query Language to uncover genetic variation”, Bioinformatics, 30(1), 1-8, 2014.
O. Çakirgoz, S. Sevinc, “Organization of Variation Based Personal Genetic Data with Relational Database”, International Journal of InformaticsTechnologies, 11(3), 295–307, 2018.

There are 44 citations in total.

Details

Primary Language	English
Subjects	Computer Software
Journal Section	Articles
Authors	Onur Çakırgöz 0000-0002-9347-1105 Süleyman Sevinç 0000-0001-9052-5836
Publication Date	October 31, 2021
Submission Date	April 7, 2021
Published in Issue	Year 2021 Volume: 14 Issue: 4

Cite

APA	Çakırgöz, O., & Sevinç, S. (2021). Organization of Variation-Based Personal Genetic Data with Document-Based No-Sql Database. Bilişim Teknolojileri Dergisi, 14(4), 391-402.

Download Cover Image

Article Files

Full Text