In silico comparative analysis of SARS-CoV-2 nucleocapsid (N) protein using bioinformatics tools

The world has been encountered to one of the biggest pandemics that causing by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). SARS-CoV-2 is placed in the Beta-CoV genus in the Coronaviridae family. N protein is one of the crucial structural proteins of SARS-CoV-2 that binds to the genome thereby generating helical ribonucleoprotein core. It is involved in viral transcription/replication, translation, and viral assembly after entering the host cell through interacting with host proteins. N protein sequences of SARS-CoV-2 and taxonomically related CoVs are examined using bioinformatics tools and approaches including sequence alignment, sequence and phylogenetic analyzes, and predicting of putative N-Glycosylation and phosphorylation positions and also predictions and comparative analyzes are performed on 3D structures of N proteins from SARS-CoV-2 related CoVs through using of some of applied bioinformatics analyzes. Results of mega BLAST search revealed that the most similar N protein sequence to SARS-CoV-2 is Bat-CoV RaTG13 N protein sequence in the taxonomically related CoVs. SARS-CoV-2 is grouped with SARS, pangolin, civet and bat CoVs (RATG13, SL ZC45 and SL ZXC21) in N protein, nucleotide and protein based ML phylogenetic trees. Some of SARS-CoV-2 N proteins were showed divergence from other SARS-CoV-2 N proteins analyzed due to amino acid substitutions detected in SARS-CoV-2 N proteins samples in phylogenetic trees. The highest amino acid substitutions were detected in Richmont/USA (QJA42209.1) and Greece (QIZ16579.1) samples, with 2 and 3 place substitutions, respectively. By domain analyzes, three domains were detected as Corona_nucleocora (Pfam), N terminal CoV RNA-binding domain (HAMAP) and C terminal N protein dimerization domain (HAMAP). Possible N-glycosylation positions of SARS-CoV-2 N protein were predicted at two positions. Assessments of possible serine, threonine and tyrosine phosphorylations were found to be at 100 positions, 34 of them were higher than 80% possibility. 3D structure analysis based on TM scores revealed that although the results of 3D structure analysis were shown consistency with the taxonomy of the CoVs, the 3D structures of SARS-CoV-2 N protein and taxonomically related CoVs were not at the same fold.


Introduction
By the end of 2019, the world has been encountered to one of the biggest pandemics that caused by severe acute respiratory syndrome coronavirus 2 (2019-nCoV or SARS-CoV-2).In December 2019, WHO authorities were informed by Chinese authorities for a new pneumonia infection, mainly resembling viral pneumonia, in Wuhan/China (Genc, 2020;Wu et al., 2020).
family Coronaviridae, like severe acute respiratory syndrome coronavirus (SARS-CoV) and Middle East respiratory syndrome coronavirus (MERS-CoV) (Ye et al., 2020).Genome structure of SARS-CoV and MERS-CoV were extensively studied after the pandemics occurred during the periods of 2002-2003and 2012-2015, respectively (De Wit, et al., 2016).The first genome sequence of SARS-CoV-2, sampled from a patient from Wuhan/China, were submitted to NBCI GenBank at the beginning of January 2020 with accession number of MN908947 (Wu et al., 2020).
The genome analyzes of the Beta-CoVs revealed that they have a large single-stranded, approximately ~30 kb-bp long positive-sense RNA genome (+ssRNA).The strand has a poly-A tail at 3′ and cap at 5′.The two thirds of the genome encodes a non-structural protein named 1a/1b (ORF1a/b) polyprotein having a role in construction of replication/transcription complex (Chen, et al., 2020).The remaining part of genome encodes structural proteins including spike glycoprotein (S protein), envelope glycoprotein (E protein), and membrane (M protein) and nucleocapsid (N protein) proteins in a 5´ to 3´ order.Additionally, some proteins are also encoded by the structural proteins which are involved in immune response of host (De Wit et al., 2016) The S protein of CoVs is responsible for recognition and entry into the host.It binds to the receptor protein of the host, initiating membrane fusion.Angiotensin-converting enzyme 2 (ACE2) bound by SARS-CoV and SARS-CoV-2, and dipeptidyl peptidase 4 (DPP4 or CD26) bound by MERS-CoV are found as receptors on the hosts (De Wit et al., 2016).E protein is an integral membrane protein and is crucial for virus assembly and budding.It enhances virus budding (Neuman et al., 2011;Nieto-Torres et al., 2011).M protein is also an integral membrane protein and is vital in producing new viruses.It binds to the membrane allowing newly produced viruses to scatter by budding (Neuman et al., 2011).
N protein of CoVs is 45-50 KDa phosphoprotein that is involved in: i) virus assembly and also viral genome +ssRNA replication and/or transcription, and ii) viral mRNA translation.N protein interacts with +ssRNA, M protein and N protein itself (Hogue & Machamer, 2008).It binds to viral genome and other N proteins in order to generate helical ribonucleoprotein core.During the process, it also acts as RNA chaperone.N protein enters into the host cell with the viral genome and interacts with host cell proteins.The structural organization analyzes of SARS-CoVs revealed that N protein has two non-interacting structural domains, of are as N-terminal RNA-binding domain between 45-181 residues and C-terminal dimerization domain between 248-365 residues (Chang et al., 2006).Additionally, SARS-CoV M protein binds to N protein between amino acids located at 211-254 and/or 168-208 positions and these regions play important roles in N protein-protein interactions (He et al., 2004;Fang et al., 2006).N protein may undergo some posttranslational modifications, including sumoylation, proteolytic cleavage, ADP-ribosylation and phosphorylation (Fung & Liu, 2018).Phosphorylation can alter function of N proteins during viral life-cycle.Phosphorylation of N protein usually occurs in serine-arginine rich region or domain 3 in the protein.Phosphorylation has some effects on N protein including subcellular localization and antigenic specificity (Chang et al., 2006;Huang et al., 2015).Also, N protein has some additional functions such as nucleocytoplasmic shuttling, inhibition of S phase progress of the host and development of viral infection (Satija & Lal, 2007;Fung & Liu, 2018).
Most of CoVs N protein have epitopic sites that can be used to diagnose the infection.For instance, in SARS-CoV, the site between 371-407th amino acids in C terminus is identified as most antigenic region (Li et al., 2003).The N protein is a major structural component of CoVs and one of the most abundant viral protein produced in the host cell; therefore, can be used as antigenic protein for diagnostic purposes, and efforts in developing medicine and/or vaccine and preventing the infection.Additionally, high level of antibodies against N protein is reported in SARS-CoV patients by several researchers (Satija & Lal, 2007).
This study is aimed to perform comparative bioinformatics analysis of N protein of CoVs in order to determine various properties of SARS-CoV-2 N protein.Additionally, 3D structures of SARS-CoV-2 N protein are also generated and analyzed.

Sequence retrieving
40 nucleotide RNA and polypeptide sequences of SARS-CoV-2 N protein were retrieved from NCBI GenBank.Coding sequences (CDS) of N protein in the viral genome were retrieved from features of CDS option except only CDS of N protein from pangolin (Zhang et al., 2020), retrieved manually from viral genome MT084071.The accession numbers of protein sequences and viral genomes, the CDS positions of N proteins, the lengths of proteins and CDS, and the origins of the countries of the viral genomes were shown in Supplementary Table 1 (S-Table1).When selecting sequences, the countries considered as epicenters of the pandemic are preferred.The distribution of the sequences by countries is of as following; 13 from USA, 9 from China, 3 from Japan, 2 from Vietnam, 2 from Italy, 2 from Colombia, 1 from each of Israel, Iran, Greece, Brazil, Spain, India, Australia and Turkey.
Additionally, 42 coding nucleotide and protein sequences from the other CoVs were also retrieved from NCBI GenBank using the same method.The sources of the sequences, the accession numbers of the N protein sequences and the viral genomes, the CDS positions of N proteins, the lengths of the proteins and CDS were shown S-Table 2. The selected other CoV sequences are of as following; 16 from bat CoVs, 6 from pangolin CoVs (except for the protein sequence of pangolin CoV isolate MP789), 2 from civet CoV, 1 from each of rabbit and camel, 2 from human beta CoV, 1 from human enteric CoV, 3 from human MERS-CoV, 7 from human SARS and finally, 3 from avian infectious bronchitis virus (Avian IBV) Gama-CoV.

Sequence alignment and analyses
All the 81 nucleotide and protein sequences retrieved were aligned as two data sets by using BioEdit software v7.2.5 (Hall, 1999) with Clustal W multiple alignment application (Thompson, Higgins, & Gibson, 1994).Separately, 19 Sarbecoviruses selected were analyzed for sequence variations.The SARS-CoV-2 N protein percentage identities of NCBI database (NCBI, 2020) were evaluated by using BLASTP suite (Basic Local Alignment Search Tool, protein-protein Blast) with blastp option.To compare SARS-CoV-2 N proteins, the percentage of identities and cover analysis were conducted by using SARS-CoV-2 isolate N protein sequences (YP_009724397.2Wuhan-Hu-1 and QHD43423.2Shanghai/

Table 1
Nucleotide blast results of N proteins of novel SARS-CoV-2 and taxonomically related CoVs.Cover/Identity values shown in percentage.
China).Domain analyzes of SARS-CoV-2 N proteins were performed using Pfam 32.0 database (El-Gebali et al., 2019) and HAMAP database (Pedruzzi et al., 2015).Tajima's test of neutrality (Tajima, 1989) was used in MEGA X software (Kumar et al., 2018) for the calculating nucleotide diversities (π), the numbers of segregating sites (S), and the Tajima's test statistic (D) values.19 Sarbecovirus sequences, previously used for alignment were also used to perform the Tajima's test of neutrality

Phylogenetic analyzes
Phylogenetic analyzes were inferred by using both of the nucleotide and protein sequences.The nucleotide sequence based phylogenetic tree was constructed using Maximum Likelihood (ML) method and Tamura-Nei model.The nucleotide sequence based phylogenetic tree were included 82 nucleotide sequences and 1586 positions in the final data set.The two N-protein amino acid sequence based trees were also online server used for the putative phosphorylation positions (Blom et al., 2004) (http://www.cbs.dtu.dk).The putative phosphorylation positions were predicted for serine, threonine and tyrosine amino acids.

Prediction of putative N-glycosylation and phosphorylation positions
The putative N-glycosylation and phosphorylation sites of SARS-CoV-2 N protein were predicted employing two online servers, one of which was NetNGlyc 1.0 online server used for N-glycosylation positions (Gupta and Brunak 2004) (http://www.cbs.dtu.dk) and the other of which was NetPhos 3.1 online server used for the putative phosphorylation positions (Blom et al., 2004) (http://www.cbs.dtu.dk).The putative phosphorylation positions were predicted for serine, threonine and tyrosine amino acids.

Phylogeny and relative search for the N protein
The nucleocapsid protein (N protein) sequences were retrieved and compared using nucleotide mega BLAST for the estimations of the similarity levels existed in the taxonomically related CoV N proteins.The results of nucleotide blast search were shown in Table 1.
According to the results, the highest coverage and identity values were calculated between the SARS-CoV-2 samples from Kayseri/Turkey, Shanghai/China, Risaralda/ Colombia and Bat-CoV RaTG13 (100/99.05)whereas the lowest coverage and identity values were found to be between most of SARS-CoV-2 samples and Camel CoV (74/37.16).By the relevance of these data, the most similar N protein nucleotide sequence to SARS-CoV-2 N protein nucleotide sequence was found in Bat-CoV (Bat-CoV RaTG13) among all of the taxonomically related CoVs, in agreement with previous results (Cui et al., 2019;Zhang et al., 2020).

Phylogenetic analyzes
Phylogeny constructed based on the SARS-CoV-2 N protein was tested with nucleotide and protein sequences.Four main groups were identified in both of the nucleotide and protein based joining phylogenetic trees shown in Fig. 1.The main groups consist of individuals from Igacovirus (orange), Embecovirus (green), Merbecovirus (turquoise) and Sarbecovirus (yellow).All the SARS-CoV-2 N protein sequences were grouped in Sarbecovirus group with SARS, bat, pangolin and civet CoVs by 93% and 99% bootstrap values in the nucleotide sequence and protein sequence based phylogenetic trees, respectively.There were two Sarbecovirus subgroups in both phylogenetic trees: named as A and B in the nucleotide sequence tree, and C and D in the protein based phylogenetic tree.The A and C subgroups included bat, SARS and civet CoVs while the B and D subgroups consisted of SARS-CoV-2, and bat and pangolin CoVs.In consistency with the previous studies, pangolin CoV, Bat-CoV RATG13, BAT-SL ZC45 and BAT-SL ZXC21 from bat CoVs grouped with SARS-CoV-2 in both trees (Chen et al., 2020;Tilocca et al., 2020;Wu et al., 2020;Zhang et al., 2020).Additionally, pangolin CoVs (the nucleotide and protein sequences of GX-5PE, GX-2PV, GX-5PL, GX-P1E, GX-P4L and the nucleotide sequence from  Protein based phylogenetic tree of SARS-CoV-2 was shown in Fig. 2. According to the results, among 40 SARS-CoV-2 sequences, Tokyo/Japan (BCB15098), Alexandroupolis/ Greece (QIZ16579), Richmont/USA (QJA42209), Kanagawa/ Japan (BCB97908), Valencia/Spain (QIM47474) and Bogota/ Colombia (QIS30062) were diverged from the others.Especially, the Alexandroupolis/Greece (QIZ16579) and Richmont/USA (QJA42209) samples were clustered together forming a small group with 86% bootstrap value.Also there were a SARS-CoV-2 sequence from Richmont/USA (QIX14043.1)and nine other sequences from USA, Richmont/USA (QJA42209) diverged from the rest of USA samples.

Domain, sequence variation analysis and amino acid substitutions
The amino acid sequences of SARS-CoV-2 N proteins were aligned with some the other members of Sarbecovirus to reveal sequence variation and amino acid substitutions.The results of alignment were shown in Fig. 3.The results of domain analyzes conducted using Pfam and HAMAP databases showed that the sequence between 14 and 377 positions displayed matches with family Corona_nucleocora domain in Pfam database and the two sequences displayed matches in HAMAP database (the first match in the sequence was between 41 and 186 on N terminal with CoV RNA-binding domain and the second match in the sequence was between 258 and 361 on C terminal with N protein dimerization domain).Serine x"on position 176 was phosphorylated.Tajima's D, segregating sites and nucleotide diversities (π) were calculated for the selected 19 Sarbecovirus members and they were also used for sequence variation analyzes as -0.337920, 0.053546 and 86, respectively.
Amino acid substitutions can alter and/or inhibit protein function (Ng & Henikoff, 2006;Teng et al., 2010).There are some studies showing the effects of amino acid substitutions in viruses (Schrauwen et al., 2016;Perera et al., 2019).Thus, the further investigations must be done for the detection of the effects of substitutions occurred in the N protein or other functional and/or structural proteins of SARS-CoV-2.

N-glycosylation and phosphorylation positions of SARS-CoV-2 N protein
Phosphorylation is one of the important posttranscriptional modifications of viral proteins.Phosphorylation regulates vital processes, including replication, transcription, RNA binding and viral assembly (Huang et al., 2015).Although it is not known exactly how N proteins fulfill their functions, it is estimated that the phosphorylated N protein serves as unifying of the RNA genome into the virion and forming replicationtranscription complex (Carlson et al., 2020).N-glycosylation affects viral protein functions by altering protein immunogenicity and facilitating viral protein interactions with host receptors (Mossenta et al., 2017).Therefore, the detection of putative N-glycosylation and phosphorylation positions may provide useful information for future studies especially related with vaccine and/or antiviral drug development.Two online servers used for the detection of putative N-glycosylation and phosphorylation positions were NetNGlyc 1.0 and NetPhos 3.1 servers.The results of the predictions were shown in Fig. 4.
The possible serine, threonine and tyrosine phosphorylation of the N protein was predicted to be at the 100 position (with 50% possibility threshold) for all the SARS-CoV-2 samples.The extensive analysis of SARS-CoV-2 sample from Kayseri/Turkey revealed that the distribution of possibilities occurred as following: 41 at 50-60%, 14 at 60-70%, 11 at 70-80%, 10 at 80-90% and 24 at 90-100%.

Comparative analysis of 3D structure of CoV N protein
The analysis of the 3D structures of the N proteins gives us information for understanding of viral packaging, taxonomical relatedness and how it functions.For that, the putative 3D structures of N proteins from the two SARS-CoV-2 samples and 11 taxonomically related previous CoVs were generated using Phyre 2 server and the assessments of topological similarities of the 3D structures of the N proteins were done by analyzing TM Scores.The results of the assessments were given in Table 3.
Xu and Zhang (2010) stated that protein pairs with a TM score >0.5 are usually not in the same fold and if TM score falls below 0.17, the protein pairs are assumed not in the same fold.
In this study, the highest TM score was detected between Shanghai/China SARS-CoV-2 and Beta-CoV SX2013 (0.4067)In the light of this data, although the TM scores between SARS-CoV-2 and other taxonomically related CoVs were shown consistency with the taxonomical data, the 3D structures of the selected CoV samples were not at the same fold.Also, it can be said that the 3D structures of SARS-CoV-2 N proteins were shown significant divergences.

Conclusion
The novel coronavirus SARS-CoV-2, the cause of Covid 19 infection, created one of the biggest outbreak in the world history.SARS-CoV-2 has structural proteins, including S, E, M and N proteins.N protein involves in viral assembly, replication, transcription and translation (Chen et al., 2020).The sequence analysis of SARS-CoV-2 N protein RaTG13 using cover and identity values revealed that the most similar N protein sequence to SARS-CoV-2 belongs to Bat-CoV.However, the lowest cover and identity values were detected for Camel CoV.The phylogenetic analysis of SARS-CoV-2 was conducted using both of the nucleotide and protein sequences of the N protein.
Four main (Igacovirus, Embecovirus, Merbecovirus and Sarbecovirus) groups identified in both nucleotide and protein trees and SARS-CoV-2 were placed in Sarbecovirus with SARS-CoV, and bat, pangolin and civet CoVs.Bat-CoV RATG13, BAT-SL ZC45 and BAT-SL ZXC21 were grouped with SARS-CoV-2 in both trees in agreement with the related literature.Additionally, a third ML tree was constructed based on the protein sequences of 40 SARS-CoV-2 N protein samples.By the phylogenetic analysis, it was shown that some of the SARS-CoV-2 N protein samples showed divergences from the other selected samples.The diverged SARS-CoV-2 N protein samples were Tokyo/Japan (BCB15098), Alexandroupolis/ Greece (QIZ16579), Richmont/USA (QJA42209), Kanagawa/ Japan (BCB97908), Valencia/Spain (QIM47474) and Bogota/ Colombia (QIS30062).
The domain search was conducted using of both Pfam and HAMAP databases.Three domains were detected as Corona_nucleocora (Pfam), N terminal CoV RNA-binding domain (HAMAP) and C terminal N protein dimerization domain (HAMAP).The sequence variation analysis revealed seven amino acid substations within the selected SARS-CoV-2 samples.The Richmont/USA (QJA42209.1)and Greece (QIZ16579.1)SARS-CoV-2 samples were shown as having higher substitution numbers, 2 and 3 substitutions, respectively.The possible N-glycosylation positions of SARS-CoV-2 N protein were predicted as 47 NNTA with 68% and 269 NVTQ with 82%.The possible serine, threonine and tyrosine phosphorylations were predicted for 100 positions with above 50% possibility (34 of them were having higher than 80% possibility).The TM score analysis revealed that the 3D structures of SARS-CoV-2 N protein and taxonomically related CoVs were not at the same fold.Also, the TM score of N protein pairs of SARS-CoV-2 samples was calculated as 0.3946.In the light of this data, although the analysis of 3D structure data was shown to have consistency with the taxonomy of the coronavirus, SARS-CoV-2 and taxonomically related CoV N proteins, they showed significant divergences.
In last three decades, the world has encountered with some severe outbreaks caused by CoV family, including SARS (2002/03), MERS (2012/15) and Covid19 (2019-still continues).The developments of vaccines and therapeutics for fighting with the current and potential future outbreaks are crucial.Analysis, identification and interpretation of all components of pathogenic CoVs will contribute in fighting with the disease.In hope, the information gained in this study will make contributions in fighting with the current and future CoV outbreaks.

Fig. 2 .
Fig. 2. ML phylogenetic tree of SARS-CoV-2 based on amino acid sequences of the N protein.

Fig. 3 .
Fig. 3.The amino acid sequences of SARS-CoV-2 and the other members of Sarbecovirus N proteins based on multiple sequence alignment.

Fig. 4 .
Fig. 4. The results of predicted putative N-glycosylation and phosphorylation positions on three SARS-CoV-2 N proteins.

Table 3 .
The results of the similarity assessments (TM score) between SARS-CoV-2 and taxonomically related previous CoVs N protein structures.