<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article  PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" dtd-version="3.0" xml:lang="en" article-type="research article"><front><journal-meta><journal-id journal-id-type="publisher-id">JAMP</journal-id><journal-title-group><journal-title>Journal of Applied Mathematics and Physics</journal-title></journal-title-group><issn pub-type="epub">2327-4352</issn><publisher><publisher-name>Scientific Research Publishing</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="doi">10.4236/jamp.2019.712204</article-id><article-id pub-id-type="publisher-id">JAMP-96855</article-id><article-categories><subj-group subj-group-type="heading"><subject>Articles</subject></subj-group><subj-group subj-group-type="Discipline-v2"><subject>Physics&amp;Mathematics</subject></subj-group></article-categories><title-group><article-title>
 
 
  A New Numerical Method for DNA Sequence Analysis Based on 8-Dimensional Vector Representation
 
</article-title></title-group><contrib-group><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Dandan</surname><given-names>Zhang</given-names></name><xref ref-type="aff" rid="aff1"><sub>1</sub></xref><xref ref-type="corresp" rid="cor1"><sup>*</sup></xref></contrib></contrib-group><aff id="aff1"><label>1</label><addr-line>Department of Mathematics, Jinan University, Guangzhou, China</addr-line></aff><pub-date pub-type="epub"><day>02</day><month>12</month><year>2019</year></pub-date><volume>07</volume><issue>12</issue><fpage>2941</fpage><lpage>2949</lpage><history><date date-type="received"><day>20,</day>	<month>August</month>	<year>2019</year></date><date date-type="rev-recd"><day>30,</day>	<month>November</month>	<year>2019</year>	</date><date date-type="accepted"><day>3,</day>	<month>December</month>	<year>2019</year></date></history><permissions><copyright-statement>&#169; Copyright  2014 by authors and Scientific Research Publishing Inc. </copyright-statement><copyright-year>2014</copyright-year><license><license-p>This work is licensed under the Creative Commons Attribution International License (CC BY). http://creativecommons.org/licenses/by/4.0/</license-p></license></permissions><abstract><p>
 
 
  Background:  The multiple sequence alignment (MSA) algorithms are the traditional ways to compare and analyze DNA sequences. However, for large DNA sequences, these algorithms require a long time computationally. 
  Objective:  Here we will propose a new numerical method to characterize and compare DNA sequences quickly. 
  Method:  Based on a new 2-dimensional (2D) graphical representation of DNA sequences, we can obtain an 8-dimensional vector using two basic concepts of probability, the mean and the variance. 
  Results:  We perform similarity/dissimilarity analyses among two real DNA data sets, the coding sequences of the first exon of beta-globin gene of 11 species and 31 mammalian mitochondrial genomes, respectively. 
  Conclusion:  Our results are in agreement with the existing analyses in our literatures. We also compare our approach with other methods and find that ours is more effective.
 
</p></abstract><kwd-group><kwd>DNA Map</kwd><kwd> Zigzag Curve</kwd><kwd> Numerical Characterization</kwd><kwd> Similarity Analysis</kwd></kwd-group></article-meta></front><body><sec id="s1"><title>1. Introduction</title><p>With the rapid growth in biological data, how to get more information from these big data is a challenge for scientists. For this purpose, an important problem is to find a suitable way to digitize these DNA sequences so that the sequence comparison can be applied. For computational time reason, beyond the traditional multiple sequence alignment (MSA), many alignment-free sequence comparison methods were introduced, for more details, please refer to [<xref ref-type="bibr" rid="scirp.96855-ref1">1</xref>] [<xref ref-type="bibr" rid="scirp.96855-ref2">2</xref>] [<xref ref-type="bibr" rid="scirp.96855-ref3">3</xref>] and the references therein.</p><p>To achieve this, one way is to use the graphical representation of DNA sequences so that the sequences can be compared by defining a suitable feature. The pioneering works were introduced by Hamori and Ruskin [<xref ref-type="bibr" rid="scirp.96855-ref4">4</xref>] [<xref ref-type="bibr" rid="scirp.96855-ref5">5</xref>] using the so-called H-curve representation of DNA sequence. Following these researches, many multi-dimensional representations were considered [<xref ref-type="bibr" rid="scirp.96855-ref6">6</xref>] - [<xref ref-type="bibr" rid="scirp.96855-ref10">10</xref>]. But these representational curves may degenerate, or may be not one-to-one mapping from DNA sequences. In order to overcome these defects, many new curves were introduced [<xref ref-type="bibr" rid="scirp.96855-ref11">11</xref>] - [<xref ref-type="bibr" rid="scirp.96855-ref19">19</xref>], while some new cluster methods were considered [<xref ref-type="bibr" rid="scirp.96855-ref20">20</xref>] [<xref ref-type="bibr" rid="scirp.96855-ref21">21</xref>] [<xref ref-type="bibr" rid="scirp.96855-ref22">22</xref>]. Some other representations were applied to the protein sequences [<xref ref-type="bibr" rid="scirp.96855-ref23">23</xref>] [<xref ref-type="bibr" rid="scirp.96855-ref24">24</xref>] [<xref ref-type="bibr" rid="scirp.96855-ref25">25</xref>] [<xref ref-type="bibr" rid="scirp.96855-ref26">26</xref>].</p><p>In [<xref ref-type="bibr" rid="scirp.96855-ref14">14</xref>] [<xref ref-type="bibr" rid="scirp.96855-ref27">27</xref>] [<xref ref-type="bibr" rid="scirp.96855-ref28">28</xref>], some new methods arrived based on the probabilistic framework. In particular, in [<xref ref-type="bibr" rid="scirp.96855-ref27">27</xref>], in order to obtain the eigenvector representing the zigzag curve, it was necessary to calculate the maximum eigenvalue of the related matrix. So it took a long time to compute this value for a huge DNA sequence. In [<xref ref-type="bibr" rid="scirp.96855-ref28">28</xref>], the polynomial curve of order 3 was used to fit the representation curve. But the choice of the order for the function was depended on their data sets. To improve these methods, we characterize the representation curve with the mean and the variance. Following some observations in [<xref ref-type="bibr" rid="scirp.96855-ref27">27</xref>] [<xref ref-type="bibr" rid="scirp.96855-ref28">28</xref>], we will provide a map from the space of DNA sequences to the 8-dimensional Euclidean space based on a 2D graphical representation of the sequence. By this mapping, the similarity/dissimilarity of the first exon of beta-globin gene of eleven species and 31 mammalian mitochondrial genomes will be studied respectively and very prospective results will be obtained.</p><p>The remainder of this paper is organized as follows. Section 2 presents the method of the graphical representation of DNA sequence, and explains the procedure of the similarity analysis among these sequences. Section 3 presents the similarity results among the coding sequences of the first exon of beta-globin gene of 11 species and 31 mammalian mitochondrial genomes. Section 4 discusses our results with other literates and shows the effectiveness of our method.</p></sec><sec id="s2"><title>2. Methods</title><p>Utilizing the fact that A, T and C, G are two base pairs, Liu [<xref ref-type="bibr" rid="scirp.96855-ref27">27</xref>] [<xref ref-type="bibr" rid="scirp.96855-ref28">28</xref>] introduced two representations of DNA sequence by setting A, T and C, G to the same probability respectively. Following this idea, each nucleotide is assigned by a vector as follows.</p><p>( 1 , 0.2 ) → A , ( 1 , − 0.2 ) → T , ( 1 , 0.3 ) → C , ( 1 , − 0.3 ) → G .</p><p>Here the y-coordinates of A and T are assigned the same number with opposite sign for differing in the curve, so as to C and G.</p><p>For a DNA sequence, we can get a zigzag curve by jointing with all the vectors one by one. For example, the representation of sequence ATGCCTT can be read as follows (<xref ref-type="table" rid="table1">Table 1</xref>).</p><p>The representation curve corresponding to the sequence is shown in <xref ref-type="fig" rid="fig1">Figure 1</xref>.</p><table-wrap id="table1" ><label><xref ref-type="table" rid="table1">Table 1</xref></label><caption><title> Representation of sequence ATGCCTT</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >sequence</th><th align="center" valign="middle" >x-coordinate</th><th align="center" valign="middle" >y-coordinate</th></tr></thead><tr><td align="center" valign="middle" >A</td><td align="center" valign="middle" >1</td><td align="center" valign="middle" >0.2</td></tr><tr><td align="center" valign="middle" >T</td><td align="center" valign="middle" >2</td><td align="center" valign="middle" >0</td></tr><tr><td align="center" valign="middle" >G</td><td align="center" valign="middle" >3</td><td align="center" valign="middle" >−0.3</td></tr><tr><td align="center" valign="middle" >C</td><td align="center" valign="middle" >4</td><td align="center" valign="middle" >0</td></tr><tr><td align="center" valign="middle" >C</td><td align="center" valign="middle" >5</td><td align="center" valign="middle" >0.3</td></tr><tr><td align="center" valign="middle" >T</td><td align="center" valign="middle" >6</td><td align="center" valign="middle" >0.1</td></tr><tr><td align="center" valign="middle" >T</td><td align="center" valign="middle" >7</td><td align="center" valign="middle" >−0.3</td></tr></tbody></table></table-wrap><p>The coordinate x of the curve is increasing, and different nucleotides have different y values, so this representation is a one-to-one map between the DNA sequences and the curves, without loss of information and degeneracy [<xref ref-type="bibr" rid="scirp.96855-ref11">11</xref>].</p><p>Based on the assignments of the four nucleotides over there, Liu [<xref ref-type="bibr" rid="scirp.96855-ref27">27</xref>] introduced a representation of DNA sequence-based on four horizon lines, then showed a map from the curve to a vector in R<sup>4</sup> by the maximal eigenvalue of a related symmetric matrix. In the rest of this section, we will present a map from a DNA sequence to an 8D vector. For two DNA sequences, we will compute the Euclidean distance between the two corresponding vectors, which could be regarded as the similarity/dissimilarity between these two DNA sequences. Our method will be examined by two data sets ranging from small to medium size, as well as exons to genomes.</p><p>Given a DNA sequence with a length of n, we have a zigzag curve based on the map between the bases and numbers as assigned as above. Let (x<sub>i</sub>, y<sub>i</sub>) be the coordinates corresponding to the i-th nucleotide of the sequence, and z i = y i / i , the slope of the line joining the origin with the point (x<sub>i</sub>, y<sub>i</sub>). Then we can get the mean and the variance of the slopes respectively,</p><p>m z = 1 n ∑ i = 1 n z i ,     v z = 1 n ∑ i = 1 n ( z i − m z ) 2 (1)</p><p>so to get a vector K = ( m z , v z ) .</p><p>On the other hand, similar to [<xref ref-type="bibr" rid="scirp.96855-ref27">27</xref>] [<xref ref-type="bibr" rid="scirp.96855-ref29">29</xref>], we could also assign A to −0.2, and T to 0.2, to get another curve, so as to the bases C and G, so that there are four curves for a fixed DNA sequence. Since every curve derives a vector V , we can get four vectors V 1 , V 2 , V 3 and V 4 . Putting them together, we can finally get an 8D vector E for a DNA sequence, which is defined by</p><p>E = ( V 1 , V 2 , V 3 , V 4 ) . (2)</p><p>Up to now, given a DNA sequence, we can get an 8D vector. That is, we have found the novel DNA map from the space of DNA sequences to the 8-dimensional Euclidean space. Please note that the terminology of “DNA map” is different a little bit with in [<xref ref-type="bibr" rid="scirp.96855-ref30">30</xref>], where the map is from DNA sequence to the representation zigzag curve.</p><p>Once the feature vector is determined, one can compare two sequences. Given two DNA sequences, we can get two corresponding vectors E 1 and E 2 . Then the distance d between them can be regarded as a similarity/dissimilarity measure of these two sequences, where</p><p>d = ‖ E 1 − E 2 ‖ .</p><p>We can see that if two DNA sequences are the same, then d is equal to zero. Therefore, if the value of d is smaller, then the two DNA sequences should be more similar.</p></sec><sec id="s3"><title>3. Results</title><p>In this section, we study the similarities among the coding sequences of the first exon of beta-globin gene of 11 species and 31 mammalian mitochondrial genomes through the similarity/dissimilarity measure d.</p><p>Let us first consider the sequences of beta-globin gene, whose information is listed in <xref ref-type="table" rid="table2">Table 2</xref> from GenBank, which updates the information of <xref ref-type="table" rid="table3">Table 3</xref> in [<xref ref-type="bibr" rid="scirp.96855-ref1">1</xref>]. The result is shown in <xref ref-type="table" rid="table3">Table 3</xref>. The table shows that the values d of Human-Gorilla, Goat-Bovine and Gorilla-Chimpanzee are relative smaller, which indicates they are relative closer. In order to exam whether our method is effective, we want to compare our results with those of others. Therefore, we list some highly cited similarity results between human beings and other species, as shown in <xref ref-type="table" rid="table4">Table 4</xref>. Following the idea in [<xref ref-type="bibr" rid="scirp.96855-ref27">27</xref>] [<xref ref-type="bibr" rid="scirp.96855-ref28">28</xref>] [<xref ref-type="bibr" rid="scirp.96855-ref31">31</xref>], for convenience, we also use the index normalized by the Human-Goat ratio. From <xref ref-type="table" rid="table4">Table 4</xref>, most results display that the normalized values of Human-Gorilla and Human-Chimpanzee are smaller, which is consistent with ours.</p><p>Now we want to analyze 31 mammalian mitochondrial genomes and construct a phylogenetic tree. The GenBank information of these genomes can be found in [<xref ref-type="bibr" rid="scirp.96855-ref32">32</xref>], and the results with UPGMA are shown in <xref ref-type="fig" rid="fig2">Figure 2</xref>. In this figure, we can see that the groups Primates, Perissodactyla and Rodentia include the same species as in the results of <xref ref-type="fig" rid="fig3">Figure 3</xref> in [<xref ref-type="bibr" rid="scirp.96855-ref33">33</xref>] and <xref ref-type="fig" rid="fig2">Figure 2</xref> in [<xref ref-type="bibr" rid="scirp.96855-ref32">32</xref>], while Sheep-Goat, Dog-Wolf, Brown Bear-Polar Bear and Tiger, Cat and Leopard are</p><table-wrap id="table2" ><label><xref ref-type="table" rid="table2">Table 2</xref></label><caption><title> The coding sequences of the first exon of beta-globin gene of eleven species</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >Species</th><th align="center" valign="middle" >Coding sequence</th></tr></thead><tr><td align="center" valign="middle" >Human</td><td align="center" valign="middle" >ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGT GGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAG</td></tr><tr><td align="center" valign="middle" >Goat</td><td align="center" valign="middle" >ATGCTGACTGCTGAGGAGAAGGCTGCCGTCACCGGCTTCTGGGG CAAGGTGAAAGTGGATGAAGTTGGTGCTGAGGCCCTGGGCAG</td></tr><tr><td align="center" valign="middle" >Opossum</td><td align="center" valign="middle" >ATGGTGCACTTGACTTCTGAGGAGAAGAACTGCATCACTACCATC TGGTCTAAGGTGCAGGTTGACCAGACTGGTGGTGAGGCCCTTGGCAG</td></tr><tr><td align="center" valign="middle" >Gallus</td><td align="center" valign="middle" >ATGGTGCACTGGACTGCTGAGGAGAAGCAGCTCATCACCGGCCTCT GGGGCAAGGTCAATGTGGCCGAATGTGGGGCCGAAGCCCTGGCCAG</td></tr><tr><td align="center" valign="middle" >Lemur</td><td align="center" valign="middle" >ATGACTTTGCTGAGTGCTGAGGAGAATGCTCATGTCACCTCTCTGT GGGGCAAGGTGGATGTAGAGAAAGTTGGTGGCGAGGCCTTGGGCAG</td></tr><tr><td align="center" valign="middle" >Mouse</td><td align="center" valign="middle" >ATGGTGCACCTGACTGATGCTGAGAAGTCTGCTGTCTCTTGCCTGT GGGCAAAGGTGAACCCCGATGAAGTTGGTGGTGAGGCCCTGGGCAGG</td></tr><tr><td align="center" valign="middle" >Rabbit</td><td align="center" valign="middle" >ATGGTGCATCTGTCCAGTGAGGAGAAGTCTGCGGTCACTGCCCTGT GGGGCAAGGTGAATGTGGAAGAAGTTGGTGGTGAGGCCCTGGGCAG</td></tr><tr><td align="center" valign="middle" >Rat</td><td align="center" valign="middle" >ATGGTGCACCTAACTGATGCTGAGAAGGCTACTGTTAGTGGCCTGT GGGGAAAGGTGAACCCTGATAATGTTGGCGCTGAGGCCCTGGGCAG</td></tr><tr><td align="center" valign="middle" >Gorilla</td><td align="center" valign="middle" >ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTG GGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGG</td></tr><tr><td align="center" valign="middle" >Bovine</td><td align="center" valign="middle" >ATGCTGACTGCTGAGGAGAAGGCTGCCGTCACCGCCTTTTGGGGC AAGGTGAAAGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAG</td></tr><tr><td align="center" valign="middle" >Chimpanzee</td><td align="center" valign="middle" >ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTG TGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCT GGGCAGGTTGGTATCAAGG</td></tr></tbody></table></table-wrap><table-wrap id="table3" ><label><xref ref-type="table" rid="table3">Table 3</xref></label><caption><title> The similarity result (1.0e − 2) for the coding sequences of the first exon of beta-globin gene of 11 species</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >Species</th><th align="center" valign="middle" >Human</th><th align="center" valign="middle" >Goat</th><th align="center" valign="middle" >Opossum</th><th align="center" valign="middle" >Gallus</th><th align="center" valign="middle" >Lemur</th><th align="center" valign="middle" >Mouse</th><th align="center" valign="middle" >Rabbit</th><th align="center" valign="middle" >Rat</th><th align="center" valign="middle" >Gorilla</th><th align="center" valign="middle" >Bovine</th><th align="center" valign="middle" >Chimpanzee</th></tr></thead><tr><td align="center" valign="middle" >Human</td><td align="center" valign="middle" >0</td><td align="center" valign="middle" >3.253</td><td align="center" valign="middle" >0.941</td><td align="center" valign="middle" >3.232</td><td align="center" valign="middle" >3.183</td><td align="center" valign="middle" >1.393</td><td align="center" valign="middle" >4.836</td><td align="center" valign="middle" >2.137</td><td align="center" valign="middle" >0.059</td><td align="center" valign="middle" >2.713</td><td align="center" valign="middle" >0.698</td></tr><tr><td align="center" valign="middle" >Goat</td><td align="center" valign="middle" ></td><td align="center" valign="middle" >0</td><td align="center" valign="middle" >3.674</td><td align="center" valign="middle" >1.245</td><td align="center" valign="middle" >2.936</td><td align="center" valign="middle" >3.755</td><td align="center" valign="middle" >2.649</td><td align="center" valign="middle" >1.189</td><td align="center" valign="middle" >3.200</td><td align="center" valign="middle" >0.574</td><td align="center" valign="middle" >2.689</td></tr><tr><td align="center" valign="middle" >Opossum</td><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td><td align="center" valign="middle" >0</td><td align="center" valign="middle" >3.324</td><td align="center" valign="middle" >4.094</td><td align="center" valign="middle" >2.258</td><td align="center" valign="middle" >5.593</td><td align="center" valign="middle" >2.489</td><td align="center" valign="middle" >0.983</td><td align="center" valign="middle" >3.202</td><td align="center" valign="middle" >1.550</td></tr><tr><td align="center" valign="middle" >Gallus</td><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td><td align="center" valign="middle" >0</td><td align="center" valign="middle" >3.965</td><td align="center" valign="middle" >4.144</td><td align="center" valign="middle" >3.882</td><td align="center" valign="middle" >1.254</td><td align="center" valign="middle" >3.193</td><td align="center" valign="middle" >1.423</td><td align="center" valign="middle" >2.882</td></tr><tr><td align="center" valign="middle" >Lemur</td><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td><td align="center" valign="middle" >0</td><td align="center" valign="middle" >2.427</td><td align="center" valign="middle" >2.288</td><td align="center" valign="middle" >2.920</td><td align="center" valign="middle" >3.131</td><td align="center" valign="middle" >2.552</td><td align="center" valign="middle" >2.549</td></tr><tr><td align="center" valign="middle" >Mouse</td><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td><td align="center" valign="middle" >0</td><td align="center" valign="middle" >4.529</td><td align="center" valign="middle" >2.902</td><td align="center" valign="middle" >1.378</td><td align="center" valign="middle" >3.183</td><td align="center" valign="middle" >1.301</td></tr><tr><td align="center" valign="middle" >Rabbit</td><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td><td align="center" valign="middle" >0</td><td align="center" valign="middle" >3.474</td><td align="center" valign="middle" >4.777</td><td align="center" valign="middle" >2.727</td><td align="center" valign="middle" >4.138</td></tr><tr><td align="center" valign="middle" >Rat</td><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td><td align="center" valign="middle" >0</td><td align="center" valign="middle" >2.089</td><td align="center" valign="middle" >0.782</td><td align="center" valign="middle" >1.675</td></tr><tr><td align="center" valign="middle" >Gorilla</td><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td><td align="center" valign="middle" >0</td><td align="center" valign="middle" >2.659</td><td align="center" valign="middle" >0.640</td></tr><tr><td align="center" valign="middle" >Bovine</td><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td><td align="center" valign="middle" >0</td><td align="center" valign="middle" >2.126</td></tr><tr><td align="center" valign="middle" >Chimpanzee</td><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td><td align="center" valign="middle" >0</td></tr></tbody></table></table-wrap><table-wrap id="table4" ><label><xref ref-type="table" rid="table4">Table 4</xref></label><caption><title> The similarity indexes between human and other species. All indexes are normalized to Human-Goat ratio</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >Methods</th><th align="center" valign="middle" >Goat</th><th align="center" valign="middle" >Opossum</th><th align="center" valign="middle" >Gallus</th><th align="center" valign="middle" >Lemur</th><th align="center" valign="middle" >Mouse</th><th align="center" valign="middle" >Rabbit</th><th align="center" valign="middle" >Rat</th><th align="center" valign="middle" >Gorilla</th><th align="center" valign="middle" >Bovine</th><th align="center" valign="middle" >Chimpanzee</th></tr></thead><tr><td align="center" valign="middle" >Our work</td><td align="center" valign="middle" >1</td><td align="center" valign="middle" >0.29</td><td align="center" valign="middle" >0.99</td><td align="center" valign="middle" >0.98</td><td align="center" valign="middle" >0.43</td><td align="center" valign="middle" >1.49</td><td align="center" valign="middle" >0.66</td><td align="center" valign="middle" >0.02</td><td align="center" valign="middle" >0.83</td><td align="center" valign="middle" >0.21</td></tr><tr><td align="center" valign="middle" >Chi &amp; Ding [<xref ref-type="bibr" rid="scirp.96855-ref34">34</xref>]</td><td align="center" valign="middle" >1</td><td align="center" valign="middle" >3.71</td><td align="center" valign="middle" >0.82</td><td align="center" valign="middle" >2.73</td><td align="center" valign="middle" >0.69</td><td align="center" valign="middle" >0.50</td><td align="center" valign="middle" >0.48</td><td align="center" valign="middle" >0.07</td><td align="center" valign="middle" >3.59</td><td align="center" valign="middle" >0.58</td></tr><tr><td align="center" valign="middle" >Randic et al. [<xref ref-type="bibr" rid="scirp.96855-ref29">29</xref>]</td><td align="center" valign="middle" >1</td><td align="center" valign="middle" >2.43</td><td align="center" valign="middle" >1.79</td><td align="center" valign="middle" >1.43</td><td align="center" valign="middle" >1.37</td><td align="center" valign="middle" >0.69</td><td align="center" valign="middle" >0.70</td><td align="center" valign="middle" >0.34</td><td align="center" valign="middle" >1.38</td><td align="center" valign="middle" >0.28</td></tr><tr><td align="center" valign="middle" >Zhang [<xref ref-type="bibr" rid="scirp.96855-ref12">12</xref>]</td><td align="center" valign="middle" >1</td><td align="center" valign="middle" >2.49</td><td align="center" valign="middle" >2.42</td><td align="center" valign="middle" >1.05</td><td align="center" valign="middle" >0.93</td><td align="center" valign="middle" >1.12</td><td align="center" valign="middle" >1.11</td><td align="center" valign="middle" >0.55</td><td align="center" valign="middle" >0.76</td><td align="center" valign="middle" >2.01</td></tr></tbody></table></table-wrap><p>also closing similar. Our results are also consistent with that in [<xref ref-type="bibr" rid="scirp.96855-ref28">28</xref>], where they considered 11 species of them.</p></sec><sec id="s4"><title>4. Discussions</title><p>Our method provides a map from the space of DNA sequences to the 8-dimensional Euclidean space. We focus the slope of the line jointing the origin and representation point for the nucleotide, which reflects the speed of the change of y-coordinate.</p><p>Different from other probabilistic methods [<xref ref-type="bibr" rid="scirp.96855-ref14">14</xref>] [<xref ref-type="bibr" rid="scirp.96855-ref35">35</xref>], where they regarded the sequence as a sample space, we read a DNA sequence as a random result. Comparing to the method in [<xref ref-type="bibr" rid="scirp.96855-ref27">27</xref>], our method relies on the mean and variance of the slopes of the corresponding lines only, not the eigenvalues. These arrive at more pure statistics, and save computing time.</p><p>As its applications, we study the similarities among beta-globin genes of eleven species and 31 mammalian mitochondrial genomes respectively. In <xref ref-type="table" rid="table4">Table 4</xref>, the Human-Gorilla is the most similar, which is supported by all the results. Beside of it, our method and that in [<xref ref-type="bibr" rid="scirp.96855-ref29">29</xref>] shows that Human-Chimpanzee is the most similar, which is consistent with many existing results. But the results in [<xref ref-type="bibr" rid="scirp.96855-ref12">12</xref>] [<xref ref-type="bibr" rid="scirp.96855-ref34">34</xref>] indicate that Human-Rabbit and Human-Rat are closer than Human-Chimpanzee. While <xref ref-type="fig" rid="fig2">Figure 2</xref> covers the corresponding results in [<xref ref-type="bibr" rid="scirp.96855-ref28">28</xref>]. This reflects the usefulness of our novel method.</p><p>In this work, we provide an alternative map from DNA sequence to a vector in R<sup>8</sup> based on two basic statistical quantities. The idea of our method can be applied to analyze the protein sequences. Even the zigzag curve representation of DNA sequence is one-to-one, but not for the map from curves to R<sup>8</sup>. That is, two DNA sequences may have the same feature vector. In future research, we try to develop our method to study more biological data, for example, to find more suitable vectors so that it can keep more information of DNA sequence.</p></sec><sec id="s5"><title>Conflicts of Interest</title><p>The author declares no conflicts of interest regarding the publication of this paper.</p></sec><sec id="s6"><title>Cite this paper</title><p>Zhang, D.D. (2019) A New Numerical Method for DNA Sequence Analysis Based on 8-Dimensional Vector Representation. Journal of Applied Mathematics and Physics, 7, 2941-2949. https://doi.org/10.4236/jamp.2019.712204</p></sec></body><back><ref-list><title>References</title><ref id="scirp.96855-ref1"><label>1</label><mixed-citation publication-type="other" xlink:type="simple">Jin, X., Jiang, Q., Chen, Y., et al. (2017) Similarity/Dissimilarity Calculation Methods of DNA Sequences: A Survey. Journal of Molecular Graphics and Modelling, 76, 342-355. https://doi.org/10.1016/j.jmgm.2017.07.019</mixed-citation></ref><ref id="scirp.96855-ref2"><label>2</label><mixed-citation publication-type="other" xlink:type="simple">Zielezinski, A., Vinga, S., Almeida, J. and Karlowski, W.M. (2017) Alignment-Free Sequence Comparison: Benefits, Applications, and Tools. Genome Biology, 18, Article No. 186. https://doi.org/10.2174/157489361002150518150716</mixed-citation></ref><ref id="scirp.96855-ref3"><label>3</label><mixed-citation publication-type="other" xlink:type="simple">Ren, J., Bai, X., Lu, Y.Y., et al. (2018) Alignment-Free Sequence Analysis and Applications. Annual Review of Biomedical Data Science, 1, 93-114. https://doi.org/10.1146/annurev-biodatasci-080917-013431</mixed-citation></ref><ref id="scirp.96855-ref4"><label>4</label><mixed-citation publication-type="other" xlink:type="simple">Hamori, E. and Ruskin, J. (1983) H Curves, a Novel Method of Representation of Nucleotide Series Especially Suited for Long DNA Sequences. The Journal of Biological Chemistry, 258, 1318-1327.</mixed-citation></ref><ref id="scirp.96855-ref5"><label>5</label><mixed-citation publication-type="other" xlink:type="simple">Hamori, E. (1985) Novel DNA Sequence Representations. Nature, 314, 585-586. https://doi.org/10.1038/314585a0</mixed-citation></ref><ref id="scirp.96855-ref6"><label>6</label><mixed-citation publication-type="other" xlink:type="simple">Gates, M.A. (1985) Simpler DNA Sequence Representations. Nature, 316, 219. https://doi.org/10.1038/316219a0</mixed-citation></ref><ref id="scirp.96855-ref7"><label>7</label><mixed-citation publication-type="other" xlink:type="simple">Zhang, R. and Zhang, C.T. (1994) Z Curves, an Intutive Tool for Visualizing and Analyzing the DNA Sequences. Journal of Biomolecular Structure &amp; Dynamics, 11, 767-782. https://doi.org/10.1080/07391102.1994.10508031</mixed-citation></ref><ref id="scirp.96855-ref8"><label>8</label><mixed-citation publication-type="journal" xlink:type="simple"><name name-style="western"><surname>Nandy</surname><given-names> A. </given-names></name>,<etal>et al</etal>. (<year>1994</year>)<article-title>A New Graphical Representation and Analysis of DNA Sequence Structure: I. Methodology and Application to Globin Genes</article-title><source> Current Science</source><volume> 66</volume>,<fpage> 309</fpage>-<lpage>314</lpage>.<pub-id pub-id-type="doi"></pub-id></mixed-citation></ref><ref id="scirp.96855-ref9"><label>9</label><mixed-citation publication-type="other" xlink:type="simple">Leong, P.M. and Morgenthaler, S. (1995) Random Walk and Gap Plots of DNA Sequences. Computer Applications in the Biosciences Cabios, 11, 503-507. https://doi.org/10.1093/bioinformatics/11.5.503</mixed-citation></ref><ref id="scirp.96855-ref10"><label>10</label><mixed-citation publication-type="other" xlink:type="simple">Tang, X.C., Zhou, P.P. and Qiu, W.Y. (2010) On the Similarity/Dissimilarity of DNA Sequences Based on 4D Graphical Representation. Chinese Science Bulletin, 55, 701-704. https://doi.org/10.1007/s11434-010-0045-2</mixed-citation></ref><ref id="scirp.96855-ref11"><label>11</label><mixed-citation publication-type="other" xlink:type="simple">Yau, S.S.T., Wang, J.S., Niknejad, A., Lu, C., Jin, N. and Ho, Y.K. (2003) DNA Sequence Representation without Degeneracy. Nucleic Acids Research, 31, 3078-3080. https://doi.org/10.1093/nar/gkg432</mixed-citation></ref><ref id="scirp.96855-ref12"><label>12</label><mixed-citation publication-type="other" xlink:type="simple">Zhang, Z.J. (2009) DV-Curve: A Novel Intuitive Tool for Visualizing and Analyzing DNA Sequences. Bioinformatics, 25, 1112-1117. https://doi.org/10.1093/bioinformatics/btp130</mixed-citation></ref><ref id="scirp.96855-ref13"><label>13</label><mixed-citation publication-type="other" xlink:type="simple">Yu, C.L., Liang, Q.A., Yin, C.C., He, R.L. and Yau, S.S.T. (2010) A Novel Construction of Genome Space with Biological Geometry. DNA Research, 17, 155-168. https://doi.org/10.1093/dnares/dsq008</mixed-citation></ref><ref id="scirp.96855-ref14"><label>14</label><mixed-citation publication-type="other" xlink:type="simple">Yu, C.L., Deng, M. and Yau, S.S.T. (2011) DNA Sequence Comparison by a Novel Probabilistic Method. Inform Sciences, 181, 1484-1492. https://doi.org/10.1016/j.ins.2010.12.010</mixed-citation></ref><ref id="scirp.96855-ref15"><label>15</label><mixed-citation publication-type="other" xlink:type="simple">Zou, S., Wang, L. and Wang, J. (2014) A 2D Graphical Representation of the Sequences of DNA Based on Triplets and Its Application. EURASIP Journal on Bioinformatics and Systems Biology, 2014, Article No. 1. https://doi.org/10.1186/1687-4153-2014-1</mixed-citation></ref><ref id="scirp.96855-ref16"><label>16</label><mixed-citation publication-type="other" xlink:type="simple">Zhang, Z.J., Li, J.Y., Pan, L.Q., et al. (2014) A Novel Visualization of DNA Sequences, Reflecting GC-Content. MATCH Communications in Mathematical and in Computer Chemistry, 72, 533-550.</mixed-citation></ref><ref id="scirp.96855-ref17"><label>17</label><mixed-citation publication-type="other" xlink:type="simple">Li, Y.S., Liu, Q. and Zheng, X.Q. (2016) DUC-Curve, a Highly Compact 2D Graphical Representation of DNA Sequences and Its Application in Sequence Alignment. Physica A, 456, 256-270. https://doi.org/10.1016/j.physa.2016.03.061</mixed-citation></ref><ref id="scirp.96855-ref18"><label>18</label><mixed-citation publication-type="other" xlink:type="simple">Yu, J.F., Sun, X. and Wang, J.H. (2009) TN Curve: A Novel 3D Graphical Representation of DNA Sequence Based on Trinucleotides and Its Applications. Journal of Theoretical Biology, 261, 459-468. https://doi.org/10.1016/j.jtbi.2009.08.005</mixed-citation></ref><ref id="scirp.96855-ref19"><label>19</label><mixed-citation publication-type="other" xlink:type="simple">Liao, B., Xiang, Q.L., Cai, L.J. and Cao, Z. (2013) A New Graphical Coding of DNA Sequence and Its Similarity Calculation. Physica A, 392, 4663-4667. https://doi.org/10.1016/j.physa.2013.05.015</mixed-citation></ref><ref id="scirp.96855-ref20"><label>20</label><mixed-citation publication-type="other" xlink:type="simple">Yu, C.L., Deng, M., Zheng, L., He, R.L., Yang, J. and Yau, S.S.T. (2014) DFA7, a New Method to Distinguish between Intron-Containing and Intronless Genes. PLoS ONE, 9, e101363. https://doi.org/10.1371/journal.pone.0101363</mixed-citation></ref><ref id="scirp.96855-ref21"><label>21</label><mixed-citation publication-type="other" xlink:type="simple">Yu, C.L., He, R.L. and Yau, S.S.T. (2014) Viral Genome Phylogeny Based on Lempel-Ziv Complexity and Hausdorff Distance. Journal of Theoretical Biology, 348, 12-20. https://doi.org/10.1016/j.jtbi.2014.01.022</mixed-citation></ref><ref id="scirp.96855-ref22"><label>22</label><mixed-citation publication-type="other" xlink:type="simple">Siegel, K., Altenburger, K., Hon, Y.-S., Lin, J. and Yu, C. (2015) PuzzleCluster: A Novel Unsupervised Clustering Algorithm for Binning DNA Fragments in Metagenomics. Current Bioinformatics, 10, 225-231. https://doi.org/10.2174/157489361002150518150716</mixed-citation></ref><ref id="scirp.96855-ref23"><label>23</label><mixed-citation publication-type="other" xlink:type="simple">Yau, S.S.T., Yu, C.L. and He, R. (2008) A Protein Map and Its Application. DNA and Cell Biology, 27, 241-250. https://doi.org/10.1089/dna.2007.0676</mixed-citation></ref><ref id="scirp.96855-ref24"><label>24</label><mixed-citation publication-type="other" xlink:type="simple">Wu, Z.C., Xiao, X.A. and Chou, K.C. (2010) 2D-MH: A Web-Server for Generating Graphic Representation of Protein Sequences Based on the Physicochemical Properties of Their Constituent Amino Acids. Journal of Theoretical Biology, 267, 29-34. https://doi.org/10.1016/j.jtbi.2010.08.007</mixed-citation></ref><ref id="scirp.96855-ref25"><label>25</label><mixed-citation publication-type="other" xlink:type="simple">Yu, C.L., Cheng, S.Y., He, R.L. and Yau, S.S.T. (2011) Protein Map: An Alignment-Free Sequence Comparison Method Based on Various Properties of Amino Acids. Gene, 486, 110-118. https://doi.org/10.1016/j.gene.2011.07.002</mixed-citation></ref><ref id="scirp.96855-ref26"><label>26</label><mixed-citation publication-type="other" xlink:type="simple">Randic, M., Zupan, J., Balaban, A.T., Vikic-Topic, D. and Plavsic, D. (2011) Graphical Representation of Proteins. Chemical Reviews, 111, 790-862. https://doi.org/10.1021/cr800198j</mixed-citation></ref><ref id="scirp.96855-ref27"><label>27</label><mixed-citation publication-type="other" xlink:type="simple">Liu, H.L. (2018) 2D Graphical Representation of DNA Sequence Based on Horizon Lines from a Probabilistic View. Bioscience Journal, 34, 1344-1350. https://doi.org/10.14393/BJ-v34n3a2018-39932</mixed-citation></ref><ref id="scirp.96855-ref28"><label>28</label><mixed-citation publication-type="other" xlink:type="simple">Liu, H.L. (2018) A Joint Probabilistic Model in DNA Sequences. Current Bioinformatics, 13, 234-240. https://doi.org/10.2174/1574893613666180305161928</mixed-citation></ref><ref id="scirp.96855-ref29"><label>29</label><mixed-citation publication-type="other" xlink:type="simple">Randic, M., Vracko, M., Lers, N. and Plavsic, D. (2003) Analysis of Similarity/Dissimilarity of DNA Sequences Based on Novel 2-D Graphical Representation. Chemical Physics Letters, 371, 202-207. https://doi.org/10.1016/S0009-2614(03)00244-6</mixed-citation></ref><ref id="scirp.96855-ref30"><label>30</label><mixed-citation publication-type="other" xlink:type="simple">Randic, M. (2004) Graphical Representations of DNA as 2-D Map. Chemical Physics Letters, 386, 468-471. https://doi.org/10.1016/j.cplett.2004.01.088</mixed-citation></ref><ref id="scirp.96855-ref31"><label>31</label><mixed-citation publication-type="other" xlink:type="simple">Peng, Y. and Liu, Y.W. (2015) An Improved Mathematical Object for Graphical Representation of DNA Sequences. Current Bioinformatics, 10, 332-336. https://doi.org/10.2174/157489361003150723135559</mixed-citation></ref><ref id="scirp.96855-ref32"><label>32</label><mixed-citation publication-type="other" xlink:type="simple">Hoang, T., Yin, C.C., Zheng, H., Yu, C.L., He, R.L. and Yau, S.S.T. (2015) A New Method to Cluster DNA Sequences Using Fourier Power Spectrum. Journal of Theoretical Biology, 372, 135-145. https://doi.org/10.1016/j.jtbi.2015.02.026</mixed-citation></ref><ref id="scirp.96855-ref33"><label>33</label><mixed-citation publication-type="other" xlink:type="simple">Deng, M., Yu, C., Liang, Q., He, R.L. and Yau, S.S. (2011) A Novel Method of Characterizing Genetic Sequences: Genome Space with Biological Distance and Applications. PLoS ONE, 6, e17293. https://doi.org/10.1371/journal.pone.0017293</mixed-citation></ref><ref id="scirp.96855-ref34"><label>34</label><mixed-citation publication-type="other" xlink:type="simple">Chi, R. and Ding, K.Q. (2005) Novel 4D Numerical Representation of DNA Sequences. Chemical Physics Letters, 407, 63-67. https://doi.org/10.1016/j.cplett.2005.03.056</mixed-citation></ref><ref id="scirp.96855-ref35"><label>35</label><mixed-citation publication-type="other" xlink:type="simple">Zhang, Y.S. and Chen, W. (2011) A New Measure for Similarity Searching in DNA Sequences. MATCH Communications in Mathematical and in Computer Chemistry, 65, 477-488.</mixed-citation></ref></ref-list></back></article>