<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article  PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" dtd-version="3.0" xml:lang="en" article-type="research article"><front><journal-meta><journal-id journal-id-type="publisher-id">JBM</journal-id><journal-title-group><journal-title>Journal of Biosciences and Medicines</journal-title></journal-title-group><issn pub-type="epub">2327-5081</issn><publisher><publisher-name>Scientific Research Publishing</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="doi">10.4236/jbm.2016.412018</article-id><article-id pub-id-type="publisher-id">JBM-72720</article-id><article-categories><subj-group subj-group-type="heading"><subject>Articles</subject></subj-group><subj-group subj-group-type="Discipline-v2"><subject>Biomedical&amp;Life Sciences</subject></subj-group></article-categories><title-group><article-title>
 
 
  A Complete and Accurate Short Sequence Alignment Algorithm for Repeats
 
</article-title></title-group><contrib-group><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Shuaibin</surname><given-names>Lian</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Tianliang</surname><given-names>Liu</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Ke</surname><given-names>Gong</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Xinwu</surname><given-names>Chen</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Gang</surname><given-names>Zheng</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref></contrib></contrib-group><aff id="aff1"><addr-line>School of Physics and Electronic Engineering, Xinyang Normal University, Xinyang City, China</addr-line></aff><pub-date pub-type="epub"><day>01</day><month>12</month><year>2016</year></pub-date><volume>04</volume><issue>12</issue><fpage>144</fpage><lpage>151</lpage><history><date date-type="received"><day>November</day>	<month>10,</month>	<year>2016</year></date><date date-type="rev-recd"><day>Accepted:</day>	<month>December</month>	<year>11,</year>	</date><date date-type="accepted"><day>December</day>	<month>14,</month>	<year>2016</year></date></history><permissions><copyright-statement>&#169; Copyright  2014 by authors and Scientific Research Publishing Inc. </copyright-statement><copyright-year>2014</copyright-year><license><license-p>This work is licensed under the Creative Commons Attribution International License (CC BY). http://creativecommons.org/licenses/by/4.0/</license-p></license></permissions><abstract><p>
 
 
  Eukaryotic genomes contain a significant fraction of repeats, which have very important biomedical function. Thus, aligning repeats from short sequences back to reference genome is the key step for further genome analysis. Unfortunately, the current aligning algorithms performed poorly in distinguishing repeats and nonrepeats. To this end, we proposed a new algorithm, named HashRepAligner, to address this problem. Finally, the cross comparison with other algorithms was performed, and the results indicated that HashRepAligner outperformed other aligners in terms of the detecting repeats.
 
</p></abstract><kwd-group><kwd>Sequence Alignment</kwd><kwd> Next Generation Sequencing</kwd><kwd> Hash Index</kwd><kwd> Repeats Detection</kwd></kwd-group></article-meta></front><body><sec id="s1"><title>1. Introduction</title><p>During the past twenty years, the new DNA sequencing technologies have significantly improved throughput and dramatically reduced the cost [<xref ref-type="bibr" rid="scirp.72720-ref1">1</xref>] . Currently, the available commercial next generation sequencing (NGS) platforms include MiSeq, and HiSeq from Illumina [<xref ref-type="bibr" rid="scirp.72720-ref2">2</xref>] , SOLiD and Ion Torrent from Life Technologies [<xref ref-type="bibr" rid="scirp.72720-ref3">3</xref>] , RS system from Pacific Bioscience, and Heliscope from Helicos Biosciences [<xref ref-type="bibr" rid="scirp.72720-ref4">4</xref>] [<xref ref-type="bibr" rid="scirp.72720-ref5">5</xref>] . These sequencing machines can sequence the whole genome in a shorter time, which inspired scientists to sequence large kinds of animals and plants [<xref ref-type="bibr" rid="scirp.72720-ref6">6</xref>] [<xref ref-type="bibr" rid="scirp.72720-ref7">7</xref>] . NGS can be characterized by parallel operation, higher throughput and much lower cost [<xref ref-type="bibr" rid="scirp.72720-ref8">8</xref>] , but share a common disadvantage of producing very short reads.</p><p>Furthermore, large researches indicated that repeats comprise a significant fraction of genomes, for example, ~20% of Caenorhabditis elegans and Caenorhabditis briggsae genomes [<xref ref-type="bibr" rid="scirp.72720-ref9">9</xref>] and ~50% of the human genome [<xref ref-type="bibr" rid="scirp.72720-ref10">10</xref>] have been identified as repeats. Most of them have some important biomedical functions and are closely related to some complex disease [<xref ref-type="bibr" rid="scirp.72720-ref11">11</xref>] [<xref ref-type="bibr" rid="scirp.72720-ref12">12</xref>] . Therefore, it is an important step to analyze genome functions from NGS data by aligning the sequencing data back to reference genome. Currently, there are two famous aligning tools, such as bowtie [<xref ref-type="bibr" rid="scirp.72720-ref13">13</xref>] and Soap [<xref ref-type="bibr" rid="scirp.72720-ref14">14</xref>] . Even though, each of them can align millions of reads in one hour, but their performance of detecting repeats also very poor.</p><p>In order to improve the completeness of aligning repeats, we proposed a new algorithm aiming for distinguishing repeats and non-repeats, named HashRepAligner, which is based on the combination of Hash index and sliding site math strategy. Hash- RepAligner has the following properties: 1) estimating the copy number of detected repeats; 2) in terms of completeness of aligning repeats, HashRepAligner outperforms others. Simulation data are used to assess the feasibility of HashRepAligner, while the real sequencing data are used to cross comparison with other two aligners, Soap and Bowtie. The results indicated that HashRepAligner outperformed others in terms of aligning repeats. Consequently, HashRepAligner is a complete and accurate repeats aligning tool.</p></sec><sec id="s2"><title>2. Results</title><p>The principle of HashRepAligner is based on hash index and sliding site match. Hash index is used to speed the aligning process, while sliding site match is to use the matched number of every site to find the location of repeats, which can decrease the coverage bias and increase the confidence of aligning repeats.</p><p>HashRepAligner runs in key four steps: hash index construction, sliding site match, coverage depth estimation and boundary detection. The concrete steps and processes are detailed as follows.</p><p>1) Constructing hash index (<xref ref-type="fig" rid="fig1">Figure 1</xref>(a)). In order to improve computing speed, an indirect hash structure was designed and adopted in this part. Firstly, the index key words are transformed into quaternary integers instead of the string itself. Secondly, the identifiers of the unique reads are recorded in decimal list. Thirdly, the mapping relations between unique reads and decimal list are constructed.</p><p>2) Sliding site match (<xref ref-type="fig" rid="fig1">Figure 1</xref>(b)). Based on the hash index, the short sequences are aligned back to the reference genome. For the repetitive seed, HashRepAligner align all of them as long as the keywords are matched according to the hash index.</p><p>3) Coverage depth estimation (<xref ref-type="fig" rid="fig1">Figure 1</xref>(c)). After aligning the short sequences back to the reference genome, the sliding window function was used to smooth the bias of data, and then coverage depth of each point in reference genome was computed.</p><p>4) Boundary detection (<xref ref-type="fig" rid="fig1">Figure 1</xref>(d)). According to the estimated coverage depth, read count are merged in a continuous interval. After merging process, the mean read counts <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/5-2150315x2.png" xlink:type="simple"/></inline-formula> of the interval will be compared with mean sequencing depth<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/5-2150315x3.png" xlink:type="simple"/></inline-formula>. If the<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/5-2150315x4.png" xlink:type="simple"/></inline-formula>, this region will be considered as the repeats, while if<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/5-2150315x5.png" xlink:type="simple"/></inline-formula>, this region will be considered as the non-repeats.</p><fig id="fig1"  position="float"><label><xref ref-type="fig" rid="fig1">Figure 1</xref></label><caption><title> The graphic illustration of four steps of HashRepAligner. (a) Hash index construction. The first column is the corresponding identify of every reads. The second column is unique reads. The third column is index words of every reads. The subsequent column is the corresponding identity of every index words. (b) Sliding site match. The Violet line represents reference genome, the green, blue and red short line represent the reads. (c) Coverage depth estimation. The coverage information was estimated by using sliding window function. (d) Boundary detection. For example, if S<sub>d</sub> = 2, the mean coverage depth of repeat region should meet M<sub>n</sub> &gt; 3, while the mean coverage depth of non-repeats region should meet M<sub>n</sub> &lt; 2</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/5-2150315x6.png"/></fig></sec><sec id="s3"><title>3. Assessments</title><sec id="s3_1"><title>3.1. Metrics</title><p>In this part, we evaluated the performances of HashRepAligner in simulated Datasets and compared with others in real NGS datasets. We use some widely recognized metrics including Family, Total size, Family-accuracy, Size-accuracy, Repeat-accuracy, Copy- accuracy, Location-error to evaluate the performance. Some of them are widely recognized and used in reference [<xref ref-type="bibr" rid="scirp.72720-ref15">15</xref>] . Their definitions and effectiveness are as follows:</p><p>1) Family: Total size (T-size): the total size of detected repeats, which is used to evaluate the completeness of length of detected repeats, and which is defined as follows.</p><disp-formula id="scirp.72720-formula6"><label>(3.1)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/5-2150315x7.png"  xlink:type="simple"/></disp-formula><p>where <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/5-2150315x8.png" xlink:type="simple"/></inline-formula> is the total length of all detected repeats, <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/5-2150315x9.png" xlink:type="simple"/></inline-formula>is the <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/5-2150315x10.png" xlink:type="simple"/></inline-formula> family of repeat, N is the number of family.</p><p>2) Family-accuracy (F-acc): the accuracy of detected repeats, which is used to evaluate the accuracy of detected repeats and it is defined as follows:</p><disp-formula id="scirp.72720-formula7"><label>(3.2)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/5-2150315x11.png"  xlink:type="simple"/></disp-formula><p>where <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/5-2150315x12.png" xlink:type="simple"/></inline-formula> is the number of detected family, <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/5-2150315x13.png" xlink:type="simple"/></inline-formula>is the number of families.</p><p>3) Size-accuracy (S-acc): the length accuracy of the detected repeat, which is defined as follows:</p><disp-formula id="scirp.72720-formula8"><label>(3.3)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/5-2150315x14.png"  xlink:type="simple"/></disp-formula><p>where <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/5-2150315x15.png" xlink:type="simple"/></inline-formula> is total length of the detected repeat, <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/5-2150315x16.png" xlink:type="simple"/></inline-formula>is total length of the actual repeat.</p><p>4) Repeat-accuracy (R-Acc): the global matching of the detected repetitive sequence and the actual repetitive sequence, which is defined as follows:</p><disp-formula id="scirp.72720-formula9"><label>(3.4)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/5-2150315x17.png"  xlink:type="simple"/></disp-formula><p>where R-Acc is the global matching value of the repetitive sequence, t is total copy number, nwalign is the global matching function of MATLAB software, A<sub>i</sub> is the actual repetitive sequence, B<sub>i</sub> is the detected repetitive sequence.</p><p>5) Copy Accuracy (C-Acc): the accuracy of the copy numbers of detected repeats, which is defined as follows.</p><disp-formula id="scirp.72720-formula10"><label>(3.5)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/5-2150315x18.png"  xlink:type="simple"/></disp-formula><p>where <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/5-2150315x19.png" xlink:type="simple"/></inline-formula> is the total copies of repeats. <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/5-2150315x19.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/5-2150315x20.png" xlink:type="simple"/></inline-formula> is the real copy numbers of <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/5-2150315x19.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/5-2150315x20.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/5-2150315x21.png" xlink:type="simple"/></inline-formula> family of repeat, <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/5-2150315x19.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/5-2150315x20.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/5-2150315x21.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/5-2150315x22.png" xlink:type="simple"/></inline-formula>is the estimated copy number of corresponding repeat.</p><p>6) Copy-accuracy (C-Acc): the accuracy of detected copy number, which is used to evaluate the accuracy of detected copy number and defined as follows:</p><disp-formula id="scirp.72720-formula11"><label>(3.6)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/5-2150315x23.png"  xlink:type="simple"/></disp-formula><p>where C<sub>d</sub> is the detected total copy number, C<sub>a</sub> is the actual total copy number.</p><p>7) Location-error (L-Err): the location error of the repeat, which defined as follows:</p><disp-formula id="scirp.72720-formula12"><label>(3.7)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/5-2150315x24.png"  xlink:type="simple"/></disp-formula><p>where D<sub>si</sub> is the starting location of i<sub>th</sub> detected repeat, D<sub>ei</sub> is the ending location of i<sub>th</sub> detected repeat, A<sub>si</sub> is the starting location of i<sub>th</sub> real repeat, A<sub>ei</sub> is the ending location of i<sub>th</sub> real repeat. p is the total number of all repeats. All the repeats are detected by using repeat finding tool HashRepeatFinder [<xref ref-type="bibr" rid="scirp.72720-ref16">16</xref>] . In order to compute this metric, sequences similarity are computed with real repeats using swalign function in MATLAB201b.</p><p>For evaluating the accuracy, the metrics, such as Repeat accuracy, Copy accuracy and Location-error are computed by aligning the corresponding items back to the reference genome.</p></sec><sec id="s3_2"><title>3.2. Simulation Study</title><p>We validated the performances of HashRepAligner in three kinds of simulated datasets containing interspersed repeats, tandem repeats and compound repeats, respectively. And then the effect of read depth, read length and the threshold value to HashRepAligner was evaluated, respectively. The detailed results were shown in <xref ref-type="table" rid="table1">Table 1</xref>.</p><p>Three sequences with length L = 500 kb, 300 kb, and 500 kb contain different types of repeats. Location of repeat and non-repeat is generated independently by HashRepAligner with basic parameters: read length L<sub>r</sub> = 50, read depth = 2, the threshold value = 160 and step-size = 10. Repeat length smaller than 200 is removed.</p><p>From <xref ref-type="table" rid="table1">Table 1</xref>, three kinds of simulated datasets containing interspersed repeats, tandem repeats, and compound repeats were used to validate the performances of Hash- RepAligner. The repetitive contents contained in these three sequences represented a wide range of repeats with different copies and lengths. The Family-accuracy and Repeat-accuracy were almost up to 100% and 99%, respectively, which indicated that the family were all absolutely correct, and the error tolerance and Location-error of the repetitive sequence were lower than 2% and 15%. All of these indicate that HashRepAligner not only can find different kinds of repeats and non-repeats independently but also can seek out the starting and ending location of the repetitive sequence.</p></sec><sec id="s3_3"><title>3.3. Cross Comparison</title><p>In this part, we use the real NGS dataset. A bacterial genome Rhodobacter sphaeroides (R.s) with genome size 4.6 Mb was downloaded from http://gage.cbcb.umd.edu/data/. All reads were error-corrected.</p><p>The Rhodobacter genome has two chromosomes and five plasmids. Thus even the bacteria had multiple chromosomes. Its repetitive structures were detected by HashRepeatFinder tool [<xref ref-type="bibr" rid="scirp.72720-ref16">16</xref>] . 23 families of repeats with total size 8.1 kb were detected. The following results can be concluded from <xref ref-type="table" rid="table2">Table 2</xref>. Firstly, there is 24, 20 and 26 family of</p><table-wrap id="table1" ><label><xref ref-type="table" rid="table1">Table 1</xref></label><caption><title> The performances of finding different kinds of repeats</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >Sequence (Containing)</th><th align="center" valign="middle" >Family</th><th align="center" valign="middle" >Total Size (Kb)</th><th align="center" valign="middle" >Family Accuracy</th><th align="center" valign="middle" >Size Accuracy</th><th align="center" valign="middle" >Repeat Accuracy</th><th align="center" valign="middle" >Copy Accuracy</th><th align="center" valign="middle" >Location Error</th></tr></thead><tr><td align="center" valign="middle" >Interspersed repeats</td><td align="center" valign="middle" >6</td><td align="center" valign="middle" >83.321</td><td align="center" valign="middle" >100%</td><td align="center" valign="middle" >100.38%</td><td align="center" valign="middle" >99.20%</td><td align="center" valign="middle" >95.55%</td><td align="center" valign="middle" >14.85%</td></tr><tr><td align="center" valign="middle" >Tandem repeats</td><td align="center" valign="middle" >3</td><td align="center" valign="middle" >90.095</td><td align="center" valign="middle" >100%</td><td align="center" valign="middle" >100.11%</td><td align="center" valign="middle" >99.86%</td><td align="center" valign="middle" >99.08%</td><td align="center" valign="middle" >0.97%</td></tr><tr><td align="center" valign="middle" >Compound repeats</td><td align="center" valign="middle" >6</td><td align="center" valign="middle" >86.148</td><td align="center" valign="middle" >100%</td><td align="center" valign="middle" >100.17%</td><td align="center" valign="middle" >99.59%</td><td align="center" valign="middle" >98.25%</td><td align="center" valign="middle" >4.37%</td></tr></tbody></table></table-wrap><table-wrap id="table2" ><label><xref ref-type="table" rid="table2">Table 2</xref></label><caption><title> The performances of three tools in R.s</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >Tools</th><th align="center" valign="middle" >Family</th><th align="center" valign="middle" >Total Size (Kb)</th><th align="center" valign="middle" >Family Accuracy</th><th align="center" valign="middle" >Size Accuracy</th><th align="center" valign="middle" >Repeat Accuracy</th><th align="center" valign="middle" >Copy Accuracy</th><th align="center" valign="middle" >Location Error</th></tr></thead><tr><td align="center" valign="middle" >HashRepAligner</td><td align="center" valign="middle" >24</td><td align="center" valign="middle" >8.56</td><td align="center" valign="middle" >95.8%</td><td align="center" valign="middle" >93.21%</td><td align="center" valign="middle" >95.45%</td><td align="center" valign="middle" >92.1%</td><td align="center" valign="middle" >3.21%</td></tr><tr><td align="center" valign="middle" >Bowtie</td><td align="center" valign="middle" >20</td><td align="center" valign="middle" >7.32</td><td align="center" valign="middle" >86.9%</td><td align="center" valign="middle" >91.51%</td><td align="center" valign="middle" >86.31%</td><td align="center" valign="middle" >NA</td><td align="center" valign="middle" >6.78%</td></tr><tr><td align="center" valign="middle" >Soap</td><td align="center" valign="middle" >26</td><td align="center" valign="middle" >12.89</td><td align="center" valign="middle" >86.7%</td><td align="center" valign="middle" >83.87%</td><td align="center" valign="middle" >87.64%</td><td align="center" valign="middle" >NA</td><td align="center" valign="middle" >8.96%</td></tr></tbody></table></table-wrap><p>repeats detected by HashRepAligner, Bowtie and Soap respectively. Their corresponding accuracy is 95.8%, 86.9% and 86.7%. Therefore, in terms of completeness, Hash- RepAligner outperformed others. Secondly, the total size of aligned repeats by three tools are 8.56 kb, 7.32 kb and 12.89 kb, the corresponding accuracy of which is 93.21%, 91.51% and 83.87% respectively. Therefore, in terms of size accuracy, HashRepAligner also outperformed others. Thirdly, HashRepAligner can estimate the copies of each aligned repeats with accuracy of 92.1%, but Bowtie and Soap cannot be used to estimate the copies of aligned items. Lastly, HashRepAligner has the minimum location error of aligned repeats.</p></sec></sec><sec id="s4"><title>4. Conclusions and Discussions</title><p>Genome repeats of eukaryotes occupy a significant fraction of the eukaryotes genomes. Most of them have played and are continuing to play critical roles in genome evolution. In order to align these repeats more completely and accurately, we proposed a short sequence aligning algorithm for repeats, named HashRepAligner, which is based on Hash index and sliding site match. In order to evaluate the performance, simulation study and cross comparison were conducted. The results indicated that 1) HashRepAligner can align the repeats more completely; 2) HashRepAligner also can estimate the copy numbers of each corresponding items; 3) HashRepAligner can find the starting and end location of the repetitive sequence. In one word, HashRepAligner is a complete and accurate ab repeat finding tool.</p><p>The alignment of repeats from sequencing data is difficult task for genome analysis and is still challenging many aligners, due to the complex repetitive structures and big datasets. Although a large number algorithms including Soap and Bowtie have been proposed to facilitate this problem, but this work is still not finished due to the following reasons. 1) Similarity: repeats can be classified as identical repeats and high similar repeats. For identical repeats, it is a little bit easy to detect as long as the length of repeat is determined. But for the similar repeats, it is difficult to unify the consensus sequences and detect them due to the uncertainty of similarity. Different researchers define different repeats similarity according to the different research task. 2) Types: interspersed repeats, tandem repeats and the compound repeats. The complexity of types of repeats is also the challenge of finding repeats. Eukaryotes genomes always contain different types of repeats. Notably, the compound repeats are almost everywhere. Different aligner has different advantages and specific applications. For the whole genome alignment, Soap or Bowite would be preferred. But for the repeat alignment, HashRepAligner should be preferred.</p></sec><sec id="s5"><title>Conflict of Interests</title><p>The authors declare that there is no conflict of interests regarding the publication of this paper.</p></sec><sec id="s6"><title>Acknowledgements</title><p>This work was financially supported by National Natural Science Foundation of China (Grant: 61501392), doctoral scientific research start-up funds of XYNU (No: 0201447). In addition, this study was financed in part by the Nanhu Scholars Program for Young Scholars of XYNU.</p></sec><sec id="s7"><title>Cite this paper</title><p>Lian, S.B., Liu, T.L., Gong, K., Chen, X.W. and Zheng, G. (2016) A Complete and Accurate Short Sequence Alignment Algorithm for Repeats. Journal of Biosciences and Medicines, 4, 144-151. http://dx.doi.org/10.4236/jbm.2016.412018</p></sec></body><back><ref-list><title>References</title><ref id="scirp.72720-ref1"><label>1</label><mixed-citation publication-type="other" xlink:type="simple">Lian, S.B., Chen, X.W., Wang, P., Zhang, X.L. and Dai, X.H. (2016) A Complete and Accurate Ab Initio Repeat Finding Algorithm. Interdisciplinary Sciences-Computational Life Sciences, 8, 75-83. https://doi.org/10.1007/s12539-015-0119-6</mixed-citation></ref><ref id="scirp.72720-ref2"><label>2</label><mixed-citation publication-type="other" xlink:type="simple">Saha, S., Bridges, S., Magbanua, Z.V. and Peterson., D.G. (2008) Empirical Comparison of Ab Initio Repeat Finding Programs. Nucleic Acids Research, 36, 2284-2294.  
https://doi.org/10.1093/nar/gkn064</mixed-citation></ref><ref id="scirp.72720-ref3"><label>3</label><mixed-citation publication-type="other" xlink:type="simple">Li, R.Q., Li, Y.R., Kristiansen, K. and Wang, J. (2008) SOAP: Short Oligonucleotide Alignment Program. Bioinformatics Application Note, 24, 713-714.  
https://doi.org/10.1093/bioinformatics/btn025</mixed-citation></ref><ref id="scirp.72720-ref4"><label>4</label><mixed-citation publication-type="other" xlink:type="simple">Langmead, B., Trapnell, C., Pop, M. and Salzberg, S.L. (2009) Ultrafast and Memory-Efficient Alignment of Short DNA Sequences to the Human Genome. Genome Biology, 10, R25.</mixed-citation></ref><ref id="scirp.72720-ref5"><label>5</label><mixed-citation publication-type="other" xlink:type="simple">Feuk, L., Carson, A.R. and Scherer, S.W. (2006) Structural Variation in the Human Genome. Nature Reviews Genetics, 7, 85-97. https://doi.org/10.1038/nrg1767</mixed-citation></ref><ref id="scirp.72720-ref6"><label>6</label><mixed-citation publication-type="other" xlink:type="simple">Iafrate, A.J., Feuk, L., Rivera M.N., et al. (2004) Detection of Large-Scale Variation in the Human Genome. Nature Genetics, 36, 949-951. https://doi.org/10.1038/ng1416</mixed-citation></ref><ref id="scirp.72720-ref7"><label>7</label><mixed-citation publication-type="other" xlink:type="simple">International Human Genome Consortium (2001) Initial Sequencing and Analysis of the Human Genome. Nature, 409, 860-921. https://doi.org/10.1038/35057062</mixed-citation></ref><ref id="scirp.72720-ref8"><label>8</label><mixed-citation publication-type="other" xlink:type="simple">Stein, L.D., Bao, Z., Blasiar, D., et al. (2003) The Genome Sequence of Caenorhabditis briggsae: A Platform for Comparative Genomics. PLoS Biology, 1, Article E45.  
https://doi.org/10.1371/journal.pbio.0000045</mixed-citation></ref><ref id="scirp.72720-ref9"><label>9</label><mixed-citation publication-type="other" xlink:type="simple">Treangen, T.J. and Salzberg, S.L. (2012) Repetitive DNA and Next Generation Sequencing: Computational Challenges and Solutions. Nature Reviews Genetics, 13, 36-46.</mixed-citation></ref><ref id="scirp.72720-ref10"><label>10</label><mixed-citation publication-type="other" xlink:type="simple">Genome 10K Community of Scientists (2009) Genome 10K: A Proposal to Obtain Whole-Genome Sequence for 10,000 Vertebrate Species. Journal of Heredity, 100, 659-674.  
https://doi.org/10.1093/jhered/esp086</mixed-citation></ref><ref id="scirp.72720-ref11"><label>11</label><mixed-citation publication-type="other" xlink:type="simple">The 1000 Genomes Project Consortium (2010) A Map of Human Genome Variation from Population-Scale Sequencing. Nature, 467, 1061-1073.</mixed-citation></ref><ref id="scirp.72720-ref12"><label>12</label><mixed-citation publication-type="other" xlink:type="simple">Mardis, E.R. (2008) The Impact of Next-Generation Sequencing Technology on Genetics. Trends in Genetics, 24, 133-141. https://doi.org/10.1016/j.tig.2007.12.007</mixed-citation></ref><ref id="scirp.72720-ref13"><label>13</label><mixed-citation publication-type="other" xlink:type="simple">Metzker, M.L. (2010) Sequencing Technologies the Next Generation. Nature Reviews Genetics, 11, 31-46. https://doi.org/10.1038/nrg2626</mixed-citation></ref><ref id="scirp.72720-ref14"><label>14</label><mixed-citation publication-type="other" xlink:type="simple">Harris, T.D., Buzby, P.R., Babcock, H., et al. (2008) Single-Molecule DNA Sequencing of a Viral Genome. Science, 320, 106-109. https://doi.org/10.1126/science.1150427</mixed-citation></ref><ref id="scirp.72720-ref15"><label>15</label><mixed-citation publication-type="other" xlink:type="simple">Bentley, D.R. (2006) Whole-Genome Re-Sequencing. Current Opinion in Genetics and Development, 16, 545-552. https://doi.org/10.1016/j.gde.2006.10.009</mixed-citation></ref><ref id="scirp.72720-ref16"><label>16</label><mixed-citation publication-type="other" xlink:type="simple">Shendure, J., et al. (2004) Advanced Sequencing Technologies: Methods and Goals. Nature Reviews Genetics, 5, 335-344. https://doi.org/10.1038/nrg1325</mixed-citation></ref></ref-list></back></article>