<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article  PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" dtd-version="3.0" xml:lang="en" article-type="research article"><front><journal-meta><journal-id journal-id-type="publisher-id">JAMP</journal-id><journal-title-group><journal-title>Journal of Applied Mathematics and Physics</journal-title></journal-title-group><issn pub-type="epub">2327-4352</issn><publisher><publisher-name>Scientific Research Publishing</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="doi">10.4236/jamp.2015.38131</article-id><article-id pub-id-type="publisher-id">JAMP-59121</article-id><article-categories><subj-group subj-group-type="heading"><subject>Articles</subject></subj-group><subj-group subj-group-type="Discipline-v2"><subject>Physics&amp;Mathematics</subject></subj-group></article-categories><title-group><article-title>
 
 
  Cheetah: A Library for Parallel Ultrasound Beamforming in Multi-Core Systems
 
</article-title></title-group><contrib-group><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>David</surname><given-names>Romero-Laorden</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Carlos</surname><given-names>Julián Martín-Arguedas</given-names></name><xref ref-type="aff" rid="aff2"><sup>2</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Javier</surname><given-names>Villazón-Terrazas</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Oscar</surname><given-names>Martinez-Graullera</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Matilde</surname><given-names>Santos Peñas</given-names></name><xref ref-type="aff" rid="aff3"><sup>3</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>César</surname><given-names>Gutierrez-Fernandez</given-names></name><xref ref-type="aff" rid="aff2"><sup>2</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Ana</surname><given-names>Jiménez Martín</given-names></name><xref ref-type="aff" rid="aff2"><sup>2</sup></xref></contrib></contrib-group><aff id="aff3"><addr-line>Department of Computer Architecture and Automation, Complutense University of Madrid, Madrid, Spain</addr-line></aff><aff id="aff1"><addr-line>ITEFI, Spanish National Research Council, Madrid, Spain</addr-line></aff><aff id="aff2"><addr-line>Department of Electronics, University of Alcalá, Alcalá de Henares, Spain</addr-line></aff><pub-date pub-type="epub"><day>26</day><month>08</month><year>2015</year></pub-date><volume>03</volume><issue>08</issue><fpage>1056</fpage><lpage>1061</lpage><history><date date-type="received"><day>15</day>	<month>August</month>	<year>2015</year></date><date date-type="rev-recd"><day>accepted</day>	<month>19</month>	<year>August</year>	</date><date date-type="accepted"><day>26</day>	<month>August</month>	<year>2015</year></date></history><permissions><copyright-statement>&#169; Copyright  2014 by authors and Scientific Research Publishing Inc. </copyright-statement><copyright-year>2014</copyright-year><license><license-p>This work is licensed under the Creative Commons Attribution International License (CC BY). http://creativecommons.org/licenses/by/4.0/</license-p></license></permissions><abstract><p>
 
 
   Developing new imaging methods needs to establish some proofs of concept before implementing them on real-time scenarios. Nowadays, the high computational power reached by multi-core CPUs and GPUs have driven the development of software-based beamformers. Taking this into account, a library for the fast generation of ultrasound images is presented. It is based on Synthetic Aperture Imaging Techniques (SAFT) and it is fast because of the use of parallel computing techniques. Any kind of transducers as well as SAFT techniques can be defined although it includes some pre-built SAFT methods like 2R-SAFT and TFM. Furthermore, 2D and 3D imaging (slice- based or full volume computation) is supported along with the ability to generate both rectangular and angular images. For interpolation, linear and polynomial schemes can be chosen. The versatility of the library is ensured by interfacing it to Matlab, Python and any programming language over different operating systems. On a standard PC equipped with a single NVIDIA Quadro 4000 (256 cores), the library is able to calculate 262,144 pixels in ≈105 ms using a linear transducer with 64 elements, and 2,097,152 voxels in ≈ 5 seconds using a matrix transducer with 121 elements when TFM is applied. 
 
</p></abstract><kwd-group><kwd>GPGPU</kwd><kwd> Ultrasound Image</kwd><kwd> Beamforming</kwd><kwd> Array Transducer</kwd></kwd-group></article-meta></front><body><sec id="s1"><title>1. Introduction</title><p>During recent years, computing industry has opened a way to parallel computing. First, dual-core processors (CPUs) were introduced in personal systems at the beginning of 2005, and it is currently common to find them in laptops as well as 8 and 16-core workstation computers, which means that parallel computing is not relegated to big supercomputers. On the other hand, Graphics Processor Units (GPUs), as their name suggests, came about as accelerators for graphics applications, predominantly those using the OpenGL and DirectX programming interfaces. Although originally they were pure fixed-function devices, the demand for real-time and 3D graphics made them evolve into small computational units, multithreaded processors with extremely high computational power and very high memory bandwidth that are now available to anyone with a standard PC or laptop.</p><p>Since 2006 GPUs can be programmed directly in C/C++ using CUDA or OpenCL [<xref ref-type="bibr" rid="scirp.59121-ref1">1</xref>], allowing each and every arithmetic logic unit on the chip to be used by programs intended for general-purpose computations (GPGPU). A CUDA program consists of one or more stages that are executed on either the host (CPU) and a NVIDIA GPU. The stages that exhibit little or no data parallelism are implemented in CPU code whereas those that exhibit rich amount of data parallelism are implemented in the GPU code. These parallel functions are called kernels, and typically generate a large number of threads to exploit data parallelism</p><p>From an architectural analysis viewpoint, beamforming techniques are pretty interesting in particular because they can be seen as a data parallel process, making possible its implementation on machines with diverse computational and I/O capabilities. Some previous works as [<xref ref-type="bibr" rid="scirp.59121-ref2">2</xref>] show toolboxes for beamforming computation over CPUs and multi-core CPUs obtaining very good timing results. Our research group has been working on GPUs applied for field modeling acceleration [<xref ref-type="bibr" rid="scirp.59121-ref3">3</xref>] [<xref ref-type="bibr" rid="scirp.59121-ref4">4</xref>] as well as for fast beamforming [<xref ref-type="bibr" rid="scirp.59121-ref5">5</xref>]-[<xref ref-type="bibr" rid="scirp.59121-ref7">7</xref>] achieving speed ups of 150&#215; over conventional CPU beamforming. Nowadays, GPU beamformers are a reality and there are lot of research groups working on solutions for NDT and medical applications [<xref ref-type="bibr" rid="scirp.59121-ref8">8</xref>].</p><p>The aim of this work is to present CHEETAH, a fast ultrasonic imaging library to assist on fast development of new ultrasound beamforming strategies (currently, only SAFT methods are supported) making possible to generate 2D and 3D images on a standard PC or laptop in just few milliseconds. The input data can originate from either a simulation program or from an experimental setup. The library is composed by several routines written in CUDA for fast execution, thus a NVIDIA&#169; GPU is required at the present time. Nowadays, the 1.0 version (Windows and Linux OS) will be soon available.</p></sec><sec id="s2"><title>2. CHEETAH Core Features</title><p>CHEETAH has been designed as a free multi-platform library written in C++ and CUDA which can handle multitude of focusing methods, interpolation schemes and apodizations, to generate images from real RF signals obtained from any application and any acquisition process. The main features currently supported are:</p><p> Custom transducers. Commands for defining linear and matrix arrays are given. Likewise, arbitrary geometry transducers such as sparse arrays can be also specified. Thereby different transducers can be used depending of the concrete application.</p><p> Custom SAFT techniques. Commands for easily defining specific SAFT sequences of emission/reception are given. Anyway, the library comes with some predefined techniques, like 2R-SAFT [<xref ref-type="bibr" rid="scirp.59121-ref6">6</xref>], and TFM [<xref ref-type="bibr" rid="scirp.59121-ref9">9</xref>].</p><p> 2D/3D imaging. Commands for composing bidimensional and volumetric images (also C-Scan images) are given what makes possible to span a wide range of applications.</p><p> Coherence factor. Commands for the application of coherence factors are given [<xref ref-type="bibr" rid="scirp.59121-ref10">10</xref>].</p><p> Matlab&#169;/Python bindings. As the library is written in C++ and CUDA, its functionality is available in Windows or Linux OS. Likewise, we have developed specific bindings to connect it to Matlab&#169; what allows to reuse existing code or utilize specific toolbox functionalities.</p></sec><sec id="s3"><title>3. TFM Design on Cheetah Library</title><p>In this section, TFM method has been chosen as the case of study to analyze the main implementation aspects of the beamforming algorithm on the library.</p><sec id="s3_1"><title>3.1. TFM Imaging Principles</title><p>Synthetic Aperture Focusing Techniques (SAFT) are based on the sequential activation of the array elements in emission and reception, and the separate acquisition of all the signals involved in the process. Then, a beamforming algorithm is applied to focus the image dynamically in emission and reception obtaining the maximum quality at each image point. One of the most common beamforming methods is Total Focusing Method (TFM) [<xref ref-type="bibr" rid="scirp.59121-ref9">9</xref>] [<xref ref-type="bibr" rid="scirp.59121-ref11">11</xref>] and it is based on Full Matrix Array (FMA), which is the complete data matrix X(t) created by any transmitter-receiver combination:</p><disp-formula id="scirp.59121-formula139"><label>(1)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/59121x3.png"  xlink:type="simple"/></disp-formula><p>where X<sub>tx,rx</sub>(t) is the corresponding signal to tx transmitter and rx receiver, and N is the number of array elements. For ease the envelope computation, acquired signals are decomposed into their analytic signals form (in-phase I and quadrature components Q) applying the Hilbert Transform being now expressed as:</p><disp-formula id="scirp.59121-formula140"><label>(2)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/59121x4.png"  xlink:type="simple"/></disp-formula><p>According to the Hilbert transformation applied in Equation (2), a complex data matrix has been created. Then, for the case of a 2D scenario 2 a Delay-and-Sum (DAS) beamforming process is used to calculate two images in the following way:</p><disp-formula id="scirp.59121-formula141"><label>(3)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/59121x5.png"  xlink:type="simple"/></disp-formula><disp-formula id="scirp.59121-formula142"><label>(4)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/59121x6.png"  xlink:type="simple"/></disp-formula><p>where A<sub>I</sub>(x, z) and A<sub>Q</sub>(x, z) are the in-phase and quadrature images respectively, and D(x, z) is the delay corresponding to the focus point (x<sub>fp</sub>, z<sub>fp</sub>) in the space which is calculated as follows:</p><disp-formula id="scirp.59121-formula143"><label>(5)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/59121x7.png"  xlink:type="simple"/></disp-formula><p>being x<sub>tx</sub> and x<sub>rx</sub> the coordinates of the transducer elements tx and rx, respectively. Over this general scheme it is possible to introduce any type of apodizations.</p><disp-formula id="scirp.59121-formula144"><label>(6)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/59121x8.png"  xlink:type="simple"/></disp-formula><p>Finally, the envelope is calculated according to the equation6 to obtain the final image.</p></sec><sec id="s3_2"><title>3.2. Parallel GPU Implementation</title><p>The background behind CHEETAH library comes from the knowledge that our group has on beamforming acceleration using GPUs [<xref ref-type="bibr" rid="scirp.59121-ref5">5</xref>]-[<xref ref-type="bibr" rid="scirp.59121-ref7">7</xref>]. Nevertheless, it has been considered to briefly review the main concepts involved in the parallel implementation of the TFM process on the library.</p><p>During the processing the input raw data and the output image pixels are stored as single-precision floating- point numbers (4 bytes). We have also used fast intrinsic math routines which provide better performance at the price of IEEE compliance (the precision is slightly reduced). In our case, this causes a minimal numerical difference who has little influence on the final output ultrasound image quality.</p><p>Designed parallel strategies perfectly fit the SIMD (Single Instruction Multiple Data) model of GPU architecture [<xref ref-type="bibr" rid="scirp.59121-ref1">1</xref>]. The general parallelization scheme can be resumed as <xref ref-type="fig" rid="fig1">Figure 1</xref> suggests:</p><p>a) The process starts when the Full Matrix Array X(t) defined in Equation (1) is transferred from CPU memory to GPU global memory via PCI Express bus.</p><p>b) According to Equation (2) the Hilbert Transform is applied to every signal using CUFFT libraries [<xref ref-type="bibr" rid="scirp.59121-ref1">1</xref>]. The parallelism strategy is signal-oriented which properly splits the algorithm and computes the FFT of the data. This means that NxN threads work concurrently. The result of these operations is a complex data matrix I(t) and Q(t) which is stored in GPU texture memory.</p><p>c) DAS kernel is applied to the complex data matrix to calculate N low resolution images (LRI<sub>I</sub> and LRI<sub>Q</sub>) each one corresponding to an emitter-all receivers [<xref ref-type="bibr" rid="scirp.59121-ref8">8</xref>] as Equations (3) and (4) suggest. The parallelization is performed launching as many threads as image pixels what, supposing image dimensions of S<sub>X</sub> &#215; S<sub>Z</sub>, means N &#215; S<sub>X</sub> &#215; S<sub>Z</sub> threads. Several optimizations (reuse computed values, symmetries, shared/constant memories [<xref ref-type="bibr" rid="scirp.59121-ref7">7</xref>]) have been used to maximize performance. The focusing delay is calculated on-the-fly and it is indexed in the I(t) and Q(t) complex signals interpolating real and imaginary parts. Finally, the complex samples are multiplied by the corresponding apodization gains and added together to beamform each of the images.</p><p>d) A new kernel is defined to calculate the final A<sub>I</sub> and A<sub>Q</sub> images performing the sum of the (LRI<sub>I</sub> and LRI<sub>Q</sub>) images respectively. Once the final values for both images are computed, the envelope calculation is carried out. This strategy follows a pixel-oriented parallelism launching S<sub>X</sub> &#215; S<sub>Z</sub> threads.</p><p>We must mention that, this scheme has been further optimized for those SAFT strategies based on the coarray</p><fig id="fig1"  position="float"><label><xref ref-type="fig" rid="fig1">Figure 1</xref></label><caption><title> Parallel scheme for beamforming acceleration</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/59121x9.png"/></fig><p>model and their particular characteristics as well as for 2D and 3Dimaging, in order to squeeze the maximum performance and speed [<xref ref-type="bibr" rid="scirp.59121-ref6">6</xref>] [<xref ref-type="bibr" rid="scirp.59121-ref7">7</xref>].</p></sec></sec><sec id="s4"><title>4. Library Performance Evaluation</title><p>To evaluate the performance of the library we have chosen two experimental scenarios, a Multi-Tissue ultrasound phantom (040 GSE model by CIRS Inc. company) to evaluate the 2D-imaging; and a methacrylate piece with five drills for 3D imaging.TFM has been chosen as the beamforming method for both cases. Details are given in <xref ref-type="table" rid="table1">Table 1</xref>.</p><p>As a 2D imaging example, we use a linear 64 elements transducer which results in a FMA matrix of 4096 signals. The result of the processing is shown in <xref ref-type="fig" rid="fig2">Figure 2</xref>(a).</p><p>In <xref ref-type="table" rid="table2">Table 2</xref> and <xref ref-type="table" rid="table3">Table 3</xref> computing times can be observed for several CPUs (testing in mono-core and multi- core). Respect to GPUs, they are equipped with different number of cores.</p><p>Given that a standard PC cannot use a NVIDIA Tesla, we underline the results of NVIDIA Quadro 4000, which has 256 cores and 2 GB RAM and took around 105 ms for an image of 512 &#215; 512 pixels, as well as, the results of the NVIDA QuadroK5000, which took less than 50 ms.</p><p>As a 3D imaging example we use a matrix array of 11 &#215; 11 elements. 3D-TFM method has been applied, using an FMA of 14,641 signals and the same GPUs to generate a volume of 128 &#215; 128 &#215; 128 pixels which is shown in <xref ref-type="fig" rid="fig2">Figure 2</xref>(b). The results of the imaging generation took less than 5 seconds in the NVIDA Quadro 4000, and less than 2 seconds in the NVIDA Quadro K5000, more details are described in <xref ref-type="table" rid="table4">Table 4</xref>.</p></sec><sec id="s5"><title>5. Conclusions and Future Work</title><p>A fast and versatile library for the generation of ultrasound images has been created which exploits GPU technology to implement the beamformer via software. We have made a detailed description of the library features and it has been quantified the benefits of using the GPU as a processing tool. So by using a simple graphics card equipped with NVIDIA CUDA technology now is possible to accelerate the development of new imaging techniques.</p><fig-group id="fig2"><label><xref ref-type="fig" rid="fig2">Figure 2</xref></label><caption><title> (a) TFM image from Multi-Tissue ultrasound phantom; (b) Volumetric image from 5 drills using 3D- TFM method.</title></caption><fig id ="fig2_1"><label> (b)</label><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/59121x10.png"/></fig><fig id ="fig2_2"><label></label><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/59121x11.png"/></fig></fig-group><table-wrap id="table1" ><label><xref ref-type="table" rid="table1">Table 1</xref></label><caption><title> Beamformation scenarios tested for 2D and 3D imaging</title></caption><table><tbody><thead><tr><th align="center" valign="middle"  rowspan="2"  >Parameters</th><th align="center" valign="middle"  colspan="2"  >Image</th></tr></thead><tr><td align="center" valign="middle" >2D Imaging</td><td align="center" valign="middle" >3D Imaging</td></tr><tr><td align="center" valign="middle" >Scenario</td><td align="center" valign="middle" >Multi-Tissue Phantom</td><td align="center" valign="middle" >Methacrylate piece</td></tr><tr><td align="center" valign="middle" >Medium velocity</td><td align="center" valign="middle" >1540 [m/s]</td><td align="center" valign="middle" >2690 [m\s]</td></tr><tr><td align="center" valign="middle" >Array size</td><td align="center" valign="middle" >64 elements</td><td align="center" valign="middle" >121 elements</td></tr><tr><td align="center" valign="middle" >Array pitch</td><td align="center" valign="middle" >0.28 [mm]</td><td align="center" valign="middle" >1.0 [mm] both directions</td></tr><tr><td align="center" valign="middle" >Imaging frequency</td><td align="center" valign="middle" >2.6 [MHz]</td><td align="center" valign="middle" >3.16 [MHz]</td></tr><tr><td align="center" valign="middle" >Sample frequency</td><td align="center" valign="middle" >40 [MHz]</td><td align="center" valign="middle" >40 [MHz]</td></tr></tbody></table></table-wrap><table-wrap id="table2" ><label><xref ref-type="table" rid="table2">Table 2</xref></label><caption><title> Performance of CPU-based TFM algorithm</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >CPU processor</th><th align="center" valign="middle" >RAM</th><th align="center" valign="middle" ># Cores</th><th align="center" valign="middle" >TFM time [seg]</th></tr></thead><tr><td align="center" valign="middle" >Intel Core 2 Quad Q9450</td><td align="center" valign="middle" >4 GB</td><td align="center" valign="middle" >1 4</td><td align="center" valign="middle" >24.13 6.96</td></tr><tr><td align="center" valign="middle" >Intel Core i7 3632Q M</td><td align="center" valign="middle" >8 GB</td><td align="center" valign="middle" >1 8</td><td align="center" valign="middle" >6.91 2.08</td></tr><tr><td align="center" valign="middle" >Intel Xeon E51650</td><td align="center" valign="middle" >30 GB</td><td align="center" valign="middle" >1 4 8 12</td><td align="center" valign="middle" >6.84 2.23 1.80 1.64</td></tr></tbody></table></table-wrap><table-wrap id="table3" ><label><xref ref-type="table" rid="table3">Table 3</xref></label><caption><title> Performance of GPU-based TFM algorithm using CHEETAH</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >NVIDIA GPU</th><th align="center" valign="middle" >RAM</th><th align="center" valign="middle" ># Cores</th><th align="center" valign="middle" >TFM time [mseg]</th></tr></thead><tr><td align="center" valign="middle" >GeForce 540 M</td><td align="center" valign="middle" >1 GB</td><td align="center" valign="middle" >96</td><td align="center" valign="middle" >265.57</td></tr><tr><td align="center" valign="middle" >GeForce 9800 GTX+</td><td align="center" valign="middle" >512 MB</td><td align="center" valign="middle" >128</td><td align="center" valign="middle" >215.01</td></tr><tr><td align="center" valign="middle" >GeForce 635 M</td><td align="center" valign="middle" >1 GB</td><td align="center" valign="middle" >144</td><td align="center" valign="middle" >239.27</td></tr><tr><td align="center" valign="middle" >Quadro 4000</td><td align="center" valign="middle" >2 GB</td><td align="center" valign="middle" >256</td><td align="center" valign="middle" >105.07</td></tr><tr><td align="center" valign="middle" >Quadro K2000</td><td align="center" valign="middle" >2 GB</td><td align="center" valign="middle" >384</td><td align="center" valign="middle" >119.23</td></tr><tr><td align="center" valign="middle" >Quadro K5000</td><td align="center" valign="middle" >6 GB</td><td align="center" valign="middle" >1536</td><td align="center" valign="middle" >45.11</td></tr></tbody></table></table-wrap><table-wrap id="table4" ><label><xref ref-type="table" rid="table4">Table 4</xref></label><caption><title> Performance of GPU-based TFM-3D algorithm using CHEETAH</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >NVIDIA GPU</th><th align="center" valign="middle" >RAM</th><th align="center" valign="middle" ># Cores</th><th align="center" valign="middle" >TFM time [mseg]</th></tr></thead><tr><td align="center" valign="middle" >GeForce 540 M</td><td align="center" valign="middle" >1 GB</td><td align="center" valign="middle" >96</td><td align="center" valign="middle" >7972.70</td></tr><tr><td align="center" valign="middle" >GeForce 9800 GTX+</td><td align="center" valign="middle" >512 MB</td><td align="center" valign="middle" >128</td><td align="center" valign="middle" >7056.01</td></tr><tr><td align="center" valign="middle" >GeForce 635 M</td><td align="center" valign="middle" >1 GB</td><td align="center" valign="middle" >144</td><td align="center" valign="middle" >7587.25</td></tr><tr><td align="center" valign="middle" >Quadro 4000</td><td align="center" valign="middle" >2 GB</td><td align="center" valign="middle" >256</td><td align="center" valign="middle" >4874.28</td></tr><tr><td align="center" valign="middle" >Quadro K2000</td><td align="center" valign="middle" >2 GB</td><td align="center" valign="middle" >384</td><td align="center" valign="middle" >6697.05</td></tr><tr><td align="center" valign="middle" >Quadro K5000</td><td align="center" valign="middle" >6 GB</td><td align="center" valign="middle" >1536</td><td align="center" valign="middle" >1929.13</td></tr></tbody></table></table-wrap><p>We are currently working on the implementation of more beamforming algorithms (e.g. adaptative beamforming and PA), on supporting CPU capabilities for parallel computing (OpenACC, OpenMP, MPI) and multi- GPU processing and on improving the overall performance. We are open to receive collaboration/feedback of any group that will be interested in using our library in their own research.</p></sec><sec id="s6"><title>Acknowledgements</title><p>This work has been supported by the Spanish Government and the University of Alcal&#225; under projects DPI2010-19376 and CCG2014/EXP-084, respectively.</p></sec><sec id="s7"><title>Cite this paper</title><p>David Romero-Laorden,Carlos Juli&#225;n Mart&#237;n-Arguedas,Javier Villaz&#243;n-Terrazas,Oscar Martinez-Graullera,Matilde Santos Pe&#241;as,C&#233;sar Gutierrez-Fernandez,Ana Jim&#233;nez Mart&#237;n, (2015) Cheetah: A Library for Parallel Ultrasound Beamforming in Multi-Core Systems. Journal of Applied Mathematics and Physics,03,1056-1061. doi: 10.4236/jamp.2015.38131</p></sec></body><back><ref-list><title>References</title><ref id="scirp.59121-ref1"><label>1</label><mixed-citation publication-type="other" xlink:type="simple">Hunter, A.J., et al. (2008) The Wavenumber Algorithm for Full-Matrix Imaging Using an Ultrasonic Array. IEEE Transactions on Ultrasonics, Ferroelectrics and Frequency Control, 55, 2450-2462.  
http://dx.doi.org/10.1109/TUFFC.952</mixed-citation></ref><ref id="scirp.59121-ref2"><label>2</label><mixed-citation publication-type="other" xlink:type="simple">Martínez-Graullera, O., et al. (2011) A New Beamforming Process Based on the Phase Dispersion Analysis. The International Congress on Ultrasonics, Gdansk, 185-188.</mixed-citation></ref><ref id="scirp.59121-ref3"><label>3</label><mixed-citation publication-type="other" xlink:type="simple">Holmes, C., et al. (2008) Advanced Postprocessing for Scanned Ultrasonic Arrays: Application to Defect Detection and Classification in Non-Destructive Evaluation. Ultrasonics, 48, 636-642.  
http://dx.doi.org/10.1016/j.ultras.2008.07.019</mixed-citation></ref><ref id="scirp.59121-ref4"><label>4</label><mixed-citation publication-type="other" xlink:type="simple">So, H.K.H., et al. (2011) Medical Ultrasound Imaging: To GPU or Not to GPU. IEEE Micro, 31, 54-65. 
http://dx.doi.org/10.1109/MM.2011.65</mixed-citation></ref><ref id="scirp.59121-ref5"><label>5</label><mixed-citation publication-type="book" xlink:type="simple">Romero-Laorden, D., et al. (2013) Strategies for Hardware Reduction on the Design of Portable Ultrasound Imaging Systems. Advancements and Breakthroughs in Ultrasound Imaging, G. P. P. Gunarathne, Ed., 2013. 
http://dx.doi.org/10.5772/55910</mixed-citation></ref><ref id="scirp.59121-ref6"><label>6</label><mixed-citation publication-type="other" xlink:type="simple">Martín-Arguedas, C.J., et al. (2012) An Ultrasonic Imaging System Based on a New SAFT Approach and a GPU Beamformer. IEEE Transactions on Ultrasonics, Ferroelectrics and Frequency Control, 59, 1402-1412. 
http://dx.doi.org/10.1109/TUFFC.2012.2341</mixed-citation></ref><ref id="scirp.59121-ref7"><label>7</label><mixed-citation publication-type="other" xlink:type="simple">Romero-Laorden, D., et al. (2009) Using GPUs for Beamforming Acceleration on SAFT Imaging. IEEE International Ultrasonics Symposium, Rome, 1334-1337.</mixed-citation></ref><ref id="scirp.59121-ref8"><label>8</label><mixed-citation publication-type="other" xlink:type="simple">Villazón-Terrazas, J., et al. (2012) A Fast Acoustic Field Simulator. In 43o Congreso Espa?ol de Acústica (TECNIACUSTICA), Evora, 1-9. </mixed-citation></ref><ref id="scirp.59121-ref9"><label>9</label><mixed-citation publication-type="other" xlink:type="simple">Romero-Laorden, D., et al. (2011) Field Modelling Acceleration on Ultrasonic Systems Using Graphic Hardware. Computer Physics Communications, 182, 590-599. http://dx.doi.org/10.1016/j.cpc.2010.10.032</mixed-citation></ref><ref id="scirp.59121-ref10"><label>10</label><mixed-citation publication-type="book" xlink:type="simple">Hansen, J.M., et al. (2011) An Object-Oriented Multi-Threaded Software Beamformation Toolbox, SPIE Medical Imaging: Ultrasonic Imaging, Tomography, and Therapy. In: D’hooge, J. and Doyley, M.M., Eds., 79680Y.</mixed-citation></ref><ref id="scirp.59121-ref11"><label>11</label><mixed-citation publication-type="other" xlink:type="simple">Hwu, W.-M.W. and Kirk, D.B. (2010) Programming Massively Parallel Processors: A Hands-On Approach.</mixed-citation></ref></ref-list></back></article>