<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article  PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" dtd-version="3.0" xml:lang="en" article-type="research article"><front><journal-meta><journal-id journal-id-type="publisher-id">JCC</journal-id><journal-title-group><journal-title>Journal of Computer and Communications</journal-title></journal-title-group><issn pub-type="epub">2327-5219</issn><publisher><publisher-name>Scientific Research Publishing</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="doi">10.4236/jcc.2015.311008</article-id><article-id pub-id-type="publisher-id">JCC-61278</article-id><article-categories><subj-group subj-group-type="heading"><subject>Articles</subject></subj-group><subj-group subj-group-type="Discipline-v2"><subject>Computer Science&amp;Communications</subject></subj-group></article-categories><title-group><article-title>
 
 
  An Acoustic Events Recognition for Robotic Systems Based on a Deep Learning Method
 
</article-title></title-group><contrib-group><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Tadaaki</surname><given-names>Niwa</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Takashi</surname><given-names>Kawakami</given-names></name><xref ref-type="aff" rid="aff2"><sup>2</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Ryosuke</surname><given-names>Ooe</given-names></name><xref ref-type="aff" rid="aff2"><sup>2</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Tamotsu</surname><given-names>Mitamura</given-names></name><xref ref-type="aff" rid="aff3"><sup>3</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Masahiro</surname><given-names>Kinoshita</given-names></name><xref ref-type="aff" rid="aff3"><sup>3</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Masaaki</surname><given-names>Wajima</given-names></name><xref ref-type="aff" rid="aff2"><sup>2</sup></xref></contrib></contrib-group><aff id="aff1"><addr-line>Graduate School of Engineering, Hokkaido University of Science, Sapporo, Japan</addr-line></aff><aff id="aff3"><addr-line>Faculty of Future Design, Hokkaido University of Science, Sapporo, Japan</addr-line></aff><aff id="aff2"><addr-line>Faculty of Engineering, Hokkaido University of Science, Sapporo, Japan</addr-line></aff><pub-date pub-type="epub"><day>19</day><month>11</month><year>2015</year></pub-date><volume>03</volume><issue>11</issue><fpage>46</fpage><lpage>51</lpage><history><date date-type="received"><day>August</day>	<month>2015</month>	</date></history><permissions><copyright-statement>&#169; Copyright  2014 by authors and Scientific Research Publishing Inc. </copyright-statement><copyright-year>2014</copyright-year><license><license-p>This work is licensed under the Creative Commons Attribution International License (CC BY). http://creativecommons.org/licenses/by/4.0/</license-p></license></permissions><abstract><p>
 
 
   In this paper, we provide a new approach to classify and recognize the acoustic events for multiple autonomous robots systems based on the deep learning mechanisms. For disaster response robotic systems, recognizing certain acoustic events in the noisy environment is very effective to perform a given operation. As a new approach, trained deep learning networks which are constructed by RBMs, classify the acoustic events from input waveform signals. From the experimental results, usefulness of our approach is discussed and verified. 
 
</p></abstract><kwd-group><kwd>Acoustic Events Recognition</kwd><kwd> Deep Learning</kwd><kwd> Restricted Boltzmann Machine</kwd></kwd-group></article-meta></front><body><sec id="s1"><title>1. Introduction</title><p>Social insect, or social animal can work more than their own ability in concert with other individuals. They usually communicate with each other through sounds and vibrations. Also, certain animals recognize events of surrounding environment using acoustic information that is obtained from the environmental sounds.</p><p>The other hand, multiple autonomous robots systems or swarm robots systems are needed to develop for the disaster response and search-rescue missions. We know well that the terrible disaster of nuclear plant in Japan reminding of necessity for robotic response systems. These systems are expected to achieve the difficult missions by cooperating among relatively simple robots. In this case, detecting and recognizing the environmental information is very important functions in whole system. Usually, vision-based recognition mechanisms are adopted in autonomous robotic systems. However, in swarm robots systems, each robot has comparatively simple structure without a camera, and each robot will act on the basis of the locally information to which it can be easily acquired. Also, sound information beyond a wall cannot be recognize by only vision-based systems. For example, it is very effective to detect and recognize the explosion sounds or human voices from the other side of the wall in the noisy environment. Therefore, we focus on developing the classification and recognition mechanisms of acoustic events.</p><p>Recognizing acoustic events are becoming a key component of multimedia computational systems of all types, including robotic systems. Until now, identifying real-world acoustic events are tried by using some methodologies, e.g., a layered Hidden Markov Model (HMM).</p><p>In real environments, it is necessary to consider that an observed sound includes multiple sound source and are mixed their sound source. For example, a sound of environment that surrounds a living space is mixed a voice, a music, a engine sound of car, and a other living sound. Therefore, it is important to separate sound source or detect typical sound at a certain timing. In this paper, we focused on detection of typical sound at a certain timing</p><p>In this paper, we discuss the acoustic events classification and recognition mechanisms based on the deep learning structure. This structure is constructed by Restricted Boltzmann Machines (RBM). As the experiments, we configured a deep network based on convolutional RBM and convolutional Deep Belief Nets. Learning and classifying results of model are compared, and discussed.</p></sec><sec id="s2"><title>2. Restricted Boltzmann Machine</title><sec id="s2_1"><title>2.1. Binary Visible Units and Binary Hidden Units</title><p>An RBM [<xref ref-type="bibr" rid="scirp.61278-ref1">1</xref>] [<xref ref-type="bibr" rid="scirp.61278-ref2">2</xref>] is an undirected graphical model that is used to describe the dependency among a set of random variables over a set of observed data. In this model, the stochastic visible units v connected to the stochastic hidden units h. The joint distribution p(v, h) over the visible units and hidden units is defined through energy function E(v, h):</p><disp-formula id="scirp.61278-formula46"><label>(1)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/61278x5.png"  xlink:type="simple"/></disp-formula><p>and the probability density p(v) over the visible units defined as:</p><disp-formula id="scirp.61278-formula47"><label>(2)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/61278x6.png"  xlink:type="simple"/></disp-formula><disp-formula id="scirp.61278-formula48"><label>(3)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/61278x7.png"  xlink:type="simple"/></disp-formula><p>where Z is normalization factor (or partition function) that can be estimated by the annealed importance sampling (AIS) method.</p><p>The commonly case (where <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/61278x8.png" xlink:type="simple"/></inline-formula> and<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/61278x9.png" xlink:type="simple"/></inline-formula>), the energy function E(v, h) of an RBM is defined as:</p><disp-formula id="scirp.61278-formula49"><label>(4)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/61278x10.png"  xlink:type="simple"/></disp-formula><p>where v<sub>i</sub>, h<sub>j</sub> are states of visible unit i and hidden unit j, b<sub>i</sub>, c<sub>j</sub> are their biases and W<sub>ij</sub> is the weight between them. Since, an RBM has no intra-layer connections, the visible unit activations and the hidden unit activations are mutually conditional independence. Therefore, the conditional probability p(v<sub>i</sub>|h) and p(h<sub>j</sub>|v) that activate each unit are represented by a simple functions as:</p><disp-formula id="scirp.61278-formula50"><label>(5)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/61278x11.png"  xlink:type="simple"/></disp-formula><disp-formula id="scirp.61278-formula51"><label>(6)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/61278x12.png"  xlink:type="simple"/></disp-formula><p>where sigmoid(x) = 1/(1 + e<sup>−x</sup>) is standard sigmoid function.</p></sec><sec id="s2_2"><title>2.2. Gaussian Visible Units</title><p>For real-valued data such as natural images or the Mel-Frequency Cepstral Coefficients, Bernoulli-Bernoulli (or binary-binary) form is poor representation. However, RBM can be applied to model the distribution of real-va- lued data by adopting its Gaussian-Bernoulli (or Gaussian-binary) form [<xref ref-type="bibr" rid="scirp.61278-ref3">3</xref>] [<xref ref-type="bibr" rid="scirp.61278-ref4">4</xref>]. Where <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/61278x13.png" xlink:type="simple"/></inline-formula> and<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/61278x14.png" xlink:type="simple"/></inline-formula>. In this case, the energy function E(v, h) of an RBM is defined as:</p><disp-formula id="scirp.61278-formula52"><label>(7)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/61278x15.png"  xlink:type="simple"/></disp-formula><p>and the conditional probability p(v<sub>i</sub>|h) is defined as:</p><disp-formula id="scirp.61278-formula53"><label>(8)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/61278x16.png"  xlink:type="simple"/></disp-formula><p>where <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/61278x17.png" xlink:type="simple"/></inline-formula> is Gaussian probability density with mean <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/61278x18.png" xlink:type="simple"/></inline-formula> and variance<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/61278x19.png" xlink:type="simple"/></inline-formula>, and <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/61278x20.png" xlink:type="simple"/></inline-formula> is variance parameter of Gaussian noise on visible unit i.</p></sec><sec id="s2_3"><title>2.3. Contrastive Divergence Learning Algorithm</title><p>The CD-k algorithm [<xref ref-type="bibr" rid="scirp.61278-ref5">5</xref>] is fast calculation algorithm to approximate the gradients of log-likelihood. Given a set of training data, the model parameters <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/61278x21.png" xlink:type="simple"/></inline-formula> of an RBM are estimated by maximum likelihood learning of p(v). The model parameters that maximize the log-likelihood are determined with stochastic gradient method in general. The gradient of this log-likelihood is given through energy function E(v, h) of an RBM:</p><disp-formula id="scirp.61278-formula54"><label>(9)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/61278x22.png"  xlink:type="simple"/></disp-formula><p>However, this gradient is difficult to calculate strictly, because, calculation cost increase exponentially. CD algorithm approximate the gradients of log-likelihood using k-step Gibbs sampling and joint probability p(v|h), p(h|v). This gradient is given as:</p><disp-formula id="scirp.61278-formula55"><label>(10)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/61278x23.png"  xlink:type="simple"/></disp-formula><p>Therefore, gradients of each parameter are given as:</p><disp-formula id="scirp.61278-formula56"><label>(11)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/61278x24.png"  xlink:type="simple"/></disp-formula><disp-formula id="scirp.61278-formula57"><label>(12)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/61278x25.png"  xlink:type="simple"/></disp-formula><disp-formula id="scirp.61278-formula58"><label>(13)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/61278x26.png"  xlink:type="simple"/></disp-formula></sec></sec><sec id="s3"><title>3. Experiments and Results</title><sec id="s3_1"><title>3.1. Experiment Condition</title><p>We configured a deep neural network based on convolutional RBM [<xref ref-type="bibr" rid="scirp.61278-ref6">6</xref>] and convolutional Deep Belief Nets (CDBN) [<xref ref-type="bibr" rid="scirp.61278-ref7">7</xref>] as shown <xref ref-type="fig" rid="fig1">Figure 1</xref>. Network has three convolution layers, two max pooling layers, and one full connection layer. Each layer setting illustrate <xref ref-type="table" rid="table1">Table 1</xref>. We configured a Convolutional Neural Network (CNN) [<xref ref-type="bibr" rid="scirp.61278-ref8">8</xref>] [<xref ref-type="bibr" rid="scirp.61278-ref9">9</xref>] of same parameters for comparison.</p><p>In pre-training step, each layer is training as standard RBM using the patch that was cut out from the inputs. Because, to reduce the computational cost. CD learning with 1-step Gibbs sampling (CD1) was adopted for the RBM training and the learning rate was 0.0001. The batch size was set to 100 and 100 epochs were executed for estimating each RBM. In fine-tuning step and training CNN, we used Adam learning method [<xref ref-type="bibr" rid="scirp.61278-ref10">10</xref>] and early stopping.</p><p>We used train and test dataset of D-CASE challenge [<xref ref-type="bibr" rid="scirp.61278-ref11">11</xref>] for our experiments. This data set is recorded typical 16 category sounds of office environments. Also, training data and test data has been granted noise. 256-order spectrograms were derived from the waveform by STFT analysis using 512 points hamming window at 10 milliseconds frame shift. We also constructed 256 &#215; 100 milliseconds-order patches of spectrogram from spectrogram using 50 milliseconds frame shift.</p></sec><sec id="s3_2"><title>3.2. Results and Discussions</title><p>In the experiments, the network has over-fitting (<xref ref-type="fig" rid="fig2">Figure 2</xref> and <xref ref-type="fig" rid="fig3">Figure 3</xref>). Also, transition of each values are analogous at between fine tuning of CDBN and CNN. In the classification of results after learning, f-measure of each network is not the significance compared to the case of random (<xref ref-type="fig" rid="fig4">Figure 4</xref>). We believe that there is cause to representation of the input data by transition of each value in training data and the outcome of deep learning</p><fig id="fig1"  position="float"><label><xref ref-type="fig" rid="fig1">Figure 1</xref></label><caption><title> Structure of network for our experiments</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/61278x27.png"/></fig><fig id="fig2"  position="float"><label><xref ref-type="fig" rid="fig2">Figure 2</xref></label><caption><title> Learning curve (mean loss of cross entropy)</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/61278x28.png"/></fig><fig id="fig3"  position="float"><label><xref ref-type="fig" rid="fig3">Figure 3</xref></label><caption><title> Learning curve (accuracy)</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/61278x29.png"/></fig><table-wrap id="table1" ><label><xref ref-type="table" rid="table1">Table 1</xref></label><caption><title> Setting of each layers</title></caption><table><tbody><thead><tr><th align="center" valign="middle"  rowspan="2"  >Layer names</th><th align="center" valign="middle"  colspan="4"  >Layer parameters</th></tr></thead><tr><td align="center" valign="middle" >Filter size (w &#215; h)</td><td align="center" valign="middle" >Output map size</td><td align="center" valign="middle" >Stride</td><td align="center" valign="middle" >Function</td></tr><tr><td align="center" valign="middle" >data</td><td align="center" valign="middle" >-</td><td align="center" valign="middle" >257 &#215; 10 &#215; 1</td><td align="center" valign="middle" >-</td><td align="center" valign="middle" ></td></tr><tr><td align="center" valign="middle" >conv 1</td><td align="center" valign="middle" >5 &#215; 2</td><td align="center" valign="middle" >253 &#215; 9 &#215; 16</td><td align="center" valign="middle" >1 &#215; 1</td><td align="center" valign="middle" ></td></tr><tr><td align="center" valign="middle" >pool 1</td><td align="center" valign="middle" >2 &#215; 2</td><td align="center" valign="middle" >127 &#215; 5 &#215; 16</td><td align="center" valign="middle" >2 &#215; 2</td><td align="center" valign="middle" >ReLU</td></tr><tr><td align="center" valign="middle" >conv 2</td><td align="center" valign="middle" >5 &#215; 2</td><td align="center" valign="middle" >123 &#215; 4 &#215; 16</td><td align="center" valign="middle" >1 &#215; 1</td><td align="center" valign="middle" >-</td></tr><tr><td align="center" valign="middle" >pool 2</td><td align="center" valign="middle" >2 &#215; 2</td><td align="center" valign="middle" >62 &#215; 2 &#215; 16</td><td align="center" valign="middle" >2 &#215; 2</td><td align="center" valign="middle" >ReLU</td></tr><tr><td align="center" valign="middle" >conv 3</td><td align="center" valign="middle" >5 &#215; 2</td><td align="center" valign="middle" >58 &#215; 1 &#215; 16</td><td align="center" valign="middle" >1 &#215; 1</td><td align="center" valign="middle" >-</td></tr><tr><td align="center" valign="middle" >full 1</td><td align="center" valign="middle" >-</td><td align="center" valign="middle" >1 &#215; 1 &#215; 17</td><td align="center" valign="middle" >-</td><td align="center" valign="middle" >Soft-max</td></tr></tbody></table></table-wrap><p>conv, pool and full are convolution layer, pooling layer and full connection layer respectively.</p><fig id="fig4"  position="float"><label><xref ref-type="fig" rid="fig4">Figure 4</xref></label><caption><title> F-measure score of each category</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/61278x30.png"/></fig><p>in the recent years. In recent years, it is known that it is possible to extract better features from the spectrogram of FFT by deep learning. However, in all cases, it is suggested that it may not be effective. If the frequency analysis by the auditory filter in consideration of the hearing mechanism of organisms, we think that the better results are obtained.</p></sec></sec><sec id="s4"><title>4. Conclusions</title><p>In swarm robots systems, each robot has comparatively simple structure without a camera, and each robot will act on the basis of the locally information to which it can be easily acquired. In this case, detecting and recognizing the environmental information is very important functions in whole system. Therefore, we focus on developing the classification and recognition mechanisms of acoustic events for swarm robots.</p><p>In this paper, we proposed the acoustic events classification and recognition mechanisms based on the deep learning structure. This structure is constructed based on RBM. However, in the experiments on this paper, we cannot figure out how to get enough recognition accuracy in noise environments. We believe that there is cause to representation of the input data by transition of each value in training data and the outcome of deep learning in the recent years. If the frequency analysis by the auditory filters in consideration of the hearing mechanism of organisms, we think that the better results are obtained.</p><p>In the future work, we plan to incorporate the auditory filter to our approach, and will expect to improve the recognition accuracy by this plan.</p></sec><sec id="s5"><title>Cite this paper</title><p>Tadaaki Niwa,Takashi Kawakami,Ryosuke Ooe,Tamotsu Mitamura,Masahiro Kinoshita,Masaaki Wajima, (2015) An Acoustic Events Recognition for Robotic Systems Based on a Deep Learning Method. Journal of Computer and Communications,03,46-51. doi: 10.4236/jcc.2015.311008</p></sec><sec id="s6"><title>NOTES</title></sec></body><back><ref-list><title>References</title><ref id="scirp.61278-ref1"><label>1</label><mixed-citation publication-type="other" xlink:type="simple">Hinton, G.E., Osindero, S. and Teh, Y.W. (2006) A Fast Learning Algorithm for Deep Belief Nets. Neural computation, 18, 1527-1554. http://dx.doi.org/10.1162/neco.2006.18.7.1527</mixed-citation></ref><ref id="scirp.61278-ref2"><label>2</label><mixed-citation publication-type="other" xlink:type="simple">Freund, Y. and Haussler, D. (1994) Unsupervised Learning of Distributions of Binary Vectors Using Two Layer Networks. Computer Research Laboratory [University of California, Santa Cruz].</mixed-citation></ref><ref id="scirp.61278-ref3"><label>3</label><mixed-citation publication-type="other" xlink:type="simple">Hinton, G.E. and Salakhutdinov, R.R. (2006) Reducing the Dimensionality of Data with Neural Networks. Science, 313, 504-507. http://dx.doi.org/10.1126/science.1127647</mixed-citation></ref><ref id="scirp.61278-ref4"><label>4</label><mixed-citation publication-type="other" xlink:type="simple">Cho, K., Ilin, A. and Raiko, T. (2011) Improved Learning of Gaussian-Bernoulli Restricted Boltzmann Machines. In: Artificial Neural Networks and Machine Learning—ICANN 2011, Springer Berlin Heidelberg, 10-17. 
http://dx.doi.org/10.1007/978-3-642-21735-7_2</mixed-citation></ref><ref id="scirp.61278-ref5"><label>5</label><mixed-citation publication-type="other" xlink:type="simple">Hinton, G.E. (2002) Training Products of Experts by Minimizing Contrastive Divergence. Neural Computation, 14, 1771-1800. http://dx.doi.org/10.1162/089976602760128018</mixed-citation></ref><ref id="scirp.61278-ref6"><label>6</label><mixed-citation publication-type="other" xlink:type="simple">Norouzi, M., Ranjbar, M. and Mori, G. (2009) Stacks of Convo-lutional Restricted Boltzmann Machines for Shift-In- variant Feature Learning. IEEE Conference on Computer Vision and Pattern Recognition, 2735-2742.</mixed-citation></ref><ref id="scirp.61278-ref7"><label>7</label><mixed-citation publication-type="other" xlink:type="simple">Lee, H., Grosse, R., Ranganath, R. and Ng, A.Y. (2009) Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations. Proceedings of the 26th Annual International Conference on Machine Learning, 609-616. http://dx.doi.org/10.1145/1553374.1553453</mixed-citation></ref><ref id="scirp.61278-ref8"><label>8</label><mixed-citation publication-type="other" xlink:type="simple">Simard, P.Y., Steinkraus, D. and Platt, J.C. (2003) Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis. In: null, 958. http://dx.doi.org/10.1109/icdar.2003.1227801</mixed-citation></ref><ref id="scirp.61278-ref9"><label>9</label><mixed-citation publication-type="other" xlink:type="simple">Krizhevsky, A., Sutskever, I. and Hinton, G.E. (2012) Imagenet Classification with Deep Convolutional Neural Networks. Advances in Neural Information Processing Systems, 1097-1105.</mixed-citation></ref><ref id="scirp.61278-ref10"><label>10</label><mixed-citation publication-type="other" xlink:type="simple">Kingma, D. and Ba, J. (2015) Adam: A Method for Stochastic Optimization. Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015).</mixed-citation></ref><ref id="scirp.61278-ref11"><label>11</label><mixed-citation publication-type="other" xlink:type="simple">IEEE AASP Challenge: Detection and Classification of Acoustic Scenes and Events.  
http://c4dm.eecs.qmul.ac.uk/sceneseventschallenge/</mixed-citation></ref></ref-list></back></article>