<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article  PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" dtd-version="3.0" xml:lang="en" article-type="research article"><front><journal-meta><journal-id journal-id-type="publisher-id">CS</journal-id><journal-title-group><journal-title>Circuits and Systems</journal-title></journal-title-group><issn pub-type="epub">2153-1285</issn><publisher><publisher-name>Scientific Research Publishing</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="doi">10.4236/cs.2016.74024</article-id><article-id pub-id-type="publisher-id">CS-65861</article-id><article-categories><subj-group subj-group-type="heading"><subject>Articles</subject></subj-group><subj-group subj-group-type="Discipline-v2"><subject>Computer Science&amp;Communications</subject><subject> Engineering</subject><subject> Physics&amp;Mathematics</subject></subj-group></article-categories><title-group><article-title>
 
 
  An Efficient Approach for Segmentation, Feature Extraction and Classification of Audio Signals
 
</article-title></title-group><contrib-group><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>uthumari</surname><given-names>Arumugam</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref><xref ref-type="corresp" rid="cor1"><sup>*</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Mala</surname><given-names>Kaliappan</given-names></name><xref ref-type="aff" rid="aff2"><sup>2</sup></xref></contrib></contrib-group><aff id="aff2"><addr-line>Department of Computer Science and Engineering, Mepco Schlenk Engineering College, Sivakasi, India</addr-line></aff><aff id="aff1"><addr-line>Department of Computer Science and Engineering, University College of Engineering, Ramanathapuram, India</addr-line></aff><author-notes><corresp id="cor1">* E-mail:<email>muthu_ru@yahoo.com(UA)</email>;</corresp></author-notes><pub-date pub-type="epub"><day>13</day><month>04</month><year>2016</year></pub-date><volume>07</volume><issue>04</issue><fpage>255</fpage><lpage>279</lpage><history><date date-type="received"><day>14</day>	<month>March</month>	<year>2016</year></date><date date-type="rev-recd"><day>accepted</day>	<month>23</month>	<year>April</year>	</date><date date-type="accepted"><day>26</day>	<month>April</month>	<year>2016</year></date></history><permissions><copyright-statement>&#169; Copyright  2014 by authors and Scientific Research Publishing Inc. </copyright-statement><copyright-year>2014</copyright-year><license><license-p>This work is licensed under the Creative Commons Attribution International License (CC BY). http://creativecommons.org/licenses/by/4.0/</license-p></license></permissions><abstract><p>
 
 
  Due to the presence of non-stationarities and discontinuities in the audio signal, segmentation and classification of audio signal is a really challenging task. Automatic music classification and annotation is still considered as a challenging task due to the difficulty of extracting and selecting the optimal audio features. Hence, this paper proposes an efficient approach for segmentation, feature extraction and classification of audio signals. Enhanced Mel Frequency Cepstral Coefficient (EMFCC)-Enhanced Power Normalized Cepstral Coefficients (EPNCC) based feature extraction is applied for the extraction of features from the audio signal. Then, multi-level classification is done to classify the audio signal as a musical or non-musical signal. The proposed approach achieves better performance in terms of precision, Normalized Mutual Information (NMI), F-score and entropy. The PNN classifier shows high False Rejection Rate (FRR), False Acceptance Rate (FAR), Genuine Acceptance rate (GAR), sensitivity, specificity and accuracy with respect to the number of classes.
 
</p></abstract><kwd-group><kwd>Audio Signal</kwd><kwd> Enhanced Mel Frequency Cepstral Coefficient (EMFCC)</kwd><kwd> Enhanced Power Normalized Cepstral Coefficients (EPNCC)</kwd><kwd> Probabilistic Neural Network (PNN) Classifier</kwd></kwd-group></article-meta></front><body><sec id="s1"><title>1. Introduction</title><p>In this paper, an efficient approach for segmentation, EMFCC-EPNCC based feature extraction and PNN-based classification of audio signal is proposed. The background section presents a brief overview of the existing audio segmentation, feature extraction and classification techniques, along with its drawbacks. The proposed work is illustrated in the contribution section.</p><sec id="s1_1"><title>1.1. Background</title><p>Segmentation and classification [<xref ref-type="bibr" rid="scirp.65861-ref1">1</xref>] of audio signals play a major role in the audio signal processing application. Audio segmentation is an essential preprocessing step for audio signal processing utilized in various applications such as medical applications, broadcast applications, etc. During the recent years, there have been many researches on the automatic audio segmentation and classification by using various features and techniques. The segmentation algorithms are classified into decoder-based [<xref ref-type="bibr" rid="scirp.65861-ref2">2</xref>] , model-based [<xref ref-type="bibr" rid="scirp.65861-ref3">3</xref>] and metric-based [<xref ref-type="bibr" rid="scirp.65861-ref4">4</xref>] [<xref ref-type="bibr" rid="scirp.65861-ref5">5</xref>] segmentation algorithms.</p><p>Feature extraction techniques are classified as temporal and spectral feature extraction techniques [<xref ref-type="bibr" rid="scirp.65861-ref6">6</xref>] [<xref ref-type="bibr" rid="scirp.65861-ref7">7</xref>] . Temporal feature extraction [<xref ref-type="bibr" rid="scirp.65861-ref7">7</xref>] utilizes the waveform of the audio signal for analysis process. Spectral feature extraction [<xref ref-type="bibr" rid="scirp.65861-ref7">7</xref>] utilizes the spectral representation of the audio signal for analysis process. Accurate classification of the musical and non-musical segments in the audio signal is the most common problem [<xref ref-type="bibr" rid="scirp.65861-ref8">8</xref>] . Classification of audio signal has become an interesting topic with respect to the increase in the growth and availability of audio database. Automatic classification of audio information has gained more popularity for organizing the large number of audio files in the database.</p></sec><sec id="s1_2"><title>1.2. Drawbacks of Existing Techniques and Merits of Proposed Work</title><p>However, the conventional audio segmentation techniques are usually quite simple and do not consider all possible scenarios. The decoder-based segmentation approaches only place the boundaries at the silence locations. This do not have any connection with the acoustic changes in the audio data. The model-based approaches do not generalize to the data conditions, as the models are not compatible with the new data conditions. The metric-based approaches generally require a threshold to make decisions. These thresholds are set empirically and require an additional development data. Hence, there arises a need in the development of efficient segmentation technique. The computational complexity of the traditional feature extraction approaches is increased with respect to the increase in the number of audio signals. The traditional classification techniques applied directly on the feature-vectors yielded poor results. Therefore, classification of audio signal is done without depending on the feature vectors. However, the existing audio classification systems do not represent the perceptual similarity of audio signals, as they mainly depend on the single similarity measure.</p><p>To overcome the challenges in the existing techniques, an efficient approach for segmentation, feature extraction and classification of audio signals is introduced in this paper. The proposed work involves the combination of new objective function of peak and pitch extraction and EPNCC and EMFCC based feature extraction. The presence of silence and irrelevant frequency details in the audio signal is eliminated. The clear features of the speech signal filtered from other background signals are obtained. PNN-based classification provides a better prediction of the classified label using the probability estimation.</p></sec><sec id="s1_3"><title>1.3. Contribution of the Proposed Work</title><p>This paper proposes an efficient approach for segmentation, feature extraction and classification of audio signals. In our proposed work, mean filtering is utilized for filtering the audio signal. Better reduction in the Gaussian noise is achieved than the traditional filtering techniques. Segmentation of the audio signal is performed using the peak estimation and pitch extraction. Then, the spectral difference in the audio signal pattern is estimated. Feature extraction is performed by using the combination of EMFCC-EPNCC, peak and pitch feature extraction for collecting the testing features of the audio signal. Multi-label and multi-level classification is performed for classifying the audio signal as a musical or non-musical signal. The category of the audio signal is extracted from the classification result. Finally, the proposed approach is compared with existing algorithms. The PNN- based classification approach achieves better performance in terms of sensitivity, specificity, accuracy, FAR, FRR and GAR. The proposed approach achieves high precision, NMI, F-score and entropy.</p><p>The remaining sections of the paper are organized as follows: Section II describes about the conventional works related to the audio segmentation and classification process. Section III explains the proposed approach including mean filtering, segmentation, feature extraction and PNN-based classification processes. The performance evaluation result of the proposed approach is illustrated in the Section IV. Section V discusses about the conclusion and future work of the proposed approach.</p></sec></sec><sec id="s2"><title>2. Related Work</title><p>This section presents the conventional research works related to the automatic segmentation and classification of audio signals and feature extraction using various techniques and approaches. Haque and Kim proposed a correlation intensive FCM (CIFCM) algorithm for the segmentation and classification of audio data. The audio-cuts were detected efficiently irrespective of the presence of the fading effects in the audio data. The boundaries between different types of sounds were detected and classified into clusters. The conventional FCM approach was outperformed by the proposed CIFCM approach [<xref ref-type="bibr" rid="scirp.65861-ref9">9</xref>] . An automatic segmentation approach combining the SVM classification and audio self-similarity segmentation was introduced for separating the sung clips and supplement clips from the pop music. The heuristic rules were utilized for filtering and integrating the classification result to determine the potential boundaries for the audio segment. The segmentation boundaries were determined accurately by the proposed approach [<xref ref-type="bibr" rid="scirp.65861-ref10">10</xref>] . Lef&#232;vre and Vincent proposed a two level segmentation process by computing numerous features for each audio sequence. Initial classification was performed by using the k-means classifier and segment-related features. Final classification was done by using the Multidimensional Hidden Markov Models and frame-related features [<xref ref-type="bibr" rid="scirp.65861-ref11">11</xref>] .</p><p>The usage of the audio signals in the identification of bird species was outperformed by using the short audio segments having high amplitude called as pulses. Training of the Support Vector Machine (SVM) classifiers was performed by using a previously labeled database of bird songs. Best results can be obtained by using the automatically obtained pulses and SVM classifier [<xref ref-type="bibr" rid="scirp.65861-ref12">12</xref>] . Dhanalakshmi et al. proposed effective algorithms for the automatic classification of the audio clips into several classes. The Auto Associative Neural Network (AANN) model was used to acquire the acoustic feature vector distribution of the classes. The weights of the network were adjusted by using the back propagation learning algorithm, for reducing the mean square error of each feature vector. The Gaussian Mixture Model (GMM) for those classes was trained by using the feature vectors [<xref ref-type="bibr" rid="scirp.65861-ref13">13</xref>] . Haque and Kim proposed an efficient approach for classifying the audio signals into broad categories by using a fuzzy c-means (FCM) algorithm. Different characteristic features of the audio signals were analyzed and an optimal feature vector was selected using an analytical scoring technique. The FCM-based classification scheme was applied on the optimal feature vector to achieve efficient classification performance [<xref ref-type="bibr" rid="scirp.65861-ref14">14</xref>] .</p><p>Dhanalakshmi et al. proposed effective algorithms for the automatic classification of the audio clips into six classes. The audio content was characterized by extracting the acoustic features such as Linear Prediction Cepstral Coefficients (LPCC) and MFCC. A method for indexing the classified audio was proposed by utilizing the k-means clustering algorithm and LPCC features [<xref ref-type="bibr" rid="scirp.65861-ref15">15</xref>] . The spatial distribution of microphone from ad-hoc microphone arrays was utilized for the accurate classification of disturbed signals. The proposed algorithm was evaluated in the simulated reverberant scenarios and multichannel recordings of microphone setup in the real-time environment. The cluster based classification accuracy of the proposed algorithm was found to be high [<xref ref-type="bibr" rid="scirp.65861-ref16">16</xref>] . Bhat et al. proposed an automated and efficient method for observing the mood of music or the emotions. The songs were classified according to the mood, based on the Thayer’s model. Various different features of the music were analyzed before the classification of the music. From a database of over 100 songs, the western and Indian Hindi film music were classified. The classification efficiency of the proposed method was improved [<xref ref-type="bibr" rid="scirp.65861-ref17">17</xref>] .</p><p>Gergen and Martin introduced various data combination strategies for the efficient classification of audio signal. The audio classification performance was analyzed based on the simulations and audio recordings. High classification accuracy was achieved [<xref ref-type="bibr" rid="scirp.65861-ref18">18</xref>] . The automatic estimation of audio chord was addressed using stacked generalization of multiple classifiers over Hidden Markov model (HMM) estimators. A new compositional hierarchical model and standard chroma feature vectors was modelled with the HMMs, for estimating the chords in the music recordings. A binary decision tree and SVM were proposed for binding the HMM estimations into a new feature vector. The classification efficiency was improved with the additional stacking of the classifiers [<xref ref-type="bibr" rid="scirp.65861-ref19">19</xref>] . Murthy and Koolagudi [<xref ref-type="bibr" rid="scirp.65861-ref20">20</xref>] employed machine learning algorithms and signal processing techniques to identify the vocal and non-vocal regions of the songs. The characteristics of vocal and non-vocal segments were obtained by using Artificial Neural Networks (ANN). The classification accuracy of the vocal and non-vocal segments was improved. Koolagudi and Krothapalli [<xref ref-type="bibr" rid="scirp.65861-ref21">21</xref>] performed recognition of emotions from speech signal by using the spectral features including LPCC and MFCC. Vowel onset points were used to determine consonant, vowel and transition regions of each syllable. The emotions in the speech signal were identified by exploring the sub-syllabic regions.</p><p>Lude&#241;a-Choez and Gallardo-Antol&#237;n [<xref ref-type="bibr" rid="scirp.65861-ref10">10</xref>] studied the spectral characteristics of various acoustic events along with the speech spectra. A novel parameter for Acoustic Event Classification (AEC) process was proposed. The performance of the proposed approach in the clean and noisy conditions was higher than the conventional MFCC in an AEC task. Geiger et al. [<xref ref-type="bibr" rid="scirp.65861-ref22">22</xref>] presented an acoustic scene classification system using audio feature extraction. The spectral, energy, voice-related and cepstral audio features were extracted from the recordings of acoustic scenes. The shorter and longer recordings were classified using SVM and majority voting scheme. From the feature analysis, Mel spectra was found as the most relevant feature. Higher accuracy was achieved when compared to the existing classification approaches. Oh and Chung [<xref ref-type="bibr" rid="scirp.65861-ref23">23</xref>] used a method for extracting features from the speech signal using a non-parametric correlation coefficient. The performance of the proposed method was better than the selective feature extraction using cross correlation. Gajšek et al. [<xref ref-type="bibr" rid="scirp.65861-ref24">24</xref>] presented an efficient approach for modeling the acoustic features to recognize various paralinguistic phenomena. The Universal Background Model (UBM) was represented by building a monophone-based Hidden Markov Model (HMM). The proposed method has achieved better results than the state-of-the-art systems.</p><p>Anguera [<xref ref-type="bibr" rid="scirp.65861-ref25">25</xref>] combined K-means clustering algorithm and GMM posterior grams to obtain highly discriminant features. The evaluation results have shown that the standard MFCC features were outperformed by the GMM posterior grams. Salamon et al. [<xref ref-type="bibr" rid="scirp.65861-ref26">26</xref>] presented a novel method for the classification of musical genre based on high-level melodic features extracted directly from the audio signal of polyphonic music. The melodic features were used for the classification of excerpts into different musical genres by using the machine learning algorithms. The proposed method was compared with a standard approach using low-level timbre features.</p><p>Alam et al. [<xref ref-type="bibr" rid="scirp.65861-ref27">27</xref>] analyzed the performance of the multi-taper MFCC and Perceptual Linear Prediction (PLP) features. The robust PLP features were computed by using multitapers. The recognition accuracy was improved significantly by using the MFCC and PLP features computed through multitapers. Muthumari and Mala [<xref ref-type="bibr" rid="scirp.65861-ref28">28</xref>] presented a study of the existing audio segmentation and classification techniques and comparison of the performance of the existing approaches. In this article, typical feature extraction techniques used in audio information retrieval for different music elements were reviewed. Two main paradigms for audio classification were presented with their advantages and drawbacks. The drawbacks of the existing approaches and merits of the proposed work are depicted in <xref ref-type="table" rid="table1">Table 1</xref>.</p></sec><sec id="s3"><title>3. Efficient Approach for Segmentation, Feature Extraction and Classification of Audio Signals</title><p>The proposed approach is clearly explained in this section. Smoothening of the audio signal is performed by using mean filter. Segmentation of the audio signal is performed by using peak estimation and pitch extraction process. Peak estimation is applied to identify the variation in signal amplitude with previous and present values of the signal amplitude with respect to the sampling time. The pitch extraction is performed, based on the frequency difference of the audio signal. Then, it is determined whether the pitch satisfies the segmentation of signal sample, based on the pitch frequency deviation.</p><table-wrap id="table1" ><label><xref ref-type="table" rid="table1">Table 1</xref></label><caption><title> Drawbacks of existing approaches and merits of proposed work</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >Drawbacks of existing segmentation, classification and feature extraction approaches</th><th align="center" valign="middle" >Merits of our proposed work</th></tr></thead><tr><td align="center" valign="middle" >・ It can be applied only on discrete audio segments. ・ Generation of extra overhead during the computation of MFCC features. ・ Distinguishing of the speech from the music signals is poor. ・ The accuracy of the existing classification techniques is low. ・ High computational complexity and cost.</td><td align="center" valign="middle" >・ The clear features of the speech signal filtered from other background signals are obtained. ・ PNN-based classification provides a better prediction of the classified label using the probability estimation. ・ The presence of silence and irrelevant frequency details in the audio signal is eliminated. ・ PNN classification approach achieves efficient classification of the musical and non-musical signal. ・ The accuracy of the proposed approach is high.</td></tr></tbody></table></table-wrap><p>The index region of the audio sample is extracted and represented as a projection line over the audio signal. Segmentation of the audio signal is done by extracting the signal amplitude according to the window selection of the sampling time. EMFCC-EPNCC is applied to extract testing feature for the classification stage with the combination of peak estimated signal feature. Classification of audio signal into musical or non-musical signal is done by using PNN classifier. From this classification result, the category of the audio signal is specified. This is done to extract index of audio input for retrieving the audio signal. The overall flow diagram of the proposed approach is shown in the <xref ref-type="fig" rid="fig1">Figure 1</xref>. The main stages of the proposed work are</p><p>・ Mean filter</p><p>・ Segmentation</p><p>・ Peak Estimation</p><p>・ Peak extraction</p><p>・ Pitch extraction</p><p>・ Feature Extraction</p><p>・ EMFCC-EPNCC based feature extraction</p><p>・ PNN-based classification</p><sec id="s3_1"><title>3.1. Mean Filter</title><p>Filtering of the audio signal is performed by using the mean filter. The mean filter is applied directly to the input audio signal, without the need to know about the statistical characteristics of the audio signal. This filter operates by using small movable window for each sample duration of the audio signal. Smoothing signal is obtained by considering the mean values of the side window and replacing the central window element with the mean value. The amplitude of the audio signal is normalized and the Gaussian noise present in the audio signal is reduced. This filtered signal is then applied to the segmentation process. The plot of the input audio signal is depicted in <xref ref-type="fig" rid="fig2">Figure 2</xref>(a) and the filtered audio signal is shown in <xref ref-type="fig" rid="fig2">Figure 2</xref>(b).</p></sec><sec id="s3_2"><title>3.2. Segmentation</title><p>The main purpose of the segmentation process is to divide the input audio signal into homogeneous segments. This is done by evaluating the similarity between two contiguous windows of fixed length, in the cepstral domain. The audio segmentation is performed by using three processes:</p><p>v Peak Estimation</p><p>v Peak Extraction</p><p>v Pitch Extraction</p><p>During the peak estimation, peaks are calculated from amplitude and frequency of input signal from the parameters of α, β and γ. The threshold peak value is calculated based on the average value of the signal. The interpolated peak location is calculated and the condition of peak from the peak magnitude is checked with the</p><fig id="fig1"  position="float"><label><xref ref-type="fig" rid="fig1">Figure 1</xref></label><caption><title> Overall flow diagram of the efficient approach for segmentation, feature extraction and classification of audio signals</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/10-7600519x7.png"/></fig><fig id="fig2"  position="float"><label><xref ref-type="fig" rid="fig2">Figure 2</xref></label><caption><title> (a) Plot of input audio signal; (b) Filtered signal</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/10-7600519x8.png"/></fig><p>threshold peak magnitude estimate value. If the peak magnitude is greater than the estimate, then it is noted as a peak range in the sampled size of signal. Interpolated peak location is given as,</p><disp-formula id="scirp.65861-formula64"><label>(1)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/10-7600519x9.png"  xlink:type="simple"/></disp-formula><p>The peak magnitude estimate is given as,</p><disp-formula id="scirp.65861-formula65"><label>(2)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/10-7600519x10.png"  xlink:type="simple"/></disp-formula><p>where “α” is the starting edge of parabola of the signal, “β” is the peak amplitude edge of signal and “γ” is the finishing edge of parabola of the signal. The above parameters are calculated from the transformation signal obtained as the result of MFCC method. <xref ref-type="fig" rid="fig3">Figure 3</xref>(a) shows the input audio signal, <xref ref-type="fig" rid="fig3">Figure 3</xref>(b) shows the audio signal after cancellation of Direct Current (DC) drift and normalization, <xref ref-type="fig" rid="fig3">Figure 3</xref>(c) shows the audio signal after applying the derivative function, <xref ref-type="fig" rid="fig3">Figure 3</xref>(d) shows the integrated signal and <xref ref-type="fig" rid="fig3">Figure 3</xref>(e) shows the audio signal with peak points.</p><fig id="fig3"  position="float"><label><xref ref-type="fig" rid="fig3">Figure 3</xref></label><caption><title> (a) Input audio signal; (b) Audio signal after cancellation of DC drift and normalization; (c) audio signal after applying the derivative function; (d) Integrated signal; (e) Audio signal with peak points</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/10-7600519x11.png"/></fig><p>In this feature extraction stage, R_Loc represents the feature at which the wave is in high peak Positive and Q_Loc represents the features at small signal difference at negative edge of the audio signal and S_Loc represents the feature values [<xref ref-type="bibr" rid="scirp.65861-ref29">29</xref>] at maximum signal difference at negative point of input audio signal. For each stage of convolution process, low pass filter and high pass filter are used for calculating difference in peak extraction using transfer function as represented by<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/10-7600519x12.png" xlink:type="simple"/></inline-formula>. This is updated by the threshold value extracted from input audio signal. Here, left and right specifies the left position and right position of sampled input signal. <xref ref-type="fig" rid="fig3">Figure 3</xref> shows the extraction process of the testing features.</p><p>Feature Vector is formed as,</p><p>a) Max (Q_loc),</p><p>b) Max (R_loc),</p><p>c) Max (S_loc),</p><p>d) Length (Q_loc &gt; 0),</p><p>e) Length (R_loc &gt; 0),</p><p>f) Length (S_loc &gt; 0),</p><p>g) Sum (Q_loc &gt; 0),</p><p>h) Sum (R_loc &gt; 0),</p><p>i) Sum (S_loc &gt; 0).</p><p>where,</p><p><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/10-7600519x13.png" xlink:type="simple"/></inline-formula>; (3)</p><p><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/10-7600519x14.png" xlink:type="simple"/></inline-formula>; (4)</p><p><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/10-7600519x15.png" xlink:type="simple"/></inline-formula>; (5)</p><p>where <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/10-7600519x16.png" xlink:type="simple"/></inline-formula>-Input Audio Signal.</p><p><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/10-7600519x17.png" xlink:type="simple"/></inline-formula>; (6)</p><p><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/10-7600519x18.png" xlink:type="simple"/></inline-formula>; (7)</p><p><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/10-7600519x19.png" xlink:type="simple"/></inline-formula>; (8)</p><p><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/10-7600519x20.png" xlink:type="simple"/></inline-formula>; (9)</p><p><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/10-7600519x21.png" xlink:type="simple"/></inline-formula>; (10)</p><disp-formula id="scirp.65861-formula66"><label>(11)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/10-7600519x22.png"  xlink:type="simple"/></disp-formula><disp-formula id="scirp.65861-formula67"><label>(12)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/10-7600519x23.png"  xlink:type="simple"/></disp-formula><disp-formula id="scirp.65861-formula68"><label>(13)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/10-7600519x24.png"  xlink:type="simple"/></disp-formula><disp-formula id="scirp.65861-formula69"><label>(14)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/10-7600519x25.png"  xlink:type="simple"/></disp-formula><disp-formula id="scirp.65861-formula70"><label>(15)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/10-7600519x26.png"  xlink:type="simple"/></disp-formula><disp-formula id="scirp.65861-formula71"><label>(16)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/10-7600519x27.png"  xlink:type="simple"/></disp-formula><p>where “N” is the sample size of input audio, <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/10-7600519x28.png" xlink:type="simple"/></inline-formula>is the transfer function of Low pass filter, <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/10-7600519x28.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/10-7600519x29.png" xlink:type="simple"/></inline-formula>is the transfer function of the High pass filter and <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/10-7600519x28.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/10-7600519x29.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/10-7600519x30.png" xlink:type="simple"/></inline-formula> is the transfer function of convolution.</p><p>In this pitch extraction, initially the objective function is implemented to perform weight calculation from the input audio signal based on the cosine angle difference of the signal amplitude. The pitch angle variation for each pre-allocated time samples calculated from the length of input signal (X<sub>i</sub>) is extracted based on the objective function from [<xref ref-type="bibr" rid="scirp.65861-ref29">29</xref>] . Then, difference in the limitation of time sequence with the <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/10-7600519x31.png" xlink:type="simple"/></inline-formula> calculation and extracted pitch angle is checked. The pitch of signal is estimated by using time domain based detection method. There are several methods used for signal pitch estimation.</p><p>a) Zero Crossing</p><p>b) Autocorrelation</p><p>c) Maximum Likelihood</p><p>d) Adaptive filter using FFT</p><p>e) Super Resolution pitch detection</p><p>In the proposed method, Maximum Likelihood based Pitch extraction is implemented. This is represented as,</p><disp-formula id="scirp.65861-formula72"><label>(17)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/10-7600519x32.png"  xlink:type="simple"/></disp-formula><p>where, “t” is the frame size of audio signal, “t” is the sampling time and “N” is the total size of audio signal. This is updated by using the objective function as,</p><disp-formula id="scirp.65861-formula73"><label>(18)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/10-7600519x33.png"  xlink:type="simple"/></disp-formula><p>The pitch frequency in each frame of the audio signal is calculated. The threshold value of the amplitude of the segmented audio signal is calculated by using</p><disp-formula id="scirp.65861-formula74"><label>(19)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/10-7600519x34.png"  xlink:type="simple"/></disp-formula><p>Then, the minimum and maximum peak values of the segmented signal are checked based on the threshold value. The peak value is estimated based on the positive and negative peak values lying on the left and right sides of the segmented signal. The positive small and large pitches and negative small and large pitches are obtained based on the peak values. <xref ref-type="fig" rid="fig4">Figure 4</xref> flow diagram of the pitch feature extraction process.</p><fig id="fig4"  position="float"><label><xref ref-type="fig" rid="fig4">Figure 4</xref></label><caption><title> Flow diagram of the pitch feature extraction process</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/10-7600519x35.png"/></fig><disp-formula id="scirp.65861-formula75"><graphic  xlink:href="http://html.scirp.org/file/10-7600519x36.png"  xlink:type="simple"/></disp-formula><p><xref ref-type="fig" rid="fig5">Figure 5</xref> shows the input waveform and pitch track. The segmented audio result is shown in the <xref ref-type="fig" rid="fig6">Figure 6</xref>.</p></sec><sec id="s3_3"><title>3.3. Feature Extraction</title><p>EMFCC-EPNCC is applied for the extraction of features from the audio signal. In several feature analysis techniques, the signal intensity is estimated based on spectrum depth variation only. In our proposed work, we implement both Mel-function with Power normalized Cepstral Coefficients for speech signal analysis. This method filters other signals present in the speech data with Gamma tone frequency integration. By using this method, the feature of signal is clear than other feature extraction types. Representation of the audio signal is performed by using a set of features.</p><p>Feature extraction is performed based on the EMFCC and EPNCC to return the feature values computed from the audio signal and sampled at fs (Hz). In the EMFCC process, 20 frame size is chosen from the sample size of input audio signal. The audio signal is subjected to the windowing process to divide it into frames and perform spectrum analysis for each and every frame of the signal. Then, Discrete Fourier Transform (DFT) is applied to the frames. The Mel frequency warping is applied to the DFT output. Logarithm is applied to the filter bank of the Mel frequency warping output. Inverse DFT is applied to obtain the Mel cepstrum coefficients.</p><p>In the EMFCC based audio feature extraction, the Mel Cepstrum is extracted from the transformation output. The input audio is divided into frames by applying the windowing function at fixed intervals. The distribution function for the window is defined as</p><disp-formula id="scirp.65861-formula76"><label>(20)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/10-7600519x37.png"  xlink:type="simple"/></disp-formula><p>Windowing involves multiplication of the time record using a finite-length window with a smoothly varying amplitude. This results in the continuous waveforms without sharp transitions. Windowing process minimizes the disruptions at the starting and end point of the frame. The output of the window is given as</p><fig id="fig5"  position="float"><label><xref ref-type="fig" rid="fig5">Figure 5</xref></label><caption><title> Input waveform and pitch track</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/10-7600519x38.png"/></fig><fig id="fig6"  position="float"><label><xref ref-type="fig" rid="fig6">Figure 6</xref></label><caption><title> Segmented audio result</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/10-7600519x39.png"/></fig><disp-formula id="scirp.65861-formula77"><label>(21)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/10-7600519x40.png"  xlink:type="simple"/></disp-formula><p>where<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/10-7600519x41.png" xlink:type="simple"/></inline-formula>. Here, “N” denotes the quantity of samples within every frame, <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/10-7600519x41.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/10-7600519x42.png" xlink:type="simple"/></inline-formula>represents the output signal obtained after multiplying the input signal <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/10-7600519x41.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/10-7600519x42.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/10-7600519x43.png" xlink:type="simple"/></inline-formula> with the window<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/10-7600519x41.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/10-7600519x42.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/10-7600519x43.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/10-7600519x44.png" xlink:type="simple"/></inline-formula>. <xref ref-type="fig" rid="fig7">Figure 7</xref>(a) shows the output window plot of the audio signal and <xref ref-type="fig" rid="fig7">Figure 7</xref>(b) shows the reduction in the spectral leakage effect by applying window.</p><p>A cepstral feature vector is generated for each frame and the DFT is applied to each frame. Mel frequency warping represented by the cosine transformation is applied to the DFT output. The cosine transform is described as</p><disp-formula id="scirp.65861-formula78"><label>(22)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/10-7600519x45.png"  xlink:type="simple"/></disp-formula><p>where, <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/10-7600519x46.png" xlink:type="simple"/></inline-formula>and “N” is the sample size of audio input.</p><p>The cosine transform is used to convert the log Mel cepstrum back into the spatial domain. The FFT is applied to calculate the coefficients from the log Mel cepstrum. The main advantage of the Mel frequency warping is the uniform placement of the triangular filter on the Mel scale between the lower and upper frequency limits of the Mel-warped spectrum.</p><p>The Mel frequency warping is calculated using the formula</p><disp-formula id="scirp.65861-formula79"><label>(23)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/10-7600519x47.png"  xlink:type="simple"/></disp-formula><p>Here “<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/10-7600519x48.png" xlink:type="simple"/></inline-formula>” is the sampling frequency and “ω” is the warping function. To be integrated with the cosine transformation, the Mel-warping function is to be normalized to satisfy the specific criterion<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/10-7600519x48.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/10-7600519x49.png" xlink:type="simple"/></inline-formula>.</p><disp-formula id="scirp.65861-formula80"><label>(24)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/10-7600519x50.png"  xlink:type="simple"/></disp-formula><disp-formula id="scirp.65861-formula81"><label>(25)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/10-7600519x51.png"  xlink:type="simple"/></disp-formula><p>The output of the Mel-frequency warping is shown in <xref ref-type="fig" rid="fig8">Figure 8</xref>. The phase information is omitted and the amplitude of the audio signal is considered. The logarithmic value of the amplitude is taken. Then, the inverse DFT is applied to extract the Mel Cepstrum output as feature of audio signal from EMFCC. The Log filter bank energies and Mel frequency cepstrum are shown in <xref ref-type="fig" rid="fig9">Figure 9</xref>.</p><p>The EMFCC-EPNCC is applied for extracting the audio features. <xref ref-type="fig" rid="fig1">Figure 1</xref>0 shows the flow diagram of the EMFCC-EPNCC process.</p><p>2) EMFCC-EPNCC Algorithm</p><p>The EPNCC extraction process involves frequency-to-Mel conversion, Mel-to-frequency conversion and cosine transform process. In the EPNCC-EMFCC method, the frequency to Mel is performed for extracting spectral data of signal based on the peak and pitch variation. Mel to frequency conversion is performed to filter out</p><fig id="fig7"  position="float"><label><xref ref-type="fig" rid="fig7">Figure 7</xref></label><caption><title> (a) Output Window Plot of audio signal; (b) Reduction in the spectral leakage effect by applying window</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/10-7600519x52.png"/></fig><fig id="fig8"  position="float"><label><xref ref-type="fig" rid="fig8">Figure 8</xref></label><caption><title> Output plot of Mel-frequency warping</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/10-7600519x53.png"/></fig><fig id="fig9"  position="float"><label><xref ref-type="fig" rid="fig9">Figure 9</xref></label><caption><title> Log filter bank energies and output Mel frequency cepstrum</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/10-7600519x54.png"/></fig><fig id="fig10"  position="float"><label><xref ref-type="fig" rid="fig1">Figure 1</xref>0</label><caption><title> Flow diagram of the EMFCC-EPNCC process</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/10-7600519x55.png"/></fig><p>other frequency signals by the frequency domain. The window size of the filtered signal is initialized by using the equations</p><disp-formula id="scirp.65861-formula82"><label>, (26)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/10-7600519x56.png"  xlink:type="simple"/></disp-formula><disp-formula id="scirp.65861-formula83"><label>(27)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/10-7600519x57.png"  xlink:type="simple"/></disp-formula><p>where, “<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/10-7600519x58.png" xlink:type="simple"/></inline-formula>” denotes the time division of window, “<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/10-7600519x58.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/10-7600519x59.png" xlink:type="simple"/></inline-formula>” represents the frequency of the audio signal and “<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/10-7600519x58.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/10-7600519x59.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/10-7600519x60.png" xlink:type="simple"/></inline-formula>” indicates the time samples of audio signal. The frequency of the audio signal is converted to Mel frames by using the formula,</p><disp-formula id="scirp.65861-formula84"><label>(28)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/10-7600519x61.png"  xlink:type="simple"/></disp-formula><p>Then, the Mel frames are converted into frequency by using the equation</p><disp-formula id="scirp.65861-formula85"><label>(29)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/10-7600519x62.png"  xlink:type="simple"/></disp-formula><p>The cosine transform is applied by using</p><disp-formula id="scirp.65861-formula86"><label>(30)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/10-7600519x63.png"  xlink:type="simple"/></disp-formula><p>The output of the DCT process is shown in <xref ref-type="fig" rid="fig1">Figure 1</xref>1. Then, FFT is applied for deconstructing a time domain representation of audio signal into the frequency domain representation. This is done by using the exponential of radian value for each sampling difference in the input signal with K<sup>th</sup> iteration.</p><disp-formula id="scirp.65861-formula87"><label>(31)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/10-7600519x64.png"  xlink:type="simple"/></disp-formula><p>where, <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/10-7600519x65.png" xlink:type="simple"/></inline-formula>, “N” is the sample size of audio signal and “<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/10-7600519x65.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/10-7600519x66.png" xlink:type="simple"/></inline-formula>” is the input signal. The magnitude of the spectrum is extracted by applying the FFT transform to the filtered signal and multiplying it with the Mel frames</p><disp-formula id="scirp.65861-formula88"><label>(32)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/10-7600519x67.png"  xlink:type="simple"/></disp-formula><p><xref ref-type="fig" rid="fig1">Figure 1</xref>2 shows the single-sided Amplitude Spectrum of output signal of FFT process y(t).</p><fig id="fig11"  position="float"><label><xref ref-type="fig" rid="fig1">Figure 1</xref>1</label><caption><title> DCT output plot</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/10-7600519x68.png"/></fig><fig id="fig12"  position="float"><label><xref ref-type="fig" rid="fig1">Figure 1</xref>2</label><caption><title> Single-sided amplitude spectrum of y(t)</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/10-7600519x69.png"/></fig><p>The Filter Coefficient “FL” is extracted by using the equation</p><disp-formula id="scirp.65861-formula89"><label>(33)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/10-7600519x70.png"  xlink:type="simple"/></disp-formula><p>The audio feature output is obtained from the product of the cosine transformation value, logarithmic value of the magnitude and filter coefficient.</p><disp-formula id="scirp.65861-formula90"><label>(34)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/10-7600519x71.png"  xlink:type="simple"/></disp-formula><disp-formula id="scirp.65861-formula91"><graphic  xlink:href="http://html.scirp.org/file/10-7600519x72.png"  xlink:type="simple"/></disp-formula><p>The index difference plot is shown in <xref ref-type="fig" rid="fig1">Figure 1</xref>3.</p></sec><sec id="s3_4"><title>3.4. Classification</title><p>The audio feature from the selected feature vectors obtained from the segmentation based on peak and pitch estimation is applied to the classification process. Classification of audio signal is performed using the PNN classifier, based on the testing features. Multi-label feature analysis is presented in the proposed work. Hence, a multi-class classifier model is implemented. Compared with other types of classifier, PNN provides a better prediction of the classified label using the probability estimation based on the neural network function.</p><p>The neural network is frequently used for the classification of the signals. The PNN is the quick learning model than the other neural network models. Hence it is used for classification of audio signal. The Probability Density Function (PDF) for a single sample is calculated as the output of the neuron of the pattern layer. This is given as</p><fig id="fig13"  position="float"><label><xref ref-type="fig" rid="fig1">Figure 1</xref>3</label><caption><title> Shows the index difference plot</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/10-7600519x73.png"/></fig><disp-formula id="scirp.65861-formula92"><label>(35)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/10-7600519x74.png"  xlink:type="simple"/></disp-formula><p>where “Y” denotes the unknown input vector. “<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/10-7600519x75.png" xlink:type="simple"/></inline-formula>” represents the j<sup>th</sup> sample input vector. “k” denotes the smoothing parameter and “d” denotes the dimension of the input vector. The output of the neuron of the summation layer is calculated as the PDF for a single pattern by using the equation</p><disp-formula id="scirp.65861-formula93"><label>(36)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/10-7600519x76.png"  xlink:type="simple"/></disp-formula><p>where <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/10-7600519x77.png" xlink:type="simple"/></inline-formula> is the total number of samples in the k<sup>th</sup> population. The decision layer performs classification of the pattern according to the Bayes decision rule, based on the output of the neurons of the summation layer. This is done, when the apriori probabilities for each classes are similar and the losses associated with incorrect decision making for each class are similar.</p><disp-formula id="scirp.65861-formula94"><label>(37)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/10-7600519x78.png"  xlink:type="simple"/></disp-formula><p>where<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/10-7600519x79.png" xlink:type="simple"/></inline-formula>. “E(Y)” denotes the estimated class of the pattern and “N” is the total number of classes in the training samples. The performance of the PNN classifier is more reliable than the Back Propagation Neural Network (BPN). The convergence rate of the PNN classifier is faster with respect to the increase in the size of the training set. Addition and removal of the training samples are performed without the need for extensive retraining. By using the PNN classifier, gradient vector of the proposed Kernel function is implemented for the selected optimal testing features. This is described as,</p><disp-formula id="scirp.65861-formula95"><label>(38)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/10-7600519x80.png"  xlink:type="simple"/></disp-formula><p>where, “e” is the feature vector of input signal, and “<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/10-7600519x81.png" xlink:type="simple"/></inline-formula>” represents feature matrix of dataset. The category of the audio signal is determined based on the classification result. Initially, it is detected whether the given testing feature is musical or non-musical. If it is detected as a musical signal, the label is classified as Piano, Guitar, etc. Hence, the presence of silence and irrelevant frequency details in the audio signal is eliminated.</p></sec></sec><sec id="s4"><title>4. Performance Analysis</title><p>This section illustrates the performance evaluation and comparative analysis of the proposed approach with the existing techniques. The datasets obtained from Ffuhrmann [<xref ref-type="bibr" rid="scirp.65861-ref30">30</xref>] and Marsyasweb [<xref ref-type="bibr" rid="scirp.65861-ref31">31</xref>] are used for the performance evaluation of the proposed approach. For the musical classes, Ffuhrmann includes 11 classes of pitched instruments including cello (cel), clarinet (cla), flute (flu), acoustic guitar (gac), electric guitar (gel), organ (org), piano (pia), saxophone (sax), trumpet (tru), violin (vio), and Band (ban) and hundred number of files and Marsyasweb includes 5 number of classes and 64 number of files. The data for the 11 pitched instruments is obtained from the pre-selected music tracks, with the objective of extracting excerpts containing a continuous presence of a single predominant target instrument. The dataset contains a total number of 220 pieces of Western music including various musical genres and instrumentations. For Non-Musical classes, Marsyasweb includes 64 number of files. The dataset consists of 120 audio tracks each 30 seconds long. Each has 60 examples. The tracks are all 22,050 Hz Mono 16-bit audio files in .wav format. The performance of the proposed approach is evaluated using the metrics such as</p><p>・ Precision</p><p>・ NMI</p><p>・ F-Score</p><p>・ Entropy</p><p>The comparison of the Precision, NMI, F-score and Entropy of the proposed approach and existing features is shown in <xref ref-type="table" rid="table2">Table 2</xref>. The proposed approach for segmentation, feature extraction and classification of audio signals (ASFEC) is compared with the acoustic features of spectral clustering and rotation, functional Magnetic Resonance Imaging (fMRI)-measured features of Support Vector Regression (SVR), Improved Twin Gaussian</p><table-wrap id="table2" ><label><xref ref-type="table" rid="table2">Table 2</xref></label><caption><title> Comparative Analysis of Precision, NMI, F-score and Entropy of the proposed approach and existing features</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >Methods</th><th align="center" valign="middle" >Precision</th><th align="center" valign="middle" >NMI</th><th align="center" valign="middle" >F-Score</th><th align="center" valign="middle" >Entropy</th></tr></thead><tr><td align="center" valign="middle" >Acoustic features (spectral clustering)</td><td align="center" valign="middle" >0.373</td><td align="center" valign="middle" >0.129</td><td align="center" valign="middle" >0.502</td><td align="center" valign="middle" >1.405</td></tr><tr><td align="center" valign="middle" >Acoustic features (spectral rotation)</td><td align="center" valign="middle" >0.384</td><td align="center" valign="middle" >0.127</td><td align="center" valign="middle" >0.485</td><td align="center" valign="middle" >1.389</td></tr><tr><td align="center" valign="middle" >fMRI-measured features of SVR</td><td align="center" valign="middle" >0.406</td><td align="center" valign="middle" >0.213</td><td align="center" valign="middle" >0.539</td><td align="center" valign="middle" >1.304</td></tr><tr><td align="center" valign="middle" >fMRI-measured features using high-level features</td><td align="center" valign="middle" >0.423</td><td align="center" valign="middle" >0.21</td><td align="center" valign="middle" >0.543</td><td align="center" valign="middle" >1.262</td></tr><tr><td align="center" valign="middle" >fMRI-measured features of ITGP</td><td align="center" valign="middle" >0.485</td><td align="center" valign="middle" >0.294</td><td align="center" valign="middle" >0.585</td><td align="center" valign="middle" >1.155</td></tr><tr><td align="center" valign="middle" >Integrated features of kernel addition</td><td align="center" valign="middle" >0.52</td><td align="center" valign="middle" >0.323</td><td align="center" valign="middle" >0.61</td><td align="center" valign="middle" >1.083</td></tr><tr><td align="center" valign="middle" >Integrated features of kernel product</td><td align="center" valign="middle" >0.499</td><td align="center" valign="middle" >0.324</td><td align="center" valign="middle" >0.583</td><td align="center" valign="middle" >1.117</td></tr><tr><td align="center" valign="middle" >Integrated features of CCA</td><td align="center" valign="middle" >0.51</td><td align="center" valign="middle" >0.317</td><td align="center" valign="middle" >0.599</td><td align="center" valign="middle" >1.1</td></tr><tr><td align="center" valign="middle" >Integrated features of ITGP</td><td align="center" valign="middle" >0.541</td><td align="center" valign="middle" >0.337</td><td align="center" valign="middle" >0.623</td><td align="center" valign="middle" >1.079</td></tr><tr><td align="center" valign="middle" >Proposed ASFEC approach</td><td align="center" valign="middle" >0.718</td><td align="center" valign="middle" >0.412</td><td align="center" valign="middle" >0.7195</td><td align="center" valign="middle" >1.6875</td></tr></tbody></table></table-wrap><p>Process (ITGP), integrated features of kernel addition, kernel product, Canonical Correlation analysis (CCA) and ITGP. The precision, NMI, F-score and Entropy of the proposed approach are found to be relatively higher than the acoustic and integrated features [<xref ref-type="bibr" rid="scirp.65861-ref32">32</xref>] . From the comparison result, it is clearly evident that the proposed approach outperforms the existing acoustic and integrated features.</p><p>Acoustic features: Only the acoustic features are used in this experiment for the audio signal clustering. Spectral rotation is used to replace the K-means in the spectral algorithm. The performance of the spectral rotation has proven to be better than the spectral clustering approach.</p><p>fMRI-measured features of SVR: First, the SVR model is trained by adopting the fMRI-measured features and acoustic features of audio selections and applied to predict the fMRI-features of the audio samples.</p><p>fMRI-measured features of ITGP: The ITGP model is trained with fMRI-measured features and acoustic features of audio selections and applied to predict the fMRI-features of the audio samples.</p><p>Integrated features of Kernel addition: The kernel addition method is applied on the fMRI-measured features and acoustic features of testing audio samples. First, the kernels are integrated by adding them and the Eigen vectors of the Laplacian of the integrated kernel are computed. Then, a matrix is generated by using the Eigen vectors as columns. Finally, each row of this matrix is considered as an integrated feature.</p><p>Integrated features of kernel product: The corresponding elements of kernels of different views are multiplied with each other to form the integrated kernel.</p><p>Integrated features of CCA: The correlated features are extracted from the fMRI-measured features and acoustic features.</p><sec id="s4_1"><title>4.1. Precision</title><p>Precision is defined as the ratio of the number of correct results to the number of predicted results.</p><disp-formula id="scirp.65861-formula96"><label>(39)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/10-7600519x82.png"  xlink:type="simple"/></disp-formula></sec><sec id="s4_2"><title>4.2. NMI</title><p>NMI is one of the rapidly prevalent measures to evaluate the agreement level between two affinity matrices formed by the predicted labels and true labels of the audio samples.</p><disp-formula id="scirp.65861-formula97"><label>(40)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/10-7600519x83.png"  xlink:type="simple"/></disp-formula><p>where p(x) and p(y) are the marginal probabilities and p(x,y) is the joint probabilities.</p></sec><sec id="s4_3"><title>4.3. F-Score</title><p>F-score is taken as a weighted average of the precision and recall values. Recall is defined as the ratio of number of correct results to the number of returned results. Higher values of Precision, NMI and F-score indicate the improved efficiency for segmentation and classification of audio signal.</p><disp-formula id="scirp.65861-formula98"><label>(41)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/10-7600519x84.png"  xlink:type="simple"/></disp-formula><disp-formula id="scirp.65861-formula99"><label>(42)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/10-7600519x85.png"  xlink:type="simple"/></disp-formula></sec><sec id="s4_4"><title>4.4. Entropy</title><p>The entropy is the sum of the individual entropies for the classification process weighted according to the classification quality. Higher entropy values indicate better classification results.</p><disp-formula id="scirp.65861-formula100"><label>(43)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/10-7600519x86.png"  xlink:type="simple"/></disp-formula><p>where, “H” is the entropy of the discrete random variable “X”. “P” is the probability of X and “I” is the information content of “X”. I(X) is a random variable.</p><p><xref ref-type="fig" rid="fig1">Figure 1</xref>4 shows the graph illustrating the comparative analysis of Precision, NMI, F-Score and Entropy for the proposed approach and existing acoustic features, fMRI and integrated features with respect to the prediction rate. It is clearly observed that the precision, NMI, F-Score and entropy for the proposed approach with respect to the prediction rate are higher than the existing fMRI and integrated features.</p></sec><sec id="s4_5"><title>4.5. ROC Plot for Classification</title><p>The ROC curve is a graphical plot that shows the performance of the PNN classifier for the classification of audio signal. The true positive rate is plotted with respect to the false positive rate at various threshold settings. The ROC curve is generated by plotting the cumulative distribution function of the true detection probability versus the false-alarm probability. Each point on the ROC plot represents a pair of the sensitivity/specificity values corresponding to the specific decision threshold value. The proximity of the ROC plot to the upper left corner indicates the higher accuracy of the classification process. <xref ref-type="fig" rid="fig1">Figure 1</xref>5 shows the ROC curve for classification. From the figure, it is clearly evident that the proposed approach achieves high classification result.</p><fig id="fig14"  position="float"><label><xref ref-type="fig" rid="fig1">Figure 1</xref>4</label><caption><title> Comparative analysis of precision, NMI, F-score and entropy for the existing features and proposed approach</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/10-7600519x87.png"/></fig><fig id="fig15"  position="float"><label><xref ref-type="fig" rid="fig1">Figure 1</xref>5</label><caption><title> ROC for classification</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/10-7600519x88.png"/></fig></sec><sec id="s4_6"><title>4.6. FRR Graph</title><p>The FRR is defined as the ratio of the number of false rejections to the number of the classified signals.</p><disp-formula id="scirp.65861-formula101"><label>(44)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/10-7600519x89.png"  xlink:type="simple"/></disp-formula><p><xref ref-type="fig" rid="fig1">Figure 1</xref>6 shows the FRR graph showing the relationship of the FRR with respect to the number of class. The FRR reduces with the increase in the number of classes. Reduction in the rejection rate of the incorrectly predicted sample indicates the effective classification of audio signal.</p></sec><sec id="s4_7"><title>4.7. FAR Graph</title><p>FAR typically is defined as the ratio of the number of false acceptances to the number of classified signals.</p><disp-formula id="scirp.65861-formula102"><label>(45)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/10-7600519x90.png"  xlink:type="simple"/></disp-formula><p><xref ref-type="fig" rid="fig1">Figure 1</xref>7 is the comparison graph between the FAR and number of classes. The FAR seems to increase with the increase in the number of classes. Hence, the incorrect classification of audio signal is prevented.</p></sec><sec id="s4_8"><title>4.8. GAR Graph</title><p>The GAR is the fraction of the genuine scores exceeding the threshold value. Higher the GAR value, higher is the classification efficiency. <xref ref-type="fig" rid="fig1">Figure 1</xref>8 shows the comparison of the GAR with respect to the number of classes.</p><disp-formula id="scirp.65861-formula103"><label>(46)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/10-7600519x91.png"  xlink:type="simple"/></disp-formula></sec><sec id="s4_9"><title>4.9. Sensitivity</title><p>The sensitivity is a measure of the actual members of the class that are correctly identified. It is defined as the ratio of the positively classified instances that are predicted correctly by the PNN classifier.</p><disp-formula id="scirp.65861-formula104"><label>(47)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/10-7600519x92.png"  xlink:type="simple"/></disp-formula><p>Here, True Positive (TP) is the number of audio signals that are correctly classified as a music or non-musical signal and False Negative (FN) is the number of music signals that are incorrectly classified as non-musical signal.</p><fig id="fig16"  position="float"><label><xref ref-type="fig" rid="fig1">Figure 1</xref>6</label><caption><title> FRR graph</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/10-7600519x93.png"/></fig><fig id="fig17"  position="float"><label><xref ref-type="fig" rid="fig1">Figure 1</xref>7</label><caption><title> FAR graph</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/10-7600519x94.png"/></fig><fig id="fig18"  position="float"><label><xref ref-type="fig" rid="fig1">Figure 1</xref>8</label><caption><title> GAR graph</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/10-7600519x95.png"/></fig></sec><sec id="s4_10"><title>4.10. Specificity</title><p>Specificity is referred as a true negative rate. It is defined as the ratio of the negatively classified instances that are predicted correctly by the PNN classifier.</p><disp-formula id="scirp.65861-formula105"><label>(48)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/10-7600519x96.png"  xlink:type="simple"/></disp-formula><p>Here, True Negative (TN) is the number of audio signals that are incorrectly classified as a music or non- musical signal and False Positive (FP) is the number of music signals that are incorrectly classified as non- musical signal.</p></sec><sec id="s4_11"><title>4.11. Accuracy</title><p>Accuracy is defined as the ratio of number of correctly classified results to the total number of the classified results. The performance of the classifier is determined based on the number of samples that are correctly and incorrectly predicted by the classifier.</p><disp-formula id="scirp.65861-formula106"><label>(49)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/10-7600519x97.png"  xlink:type="simple"/></disp-formula><p>The comparative analysis of the sensitivity, specificity and accuracy with respect to the prediction rate is shown in the <xref ref-type="fig" rid="fig1">Figure 1</xref>9. From the figure, it is observed that the PNN-based classification approach achieves high sensitivity, specificity and accuracy.</p><p><xref ref-type="table" rid="table3">Table 3</xref> shows the average GAR, FAR, FRR, Accuracy and Error rate values of the proposed approach. The proposed approach achieves high average GAR and accuracy and low FAR, FRR and error rate. Hence, the segmentation and classification efficiency of the proposed approach are improved.</p><p><xref ref-type="fig" rid="fig2">Figure 2</xref>0 shows the comparative analysis of the classification rate of the musical data for five different classes. The correct rate of the proposed approach for the five classes of the musical data is found to be higher than the error rate.</p><p><xref ref-type="fig" rid="fig2">Figure 2</xref>1 shows the comparative analysis of the classification rate of the musical data and non-musical data. The correct rate of the proposed approach for the musical and non-musical data is found to be higher than the error rate. This implies that the PNN classification approach achieves efficient classification of the musical and non-musical signal. <xref ref-type="table" rid="table4">Table 4</xref> shows the overall accuracy analysis for Online Dictionary Learning (ODL), K- means, Exemplar [<xref ref-type="bibr" rid="scirp.65861-ref33">33</xref>] and proposed EMFCC-EPNCC with PNN. <xref ref-type="fig" rid="fig2">Figure 2</xref>2 shows the overall accuracy graph for GTZAN dataset [<xref ref-type="bibr" rid="scirp.65861-ref34">34</xref>] and Music Technology Group (MTG) dataset [<xref ref-type="bibr" rid="scirp.65861-ref35">35</xref>] . The proposed EMFCC-EPNCC with PNN classifier achieves higher accuracy of 96.2% and 97.3% for both GTZAN dataset and MTG dataset.</p><p>GTZAN dataset: It is composed of 1000 30-second clips covering 10 genres such as blues, classical, country, disco, hip-hop, jazz, metal, pop, reggae, rock, with 100 clips per genre.</p><p>MTG dataset: It consists of approximately 2500 excerpts of Western music labeled into 11 classes of pitched</p><fig id="fig19"  position="float"><label><xref ref-type="fig" rid="fig1">Figure 1</xref>9</label><caption><title> Comparative analysis of sensitivity, specificity and accuracy</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/10-7600519x98.png"/></fig><fig id="fig20"  position="float"><label><xref ref-type="fig" rid="fig2">Figure 2</xref>0</label><caption><title> Comparative analysis of classification rate of musical data</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/10-7600519x99.png"/></fig><fig id="fig21"  position="float"><label><xref ref-type="fig" rid="fig2">Figure 2</xref>1</label><caption><title> Comparative analysis of classification rate of musical and non-musical data</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/10-7600519x100.png"/></fig><fig id="fig22"  position="float"><label><xref ref-type="fig" rid="fig2">Figure 2</xref>2</label><caption><title> Overall accuracy graph for ODL, K-means, Exemplar and proposed EMFCC-EPNCC with PNN</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/10-7600519x101.png"/></fig><table-wrap id="table3" ><label><xref ref-type="table" rid="table3">Table 3</xref></label><caption><title> Average GAR, FAR,FRR, Accuracy and Error rate values of the proposed approach</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >Parameters</th><th align="center" valign="middle" >Value</th></tr></thead><tr><td align="center" valign="middle" >Average GAR</td><td align="center" valign="middle" >99.67%</td></tr><tr><td align="center" valign="middle" >Average FAR</td><td align="center" valign="middle" >0.33%</td></tr><tr><td align="center" valign="middle" >Average FRR</td><td align="center" valign="middle" >0.33%</td></tr><tr><td align="center" valign="middle" >Average Accuracy</td><td align="center" valign="middle" >96.50%</td></tr><tr><td align="center" valign="middle" >Average Error Rate</td><td align="center" valign="middle" >3.50%</td></tr></tbody></table></table-wrap><table-wrap id="table4" ><label><xref ref-type="table" rid="table4">Table 4</xref></label><caption><title> Overall accuracy analysis for ODL, K-means, Exemplar and proposed EMFCC-EPNCC with PNN</title></caption><table><tbody><thead><tr><th align="center" valign="middle"  colspan="5"  >Overall Accuracy (%)</th></tr></thead><tr><td align="center" valign="middle" >Dataset</td><td align="center" valign="middle" >ODL</td><td align="center" valign="middle" >K-means</td><td align="center" valign="middle" >Exemplar</td><td align="center" valign="middle" >EMFCC-EPNCC with PNN</td></tr><tr><td align="center" valign="middle" >GTZAN Dataset</td><td align="center" valign="middle" >88</td><td align="center" valign="middle" >95.4</td><td align="center" valign="middle" >95.7</td><td align="center" valign="middle" >96.2</td></tr><tr><td align="center" valign="middle" >MTG Dataset</td><td align="center" valign="middle" >87.9</td><td align="center" valign="middle" >91.8</td><td align="center" valign="middle" >94.5</td><td align="center" valign="middle" >97.3</td></tr></tbody></table></table-wrap><p>instruments such as cello, clarinet, flute, acoustic guitar, electric guitar, Hammond organ, piano, saxophone, trumpet, violin and singing voice and two classes of drums and no-drums. The class labels are applied to the predominant instrument over a 3-second snippet of polyphony music.</p></sec></sec><sec id="s5"><title>5. Conclusion and Future Work</title><p>The conclusion and future work of the proposed approach are discussed in this section. An efficient approach for segmentation, feature extraction and classification of audio signals is presented in this paper. Audio segmentation is performed by extracting the signal amplitude between the lengths of sample time. From this segmented output, EMFCC is applied to extract testing feature for the classification process, along with the combination of peak estimated signal feature. This extracts 41 number of feature vectors for the audio signal. PNN classifier is used for classification of audio signal. From this classification result, the category of given audio input is specified. The audio signal is classified as a musical or non-musical signal, based on the testing feature. If it is detected as a musical signal, the label is classified as Piano, Guitar, etc.</p><p>The proposed approach achieves better performance in terms of precision, NMI, F-score and entropy. The FRR, FAR, GAR, sensitivity, specificity and accuracy of the PNN classifier are higher with respect to the number of classes. In future, the audio signal is segmented from the given input and various frequencies presented in single audio input are separated. Then, the separated frequency is retrieved by classifying features of segmented signal frequency.</p></sec><sec id="s6"><title>Cite this paper</title><p>Muthumari Arumugam,Mala Kaliappan, (2016) An Efficient Approach for Segmentation, Feature Extraction and Classification of Audio Signals. Circuits and Systems,07,255-279. doi: 10.4236/cs.2016.74024</p></sec><sec id="s7"><title>NOTES</title></sec></body><back><ref-list><title>References</title><ref id="scirp.65861-ref1"><label>1</label><mixed-citation publication-type="other" xlink:type="simple">Castán, D., Tavarez, D., Lopez-Otero, P., Franco-Pedroso, J., Delgado, H., Navas, E., et al. (2015) Albayzín-2014 evaluation: Audio segmentation and classification in broadcast news domains. EURASIP Journal on Audio, Speech, and Music Processing, 2015, 33. http://dx.doi.org/10.1186/s13636-015-0076-3</mixed-citation></ref><ref id="scirp.65861-ref2"><label>2</label><mixed-citation publication-type="other" xlink:type="simple">Kubala, F., Jin, H., Matsoukas, S., Nguyen, L., Schwartz, R. and Makhoul, J. (1997) The 1996 BBN Byblos HUB-4 Transcription System. Proceedings of the 1997 DARPA Speech Recognition Workshop, Chantilly, VA, 2-5 February 1997, 90-93.</mixed-citation></ref><ref id="scirp.65861-ref3"><label>3</label><mixed-citation publication-type="other" xlink:type="simple">Bakis, R., Chen, S., Gopalakrishnan, P., Gopinath, R., Maes, S., Polymenakos, L. and Franz, M. (1997) Transcription of Broadcast News Shows with the IBM Large Vocabulary Speech Recognition System. Proceedings of the Speech Recognition Workshop, Chantilly, February 1997, 67-72.</mixed-citation></ref><ref id="scirp.65861-ref4"><label>4</label><mixed-citation publication-type="other" xlink:type="simple">Beigi, H.S. and Maes, S. (1998) Speaker, Channel and Environment Change Detection. Proceedings of the World Congress on Automation, Anchorage, AK, 18 May 1998, 18-22.</mixed-citation></ref><ref id="scirp.65861-ref5"><label>5</label><mixed-citation publication-type="other" xlink:type="simple">Siegler, M.A., Jain, U., Raj, B. and Stern, R.M. (1997) Automatic Segmentation, Classification and Clustering of Broadcast News Audio. Proceedings of DARPA Speech Recognition Workshop, Chantilly, VA, 2-5 February 1997, 97- 99.</mixed-citation></ref><ref id="scirp.65861-ref6"><label>6</label><mixed-citation publication-type="other" xlink:type="simple">Zhang, X., Su, Z., Lin, P., He, Q. and Yang, J. (2014) An Audio Feature Extraction Scheme Based on Spectral Decomposition. International Conference on Audio, Language and Image Processing (ICALIP), Shanghai, 7-9 July 2014, 730-733.</mixed-citation></ref><ref id="scirp.65861-ref7"><label>7</label><mixed-citation publication-type="book" xlink:type="simple">Patil, H.A., Madhavi, M.C., Jain, R. and Jain, A.K. (2012) Combining Evidence from Temporal and Spectral Features for Person Recognition Using Humming. In: Kundu, M.K., Mitra, S., Mazumdar, D. and Pal, S.K., Eds., Perception and Machine Intelligence, Springer, Berlin Heidelberg, 321-328. http://dx.doi.org/10.1007/978-3-642-27387-2_40</mixed-citation></ref><ref id="scirp.65861-ref8"><label>8</label><mixed-citation publication-type="other" xlink:type="simple">Bhalke, D., Rao, C. and Bormane, D.S. (2014) Musical Instrument Classification Using Higher Order Spectra. International Conference on Signal Processing and Integrated Networks (SPIN), Noida, 20-21 February 2014, 40-45.http://dx.doi.org/10.1109/spin.2014.6776918</mixed-citation></ref><ref id="scirp.65861-ref9"><label>9</label><mixed-citation publication-type="other" xlink:type="simple">Haque, M.A. and Kim, J.-M. (2013) An Enhanced Fuzzy C-Means Algorithm for Audio Segmentation and Classification. Multimedia Tools and Applications, 63, 485-500. http://dx.doi.org/10.1007/s11042-011-0921-z</mixed-citation></ref><ref id="scirp.65861-ref10"><label>10</label><mixed-citation publication-type="other" xlink:type="simple">Lude?a-Choez, J. and Gallardo-Antolín, A. (2015) Feature Extraction Based on the High-Pass Filtering of Audio Signals for Acoustic Event Classification. Computer Speech &amp; Language, 30, 32-42. http://dx.doi.org/10.1016/j.csl.2014.04.001 </mixed-citation></ref><ref id="scirp.65861-ref11"><label>11</label><mixed-citation publication-type="other" xlink:type="simple">Lefèvre, S. and Vincent, N. (2011) A Two Level Strategy for Audio Segmentation. Digital Signal Processing, 21, 270- 277. http://dx.doi.org/10.1016/j.dsp.2010.07.003</mixed-citation></ref><ref id="scirp.65861-ref12"><label>12</label><mixed-citation publication-type="other" xlink:type="simple">Evangelista, T.L., Priolli, T.M., Silla, C.N., Angelico, B. and Kaestner, C. (2014) Automatic Segmentation of Audio Signals for Bird Species Identification. IEEE International Symposium on Multimedia (ISM), Taichung, 10-12 December 2014, 223-228. http://dx.doi.org/10.1109/ism.2014.46</mixed-citation></ref><ref id="scirp.65861-ref13"><label>13</label><mixed-citation publication-type="other" xlink:type="simple">Dhanalakshmi, P., Palanivel, S. and Ramalingam, V. (2011) Classification of Audio Signals Using AANN and GMM. Applied Soft Computing, 11, 716-723. http://dx.doi.org/10.1016/j.asoc.2009.12.033</mixed-citation></ref><ref id="scirp.65861-ref14"><label>14</label><mixed-citation publication-type="other" xlink:type="simple">Haque, M.A. and Kim, J.-M. (2013) An Analysis of Content-Based Classification of Audio Signals Using a Fuzzy C-Means Algorithm. Multimedia Tools and Applications, 63, 77-92. http://dx.doi.org/10.1007/s11042-012-1019-y</mixed-citation></ref><ref id="scirp.65861-ref15"><label>15</label><mixed-citation publication-type="other" xlink:type="simple">Dhanalakshmi, P., Palanivel, S. and Ramalingam, V. (2011) Pattern Classification Models for Classifying and Indexing Audio Signals. Engineering Applications of Artificial Intelligence, 24, 350-357. http://dx.doi.org/10.1016/j.engappai.2010.10.011</mixed-citation></ref><ref id="scirp.65861-ref16"><label>16</label><mixed-citation publication-type="other" xlink:type="simple">Gergen, S., Nagathil, A. and Martin, R. (2015) Classification of Reverberant Audio Signals Using Clustered Ad Hoc Distributed Microphones. Signal Processing, 107, 21-32. http://dx.doi.org/10.1016/j.sigpro.2014.04.034</mixed-citation></ref><ref id="scirp.65861-ref17"><label>17</label><mixed-citation publication-type="other" xlink:type="simple">Bhat, A.S., Amith, V., Prasad, N.S. and Mohan, D.M. (2014) An Efficient Classification Algorithm for Music Mood Detection in Western and Hindi Music Using Audio Feature Extraction. 5th International Conference on Signal and Image Processing (ICSIP), Jeju Island, 8-10 January 2014, 359-364. http://dx.doi.org/10.1109/icsip.2014.63</mixed-citation></ref><ref id="scirp.65861-ref18"><label>18</label><mixed-citation publication-type="other" xlink:type="simple">Gergen, S. and Martin, R. (2014) Linear Combining of Audio Features for Signal Classification in Ad-Hoc Microphone Arrays. 11 ITG Symposium; Proceedings of Speech Communication, Erlangen, 24-26 September 2014, 1-4.</mixed-citation></ref><ref id="scirp.65861-ref19"><label>19</label><mixed-citation publication-type="other" xlink:type="simple">Pesek, M., Leonardis, A. and Marolt, M. (2014) Boosting Audio Chord Estimation Using Multiple Classifiers. International Conference on Systems, Signals and Image Processing (IWSSIP), Dubrovnik, 12-15 May 2014, 107-110.</mixed-citation></ref><ref id="scirp.65861-ref20"><label>20</label><mixed-citation publication-type="other" xlink:type="simple">Srinivasa Murthy, Y. and Koolagudi, S.G. (2015) Classification of Vocal and Non-Vocal Regions from Audio Songs Using Spectral Features and Pitch Variations. IEEE 28th Canadian Conference on Electrical and Computer Engineering (CCECE), Halifax, 3-6 May 2015, 1271-1276. http://dx.doi.org/10.1109/ccece.2015.7129461</mixed-citation></ref><ref id="scirp.65861-ref21"><label>21</label><mixed-citation publication-type="other" xlink:type="simple">Koolagudi, S.G. and Krothapalli, S.R. (2012) Emotion Recognition from Speech Using Sub-Syllabic and Pitch Synchronous Spectral Features. International Journal of Speech Technology, 15, 495-511. http://dx.doi.org/10.1007/s10772-012-9150-8</mixed-citation></ref><ref id="scirp.65861-ref22"><label>22</label><mixed-citation publication-type="other" xlink:type="simple">Geiger, J.T., Schuller, B. and Rigoll, G. (2013) Large-Scale Audio Feature Extraction and SVM for Acoustic Scene Classification. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, 20-23 October 2013, 1-4. http://dx.doi.org/10.1109/waspaa.2013.6701857</mixed-citation></ref><ref id="scirp.65861-ref23"><label>23</label><mixed-citation publication-type="other" xlink:type="simple">Oh, S.Y. and Chung, K.-Y. (2014) Target speech Feature Extraction Using Non-Parametric Correlation Coefficient. Cluster Computing, 17, 893-899. http://dx.doi.org/10.1007/s10586-013-0284-5</mixed-citation></ref><ref id="scirp.65861-ref24"><label>24</label><mixed-citation publication-type="other" xlink:type="simple">Gaj?ek, R., Miheli?, F. and Dobri?ek, S. (2013) Speaker State Recognition Using an HMM-Based Feature Extraction Method. Computer Speech &amp; Language, 27, 135-150. http://dx.doi.org/10.1016/j.csl.2012.01.007 </mixed-citation></ref><ref id="scirp.65861-ref25"><label>25</label><mixed-citation publication-type="other" xlink:type="simple">Anguera, X. (2012) Speaker Independent Discriminant Feature Extraction for Acoustic Pattern-Matching. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, 25-30 March 2012, 485-488.http://dx.doi.org/10.1109/icassp.2012.6287922</mixed-citation></ref><ref id="scirp.65861-ref26"><label>26</label><mixed-citation publication-type="other" xlink:type="simple">Salamon, J., Rocha, B. and Gómez, E. (2012) Musical Genre Classification Using Melody Features Extracted from Polyphonic Music Signals. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, 25-30 March 2012, 81-84. http://dx.doi.org/10.1109/icassp.2012.6287822</mixed-citation></ref><ref id="scirp.65861-ref27"><label>27</label><mixed-citation publication-type="other" xlink:type="simple">Alam, M.J., Kinnunen, T., Kenny, P., Ouellet, P. and O’Shaughnessy, D. (2013) Multitaper MFCC and PLP Features for Speaker Verification Using i-Vectors. Speech Communication, 55, 237-251. http://dx.doi.org/10.1016/j.specom.2012.08.007</mixed-citation></ref><ref id="scirp.65861-ref28"><label>28</label><mixed-citation publication-type="other" xlink:type="simple">Muthumari, A. and Mala, K. (2015) Computerized Methods for Audio Segmentation and Classification: Survey. International Journal of Applied Engineering Research, 10, 26857-26870.</mixed-citation></ref><ref id="scirp.65861-ref29"><label>29</label><mixed-citation publication-type="other" xlink:type="simple">Rajeswari, K.C. and Uma Maheswari, P. (2015) Feature Extraction and Analysis of Speech Quality for Tamil Text System using Fast Fourier Transform. Australian Journal of Basic and Applied Sciences, 9, 349-356.</mixed-citation></ref><ref id="scirp.65861-ref30"><label>30</label><mixed-citation publication-type="other" xlink:type="simple">(2015) 21 August 2015. http://www.dtic.upf.edu/~ffuhrmann/PhD/data/</mixed-citation></ref><ref id="scirp.65861-ref31"><label>31</label><mixed-citation publication-type="other" xlink:type="simple">(2015) 21 August 2015. http://marsyasweb.appspot.com/download/data_sets/</mixed-citation></ref><ref id="scirp.65861-ref32"><label>32</label><mixed-citation publication-type="other" xlink:type="simple">Ji, X., Han, J., Jiang, X., Hu, X., Guo, L., Han, J., et al. (2015) Analysis of Music/Speech via Integration of Audio Content and Functional Brain Response. Information Sciences, 297, 271-282. http://dx.doi.org/10.1016/j.ins.2014.11.020</mixed-citation></ref><ref id="scirp.65861-ref33"><label>33</label><mixed-citation publication-type="other" xlink:type="simple">Su, L., Yeh, C.-C.M., Liu, J.-Y., Wang, J.-C. and Yang, Y.-H. (2014) A Systematic Evaluation of the Bag-of-Frames Representation for Music Information Retrieval. IEEE Transactions on Multimedia, 16, 1188-1200.http://dx.doi.org/10.1109/TMM.2014.2311016</mixed-citation></ref><ref id="scirp.65861-ref34"><label>34</label><mixed-citation publication-type="other" xlink:type="simple">Fu, Z., Lu, G., Ting, K.M. and Zhang, D. (2011) Music Classification via the Bag-of-Features Approach. Pattern Recognition Letters, 32, 1768-1777. http://dx.doi.org/10.1016/j.patrec.2011.06.026</mixed-citation></ref><ref id="scirp.65861-ref35"><label>35</label><mixed-citation publication-type="other" xlink:type="simple">Fuhrmann, F. (2012) Automatic Musical Instrument Recognition from Polyphonic Music Audio Signals. PhD Thesis, Universitat Pompeu Fabra, Barcelona.</mixed-citation></ref></ref-list></back></article>