<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article  PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" dtd-version="3.0" xml:lang="en" article-type="research article"><front><journal-meta><journal-id journal-id-type="publisher-id">JCC</journal-id><journal-title-group><journal-title>Journal of Computer and Communications</journal-title></journal-title-group><issn pub-type="epub">2327-5219</issn><publisher><publisher-name>Scientific Research Publishing</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="doi">10.4236/jcc.2017.510006</article-id><article-id pub-id-type="publisher-id">JCC-78666</article-id><article-categories><subj-group subj-group-type="heading"><subject>Articles</subject></subj-group><subj-group subj-group-type="Discipline-v2"><subject>Computer Science&amp;Communications</subject></subj-group></article-categories><title-group><article-title>
 
 
  HMM-Based Photo-Realistic Talking Face Synthesis Using Facial Expression Parameter Mapping with Deep Neural Networks
 
</article-title></title-group><contrib-group><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Kazuki</surname><given-names>Sato</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Takashi</surname><given-names>Nose</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref><xref ref-type="corresp" rid="cor1"><sup>*</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Akinori</surname><given-names>Ito</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref></contrib></contrib-group><aff id="aff1"><addr-line>Department of Communication Engineering, Graduate School of Engineering, Tohoku University, Sendai, Japan</addr-line></aff><author-notes><corresp id="cor1">* E-mail:<email>tnose@m.tohoku.ac.jp(TN)</email>;</corresp></author-notes><pub-date pub-type="epub"><day>30</day><month>07</month><year>2017</year></pub-date><volume>05</volume><issue>10</issue><fpage>50</fpage><lpage>65</lpage><history><date date-type="received"><day>July</day>	<month>11,</month>	<year>2017</year></date><date date-type="rev-recd"><day>Accepted:</day>	<month>August</month>	<year>20,</year>	</date><date date-type="accepted"><day>August</day>	<month>23,</month>	<year>2017</year></date></history><permissions><copyright-statement>&#169; Copyright  2014 by authors and Scientific Research Publishing Inc. </copyright-statement><copyright-year>2014</copyright-year><license><license-p>This work is licensed under the Creative Commons Attribution International License (CC BY). http://creativecommons.org/licenses/by/4.0/</license-p></license></permissions><abstract><p>
 
 
  This paper proposes a technique for synthesizing a pixel-based photo-realistic talking face animation using two-step synthesis with HMMs and DNNs. We introduce facial expression parameters as an intermediate representation that has a good correspondence with both of the input contexts and the output pixel data of face images. The sequences of the facial expression parameters are modeled using context-dependent HMMs with static and dynamic features. The mapping from the expression parameters to the target pixel images are trained using DNNs. We examine the required amount of the training data for HMMs and DNNs and compare the performance of the proposed technique with the conventional PCA-based technique through objective and subjective evaluation experiments.
 
</p></abstract><kwd-group><kwd>Visual-Speech Synthesis</kwd><kwd> Talking Head</kwd><kwd> Hidden Markov Models (HMMs)</kwd><kwd> Deep Neural Networks (DNNs)</kwd><kwd> Facial Expression Parameter</kwd></kwd-group></article-meta></front><body><sec id="s1"><title>1. Introduction</title><p>In our daily life, facial information is important to enrich speech communication. The face-to-face communication can give us not only linguistic information but also the facial identity and expressions, which sometimes plays an essential role to make a person be relieved, attracted, or affected. The same thing can be said in human-computer interaction. A spoken dialogue system with facial information is richer than that with only speech, and it often gives friendlier impression to users. For example, a virtual agent with the face of a famous person could easily attract consumers in a shop or a public space. Therefore, a visual-speech synthesis, i.e., creating a talking head with synthetic speech and facial animation, is an interesting topic for more advanced man-machine interfaces.</p><p>There have been many studies for visual-speech synthesis [<xref ref-type="bibr" rid="scirp.78666-ref1">1</xref>] [<xref ref-type="bibr" rid="scirp.78666-ref2">2</xref>] . While some studies aimed to generate faces of simple [<xref ref-type="bibr" rid="scirp.78666-ref3">3</xref>] and detailed [<xref ref-type="bibr" rid="scirp.78666-ref3">3</xref>] 3DCG characters, the target of the most studies was in synthesizing photo-realistic human faces, which is a more challenging task. In the early years of visual-speech synthesis, they only focused on synthesizing the image of a speaker’s mouth area [<xref ref-type="bibr" rid="scirp.78666-ref4">4</xref>] [<xref ref-type="bibr" rid="scirp.78666-ref5">5</xref>] due to the limitation of computational power. As the performance of computers improved, researchers started to examine entire-face synthesis with a variety of approaches. When we can prepare a large amount of facial video samples, a promising approach is to use synthesis techniques based on visual unit selection [<xref ref-type="bibr" rid="scirp.78666-ref6">6</xref>] [<xref ref-type="bibr" rid="scirp.78666-ref7">7</xref>] [<xref ref-type="bibr" rid="scirp.78666-ref8">8</xref>] that was inspired by the idea in speech synthesis (e.g., [<xref ref-type="bibr" rid="scirp.78666-ref9">9</xref>] ). Since video snippets of tri-phone have been used as basic concatenation units, the resulting database can become very large. The use of smaller units, i.e., image samples, showed their effectiveness in improving the coverage of candidate units with smaller footprint [<xref ref-type="bibr" rid="scirp.78666-ref10">10</xref>] [<xref ref-type="bibr" rid="scirp.78666-ref11">11</xref>] .</p><p>Although the unit-selection-based synthesis has an advantage in the quality of synthetic facial motion, there are restrictions that the recording cost is high and the face position is fixed to keep the continuity between visual units. One approach to overcoming the problem is to use a facial 3DCG model [<xref ref-type="bibr" rid="scirp.78666-ref12">12</xref>] [<xref ref-type="bibr" rid="scirp.78666-ref13">13</xref>] . In this approach, the face model has 3D mesh and photo-realistic texture information, and a high-quality rendered animation can be produced. However, the rendering needs high computational cost, and hence real-time rendering is not always possible in low-resource devices such as mobile phones and tablets. From the viewpoint of the footprint and computational cost, a 2D image-based modeling approach can be an alternative choice [<xref ref-type="bibr" rid="scirp.78666-ref5">5</xref>] [<xref ref-type="bibr" rid="scirp.78666-ref14">14</xref>] [<xref ref-type="bibr" rid="scirp.78666-ref15">15</xref>] . In [<xref ref-type="bibr" rid="scirp.78666-ref14">14</xref>] and [<xref ref-type="bibr" rid="scirp.78666-ref16">16</xref>] , multidimensional morphable model (MMM) [<xref ref-type="bibr" rid="scirp.78666-ref17">17</xref>] and active appearance model (AAM) [<xref ref-type="bibr" rid="scirp.78666-ref18">18</xref>] were used to model and parametrize the 2D face images, where the face images are represented by shape and texture (appearance) parameters. By parametrizing 2D face image, the parameter sequences can be statistically modeled using context-independent [<xref ref-type="bibr" rid="scirp.78666-ref14">14</xref>] or context-dependent [<xref ref-type="bibr" rid="scirp.78666-ref16">16</xref>] models. A limitation in this approach is that facial key points must be labeled by hand for the training images of facial models. In contrast, the facial animation generation based on hidden Markov models (HMMs) with non-parametric features has an advantage [<xref ref-type="bibr" rid="scirp.78666-ref5">5</xref>] since no manual labeling of facial parameters is necessary.</p><p>In this paper, we revise the HMM-based visual-speech synthesis to synthesize not only lip images [<xref ref-type="bibr" rid="scirp.78666-ref5">5</xref>] but also entire-face images. The conventional technique used principal component analysis (PCA) for visual features instead for speech features in the HMM-based speech synthesis [<xref ref-type="bibr" rid="scirp.78666-ref19">19</xref>] . The HMM-based speech synthesis, one of the statistical parametric speech synthesis techniques, has been widely studied [<xref ref-type="bibr" rid="scirp.78666-ref20">20</xref>] and has high flexibility such as style control of synthetic speech [<xref ref-type="bibr" rid="scirp.78666-ref21">21</xref>] for the expressive speech synthesis [<xref ref-type="bibr" rid="scirp.78666-ref22">22</xref>] . However, the HMM-based visual speech synthesis is difficult to be applied to the face images because the movement other than the lip region affects the PCA coefficients and degrades the synthesis performance. For this problem, we propose two-step synthesis by introducing intermediate features, i.e., low-dimensional facial expression parameters [<xref ref-type="bibr" rid="scirp.78666-ref23">23</xref>] [<xref ref-type="bibr" rid="scirp.78666-ref24">24</xref>] , into the modeling process. The expression parameters are associated with face images with non-linear mapping using deep neural networks (DNNs). Since the expression parameters are not affected by the face movement but well correspond to the lip movement, the proposed technique is expected to improve the modeling accuracy compared to the conventional PCA-based synthesis.</p><p>The contributions of this paper are summarized as follows. The proposed technique achieves facial animation synthesis using the HMM-based 2D image synthesis framework with facial expression parameters and DNNs. The advantage of the HMM-based system is its small footprint size [<xref ref-type="bibr" rid="scirp.78666-ref25">25</xref>] compared to the unit-selection and 3DCG models. In addition, our technique uses no manual labeling in the model training, which is essential to realize a visual-speech synthesizer of arbitrary speakers at a low cost. We investigate the amount of training data required for HMMs and DNNs through experiments and finally show the superiority of the proposed technique to the conventional PCA-based technique through the objective and subjective evaluation tests.</p><p>The rest of this paper is organized as follows: In Section 2, we briefly overview the conventional HMM-based visual speech synthesis techniques. Section 3 describes the proposed two-step synthesis technique using facial expression parameters and non-linear mapping using DNNs. In Section 4, the performance of the proposed facial animation synthesis technique is evaluated and is compared to the conventional PCA-based approach from objective and subjective perspectives. In Section 5, we summarize this study and give suggestions for future work.</p></sec><sec id="s2"><title>2. Conventional Photo-Realistic Talking Face Synthesis Based on HMMs</title><p>In this section, we briefly review the conventional techniques for synthesizing photo-realistic talking face animations. As described in the introduction, the basis of this study is on the HMM-based visual-speech synthesis using PCA-based visual features [<xref ref-type="bibr" rid="scirp.78666-ref5">5</xref>] . This technique was inspired by the HMM-based speech synthesis where sequences of speech parameters, i.e., spectral and excitation parameters, are modeled using context-dependent HMMs. The previous work [<xref ref-type="bibr" rid="scirp.78666-ref5">5</xref>] only focused on mouth area and the tip of the nose. Parallel data of audio and video frames are constructed but each of them are modeled separately using HMM sets with different number of states: five states for audio and three states for video features. The modeling method for the audio data is the same as that for HMM-based speech synthesis.</p><p>Since the number of dimensions of pixel image data were very high, it is computationally expensive to apply the HMM-based acoustic modeling to the image data straightforwardly. Therefore, in the previous work [<xref ref-type="bibr" rid="scirp.78666-ref5">5</xref>] , the dimensionality of the image was reduced using PCA, and each lip image was repre- sented by a linear combinations of eigen vectors in a similar manner to the eigenface [<xref ref-type="bibr" rid="scirp.78666-ref26">26</xref>] . Both audio and image features are modeled by the context-depen- dent phone HMMs, and the durations of each phone is determined by the HMMs trained from the speech. In the model training for visual features, phonetic contextual factors are taken into account, and context-dependent HMMs are trained. State-dependent model-parameter tying using context clustering with contextual decision trees is performed because the number of possible combinations of phonetic contextual factors is enormous. The approach of [<xref ref-type="bibr" rid="scirp.78666-ref5">5</xref>] differs from that in the previous studies [<xref ref-type="bibr" rid="scirp.78666-ref4">4</xref>] [<xref ref-type="bibr" rid="scirp.78666-ref27">27</xref>] because dynamic features are used in addition to the static features, which reflects both static and dynamic properties of the training data to the generated feature sequences.</p><p>There have been other studies for visual-speech synthesis based HMMs. [<xref ref-type="bibr" rid="scirp.78666-ref28">28</xref>] proposed the lip animation synthesis where the lip image samples were concatenated using the trajectory-guided sample selection method. The guide trajectory is generated using the similar manner to [<xref ref-type="bibr" rid="scirp.78666-ref5">5</xref>] . Then, the optimal sequence of the image samples is determined by the cost function. The total cost is given by the weighted sum of the target and concatenation costs, which is the same as the unit selection for speech synthesis. Since this is the sample-based approach, the required amount of visual data of the target speaker is larger than that in the parametric visual-speech synthesis. In addition, the synthesis of the entire face was not investigated. There can be the same difficulty with [<xref ref-type="bibr" rid="scirp.78666-ref5">5</xref>] because the guide trajectory is generated from the HMMs with PCA-based features. The HMMs were also used for modeling the AAM parameters [<xref ref-type="bibr" rid="scirp.78666-ref16">16</xref>] instead of the PCA coefficients. In this technique, the face images and emotional expressions are simultaneously modeled and controlled using a framework of a cluster adaptive training</p><p>(CAT) [<xref ref-type="bibr" rid="scirp.78666-ref29">29</xref>] that is also applied to HMM-based speech synthesis [<xref ref-type="bibr" rid="scirp.78666-ref30">30</xref>] <sup>1</sup>. However, our approach has advantages over the AAM-based technique in that our approach does not need clipping the facial region from a image and manual labeling of facial key points.</p></sec><sec id="s3"><title>3. Two-Step Photo-Realistic Talking Face Synthesis Using Facial Expression Parameters</title><p>In this section, we present a novel technique for synthesizing a 2D photo-realis- tic talking face animation. The technique has two steps to model the relation between input context-dependent labels and output pixel images. First, we give an overview of the proposed talking face synthesis system and introduce facial expression parameters as an intermediate features in the modeling. Then, the modeling and parameter generation processes are described in detail. Finally, the conversion from the expression parameters to the pixel images is explained where DNNs are used for the non-linear mapping.</p><sec id="s3_1"><title>3.1. Overview of the Proposed System</title><p><xref ref-type="fig" rid="fig1">Figure 1</xref> illustrates the outline of the proposed talking face synthesis system. As is the same as the conventional PCA-based approach described in Section 2, speech and visual units for synthesis are phone HMMs, and hence lip movements are easily synchronized with auditory speech by using the same phoneme labels for synthesis even when both units are modeled separately. There are two steps for the model training stage. The first step is the modeling of facial expression parameters. In this study, we use Microsoft Kinect v2 to capture the facial video data. The facial expression parameters, called animation units (AUs), are</p><p>extracted using Microsoft Face Tracking SDK<sup>2</sup>. The details of the expression parameters are explained in Section 3.2. Then, the expression parameter sequences are modeled by HMMs with context-dependent labels. We only use triphone context in this study. The second step is to train the mapping from expression parameters to facial pixel images where DNNs are used to achieve the non-linear</p><fig id="fig1"  position="float"><label><xref ref-type="fig" rid="fig1">Figure 1</xref></label><caption><title> Overview of the proposed system</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/6-1730667x2.png"/></fig><p>mapping. For the training of HMMs, dynamic features are used to capture the dynamic property of the expression parameters between frames.</p><p>In the synthesis stage, the input text is converted to the context-dependent label sequence using text analysis. The context-dependent HMMs are concatenated aligned with the label sequence, and an optimal expression parameter sequence is estimated using the parameter generation algorithm based on maximum likelihood [<xref ref-type="bibr" rid="scirp.78666-ref31">31</xref>] , which is described in Section 3.4. In the estimation, both the static and dynamic features are taken into account. Finally, the generated expression parameters are converted to the facial pixel images.</p></sec><sec id="s3_2"><title>3.2. Animation Units for Facial Expression Parameters</title><p>In the conventional PCA-based synthesis, the PCA coefficients obtained from the training pixel images can be viewed as an intermediate representation. Although the PCA efficiently reduces the number of dimensions of the images, the obtained coefficients include the characteristics of the whole images. This means that the representation is sensitive not only to the lip movement but also to the face movement even though the degree is small. As a result, it is difficult to accurately model the facial parts using HMMs with context labels, and hence the applicable region is restricted only to around the lip [<xref ref-type="bibr" rid="scirp.78666-ref5">5</xref>] [<xref ref-type="bibr" rid="scirp.78666-ref28">28</xref>] .</p><p>Instead of the conventional PCA coefficients, we use animation units (AUs) for the facial expression parameters as intermediate features in the modeling. AUs are seventeen parameters that represent the position and shape of the face and are expressed as a numeric weight as shown in <xref ref-type="table" rid="table1">Table 1</xref>. Three of the parameters, Jaw Slide Right, Right Eyebrow Lowerer, and Left Eyebrow Lowerer,</p><table-wrap id="table1" ><label><xref ref-type="table" rid="table1">Table 1</xref></label><caption><title> Definition of animation unit parameter</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >Number</th><th align="center" valign="middle" >Parameter</th></tr></thead><tr><td align="center" valign="middle" >AU 0</td><td align="center" valign="middle" >Jaw Open</td></tr><tr><td align="center" valign="middle" >AU 1</td><td align="center" valign="middle" >Lip Pucker</td></tr><tr><td align="center" valign="middle" >AU 2</td><td align="center" valign="middle" >Jaw Slide Right</td></tr><tr><td align="center" valign="middle" >AU 3</td><td align="center" valign="middle" >Lip Stretcher Right</td></tr><tr><td align="center" valign="middle" >AU 4</td><td align="center" valign="middle" >Lip Stretcher Left</td></tr><tr><td align="center" valign="middle" >AU 5</td><td align="center" valign="middle" >Lip Corner Puller Left</td></tr><tr><td align="center" valign="middle" >AU 6</td><td align="center" valign="middle" >Lip Corner Puller Right</td></tr><tr><td align="center" valign="middle" >AU 7</td><td align="center" valign="middle" >Lip Corner Depressor Left</td></tr><tr><td align="center" valign="middle" >AU 8</td><td align="center" valign="middle" >Lip Corner Depressor Right</td></tr><tr><td align="center" valign="middle" >AU 9</td><td align="center" valign="middle" >Left cheek Puff</td></tr><tr><td align="center" valign="middle" >AU 10</td><td align="center" valign="middle" >Right cheek Puff</td></tr><tr><td align="center" valign="middle" >AU 11</td><td align="center" valign="middle" >Left eye Closed</td></tr><tr><td align="center" valign="middle" >AU 12</td><td align="center" valign="middle" >Right eye Closed</td></tr><tr><td align="center" valign="middle" >AU 13</td><td align="center" valign="middle" >Right eyebrow Lowerer</td></tr><tr><td align="center" valign="middle" >AU 14</td><td align="center" valign="middle" >Left eyebrow Lowerer</td></tr><tr><td align="center" valign="middle" >AU 15</td><td align="center" valign="middle" >Lower lip Depressor Left</td></tr><tr><td align="center" valign="middle" >AU 16</td><td align="center" valign="middle" >Lower lip Depressor Right</td></tr></tbody></table></table-wrap><p>vary between −1.0 and 1.0, and the others vary between 0.0 and 1.0. Since these parameters are calculated using color and depth information with the Kinect sensor [<xref ref-type="bibr" rid="scirp.78666-ref32">32</xref>] , there is no need to label the face images manually. The advantage of the AUs over PCA coefficients is that the AUs capture the state of the respective facial parts independently and are not affected by each other.</p><p>The idea of our approach is similar to the study for the emotional speech synthesis based on a three-layered model using a dimensional approach [<xref ref-type="bibr" rid="scirp.78666-ref33">33</xref>] in contrast to the categorical approach [<xref ref-type="bibr" rid="scirp.78666-ref34">34</xref>] . Similarly to the case of this study, the speech features are sometimes difficult to be predicted directly from the emotion dimensions. They used seventeen semantic primitives as an intermediate representation and improved the accuracy of acoustic feature estimation to synthesize affective speech more similar to that intended in the dimensional emotion space.</p></sec><sec id="s3_3"><title>3.3. Modeling Facial Expression Parameter Sequences Using HMMs</title><p>Since expression parameter sequences generally have continuity in time domain, we use HMMs to model the continuity of the parameter sequences in a similar way to the HMM-based speech synthesis [<xref ref-type="bibr" rid="scirp.78666-ref19">19</xref>] . For the model training, we use a phone as the synthesis unit. The phone labels with phone boundary information are the same as those for the speech modeling. The parameter sequences of the respective phone segments are modeled using context-dependent HMMs. Hidden semi-Markov models (HSMMs) [<xref ref-type="bibr" rid="scirp.78666-ref35">35</xref>] are used for explicit modeling of state duration distribution [<xref ref-type="bibr" rid="scirp.78666-ref36">36</xref>] . State-based decision trees are constructed, and parameter tying using context clustering is performed to reduce the number of model parameters. The stopping criterion based on minimum description length (MDL) [<xref ref-type="bibr" rid="scirp.78666-ref37">37</xref>] is used for the decision tree construction in this study. Dynamic features are used as well as static features to model the dynamic property among multiple frames [<xref ref-type="bibr" rid="scirp.78666-ref38">38</xref>] , which is used also in the very low bit-rate coding of spectral [<xref ref-type="bibr" rid="scirp.78666-ref39">39</xref>] and F0 [<xref ref-type="bibr" rid="scirp.78666-ref40">40</xref>] features of speech.</p></sec><sec id="s3_4"><title>3.4. Facial Expression Parameter Generation from HMMs</title><p>In the synthesis stage of a face animation, a given text is converted to a context-dependent label sequence using text analysis. The model parameters of the facial expression parameters for unseen context labels are estimated using decision trees constructed during the model training. The context-dependent HSMMs are aligned with the label sequence, and a single sentence HSMM is created. A sequence of facial expression parameters is generated from HMMs using a maximum likelihood parameter generation algorithm [<xref ref-type="bibr" rid="scirp.78666-ref41">41</xref>] . In the parameter generation, both static and dynamic features are taken into account, and consequently, a smooth parameter sequence is obtained. <xref ref-type="fig" rid="fig2">Figure 2</xref> shows the effect of the dynamic features in the parameter generation. From <xref ref-type="fig" rid="fig2">Figure 2</xref>(b), we see that the trajectory of the generated parameter sequence is not smooth when only the static feature is used. There is undesirable fluctuations between frames compared to the trajectory of the original parameter sequence. On the other</p><fig id="fig2"  position="float"><label><xref ref-type="fig" rid="fig2">Figure 2</xref></label><caption><title> Example of facial expression parameter sequences generated with and without dynamic features. (a) Original; (b) without dynamic features; (c) with dynamic features</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/6-1730667x3.png"/></fig><p>hand, the smooth trajectory is obtained when the dynamic features are taken into account. As a result, the trajectory becomes close to that of the original parameters.</p></sec><sec id="s3_5"><title>3.5. Mapping Facial Expression Parameters to Face Image Using DNNs</title><p>Finally, the facial expression parameters generated from HMMs are converted to the facial pixel image of the target speaker using DNN-based non-linear mapping. The same idea was used in our previous study for the conversion of speaker’s face images [<xref ref-type="bibr" rid="scirp.78666-ref24">24</xref>] . Since both the 2D face image and expression parameters, i.e., AUs, are simultaneously obtained using Kinect, there is a good correspondence between them. The variation of the shape of lips and other parts is smaller than that of speech parameters such as spectral and F0 features. Therefore, using whole training data, typically several tens minutes or more, is not necessary for the face image. We randomly choose frames from the training data in the similar manner to the case of PCA of the previous study [<xref ref-type="bibr" rid="scirp.78666-ref5">5</xref>] . The HSV space is used for the color representation on the basis of the report that the HSV space was better than the RGB space in the extraction of the lip region [<xref ref-type="bibr" rid="scirp.78666-ref42">42</xref>] .</p></sec></sec><sec id="s4"><title>4. Experiments</title><p>In this section, we conducted objective evaluations to examine the appropriate setting for the model training in the proposed talking face synthesis technique. We also compared the proposed technique with the conventional PCA-based synthesis using objective and subjective evaluation tests to show the effectiveness of introducing the intermediate features, i.e., facial expression parameters, into the model training.</p><sec id="s4_1"><title>4.1. Experimental Conditions</title><p>For the model training, we recorded color video samples of a male speaker who uttered 103 sentences using Kinect v2. The sentences were selected from the subsets A and J of 503 phonetically balanced sentences of the ATR Japanese speech database set B [<xref ref-type="bibr" rid="scirp.78666-ref43">43</xref>] . The size of the images was 400 &#215; 400 pixels. The speech and timestamp data were also recorded as well as the video data. The built-in microphone in Kinect was used for the speech recording. We used all 17 AUs as the facial expression parameters. The head position of the speaker was fixed using a headrest to suppress the face movement during the recording. The frame rate was set to 30 fps in the recording. However, since there were some dropped frames that could not capture the AUs, the frame rate of AUs was converted to 60 fps using cubic spline interpolation with the recorded timestamps. The facial regions were cut out from the recorded images using template matching and were resized to 200 &#215; 200 pixels. In the template matching, a single face image of closed mouth, which was chosen in advance, was applied to the first fame of each utterance. Then the face region was cut out and the image was used as a new template for the next frame to improve the matching accuracy. This template update was performed frame by frame.</p><p>From the 103 sentences, 48 sentences were chosen for the candidates of the model training for HMM/DNN, 25 sentences were chosen for the validation data to obtain the optimal number of hidden layers and the number of units for DNNs, and 30 sentences were chosen for the evaluation tests. The AUs and their delta and delta-delta parameters were used as the static and dynamic features. The formulations of the dynamic features were the same as those in the HMM- based speech synthesis [<xref ref-type="bibr" rid="scirp.78666-ref19">19</xref>] . As a result, the total number of dimensions of the feature vector for facial expression parameters was 51. Three state left-to-right triphone HSMMs were used for the modeling of facial expression parameters. We assumed that the probability density functions in the all decision-tree leaf nodes were Gaussian with diagonal covariance matrices, which is a typical implementation in HMM-based speech synthesis. We used standard feed-forward DNNs for the parameter mapping. The conditions for the training of DNNs are listed in <xref ref-type="table" rid="table2">Table 2</xref>, which were the same conditions as [<xref ref-type="bibr" rid="scirp.78666-ref24">24</xref>] .</p></sec><sec id="s4_2"><title>4.2. Required Amount of Training Data</title><p>It is important to know the amount of training data that is sufficient for the model training. In this section, we objectively examined the amount of data required for the training of HMMs and DNNs. Root mean square errors (RMSEs) between the original and synthetic features were used as an objective distortion measure. In the evaluation of HMMs, the training data was changed from 4 to 48</p><table-wrap id="table2" ><label><xref ref-type="table" rid="table2">Table 2</xref></label><caption><title> Structure of DNNs</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >Number of units for input layer</th><th align="center" valign="middle" >17</th></tr></thead><tr><td align="center" valign="middle" >Number of units for output layer</td><td align="center" valign="middle" >120,000</td></tr><tr><td align="center" valign="middle" >Optimizer</td><td align="center" valign="middle" >Adam [<xref ref-type="bibr" rid="scirp.78666-ref44">44</xref>]</td></tr><tr><td align="center" valign="middle" >Activation function</td><td align="center" valign="middle" >tanh</td></tr><tr><td align="center" valign="middle" >Batch size</td><td align="center" valign="middle" >100</td></tr><tr><td align="center" valign="middle" >Number of epochs</td><td align="center" valign="middle" >100</td></tr><tr><td align="center" valign="middle" >Dropout rate</td><td align="center" valign="middle" >0.5</td></tr></tbody></table></table-wrap><p>sentences with an increment of 4 sentences. The sentences were randomly chosen from the all 48 sentences. Since the performance depends on the choice of the sentence set, we made five sets of training data for each target number of sentences. Then, the average value of RMSEs of the five sets was calculated and was used as a final RMSE, which alleviates the dependency to the choice of the sentences. The RMSE was calculated between original and generated AUs where the frames were aligned using the durations of the original speech. <xref ref-type="fig" rid="fig3">Figure 3</xref> shows the result. From the figure, we found that there was not a large variation of the RMSEs when the number of sentences was over ten. The smallest RMSE was given by the condition that the number of sentences was set to 44 in this experiment. This result indicates that the sufficient amount of training data to model the facial expression parameters using HMMs is around 50 sentences when the phonetically balanced ATR sentences are used for the training.</p><p>In the evaluation of DNNs, we randomly chose the frames for training DNNs as is described in Section 3.5. The target number of the frames was doubled from 128 frames up to 4096 frames. As was the case with the evaluation of HMMs, we made five sets of training data and used the average value of RMSEs for the five sets as the final RMSE. For the optimization of DNN structure, the candidate numbers of hidden layers were 1, 2, and 3, and the candidate numbers of hidden units in one layer were 512, 1024, and 2048. Totally, there were nine combinations of the conditions. For each condition, we calculated the RMSEs for the validation set, and finally the best combination, which gave the smallest RMSE, was chosen as the structure in each amount of training data. For the validation and test data, the RMSE was calculated between the original and generated values of pixels in the HSV color space. The frames were aligned using the durations of the original speech. <xref ref-type="fig" rid="fig4">Figure 4</xref> shows the result. From the figure, it is seen that the variation of RMSE became small when the target number of frames was set to 512 or more.</p><p>When comparing the results of DNN and HMM, we found that the required amount of training data for DNNs was much smaller than HMMs. This is because the HMMs model the continuous sequence of facial expression parameters whereas no dynamic features are taken into account in the DNN-based feature mapping and the frame-independent mapping using randomly chosen frames is sufficient.</p><fig id="fig3"  position="float"><label><xref ref-type="fig" rid="fig3">Figure 3</xref></label><caption><title> Variation of the objective distortions against the different amounts of training data for HMMs</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/6-1730667x4.png"/></fig><fig id="fig4"  position="float"><label><xref ref-type="fig" rid="fig4">Figure 4</xref></label><caption><title> Variation of the objective distortions against the different amounts of training data for DNNs</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/6-1730667x5.png"/></fig></sec><sec id="s4_3"><title>4.3. Comparison with the PCA-Based Synthesis</title><p>Next, we compared the proposed two-step synthesis technique with the conventional PCA-based synthesis technique [<xref ref-type="bibr" rid="scirp.78666-ref5">5</xref>] in an objective manner. In the conventional technique, the PCA was applied to the training data and 100 PCA coefficients were obtained for each frame. The cumulative contribution ratio of the obtained eigen vectors was about 90% for the training data. We conducted a preliminary experiment and confirmed that perceived degradation of the re-con- structed images by the PCA was small for the original images. The feature vectors consisted of the PCA coefficients with their delta and delta-delta coefficients, and the total number of dimensions was 300. The feature vectors were used for the training of triphone HMMs whose conditions were the same as those in the proposed technique. For the proposed technique, we used 48 sentences and 4096 frames for the training of HMMs and DNNs, respectively. The structure of the DNNs was determined using the validation data in Section 4.2, and the optimal numbers of hidden layers and units were 3 and 512, respectively, in this condition. The RMSEs of pixel data were calculated for the conventional and proposed techniques. <xref ref-type="table" rid="table3">Table 3</xref> shows the result. From the table, we see that the proposed technique using the two-step training can synthesize closer face images than the conventional PCA-based technique.</p><p>Finally, we conducted a subjective preference test for the facial animations synthesized by the conventional and proposed techniques. The same samples as those in the objective evaluation were used for the preference test. In this test, the samples with the conventional and proposed techniques were displayed to each participant in random order. 10 sentences were randomly chosen from the 48 sentences for each participant. The participants were asked to choose the sample whose naturalness was better than the other as a facial animation. Since the sample is a photo-realistic facial animation in this evaluation, it is natural that both animation and speech were presented to the participants. Therefore, we added the original speech to the animation of each sample. Note that the lip motion and speech were synchronized because the phone durations of original speech were used in the synthesis of the facial animations. The participants were twelve undergraduate and graduate students.</p><p><xref ref-type="fig" rid="fig5">Figure 5</xref> shows the result of the preference test. The 95% confidence interval is also shown in the figure. From the figure, it is found that the proposed technique synthesized substantially better facial animations than the conventional PCA-based technique, which is consistent with the objective evaluation result. When seeing the synthetic samples of the conventional technique, we found that the motions of mouth open and close were almost not achieved and did not correspond to the phonetic information. A possible reason is that the variation of the whole pixel image affected the PCA coefficients and the contribution of the mouth shape to the coefficients became lower. As a result, the images of the mouth open and close were clustered in the same leaf node in the decision- tree-based context clustering, which crucially made the lip motion unclear. In contrast, although the quality of the synthetic image of the entire face with the proposed technique was at the same level as the conventional one, the lip motion was synthesized because of the two-step modeling and synthesis.</p><p><xref ref-type="fig" rid="fig6">Figure 6</xref> shows an example of the variation of the successive face images extracted from the facial animations of 1) original samples and synthetic samples with 2) conventional and 3) proposed techniques. In the figure of the original</p><table-wrap id="table3" ><label><xref ref-type="table" rid="table3">Table 3</xref></label><caption><title> Comparison of objective distortions (RMSEs) between the conventional and proposed synthesis techniques</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >PCA</th><th align="center" valign="middle" >Proposed</th></tr></thead><tr><td align="center" valign="middle" >42.64</td><td align="center" valign="middle" >41.69</td></tr></tbody></table></table-wrap><fig id="fig5"  position="float"><label><xref ref-type="fig" rid="fig5">Figure 5</xref></label><caption><title> Result of the preference test</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/6-1730667x6.png"/></fig><fig id="fig6"  position="float"><label><xref ref-type="fig" rid="fig6">Figure 6</xref></label><caption><title> Comparison of the captured and synthesized frames. From left; captured, conventional method, and proposal method. (a) Original; (b) PCA; (c) proposed</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/6-1730667x7.png"/></fig><p>animation, the mouth was opened, closed, and opened again. However, we see no variation in the mouth region of the conventional technique. The proposed technique improves the problem and there is a difference between the frames of mouth open and close.</p></sec></sec><sec id="s5"><title>5. Conclusion</title><p>In this paper, we proposed a technique for synthesizing a 2D photo-realistic talking face animation using two-step synthesis with HMMs and DNNs. The key idea of the technique is the introduction of facial expression parameters as an intermediate representation that has a good correspondence both with the input contexts and the output pixel data of the face image. In the proposed technique, the facial expression parameters and pixel images are modeled using HMMs and DNNs, respectively. In the experiments, first we examined the required amount of the training data for HMMs and DNNs. The objective experimental results showed that about 50 phonetically balanced ATR sentences were sufficient for the modeling of facial expression parameters with HMMs. It was also found that the DNN training needed less amount of training data than the HMM training, which saves the computation time for the model preparation. The objective and subjective comparative experiments with the conventional PCA-based synthesis both results in showing the superiority of the proposed technique. The remaining work includes the synthesis of expressive facial animation and the increase of the contextual factors.</p></sec><sec id="s6"><title>Acknowledgements</title><p>Part of this work was supported by JSPS KAKENHI Grant Number JP15H02720.</p></sec><sec id="s7"><title>Cite this paper</title><p>Sato, K., Nose, T. and Ito, A. (2017) HMM-Based Photo- Realistic Talking Face Synthesis Using Facial Expression Parameter Mapping with Deep Neural Networks. Journal of Computer and Communications, 5, 50-65. https://doi.org/10.4236/jcc.2017.510006</p></sec><sec id="s8"><title>NOTES</title></sec></body><back><ref-list><title>References</title><ref id="scirp.78666-ref1"><label>1</label><mixed-citation publication-type="other" xlink:type="simple">Ostermann, J. and Weissenfeld, A. (2004) Talking Faces - Technologies and Applications. Proc. the 17th International Conference on Pattern Recognition (ICPR), 3, 826-833. https://doi.org/10.1109/ICPR.2004.1334656</mixed-citation></ref><ref id="scirp.78666-ref2"><label>2</label><mixed-citation publication-type="other" xlink:type="simple">Mattheyses, W. and Verhelst, W. (2015) Audiovisual Speech Synthesis: An Overview of the State-of-the-Art. Speech Communication, 66, 182-217. https://doi.org/10.1016/j.specom.2014.11.001</mixed-citation></ref><ref id="scirp.78666-ref3"><label>3</label><mixed-citation publication-type="other" xlink:type="simple">Savran, A., Arslan, L.M. and Akarun, L. (2006) Speaker-Independent 3D Face Synthesis Driven by Speech and Text. Signal Processing, 86, 2932-2951. https://doi.org/10.1016/j.sigpro.2005.12.007</mixed-citation></ref><ref id="scirp.78666-ref4"><label>4</label><mixed-citation publication-type="other" xlink:type="simple">Brooke, N.M. and Scott, S.D. (1998) Two- and Three-Dimensional Audio-Visual Speech Synthesis. Proc. AVSP'98 International Conference on Auditory-Visual Speech Processing.</mixed-citation></ref><ref id="scirp.78666-ref5"><label>5</label><mixed-citation publication-type="other" xlink:type="simple">Sako, S., Tokuda, K., Masuko, T., Kobayashi, T. and Kitamura, T. (2000) HMM-Based Text-to-Audio-Visual Speech Synthesis. Proc. INTERSPEECH, 25-28.</mixed-citation></ref><ref id="scirp.78666-ref6"><label>6</label><mixed-citation publication-type="other" xlink:type="simple">Huang, F.J., Cosatto, E. and Graf, H.P. (2002) Triphone Based Unit Selection for Concatenative Visual Speech Synthesis. Proc. International Conference on Acoustics, Speech, and Signal Processing, 2, 2037-2040.</mixed-citation></ref><ref id="scirp.78666-ref7"><label>7</label><mixed-citation publication-type="other" xlink:type="simple">Ezzat, T., Geiger, G. and Poggio, T. (2002) Trainable Videorealistic Speech Animation. Proc. Special Interest Group on Computer GRAPHics and Interactive Techniques, 388-398. https://doi.org/10.1145/566570.566594</mixed-citation></ref><ref id="scirp.78666-ref8"><label>8</label><mixed-citation publication-type="other" xlink:type="simple">Mattheyses, W., Latacz, L., Verhelst, W. and Sahli, H. (2008) Multimodal Unit Selection for 2D Audiovisual Text-to-Speech Synthesis. Inter-national Workshop on Machine Learning for Multimodal Interaction, 125-136. https://doi.org/10.1007/978-3-540-85853-9_12</mixed-citation></ref><ref id="scirp.78666-ref9"><label>9</label><mixed-citation publication-type="other" xlink:type="simple">Hunt, A.J. and Black, A.W. (1996) Unit Selection in a Concatenative Speech Synthesis System Using a Large Speech Database. Proc. International Conference on Acoustics, Speech, and Signal Processing, 1, 373-376.</mixed-citation></ref><ref id="scirp.78666-ref10"><label>10</label><mixed-citation publication-type="other" xlink:type="simple">Cosatto, E. and Graf, H.P. (2000) Photo-Realistic Talking-Heads from Image Samples. IEEE Transactions on Multimedia, 2, 152-163. https://doi.org/10.1109/6046.865480</mixed-citation></ref><ref id="scirp.78666-ref11"><label>11</label><mixed-citation publication-type="other" xlink:type="simple">Liu, K. and Ostermann, J. (2008) Realistic Facial Animation System for Interactive Services. Visual Speech Synthesis Challenge, Brisbane, September 2008, 2330-2333.</mixed-citation></ref><ref id="scirp.78666-ref12"><label>12</label><mixed-citation publication-type="other" xlink:type="simple">Cao, Y., Tien, W.C., Faloutsos, P. and Pighin, F. (2005) Expressive Speech-Driven Facial Animation. ACM Transactions on Graphics, 24, 1283-1302. https://doi.org/10.1145/1095878.1095881</mixed-citation></ref><ref id="scirp.78666-ref13"><label>13</label><mixed-citation publication-type="other" xlink:type="simple">Wang, L., Han, W., Soong, F.K. and Huo, Q. (2011) Text Driven 3D Photo-Realistic Talking Head. Proc. INTERSPEECH, 3307-3308.</mixed-citation></ref><ref id="scirp.78666-ref14"><label>14</label><mixed-citation publication-type="other" xlink:type="simple">Chang, Y.-J. and Ezzat, T. (2005) Transferable Video Realistic Speech Animation. ACM SIGGRAPH Eurographics Symposium on Computer Animation, 143-151. https://doi.org/10.1145/1073368.1073388</mixed-citation></ref><ref id="scirp.78666-ref15"><label>15</label><mixed-citation publication-type="other" xlink:type="simple">Wan, V., Anderson, R., Blokland, A., Braunschweiler, N., Chen, L., Kolluru, B., Latorre, J., Maia, R., Stenger, B., Yanagisawa, K., et al. (2013) Photo-Realistic Expressive Text to Talking Head Synthesis. Proc. INTERSPEECH, 2667-2669.</mixed-citation></ref><ref id="scirp.78666-ref16"><label>16</label><mixed-citation publication-type="other" xlink:type="simple">Anderson, R., Stenger, B., Wan, V. and Cipolla, R. (2013) Expressive Visual Text-to-Speech Using Active Appearance Models. Proc. IEEE Conference on Computer Vision and Pattern Recognition, 3382-3389. https://doi.org/10.1109/CVPR.2013.434</mixed-citation></ref><ref id="scirp.78666-ref17"><label>17</label><mixed-citation publication-type="other" xlink:type="simple">Jones, M.J. and Poggio, T. (1998) Multidimensional Morphable Models. Proc. 6th International Conference on Computer Vision, 683-688. https://doi.org/10.1109/ICCV.1998.710791</mixed-citation></ref><ref id="scirp.78666-ref18"><label>18</label><mixed-citation publication-type="other" xlink:type="simple">Cootes, T.F., Edwards, G.J. and Taylor, C.J. (1998) Active Appearance Models. European Conference on Computer Vision, 484-498. https://doi.org/10.1007/BFb0054760</mixed-citation></ref><ref id="scirp.78666-ref19"><label>19</label><mixed-citation publication-type="other" xlink:type="simple">Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T. and Kitamura, T. (1999) Simultaneous Modeling of Spectrum, Pitch and Duration in HMM-Based Speech Synthesis. Proc. Eurospeech, 2347-2350.</mixed-citation></ref><ref id="scirp.78666-ref20"><label>20</label><mixed-citation publication-type="other" xlink:type="simple">Zen, H., Tokuda, K. and Black, A. (2009) Statistical Parametric Speech Synthesis. Speech Communication, 51, 1039-1064.</mixed-citation></ref><ref id="scirp.78666-ref21"><label>21</label><mixed-citation publication-type="other" xlink:type="simple">Nose, T., Yamagishi, J., Masuko, T. and Kobayashi, T. (2007) A Style Control Technique for HMM-Based Expressive Speech Synthesis. IEICE Transactions on Information and Systems, E90-D, 1406-1413. https://doi.org/10.1093/ietisy/e90-d.9.1406</mixed-citation></ref><ref id="scirp.78666-ref22"><label>22</label><mixed-citation publication-type="other" xlink:type="simple">Nose, T. and Kobayashi, T. (2011) Recent Development of Hmm-Based Expressive Speech Synthesis and Its Applications. Proc. APSIPA ASC, 1-4.</mixed-citation></ref><ref id="scirp.78666-ref23"><label>23</label><mixed-citation publication-type="other" xlink:type="simple">Gui, J., Zhang, Y., Li, S., Xu, P. and Lan, S. (2015) Real-Time 3D Facial Subtle Expression Control Based on Blended Normal Maps. 8th International Symposium on Computational Intelligence and Design, Vol. 1, 466-469. https://doi.org/10.1109/ISCID.2015.200</mixed-citation></ref><ref id="scirp.78666-ref24"><label>24</label><mixed-citation publication-type="other" xlink:type="simple">Saito, Y., Nose, T., Shinozaki, T. and Ito, A. (2015) Conversion of Speaker’s Face Image Using PCA and Animation Unit for Video Chatting. 2015 International Conference on Intelligent Information Hiding and Multimedia Signal Processing, 433-436.</mixed-citation></ref><ref id="scirp.78666-ref25"><label>25</label><mixed-citation publication-type="other" xlink:type="simple">Tokuda, K., Nankaku, Y., Toda, T., Zen, H., Yamagishi, J. and Oura, K. (2013) Speech Synthesis Based on Hidden Markov Models. Proc. the IEEE, 101, 1234-1252. https://doi.org/10.1109/JPROC.2013.2251852</mixed-citation></ref><ref id="scirp.78666-ref26"><label>26</label><mixed-citation publication-type="other" xlink:type="simple">Turk, M.A. and Pentland, A.P. (1991) Face Recognition Using Eigenfaces. Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 586-591. https://doi.org/10.1109/CVPR.1991.139758</mixed-citation></ref><ref id="scirp.78666-ref27"><label>27</label><mixed-citation publication-type="other" xlink:type="simple">Williams, J.J., Katsaggelos, A.K. and Randolph, M.A. (2000) A Hidden Markov Model Based Visual Speech Synthesizer. Proc. International Conference on Acoustics, Speech, and Signal Processing, Vol. 6, 2393-2396. https://doi.org/10.1109/ICASSP.2000.859323</mixed-citation></ref><ref id="scirp.78666-ref28"><label>28</label><mixed-citation publication-type="other" xlink:type="simple">Wang, L., Qian, X., Han, W. and Soong, F.K. (2010) Synthesizing Photo-Real Talking Head via Trajectory-Guided Sample Selection. Proc. INTERSPEECH, 446-449.</mixed-citation></ref><ref id="scirp.78666-ref29"><label>29</label><mixed-citation publication-type="other" xlink:type="simple">Gales, M. (2000) Cluster Adaptive Training of Hidden Markov Models. IEEE Transactions on Speech and Audio Processing, 8, 417-428. https://doi.org/10.1109/89.848223</mixed-citation></ref><ref id="scirp.78666-ref30"><label>30</label><mixed-citation publication-type="other" xlink:type="simple">Latorre, J., Wan, V., Gales, M.J., Chen, L., Chin, K., Knill, K., Akamine, M., et al. (2012) Speech Factorization for HMM-TTS Based on Cluster Adaptive Training. Proc. INTERSPEECH, 971-974.</mixed-citation></ref><ref id="scirp.78666-ref31"><label>31</label><mixed-citation publication-type="other" xlink:type="simple">Tokuda, K., Masuko, T., Yamada, T., Kobayashi, T. and Imai, S. (1995) An Algorithm for Speech Parameter Generation from Continuous Mixture HMMs with Dynamic Features. Proc. Eurospeech, 757-760.</mixed-citation></ref><ref id="scirp.78666-ref32"><label>32</label><mixed-citation publication-type="other" xlink:type="simple">Zhang, Z. (2012) Microsoft Kinect Sensor and Its Effect. IEEE Multimedia, 19, 4-10. https://doi.org/10.1109/MMUL.2012.24</mixed-citation></ref><ref id="scirp.78666-ref33"><label>33</label><mixed-citation publication-type="other" xlink:type="simple">Xue, Y., Hamada, Y. and Akagi, M. (2015) Emotional Speech Synthesis System Based on a Three-Layered Model Using a Dimensional Approach. Proc. APSIPA ASC, 505-514. https://doi.org/10.1109/APSIPA.2015.7415323</mixed-citation></ref><ref id="scirp.78666-ref34"><label>34</label><mixed-citation publication-type="other" xlink:type="simple">Yamagishi, J., Onishi, K., Masuko, T. and Kobayashi, T. (2005) Acoustic Modeling of Speaking Styles and Emotional Expressions in HMM-Based Speech Synthesis. IEICE Transactions on Information and Systems, E88-D, 503-509. https://doi.org/10.1093/ietisy/e88-d.3.502</mixed-citation></ref><ref id="scirp.78666-ref35"><label>35</label><mixed-citation publication-type="journal" xlink:type="simple"><name name-style="western"><surname>Levinson</surname><given-names> S. </given-names></name>,<etal>et al</etal>. (<year>1986</year>)<article-title>Continuously Variable Duration Hidden Markov Models for Automatic Speech Recognition</article-title><source> Computer Speech and Language</source><volume> 1</volume>,<fpage> 29</fpage>-<lpage>45</lpage>.<pub-id pub-id-type="doi"></pub-id></mixed-citation></ref><ref id="scirp.78666-ref36"><label>36</label><mixed-citation publication-type="other" xlink:type="simple">Zen, H., Tokuda, K., Masuko, T., Kobayashi, T. and Kitamura, T. (2007) A Hidden Semi-Markov Model-Based Speech Synthesis System. IEICE Transactions on Information and Systems, E90-D, 825-834. https://doi.org/10.1093/ietisy/e90-d.5.825</mixed-citation></ref><ref id="scirp.78666-ref37"><label>37</label><mixed-citation publication-type="other" xlink:type="simple">Shinoda, K. and Watanabe, T. (2000) MDL-Based Context-Dependent Subword Modeling for Speech Recognition. Journal of the Acoustical Society of Japan (E), 21, 79-86. https://doi.org/10.1250/ast.21.79</mixed-citation></ref><ref id="scirp.78666-ref38"><label>38</label><mixed-citation publication-type="other" xlink:type="simple">Masuko, T., Tokuda, K., Kobayashi, T. and Imai, S. (1996) Speech Synthesis Using HMMs with Dynamic Features. Proc. International Conference on Acoustics, Speech, and Signal Processing, 389-392. https://doi.org/10.1109/ICASSP.1996.541114</mixed-citation></ref><ref id="scirp.78666-ref39"><label>39</label><mixed-citation publication-type="other" xlink:type="simple">Tokuda, K., Masuko, T., Hiroi, J., Kobayashi, T. and Kitamura, T. (1998) A Very Low Bit Rate Speech Coder Using HMM-Based Speech Recognition/Synthesis Techniques. Proc. International Conference on Acoustics, Speech, and Signal Processing, Vol. 2, 609-612. https://doi.org/10.1109/ICASSP.1998.675338</mixed-citation></ref><ref id="scirp.78666-ref40"><label>40</label><mixed-citation publication-type="other" xlink:type="simple">Nose, T. and Kobayashi, T. (2012) Very Low Bit-Rate F0 Coding for Phonetic Vocoders Using MSD-HMM with Quantized F0 Symbols. Speech Communication, 54, 384-392.</mixed-citation></ref><ref id="scirp.78666-ref41"><label>41</label><mixed-citation publication-type="other" xlink:type="simple">Tokuda, K., Kobayashi, T. and Imai, S. (1995) Speech Parameter Generation from HMM Using Dynamic Features. Proc. International Conference on Acoustics, Speech, and Signal Processing, 660-663. https://doi.org/10.1109/ICASSP.1995.479684</mixed-citation></ref><ref id="scirp.78666-ref42"><label>42</label><mixed-citation publication-type="other" xlink:type="simple">Kuroda, T. and Watanabe, T. (1995) Method for Lip Extraction from Face Image Using HSV Color Space. Transactions the Japan Society of Mechanical Engineers Series C, 61, 4724-4729.</mixed-citation></ref><ref id="scirp.78666-ref43"><label>43</label><mixed-citation publication-type="other" xlink:type="simple">Kurematsu, A., Takeda, K., Sagisaka, Y., Katagiri, S., Kuwabara, H. and Shikano, K. (1990) ATR Japanese Speech Database as a Tool of Speech Recognition and Synthesis. Speech Communication, 9, 357-363.</mixed-citation></ref><ref id="scirp.78666-ref44"><label>44</label><mixed-citation publication-type="other" xlink:type="simple">Kingma, D. and Ba, J. (2014) Adam: A Method for Stochastic Optimization.</mixed-citation></ref></ref-list></back></article>