<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article  PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" dtd-version="3.0" xml:lang="en" article-type="research article"><front><journal-meta><journal-id journal-id-type="publisher-id">JAMP</journal-id><journal-title-group><journal-title>Journal of Applied Mathematics and Physics</journal-title></journal-title-group><issn pub-type="epub">2327-4352</issn><publisher><publisher-name>Scientific Research Publishing</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="doi">10.4236/jamp.2020.89141</article-id><article-id pub-id-type="publisher-id">JAMP-103004</article-id><article-categories><subj-group subj-group-type="heading"><subject>Articles</subject></subj-group><subj-group subj-group-type="Discipline-v2"><subject>Physics&amp;Mathematics</subject></subj-group></article-categories><title-group><article-title>
 
 
  A Bayesian Regression Model and Applications
 
</article-title></title-group><contrib-group><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Yijun</surname><given-names>Yu</given-names></name><xref ref-type="aff" rid="aff1"><sub>1</sub></xref><xref ref-type="corresp" rid="cor1"><sup>*</sup></xref></contrib></contrib-group><aff id="aff1"><label>1</label><addr-line>Department of Mathematics, Tuskegee University, Tuskegee, AL, USA</addr-line></aff><pub-date pub-type="epub"><day>01</day><month>09</month><year>2020</year></pub-date><volume>08</volume><issue>09</issue><fpage>1877</fpage><lpage>1887</lpage><history><date date-type="received"><day>17,</day>	<month>July</month>	<year>2020</year></date><date date-type="rev-recd"><day>19,</day>	<month>September</month>	<year>2020</year>	</date><date date-type="accepted"><day>22,</day>	<month>September</month>	<year>2020</year></date></history><permissions><copyright-statement>&#169; Copyright  2014 by authors and Scientific Research Publishing Inc. </copyright-statement><copyright-year>2014</copyright-year><license><license-p>This work is licensed under the Creative Commons Attribution International License (CC BY). http://creativecommons.org/licenses/by/4.0/</license-p></license></permissions><abstract><p><html>
 <head></head>
 
  A sparse vector regression model is developed. The model is established by employing Bayesian formulation and trained by using a set of data 
  <img src="Edit_c3ff10a2-e8b8-4862-bc9d-74ca04016ecb.bmp" alt="" />. The parameters needed to be determined in the algorithm are reduced by a special prior hyperparameter setting, and therefore the algorithm is simpler than similar type of Bayesian vector regression models. The examples of applications to the function approximation and inverse scattering problem are presented.
 
</html></p></abstract><kwd-group><kwd>Bayesian</kwd><kwd> Regression</kwd><kwd> Applications</kwd></kwd-group></article-meta></front><body><sec id="s1"><title>1. Introduction</title><p>There has been a lot of interest in studying the Bayesian vector regression and its application on various classification and regression problems [<xref ref-type="bibr" rid="scirp.103004-ref1">1</xref>] [<xref ref-type="bibr" rid="scirp.103004-ref2">2</xref>] [<xref ref-type="bibr" rid="scirp.103004-ref3">3</xref>] [<xref ref-type="bibr" rid="scirp.103004-ref4">4</xref>]. The Bayesian approach considers probability distributions with the observed data; prior distributions are converted to posterior distribution through the use of Bayes’ theorem. Let x be an input vector and t be a vector of target parameters. In a regression formulation our goal is to define a model y ( x ; w ) that yields an approximation to the true target t, with the model defined by the parameters w. The model is typically designed using a set of “training” data D = { x n , t n } n = 1 N , Although we initially consider a finite set D, the goal is for the subsequent model y ( x ; w ) to be applicable to arbitrary ( x , t ) ∉ D , over the anticipated range of t. When developing a regression model one must address the bias-variance tradeoff. A bias is introduced by restricting the form that y ( x ; w ) may take, while the variance represents the error between the model y ( x ; w ) and true target parameters t. Models with minimal bias typically have significant flexibility, and therefore the model parameters may vary significantly as a function of the specific training set D employed. To obtain good model generalization, which may be connected to the variation in the model parameters as a function of D, one must introduce a bias. The utilization of a small number of non-zero parameters w often yields a good balance between bias and variance; such models are termed “sparse”. This has led to development of the relevance vector machine [<xref ref-type="bibr" rid="scirp.103004-ref5">5</xref>].</p><p>The rest of this paper is organized as follows. The theory of the vector-regression formulation is presented in Section 2, with application example provided in Section 3. The work is summarized in Section 4.</p></sec><sec id="s2"><title>2. Sparse Bayesian Vector Regression</title><sec id="s2_1"><title>2.1. Model Specification</title><p>Assume we have available a set of training data D = { x n , t n } n = 1 N , where x n = [ x n ( 1 )   x n ( 2 )   ⋯   x n ( L ) ] ⊺ and t n = [ t n ( 1 )   t n ( 2 )   ⋯   t n ( M ) ] ⊺ . Our objective is to develop a function y ( x ; w ) that is dependent on the parameters w. After y ( x ; w ) is so designed, it may be used to map an arbitrary x to an approximation of the target parameters t.</p><p>The specific vector-regression function y ( x ; w ) = [ y ( 1 ) ( x ; w )   y ( 2 ) ( x ; w )   ⋯   y ( M ) ( x ; w ) ] ⊺ employed here is defined as</p><p>y ( x ; w ) = ∑ i = 1 N w i t i K ( x , x i ) + w 0 (1)</p><p>where w 0 = [ w 0 ( 1 )   w 0 ( 2 )   ⋯   w 0 ( M ) ] ⊺ , and K ( x , x i ) is a kernel function that is designed such that K ( x , x i ) is large if x i ≈ x and otherwise K ( x , x i ) is small. Hence in (1) only those x i ≈ x are important in defining y ( x ; w ) .</p><p>Let</p><p>w = [ w 1 w 2 ⋯ w N w 0 ( 1 ) w 0 ( 2 ) ⋯ w 0 ( M ) ] ⊺ ,</p><p>ψ i ( x ) = [ ϕ i ( 1 )   ϕ i ( 2 )   ⋯   ϕ i ( M ) ] ⊺ ,       i = 1 , 2 , ⋯ , N</p><p>with</p><p>ϕ i ( k ) = t i ( k ) K ( x , x i ) ,     i = 1 , 2 , ⋯ , N ;   k = 1 , 2 , ⋯ , M (2)</p><p>and M &#215; ( N + M ) matrix</p><p>Ψ ( x ) = [ ψ 1 ( x )   ψ 2 ( x )   ⋯   ψ N ( x )   I M ] , (3)</p><p>where I M is M &#215; M identity matrix, then (1) can be expressed in matrix form</p><p>y ( x ; w ) = Ψ ( x ) w (4)</p><p>Assume that target is from the model with additive noise</p><p>t = y ( x ; w ) + ε = Ψ ( x ) w + ε , (5)</p><p>where model error ε = [ ε ( 1 )   ε ( 2 )   ⋯   ε ( M ) ] ⊺ and ε ( k ) , k = 1 , 2 , ⋯ , M are independent samples from a zero-mean Gaussian process with variance α 0 − 1</p><p>p ( ε ( k ) ) = N ( ε ( k ) | 0 , α 0 − 1 ) ,     k = 1 , 2 , ⋯ , M (6)</p><p>We therefore have</p><p>p ( t | x , w , α 0 ) = ( 2 π α 0 ) − M 2 exp ( − α 0 2 ‖ t − Ψ ( x ) w ‖ 2 2 ) = N ( t | Ψ ( x ) w , α 0 − 1 I M ) (7)</p><p>We wish to constrain the weights w such that a simple model is favored, this accomplished by invoking a prior distribution on w that favors most of the weights being zero. In this context, only the most relevant members of the training set D = { x n , t n } n = 1 N , those with nonzero weights w n , are ultimately used in the final regression model. This simplicity allows improved regression performance for ( x , t ) ∉ D [<xref ref-type="bibr" rid="scirp.103004-ref5">5</xref>] [<xref ref-type="bibr" rid="scirp.103004-ref6">6</xref>].</p><p>We employ a zero-mean Gaussian prior distribution for w</p><p>p ( w | α 0 , α ) = N ( w | 0 N + M , α 0 − 1 α − 1 I N + M ) , (8)</p><p>where 0 N + M is a (N + M)-dimensional zero vector, I N + M is a ( N + M ) &#215; ( N + M ) identity matrix, and suitable priors over hyperparameters α 0 and α are Gamma distributions [<xref ref-type="bibr" rid="scirp.103004-ref7">7</xref>]</p><p>p ( α 0 | a , b ) = Gamma ( α 0 | a , b ) (9)</p><p>p ( α | c , d ) = Gamma ( α | c , d ) (10)</p><p>where Gamma ( α 0 | a , b ) = Γ ( a ) − 1 b a α 0 a − 1 e − b α 0 with Γ ( a ) = ∫ 0 ∞ t a − 1 e − t d t .</p><p>The hierarchical prior over w favors a sparse model and the prior over α 0 will be used to favor small model error on the training data D.</p></sec><sec id="s2_2"><title>2.2. Inference</title><p>For training data D = { x n , t n } n = 1 N we introduce LN-dimensional vector</p><p>X = [ x 1 ⊺   x 2 ⊺   ⋯   x N ⊺ ] ⊺</p><p>and MN-dimensional vector</p><p>T = [ t 1 ⊺   t 2 ⊺   ⋯   t N ⊺ ] ⊺</p><p>and let ( M N ) &#215; ( M + N ) matrix</p><p>Φ = [ Φ 1 ⊺   Φ 2 ⊺   ⋯   Φ N ⊺ ] ⊺ with Φ i = Ψ ( x i ) ,   i = 1 , 2 , ⋯ , N ,</p><p>then by (7), we have</p><p>p ( T | w , α 0 , X ) = ( 2 π α 0 ) − M N 2 exp ( − α 0 2 ‖ T − Φ w ‖ 2 2 ) = N ( T | Φ w , α 0 − 1 I M N ) (11)</p><p>Noting that p ( T | α 0 , α , X ) = ∫ p ( T | w , α 0 , X ) p ( w | α 0 , α ) d w is a convolution of Gaussians, the posterior distribution over the weights w can be derived as</p><p>p ( w | α 0 , α , X , T ) = p ( T | w , α 0 , X ) p ( w | α 0 , α ) p ( T | α 0 , α , X ) = N ( w | μ , α 0 − 1 Σ ) (12)</p><p>where</p><p>Σ = ( Φ ⊺ Φ + α I M + N ) − 1 = ( ∑ i = 1 N Φ i ⊺ Φ i + α I M + N ) − 1 (13)</p><p>μ = Σ Φ ⊺ T = Σ ∑ i = 1 N ( Φ i t i ) (14)</p></sec><sec id="s2_3"><title>2.3. Hyperparameter Optimization</title><p>We determine α in (13) by maximizing p ( α | T , X ) ∝ p ( T | α , X ) p ( α ) with respect to α . It is equivalent to maximize the ln of this quantity. In addition, we can choose to maximize with respect to ln α as we can assume hyperpriors over a logarithmic scale.</p><p>Since</p><p>ln p ( T | α , X ) = ln ∫ p ( T | w , α 0 , X ) p ( w | α 0 , α ) p ( α 0 | a , b ) d w d α 0 = − 1 2 [ ln | B | + ( M N + 2 a ) ln ( T ⊺ B − 1 T + 2 b ) ] + c o n s t</p><p>where B = I M N + α − 1 Φ Φ ⊺ , and p ( ln α ) = α p ( α ) , we obtain objective function</p><p>L ( α ) = − 1 2 [ ln | B | + ( M N + 2 a ) ln ( T ⊺ B − 1 T + 2 b ) ] + c ln α − d α (15)</p><p>By the determinant identity [<xref ref-type="bibr" rid="scirp.103004-ref8">8</xref>], we have</p><p>| B | = | I M N + α − 1 Φ Φ ⊺ | = α − ( M + N ) | α I M + N + Φ ⊺ Φ | = α − ( M + N ) | Σ − 1 | ,</p><p>and so</p><p>ln | B | = − ( M + N ) ln α + ln | Σ − 1 | (16)</p><p>Using the Woodbury formula, we obtain</p><p>B − 1 = ( I M N + α − 1 Φ Φ ⊺ ) − 1 = I M N − Φ ( α I M + N + Φ ⊺ Φ ) − 1 Φ ⊺ = I M N − Φ Σ Φ ⊺ ,</p><p>thus</p><p>T ⊺ B − 1 T = T ⊺ ( T − Φ Σ Φ ⊺ T )</p><p>= T ⊺ ( T − Φ μ ) (17)</p><p>= ‖ T ‖ 2 − T ⊺ Φ Σ Φ ⊺ T (18)</p><p>Then by (16) and Jacobi’s formula, we have</p><p>d ln | B | d ln α = − ( M + N ) + 1 | Σ − 1 | d | Σ − 1 | d ln α = − ( M + N ) + t r ( Σ d Σ − 1 d ln α ) = − ( M + N ) + α ∑ j = 1 M + N Σ j j (19)</p><p>where Σ j j is the j-th diagonal element of matrix Σ .</p><p>By (18)</p><p>d T ⊺ B − 1 T d ln α = − d T ⊺ Φ Σ Φ ⊺ T d ln α = − T ⊺ Φ d Σ d ln α Φ ⊺ T = − T ⊺ Φ Σ d Σ − 1 d ln α Σ Φ ⊺ T = α ‖ μ ‖ 2 (20)</p><p>Using (17), (19) and (20), we have</p><p>d L ( α ) d α = 1 2 ( M + N − α ∑ j = 1 M + N Σ j j ) − ( M N + 2 a ) 2 ( T ⊺ B − 1 T + 2 b ) d T ⊺ B − 1 T d ln α + c − d α = 1 2 ( M + N − α ∑ j = 1 M + N Σ j j ) − ( M N + 2 a ) ‖ μ ‖ 2 α 2 [ T ⊺ ( T − Φ μ ) + 2 b ] + c − d α (21)</p><p>Setting (21) to zero, followed by algebra operations, yield</p><p>α = M + N + 2 c ∑ j = 1 M + N Σ j j + 2 d + ( M N + 2 a ) ‖ μ ‖ 2 / [ T ⊺ ( T − Φ μ ) + 2 b ] (22)</p><p>The algorithm consists of (13), (14) and (22) with iteration for α , Σ and μ .</p></sec><sec id="s2_4"><title>2.4. Making Predictions</title><p>Assume α M P and α 0 M P are maximizing values obtained by maximizing p ( α | T , X ) (Sec. 2.3) and p ( α 0 | T , X ) , respectively. Assume</p><p>p ( α 0 , α | X , T ) ≈ δ ( α 0 − α 0 M P ) δ ( α − α M P )</p><p>then</p><p>p ( t | x , X , T ) = ∫ p ( t | x , w , α 0 , α ) p ( w , α 0 , α | X , T ) d w d α 0 d α = ∫ p ( t | x , w , α 0 ) p ( w | α 0 , α , X , T ) p ( α 0 , α | X , T ) d w d α 0 d α ≈ ∫ p ( t | x , w , α 0 ) p ( w | α 0 , α , X , T ) δ ( α 0 − α 0 M P ) δ ( α − α M P ) d w d α 0 d α = ∫ p ( t | x , w , α 0 M P ) p ( w | α 0 M P α M P , X , T ) d w = N ( t | y ( x ; μ ) , ( α 0 M P ) − 1 Ω ) (23)</p><p>with</p><p>y ( x ; μ ) = Ψ ( x ) μ (24)</p><p>Ω = I M + Ψ ( x ) Σ Ψ ( x ) ⊺ (25)</p></sec></sec><sec id="s3"><title>3. Applications</title><p>In examples we employ a radial-basis-function kernel K ( x , x i ) = exp ( − ‖ x − x i ‖ 2 / r 2 ) , and just parameters a, b, c and d by training and testing on given training data, finally we take a = b = c = d = 0.05 for all examples in this section. In all figures the horizontal axis is the index of samples and the vertical axis is output.</p><sec id="s3_1"><title>3.1. Regression: Function Approximation</title><p>The model can be used to establish the relation between independent variables and dependent variables of a function.</p><p>Example 1 2-dimensional vector function with two variables</p><p>t 1 = sinc ( x 1 + x 2 4 )</p><p>t 2 = − 0.5 sinc ( x 1 + x 2 4 ) sin ( x 1 x 2 20 ) − 0.4</p><p>in domain { ( x 1 , x 2 ) | − 10 ≤ x 1 ≤ 10 , 0 ≤ x 2 ≤ 20 } , where sinc ( x ) = sin ( x ) / x .</p><p><xref ref-type="fig" rid="fig1">Figure 1</xref> and <xref ref-type="fig" rid="fig2">Figure 2</xref> illustrate the results. <xref ref-type="fig" rid="fig1">Figure 1</xref> is learning from 100 noise-free training samples. <xref ref-type="fig" rid="fig2">Figure 2</xref> is based on 100 noisy training samples. The noise is generated from zero-mean Gaussian with 5% of average training data ‖ t ‖ as standard deviation. Both test on 100 examples that are not in training data.</p><p>Example 2 3-dimensional vector function with 200 variables ( x 1 , x 2 , ⋯ , x 200 ) → ( t 1 , t 2 , t 3 ) .</p><p>t 1 = ∑ k = 1 200 sin ( ( x k ) 5 / 7 ) + x 50 100</p><p>t 2 = x 200 800 t 1 + x 50 200 + cos ( x 100 5 ) − 10</p><p>t 3 = atan ( t 1 + t 2 6 ) + t 2 − t 1 2 − 10</p><p>We choose samples at point x n = ( x 1 n , x 2 n , ⋯ , x 200 n ) with x k n = k + ( n − 1 ) π / 4 . 100 samples at points x n with n = 1 , 3 , 5 , ⋯ , 199 used as training data, and 100 samples at points x n with n = 2 , 4 , 6 , ⋯ , 200 used as testing data.</p><p><xref ref-type="fig" rid="fig3">Figure 3</xref> is from noise-free training samples. <xref ref-type="fig" rid="fig4">Figure 4</xref> is based on noisy training samples. The noise is generated from zero-mean Gaussian with 5% of average training data ‖ t ‖ as standard deviation.</p></sec><sec id="s3_2"><title>3.2. Regression: Inverse Scattering</title><p>The model can be used to characterize the connection between measured vector</p><p>scattered-field data x and the underlying target responsible for these fields, characterized by the parameter vector t. The scattering data x may be measured at multiple positions. In the examples the measure data is simulated by forward model.</p><p>We consider a homogeneous lossless dielectric target buried in a lossy dielectric half space. The objective is to invert for the parameters of the target. In the examples, the parameter vector t is composed of three real numbers: the depth of target, the size of target, and the dielectric constant of target. For each target there are 100 simulated measure data. Training data D = { x n , t n } n = 1 N is composed of N = 180 examples and testing data is composed of 125 examples that are not in D.</p><p>Example 1 We consider cube target in this example. <xref ref-type="fig" rid="fig5">Figure 5</xref> and figure 6 illustrate the results. <xref ref-type="fig" rid="fig5">Figure 5</xref> is from noise-free data. <xref ref-type="fig" rid="fig6">Figure 6</xref> is based on noisy data. The noise is generated from zero-mean Gaussian with 10% of average training data ‖ x ‖ as standard deviation. The “size” is the width of cube.</p><p>Example 2 We consider sphere target in this example. <xref ref-type="fig" rid="fig7">Figure 7</xref> and figure 8 illustrate the results. <xref ref-type="fig" rid="fig7">Figure 7</xref> is from noise-free data. <xref ref-type="fig" rid="fig8">Figure 8</xref> is based on noisy data. The noise is generated from zero-mean Gaussian with 10% of average training data ‖ x ‖ as standard deviation. The “size” is the diameter of sphere.</p><p>We applied the model to two completely different types of problems, the model works well for both application. The results display this regression model can apply to various types of regression problems.</p></sec></sec><sec id="s4"><title>4. Conclusion</title><p>A Bayesian vector-regression algorithm has been developed. The model employs a statistical prior that favors a sparse model, for which most of its weights are zero [<xref ref-type="bibr" rid="scirp.103004-ref5">5</xref>]. This model improves the algorithm in [<xref ref-type="bibr" rid="scirp.103004-ref9">9</xref>], and reduces the number of hyperparameters, which need to be calculated in the algorithm, from two to one. The model is not established for one specific problem, and so can be applied to different regression problems. We have discussed the theoretical development of the model and have presented several example results for two different applications. One is for function approximation, and the other is for inverse scattering of dielectric targets buried in a lossy half space. It has been demonstrated that the algorithm works well for different applications.</p></sec><sec id="s5"><title>Conflicts of Interest</title><p>The author declares no conflicts of interest regarding the publication of this paper.</p></sec><sec id="s6"><title>Cite this paper</title><p>Yu, Y.J. (2020) A Bayesian Regression Model and Applications. Journal of Applied Mathematics and Physics, 8, 1877-1887. https://doi.org/10.4236/jamp.2020.89141</p></sec></body><back><ref-list><title>References</title><ref id="scirp.103004-ref1"><label>1</label><mixed-citation publication-type="other" xlink:type="simple">Law, T. and Shawe-Taylor, J. (2017) Practical Bayesian Support Vector Regression for Financial Time Series Prediction and Market Condition Change Detection. Quantitative Finance, 17, 1403-1416. https://doi.org/10.1080/14697688.2016.1267868</mixed-citation></ref><ref id="scirp.103004-ref2"><label>2</label><mixed-citation publication-type="other" xlink:type="simple">Yu, J. (2012) A Bayesian Inference Based Two-Stage Support Vector Regression Framework for Soft Sensor Development in Batch Bioprocesses. Computers &amp; Chemical Engineering, 41, 134-144. https://doi.org/10.1016/j.compchemeng.2012.03.004</mixed-citation></ref><ref id="scirp.103004-ref3"><label>3</label><mixed-citation publication-type="other" xlink:type="simple">Jacobs, J.P. (2012) Bayesian Support Vector Regression with Automatic Relevance Determination Kernel for Modeling of Antenna Input Characteristics. IEEE Transactions on Antennas and Propagation, 60, 2114-2118. https://doi.org/10.1109/TAP.2012.2186252</mixed-citation></ref><ref id="scirp.103004-ref4"><label>4</label><mixed-citation publication-type="other" xlink:type="simple">Hans, C. (2009) Bayesian Lasso Regression. Biometrika, 96, 835-845. https://doi.org/10.1093/biomet/asp047</mixed-citation></ref><ref id="scirp.103004-ref5"><label>5</label><mixed-citation publication-type="journal" xlink:type="simple"><name name-style="western"><surname>Tipping</surname><given-names> M.E. </given-names></name>,<etal>et al</etal>. (<year>2001</year>)<article-title>Sparse Bayesian Learning and the Relevance Vector Machine</article-title><source> Journal of Machine Learning Research</source><volume> 1</volume>,<fpage> 211</fpage>-<lpage>244</lpage>.<pub-id pub-id-type="doi"></pub-id></mixed-citation></ref><ref id="scirp.103004-ref6"><label>6</label><mixed-citation publication-type="other" xlink:type="simple">Scholkopf, B. and Smola, A.J. (2001) Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge.</mixed-citation></ref><ref id="scirp.103004-ref7"><label>7</label><mixed-citation publication-type="other" xlink:type="simple">Berger, J.O. (1985) Statistical Decision Theory and Bayesian Analysis. 2nd Edition, Springer, Berlin. https://doi.org/10.1007/978-1-4757-4286-2</mixed-citation></ref><ref id="scirp.103004-ref8"><label>8</label><mixed-citation publication-type="other" xlink:type="simple">Mardia, K.V., Kent, J.T. and Bibby, J.B. (1979) Multivariate Analysis. Academic Press, New York.</mixed-citation></ref><ref id="scirp.103004-ref9"><label>9</label><mixed-citation publication-type="other" xlink:type="simple">Yu, Y., Krishnapuram, B. and Carin, L. (2004) Inverse Scattering with Sparse Bayesian Vector Regression. Inverse Problems, 20, 217-231. https://doi.org/10.1088/0266-5611/20/6/S13</mixed-citation></ref></ref-list></back></article>