<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article  PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" dtd-version="3.0" xml:lang="en" article-type="research article"><front><journal-meta><journal-id journal-id-type="publisher-id">JILSA</journal-id><journal-title-group><journal-title>Journal of Intelligent Learning Systems and Applications</journal-title></journal-title-group><issn pub-type="epub">2150-8402</issn><publisher><publisher-name>Scientific Research Publishing</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="doi">10.4236/jilsa.2012.43024</article-id><article-id pub-id-type="publisher-id">JILSA-22033</article-id><article-categories><subj-group subj-group-type="heading"><subject>Articles</subject></subj-group><subj-group subj-group-type="Discipline-v2"><subject>Computer Science&amp;Communications</subject></subj-group></article-categories><title-group><article-title>
 
 
  Regularization by Intrinsic Plasticity and Its Synergies with Recurrence for Random Projection Methods
 
</article-title></title-group><contrib-group><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>laus</surname><given-names>Neumann</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref><xref ref-type="corresp" rid="cor1"><sup>*</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Christian</surname><given-names>Emmerich</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref><xref ref-type="corresp" rid="cor1"><sup>*</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Jochen</surname><given-names>J. Steil</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref><xref ref-type="corresp" rid="cor1"><sup>*</sup></xref></contrib></contrib-group><aff id="aff1"><addr-line>Research Institute for Cognition and Robotics (CoR-Lab), Bielefeld University, Bielefeld, Germany.</addr-line></aff><author-notes><corresp id="cor1">* E-mail:<email>kneumann@cor-lab.uni-bielefeld.de(LN)</email>;<email>cemmeric@cor-lab.uni-bielefeld.de(CE)</email>;<email>jsteil@cor-lab.uni-bielefeld.de(JJS)</email>;</corresp></author-notes><pub-date pub-type="epub"><day>30</day><month>08</month><year>2012</year></pub-date><volume>04</volume><issue>03</issue><fpage>230</fpage><lpage>246</lpage><history><date date-type="received"><day>February</day>	<month>22nd,</month>	<year>2012</year></date><date date-type="rev-recd"><day>May</day>	<month>23rd,</month>	<year>2012</year>	</date><date date-type="accepted"><day>June</day>	<month>1st,</month>	<year>2012</year></date></history><permissions><copyright-statement>&#169; Copyright  2014 by authors and Scientific Research Publishing Inc. </copyright-statement><copyright-year>2014</copyright-year><license><license-p>This work is licensed under the Creative Commons Attribution International License (CC BY). http://creativecommons.org/licenses/by/4.0/</license-p></license></permissions><abstract><p>
 
 
  Neural networks based on high-dimensional random feature generation have become popular under the notions extreme learning machine (ELM) and reservoir computing (RC). We provide an in-depth analysis of such networks with respect to feature selection, model complexity, and regularization. Starting from an ELM, we show how recurrent connections increase the effective complexity leading to reservoir networks. On the contrary, intrinsic plasticity (IP), a biologically inspired, unsupervised learning rule, acts as a task-specific feature regularizer, which tunes the effective model complexity. Combing both mechanisms in the framework of static reservoir computing, we achieve an excellent balance of feature complexity and regularization, which provides an impressive robustness to other model selection parameters like network size, initialization ranges, or the regularization parameter of the output learning. We demonstrate the advantages on several synthetic data as well as on benchmark tasks from the UCI repository providing practical insights how to use high-dimensional random networks for data processing.
 
</p></abstract><kwd-group><kwd>Extreme Learning Machine; Reservoir Computing; Model Selection; Feature Selection; Model Complexity; Intrinsic Plasticity; Regularization</kwd></kwd-group></article-meta></front><body><sec id="s1"><title>1. Introduction</title><p>In the last decade, machine learning techniques based on random projections have attracted a lot of attention because in principle they allow for very efficient processing of large and high-dimensional data sets [<xref ref-type="bibr" rid="scirp.22033-ref1">1</xref>]. These approaches randomly initialize the free parameters of the feature generating part of a data processing model and restrict learning to linear methods for obtaining a suitable readout function. As opposed to random projections for dimensionality reduction, which have been considered much earlier [2,3], it is characteristic for such new approaches to use high-dimensional projections. These often actually increase the feature dimensionality.</p><p>A prominent example is the extreme learning machine (ELM) as proposed in [<xref ref-type="bibr" rid="scirp.22033-ref4">4</xref>]. It comprises a single hidden layer feed-forward neural network with fixed random input weights and a trainable linear output layer as depicted in <xref ref-type="fig" rid="fig1">Figure 1</xref>(a). ELMs have become popular, because, compared to traditional backpropagation training, they train much faster since output weights are computed in a single batch regression step. Despite this apparent simplicity, ELMs are universal function approximators with high probability under mild conditions if arbitrarily large networks can be considered [<xref ref-type="bibr" rid="scirp.22033-ref5">5</xref>]. The relation between the ELM approach and earlier proposed feedforward random projection methods is discussed in [<xref ref-type="bibr" rid="scirp.22033-ref6">6</xref>]. In practice and for a finite ELM, however, model selection, parameter initialization and regularization are challenges and active topics of research.</p><p>The most prominent other example for random projections is the reservoir computing (RC) approach [<xref ref-type="bibr" rid="scirp.22033-ref7">7</xref>], a paradigm to use recurrent neural networks with fixed and randomly initialized recurrent weights, see <xref ref-type="fig" rid="fig1">Figure 1</xref>(b). From a machine learning point of view, the reservoir serves as a fixed spatio-temporal kernel projecting the input data nonlinearly into a high dimensional space of the reservoir network states. In the limit of infinitely many neurons, this is equivalent to a recursive kernel transformation [<xref ref-type="bibr" rid="scirp.22033-ref8">8</xref>]. The subsequent use of a trainable</p><p>non-recurrent linear readout layer combines the advantages of recurrent networks with the ease, efficiency and optimality of linear regression methods. New applications for processing temporal data have been reported, for instance in speech recognition [9,10], sensori-motor robot control [11-13], detection of diseases [14,15], or flexible central pattern generators in biological modeling [<xref ref-type="bibr" rid="scirp.22033-ref16">16</xref>].</p><p>An intermediate approach to use dynamic reservoir encodings for processing data in static classification and regression tasks has also been considered under the notion of attractor based reservoir computing [17-19]. The rationale behind is that a recurrent network can efficiently encode static inputs in its attractors [19,20]. In this contribution, we regard static reservoir computing as a natural extension of the ELM. We point out that recurrent connections significantly enrich the set of possible features for an ELM by introducing non-linear mixtures. They thereby enhance approximation capability and performance under limited resources like a finite network size. It is noteworthy that this approach does not affect the output learning, where we will still use standard linear regression.</p><p>A central issue for all learning approaches is model selection, and it is even more severe for random projection networks because large parts of the networks remain fixed after initialization. The neuron model, the network architecture and particularly the network size strongly determine the generalization performance, compare <xref ref-type="fig" rid="fig2">Figure 2</xref> upper part. In the state-of-the-art ELM approach, most of these quantities are tuned manually by means of expert knowledge about the specific task.</p><p>Several techniques to automatically adapt the network’s size to a given task have been considered [21-23], whereas success is always measured after retraining the output layer of the network. Despite these efforts, it</p><p>remains a challenge to understand the interplay between model complexity, output learning, and performance: controlling the network size affects only the number of features rather than their complexity and ignores effects of regularization both in the output learning and the ratio of data points to number of neurons.</p><p>An essential mechanism to consider in this context is regularization ([24-26], Section 7 in the appendix). In this paper we distinguish two different levels of regularization: output regularization with regard to the linear output learning and input or feature regularization with regard to the feature encoding produced in the hidden layer. Output regularization typically refers to Tikhonov regularization [<xref ref-type="bibr" rid="scirp.22033-ref24">24</xref>] and assumes a Gaussian prior for the learning parameters. This refers to adding a term in the error function which punishes large output weights and is also known as weight decay (c.f. Section A). It is easy to implement without additional computational costs in the batch linear regression and therefore is a standard method used for both ELM [<xref ref-type="bibr" rid="scirp.22033-ref27">27</xref>] and reservoir computing [7,28]. A suitable Tikhonov regularization parameter must be determined by line search, which is computationally costly and performance can be undesirably sensitive to it. This is in contradiction to the original simplicity of the random projection method and we will therefore propose a method to make performance more robust with respect to the choice of the output regularization.</p><p>The designer also has to make choices with respect to the input processing, e.g. on the hyper-parameters governing the distributions of random parameter initialization, on proper pre-scaling of the input data, and on the type of non-linear functions involved.</p><p>It is therefore highly desirable to gain insight on the interaction between parameter or feature selection and output regularization. The goal is to provide constructive tools to robustly reduce the dependency of the network performance on the different parameter choices while keeping peak performance. To this aim, we investigate recurrence and intrinsic plasticity, an unsupervised biologically motivated learning rule that adjusts bias and slope of a neuron’s sigmoid activation function [<xref ref-type="bibr" rid="scirp.22033-ref29">29</xref>]. These are two mechanisms to influence the model’ feature complexity, which span a different axis as compared to the usual model selection approaches, see <xref ref-type="fig" rid="fig2">Figure 2</xref> (horizontal axis). We analyze the complex interplay of feature and model selection by assessing properties on three levels: First, the feature complexity, i.e. the feature transformation provided by a single neuron; second, the complexity of the network function, i.e. the learned combination of features measured by its mean curvature; and third, the generalization performance. Together, these measures provide a clear picture of the advantages and disadvantages of the different models.</p><p>The remainder of the paper is organized as follows. We introduce the ELM including Tikhonov regularization in the output learning in Section 2. Then we add recurrent connections to increase feature complexity in Section 3, which results in greater capacity of the network and enhanced performance. Not unexpectedly, we observe a trade-off with respect to the risk of overfitting. In Section 4 we investigate the influence of IP pre-training on the mapping properties of ELMs and show that IP results in proper input-specific regularization. Here the trade-off is for the risk of poor approximation when regularizing too much. We proceed in Section 5 to show synergy effects between IP feature regularization and recurrence when applying the IP learning rule to recurrently enhanced ELM networks. Whereas IP simplifies the feature pool and tunes the neurons to a good regime, recurrent connections introduce nonlinear mixtures and thereby avoid to end up with a too simple feature set. We show experimentally that these two processes balance each other such that we obtain complex but IP regularized features with reduced overfitting. As a result, input-tuned reservoir networks that are less dependent on the random initialization and less sensitive to the choice of the output regularization parameter are obtained. We confirm this in experiments, where we observe constantly good performance over a wide range of network initialization and learning parameters.</p></sec><sec id="s2"><title>2. Baseline: The Extreme Learning Machine</title><p>In 2004, Huang et al. introduced the extreme learning machine (ELM) [<xref ref-type="bibr" rid="scirp.22033-ref4">4</xref>], a three-layer feed-forward neural network with a high-dimensional hidden layer providing a random projection of the input through fixed random weights (see Figures 1(a) and 3). Learning is reduced to computing a simple generalized inverse by linear regression. ELMs thus train much faster than traditionally trained backpropagation networks, and even performed better on most of the tasks reported in [<xref ref-type="bibr" rid="scirp.22033-ref5">5</xref>]. It has also been shown in [<xref ref-type="bibr" rid="scirp.22033-ref5">5</xref>] that a randomly created ELM with hidden layer size R is able to perform any mapping consisting of R observations. ELMs are thus in theory universal function approximators, if permitting an arbitrary number of training samples and any hidden-layer size.</p><p>The activations of the ELM input, hidden and output neurons are denoted by x, h and y, respectively (see <xref ref-type="fig" rid="fig1">Figure 1</xref>(a)). The connection strengths are collected in the matrices W<sup>inp</sup> and W<sup>out</sup> denoting the input and read-out weights. We consider parametrized activation functions</p><p><img src="9-9601144\0ae76da6-0bbd-41f4-a9f7-840429ff8ac0.jpg" />where</p><p>x is the total activation of each hidden neuron h<sub>r</sub> for input x and D is the input dimension. We denote a<sub>r</sub> as the slope and b<sub>r</sub> as the bias of the activation function f<sub>r</sub>(&#183;). The output y of an ELM is</p><disp-formula id="scirp.22033-formula155439"><label>(1)</label><graphic position="anchor" xlink:href="9-9601144\91beee0a-bb5a-48bc-b1f3-df2119209758.jpg"  xlink:type="simple"/></disp-formula><p>The key idea of the ELM approach is to restrict learning to the linear readout layer. All other network parameters, i.e. the input weights W<sup>inp</sup> and the activation function parameters a, b stay fixed after initialization of the network.</p><p>The ELM is trained on a set of training examples<img src="9-9601144\88256ad4-e831-4b95-b7bb-beaf440a405d.jpg" />, n = 1, &#183;&#183;&#183;, N<sub>tr</sub> by minimizing the mean squared error</p><disp-formula id="scirp.22033-formula155440"><label>(2)</label><graphic position="anchor" xlink:href="9-9601144\678dc6db-11d4-4171-bb91-242e9c1d71fd.jpg"  xlink:type="simple"/></disp-formula><p>between the target outputs <img src="9-9601144\c3e285a9-481b-419e-8c72-0fec32cdb0e3.jpg" /> and the actual network output y<sub>n</sub> with respect to the read-out weights W<sup>out</sup>. The minimization reduces to a linear regression task given the fixed parameters and hidden activations h as follows. We collect the network’s states h<sub>n</sub> as well as the desired output targets y<sub>n</sub> in a state matrix H = (h<sub>1</sub>, &#183;&#183;&#183;, h<sub>Ntr</sub>) and a target matrix <img src="9-9601144\14bea415-711e-46d9-a95d-dcbbd3eb2afe.jpg" /> for all n = 1, &#183;&#183;&#183;, N<sub>tr</sub>, respectively. The minimizer is the least squares solution</p><disp-formula id="scirp.22033-formula155441"><label>(3)</label><graphic position="anchor" xlink:href="9-9601144\81f853e5-6a67-4b48-8bdf-a035817da3e9.jpg"  xlink:type="simple"/></disp-formula><p>where H<sup>†</sup> is the pseudo-inverse of the matrix H.</p><sec id="s2_1"><title>2.1. Model Selection for the ELM</title><p>The ELM approach is appealing because of its apparently efficient and simple training procedure [<xref ref-type="bibr" rid="scirp.22033-ref5">5</xref>] and it has been claimed that “apart from selecting the number of hidden nodes, no other control parameters have to be manually chosen” ([<xref ref-type="bibr" rid="scirp.22033-ref30">30</xref>] p. 1411). However, this claim is based on the assumption that either very large data sets are used as in [5, 30, 31] or the network size is explicitly chosen to be much smaller than the number of training samples (e.g. in [<xref ref-type="bibr" rid="scirp.22033-ref12">12</xref>] pp. 1355-1356). In contrast, in practical applications training data can be very expensive, e.g. in tasks involving robots, and it can also be undesirable to limit the hidden layer size R to a small fraction of the number of training samples N<sub>tr</sub>, because then the network suffers from poor approximation abilities. This is illustrated in <xref ref-type="fig" rid="fig4">Figure 4</xref> (R = 5, N<sub>tr</sub> = 50), where we show the dependency of the ELM’s generalization ability on the random distribution of the input weights W<sup>inp</sup>, the network size R and the biases b on the Mexican hat regression task (cf. Section C.2 for this often employed illustrative task). In such cases, the model selection becomes an important issue since the generalization ability is highly depending on the choice of the model’s parameters, e.g. output regularization or network size.</p><sec id="s2_1_1"><title>2.1.1. Output Regularization</title><p>Since the ELM is based on the empirical risk minimization principle [<xref ref-type="bibr" rid="scirp.22033-ref32">32</xref>], it tends to over-fit the data, particularly if the task does not comprise many training samples. In the original ELM approach, over-fitting is prevented by implicit regularization: by either providing a large number of training samples (see <xref ref-type="fig" rid="fig4">Figure 4</xref>, R = 20, N<sub>tr</sub> = 1000) or by using small network sizes. Assuming noise in the data, it is well known that this is equivalent to some level of output regularization [33,34]. It is therefore natural to consider output regularization directly as a more appropriate technique for arbitrary network and training data sizes as e.g. in [27,35]. As a state-of-the-art method, Tikhonov regularization ([<xref ref-type="bibr" rid="scirp.22033-ref24">24</xref>], Section A) can be used as in [<xref ref-type="bibr" rid="scirp.22033-ref27">27</xref>] which is also a standard method for reservoir networks that are introduced in Section 3. It introduces a regularization parameter ε in the error function</p><p><img src="9-9601144\e5b5904e-ad41-4912-bffc-736c0241b4eb.jpg" /><img src="9-9601144\c94b7331-39ea-4d44-9ac7-0011c07c5d67.jpg" /> (4)</p><p>and the regularized minizer then becomes</p><disp-formula id="scirp.22033-formula155442"><label>(5)</label><graphic position="anchor" xlink:href="9-9601144\b276f6b0-9060-442d-a4e0-c07c950ed5d9.jpg"  xlink:type="simple"/></disp-formula><p>which is, as a side effect, also numerically more stable because of the better conditioned matrix inverse. A suitable regularization parameter ε needs to be chosen carefully. Too strong regularization, i.e. too large ε, can result in poor performance, because it limits the effective model complexity inappropriately [<xref ref-type="bibr" rid="scirp.22033-ref34">34</xref>]. On the other hand, a too small value of ε does not avoid the over-fitting. This is a typical model selection problem also for the ELM. The parameter ε must be determined by line search after definition of a suitable validation set, which is computationally costly.</p></sec><sec id="s2_1_2"><title>2.1.2. Finding the Right Initialization Ranges</title><p>In the ELM paradigm, a typical heuristics is to scale the data to [–1, 1] and to set the activation function parameters a to one [<xref ref-type="bibr" rid="scirp.22033-ref5">5</xref>]. Then, allowing an arbitrary large number R of hidden neurons, manual tuning of the input weights W<sup>inp</sup> or the activation function parameters a, b is not needed, because a random initialization of these parameters is sufficient to create a rich feature set. In practice, the hidden layer size is limited and the performance does indeed depend on the hyper-parameters controlling the distributions of the initialization at least of the input weights and the biases b. Very small weights result in approximately linear neurons with no contribution to the approximation capability, whereas large weights drive the neurons into saturation resulting in a binary encoding. This is illustrated in <xref ref-type="fig" rid="fig4">Figure 4</xref> (R =20, N<sub>tr</sub> = 50, where we vary the initialization range of input weights and biases b and <xref ref-type="fig" rid="fig4">Figure 4</xref>(b), respectively. Apparently, the choice of scaling matters.</p></sec><sec id="s2_1_3"><title>2.1.3. The Network Size Matters</title><p>Finally, the number R of hidden neurons plays a central role and several techniques have been investigated to automatically adapt the hidden layer size. The error minimized extreme learning machine [<xref ref-type="bibr" rid="scirp.22033-ref21">21</xref>] and the incremental extreme learning machine [<xref ref-type="bibr" rid="scirp.22033-ref22">22</xref>] are methods which add random neurons to the ELM. In contrast, the optimally pruned extreme learning machine [<xref ref-type="bibr" rid="scirp.22033-ref23">23</xref>] pursues the idea to improve ELMs by decreasing the size of the hidden layer. All of these methods introduce considerable computational load.</p><p>In summary, the performance of the ELM on a broader range of tasks depends on a number of choices in model selection: the network size, the output regularization (or the equivalent in chosing a respective task), and the hyper-parameters for initialization. Methods to reduce sensitivity of the performance to these parameters are therefore highly desired.</p></sec></sec></sec><sec id="s3"><title>3. Reservoir Networks as Natural Extension of the ELM</title><p>Adding recurrent connections to the hidden layer of an ELM converts it to a corresponding reservoir network<sup>1</sup> (RN) (see the machine learning view on RNs in <xref ref-type="fig" rid="fig5">Figure 5</xref>). The RN can be used for static mapping tasks by considering the converged attractor state as encoding of the input (for more details see Section B). Then applying output regression with regularization is applied as described in the last section. In [<xref ref-type="bibr" rid="scirp.22033-ref18">18</xref>] and [<xref ref-type="bibr" rid="scirp.22033-ref19">19</xref>] this approach has been motivated by showing that for static mappings the important information is represented in the reservoir’s attractor states and in [17,19,36] it has been applied successfully. To gain insights, how and why the respective</p><p>random projections work in these models, we compare an ELM and the corresponding reservoir network on the same tasks. We argue that the additional mixing effect of the recurrence enhances model complexity. The hypothesized effect can be visualized and evaluated on three levels: for the single feature, the learned function, and with respect to the task performance.</p><sec id="s3_1"><title>3.1. Recurrence Enhances Feature Complexity by Nonlinear Mixtures</title><p>We first consider the level of a single neuron and the feature it computes in a given architecture. We define such a feature F<sub>r</sub> as the response of the r-th reservoir neuron h<sub>r</sub> to the full range of possible inputs <img src="9-9601144\d15d826b-1244-483a-82a9-930e363999d7.jpg" /> from the network’s input space<img src="9-9601144\159653ad-42f2-4ecf-ae2e-62e6c30b94ac.jpg" />:</p><p><img src="9-9601144\a5102dbc-622b-4e4f-833d-0f31fec56d73.jpg" /></p><p>where <img src="9-9601144\df556ff4-3cf9-41e0-bc0b-4dfe30185a5a.jpg" /> denotes the network’s converged attractor state (cf. Section B). The feature can easily be visualized as e.g. in <xref ref-type="fig" rid="fig6">Figure 6</xref>, which shows features of an ELM and a corresponding RN for the reference example of the Mexican hat data set (cf. Section C.2). For the ELM, the features are completely determined by the activation function parameters a and b of the corresponding neuron. Regardless of the specific choice of the activation function parameters, the set of possible features in an ELM (top row) is quite restricted, namely to monotonically increasing or decreasing functions: standard sigmoid functions (left), stretched or compressed shifted sigmoid functions (middle), which can approximate linear or even constant behavior (right) for an appropriate parameter choice. In contrast, recurrent connections in a corresponding reservoir network (bottom row) enhance the feature spectrum to more complex functions with possibly several local optima. Even weak recurrence with small weights gives this effect without any tuning. The effect can be seen by visual inspection but, however, is</p><p>not easily be quantified and we therefore consider also the network level.</p></sec><sec id="s3_2"><title>3.2. Recurrence Increases the Effective Model Complexity</title><sec id="s3_2_1"><title>3.2.1. The Mean Curvature</title><p>To assess the effective model complexity, we consider the mean curvature (MC) of the network’s output function, which directly evaluates a property of the learned model. On the one hand, this measure is closely connected to the output regularization introduced in Section A. Typical choices for regularization functionals in (9) punish high curvatures such as strong oscillations. The network’s effective model complexity is reduced [<xref ref-type="bibr" rid="scirp.22033-ref33">33</xref>] and the network’s output function becomes smooth through the regularized learning. On the other hand, the number of features available for learning, i.e. the network’s hidden layer size, also influences the model complexity. A small number of features decreases the model complexity and implements a kind of input regularization.</p><p>For these reasons, we measure the MC while decreasing the effective model complexity through either increasing the regularization parameter ε of the output regularization or decreasing the network size R and we expect qualitatively similar developments for varying both model selection parameters. Experiments are performed on the Mexican hat task and the default initialization parameters are shown in Section C.1. Due to the stochastic nature of parameter initialization, we average the MC over 30 networks and test each ELM and the corresponding RN for comparison.</p><p>The results shown in <xref ref-type="fig" rid="fig7">Figure 7</xref> (left) reveal the expected behavior: too small network size or too strong output regularization decrease mean curvature below the necessary baseline level given by the MC of the target function, which is displayed with the dotted line. The target function can not be approximated in this case. On the other end, no regularization or very large network sizes</p><p>result in a MC that is larger than the MC of the target function. This is an indication for overfitting. We also find that the ELM and the corresponding RN have very similar MC’s, except for the unregularized case, where the RN overfits more strongly. This is expected, because the more complex features of the RN provide a larger model complexity, which is favorable if the network size is limited. Note that the results for varying network size use a regularization of ε = 10<sup>–5</sup>, which is quite optimal and as such already prevents overfitting quite well. Vice versa, the results for varying ε are given for a network size of N = 100, which is clearly suitable for the task. This once more underlines that model selection and regularization are important issues.</p></sec><sec id="s3_2_2"><title>3.2.2. The Task Performance</title><p>From the above, we expect that measuring task performance on training and test data displays a typical overfitting pattern. For small networks or too strong regularization, training and test performance are poor, for increasing regularization and for larger network size the test error reaches a minimum and then starts increasing, while the training error keeps decreasing. This is exactly the case in <xref ref-type="fig" rid="fig7">Figure 7</xref> (bottom). We observe the same pattern of the RN networks for increasing network size, however, the ELM does not overfit even for large networks, if properly regularized. That is due to the limited complexity of its features and underlines the increased modeling power of the RN, which is caused by the non-linear mixing of features and also leads to a significantly better test performance. We therefore have to trade model complexity and better performance for risk of overfitting when moving from ELM to RN.</p></sec></sec><sec id="s3_3"><title>3.3. Recurrence Enhances the Spatial Encoding of Static Inputs</title><p>The results of the last section show the higher complexity of the RN in comparison to the ELM, which is caused by the non-linear mixing of features. While the exact class of features which is thereby produced is unknown, [<xref ref-type="bibr" rid="scirp.22033-ref20">20</xref>] introduced an approach to analyze how the inputs are represented in RNs compared to the corresponding ELMs. It is based on considering the hidden state representation <img src="9-9601144\a7ff36e8-1cba-45eb-b633-5f7db24bec86.jpg" /> and measuring the cumulative energy content:</p><p><img src="9-9601144\3fb57834-72b6-4bab-a3d4-02a444ea5e3a.jpg" /></p><p>Thereby λ<sub>1</sub> ≥ &#183;&#183;&#183;≥ λ<sub>R</sub> ≥ 0 are the eigenvalues of the covariance matrix <img src="9-9601144\c368341e-dffe-4618-af0e-d4788070f951.jpg" /> corresponding to the principal components (PCs) of the network’s attractor or hidden state distribution<img src="9-9601144\cf42ab4b-6202-42c7-9f7d-ee8cf3100738.jpg" />. In principle, the cumulative energy content measures the increased dimensionality of the hidden data representation <img src="9-9601144\759cf385-05f3-482f-898c-f9c99e87dd82.jpg" /> compared to the dimensionality D of the input data x. The case of g(D) &lt; 1 implicates a shift of the input information to additional PCs, because the encoded data then spans a space with more than D latent dimensions. If g(D) &lt; 1, no information content shift occurs, which is true for any linear transformation of data. The experiments conducted with several data sets from the UCI repository [<xref ref-type="bibr" rid="scirp.22033-ref37">37</xref>] showed that the cumulative energy content g(D) of the first D PCs of the attractor distribution is significantly lower for reservoir networks than for ELMs (see <xref ref-type="fig" rid="fig8">Figure 8</xref>). That is, a reservoir network redistributes more information in the input data onto the remaining R-D PCs than the feedforward ELM. This effect, which is only due to the recurrent connections and the respective mixing of features shows that RNs inherently hold a higher dimensional hidden data representation, which can be advantageous for the separability of input patterns and thus increases learning performance, e.g. on classification tasks.</p></sec></sec><sec id="s4"><title>4. Feature Regularization with Intrinsic Plasticity</title><p>In the previous section, we have shown that overfitting can occur when using an ELM and is even stronger when a corresponding RN with its richer feature set is used. Output regularization can counteract this effect, however, needs proper tuning of the regularization parameter. Hence, we propose a different route to directly tune the features of an ELM and the corresponding RN with respect to the input. A machine learning view on this idea is visualized in <xref ref-type="fig" rid="fig9">Figure 9</xref>. We adapt the parameters of the non-linear functions in the hidden layer by means of an unsupervised learning rule called intrinsic plasticity (IP). IP is biologically motivated and was first introduced in [<xref ref-type="bibr" rid="scirp.22033-ref29">29</xref>]. The idea to use IP for ELM and RN is motivated by</p><p>previous work [38,39], where IP was shown to provide robustness against both varying weight and learning parameters. We show that IP in our context works as an input regularization mechanism. Again, we analyze the resulting networks on all three levels: with respect to feature complexity, by means of the MC, and by evaluating task performance.</p><sec id="s4_1"><title>4.1. Intrinsic Plasticity Revisited</title><p>Intrinsic Plasticity (IP) was developed by Triesch in 2004 [<xref ref-type="bibr" rid="scirp.22033-ref29">29</xref>] as a model for homeostatic plasticity for analog neurons with Fermi-function. Its goal is to optimize the information transmission of a single neuron strictly locally by adaption of slope a and bias b of the Fermifunction such that the neurons’ outputs h become exponentially distributed. IP-learning can be derived by minimizing the Kullback-Leibler-divergence D(f<sub>h</sub>, f<sub>exp</sub>) between the output f<sub>h</sub> and an exponential distribution f<sub>exp</sub>:</p><disp-formula id="scirp.22033-formula155443"><label>(6)</label><graphic position="anchor" xlink:href="9-9601144\df35fe21-6fae-42ad-a074-76b1e7db7e60.jpg"  xlink:type="simple"/></disp-formula><p>where H(h) denotes the entropy and E(h) the expectation value of the output distribution. In fact, minimization of D(F<sub>h</sub>, F<sub>exp</sub>) in Eq. (6) for a fixed E(h) is equivalent to entropy maximization of the output distribution. For small mean values, i.e. μ ≈ 0.2, the neuron is forced to respond strongly only for a few input stimuli. The following online update equations for slope and bias-scaled by the step-width η<sub>IP</sub>- are obtained:</p><p><img src="9-9601144\2392df23-b739-4d18-b25a-aa08d8ff94aa.jpg" /> (7)<img src="9-9601144\dcfbe243-8a1f-446f-81b0-7324a0b3a492.jpg" /></p><p>The only quantities used to update the neuron’s non-linear transfer function are s, the synaptic sum arriving at the neuron, the firing rate h and its squared value h<sup>2</sup>. Since IP is an online learning algorithm, training is organized in epochs: For a pre-defined number of training epochs the network is fed with the entire training data and each hidden neuron is adapted to the network’s current input separately. Within the ELM paradigm, IP is used as a pre-training algorithm to optimize the hidden layer features before output regression is applied.</p></sec><sec id="s4_2"><title>4.2. Regulating ELM Complexity through Intrinsic Plasticity</title><sec id="s4_2_1"><title>4.2.1. IP and Feature Complexity</title><p>Since IP adapts the parameters a and b of the hidden neurons’ activation function it directly influences the features generated by an ELM. <xref ref-type="fig" rid="fig1">Figure 1</xref>0 visualizes the development of the network’s features’ shape during IP training for one dimensional inputs as it was done in Section 3. The left plot in <xref ref-type="fig" rid="fig1">Figure 1</xref>0 shows a collection of features for a randomly initialized ELM. The features are distributed over the whole range of inputs. Through IPpretraining, the variety in the set of features is reduced (see <xref ref-type="fig" rid="fig1">Figure 1</xref>0(b)), until the extreme case of only two features is reached (see <xref ref-type="fig" rid="fig1">Figure 1</xref>0(c)).</p></sec><sec id="s4_2_2"><title>4.2.2. IP and the Effective Model Complexity</title><p>On the network level, we evaluate model complexity again by means of the MC and the network performance on the Mexican hat regression task. We apply readout learning after each epoch to monitor the impact of IP on these measures over epochs. Learning and initialization parameters are collected in Section C.1. For illustration we choose the size of the ELMs’ hidden layer as R = 100 and the number of samples used for training as N<sub>tr</sub> = 50 such that the ELM is prone to show overfitting and the effect of regularization can be observed clearly. The results are shown in <xref ref-type="fig" rid="fig1">Figure 1</xref>1.</p><p><xref ref-type="fig" rid="fig1">Figure 1</xref>1(a) shows that the MC is decreasing with more IP epochs and thus shows a typical regularization behavior qualitatively similar to the dependency on the output regularization shown before in <xref ref-type="fig" rid="fig7">Figure 7</xref> (bottom left). The optimal MC of the Mexican hat function is reached at about 300 IP-Epochs, more IP epochs further reduce the curvature such that no proper approximation of the target function is possible. Note that in contrast to networks with output regularization the MC does not fall dramatically down to zero.</p><p>The task performance shown in <xref ref-type="fig" rid="fig1">Figure 1</xref>1(b) confirms that IP-pretraining has typical characteristics of a regularization mechanism. Low regularization strength (few IP epochs) results in low training error but high test error; over-fitting occurs. In contrast, too strong regularization (too many IP epochs) results in a degenerated behavior indicated by simultaneously high training and test error. The optimal regularization strength can be found in between these areas.</p><p>In <xref ref-type="fig" rid="fig1">Figure 1</xref>1(c) we add a further analysis of the performance by decomposing the errors into integrated squared bias and integrated variance during IP training (cf. Section A). It shows that the variance of the outputs decreases with the amount of IP epochs, while the bias is first constant and then increases rapidly, when the model complexity starts to degenerate. The observed trade-off between these quantities indicates the similarities to regularization processes [<xref ref-type="bibr" rid="scirp.22033-ref25">25</xref>].</p><p>Finally, we plot 30 trained ELMs for non IP, medium IP epochs and too many IP epochs each in <xref ref-type="fig" rid="fig1">Figure 1</xref>2.</p><p>The ELMs without IP-training (a) clearly show the typical oscillations due to over-fitting; a suited number of IP pre-training epochs (b) leads to constantly good results, whereas too long IP pre-training (c) tends to reduce the model complexity inappropriately so that the mapping is not accurately approximated anymore. The set of corresponding features is shown in <xref ref-type="fig" rid="fig1">Figure 1</xref>0, respectively.</p><p>The experiments in this section clearly reveal the regulatory nature of IP as a task-specific feature regularization for ELMs.</p></sec></sec></sec><sec id="s5"><title>5. Intrinsic Plasticity in Combination with Recurrence</title><p>We now show that the combination of recurrence and IP can achieve a balance between task-specific regularization by means of IP and a large modeling capability by means of recurrence. Whereas this is interesting from a theoretical point of view, it turns out that this combination also strongly enhances robustness of the performance with respect to other model selection parameters and eases the burden to perform grid-search or other optimization of those. To obtain comparable results to the experiments performed in the last sections, we add recurrent connections to the hidden layer of the ELMs to obtain the corresponding RN (see <xref ref-type="fig" rid="fig1">Figure 1</xref>3 for illustration of the corresponding machine learning viewpoint). Recurrent weights are randomly drawn from a uniform distribution in [–1, 1] with a density of ρ = 0.1. Only at tractor states are used for IP-learning, i.e. the networks are iterated until convergence for constant input as described by Alg. 1 in Section B before applying the IP learning-step given by (7). We again analyze feature complexity, MC, and the performance in turn.</p><sec id="s6_0_1"><title>Feature Complexity</title><p><xref ref-type="fig" rid="fig1">Figure 1</xref>4 illustrates the development of the features of a reservoir network during IP training. The features are not only sigmoid anymore due to the addition of the recurrent weights. As observed in Section 4.2 during IP training the features become similar and input specific, but in contrast to ELMs (compare to <xref ref-type="fig" rid="fig1">Figure 1</xref>0 in Section 4.2)recurrent features stay complex even after a huge amount of IP training.</p></sec><sec id="s6_0_2"><title>Network Complexity</title><p>We repeat the experiments from the previous Section 4 with the corresponding reservoir networks instead of ELMs. The network settings are given in Section C.1. The MC development with respect to IP-training of the reservoir networks is illustrated in <xref ref-type="fig" rid="fig1">Figure 1</xref>5 (a). Similar to the ELMs (cf. <xref ref-type="fig" rid="fig1">Figure 1</xref>1), the RNs’ output function’s curvature decreases in the first epochs, but then stays close to the curvature of the target function without dropping to small values. This indicates that the regularization effect of IP and recurrence balance very well, in contrast to the ELM experiments where the output curvature falls significantly below the mean task curvature when regularizing to strongly for both output and feature-regularization.</p><p>Figures 15(b) and (c) show the performance and the bias/variance decomposition. The behavior of the networks show similar characteristics as the ELMs under the influence of IP (compare to <xref ref-type="fig" rid="fig1">Figure 1</xref>1): Stronger regularization implemented by longer IP pre-training increases the generalization ability indicated by a lower test error and decreasing variance. Hence, reservoir networks still profit from the feature regularization. But in contrast to the results obtained for the ELMs, bias and test error do not increase for many IP epochs, i.e. no degeneration of the networks is observed. Obviously, the recurrent connections maintain the networks’ high mapping capabilities even in the presence of strong regularization through IP.</p></sec><sec id="s6_1"><title>5.1. Increased Model Complexity for More Complex Tasks</title><p>In previous sections, we used synthetic data and a rather simple one-dimensional task to clearly state and illustrate the concepts. We now investigate the enhanced intrinsic model complexity, which is due to the addition of recurrent connections, in a more complex function approximation task where the task complexity can be controlled with a single parameter. The target function is a two-dimensional sine function (cf. Section C.3), where the frequency ω is proportional to its mean curvature and the difficulty of task.</p><p><xref ref-type="fig" rid="fig1">Figure 1</xref>6 shows the MSE on the training and test set for ELMs and corresponding RN, both pretrained with the same amount of IP-epochs, with respect to increasing frequency ω. The initialization parameters of the networks are stated in Section C.1. As expected, the errors increase with the frequency and at some frequency the networks can not approximate the function appropriately. This is indicated by a rapid deterioration of the performance, which occurs for the ELMs at ω ≥ 2, whereas the error for the recurrent networks does not increase strongly until ω ≥ 3. This experiment shows that the enhanced mapping capability due to the addition of recurrent connections is preserved despite the IP-training of the networks. As a result, IP-trained reservoir networks are suitable for a wider spectrum of task complexities than IP-trained ELMs.</p></sec></sec></body><back><ref-list><title>References</title><ref id="scirp.22033-ref1"><label>1</label><mixed-citation publication-type="other" xlink:type="simple">Y. Miche, B. Schrauwen and A. Lendasse, “Machine Learning Techniques Based on Random Projections,” In Proceedings of European Symposium on Artificial Neural Networks, Bruges, April 2010, pp. 295-302.</mixed-citation></ref><ref id="scirp.22033-ref2"><label>2</label><mixed-citation publication-type="other" xlink:type="simple">Y.-H. Pao, G.-H. Park and D. J. Sobajic, “Learning and Generalization Characteristics of the Random Vector Functional-Link Net,” Neurocomputing, Vol. 6, No. 2, 1994, pp. 163-180. 
doi:10.1016/0925-2312(94)90053-1</mixed-citation></ref><ref id="scirp.22033-ref3"><label>3</label><mixed-citation publication-type="other" xlink:type="simple">D. S. Broomhead and D. Lowe, “Multivariable Functional Interpolation and Adaptive Networks,” Complex Systems, Vol. 2, No. 1, 1988, pp. 321-355.</mixed-citation></ref><ref id="scirp.22033-ref4"><label>4</label><mixed-citation publication-type="other" xlink:type="simple">G.-B. Huang, Q.-Y. Zhu and C.-K. Siew, “Extreme Learning Machine: A New Learning Scheme of Feedforward Neural Networks,” Proceedings of International Joint Conferences on Artificial Intelligence, Budapest, July 2004, pp. 489-501.</mixed-citation></ref><ref id="scirp.22033-ref5"><label>5</label><mixed-citation publication-type="other" xlink:type="simple">G.-B. Huang, Q.-Y. Zhu and C.-K. Siew, “Extreme Learning Machine: Theory and Applications,” Neurocomputing, Vol. 70, No. 1-3, 2006, pp. 489-501.</mixed-citation></ref><ref id="scirp.22033-ref6"><label>6</label><mixed-citation publication-type="other" xlink:type="simple">L. P. Wang and C. R. Wan, “Comments on the Extreme Learning Machine,” IEEE Transactions on Neural Networks, Vol. 19, No. 8, 2008, pp. 1494-1495.</mixed-citation></ref><ref id="scirp.22033-ref7"><label>7</label><mixed-citation publication-type="other" xlink:type="simple">M. Lukosevicius and H. Jaeger, “Reservoir Computing Approaches to Recurrent Neural Network Training,” Computer Science Review, Vol. 3, No. 3, 2009, pp. 127-149.</mixed-citation></ref><ref id="scirp.22033-ref8"><label>8</label><mixed-citation publication-type="other" xlink:type="simple">M. Hermans and B. Schrauwen, “Recurrent Kernel Machines: Computing with Infinite Echo State Networks,” Neural Computation, Vol. 24, No. 6, 2011, pp. 104-133.</mixed-citation></ref><ref id="scirp.22033-ref9"><label>9</label><mixed-citation publication-type="other" xlink:type="simple">D. Verstraeten, B. Schrauwen and D. Stroobandt, “Reservoir-Based Techniques for Speech Recognition,” International Joint Conference on Neural Networks, Vancouver, 16-21 July 2006, pp. 1050-1053.</mixed-citation></ref><ref id="scirp.22033-ref10"><label>10</label><mixed-citation publication-type="other" xlink:type="simple">M. D. Skowronski and J. G. Harris, “Automatic Speech Recognition Using a Predictive Echo State Network Classifer,” Neural Networks, Vol. 20, No. 3, 2007, pp. 414-423. doi:10.1016/j.neunet.2007.04.006</mixed-citation></ref><ref id="scirp.22033-ref11"><label>11</label><mixed-citation publication-type="other" xlink:type="simple">E. A. Antonelo, B. Schrauwen and D. Stroobandt, “Event Detection and Localization for Small Mobile Robots Using Reservoir Computing,” Neural Networks, Vol. 21, No. 6, 2008, pp. 862-871.</mixed-citation></ref><ref id="scirp.22033-ref12"><label>12</label><mixed-citation publication-type="other" xlink:type="simple">M. Rolf, J. J. Steil and M. Gienger, “Effcient Exploration and Learning of Full Body Kinematics,” IEEE 8th International Conference on Development and Learning, Shanghai, 5-7 June 2009, pp. 1-7.</mixed-citation></ref><ref id="scirp.22033-ref13"><label>13</label><mixed-citation publication-type="other" xlink:type="simple">R. F. Reinhart and J. J. Steil, “Reaching Movement Generation with a recurrent neural network based on Learning Inverse Kinematics for the Humanoid Robot Icub,” Proceedings of IEEE-RAS International Conference on Humanoid Robots, Paris, 7-10 December 2009, pp. 323-330. 
doi:10.1109/ICHR.2009.5379558</mixed-citation></ref><ref id="scirp.22033-ref14"><label>14</label><mixed-citation publication-type="other" xlink:type="simple">P. Buteneers, B. Schrauwen, D. Verstraeten and Dirk Stroobandt, “Real-Time Epileptic Seizure Detection on Intra-Cranial Rat Data Using Reservoir Computing,” Advances in Neuro-Information Processing, Vol. 5506, 2009, pp. 56-63.</mixed-citation></ref><ref id="scirp.22033-ref15"><label>15</label><mixed-citation publication-type="other" xlink:type="simple">B. Noris, M. Nobile, L. Piccinini, M. Berti, M. Molteni, E. Berti, F. Keller, D. Campolo and A. Billard, “Gait Analysis of Autistic Children with Echo State Networks,” Workshop on Echo State Networks and Liquid State Machines, Whistler, December 2006.</mixed-citation></ref><ref id="scirp.22033-ref16"><label>16</label><mixed-citation publication-type="other" xlink:type="simple">A.F. Krause, B.Bl?sing, V.r Dürr and T. Schack, “Direct Control of an Active Tactile Sensor Using Echo State Networks,” Human Centered Robot Systems, Vol. 6, 2009, pp. 11-21.</mixed-citation></ref><ref id="scirp.22033-ref17"><label>17</label><mixed-citation publication-type="other" xlink:type="simple">M. J. Embrechts and L. Alexandre, “Reservoir Computing for Static Pattern Recognition,” Proceedings of European Symposium on Artificial Neural Networks, Bruges, April 2009, pp. 245-250.</mixed-citation></ref><ref id="scirp.22033-ref18"><label>18</label><mixed-citation publication-type="other" xlink:type="simple">X. Dutoit, B. Schrauwen and H. Van Brussel, “NonMarkovian Processes Modeling with Echo State Networks,” Proceedings of European Symposium on Artificial Neural Networks, Bruges, April 2009, pp. 233-238.</mixed-citation></ref><ref id="scirp.22033-ref19"><label>19</label><mixed-citation publication-type="other" xlink:type="simple">F. R. Reinhart and J. J. Steil, “Attractor-Based Computation with Reservoirs for Online Learning of Inverse Kinematics,” Proceedings of European Symposium on Artificial Neural Networks, Bruges, April 2009, pp. 257-262.</mixed-citation></ref><ref id="scirp.22033-ref20"><label>20</label><mixed-citation publication-type="other" xlink:type="simple">C. Emmerich, R. F. Reinhart and J. J. Steil, “Recurrence Enhances the Spatial Encoding of Static Inputs in Reservoir Networks,” Proceedings of the International Conference on Artificial Neural Networks (ICANN), Thessaloniki, September 2010, Vol. 6353, pp. 148-153. 
doi:10.1007/978-3-642-15822-3_19</mixed-citation></ref><ref id="scirp.22033-ref21"><label>21</label><mixed-citation publication-type="other" xlink:type="simple">G. Feng, G.-B. Huang, Q. P. Lin and R. Gay, “Error Minimized Extreme Learning Machine with Growth of Hidden Nodes and Incremental Learning,” IEEE Transactions on Neural Networks, Vol. 20, No. 8, 2009, pp. 1352-1357.</mixed-citation></ref><ref id="scirp.22033-ref22"><label>22</label><mixed-citation publication-type="other" xlink:type="simple">G.-B. Huang, L. Chen and C.-K. Siew, “Universal Approximation Using Incremental Constructive Feedforward Networks with Random Hidden Nodes,” IEEE Transactions on Neural Networks, Vol. 17, No. 4, 2006, pp. 879-892. doi:10.1109/TNN.2006.875977</mixed-citation></ref><ref id="scirp.22033-ref23"><label>23</label><mixed-citation publication-type="other" xlink:type="simple">Y. Miche, A. Sorjamaa, P. Bas, O. Simula, C. Jutten and A. Lendasse, “OPELM: Optimally Pruned Extreme Learning Machine,” IEEE Transactions on Neural Networks, Vol. 21, No. 1, 2009, pp. 158-162. 
doi:10.1109/TNN.2009.2036259</mixed-citation></ref><ref id="scirp.22033-ref24"><label>24</label><mixed-citation publication-type="other" xlink:type="simple">A. N. Tikhonov, “Solution of Incorrectly Formulated Problems and the Regularization Method,” W. H. Winston, Washington DC, 1977.</mixed-citation></ref><ref id="scirp.22033-ref25"><label>25</label><mixed-citation publication-type="other" xlink:type="simple">S. Geman, E. Bienenstock and R. Doursat, “Neural Networks and the Bias/Variance Dilemma,” Neural Computation, Vol. 4, No. 1, 1992, pp. 1-58.</mixed-citation></ref><ref id="scirp.22033-ref26"><label>26</label><mixed-citation publication-type="other" xlink:type="simple">F. Girosi, M. Jones and T. Poggio, “Regularization Theory and Neural Networks Architectures,” Neural Computation, Vol. 7, No. 2, 1995, pp. 219-269. 
doi:10.1162/neco.1995.7.2.219</mixed-citation></ref><ref id="scirp.22033-ref27"><label>27</label><mixed-citation publication-type="other" xlink:type="simple">W. Y. Deng, Q. H. Zheng and L. Chen, “Regularized Extreme Learning Machine,” Proceedings of IEEE Symposium on Computational Intelligence and Data Mining, Nashville, 30 March-2 April 2009, pp. 389-395. 
doi:10.1109/CIDM.2009.4938676</mixed-citation></ref><ref id="scirp.22033-ref28"><label>28</label><mixed-citation publication-type="other" xlink:type="simple">B. Schrauwen, D. Verstraeten and J. Van Campenhout, “An Overview of Reservoir Computing: Theory, Applications and Implementations,” Proceedings of European Symposium on Artificial Neural Networks, Bruges, 2005, pp. 471-482.</mixed-citation></ref><ref id="scirp.22033-ref29"><label>29</label><mixed-citation publication-type="other" xlink:type="simple">J. Triesch, “A Gradient Rule for the Plasticity of a Neuron’s Intrinsic Excitability,” Proceedings of International Conference on Artificial Neural Networks, Warsaw, September 2005, pp. 65-79.</mixed-citation></ref><ref id="scirp.22033-ref30"><label>30</label><mixed-citation publication-type="other" xlink:type="simple">N.-Y. Liang, G. B. Huang, P Saratchandran and N Sundararajan, “A Fast and Accurate Online Sequential Learning Algorithm for Feedforward Networks,” Proceedings of IEEE Transactions on Neural Networks, Vol. 17, No. 6, 2006, pp. 1411-1423.</mixed-citation></ref><ref id="scirp.22033-ref31"><label>31</label><mixed-citation publication-type="other" xlink:type="simple">Q. Zhu, A. Qin, P. Suganthan and G.-B. Huang, “Evolutionary Extreme Learning Machine,” Pattern Recognition, Vol. 38, No. 10, 2005, pp. 1759-1763.</mixed-citation></ref><ref id="scirp.22033-ref32"><label>32</label><mixed-citation publication-type="other" xlink:type="simple">V. N. Vapnik, “The Nature of Statistical Learning Theory,” Springer-Verlag Inc., New York, 1995.</mixed-citation></ref><ref id="scirp.22033-ref33"><label>33</label><mixed-citation publication-type="other" xlink:type="simple">C. M. Bishop, “Training with Noise Is Equivalent to Tikhonov Regularization,” Neural Computation, Vol. 7, 1994, pp. 108-116.</mixed-citation></ref><ref id="scirp.22033-ref34"><label>34</label><mixed-citation publication-type="other" xlink:type="simple">C. M. Bishop, “Pattern Recognition and Machine Learning,” Springer, New York, 2007.</mixed-citation></ref><ref id="scirp.22033-ref35"><label>35</label><mixed-citation publication-type="other" xlink:type="simple">K. Neumann and J. J. Steil, “Optimizing Extreme Learning Machines via Ridge Regression and Batch Intrinsic Plasticity,” Neurocomputing, 2012, in Press. 
doi:10.1016/j.neucom.2012.01.041</mixed-citation></ref><ref id="scirp.22033-ref36"><label>36</label><mixed-citation publication-type="other" xlink:type="simple">M. Rolf, J. J. Steil and M. Gienger, “Learning Exible Full Body Kinematics for Humanoid Tool Use,” International Conference on Emerging Security Technologies, Canterbury, 6-7 September 2010, pp. 171-176.</mixed-citation></ref><ref id="scirp.22033-ref37"><label>37</label><mixed-citation publication-type="other" xlink:type="simple">A. Frank and A. Asuncion, “Uci Machine Learning Repository,” Amherst, 2010.</mixed-citation></ref><ref id="scirp.22033-ref38"><label>38</label><mixed-citation publication-type="other" xlink:type="simple">J. J. Steil, “Online Reservoir Adaptation by Intrinsic Plasticity for Backpropagation—Decorrelation and Echo State Learning,” Neural Networks, Vol. 20, No. 3, 2007, pp. 353-364. doi:10.1016/j.neunet.2007.04.011</mixed-citation></ref><ref id="scirp.22033-ref39"><label>39</label><mixed-citation publication-type="other" xlink:type="simple">B. Schrauwen, M. Wardermann, D. Verstraeten, J. J. Steil and D. Stroobandt, “Improving Reservoirs Using Intrinsic Plasticity,” Neurocomputing, Vol. 71, No. 7-9, 2008 pp. 1159-1171.</mixed-citation></ref><ref id="scirp.22033-ref40"><label>40</label><mixed-citation publication-type="other" xlink:type="simple">J. Triesch, “The combination of stdp and intrinsic plasticity yields complex dynamics in recurrent spiking networks,” Proceedings of International Conference on Artificial Neural Networks, Athens, 2006, pp. 647-652.</mixed-citation></ref><ref id="scirp.22033-ref41"><label>41</label><mixed-citation publication-type="other" xlink:type="simple">A. N. Tikhonov and V. Y. Arsenin, “Solutions of IllPosed Problems,” Soviet Mathematics—Doklady, Vol. 4, 1963, pp. 1035-1038.</mixed-citation></ref><ref id="scirp.22033-ref42"><label>42</label><mixed-citation publication-type="other" xlink:type="simple">H. Jaeger, “Adaptive Nonlinear System Identification with Echo State Networks,” Proceedings of Neural Information Processing Systems, Vancouver, September 2002, pp. 593-600.</mixed-citation></ref><ref id="scirp.22033-ref43"><label>43</label><mixed-citation publication-type="other" xlink:type="simple">H. Jaeger, “The Echo State Approach to Analysing and Training Recurrent Neural Networks,” 2001.</mixed-citation></ref></ref-list></back></article>