<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article  PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" dtd-version="3.0" xml:lang="en" article-type="research article"><front><journal-meta><journal-id journal-id-type="publisher-id">AJOR</journal-id><journal-title-group><journal-title>American Journal of Operations Research</journal-title></journal-title-group><issn pub-type="epub">2160-8830</issn><publisher><publisher-name>Scientific Research Publishing</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="doi">10.4236/ajor.2013.36050</article-id><article-id pub-id-type="publisher-id">AJOR-38855</article-id><article-categories><subj-group subj-group-type="heading"><subject>Articles</subject></subj-group><subj-group subj-group-type="Discipline-v2"><subject>Physics&amp;Mathematics</subject></subj-group></article-categories><title-group><article-title>
 
 
  Adaptive Strategies for Accelerating the Convergence of Average Cost Markov Decision Processes Using a Moving Average Digital Filter
 
</article-title></title-group><contrib-group><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>dilson</surname><given-names>F. Arruda</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref><xref ref-type="corresp" rid="cor1"><sup>*</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Fabrício</surname><given-names>Ourique</given-names></name><xref ref-type="aff" rid="aff2"><sup>2</sup></xref><xref ref-type="corresp" rid="cor1"><sup>*</sup></xref></contrib></contrib-group><aff id="aff2"><addr-line>Federal University of Santa Catarina (Campus Araranguá), Araranguá, Brazil</addr-line></aff><aff id="aff1"><addr-line>Federal University of Rio de Janeiro, Alberto Luiz Coimbra Institute—Graduate School
and Research in Engineering, Rio de Janeiro, Brazil</addr-line></aff><author-notes><corresp id="cor1">* E-mail:<email>fourique@gmail.com(DFA)</email>;<email>fourique@gmail.com(FO)</email>;</corresp></author-notes><pub-date pub-type="epub"><day>24</day><month>10</month><year>2013</year></pub-date><volume>03</volume><issue>06</issue><fpage>514</fpage><lpage>520</lpage><history><date date-type="received"><day>July</day>	<month>9,</month>	<year>2013</year></date><date date-type="rev-recd"><day>August</day>	<month>9,</month>	<year>2013</year>	</date><date date-type="accepted"><day>August</day>	<month>16,</month>	<year>2013</year></date></history><permissions><copyright-statement>&#169; Copyright  2014 by authors and Scientific Research Publishing Inc. </copyright-statement><copyright-year>2014</copyright-year><license><license-p>This work is licensed under the Creative Commons Attribution International License (CC BY). http://creativecommons.org/licenses/by/4.0/</license-p></license></permissions><abstract><p>
 
 
   This paper proposes a technique to accelerate the convergence of the value iteration algorithm applied to discrete average cost Markov decision processes. An adaptive partial information value iteration algorithm is proposed that updates an increasingly accurate approximate version of the original problem with a view to saving computations at the early iterations, when one is typically far from the optimal solution. The proposed algorithm is compared to classical value iteration for a broad set of adaptive parameters and the results suggest that significant computational savings can be obtained, while also ensuring a robust performance with respect to the parameters.
     
 
</p></abstract><kwd-group><kwd>Average Cost; Markov Decision Processes; Value Iteration; Computational Effort; Gradient</kwd></kwd-group></article-meta></front><body><sec id="s1"><title>1. Introduction</title><p>Discrete-time Markov decision processes aim at controlling the dynamics of a stochastic system by mean of taking suitable control actions at each possible configuration of the system. At each period, a control action is selected based on the current configuration (state) of the system, which triggers a probabilistic transition to another state in the next decision period and so on in an infinite time horizon. The objective is to find the best action to be taken at each possible configuration of the system with respect to some prescribed performance measure. From now on, each configuration of the system is referred to as a state of the system.</p><p>An elegant way to find the optimal control actions for each state is provided by the classical value or policy iteration algorithms [1-11]. The value iteration (VI) algorithm is arguably the most popular algorithm, in part because of its simplicity and ease of implementation. In this paper, we introduce an adaptive way to improve the convergence of the VI algorithm with respect to convergence time, with a view at accelerating the convergence of the method. We refer to [<xref ref-type="bibr" rid="scirp.38855-ref12">12</xref>] for a study on variants of the policy iteration algorithm.</p><p>We explore a way to accelerate the convergence of value iteration algorithms for average cost MDPs. The rationale is to apply the value iteration algorithm to a sequence of approximate models, which are simpler than the original model and hence require less computation. These models are refined at each new iteration of the algorithm and converge to the exact model within a finite number of iterations, which enables one to retrieve the solution to the original problem at the end of the procedure. The rationale is based on a refinement scheme introduced in [<xref ref-type="bibr" rid="scirp.38855-ref13">13</xref>] for linearly convergent algorithms with convergence rates that are known a priori.</p><p>Classical Markov Decision Problem (MDP) results yield that VI algorithms converge, but the rate of convergence is unknown a priori and depends on the system at hand. For more details on the convergence of VI algorithms for average cost MDPs, we refer to [<xref ref-type="bibr" rid="scirp.38855-ref1">1</xref>]. The unknown rate of convergence renders the results in [<xref ref-type="bibr" rid="scirp.38855-ref13">13</xref>] not directly applicable for the studied problem. Earlier results, however, have shown that significant reduction on the overall computational effort can be attained by a suitable choice of refinement rate [<xref ref-type="bibr" rid="scirp.38855-ref14">14</xref>]. Unfortunately, such rate is now known a priori and the parameter tuning turns out to be very difficult. Moreover, consistent performance gains over the classical VI algorithm are difficult to obtain and a poor parameter selection may render the algorithm slower than classical VI. In this paper, we try to overcome this difficulty by introducing an adaptive algorithm that automatically adjusts the refinement rate at each iteration. It employs the error sequence of the algorithm to iteratively estimate the empirical convergence rate, and a moving average digital filter is appended to mitigate the erratic behavior. For more details about moving average filters, we refer to [<xref ref-type="bibr" rid="scirp.38855-ref15">15</xref>]. We show that the adaptive algorithm presents a robust performance, consistently outperforming value iteration.</p></sec><sec id="s2"><title>2. Average Cost Markov Decision Processes</title><p>Markov decision processes are comprised of a set of states, each representing a possible configuration of the studied system. Let <img src="8-1040253\28510492-4e47-4d13-ad4d-98d5d6578c84.jpg" /> be the set of all possible system configurations. For each state<img src="8-1040253\3ce94678-06e7-423b-ad05-1e42047e0a23.jpg" />, there exists a set of possible control actions<img src="8-1040253\b4bf893d-67bc-4fc2-b0aa-95ae9232928c.jpg" />. Each action <img src="8-1040253\3defb807-9b11-4ed3-a7ee-3d294da9ec50.jpg" /> drives the system from state <img src="8-1040253\101f72d8-4cfe-4b1b-9f29-41e98b980062.jpg" /> to state <img src="8-1040253\2e5865d1-fdf7-48ce-aa91-6777ff7200e0.jpg" /> with probability<img src="8-1040253\0eebbb9d-8861-4785-8229-1eb5c2a9b2b9.jpg" />. Since <img src="8-1040253\af592a1a-744f-40a5-84f5-ef688d74d8a1.jpg" /> is a probability for all <img src="8-1040253\5c9bb7e1-1a82-48b3-b755-4be8cf1408c4.jpg" /> and <img src="8-1040253\8cb7c161-3c28-4acd-bcee-43e2ee61132c.jpg" /> in<img src="8-1040253\419be62b-3e8a-492a-aec9-8b68647214cd.jpg" />, we have</p><p><img src="8-1040253\4607c6da-9eac-4870-ab08-0424f0014101.jpg" /></p><p>Let <img src="8-1040253\e20d146f-3420-4ca6-b223-59f7c1ea8070.jpg" /> be the set of all possible control actions and <img src="8-1040253\07057491-0e08-456c-bc6e-3479863b09cb.jpg" /> represent a cost function in the state/action space. When visiting state <img src="8-1040253\978b9e9b-e7cd-432c-8c1a-e39236a8b0b2.jpg" /> and applying a control action<img src="8-1040253\9bdc2b0a-e446-47d4-af7e-c46426a790c6.jpg" />, the system incurs a cost<img src="8-1040253\21902dd9-fff3-4c47-b1cc-57995b0e42c0.jpg" />. A stationary control policy is a mapping from the state space <img src="8-1040253\91d124e1-1fc4-41cd-a689-1df75ce5205a.jpg" /> to the action space <img src="8-1040253\4f1d4982-e9ae-403f-96e8-ca8ca284583f.jpg" /> that defines a single action in <img src="8-1040253\e63f5ffc-7c80-4a4e-9be9-a64395090d67.jpg" /> to be taken each time the system visits state<img src="8-1040253\98d1930e-2ad6-427c-8513-c91d63acab04.jpg" />. Let <img src="8-1040253\4b78d8ac-f932-4529-a9bf-32dca558bf9b.jpg" /> denote any particular stationary policy and <img src="8-1040253\0dd8aca4-e5ca-4240-a14d-158b8057b0b2.jpg" /> denote the set of all feasible stationary control policy.</p><p>Once a stationary control policy <img src="8-1040253\313b0ca9-ec4d-4be7-a26f-e69c46a77801.jpg" /> is chosen and applied, the controlled system can be modeled as a homogeneous Markov chain <img src="8-1040253\5fcb2317-b3c0-4037-b566-2156fe5bf23f.jpg" /> [<xref ref-type="bibr" rid="scirp.38855-ref16">16</xref>]. The long term cost of the system that operates under policy <img src="8-1040253\d9e02f63-2f2c-4fa3-89c7-4642ffafe37e.jpg" /> is given by</p><disp-formula id="scirp.38855-formula140198"><label>(1)</label><graphic position="anchor" xlink:href="8-1040253\5ef5633d-9ca1-479e-868c-fe8a11d7ac29.jpg"  xlink:type="simple"/></disp-formula><p>Under general conditions [<xref ref-type="bibr" rid="scirp.38855-ref1">1</xref>], each control policy <img src="8-1040253\a9f3a02a-8a97-419e-a07b-322329156e79.jpg" /> implies a finite long term cost<img src="8-1040253\4c621770-3f89-42b3-a10d-5faf613c8828.jpg" />. The task of the decision maker is to identify a policy <img src="8-1040253\ca151cd0-620e-4a59-9f4c-9ec210332c7b.jpg" /> that minimizes the long term average cost, thus satisfying the expression below:</p><disp-formula id="scirp.38855-formula140199"><label>(2)</label><graphic position="anchor" xlink:href="8-1040253\5fedb23c-2e16-4af4-baf8-35da4a87c909.jpg"  xlink:type="simple"/></disp-formula><p>In order to find the optimal policy, one seeks for the solution to the Poisson Equation (Average Cost Optimality Equation):</p><disp-formula id="scirp.38855-formula140200"><label>(3)</label><graphic position="anchor" xlink:href="8-1040253\3bbbbb6e-938e-4449-bb5e-53cef907fbf1.jpg"  xlink:type="simple"/></disp-formula><p>which is satisfied only by the optimal policy, where <img src="8-1040253\ef203ab0-041c-4802-bb2d-1cff0a2d78e3.jpg" /> is a real valued function, sometimes referred to as value function or relative cost function.</p></sec><sec id="s3"><title>3. The Value Iteration Algorithm</title><p>A very popular algorithm to find the optimal policy <img src="8-1040253\ce2f045f-5ea6-49f8-871b-9768631e06d3.jpg" /> is the value iteration (VI) algorithm. This algorithm iteratively searches for the solution <img src="8-1040253\69155e58-3509-40ed-b41d-6bc537798ffb.jpg" /> to the Poisson Equation (3).</p><p>Let <img src="8-1040253\a82d0b54-6225-4d90-abca-8b3f8d900617.jpg" /> be the space of real valued functions in<img src="8-1040253\a6da1f87-fbad-4316-8ec1-0865e1c7d9f3.jpg" />. The VI algorithm employs a mapping <img src="8-1040253\9a21af60-1db1-4fae-b65c-d721e9a87a77.jpg" /> defined as</p><disp-formula id="scirp.38855-formula140201"><label>(4)</label><graphic position="anchor" xlink:href="8-1040253\b0803323-7b7b-4f80-97bc-d1960d7ecfec.jpg"  xlink:type="simple"/></disp-formula><p>The VI algorithm consists in applying the recursion</p><disp-formula id="scirp.38855-formula140202"><label>(5)</label><graphic position="anchor" xlink:href="8-1040253\095b000a-5f67-4e49-961d-5e8834edad55.jpg"  xlink:type="simple"/></disp-formula><p>to obtain increasingly refined estimates of the solution to the Poisson Equation (3). Under mild conditions [<xref ref-type="bibr" rid="scirp.38855-ref1">1</xref>], the algorithm can be shown to converge to the solution of Equation (3), thus yielding both the optimal policy <img src="8-1040253\5d391b8b-7909-4488-acd1-49e20bd971d6.jpg" /> and its associated average cost<img src="8-1040253\914f2194-b6b8-414b-88e6-5a33984566ee.jpg" />.</p><p>The convergence of the algorithm is linear, but the rate of convergence is not known a priori [<xref ref-type="bibr" rid="scirp.38855-ref1">1</xref>].</p></sec><sec id="s4"><title>4. The Partial Information Value Iteration Algorithm</title><p>The rationale behind the partial information value iteration (PIVI) algorithm is to iterate on increasingly refined approximate models that converge to the exact model according to a prescribed schedule defined a priori. The purpose of such a refinement is to employ less computational resources in the early states of the algorithm, when the algorithm is typically far from the optimal solution, and hence focus most of the computational resources within a region that is closer to the optimal solution.</p><p>An intuitive way of decreasing the computational effort at the early iterations is to focus on the most probable transitions at the initial stages of the algorithm. For any state-action pair <img src="8-1040253\219b6eaa-cd0f-4f8a-a5b2-bb93ebebb58a.jpg" /> let <img src="8-1040253\5ae4ed8d-198f-4276-a20f-9b0c0c277fbd.jpg" /> be an ordering of the states in decreasing order of transition probabilities, that is<img src="8-1040253\86ac1016-fe04-42a4-b6b6-74dbe66e2e04.jpg" />. This leads to the distribution functions</p><p><img src="8-1040253\b04efb57-b957-4d4d-9b55-0dc10a0c9a57.jpg" /></p><p><img src="8-1040253\b5ca615e-97c2-4436-98e7-b038a1ca6976.jpg" /></p><p>Let</p><disp-formula id="scirp.38855-formula140203"><label>(6)</label><graphic position="anchor" xlink:href="8-1040253\5bbde095-3f8d-4982-8b3f-6cb6b446f906.jpg"  xlink:type="simple"/></disp-formula><p>where<img src="8-1040253\db372367-c1ed-4016-bd71-e870dacb083d.jpg" />.</p><p>Consider the following mapping [<xref ref-type="bibr" rid="scirp.38855-ref13">13</xref>]</p><disp-formula id="scirp.38855-formula140204"><label>(7)</label><graphic position="anchor" xlink:href="8-1040253\8de3d34c-48e6-4950-85d2-76477fdb3769.jpg"  xlink:type="simple"/></disp-formula><p>where <img src="8-1040253\fef4c120-d4f3-4395-b744-13c67fee0719.jpg" /> and <img src="8-1040253\da674763-edac-4f52-a42d-d716dba2f278.jpg" /> is a normalizing factor intended to make the truncated transition probability into a normalized probability distribution. Let <img src="8-1040253\16aa8d88-4cb2-4990-b2b1-550a5ad66d13.jpg" /> be a limited non-increasing sequence in the interval <img src="8-1040253\25ae91d5-1849-4058-9b6b-6ee321a6dd30.jpg" /> such that</p><disp-formula id="scirp.38855-formula140205"><label>(8)</label><graphic position="anchor" xlink:href="8-1040253\7c866951-8c4c-40db-8caa-20de267e9afb.jpg"  xlink:type="simple"/></disp-formula><p>The PIVI algorithm can be defined by the following recursion</p><disp-formula id="scirp.38855-formula140206"><label>(9)</label><graphic position="anchor" xlink:href="8-1040253\4aca9604-4117-4f3d-95bc-32091e85e9c4.jpg"  xlink:type="simple"/></disp-formula><p>Observe that, since the parameter sequence <img src="8-1040253\627dba28-c38f-462e-a44f-db3a19a93c24.jpg" /> in (8) goes to zero, the algorithm tends to the exact algorithm and, as such, converges to the solution to the proposed problem. This follows by applying (5) to some iterate <img src="8-1040253\511593d0-21a8-4d64-b0b3-471f0c0ddb7a.jpg" /> in (9), with an arbitrarily high index <img src="8-1040253\34721110-15eb-4fa4-b617-f6b7c890bda4.jpg" /> relabeled as zero.</p><sec id="s4_1"><title>4.1. The Parameter Sequence <img src="8-1040253\70697a43-290f-42fe-823e-b76a67c1009d.jpg" /></title><p>As pointed out in the last section, it suffices that <img src="8-1040253\8d5f855a-96ac-4911-9dd3-1a5e23b52c24.jpg" /> goes to zero within finite time for the PIVI algorithm (9) to converge to the exact solution. Hence, the sequence <img src="8-1040253\de36b6ec-89ce-48d7-a50e-d05b31384235.jpg" /> can be freely selected from the class of convergent sequences in the interval <img src="8-1040253\0ec8bee2-079c-4843-adc0-c821fbc39d7f.jpg" /> whose limit is nil. However, it is the form at which the convergent sequence goes to zero that will ultimately determine the behavior and, therefore, the computational effort, of the PIVI algorithm [13,14].</p><p>It has been shown that, for linearly convergent algorithms with convergence rate <img src="8-1040253\ec2ad39d-6d47-449c-a928-5f5d7e24e3b6.jpg" /> the optimal sequence with respect to the overall computational effort is geometrically decreasing, with rate<img src="8-1040253\658bb0a6-dfd8-4f32-a27e-a1b4f87cad10.jpg" />, which coincides with the convergence rate of the algorithm [<xref ref-type="bibr" rid="scirp.38855-ref13">13</xref>]. This result applies to discounted MDPs, for which the convergence rate <img src="8-1040253\80324779-a4b3-4357-b66d-308c7457d1cb.jpg" /> is known and coincides with the discount factor.</p><p>Average cost MDPs do converge linearly, but the convergence rate is unknown and depends on the topology of the MDP being solved [<xref ref-type="bibr" rid="scirp.38855-ref1">1</xref>]. This renders the direct application of the results in [<xref ref-type="bibr" rid="scirp.38855-ref13">13</xref>] unpractical. Indeed, geometrically decreasing sequences <img src="8-1040253\c4629c41-df5d-4278-80e2-1a5b2d548a49.jpg" /> where tried in [<xref ref-type="bibr" rid="scirp.38855-ref14">14</xref>], and promising results where obtained. The difficulty in such an approach lies in the fact that guessing the convergent rate a priori can be quite a daunting task. Indeed, when a suitable decreasing rate is found, it can result in significant computational savings. However, a poor choice of decreasing may result in an inefficient algorithm, which can even be outperformed by standard value iteration [<xref ref-type="bibr" rid="scirp.38855-ref14">14</xref>]. In this paper we address this short-coming by introducing an algorithm that adaptively decreases the error sequence<img src="8-1040253\8e22e49c-e137-40bd-952b-411af53d3a26.jpg" />, and that results in a more robust algorithm, with more stable behavior that consistently outperforms standard value iteration.</p></sec><sec id="s4_2"><title>4.2. Identification of Efficient Parameter Sequences</title><p>In this section we propose an adaptive algorithm to adaptively select the parameter sequence<img src="8-1040253\681166c5-0be1-4f9e-be67-ed13f17d69fe.jpg" />. The selection is based on the span semi-norm of the error sequence obtained by the PIVI algorithm at each iteration, defined as</p><disp-formula id="scirp.38855-formula140207"><label>(10)</label><graphic position="anchor" xlink:href="8-1040253\3d311b32-8015-41fa-be8f-7d44269ae6ab.jpg"  xlink:type="simple"/></disp-formula><p>where <img src="8-1040253\e499cbb4-3385-4249-92ce-049e2c5fe8f0.jpg" /> is the result of the k-th iterate of Algorithm (9).</p><p>The proposed algorithm uses the error defined above to assess the empirical convergence rate at iteration<img src="8-1040253\a6931054-512c-4bfd-ac97-173cfd9c1b43.jpg" />, <img src="8-1040253\ba4b98c7-ea3d-45e5-8556-23d5bfbc278b.jpg" />, defined as:</p><disp-formula id="scirp.38855-formula140208"><label>(11)</label><graphic position="anchor" xlink:href="8-1040253\774321cf-84b5-4c69-9f1f-139c0df6a55b.jpg"  xlink:type="simple"/></disp-formula><p>During the convergence process, the error <img src="8-1040253\a2ea54ae-37b7-4be2-b323-1ee542fb811d.jpg" /> should steadily decay. It is possible, however, that the decrease in parameter <img src="8-1040253\a7203318-b62f-4697-8b66-105a6c53864e.jpg" /> in mapping <img src="8-1040253\7df14afc-005d-4296-aa73-e924b9fd371d.jpg" /> in the PIVI algorithm (9), which is defined in (7), results in an immediate increase in the error. This happens because a decrease in <img src="8-1040253\b9edf77f-f346-4012-9f4b-76438d8ed0ac.jpg" /> results in a different, more accurate approximate model in (7), for which the current approximation in the value function <img src="8-1040253\d4508913-36ee-4c5a-b496-d4a995307556.jpg" /> may not be as good. Hence, in order to avoid instability, i.e., a<img src="8-1040253\e3247881-3db7-4453-a120-9c045d030de7.jpg" />, whenever the error increases, <img src="8-1040253\1aa05a56-09dc-4fbc-971e-311b53423395.jpg" />is set to zero.</p><p>For the adaptive algorithm, we use a varying decrease rate sequence <img src="8-1040253\2e641f9a-6b37-4813-bae4-bba4198d663a.jpg" /> and the objective is to adaptively estimate the convergence rate of the exact algorithm, which is unknown a priori. In order to do that, gradient, <img src="8-1040253\35815ec2-b5dc-4bdd-b025-0a6d1bcf4ff9.jpg" />is calculated, which is defined as:</p><disp-formula id="scirp.38855-formula140209"><label>(12)</label><graphic position="anchor" xlink:href="8-1040253\59528ff9-255f-4eee-ac70-37211892556e.jpg"  xlink:type="simple"/></disp-formula><p>The convergence parameter is estimated by the adaptive gradient recursion Equation as follows:</p><disp-formula id="scirp.38855-formula140210"><label>(13)</label><graphic position="anchor" xlink:href="8-1040253\d80bf502-e4d3-4226-a3ba-952f0336e273.jpg"  xlink:type="simple"/></disp-formula><p>where<img src="8-1040253\1387b0cf-b476-41a1-9f93-97220b6952e2.jpg" />, and <img src="8-1040253\c914ab8d-7f34-4cc6-8759-ac13692d6d8f.jpg" /> is an adaptive parameter. The parameter sequence <img src="8-1040253\db2c2b28-bd56-468e-9c0a-5263274c3b82.jpg" /> in Algorithm (9) is then refined by the following recursion:</p><disp-formula id="scirp.38855-formula140211"><label>(14)</label><graphic position="anchor" xlink:href="8-1040253\c8da6e0f-100a-4a36-9d90-d8d22b48f489.jpg"  xlink:type="simple"/></disp-formula><p>The proposed parameters sequence, <img src="8-1040253\034d0c14-c948-4ae4-8d9d-44fe63679b09.jpg" />, may result in an intensely varying sequence<img src="8-1040253\d648dd74-b398-451e-a03f-08a847df8965.jpg" />, due to a possible erratic behavior in the error sequence <img src="8-1040253\26af2fec-4df6-40f7-b6de-565820e2a865.jpg" /> and such a variation may limit the computational gains. To mitigate the effect of the error on the estimation process, a refined algorithm is presented. Prior to estimating the empirical convergence rate Equation (11), the error, <img src="8-1040253\8675cdfb-8ab6-44cd-97c5-83353e230d63.jpg" />, goes through moving average digital filter of order <img src="8-1040253\1529d768-47e1-43b7-9b0d-7e5db8b946e9.jpg" /> [<xref ref-type="bibr" rid="scirp.38855-ref15">15</xref>]. This filter attenuates the error high frequency components, leading to a better estimation of the convergence rate parameter,<img src="8-1040253\f5194b4f-9778-4956-9299-a81637a9beb4.jpg" />. The differences Equation that implements the moving average filter is presented in (15).</p><disp-formula id="scirp.38855-formula140212"><label>(15)</label><graphic position="anchor" xlink:href="8-1040253\4a539006-cecb-4320-bf31-42cb06262f87.jpg"  xlink:type="simple"/></disp-formula><p>where <img src="8-1040253\e711dd40-426a-4cee-b894-34fe9db9ab03.jpg" /> is the filter input, and <img src="8-1040253\e0e99787-3b73-4a44-be30-e4a1936c626c.jpg" /> is the filtered error, and <img src="8-1040253\fa4c881d-85c9-4192-9992-b3f3cca11ad1.jpg" /> is the filter order. The filter uses <img src="8-1040253\0a995fe6-d12d-4e54-a9a8-7f3195301510.jpg" /> past iterations to estimate the error<img src="8-1040253\b8a96d84-ac5a-44e9-bba4-becc7432e1c9.jpg" />. The estimated error is used in (11) to approximate the empirical convergence rate. The higher the filter order<img src="8-1040253\09924891-5338-48a8-90d0-83db67dc4d4b.jpg" />, the longer the past history that is taken into account.</p></sec><sec id="s4_3"><title>4.3. A Measure for Computational Effort Comparison</title><p>In order to compare the overall computational effort, one needs to propose a measure of the total computational effort applied by each algorithm. Such a measure enables one to directly compare different types of algorithms, which can rely on different updating schemes. In addition, defining this type of measure is more appealing than just measuring the convergence time because it makes possible to compare different types of algorithms without necessarily running them. Furthermore, one can also define suitable optimization routines that aim at finding the best algorithm in a given class with respect to the overall computational effort. This line of study is exploited in [<xref ref-type="bibr" rid="scirp.38855-ref13">13</xref>].</p><p>Examining the updating scheme of mapping <img src="8-1040253\01f5f572-6b52-4598-98cd-9149d06d1783.jpg" /> in the VI algorithm (5), it can be verified that the overall computational effort per iteration, per state, of the VI algorithm is proportional to the total number of transitions examined by mapping<img src="8-1040253\dbcb6f95-849e-4f33-9f80-89d1d2314585.jpg" />. Hence, one can define the computational effort of updating a single state <img src="8-1040253\d8cffff8-d060-4884-8c27-ecbeaf61a724.jpg" /> as the sum of the cardinalities of the transition probability distributions for each feasible action for that state. Let <img src="8-1040253\151bdda2-9477-4918-937b-b78c502b288d.jpg" /> denote the computational effort of updating state <img src="8-1040253\0169991c-ab2a-4b49-8ebf-74805ac5f122.jpg" /> under the VI algorithm. The overall effort for a single iteration of the VI algorithm then becomes</p><p><img src="8-1040253\ef2e1a93-76c8-4997-ac38-e3270875016a.jpg" /></p><p>The computational effort for an iteration of the PIVI algorithm, on the other hand, depends on the total number of transitions examined in each iteration of the algorithm. Letting <img src="8-1040253\86e7abfb-deb8-459b-a7ec-331a79b756a5.jpg" /> denote the parameter sequence of the PIVI algorithm and <img src="8-1040253\b0c1fedd-154a-461c-b156-0419649598aa.jpg" /> denote the total number of transitions at state<img src="8-1040253\f1d04249-13c3-49e9-8de3-4a95e6a81963.jpg" />, at the k-th iteration of the PIVI algorithm, we have</p><disp-formula id="scirp.38855-formula140213"><label>(16)</label><graphic position="anchor" xlink:href="8-1040253\6dd3ddaa-853c-4abb-99cc-094fcd3203ce.jpg"  xlink:type="simple"/></disp-formula><p>where <img src="8-1040253\df099f00-0440-49a1-a2ce-22914ed90d62.jpg" /> is the function defined in (6) for stateaction pairs<img src="8-1040253\7e169930-50d8-4d35-a7b9-bea67c1b929c.jpg" />. Let <img src="8-1040253\84a5cdc8-77d4-4919-831a-7159ef55d65e.jpg" /> denote the total number of iterations up to the PIVI convergence. Consequently, the overall computational effort of the PIVI algorithm becomes</p><p><img src="8-1040253\d671a71a-f2d3-44a9-97e7-8dbad49b4793.jpg" /></p><p>In the next section, we experiment with the parameter sequence<img src="8-1040253\8dabe6fa-7761-4b21-b152-28beabd08029.jpg" />, varying the adaptive parameter <img src="8-1040253\d1ae43c5-583c-41d8-a826-df29e197b190.jpg" /> in (13) and compare the computational effort measures <img src="8-1040253\59383af5-1bd6-4627-bea6-3217fe63261e.jpg" /> and<img src="8-1040253\9294be51-fbe2-4130-86fb-76f3ffaff77d.jpg" />, to obtain the order of the computational savings that can be obtained by the PIVI algorithm when compared to the classical VI algorithm.</p></sec></sec><sec id="s5"><title>5. Numerical Experiments</title><p>In order to compare the proposed method with the classic VI algorithm [<xref ref-type="bibr" rid="scirp.38855-ref17">17</xref>], two sets of experiments were derived. These experiments are replications of the experiments presented in [<xref ref-type="bibr" rid="scirp.38855-ref14">14</xref>] and thus offer a ground for comparison. In the first experiment we solve a Queueing model with two classes of clients. Each client class has a dedicated queue whose length varies in the interval<img src="8-1040253\89d818af-07a2-4fb9-a035-77f0bbe25ff7.jpg" />. Moreover, no new client is permitted at any given queue whenever the length of that queue is at the upper limit. For both experiments, a single server is responsible for the service of both queues and serves <img src="8-1040253\fdce5a22-671f-4e61-ab2f-74a57ab93e24.jpg" /> clients at each time period. The decision maker must decide whether to serve Queue 1, Queue 2, or to stay idle. The cost function depends on the total of clients in line and is given by:</p><p><img src="8-1040253\d8f28dd4-b488-4e18-9b5c-c2982d1c0727.jpg" /></p><p>where <img src="8-1040253\92959460-b2d2-4838-864b-c537b71dbbfb.jpg" /> and <img src="8-1040253\98a75398-0d83-49d3-a60b-1878b289e495.jpg" /> denotes the size of Queue 1 and Queue 2, respectively. Clearly, such a cost function is designed to prioritize clients belonging to the second class. We also note that the cost function does not depend on the selected control actions. The objective is to find the policy which minimizes the average cost and satisfies expression (2).</p><p>For the first experiment, both types of clients arrive according to a Poisson process with mean<img src="8-1040253\a6c6e5f4-8f02-4005-bcb6-31caa8786410.jpg" />. For computational purposes, the transition probability generated by this process was truncated to accommodate a fraction 0.9999 of the transitions and re-normalized afterwards. Such a normalization yields a total of 22 transitions for each line, thus resulting in a total of 484 possible transitions for each state-action pair. For this experiment <img src="8-1040253\1c35dd1a-9102-4a15-82d9-edf5bae813c5.jpg" /> was fixed at 21. Since we have three possible control actions and since the number of transitions is the same under each action, the total number of transitions per state for the VI algorithm is</p><p><img src="8-1040253\337e2875-f800-469e-8716-046bafe2bd69.jpg" /></p><p>For a tolerance of <img src="8-1040253\bd435a92-ff7a-4065-a467-58453e6f5554.jpg" /> [<xref ref-type="bibr" rid="scirp.38855-ref1">1</xref>], the VI algorithm took 950 iterations to converge, having an overall computational effort per state of<img src="8-1040253\549067a5-8bb9-4cde-95af-eb5ec7bc2688.jpg" />, per state. The total effort can be obtained by multiplying this value by the cardinality of the state space S,<img src="8-1040253\603c5135-a9a3-4264-a22a-18ee723bdb93.jpg" />.</p><p><xref ref-type="fig" rid="fig1">Figure 1</xref> depicts the overall computational effort <img src="8-1040253\b52a0fb9-d122-4a35-9962-c70ab8706cb4.jpg" /> for parameter sequences of the type in (14), for different values of<img src="8-1040253\4f964ac1-1135-4f27-acb3-1ac4256cf545.jpg" />. The computational effort was normalized with respect to the overall VI effort <img src="8-1040253\8b9dacd4-c6aa-40cd-ae93-09556ebfe9ac.jpg" /> to simplify the comparison.</p><p>The normalized computational effort for the proposed algorithm is plotted in <xref ref-type="fig" rid="fig1">Figure 1</xref> as a function of the adaptive parameter<img src="8-1040253\54fa4360-260a-4d45-b2ab-319d7490720c.jpg" />. One can see that, for the best</p><p>possible choice of<img src="8-1040253\56d57f3e-59bb-4b0a-a620-edbfe2e520b4.jpg" />, the computational effort is approximately <img src="8-1040253\ee9116fd-fee3-46c8-a9b3-032c6085c6de.jpg" /> of that of the classical VI algorithm. Therefore, the proposed algorithm converges in about <img src="8-1040253\79af88f6-6390-4c4c-bfc7-fee66b1a1063.jpg" /> of the time required for the VI algorithm to converge. Another point to look at is the improvement due to the moving average digital filter. Five curves are shown in <xref ref-type="fig" rid="fig1">Figure 1</xref>(a), the solid blue curve is the overall computational effort without filtering of the empirical rate; the square-marked dotted line curve, the circle-marked dashed line curve, the diamond-marked dashed line curve, and the triangle-marked solid line curve present the computational effort with the moving average digital filter defined in (15) of orders<img src="8-1040253\6c20a13d-1439-4414-89ff-c1ae86b0b8e2.jpg" />, respectively. This filters are used to process the error sequence <img src="8-1040253\8e5551bb-b8e8-4576-a214-2558ffaebaf2.jpg" /> prior to the empirical rate estimation in (11). While the unfiltered algorithm produces an improvement over the classic VI algorithm, the use the moving average filter provides an even superior performance, since the filter acts as a smoother of fluctuations in the empirical error function. Moreover, one can see that the performance improves as the filter order increases.</p><p><xref ref-type="fig" rid="fig1">Figure 1</xref> also shows that the proposed algorithm is consistently better than the VI algorithm, for a broad range of parameters<img src="8-1040253\ce799fcc-6c56-4b49-997c-a59439571955.jpg" />. Naturally, a better choice of parameter results in better savings, but the results suggest that the algorithm is robust with respect to the parameter choice. This is a significant improvement over the class of algorithms proposed in [<xref ref-type="bibr" rid="scirp.38855-ref14">14</xref>], which yield more significant savings in a best-case scenario, but it can be outperformed by VI under a poor parameter selection. Moreover, the algorithm in [<xref ref-type="bibr" rid="scirp.38855-ref14">14</xref>] seems to be overly sensitive to the parameter selection and the results tend to vary significantly within a small parameter interval.</p><p>One can expect an improvement as the filter order increases, up to some point where the increase no longer propitiates an improving performance. This behavior can be observed in <xref ref-type="fig" rid="fig1">Figure 1</xref>(b), where the limit is reached and the best computation effort is achieved with a filter order of<img src="8-1040253\21533319-a75a-4047-87de-f7e70b1aac7a.jpg" />, when the algorithm needs about <img src="8-1040253\dba637ba-c6d0-433f-b77f-c2fe570294a2.jpg" /> of the total computation effort.</p><p><xref ref-type="fig" rid="fig2">Figure 2</xref> depicts the computational effort for higher filter orders,<img src="8-1040253\d6b4a986-ae8d-429a-a0dc-42c7950d319a.jpg" />. One can see that no additional improvement is obtained for orders higher than 9. The main reason is that the underlying process that generates the error sequence <img src="8-1040253\3bc48cc7-645d-45a2-a121-3a5f89f09bda.jpg" /> has a given internal memory, or process memory. Whenever a estimator, i.e. moving average filter, exceeds the process memory, the estimator will decrease its performance [<xref ref-type="bibr" rid="scirp.38855-ref18">18</xref>].</p><p>In order to illustrate the impact of the filter order on the computation effort, <xref ref-type="fig" rid="fig3">Figure 3</xref>(a) shows the computation effort for different filter order values. One can see the impact of the filter order in the performance, also noticing that is always improved with respect to unfiltered PIVI algorithm.</p><p><xref ref-type="fig" rid="fig3">Figure 3</xref>(b) presents the best adaptive parameter <img src="8-1040253\cc4a601d-811a-4506-9934-763d4db701b9.jpg" /> for different values of filter order. One can notice on <xref ref-type="fig" rid="fig3">Figure 3</xref>(b) that the best value for the adaptive parameter <img src="8-1040253\65ef270a-f8ff-4212-8501-dafb922f4744.jpg" /> spans over a wide range. Whenever the filter order is among<img src="8-1040253\ead65036-5093-4f15-9534-d38d98485e5f.jpg" />, <img src="8-1040253\c25bc272-cb48-4778-9707-759c32747272.jpg" />, the adaptive parameter <img src="8-1040253\9ad541ae-e067-41ee-b6b3-5ad866510f98.jpg" /> has low sensibility, any value of <img src="8-1040253\15a62ba5-435c-40c0-a32f-710e53a99ab6.jpg" /> in the range from 0.30 to 0.65, i.e., <img src="8-1040253\f9eafd5c-b7ef-4117-a790-b2df9f496c29.jpg" />, would yield a significant improvement in the computational performance.</p><p>The first experiment suggests that the proposed method can provide significant savings in computational for problems with exogenous Poisson arrival processes. This is a very interesting results considering that such a class of problems tends to be very popular for queueing and manufacturing problems [16,19].</p><p>In the second experiment, the clients arrive in the first queue according to a geometric distribution with mean<img src="8-1040253\fd027210-55cc-4598-b9ad-8456a1104731.jpg" />. Clients belonging to the second class arrive uniformly in the integer interval from 0 to 9. The geometric distribution is also truncated to retain a fraction of 0.9999 of the total transitions and then renormalized. Such a normalization is not needed for the uniform distribution. The joint arrival processes yields 660 possible transitions, hence</p><p><img src="8-1040253\4ae0d021-55b9-4a17-891e-a53f982d92d6.jpg" /></p><p>The classical VI algorithm converged in 104 iterations and applied an overall computational effort per state of<img src="8-1040253\a71b90ff-7039-4d65-b6c6-50547c7c272e.jpg" />.</p><p><xref ref-type="fig" rid="fig4">Figure 4</xref> depicts the overall computational effort <img src="8-1040253\e304596a-2727-46fd-bda6-e591a6c1cd1c.jpg" /> for parameter sequences of the type in (14), for different values of<img src="8-1040253\ed865005-ee8d-4dbf-8f82-a1824fb22651.jpg" />, normalized with respect to the overall VI effort<img src="8-1040253\c1803fe1-e5da-4fc2-82af-fd496da2d929.jpg" />. For this experiments, we make<img src="8-1040253\10b52792-dc82-4717-bb22-51e5d97c48f5.jpg" />. The normalized computational effort for the proposed algorithm is plotted in <xref ref-type="fig" rid="fig4">Figure 4</xref> as a function of the adaptive parameter,<img src="8-1040253\0577e0df-ff87-4532-bf78-57e3e858de2e.jpg" />. One can see that, for a broad range of<img src="8-1040253\24748df6-45ba-4135-8300-2f1e79ac69b1.jpg" />,</p><p><img src="8-1040253\f3c05cb0-863e-42e5-840e-4e1ad74faa2e.jpg" />, the computational effort is less than the classical VI algorithm. At the best point, the cost is about <img src="8-1040253\0fb5beb6-8b42-4551-be11-50697fc70dac.jpg" /> of the classical VI algorithm.</p><p>Note that the savings of the proposed algorithm for the second setting, though still appealing, are much less significant than those obtained for Poisson exogenous arrival processes. This suggests that more modest savings may be expected for uniform distribution settings. This is consistent with the results in [<xref ref-type="bibr" rid="scirp.38855-ref13">13</xref>], which suggest that PIVI algorithms tend to have a stronger performance for highly concentrated probability distributions, such as the Poisson distribution, while yielding less significant savings for distributions that are widely spread. Moreover, one can also notice that the moving average filter addition seems to have no significant effect on the results.</p></sec><sec id="s6"><title>6. Concluding Remarks</title><p>This paper introduced a gradient adaptive version of the partial information value iteration (PIVI) algorithm introduced in [<xref ref-type="bibr" rid="scirp.38855-ref13">13</xref>] to the average cost MDP framework, with the addition of a moving average filter to smooth the empirical error sequence.</p><p>The proposed algorithm was validated by means of queueing examples, and presented consistent computational savings with respect to classical value iteration. Moreover, the proposed algorithm yielded consistent improvement over value iteration for a wide range of parameters, thus overcoming a shortcoming of a previous approach [<xref ref-type="bibr" rid="scirp.38855-ref14">14</xref>] that was overly sensitive to the parameter choice.</p></sec><sec id="s7"><title>7. Acknowledgements</title><p>This work was partially supported by the Brazilian National Research Council-CNPq, under Grant No. 302716/ 2011-4.</p></sec><sec id="s8"><title>REFERENCES</title></sec></body><back><ref-list><title>References</title><ref id="scirp.38855-ref1"><label>1</label><mixed-citation publication-type="other" xlink:type="simple">M. L. Puterman, “Markov Decision Processes: Discrete Stochastic Dynamic Programming,” John Wiley &amp; Sons, New York, 1994.http://dx.doi.org/10.1002/9780470316887</mixed-citation></ref><ref id="scirp.38855-ref2"><label>2</label><mixed-citation publication-type="other" xlink:type="simple">R. Bellman, “Dynamic Programming,” Princeton University Press, Princeton, 1957.</mixed-citation></ref><ref id="scirp.38855-ref3"><label>3</label><mixed-citation publication-type="other" xlink:type="simple">R. Howard, “Dynamic Probabilistic Systems,” John Wiley &amp; Sons, New York, 1971.</mixed-citation></ref><ref id="scirp.38855-ref4"><label>4</label><mixed-citation publication-type="other" xlink:type="simple">A. S. Adeyefa and M. K. Luhandjula, “Multiobjective Stochastic Linear Programming: An Overview,” American Journal of Operational Research, Vol. 1, No. 4, 2011, pp. 203-213. http://dx.doi.org/10.4236/ajor.2011.14023</mixed-citation></ref><ref id="scirp.38855-ref5"><label>5</label><mixed-citation publication-type="other" xlink:type="simple">M. He, L. Zhao and W. B. Powell, “Approximate Dynamic Programming Algorithms for Optimal Dosage Decisions in Controlled Ovarian Hyperstimulation,” European Journal of Operational Research, Vol. 222, 2012, pp. 328-340. http://dx.doi.org/10.1016/j.ejor.2012.03.049</mixed-citation></ref><ref id="scirp.38855-ref6"><label>6</label><mixed-citation publication-type="other" xlink:type="simple">S. A. Tarim, M. K. Dogru, U. Ozen and R. Rossi, “An Efficient Computational Method for a Stochastic Dynamic Lot-Sizing Problem under Service-Level Constraints,” European Journal of Operational Research, Vol. 215, No. 3, 2011, pp. 563-571.http://dx.doi.org/10.1016/j.ejor.2011.06.034</mixed-citation></ref><ref id="scirp.38855-ref7"><label>7</label><mixed-citation publication-type="other" xlink:type="simple">E. F. Arruda, M. Fragoso and J. do Val, “Approximate Dynamic Programming via Direct Search in the Space of Value Function Approximations,” European Journal of Operational Research, Vol. 211, No. 2, 2011, pp. 343351. http://dx.doi.org/10.1016/j.ejor.2010.11.019</mixed-citation></ref><ref id="scirp.38855-ref8"><label>8</label><mixed-citation publication-type="other" xlink:type="simple">A. Saure, J. Patrick, S. Tyldesley and M. L. Puterman, “Dynamic Multi-Appointment Patient Scheduling for Radiation Therapy,” European Journal of Operational Research, Vol. 223, No. 2, 2012, pp. 573-584.http://dx.doi.org/10.1016/j.ejor.2012.06.046</mixed-citation></ref><ref id="scirp.38855-ref9"><label>9</label><mixed-citation publication-type="other" xlink:type="simple">T. Hao, Z. Lei and A. Tamio, “Optimization of a Special Case of Continuous-Time Markov Decision Processes with Compact Action Set,” European Journal of Operational Research, Vol. 187, No. 1, 2008, pp. 113-119.http://dx.doi.org/10.1016/j.ejor.2007.04.011</mixed-citation></ref><ref id="scirp.38855-ref10"><label>10</label><mixed-citation publication-type="other" xlink:type="simple">H. Wang, “Retrospective Optimization of Mixed-Integer Stochastic Systems Using Dynamic Simplex Linear Interpolation,” European Journal of Operational Research, Vol. 217, No. 1, 2012, pp. 141-148.http://dx.doi.org/10.1016/j.ejor.2011.08.020</mixed-citation></ref><ref id="scirp.38855-ref11"><label>11</label><mixed-citation publication-type="other" xlink:type="simple">P. Benchimol, G. Desaulniers and J. Desrosiers, “Stabilized Dynamic Constraint Aggregation for Solving Set Partitioning Problems,” European Journal of Operational Research, Vol. 223, No. 2, 2012, pp. 360-371.http://dx.doi.org/10.1016/j.ejor.2012.07.004</mixed-citation></ref><ref id="scirp.38855-ref12"><label>12</label><mixed-citation publication-type="other" xlink:type="simple">S. D. Patek, “Policy Iteration Type Algorithms for Recurrent State Markov Decision Processes,” Computers &amp; Operations Research, Vol. 31, No. 14, 2004, pp. 23332347. http://dx.doi.org/10.1016/S0305-0548(03)00190-4</mixed-citation></ref><ref id="scirp.38855-ref13"><label>13</label><mixed-citation publication-type="other" xlink:type="simple">A. Almudevar and E. F. Arruda, “Optimal Approximation Schedules for a Class of Iterative Algorithms, with an Application to Multigrid Value Iteration,” IEEE Transactions on Automatic Control, Vol. 27, No. 12, 2012, pp. 3132-3146. http://dx.doi.org/10.1109/TAC.2012.2203053</mixed-citation></ref><ref id="scirp.38855-ref14"><label>14</label><mixed-citation publication-type="other" xlink:type="simple">E. F. Arruda, F. Ourique and A. Almudevar, “Toward an Optimized Value Iteration Algorithm for Average Cost Markov Decision Processes,” Proceedings of the 49th IEEE International Conference on Decision and Control, Atlanta, 15-17 December 2010, pp. 930-934.http://dx.doi.org/10.1109/CDC.2010.5717895</mixed-citation></ref><ref id="scirp.38855-ref15"><label>15</label><mixed-citation publication-type="other" xlink:type="simple">D. M. John and G. Proakis, “Digital Signal Processing,” 4th Edition, Prentice Hall, Upper Saddle River, 2006.</mixed-citation></ref><ref id="scirp.38855-ref16"><label>16</label><mixed-citation publication-type="other" xlink:type="simple">P. Brémaud, “Gibbs Fields, Monte Carlo Simulation, and Queues,” Springer-Verlag, New York, 1999.</mixed-citation></ref><ref id="scirp.38855-ref17"><label>17</label><mixed-citation publication-type="other" xlink:type="simple">D. P. Bertsekas, “Dynamic Programming and Optimal Control,” 2nd Edition, Athena Scientific, Belmont, 1995.</mixed-citation></ref><ref id="scirp.38855-ref18"><label>18</label><mixed-citation publication-type="other" xlink:type="simple">R. A. D. Peter and J. Brockwell, “Time Series: Theory and Methods,” 2nd Edition, Springer, New York, 1991.</mixed-citation></ref><ref id="scirp.38855-ref19"><label>19</label><mixed-citation publication-type="other" xlink:type="simple">S. M. Ross, “Stochastic Processes,” 2nd Edition, John Wiley &amp; Sons, New York, 1996.</mixed-citation></ref></ref-list></back></article>