[IEEE 2007 IEEE Wireless Communications and Networking Conference - Kowloon, China (2007.03.11-2007.03.15)] 2007 IEEE Wireless Communications and Networking Conference - QoS Routing

QoS routing in MANETs with impreciseinformation using actor-critic reinforcement learning

Wipawee UsahaSchool of Telecommunication Engineering

Suranaree University of TechnologyNakhon Ratchasima, Thailand 30000

Email: [email protected]

Javier A. BarriaDepartment of Electrical and Electronic Engineering

Imperial College London,London, U.K. SW7 2BT

Email: [email protected]

Abstract— This paper proposes a path discovery scheme whichsupports delay-constrained least cost routing in MANETs. Theaim of the scheme is to maximise the probability of success infinding feasible paths while maintaining communication over-head under control in presence of information uncertainty. Theproblem is viewed as a partially observable Markov decisionprocess (POMDP) and is solved using an actor-critic reinforce-ment learning (RL) method. The scheme relies on approximatebelief states of the environment which captures the networkstate uncertainty. Numerical results carried out under variousscenarios of state uncertainty and stringent QoS requirementsshow that the proposed RL framework can lead to more efficientcontrol of search messages, i.e., a reduction of up to 63% ofaverage number of search messages with marginal reduction ofup to 3% in success ratio in comparison with a flooding scheme.

I. INTRODUCTION

QoS routing in mobile ad hoc networks (MANETs) isa challenging task since network resource state informationavailable at each MANET node is imprecise due to themobility of its nodes. Based on the nature of informationrequired, QoS routing protocols can be classified as local(precise) state routing, and global (imprecise) state routing.Local state routing protocols rely on information at thatparticular node such as, propagation delay, residual bandwidthand cost metric of its outgoing links. Most local state routingschemes discover path(s) on a pure on-demand basis. Hence,neither routing table maintenance nor periodic exchange ofrouting information is required. On the other hand, global(imprecise) state information routing protocols rely on (non-restrictive) flooding which create communication overhead.Therefore, they do not scale well and are only performedperiodically or whenever a significant change is detected inthe network. Consequently, the global information maintainedat each node can become imprecise. Furthermore, informa-tion imprecision becomes further aggravated in hierarchically-organised MANET structures where aggregated information isnormally used to cope with scalability problems [1], [2].

Alternavely, restrictive flooding algorithms, like the Ticket-Based Probing (TBP) scheme [3], [4], where a maximumnumber of probes packets are heuristically determined havebeen proposed to limit the amount of overhead. However,such schemes can also diminish the chances of finding such

paths if the maximum allowed number of probes is too low.Learning frameworks have also been applied to support routingin MANETs. For example, [5] proposed a multicast routingapproach based on Q-learning concept, [6] suggested a possi-ble application of a multi-agent routing scheme in MANETsand a power-aware routing algorithm using a cognitive packetnetworks (CPN) routing protocol has been proposed in [7].Furthermore, to model information imprecision, partially ob-servable Markov decision process (POMDP) learning frame-works have been applied to packet-switched network routing[6], [8]. However, these works deal with connectionless, min-imum delay routing and not explicitly with QoS requirementsor message overhead which is the focus of the work presentedin this paper.

Contribution: In this paper, we propose a restricted flood-ing scheme, based on a POMDP learning framework, whicheffectively tradeoffs the amount of search messages and theprobability of finding the least-cost path that satisfies end-to-end delay QoS constraints. In particular, our proposedscheme is based on the imprecise network state or rather,a belief state concept which is a distribution function thatcaptures the uncertainty of the network state, embedded in aPOMDP learning framework. The algorithm here proposed isdistributed and outperforms our previously proposed RL basedalgorithms ([11] in terms of low computational complexity,demands low storage and hence saving onboard processingpower on the mobile node. The proposed method is based onKonda’s actor-critic scheme [9] whose theoretical guaranteesof convergence and convergence rates have been proven for thecompletely-observable MDPs [9]. This paper extends Konda’sactor-critic algorithm from the completely-observable domainto deal with the partially-observable domain by using the beliefstate concept.

II. TICKET-BASED PROBING AS A REINFORCEMENT

LEARNING PROBLEM

This section presents the Ticket-Based Probing (TBP)scheme [3], [4] which attempts to maximise the probability ofsuccess in finding a feasible route in a network with inaccurateinformation. The algorithm relies on a number of probes(search messages) which is controlled by a number logicaltickets M0 computed at the source node. More specifically,

1525-3511/07/$25.00 ©2007 IEEE

This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the WCNC 2007 proceedings.

3384

when a source node s needs to find a route that satisfies a delay(or bandwidth) requirement to a destination node d, an initialnumber of logical tickets is computed at node s dependingon the contention level of network resources and the degreeof imprecision of available information. Then, probes (eachprobe carries at least one ticket) are sent from s towards d.The end-to-end information, is used to guide the distributionof the tickets along the directions of most probable feasiblepaths towards the destination d. The computation of tickets atan intermediate node j is based on the available end-to-endinformation (i.e., from node j to d) and cannot exceed thenumber of tickets in the probe that node j has received.

The intuitive reasoning behind this scheme is that moretickets should be issued at the source node for connectionrequests with stringent mean end-to-end delay requirementssince feasible paths are difficult to find; and as the meanend-to-end delay requirement is relaxed (i.e. is increased), thenumber of tickets required in the search should be reducedsubsequently since the feasible path becomes easier to dis-cover. If multiple feasible paths are discovered, the least-costpath is selected.

The number of issued logical tickets naturally affects theability to discover feasible paths. By issuing an infinite numberof tickets (i.e. flooding the entire network), the chance ofdiscovering a feasible path is maximised at the cost of creatingmessage overhead in the network. On the other hand, bylimiting the number of issued tickets, the chance of findinga feasible path also diminishes. Therefore, there exists atrade-off in the number of issued tickets and the chance offinding feasible paths. To obtain a good ticket issuing policywhich balances such trade-off, a reinforcement learning (RL)method, called the actor-critic method based on the belief stateconcept can be integrated with the existing TBP scheme. Inthe proposed scheme, instead of calculating the number ofinitial tickets (M0) according to an heuristic rule like [3], [4],M0 in the proposed scheme is selected from some finite setin a sequential decision-making process in the presence ofnetwork state uncertainty with the objective to maximise someperformance criteria.

III. PROBLEM FORMULATION

A. Imprecision model

The imprecision model in [3], [4] is employed where thevariable of interest (i.e. the end-to-end delay from somenode n to destination d) is given by Dn (d) ± Dn (d)where Dn (d) are the maximal variations of the mean end-to-end delay before the next information update. Using adistance vector protocol, the parameter Dn (d) is updatedperiodically at each node along with Dn (d) according to

Dnewn (d) = ρDold

n (d)+(1 − ρ)β∣∣Dnew

n (d) −Doldn (d)

∣∣ .(1)

The parameter ρ is the forgetting factor which determineshow fast Dold

n (d) is forgotten, (1 − ρ) determines how fastDnew

n (d) converges to∣∣Dnew

n (d) −Doldn (d)

∣∣, and β is aparameter chosen to ensure a large value of Dnew

n (d). Note

that by increasing β, we increase Dnewn (d) and conse-

quently, the certainty that the actual delay falls in the impreciserange.

B. POMDP framework

Let the observation space of the discretised end-to-enddelay between node s and node d be denoted as Osd =qD (1) , . . . , qD (n) where and qD (m) is the mth quantisedinterval on [0,∞). Let Xsd be the state space of the discretisedend-to-end delay between node s and node d such that Xsd =Osd. The action space is defined as A = 0, . . . ,Mmaxwhere Mmax is the maximum allowable number of tickets.The state b ∈ B =

b : b ∈ [0, 1]n ,

∑∀m∈Xsd

b (m) = 1

is the probability that node s believes the actual end-to-enddelay, Ds (d), falls in each quantised interval. The distributionbk captures the degree of uncertainty node s is about Ds (d)at time k. If a new imprecise value of Dnew

s (d) is availableat node s and Dnew

s (d) ≈ Dolds (d), node s becomes more

certain that the actual delay, Ds (d), is near Dnews (d) and

therefore updates bk+1 following

bk+1 (m) =

bk (m) ρ+ (1 − ρ) , if Dnews (d) ∈ q (m)

bk (m) (1 − ρ) , otherwise(2)

where q (m) is the m-th quantised interval of end-to-end delay,and ρ ∈ (0, 1) is called the forgetting factor which determineshow fast bk (m) is forgotten, and (1 − ρ) determines how fastbk (m) converges to 1.

Suppose now that at time k, there is a connection requestfrom node s to node d with mean end-to-end delay requirementDreq (j) where 1 ≤ j ≤ K, K is the number of delay-constrained service types in the network. Depending on bk,node s takes an action ak ∈ A where ak = M0 refers to theaction which node s issues M0 tickets. If ak = M0 > 0 thetickets are distributed in the same way as [3], [4] and [11].

The reward scheme g (·, ak) generated by selecting ak isgiven by

g (·, ak) =

ζj − log ak

− (ζj − log ak)0

, ak > 0 and δ > 0, ak > 0 and δ = 0, ak = 0 (connection rejected)

(3)where δ is the number of feasible paths found, ζj ∈ R+ is theimmediate reward parameter for service type-j, 1 ≤ j ≤ K.The logic of the above reward scheme is straightforward: themore tickets issued at the source node, the more likely afeasible path(s) can be found with the trade-off of introducingmore message overhead into the network. Therefore, theobtained reward is less for large values of ak, however issuingtickets economically reduces the chances of finding a feasiblepath(s). If a feasible path is not found (i.e. δ = 0), the actionis penalised.

Therefore, for every selected action ak taken at state bk, areward (3) is generated. Suppose that ak follows a randomisedcontrol policy for M0 selection which is given by a mappingµθ : B×Ω → PA where PA is the probability distribution overA and µθ belongs to a family of policies parameterised by


3385

some vector θ,µθ : θ ∈ RM

. For ω ∈ Ω = 0, 1, . . . ,K

where K is the number of delay-constrained service typesin the network, ω = j where j > 0 corresponds to aconnection request from node s to node d with Dreq (j),and ω = 0 signifies the end of an episode. An episode endswhen information updating is (periodically) performed in theMANET. Hence, an episode terminates eventually regardlessof whatever actions are taken within the episode. Then, theobjective is to find a randomised policy µθ∗ that maximisesthe expected accumulated reward until the end of the episode

Vµθ∗ (b) = maxµθ

Eµθ

[N−1∑k=0

g (bk, ak) |b0 = b

]∀ b ∈B.

(4)

IV. THE PROPOSED METHOD

This section describes the actor-critic algorithm with beliefstate concept (ACBS) [10] which is employed to obtain theoptimal randomised policy µθ∗ in (4). In this algorithm, γk, ηk

are small stepsize parameters; ψθ (w, a) is the feature vectorof state-action pair (w, a) for the actor which is dependent onθ and w ∈ B × Ω; φθ (w, a) is the feature vector of state-action pair (w, a) for the critic which is dependent on θ; r =[r (0) , ..., r (K − 1)]T is the parametric vector for the critic;Qθ

r (w, a) = rTφθ (w, a) so that ∇rQθr (w, a) = φθ (w, a);

and Γ (rk) is a scalar that controls the stepsize of the actor.The algorithm is executed for each source-destination pair asfollows.

1) Initialise b0 ∈ B, θ0 ∈ RM and r0, z0 ∈ RK .2) for episode t = 1 to T do3) for k = 0 to Nt − 1 do

a) At time step k, node s sees network in state bk ∈ B,ωk = j.

b) If Dreq (j) < Ds (d) −Ds (d),M0 = 0. Reject the connection request.

Else,M0 = a is selected according to policy µθt (a |wk ).

c) Get immediate reward g (wk, ak) and next observationok+1 = Dnew

s (d).d) Get next state bk+1:

If Dnews (d) ∈ qD (m)

bk+1 (m) = bk (m) ρ+ (1 − ρ)Elsebk+1 (m) = bk (m) (1 − ρ)

where 0 < ρ < 1.e) dk = g (wk−1, ak−1) + Qθt

rt(wk, ak) −

Qθtrt

(wk−1, ak−1)f) zk = λzk−1 + ∇rQ

θtrt

(wk, ak)

4) end5) Perform updates:6) rt+1 = rt + γt

∑Nt−1

k=0dkzk

7) θt+1 = θt + βtΓ (rt)∑Nt−1

k=0Qθt

rt(wk, ak)ψθt (wk, ak)

8) end

V. NUMERICAL RESULTS

We consider a MANET of 36 nodes placed in a 15 × 15square metre area.The topology of the MANET is randomlygenerated by a random way point mobility model. We assume

that a MAC protocol to resolve contention and support re-source reservation is used and that the battery power remainsconstant so that the transmission range is constant and theconnectivity between nodes is lost only when the node haswandered too far away—not because of low battery power. Theend-to-end information is updated periodically using a distancevector protocol. We also assume that there is a neighbordiscovering protocol where each node periodically identifiesitself with its identifier so that a node j is aware of its neigh-bors and that only stable links [4] are considered limiting thesearch space for a feasible path and minimise rerouting whenthe network topology changes.Each actual mean link delayis uniformly distributed in [0, 50] ms. Each announced meanlink delay is subjected to imprecision so that it is uniformlydistributed in the range Dij ∈ [Dij − ∆ij ,Dij + ∆ij ] , ∆ij =ξimpDij and ξimp is the imprecision rate.1 The parameters ρand β in (1) are 0.95 and 1, and the parameter ρ in (2) is 0.5.The maximum hop count allowed in a path is 10.

All algorithms are compared using the following metrics:the accumulated reward (AV), success ratio (SR), averagenumber of search messages (ASM) :

AV =∑

t Accumulated reward in episode tTotal number of episodes

SR =Total number of accepted connectionsTotal number of connection requests

ASM =Total number of search messages sentTotal number of connection requests

Simulations are run for four algorithms, namely, a floodingscheme (FLO), the original TBP scheme based on heuristics(TBP), the proposed TBP scheme based on the modifiedactor-critic method with belief state concept (ACBS), theTBP scheme based on another RL-based method called theon-policy first-visit Monte Carlo method (ONMC) [12]. Allalgorithms employ Mmax of 100 tickets. Since all four al-gorithms exchange distance vectors in the same way as in[11], the only measurement of overhead is the number ofsearch messages sent occurred from the feasible path search.The FLO scheme implemented here is actually a TBP schemewhich is constantly issuing Mmax tickets for all types of delay-constrained services and has the same setting as in [11].

For the ACBS scheme, the mean end-to-end delay(in ms) between each source-destination pair is quan-tised into the following intervals, qD (m) ∈ Xsd =[0, 10) , . . . , [240, 250) , [250,∞) for m = 1, . . . , 26.The action set (when the connection request is not rejected)is given by M0 ∈ A = 1, 10, 20, . . . , 100. The actor’sparameter vector θ is chosen such that one parameter, θ (j)where j = 1, . . . , 11, is associated with one action a ∈ A.Hence for an episode t, M0 number of tickets (i.e. action a)

1The imprecision rate specifies the largest percentage of deviation allowed

between the actual and advertised link delay, ξimp = max|Dij−Dij |

Dij.


3386

is selected with probability

µθt (a |b ) =exp (sθ (a))∑

∀u∈A exp (sθ (u))(5)

where sθ (a) is defined for each action as

sθ (a) = θ (j)

(3

9∑m=1

b (m) +18∑

m=10

b (m) + 0.5

). (6)

The parameterised policy structure in (5) is selected as itdefines the probability of selecting an action which is a con-tinuous differentiable function of θ [9]. This policy structurealso assures that for any bk, any action is selected withpositive probability. The terms in the parenthesis in (6) mapbk ∈ B → R+ and hence facilitates the computation of theparameterised policy µθt

as a function of a scalar rather thana vector bk.

The stepsizes used in the ACBS scheme are γt =0.01

1+ t10000

, ηt = γt

10 and Γ (rt) = 0.5(

11+|rt| + 2

1+|rt|)

, whereas

the stepsize used in the the ONMC scheme is 1t , where

t refers to the tth episode. The critic rt (in line 6 of theACBS algorithm) is updated using the TD(1) method [12].The TD(1) method is used since the considered problem isepisodic. From [9], it is shown for the TD(1) method thatφθ

j = Cψθj is sufficient to produce an unbiased gradient

estimate, where C is some positive scalar and

ψθj (·, a) =

∂∂θ(j)µθ (a |· )µθ (a |· ) , j = 0, . . . ,M − 1.

The ACBS and ONMC schemes are trained for 4 × 106

connection requests under a 30-second distance vector updateinterval. Once the parameters are obtained, their performanceis evaluated and compared with the TBP and FLO schemes—all schemes are evaluated using a simulation run of 1 × 106

connection requests.

A. Accumulated reward per episode

Figures 1-2 compares the accumulated reward per episodeof the four algorithms as a function of the mean end-to-enddelay requirement and imprecision rate. The pause time is60 s and the update interval is 30 s. It can be seen thatACBS and ONMC methods consistently outperform the othertwo schemes. When the delay requirement is relaxed (interval210−240 ms), the ACBS scheme outperforms the ONMC andFLO schemes. We also note a degradation of performancewhen imprecision rate is changed from 0.1 to 0.5 consistentwith the fact that it will become more difficult to discoverfeasible paths when the imprecision rate increases.

B. Success ratio

From Fig. 3-4 it is seen that ACBS and ONMC producesimilar success ratio, in accordance with previous results. Thesuccess ratio will increase as the mean end-to-end delay re-quirement becomes less stringent since feasible paths becomeeasier to find.As expected the success ratio at 0.1 imprecisionrate is higher than those observed in the 0.5 imprecision rate

120 140 160 180 200 220 2400.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

Mean end-to-end delay requirement (msec)

Acc

umu

late

d re

war

d p

er e

pis

ode

TBPFLOACBSONMC

Fig. 1. Accumulated reward per episode with 0.1 imprecision rate.

120 140 160 180 200 220 2400.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8


Acc

umu

late

d re

war

d p

er e

pis

ode

TBPFLOACBSONMC

Fig. 2. Accumulated reward per episode with 0.5 imprecision rate.

for all algorithms. Note also that the success ratio of theONMC and ACBS schemes remain close to the FLO schemeeven when the imprecision rate is as high as 0.5. This ishappening even though the ONMC and ACBS schemes aregenerating much lower average number of search messagesthan the FLO scheme.

C. Average number of search messages

From Fig. 5-6 we can see that even though the TBP schemegenerates the least average number of search messages, this isat the expense of low success ratio and accumulated reward perepisode. The ONMC and ACBS schemes produce an averagenumber of search messages between that of the FLO and TBPschemes in both 0.1 and 0.5 imprecision rate cases. Notealso that ACBS produce on aggregate fourfold less numberof messages than ONMC.


3387

120 140 160 180 200 220 2400.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7S

ucc

ess

ratio

TBPFLOACBSONMC


Fig. 3. Success ratio with 0.1 imprecision rate.

120 140 160 180 200 220 2400.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

Suc

cess

rat

io

TBPFLOACBSONMC


Fig. 4. Success ratio with 0.5 imprecision rate.

The average number of search messages increases as theend-to-end delay requirement become less restricted due tothe increased number of nodes involved in the path search.Note however that in the interval 210-240 ms mean end-to-enddelay a reduction in the average number of search messagesis observed for the ACBS scheme. This means that thepropagation of ticket distribution has been contained despitethe relaxation on the mean end-to-end delay requirement. Thiswould imply a reduction in the number of tickets issued bythe source node. This is due to the fact that as feasible pathsbecome easier to discover less penalty and more reward isreceived when fewer tickets are issued. Hence, the ACBSscheme can learn better than the ONMC scheme through suchreward signals to significantly reduce the number of issuedtickets at less stringent mean end-to-end delay requirements.

120 140 160 180 200 220 24010

0

101

102

103

Ave

rag

e nu

mbe

r of

se

arch

mes

sage

s

TBPFLOACBSONMC


Fig. 5. Average number of search messages with 0.1 imprecision rate.

120 140 160 180 200 220 24010

0

101

102

103

TBPFLOACBSONMC


Ave

rag

e n

um

ber

of

sea

rch

me

ssag

es

Fig. 6. Average number of search messages with 0.5 imprecision rate.

D. Extension to energy-efficient routing

The proposed QoS routing algorithm with imprecise infor-mation in this paper attempts to reduce message overheadso as to reduce power utilisation. However, we can extendthe algorithm further to consider mechanisms that explicitlyhandle energy-efficient routing. Since ad hoc networks areoften multi-hop in nature and nodes are generally battery-constrained, certain nodes which are critical and centrally-located may be overused and their energy reserves quicklyexhausted. These nodes will suffer from early failure and willlead to a reduction of the overall network lifetime. Hence,to avoid draining out such nodes, it is therefore importantto consider the residual battery capacity as well. However,due to the decentralized operation in mobile ad hoc net-works, maintaining an up-to-date residual energy information


3388

at all the nodes in the network for route calculations isinherently difficult. As a result, node energy information maybe inaccurate [13]. The imprecision model employed in thispaper may be applied to capture such information uncertainty.Furthermore, it is important not to isolate energy-efficientcost metrics from other aspects of ad hoc routing such as,for instance, QoS considerations for delay-sensitive traffic.In the simplest case, the metric in (3) may be modified toincorporate energy-efficient cost metrics in order to be ableto search for energy-aware routes as well as feasible delay-constrained routes in the path discovery phase. Furthermore,apart from maximising the overall network lifetime, energy-efficient routing is also concerned with the objective of min-imising the overall energy consumption. Unfortunately, thesetwo objectives contradict each other [16]. To maximise thenetwork lifetime, routes must be equally distributed acrossseveral nodes. However, this cannot be achieved easily sincethe overall energy consumption across the network has tobe minimised. Routing schemes which attempt to strike abalance between these two contrasting objectives already existin the literature [14], [15], [16]. Research on these issues andperformance comparisons are currently under investigation.

VI. CONCLUSION

In this paper, a reinforcement learning (RL) method,namely, the actor-critic method with belief state concept(ACBS), has been applied to support QoS routing at thenetwork level in a MANET. The focus is on the feasible pathdiscovery task in the ticket-based probing (TBP) scheme wherea trade-off exists between the amount of message overhead andthe probability of finding the optimal feasible path.

The TBP scheme based on the ACBS method relies onmaintaining approximate belief states of the environmentwhich is used as an indicator of the network state uncertainty.Training for this approach can be performed on-line, however,an off-line training is implemented in this paper for the pur-pose of policy evaluation under the fixed trained parameters.

Simulation results show that the TBP schemes based onthe ACBS method can achieve good ticket-issuing policies, interms of the accumulated reward per episode, when comparedto the original heuristic TBP scheme and the flooding-basedTBP scheme. Numerical results reported here suggest that theACBS scheme can efficiently control the amount of flooding inthe network by obtaining upto 63% reduction in the averagenumber of search messages compared to the flooding-basedTBP scheme with only 3% reduction in success ratio. TheACBS scheme can attain upto 24% higher success ratio thanthe original heuristic TBP scheme.

ACKNOWLEDGMENT

This research has been supported by the Thailand ResearchFund under the contract MRG4880149.

REFERENCES

[1] K. Chen, S. H. Shah, K. Nahrstedt, “Cross-layer design for dataaccessibility in mobile ad hoc networks,” Journal of Wireless PersonalCommunications, Vol. 21, 2002, pp. 49-76.

[2] R. Ramanathan, M. Steenstrup, “Hierarchically-organized, multihop mo-bile wireless networks for quality-of-service support,” Mobile Networksand Applications, Vol. 3, 1998, pp. 101–119.

[3] S. Chen, Routing Support for Providing Guaranteed End-to-EndQuality-of-Service, PhD Thesis, University of Illinois at Urbana-Champaign, IL, 1999.

[4] S. Chen, K. Nahrstedt, “Distributed quality-of-service routing in ad-hocnetworks,” IEEE Journal on Selected Areas in Communications, Vol. 17,No. 8, August 1999, pp. 1488-1505.

[5] R. Sun, S. Tatsumi, G. Zhao, “Application of multiagent reinforcementlearning to multicast routing in wireless ad hoc networks ensuringresource reservation,” Proceedings of the IEEE International Conferenceon Systems, Man and Cybernetics, 2002.

[6] L. Peshkin, V. Savova, “Reinforcement learning for adaptive rout-ing,” Proceedings of the International Joint Conference on NeuralNeworks’02, 2002.

[7] E. Gelenbe, “Self-aware network and QoS”, Proceedings of the ISCISXVIII, in Lecture Notes in Computer Science LNCS 2869, pp 1-14,Springer Verlag, Berlin, 2003.

[8] N. Tao, J. Baxter, L. Weaver, “A multi-agent, policy-gradient approachto network routing,” Proceedings of the 18th International MachineLearning Conference, 2001.

[9] V.R. Konda, Actor-Critic Algorithms, PhD Thesis, Massachusetts Insti-tute of Technology, MA, 2002.

[10] W. Usaha, Resource Allocation in Networks with Dynamic Topology,Ph.D. thesis, Imperial College London, U.K., 2004.

[11] W. Usaha, J. A. Barria, “A reinforcement learning Ticket-Based Probingpath discovery scheme for MANETs,” Ad Hoc Networks Journal, Vol. 2,2004, pp. 319-334.

[12] R.S. Sutton, A.G. Barto, Reinforcement Learning: An Introduction,Massachusetts: The MIT Press, 1998.

[13] S. Basagni, M. Conti, S. Giordano, I. Stojmenovic, Mobile Ad HocNetworking, New Jersey: IEEE Press, 2004.

[14] C.-K. Toh, ”Maximum battery life routing to support ubiquitous mo-bile computing in wireless ad hoc networks,” IEEE CommunicationsMagazine, Vol. 39, No. 6, 2001.

[15] A. Mahimkar, R. K. Shyamasundar, ”S-MECRA: A secure energy-efficient routing protocol for wireless ad hoc networks,” Proceedingsof the IEEE Vehicular Technology Conference, 2004.

[16] N. Meghanathan, ”On-demand maximum battery life routing with powersensitive power control in ad hoc networks,” Proceedings of the Interna-tional Conference on Networking, International Conference on Systemsand International Conference on Mobile Communications and LearningTechnologies, 2006.


3389

Documents

[IEEE 2007 IEEE Wireless Communications and Networking Conference - Kowloon, China (2007.03.11-2007.03.15)] 2007 IEEE Wireless Communications and Networking Conference - QoS Routing