View
0
Download
0
Category
Preview:
Citation preview
APWD’s Reward GRWD’s
0≤γ<1 γ=1 No Discoun;ng With Discoun;ng
Goal Reward Step -‐> 0 Goal -‐> 1
Ac;on Penalty Step -‐> -‐1 Goal -‐> 0
Expand same states - Find same answer - Take same time Why? - The algorithm uses rewards in the Set Policy Step - Tie breaker chooses the policy with the greatest expected total reward - APWD and GRWD’s expected total rewards are linear transformations of each other
Worst possible performance - undirected, infinite, wandering
APND and GRWD/APWD are not comparable - domains exist where each performs better - GRWD/APWD makes different decisions based on the discount factor
GRWD’s Choice
APND’s Choice
GRWD’s Choice
APND’s Choice
Take the First Right and Go Straight Forever: Novel Planning Algorithms in Stochas;c Infinite Domains
Judah Schvimer Advisor: Prof. Michael LiEman
Draw a 9 of Diamonds
Draw an Ace of Spades
5 of Hearts on 6 of Spades
Termina;on ü Optimal Policy is finite ü Actions transition to a finite number of states ü The greatest probability of reaching the goal is 1* ü The reward scheme causes states within a finite
number of steps of the start to have greater values than states an infinite number of steps away from the start
1 Set Policy: Choose the policy with the greatest probability of reaching the goal using a standard planning algorithm, assuming optimistically that unexplored states are goal states.
1 If there is a tie, choose the policy with the greatest expected total reward 2 If there is still a tie, choose the policy arbitrarily, though consistently
2 Short Circuiting (Optional): If the policy's pessimistic estimate for the probability of reaching the goal is better than the best optimistic estimate from a different first action, go to Step 6 and return only the optimal first action
3 Termination: If there are no more fringe states in the current policy, go to Step 6, otherwise return to Step 1
4 Choose Expansion State: Among all fringe states, choose the one reached with the greatest probability
1 If there is a tie, choose one state arbitrarily, though consistently
5 Expand the chosen fringe state by seeing where its actions transition and adding those states to the MDP; go to Step 1
6 Policy Choice: Return the last expanded policy
Modified Breadth First Search ü Uses short circuiting termination ü Guaranteed to find the optimal policy but not to terminate when it
does, without the greatest probability of reaching the goal equaling 1 ü Both this and the Probabilistic Search Algorithm do not find the
policy with the fewest number of expected steps
Reward Func;ons
Probabilis;c Heuris;c Search Algorithm
GRWD Expands Fewer States
APND Expands Fewer States
GRND Doesn’t Terminate
Recommended