Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
The Dynamics of Reinforcement Learning in Cooperative
Multiagent Systems2/1
ๅผทๅๅญฆ็ฟๅๅผทไผ
ๅคง้ฒ ่ช ๅบ
Abstract & 1. Introduction
โข่คๆฐใฎใใฌใคใคใผใใใใใใช็ถๆณใงใฉใฎใใใซๅ่ชฟใ็บ็ใใใโๅผทๅๅญฆ็ฟ(RL)ใงใขใใซๅ
โขไปใฎใใฌใคใคใผใฎๅญๅจใซๆฐไปใใฆใใชใใฑใผในโjoint actionใฎvalueใจ็ธๆใฎๆฆ็ฅใๆ็คบ็ใซๅญฆใผใใจใใใฑใผใน
โขใฒใผใ ใฎๆง้ ใจๆข็ดขๆฆ็ฅใNashๅ่กกใธใฎๅๆใซใฉใฎใใใซๅฝฑ้ฟใใใใ่ชฟในใ
โขๆ้ฉใชๅ่กกใธใฎๅๆใฎ็ขบใใใใใๅขๅคงใใใโoptimisticโใชๆข็ดขๆฆ็ฅใๆๆกใใ
2. Preliminary Concepts and Notation2.1 Single Stage Games
โข Nใใฌใคใคใผใฎ็นฐใ่ฟใ่ชฟๆดใฒใผใ
โข distributed bandit problemใจใใฆๆฑใ
ใขใใซใฎๅฎๅผๅ
cooperative since each agentโs reward is drawn from the same distribution
ๆฆ็ฅใจๅ่กกใฎๅฎ็พฉ
ไพ
a0 a1
b0 x 0
b1 0 y
x > y > 0Nash equilibria
Optimal
ใใใทใฅๅ่กก: ่ชๅใใ้ธ่ฑใใ่ชๅ ใ็กใOptimal : ไบไบบใซใจใฃใฆๆใๅฉๅพใ้ซใ
2. Preliminary Concepts and Notation2.2 Learning in Coordination Games
โขๆ้ฉใชๅ่กกใ่คๆฐใใๅ ดๅโ่กๅ้ธๆใฏ้ฃใใใชใ
โขๅ่ชฟ่กๅใฏๅใใใฌใคใคใผ้ใฎ
็นฐใ่ฟใใฒใผใ ใฎ็ตๆๅญฆ็ฟใใๅพใ
โข ๅๆใไฟ่จผใใๅญฆ็ฟใขใใซ๏ผไปฎๆณใใฌใค
็ธๆใ้ๅปใซๅใฃใ่กๅใฎๅฒๅใจๅใ็ขบ็ใง
ๆฌกใฎ่กๅใ้ธใถใจใใไฟกๅฟต
a0 a1
b0 x 0
b1 0 y
x = y > 0Nash equilibria
Optimal
2. Preliminary Concepts and Notation2.3 Reinforcement Learning
โข Joint actionใซ็ดไปใๅฉๅพใฎๆ ๅ ฑใ็ฅใใชใๅ ดๅโๅผทๅๅญฆ็ฟใ็จใใฆ้ๅปใฎ็ต้จใใ้กๆจใใใใจใๅฏ่ฝ
โข Stateless ใฎQๅญฆ็ฟ๏ผQๅญฆ็ฟใจใใใใใๅบๆฌ็ใชstochaostic approximation technique๏ผ
Exploitation vs Exploration
โข Nonoptimalใช่กๅใๅใ็ขบ็โBoltzmann exploration ใชใฉ
action a is chosen with prob.
2.3 Reinforcement Learning๏ผ็ถใ๏ผโขไธ่ฌใซmultiagentใฎ(nonstationallyใช็ฐๅขใงใฎ)Qๅญฆ็ฟใฏ้ฃใใใ
Q-valueใฎๅๆใฏไฟ่จผใใใฆใใชใ
Qๅญฆ็ฟใMultiagentใซ้ฉ็จๅฏ่ฝใชๆๆณ : MARL (IL)ใขใซใดใชใบใ ใจJAL
2.3 Reinforcement Learning๏ผ็ถใ๏ผ
โข MARL ใใใใฏIndependent Learner(IL)
โขใใฌใคใคใผใฏ่ชๅใฎ่กๅใจrewardใ็ต้จ< ๐๐ , ๐ >ใจใใฆQๅญฆ็ฟ
2.3 Reinforcement Learning๏ผ็ถใ๏ผ
โข Joint Action Learner (JAL)
โข่ชๅใจ็ธๆใฎ่กๅใ็ต้จ< ๐ , ๐ >ใจใใฆๅญฆ็ฟ
โข็ธๆใฏ็พๅจใฎ่ชๅใฎbeliefใซๆฒฟใฃใฆ่กๅใใใจ่ใใ
2.3 Reinforcement Learning๏ผ็ถใ๏ผ
AgentใฎExperienceโข Independent Learner (IL) ใฎๅ ดๅ: < ๐๐ , ๐ >โข Joint Action Learner(JAL)ใฎๅ ดๅ: < ๐ , ๐ >ใฉใกใใPartially observable model < ๐๐ , ๐, ๐ >ใฎ็นๆฎใชใฑใผใน
Aใฎaction setใ ๐0, ๐1 Bใฎaction setใ{๐0, ๐1}ใงใใๆ, AใฎQๅคใฏโข Independent Learner (IL) ใฎๅ ดๅ: ๐ ๐0 , ๐ ๐1 ใฎ2้ใโข Joint Action Learner(JAL)ใฎๅ ดๅ: ๐ < ๐0, ๐0 > ,๐(< ๐0, ๐1 >), ๐ < ๐1, ๐0 > ,๐ < ๐1, ๐1 > ใฎ4้ใ
3. Comparing Independent and Joint-Action Learners
a0 a1
b0 10 0
b1 0 10
โข IL, JALใงๆฏ่ผ๏ผใฉใกใใBoltzmann exploration ๏ผ
โข JALใฎๆนใใใใใซๆฉใๅๆ
โข ใไบใใซILใใใฃใฆใใใฎใจๅใใใใชใใฎใงใใ&exploration strategyใซใใ
4. Convergence and Game Structure
a0 a1 a2
b0 10 0 k
b1 0 2 0
b2 k 0 10Penalty Game
๐ โค 0
โข ใฒใผใ ใฎๆง้ ใใใ่ค้ใชๅ ดๅโข ใฉใใซๅๆใใ๏ผ
4. Convergence and Game Structure
a0 a1 a2
b0 10 0 k
b1 0 2 0
b2 k 0 10Penalty Game
๐ โค 0
โข ใฒใผใ ใฎๆง้ ใใใ่ค้ใชๅ ดๅโข ใฉใใซๅๆใใ๏ผ
Nash equilibria
4. Convergence and Game Structure
a0 a1 a2
b0 10 0 k
b1 0 2 0
b2 k 0 10Penalty Game
๐ โค 0
โข ใฒใผใ ใฎๆง้ ใใใ่ค้ใชๅ ดๅโข ใฉใใซๅๆใใ๏ผ
Nash equilibria
Optimal
4. Convergence and Game Structure
โข k = -100ใ ใฃใใ๏ผ a0 a1 a2
b0 10 0 -100
b1 0 2 0
b2 -100 0 10
Optimalใงใฏใชใๅ่กกใ้ธๆใใใ
4. Convergence and Game Structure
โข kใซๅฏพใใๅฟ็ญ
4. Convergence and Game Structure(็ถใ)
a0 a1 a2
b0 11 -30 0
b1 -30 7 6
b2 0 0 5
Climbing Game
4. Convergence and Game Structure(็ถใ)
โขๅๆใฎ่ฆไปถ
5. Biasing Exploration Strategies for OptimalityMARL(IL)ใฏใๆ้ฉใชๅ่กกใธใฎๅๆใฏไฟ่จผใใชใJALใชใใใใโoptimisticโใชๆข็ดขๆฆ็ฅ๏ผmyopic heuristics๏ผใๅใใใจใซใใฃใฆoptimalใชๅ่กกใธๅๆใใlikelihoodใ้ซใใใใจใๅบๆฅใ
6. Concluding Remarks
โข่คๆฐใฎใใฌใคใคใผใใใๅ ดๅใQๅญฆ็ฟ๏ผใซใใๆ้ฉๆน็ญใธใฎๅๆ๏ผใฏใใฌใคใคใผใไธไบบใฎๆใซๆฏในใฆ้ ๅฅใงใฏใชใ
โข่ค้ใชใฒใผใ ็็ถๆณใซใใใฆใฏ็ต้จๅใซใใๆฐใใๆข็ดขๆๆณใๆๅนใงใใ
โขไปๅพใฎๆนๅๆงโข Qๅญฆ็ฟใ้ฉ็จใใใใ่คๆฐใฎ็ถๆ ใๆใค้ๆฌก็ใชๅ้กใซๅฏพใใๅฟ็จ
โข ็ถๆ ใจ่กๅใฎ้ๅใใใๅคงใใๆ๏ผ็นใซใใใฌใคใคใผใฎๆฐใซๅฏพใใฆ็ถๆ ใฎๆฐใๆๆฐ้ขๆฐ็ใซๅขๅ ใใๅ ดๅ๏ผใฎไธ่ฌๅ
โข ไปฎๆณใใฌใคใงใฎๅๆใ็ฅใใใฆใใไปใฎ็ถๆณ๏ผใผใญใตใ ใฒใผใ ใชใฉ๏ผใงใฎๅฟ็จ