Simulation-based Approximate Policy Iteration for

Simulation-based Approximate Policy Iteration for. Generalized Semi-Markov Decision Processes. Emmanuel Rachelson 1. Patrick Fabiani 1. Frédérick Garcia 2.
690KB taille 0 téléchargements 268 vues
Temporal Markov Problems: motivation and modeling

Solving large scale GSMDP: ATPI

Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes Emmanuel Rachelson 1 Patrick Fabiani 1 Frédérick Garcia 2 1

ONERA-DCSD

2 INRA-BIA Toulouse, France

EWRL08, July 3rd, 2008 Emmanuel Rachelson

Patrick Fabiani Frédérick Garcia

Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes

Conclusion

Temporal Markov Problems: motivation and modeling

Solving large scale GSMDP: ATPI

Plan

1

Temporal Markov Problems: motivation and modeling Examples Problem features GSMDP

2

Solving large scale GSMDP: ATPI Basic ideas Introducing confidence The bigger picture

Emmanuel Rachelson

Patrick Fabiani Frédérick Garcia

Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes

Conclusion

Temporal Markov Problems: motivation and modeling

Solving large scale GSMDP: ATPI

Plan

1

Temporal Markov Problems: motivation and modeling Examples Problem features GSMDP

2

Solving large scale GSMDP: ATPI Basic ideas Introducing confidence The bigger picture

Emmanuel Rachelson

Patrick Fabiani Frédérick Garcia

Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes

Conclusion

Temporal Markov Problems: motivation and modeling

Solving large scale GSMDP: ATPI

Examples

Planning under uncertainty with time dependency.

→ planning to coordinate with an uncertain and unstationnary environment.

Emmanuel Rachelson

Patrick Fabiani Frédérick Garcia

Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes

Conclusion

Temporal Markov Problems: motivation and modeling

Solving large scale GSMDP: ATPI

Examples

Planning under uncertainty with time dependency.

→ planning to coordinate with an uncertain and unstationnary environment.

Should we open more lines ?

Emmanuel Rachelson

Patrick Fabiani Frédérick Garcia

Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes

Conclusion

Temporal Markov Problems: motivation and modeling

Solving large scale GSMDP: ATPI

Examples

Planning under uncertainty with time dependency.

→ planning to coordinate with an uncertain and unstationnary environment.

Airplanes taxiing management

Emmanuel Rachelson

Patrick Fabiani Frédérick Garcia

Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes

Conclusion

Temporal Markov Problems: motivation and modeling

Solving large scale GSMDP: ATPI

Examples

Planning under uncertainty with time dependency.

→ planning to coordinate with an uncertain and unstationnary environment.

Onboard planning for coordination

Emmanuel Rachelson

Patrick Fabiani Frédérick Garcia

Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes

Conclusion

Temporal Markov Problems: motivation and modeling

Solving large scale GSMDP: ATPI

Examples

Planning under uncertainty with time dependency.

→ planning to coordinate with an uncertain and unstationnary environment.

Adding or removing trains ?

Emmanuel Rachelson

Patrick Fabiani Frédérick Garcia

Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes

Conclusion

Temporal Markov Problems: motivation and modeling

Solving large scale GSMDP: ATPI

Examples

Subway problem: toy example Some figures 4 trains, 6 stations → 22 state variables, 9 actions episodes of 12 hours with around 2000 steps.

Emmanuel Rachelson

Patrick Fabiani Frédérick Garcia

Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes

Conclusion

Temporal Markov Problems: motivation and modeling

Solving large scale GSMDP: ATPI

Conclusion

Problem features

Main idea Why is writing an MDP for the previous problems such a difficult task ? “Lots of things occur in parallel” concurrent phenomena partially controlable dynamics

Emmanuel Rachelson

Patrick Fabiani Frédérick Garcia

Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes

Temporal Markov Problems: motivation and modeling

Solving large scale GSMDP: ATPI

Conclusion

Problem features

Main idea Why is writing an MDP for the previous problems such a difficult task ? “Lots of things occur in parallel” concurrent phenomena partially controlable dynamics

Emmanuel Rachelson

Patrick Fabiani Frédérick Garcia

Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes

Temporal Markov Problems: motivation and modeling

Solving large scale GSMDP: ATPI

Conclusion

Problem features

Main idea Why is writing an MDP for the previous problems such a difficult task ? “Lots of things occur in parallel” concurrent phenomena partially controlable dynamics

Emmanuel Rachelson

Patrick Fabiani Frédérick Garcia

Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes

Temporal Markov Problems: motivation and modeling

Solving large scale GSMDP: ATPI

Problem features

Typical features Continuous time Hybrid state spaces Large state spaces Total reward criteria Long trajectories

Emmanuel Rachelson

Patrick Fabiani Frédérick Garcia

Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes

Conclusion

Temporal Markov Problems: motivation and modeling

Solving large scale GSMDP: ATPI

Problem features

How do we model all this ?

Emmanuel Rachelson

Patrick Fabiani Frédérick Garcia

Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes

Conclusion

Temporal Markov Problems: motivation and modeling

Solving large scale GSMDP: ATPI

GSMDP

GSMDP, ( Younes et al., 04)

Emmanuel Rachelson

Patrick Fabiani Frédérick Garcia

Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes

Conclusion

Temporal Markov Problems: motivation and modeling

Solving large scale GSMDP: ATPI

GSMDP

GSMDP, ( Younes et al., 04)

One process conditionned by the choice of the action undertaken

GSMP, ( Glynn, 89) Several semi-Markov processes affecting the same state space

→ hS , E , A, P , F , r i

Emmanuel Rachelson

Patrick Fabiani Frédérick Garcia

Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes

Conclusion

Temporal Markov Problems: motivation and modeling

Solving large scale GSMDP: ATPI

GSMDP

GSMDP, ( Younes et al., 04)

One process conditionned by the choice of the action undertaken

GSMP, ( Glynn, 89) Several semi-Markov processes affecting the same state space

→ hS , E , A, P , F , r i P (s′ |s1 , e4 )

Emmanuel Rachelson

P (s′ |s2 , a)

s1

s2

Es1 : e2 e4 e5 a

Es2 : e2 e3 a

Patrick Fabiani Frédérick Garcia

Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes

Conclusion

Temporal Markov Problems: motivation and modeling

Solving large scale GSMDP: ATPI

GSMDP

Controling GSMDP non-Markov behaviour !

→ no guarantee of an optimal Markov policy ( Younes et al., 04): approximate your model with phase-type (exponential) distributions. Supplementary variables technique ( Nilsen, 98). Our approach: no hypothesis, simulation-based API.

Emmanuel Rachelson

Patrick Fabiani Frédérick Garcia

Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes

Conclusion

Temporal Markov Problems: motivation and modeling

Solving large scale GSMDP: ATPI

GSMDP

Controling GSMDP non-Markov behaviour !

→ no guarantee of an optimal Markov policy ( Younes et al., 04): approximate your model with phase-type (exponential) distributions. Supplementary variables technique ( Nilsen, 98). Our approach: no hypothesis, simulation-based API.

Emmanuel Rachelson

Patrick Fabiani Frédérick Garcia

Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes

Conclusion

Temporal Markov Problems: motivation and modeling

Solving large scale GSMDP: ATPI

GSMDP

Controling GSMDP non-Markov behaviour !

→ no guarantee of an optimal Markov policy ( Younes et al., 04): approximate your model with phase-type (exponential) distributions. Supplementary variables technique ( Nilsen, 98). Our approach: no hypothesis, simulation-based API.

Emmanuel Rachelson

Patrick Fabiani Frédérick Garcia

Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes

Conclusion

Temporal Markov Problems: motivation and modeling

Solving large scale GSMDP: ATPI

Plan

1

Temporal Markov Problems: motivation and modeling Examples Problem features GSMDP

2

Solving large scale GSMDP: ATPI Basic ideas Introducing confidence The bigger picture

Emmanuel Rachelson

Patrick Fabiani Frédérick Garcia

Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes

Conclusion

Temporal Markov Problems: motivation and modeling

Solving large scale GSMDP: ATPI

Contribution overview General framework: API, simulation-based PI. Our contribution: API as non-parametric statistical learning: classification (policy), regression (value function), density estimation (“I don’t know” situations)

Three extensive uses of simulation: Monte-Carlo sampling for the evaluation of V π Roll-out for the calculation of Q-values Selection of the subset of states on which we perform policy improvement Emmanuel Rachelson

Patrick Fabiani Frédérick Garcia

Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes

Conclusion

Temporal Markov Problems: motivation and modeling

Solving large scale GSMDP: ATPI

Basic ideas

Simulation-based policy evaluation Our hypothesis: we have a generative model of the process. → (Monte-Carlo) simulation-based policy evaluation. Statistical learning Simulating the policy ⇔ Drawing a set of trajectories ⇔ Finite set of realisations of r.v. R π (s) We need to abstract (generalize) information from samples compactly store previous knowledge of V π (s) = E (R π (s)). (nearest neighbours, SVR, kLASSO, LWPR) Emmanuel Rachelson

Patrick Fabiani Frédérick Garcia

Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes

Conclusion

Temporal Markov Problems: motivation and modeling

Solving large scale GSMDP: ATPI

Basic ideas

Simulation-based policy evaluation Our hypothesis: we have a generative model of the process. → (Monte-Carlo) simulation-based policy evaluation. Statistical learning Simulating the policy ⇔ Drawing a set of trajectories ⇔ Finite set of realisations of r.v. R π (s) We need to abstract (generalize) information from samples compactly store previous knowledge of V π (s) = E (R π (s)). (nearest neighbours, SVR, kLASSO, LWPR) Emmanuel Rachelson

Patrick Fabiani Frédérick Garcia

Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes

Conclusion

Temporal Markov Problems: motivation and modeling

Solving large scale GSMDP: ATPI

Basic ideas

Approximate Policy Iteration

Approximate evaluation: V πn

One-step improvement: πn+1

Emmanuel Rachelson

Patrick Fabiani Frédérick Garcia

Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes

Conclusion

Temporal Markov Problems: motivation and modeling

Solving large scale GSMDP: ATPI

Conclusion

Basic ideas

Approximate Policy Iteration

in each visited state: 1-step rollout in order to find the best Q-value. → local improvements guided by the simulation of πn+1 .

Emmanuel Rachelson

Patrick Fabiani Frédérick Garcia

Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes

Temporal Markov Problems: motivation and modeling

Solving large scale GSMDP: ATPI

Conclusion

Basic ideas

online-API, cont’d Motivation: don’t want / can’t improve the policy everywhere too time/resource consuming not useful with regard to ’relevant’ information gathered Useful ? Interesting ? Relevant ? → “Improving the policy in the situations I am likely to encounter today” In other words . . . Which subset of states for API ? The ones visited by policy simulation ! Emmanuel Rachelson

Patrick Fabiani Frédérick Garcia

Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes

Temporal Markov Problems: motivation and modeling

Solving large scale GSMDP: ATPI

Conclusion

Basic ideas

online-API, cont’d Motivation: don’t want / can’t improve the policy everywhere too time/resource consuming not useful with regard to ’relevant’ information gathered Useful ? Interesting ? Relevant ? → “Improving the policy in the situations I am likely to encounter today” In other words . . . Which subset of states for API ? The ones visited by policy simulation ! Emmanuel Rachelson

Patrick Fabiani Frédérick Garcia

Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes

Temporal Markov Problems: motivation and modeling

Solving large scale GSMDP: ATPI

Conclusion

Basic ideas

online-API, cont’d Motivation: don’t want / can’t improve the policy everywhere too time/resource consuming not useful with regard to ’relevant’ information gathered Useful ? Interesting ? Relevant ? → “Improving the policy in the situations I am likely to encounter today” In other words . . . Which subset of states for API ? The ones visited by policy simulation ! Emmanuel Rachelson

Patrick Fabiani Frédérick Garcia

Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes

Temporal Markov Problems: motivation and modeling

Solving large scale GSMDP: ATPI

Conclusion

Basic ideas

online-API, cont’d Motivation: don’t want / can’t improve the policy everywhere too time/resource consuming not useful with regard to ’relevant’ information gathered Useful ? Interesting ? Relevant ? → “Improving the policy in the situations I am likely to encounter today” In other words . . . Which subset of states for API ? The ones visited by policy simulation ! Emmanuel Rachelson

Patrick Fabiani Frédérick Garcia

Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes

Temporal Markov Problems: motivation and modeling

Solving large scale GSMDP: ATPI

Conclusion

Basic ideas

First results Initial version of online-ATPI with SVR. Initial policy sets trains to run all day long. 1500

stat SVR

1000 initial state value

500 0 -500 -1000 -1500 -2000 -2500 -3000 -3500 0 Emmanuel Rachelson

2

Patrick Fabiani Frédérick Garcia

4

6

8

10

iteration number

Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes

12

14

Temporal Markov Problems: motivation and modeling

Solving large scale GSMDP: ATPI

Conclusion

Introducing confidence

Is there anybody out there ? ut

t

bc b

bc

ut ut

b

bc b

b

ut bc

b

ut

bc

b bc ut bc b

ut bc ut b bc

ut b

bc ut

b bc

b ut bc ut

b ut ut b

bc

bc ut

b bc ut bc

ut b ut bc

b bc ut utbc

Emmanuel Rachelson

Patrick Fabiani Frédérick Garcia

s0

Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes

s

Temporal Markov Problems: motivation and modeling

Solving large scale GSMDP: ATPI

Conclusion

Introducing confidence

Is there anybody out there ? ut

t

bc b

bc

ut ut

b

bc b

b

ut bc

b

ut

bc

b bc ut bc b

ut

Q (s0 , a1 ) =?

bc ut b

bc

ut b

bc ut

b bc

b ut bc ut

b ut ut b

bc

bc ut

b bc ut

P (s′ , t ′ |s0 , t0 , a1 )

bc

ut b ut bc

b bc ut utbc

Emmanuel Rachelson

Patrick Fabiani Frédérick Garcia

s0

Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes

s

Temporal Markov Problems: motivation and modeling

Solving large scale GSMDP: ATPI

Conclusion

Introducing confidence

Is there anybody out there ? ut

t

bc b

bc

ut ut

b

bc b

b

ut bc

b

ut

bc

b bc ut

Should I trust my regression ? → what if it overestimates the true V π (s) ?

bc b

ut bc ut b bc

ut b

bc ut

b bc

b ut bc ut

b ut ut b

bc

bc ut

b bc ut bc

ut b ut bc

b bc ut utbc

Emmanuel Rachelson

Patrick Fabiani Frédérick Garcia

s0

Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes

s

Temporal Markov Problems: motivation and modeling

Solving large scale GSMDP: ATPI

Conclusion

Introducing confidence

Is there anybody out there ? ut

t

bc b

bc

ut ut

b

bc b

b

ut bc

b

ut

bc

b bc ut bc b

ut bc ut b

Define a notion of confidence bc

ut b

bc ut

b bc

b ut bc ut

b ut ut b

bc

bc ut

b bc ut bc

ut b ut bc

b bc ut utbc

Emmanuel Rachelson

Patrick Fabiani Frédérick Garcia

s0

Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes

s

Temporal Markov Problems: motivation and modeling

Solving large scale GSMDP: ATPI

Conclusion

Introducing confidence

Introducing confidence “confidence” ⇔ having enough points around s ⇔ approaching the sufficient statistics for V π (s) → approx. measure: pdf of the underlying process. What should we do if we are not confident ? → generate data – increase the samples’ density – simulate Storing the policy ? Same problem for policy storage than for value function: ( Lagoudakis et al., 03) RL as Classification. Full statistical learning problem: (local incremental) regression (V π ), classification (π ), density estimation (conf ) Emmanuel Rachelson

Patrick Fabiani Frédérick Garcia

Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes

Temporal Markov Problems: motivation and modeling

Solving large scale GSMDP: ATPI

Conclusion

Introducing confidence

Introducing confidence “confidence” ⇔ having enough points around s ⇔ approaching the sufficient statistics for V π (s) → approx. measure: pdf of the underlying process. What should we do if we are not confident ? → generate data – increase the samples’ density – simulate Storing the policy ? Same problem for policy storage than for value function: ( Lagoudakis et al., 03) RL as Classification. Full statistical learning problem: (local incremental) regression (V π ), classification (π ), density estimation (conf ) Emmanuel Rachelson

Patrick Fabiani Frédérick Garcia

Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes

Temporal Markov Problems: motivation and modeling

Solving large scale GSMDP: ATPI

Conclusion

Introducing confidence

Introducing confidence “confidence” ⇔ having enough points around s ⇔ approaching the sufficient statistics for V π (s) → approx. measure: pdf of the underlying process. What should we do if we are not confident ? → generate data – increase the samples’ density – simulate Storing the policy ? Same problem for policy storage than for value function: ( Lagoudakis et al., 03) RL as Classification. Full statistical learning problem: (local incremental) regression (V π ), classification (π ), density estimation (conf ) Emmanuel Rachelson

Patrick Fabiani Frédérick Garcia

Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes

Temporal Markov Problems: motivation and modeling

Solving large scale GSMDP: ATPI

Conclusion

Introducing confidence

Introducing confidence “confidence” ⇔ having enough points around s ⇔ approaching the sufficient statistics for V π (s) → approx. measure: pdf of the underlying process. What should we do if we are not confident ? → generate data – increase the samples’ density – simulate Storing the policy ? Same problem for policy storage than for value function: ( Lagoudakis et al., 03) RL as Classification. Full statistical learning problem: (local incremental) regression (V π ), classification (π ), density estimation (conf ) Emmanuel Rachelson

Patrick Fabiani Frédérick Garcia

Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes

Temporal Markov Problems: motivation and modeling

Solving large scale GSMDP: ATPI

The bigger picture

The bigger picture simulation-based API samples ← 0/ for i = 1 to Nsim do while t < horizon do estimate Q-values s′ ← apply best action store (s, a, r , s′ ) in samples end while end for ˜ π (samples) trainV trainπ˜ (samples)

Emmanuel Rachelson

Patrick Fabiani Frédérick Garcia

Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes

Conclusion

Temporal Markov Problems: motivation and modeling

Solving large scale GSMDP: ATPI

The bigger picture

The bigger picture estimate Q (s, a)

˜ (s, a) ← 0 Q for i = 1 to Na do (r , s′ ) ← pick next state if confidence(s′ ) = true then

˜ π (s ′ ) r +V Na

˜ (s, a) ← Q ˜ (s, a) + Q else data = simulate(π , s′ ) ˜ π (data) retrainV ˜ (s, a) ← Q ˜ (s, a) + Q end if end for ˜ (s, a) return Q Emmanuel Rachelson

˜ π (s ′ ) r +V Na

Patrick Fabiani Frédérick Garcia

Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes

Conclusion

Temporal Markov Problems: motivation and modeling

Solving large scale GSMDP: ATPI

Conclusion GSMDP Modeling of large scale temporal problems of decision under uncertainty + introduction of a new LSPI-like method, bringing together results from: discrete events simulation approximate policy iteration statistical learning

API A general method inside API partial and incremental state space exploration guided by simulation / local policy improvement API as statistical learning

GiSMoP C++ library

→ http://emmanuel.rachelson.free.fr/fr/gismop.html Emmanuel Rachelson

Patrick Fabiani Frédérick Garcia

Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes

Conclusion

Temporal Markov Problems: motivation and modeling

Solving large scale GSMDP: ATPI

Perspectives Ongoing work: GiSMoP is still under development benchmark analysis (especially variance in V π ) interest of regression vs. brute force rollout is still unclear This work can benefit from: Better tuning of regression / classification / density estimation techniques (currently: LWPR / MC-SVM / OC-SVM) Non-arbitrary stopping bounds for sampling Error bounds ... Emmanuel Rachelson

Patrick Fabiani Frédérick Garcia

Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes

Conclusion

Temporal Markov Problems: motivation and modeling

Solving large scale GSMDP: ATPI

Thank you for your attention !

Emmanuel Rachelson

Patrick Fabiani Frédérick Garcia

Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes

Conclusion