Temporal Markov Problems: motivation and modeling
Solving large scale GSMDP: ATPI
Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes Emmanuel Rachelson 1 Patrick Fabiani 1 Frédérick Garcia 2 1
ONERA-DCSD
2 INRA-BIA Toulouse, France
EWRL08, July 3rd, 2008 Emmanuel Rachelson
Patrick Fabiani Frédérick Garcia
Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes
Conclusion
Temporal Markov Problems: motivation and modeling
Solving large scale GSMDP: ATPI
Plan
1
Temporal Markov Problems: motivation and modeling Examples Problem features GSMDP
2
Solving large scale GSMDP: ATPI Basic ideas Introducing confidence The bigger picture
Emmanuel Rachelson
Patrick Fabiani Frédérick Garcia
Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes
Conclusion
Temporal Markov Problems: motivation and modeling
Solving large scale GSMDP: ATPI
Plan
1
Temporal Markov Problems: motivation and modeling Examples Problem features GSMDP
2
Solving large scale GSMDP: ATPI Basic ideas Introducing confidence The bigger picture
Emmanuel Rachelson
Patrick Fabiani Frédérick Garcia
Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes
Conclusion
Temporal Markov Problems: motivation and modeling
Solving large scale GSMDP: ATPI
Examples
Planning under uncertainty with time dependency.
→ planning to coordinate with an uncertain and unstationnary environment.
Emmanuel Rachelson
Patrick Fabiani Frédérick Garcia
Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes
Conclusion
Temporal Markov Problems: motivation and modeling
Solving large scale GSMDP: ATPI
Examples
Planning under uncertainty with time dependency.
→ planning to coordinate with an uncertain and unstationnary environment.
Should we open more lines ?
Emmanuel Rachelson
Patrick Fabiani Frédérick Garcia
Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes
Conclusion
Temporal Markov Problems: motivation and modeling
Solving large scale GSMDP: ATPI
Examples
Planning under uncertainty with time dependency.
→ planning to coordinate with an uncertain and unstationnary environment.
Airplanes taxiing management
Emmanuel Rachelson
Patrick Fabiani Frédérick Garcia
Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes
Conclusion
Temporal Markov Problems: motivation and modeling
Solving large scale GSMDP: ATPI
Examples
Planning under uncertainty with time dependency.
→ planning to coordinate with an uncertain and unstationnary environment.
Onboard planning for coordination
Emmanuel Rachelson
Patrick Fabiani Frédérick Garcia
Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes
Conclusion
Temporal Markov Problems: motivation and modeling
Solving large scale GSMDP: ATPI
Examples
Planning under uncertainty with time dependency.
→ planning to coordinate with an uncertain and unstationnary environment.
Adding or removing trains ?
Emmanuel Rachelson
Patrick Fabiani Frédérick Garcia
Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes
Conclusion
Temporal Markov Problems: motivation and modeling
Solving large scale GSMDP: ATPI
Examples
Subway problem: toy example Some figures 4 trains, 6 stations → 22 state variables, 9 actions episodes of 12 hours with around 2000 steps.
Emmanuel Rachelson
Patrick Fabiani Frédérick Garcia
Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes
Conclusion
Temporal Markov Problems: motivation and modeling
Solving large scale GSMDP: ATPI
Conclusion
Problem features
Main idea Why is writing an MDP for the previous problems such a difficult task ? “Lots of things occur in parallel” concurrent phenomena partially controlable dynamics
Emmanuel Rachelson
Patrick Fabiani Frédérick Garcia
Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes
Temporal Markov Problems: motivation and modeling
Solving large scale GSMDP: ATPI
Conclusion
Problem features
Main idea Why is writing an MDP for the previous problems such a difficult task ? “Lots of things occur in parallel” concurrent phenomena partially controlable dynamics
Emmanuel Rachelson
Patrick Fabiani Frédérick Garcia
Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes
Temporal Markov Problems: motivation and modeling
Solving large scale GSMDP: ATPI
Conclusion
Problem features
Main idea Why is writing an MDP for the previous problems such a difficult task ? “Lots of things occur in parallel” concurrent phenomena partially controlable dynamics
Emmanuel Rachelson
Patrick Fabiani Frédérick Garcia
Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes
Temporal Markov Problems: motivation and modeling
Solving large scale GSMDP: ATPI
Problem features
Typical features Continuous time Hybrid state spaces Large state spaces Total reward criteria Long trajectories
Emmanuel Rachelson
Patrick Fabiani Frédérick Garcia
Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes
Conclusion
Temporal Markov Problems: motivation and modeling
Solving large scale GSMDP: ATPI
Problem features
How do we model all this ?
Emmanuel Rachelson
Patrick Fabiani Frédérick Garcia
Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes
Conclusion
Temporal Markov Problems: motivation and modeling
Solving large scale GSMDP: ATPI
GSMDP
GSMDP, ( Younes et al., 04)
Emmanuel Rachelson
Patrick Fabiani Frédérick Garcia
Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes
Conclusion
Temporal Markov Problems: motivation and modeling
Solving large scale GSMDP: ATPI
GSMDP
GSMDP, ( Younes et al., 04)
One process conditionned by the choice of the action undertaken
GSMP, ( Glynn, 89) Several semi-Markov processes affecting the same state space
→ hS , E , A, P , F , r i
Emmanuel Rachelson
Patrick Fabiani Frédérick Garcia
Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes
Conclusion
Temporal Markov Problems: motivation and modeling
Solving large scale GSMDP: ATPI
GSMDP
GSMDP, ( Younes et al., 04)
One process conditionned by the choice of the action undertaken
GSMP, ( Glynn, 89) Several semi-Markov processes affecting the same state space
→ hS , E , A, P , F , r i P (s′ |s1 , e4 )
Emmanuel Rachelson
P (s′ |s2 , a)
s1
s2
Es1 : e2 e4 e5 a
Es2 : e2 e3 a
Patrick Fabiani Frédérick Garcia
Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes
Conclusion
Temporal Markov Problems: motivation and modeling
Solving large scale GSMDP: ATPI
GSMDP
Controling GSMDP non-Markov behaviour !
→ no guarantee of an optimal Markov policy ( Younes et al., 04): approximate your model with phase-type (exponential) distributions. Supplementary variables technique ( Nilsen, 98). Our approach: no hypothesis, simulation-based API.
Emmanuel Rachelson
Patrick Fabiani Frédérick Garcia
Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes
Conclusion
Temporal Markov Problems: motivation and modeling
Solving large scale GSMDP: ATPI
GSMDP
Controling GSMDP non-Markov behaviour !
→ no guarantee of an optimal Markov policy ( Younes et al., 04): approximate your model with phase-type (exponential) distributions. Supplementary variables technique ( Nilsen, 98). Our approach: no hypothesis, simulation-based API.
Emmanuel Rachelson
Patrick Fabiani Frédérick Garcia
Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes
Conclusion
Temporal Markov Problems: motivation and modeling
Solving large scale GSMDP: ATPI
GSMDP
Controling GSMDP non-Markov behaviour !
→ no guarantee of an optimal Markov policy ( Younes et al., 04): approximate your model with phase-type (exponential) distributions. Supplementary variables technique ( Nilsen, 98). Our approach: no hypothesis, simulation-based API.
Emmanuel Rachelson
Patrick Fabiani Frédérick Garcia
Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes
Conclusion
Temporal Markov Problems: motivation and modeling
Solving large scale GSMDP: ATPI
Plan
1
Temporal Markov Problems: motivation and modeling Examples Problem features GSMDP
2
Solving large scale GSMDP: ATPI Basic ideas Introducing confidence The bigger picture
Emmanuel Rachelson
Patrick Fabiani Frédérick Garcia
Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes
Conclusion
Temporal Markov Problems: motivation and modeling
Solving large scale GSMDP: ATPI
Contribution overview General framework: API, simulation-based PI. Our contribution: API as non-parametric statistical learning: classification (policy), regression (value function), density estimation (“I don’t know” situations)
Three extensive uses of simulation: Monte-Carlo sampling for the evaluation of V π Roll-out for the calculation of Q-values Selection of the subset of states on which we perform policy improvement Emmanuel Rachelson
Patrick Fabiani Frédérick Garcia
Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes
Conclusion
Temporal Markov Problems: motivation and modeling
Solving large scale GSMDP: ATPI
Basic ideas
Simulation-based policy evaluation Our hypothesis: we have a generative model of the process. → (Monte-Carlo) simulation-based policy evaluation. Statistical learning Simulating the policy ⇔ Drawing a set of trajectories ⇔ Finite set of realisations of r.v. R π (s) We need to abstract (generalize) information from samples compactly store previous knowledge of V π (s) = E (R π (s)). (nearest neighbours, SVR, kLASSO, LWPR) Emmanuel Rachelson
Patrick Fabiani Frédérick Garcia
Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes
Conclusion
Temporal Markov Problems: motivation and modeling
Solving large scale GSMDP: ATPI
Basic ideas
Simulation-based policy evaluation Our hypothesis: we have a generative model of the process. → (Monte-Carlo) simulation-based policy evaluation. Statistical learning Simulating the policy ⇔ Drawing a set of trajectories ⇔ Finite set of realisations of r.v. R π (s) We need to abstract (generalize) information from samples compactly store previous knowledge of V π (s) = E (R π (s)). (nearest neighbours, SVR, kLASSO, LWPR) Emmanuel Rachelson
Patrick Fabiani Frédérick Garcia
Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes
Conclusion
Temporal Markov Problems: motivation and modeling
Solving large scale GSMDP: ATPI
Basic ideas
Approximate Policy Iteration
Approximate evaluation: V πn
One-step improvement: πn+1
Emmanuel Rachelson
Patrick Fabiani Frédérick Garcia
Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes
Conclusion
Temporal Markov Problems: motivation and modeling
Solving large scale GSMDP: ATPI
Conclusion
Basic ideas
Approximate Policy Iteration
in each visited state: 1-step rollout in order to find the best Q-value. → local improvements guided by the simulation of πn+1 .
Emmanuel Rachelson
Patrick Fabiani Frédérick Garcia
Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes
Temporal Markov Problems: motivation and modeling
Solving large scale GSMDP: ATPI
Conclusion
Basic ideas
online-API, cont’d Motivation: don’t want / can’t improve the policy everywhere too time/resource consuming not useful with regard to ’relevant’ information gathered Useful ? Interesting ? Relevant ? → “Improving the policy in the situations I am likely to encounter today” In other words . . . Which subset of states for API ? The ones visited by policy simulation ! Emmanuel Rachelson
Patrick Fabiani Frédérick Garcia
Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes
Temporal Markov Problems: motivation and modeling
Solving large scale GSMDP: ATPI
Conclusion
Basic ideas
online-API, cont’d Motivation: don’t want / can’t improve the policy everywhere too time/resource consuming not useful with regard to ’relevant’ information gathered Useful ? Interesting ? Relevant ? → “Improving the policy in the situations I am likely to encounter today” In other words . . . Which subset of states for API ? The ones visited by policy simulation ! Emmanuel Rachelson
Patrick Fabiani Frédérick Garcia
Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes
Temporal Markov Problems: motivation and modeling
Solving large scale GSMDP: ATPI
Conclusion
Basic ideas
online-API, cont’d Motivation: don’t want / can’t improve the policy everywhere too time/resource consuming not useful with regard to ’relevant’ information gathered Useful ? Interesting ? Relevant ? → “Improving the policy in the situations I am likely to encounter today” In other words . . . Which subset of states for API ? The ones visited by policy simulation ! Emmanuel Rachelson
Patrick Fabiani Frédérick Garcia
Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes
Temporal Markov Problems: motivation and modeling
Solving large scale GSMDP: ATPI
Conclusion
Basic ideas
online-API, cont’d Motivation: don’t want / can’t improve the policy everywhere too time/resource consuming not useful with regard to ’relevant’ information gathered Useful ? Interesting ? Relevant ? → “Improving the policy in the situations I am likely to encounter today” In other words . . . Which subset of states for API ? The ones visited by policy simulation ! Emmanuel Rachelson
Patrick Fabiani Frédérick Garcia
Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes
Temporal Markov Problems: motivation and modeling
Solving large scale GSMDP: ATPI
Conclusion
Basic ideas
First results Initial version of online-ATPI with SVR. Initial policy sets trains to run all day long. 1500
stat SVR
1000 initial state value
500 0 -500 -1000 -1500 -2000 -2500 -3000 -3500 0 Emmanuel Rachelson
2
Patrick Fabiani Frédérick Garcia
4
6
8
10
iteration number
Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes
12
14
Temporal Markov Problems: motivation and modeling
Solving large scale GSMDP: ATPI
Conclusion
Introducing confidence
Is there anybody out there ? ut
t
bc b
bc
ut ut
b
bc b
b
ut bc
b
ut
bc
b bc ut bc b
ut bc ut b bc
ut b
bc ut
b bc
b ut bc ut
b ut ut b
bc
bc ut
b bc ut bc
ut b ut bc
b bc ut utbc
Emmanuel Rachelson
Patrick Fabiani Frédérick Garcia
s0
Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes
s
Temporal Markov Problems: motivation and modeling
Solving large scale GSMDP: ATPI
Conclusion
Introducing confidence
Is there anybody out there ? ut
t
bc b
bc
ut ut
b
bc b
b
ut bc
b
ut
bc
b bc ut bc b
ut
Q (s0 , a1 ) =?
bc ut b
bc
ut b
bc ut
b bc
b ut bc ut
b ut ut b
bc
bc ut
b bc ut
P (s′ , t ′ |s0 , t0 , a1 )
bc
ut b ut bc
b bc ut utbc
Emmanuel Rachelson
Patrick Fabiani Frédérick Garcia
s0
Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes
s
Temporal Markov Problems: motivation and modeling
Solving large scale GSMDP: ATPI
Conclusion
Introducing confidence
Is there anybody out there ? ut
t
bc b
bc
ut ut
b
bc b
b
ut bc
b
ut
bc
b bc ut
Should I trust my regression ? → what if it overestimates the true V π (s) ?
bc b
ut bc ut b bc
ut b
bc ut
b bc
b ut bc ut
b ut ut b
bc
bc ut
b bc ut bc
ut b ut bc
b bc ut utbc
Emmanuel Rachelson
Patrick Fabiani Frédérick Garcia
s0
Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes
s
Temporal Markov Problems: motivation and modeling
Solving large scale GSMDP: ATPI
Conclusion
Introducing confidence
Is there anybody out there ? ut
t
bc b
bc
ut ut
b
bc b
b
ut bc
b
ut
bc
b bc ut bc b
ut bc ut b
Define a notion of confidence bc
ut b
bc ut
b bc
b ut bc ut
b ut ut b
bc
bc ut
b bc ut bc
ut b ut bc
b bc ut utbc
Emmanuel Rachelson
Patrick Fabiani Frédérick Garcia
s0
Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes
s
Temporal Markov Problems: motivation and modeling
Solving large scale GSMDP: ATPI
Conclusion
Introducing confidence
Introducing confidence “confidence” ⇔ having enough points around s ⇔ approaching the sufficient statistics for V π (s) → approx. measure: pdf of the underlying process. What should we do if we are not confident ? → generate data – increase the samples’ density – simulate Storing the policy ? Same problem for policy storage than for value function: ( Lagoudakis et al., 03) RL as Classification. Full statistical learning problem: (local incremental) regression (V π ), classification (π ), density estimation (conf ) Emmanuel Rachelson
Patrick Fabiani Frédérick Garcia
Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes
Temporal Markov Problems: motivation and modeling
Solving large scale GSMDP: ATPI
Conclusion
Introducing confidence
Introducing confidence “confidence” ⇔ having enough points around s ⇔ approaching the sufficient statistics for V π (s) → approx. measure: pdf of the underlying process. What should we do if we are not confident ? → generate data – increase the samples’ density – simulate Storing the policy ? Same problem for policy storage than for value function: ( Lagoudakis et al., 03) RL as Classification. Full statistical learning problem: (local incremental) regression (V π ), classification (π ), density estimation (conf ) Emmanuel Rachelson
Patrick Fabiani Frédérick Garcia
Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes
Temporal Markov Problems: motivation and modeling
Solving large scale GSMDP: ATPI
Conclusion
Introducing confidence
Introducing confidence “confidence” ⇔ having enough points around s ⇔ approaching the sufficient statistics for V π (s) → approx. measure: pdf of the underlying process. What should we do if we are not confident ? → generate data – increase the samples’ density – simulate Storing the policy ? Same problem for policy storage than for value function: ( Lagoudakis et al., 03) RL as Classification. Full statistical learning problem: (local incremental) regression (V π ), classification (π ), density estimation (conf ) Emmanuel Rachelson
Patrick Fabiani Frédérick Garcia
Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes
Temporal Markov Problems: motivation and modeling
Solving large scale GSMDP: ATPI
Conclusion
Introducing confidence
Introducing confidence “confidence” ⇔ having enough points around s ⇔ approaching the sufficient statistics for V π (s) → approx. measure: pdf of the underlying process. What should we do if we are not confident ? → generate data – increase the samples’ density – simulate Storing the policy ? Same problem for policy storage than for value function: ( Lagoudakis et al., 03) RL as Classification. Full statistical learning problem: (local incremental) regression (V π ), classification (π ), density estimation (conf ) Emmanuel Rachelson
Patrick Fabiani Frédérick Garcia
Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes
Temporal Markov Problems: motivation and modeling
Solving large scale GSMDP: ATPI
The bigger picture
The bigger picture simulation-based API samples ← 0/ for i = 1 to Nsim do while t < horizon do estimate Q-values s′ ← apply best action store (s, a, r , s′ ) in samples end while end for ˜ π (samples) trainV trainπ˜ (samples)
Emmanuel Rachelson
Patrick Fabiani Frédérick Garcia
Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes
Conclusion
Temporal Markov Problems: motivation and modeling
Solving large scale GSMDP: ATPI
The bigger picture
The bigger picture estimate Q (s, a)
˜ (s, a) ← 0 Q for i = 1 to Na do (r , s′ ) ← pick next state if confidence(s′ ) = true then
˜ π (s ′ ) r +V Na
˜ (s, a) ← Q ˜ (s, a) + Q else data = simulate(π , s′ ) ˜ π (data) retrainV ˜ (s, a) ← Q ˜ (s, a) + Q end if end for ˜ (s, a) return Q Emmanuel Rachelson
˜ π (s ′ ) r +V Na
Patrick Fabiani Frédérick Garcia
Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes
Conclusion
Temporal Markov Problems: motivation and modeling
Solving large scale GSMDP: ATPI
Conclusion GSMDP Modeling of large scale temporal problems of decision under uncertainty + introduction of a new LSPI-like method, bringing together results from: discrete events simulation approximate policy iteration statistical learning
API A general method inside API partial and incremental state space exploration guided by simulation / local policy improvement API as statistical learning
GiSMoP C++ library
→ http://emmanuel.rachelson.free.fr/fr/gismop.html Emmanuel Rachelson
Patrick Fabiani Frédérick Garcia
Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes
Conclusion
Temporal Markov Problems: motivation and modeling
Solving large scale GSMDP: ATPI
Perspectives Ongoing work: GiSMoP is still under development benchmark analysis (especially variance in V π ) interest of regression vs. brute force rollout is still unclear This work can benefit from: Better tuning of regression / classification / density estimation techniques (currently: LWPR / MC-SVM / OC-SVM) Non-arbitrary stopping bounds for sampling Error bounds ... Emmanuel Rachelson
Patrick Fabiani Frédérick Garcia
Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes
Conclusion
Temporal Markov Problems: motivation and modeling
Solving large scale GSMDP: ATPI
Thank you for your attention !
Emmanuel Rachelson
Patrick Fabiani Frédérick Garcia
Simulation-based Approximate Policy Iteration for Generalized Semi-Markov Decision Processes
Conclusion