Introduction
Multiplicity Automata
MoM
CH-PRFA
Experiments
Discussion
PAC-learning of Linear Sequential Systems using the Method of Moments Hadrien Glaude 1
1
Olivier Pietquin*
1 2
University of Lille 1, CRIStAL, UMR 9189, SequeL Team, France 2
Institut Universitaire de France (IUF), *now at DeepMind
ICML 2016
1/25
Introduction
Multiplicity Automata
MoM
CH-PRFA
Experiments
Discussion
Multiplicity Automata and stochastic languages
This talk focuses on learning a stochastic language from drawn sequences A stochastic language is a distribution over finite length sequences
uses the formalism of Multiplicity Automata (a.k.a Weighted Finite Automata) allows to use linear algebra to formulate the learning equations can also model stochastic processes like (OOMs, HMMs, MCs of any order) or controlled processes (PSRs, POMDPs, MDPs).
Algorithms, conclusions and results naturally extend to these processes.
2/25
Introduction
Multiplicity Automata
MoM
CH-PRFA
Experiments
Discussion
Contribution
State of the art Iterative algorithms are slow and prone to get stuck in local optima. The Spectral algorithm allows PAC learning ... but doesn’t produce proper probabilities (NPP). → not suitable for computing expectations, e.g. in reinforcement learning (high reward after an unlikely transition) A new algorithm CH-PRFA, obtained by replacing the SVD by NMF adding constraints during the regression
3/25
Introduction
Multiplicity Automata
MoM
CH-PRFA
Experiments
Discussion
Multiplicity Automata Graphical model and linear representation
Graphical model Given an alphabet Σ (ex. Σ = {a, b}) and a set of states Q (ex : Q = {q1 , q2 }) a : 0.7 b : 0.99
1
a : 0.01 b : 0.2
q1
a : 0.1 b : 0.3 q2
0.4
a:0 b : 0.2
4/25
Introduction
Multiplicity Automata
MoM
CH-PRFA
Experiments
Discussion
Multiplicity Automata Graphical model and linear representation
Graphical model Given an alphabet Σ (ex. Σ = {a, b}) and a set of states Q (ex : Q = {q1 , q2 }) a : 0.7 b : 0.99
1
a : 0.01 b : 0.2
q1 a:0 b : 0.2
1 0 0.7 0.01 Aa = 0 0.1 0.99 0.2 Ab = 0.2 0.3 0 α∞ = 0.4 α0 =
a : 0.1 b : 0.3 q2
Linear representation
0.4
4/25
Introduction
Multiplicity Automata
MoM
CH-PRFA
Experiments
Discussion
Multiplicity Automata Graphical model and linear representation
Graphical model Given an alphabet Σ (ex. Σ = {a, b}) and a set of states Q (ex : Q = {q1 , q2 }) a : 0.7 b : 0.99
1
a : 0.01 b : 0.2
q1 a:0 b : 0.2
1 0 0.7 0.01 Aa = 0 0.1 0.99 0.2 Ab = 0.2 0.3 0 α∞ = 0.4 α0 =
a : 0.1 b : 0.3 q2
Linear representation
0.4
It realizes a function mapping any finite sequence to a value, f : Σ? → IR
4/25
Introduction
Multiplicity Automata
MoM
CH-PRFA
Experiments
Discussion
Multiplicity Automata Graphical model and linear representation
Graphical model Given an alphabet Σ (ex. Σ = {a, b}) and a set of states Q (ex : Q = {q1 , q2 }) a : 0.7 b : 0.99
1
a : 0.01 b : 0.2
q1
1 0 0.7 0.01 Aa = 0 0.1 0.99 0.2 Ab = 0.2 0.3 0 α∞ = 0.4 α0 =
a : 0.1 b : 0.3 q2
Linear representation
0.4
a:0 b : 0.2
It realizes a function mapping any finite sequence to a value, f : Σ? → IR f (a) = 1 × 0.01 × 0.4 = 0.004
4/25
Introduction
Multiplicity Automata
MoM
CH-PRFA
Experiments
Discussion
Multiplicity Automata Graphical model and linear representation
Graphical model Given an alphabet Σ (ex. Σ = {a, b}) and a set of states Q (ex : Q = {q1 , q2 }) a : 0.7 b : 0.99
1
a : 0.01 b : 0.2
q1
1 0 0.7 0.01 Aa = 0 0.1 0.99 0.2 Ab = 0.2 0.3 0 α∞ = 0.4 α0 =
a : 0.1 b : 0.3 q2
Linear representation
0.4
a:0 b : 0.2
It realizes a function mapping any finite sequence to a value, f : Σ? → IR f (a) = 1 × 0.01 × 0.4 = 0.004 f (ab) = 1 × 0.7 × 0.2 × 0.4 + 1 × 0.01 × 0.3 × 0.4 = 0.572 4/25
Introduction
Multiplicity Automata
MoM
CH-PRFA
Experiments
Discussion
Multiplicity Automata Graphical model and linear representation
Graphical model Given an alphabet Σ (ex. Σ = {a, b}) and a set of states Q (ex : Q = {q1 , q2 }) a : 0.7 b : 0.99
1
a : 0.01 b : 0.2
q1
1 0 0.7 0.01 Aa = 0 0.1 0.99 0.2 Ab = 0.2 0.3 0 α∞ = 0.4 α0 =
a : 0.1 b : 0.3 q2
Linear representation
0.4
a:0 b : 0.2
It realizes a function mapping any finite sequence to a value, f : Σ? → IR f (a) = 1 × 0.01 × 0.4 = 0.004 f (ab) = 1 × 0.7 × 0.2 × 0.4 + 1 × 0.01 × 0.3 × 0.4 = 0.572 4/25
Introduction
Multiplicity Automata
MoM
CH-PRFA
Experiments
Discussion
Multiplicity Automata Graphical model and linear representation
Graphical model Given an alphabet Σ (ex. Σ = {a, b}) and a set of states Q (ex : Q = {q1 , q2 }) a : 0.7 b : 0.99
1
a : 0.01 b : 0.2
q1
1 0 0.7 0.01 Aa = 0 0.1 0.99 0.2 Ab = 0.2 0.3 0 α∞ = 0.4 α0 =
a : 0.1 b : 0.3 q2
Linear representation
0.4
a:0 b : 0.2
It realizes a function mapping any finite sequence to a value, f : Σ? → IR f (a) = 1 × 0.01 × 0.4 = 0.004 f (ab) = 1 × 0.7 × 0.2 × 0.4 + 1 × 0.01 × 0.3 × 0.4 = 0.572 f (u) = f (σ1 σ2 ...σl ) = α> 0 Aσ1 Aσ2 . . . Aσl α∞
4/25
Introduction
Multiplicity Automata
MoM
CH-PRFA
Experiments
Discussion
Multiplicity Automata Graphical model and linear representation
Graphical model Given an alphabet Σ (ex. Σ = {a, b}) and a set of states Q (ex : Q = {q1 , q2 }) a : 0.7 b : 0.99
1
a : 0.01 b : 0.2
q1 a:0 b : 0.2
1 0 0.7 0.01 Aa = 0 0.1 0.99 0.2 Ab = 0.2 0.3 0 α∞ = 0.4 α0 =
a : 0.1 b : 0.3 q2
Linear representation
0.4
Note that all the weights are non-negative, belong to [0, 1] and have a probabilistic interpretation. This is not always the case !
5/25
Introduction
Multiplicity Automata
MoM
CH-PRFA
Experiments
Discussion
Multiplicity Automata Linear representation and basis
For any sequence u, let u˙ be a linear operator s.t. uf ˙ (v ) = f (uv ). Hence, think about σ˙ as a one step look ahead operator.
For any Multiplicity Automata with d states, there exists a basis of functions f1 , . . . , fd (represented as vectors here) such that f=
d X
α0 [i]fi
,
i=1
|
{z
}
repr. of f on the basis
σf ˙ i=
d X
Aσ [i, j]fj
,
j=1
|
{z
}
repr. of σf ˙ i on the basis
α∞ [i] = fi [ε]. 6/25
Introduction
Multiplicity Automata
MoM
CH-PRFA
Experiments
Discussion
Multiplicity Automata The Spectral algorithm
Spectral algorithm 1
estimates the vectors u˙ˆf for many u,
2
identifies the basis by SVD of the Hankel matrix (u˙ˆf)u∈Σ? , ”to captures the main axis of variations for the future”
3
recovers Aσ using linear regression between ˆf1 , . . . , ˆfd and σ˙ ˆf1 , . . . , σ˙ ˆfd , α0 using linear regression between ˆf1 , . . . , ˆfd and ˆf, sets α∞ [i] = fi [ε].
However, during regression weights are not constrained to define probabilities, neither to define a MA that realizes a probability distribution.
7/25
Introduction
Multiplicity Automata
MoM
CH-PRFA
Experiments
Discussion
Multiplicity Automata Hierarchy
What is learnt by the Spectral algorithm which does not constraint fˆ to output probabilities. IR-MA Figure: Hierarchy between classes of automata[BD11, DE08] 8/25
Introduction
Multiplicity Automata
MoM
CH-PRFA
Experiments
Discussion
Multiplicity Automata Hierarchy
La ng uage s
SMA
IR-MA Figure: Hierarchy between classes of automata[BD11, DE08] 8/25
Introduction
Multiplicity Automata
MoM
CH-PRFA
Experiments
Discussion
Multiplicity Automata Hierarchy
IR+ -MA PNFA
La ng uage s
SMA
No n-n ega tive weights IR-MA
Figure: Hierarchy between classes of automata[BD11, DE08] 8/25
Introduction
Multiplicity Automata
MoM
CH-PRFA
Experiments
Discussion
Multiplicity Automata Subclasses of Stochastic Multiplicity Automata
A PNFA is a SMA where weights represents probabilities.
Definition (Probabilistic Non-deterministic Finite Automata) A PNFA have non-negative weights verifying, X 1> α0 = 1 α∞ + Aσ 1 = 1 σ∈Σ
PNFA are similar to HMMs.
9/25
Introduction
Multiplicity Automata
MoM
CH-PRFA
Experiments
Discussion
Multiplicity Automata
PDFA
e Fully obs
No n-n ega tive weights
SMA
La ng uage s
IR+ -MA PNFA
rva ble
Hierarchy
IR-MA Figure: Hierarchy between classes of automata[BD11, DE08] 10/25
Introduction
Multiplicity Automata
MoM
CH-PRFA
Experiments
Discussion
How to solve the Negative Probability Problem ? In the Spectral algorithm, can we force the learnt MA to realize a stochastic language ?
11/25
Introduction
Multiplicity Automata
MoM
CH-PRFA
Experiments
Discussion
How to solve the Negative Probability Problem ? In the Spectral algorithm, can we force the learnt MA to realize a stochastic language ? Bad news ! No ! There is no consistent algorithm to learn SMA [Esp04] Can we bypass by learning PNFA which realize stochastic languages ?
11/25
Introduction
Multiplicity Automata
MoM
CH-PRFA
Experiments
Discussion
How to solve the Negative Probability Problem ? In the Spectral algorithm, can we force the learnt MA to realize a stochastic language ? Bad news ! No ! There is no consistent algorithm to learn SMA [Esp04] Can we bypass by learning PNFA which realize stochastic languages ? learning PNFA seems computationally too difficult. [ZP14] Can we find a class automata between PNFA and PDFA that is PAC-learnable without the NPP?
11/25
Introduction
Multiplicity Automata
MoM
CH-PRFA
Experiments
Discussion
How to solve the Negative Probability Problem ? In the Spectral algorithm, can we force the learnt MA to realize a stochastic language ? Bad news ! No ! There is no consistent algorithm to learn SMA [Esp04] Can we bypass by learning PNFA which realize stochastic languages ? learning PNFA seems computationally too difficult. [ZP14] Can we find a class automata between PNFA and PDFA that is PAC-learnable without the NPP? A class that would handle some partial observability ?
11/25
Introduction
Multiplicity Automata
MoM
CH-PRFA
Experiments
Discussion
How to solve the Negative Probability Problem ? In the Spectral algorithm, can we force the learnt MA to realize a stochastic language ? Bad news ! No ! There is no consistent algorithm to learn SMA [Esp04] Can we bypass by learning PNFA which realize stochastic languages ? learning PNFA seems computationally too difficult. [ZP14] Can we find a class automata between PNFA and PDFA that is PAC-learnable without the NPP? A class that would handle some partial observability ?
Probabilitic Residual Finite Automata 11/25
Introduction
Multiplicity Automata
MoM
CH-PRFA
Experiments
Discussion
Probabilitic Residual Finite Automata Definition
Definition (Residuals) A residual fu is a stochastic language such that fu (v ) is the probability of f (uv ) 1 = uf ˙ ) observing v after u (fu (v ) = f (uΣ? ) f (uΣ∗ ) Similarly, fq (u) is the probability to observe u starting from the state q.
Definition (Probabilistic Residual Finite Automata [DE08]) A PRFA is a PNFA such that for all state q, there exists a prefix u ∈ Σ? such that fq = fu . In a PRFA each state q is characterized by at least one prefix that ends only in q.
12/25
Introduction
Multiplicity Automata
MoM
CH-PRFA
Experiments
Discussion
Probabilitic Residual Finite Automata
PRFA
PDFA
ful
e Fully obs No ly obs n-n ervable system ega tive weights IR-MA
SMA
La ng uage s
Contains a
IR+ -MA PNFA
rva ble
Hierarchy
13/25
Introduction
Multiplicity Automata
MoM
CH-PRFA
Experiments
Discussion
Probabilitic Residual Finite Automata Learning
What does make PRFA so special ? In PRFA with d states, d of all its residuals generates a convex hull containing all the others. z
b
b
b b
b
x
y
Figure: Dots are residuals, crosses are the supporting ones. 14/25
Introduction
Multiplicity Automata
MoM
CH-PRFA
Experiments
Discussion
Probabilitic Residual Finite Automata Learning
What does make PRFA so special ? In PRFA with d states, d of all its residuals generates a convex hull containing all the others. Hence, identifying the supporting residuals fu1 , . . . , fud ⇔ finding a basis in which f and σf ˙ ui can be represented with convex combinations f=
d X
α0 [i]fui
and
σf ˙ ui =
Aσ [i, j]fuj
j=1
i=1
|
d X
{z
}
|
repr. of f on the basis
{z
}
repr. of σf ˙ ui on the basis
So, there are satisfied 1> α0 = 1
α∞ +
X
Aσ 1 = 1
σ∈Σ 14/25
Introduction
Multiplicity Automata
MoM
CH-PRFA
Experiments
Discussion
CH-PRFA
SPECTRAL
... vs SPECTRAL
Estimate joint probabilities
SVD
Normal equations
Find a basis
Linear regression
MA
15/25
Introduction
Multiplicity Automata
MoM
CH-PRFA
Experiments
Discussion
CH-PRFA
SPECTRAL
Estimate joint probabilities
CH-PRFA
... vs SPECTRAL
Estimate conditional probabilities
SVD
Normal equations
Find a basis
Linear regression
MA
15/25
Introduction
Multiplicity Automata
MoM
CH-PRFA
Experiments
Discussion
CH-PRFA SVD
Normal equations
SPECTRAL
Estimate joint probabilities
Find a basis
Linear regression
CH-PRFA
... vs SPECTRAL
Estimate conditional probabilities
Find a convex hull
MA
Separable NMF 15/25
Introduction
Multiplicity Automata
MoM
CH-PRFA
Experiments
Discussion
CH-PRFA
SPECTRAL
Estimate joint probabilities
CH-PRFA
... vs SPECTRAL
Estimate conditional probabilities
SVD
Normal equations
Find a basis
Linear regression
Find a convex hull
Linear regression with nonnegativity and ”sum to one” contraints
Separable NMF
MA
Quadratic programming 15/25
Introduction
Multiplicity Automata
MoM
CH-PRFA
Experiments
Discussion
CH-PRFA
SPECTRAL
Estimate joint probabilities
CH-PRFA
... vs SPECTRAL
Estimate conditional probabilities
SVD
Normal equations
Find a basis
Linear regression
MA
Find a convex hull
Linear regression with nonnegativity and ”sum to one” contraints
PNFA
Separable NMF
Quadratic programming 15/25
Introduction
Multiplicity Automata
MoM
CH-PRFA
Experiments
Discussion
CH-PRFA Finite sample bound
Let n be the number of time the least occurring prefix of P appears in D (n = min |{∃v ∈ S|uv ∈ D}|) u∈P
σd be the d-th largest singular values of (fu (v ))u∈R . K be a numeric constant. t be number of time steps.
Theorem For all 0 < δ < 1, for all t > 0, > 0, with probability 1 − δ, if t 4 d 4 |Σ| |P| n ≥ K 2 10 log , δ σd CH-PRFA returns a PFA realizing a proper distribution fˆ such that X fˆ(u) − f (u) ≤ . u∈Σ≤t 16/25
Introduction
Multiplicity Automata
MoM
CH-PRFA
Experiments
Discussion
Experiments PAutomaC challenge
The Probabilistic Automata learning Competition (PAutomaC) [VEdlH12], a set of 48 synthetic problems where the goal is to recover the true model from a training set, 12 problems were selected : 4 HMMs, 4 PNFA, 4 PDFA. Scores : Perplexity (f (x) is the true probability and T the test set) : Perplexity(ˆf) = 2−
P
u∈T
f (u) log(fˆ(u))
equals the number of bits to encode the test set. Word Error Rate (WER) : measures the fraction of incorrectly predicted symbols (one step ahead).
Comparison to Spectral, NNSpectral [GEP15], Convex Optimization (CO) [BQC12], Tensor method [AGH+ 12] and Baum-Welch (BW). 17/25
Experiments PAutomaC results
Table: Performances on twelve problems of PAutomaC. Rank 1. 2. 3. 4. 5. 6. 7. 8.
Algorithm NNSpectral CH-PRFA+BW Spectral BW CH-PRFA CO Tensor+BW Tensor
WER 64.618 65.160 65.529 66.482 67.141 71.467 73.544 77.433
Rank 1. 2. 3. 4. 5. 6. 7. 8.
Algorithm NNSpectral CH-PRFA+BW CH-PRFA CO BW Spectral Tensor+BW Tensor
Perplexity 30.364 30.468 30.603 35.641 35.886 40.210 47.655 54.000
140 126.09 120 100 80 60 44.75 40 23.67 20 4.73 4.64 0 2.85 CH-PRFA Spectral Tensor CO NNSpectralBW Figure: Mean learning time in seconds.
18/25
Introduction
Multiplicity Automata
MoM
CH-PRFA
Experiments
Discussion
Experiments Learning Wikipedia
Wikipedia raw text extracted from English Wikipedia pages divided into sequences of 250 symbols 86 symbols in the alphabet Small training set ≈ 500 sequences Medium training set ≈ 50000 sequences Test set ≈ 5000 sequences Scores (computed after a warm-up on 50 symbols) Conditional likelihood on the next symbol (1 − WER) Number of Bits Per Character (BPC) (perplexity of the conditional distribution on the next symbol)
CO and Tensor not included because of their very poor results on that task. BW only for the small dataset and small models. 19/25
Introduction
Multiplicity Automata
MoM
CH-PRFA
Experiments
Discussion
Experiments I Wikip´ edia results
the higher, the better
the lower, the better
0.20
BPC
likelihood
0.25
0.15 CH-PRFA Spectral BW
0.10
10
20
30
40
50
60
dimension
CH-PRFA+BW NNSpectral
70
80
90 100
9 8.5 8 7.5 7 6.5 6 5.5 5 4.5 4 3.5 3
CH-PRFA Spectral BW
10
20
30
CH-PRFA+BW NNSpectral
40
50
60
70
80
90 100
dimension
Figure: Performance on Wikipedia, small training set.
20/25
Introduction
Multiplicity Automata
MoM
CH-PRFA
Experiments
Discussion
Experiments II Wikip´ edia results
the higher, the better
the lower, the better
0.30
0.20
BPC
likelihood
0.25
0.15
CH-PRFA Spectral
0.10 10
20
30
40
50
60
dimension
70
NNSpectral
80
90 100
9 8.5 8 7.5 7 6.5 6 5.5 5 4.5 4 3.5 3
CH-PRFA Spectral
10
20
30
NNSpectral
40
50
60
70
80
90 100
dimension
Figure: Performance on Wikipedia, medium training set.
21/25
Introduction
Multiplicity Automata
MoM
CH-PRFA
Experiments
Discussion
Conclusion Comparison to Spectral algorithms SVD : principal axis of variation of the future Separable NMF : supporting residual, extreme futures Extreme futures corresponds to beliefs where all the mass is concentrated on one hidden state more interpretable allows incorporating constraints during the linear regression to force the learnt model to be a PNFA CH-PRFA experimentally, better fits small probabilities is consistent and has PAC-style guarantee can be used to initialize an iterative algorithm (local search)
22/25
Thank you for your attention. Come to see my poster #20 tomorrow morning. Any questions?
23/25
Introduction
Multiplicity Automata
MoM
CH-PRFA
Experiments
Discussion
References I Anima Anandkumar, Rong Ge, Daniel Hsu, Sham M Kakade, and Matus Telgarsky, Tensor decompositions for learning latent variable models, arXiv preprint arXiv:1210.7559 (2012). Rapha¨el Bailly and Fran¸cois Denis, Absolute convergence of rational series is semi-decidable, Information and Computation 209 (2011), no. 3, 280–295. Borja Balle, William Hamilton, and Joelle Pineau, Methods of moments for learning stochastic languages: Unified presentation and empirical comparison, Proc. of ICML-14, 2014, pp. 1386–1394. Borja Balle, Ariadna Quattoni, and Xavier Carreras, Local loss optimization in operator models: A new insight into spectral learning, Proc. of ICML-12, 2012. Fran¸cois Denis and Yann Esposito, On rational stochastic languages, Fundamenta Informaticae 86 (2008), no. 1, 41–77. 24/25
Introduction
Multiplicity Automata
MoM
CH-PRFA
Experiments
Discussion
References II
Yann Esposito, Contribution `a l’inf´erence d’automates probabilistes, Ph.D. thesis, Universit´e Aix-Marseille, 2004. Hadrien Glaude, Cyrille Enderli, and Olivier Pietquin, Non-negative spectral learning for linear sequential systems, Neural Information Processing, Springer, 2015, pp. 143–151. Sicco Verwer, R´emi Eyraud, and Colin de la Higuera, Results of the pautomac probabilistic automaton learning competition, Journal of Machine Learning Research - Proceedings Track 21 (2012), 243–248. Han Zhao and Pascal Poupart, A sober look at spectral learning, arXiv preprint arXiv:1406.4631 (2014).
25/25