PAC-learning of Linear Sequential Systems using the Method of

allows to use linear algebra to formulate the learning equations can also model .... s.t. ˙uf (v) = f (uv). Hence, think about ˙σ as a one step look ahead operator.
409KB taille 0 téléchargements 479 vues
Introduction

Multiplicity Automata

MoM

CH-PRFA

Experiments

Discussion

PAC-learning of Linear Sequential Systems using the Method of Moments Hadrien Glaude 1

1

Olivier Pietquin*

1 2

University of Lille 1, CRIStAL, UMR 9189, SequeL Team, France 2

Institut Universitaire de France (IUF), *now at DeepMind

ICML 2016

1/25

Introduction

Multiplicity Automata

MoM

CH-PRFA

Experiments

Discussion

Multiplicity Automata and stochastic languages

This talk focuses on learning a stochastic language from drawn sequences A stochastic language is a distribution over finite length sequences

uses the formalism of Multiplicity Automata (a.k.a Weighted Finite Automata) allows to use linear algebra to formulate the learning equations can also model stochastic processes like (OOMs, HMMs, MCs of any order) or controlled processes (PSRs, POMDPs, MDPs).

Algorithms, conclusions and results naturally extend to these processes.

2/25

Introduction

Multiplicity Automata

MoM

CH-PRFA

Experiments

Discussion

Contribution

State of the art Iterative algorithms are slow and prone to get stuck in local optima. The Spectral algorithm allows PAC learning ... but doesn’t produce proper probabilities (NPP). → not suitable for computing expectations, e.g. in reinforcement learning (high reward after an unlikely transition) A new algorithm CH-PRFA, obtained by replacing the SVD by NMF adding constraints during the regression

3/25

Introduction

Multiplicity Automata

MoM

CH-PRFA

Experiments

Discussion

Multiplicity Automata Graphical model and linear representation

Graphical model Given an alphabet Σ (ex. Σ = {a, b}) and a set of states Q (ex : Q = {q1 , q2 }) a : 0.7 b : 0.99

1

a : 0.01 b : 0.2

q1

a : 0.1 b : 0.3 q2

0.4

a:0 b : 0.2

4/25

Introduction

Multiplicity Automata

MoM

CH-PRFA

Experiments

Discussion

Multiplicity Automata Graphical model and linear representation

Graphical model Given an alphabet Σ (ex. Σ = {a, b}) and a set of states Q (ex : Q = {q1 , q2 }) a : 0.7 b : 0.99

1

a : 0.01 b : 0.2

q1 a:0 b : 0.2

  1 0   0.7 0.01 Aa = 0 0.1   0.99 0.2 Ab = 0.2 0.3   0 α∞ = 0.4 α0 =

a : 0.1 b : 0.3 q2

Linear representation

0.4

4/25

Introduction

Multiplicity Automata

MoM

CH-PRFA

Experiments

Discussion

Multiplicity Automata Graphical model and linear representation

Graphical model Given an alphabet Σ (ex. Σ = {a, b}) and a set of states Q (ex : Q = {q1 , q2 }) a : 0.7 b : 0.99

1

a : 0.01 b : 0.2

q1 a:0 b : 0.2

  1 0   0.7 0.01 Aa = 0 0.1   0.99 0.2 Ab = 0.2 0.3   0 α∞ = 0.4 α0 =

a : 0.1 b : 0.3 q2

Linear representation

0.4

It realizes a function mapping any finite sequence to a value, f : Σ? → IR

4/25

Introduction

Multiplicity Automata

MoM

CH-PRFA

Experiments

Discussion

Multiplicity Automata Graphical model and linear representation

Graphical model Given an alphabet Σ (ex. Σ = {a, b}) and a set of states Q (ex : Q = {q1 , q2 }) a : 0.7 b : 0.99

1

a : 0.01 b : 0.2

q1

  1 0   0.7 0.01 Aa = 0 0.1   0.99 0.2 Ab = 0.2 0.3   0 α∞ = 0.4 α0 =

a : 0.1 b : 0.3 q2

Linear representation

0.4

a:0 b : 0.2

It realizes a function mapping any finite sequence to a value, f : Σ? → IR f (a) = 1 × 0.01 × 0.4 = 0.004

4/25

Introduction

Multiplicity Automata

MoM

CH-PRFA

Experiments

Discussion

Multiplicity Automata Graphical model and linear representation

Graphical model Given an alphabet Σ (ex. Σ = {a, b}) and a set of states Q (ex : Q = {q1 , q2 }) a : 0.7 b : 0.99

1

a : 0.01 b : 0.2

q1

  1 0   0.7 0.01 Aa = 0 0.1   0.99 0.2 Ab = 0.2 0.3   0 α∞ = 0.4 α0 =

a : 0.1 b : 0.3 q2

Linear representation

0.4

a:0 b : 0.2

It realizes a function mapping any finite sequence to a value, f : Σ? → IR f (a) = 1 × 0.01 × 0.4 = 0.004 f (ab) = 1 × 0.7 × 0.2 × 0.4 + 1 × 0.01 × 0.3 × 0.4 = 0.572 4/25

Introduction

Multiplicity Automata

MoM

CH-PRFA

Experiments

Discussion

Multiplicity Automata Graphical model and linear representation

Graphical model Given an alphabet Σ (ex. Σ = {a, b}) and a set of states Q (ex : Q = {q1 , q2 }) a : 0.7 b : 0.99

1

a : 0.01 b : 0.2

q1

  1 0   0.7 0.01 Aa = 0 0.1   0.99 0.2 Ab = 0.2 0.3   0 α∞ = 0.4 α0 =

a : 0.1 b : 0.3 q2

Linear representation

0.4

a:0 b : 0.2

It realizes a function mapping any finite sequence to a value, f : Σ? → IR f (a) = 1 × 0.01 × 0.4 = 0.004 f (ab) = 1 × 0.7 × 0.2 × 0.4 + 1 × 0.01 × 0.3 × 0.4 = 0.572 4/25

Introduction

Multiplicity Automata

MoM

CH-PRFA

Experiments

Discussion

Multiplicity Automata Graphical model and linear representation

Graphical model Given an alphabet Σ (ex. Σ = {a, b}) and a set of states Q (ex : Q = {q1 , q2 }) a : 0.7 b : 0.99

1

a : 0.01 b : 0.2

q1

  1 0   0.7 0.01 Aa = 0 0.1   0.99 0.2 Ab = 0.2 0.3   0 α∞ = 0.4 α0 =

a : 0.1 b : 0.3 q2

Linear representation

0.4

a:0 b : 0.2

It realizes a function mapping any finite sequence to a value, f : Σ? → IR f (a) = 1 × 0.01 × 0.4 = 0.004 f (ab) = 1 × 0.7 × 0.2 × 0.4 + 1 × 0.01 × 0.3 × 0.4 = 0.572 f (u) = f (σ1 σ2 ...σl ) = α> 0 Aσ1 Aσ2 . . . Aσl α∞

4/25

Introduction

Multiplicity Automata

MoM

CH-PRFA

Experiments

Discussion

Multiplicity Automata Graphical model and linear representation

Graphical model Given an alphabet Σ (ex. Σ = {a, b}) and a set of states Q (ex : Q = {q1 , q2 }) a : 0.7 b : 0.99

1

a : 0.01 b : 0.2

q1 a:0 b : 0.2

  1 0   0.7 0.01 Aa = 0 0.1   0.99 0.2 Ab = 0.2 0.3   0 α∞ = 0.4 α0 =

a : 0.1 b : 0.3 q2

Linear representation

0.4

Note that all the weights are non-negative, belong to [0, 1] and have a probabilistic interpretation. This is not always the case !

5/25

Introduction

Multiplicity Automata

MoM

CH-PRFA

Experiments

Discussion

Multiplicity Automata Linear representation and basis

For any sequence u, let u˙ be a linear operator s.t. uf ˙ (v ) = f (uv ). Hence, think about σ˙ as a one step look ahead operator.

For any Multiplicity Automata with d states, there exists a basis of functions f1 , . . . , fd (represented as vectors here) such that f=

d X

α0 [i]fi

,

i=1

|

{z

}

repr. of f on the basis

σf ˙ i=

d X

Aσ [i, j]fj

,

j=1

|

{z

}

repr. of σf ˙ i on the basis

α∞ [i] = fi [ε]. 6/25

Introduction

Multiplicity Automata

MoM

CH-PRFA

Experiments

Discussion

Multiplicity Automata The Spectral algorithm

Spectral algorithm 1

estimates the vectors u˙ˆf for many u,

2

identifies the basis by SVD of the Hankel matrix (u˙ˆf)u∈Σ? , ”to captures the main axis of variations for the future”

3

recovers Aσ using linear regression between ˆf1 , . . . , ˆfd and σ˙ ˆf1 , . . . , σ˙ ˆfd , α0 using linear regression between ˆf1 , . . . , ˆfd and ˆf, sets α∞ [i] = fi [ε].

However, during regression weights are not constrained to define probabilities, neither to define a MA that realizes a probability distribution.

7/25

Introduction

Multiplicity Automata

MoM

CH-PRFA

Experiments

Discussion

Multiplicity Automata Hierarchy

What is learnt by the Spectral algorithm which does not constraint fˆ to output probabilities. IR-MA Figure: Hierarchy between classes of automata[BD11, DE08] 8/25

Introduction

Multiplicity Automata

MoM

CH-PRFA

Experiments

Discussion

Multiplicity Automata Hierarchy

La ng uage s

SMA

IR-MA Figure: Hierarchy between classes of automata[BD11, DE08] 8/25

Introduction

Multiplicity Automata

MoM

CH-PRFA

Experiments

Discussion

Multiplicity Automata Hierarchy

IR+ -MA PNFA

La ng uage s

SMA

No n-n ega tive weights IR-MA

Figure: Hierarchy between classes of automata[BD11, DE08] 8/25

Introduction

Multiplicity Automata

MoM

CH-PRFA

Experiments

Discussion

Multiplicity Automata Subclasses of Stochastic Multiplicity Automata

A PNFA is a SMA where weights represents probabilities.

Definition (Probabilistic Non-deterministic Finite Automata) A PNFA have non-negative weights verifying, X 1> α0 = 1 α∞ + Aσ 1 = 1 σ∈Σ

PNFA are similar to HMMs.

9/25

Introduction

Multiplicity Automata

MoM

CH-PRFA

Experiments

Discussion

Multiplicity Automata

PDFA

e Fully obs

No n-n ega tive weights

SMA

La ng uage s

IR+ -MA PNFA

rva ble

Hierarchy

IR-MA Figure: Hierarchy between classes of automata[BD11, DE08] 10/25

Introduction

Multiplicity Automata

MoM

CH-PRFA

Experiments

Discussion

How to solve the Negative Probability Problem ? In the Spectral algorithm, can we force the learnt MA to realize a stochastic language ?

11/25

Introduction

Multiplicity Automata

MoM

CH-PRFA

Experiments

Discussion

How to solve the Negative Probability Problem ? In the Spectral algorithm, can we force the learnt MA to realize a stochastic language ? Bad news ! No ! There is no consistent algorithm to learn SMA [Esp04] Can we bypass by learning PNFA which realize stochastic languages ?

11/25

Introduction

Multiplicity Automata

MoM

CH-PRFA

Experiments

Discussion

How to solve the Negative Probability Problem ? In the Spectral algorithm, can we force the learnt MA to realize a stochastic language ? Bad news ! No ! There is no consistent algorithm to learn SMA [Esp04] Can we bypass by learning PNFA which realize stochastic languages ? learning PNFA seems computationally too difficult. [ZP14] Can we find a class automata between PNFA and PDFA that is PAC-learnable without the NPP?

11/25

Introduction

Multiplicity Automata

MoM

CH-PRFA

Experiments

Discussion

How to solve the Negative Probability Problem ? In the Spectral algorithm, can we force the learnt MA to realize a stochastic language ? Bad news ! No ! There is no consistent algorithm to learn SMA [Esp04] Can we bypass by learning PNFA which realize stochastic languages ? learning PNFA seems computationally too difficult. [ZP14] Can we find a class automata between PNFA and PDFA that is PAC-learnable without the NPP? A class that would handle some partial observability ?

11/25

Introduction

Multiplicity Automata

MoM

CH-PRFA

Experiments

Discussion

How to solve the Negative Probability Problem ? In the Spectral algorithm, can we force the learnt MA to realize a stochastic language ? Bad news ! No ! There is no consistent algorithm to learn SMA [Esp04] Can we bypass by learning PNFA which realize stochastic languages ? learning PNFA seems computationally too difficult. [ZP14] Can we find a class automata between PNFA and PDFA that is PAC-learnable without the NPP? A class that would handle some partial observability ?

Probabilitic Residual Finite Automata 11/25

Introduction

Multiplicity Automata

MoM

CH-PRFA

Experiments

Discussion

Probabilitic Residual Finite Automata Definition

Definition (Residuals) A residual fu is a stochastic language such that fu (v ) is the probability of f (uv ) 1 = uf ˙ ) observing v after u (fu (v ) = f (uΣ? ) f (uΣ∗ ) Similarly, fq (u) is the probability to observe u starting from the state q.

Definition (Probabilistic Residual Finite Automata [DE08]) A PRFA is a PNFA such that for all state q, there exists a prefix u ∈ Σ? such that fq = fu . In a PRFA each state q is characterized by at least one prefix that ends only in q.

12/25

Introduction

Multiplicity Automata

MoM

CH-PRFA

Experiments

Discussion

Probabilitic Residual Finite Automata

PRFA

PDFA

ful

e Fully obs No ly obs n-n ervable system ega tive weights IR-MA

SMA

La ng uage s

Contains a

IR+ -MA PNFA

rva ble

Hierarchy

13/25

Introduction

Multiplicity Automata

MoM

CH-PRFA

Experiments

Discussion

Probabilitic Residual Finite Automata Learning

What does make PRFA so special ? In PRFA with d states, d of all its residuals generates a convex hull containing all the others. z

b

b

b b

b

x

y

Figure: Dots are residuals, crosses are the supporting ones. 14/25

Introduction

Multiplicity Automata

MoM

CH-PRFA

Experiments

Discussion

Probabilitic Residual Finite Automata Learning

What does make PRFA so special ? In PRFA with d states, d of all its residuals generates a convex hull containing all the others. Hence, identifying the supporting residuals fu1 , . . . , fud ⇔ finding a basis in which f and σf ˙ ui can be represented with convex combinations f=

d X

α0 [i]fui

and

σf ˙ ui =

Aσ [i, j]fuj

j=1

i=1

|

d X

{z

}

|

repr. of f on the basis

{z

}

repr. of σf ˙ ui on the basis

So, there are satisfied 1> α0 = 1

α∞ +

X

Aσ 1 = 1

σ∈Σ 14/25

Introduction

Multiplicity Automata

MoM

CH-PRFA

Experiments

Discussion

CH-PRFA

SPECTRAL

... vs SPECTRAL

Estimate joint probabilities

SVD

Normal equations

Find a basis

Linear regression

MA

15/25

Introduction

Multiplicity Automata

MoM

CH-PRFA

Experiments

Discussion

CH-PRFA

SPECTRAL

Estimate joint probabilities

CH-PRFA

... vs SPECTRAL

Estimate conditional probabilities

SVD

Normal equations

Find a basis

Linear regression

MA

15/25

Introduction

Multiplicity Automata

MoM

CH-PRFA

Experiments

Discussion

CH-PRFA SVD

Normal equations

SPECTRAL

Estimate joint probabilities

Find a basis

Linear regression

CH-PRFA

... vs SPECTRAL

Estimate conditional probabilities

Find a convex hull

MA

Separable NMF 15/25

Introduction

Multiplicity Automata

MoM

CH-PRFA

Experiments

Discussion

CH-PRFA

SPECTRAL

Estimate joint probabilities

CH-PRFA

... vs SPECTRAL

Estimate conditional probabilities

SVD

Normal equations

Find a basis

Linear regression

Find a convex hull

Linear regression with nonnegativity and ”sum to one” contraints

Separable NMF

MA

Quadratic programming 15/25

Introduction

Multiplicity Automata

MoM

CH-PRFA

Experiments

Discussion

CH-PRFA

SPECTRAL

Estimate joint probabilities

CH-PRFA

... vs SPECTRAL

Estimate conditional probabilities

SVD

Normal equations

Find a basis

Linear regression

MA

Find a convex hull

Linear regression with nonnegativity and ”sum to one” contraints

PNFA

Separable NMF

Quadratic programming 15/25

Introduction

Multiplicity Automata

MoM

CH-PRFA

Experiments

Discussion

CH-PRFA Finite sample bound

Let n be the number of time the least occurring prefix of P appears in D (n = min |{∃v ∈ S|uv ∈ D}|) u∈P

σd be the d-th largest singular values of (fu (v ))u∈R . K be a numeric constant. t be number of time steps.

Theorem For all 0 < δ < 1, for all t > 0,  > 0, with probability 1 − δ, if   t 4 d 4 |Σ| |P| n ≥ K 2 10 log , δ  σd CH-PRFA returns a PFA realizing a proper distribution fˆ such that X fˆ(u) − f (u) ≤ . u∈Σ≤t 16/25

Introduction

Multiplicity Automata

MoM

CH-PRFA

Experiments

Discussion

Experiments PAutomaC challenge

The Probabilistic Automata learning Competition (PAutomaC) [VEdlH12], a set of 48 synthetic problems where the goal is to recover the true model from a training set, 12 problems were selected : 4 HMMs, 4 PNFA, 4 PDFA. Scores : Perplexity (f (x) is the true probability and T the test set) : Perplexity(ˆf) = 2−

P

u∈T

f (u) log(fˆ(u))

equals the number of bits to encode the test set. Word Error Rate (WER) : measures the fraction of incorrectly predicted symbols (one step ahead).

Comparison to Spectral, NNSpectral [GEP15], Convex Optimization (CO) [BQC12], Tensor method [AGH+ 12] and Baum-Welch (BW). 17/25

Experiments PAutomaC results

Table: Performances on twelve problems of PAutomaC. Rank 1. 2. 3. 4. 5. 6. 7. 8.

Algorithm NNSpectral CH-PRFA+BW Spectral BW CH-PRFA CO Tensor+BW Tensor

WER 64.618 65.160 65.529 66.482 67.141 71.467 73.544 77.433

Rank 1. 2. 3. 4. 5. 6. 7. 8.

Algorithm NNSpectral CH-PRFA+BW CH-PRFA CO BW Spectral Tensor+BW Tensor

Perplexity 30.364 30.468 30.603 35.641 35.886 40.210 47.655 54.000

140 126.09 120 100 80 60 44.75 40 23.67 20 4.73 4.64 0 2.85 CH-PRFA Spectral Tensor CO NNSpectralBW Figure: Mean learning time in seconds.

18/25

Introduction

Multiplicity Automata

MoM

CH-PRFA

Experiments

Discussion

Experiments Learning Wikipedia

Wikipedia raw text extracted from English Wikipedia pages divided into sequences of 250 symbols 86 symbols in the alphabet Small training set ≈ 500 sequences Medium training set ≈ 50000 sequences Test set ≈ 5000 sequences Scores (computed after a warm-up on 50 symbols) Conditional likelihood on the next symbol (1 − WER) Number of Bits Per Character (BPC) (perplexity of the conditional distribution on the next symbol)

CO and Tensor not included because of their very poor results on that task. BW only for the small dataset and small models. 19/25

Introduction

Multiplicity Automata

MoM

CH-PRFA

Experiments

Discussion

Experiments I Wikip´ edia results

the higher, the better

the lower, the better

0.20

BPC

likelihood

0.25

0.15 CH-PRFA Spectral BW

0.10

10

20

30

40

50

60

dimension

CH-PRFA+BW NNSpectral

70

80

90 100

9 8.5 8 7.5 7 6.5 6 5.5 5 4.5 4 3.5 3

CH-PRFA Spectral BW

10

20

30

CH-PRFA+BW NNSpectral

40

50

60

70

80

90 100

dimension

Figure: Performance on Wikipedia, small training set.

20/25

Introduction

Multiplicity Automata

MoM

CH-PRFA

Experiments

Discussion

Experiments II Wikip´ edia results

the higher, the better

the lower, the better

0.30

0.20

BPC

likelihood

0.25

0.15

CH-PRFA Spectral

0.10 10

20

30

40

50

60

dimension

70

NNSpectral

80

90 100

9 8.5 8 7.5 7 6.5 6 5.5 5 4.5 4 3.5 3

CH-PRFA Spectral

10

20

30

NNSpectral

40

50

60

70

80

90 100

dimension

Figure: Performance on Wikipedia, medium training set.

21/25

Introduction

Multiplicity Automata

MoM

CH-PRFA

Experiments

Discussion

Conclusion Comparison to Spectral algorithms SVD : principal axis of variation of the future Separable NMF : supporting residual, extreme futures Extreme futures corresponds to beliefs where all the mass is concentrated on one hidden state more interpretable allows incorporating constraints during the linear regression to force the learnt model to be a PNFA CH-PRFA experimentally, better fits small probabilities is consistent and has PAC-style guarantee can be used to initialize an iterative algorithm (local search)

22/25

Thank you for your attention. Come to see my poster #20 tomorrow morning. Any questions?

23/25

Introduction

Multiplicity Automata

MoM

CH-PRFA

Experiments

Discussion

References I Anima Anandkumar, Rong Ge, Daniel Hsu, Sham M Kakade, and Matus Telgarsky, Tensor decompositions for learning latent variable models, arXiv preprint arXiv:1210.7559 (2012). Rapha¨el Bailly and Fran¸cois Denis, Absolute convergence of rational series is semi-decidable, Information and Computation 209 (2011), no. 3, 280–295. Borja Balle, William Hamilton, and Joelle Pineau, Methods of moments for learning stochastic languages: Unified presentation and empirical comparison, Proc. of ICML-14, 2014, pp. 1386–1394. Borja Balle, Ariadna Quattoni, and Xavier Carreras, Local loss optimization in operator models: A new insight into spectral learning, Proc. of ICML-12, 2012. Fran¸cois Denis and Yann Esposito, On rational stochastic languages, Fundamenta Informaticae 86 (2008), no. 1, 41–77. 24/25

Introduction

Multiplicity Automata

MoM

CH-PRFA

Experiments

Discussion

References II

Yann Esposito, Contribution `a l’inf´erence d’automates probabilistes, Ph.D. thesis, Universit´e Aix-Marseille, 2004. Hadrien Glaude, Cyrille Enderli, and Olivier Pietquin, Non-negative spectral learning for linear sequential systems, Neural Information Processing, Springer, 2015, pp. 143–151. Sicco Verwer, R´emi Eyraud, and Colin de la Higuera, Results of the pautomac probabilistic automaton learning competition, Journal of Machine Learning Research - Proceedings Track 21 (2012), 243–248. Han Zhao and Pascal Poupart, A sober look at spectral learning, arXiv preprint arXiv:1406.4631 (2014).

25/25