Econometrics: Learning from 'Statistical Learning' - Freakonometrics

Chief Economists' workshop: what can central bank policymakers learn from other ... for some loss function l. See also Varian (2014). @freakonometrics. 4 ...
24MB taille 2 téléchargements 381 vues
Arthur Charpentier

Chief Economists’ workshop: what can central bank policymakers learn from other disciplines?

Econometrics: Learning from ‘Statistical Learning’ Techniques Arthur Charpentier (Université de Rennes 1 & UQàM)

Centre for Central Banking Studies Bank of England, London, UK, May 2016 http://freakonometrics.hypotheses.org @freakonometrics

1

Arthur Charpentier

Chief Economists’ workshop: what can central bank policymakers learn from other disciplines?

Econometrics: Learning from ‘Statistical Learning’ Techniques Arthur Charpentier (Université de Rennes 1 & UQàM)

Professor, Economics Department, Univ. Rennes 1 In charge of Data Science for Actuaries program, IA Research Chair actinfo (Institut Louis Bachelier) (previously Actuarial Sciences at UQàM & ENSAE Paristech actuary in Hong Kong, IT & Stats FFSA) PhD in Statistics (KU Leuven), Fellow Institute of Actuaries MSc in Financial Mathematics (Paris Dauphine) & ENSAE Editor of the freakonometrics.hypotheses.org’s blog Editor of Computational Actuarial Science, CRC @freakonometrics

2

Arthur Charpentier

Chief Economists’ workshop: what can central bank policymakers learn from other disciplines?

Agenda “the numbers have no way of speaking for themselves. We speak for them. [· · · ] Before we demand more of our data, we need to demand more of ourselves ” from Silver (2012). - (big) data - econometrics & probabilistic modeling - algorithmics & statistical learning - different perspectives on classification - boostrapping, PCA & variable section see Berk (2008), Hastie, Tibshirani & Friedman (2009), but also Breiman (2001) @freakonometrics

3

Arthur Charpentier

Chief Economists’ workshop: what can central bank policymakers learn from other disciplines?

Data and Models From {(yi , xi )}, there are different stories behind, see Freedman (2005) • the causal story : xj,i is usually considered as independent of the other covariates xk,i . For all possible x, that value is mapped to m(x) and a noise is attached, ε. The goal is to recover m(·), and the residuals are just the difference between the response value and m(x). • the conditional distribution story : for a linear model, we usually say that Y given X = x is a N (m(x), σ 2 ) distribution. m(x) is then the conditional mean. Here m(·) is assumed to really exist, but no causal assumption is made, only a conditional one. • the explanatory data story : there is no model, just data. We simply want to summarize information contained in x’s to get an accurate summary, close to the response (i.e. min{`(y, m(x))}) for some loss function `. See also Varian (2014) @freakonometrics

4

Arthur Charpentier

Chief Economists’ workshop: what can central bank policymakers learn from other disciplines?

Data, Models & Causal Inference We cannot differentiate data and model that easily.. After an operation, should I stay at hospital, or go back home ? as in Angrist & Pischke (2008), (health | hospital) − (health | stayed home)

[observed]

should be written (health | hospital) − (health | had stayed home) + (health | had stayed home) − (health | stayed home)

[treatment effect] [selection bias]

Need randomization to solve selection bias.

@freakonometrics

5

Arthur Charpentier

Chief Economists’ workshop: what can central bank policymakers learn from other disciplines?

Econometric Modeling Data {(yi , xi )}, for i = 1, · · · , n, with xi ∈ X ⊂ Rp and yi ∈ Y. A model is a m : X 7→ Y mapping









0.4

Classification models are based on two steps,





0.8

(binary, or more)



0.6

- classification, Y = {0, 1}, {−1, +1}, {•, •}

1.0

- regression, Y = R (but also Y = N)

● 0.2

• score function, s(x) = P(Y = 1|X = x) ∈ [0, 1]



0.0



0.0

0.2

0.4

0.6

0.8

1.0

• classifier s(x) → yb ∈ {0, 1}.

@freakonometrics

6

Arthur Charpentier

Chief Economists’ workshop: what can central bank policymakers learn from other disciplines?

High Dimensional Data (not to say ‘Big Data’) See Bühlmann & van de Geer (2011) or Koch (2013), X is a n × p matrix Portnoy (1988) proved that maximum likelihood estimators are asymptotically √ normal when p2 /n → 0 as n, p → ∞. Hence, massive data, when p > n. More intersting is the sparcity concept, based not on p, but on the effective size. Hence one can have p > n and convergent estimators. High dimension might be scary because of curse of dimensionality, see Bellman (1957). The volume of the unit sphere in Rp tends to 0 as p → ∞, i.e.space is sparse.

@freakonometrics

7

Arthur Charpentier

Chief Economists’ workshop: what can central bank policymakers learn from other disciplines?

Computational & Nonparametric Econometrics Linear Econometrics: estimate g : x 7→ E[Y |X = x] by a linear function. Nonlinear Econometrics: consider the approximation for some functional basis

g(x) =

∞ X j=0

ωj gj (x) and gb(x) =

h X

ωj gj (x)

j=0

or consider a local model, on the neighborhood of x, 1 X gb(x) = yi , with Ix = {x ∈ Rp : kxi −xk ≤ h}, nx i∈Ix

see Nadaraya (1964) and Watson (1964). Here h is some tunning parameter: not estimated, but chosen (optimaly).

@freakonometrics

8

Arthur Charpentier

Chief Economists’ workshop: what can central bank policymakers learn from other disciplines?

Econometrics & Probabilistic Model

from Cook & Weisberg (1999), see also Haavelmo (1965). (Y |X = x) ∼ N (µ(x), σ 2 ) with µ(x) = β0 + xT β, and β ∈ Rp . Linear Model: E[Y |X = x] = β0 + xT β Homoscedasticity: Var[Y |X = x] = σ 2 .

@freakonometrics

9

Arthur Charpentier

Chief Economists’ workshop: what can central bank policymakers learn from other disciplines?

Conditional Distribution and Likelihood (Y |X = x) ∼ N (µ(x), σ 2 ) with µ(x) = β0 + xT β, et β ∈ Rp The log-likelihood is n X n 1 2 log L(β0 , β, σ 2 |y, x) = − log[2πσ 2 ] − 2 . (yi − β0 − xT β) i 2 2σ i=1 {z } |

Set  2 2 b b (β0 , β, σ b ) = argmax log L(β0 , β, σ |y, x) . b = 0. If matrix X is a full rank matrix First order condition X T [y − X β] b = (X T X)−1 X T y = β + (X T X)−1 X T ε. β b Asymptotic properties of β, √ L b − β) → n(β N (0, Σ) as n → ∞ @freakonometrics

10

Arthur Charpentier

Chief Economists’ workshop: what can central bank policymakers learn from other disciplines?

Geometric Perspective Define the orthogonal projection on X , ΠX = X[X T X]−1 X T b = X[X T X]−1 X T y = ΠX y. y | {z } ΠX

Pythagoras’ theorem can be writen kyk2 = kΠX yk2 + kΠX ⊥ yk2 = kΠX yk2 + ky − ΠX yk2 which can be expressed as n X

yi2

ybi2

+

n X

i=1

i=1

i=1

| {z }

| {z }

|

n×total variance

@freakonometrics

=

n X

n×explained variance

(yi − ybi )2 {z

}

n×residual variance

11

Arthur Charpentier

Chief Economists’ workshop: what can central bank policymakers learn from other disciplines?

Geometric Perspective Define the angle θ between y and ΠX y, kΠX yk2 kΠX ⊥ yk2 2 R = = 1 − = cos (θ) 2 2 kyk kyk 2

see Davidson & MacKinnon (2003)

y = β0 + X 1 β 1 + X 2 β 2 + ε If y ?2 = ΠX1⊥ y and X ?2 = ΠX1⊥ X 2 , then b = [X ? T X ? ]−1 X ? T y ? β 2 2 2 2 2 X ?2 = X 2 if X 1 ⊥ X 2 , Frisch-Waugh theorem.

@freakonometrics

12

Arthur Charpentier

Chief Economists’ workshop: what can central bank policymakers learn from other disciplines?

From Linear to Non-Linear b = X[X T X]−1 X T y i.e. ybi = hT y, b = Xβ y xi {z } | H

with - for the linear regression - hx = X[X T X]−1 x. One can consider some smoothed regression, see Nadaraya (1964) and Watson (1964), with some smoothing matrix S m b h (x) = sT xy =

n X i=1

sx,i yi withs sx,i =

Kh (x − xi ) Kh (x − x1 ) + · · · + Kh (x − xn )

for some kernel K(·) and some bandwidth h > 0.

@freakonometrics

13

Arthur Charpentier

Chief Economists’ workshop: what can central bank policymakers learn from other disciplines?

From Linear to Non-Linear kSy − Hyk T = trace([S − H]T [S − H]) can be used to test for linearity, Simonoff (1996). trace(S) is the equivalent number of parameters, and n − trace(S) the degrees of freedom, Ruppert et al. (2003). Nonlinear Model, but Homoscedastic - Gaussian • (Y |X = x) ∼ N (µ(x), σ 2 ) • E[Y |X = x] = µ(x)

@freakonometrics

14

Arthur Charpentier

Chief Economists’ workshop: what can central bank policymakers learn from other disciplines?

Conditional Expectation

from Angrist & Pischke (2008), x 7→ E[Y |X = x]. @freakonometrics

15

Arthur Charpentier

Chief Economists’ workshop: what can central bank policymakers learn from other disciplines?

Exponential Distributions and Linear Models  f (yi |θi , φ) = exp

 yi θi − b(θi ) + c(yi , φ) with θi = h(xT i β) a(φ)

Log likelihood is expressed as log L(θ, φ|y) =

n X

Pn log f (yi |θi , φ) =

i=1

Pn

i=1 yi θi − a(φ)

i=1 b(θi )

+

n X

c(yi , φ)

i=1

and first order conditions ∂ log L(θ, φ|y) = X T W −1 [y − µ] = 0 ∂β as in Müller (2001), where W is a weight matrix, function of β. We usually specify the link function g(·) defined as yb = m(x) = E[Y |X = x] = g −1 (xT β). @freakonometrics

16

Arthur Charpentier

Chief Economists’ workshop: what can central bank policymakers learn from other disciplines?

Exponential Distributions and Linear Models Note that W = diag(∇g(b y ) · Var[y]), and set b ) · ∇g(b z = g(b y ) + (y − y y) the the maximum likelihood estimator is obtained iteratively T −1 T −1 −1 b β = [X W X] X W k+1 k k zk

b = β , so that Set β ∞



L

b − β) → N (0, I(β)−1 ) n(β

with I(β) = φ · [X T W −1 ∞ X]. Note that [X T W −1 k X] is a p × p matrix.

@freakonometrics

17

Arthur Charpentier

Chief Economists’ workshop: what can central bank policymakers learn from other disciplines?

Exponential Distributions and Linear Models Generalized Linear Model: • (Y |X = x) ∼ L(θx , ϕ) • E[Y |X = x] = h−1 (θx ) = g −1 (xT β) e.g. (Y |X = x) ∼ P(exp[xT β]). Use of maximum likelihood techniques for inference.

Actually, more a moment condition than a distribution assumption.

@freakonometrics

18

Arthur Charpentier

Chief Economists’ workshop: what can central bank policymakers learn from other disciplines?

Goodness of Fit & Model Choice From the variance decomposition n

n

n

1X 1X 1X 2 2 (yi − y¯) = (yi − ybi ) + (b yi − y¯)2 n i=1 n i=1 n i=1 | {z } | {z } | {z } total variance

and define 2

R =

residual variance

Pn

i=1 (yi

explained variance

2

Pn

X

(yi − ybi )2 = Deviance(b y)

− y¯) − i=1 (yi − ybi )2 Pn ¯)2 i=1 (yi − y

More generally Deviance(β) = −2 log[L] = 2

i=1

The null deviance is obtained using ybi = y, so that R2 =

@freakonometrics

Deviance(y) − Deviance(b y) Deviance(b y) D =1− =1− Deviance(y) Deviance(y) D0 19

Arthur Charpentier

Chief Economists’ workshop: what can central bank policymakers learn from other disciplines?

Goodness of Fit & Model Choice One usually prefers a penalized version 2 2 n−1 2 2 p−1 ¯ R = 1 − (1 − R ) = R − (1 − R ) n−p n−p | {z } penalty

See also Akaike criteria AIC = Deviance + 2 · p or Schwarz, BIC = Deviance + log(n) · p In high dimension, consider a corrected version n AICc = Deviance + 2 · p · n−p−1

@freakonometrics

20

Arthur Charpentier

Chief Economists’ workshop: what can central bank policymakers learn from other disciplines?

Stepwise Procedures Forward algorithm 1. set j1? =

argmin {AIC({j})} j∈{∅,1,··· ,n}

2. set j2? =

argmin j∈{∅,1,··· ,n}\{j1? }

{AIC({j1? , j})}

3. ... until j ? = ∅ Backward algorithm 1. set j1? =

argmin {AIC({1, · · · , n}\{j})} j∈{∅,1,··· ,n}

2. set j2? =

argmin j∈{∅,1,··· ,n}\{j1? }

{AIC({1, · · · , n}\{j1? , j})}

3. ... until j ? = ∅

@freakonometrics

21

Arthur Charpentier

Chief Economists’ workshop: what can central bank policymakers learn from other disciplines?

Econometrics & Statistical Testing Standard test for H0 : βk = 0 against H1 : βk 6= 0 is Student-t est tk = βbk /seβb , k

Use the p-value P[|T | > |tk |] with T ∼ tν (and ν = trace(H)). In high dimension, consider the FDR (False Discovery Ratio). With α = 5%, 5% variables are wrongly significant. If p = 100 with only 5 significant variables, one should expect also 5 false positive, i.e. 50% FDR, see Benjamini & Hochberg (1995) and Andrew Gelman’s talk.

@freakonometrics

22

Arthur Charpentier

Chief Economists’ workshop: what can central bank policymakers learn from other disciplines?

Under & Over-Identification Under-identification is obtained when the true model is T T y = β0 + xT 1 β 1 + x2 β 2 + ε, but we estimate y = β0 + x1 b1 + η. Maximum likelihood estimator for b1 is b b1

=

−1 (X T XT 1 X 1) 1y

T −1 = (X T X ) X 1 1 1 [X 1,i β 1 + X 2,i β 2 + ε] T T −1 + (X X ) X = β 1 + (X 01 X 1 )−1 X T X β 1 2 1ε {z 1 }2 | 1 {z } | β 12

νi

so that E[b b1 ] = β 1 + β 12 , and the bias is null when X T 1 X 2 = 0 i.e. X 1 ⊥ X 2 , see Frisch-Waugh). Over-identification is obtained when the true model is y = β0 + xT 1 β 1 ε, but we T fit y = β0 + xT 1 b1 + x2 b2 + η. Inference is unbiased since E(b1 ) = β 1 but the estimator is not efficient. @freakonometrics

23

Arthur Charpentier

Chief Economists’ workshop: what can central bank policymakers learn from other disciplines?

Statistical Learning & Loss Function Here, no probabilistic model, but a loss function, `. For some set of functions M, X → Y, define ( n ) X ? m = argmin `(yi , m(xi )) m∈M

i=1

Quadratic loss functions are interesting since ( n ) X1 y = argmin [yi − m]2 n m∈R i=1 which can be writen, with some underlying probabilistic model    2 2 E(Y ) = argmin kY − mk`2 = argmin E [Y − m] m∈R

m∈R

For τ ∈ (0, 1), we obtain the quantile regression (see Koenker (2005)) ( n ) X ? `τ (yi , m(xi )) avec `τ (x, y) = |(x − y)(τ − 1x≤y )| m = argmin m∈M0

@freakonometrics

i=1

24

Arthur Charpentier

Chief Economists’ workshop: what can central bank policymakers learn from other disciplines?

Boosting & Weak Learning ( m? = argmin m∈M

n X

) `(yi , m(xi ))

i=1

is hard to solve for some very large and general space M of X → Y functions. Consider some iterative procedure, where we learn from the errors, m(k) (·) = m1 (·) + m2 (·) + m3 (·) + · · · + mk (·) = m(k−1) (·) + mk (·). | {z } | {z } | {z } | {z } ∼y

∼ε1

∼ε2

∼εk−1

Formely ε can be seen as ∇`, the gradient of the loss.

@freakonometrics

25

Arthur Charpentier

Chief Economists’ workshop: what can central bank policymakers learn from other disciplines?

Boosting & Weak Learning It is possible to see this algorithm as a gradient descent. Not f (xk ) ∼ f (xk−1 ) + (xk − xk−1 ) ∇f (xk−1 ) {z } | {z } | {z } | {z } | hf,xk i

αk

hf,xk−1 i

h∇f,xk−1 i

but some kind of dual version fk (x) ∼ fk−1 (x) + (fk − fk−1 ) | {z } | {z } | {z } hfk ,xi

ak

hfk−1 ,xi

? |{z}

hfk−1 ,∇xi

where ? is a gradient is some functional space. ( m(k) (x) = m(k−1) (x) + argmin f ∈F

n X

) `(yi , m(k−1) (x) + f (x))

i=1

for some simple space F so that we define some weak learner, e.g. step functions (so called stumps) @freakonometrics

26

Arthur Charpentier

Chief Economists’ workshop: what can central bank policymakers learn from other disciplines?

Boosting & Weak Learning Standard set F are stumps functions but one can also consider splines (with non-fixed knots).

One might add a shrinkage parameter to learn even more weakly, i.e. set ε1 = y − α · m1 (x) with α ∈ (0, 1), etc.

@freakonometrics

27

Arthur Charpentier

Chief Economists’ workshop: what can central bank policymakers learn from other disciplines?

Big Data & Linear Model Consider some linear model yi = xT i β + εi for all i = 1, · · · , n. Assume that εi are i.i.d. with E(ε) = 0 (and finite variance). Write         β0   1 x1,1 · · · x1,p ε1 y1         β   1 .. ..     ..   ..   .. .. +  . .  .  = .  . . . .          ..    εn 1 xn,1 · · · xn,p yn βp {z } | {z } | {z } | | {z } y,n×1 ε,n×1 X,n×(p+1) β,(p+1)×1

Assuming ε ∼ N (0, σ 2 I), the maximum likelihood estimator of β is b = argmin{ky − X T βk` } = (X T X)−1 X T y β 2 ... under the assumtption that X T X is a full-rank matrix. b = [X T X]−1 X T y does not exist, but What if X T X cannot be inverted? Then β b = [X T X + λI]−1 X T y always exist if λ > 0. β λ

@freakonometrics

28

Arthur Charpentier

Chief Economists’ workshop: what can central bank policymakers learn from other disciplines?

Ridge Regression & Regularization b = [X T X + λI]−1 X T y is the Ridge estimate obtained as The estimator β solution of     n  X T 2 2 b = argmin β [yi − β0 − xi β] + λ kβk`2  | {z } β   i=1 1T β 2

for some tuning parameter λ. One can also write b = argmin {kY − X T βk` } β 2 β;kβk`2 ≤s

There is a Bayesian interpretation of that regularization, when β has some prior N (β 0 , τ I).

@freakonometrics

29

Arthur Charpentier

Chief Economists’ workshop: what can central bank policymakers learn from other disciplines?

Over-Fitting & Penalization Solve here, for some norm k · k, ( n ) n o X T min `(yi , β0 + x β) + λkβk = min objective(β) + penality(β) . i=1

Estimators are no longer unbiased, but might have a smaller mse. Consider some i.id. sample {y1 , · · · , yn } from N (θ, σ 2 ), and consider some estimator proportional to y, i.e. θb = αy. α = 1 is the maximum likelihood estimator. Note that

2 2 α σ b mse[θ] = (α − 1) µ + | {z } | {z n } bias[b θ ]2 Var[b θ] 2 2



and α? = µ2 · µ2 +

@freakonometrics

2

σ n

−1 < 1. 30

Arthur Charpentier

Chief Economists’ workshop: what can central bank policymakers learn from other disciplines?

b = argmin (βb0 , β)

( n X

) `(yi , β0 + xT β) + λkβk ,

i=1

can be seen as a Lagrangian minimization problem ( n ) X b = argmin (βb0 , β) `(yi , β0 + xT β) β;kβk≤s

@freakonometrics

i=1

31

Arthur Charpentier

Chief Economists’ workshop: what can central bank policymakers learn from other disciplines?

LASSO & Sparcity In severall applications, p can be (very) large, but a lot of features are just noise: βj = 0 for many j’s. Let s denote the number of relevent features, with s s}

−0.20

−0.25 −0.35

NL : {xi,j ≤ s}

−0.45

−0.45

−0.25

INSYS

−0.35

INCAR

4

6

8

10

12

14

500

700

900

45

1100

Arthur Charpentier

Chief Economists’ workshop: what can central bank policymakers learn from other disciplines?

Trees & Forests Boostrap can be used to define the concept of margin, B B X 1 X 1 (b) (b) margini = 1(b yi = yi ) − 1(b yi 6= yi ) B B b=1

b=1

Subsampling of variable, at each knot (e.g.



k out of k)

Concept of variable importance: given some random forest with M trees,

importance of variable k

1 X X Nt I(Xk ) = ∆I(t) M m t N

where the first sum is over all trees, and the second one is over all nodes where the split is done based on variable Xk .

@freakonometrics

46

Arthur Charpentier

Chief Economists’ workshop: what can central bank policymakers learn from other disciplines?

Trees & Forests 3000

● ● ●

2500

● ●

2000





● ● ●

● ● ●



1500

REPUL



● ●

● ●









● ●

1000





● ●









● ● ●

500

● ● ●

0



● ●



● ●

●● ●



● ●





● ● ● ● ● ●

● ●



● ● ● ●









5

10

15

20

PVENT

See also discriminant analysis, SVM, neural networks, etc.

@freakonometrics

47

Arthur Charpentier

Chief Economists’ workshop: what can central bank policymakers learn from other disciplines?

Model Selection & ROC Curves Given a scoring function m(·), with m(x) = E[Y |X = x], and a threshold s ∈ (0, 1), set   1 if m(x) > s Yb (s) = 1[m(x) > s] =  0 if m(x) ≤ s Define the confusion matrix as N = [Nu,v ] (s) Nu,v =

n X

(s)

1(b yi

= u, yj = v) for (u, v) ∈ {0, 1}.

i=1

Ybs = 0 Ybs = 1

@freakonometrics

Y =0

Y =1

TNs

FNs

TNs +FNs

FPs

TPs

FPs +TPs

TNs +FPs

FNs +TPs

n 48

Arthur Charpentier

Chief Economists’ workshop: what can central bank policymakers learn from other disciplines?

Model Selection & ROC Curves ROC curve is  ROCs =

@freakonometrics

FPs TPs , FPs + TNs TPs + FNs

 with s ∈ (0, 1)

49

Arthur Charpentier

Chief Economists’ workshop: what can central bank policymakers learn from other disciplines?

Model Selection & ROC Curves In machine learning, the most popular measure is κ, see Landis & Koch (1977). Define N ⊥ from N as in the chi-square independence test. Set TP + TN total accuracy = n TP⊥ + TN⊥ [TN+FP] · [TP+FN] + [TP+FP] · [TN+FN] random accuracy = = n n2 and total accuracy − random accuracy . κ= 1 − random accuracy See Kaggle competitions.

@freakonometrics

50

Arthur Charpentier

Chief Economists’ workshop: what can central bank policymakers learn from other disciplines?

Reducing Dimension with PCA Use principal components to reduce dimension (on centered and scaled variables): we want d vectors z 1 , · · · , z d such that 4





−2

3



1944● 1918● 1917● ●



−4

2



● ●● ●●

● ●





● ● ●● ●

● ●● ● ● ●



● ●●

1

PC score 2



−6

Log Mortality Rate



1943● 1940● ● ●

●●

1919●

● ●●



1942● ●

●●●● ●





0

● ● ● ●● ● ●

kωk=1

● ● ●● ● ● ●● ● ● ● ● ● ● ●● ●

● ● ● ●

● ● ●● ●

0

20

40

60

80

● ●● ● ●

−10

−5

● ●

0

15

3 −2

2



−4

● ●



● ●





● ●



● ●

● ●

● ●● ● ● ● ●

● ●

● ●

● ●● ●● ●

● ●

● ● ●

● ●



0

● ● ●

● ●● ●

● ●

● ●

●● ● ● ●



● ●● ●● ● ●●

−1



● ●





−10



● ● ●

● ● ● ●

20

40 Age

60

80

−10

−5

0

5

10

PC score 1

= X − Xω 1 ω T . | {z } 1 z1

@freakonometrics

● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●

●● ●●

● ●



● ●



1

PC score 2



−6

Log Mortality Rate



0

(1)

10



● ●●

f with X

5 PC score 1

−8

kωk=1

● ● ● ●● ● ● ●●●● ●● ● ●

● ●

Age

Second Compoment is z 2 = Xω 2 where   (1) f · ωk2 ω 2 = argmax kX









−1

kωk=1

1914● 1915● 1916●

−8

First Compoment is z 1 = Xω 1 where n o  T 2 T ω 1 = argmax kX · ωk = argmax ω X Xω

51

15

Arthur Charpentier

Chief Economists’ workshop: what can central bank policymakers learn from other disciplines?

Reducing Dimension with PCA A regression on (the d) principal components, y = z T b + η could be an interesting idea, unfortunatley, principal components have no reason to be correlated with y. First compoment was z 1 = Xω 1 where n o  T 2 T ω 1 = argmax kX · ωk = argmax ω X Xω kωk=1

kωk=1

It is a non-supervised technique. Instead, use partial least squares, introduced in Wold (1966). First compoment is z 1 = Xω 1 where n o ω 1 = argmax {< y, X · ω >} = argmax ω T X T yy T Xω kωk=1

kωk=1

(etc.)

@freakonometrics

52

Arthur Charpentier

Chief Economists’ workshop: what can central bank policymakers learn from other disciplines?

Instrumental Variables Consider some instrumental variable model, yi = xT i β + εi such that E[Yi |Z] = E[X i |Z]T β + E[εi |Z] The estimator of β is b = [Z T X]−1 Z T y β IV If dim(Z) > dim(X) use the Generalized Method of Moments, T T T −1 −1 T b β = [X Π X] X Π y with Π = Z[Z Z] Z Z Z Z GMM

@freakonometrics

53

Arthur Charpentier

Chief Economists’ workshop: what can central bank policymakers learn from other disciplines?

Instrumental Variables Consider a standard two step procedure c = ΠZ X 1) regress colums of X on Z, X = Zα + η, and derive predictions X c yi = x bT 2) regress Y on X, i β + εi , i.e. b = [Z T X]−1 Z T y β IV See Angrist & Krueger (1991) with 3 up to 1530 instruments : 12 instruments seem to contain all necessary information. Use LASSO to select necessary instruments, see Belloni, Chernozhukov & Hansen (2010)

@freakonometrics

54

Arthur Charpentier

Chief Economists’ workshop: what can central bank policymakers learn from other disciplines?

Take Away Conclusion Big data mythology - n → ∞: 0/1 law, everything is simplified (either true or false) - p → ∞: higher algorithmic complexity, need variable selection tools Econometrics vs. Machine Learning - probabilistic interpretation of econometric models (unfortunately sometimes misleading, e.g. p-value) can deal with non-i.id data (time series, panel, etc) - machine learning is about predictive modeling and generalization algorithmic tools, based on bootstrap (sampling and sub-sampling), cross-validation, variable selection, nonlinearities, cross effects, etc Importance of visualization techniques (forgotten in econometrics publications)

@freakonometrics

55