Applied Econometrics

The time value of money is represented by ... money in any investment over a period of time. ..... The rank of a matrix is the number of columns that are linearly.
3MB taille 3 téléchargements 421 vues
Outline Presentation of the course Introduction The simple linear model Gauss-Markov hypotheses OLS estimation Properties and goodness of fit The multiple linear model A generalization of the simple model OLS estimation Multicollinearity Dummy variables Missing variables Centered variables Properties of OLS estimators Unbiasedness & consistency Normality of the error term

2/83

Outline Presentation of the course Introduction The simple linear model Gauss-Markov hypotheses OLS estimation Properties and goodness of fit The multiple linear model A generalization of the simple model OLS estimation Multicollinearity Dummy variables Missing variables Centered variables Properties of OLS estimators Unbiasedness & consistency Normality of the error term

3/83

Topics addressed

Mostly micro-econometrics (Time Series : dedicated course) I

Getting acquainted with Stata (version 10)

I

Ordinary Least Squares

I

Generalized Least Squares

I

Instrumental Variables

I

Binary outcome models (Probit & Logit)

I

Multinomial Logit, Ordered Probit

6/83

Some references I

Getting started : Basic Econometrics, by D. Gujarati, McGraw Hill

I

Applied work : A guide to modern econometrics, by M. Verbeek, Wiley

I

Comprehensive text exclusively on microeconometrics (theory & applications) : Microeconometrics, by A.C. Cameron & P.K. Trivedi, Cambridge University Press

I

Comprehensive text on every possible topic : Econometric Analysis, by W. Greene, Pearson International Edition

I

Very good book in French (theory only) : Introduction à l’Econométrie, by B. Dormont, Montchrestien

I

Interesting case studies : Econométrie Appliquée, by I. Cadoret et al., De Boeck 7/83

Outline Presentation of the course Introduction The simple linear model Gauss-Markov hypotheses OLS estimation Properties and goodness of fit The multiple linear model A generalization of the simple model OLS estimation Multicollinearity Dummy variables Missing variables Centered variables Properties of OLS estimators Unbiasedness & consistency Normality of the error term

9/83

Econometrics

I

Economic analysis relies on theoretical representations of behaviors and mechanisms (ex : macro-economic models, the Capital Asset Pricing Model, etc)

I

The relevance of these models should be tested on real data (ex : does the CAPM "work" ?)

I

Parameters of interest should be assessed too (ex : what is the impact of a rise in oil prices or interest rates on car sales ?)

I

Econometrics uses statistical tools to provide an answer to these questions

10/83

Capital Asset Pricing Model - CAPM The capital asset pricing model (CAPM) is a model that describes the relationship between systematic risk and expected return for assets, particularly stocks. CAPM is widely used throughout finance for the pricing of risky securities, generating expected returns for assets given the risk of those assets and calculating costs of capital. The general idea behind CAPM is that investors need to be compensated in two ways: time value of money and risk. The time value of money is represented by the risk-free (rf) rate in the formula and compensates the investors for placing money in any investment over a period of time. The risk-free rate is customarily the yield on government bonds like U.S. Treasuries.

Example

Consider an individual consumption function : for each individual i, we observe consumption Ci and income Ri . A very simple model explaining Ci would be : ∀i, Ci = a + bRi The goal of econometrics would be to test the relevance of this model (is it a good approximation of reality ?) and to provide estimates of parameters a and b.

11/83

Types of datasets

I

Cross-sectional data : individual data collected at a particular point in time (ex : census data on individuals, households, firms)

I

Time series data : variables collected over time (ex : yearly GDP, monthly unemployment rate)

I

Panel data : individual data collected over time (ex : Medical Expenditure Panel Survey)

Cross-sectional data and panel data are collected not at the population level, but on a sample, representative of the population. We thus have to infer the characteristics of the population from the information we get from that particular sample.

12/83

A14

Types de données: données transversales (enquête)

A15

Types de données, données de panel

„

„ „

Le même individu (l (l’unité unité d’observation) d observation) est observé pendant un certain temps (5-10 ans). Le plus souvent il s’agit de données aléatoires (d’enquête) Problèmes d’attrition!

A16

Types de données: données de panel

A17

T Types de d données: d é série é i temporelle ll de d données d é d’enquête d’ ê

‰

‰

‰

On peut “empiler” les données (enquêtes, séries temporelles) transversales réalisées à des périodes différentes. Intéressant quand il y a d variables des i bl communes. Le fichier ainsi rassemblé peut être traité comme des données transversales classique, avec la prise en compte de la dimension de temps. Les série temporelles de données d’observation sont souvent aussi appelées les panels (éco inter)

A18

Types de données: série temporelle de données d’enquête

A22

Types de données: pseudo pseudo-panels: panels: structure identique aux panels, mais les individus sont regroupés.

A19

Types yp de données,, séries temporelles p

„

Les séries temporelles se caractérisent par la structure de type: une observation = une période de temps (année, mois, semaine, jour…)

„

Les séries temporelles ne sont pas des échantillons aléatoires – certains problèmes particuliers apparaissent.

„

Leurs spécificité c’est l’analyse des tendances, des variations saisonnières, de la volatilité, de la persistance, de la dynamique.

A20

Types de données, série temporelle

Variables

I

Quantitative variables are measured on a numeric or quantitative scale : income, age ...

I

Qualitative variables are not : gender, region ..., but need to be coded as numbers anyway

Consider model ∀i, Ci = a + bRi I

The endogenous ((or dependent or explained) variable is the one explained by the model (here, Ci )

I

Exogenous (or independent or explanatory) variables determine the endogenous variable (here, Ri )

13/83

Outline Presentation of the course Introduction The simple linear model Gauss-Markov hypotheses OLS estimation Properties and goodness of fit The multiple linear model A generalization of the simple model OLS estimation Multicollinearity Dummy variables Missing variables Centered variables Properties of OLS estimators Unbiasedness & consistency Normality of the error term

16/83

The simple linear model : example

Say we’d like to know about the health care habits of the French population. Assume that for each individual i, health care expenditure yi could be explained by age xi , in a linear way : ∀i, yi = b0 + b1 xi For a newborn, expenditure would be b0 , and each additional year would incur b1 extra cost. This makes sense, since we all know that health expenditure increase with age. This is a simple model because we have only one explanatory variable.

17/83

The simple linear model

Of course, the link between age and expenditure cannot be that perfect, so we add an error term ui that accounts for unobserved factors that make this relation not perfect (frailty, income, etc). ∀i, yi = b0 + b1 xi + ui This is a (simple) model of individual health care expenditure. Its relevance cannot be assessed on the whole population, but rather on a sample, say of size N. The model can be estimated using ordinary least squares (OLS), if some hypotheses are verified. Notice that this is a linear model because parameters (and not necessarily variables) enter it in a linear way.

18/83

Outline Presentation of the course Introduction The simple linear model Gauss-Markov hypotheses OLS estimation Properties and goodness of fit The multiple linear model A generalization of the simple model OLS estimation Multicollinearity Dummy variables Missing variables Centered variables Properties of OLS estimators Unbiasedness & consistency Normality of the error term

19/83

The simple linear model : hypotheses Let’s call i an individual from the size N sample. ∀i, yi = b0 + b1 xi + ui The usual and so-called Gauss-Markov hypotheses for such a model are : I

∀i, E (ui ) = 0

I

∀i, xi are considered deterministic 1

I

V (X ) 6= 0

I

∀i, V (ui ) = σ 2 and ∀i 6= j, cov (ui , uj ) = 0

It simply means that we want the model to predict E (yi ) : ∀i, E (yi ) = E (b0 + b1 xi + ui ) = E (b0 + b1 xi ) = b0 + b1 E (xi ) = b0 + b1 xi , ... that we want u not to be correlated with x , the xi not to be all the same, the variability of the error term to be roughly the same over the individuals of the sample and uncorrelated error terms. 1. This convenient hypothesis will be discussed later

20/83

Interpretation of parameters

Say we have the following model : ∀i, yi = b0 + b1 xi + ui Then b1 can be interpreted as the marginal change in y when x increases by 1 unit, because since u is by definition unrelated to x , we have : dy = b1 dx The linear model is in fact very general

21/83

Example : squared variables

The following model is linear, even if variable x is squared : yi = b0 + b1 xi2 + ui And a way to understand how y moves with x is to compute the following derivative : dy = 2xb1 dx And we see it is not constant, it depends on the value of x .

22/83

Example : logs (1)

I

Let’s have a simple model : ∀i, log(yi ) = a + bxi + ui

I

How can we interpret b ? (let’s drop i’s for convenience)

I

Taking the exponential : y = exp(a + bx + u) dy = b.exp(a + bx + u) = b.y dx

dy /y =b dx b is thus interpreted as the percentage of variation of y when x increases by 1 unit. This kind of model has a semi-log form.

23/83

Example : logs (2)

I

Let’s have a simple model : ∀i, log(yi ) = a + b.log(xi ) + ui

I

How can we interpret b ?

I

Taking the exponential : y = exp(a + b.log(x ) + u) dy 1 y = b. exp(a + b.log(x ) + u) = b. dx x x dy /y =b dx /x

b is thus interpreted as an elasticity : the % of variation of y when x increases by 1%. This kind of model has a log-log form.

24/83

How to choose ?

I

Using economic theory, common sense, plots ...

I

Example : the link between income and medical expenditures using the World Bank’s 1997 Vietnam Living Standards Survey (source : Cameron & Trivedi, chapter 4)

I

Model : medical expenditures = a + b.income + u

I

Data : 5,999 households

I

Total household income is usually not well captured in developing countries, so we use as a proxy total household expenditures

25/83

Outline Presentation of the course Introduction The simple linear model Gauss-Markov hypotheses OLS estimation Properties and goodness of fit The multiple linear model A generalization of the simple model OLS estimation Multicollinearity Dummy variables Missing variables Centered variables Properties of OLS estimators Unbiasedness & consistency Normality of the error term

26/83

Our goal : estimate the parameters of the model

∀i, yi = b0 + b1 xi + ui I

b0 and b1 are unknown.

I

We thus want to find the best estimates for them : bc0 and bc1 .

I

One way of doing this is using ordinary least squares : OLS.

I

bc0 + bc1 xi should be as close as possible to yi .

I

We define ybi as the model prediction of yi : ybi = bc0 + bc1 xi

I

And we define ubi as yi − ybi : the residuals

I

Warning : u and ub are two different things

27/83

OLS estimators I

We want to minimize the global sum of residuals : this will give us the optimal bc0 and bc1

I

But the residuals are known only when the regression line is drawn, and for that we need bc0 and bc1 : what to do ?

I

We need to express residuals as a function of bc0 and bc1 , and then minimize them

I

Since residuals can be either positive of negative, we minimize the sum of squared residuals with respect to bc0 and bc1 :

Min ubi 2 = Min (yi − ybi )2 = Min And the solution is : P

P

(yi − bc0 − bc1 xi )2

P

(xi − x¯ )(yi − y¯ ) P (xi − x¯ )2

P

bc1 = and

bc0 = y¯ − bc1 x¯ 28/83

Remark

(xi − x¯ )(yi − y¯ ) sxy P = sxx (xi − x¯ )2

P

bc1 = I

sxx =

1 N 1 N

P

(xi − x¯ )2 is the empirical variance of x (xi − x¯ )(yi − y¯ ) is the empirical covariance between

P

I

sxy = x and y

I

Warning : correlation 6= causality

29/83

Example

Compute OLS estimates of the following model : ∀i, yi = a + bxi + ui y 8 2 Given the following data : 6 0 4

x 3 1 3 1 2

30/83

Outline Presentation of the course Introduction The simple linear model Gauss-Markov hypotheses OLS estimation Properties and goodness of fit The multiple linear model A generalization of the simple model OLS estimation Multicollinearity Dummy variables Missing variables Centered variables Properties of OLS estimators Unbiasedness & consistency Normality of the error term

31/83

The OLS estimator is linear

We have : (xi − x¯ )(yi − y¯ ) P (xi − x¯ )2

P

bc1 = So :

X (xi − x¯ ) (xi − x¯ )yi P bc1 = P = yi (xi − x¯ )2 (xi − x¯ )2 P

Estimate bc1 is thus a linear combination of elements yi .

32/83

Are all observations given the same weight ? We have : (xi − x¯ )(yi − y¯ ) P (xi − x¯ )2

P

bc1 = So : P

bc1 =

y) i −¯ (xi − x¯ )2 (y (xi −¯ x)

(xi −

P

x¯ )2

=

X

pi

(yi − y¯ ) (xi − x¯ )

(xi − x¯ )2 Calling pi = P . (xi − x¯ )2 (yi −¯ y) (xi −¯ x)

is the slope of the line drawn from the point corresponding to individual i to the sample average, and pi is an increasing function of (xi − x¯ ). Estimate bc1 is thus highly influenced by extreme points (see Anscombe’s quartet of identical regressions).

33/83

OLS estimators are "BLUE" ∀i, yi = b0 + b1 xi + ui I

(bˆ0 ) and (bˆ1 ) are particular outcomes of random variables (their expression comprises y , which in turn comprises u)

I

Unbiased : E (bc0 ) = b0 , E (bc1 ) = b1 σ2 V (bc1 ) = P and cov (bˆ0 , bˆ1 ) = −¯ x V (bˆ1 ) (xi − x¯ )2 OLS estimators are linear in yi and are the Best Linear Unbiased Estimators (BLUE) : this is the Gauss-Markov theorem, which is valid under the 4 Gauss-Markov assumptions seen earlier

I I

I

"Best" means they have a minimal variance among unbiased linear estimators

I

All this is true whatever the size of the sample 34/83

The variance of estimators : summary

I

I I

I I

σ2 V (bc1 ) = P (xi − x¯ )2 σ2 + x¯ 2 V (bc1 ) V (bc0 ) = N To get an estimate of these variances, we need an unbiased estimator for σ 2 : σ ˆ2 P 2 uˆi σ ˆ2 = N −2 We’ll see a proof of these results in the more general case of the multiple linear model

35/83

Asymptotics (1)

I

I

I

Do OLS estimates have good properties when sample size goes to infinity, i.e. "asymptotically" ? Let’s call (bˆ0,N ) and (bˆ1,N ) the series of estimators corresponding to a sample of size N (bˆ0,N ) and (bˆ1,N ) converge in probability towards b0 and b1

I

Definition of convergence in probability : (X1 , X2 , ..., Xt ) converges in probability towards a if and only if : P(Xt 6= a) → 0 when t → +∞

I

It means that when sample size goes to infinity, the probability that estimates are exactly equal to the true parameters is almost 1

36/83

Supplementary hypothesis needed

To prove this convergence, we need to assume that variable x follows this rule : 1 X (xi − x¯ )2 = σx2 6= 0 N→∞ N Which simply means that the empirical variance of x should have a given limit which is not zero. If it were zero, then it would mean that when we increase sample size, after a while variable x doesn’t vary anymore (sticks to its average x¯ ) so it doesn’t provide any information. In fact, the only thing we need here is that the x keep some variance when sample size goes to infinity. lim

37/83

Focus on consistency / convergence

I

Convergence in quadratic mean ⇒ Convergence in probability

I

Definition : a series of random variables (X1 , X2 , ..., Xt ) converges in quadratic mean towards variable X iff :

I

E [(Xt − X )2 ] → 0 when t → ∞

I

Definition : (X1 , X2 , ..., Xt ) converges in probability towards a iff :

I

P(Xt 6= a) → 0 when t → 0

It means that we only need to prove that OLS estimators converge in quadratic mean towards their "true counterparts" to prove they are consistent in probability. So, we only need to prove that their variance goes to zero when sample size goes to infinity (easier).

38/83

Asymptotics (2)

I

I

Since estimates depend on error terms, if we don’t know the distribution of the error terms, we won’t know the distribution of estimates But we can prove that if the ui are iid 2 , bˆ0 and bˆ1 are asymptotically normal

I

So if sample size goes to infinity (=with a sample "large enough"), we know the distribution of estimates, so we may compute confidence intervals, run tests etc

I

Note : it is important to know under which conditions results are found (need for iid error terms, sample size etc).

39/83

2. independent and identically distributed

Focus on the Central Limit Theorem (CLT) (Lindeberg-Feller version) I

Assume we have a series of random variables (Z1 , Z2 , ..., ZN ) that are independent with finite expectancies and variances such that ∀i, E (zi ) = µi and V (zi ) = σi2

I

They can have different expectancies and variances, and don’t need to share the same distribution

I

Assume z N =

I

Assume σ 2N = √

I

1 N

1 N

N X i=1 N X i=1

zi and µ =

1 N

P

µi

σi2 and lim σ 2N = σ 2 N→∞

Then : N(z N − µ) converges in distribution towards N(0, σ 2 ), when N → ∞ 40/83

Focus on convergence in distribution / law

I

A series of random variables (X1 , X2 , ..., Xt ) each of cdf 3 Ft converges in distribution towards variable X of cdf F if :

I

For every number x at which F is continuous, lim Ft (x ) = F (x )

t→∞ I

Remark : convergence in quadratic mean ⇒ convergence in probability ⇒ convergence in distribution

41/83

3. Cumulative distribution function

A measure of the "goodness of fit" of the regression We call the R 2 or coefficient of determination : P (ybi − y¯ )2 R =P 2 2

(yi − y¯ )

I

It is the ratio of the variance explained by the model over the total variance, and lies between 0 (very poor fit) and 1 (perfect fit).

I

Formal derivation of the R 2 will be given in the more general case of the multiple linear model, as well as the conditions under which it can be used.

I

Warning : if the x variable is a good predictor of y , but if the link is not linear, the R 2 will be low because of the lack of fit and not because of the poor choice of variables.

I

In a simple linear model (only 1 explanatory variable), the R 2 is equal to the linear correlation coefficient between x and y . 42/83

Outline Presentation of the course Introduction The simple linear model Gauss-Markov hypotheses OLS estimation Properties and goodness of fit The multiple linear model A generalization of the simple model OLS estimation Multicollinearity Dummy variables Missing variables Centered variables Properties of OLS estimators Unbiasedness & consistency Normality of the error term

43/83

Outline Presentation of the course Introduction The simple linear model Gauss-Markov hypotheses OLS estimation Properties and goodness of fit The multiple linear model A generalization of the simple model OLS estimation Multicollinearity Dummy variables Missing variables Centered variables Properties of OLS estimators Unbiasedness & consistency Normality of the error term

44/83

Multiple linear model Say we get back to the simple model explaining health care expenditures by age, and we want to add to the model various other explanatory variables (income, household size, etc) : x2 , x3 , ... , xk−1 ∀i, yi = b0 + b1 x1,i + b2 x2,i + b3 x3,i + ... + bk−1 xk−1,i + ui (1) This is a (less simple) model of individual health care expenditure. It can be rewritten the following way, for individuals i = 1 to N : y1 = b0 + b1 x1,1 + b2 x2,1 + b3 x3,1 + ... + bk−1 xk−1,1 + u1 y2 = b0 + b1 x1,2 + b2 x2,2 + b3 x3,2 + ... + bk−1 xk−1,2 + u2 ... yN = b0 + b1 x1,N + b2 x2,N + b3 x3,N + ... + bk−1 xk−1,N + uN 45/83

Multiple linear model This system of equations can be rewritten using simple vectors : 



 













y1 1 x1,1 xk−1,1 u1            y2  1  x1,2  x   u2   .  = b0  .  + b1  .  + ... + bk−1  k−1,2     .  .  .   ..  +  ..  (2)  .  .  .   .   .  yN

1

x1,N

xk−2,N

uN

And also in a more compact way, using matrices : 





x1,1 x1,2 .. .

x2,1 x2,2 .. .

··· ··· .. .

1 x1,N

x2,N

···

y1 1     y2  1  .  = .  .  .  .  . yN

Rewritten : y (N,1)

= X



xk−1,1  xk−1,2    ..    . 

xk−1,N

b0 b1 .. . bk−1







u1      u2   +  .  (3)   .    .  uN

b + u

(N,k)(k,1)

(N,1) 46/83

Some notations   

E (b) =   

E (b0 ) E (b1 ) .. .

     

(4)

E (bk−1 )

V (b) = E [(b − E (b))(b − E (b))0 ]

  

V (b) =   

V (b0 ) cov (b1 , b0 ) .. .

cov (b0 , b1 ) V (b1 ) .. .

··· ··· .. .

cov (bk−1 , b0 ) cov (bk−1 , b1 ) · · ·

(5)



cov (b0 , bk−1 ) cov (b1 , bk−1 )   ..  .  V (bk−1 )

(6) And notice that : E (A.X ) = A.E (X ) but V (A.X ) =

A.V (X ).A0 47/83

Usual hypotheses of the model

These are the same as the simple model, but generalized to any number of explanatory variables : 1. E (u) = 0 2. X deterministic 3. Rank(X ) = k 4. E (uu 0 ) = σ 2 IN 0

5. When N → ∞, lim XNX = VX where VX is a finite non-singular matrix The rank of a matrix is the number of columns that are linearly independent. A non-singular matrix has an inverse.

48/83

Outline Presentation of the course Introduction The simple linear model Gauss-Markov hypotheses OLS estimation Properties and goodness of fit The multiple linear model A generalization of the simple model OLS estimation Multicollinearity Dummy variables Missing variables Centered variables Properties of OLS estimators Unbiasedness & consistency Normality of the error term

49/83

Estimation process (1) I

I

I

I

I

OLS is about minimizing the sum of squared residuals, each one called ubi Calling ub the vector of residuals, this sum can be written as : P S = ubi2 = ub0 ub The derivative of S with respect to the vectors of parameters bb should be zero, for this vector of parameters to give an optimum for S Plus, the second derivative should be positive for this optimum to be a minimum How can we take the derivative of a number with respect to a vector ? It is simply the vector G of derivatives, called the gradient :   

G = ∂S/∂ bb =   

∂S/∂ bb0 ∂S/∂ bb1 .. .

∂S/∂ bbk−1

     

(7) 50/83

Estimation process (2) b then : We know that since ub = y − yb = y − X b,

S = ub0 ub b 0 (y − X b) b = (y − X b) b = (y 0 − bb0 X 0 )(y − X b)

= y 0 y − y 0 X bb − bb0 X 0 y + bb0 X 0 X bb = y 0 y − 2y 0 X bb + bb0 X 0 X bb Indeed, y 0 X bb and bb0 X 0 y are each other’s transpose, but they both are only scalars (matrix of size one by one), and a scalar and its transpose are the same thing.

51/83

Estimation process (3)

We can now compute the gradient G :

G

= =

∂S ∂ bb ∂(y 0 y )



b ∂(2y 0 X b)

∂ bb ∂ bb 0 = 0 − 2X y + 2X 0 X bb

+

b ∂(bb0 X 0 X b)

∂ bb

b We look for an optimum, so we Since y is not a function of b. want G = 0, which implies that bb = (X 0 X )−1 X 0 y (given that (X 0 X )−1 exists, that follows from the fact that Rk(X ) = k).

52/83

Estimation process (4) I

We also need the second derivative of S to be positive, which in the matrix framework amounts to having the matrix of second derivatives positive definite.

I

This matrix of second derivatives is equal to ∂(2X 0 y +2X 0 X b b) ∂b b

= 2X 0 X which is indeed positive definite.

I

Reminder : a matrix A is positive definite if for any vector y of convenient size, y 0 Ay > 0. So let’s check this with our matrix 2X 0 X .

I

We only need to check that X 0 X is positive definite. For any vector y of convenient size, y 0 (X 0 X )y = (Xy )0 (Xy ). But matrix Xy is just a column vector, and a transposed vector times itself is a sum of squares, equal to its square norm (=length), which is always positive. 53/83

Geometrical interpretation of OLS (1)

I

In the minimization process, we got to a point where : G = −2X 0 y + 2X 0 X bb = 0

I

Rewritten : b = X 0 (y − yˆ ) = X 0 u b=0 X 0 y − X 0 X bb = X 0 (y − X b)

I

Which means that ub should be perpendicular to every column vector of matrix X , i.e. perpendicular to the vector space spanned by the column vectors of X

I

Condition X 0 ub = 0 is called the system of normal equations (normal also means perpendicular)

54/83

Geometrical interpretation of OLS (2)

I

Consider the following matrix : PX = X (X 0 X )−1 X 0

I

Matrix PX is the general form of a matrix that projects orthogonally on L(X ) (vector space spanned by X )

I

Notice that yˆ = X bb = X (X 0 X )−1 X 0 y = PX y .

I

Prediction yˆ is the orthogonal projection of y on L(X )

I

Let’s call MX = I − PX , with I the identity matrix of convenient dimension.

I

MX projects orthogonally on L⊥ (X ).

I

We can see that residual ub = MX y . Notice too that y = yˆ + ub = PX y + MX y .

55/83

Consequence : the analysis of variance I

I

I I

I

I

I

I

If there is a constant in the model, uˆ ⊥ eT (where eT is X 0 s vector of 10 s corresponding to the constant) and the sum of residuals is zero P P P We thus get = (yi − y¯ )2 = (ˆ yi − y¯ )2 + (ˆ ui )2 (Pythagorean theorem) That gives : V (y ) = V (ˆ y ) + V (ˆ u) The coefficient of determination can be computed as : P P R 2 = ( (ˆ yi − y¯ )2 )/( (yi − y¯ )2 ) R 2 represents the percentage of variability explained by the model But adding a variable, even if it is irrelevant, will artificially increase the R 2 (proof : see Dormont) For models with a different number of explanatory variables, ¯ 2 ’s can be compared only the R ¯ 2 = 1 − N−1 (1 − R 2 ) (warning : can be Adjusted R 2 : R N−k negative ...)

56/83

Interpretations of the R squared I

R 2 represents the percentage of variability explained by the model

I

It thus represents how well the model fits the data

I

If there is a true relationship between the variables but this relationship is is fact non linear, the R 2 will be rather low

I

The R 2 is the multiple linear correlation coefficient between y and the X ’s

I

The R 2 can be interpreted as the square of the linear correlation coefficient between y and its prediction yb

I

If there is no constant in the model, it has no meaning because the way it is computed requires a constant term

I

The R 2 is not enough to assess the relevance of a regression : we’ll need statistical tests (see later) 57/83

Outline Presentation of the course Introduction The simple linear model Gauss-Markov hypotheses OLS estimation Properties and goodness of fit The multiple linear model A generalization of the simple model OLS estimation Multicollinearity Dummy variables Missing variables Centered variables Properties of OLS estimators Unbiasedness & consistency Normality of the error term

59/83

Remarks

I

R 2 and adjusted R 2 are valid only if comparing models that have the same dependent variable

I

So they are inappropriate to compare 2 models with y and log(y ) as the dependent variable

I

See PE test (parametric encompassing) to compare models in log vs levels (see Verbeek)

58/83

Strict multicollinearity

I

This problem arises if some variable is equal to an exact linear combination of some other variables (e.g. if we have variables like income, income after tax and tax )

I

This variable is unnecessary because it contains repetitive information

I

Worse, it prevents the computation of the OLS estimator, since X is no longer full rank and we cannot compute bˆ = (X 0 X )−1 X 0 y

I

If the software detects multicollinearity, it arbitrarily removes one repetitive variable

60/83

Near multicollinearity

I

This problem arises if some variable is almost equal to an exact linear combination of some other variables

I

In other words, it means that they are very strongly correlated We can compute bˆ = (X 0 X )−1 X 0 y

I I

But : bˆ is very unstable and the estimator is not very reliable (great variance)

61/83

How to detect near multicollinearity I

Near multicollinearity happens very rarely : we should do something only if we find abnormally high variance for some estimates

I

We can compute the coefficient of correlation between every pair of explanatory variables ri,j

I

If ri,j > R 2 of the regression, we suspect multicollinearity (it is the Klein test), but if there is no high variance problem, we should not remove the variable in question

I

See also the VIF (Variance Inflation Factor)

I

To fix this problem, we can either increase the sample size (not always possible, and anyway multicollinearity could remain) or remove the variable in question from the analysis (for now)

I

Other solutions are available (Ridge regression), but are out of the scope of this course 62/83

Le modèle linéaire multiple comparaison modèle simple – multiple; problème de sélection de variables explicatives

x2

x1

y

y

Selection strategy of regressors (X): Less correlated between them then with Y

y x1

x2

x2

x1

Outline Presentation of the course Introduction The simple linear model Gauss-Markov hypotheses OLS estimation Properties and goodness of fit The multiple linear model A generalization of the simple model OLS estimation Multicollinearity Dummy variables Missing variables Centered variables Properties of OLS estimators Unbiasedness & consistency Normality of the error term

63/83

Dummy variables (1) I

Let’s say we have N individual observations providing income, gender and the number of years of education.

I

We want to explain individual income y by education x and gender z : yi = a + b.xi + c.zi + ui .

I

z is a categorical variable, ideally coded with 0/1

I

Say z = 1 codes for males

I

c represents the extra money that provides the fact of being a male with respect to being a female

I

It would be irrelevant to have a male and a female variable because it would cause multicollinearity (besides the fact that it is unnecessary)

I

Multicollinearity means that Rk(X ) < k so that the OLS estimates cannot be computed 64/83

Dummy variables (2) I

Let’s say we have N individual observations providing an index of life satisfaction, income and type of environment (big city, small city, rural).

I

We want to explain life satisfaction y by income x and environment z

I

z is still a categorical variable, but we have 3 categories : we have to choose a category with respect to which parameters will be computed, say rural

I

The model is written : yi = a + b.xi + c1 .z1i + c2 .z2i + ui , with z1 and z2 are dummy variables that code for big city and small city respectively

I

Stata command : xi : reg y income i.environment

65/83

Dummy variables (3)

I

Let’s say we have N yearly income observations for one individual.

I

We want to explain individual income y by the year x : yt = a + b.xt + ut .

I

Let’s say that on year i this person wins at the lottery.

I

We’d like to code for this event with a dummy variable :

I

Dt = 1 if xt = i, Dt = 0 otherwise.

I

The model is now written : yt = a + b.xt + c.Dt + ut .

I

c represents the extra money that the lottery brings, with respect to the income the individual would be expected to earn on year t.

66/83

Dummy variables (4) In this particular framework, what does the introduction of D amount to ? The model can be written the following way : 

1  1 . . .

x1 x2 .. .



 

0

    0  .      . y1 u1  . !          y2  1 xi−1  a 0  u2   . =     +c  .  1 1 +  ..  xi   b    .   .       1 xi+1  0 yN uN  . . ..  . . .  . .

1

xN

(8)

0

It can be shown that introducing D amounts to estimating a and b without observation i while c captures the lottery effect. 67/83

Outline Presentation of the course Introduction The simple linear model Gauss-Markov hypotheses OLS estimation Properties and goodness of fit The multiple linear model A generalization of the simple model OLS estimation Multicollinearity Dummy variables Missing variables Centered variables Properties of OLS estimators Unbiasedness & consistency Normality of the error term

68/83

Missing variables

I

It can be easily understood that estimating the previous model with and without dummy D can lead to changes even in parameters a and b

I

Omitting a relevant variable (here D) leads to omitted variable bias that affects all estimated parameters

I

Proof : with the Frish-Waugh theorem, and general proof in later chapters (on endogeneity)

I

This is a very important issue in applied work, so try not to forget important variables in models

69/83

Outline Presentation of the course Introduction The simple linear model Gauss-Markov hypotheses OLS estimation Properties and goodness of fit The multiple linear model A generalization of the simple model OLS estimation Multicollinearity Dummy variables Missing variables Centered variables Properties of OLS estimators Unbiasedness & consistency Normality of the error term

70/83

Centered variables

I

Estimating the simple model yt = a + b.xt + ut with or without centering observations (i.e., replacing each value yi by ˆ yi − y¯ , same for x ) will give the same result for b.

I

And using the Frish-Waugh theorem, it can be shown that using centered variables will lead to aˆ = 0.

71/83

Outline Presentation of the course Introduction The simple linear model Gauss-Markov hypotheses OLS estimation Properties and goodness of fit The multiple linear model A generalization of the simple model OLS estimation Multicollinearity Dummy variables Missing variables Centered variables Properties of OLS estimators Unbiasedness & consistency Normality of the error term

72/83

Outline Presentation of the course Introduction The simple linear model Gauss-Markov hypotheses OLS estimation Properties and goodness of fit The multiple linear model A generalization of the simple model OLS estimation Multicollinearity Dummy variables Missing variables Centered variables Properties of OLS estimators Unbiasedness & consistency Normality of the error term

73/83

Usual hypotheses of the model

These are the same as the simple model, but generalized to any number of explanatory variables : 1. E (u) = 0 2. X deterministic 3. Rank(X ) = k 4. E (uu 0 ) = σ 2 IN 0

5. When N → ∞, lim XNX = VX where VX is a finite non-singular matrix The rank of a matrix is the number of columns that are linearly independent. A non-singular matrix has an inverse.

48/83

OLS estimators are (still) BLUE

I

We have bˆ = (X 0 X )−1 X 0 y ˆ = b : unbiased E (b)

I

ˆ = σ 2 (X 0 X )−1 V (b)

I

σ ˆ 2 = ( uˆi 2 )/(N − k) : unbiased ˆ (b) ˆ =σ So V ˆ 2 (X 0 X )−1

I

I I

P

Gauss-Markov theorem : under hypotheses 1 to 4, the OLS estimator is the Best Linear Unbiased Estimator (unbiased with a minimal variance) (no assumption on the distribution of u)

74/83

Consistency and asymptotic normality

I

When N → ∞, (bˆN ) converges in probability towards b

I

With the hypothesis E (uu 0 ) = σ 2 IN and iid, the estimators are asymptotically normal

I

For now, we will reason only at a finite distance and not asymptotically (easier)

75/83

Outline Presentation of the course Introduction The simple linear model Gauss-Markov hypotheses OLS estimation Properties and goodness of fit The multiple linear model A generalization of the simple model OLS estimation Multicollinearity Dummy variables Missing variables Centered variables Properties of OLS estimators Unbiasedness & consistency Normality of the error term

76/83

Considering normality of the error term

I I I

I

Assume now that u ,→ N(0, σ 2 IN ) bˆ ,→ N(b, σ 2 (X 0 X )−1 ), not only asymptotically Calling bˆj the j th element of bˆ and αjj the j th element of the diagonal of (X 0 X )−1 , we get : bˆj ,→ N(bj , σ 2 αjj )

I

ˆ but we cannot yet find We thus know the distribution of b, confidence intervals for it since one element is still unkown : σ

I

In the remainder : σ ˆbˆ2 = σ ˆ 2 αjj j

77/83

Usual distributions to know

I

Normal

I

χ2

I

Student (T)

I

Fisher

78/83

The Normal distribution

It is the usual "bell-shaped" distribution. It can also be called the Gaussian distribution. X ,→ N(µ, σ 2 ) if : (x −µ)2 1 f (x ) = √ e − 2σ2 σ 2π

Any linear combination of normal random variables is a normal random variable

79/83

The Chi-squared distribution

It is defined as the sum of squared independent normal variables. Y ,→ χ2n if : Y =

n X

Xi2

i=1

Where Xi ,→ N(0, 1) and the Xi are independent. Any linear combination of independent χ2 random variables is a χ2 random variable

80/83

The Student distribution

It is defined as a ratio of a normal and a square root χ2 that are independent. Y ,→ Tn if : X Y =q

Z n

Where X ,→ N(0, 1) and Z ,→ χ2n , and are independent. The Student distribution is bell-shaped, and when n → +∞ it becomes a normal distribution

81/83

The Fisher distribution

It is defined as a ratio of two χ2 that are independent. Y ,→ Fp,q if : Y =

Z1 /p Z2 /q

Where Z1 ,→ χ2p and Z2 ,→ χ2q , and are independent.

82/83

A useful result : the variance estimator

We know that σ ˆ2 = We thus get :

Σˆ ui2 N−k

σ ˆ2 ,→ χ2N−k σ2 Proof : using the fact that u is normal and that uˆ = MX u (N − k)

83/83