Multivariate statistics 2 and clustering - Emmanuel Rachelson

Oct 2, 2013 - elections, PCA and clustering (k-means and AHC) on hotel data set .... precision, sensitivity or Rand/Jaccard index or (ii) internal: Dunn index.
925KB taille 3 téléchargements 311 vues
Statistics and learning Multivariate statistics 2 and clustering

Emmanuel Rachelson and Matthieu Vignes ISAE SupAero

Wednesday 2nd and 9th October 2013

E. Rachelson & M. Vignes (ISAE)

SAD

2013

1 / 14

Link to the previous session Goal of multivariate (exploratory) statistics: understanding high-dimensional data sets, reducing their ’useful’ dimensions, representing them, seeking hidden or latent factors . . . Today we will: I

review PCA needed ?

E. Rachelson & M. Vignes (ISAE)

SAD

2013

2 / 14

Link to the previous session Goal of multivariate (exploratory) statistics: understanding high-dimensional data sets, reducing their ’useful’ dimensions, representing them, seeking hidden or latent factors . . . Today we will: I

review PCA needed ?

I

introduce Multidimensional scaling (MDS) as a factor analysis of a distance matrix

E. Rachelson & M. Vignes (ISAE)

SAD

2013

2 / 14

Link to the previous session Goal of multivariate (exploratory) statistics: understanding high-dimensional data sets, reducing their ’useful’ dimensions, representing them, seeking hidden or latent factors . . . Today we will: I

review PCA needed ?

I

introduce Multidimensional scaling (MDS) as a factor analysis of a distance matrix

I

introduce Canonical correlation analysis (CCA): for p quantitative variables and q quantitative variables)

E. Rachelson & M. Vignes (ISAE)

SAD

2013

2 / 14

Link to the previous session Goal of multivariate (exploratory) statistics: understanding high-dimensional data sets, reducing their ’useful’ dimensions, representing them, seeking hidden or latent factors . . . Today we will: I

review PCA needed ?

I

introduce Multidimensional scaling (MDS) as a factor analysis of a distance matrix

I

introduce Canonical correlation analysis (CCA): for p quantitative variables and q quantitative variables)

I

introduce Correspondence analysis (CA): for 2 qualitative variables with several (many) levels.

E. Rachelson & M. Vignes (ISAE)

SAD

2013

2 / 14

Link to the previous session Goal of multivariate (exploratory) statistics: understanding high-dimensional data sets, reducing their ’useful’ dimensions, representing them, seeking hidden or latent factors . . . Today we will: I

review PCA needed ?

I

introduce Multidimensional scaling (MDS) as a factor analysis of a distance matrix

I

introduce Canonical correlation analysis (CCA): for p quantitative variables and q quantitative variables)

I

introduce Correspondence analysis (CA): for 2 qualitative variables with several (many) levels.

I

introduce clustering methods like hierarchical clustering or Kmeans-like algorithms.

E. Rachelson & M. Vignes (ISAE)

SAD

2013

2 / 14

Multidimensional scaling (MDS) I

now only an index between individual is known, variables are not observed anymore: n × n matrix (think of distances).

E. Rachelson & M. Vignes (ISAE)

SAD

2013

3 / 14

Multidimensional scaling (MDS) I

I

now only an index between individual is known, variables are not observed anymore: n × n matrix (think of distances). Goal: represent the cloud of points in a low-dimensional subspace.

E. Rachelson & M. Vignes (ISAE)

SAD

2013

3 / 14

Multidimensional scaling (MDS) I

I I

now only an index between individual is known, variables are not observed anymore: n × n matrix (think of distances). Goal: represent the cloud of points in a low-dimensional subspace. MDS = PCA on distance matrix !

E. Rachelson & M. Vignes (ISAE)

SAD

2013

3 / 14

Multidimensional scaling (MDS) I

I I

now only an index between individual is known, variables are not observed anymore: n × n matrix (think of distances). Goal: represent the cloud of points in a low-dimensional subspace. MDS = PCA on distance matrix !

Easy example

Road distances between 47 French cities. Is it Euclidian ?

E. Rachelson & M. Vignes (ISAE)

SAD

2013

3 / 14

Canonical correlation analysis (CCA) I

Uses techniques close to PCA to achieve a kind of multiple output multivariate regression

E. Rachelson & M. Vignes (ISAE)

SAD

2013

4 / 14

Canonical correlation analysis (CCA) I

Uses techniques close to PCA to achieve a kind of multiple output multivariate regression

I

Goal: Linking 2 groups of variables (X and Y ) measured on the same individuals

E. Rachelson & M. Vignes (ISAE)

SAD

2013

4 / 14

Canonical correlation analysis (CCA) I

Uses techniques close to PCA to achieve a kind of multiple output multivariate regression

I

Goal: Linking 2 groups of variables (X and Y ) measured on the same individuals

I

Example from yesterday on the study of fatty acids and gene levels on mice: are some acids more present when some genes are over-expressed ? Or conversely ? → Practical session !

E. Rachelson & M. Vignes (ISAE)

SAD

2013

4 / 14

Canonical correlation analysis (CCA) I

Uses techniques close to PCA to achieve a kind of multiple output multivariate regression

I

Goal: Linking 2 groups of variables (X and Y ) measured on the same individuals

I

Example from yesterday on the study of fatty acids and gene levels on mice: are some acids more present when some genes are over-expressed ? Or conversely ? → Practical session !

I

Consists in looking for a couple of vectors, one related to X (gene expressions) and one to Y (metabolite levels) which are maximally conected. And iteratively (without correlation between iterations).

E. Rachelson & M. Vignes (ISAE)

SAD

2013

4 / 14

Canonical correlation analysis (CCA) I

Uses techniques close to PCA to achieve a kind of multiple output multivariate regression

I

Goal: Linking 2 groups of variables (X and Y ) measured on the same individuals

I

Example from yesterday on the study of fatty acids and gene levels on mice: are some acids more present when some genes are over-expressed ? Or conversely ? → Practical session !

I

Consists in looking for a couple of vectors, one related to X (gene expressions) and one to Y (metabolite levels) which are maximally conected. And iteratively (without correlation between iterations).

I

Variables can be represented in either basis, it does not change the interpretation.

E. Rachelson & M. Vignes (ISAE)

SAD

2013

4 / 14

CCA (cont’d) Need to have p, q ≤ n. We kept 10 genes and 11 fatty acids.

More interpretation ? → Practical session E. Rachelson & M. Vignes (ISAE)

SAD

2013

5 / 14

Correspondence analysis (CA)

I

Becomes AFC in French

E. Rachelson & M. Vignes (ISAE)

SAD

2013

6 / 14

Correspondence analysis (CA)

I

Becomes AFC in French

I

similar concept to PCA: represent the distribution of the 2 variables and plots the individuals. but applies to qualitative rather than quantitative data → contingency table (ni,j )

E. Rachelson & M. Vignes (ISAE)

SAD

2013

6 / 14

Correspondence analysis (CA)

I

Becomes AFC in French

I

similar concept to PCA: represent the distribution of the 2 variables and plots the individuals. but applies to qualitative rather than quantitative data → contingency table (ni,j )

I

This is double PCA (line and column profiles) on (Xij ) = ( fi,.i,jf.j − 1), with fi,j = ni,j /n.

E. Rachelson & M. Vignes (ISAE)

f

SAD

2013

6 / 14

Correspondence analysis (CA)

I

Becomes AFC in French

I

similar concept to PCA: represent the distribution of the 2 variables and plots the individuals. but applies to qualitative rather than quantitative data → contingency table (ni,j )

I

This is double PCA (line and column profiles) on (Xij ) = ( fi,.i,jf.j − 1), with fi,j = ni,j /n. P P Note that χ2 writes n i j f˜i,j x2i,j

I

E. Rachelson & M. Vignes (ISAE)

f

SAD

2013

6 / 14

CA: an example Cultivated area in the Midi-Pyr´en´ees region

Simultaneous representation of d´epartement and farm size (in 6 bins).

E. Rachelson & M. Vignes (ISAE)

SAD

2013

7 / 14

Today I

”Clustering: unsupervised classification”. Distance, hierarchical clustering (divisive or agglomerative).

I

Keep in mind that this is still exploratory statistics so the best clustering (including method, options, criterion, etc.) is the most useful ?!

I

End of practical session on mice data set.

I

And a new guided session on multivariate stats: CA on presidential elections, PCA and clustering (k-means and AHC) on hotel data set and multiple CA on 2 multiple factor data sets.

E. Rachelson & M. Vignes (ISAE)

SAD

2013

8 / 14

Clustering: grouping into classes

Ever heard of that in your background ??

E. Rachelson & M. Vignes (ISAE)

SAD

2013

9 / 14

Clustering: grouping into classes

E. Rachelson & M. Vignes (ISAE)

SAD

2013

9 / 14

Clustering: grouping into classes

E. Rachelson & M. Vignes (ISAE)

SAD

2013

9 / 14

Cluster analysis or clustering I

Task of grouping objects so that objects belonging to the same group are ’more similar’ to each other than to those in any other group → multiobjective optimisation task.

E. Rachelson & M. Vignes (ISAE)

SAD

2013

10 / 14

Cluster analysis or clustering I

Task of grouping objects so that objects belonging to the same group are ’more similar’ to each other than to those in any other group → multiobjective optimisation task.

E. Rachelson & M. Vignes (ISAE)

SAD

2013

10 / 14

Cluster analysis or clustering I

Task of grouping objects so that objects belonging to the same group are ’more similar’ to each other than to those in any other group → multiobjective optimisation task.

E. Rachelson & M. Vignes (ISAE)

SAD

2013

10 / 14

Cluster analysis or clustering I

Task of grouping objects so that objects belonging to the same group are ’more similar’ to each other than to those in any other group → multiobjective optimisation task.

E. Rachelson & M. Vignes (ISAE)

SAD

2013

10 / 14

Cluster analysis or clustering I

Task of grouping objects so that objects belonging to the same group are ’more similar’ to each other than to those in any other group → multiobjective optimisation task.

E. Rachelson & M. Vignes (ISAE)

SAD

2013

10 / 14

Cluster analysis or clustering I

Task of grouping objects so that objects belonging to the same group are ’more similar’ to each other than to those in any other group → multiobjective optimisation task.

I

Several algorithms can do the job, their differences mainly being about used distance.

E. Rachelson & M. Vignes (ISAE)

SAD

2013

10 / 14

Cluster analysis or clustering I

Task of grouping objects so that objects belonging to the same group are ’more similar’ to each other than to those in any other group → multiobjective optimisation task.

I

Several algorithms can do the job, their differences mainly being about used distance.

I

Possibly, different parameters (initialisation, distance used, ending criterion . . . ) lead to different representations.

E. Rachelson & M. Vignes (ISAE)

SAD

2013

10 / 14

Clustering algorithms

Challenge: build your own clustering algorithm ?!

E. Rachelson & M. Vignes (ISAE)

SAD

2013

11 / 14

Clustering algorithms

Challenge: build your own clustering algorithm ?! Let’s quote only few of widespread clustering algorithms: I

hierarchical clustering with dissimilarity min → single, max → complete or mean → average linkages)

I

centroid models (e.g. K-means clustering)

I

distribution models (statistical definition e.g. multivariate Gaussian distribution)

I

graph or density models (e.g. clique)

I

...

E. Rachelson & M. Vignes (ISAE)

SAD

2013

11 / 14

Clustering: some formalism

I

Define a similarity (symetry, self-similarity, bounded) → dissimilarity

I

Distance need additional properties: d(i, j) = 0 ⇒ i = j and triangular inequality (Euclidian dist. from scalar product)

E. Rachelson & M. Vignes (ISAE)

SAD

2013

12 / 14

Clustering: some formalism

I

Define a similarity (symetry, self-similarity, bounded) → dissimilarity

I

Distance need additional properties: d(i, j) = 0 ⇒ i = j and triangular inequality (Euclidian dist. from scalar product)

A goodness-of-fit of partitions can be defined: (i) external: TP, FP . . . → precision, sensitivity or Rand/Jaccard index or (ii) internal: Dunn index d(i,j) D = mini minj6=i max . d0 (k) k

E. Rachelson & M. Vignes (ISAE)

SAD

2013

12 / 14

Homework What do students choose after French baccalaur´eat ?

First describe and then represent this (simple) data set in some informative way. Hint: CA... origin bac lit. bac ´eco. bac scient. bac tech. Total

E. Rachelson & M. Vignes (ISAE)

counselling universit´e prep. clas. 13 2 20 2 10 5 7 1 50 10

SAD

other 5 8 5 22 40

Total 20 30 20 30 100

2013

13 / 14

Finished

Next time: tests

E. Rachelson & M. Vignes (ISAE)

SAD

2013

14 / 14

Finished

Next time: tests But before that: practice with R ?!

E. Rachelson & M. Vignes (ISAE)

SAD

2013

14 / 14