Outline - Yves Desdevises

Outline. Molecular phylogenetics ... molecular phylogenetics. • Goal: propose a hypothesis of relationships .... Sensitive to taxa/gene representation in databases.
12MB taille 42 téléchargements 382 vues
O céanolo

This file contains one logo per layer. These logos use RGB (RVB) colors.

g

In the calque menu, click on the "eye" symbol to toggle the logo displayed.

i

ue

rv

a

ire

q

to

Obse

Laboratoire



S • INSU •

er

C

NR

Turn off this layer (READ ME FIRST) before exporting or printing.

/M

M • UP

1882 C

The logo that is visible is the version that will be exported, or printed.

de Banyuls

ARAGO

Genomes, phylogeny and lateral gene transfer desdevises.free.fr/ISMB2011

Yves Desdevises Observatoire Océanologique de Banyuls Université Pierre et Marie Curie France

1

Outline Molecular phylogenetics Phylogenomics Lateral gene transfer Illustrated uses of evolutionary genomics in a microalgaevirus association

2

A few words on molecular phylogenetics 3

• Goal: propose a hypothesis of relationships between several taxa

• Phylogeny = tree • Speciation: binary • Hypothesis: A

B

C

Phylogenetic network

4

An am pse s

Cheilinus trilobatus Cheilinus chlorourus Epibulus incidiator

Stetojulis bandanensis Halichoeres hortulanus

Halichoeres margaritace us albovittata Stetojulis bandanensis Stetojulis rus lorou nus ch Cheili Ch eil in us tril ob a Labrus merula viridis tus

Halichoeres margaritaceus Labropsis australis Halichoeres marginatus Anampses geographicus Anampses caeruleopunctatus

Coris julis

r to ia cid in

Hemigymnus melapterus Hemigymnus fasciatus Thalassoma bifasciatum

Bodianus rufus Clepticus parrae Pagrus major

Symphodus roissali

Symphodus cinereus

Symphodus cinereus

Symphodus tinca

Symphodus tinca

Symphodus ocellatus

Symphodus ocellatus

Symphodus mediterraneus

Symphodus mediterraneus

Symphodus melanocercus

Labrus viridis

Pictilabrus laticlavius Notolabrus tetricus

Symphodus roissali

Sympho dus cin ereus Sym phod Sy us tin mp ca Sy ho m du ph so ce od ll us atu s m ed ite rra ne us

s rcu ce no ela sm ris du ho pest s ru mp bru Sy nola Cte a s merul Labru

Thalassoma lunare Thalassoma lutescens

stris rupe

us tinca Symphod

brus nola Cte

Sym ph od us oce lla tus

Labrichthys unilineatus

us tric te s bru fus la to s ru No ianu d rrae Bo us pa ptic Cle major Pagrus Symphodus roissali

lis Labropsis austra ceus rgarita us es ma lan hoer Halic is ortu ns sh re ne oe da lich an Ha sb juli to Ste

Labroides dimidiatus

s ulu ib Ep

Pa gru sm ajo r

La bro ide sd im cae idia rule opu tus Anam nct atu pses s geog raph icus Halichoeres margin atus

Labrus merula Labrus viridis

Thalassoma bifasciatum

Symphodus melanocercus Ctenolabrus rupestris

Ste to juli sa Ep lbo ibu vit lus ta inc Chei idia ta linus tor chlo rour us Cheilinus trilobatus

La

Symphodus ocellatus Symphodus mediterraneus

Th

br An am oide s di pse HLab mid alic ropsis aus s ca iatu tralis ho eru s ere leo sm pu nct arg atu ina s tus

Symphodus cinereus Symphodus tinca

Stetojulis albovittata

SSyy mmp phh oodd uuss cro inis ere sa ulis

nus fasciatus Hemigym rus apte mel julis ris s Co tu ea ilin un ys th ch bri La

Symphodus roissali

s s nu icu ula ph ort gra eo sh sg ere pse ho am lic An Ha

fus s ru ianu Bod

s rcueus ocean ditnerr s meela hodu s m Symp odu ph Sym

nus igym Hem

unilineatus Labrichthys

Th TH Cor ala haem is ju ss lasig lis om soym nu a b ma s fa ifa lute sciatu s Hemigymnusscmelapterus iatuscen m s Pic tilabr us are la maticlun lavi sso us Thala Cle ptic tetricus Notolabrus us pa rra e

alasso ma lun Tha are lass Pic om a lu tila tesc bru ens s la tic lav ius

Phylogenetic trees

Symphodus melanocercus Ctenolabrus rupestris

Ctenolabrus rupestris Labrus merula

Labrus merula

Labrus viridis

Labrus viridis Cheilinus trilobatus

Cheilinus trilobatus Cheilinus chlorourus

Cheilinus chlorourus

Epibulus incidiator

Epibulus incidiator

Stetojulis albovittata

Stetojulis albovittata

Stetojulis bandanensis

Stetojulis bandanensis

Halichoeres hortulanus

Halichoeres hortulanus

Halichoeres margaritaceus

Halichoeres margaritaceus

Labropsis australis

Labropsis australis

Halichoeres marginatus

Halichoeres marginatus

Anampses geographicus

Anampses geographicus

Anampses caeruleopunctatus

Anampses caeruleopunctatus Labroides dimidiatus

Labroides dimidiatus

Labrichthys unilineatus

Labrichthys unilineatus Coris julis

Coris julis

Hemigymnus melapterus

Hemigymnus melapterus

Hemigymnus fasciatus

Hemigymnus fasciatus

Thalassoma bifasciatum

Thalassoma bifasciatum

Thalassoma lunare

Thalassoma lunare

Thalassoma lutescens

Thalassoma lutescens Pictilabrus laticlavius

Pictilabrus laticlavius Notolabrus tetricus

Notolabrus tetricus

5

Bodianus rufus

Bodianus rufus Clepticus parrae

Clepticus parrae

Pagrus major

Pagrus major

Molecular data

• Most current source for phylogeny • Nucleotides ou amino acids (for ancient divergences) • Important step: alignment (with the help of alignment softwares)

• Use of evolutionary models to build trees 6

• Gene tree ≠ species tree • Genes: orthologous or paralogous Paralogs Orthologs

Orthologs

a

b* c

C* B

A*

b* C*

A*

Duplication

Tree Ancestral gene

7

Making a molecular phylogeny Data DNA, AA, ...

Alignment Software + eye

Characters

Distances

Data quality Saturation, homogeneity, ...

Distances

Method

Model?

Data type, taxa number

BI ML Model?

MP

Optimality criteria

Weigthing? (sites, changes)

Yes

Tree(s) Validation Bootstrap, ...

ME...

No

NJ...

8

Optimality criteria • Different methods to choose the “best tree” from the alignment

• Hypothesis on how evolution works • Different in different methods • Number of steps (parsimony) • Sum of branch lengths (minimum evolution) • Likelihood (ML, with evolutionary model) 9

Parsimony

10

• Method based on individual characters, associated to cladistics

• “Ockham’s razor”: favour the simplest solution • Assess character fit to trees by character mapping via parsimony

• Optimisation is different on different trees

11

Distances

12

• Assessement of the mean number of changes between two taxa

• Based on distances, not individual characters • Data sometimes only as distances (e.g. DNA/ DNA hybridation temperature), if not, data transformation in distance matrix

• Mainly used for molecular data and molecular distances can be corrected using models of sequence evolution (same as ML)

• Main method: neighbor-joining (NJ) • Very fast

13

Maximum likelihood

14

• Maximum Likelihood = ML • Method based on individual characters • Uses an explicit evolutionary model (DNA or AA) • The more computationally complex method • Model very important: only for molecular data • ML finds (one) tree (and model parameters) maximizing the probability of the data

15

Bayesian inference • Recent and now widely used method • Uses Bayes formula to generate posterior probability of parameters (among which topology and branch lengths), based on previous knowledge on data: prior probability

• Tree (as well as parameters of the model such as substitution rates) with confidence intervals (support values) for clades

16

17

Validation • Trees can be validated using resampling

procedures such as boostrap (resampling with replacement)

• Assessment of clade support: add some noise in (alter) the data and rebuild the tree, do this from many altered datasets

• The more a clade is strong, the more it will appear in all trees: % = bootstrap support

18

Supertrees

19

• Combine trees with partially overlapping taxa • Bigger tree • Many methods (at least 17)

20

• Uses of supertrees • Combining trees from different data/studies • Phylogenomics: genes are often unequally present in the taxa under study

• Metagenomic: taxa partially and unequally represented in sequences

➡Many gaps in the matrix:

• Supermatrix (as is) • Design several complete sub-matrices, compute subtrees, build supertree

21

• e.g. Sargasso Sea environmental sequences

...

22

Phylogenomics

23

24

• Genomes: more accurate and precise phylogenies? Not so simple...

• Very large dataset: computation difficult • Genomes are plastic: duplications (total, partial), fusions, chromosome fissions, LGT, ...

• No good model of genomic evolution • Diminution of stochastic error (random), only by increasing character number

• The possibility of systematic error remains, for

example caused by wrong method or model choice

25

• 3 main biases • Composition bias: sequences with the same composition tend to cluster

• Check from sequences • Long branch attraction • Good taxon sampling • Heterotachy: substitution rate change through time for fixed positions

• Hard to detect and correct 26

Genomes

• More characters • New character types: gene order, gene content,

nucleotidic signature (DNA strings), rare genomic changes

• 2 main approaches • Classical: sequences (gene concatenation) and phylogeny (supermatrix or supertree)

• Whole genome features: gene order, gene content, DNA string

• + 1: rare genomic changes

27

Classical methods

28

• Resolution of difficult phylogenetic problems (e.g. Tree of Life, Eukaryotes, Bilateria)

• Evolution of gene groups (e.g. family):

mutations, selective pressure, divergence, duplications, ...

• Identification of lateral gene transfer (which brings noise in the the tree-like signal)

29

• Example: classical tree of Deuterostomians

30

• Genomic data (Nature, 2006) - 146 genes - Classical methods: sequences - Bias control

31

• Example: Eukaryote phylogeny (2009, 2010)

32

• Example: Tree of Life (Science, 2006)

33

• But is the history of life really tree like?

34

Lateral Gene Transfer

35

• LGT in the tree of life

36

• Lateral gene transfer is more and more recognized

as an important factor shaping the evolution of life

• Current debate is no more on the existence of LGT but on its importance: can we still consider that the evolution of life is mainly tree-like?

• No (?) in Prokaryotes • Yes (?) in Eukaryotes 37

38

Methods to unveil LGT • Compositional methods • Comparison of evolutionary rates • Look for similar sequences in databases through BLAST

• Phylogenetic approach 39

Compositional methods

• Look via bioinformatics in complete genomes for • atypical nucleotide composition in putatively transferred genes

• atypical codon usage patterns • Only for recent transfer events (before homogenization)

40

Evolutionary rates

• Compare pairwise distances between gene

orthologs within families vs distances between genomes (from a reference tree): if no LGT, these distances should be roughly equal

• Another possibility is to compare instantaneous substitution matrices in genes vs genomes

• In case of LGT, these rates should differ 41

BLAST and similarity

• Find homologs of a query sequence in databases, genomes, ... via a similarity search (e.g. BLAST)

• Pattern of gene presence/absence in organisms = phyletic pattern

• Identification of LGT for genes with unusual affiliation

• Drawback: similarity does not necessarily mean evolutionary proximity

• Sensitive to taxa/gene representation in databases

42

Phylogenetic approach

• Look for individual gene trees incongruent with a reference phylogeny

• Reference tree: rDNA, genomes, gene

concatenation, consensus tree, supertree, ...

• Need well supported trees • Test for incongruence between topologies • Cannot detect LGT between neighbours 43

• Between symbionts with complete genomes available: blast symbiont ORFs against host genome to identify putatively transferred genes

• Cophylogenetic methods (gene tree within species tree) can be used to infer a scenario for the LGT

44

• Example: frp gene acquired in red algae and green plants from ∂-proteobacteria

45

• Example: multiple transfers virus-to-host and hostto-virus between Emiliania huxleyi and EhV86

46

• Example: multiple transfers virus-to-host and hostto-virus between Emiliania huxleyi and EhV86

47

Case study: Prasinophyte microalgae and their viruses 48

Hosts: Prasinophyceae Chlorophyta: green algae (Order Mamiellales, ubiquitous picophytoplankton) 3 main genera, 6 complete genomes to date Ostreococcus (3 genomes) Bathycoccus (1 genome) Micromonas (2 genomes)

49

Chrétiennot-Dinet et al. (1995)

Ostreococcus

Bathycoccus

50

Micromonas

Host phylogeny (SSU rDNA) Ostreococcus RCC344

0,74

0,86

Ostreococcus RCC356 O. lucimarinus CCMP2972

Chrétiennot-Dinet et al. (1995)

Ostreococcus

Ostreococcus RCC1108 0,99 O. tauri RCC745 Ostreococcus RCC1107

Bathycoccus prasinos RCC1105

1

Bathycoccus prasinos RCC464

0,99

Bathycoccus

M. pusilla RCC497 M. pusilla CCMP1545

0,99

Micromonas RCC1109 Micromonas RCC828

0,63 Micromonas RCC451

Micromonas

51

Viruses Phycodnavirus Prasinovirus Important role in the regulation of phytoplanktonic populations

ML Escande, OOB

52

Giant virus ("Girus"): 100-200 nm Large genomes: about 200 Kb (closely-related Chlorovirus: almost 400 Kb!)

53

54

55

Prasinovirus tree from partial DNA polymerase (about 600 bp) OxV

MpV

BpV

56

Prasinovirus genomes 6 genomes (4 correspond to hosts with complete genomes) 3 OxV: 1 OlV + 2 OtV 2 BpV 1 MpV

57

OtV5 genome

58

Genome comparisons

!

59

Phylogeny from genomic data

60

arin us cu s lu cim oc reoc Ost

Global tree based on ß DNA polymerase (DP): Eukaryotes, Eubacteria, Archae, and viruses

1/81

1/100

Bath

ycoc

Chlamydomonas Ph Chlorella om O ysc ry Sorg itrella za h um

us luc im ococ arin cus us taur i

as

oc oc c

Ostre

opsis ondy Arabid lysph Po Homo

us

cus

0.1

on

tre

Ce V Mim ivir

rom Mic

Micromonas CCMP1545 C299us s RC occ c mona thy Micro Ba

Os

lium

EhV8 6

Thermococcus AtC V PbCV MT32 1 5 3 PbCV FR48 158 AR CV Pb

NY 2A PbCV1

Methanosarcina Me Met thanoco ccoide hano s saet a

OtV

1

Pb

CV

MpV1 1 OlV

V1

BpV2

Bp

Bp

BpV2 OtV5

Focus on prasinoviruses and their hosts based on the concatenation of 5 genes in common: DP, PCNA, lsu and ssu Ribonucleotide reductase, Thymidine synthase

tauri

cus

ococ

Ostre

00

1/1

V1

Pro

s

1

roco

ccus

s cu

a on

ia ston

OtV

c co ho

m

4

no

Ral

1/9

1/100 0.88/76

chlo

ec

ari

00

M

1/1

V1

n Sy

Mp

OlV1 1/100 OtV5

0.1 0.1

61

Green algal dsDNA viruses monophyletic

m

liu psis ondy Arabidoolysph P Homo

6

Thermococcus AtC

V PbCV MT3 1 25 83 PbCV FR4 158 AR CV Pb

1

V1 s lu cim

OtV

Bp

us

1 MpV1 OlV

ari n

NY

2A PbCV1

Methanosarcina M Me ethanoco tha ccoide nos s aeta BpV2 OtV5

Evolutionary divergence: Host > virus

us luc im ococ arin cus us taur i

EhV8

Pb CV

Global coevolution with algal hosts

cc

Ostre

V Mim ivir us

Chlamydomonas lorella Ph Ch comitre O ys lla ry Sor za gh um

Micromonas

co

CCMP1545 9 CC29 us as R cocc thy Ba

tre o

mon

Micro

Os

Ce

chlo

roc occ

s

as

cu

oc

oc

on

nia lsto Ra

om

00

Pro

ch ne

Os

i

taur

in ar

cus

ococ

Ostre

M

Recent colonisation of hosts by viruses?

Sy

tre oc

oc cu

Higher evolutionary rate in hosts? 1/1 BpV2

Bp V

1/100

1

V1

1/81

1/10

Mp

Bat

0

hyco

1/94

ccus

0.1

0.1

s

na

o rom

Mic

OtV

1

OlV1 1/100 OtV5 1/100 0.88/76

62

Lateral gene transfers in prasinoviruses

63

Viruses are known as "bag of genes", or "gene robbers", steeling genes from their hosts: LGT Suspected to be vectors of gene transfers between eukaryotes Virus strains can recombine within hosts (e.g. H1N1)

64

us

General methodolody for identifying LGT Define candidate gene for transfer via BLAST: present in host and viruses Find same genes in different taxa (using BLAST, GenBank, ...): make a dataset with most closely related hits (BLAST), reference taxa, candidate gene in host and virus Align sequences and make tree Look at the tree to identify LGT

65

Host-virus LGT in OtV5? Blast each viral ORF against host genome and keep ORF meeting specific criteria (AA ID > 45 % on > 50 AA) Blastp against GenBank nr, keep all viral ORFs with host in the 50 best blast hits, and get these BBHs Keep these sequences if similar known gene function in Phycodnaviruses 6 candidates for LGT

66

Make phylogenetic tree for each candidate, adding host and virus sequences in the alignment + other BBHs (and reference sequences)

NO NO Pyrophosphatase

GDP-mannose Unknown

?

67

NO

Topoisomerase

NO

Ribonucleosidediphosphate reductase

68

? Maybe...

69

LGT from the new genomic data? These virus genomes possess unique pathways for AA synthesis, never seen in any virus before These biosynthesis pathways are not shared by all prasinovirus genomes A HSP70 gene is found only in the BpV genome Do involved gene originate from a lateral transfer? LGT from host or other sources?

70

Different AA synthesis pathways in related virus genomes Only in MpV and OtV: LGT?

!

Only in MpV and OtV: LGT? Only in OtV: LGT?

71

ari

onas

nu

s

Ory za

as on om

opsis

a

Oryz

lam

Acetolactate synthase

Ch

nd

te

yce

iom sid Ba

Arabid

yd

la itrel

scom

Populus

es id

ro

Phy

cte

Ba

Po lys ph o

ete

cim

m yc

s lu

As co

cu

ydom

oc

s cu

oc

Chlam

ccus hyco uri Bat ta

tre

oc oc

tre

Os

Micromonas

Os

Asparagine synthase

yliu m

Mus Homo

Flavobact

erium

ella Gram

Mimivir

us

Thauera

tes

idiu m

1

us

illus obac

cc

Lact

OtV5

O O lV tV 1 1

str

1 OlV OtV

1

MpV1

Clo

Shuttleworthia

pV

co cto La

Ms

no tha

Me

Ba cte ro ide Rose s buri a

llic u

ra

ae

h sp

Mo

72

Dehydroquinate synthase Os ccu s lu

cim

s ta

cc u

co

uri

ari nu s

Microm

Oryza VitMedicago is

tre o

oco

tre

Os

onas

monas op

a rell

bid

mit co

Ara

ys Ph

sis

Populus

ydo Chlam

OtV1 OlV1

? Se

om on

tospirillum Magne

as

lum

cu

ba

loro

Ch

Rhodofe rax

len

Chlamydom onas

73

Ch

lor ella

HSP 70

inu

um

h rg

So

um

Tritic Nic oti

ana

s lu

Vitis

oc

tre Os

s

ar cim

s occus

Ostreoc

tauri

as C C

BpV1 Bp V2

Micro

s

ino

ras sp

cu

oc

s RCC299

mon

c thy

Ba

Micromona

MP1 545

Sp

in

ac

ia

ulu

Pop

cu oc

74

Evolution of a gene family Example of capsid protein in prasinoviruses

75

Phycodnaviruses are icosahedral, with a capsid formed by different proteins (capsomers, or capsid-like proteins (clp)), which have probably evolved via duplications

What comparative genomics tells us about clp evolution in prasinoviruses within phycodnaviruses?

76

Capsomers evolution ATCV1_Z664R_OG 0,99

PBCV_NY2A_BO59R PBCV_NY2A_B617L PBCV_NY2A_B825L PpV_01 PoV_01B HaV_1 0,93 OtV1_clp6 0,56 OtV5_clp6 0,98 OlV1_clp6 MpV1_clp6 0,8 BpV1_clp6 BpV2_clp6 1 OtV1_clp1 1 OtV5_clp1 1 OlV1_clp1 MpV1_clp1 1 OtV1_clp2 1 OtV5_clp2 0,85 OlV1_clp2 1 MpV1_clp2 1 BpV1_clp2 BpV2_clp2 1 1 OtV1_clp3 0,77 OtV5_clp3 0,61 OlV1_clp3 1 MpV1_clp3 1 BpV1_clp3 BpV2_clp3 1 OtV1_clp7 1 OtV5_clp7 0,87 OlV1_clp7 1 MpV1_clp7 1 BpV1_clp7 BpV2_clp7 1 OtV1_clp4 0,99 OtV5_clp4 1 OlV1_clp4

0,74 0,77 0,61 1

1

0,55 1

0,86

0,71

8 putative capsid genes in prasinovirus genomes (7 in BpV) Many duplications Phylogenetic tree including other PhycoDNAviruses available

1

0,69 1

0,97 1 0,93

BpV1_clp4 BpV2_clp4 MpV1_clp4 1 OtV1_clp5 1 OtV5_clp5 OlV1_clp5 MpV1_clp5 1 BpV1_clp5 BpV2_clp5 1

1 1 1

1 OtV1_clp8 OtV5_clp8 OlV1_clp8 MpV1_clp8 BpV1_clp8 BpV2_clp8

0.2

ATCV1_Z664R_OG 0,99

PBCV_NY2A_BO59R PBCV_NY2A_B617L PBCV_NY2A_B825L PpV_01 PoV_01B HaV_1 0,93 0,56 OtV1_clp6 OtV5_clp6 0,98 OlV1_clp6 MpV1_clp6 0,8 BpV1_clp6 BpV2_clp6 1 OtV1_clp1 1 OtV5_clp1 1 OlV1_clp1 MpV1_clp1 1 OtV1_clp2 1 OtV5_clp2 0,85 OlV1_clp2 1 MpV1_clp2 1 BpV1_clp2 BpV2_clp2 1 1 OtV1_clp3 0,77 OtV5_clp3 0,61 OlV1_clp3 1 MpV1_clp3 1 BpV1_clp3 BpV2_clp3 1 OtV1_clp7 1 OtV5_clp7 0,87 OlV1_clp7 1 MpV1_clp7 1 BpV1_clp7 BpV2_clp7 1 OtV1_clp4 0,99 OtV5_clp4 1 OlV1_clp4

77

0,74 0,77 0,61 1

1

X

0,55 1

0,86

0,71

0,69

0,97 0,93

Evolution via duplications Loss of clp1 in BpV

1

1

1

Ancestral copy in Prasinoviruses = clp 6

BpV1_clp4 BpV2_clp4 MpV1_clp4 1 OtV1_clp5 1 OtV5_clp5 OlV1_clp5 MpV1_clp5 1 BpV1_clp5 BpV2_clp5 1

1 1 1

1 OtV1_clp8 OtV5_clp8 OlV1_clp8 MpV1_clp8 BpV1_clp8 BpV2_clp8

0.2

78

To conclude... All these questions can only be studied using comparative genomics Obtaining genomes is more and more easy, fast, and cheap, but analyzing them requires more and more human skills: this is where the bottleneck is now, so get to genomics!

79