Detection of new protein domains by co-occurrence ... - CiteSeerX

One HMM models a single domain. •Given a new protein and an HMM, allows computing a score reflecting the probability of the domain presence [3] → E-value: ...
354KB taille 2 téléchargements 263 vues
Detection of new protein domains by co-occurrence: application to Plasmodium falciparum ´ elin ´ 2 - Projet ANR PlasmoExplore Nicolas Terrapon1,2, Olivier Gascuel2 and Laurent Breh 1 TIMC-IMAG, INP Grenoble, CNRS; Domaine de la Merci, 38710 LA TRONCHE 2 LIRMM, Univ. Montpellier 2, CNRS; 161 rue Ada, 34092 MONTPELLIER

Background Protein domains

Domain detection with HMMs

• Definition: domains are structurally independent subunits of proteins.

• One HMM models a single domain.

The RecR protein in D. radiodurans (Pfam domains: RecR in green and Toprim in red).

• Interest: domain composition can help for protein function prediction. 80% of multidomain proteins sharing two identical domains have similar function [2].

Plasmodium falciparum • Main causal agent of human malaria.

• Given a new protein and an HMM, allows computing a score reflecting the probability of the domain presence [3] → E-value: expected number of false positives with better scores.

• Atypical genome [5]: – 80% of A+T, – 6 amino-acids coding 50% of residues, – presence of long inserts of low complexity.

• Pfam database [4] http://pfam.sanger.ac.uk/ – 10 340 manually curated domain models, – stringent E-value threshold for each HMM: below this threshold domain presence can be asserted. – Example: ABR human protein.

• Of the 5 400 predicted proteins, ∼60% cannot be annotated by sequence homolgy. • Guilt by association methods based on postgenomic data have been proposed [6].

Our approach The problem • Standard methods for domain detection fail in P. falciparum: – 1 421 domain types in 50% of P. falciparum proteins. – For comparison, yeast has 2 369 domain types in 76% proteins.

Method summary

Principle • Increase Pfam E-value thresholds: allow detecting more potential domains but with many false positives. • Use domain co-occurrence for filtering false positives.

Domain co-occurrence

• Two hypotheses are discussed: – P. falciparum possesses many specific domains. – The amino acid bias obstructs domain detection.

• List domain pairs showing a strong co-occurrence: – Presence of one of the domains must be a strong clue of the presence of the other one. – This list of Conditionally Dependent Pairs (CDPs) is built from domain composition of well annotated Uniprot proteins thanks to a statistical test (one-tailed Fisher’s exact test).

• Studies on domain combinations revealed that a domain generally appears in protein with few other favorite domains [1].

• Store known domains: detected with Pfam E-value thresholds.

• Less than 20 000 observed Pfam domain pairs in Uniprot proteins, while ∼12.5 millions pairs are possible (i.e. 1,6‰).

• Validate a potential domain by a known domain if the pair belong to the CDP list.

• Increase E-value thresholds to get additional potential domains.

Method Selecting the CDPs

Estimating an error rate

• Extract all domain pairs from domain composition of well annotated Uniprot proteins. Nb of Dom. A Tot. proteins present absent x y x+y • For each domain pair Dom. present B absent w z w+z (A,B ) build the 2x2 conTot. x+w y+z N tingency table: • Null hypothesis H0: A and B are statistically independent. • Probability of observing x or more pairs (A,B ) under H0: min(y,w)

X t=0

x w Cx+y × Cw+z . P [x+t, y−t, w−t, z+t], where P [x, y, w, z] = x+w CN

• If P-value is below a given threshold, the pair is added to the CDP list.

• Let NbVal be the number of new domains validated by our method, given a set of proteins with known and potential domains, and a CDP list. • How many domains would be validated under the H0 hypothesis that all potential domains were randomly predicted? • Shuffling algorithm: – randomly permute all potential domains among proteins, – apply the validation method on the permuted domains, – store the number of validated domains, – iterate and average the results to compute the expected number of validations under H0: NbErr . NbErr . • The FDR of the method is estimated by the quantity: NbVal

Results on Plasmodium falciparum

Experiments on yeast • Aim: To assess the ability of our approach to retrieve domains in biased proteomes.

Number of new validated domains and associated FDR according to the selected E-value threshold (for CDP list computed with P-value 5%).

• Principle: – Find all domains in yeast proteins using Pfam thresholds: → Reference domains. – Simulate drift of yeast sequences toward P. falciparum amino acid composition (seqgen software). – Detect domains in drifted sequences (with Pfam thresholds): → Known domains. Count the number of lost domains. – Increase Pfam E-value thresholds and use the known domains to find lost domains back with the co-occurrence method. • Results: Reference Lost May be dom. domains retrieved‡ 2 405 905 645

Found Total Expect. back validated errors 480 510 26

‡Lost domains in drifted sequences where at least one other domain is still detected

in the same protein.

References [1] Apic G, Gough J et al. (2001). Domain combinations in archaeal, eubacterial and eukaryotic proteomes, J. of Mol. Biol., 310: 311-25

FDR

12%

19%

27%

New domains

290

360

421

Number proteins

270

321

362

Dom. types never seen†

95

79

139

New GO annotations∗

74

109

132



# of new domain types that were never seen in P. falciparum proteins before. ∗ New Gene Ontology [7] annotations brought by new domains (# of proteins).

[4] Eddy SR, Sonnhammer ELL, Bateman A, et al. (2008). The Pfam Protein Families Database, NAR, 36: D281-D288. [5] Gardner MJ, Hall N, Berriman M et al., (2002). Genome sequence of the human malaria parasite P. falciparum, Nat., 419: 498-511

[2] Gerstein M, Hegyi H (2001). Annotation transfer for genomics: measuring functional divergence in multidomain proteins, Gen. Res., 11: 1632-40

´ elin ´ L, Dufayard JF, Gascuel O (2008). PlasmoDraft: a database of Plasmodium falciparum gene function predictions based [6] Breh on postgenomic data, BMC Bioinformatics, 9(1):440.

[3] Durbin R, Eddy S, Krogh A, et al. (1998). Biological sequence analysis: Probabilistic models of proteins and nucleic acids (Book).

[7] The Gene Ontology Consortium (2006). The Gene Ontology (GO) project in 2006, NAR, 34 Database issue: D322-D326