bioinformaticsapplications note

Jul 13, 2005 - handle large DNA sequences, whether assembled or not. Availability: STAN .... pattern versus DNA sequence, single amino acid-based pattern ... Helgesen,C. and Sibbald,P.R. (1993) PALM—a pattern language for molecular.
153KB taille 1 téléchargements 53 vues

Vol. 21 no. 24 2005, pages 4408–4410 doi:10.1093/bioinformatics/bti710

Genome analysis

Suffix-tree analyser (STAN): looking for nucleotidic and peptidic patterns in chromosomes Jacques Nicolas , Patrick Durand, Gre´gory Ranchy, Se´bastien Tempel and Anne-Sophie Valin IRISA-INRIA, Campus de Beaulieu, 35402 Rennes cedex, France Received on June 27, 2005; revised on July 13, 2005; accepted on October 6, 2005 Advance Access publication October 13, 2005

ABSTRACT Summary: We have developed STAN (suffix-tree analyser), a tool to search for nucleotidic and peptidic patterns within whole chromosomes. Pattern syntax uses a string variable grammar-like formalism which allows the description of complex patterns including ambiguities, insertions/deletions, gaps, repeats and palindromes. STAN is based on a reduction to multipart matching on a suffix-tree data structure and can handle large DNA sequences, whether assembled or not. Availability: STAN is accessible online at symbiose/STAN Contact: [email protected]

Sequence pattern matching within biological sequences (DNA, RNA and proteins) can be achieved very efficiently using regular expressions. A number of tools have been developed in this context (Kucherov and Rusinowitch, 1995; Altschul et al., 1997; Gatiker et al., 2002). However, when considering high-order sequence organization (such as copy, palindrome or even structural properties) a more expressive formalism is required. During the past decade, definite clause grammars (DCG) (Pereira and Warren, 1980), a particular form of context-free grammars, have been used in various works to model DNA sequence features (Searls, 1989; Helgesen and Sibbald, 1993; Leung et al., 2001), as well as to model gene regulation (Collado-Vides, 1992). Among those formalisms, the pioneering work of David Searls on string variable grammars (SVGs) is of particular interest (Searls, 1995, 2002). SVG introduces the concept of a variable that can be associated to a string during a pattern search. SVGs can be used to model not only DNA/RNA sequence features but also structural features such as repeats, palindromes, stem–loop or pseudo-knots (Searls, 1993). To our knowledge, the only two tools capable of searching for SVGbased patterns in biological sequences are GenLang (Dong and Searls, 1994) and PatScan (Dsouza et al., 1997). However, GenLang is no longer maintained and, because of its time complexity, was restricted to the analysis of medium size sequences (several Mbases). PatScan, on the other hand, does not guarantee to find 

To whom correspondence should be addressed.

all occurrences of complex patterns (once a hit is found, it does not check overlapping alternative solutions). This paper describes a new tool, STAN (suffix-tree analyser), allowing to search for a subset of SVG patterns in fully sequenced chromosomes. STAN is capable of efficiently scanning sequences as large as human chromosomes, it guarantees to find all occurrences of complex patterns and provides three different types of search operations as explained below. STAN patterns can be used to model sequence and/or structural features of nucleotidic and peptidic sequences. STAN patterns are made of words, separated by a ‘-’character, belonging to the following language. Letters are taken from standard IUPAC alphabets describing DNA or protein sequences. Tokens are made of a single letter or a string of letters. Ambiguous tokens are described using either curly braces ({C} for all nucleotides but cytosine in a DNA pattern) or brackets ([AC] for A or C letter) or brackets and pipe characters ([AC|AG] for AC or AG tokens). Spacers are described using the syntax ‘x(a,b)’. In this syntax, and the following, a and b denote positive integers, a  b. Tokens with insertions and/or deletions are described as ‘indel(m,ai,bi,ad,bd)’ where m is a token and ai and bi (respectively ad and bd) give the minimum and maximum numbers of letter insertions (respectively deletions) allowed in m. Tokens with mismatches are described as ‘m:a’ where m is a token and a in the maximum number of letter substitutions allowed in m. String variables are represented by identifiers starting with the upper letter X. They may be constrained in size using the syntax ‘X:[a,b]’ where a and b are the lower and upper limits of X size. A token or a string variable may be prefixed with operator ‘’ to search for the reverse complement. In that way, pattern ‘X - X’ allows to search for palindromes. Finally, mismatches are allowed when defining a string variable (‘X - X:a’ allows to search for palindromes with a letter mismatches). The current implementation of STAN allows searching for SVGbased patterns in most of publicly available genome sequences from Archaea, Bacteria and Eukaryota, as well as in user-provided DNA sequences. To perform a pattern search efficiently, STAN uses suffix trees, one of the most effective computational representations of a sequence (McCreight, 1976; Kurtz and Schleiermacher, 1999; Kurtz et al., 2004). Given a pattern and a sequence, STAN works as follows. The sequence, stored on disk as a binary representation of

Ó The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact [email protected]

STAN: looking for patterns in genomes



T-[CT]-x(0,1)-TAC:1-x(2)-TAT-[TA]-AT-[TC]-T-GGGAAG:2-T-ACA:1-TT:1-[AT]-x(0,1)-TAA:1-[ATG]-TGT:2 – 5’ end x(100,3000)-AAATCGT:1-X:[7]-x(4)-~X:5-TTAAAATCTAG:2 spacer


3’ end



Fig. 1. (A) AtREP3 model from Kapitonov and Jurka (2001). (B) Refined AtREP3 model deduced by iterative database scanning using STAN. (C) STAN webform interface used to format the result of a search for the AtREP3 model in the entire genome of Arabidopsis thaliana. (D) Search time comparison, looking for AtREP3 model in the A.thaliana chromosome 1 with STAN (S), PatScan (P) and GenLang (G) (see text).

its suffix tree, is fully loaded into the computer memory. The pattern is interpreted by a Prolog program and converted into a series of search operations that are directly applied to the suffix tree using C compiled software calls. The search for patterns is performed by exhaustive enumeration and search of all possible candidates generated from the given pattern in case of insertion and deletion and by exhaustive search of the pattern in all the suffix-tree paths till the number of errors is reached in case of mismatches. Three kinds of searches can be achieved using STAN: single nucleotide-based pattern versus DNA sequence, single amino acid-based pattern versus six frames translation of a DNA sequence and multiple amino acid-based patterns versus DNA sequence. The last type of search can be used to produce all the combinations of up to five

patterns located anywhere in the six coding frames of a DNA sequence. It can be of high interest to locate patterns spanning several exons of a coding sequence in eukaryote genomes. As an example of STAN application, we have searched the genome of Arabidopsis thaliana for the AtREP3 transposable element. AtREP3 is a novel family of A.thaliana’s non-autonomous transposons described by a TC 50 terminus, a CTRR 30 terminus and a 30 subterminal short hairpin structure (Fig. 1A) (Kapitonov and Jurka, 2001). On account of the strong variability of internal sequence, AtREP3 cannot be detected by BLAST or other classical software. In order to identify all family members, we have used STAN iteratively to scan database with refined models starting with the model from Figure 1A. The final model presented in Figure 1B


J.Nicolas et al.

was then used to scan the entire genome of A.thaliana (5 chromosomes for a total of 119 Mb) in both normal and reverse directions. We identified 125 different AtREP3 sizing from 450 to 2500 nt. Cumulated search time was 20 s on a SunFire 6800 computer. Figure 1D shows a search time comparison, looking for our AtREP3 model in the A.thaliana chromosome 1 (29 Mb) with STAN, PatScan and GenLang. For fair comparison, all tools were installed and executed from the command-line on our SunFire computer, and reported times are averaged from four successive executions. STAN, in contrast to the other tools with which it has been compared, runs on four processors and the time reported in the graph is the sum of the user times for each processor. More details on STAN’s pattern language, implementation and performance are provided in the STAN manual available online (see

ACKNOWLEDGEMENTS Funding to pay the Open Access publication charges for this article was provided by Re´gion Bretagne (Plateforme bioinformatique Quest Genopole). Conflict of Interest: none declared.

REFERENCES Altschul,S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. Collado-Vides,J. (1992) Grammatical model of the regulation of gene expression. Proc. Natl Acad. Sci. USA, 89, 9405–9409.


Dong,S. and Searls,D. (1994) Gene structure prediction by linguistic methods. Genomics, 23, 540–551. Dsouza,M. et al. (1997) Searching for patterns in genomic data. Trends Genet., 13, 497–498. Gatiker,A. et al. (2002) ScanProsite: a reference implementation of a PROSITE scanning tool. Appl. Bioinformatics, 1, 107–108. Helgesen,C. and Sibbald,P.R. (1993) PALM—a pattern language for molecular biology. In 1st International Conference on Intelligent Systems for Molecular Biology, Bethesda, MD, AAAI Press, pp. 172–180. Kapitonov,V. and Jurka,J. (2001) Rolling-circle transposons in eukaryotes. Proc. Natl Acad. Sci. USA, 98, 8714–8719. Kucherov,G. and Rusinowitch,M. (1995) Matching a set of strings with variable length don0 t cares. LNCS, 937, 230–247. Kurtz,S. and Schleiermacher,C. (1999) REPuter: fast computation of maximal repeats in complete genomes. Bioinformatics, 15, 426–427. Kurtz,S. et al. (2004) Versatile and open software for comparing large genomes. Genome Biol., 5, R12. Leung,S. et al. (2001) Basic gene grammars and DNA-ChartParser for language processing of Escherichia coli promoter DNA sequences. Bioinformatics, 17, 226–236. McCreight,E. (1976) A space-economical suffix tree construction algorithm. J. ACM, 23, 262–272. Pereira,F. and Warren,D. (1980) Definite clause grammars for language analysis—a survey of the formalism and a comparison with augmented transition networks. Artif. Intell., 13, 231–278. Searls,D. (1989) Investigating the linguistics of DNA with definite clause grammars. In Lusk,E. and Overbeek,R. (eds), Logic Programming: Proceedings of the North American Conference on Logic Programming. MIT Press, Cambridge, MA, pp. 189–208. Searls,D. (1993) The computational linguistics of biological sequences. In Hunter,L. (ed.), Artificial Intelligence and Molecular Biology. AAAI Press, Menlo Park, CA, pp. 47–120. Searls,D. (1995) String variable grammar : a logic grammar formalism for the biological language of DNA. J. Logic Prog., 14, 73–102. Searls,D. (2002) The language of genes. Nature, 420, 211–217.