生物信息导论英文论文Practical Suffix Tree Construction 生物信息导论英文论文-Practical Suffix Tree Construction.doc

资源描述

《生物信息导论英文论文Practical Suffix Tree Construction 生物信息导论英文论文-Practical Suffix Tree Construction.doc》由会员分享，可在线阅读，更多相关《生物信息导论英文论文Practical Suffix Tree Construction 生物信息导论英文论文-Practical Suffix Tree Construction.doc（24页珍藏版）》请在三一文库上搜索。

1、Practical Suffix Tree ConstructionSandeep Tata Richard A. Hankins Jignesh M. PatelUniversity of Michigan24AbstractLarge string datasets are common in a numberof emerging text and biological database applications.Common queries over such datasets includeboth exact and approximate string matches. Thes

2、equeries can be evaluated very efficiently by usinga suffix tree index on the string dataset. Althoughsuffix trees can be constructed quickly in memoryfor small input datasets, constructing persistenttrees for large datasets has been challenging.In this paper, we explore suffix tree constructionalgo

3、rithms over a wide spectrum of data sourcesand sizes. First, we show that on modern processors,a cache-efficient algorithm with O(n2) complexityoutperforms the popular O(n) Ukkonenalgorithm, even for in-memory construction. Forlarger datasets, the disk I/O requirement quicklybecomes the bottleneck i

4、n each algorithms performance.To address this problem, we present abuffer management strategy for the O(n2) algorithm,creating a new disk-based construction algorithmthat scales to sizes much larger than havebeen previously described in the literature. Ourapproach far outperforms the best known disk

5、basedconstruction algorithms.1 IntroductionQuerying large string datasets is becoming increasinglyimportant in a number of emerging text and life sciencesapplications. Life science researchers are often interestedin explorative querying of large biological sequencedatabases, such as genomes and larg

6、e sets of protein sequences.Many of these biological datasets are growingat exponential rates for example, the sizes of the sequencedatasets in GenBank have been doubling every six-Permission to copy without fee all or part of this material is granted providedthat the copies are not made or distribu

7、ted for direct commercialadvantage, the VLDB copyright notice and the title of the publication andits date appear, and notice is given that copying is by permission of theVery Large Data Base Endowment. To copy otherwise, or to republish,requires a fee and/or special permission from the Endowment.Pr

8、oceedings of the 30th VLDB Conference,Toronto, Canada, 2004teen months 31. Consequently, methods for efficientlyquerying large string datasets are critical to the success ofthese emerging database applications.Suffix trees are versatile data structures that can helpexecute such queries very efficien

9、tly. In fact, suffix treesare useful for solving a wide variety of string based problems17. For instance, the exact substring matching problemcan be solved in time proportional to the length of thequery, once the suffix tree is built on the database string.Suffix trees can also be used to solve appr

10、oximate stringmatching problems efficiently. Some bioinformatics applicationssuch as MUMmer 10, 11, 22, REPuter 23,and OASIS 25 exploit suffix trees to efficiently evaluatequeries on biological sequence datasets. However, suffixtrees are not widely used because of their high cost of construction.As

11、we show in this paper, building a suffix treeon moderately sized datasets, such as a single chromosomeof the human genome, takes over 1.5 hours with the bestknown existing disk-based construction technique 18. Incontrast, the techniques that we develop in this paper reducethe construction time by a

12、factor of 5 on inputs of thesame size.Even though suffix trees are currently not in widespreaduse, there is a rich history of algorithms for constructingsuffix trees. A large focus of previous research has been onlinear-time suffix tree construction algorithms 24, 32, 33.These algorithms are well su

13、ited for small input stringswhere the tree can be constructed entirely in main memory.The growing size of input datasets, however, requires thatwe construct suffix trees efficiently on disk. The algorithmsproposed in 24, 32, 33 cannot be used for disk-based constructionas they have poor locality of

14、reference. This poorlocality causes a large amount of random disk I/O once thedata structures no longer fit in main memory. If we naivelyuse these main-memory algorithms for on-disk suffix treeconstruction, the process may take well over a day for asingle human chromosome.Large (and rapidly growing)

15、 size of many string datasetsunderscores the need for fast disk-based suffix tree constructionalgorithms. A few recent research efforts havealso considered this problem 4,18, though neither of theseapproaches scales well for large datasets (such as a largechromosome, or an entire eukaryotic genome).

16、In this paper, we present a new approach to efficiently36construct suffix trees on disk. We use a philosophy similarto the one in 18. We forgo the use of suffix links in returnfor a much better memory reference pattern, which translatesto better scalability and performance for large trees.The main c

17、ontributions of this paper are as follows:1. We introduce the “Top Down Disk-based” (TDD)approach to building suffix trees efficiently for awide range of sizes and input types. This technique,includes a suffix tree construction algorithmcalled PWOTD, and a sophisticated buffer managementstrategy.2.

18、We compare the performance of TDD with the popularUkkonens algorithm 32 for the in-memory case,where all the data structures needed for building thesuffix trees are memory resident (i.e. the datasets are“small”). Interestingly, we show that even thoughUkkonen has a better worst case theoretical comp

19、lexity,TDD outperforms Ukkonen on modern cachedprocessors, since TDD incurs significantly fewer processorcache misses.3. We systematically explore the space of data sizes andtypes, and highlight the advantages and disadvantagesof TDD with respect to other construction algorithms.4. We experimentally

20、 demonstrate that TDD scalesgracefully with increasing input size. Using the TDDprocess, we are able to construct a suffix tree on theentire human genome in 30 hours (on a single processormachine)! To our knowledge, suffix tree constructionon an input string of this size (3 billion symbolsapprox.) h

21、as yet to be reported in literature.The remainder of this paper is organized as follows:Section 2 discusses related work. The TDD technique isdescribed in Section 3, and we analyze the behavior of thisalgorithm in Section 4 . Section 5, presents the experimentalresults, and Section 6 presents our co

22、nclusions.2 Related WorkLinear time algorithms for constructing suffix trees havebeen described byWeiner 33, McCreight 24, and Ukkonen32. Ukkonens is a popular algorithm because itis easier to implement than the other algorithms. It isan O(n), in-memory construction algorithm based on theclever obse

23、rvation that constructing the suffix tree can beperformed by iteratively expanding the leaves of a partiallyconstructed suffix tree. Through the use of suffix links,which provide a mechanism for quickly traversing acrosssub-trees, the suffix tree can be expanded by simply addingthe i+1 character to

24、the leaves of the suffix tree built on theprevious i characters. The algorithm thus relies on suffixlinks to traverse through all of the sub-trees in the main tree,expanding the outer edges for each input character. However,they have poor locality of reference since they traversethe suffix tree node

25、s in a random fashion. This leads topoor performance on cached architectures and when usedto construct on-disk suffix trees.Recently, Bedathur et al. developed a buffering strategy,called TOP-Q, which improves the performance of theUkkonens algorithm (which uses suffix links) when constructingon-dis

26、k suffix trees 4. A different approach wassuggested by Hunt et al. 18 where the authors drop the useof suffix links and use an O(n2) algorithm with a better localityof reference. In one pass over the string, they indexall suffixes with the same prefix by inserting them into anon-disk subtree managed

27、 by PJama 3, a Java based objectstore. Construction of each independent subtree requires afull pass over the string.Several O(n2) and O(n log n) algorithms for constructingsuffix trees are described in 17. A top-down approachhas been suggested in 1, 14, 16. In 15, the authors explorethe benefits of

28、using a lazy implementation of suffixtrees. In this approach, the authors argue that one can avoidpaying the full construction cost by constructing the subtreeonly when it is accessed for the first time. This approachis useful only when a small number of queries are posedagainst a string dataset. Wh

29、en executing a large number ofqueries, most of the tree must be materialized, and in thiscase, this approach will perform poorly.Previous research has also produced theoretical resultson understanding the average sizes of suffix trees 5, 30,and theoretical complexity of using sorting to build suffix

30、trees for different computational models such as RAM,PRAM, and various other external memory models 12.Suffix arrays have also been used as an alternative to suffixtrees for specific string matching tasks 8, 9, 26. However,in general, suffix trees are more versatile data structures.The focus of this

31、 paper is only on suffix trees.Our solution uses a simple partitioning strategy. However,a more sophisticated partitioning method has beenproposed recently 6, which can complement our existingpartitioning method.3 The TDD TechniqueMost suffix tree construction algorithms do not scale dueto the prohi

32、bitive disk I/O requirements. The high percharacteroverhead quickly causes the data structures tooutgrow main memory and the poor locality of referencemakes efficient buffer management difficult.We now present a new disk-based construction techniquecalled the “Top-Down Disk-based” technique, hereaft

33、erreferred to simply as TDD. TDD scales much moregracefully than existing techniques by reducing the mainmemoryrequirements through strategic buffering of thelargest data structures. The TDD technique consists of asuffix tree construction algorithm, called PWOTD, and therelated buffer management str

34、ategy described in the followingsections.3.1 PWOTD AlgorithmThe first component of the TDD technique is our suffixtree construction algorithm, called PWOTD (Partition andWrite Only Top Down). This algorithm is based on the wotdeageralgorithm suggested by Kurtz 15. We improve onthis algorithm by usin

35、g a partitioning phase which allowsone to immediately build larger, independent sub-trees inmemory. Before we explain the details of the algorithm,we briefly discuss the representation of the suffix tree.The suffix tree is represented by a linear array, as in wotdeager.This is a compact representati

36、on using an averageof 8.5 bytes per symbol indexed. Figure 1 illustrates a suffixtree on the string ATTAGTACA$ and the trees correspondingarray representation in memory. Shaded entriesin the array represent leaf nodes, with all other entries representingnon-leaf nodes. An R in the lower right-hand c

37、ornerof an entry denotes a rightmost child. A branching nodeis represented by two integers. The first is an index into theinput string; the character at that index is the starting characterof the incoming edges label. The length of the labelcan be deduced by examining the children of the currentnode

38、. The second entry points to the first child. Note thatthe leaf nodes do not have a second entry. The leaf noderequires only the starting index of the label; the end of thelabel is the strings terminating character. See 15 for amore detailed explanation.The PWOTD algorithm consists of two phases. In

39、phase one, we partition the suffixes of the input string into|A|prefixlen partitions, where |A| is the alphabet size ofthe string and prefixlen is the depth of the partitioning. Thepartitioning step is executed as follows. The input stringis scanned from left to right. At each index position i thepr

40、efixlen subsequent characters are used to determine oneof the |A|prefixlen partitions. This index i is then writtento the calculated partitions buffer. At the end of the scan,each partition will contain the suffix pointers for suffixesthat all have the same prefix of size prefixlen.To further illust

41、rate the partition step, consider the followingexample. Partitioning the string ATTAGTACA$using a prefixlen of 1 would create four partitions of suffixes,one for each symbol in the alphabet. (We ignorethe final partition consisting of just the string terminatorsymbol $.) The suffix partition for the

42、 character A wouldAlgorithm PWOTD(String,prefixlen)Phase1:Scan the String and partition Suffixes basedon the first prefixlen symbols of each suffixPhase2: Do for each partition:1. START BuildSuffixTree2. Populate Suffixes from current partition3. Sort Suffixes on first symbol using Temp4. Output bra

43、nching and leaf nodes to the Tree5. Push the nodes pointing to an unevaluated rangeonto the StackWhile Stack is not empty6. Pop a node7. Find the Longest Common Prefix (LCP) ofall the suffixes in this range by checkingthe String8. Sort the range in Suffixes on the firstsymbol using Temp9. Write out

44、branching nodes or leaf nodes to Tree10.Push the nodes pointing to an unevaluated rangeonto the Stack11. ENDFigure 2: The TDD Algorithmbe 0,3,6,8, representing the suffixes ATTAGTACA$,AGTACA$, ACA$, A$. The suffix partition for thecharacter T would be 1,2,5 representing the suffixesTTAGTACA$, TAGTAC

45、A$, TACA$. In phase two, weuse the wotdeager algorithm to build the suffix tree on eachpartition using a top down construction.The pseudo-code for the PWOTD algorithm is shown inFigure 2. While the partitioning in phase one of PWOTD issimple enough, the algorithm for wotdeager in phase twowarrants f

46、urther discussion. We now illustrate the wotdeageralgorithm using an example.3.1.1 Example Illustrating the wotdeager AlgorithmThe PWOTD algorithm requires four data structures forconstructing suffix trees: an input string array, a suffix array,a temporary array, and the suffix tree. For the discuss

47、ionthat follows, we name each of these structures String,Suffixes, Temp, and Tree, respectively.The Suffixes array is first populated with suffixes from apartition after discarding the first prefixlen characters. Usingthe same example string as before, ATTAGTACA$,consider the construction of the Suffixes array for the Tpartition.The suffixes in this partition are at positions 1,2, and 5. Since all these suffixes share the same prefix, T,we add one to each offset to produce the new Suffix array2,3,6. The next step involves sorting this arr

展开阅读全文