《Analysis of DNA chains by means of factorial moments.docx》由会员分享,可在线阅读,更多相关《Analysis of DNA chains by means of factorial moments.docx(17页珍藏版)》请在三一文库上搜索。
1、Analysis of DNA chains by means of factorial momentsa r X i v:c o nd -m a t /0109299 v 2 18 Se p 2001Analysis of DNA Chains by Means of Factorial MomentsHuijie Yang 1,+Yizhong Zhuo 2information associated with these correlations lacks relevant details about the composition of DNA chains. On the othe
2、r hand, scientists in the field are trying combinations of different methods for the recognition of coding and non-coding DNA regions (based on techniques such as those mentioned above) in order to improve the accuracy for prediction of different packages, which actually reach approximately 90% of a
3、ccuracy 20-21. What is more, the large amounts of statistical patterns that are different in coding and non-coding DNA have been found to be species dependent 22. That is to say, traditional coding measures based upon these patterns need to be trained on organism-specific data sets before they can b
4、e applied to identify coding DNA. This training set dependence limits the applicability of traditional measures, as new genomes are currently being sequenced for which training sets do not exist. For these reasons, alternative tools able to give different ideas and estimators concerning the structur
5、e of DNA chains, especially the statistical patterns that are species-independent, represent an important contribution in the field. Detailed investigations on the statistical behaviors of DNA, especially on the differences between coding and non-coding segments make it possible for us to find metho
6、ds to identify coding segments from DNA sequences theoretically 22-30. Several novel methods have been suggested in literature, such as entropy segmentation, NM method, mutual information function, etc. 22-24.According to Li 31, a gene is a sequence of genomic DNA or RNA that performs a specific fun
7、ction, a vague definition comparing with the traditional one. Performing the function may not require the gene to be translated or even transcribed. Three types of genes are recognized at present, e.g., protein-coding genes, RNA-specifying genes and regulatory genes. In this present letter the codin
8、g segments refer to protein-coding genes.In this letter, we suggest the concept of factorial moments as a coding measure. By means of the concept of factorial moments we try to identify coding and non-coding regions of DNA sequences from yeast. This method uses only the known statistical general pro
9、perties of coding and non-coding segments of DNA. In this way, the prior training on known data sets is avoided; furthermore the search for additional biological information (such as splice sites or termination signals) can also be avoided.II. Factorial Moments (FM)More than ten years have witnessed
10、 a remarkably intense experimental and theoretical activity in search of scale invariance and fractal in multihardron production processes, for short also called intermittency 32. The primary motivation is the expectation that scale invariance or self-similarity, analogous to that often encountered
11、in complex non-linear systems, might open new avenues ultimately leading to deeper insight into long-distance properties of QCD and the unsolved problem of colour confinement.Generally, intermittency can be described with the concept of probability moment (PM).Dividing a region of phase space ? into
12、 M bins, the volume of one bin is then M /1=. And the definition of q -order PM can be written as 33,=Mm q m q p C 1,)(Where m p is the probability for a particle occurring in the m th bin, which satisfies a constrained condition, .11=M m m pFor a self-similar structure, PM will obey a power law as,
13、q D q q C )1(0)(lim ?,And q D is called q -order fractal dimension or Renyi dimension. Simple discussions show that 0D ,1D and 2q D q reflect the geometry, information entropy and particle correlation dimensions, respectively.It is well known that intermittency is related with strong dynamical fluct
14、uations. But the measurements for multihardron production obtain the distribution of particle numbers directly instead of the probability distribution. And the finite number of cases will induce statistical fluctuations. To describe the strong dynamical fluctuations and dismiss the statistical fluct
15、uations effectively, factorial moment (FM) is suggested to investigate intermittency 34,35. The generally used form for FM can be written as,=?m q m m m q q n q n n n M F 11,)1).(1(Where M is the number of the bins the considered interval being divided into, m n the number ofparticles occurring in t
16、he m th bin, and n the total number of particles in all the bins. A measure quantity can then be introduced to indicate the dynamical fluctuations,.)/1ln(ln lim 0q F q =Here we present a simple argument for the ability of FM to dismiss statistical fluctuations 33.The statistical fluctuations will ob
17、ey Bernoulli and Poisson distributions for a system containing uncertain and certain number of total particles, respectively. For a system containing uncertain total particles, the distribution of particles in the bins can be expressed as,.!.!),.,.(212121212,1M n M n n M M M p p p n n n n p p p n n
18、n Q =And ),.,(21M p p p are the probabilities for a particle occurring in the M ,.,2,1bins,respectively. Hence,=+?Mn M M n M M m m m p p p n n n Q p p p P dp dp dp q n n n ),.,.,(.),.,(.)1).(1(212121211 )1).(1(+?q n n n m m m +?=q mM M p p p p P dp dp dp q n n n ),.,(.)1).(1(2121.)1).(1(q mp q n n n
19、 +?=That is to say,=M M M C M F q q ,)()(.Therefore FM can describe the strong dynamical fluctuations and can dismiss the statistical fluctuations effectively.Besides the statistical fluctuations, there are some trivial dynamical processes that need to be dismissed. These trivial dynamical processes
20、 induce the average numbers of particles in different phase space bins being not same, and the form of FM should be the original one, which reads,=?m q m m m m q n q n n n M F 11,)1).(1(A typical method to dismiss the fluctuations due to this kind of trivial dynamical processes is totransform the or
21、iginal distribution to homogeneous distribution by means of integrate method as follows 36,.)()()(=b aay y y y dy y f dy y f y x But in this paper we resolve this problem by constructing a series of delay register vectors based upon the DNA sequences, as illustrated in the next paragraph.III. Applic
22、ation to DNA analysisThe concept of FM has been used to deal with many kinds of complex dynamical processes in physics, such as multi-particle production at high energy, DNA melting and denaturalization with the temperature increasing, etc. 37-38. What is more, this concept is also improved to a new
23、 version called etermittency, to deal with some problems where statistical average can not be complemented properly 39.Detailed works predict that in non-coding DNA sequences the elements A, T, C and G are not positioned randomly, but exhibit self-similar structure, while in coding DNA sequences the
24、 elements are distributed in a quasi-random way. Therefore, it may be a reasonable idea to distinguish coding and non-coding DNA sequences using the concept of FM.There are several statistical features that can be employed to distinguish non-coding and coding regions, as illustrated below 40,41,(a)
25、The usage of strongly bonded nucleotide C-G pairs is usually less frequent than thatof weakly bonded A-T pairs;(b) The C-G concentration may differ significantly between organisms, but is generallylarger in coding than in non-coding regions.(c) The C-G concentration makes a strong “background” contr
26、ibution to any possibledifferences between non-coding and coding subsequences.(d) Non-coding regions display long-range power-law relations, and have commonfeatures to hierarchically structured languages, i.e. a linear Zipf plot and a non-zeroredundancy. That is to say, there are deterministic struc
27、tures in non-coding regions.While for coding regions, it seems that random rules dominate the sequences.Therefore, the coding and non-coding regions behave different completely. They are sequences obeying different laws. To take into account these statistical characteristics of DNA sequences, we con
28、struct a process as illustrated below 42-46,(a) d successive nucleotides along a DNA sequence are regarded as a case containing dparticles. The state of the case can be described with a d-dimensional vector as ).,(321d x x x x , where i x is the state value for the th i nucleotide. We can define the
29、 state values according to our counting rules. In this paper i x is set to be 1when the th i position is occupied with C or G, and 0 for A or T.(b) For a segment with length N , the total possible 1+?d N successive cases form aprocess. The process covers the entire DNA segment we are interested in,
30、which can be expressed with the series in d-dimensional delay-register vectors:).,(321d x x x x ).,(1432+d x x x x ).,(321N d N d N d N x x x x +?+?+? For each case we can reckon the number of occurrences of the nucleotides C and G.Then the density spectrum m (i.e., m distribution, normalized to uni
31、ty) is obtained based upon the number of occurrences in all the cases. In this paper q F with q=4 are calculated.Obviously q F with other values of q, such as 5,6, can be gained easily if necessary. To indicate the differences of processes constructed above which reflect the behaviors of different r
32、egions in DNA sequence, we introduce a measure quantity as below,Where m is the length of a case, m F 0and tm F are FM of the initial process (i.e . the,1)()(20?=?mm tm m FF t Fregion for reference) and the th t process, respectively. If the th t region behaves similar with the initial one )(t F ? w
33、ill tend to zero, while )(t F ? will be a definite non-zero value when the successive processes step into a region obeying different laws comparing with the initial one. What is more, two regions with similar behaviors will have almost same values of )(t F ?.In Fig.(1) we shows the results for DNA s
34、equences from Yeast. The unitary values of )(t F ? are presented here. The initial part 1-1200bp is chosen to be the reference segment.The length of a case is set to be 10,20,30,40,50,60, respectively. The length of a segment used to construct a process is 1200bp. We can find that the right borders
35、for almost all the coding regions occur at the bottoms of valleys. Because we can get the positions of valleys with a considerable precision, the right borders can be determined with the FM appropriately.In Fig.(2) the left borders are determined. Firstly the considered DNA sequence is arranged in a
36、n inverse order, e.g., numbering the initial DNA sequence denoted with 1,2,3N with N, N-1, 1. Then the unitary )(t F ?values are calculated. The positions of valleys can fit with the left borders very well.Here we meet an essential problem, that is, how can we find a proper segment of DNA sequence t
37、o be employed as reference. Bad reference may induce fuzzy results.Investigations on the differences among coding segments or non-coding segments may be helpful, and the FM method is clearly a powerful tool. It is interesting to find in the results above that the right borders or the left borders ar
38、e almost all positioned around two typical values, respectively. Perhaps we can catalogue the borders according to the quantity )(t F ? in a certain degree.Right borders for DNA sequence from YeastLeft borders for DNA sequence from YeastReference1. B. Lewin, Genes VI (Oxford Univ. Press, Oxford,1997
39、); H.Lodish et al.,Molecular CellBiology(Freeman, New York,1995); B. Alberts et al., Molecular Biology of the Cell (Garland Publishing, New York,1994).2.J. W. Fickett, Trends Genet. 12, 316(1996).3.J. -M. Calvarie, Hum. Mol. Genet. 6,1735(1997).4.J. C. W. Shepherd, Proc. Natl. Acad. Sci. USA 78,1596
40、(1981).5. F. S. Collins, Proc. Natl. Acad. Sci. USA 92,10801(1995).6. D. A. Benson et al., Nucleic Acids Res. 24,1(1996).7.W. Li and K. Kaneko, Nature(London)360,635(1992).8. C. K. Peng, S. V. Buldyrev, A. L. Goldberger, S. Havlin, F. Sciortino, M. Somons, and H. E.Stanley. Nature 356(1992)168.9.R.
41、F. V oss, Phys. Rev. Lett. 68,3805(1992).10.V. R. Chechetkin and A. Y. Turygin, J. Theor. Biol. 178,205(1996).11. C. L. Berthelsen, J. A. Glazier, and M. H. Skolnick, Phys. Rev. A 45,8902(1992).12.M. De Sousa Vieira and H. J. Herrmann, Europhys. Lett. 33,409(1996).13.N. V. Dokholyan, S. V. Buldyrev,
42、 S. Havlin, and H. E. Stanley, Phys. Rev. Lett.79,5182(1997).14.S. Nee, Nature (London) 357,450(1992).15.J. Maddox, Nature (London) 358,103 (1992).16.P. J. Munson, R. C. Taylor, and G. S. Michaels, Nature (London) 360,636 (1992).17.S. V. Buldyrev, A. L. Goldberger, S. Havlin, C. K. Peng,M. Simons,F.
43、 Sciortino, and H. E.Stanley, Phys. Rev. Lett.71,1776(1993).18.R. F. V oss, Phys. Rev. Lett. 71,1777 (1993).19. A. Krogh, in Computational Methods in Molecular Biology, edited by S. Salzberg, D. Searls,and S. Kasif (Elsevier Science B. V. ,Amsterdam,1998).20.M. Burset and R.Guig28.Richard F. V oss.
44、Fractals 2,1-6(1994).29.Bai-Lin Hao. Physica A282, 225(2000).30.Zu-Guo Yu, Bai-lin Hao, Hui-min Xie, Guo-yi Chen. Chaos. Solitons and Fractals11,2215(2000).31.W. H. Li, Molecular Evolution (Sinauer Associates, Sunderland, MA,1997); W. H. Li and D.Graur, Fundamentals of Molecular Evolution (Sinauer A
45、ssociates, Sunderland, MA,1991)32. E. A. De Wolf et al., Physics Reports 270, 1 (1996).33.Paladin G., Vulpiani A., Physics Reports 156, 147 (1987).34. A. Bialas, R. Peschanski. Nuclear Physics B308, 857(1988).35. A. Bialas, R. Peschanski. Nuclear Physics B273, 703(1986).36.Ochs W. , Z. Phys. C50, 33
46、9 (1991); Boalas A, Gazdzicks M., Phys. Lett. B252, 483(1990).37.Yang Huijie, Zhuo Yizhong, Wu Xizhen, Journal of Physics A27, 6147(1994).38.See, for example, systematical works presented by the group from Huazhong NormalUniversity (Liu Lianshou, Wu Yuanfang, Zhang Yang, Chen Gang, et. al.), Ch. Sci
47、. Bull. 1, 21(1991); Phys. Rev. Lett. 70, 3197(1993); Phys. Rev. D51, 6576(1995); Z. Phys.C73,535(1997); High En. Nuc.Phys.23 , 560 (1999), etc.39.Z. Cao and R.C. Hwa, Phys. Rev. E56, 326(1997). And the references therein.40.J.D. Watson, M. Gilman, J. Witkowski, M. Zoller. Recombinant DNA, Scientifi
48、c AmericanBooks, NewYork, 1992.41.W. H. Li, D. Graur, Fundamentals of Molecular Evolution, Sinauer Associates, SunderlandMA, 1995.42.Barral P., A. Hasmy, J. Jimenez, and A. Marcano, Phys. Rev. E61, 1812 (2000).43.J. D. Farmer, J. J. Sidorowich, Phys. Rev. Lett. 59, 845 (1987).44.G. Sugihara and R. May, Nature(London) 344,734 (1990).45. D. M. Rubin, Chaos 2,525 (1992).46.P. Garcia, J. Jimenez, A. Marcano, and F. Moleiro, Phys. Rev. Lett. 76,1449 (1996).