基于IBD谱和基因组结构的复杂疾病相关分子标记识别的新策略.pdf

上传人:小小飞 文档编号:3704175 上传时间:2019-09-20 格式:PDF 页数:4 大小:266.21KB
返回 下载 相关 举报
基于IBD谱和基因组结构的复杂疾病相关分子标记识别的新策略.pdf_第1页
第1页 / 共4页
基于IBD谱和基因组结构的复杂疾病相关分子标记识别的新策略.pdf_第2页
第2页 / 共4页
基于IBD谱和基因组结构的复杂疾病相关分子标记识别的新策略.pdf_第3页
第3页 / 共4页
基于IBD谱和基因组结构的复杂疾病相关分子标记识别的新策略.pdf_第4页
第4页 / 共4页
亲,该文档总共4页,全部预览完了,如果喜欢就下载吧!
资源描述

《基于IBD谱和基因组结构的复杂疾病相关分子标记识别的新策略.pdf》由会员分享,可在线阅读,更多相关《基于IBD谱和基因组结构的复杂疾病相关分子标记识别的新策略.pdf(4页珍藏版)》请在三一文库上搜索。

1、论著 Novel strategies to identify relevant molecular signatures for complex hu- man diseases based on data of identical-by-decent profiles and genomic context Chuan-xing LI1,Lei DU1,Xia LI1, 2*,Bin-sheng GONG1,Jie ZHANG1,Shao-qi RAO1, 3* (1. Department of Bioinformatics,Harbin Medical University,Harbi

2、n 150086,China;2. Department of Computer Sci- ence,Harbin Institute of Technology;3. Departments of Cardiovascular Medicine and Molecular Cardiology,Cleveland Clinic Foundation,Cleveland,Ohio 44195,USA) ABSTRACT Objective:To develop novel strategies to identify relevant molecular signatures for comp

3、lex human diseases based on data of identical-by-decent profiles and genomic context. Methods:In the pro- posed strategies,we define four relevancy criteria for mapping SNP-phenotype relationships-point-wise IBD mean difference,averaged IBD difference for window,Z curve and averaged slope for window

4、. Re- sults: Application of these criteria and permutation test to 100 simulated replicates for two hypothetical A- merican populations to extract the relevant SNPs for alcoholism based on sib-pair IBD profiles of pedigrees demonstrates that the proposed strategies have successfully identified most

5、of the simulated true loci. Con- clusion:The data mining practice implies that IBD statistic and genomic context could be used as the in- formatics for locating the underlying genes for complex human diseases. Compared with the classical Haseman-Elston sib-pair regression method,the proposed strateg

6、ies are more efficient for large-scale ge- nomic mining. KEY WORDS Polymorphism,single nucleotide;Medical informatics;Multifactorial inheritance;Ge- nome Single-nucleotide polymorphism( SNP)is the most widespread form of DNA polymorphism in human genome,thus permitting large-scale and high-density g

7、enome-wide profiling. SNPs are generally considered to be ideal genetic markers for genetic investigations, as they are common,stable and increasingly amenable to automated mining methods.Searching for disease relevant SNPs as the landmark (s)to locate disease gene (s)is a critical step for position

8、al cloning of the underlying molecular determinants for complex human traits. Many statistical methods have been developed for identification of disease relevant SNPs based on ei- ther population-based or pedigree-based data,yet no optimal method for analysis of high-dimension SNPs has been found so

9、 far 1. Many complex human diseases such as behaviors of alcoholism investigated by the Genetic Analysis Workshop 14(GAW 14,http: / / www. gaworkshop. org/ )are not simple Mendelian disorders.Instead, they may have mixed contributions of genes,environ- ments and their interactions. A sophisticated m

10、athe- matical model (s)is thus desirable to map the epidemi- ological complexities,but can be prohibitively com- plex. Recent advances in IBD linkage analysis,chro- mosome structure analysis( the Z curve method for computing the G + C content) 2,disease gene min- ing 3, 4,adjacent and co-expressed g

11、enes along chro- mosome discovery(sliding window method) 5,and permutation test 6 give us insights and alternative methods for large-scale association study. In principle, an ideal information measurement or statistic for corre- lations between molecular signatures and disease phe- notypes should be

12、 sought for capturing both the margin- al effects of a signature and its interactions with other feature variables such as nearby SNPs and environmen- tal risk factors. In this study,we define and evaluate several information criteria using both the IBD statistic and genomic context. Then,we apply t

13、hese criteria and permutation test to extract the relevant SNPs for al- coholism based on the simulated pedigree data for GAW14. 1 Materials and Methods Virtually,all the pedigree-based genetic analysis methods rely on the concept-resemblance between rela- tives. The degree of relation between pheno

14、typic re- semblance(e. g. as defined below for alcoholism)and genetic resemblance ( e. g.IBD sharing ) provide means of estimating the strength of association of a SNP (or other genetic variants)with the studied trait. We start with definitions of the resemblance measures for a sib pair. 1. 1 Defini

15、ng the phenotypic resemblance attribute of a sib-pair First,we define the phenotypic attribute of a sib pair,the affection status of a sib pair for alcoholism. For the binary trait,there are three possible attributes, of which two attributes are chosen to be the phenotypes Fund project (基金项目) : Supp

16、orted by the National Natural Sciences Foundation of China(30170515,30370798,30570424 and 30571034) ; The National High Technology Research and Development Program of China (2003AA2Z2051) ; 211 Project; The Tenth“Five-year”Plan; Harbin Medi- cal University and Heilongjiang Province Science and Techn

17、ology Key Project(GB03C602-4 and 1055HG009)国家自然科学基金; 国家高技术研究发 展计划专项经费资助; 国家 “211 工程” 学科建设项目; 国家 “十五” 科技攻关; 哈尔滨医科大学和黑龙江省攻关重点项目 Corresponding author s e-mail,Lixia ems. hrbmu. edu. cn * These authors contributed equally to this work 47 北京大学学报 (医学版) JOURNAL OF PEKING UNIVERSITY (HEALTH SCIENCES) Vol. 3

18、8 No. 1 Feb. 2006 for learning:concordant affected,both sibs in a sib pair are affected;and concordant unaffected,no sibs in a sib pair are affected. 1. 2 Defining the genetic features(genetic resem- blance measure)to be mined The genetic features are defined to be the esti- mated proportions of all

19、eles shared IBD by the sib pair at the SNP positions,computed by the GENIBD of the SAGE package 7. Because our main interest is to ex- plore the utility of the proposed analysis strategies for extracting useful genetic information from the large- scale SNP data,we did not model the second-moment qua

20、ntities of clinical covariates for the sib pairs. 1. 3 Four statistics for association between molecular signatures and phenotypes The IBD values can reflect the proportion of al- leles identical by descent at the putative locus,for sib- ling pairs. The higher the SNP IBD differences be- tween conco

21、rdant affected and concordant unaffected sib pairs are,the stronger the association between the SNP and the disease is. Here,we define four criteria to measure the association of molecular signatures with phenotypes. 1. 3. 1 IBD difference The IBD difference(DF)of a single marker(i)is the discrepanc

22、y in its two means of IBD values in all concordant affected and concordant unaffected sib pairs. It is determined by the equation: DFi= mean (IBDdisease i )- mean (IBDnormal i )(1) 1. 3. 2 Averaged IBD differences for window In ge- netic studies, it is well known that nearby SNP markers are not inde

23、pendent due to close linkage or linkage dis- equilibrium. Furthermore,increasing experimental evi- dence suggests that adjacent,co-expressed and func- tional associated genes are inclined to cluster along the chromosome. The averaged IBD differences for window (ADF)measure these IBD-based genomic co

24、ntexts, by taking into account the association of the signature with disease and its interaction effects with adjacent SNPs. ADF of the ith signature(ADFi)is the mean IBD differences of the SNPs within a window,which con- tains w markers and is centered by the ith signature. The ADF profile for sign

25、atures is calculated using a sliding window across the genome. Different window si- zes are tried to identify the optimal window for subse- quent regional analyses. Mathematically,this strategy can be formalized as follows: ADFi= window (i) DFj w , jwindow (i) , i =1, 2, N.(2) 1.3. 3 Z curve This ap

26、proach is analogous to the Z curve method for analysis of the G + C content in hu- man genome. Consider a DNA sequence with N SNPs. Beginning at the first SNP,inspect the sequence one SNP at a time. Denote the current step by n (i. e. at the ith SNP) ,n =1, 2, , N. In the nth step,calcu- late the cu

27、mulative IBD difference(i. e. Zn)of the SNPs up to the nth SNP. Denote the genomic location of the nth SNP by Xn,which is measured as a relative distance to the first SNP(X1) . The Z curve consists of a series of nodes Pn,where n =1, 2, ,N and whose coordinates are denoted by Xnand Zn. Xn= location

28、(Pn) Zn= n i =1DF i n =1, 2, , N.(3) Therefore,Zndepicts the cumulative distribution of IBD difference(DF)for a SNP sequence. Usually, for a DF-rich genomic fragment, Znis approximately a mo- notonously increasing linear function of Xn,whereas for a DF-poor one,Znis approximately a monotonously decr

29、easing linear function of Xn. In both cases,it is convenient to fit the curve of Zn,Xnby a straight line using the least square estimation: z = kXn(4) where(z,Xn)is the coordinate of a point on the fitted straight line(corresponding to the observed node Pn) and k is its slope. Instead of direct usin

30、g the curve of Zn,Xn,we use its derivative,Zn,Xn,where, Zn= Zn- z = Zn- kXn(5) Let DF denote the average DF within a region Xnin a SNP sequence;we find from Equations(3)and(5) that DF = k + Z n X n = k + k(6) where k = Zn/ Xnis the average slope of the Zn, Xncurve within the region Xn. As seen from

31、Equation (6) ,an up jump in the curve,i. e. ,k 0,indicates an increase of the average DF between concordant af- fected and concordant unaffected sib pairs within the region,and vice versa. 1. 3. 4 Averaged slope for window(AS) The com- putation method for AS is similar to the one for ADF and can be

32、formulated as follows: ASi= window (i) kj w , jwindow (i) , i =1, 2, N.(7) 1. 4 Permutation test To examine strength of the associations between markers and phenotypes,we resort to permutation tests. Permutation(or randomization)tests have the advantage that a particular data distribution is not as-

33、 sumed. Instead,the empirical distributions,obtained by permuting the observed data examples,are used as the basis for statistical inferences,thus render the ap- proach robust to a variety of test statistics with unknown distributions. The principle for a permutation test is simple:given the labeled

34、 data,all permutations of the labels(i. e. combinations between the phenotypic at- tributes andtheircorrespondinggeneticprofiles ) should be equally likely. Under the null hypothesis of no association between the investigated SNP (s)and the target disease phenotype,an empirical null distri- bution o

35、f the test statistic can be constructed by ran- domization of the phenotypic attribute of a sib pair. The significance of experimental observations can be determined by comparing the test statistic derived from permutated data with the test statistic of the original da- ta. The empirical distributio

36、n for each test statistic is produced by calculating a total of 917(the total num- ber of SNPs)100(the population replicates provid- ed by GAW14)mean values,each averaged over 100 randomly permutated samples using a GAW 14 simula- ted replicate as the template. Based on the empirical 57李传星, 等 基于 IBD

37、 谱和基因组结构的复杂疾病相关分子标记识别的新策略 distribution of each statistic,we evaluate the relevance of a SNP (s)with alcoholism phenotype. The four sta- tistics are computed for each chromosome separately, and the mean for each statistic is averaged over the 100 GAW 14 replicates. The significance in terms of the em

38、pirical P-value for each SNP marker is defined as P = m/ s,where m is the number of the test statistic in the empirical null distribution whose values are larger than the computed statistic derived from the original data and s( =91 700)is the total number of test sta- tistic for the empirical distri

39、bution. We claim statistical significance if P =0. 01. 2 Results 2.1 Identification of relevant molecular signatures for alcoholism We use the 100 GAW14 simulated replicates to demonstrate the behaviors and properties of the pro- posed methods for mining alcoholism relevant SNPs. The dataset used in

40、 this study contains a total of 917 simulated SNPs across ten chromosomes for two hypo- thetical populations(Aipotu and Karnagar,each hav- ing 100 pedigrees) . The SNP markers are spaced 3 cM apart on average,and nine disease loci are simulated on eight chromosomes(chromosomes 1,2,3,4,5, 8,9 and 10)

41、 .For computational convenience,we perform the same analysis procedures separately for each investigated chromosome instead of simultaneously analyzing the whole genome.The four statistics for each SNP(or a window)are calculated from each sim- ulated replicate. After trying with different window si-

42、 zes(w =3, 5 and 7) ,we select w =3 for computation of regional statistics ADF and AS. We use the means of each statistic averaged over the 100 simulated repli- cates to estimate significance of association(i. e. P- value) . On average,we have successfully identified 70% and 78% simulated true disea

43、se loci in Aipotu and Karnagar populations respectively. It is not sur- prising that the proposed method has also identified the clusters of SNPs that are nearby the true trait loci as the proposed method is designed for extracting both the main SNP feature and nearby“ redundant”features (here,the r

44、eason for redundancies can be close link- age or association of linkage disequilibrium between the markers within the cluster) . 2. 2 Prediction efficiency test Examining the efficiency of each method is essen- tially equivalent to the problem of measuring the per- formance of the resulted classifie

45、r( s) .Here,we choose four measures to assess the efficiency of each method: (1)prediction accuracy, which is the propor- tion of the total number of predictions that are correct; (2)precision,which is the proportion of the predicted positive cases that are correct; (3)recall,which is the proportion

46、 of the total number of positive cases that are correctly identified;and(4)F value,an overall measure,which is derived from precision and recall rate,and is particularly suitable for the scenarios of o- verly unbalance between positive and negative sam- ples,and is calculated as F value = 2 precisio

47、n re- call / (precision + recall) . For all the four measures, the higher they are,the better the classifier performs. Comparisons of the proposed approaches with the Hase- man-Elston sib-pair regression method(HE)for the two populations are shown in Figure 1. The numerical results demonstrate that

48、the newly proposed approaches significantly improve the recall performance for both populations. Nevertheless,differences in terms of the remaining three measures are slight,which is possible due to the high imbalance between positive and nega- tive samples,i. e. 9 true trait loci versus 908 SNPs. C

49、omparing between the four proposed criteria,IBD mean difference and Z curve appear to be more power- ful than the two sliding window approaches,whose merits might be obvious when high-dense markers are used. To study the effects of the inclusion criteria for selecting SNP as prediction features,we further select the top five most important SNPs to build the classifier. This analysis leads to marked improvements in terms of accuracy,precision and F value(average increase of 5%,15% and 20% respectively,see Figure 2) . Now,all the propose

展开阅读全文
相关资源
猜你喜欢
相关搜索

当前位置:首页 > 其他


经营许可证编号:宁ICP备18001539号-1