就拿TCGA的乳腺癌RNA.doc_三一文库31doc.com

资源描述

《就拿TCGA的乳腺癌RNA.doc》由会员分享，可在线阅读，更多相关《就拿TCGA的乳腺癌RNA.doc（13页珍藏版）》请在三一文库上搜索。

1、就拿 TCGA 的乳腺癌 RNAWGCNA （Weighted Correlation Network analysis）是一个基于基因表达数据，构建基因共表达网络的方法。WGCNA 和差异基因分析（ DEG）的差异在于 DEG 主要分析样本和样本之间的差异，而WGCNA 主要分析的是基因和基因之间的关系。 WGCNA 通过分析基因之间的关联关系，将基因区分为多个模块。而最后通过这些模块和样本表型之间的关联性分析，寻找特定表型的分子特征。网上例子千千万，但是大部分都是从文档翻译而来，要用起来还是有些费劲，要深入的可以移步这里：tml下面我将根据 TCGA 乳腺癌基因表达数据以及乳腺癌压型数据，

2、一步一步的使用WGCNA 来进行乳腺癌各个亚型共表达模块的挖掘#数据准备 #首先我们需要下载TCGA 的乳腺癌的 RNA-seq 数据以及临床病理资料，我这里使用我们自己开发的TCGA 简易下载工具进行下载首先下载 RNA-Seq：下载之后共得到1215 个样本表达数据进一步下载临床病理资料进一步点击 ClinicalFull 按钮对病理资料进行提取得到 ClinicalFull_matrix.txt 文件，使用 Excel 打开 ClinicalFull_matrix.txt 文件可以看到共有 301 列信息，包含了各种用药，随访，预后等等信息，我们这里选择乳腺癌 ER、PR、HER2 的信

3、息，去除其他用不上的信息，然后选择了其中有明确 ER、PR、HER2 阳性阴性的样本，随机拿 100 个做例子吧样本筛选完了，现在轮到怎么获取这些样本的 RNA-seq 数据啦，前面下载了一千多个样本的 RNA-seq,从里面找到这一百个样本的表达数据其实也是不需要变成的啦，看清楚咯首先打开 RNA-Seq 数据目录的 fileID.tmp( 用 Excel 打开 )，然后可以看到两列：将第二列复制，并且替换 -01.gz 为空使用 Excel 的 vlookup 命令将临床病理资料的那100 个样本进行映射然后筛选非 N/A 的就得到了这一百个样本对于的 RNA-seq 数据信息进一步删除其

4、他的样本，还原成fileID.tmp 格式保存退出：然后使用 TCGA 简易小工具 “合并文件 ”按钮就得到表达矩阵了，进一步使用 ENSD_ID 转换按钮就得到了基因表达矩阵和 lncRNA 表达矩阵了#R代码实现 WGCNA# setwd(E:/rawData/TCGA_DATA/TCGA-BRCA) samples=read.csv(ClinicalFull_matrix.txt,sep = t,row.names= 1)dim(samples)#1 10031)dim(expro)#1 24991100数据读取完成，从上述结果可以看出100 个样本，有 24991个基因，这么多基因全部

5、用来做WGCNA 很显然没有必要，我们只要选择一些具有代表性的基因就够了，这里我们采取的方式是选择在100 个样本中方差较大的那些基因（意味着在不同样本中变化较大）继续命令：m.vars=apply(expro,1,var)expro.upper=exprowhich(m.vars>quantile(m.vars, probs =seq(0, 1, 0.25)4),#选择方差最大的前25%个基因作为后续 WGCNA 的输入数据集通过上述步骤拿到了6248 个基因的表达谱作为WGCNA 的输入数据集，进一步的我们需要看看样本之间的差异情况gsg = goodSamplesGenes(dat

6、Expr, verbose = 3); gsg$allOKsampleTree = hclust(dist(datExpr), method = average) plot(sampleTree, main = Sample clustering to detect outliers, sub=, xlab=) 从图中可看出大部分样本表现比较相近，而有两个离群样本，对后续的分析可能造成影响，我们需要将其去掉，共得到 98 个样本clust = cutreeStatic(sampleTree, cutHeight = 80000, minSize =10)table(clust)#clust#0

7、1#2 98keepSamples = (clust=1)datExpr = datExprkeepSamples, nGenes = ncol(datExpr)nSamples = nrow(datExpr)save(datExpr, file = FPKM-01-dataInput.RData)得到最终的数据矩阵之后，我们需要确定软阈值，从代码中可以看出 pickSoftThreshold 很简单，就两个参数，其他默认即可powers = c(c(1:10), seq(from = 12, to=20, by=2)sft = pickSoftThreshold(datExpr, power

8、Vector = powers,verbose = 5)#画图 #par(mfrow = c(1,2);cex1 = 0.9;plot(sft$fitIndices,1, -sign(sft$fitIndices,3)*sft$fitIndices,2,xlab=Soft Threshold (power),ylab=Scale FreeTopology Model Fit,signed R2,type=n, main = paste(Scale independence);text(sft$fitIndices,1, -sign(sft$fitIndices,3)*sft$fitIndice

9、s,2, labels=powers,cex=cex1,col=red);abline(h=0.90,col=red)plot(sft$fitIndices,1, sft$fitIndices,5, xlab=Soft Threshold (power),ylab=MeanConnectivity, type=n,main = paste(Mean connectivity)text(sft$fitIndices,1, sft$fitIndices,5, labels=powers,cex=cex1,col=red)从图中可以看出这个软阈值选择7 比较合适 ,选择软阈值 7进行共表达模块挖掘p

10、ow=7net = blockwiseModules(datExpr, power = pow, maxBlockSize = 7000,TOMType = unsigned,minModuleSize = 30,reassignThreshold = 0,mergeCutHeight = 0.25,numericLabels = TRUE,pamRespectsDendro = FALSE,saveTOMs = TRUE,saveTOMFileBase = FPKM-TOM,verbose = 3)table(net$colors)# open a graphics window #size

11、GrWindow(12, 9)# Convert labels to colors for plottingmergedColors = labels2colors(net$colors)# Plot the dendrogram and the module colors underneath plotDendroAndColors(net$dendrograms1, mergedColorsnet$blockGenes1,groupLabels = c(Module colors,GS.weight),dendroLabels = FALSE, hang = 0.03,addGuide =

12、 TRUE, guideHang = 0.05)从图中可以看出大部分基因在灰色区域，灰色部分一般认为是没有模块接受的，从这里也可以看出其实咱们选择的这些基因并不是特别好那么做到这一步了基本上共表达模块做完了，每个颜色代表一个共表达模块，统计看看各个模块下的基因个数：那么得到模块之后下一步该做啥呢，或许很多人到这就不知道如何继续分析了这里就需要咱们利用这些模块搞事情了，举个例子如果你是整合的数据（整合 lnc 与 gene），那么同时在某个模块中的基因和 lncRNA 咱们可以认为是共表达的，这便是 lnc-gene共表达关系的获得途径之一了，进一步你可以根据该模块的基因 -lnc-基因之间的

13、关系绘制出共表达网络今天咱们这里不讲这个，而是跟表型关联，咱们已经拿到了这 98 个样本的 ER、PR、HER2 阳性阴性信息，那么进一步的咱们可以看看哪些共表达模块跟ER、PR、HER2 阴性最相关，代码如下：moduleLabelsAutomatic = net$colorsmoduleColorsAutomatic =labels2colors(moduleLabelsAutomatic)moduleColorsFemale = moduleColorsAutomaticMEs0 = moduleEigengenes(datExpr,moduleColorsFemale)$eigenge

14、nesMEsFemale = orderMEs(MEs0)samples=samplesmatch(row.names(datExpr),paste0(gsub(-,.,row.names(samples),.01),#匹配 98 个样本数据trainDt=as.matrix(cbind(ifelse(samples,1=Positive,0,1),#将阴性的样本标记为1ifelse(samples,2=Positive,0,1),#将阴性的样本标记为 1ifelse(samples,3=Positive,0,1),#将阴性的样本标记为 1ifelse(samples,1=Negative&a

15、mp;samples,2=Negative&samples,3=Negative,1,0)#将三阴性的样本标记为1)#得到一个表型的 0-1 矩阵modTraitCor = cor(MEsFemale, trainDt, use = p)colnames(MEsFemale)modTraitP = corPvalueStudent(modTraitCor, nSamples) textMatrix = paste(signif(modTraitCor, 2), n(, signif(modTraitP, 1), ), sep = )dim(textMatrix) = dim(modTr

16、aitCor) labeledHeatmap(Matrix = modTraitCor, xLabels = colnames(trainDt), yLabels = names(MEsFemale),ySymbols = colnames(modlues), colorLabels = FALSE, colors = greenWhiteRed(50),textMatrix = textMatrix, setStdMargins = FALSE, cex.text = 0.5, zlim = c(-1,1), main = paste(Module-trait relationships)最

17、终找到几个共表达网络与三阴性表型最相关的模块。到了这一步其实还没完，咱们已经拿到了最相关的模块比如上图中的 yellow 模块，那么 yellow 模块中的基因，哪些基因最重要，我们该如何获取首先我们需要计算每个基因与模块的相关关系modTraitCor = cor(MEsFemale, datExpr, use = p)modTraitP = corPvalueStudent(modTraitCor, nSamples)corYellow=modTraitCorwhich(row.names(modTraitCor)=MEyellow),head(corYelloworder(-corYel

18、low)#RAD51AP1HDAC2FOXM1NCAPD2TPI1NOP2#0.9249354 0.9080872 0.8991733 0.8872607 0.87170500.8708449从上诉结果中可以看出RAD51AP1,HDAC2 两个基因与yellow 相关性最好，也就是说这两个基因是yellow 模块的hub-gene进一步的咱们导出yellow 模块的基因共表达关系使用cytoscape进行可视化，代码如下：TOM = TOMsimilarityFromExpr(datExpr, power = pow);probes = names(datExpr)mc=yellowmcIn

19、ds=which(match(moduleColorsAutomatic,gsub(ME,mc)=1)modProbes=probesmcIndsmodTOM = TOMmcInds, mcInds;dimnames(modTOM) = list(modProbes, modProbes)cyt = exportNetworkToCytoscape(modTOM,edgeFile =paste(edges-, mc, .txt, sep=),nodeFile =paste(nodes-, mc, .txt, sep=),weighted = TRUE,threshold =median(modTOM),nodeNames =modProbes,#altNodeNames =modGenes,nodeAttr =moduleColorsAutomaticmcInds); 到此基本结束了，敲这么多字。好累。

展开阅读全文