- 发布日期:2024-08-24 09:04 点击次数:153
设为“星标”动漫 在线,精彩可以过
本文主要参考官方先容:https://xlucpu.github.io/MOVICS/MOVICS-VIGNETTE.html
简介
装配
制服丝袜GET Module
准备数据
筛选基因(降维)
详情最好亚型数目
凭证单一算法分型
同期进行多种分型算法
整合多种分型驱散
搜检分型驱散的质地
多组学分型热图
简介分子分型一直是生信数据挖掘的热点手段,用于分子分型的算法特别多,比如公共常见的非负矩阵成见、一致性聚类、PCA等,一致性聚类咱们在之前也先容过了:免疫浸润驱散分子分型
今天给公共先容一个一站式的分子分型R包:MOVICS。
该包与其他分子分型R包最大的不同是它能同期使用多组学的数据,时常的分子分型R包只可通过一种组学数据进行分析,比如只可通过mRNA的抒发矩阵进行分析。关联词这R包它可以同期通过比如说mRNA、lncRNA、甲基化数据、突变数据进行分型。
以外,它还提供了分型之后每个亚型的探索以及每个亚型内的分析。是以说这是一个一站式的包。这个的功能主要分为三个部分,暗意图如下:
图片
第一个部分是凭证不同的组学数据进行分型。第二个部分是相比不同的分型。第三个部分是对每个分型进行探索,以及取得每个分型特异性的分子。
每个部分包含的主要函数如下,底下会先容:
GET Module: get subtypes through multi-omics integrative clustering
getElites(): get elites which are those features that pass the filtering procedure and are used for analysesgetClustNum(): get optimal cluster number by calculating clustering prediction index (CPI) and Gap-statisticsgetalgorithm_name(): get results from one specific multi-omics integrative clustering algorithm with detailed parametersgetMOIC(): get a list of results from multiple multi-omics integrative clustering algorithm with parameters by defaultgetConsensusMOIC(): get a consensus matrix that indicates the clustering robustness across different clustering algorithms and generate a consensus heatmapgetSilhouette(): get quantification of sample similarity using silhoutte score approachgetStdiz(): get a standardized data for generating comprehensive multi-omics heatmapgetMoHeatmap(): get a comprehensive multi-omics heatmap based on clustering resultsCOMP Module: compare subtypes from multiple perspectives
compSurv(): compare survival outcome and generate a Kalan-Meier curve with pairwise comparison if possiblecompClinvar(): compare and summarize clinical features among different identified subtypescompMut(): compare mutational frequency and generate an OncoPrint with significant mutationscompTMB(): compare total mutation burden among subtypes and generate distribution of Transitions and TransversionscompFGA(): compare fraction genome altered among subtypes and generate a barplot for distribution comparisoncompDrugsen(): compare estimated half maximal inhibitory concentration (IC50 ) for drug sensitivity and generate a boxviolin for distribution comparisoncompAgree(): compare agreement of current subtypes with other pre-existed classifications and generate an alluvial diagram and an agreement barplotRUN Module: run marker identification and verify subtypes
runDEA(): run differential expression analysis with three popular methods for choosing, including edgeR, DESeq2, and limmarunMarker(): run biomarker identification to determine uniquely and significantly differential expressed genes for each subtyperunGSEA(): run gene set enrichment analysis (GSEA), calculate activity of functional pathways and generate a pathway-specific heatmaprunGSVA(): run gene set variation analysis to calculate enrichment score of each sample based on given gene set list of interestrunNTP(): run nearest template prediction based on identified biomarkers to evaluate subtypes in external cohortsrunPAM(): run partition around medoids classifier based on discovery cohort to predict subtypes in external cohortsrunKappa(): run consistency evaluation using Kappa statistics between two appraisements that identify or predict current subtypes该包已发表,使用时记起援用:
Lu, X., Meng, J., Zhou, Y., Jiang, L., and Yan, F. (2020). MOVICS: an R package for multi-omics integration and visualization in cancer subtyping. bioRxiv, 2020.2009.2015.297820. [doi.org/10.1101/2020.09.15.297820]装配现在该包在github,只可通过以下方式装配,安宁装配时最好先装配依赖包,因为这个包的依赖包特别多,装配经过中特别容易失败。对于入门者来说,这个包的装配不是很友好哦~
# 网罗装配devtools::install_github("xlucpu/MOVICS")# 未必下载到腹地装配devtools::install_local("E:/R/R包/MOVICS-master.zip")GET Module准备数据
咱们先看一下示例数据。
library(MOVICS)##
使用该包自带数据进行演示,这个自带数据是还是清洗好的。过几天再专门写一篇推文先容怎样准备这个数据。
# TCGA的乳腺癌数据load(system.file("extdata", "brca.tcga.RData", package = "MOVICS", mustWork = TRUE))load(system.file("extdata", "brca.yau.RData", package = "MOVICS", mustWork = TRUE))
brca.tcga内部是多个组学的数据,比如mRNA、lncRNA、甲基化、突变数据等,还有临床信息,比如生涯技艺和生涯气象以及乳腺癌的PAM50分类。
为了演示,这个数据通过MAD筛选了部分数据:
500 mRNAs,500 lncRNA,1,000 promoter CGI probes/genes with high variation30 genes that mutated in at least 3% of the entire cohort.安宁,这里最遑急的少量是:每种组学的数据的样本数目、名字、轨则应该全王人一致。公共可以我方看一下这些数据是什么样的。
names(brca.tcga)## [1] "mRNA.expr" "lncRNA.expr" "meth.beta" "mut.status" "count" ## [6] "fpkm" "maf" "segment" "clin.info"names(brca.yau)## [1] "mRNA.expr" "clin.info"# 索求"mRNA.expr""lncRNA.expr""meth.beta""mut.status"mo.data <- brca.tcga[1:4]# 索求raw count datacount <- brca.tcga$count# 索求fpkm datafpkm <- brca.tcga$fpkm# 索求mafmaf <- brca.tcga$maf# 索求segmented copy numbersegment <- brca.tcga$segment# 索求生涯信息surv.info <- brca.tcga$clin.info筛选基因(降维)
getElites,顾名想义,找出精英,找出最给力的,也即是说这个函数可以作念一些预经管和筛选职责,可以帮你进行数据准备职责。
主要可以作念以下预经管:
缺失值插补:班师删除未必knn插补筛选分子:可凭证mad, sd, pca, cox, freq(二分类数据)进行筛选其实这个不是第一步,第一步应该是我方先清洗一下数据,比如抒发矩阵先进行log休养等。
底下是一些功能演示,照旧特别弘大的。
缺失值插补:
# scenario 1: 经管缺失值tmp <- brca.tcga$mRNA.expr # get expression datadim(tmp) # check data dimension## [1] 500 643tmp[1,1] <- tmp[2,2] <- NA # 添加几个NAtmp[1:3,1:3] # check data## BRCA-A03L-01A BRCA-A04R-01A BRCA-A075-01A## SCGB2A2 NA 1.42 7.24## SCGB1D2 10.11 NA 5.88## PIP 4.54 2.59 4.35elite.tmp <- getElites(dat = tmp, method = "mad", na.action = "rm", # 班师删除 elite.pct = 1) # 保留100%的数据## --2 features with NA values are removed.## missing elite.num then use elite.pctdim(elite.tmp$elite.dat) ## [1] 498 643elite.tmp <- getElites(dat = tmp, method = "mad", na.action = "impute", # 使用knn进行插补 elite.pct = 1) ## missing elite.num then use elite.pctdim(elite.tmp$elite.dat) ## [1] 500 643elite.tmp$elite.dat[1:3,1:3] # NA values have been imputed ## BRCA-A03L-01A BRCA-A04R-01A BRCA-A075-01A## SCGB2A2 6.867 1.420 7.24## SCGB1D2 10.110 4.739 5.88## PIP 4.540 2.590 4.35
使用MAD筛选分子:
# scenario 2: 使用MAD筛选,最大中位差tmp <- brca.tcga$mRNA.expr elite.tmp <- getElites(dat = tmp, method = "mad", elite.pct = 0.1) # 保留MAD前10%的基因## missing elite.num then use elite.pctdim(elite.tmp$elite.dat) # 500的10%是50## [1] 50 643#> [1] 50 643elite.tmp <- getElites(dat = tmp, method = "sd", elite.num = 100, # 保留MAD前100的基因 elite.pct = 0.1) # 此时这个参数就不起作用了## elite.num has been provided then discards elite.pct.dim(elite.tmp$elite.dat) ## [1] 100 643
使用PCA筛选分子,需要了解一些对于PCA的基础常识:R话语主要素分析
# scenario 3: 使用PCA筛选分子tmp <- brca.tcga$mRNA.expr # get expression data with 500 featureselite.tmp <- getElites(dat = tmp, method = "pca", pca.ratio = 0.95) # 主要素的比例## --the ratio used to select principal component is set as 0.95dim(elite.tmp$elite.dat) # get 204 elite (PCs) left## [1] 204 643
使用单因素COX追溯筛选分子,也即是对每个分子作念单因素cox分析,采用挑升想的留住,需要提供生涯信息:
# scenario 4: 使用cox筛选分子tmp <- brca.tcga$mRNA.expr # get expression data elite.tmp <- getElites(dat = tmp, method = "cox", surv.info = surv.info, # 生涯信息,列名必须有'futime'和'fustat' p.cutoff = 0.05, elite.num = 100) # 此时这个参数亦然不起作用的## --all sample matched between omics matrix and survival data.## 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 55% 60% 65% 70% 75% 80% 85% 90% 95% 100%dim(elite.tmp$elite.dat) # get 125 elites## [1] 125 643table(elite.tmp$unicox$pvalue < 0.05) # 125 genes have nominal pvalue < 0.05 in ## ## FALSE TRUE ## 375 125tmp <- brca.tcga$mut.status # get mutation data elite.tmp <- getElites(dat = tmp, method = "cox", surv.info = surv.info, p.cutoff = 0.05, elite.num = 100) ## --all sample matched between omics matrix and survival data.## 7% 13% 20% 27% 33% 40% 47% 53% 60% 67% 73% 80% 87% 93% 100%dim(elite.tmp$elite.dat) # get 3 elites## [1] 3 643table(elite.tmp$unicox$pvalue < 0.05) # 3 mutations have nominal pvalue < 0.05## ## FALSE TRUE ## 27 3
使用突变频率筛选分子,这个是准们用于0/1矩阵这种二分类数据的:
# scenario 5: 使用突变频率筛选tmp <- brca.tcga$mut.status # get mutation data rowSums(tmp) ## PIK3CA TP53 TTN CDH1 GATA3 MLL3 MUC16 MAP3K1 SYNE1 MUC12 DMD ## 208 186 111 83 58 49 48 38 33 32 31 ## NCOR1 FLG PTEN RYR2 USH2A SPTA1 MAP2K4 MUC5B NEB SPEN MACF1 ## 31 30 29 27 27 25 25 24 24 23 23 ## RYR3 DST HUWE1 HMCN1 CSMD1 OBSCN APOB SYNE2 ## 23 22 22 22 21 21 21 21elite.tmp <- getElites(dat = tmp, method = "freq", # must set as 'freq' elite.num = 80, # 这里是指突变频率 elite.pct = 0.1) # 此时该参数不起作用## --method of 'freq' only supports binary omics data (e.g., somatic mutation matrix), and in this manner, elite.pct and elite.num are used to cut frequency.## elite.num has been provided then discards elite.pct.rowSums(elite.tmp$elite.dat) # 只保留在80个及以上样本中突变的基因## PIK3CA TP53 TTN CDH1 ## 208 186 111 83elite.tmp <- getElites(dat = tmp, method = "freq", elite.pct = 0.2) ## --method of 'freq' only supports binary omics data (e.g., somatic mutation matrix), and in this manner, elite.pct and elite.num are used to cut frequency.## missing elite.num then use elite.pctrowSums(elite.tmp$elite.dat) # only genes that are mutated in over than 0.2*643=128.6 ## PIK3CA TP53 ## 208 186详情最好亚型数目
凭证分子抒发量对样本进行分型,分子即是上一步得到的mRNA、lncRNA、miRNA、甲基化矩阵等。
先凭证CPI和Gaps-statistics详情分红几个亚型:
optk.brca <- getClustNum(data = mo.data, # 4种组学数据 is.binary = c(F,F,F,T), #前3个不是二分类的,终末一个是 try.N.clust = 2:8, # 尝试亚型数目,从2到8 fig.name = "CLUSTER NUMBER OF TCGA-BRCA")#保存的文献名## calculating Cluster Prediction Index...## 5% complete## 5% complete## 10% complete## 10% complete## 15% complete## 15% complete## 20% complete## 25% complete## 25% complete## 30% complete## 30% complete## 35% complete## 35% complete## 40% complete## 45% complete## 45% complete## 50% complete## 50% complete## 55% complete## 55% complete## 60% complete## 65% complete## 65% complete## 70% complete## 70% complete## 75% complete## 75% complete## 80% complete## 85% complete## 85% complete## 90% complete## 90% complete## 95% complete## 95% complete## 100% complete## calculating Gap-statistics...## visualization done...## --the imputed optimal cluster number is 3 arbitrarily, but it would be better referring to other priori knowledge.
图片
unnamed-chunk-10-186542957会自动在现时职责目次下产生一个PDF花样的图片。
函数给出的驱散是3,关联词探讨到乳腺癌的PAM0分类,咱们采用k=5,也即是分红5个亚型。
是以这个详情最好亚型个数是凭证你我方的需要来的哈,活泼休养~
凭证单一算法分型详情分红几个亚型之后,可以通过算法进行分型了。提供了特别多的步伐,公共常见的非负矩阵成见、异质性聚类等等王人提供了。
比如凭证贝叶斯步伐进行分型:
# perform iClusterBayes (may take a while)iClusterBayes.res <- getiClusterBayes(data = mo.data, N.clust = 5, type = c("gaussian","gaussian","gaussian","binomial"), n.burnin = 1800, n.draw = 1200, prior.gamma = c(0.5, 0.5, 0.5, 0.5), sdev = 0.05, thin = 3)## clustering done...## feature selection done...
未必使用搭伙的函数,我方采用步伐即可,两种步伐得到的驱散全王人是相同的:
iClusterBayes.res <- getMOIC(data = mo.data, N.clust = 5, methodslist = "iClusterBayes", # 指定算法 type = c("gaussian","gaussian","gaussian","binomial"), # data type corresponding to the list n.burnin = 1800, n.draw = 1200, prior.gamma = c(0.5, 0.5, 0.5, 0.5), sdev = 0.05, thin = 3)
复返的驱散包含一个clust.res对象,它有两列:clust列指导样本所属的亚型,samID列纪录对应的样真称号。对于提供特征采用经过的算法(如iClusterBayes、CIMLR和MoCluster),驱散还包含一个feat.res对象,存储了这种经过的信息。对于触及分层聚类的算法(举例COCA、ConsensusClustering),样本聚类的相应树状图也将算作clust.dend复返,淌若用户想要将它们放在热图中会很有效。
同期进行多种分型算法可以同期凭证多种算法进行分型,然后整合它们的驱散,得到最终的驱散,不是一般的弘大:
# perform multi-omics integrative clustering with the rest of 9 algorithmsmoic.res.list <- getMOIC(data = mo.data, methodslist = list("SNF", "PINSPlus", "NEMO", "COCA", "LRAcluster", "ConsensusClustering", "IntNMF", "CIMLR", "MoCluster"), # 9种算法 N.clust = 5, type = c("gaussian", "gaussian", "gaussian", "binomial"))## --you choose more than 1 algorithm and all of them shall be run with parameters by default.## SNF done...## Clustering method: kmeans## Perturbation method: noise## PINSPlus done...## NEMO done...## COCA done...## LRAcluster done...## end fraction## clustered
## ConsensusClustering done...## IntNMF done...## clustering done...## feature selection done...## CIMLR done...## clustering done...## feature selection done...## MoCluster done...
再把贝叶斯的驱散沿路加进来,这即是10种算法了:
moic.res.list <- append(moic.res.list, list("iClusterBayes" = iClusterBayes.res))# 保存下驱散save(moic.res.list, file = "moic.res.list.rda")整合多种分型驱散
鉴戒了consensus ensembles的宗旨,驱散对多个分型算法驱散的整合。
可以画出一个一致性热图:
load(file = "moic.res.list.rda")cmoic.brca <- getConsensusMOIC(moic.res.list = moic.res.list, fig.name = "CONSENSUS HEATMAP", distance = "euclidean", linkage = "average")
图片
unnamed-chunk-15-186542957驱散会保存在现时职责目次中。
搜检分型驱散的质地除了通过上头的热图搜检分型驱散,还可以使用Silhouette准则判断分型质地。
以下是评释,开首于网罗:
Silhouette准则是一种用于聚类分析中的评价步伐,它通过对每个数据点与其所属簇内其他数据点之间的距离进行相比,来斟酌聚类质地的利弊。Silhouette准则可以匡助咱们详情最好的聚类数目,从而栽植聚类分析的可靠性和准确性。 Silhouette准则的诡计步伐如下:对于每个数据点i,诡计它与同簇中其他数据点之间的平均距离ai,以及与最近其他簇中数据点之间的平均距离bi。然后,界说每个数据点的Silhouette统统为: s(i) = (bi - ai) / max(ai, bi) Silhouette统统的取值范围在-1到1之间,其中负值表露数据点更容易被分类到乌有的簇中,而正巧则表露数据点更容易被正确分类。Silhouette统统的平均值可以用来评估通盘聚类的质地,因此,Silhouette准则的方向是最大化Silhouette统统的平均值,从而找到最好的聚类数目。 当聚类数目增多时,Silhouette统统的平均值时常会先增多后减少。因此,咱们需要找到一个聚类数目,使得Silhouette统统的平均值达到最大值。时常,咱们融会过绘制Silhouette图来采用最好的聚类数目。Silhouette图是一种以Silhouette统统为纵轴,聚类数目为横轴的图表,它可以匡助咱们直不雅地解析聚类的质地。 在使用Silhouette准则进行聚类分析时,需要安宁以下几点:
Silhouette统统只适用于欧氏距离或有计划度量,对于其他距离度量可能不适用。Silhouette统统的诡计技艺较长,因此在经管大畛域数据时需要安宁诡计遵循。Silhouette统统并不是独一的评价操办,对于特定的聚类问题可能需要选择其他评价操办。驱散会保存在现时职责目次中:
getSilhouette(sil = cmoic.brca$sil, # a sil object returned by getConsensusMOIC() fig.path = getwd(), fig.name = "SILHOUETTE", height = 5.5, width = 5)
图片
unnamed-chunk-16-186542957## png ## 2多组学分型热图
分型之后,敬佩是要对每个组学数据进行热图展示不同亚型的抒发量情况。
不外需要作念一些准备职责。
把甲基化的β值矩阵休养为M值矩阵,作家保举,这么作念展示遵循更好;数据法度化,画热图之钱一般王人会进行这个操作,其实是通过scale进行的,比如把所独特据压缩为[-2,2],进步2的用2表露,小于-2的用-2表露# β值矩阵休养为M值矩阵indata <- mo.dataindata$meth.beta <- log2(indata$meth.beta / (1 - indata$meth.beta))# 对数据进行法度化plotdata <- getStdiz(data = indata, halfwidth = c(2,2,2,NA), # no truncation for mutation centerFlag = c(T,T,T,F), # no center for mutation scaleFlag = c(T,T,T,F)) # no scale for mutation
咱们这里就用贝叶斯分型的驱散进行展示,领先是索求每个组学的驱散,然后每个组学中采用前10个分子进行标注:
feat <- iClusterBayes.res$feat.resfeat1 <- feat[which(feat$dataset == "mRNA.expr"),][1:10,"feature"] feat2 <- feat[which(feat$dataset == "lncRNA.expr"),][1:10,"feature"]feat3 <- feat[which(feat$dataset == "meth.beta"),][1:10,"feature"]feat4 <- feat[which(feat$dataset == "mut.status"),][1:10,"feature"]annRow <- list(feat1, feat2, feat3, feat4)
底下即是绘画即可,其实亦然借助complexheatmap驱散的,只不外帮你简化了许多经过,驱散会自动保存在现时职责目次下,MOVICS的默许出图照旧很好意思不雅的,可能比你我方画的颜面~
# 为每个组学的热图自界说神气,不界说也可mRNA.col <- c("#00FF00", "#008000", "#000000", "#800000", "#FF0000")lncRNA.col <- c("#6699CC", "white" , "#FF3C38")meth.col <- c("#0074FE", "#96EBF9", "#FEE900", "#F00003")mut.col <- c("grey90" , "black")col.list <- list(mRNA.col, lncRNA.col, meth.col, mut.col)# comprehensive heatmap (may take a while)getMoHeatmap(data = plotdata, row.title = c("mRNA","lncRNA","Methylation","Mutation"), is.binary = c(F,F,F,T), # the 4th data is mutation which is binary legend.name = c("mRNA.FPKM","lncRNA.FPKM","M value","Mutated"), clust.res = iClusterBayes.res$clust.res, # cluster results clust.dend = NULL, # no dendrogram show.rownames = c(F,F,F,F), # specify for each omics data show.colnames = FALSE, # show no sample names annRow = annRow, # mark selected features color = col.list, annCol = NULL, # no annotation for samples annColors = NULL, # no annotation color width = 10, # width of each subheatmap height = 5, # height of each subheatmap fig.name = "COMPREHENSIVE HEATMAP OF ICLUSTERBAYES")
图片
unnamed-chunk-19-186542957上头是贝叶斯步伐分型驱散的展示,你也可以任选一种,毕竟咱们有10种算法。
比如采用COCA法的驱散进行展示,亦然一模相同的用法,驱散会自动保存:
# comprehensive heatmap (may take a while)getMoHeatmap(data = plotdata, row.title = c("mRNA","lncRNA","Methylation","Mutation"), is.binary = c(F,F,F,T), # the 4th data is mutation which is binary legend.name = c("mRNA.FPKM","lncRNA.FPKM","M value","Mutated"), clust.res = moic.res.list$COCA$clust.res, # cluster results clust.dend = moic.res.list$COCA$clust.dend, # show dendrogram for samples color = col.list, width = 10, # width of each subheatmap height = 5, # height of each subheatmap fig.name = "COMPREHENSIVE HEATMAP OF COCA")
图片
unnamed-chunk-20-186542957淌若你要展示多个临床信息,亦然班师添加即可,安宁自界说神气需要使用circlize驱散:
# extract PAM50, pathologic stage and age for sample annotationannCol <- surv.info[,c("PAM50", "pstage", "age"), drop = FALSE]# generate corresponding colors for sample annotationannColors <- list(age = circlize::colorRamp2(breaks = c(min(annCol$age), median(annCol$age), max(annCol$age)), colors = c("#0000AA", "#555555", "#AAAA00")), PAM50 = c("Basal" = "blue", "Her2" = "red", "LumA" = "yellow", "LumB" = "green", "Normal" = "black"), pstage = c("T1" = "green", "T2" = "blue", "T3" = "red", "T4" = "yellow", "TX" = "black"))# comprehensive heatmap (may take a while)getMoHeatmap(data = plotdata, row.title = c("mRNA","lncRNA","Methylation","Mutation"), is.binary = c(F,F,F,T), # the 4th data is mutation which is binary legend.name = c("mRNA.FPKM","lncRNA.FPKM","M value","Mutated"), clust.res = cmoic.brca$clust.res, # consensusMOIC results clust.dend = NULL, # show no dendrogram for samples show.rownames = c(F,F,F,F), # specify for each omics data show.colnames = FALSE, # show no sample names show.row.dend = c(F,F,F,F), # show no dendrogram for features annRow = NULL, # no selected features color = col.list, annCol = annCol, # annotation for samples annColors = annColors, # annotation color width = 10, # width of each subheatmap height = 5, # height of each subheatmap fig.name = "COMPREHENSIVE HEATMAP OF CONSENSUSMOIC")
图片
unnamed-chunk-21-186542957是不是特别给力?
到这里第一部分的骨子就先容罢了动漫 在线,底下即是探索、相比不同的亚型了。
本站仅提供存储职业,通盘骨子均由用户发布,如发现存害或侵权骨子,请点击举报。- 动漫 在线 我校异邦语学院与自治区培植科学有计划院联结开展高中英语学科主题教育琢磨活动2024-09-30
- 动漫 在线 粳高粱新季供应预期胁制增强,价钱承压下行2024-08-27
- 动漫 在线 从俄乌禁绝看,行为宇宙第一工业大国的中国,有奈何的搏斗后劲2024-08-26
- 动漫 在线 档案室温湿度截至系统,自动调控,保握库房环境相识2024-08-26
- 动漫 在线 又一家银行 驱逐!2024-08-24
- 动漫 在线 八上生物常识点缅念念口诀2024-08-24