此次給繪制的圖來自文獻(xiàn)《Triple DMARD treatment in early rheumatoid arthritis modulates synovial T cell activation and plasmablast/plasma cell differentiation pathways 》,是2017發(fā)表的,使用了他們團(tuán)隊(duì)自己2016發(fā)表的轉(zhuǎn)錄組測序數(shù)據(jù),所以其實(shí)是有兩個(gè)轉(zhuǎn)錄組測序數(shù)據(jù)集。
文章中數(shù)據(jù)情況如下:Data are available publicly through NCBI GEO database
(Healthy samples are under acces sion number GSE89408.
RA samples are under accession number GSE97165)
GSE89408 :有28個(gè)正常樣本,GSE97165 有19個(gè) RA 治療前與 19個(gè)RA治療后的樣本。
復(fù)現(xiàn)的圖:這個(gè)圖主要展示了 A: 治療后 與 治療前的差異火山圖,B: 治療前 與正常對照 差異基因在三組樣本中的表達(dá)熱圖,以及 C&D :一些 marker 基因在三個(gè)組別中的箱線圖+抖動(dòng)散點(diǎn)+顯著性比較 。
Fig 2. Genes selectively expressed by lymphocytes are decreased after 6 months of tDMARD treatment, while other genes remain elevated compared to normal joints 首先,整理差異分析所需要的數(shù)據(jù)這里的兩個(gè)數(shù)據(jù)集中的count數(shù)據(jù)作者使用的同一套流程與參數(shù)進(jìn)行分析,這里直接合并在一起進(jìn)行后續(xù)分析。 (兩個(gè)數(shù)據(jù)集都是同一個(gè)科研團(tuán)隊(duì)的,所以理論上可以先不考慮里面的批次效應(yīng)哈)
如果實(shí)在不放心,可以選擇下載fq數(shù)據(jù)放在一起定量,拿到表達(dá)矩陣走后面的分析。
rm(list = ls())#清空當(dāng)前的工作環(huán)境 options(stringsAsFactors = F)#不以因子變量讀取 options(scipen = 20)#不以科學(xué)計(jì)數(shù)法顯示 library(data.table) library(tinyarray)####################### 讀取 正常樣本:PRJNA352076、GSE89408 acc <- "GSE89408" data1 <- data.table::fread("GSE89408_GEO_count_matrix_rename.txt.gz" , data.table = F) data1 <- data1[!duplicated(data1$V1 ), ]####################### RA樣本:PRJNA380832、GSE97165 acc <- "GSE97165" data2 <- data.table::fread("GSE97165_GEO_count_matrix.txt.gz" , data.table = F) data2 <- data2[!duplicated(data2$V1 ), ] mat <- merge(data1, data2, by="V1" ) colnames(mat) symbol_matrix <- mat[,c(grep("normal" ,colnames(mat)),grep("RA_pre" ,colnames(mat)),grep("RA_post" ,colnames(mat))) ] colnames(symbol_matrix) rownames(symbol_matrix) <- mat[,1] symbol_matrix <- floor(symbol_matrix) symbol_matrix[1:4,1:4]#過濾低表達(dá) keep_feature <- rowSums (symbol_matrix > 0) > 0.5*ncol(symbol_matrix) ;table(keep_feature) symbol_matrix <- symbol_matrix[keep_feature, ] symbol_matrix[1:4,1:4] group_list <- stringr::str_split(colnames(symbol_matrix), pattern = "_[1-9]" , n = 2, simplify = T)[,1] group_list <- gsub("_tissue" , "" , group_list) group_list <- gsub("RA_pre" , "RApre" , group_list) group_list <- gsub("RA_post" , "RApost" , group_list) group_list table(group_list)# group_list # normal RApre RApost # 28 19 19 colnames(symbol_matrix) group_list table(group_list)#dat <- log10(edgeR::cpm(symbol_matrix)+1) dat <- log2(edgeR::cpm(symbol_matrix)+1) save(dat, symbol_matrix,group_list,file = 'step1-output.Rdata' )
然后是兩次差異分析文獻(xiàn)中使用的是 limma 算法,我們也盡量復(fù)現(xiàn)同樣的哈,其中,疾病和對照肯定是差異巨大,但是治療前后就很難說了因?yàn)閺奈墨I(xiàn)里面的pca來看本來就是分組內(nèi)的差異并沒有顯著的小于組間差異!
(1)RA 治療前 vs 正常對照rm(list = ls()) ## 魔幻操作,一鍵清空~ options(stringsAsFactors = F) library(AnnoProbe) library(GEOquery) library(ggplot2) library(ggstatsplot) library(patchwork) library(reshape2) library(stringr) getOption('timeout' ) options(timeout=10000) load('./step1-output.Rdata' ) symbol_matrix[1:4,1:4] group_list table(group_list) gse_number <- 'RA_vs_Normal' gse_number dir.create(gse_number) setwd( gse_number ) getwd() kp <- group_list %in % c('normal' ,'RApre' );table(kp) symbol_matrix <- symbol_matrix[,kp] group_list <- group_list[kp] table(group_list) group_list <- factor(group_list,levels = c('normal' ,'RApre' )) group_list save(symbol_matrix,group_list,file = 'symbol_matrix.Rdata' )# 運(yùn)行 source ('../scripts/step3-deg-limma-voom.R' )
(2)RA 治療后 vs RA 治療前rm(list = ls()) ## 魔幻操作,一鍵清空~ options(stringsAsFactors = F) library(AnnoProbe) library(GEOquery) library(ggplot2) library(ggstatsplot) library(patchwork) library(reshape2) library(stringr) getOption('timeout' ) options(timeout=10000) load('./step1-output.Rdata' ) symbol_matrix[1:4,1:4] group_list table(group_list) gse_number <- 'RApost_vs_RApre' gse_number dir.create(gse_number) setwd( gse_number ) getwd() kp=group_list %in % c('RApre' ,'RApost' );table(kp) symbol_matrix=symbol_matrix[,kp] group_list=group_list[kp] table(group_list) group_list = factor(group_list,levels = c('RApre' ,'RApost' )) group_list save(symbol_matrix,group_list,file = 'symbol_matrix.Rdata' )# 運(yùn)行 source ('../scripts/step3-deg-limma-voom.R' )
source的 step3-deg-limma-voom.R 代碼:rm(list = ls()) options(stringsAsFactors = F) library(ggplot2) library(ggstatsplot) library(ggsci) library(RColorBrewer) library(patchwork) library(ggplotify) library(limma) library(edgeR) library(DESeq2) load(file = 'symbol_matrix.Rdata' ) symbol_matrix[1:4,1:4] exprSet = symbol_matrix design <- model.matrix(~0+factor(group_list)) colnames(design)=levels(factor(group_list)) rownames(design)=colnames(exprSet) design dge <- DGEList(counts=exprSet) dge <- calcNormFactors(dge) logCPM <- cpm(dge, log =TRUE, prior.count=3) v <- voom(dge,design,plot=TRUE, normalize="quantile" ) fit <- lmFit(v, design) group_list g1=levels(group_list)[1] g2=levels(group_list)[2] con=paste0(g2,'-' ,g1) cat(con) cont.matrix=makeContrasts(contrasts=c(con),levels = design) fit2=contrasts.fit(fit,cont.matrix) fit2=eBayes(fit2) tempOutput = topTable(fit2, coef=con, n=Inf) DEG_limma_voom = na.omit(tempOutput) head(DEG_limma_voom) save(DEG_limma_voom, file = 'DEG_limma_voom.Rdata' )
繪制圖2A:治療后 與 治療前的差異火山圖load( file = 'DEG_limma_voom.Rdata' ) colnames(DEG_limma_voom) head(DEG_limma_voom) nrDEG=DEG_limma_voom[,c("logFC" , "P.Value" )] head(nrDEG) df=nrDEG logFC_t <- 1 pvalue_t <- 0.05 df$g =ifelse(df$P .Value > pvalue_t,'stable' , #if 判斷:如果這一基因的P.Value>0.01,則為stable基因 ifelse( df$logFC > logFC_t,'up' , #接上句else 否則:接下來開始判斷那些P.Value<0.01的基因,再if 判斷:如果logFC >1.5,則為up(上調(diào))基因 ifelse( df$logFC < -logFC_t,'down' ,'stable' ) )#接上句else 否則:接下來開始判斷那些logFC <1.5 的基因,再if 判斷:如果logFC <1.5,則為down(下調(diào))基因,否則為stable基因 ) table(df$g ) df$name =rownames(df) head(df) this_tile <- paste0('Threshold of logFC is ' ,round(logFC_t,3), '\nThe number of up gene is ' ,nrow(df[df$g == 'up' ,]) , '\nThe number of down gene is ' ,nrow(df[df$g == 'down' ,]) ) this_tile#~~~火山圖p5~~~ target_gene= c('MS4A1' ) for_label <- df %>% dplyr::filter(df$name %in % target_gene) head(for_label) p5 <- ggplot(data = df, aes(x = logFC, y = -log10(P.Value))) + geom_point(alpha=0.6, size=1.5, aes(color=g)) + geom_point(size = 3, shape = 1, data = for_label) + ggrepel::geom_label_repel(aes(label = name),data = for_label, max.overlaps = getOption("ggrepel.max.overlaps" , default = 20), color="black" ) + ylab("-log10(Pvalue)" )+ scale_color_manual(values=c("#34bfb5" , "#828586" ,"#ff6633" ))+ geom_vline(xintercept=c(-logFC_t,logFC_t),lty=4,col="grey" ,lwd=0.8) + geom_hline(yintercept = -log10(pvalue_t),lty=4,col="grey" ,lwd=0.8) + #coord_flip()+ #xlim(-5, 5)+ theme_bw()+ ggtitle(this_tile )+ theme(panel.grid = element_blank(), plot.title = element_text(size=8,hjust = 0.5), legend.title = element_blank(), legend.text = element_text(size=8)) p5
結(jié)果如下:比文獻(xiàn)中那個(gè)火山圖好看,就用我們這個(gè)進(jìn)行后續(xù)的拼圖好了。
繪制圖2B:治療前 與 正常對照的差異基因熱圖rm(list = ls()) ## 魔幻操作,一鍵清空~ options(stringsAsFactors = F) library(ggplot2) library(ggstatsplot) library(patchwork) library(reshape2) library(stringr) library(ggsignif) getOption('timeout' ) options(timeout=10000) setwd("RA_vs_Normal/" )# 讀取差異結(jié)果 load("DEG_limma_voom.Rdata" ) head(DEG_limma_voom)# 讀取所有樣本表達(dá)矩陣 load("../step1-output.Rdata" ) ls()###################################### Fig2.B 熱圖 # 提取所有差異基因的表達(dá)矩陣 DEG_limma_voom$name <- rownames(DEG_limma_voom) DEG_limma_voom$g <- "stable" DEG_limma_voom$g [ DEG_limma_voom$logFC >1 & DEG_limma_voom$adj .P.Val < 0.05 ] <- "up" DEG_limma_voom$g [ DEG_limma_voom$logFC < -1 & DEG_limma_voom$adj .P.Val < 0.05 ] <- "down" table(DEG_limma_voom$g ) head(DEG_limma_voom) diff <- DEG_limma_voom[DEG_limma_voom$g !="stable" , "name" ] head(diff) length(diff) library(pheatmap) dat_plot <- dat[diff,] head(dat_plot) anno <- data.frame(group=group_list) rownames(anno) <- colnames(dat) head(anno) ann_color <- list(group = c(normal="#ed1a22" ,RApre="#00a651" ,RApost="#652b90" )) library(RColorBrewer) color <- colorRampPalette(c('#1f67ad' ,'white' ,'#bb2f34' ))(100) pheatmap(dat_plot, scale = "row" ,show_rownames = F,show_colnames = F, color = color,annotation_col = anno, annotation_colors = ann_color, cluster_cols = F, border_color = "black" ,gaps_col=c(28,47))
結(jié)果如下:文獻(xiàn)中那個(gè)圖感覺只放了 上調(diào)基因?
繪制圖2C與D:marker 基因 箱線圖首先,讓kimi(https://kimi./) 幫我拿到圖片中的基因并整理成一個(gè)R語言向量,再也不用一個(gè)個(gè)手動(dòng)從文獻(xiàn)里面摳出來了:
圖C 的比較組別為:list(c("RApre", "normal"), c("RApost", "RApre"))
###################################### Fig2.C 熱圖 # 提取所有差異基因的表達(dá)矩陣 genes1 <- c("CD3D" , "MS4A1" , "CTLA4" , "CD19" ) group_list <- factor(group_list,levels = c("normal" ,"RApre" ,"RApost" )) fig2c <- list()for (i in 1:length(genes1)) { #i <- 4 box_dat <- data.frame(exp=dat[genes1[i],], group=group_list) head(box_dat) max_pos <- max(box_dat$exp ) # 繪制小提琴圖和顯著性標(biāo)記 B <- ggplot(data=box_dat,aes(x=group,y=exp,colour = group)) + geom_boxplot(mapping=aes(x=group,y=exp,colour=group), size=0.6, width = 0.5) + # 箱線圖 geom_jitter(mapping=aes(x=group,y=exp,colour = group),size=1.5) + # 散點(diǎn) scale_color_manual(limits=c("normal" ,"RApre" ,"RApost" ), values =c( "#ed1a22" ,"#00a651" ,"#652b90" ) ) + # 顏色 geom_signif(mapping=aes(x=group,y=exp), # 不同組別的顯著性 comparisons = list(c("RApre" , "normal" ), c("RApost" , "RApre" )), map_signif_level=T, # T顯示顯著性,F(xiàn)顯示p value tip_length=c(0,0,0,0,0,0,0,0,0,0,0,0), # 修改顯著性線兩端的長短 y_position = c(max_pos, max_pos*1.02), # 設(shè)置顯著性線的位置高度 size=0.8, # 修改線的粗細(xì) textsize = 4, # 修改顯著性標(biāo)記的大小 test = "t.test" ) + # 檢驗(yàn)的類型,可以更改 ylim(c(0,max_pos*1.12)) + theme_classic() + #設(shè)置白色背景 labs(x="" ,y="" ) + # 添加標(biāo)題,x軸,y軸標(biāo)簽 ggtitle(label = genes1[i]) + theme(plot.title = element_text(hjust = 0.5), axis.line=element_line(linetype=1,color="black" ,size=0.9)) B fig2c[[i]] <- B } p_c <- wrap_plots(fig2c,guides="collect" ) p_c
結(jié)果如下:
圖D的比較組別為:list(c("RApre", "normal"), c("RApost", "normal"))
########################################################################## genes2 <- c("IL10" , "CLEC12A" , "AURKA" , "MMP13" ,"CLEC2B" , "CD58" ) fig2d <- list()for (i in 1:length(genes2)) { #i <- 4 box_dat <- data.frame(exp=dat[genes2[i],], group=group_list) head(box_dat) max_pos <- max(box_dat$exp ) # 繪制小提琴圖和顯著性標(biāo)記 B <- ggplot(data=box_dat,aes(x=group,y=exp,colour = group)) + geom_boxplot(mapping=aes(x=group,y=exp,colour=group), size=0.6, width = 0.5) + # 箱線圖 geom_jitter(mapping=aes(x=group,y=exp,colour = group),size=1.5) + # 散點(diǎn) scale_color_manual(limits=c("normal" ,"RApre" ,"RApost" ), values =c( "#ed1a22" ,"#00a651" ,"#652b90" ) ) + # 顏色 geom_signif(mapping=aes(x=group,y=exp), # 不同組別的顯著性 comparisons = list(c("RApre" , "normal" ), c("RApost" , "normal" )), map_signif_level=T, # T顯示顯著性,F(xiàn)顯示p value tip_length=c(0,0,0,0,0,0,0,0,0,0,0,0), # 修改顯著性線兩端的長短 y_position = c(max_pos*1.04,max_pos), # 設(shè)置顯著性線的位置高度 size=0.8, # 修改線的粗細(xì) textsize = 4, # 修改顯著性標(biāo)記的大小 test = "t.test" ) + # 檢驗(yàn)的類型,可以更改 ylim(c(0,max_pos*1.12)) + theme_classic() + #設(shè)置白色背景 labs(x="" ,y="" ) + # 添加標(biāo)題,x軸,y軸標(biāo)簽 ggtitle(label = genes2[i]) + theme(plot.title = element_text(hjust = 0.5), axis.line=element_line(linetype=1,color="black" ,size=0.9)) B fig2d[[i]] <- B } p_d <- wrap_plots(fig2d,guides="collect" ,ncol=3) p_d
結(jié)果如下:
拼圖環(huán)節(jié)將以上峰圖 保存成pdf,用ai進(jìn)行處理,得到下面完整的fig2:
文末友情宣傳強(qiáng)烈建議你推薦給身邊的博士后以及年輕生物學(xué)PI ,多一點(diǎn)數(shù)據(jù)認(rèn)知,讓他們的科研上一個(gè)臺(tái)階: