一文搞定單細胞基因集評分

cmu小孩 2022-10-17 發(fā)布于江蘇

展開全文

前面我們已經學習過幾種基因集打分/富集方法，但是每個都是獨立講解，在不同的數據集進行演示，難以領會方法的異同點。這里我給技術支持安排了一個小作業(yè)，以同一個數據集、同一個基因集為例，進行AddModuleScore、ssGSEA、GSVA、PercentageFeatureSet、AUCell五種基因集打分方法的橫向比較，一起去學習吧~

下面這五個方法的箱線圖幾乎是一個模子刻出來的，只是PercentageFeatureSet的算法會讓數值更離散些：

1. 準備工作

1.1 測試數據集

下面預處理的步驟看不懂的同學可以先補一下基礎知識:

手把手教你做單細胞測序數據分析（三）——單樣本分析

set.seed(123456)dir.create('Results')

library(Seurat)

## Attaching SeuratObject

library(ggplot2)library(png)sce = Read10X('./pbmc3k_filtered_gene_bc_matrices/filtered_gene_bc_matrices/hg19/')

library(dplyr)

## ## Attaching package: 'dplyr'## The following objects are masked from 'package:stats':## ##     filter, lag## The following objects are masked from 'package:base':## ##     intersect, setdiff, setequal, unio

sce_final <- sce %>% CreateSeuratObject(min.cells = 3, min.features = 200) %>% PercentageFeatureSet(pattern = '^MT',col.name = 'percent.mt')

select_c <- WhichCells(sce_final,expression = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)sce_final = subset(sce_final,cells = select_c)dim(sce_final)

## [1] 13714  2626

# [1] 13714 2626

sce_final <- sce_final %>% NormalizeData() %>% FindVariableFeatures() %>% ScaleData() %>% RunPCA() %>% FindNeighbors(dims = 1:10) %>% FindClusters(resolution = 0.5) %>% RunTSNE(dims = 1:10) %>% RunUMAP(dims = 1:10)

## Centering and scaling data matrix## PC_ 1 ## Positive:  MALAT1, LTB, IL32, IL7R, CD2, B2M, ACAP1, CD27, STK17A, CTSW ##     CD247, GIMAP5, AQP3, CCL5, SELL, GZMA, TRAF3IP3, CST7, MAL, ITM2A ##     HOPX, GIMAP7, MYC, BEX2, GZMK, ETS1, ZAP70, TNFAIP8, RIC3, LYAR ## Negative:  CST3, TYROBP, LST1, AIF1, FTL, LYZ, FTH1, FCN1, S100A9, TYMP ##     FCER1G, CFD, LGALS1, S100A8, LGALS2, CTSS, SERPINA1, SPI1, IFITM3, CFP ##     PSAP, IFI30, SAT1, COTL1, S100A11, NPC2, GRN, LGALS3, GSTP1, PYCARD ## PC_ 2 ## Positive:  CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1, HLA-DRA, LINC00926, CD79B, CD74, HLA-DRB1 ##     HLA-DPB1, HLA-DMA, HLA-DQA2, CD37, HLA-DRB5, HLA-DMB, HLA-DPA1, FCRLA, HVCN1, LTB ##     BLNK, P2RX5, IGLL5, IRF8, SWAP70, ARHGAP24, FCGR2B, SMIM14, PPP1R14A, C16orf74 ## Negative:  NKG7, PRF1, CST7, GZMA, GZMB, FGFBP2, GNLY, CTSW, B2M, SPON2 ##     GZMH, CCL4, FCGR3A, CCL5, CD247, XCL2, CLIC3, AKR1C3, SRGN, HOPX ##     TTC38, APMAP, CTSC, S100A4, IGFBP7, ANXA1, ID2, IL32, XCL1, ACTB ## PC_ 3 ## Positive:  HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPA1, HLA-DPB1, CD74, MS4A1, HLA-DRB1, HLA-DRA ##     HLA-DRB5, HLA-DQA2, TCL1A, LINC00926, HLA-DMB, CD37, HLA-DMA, HVCN1, FCRLA, MALAT1 ##     IRF8, PLAC8, BLNK, SMIM14, PLD4, LAT2, IGLL5, P2RX5, SWAP70, FCGR2B ## Negative:  PPBP, PF4, SDPR, SPARC, GNG11, NRGN, GP9, RGS18, TUBB1, HIST1H2AC ##     CLU, AP001189.4, ITGA2B, CD9, TMEM40, PTCRA, CA2, ACRBP, MMD, TREML1 ##     NGFRAP1, F13A1, RUFY1, SEPT5, TSC22D1, MPP1, CMTM5, MYL9, RP11-367G6.3, GP1BA ## PC_ 4 ## Positive:  HLA-DQA1, CD79A, CD79B, HLA-DQB1, MS4A1, CD74, HLA-DPB1, HLA-DPA1, HLA-DRB1, TCL1A ##     HLA-DQA2, HLA-DRA, LINC00926, HLA-DRB5, HIST1H2AC, PF4, SDPR, GZMB, PPBP, GNG11 ##     HLA-DMA, HLA-DMB, HVCN1, SPARC, FCRLA, GP9, FGFBP2, PTCRA, CA2, CD9 ## Negative:  VIM, IL7R, S100A6, IL32, S100A8, S100A4, GIMAP7, S100A10, S100A9, MAL ##     AQP3, FYB, CD2, CD14, GIMAP4, LGALS2, ANXA1, CD27, RBP7, GIMAP5 ##     FCN1, LYZ, S100A11, S100A12, MS4A6A, FOLR3, TMSB4X, TRABD2A, AIF1, NELL2 ## PC_ 5 ## Positive:  GZMB, S100A8, NKG7, FGFBP2, GNLY, CCL4, CST7, PRF1, GZMA, SPON2 ##     S100A9, GZMH, LGALS2, CCL3, XCL2, CTSW, CD14, S100A12, CLIC3, RBP7 ##     CCL5, MS4A6A, FOLR3, GSTP1, TYROBP, TTC38, AKR1C3, IGFBP7, XCL1, HOPX ## Negative:  LTB, IL7R, VIM, CKB, AQP3, MS4A7, CYTIP, RP11-290F20.3, HMOX1, PTGES3 ##     CD27, HN1, MAL, LILRB2, GDI2, CORO1B, CD2, ANXA5, TUBA1B, FAM110A ##     TRADD, PPA1, ATP1A1, CCDC109B, ABRACL, CTD-2006K23.1, FYB, WARS, TRAF3IP3, IFITM2## Computing nearest neighbor graph## Computing SNN## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck## ## Number of nodes: 2626## Number of edges: 95233## ## Running Louvain algorithm...## Maximum modularity in 10 random starts: 0.8722## Number of communities: 9## Elapsed time: 0 seconds## 11:34:56 UMAP embedding parameters a = 0.9922 b = 1.112## 11:34:56 Read 2626 rows and found 10 numeric columns## 11:34:56 Using Annoy for neighbor search, n_neighbors = 30## 11:34:56 Building Annoy index with metric = cosine, n_trees = 50## 0%   10   20   30   40   50   60   70   80   90   100%## [----|----|----|----|----|----|----|----|----|----|## **************************************************|## 11:34:56 Writing NN index file to temp file C:\Users\53513\AppData\Local\Temp\RtmpiIxgPz\file39606f2345ad## 11:34:56 Searching Annoy index using 1 thread, search_k = 3000## 11:34:57 Annoy recall = 100%## 11:34:57 Commencing smooth kNN distance calibration using 1 thread## 11:34:58 Initializing from normalized Laplacian   noise## 11:34:58 Commencing optimization for 500 epochs, with 104830 positive edges## 11:35:06 Optimization finished

new.cluster.ids <- c('Naive CD4 T', 'CD14_Mono', 'Memory CD4 T', 'B', 'CD8 T', 'FCGR3A_Mono', 'NK', 'DC', 'Platelet')names(new.cluster.ids) <- levels(sce_final)sce_final <- RenameIdents(sce_final, new.cluster.ids)

sce_final$celltype <- Idents(sce_final)

sce_final$celltype <- factor(sce_final$celltype,levels = c('Naive CD4 T','Memory CD4 T','CD8 T','B','NK','CD14_Mono','FCGR3A_Mono','DC', 'Platelet'))

my9color <- c('#5470c6','#91cc75','#fac858','#ee6666','#73c0de','#3ba272','#fc8452','#9a60b4','#ea7ccc')

p <- DimPlot(sce_final, reduction = 'umap', label = TRUE, pt.size = 0.5,label.size = 5,cols = my9color)   NoLegend()   NoAxes()

ggsave('./Results/umap.png',width = 5,height = 5)

1.2 基因集

選擇的是MSigdb數據庫中C7: immunologic signature gene sets中的GSE10325_CD4_TCELL_VS_BCELL_UP基因集（200個基因）

library(readxl)CD4_TCELL_VS_BCELL_UP <- openxlsx::read.xlsx('./CD4_TCELL_VS_BCELL_UP.xlsx')unique(CD4_TCELL_VS_BCELL_UP$Gene)

## [1] 'AAK1' 'ACVR1' 'AK5' 'AKTIP' 'ANXA1' 'AP3M2' ## [7] 'APBA2' 'AQP3' 'ARL4C' 'ARNTL' 'ARRB1' 'ATP10A' ## [13] 'ATP1A1' 'ATP2B4' 'ATP6V0E2' 'BCL11B' 'BEX3' 'BLMH' ## [19] 'BNIP3' 'C1orf216' 'CALM1' 'CAMK2G' 'CCL5' 'CCND2' ## [25] 'CCND3' 'CCR2' 'CCR5' 'CCSER2' 'CD2' 'CD247' ## [31] 'CD28' 'CD3G' 'CD6' 'CDR2' 'CERK' 'CHST7' ## [37] 'CISH' 'CLUAP1' 'CORO2A' 'CPD' 'CYB561' 'CYLD' ## [43] 'DEXI' 'DNMT1' 'DOCK9' 'DPP4' 'DPYD' 'DUSP7' ## [49] 'DYRK2' 'FAM102A' 'FBXL8' 'FHL1' 'FKBP5' 'FLT3LG' ## [55] 'FYB1' 'FYN' 'GATA3' 'GIMAP4' 'GIMAP5' 'GIMAP6' ## [61] 'GNAQ' 'GOLGA7' 'GPD1L' 'GSTK1' 'GZMA' 'GZMK' ## [67] 'HDAC4' 'HPGD' 'ICOS' 'ID2' 'IFITM1' 'IGF2R' ## [73] 'IL10RA' 'IL11RA' 'IL2RB' 'INPP4A' 'IPCEF1' 'ITGA6' ## [79] 'ITK' 'ITM2A' 'ITPKB' 'KLRB1' 'KLRG1' 'LCK' ## [85] 'LDHB' 'LDLRAP1' 'LEF1' 'LEPROTL1' 'LGALS8' 'LIMA1' ## [91] 'LINC00623' 'LPAR6' 'LRIG1' 'MAF' 'MAL' 'MGAT4A' ## [97] 'MICAL2' 'MLLT3' 'MT1F' 'MT1X' 'MYBL1' 'NACC2' ## [103] 'NCALD' 'NELL2' 'NOSIP' 'OPTN' 'P2RX4' 'PDE3B' ## [109] 'PDE4D' 'PHACTR2' 'PIK3IP1' 'PIK3R1' 'PIM1' 'PIP4K2A' ## [115] 'PLAAT4' 'PLCG1' 'PLCL1' 'PLIN2' 'PLK3' 'PRDX2' ## [121] 'PRF1' 'PRKACB' 'PRKCA' 'PRKCH' 'PRKCQ' 'PRMT2' ## [127] 'PRORP' 'PTGER2' 'PTPN13' 'PTPN4' 'PXN' 'RAB33A' ## [133] 'RASGRP1' 'RCAN3' 'RGCC' 'RGS10' 'RIC8B' 'RNF144A' ## [139] 'RORA' 'RPS6KA3' 'RUNX1' 'S100A4' 'SAMHD1' 'SELPLG' ## [145] 'SEMA4C' 'SEMA4D' 'SERINC5' 'SFI1' 'SH2D1A' 'SH2D2A' ## [151] 'SIRPG' 'SKAP1' 'SLC39A8' 'SLC9A3R1' 'SOCS2' 'SPATS2L' ## [157] 'SRGN' 'STAT4' 'STAT5B' 'STOM' 'STXBP1' 'SVIL' ## [163] 'TBC1D2B' 'TBC1D4' 'TGFBR3' 'TIAM1' 'TIMP1' 'TKTL1' ## [169] 'TMEM123' 'TMEM30B' 'TMEM43' 'TNFAIP3' 'TNFRSF25' 'TNFSF10' ## [175] 'TNFSF14' 'TNFSF8' 'TNIK' 'TOB1' 'TRAT1' 'TRAV8-3' ## [181] 'TRBC1' 'TRBV10-2' 'TRDV2' 'TRPV2' 'TTLL1' 'TXK' ## [187] 'UBASH3A' 'UPP1' 'URI1' 'VAMP5' 'VIM' 'VPS26C' ## [193] 'WDR1' 'WDR82' 'WWP1' 'ZAP70' 'ZMYM6' 'ZNF277'

2. 不同打分方法橫向比較

2.1 PercentageFeatureSet

2.1.1 僅保留基因集中存在于seurat對象中的基因評分

a <- CD4_TCELL_VS_BCELL_UP[CD4_TCELL_VS_BCELL_UP$Gene %in% rownames(sce_final),]sce <- PercentageFeatureSet(sce_final,features = a,col.name = 'CD4_TCELL_VS_BCELL_UP_score')
colnames(sce@meta.data)

## [1] 'orig.ident' 'nCount_RNA' ## [3] 'nFeature_RNA' 'percent.mt' ## [5] 'RNA_snn_res.0.5' 'seurat_clusters' ## [7] 'celltype' 'CD4_TCELL_VS_BCELL_UP_score'

# 可以看到sce對象的meta.data中新增了一列名為CD4_TCELL_VS_BCELL_UP_score的評分

2.1.2 可視化各個細胞類型CD4_TCELL_VS_BCELL_UP_score基因集的評分

my_comparisons <- list(c('Naive CD4 T','B'),c('Memory CD4 T','B'))library(ggpubr)p.percentage <- ggviolin(sce@meta.data, x = 'celltype', y = 'CD4_TCELL_VS_BCELL_UP_score', color = 'celltype',add = 'mean_sd',fill = 'celltype', add.params = list(color = 'black')) stat_compare_means(comparisons = my_comparisons,label = 'p.signif') scale_color_manual(values = my9color) scale_fill_manual(values = my9color) theme(axis.text.x.bottom = element_text(angle = 90,vjust = 0.5,hjust = 1)) NoLegend() labs(x = '')ggsave('./Results/percentagefeatureset_score_vln.png',width = 8,height = 4.5)p2 <- FeaturePlot(sce,'CD4_TCELL_VS_BCELL_UP_score')ggsave('./Results/percentagefeatureset_score_umap.png',width = 5.5,height = 5)

可以看到兩類T細胞的基因集評分評分都顯著高于B細胞。

2.2 AddModuleScore

AddModuleScore函數計算出的結果與PercentageFeatureSet函數會有什么差異呢？ > 注意：AddModuleScore使用的基因集對象需要提前轉變?yōu)榱斜砀袷健?/span>

2.2.1 基因集列表制作及評分

genes <- agenes <- as.data.frame(genes)genes <- as.list(genes)
sce <- AddModuleScore(sce_final,features = genes,name = 'CD4_TCELL_VS_BCELL_UP_score')colnames(sce@meta.data)

## [1] 'orig.ident' 'nCount_RNA' ## [3] 'nFeature_RNA' 'percent.mt' ## [5] 'RNA_snn_res.0.5' 'seurat_clusters' ## [7] 'celltype' 'CD4_TCELL_VS_BCELL_UP_score1'

# 可以看到sce對象的meta.data中新增了一列名為CD4_TCELL_VS_BCELL_UP_score1的評分

2.2.2 可視化

my_comparisons <- list(c('Naive CD4 T','B'),c('Memory CD4 T','B'))library(ggpubr)p.AddModuleScore <- ggviolin(sce@meta.data, x = 'celltype', y = 'CD4_TCELL_VS_BCELL_UP_score1', color = 'celltype',add = 'mean_sd',fill = 'celltype', add.params = list(color = 'black')) stat_compare_means(comparisons = my_comparisons,label = 'p.signif') scale_color_manual(values = my9color) scale_fill_manual(values = my9color) theme(axis.text.x.bottom = element_text(angle = 90,vjust = 0.5,hjust = 1)) NoLegend() labs(x = '')ggsave('./Results/addmodulescore_score_vln.png',width = 8,height = 4.5)p2 <- FeaturePlot(sce,'CD4_TCELL_VS_BCELL_UP_score1')ggsave('./Results/addmodulescore_score_umap.png',width = 5.5,height = 5)

這里得到的結果與PercentageFeatureSet相似。 #### AddModuleScore和PercentageFeatureSet的差異區(qū)別其實很容易發(fā)現(xiàn)，從metadata的數據我們就可以觀察到AddModuleScore得到的評分是有負值的。我們plot看一下兩種方法，基因集評分值的分布情況：

# plot(sce$CD4_TCELL_VS_BCELL_UP_score)# plot(sce$CD4_TCELL_VS_BCELL_UP_score1)

PercentageFeatureSet：其評估得到的分數，均為正值，并且有不少的離散值分布。

AddModuleScore：可以看到，其評估得到的分數，有點類似scale之后的結果。

這個帖子有提到AddModuleScore的原理：Seurat的打分函數AddMouduleScore

2.3 AUCell

按照之前的講解，我們走一遍AUCell的流程。

2.3.1 AUCell_buildRankings

library(AUCell)library(clusterProfiler)

## ## clusterProfiler v3.18.1  For help: https://guangchuangyu./software/clusterProfiler## ## If you use clusterProfiler in published research, please cite:## Guangchuang Yu, Li-Gen Wang, Yanyan Han, Qing-Yu He. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology. 2012, 16(5):284-287.## ## Attaching package: 'clusterProfiler'## The following object is masked from 'package:stats':## ##     filter

library(ggplot2)library(Seurat)library(GSEABase)

## Loading required package: BiocGenerics## Loading required package: parallel## ## Attaching package: 'BiocGenerics'## The following objects are masked from 'package:parallel':## ##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,##     clusterExport, clusterMap, parApply, parCapply, parLapply,##     parLapplyLB, parRapply, parSapply, parSapplyLB## The following objects are masked from 'package:dplyr':## ##     combine, intersect, setdiff, union## The following objects are masked from 'package:stats':## ##     IQR, mad, sd, var, xtabs## The following objects are masked from 'package:base':## ##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,##     union, unique, unsplit, which.max, which.min## Loading required package: Biobase## Welcome to Bioconductor## ##     Vignettes contain introductory material; view with##     'browseVignettes()'. To cite Bioconductor, see##     'citation('Biobase')', and for packages 'citation('pkgname')'.## Loading required package: annotate## Loading required package: AnnotationDbi## Loading required package: stats4## Loading required package: IRanges## Loading required package: S4Vectors## ## Attaching package: 'S4Vectors'## The following object is masked from 'package:clusterProfiler':## ##     rename## The following objects are masked from 'package:dplyr':## ##     first, rename## The following object is masked from 'package:base':## ##     expand.grid## ## Attaching package: 'IRanges'## The following object is masked from 'package:clusterProfiler':## ##     slice## The following objects are masked from 'package:dplyr':## ##     collapse, desc, slice## The following object is masked from 'package:grDevices':## ##     windows## ## Attaching package: 'AnnotationDbi'## The following object is masked from 'package:clusterProfiler':## ##     select## The following object is masked from 'package:dplyr':## ##     select## Loading required package: XML## Loading required package: graph## ## Attaching package: 'graph'## The following object is masked from 'package:XML':## ##     addNode

sce <- sce_final

# AUCell_buildRankingscells_rankings <- AUCell_buildRankings(sce@assays$RNA@counts,nCores=1, plotStats=TRUE)

## Quantiles for the number of genes detected by cell: ## (Non-detected genes are shuffled at the end of the ranking. Keep it in mind when choosing the threshold for calculating the AUC).

## min 1% 5% 10% 50% 100% ## 212.00 341.00 460.25 557.50 819.00 2413.00

# Quantiles for the number of genes detected by cell: #   (Non-detected genes are shuffled at the end of the ranking. Keep it in mind when choosing the threshold for calculating the AUC).# min      1%      5%     10%     50%    100% # 212.00  341.00  460.25  557.50  819.00 2413.00

2.3.2 制作基因集

## 制作基因集geneSets <- aextraGeneSets <- c(GeneSet(sample(geneSets),setName='CD4_TCELL_VS_BCELL_UP_gene'))

geneSets1 <- GeneSetCollection(extraGeneSets)names(geneSets1)

## [1] 'CD4_TCELL_VS_BCELL_UP_gene'

2.3.3 計算AUC

## 計算AUCcells_AUC <- AUCell_calcAUC(geneSets1, cells_rankings)cells_AUC

## AUC for 1 gene sets (rows) and 2626 cells (columns).## ## Top-left corner of the AUC matrix:##                             cells## gene sets                    AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1##   CD4_TCELL_VS_BCELL_UP_gene       0.05518402       0.03328588       0.08549763##                             cells## gene sets                    AAACCGTGCTTCCG-1 AAACCGTGTATGCG-1##   CD4_TCELL_VS_BCELL_UP_gene       0.04875603       0.06796838

2.3.4 AUC閾值計算及可視化

pdf('./Results/cells_assignment.pdf',width = 6,height = 5)cells_assignment <- AUCell_exploreThresholds(cells_AUC, plotHist=TRUE, assign=TRUE) dev.off()

## png ##   2

##set gene set of interest here for plottinggeneSet <- 'CD4_TCELL_VS_BCELL_UP_gene'aucs <- as.numeric(getAUC(cells_AUC)[geneSet, ])sce$AUC <- aucsdf<- data.frame(sce@meta.data, sce@reductions[['umap']]@cell.embeddings)colnames(df)

##  [1] 'orig.ident'      'nCount_RNA'      'nFeature_RNA'    'percent.mt'     ##  [5] 'RNA_snn_res.0.5' 'seurat_clusters' 'celltype'        'AUC'            ##  [9] 'UMAP_1'          'UMAP_2'

#我們看到每個細胞現(xiàn)在都加上AUC值了，下面做一下可視化。class_avg <- df %>% group_by(celltype) %>% summarise( UMAP_1 = median(UMAP_1), UMAP_2 = median(UMAP_2) )p1 <- ggplot(df, aes(UMAP_1, UMAP_2)) geom_point(aes(colour = AUC)) viridis::scale_color_viridis(option='H') ggrepel::geom_text_repel(aes(label = celltype), data = class_avg, size = 6, segment.color = NA) theme(legend.position = 'none') theme_classic()ggsave('./Results/umap_auc.png',width = 5.5,height = 5)

p.auc <- ggviolin(df, x = 'celltype', y = 'AUC', color = 'celltype',add = 'mean_sd',fill = 'celltype', add.params = list(color = 'black')) stat_compare_means(comparisons = my_comparisons,label = 'p.signif') scale_color_manual(values = my9color) scale_fill_manual(values = my9color) theme(axis.text.x.bottom = element_text(angle = 90,vjust = 0.5,hjust = 1)) NoLegend() labs(x = '')ggsave('./Results/vio_auc.png',width = 8,height = 4.5)

可以看到AUC的閾值為0.098，大于此閾值的有28個細胞

可以看到，T和NK細胞區(qū)域具有相對較高的AUC得分，B細胞的評分最低，與AddModuleScore和PercentageFeatureSet的計算結果是一致的。

2.4 ssGSEA

2.4.1 加載R包

# devtools::install_github('nicolash2/gggsea')library(GSVA)#ssGSEA也可以通過GSVA包實現(xiàn)library(ggplot2)

2.4.2 ssGSEA分析

gene.expr <- as.matrix(sce_final[['RNA']]@data)dim(gene.expr)

gene4ssGSEA <- as.data.frame(CD4_TCELL_VS_BCELL_UP$Gene)colnames(gene4ssGSEA)<-'CD4_TCELL_VS_BCELL_UP'ssGSEA.result <- GSVA::gsva(gene.expr, gene4ssGSEA,method='ssgsea',kcdf='Gaussian')

## Estimating ssGSEA scores for 1 gene sets.

2.4.3 可視化

my_comparisons <- list(c('Naive CD4 T','B'),c('Memory CD4 T','B'))library(ggpubr)data4plot <-cbind(sce_final@meta.data,                  t(ssGSEA.result)[rownames(sce_final@meta.data),])data4plot[1:4,1:4]

## orig.ident nCount_RNA nFeature_RNA percent.mt## AAACATACAACCAC-1 SeuratProject 2419 779 3.100455## AAACATTGAGCTAC-1 SeuratProject 4903 1352 3.875178## AAACATTGATCAGC-1 SeuratProject 3147 1129 1.080394## AAACCGTGCTTCCG-1 SeuratProject 2639 960 1.970443

colnames(data4plot)[ncol(data4plot)] <- 'CD4_TCELL_VS_BCELL_UP_ssgsea'p.ssGSEA <- ggviolin(data4plot,                x = 'celltype', y = 'CD4_TCELL_VS_BCELL_UP_ssgsea',         color = 'celltype',add = 'mean_sd',fill = 'celltype',         add.params = list(color = 'black'))     stat_compare_means(comparisons = my_comparisons,label = 'p.signif')     scale_color_manual(values = my9color)     scale_fill_manual(values = my9color)    theme(axis.text.x.bottom = element_text(angle = 90,vjust = 0.5,hjust = 1))     NoLegend()   labs(x = '')

2.5 GSVA

GSVA方法Biomamba生信基地之前發(fā)過推文的噢！鏈接在單細胞中應該如何做GSVA？這里參照之前的流程進行分析。

2.5.1 加載包

library(GSVA)library(ggplot2)

2.5.2 GSVA分析

gene.expr <-  as.matrix(sce_final[['RNA']]@data)dim(gene.expr)

## [1] 13714 2626

gene4GSVA <- as.data.frame(CD4_TCELL_VS_BCELL_UP$Gene)colnames(gene4GSVA)<-'CD4_TCELL_VS_BCELL_UP'GSVA.result <- GSVA::gsva(gene.expr, gene4GSVA,method='gsva',kcdf='Gaussian')

## Estimating GSVA scores for 1 gene sets.## Estimating ECDFs with Gaussian kernels## | | | 0% | |======================================================================| 100%

2.5.3 可視化

my_comparisons <- list(c('Naive CD4 T','B'),c('Memory CD4 T','B'))library(ggpubr)data4plot <-cbind(sce_final@meta.data,                  t(GSVA.result)[rownames(sce_final@meta.data),])data4plot[1:4,1:4]

colnames(data4plot)[ncol(data4plot)] <- 'CD4_TCELL_VS_BCELL_UP_GSVA'p.GSVA <- ggviolin(data4plot,                x = 'celltype', y = 'CD4_TCELL_VS_BCELL_UP_GSVA',         color = 'celltype',add = 'mean_sd',fill = 'celltype',         add.params = list(color = 'black'))     stat_compare_means(comparisons = my_comparisons,label = 'p.signif')     scale_color_manual(values = my9color)     scale_fill_manual(values = my9color)    theme(axis.text.x.bottom = element_text(angle = 90,vjust = 0.5,hjust = 1))     NoLegend()   labs(x = '')p.GSVA

3. 一起對比一下吧

library(patchwork)(p.percentage|p.AddModuleScore)/(p.auc|p.GSVA|p.ssGSEA)

4. 版本信息

sessionInfo()

## R version 4.0.3 (2020-10-10)## Platform: x86_64-w64-mingw32/x64 (64-bit)## Running under: Windows 10 x64 (build 19044)## ## Matrix products: default## ## locale:## [1] LC_COLLATE=Chinese (Simplified)_China.936 ## [2] LC_CTYPE=Chinese (Simplified)_China.936 ## [3] LC_MONETARY=Chinese (Simplified)_China.936## [4] LC_NUMERIC=C ## [5] LC_TIME=Chinese (Simplified)_China.936 ## ## attached base packages:## [1] stats4 parallel stats graphics grDevices utils datasets ## [8] methods base ## ## other attached packages:## [1] patchwork_1.1.1 GSVA_1.38.2 GSEABase_1.52.1 ## [4] graph_1.68.0 annotate_1.68.0 XML_3.99-0.7 ## [7] AnnotationDbi_1.52.0 IRanges_2.24.1 S4Vectors_0.28.1 ## [10] Biobase_2.50.0 BiocGenerics_0.36.1 clusterProfiler_3.18.1## [13] AUCell_1.12.0 ggpubr_0.4.0 readxl_1.3.1 ## [16] dplyr_1.0.7 png_0.1-7 ggplot2_3.3.5 ## [19] SeuratObject_4.0.2 Seurat_4.0.2 ## ## loaded via a namespace (and not attached):## [1] utf8_1.2.2 reticulate_1.20 ## [3] R.utils_2.10.1 tidyselect_1.1.1 ## [5] RSQLite_2.2.8 htmlwidgets_1.5.4 ## [7] BiocParallel_1.24.1 grid_4.0.3 ## [9] Rtsne_0.15 scatterpie_0.1.7 ## [11] munsell_0.5.0 codetools_0.2-16 ## [13] ragg_1.2.2 ica_1.0-2 ## [15] future_1.22.1 miniUI_0.1.1.1 ## [17] withr_2.4.2 GOSemSim_2.16.1 ## [19] colorspace_2.0-2 highr_0.9 ## [21] knitr_1.34 ROCR_1.0-11 ## [23] ggsignif_0.6.3 tensor_1.5 ## [25] DOSE_3.16.0 listenv_0.8.0 ## [27] MatrixGenerics_1.2.1 labeling_0.4.2 ## [29] GenomeInfoDbData_1.2.4 polyclip_1.10-0 ## [31] bit64_4.0.5 farver_2.1.0 ## [33] downloader_0.4 parallelly_1.28.1 ## [35] vctrs_0.3.8 generics_0.1.0 ## [37] xfun_0.25 R6_2.5.1 ## [39] GenomeInfoDb_1.26.7 graphlayouts_0.7.1 ## [41] openintro_2.2.0 fgsea_1.16.0 ## [43] DelayedArray_0.16.3 bitops_1.0-7 ## [45] spatstat.utils_2.2-0 cachem_1.0.6 ## [47] assertthat_0.2.1 promises_1.2.0.1 ## [49] cherryblossom_0.1.0 scales_1.1.1 ## [51] ggraph_2.0.5 enrichplot_1.10.2 ## [53] gtable_0.3.0 globals_0.14.0 ## [55] goftest_1.2-2 tidygraph_1.2.0 ## [57] rlang_0.4.11 systemfonts_1.0.4 ## [59] splines_4.0.3 rstatix_0.7.0 ## [61] lazyeval_0.2.2 spatstat.geom_2.3-2 ## [63] broom_0.7.9 BiocManager_1.30.16 ## [65] yaml_2.2.1 reshape2_1.4.4 ## [67] abind_1.4-5 backports_1.2.1 ## [69] httpuv_1.6.2 qvalue_2.22.0 ## [71] tools_4.0.3 ellipsis_0.3.2 ## [73] spatstat.core_2.3-0 jquerylib_0.1.4 ## [75] RColorBrewer_1.1-2 ggridges_0.5.3 ## [77] Rcpp_1.0.7 plyr_1.8.6 ## [79] zlibbioc_1.36.0 purrr_0.3.4 ## [81] RCurl_1.98-1.4 rpart_4.1-15 ## [83] deldir_1.0-6 viridis_0.6.1 ## [85] pbapply_1.5-0 cowplot_1.1.1 ## [87] zoo_1.8-9 SummarizedExperiment_1.20.0## [89] haven_2.4.3 ggrepel_0.9.1 ## [91] cluster_2.1.0 magrittr_2.0.1 ## [93] data.table_1.14.2 RSpectra_0.16-0 ## [95] scattermore_0.7 DO.db_2.9 ## [97] openxlsx_4.2.4 lmtest_0.9-38 ## [99] RANN_2.6.1 airports_0.1.0 ## [101] fitdistrplus_1.1-6 matrixStats_0.60.1 ## [103] hms_1.1.0 mime_0.11 ## [105] evaluate_0.14 xtable_1.8-4 ## [107] rio_0.5.27 gridExtra_2.3 ## [109] compiler_4.0.3 tibble_3.1.5 ## [111] shadowtext_0.1.0 KernSmooth_2.23-17 ## [113] crayon_1.4.1 R.oo_1.24.0 ## [115] htmltools_0.5.2 segmented_1.3-4 ## [117] ggfun_0.0.4 mgcv_1.8-33 ## [119] later_1.3.0 tzdb_0.1.2 ## [121] tidyr_1.1.4 DBI_1.1.1 ## [123] tweenr_1.0.2 MASS_7.3-53 ## [125] Matrix_1.3-4 car_3.0-11 ## [127] readr_2.1.1 R.methodsS3_1.8.1 ## [129] igraph_1.2.6 GenomicRanges_1.42.0 ## [131] forcats_0.5.1 pkgconfig_2.0.3 ## [133] rvcheck_0.1.8 foreign_0.8-80 ## [135] plotly_4.10.0 spatstat.sparse_2.0-0 ## [137] bslib_0.3.1 XVector_0.30.0 ## [139] stringr_1.4.0 digest_0.6.27 ## [141] sctransform_0.3.2 RcppAnnoy_0.0.19 ## [143] spatstat.data_2.1-0 fastmatch_1.1-3 ## [145] rmarkdown_2.10 cellranger_1.1.0 ## [147] leiden_0.3.9 usdata_0.2.0 ## [149] uwot_0.1.10 kernlab_0.9-29 ## [151] curl_4.3.2 shiny_1.7.1 ## [153] lifecycle_1.0.1 nlme_3.1-149 ## [155] jsonlite_1.7.2 carData_3.0-4 ## [157] viridisLite_0.4.0 fansi_0.5.0 ## [159] pillar_1.6.4 lattice_0.20-41 ## [161] GO.db_3.12.1 fastmap_1.1.0 ## [163] httr_1.4.2 survival_3.2-7 ## [165] glue_1.4.2 zip_2.2.0 ## [167] bit_4.0.4 mixtools_1.2.0 ## [169] ggforce_0.3.3 stringi_1.7.4 ## [171] sass_0.4.0 blob_1.2.2 ## [173] textshaping_0.3.6 memoise_2.0.0 ## [175] irlba_2.3.3 future.apply_1.8.1

5.總結

本次針對上述物種基因集打分方法做了匯總。不得不說，通過總結，我對這些方法也有了進一步的認識。
我們可以看到，這些方法所得到的結果均具有很好的一致性；在這個前提下，我們應該選擇哪個方法可能是值得商榷的問題。
① PercentageFeatureSet函數并未對結果進行處理，raw結果的比較分析可能會受到離散值的影響，所以個人覺得可能不是很客觀；相比之下，其他方法均不同程度的去除了離散值的影響，相對來說更為客觀。
② 值得一提的是，GSEA使用的情況與其他函數/算法并不相同，其首先需要進行差異基因的計算，然后再去觀察針對特定基因集差異基因所富集的情況。其他算法則不需要進行差異基因計算，但總而言之，評分/富集程度的差異也同時反映了不同細胞/評估對象基因表達的差異。萬變不離其宗。
你覺得那種方法更簡單呢？
如果是小編，我會更喜歡AddModuleScore函數。一行搞定，不耗時！耗時短除了PercentageFeatureSet函數可以相媲美（當然缺點很明顯），其他方法相對AddModuleScore來說都過于復雜了。
不短了，今天就總結到這里吧，希望這個教程能幫到你們這些愛學習生信的小可愛們，一起加油吧~~