一起學(xué)之單細(xì)胞高級分析SCENIC-Python版二項值熱圖

cmu小孩 2024-05-20 發(fā)布于德國

展開全文

作為三大高級分析的SCENIC，由于在R中運行的時間太長了，因此我們這里直接用python運行，時間大概能夠縮減20倍，這里我們用自己的數(shù)據(jù)進(jìn)行跑流程：

官網(wǎng)地址：GitHub - aertslab/pySCENIC: pySCENIC is a lightning-fast python implementation of the SCENIC pipeline (Single-Cell rEgulatory Network Inference and Clustering) which enables biologists to infer transcription factors, gene regulatory networks and cell types from single-cell RNA-seq data.

一部分在Linux命令行中運行，另一部分在jupyter notebook中運行：

首先是Linux中運行：

#!/bin/bash

loomfile='/home/data/sbc057/B.loom'
tfs='/home/data/refdir/lesson_3-4/SCENIC_RESOURCES/allTFs_hg38.txt'
tbl='/home/data/refdir/lesson_3-4/SCENIC_RESOURCES/hgnc_v2/motifs-v10nr_clust-nr.hgnc-m0.001-o0.0.tbl'
motifs='/home/data/refdir/lesson_3-4/SCENIC_RESOURCES/hgnc_v2/hg38_10kbp_up_10kbp_down_full_tx_v10_clust.genes_vs_motifs.rankings.feather'
outdir='pySCENIC_B_res'
num_workers=1


if [ -e $outdir ] && [ -d $outdir ]
then
    echo $outdir 'already exists.'
else
    mkdir $outdir
fi

cd $outdir

##################################################
# NOTE !!!
# Using v1 feather would raise
#     ValueError: 'hg38__refseq-r80__10kb_up_and_down_tss.mc9nr.feather' 
#             is a cisTarget Feather database in Feather v1 format, which is not supported anymore. 
#             Convert them with 'convert_cistarget_databases_v1_to_v2.py' (https://github.com/aertslab/create_cisTarget_databases/) to Feather v2 format.
##################################################


# STEP 1: Gene regulatory network inference, and generation of co-expression modules
pyscenic grn $loomfile $tfs -o adj.csv -m grnboost2 --num_workers $num_workers --seed 777

# STEP 2-3: Regulon prediction aka cisTarget from CLI
pyscenic ctx adj.csv $motifs -o reg.csv \
--annotations_fname $tbl \
--rank_threshold 5000 --auc_threshold 0.05 --nes_threshold 3.0 \
--thresholds 0.75 0.90 --top_n_targets 50 --top_n_regulators 5 10 50 --min_genes 20 \
--expression_mtx_fname $loomfile \
--mask_dropouts --num_workers $num_workers 

# STEP 4: Cellular enrichment (aka AUCell) from CLI
pyscenic aucell $loomfile reg.csv \
-o pyscenic_final.loom --num_workers 1

這里由于pandas版本問題，會出現(xiàn)一個經(jīng)典報錯，出現(xiàn)的第4步AUCell，出現(xiàn)：AttributeError: 'Series' object has no attribute 'iteritems'

查閱了資料，最終的解決方案是降低pandas版本，這里我直接新建了一個新環(huán)境，并安裝特定包版本：

conda create -y -n pyscenic_env python=3.10
conda activate pyscenic_env
pip install pyscenic   
pip install numpy==1.23.5 
pip install pandas==1.5.3 
pip install numba==0.56.4

最終我們得到了所需文件pyscenic_final.loom

接下來在python中運行：

import os
import numpy as np
import pandas as pd
import scanpy as sc
import loompy as lp
import json
import base64
import zlib
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
from scipy.stats import zscore
from adjustText import adjust_text

from pyscenic.export import add_scenic_metadata
from pyscenic.cli.utils import load_signatures
from pyscenic.rss import regulon_specificity_scores
from pyscenic.plotting import plot_rss
from pyscenic.binarization import binarize

sc.settings.verbosity = 3

#########################################
wdir = 'pySCENIC_new'
os.mkdir( wdir )
os.chdir( wdir )
f_final_loom = '/home/data/sbc057/pyscenic_final.loom'
regulonsfile = '/home/data/sbc057/reg.csv'

lf = lp.connect( f_final_loom, mode='r', validate=False )
meta = json.loads(zlib.decompress(base64.b64decode( lf.attrs.MetaData )))
exprMat = pd.DataFrame( lf[:,:], index=lf.ra.Gene, columns=lf.ca.CellID).T
auc_mtx = pd.DataFrame( lf.ca.RegulonsAUC, index=lf.ca.CellID)
lf.close()

這段代碼的目的是從 Loom 文件中讀取基因表達(dá)數(shù)據(jù)和 Regulons 的 AUC 值，然后將其轉(zhuǎn)換成 Pandas DataFrame 進(jìn)行進(jìn)一步的分析和處理。

adata = sc.read( f_final_loom, validate=False)
adata.obs.drop( ['RegulonsAUC'], axis=1, inplace=True )
sc._utils.sanitize_anndata( adata )
sig = load_signatures(regulonsfile)
adata = add_scenic_metadata(adata, auc_mtx, sig)

然后新建一個AnnData對象，由于自己數(shù)據(jù)中暫時沒有分組，這里我們隨機(jī)分成2組，'Group1' 和 'Group2'，只是為了復(fù)現(xiàn)流程：

adata.obs
adata.obs['celltype'] = np.random.choice(['Group1', 'Group2'], size=len(adata.obs))
print(adata.obs[['celltype']].head())

               celltype
CellID                 
AAACATTGAGCTAC   Group1
AAACTTGAAAAACG   Group1
AAAGGCCTGTCTAG   Group1
AAAGTTTGATCACG   Group1
AAAGTTTGGGGTGA   Group2

這些代碼的目的是從外部文件中加載細(xì)胞注釋信息，將其整合到 AnnData 對象的觀測中，并為細(xì)胞類型創(chuàng)建一個新的列。接下來，計算分組之間的AUC分?jǐn)?shù)。

# Regulon Specificity Score
rss_cellType = regulon_specificity_scores( auc_mtx, adata.obs['celltype'] )
# binary AUC
binary_mtx, auc_thresholds = binarize( auc_mtx )

# save results
auc_mtx.to_excel('AUC_Score_Matrix.random.xlsx', header=True, index=True)
rss_cellType.to_excel('RSS_Score_by_CellType_Matrix.random.xlsx', header=True, index=True)
binary_mtx.to_excel('Binarized_AUC_Score_Matrix.random.xlsx', header=True, index=True)

可視化：

cats = np.unique(adata.obs.celltype)
#########################################
#### plot
#########################################
# RSS
fig = plt.figure(figsize=(15, 6))
for c,num in zip(cats, range(1,len(cats) 1)):
    x=rss_cellType.T[c]
    ax = fig.add_subplot(3, 5, num)
    plot_rss(rss_cellType, c, top_n=5, max_n=None, ax=ax)
    ax.set_ylim( x.min()-(x.max()-x.min())*0.05 , x.max() (x.max()-x.min())*0.05 )
    for t in ax.texts:
        t.set_fontsize(12)
    ax.set_ylabel('')
    ax.set_xlabel('')
    adjust_text(ax.texts, autoalign='xy', ha='right', va='bottom', arrowprops=dict(arrowstyle='-',color='lightgrey'), precision=0.001)

fig.text(0.5, 0.0, 'Regulon', ha='center', va='center', size='x-large')
fig.text(0.00, 0.5, 'Regulon specificity score (RSS)', ha='center', va='center', rotation='vertical', size='x-large')
plt.tight_layout()
plt.rcParams.update({
    'figure.autolayout': True,
        'figure.titlesize': 'large' ,
        'axes.labelsize': 'medium',
        'axes.titlesize':'large',
        'xtick.labelsize':'medium',
        'ytick.labelsize':'medium'
        })
plt.savefig('All_RssPlot.pdf', dpi=300, bbox_inches = 'tight')

上圖是Regulon的特異性排序圖，橫坐標(biāo)表示排名，縱坐標(biāo)表示RSS特異性得分。RSS越高的調(diào)控子可能與該細(xì)胞類型特異性相關(guān)。

# AUC
topreg = []
for i,c in enumerate(cats):
    topreg.extend(
        list(rss_cellType.T[c].sort_values(ascending=False)[:5].index)
    )

topreg = list(set(topreg))
auc_mtx_Z = zscore(auc_mtx, axis=0, ddof=0)
colors = sns.color_palette('bright',n_colors=len(cats) )
colorsd = dict( zip( cats, colors ))
colormap = [ colorsd[x] for x in adata.obs['celltype'] ]

sns.set(font_scale=1.2)
g_auc = sns.clustermap(auc_mtx_Z[topreg], annot=False, square=False,  linecolor='gray',
    yticklabels=False, xticklabels=True, vmin=-2, vmax=6, row_colors=colormap,
    cmap='YlGnBu', figsize=(12,18) )
g_auc.cax.set_visible(True)
g_auc.ax_heatmap.set_ylabel('')
g_auc.ax_heatmap.set_xlabel('')
g_auc.fig.savefig('RegulonActivity_heatmap.pdf', dpi=300, bbox_inches = 'tight', pad_inches=1)

上圖是行為細(xì)胞分組，列為轉(zhuǎn)錄因子的熱圖，不過需要一些調(diào)整才能更方便看出差異。

# bionarized AUC
g_auc_b = sns.clustermap(binary_mtx[topreg].T, annot=False, square=False,
    col_colors=colormap, cmap=sns.xkcd_palette(['white', 'black']), figsize=(12,18) )
g_auc_b.cax.set_visible(True)
g_auc_b.ax_heatmap.set_xticklabels([])
g_auc_b.ax_heatmap.set_xticks([])
g_auc_b.ax_heatmap.set_xlabel('Cells')
g_auc_b.ax_heatmap.set_ylabel('Regulons')
g_auc_b.ax_col_colors.set_yticks([0.5])
g_auc_b.ax_col_colors.set_yticklabels(['Cell Type'])
g_auc_b.fig.savefig('RegulonActivity_heatmap_binary.pdf', dpi=300, bbox_inches = 'tight', pad_inches=1)

上面是AUC二分類的熱圖。

df_obs = adata.obs
signature_column_names = list(df_obs.select_dtypes('number').columns)
signature_column_names = list(filter(lambda s: s.startswith('Regulon('), signature_column_names))
df_scores = df_obs[signature_column_names   ['celltype']]
df_results = ((df_scores.groupby(by='celltype').mean() - df_obs[signature_column_names].mean())/ df_obs[signature_column_names].std()).stack().reset_index().rename(columns={'level_1': 'regulon', 0:'Z'})
df_results['regulon'] = list(map(lambda s: s[8:-1], df_results.regulon))
df_heatmap = pd.pivot_table(data=df_results[df_results.Z >= 0].sort_values('Z', ascending=False),
                           index='celltype', columns='regulon', values='Z')
df_heatmap = df_heatmap.fillna(0)  # 將 NaN 填充為 0，但根據(jù)實際情況可以使用其他值
# 繪制熱圖
fig, ax1 = plt.subplots(1, 1, figsize=(15, 8))
sns.heatmap(df_heatmap, ax=ax1, annot=True, fmt='.1f', linewidths=.7, cbar=False, square=True, linecolor='gray', 
            cmap='YlGnBu', annot_kws={'size': 6})
ax1.set_ylabel('')
fig.savefig('Average_Regulon_Activity_byCelltype.pdf')