首先需要明白數(shù)據(jù)分析流程,可以查看第一講:三維基因組學(xué)習(xí)筆記,提煉流程如下:
實(shí)戰(zhàn)數(shù)據(jù)來(lái)自于Tung B. K. Le et al. Science 2013 :https://www.ncbi.nlm./sra/?term=srr824846 Study: High-resolution mapping of the spatial organization of Caulobacter crescentus chromosome by chromosome conformation capture in conjunction with next-generation sequencing (Hi-C) 數(shù)據(jù)下載后轉(zhuǎn)為fq文件如下: 858M Jul 3 16:21 SRR824846_Q20L10_1.fastq.gz 857M Jul 3 16:22 SRR824846_Q20L10_2.fastq.gz
如果想看其它數(shù)據(jù):PRJNA196826 · SRP020913 · All experiments · All runs 下載參考基因組并且構(gòu)建bowtie2的索引物種是:新月柄桿菌 Caulobacter crescentus,它是一種經(jīng)常用于實(shí)驗(yàn)室實(shí)驗(yàn)中的細(xì)菌,通常含有扁平囊泡(綠色),包裹著貯存顆粒(橙色)。 WC Nierman - ?2001的文章就發(fā)表了該物種的基因組 - ?被引用次數(shù):500 The complete genome sequence of Caulobacter crescentus was determined to be 4,016,942 base pairs in a single circular chromosome encoding 3,767 genes. mkdir -p ~/project/hic/ref cd ~/project/hic/ref wget ftp://ftp.ensemblgenomes.org/pub/bacteria/release-40/fasta/bacteria_20_collection/caulobacter_crescentus_na1000/dna/Caulobacter_crescentus_na1000.ASM2200v1.dna.toplevel.fa.gz gunzip Caulobacter_crescentus_na1000.ASM2200v1.dna.toplevel.fa.gz bowtie2-build Caulobacter_crescentus_na1000.ASM2200v1.dna.toplevel.fa bacteria
得到 5.3M Jul 25 19:28 bacteria.1.bt2 988K Jul 25 19:28 bacteria.2.bt2 17 Jul 25 19:28 bacteria.3.bt2 988K Jul 25 19:28 bacteria.4.bt2 5.3M Jul 25 19:28 bacteria.rev.1.bt2 988K Jul 25 19:28 bacteria.rev.2.bt2
這個(gè)參考基因組fa文件節(jié)選如下: >Chromosome dna:chromosome chromosome:ASM2200v1:Chromosome:1:4042929:1 REF GAATTCTTAACGTCCTGAGACACGACAGCGACCTCTGACCGGACTCGTTCCGCGTCTTTG GACAATCGGGATTCAGACTTCGGGGGATGCGGCGCAGGCTTGGGGATGATAGGCGAGCAA TGCGACCGTTGATCACAGCGGCGCCGTGTCACGACGCTGTTGGGGCCGTTCGGCGCCCGG
下載必備軟件軟件大全來(lái)源于:https:///3c-4c-5c-hi-c-chia-pet-category 如果沒(méi)有conda就先安裝咯: wget https://mirrors.tuna./anaconda/miniconda/Miniconda3-latest-Linux-x86_64.sh bash Miniconda3-latest-Linux-x86_64.sh source ~/.bashrc conda config --add channels https://mirrors.tuna./anaconda/pkgs/free conda config --add channels https://mirrors.tuna./anaconda/cloud/conda-forge conda config --add channels https://mirrors.tuna./anaconda/cloud/bioconda conda config --set show_channel_urls yes
然后安裝一系列軟件哈: conda create -n hic python=2 bowtie2 conda info --envs source activate hic conda search hiclab conda install -y sra-tools samtools
有些軟件不在conda里面,需要自行查看軟件說(shuō)明書文檔,主要是: 其中特別值得推薦,可以處理各種各樣的hic數(shù)據(jù),包括: Hi-C in situ Hi-C DNase Hi-C Micro-C capture-C capture Hi-C HiChip
安裝hiclib代碼如下: source activate hic conda install numpy scipy matplotlib h5py cython numexpr statsmodels scikit-learn pandas pip install https:///mirnylab/mirnylib/get/tip.tar.gz pip install https:///mirnylab/hiclib/get/tip.tar.gz ## 17.7MB 44kB/s
安裝hiclib代碼如下: # conda install numpy scipy matplotlib h5py cython numexpr statsmodels scikit-learn pandas ## 依賴軟件比較多 source activate hic conda install -y pysam bx-python numpy scipy conda install -y R
R -e 'install.packages(c('ggplot2','RColorBrewer') repos='https://mirrors.tuna./CRAN/')' R -e 'library(ggplot2)' R -e 'library(RColorBrewer)'
mkdir -p ~/biosoft/hicpro cd ~/biosoft/hicpro git clone https://github.com/nservant/HiC-Pro.git cd HiC-Pro/ which bowtie2 which R which samtools which python cat config-install.txt mkdir /home/zengjianming/biosoft/hicpro/bin
這個(gè)時(shí)候一定要根據(jù)自己的系統(tǒng)環(huán)境,修改目錄下的config-install.txt 文件哦: PREFIX =/home/zengjianming/biosoft/hicpro/bin BOWTIE2_PATH =/home/zengjianming/miniconda3/envs/hic/bin/bowtie2 SAMTOOLS_PATH =/home/zengjianming/miniconda3/envs/hic/bin/samtools R_PATH =/home/zengjianming/miniconda3/envs/hic/bin/R PYTHON_PATH =/home/zengjianming/miniconda3/envs/hic/bin/python CLUSTER_SYS =SGE
然后就可以編譯自己的軟件啦: make configure make install
依賴非常多,但是用心安裝還是問(wèn)題不大的哦! /home/zengjianming/biosoft/hicpro/bin/HiC-Pro_2.10.0/bin/HiC-Pro -h
這樣如果輸出了幫助文檔,說(shuō)明安裝成功哦。 hiclib教程先看官網(wǎng)readme,如下: 0. Download software and data 1. Map reads to the genome 2. Filter the dataset at the restriction fragment level 3. Filter and iteratively correct heatmaps.
打開才發(fā)現(xiàn),居然清一色的python代碼,而不是打包好的軟件,命令行加上參數(shù)的模式來(lái)走這個(gè)流程,感覺有點(diǎn)難用,先放棄,后續(xù)再更新這個(gè)使用記錄。 Hic-pro教程其說(shuō)明書完全不遜于hiclib,詳見:http://nservant./HiC-Pro 大體上看就6個(gè)步驟,比對(duì)、過(guò)濾HiC比對(duì)結(jié)果、檢測(cè)有效HiC序列、結(jié)果合并、構(gòu)建HiC關(guān)聯(lián)圖譜以及關(guān)聯(lián)圖譜標(biāo)準(zhǔn)化。而行使這些不同功能只需要更改參數(shù)即可: [-s|--step ANALYSIS_STEP] : run only a subset of the HiC-Pro workflow; if not specified the complete workflow is run mapping: perform reads alignment - require fast files proc_hic: perform Hi-C filtering - require BAM files quality_checks: run Hi-C quality control plots merge_persample: merge multiple inputs and remove duplicates if specified - require .validPairs files build_contact_maps: Build raw inter/intrachromosomal contact maps - require _allValidPairs files ice_norm : run ICE normalization on contact maps - require .matrix files
只使用s 參數(shù)才 會(huì)分步運(yùn)行,因?yàn)?步中還是mapping花的時(shí)間最多,如果其它步驟需要調(diào)整參數(shù),分步運(yùn)行還是會(huì)快很多,比如調(diào)整BIN_SIZE等等。 當(dāng)然,不得不提的是其特色功能:位基因特異性HiC分析 今天有點(diǎn)晚了,明天繼續(xù)實(shí)戰(zhàn)哦。 其它實(shí)戰(zhàn)數(shù)據(jù)集上面的是細(xì)菌基因組,測(cè)序文件也小很多,適合練手,如果熟練了也可以找其它數(shù)據(jù)集,比如Rose基因組的HiC原始數(shù)據(jù)下載地址: 每個(gè)數(shù)據(jù)都12G左右。 還可以是 An Osteoporosis Risk SNP at 1p36.12 Acts as an Allele-Specific Enhancer to Modulate LINC00339 Expression via Long-Range Loop Formation 文章的數(shù)據(jù),等等。
|