【原】孟德爾隨機化：下載UKbiobank數(shù)據(jù)本地處理

阿越就是我 2024-04-22 發(fā)布于上海

展開全文

今天學習下如何下載UKBiobank的數(shù)據(jù)并進行整理，把它變成TwoSampleMR能夠使用的格式。

數(shù)據(jù)下載

打開ukbiobank的官網(wǎng)：https://www./uk-biobank

打開后往下拉，找到results can be found here，點擊它：

點擊之后即可來到數(shù)據(jù)界面，首先是README這個sheet，這里會有一些介紹，每個文件包含哪些內(nèi)容，每個文件的列名是什么意思等，都有詳細的說明，大家一定要看！

為了方便以后使用，這個文件是可以免費下載的，建議大家直接下來，后面就可以用EXCEL隨時打開查看了。

點擊下面的Manifest 201807就可以來到數(shù)據(jù)詳情和下載界面了，UKbiobank這個免費的數(shù)據(jù)已經(jīng)好久不更新了，可以看到數(shù)據(jù)就留在2018年了。

數(shù)據(jù)詳情界面有每個數(shù)據(jù)的phenotype description、phenotype code、數(shù)據(jù)下載鏈接等信息。

可以看到每個表型（phenotype）都有6個數(shù)據(jù)，分別是inrt和raw各有3種（分為female、male、both_sexes）。

把下面的滾動條往右邊拉到最后，就可以看到每個數(shù)據(jù)的下載鏈接了，你鼠標放上去就會有復(fù)制鏈接的提示，直接復(fù)制在瀏覽器或者迅雷等軟件中打開即可下載數(shù)據(jù)了。

每個表型都有6個數(shù)據(jù)，那么我們應(yīng)該是用哪個呢？官方的github^[1]介紹說選擇irnt作為接下來的分析：

Overall, the results are largely consistent regardless of the choice of IRNT(inverse rank-normal transformed) or raw untransformed phenotypes. Since IRNT does appear to provide a marginal boost to the h2g results, especially in terms of significance, we choose to treat the IRNT version as the primary analysis for continuous phenotypes. (Results for the raw, untransformed versions will still appear in the complete results file though.)

所以我們也不糾結(jié)了，直接用irnt的數(shù)據(jù)分析，如果你的數(shù)據(jù)需要分開性別，那么你就單獨下載某個性別的數(shù)據(jù)，要么就是下載both_sexes的。

數(shù)據(jù)讀取及整理

我這里選擇了fat這個表型的數(shù)據(jù)（上面有展示）。下載好之后直接用vroom讀取即可：

fat_bothsex <- vroom::vroom("./ukbiobank/100004_irnt.gwas.imputed_v3.both_sexes.tsv.bgz")

dim(fat_bothsex)
## [1] 13791467       11
colnames(fat_bothsex)
##  [1] "variant"                "minor_allele"           "minor_AF"              
##  [4] "low_confidence_variant" "n_complete_samples"     "AC"                    
##  [7] "ytx"                    "beta"                   "se"                    
## [10] "tstat"                  "pval"

這個數(shù)據(jù)也是很大的，1000多萬行，11列。

# 看下前6行
head(fat_bothsex)
## # A tibble: 6 × 11
##   variant minor_allele minor_AF low_confidence_variant n_complete_samples     AC
##   <chr>   <chr>           <dbl> <lgl>                               <dbl>  <dbl>
## 1 1:1579… T             0       TRUE                                51453 0     
## 2 1:6948… A             2.04e-5 TRUE                                51453 2.10e0
## 3 1:6956… C             1.80e-4 TRUE                                51453 1.85e1
## 4 1:1398… T             2.03e-5 TRUE                                51453 2.09e0
## 5 1:6927… C             1.12e-1 FALSE                               51453 1.15e4
## 6 1:6937… G             1.17e-1 FALSE                               51453 1.20e4
## # ? 5 more variables: ytx <dbl>, beta <dbl>, se <dbl>, tstat <dbl>, pval <dbl>

這個數(shù)據(jù)的每一列是什么意思，都在前面說過的README部分有詳細的介紹，大家一定要去看！

比如第一列是variant，它的意思是chr:pos:ref:alt，alt是effect allele，需要使用這個和注釋文件合并。這些信息都在README里面：

這個數(shù)據(jù)是沒有rsid的，但是你仔細讀過README之后就會發(fā)現(xiàn)，在variants.tsv.bgz這個文件中是有rsid的，而且這個文件也有chr:pos:ref:alt這些信息：

所以把這個文件下載下來，和我們的fat_bothsex文件合并一下，就有rsid了。variants.tsv.bgz這個文件就在Manifest 201807的第11行，找到下載鏈接直接下載即可：

下載好之后我們把它讀取進來：

variants_anno <- vroom::vroom("./ukbiobank/variants.tsv.bgz")
dim(variants_anno)
## [1] 13791467       25

發(fā)現(xiàn)這個數(shù)據(jù)和fat_bothsex這個數(shù)據(jù)的行數(shù)是一樣的。其實在README的描述中人家就說的很清楚了，這個順序就是一樣的，你可以通過variant這一列進行合并，也可以直接復(fù)制粘貼到一起：

查看前6行：

variants_anno[1:6,]
## # A tibble: 6 × 25
##   variant  chr      pos ref   alt   rsid  varid consequence consequence_category
##   <chr>    <chr>  <dbl> <chr> <chr> <chr> <chr> <chr>       <chr>               
## 1 1:15791… 1      15791 C     T     rs54… 1:15… splice_reg… missense            
## 2 1:69487… 1      69487 G     A     rs56… 1:69… missense_v… missense            
## 3 1:69569… 1      69569 T     C     rs25… 1:69… missense_v… missense            
## 4 1:13985… 1     139853 C     T     rs53… 1:13… splice_reg… missense            
## 5 1:69279… 1     692794 CA    C     1:69… 1:69… upstream_g… non_coding          
## 6 1:69373… 1     693731 A     G     rs12… 1:69… upstream_g… non_coding          
## # ? 16 more variables: info <dbl>, call_rate <dbl>, AC <dbl>, AF <dbl>,
## #   minor_allele <chr>, minor_AF <dbl>, p_hwe <dbl>, n_called <dbl>,
## #   n_not_called <dbl>, n_hom_ref <dbl>, n_het <dbl>, n_hom_var <dbl>,
## #   n_non_ref <dbl>, r_heterozygosity <dbl>, r_het_hom_var <dbl>,
## #   r_expected_het_frequency <dbl>

看看是不是完全一樣：

# 肯定是完全一致
identical(variants_anno$variant, fat_bothsex$variant)
## [1] TRUE

所以我們直接合并一下即可：

fat_bothsex <- cbind.data.frame(variants_anno[,2:6], fat_bothsex)
head(fat_bothsex)
##   chr    pos ref alt          rsid       variant minor_allele    minor_AF
## 1   1  15791   C   T   rs547522712   1:15791:C:T            T 0.00000e+00
## 2   1  69487   G   A   rs568226429   1:69487:G:A            A 2.03879e-05
## 3   1  69569   T   C     rs2531267   1:69569:T:C            C 1.80062e-04
## 4   1 139853   C   T   rs533633326  1:139853:C:T            T 2.03117e-05
## 5   1 692794  CA   C 1:692794_CA_C 1:692794:CA:C            C 1.12102e-01
## 6   1 693731   A   G    rs12238997  1:693731:A:G            G 1.16788e-01
##   low_confidence_variant n_complete_samples          AC        ytx       beta
## 1                   TRUE              51453     0.00000    0.00000        NaN
## 2                   TRUE              51453     2.09804   -1.99658 -0.9567220
## 3                   TRUE              51453    18.52940   -2.86950 -0.1222020
## 4                   TRUE              51453     2.09020   -2.00107 -0.9583010
## 5                  FALSE              51453 11535.90000 -162.96100 -0.0233942
## 6                  FALSE              51453 12018.20000 -105.14000 -0.0126221
##          se     tstat      pval
## 1       NaN       NaN       NaN
## 2 0.6939280 -1.378710 0.1679920
## 3 0.2382830 -0.512845 0.6080620
## 4 0.6939330 -1.380970 0.1672940
## 5 0.0107131 -2.183700 0.0289888
## 6 0.0101812 -1.239750 0.2150740

MR分析

首先是加載R包：

library(TwoSampleMR)

現(xiàn)在fat_bothsex這個數(shù)據(jù)是一個GWAS的原始數(shù)據(jù)，你可以把它用作是結(jié)局相關(guān)的數(shù)據(jù)，也可以把它用作是暴露相關(guān)的數(shù)據(jù)。

如果你要把這個數(shù)據(jù)用作是結(jié)局相關(guān)的數(shù)據(jù)，那就是直接進行以下代碼即可：

# 注意列名一一對應(yīng)
outcome_data <- format_data(dat = fat_bothsex,
                      type = "outcome", # 類型是outcome
                      snp_col = "rsid",
                      beta_col = "beta",
                      se_col = "se",
                      eaf_col = "minor_AF",
                      effect_allele_col = "alt",
                      other_allele_col = "ref",
                      pval_col = "pval",
                      samplesize_col = "n_complete_samples",
                      #ncase_col = "ncase_col",
                      #ncontrol_col = "ncontrol_col",
                      #gene_col = "nearest_genes",
                      chr_col = "chr",
                      pos_col = "pos"
                      )

現(xiàn)在這個outcome_data就是一個結(jié)局數(shù)據(jù)了，如果你已經(jīng)有了暴露數(shù)據(jù)，那么就可以直接進行MR分析了。

如果你要把這個數(shù)據(jù)用作是暴露數(shù)據(jù)，那么首先需要根據(jù)P值篩選一下：

# 先去掉NaN
df1 <- fat_bothsex[!is.nan(fat_bothsex$pval),]
# 根據(jù)P值篩選,我這個數(shù)據(jù)沒有小于5e-8的，所以選擇了5e-6作為演示
df1 <- subset(df1, pval < 5e-6)

然后再格式化數(shù)據(jù)：

# 注意列名一一對應(yīng)
exposure_data <- format_data(dat = df1,
                      type = "exposure", # 類型是exposure
                      snp_col = "rsid",
                      beta_col = "beta",
                      se_col = "se",
                      eaf_col = "minor_AF",
                      effect_allele_col = "alt",
                      other_allele_col = "ref",
                      pval_col = "pval",
                      samplesize_col = "n_complete_samples",
                      #ncase_col = "ncase_col",
                      #ncontrol_col = "ncontrol_col",
                      #gene_col = "nearest_genes",
                      chr_col = "chr",
                      pos_col = "pos"
                      )

然后再運行下clump（這個過程需要聯(lián)網(wǎng)的，如果你想完全本地化運行，可參考下載IEU OPEN GWAS數(shù)據(jù)本地處理）：

# 這一步需要聯(lián)網(wǎng)
exposure_data <- clump_data(exposure_data,
                            clump_kb=10000, clump_r2=0.001)

現(xiàn)在這個exposure_data就是一個暴露數(shù)據(jù)了，如果你已經(jīng)有了結(jié)局數(shù)據(jù)，那么就可以直接進行MR分析了。

其中還有一些問題并沒有說，比如一個ref有多個alt，或者minor_allele有多個等問題，以后再介紹，大家遇到了也可以自己搜索下。

搞定！easy！

參考資料

[1]

github: https://nealelab./UKBB_ldsc/select_topline.html

贊賞

共11人贊賞

小男孩‘自慰网亚洲一区二区,亚洲一级在线播放毛片,亚洲中文字幕av每天更新,黄aⅴ永久免费无码,91成人午夜在线精品,色网站免费在线观看,亚洲欧洲wwwww在线观看

【原】孟德爾隨機化：下載UKbiobank數(shù)據(jù)本地處理

數(shù)據(jù)下載

數(shù)據(jù)讀取及整理

MR分析

參考資料