Last updated: 2024-06-01
Checks: 7 0
Knit directory: VINTAGE-analysis/
This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20240529)
was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 1615f2f. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish
or
wflow_git_commit
). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Untracked files:
Untracked: data/2019-12-11-cis-eQTLsFDR-ProbeLevel-CohortInfoRemoved-BonferroniAdded.txt.gz
Untracked: data/4080_irnt.gwas.imputed_v3.both_sexes.tsv.bgz
Untracked: data/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz
Untracked: data/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz.tbi
Untracked: data/gencode.v40lift37.annotation.gtf.gz
Untracked: data/integrated_call_samples_v3.20130502.ALL.panel
Untracked: output/1kg_eur_smps.txt
Untracked: output/GENCODE_v40_chr22.txt
Untracked: output/GWAS_chr22_processed.txt.gz
Untracked: output/LD_1kg_chr22_processed.bed
Untracked: output/LD_1kg_chr22_processed.bim
Untracked: output/LD_1kg_chr22_processed.bk
Untracked: output/LD_1kg_chr22_processed.fam
Untracked: output/LD_1kg_chr22_processed.log
Untracked: output/LD_1kg_chr22_processed.rds
Untracked: output/eQTL_chr22_processed.txt.gz
Untracked: output/results/
Unstaged changes:
Deleted: analysis/about.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were
made to the R Markdown (analysis/index.Rmd
) and HTML
(docs/index.html
) files. If you’ve configured a remote Git
repository (see ?wflow_git_remote
), click on the hyperlinks
in the table below to view the files as they were in that past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | 1615f2f | zhengli09 | 2024-06-01 | Update code directory |
html | 903d86a | zhengli09 | 2024-05-31 | Build site. |
Rmd | 91e8d0e | zhengli09 | 2024-05-31 | Publish the initial files for myproject |
Rmd | a7bea0c | zhengli09 | 2024-05-29 | Start workflowr project. |
In this tutorial, we illustrate the use of VINTAGE through the integrative analysis of two datasets: a large scale cis-eQTL study from eQTLGen phase I and a genome-wide association study conducted in UK Biobank. The data we use are both in the form of summary statistics and are publicly available. For this tutorial, we will focus on analyzing the genes on chromosome 22 and their relationship to systolic blood pressure. In addition, we obtain the genotype data of 503 European individuals from the 1000 Genomes Project phase 3 to serve as the linkage disequilibrium (LD) reference panel. We also obtain the gene annotation information from GENCODE. The specific files we use and the corresponding links for download are summarized in the following:
In this step, we pre-process the four input files. For eQTL summary
statistics, the minimally required information includes the gene ID,
variant ID, variant chromosome, variant genomic position, effect allele,
other allele, z-score from the single-variant association analysis, and
the sample size. For GWAS summary statistics, the minimally required
information includes the variant ID, effect allele, other allele,
variant chromosome, variant genomic position, z-score from the
single-variant association analysis, and the sample size. We further
filter out variants that have a minor allele frequency (MAF) < 1%.
For the LD reference panel, we extract genotype data of 503 European
individuals and exclude variants that have a MAF < 5%. We focus our
analysis on biallelic SNPs. The LD reference panel is stored in the
plink bim/bed/fam format. Lastly, we extract the chromosome,
transcription start site (TSS), and transcription end site (TES) of each
protein-coding gene from the GENCODE. The detailed processing steps can
be found in the scripts 1_proc_eQTL.R
,
2_proc_GWAS.R
, 3_proc_LD_ref.R
, and
4_proc_gene_annot.R
at the code
directory.
# step 1: specify the input files
eqtl_file <- "2019-12-11-cis-eQTLsFDR-ProbeLevel-CohortInfoRemoved-BonferroniAdded.txt.gz"
gwas_file <- "4080_irnt.gwas.imputed_v3.both_sexes.tsv.bgz"
ld_ref_file <- "ALL.chr22.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz"
ld_demo <- "integrated_call_samples_v3.20130502.ALL.panel"
gencode_file <- "gencode.v40lift37.annotation.gtf.gz"
# step 2: pre-process input data
chr <- 22
system(paste("Rscript code/1_proc_eQTL.R", eqtl_file, chr)) # eQTL
system(paste("Rscript code/2_proc_GWAS.R", gwas_file, chr)) # GWAS
system(paste("Rscript code/3_proc_LD_ref.R", ld_ref_file)) # LD Reference
system(paste("Rscript code/4_proc_gene_annot.R", gencode_file, chr)) # Gene Annotation
# step 3: load processed data
library(bigreadr)
eqtl <- fread2("output/eQTL_chr22_processed.txt.gz")
gwas <- fread2("output/GWAS_chr22_processed.txt.gz")
ld_ref_path <- "output/LD_1kg_chr22_processed.bed"
gencode <- fread2("output/GENCODE_v40_chr22.txt")
# find the common set of genes between eqtl and gencode
gencode <- subset(gencode, gene_id %in% unique(eqtl$gene))
Let’s take a look at each processed file so that you have a brief idea of what we are dealing with.
head(eqtl)
head(gwas)
head(gencode)
The core function of VINTAGE is run_vintage
. For a
detailed description of all the input parameters and output values,
please refer to its documentation by typing ?run_vintage
in
R. Briefly, run_vintage
starts by loading all necessary
data and filtering out genetic variants that have a MAF below a certain
threshold (default is 5%) in the LD reference panel and that are strand
ambiguous. The function then proceeds to analyze one gene at a time. In
the analysis of each gene, run_vintage
extracts variants in
the genic and adjacent regulatory regions of the gene, merges the eQTL,
GWAS, and LD panel data based on the order of effect allele and other
allele in the LD panel, filters out variants with a potential LD
mismatch between the eQTL/GWAS data and the LD panel, and eventually
performs the genetic variance and correlation tests. We analyze five
genes as an example and the results will be written to an output file
specified by outfile
. ncores
represents the
number of CPUs to use and the number of genes we simultaneously analyze.
In practice, ncores
can be set to a relatively large value
(e.g., 10) depending on your hardware.
library(VINTAGE)
outfile <- "output/results/SBP_chr22.txt"
example_genes <- c("TANGO2", "WBP2NL", "TRIOBP", "ARVCF", "SMDT1")
idx <- match(example_genes, gencode$gene_name)
run_vintage(eqtl = eqtl, gwas = gwas, refpath = ld_ref_path,
gene_info = gencode[idx, ], outfile = outfile, ncores = 1,
debug = FALSE)
***** Load LD reference data in plink format *****
***** Exclude strand ambiguous variants (eQTL, GWAS, LD) *****
(eQTL) 429859/3807663 (11%) ambiguous variants identified
(GWAS) 16542/128732 (13%) ambiguous variants identified
(LD) 12448/85250 (15%) ambiguous variants identified
(1/5) Handling gene: ENSG00000183597 (20004537-20054687)
- Merge three datasets (LD, eQTL, and GWAS)
- Filter out variants that have LD mismatch
(2/5) Handling gene: ENSG00000183066 (42394729-42454460)
- Merge three datasets (LD, eQTL, and GWAS)
- Filter out variants that have LD mismatch
(3/5) Handling gene: ENSG00000100106 (38093055-38172563)
- Merge three datasets (LD, eQTL, and GWAS)
- Filter out variants that have LD mismatch
(4/5) Handling gene: ENSG00000099889 (19957419-20004346)
- Merge three datasets (LD, eQTL, and GWAS)
- Filter out variants that have LD mismatch
(5/5) Handling gene: ENSG00000183172 (42475695-42480288)
- Merge three datasets (LD, eQTL, and GWAS)
- Filter out variants that have LD mismatch
The detailed explanation to all the output information can be found
in the documentation of the run_vintage
function. In
particular, r
represents the local genetic correlation
estimate, vin_var_test
represents the p value from the
gene-wise genetic variance test, and vin_corr_test
represents the p value from the gene-wise genetic correlation test and
would be reported to be NA if the p value from the variance test did not
pass the specified threshold.
results <- read.table(outfile, header = T)
results
sessionInfo()
R version 4.4.0 (2024-04-24)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 20.04.6 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3; LAPACK version 3.9.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: $( cat /etc/timezone )
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] VINTAGE_0.1.008 bigreadr_0.2.5 workflowr_1.7.1
loaded via a namespace (and not attached):
[1] gtable_0.3.4 xfun_0.41 bslib_0.6.1 ggplot2_3.4.4
[5] SPAtest_3.1.2 processx_3.8.3 lattice_0.22-5 callr_3.7.3
[9] bigassertr_0.1.6 vctrs_0.6.5 tools_4.4.0 ps_1.7.5
[13] generics_0.1.3 parallel_4.4.0 tibble_3.2.1 fansi_1.0.6
[17] pkgconfig_2.0.3 R.oo_1.25.0 Matrix_1.6-4 data.table_1.14.10
[21] bigstatsr_1.5.12 rngtools_1.5.2 lifecycle_1.0.4 compiler_4.4.0
[25] stringr_1.5.1 git2r_0.33.0 munsell_0.5.0 bigparallelr_0.3.2
[29] getPass_0.2-4 codetools_0.2-19 httpuv_1.6.13 htmltools_0.5.7
[33] sass_0.4.8 yaml_2.3.8 crayon_1.5.2 later_1.3.2
[37] pillar_1.9.0 jquerylib_0.1.4 whisker_0.4.1 R.utils_2.12.3
[41] cachem_1.0.8 doRNG_1.8.6 iterators_1.0.14 foreach_1.5.2
[45] RSpectra_0.16-1 parallelly_1.36.0 tidyselect_1.2.0 digest_0.6.33
[49] stringi_1.8.3 dplyr_1.1.4 cowplot_1.1.2 rprojroot_2.0.4
[53] fastmap_1.1.1 grid_4.4.0 colorspace_2.1-0 cli_3.6.2
[57] magrittr_2.0.3 ACAT_0.91 utf8_1.2.4 scales_1.3.0
[61] promises_1.2.1 SKAT_2.2.5 rmarkdown_2.25 httr_1.4.7
[65] bigsparser_0.6.1 matrixStats_1.2.0 rmio_0.4.0 bit_4.0.5
[69] R.methodsS3_1.8.2 evaluate_0.23 ff_4.0.9 knitr_1.45
[73] doParallel_1.0.17 irlba_2.3.5.1 rlang_1.1.2 mixsqp_0.3-54
[77] Rcpp_1.0.12 glue_1.6.2 susieR_0.12.35 reshape_0.8.9
[81] rstudioapi_0.15.0 jsonlite_1.8.8 plyr_1.8.9 R6_2.5.1
[85] bigsnpr_1.12.2 fs_1.6.3 flock_0.7