--- title: "Adaptive Gene- and Pathway- Trait Association testing with GWAS Summary Statistics (aSPUs() and aSPUsPath())" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{aSPU with GWAS Summary Statistics} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r knitr_options, echo=FALSE, results=FALSE} library(knitr) opts_chunk$set(fig.width = 12) ``` This vignette illustrates the use of the aSPU package with GWAS Summary Statistics. ## Data We will consider the analysis of a coronary artery disease (CAD) data from the CARDIoGRAM and C4D consortium. The data set contains P value data for coronary artery disease (CAD). This study comprised 63,746 CAD cases and 130,681 controls. We mapped these SNPs to the 9th KEGG pathway ([KEGG_STEROID_BIOSYNTHESIS](http://software.broadinstitute.org/gsea/msigdb/cards/KEGG_STEROID_BIOSYNTHESIS)) . Let's load this subset. ```{r loading} library(aSPU) data(kegg9) ``` The 9th Kegg pathway contains 16 genes. ```{r kegg1} kegg9$gene.info ``` The `PPs` is a list object contains the SNP information for each genes. ```{r kegg2} ## SNPs mapped on 3rd and 4th genes of 9th Kegg pathway kegg9$PPs[3:4] ``` The `Ps` object contains p-value information for mapped SNPs. Total `r length(kegg9$Ps)`SNPs are mapped on 9th kegg pathway. ```{r kegg3} length(kegg9$Ps) kegg9$Ps[1:10] ``` Using plink, we mapped our SNPs to the reference population ( Hapmap CEU phase 2 ). We dropped the SNPs of minor allele frequency (MAF) less then 5 percent. Among `r length(kegg9$Ps)` SNPs mapped on 9th Kegg pathway, `r length(kegg9$nP)` SNPs are mapped on the reference data. P-values of these SNPs and correlation matirx of SNPs using the reference population, saved on `nP` and `ldmatrix` object. ```{r kegg4} kegg9$nP[1:10] kegg9$ldmatrix[1:10,1:10] ``` The `snp.info` data object have snp information for each mapped SNPs. The 1st column is rsID, 2nd column is Chr, 3rd column is location and 4th column is p-value. ```{r kegg5} kegg9$snp.info[1:10,] ``` ## Use of aSPUs function Using the following code we can use `aSPUs()` and `GATES2()` function to get gene-wise aSPU and GATES p-value for each gene in 9th Kegg pathway. ```{r gene-wise p} Gps<-NULL; gl <- NULL; for( g in kegg9$gene.info[,1]) { snps <- which( ( kegg9$snp.info[,2] == kegg9$gene.info[kegg9$gene.info[,1]==g,2]) & (kegg9$snp.info[,3] > kegg9$gene.info[kegg9$gene.info[,1] == g, 3])& (kegg9$snp.info[,3] < kegg9$gene.info[kegg9$gene.info[,1] == g, 4])) newP <- kegg9$nP[snps] ; ldsub <- kegg9$ldmatrix[snps, snps]; if( length(snps) > 1) { out <- aSPUs(newP, corSNP=ldsub , pow=c(1,2,4,8, Inf), n.perm=10000, Ps=TRUE) o.pvec = order(newP) ldmat <- ldsub[o.pvec, o.pvec] gatesp <- GATES2(ldmat, sort(newP))[1] Gps <- rbind(Gps, c(length(newP),out$pvs, gatesp)) gl <- c(gl, g) } else if (length(snps) == 1) { out <- newP gatesp <- newP Gps <- rbind(Gps, c(length(newP),rep(out,6), gatesp) ) gl <- c(gl, g) } } colnames(Gps)[1] <- "nSNP" rownames(Gps) <- gl Gps ``` The row of `Gps` means each gene, 1st column indicate the number of SNPs for each gene. 2nd to 6th column indicate SPUs p-values for each power ( 1,2,4,8 and Inf), 7th column indicate aSPUs p-value and 8th column indicate p-value of GATES method. We can see that a gene _LIPA_ is very significant. ```{r pathway2} g = "LIPA" snps <- which( ( kegg9$snp.info[,2] == kegg9$gene.info[kegg9$gene.info[,1]==g,2]) & (kegg9$snp.info[,3] > kegg9$gene.info[kegg9$gene.info[,1] == g, 3])& (kegg9$snp.info[,3] < kegg9$gene.info[kegg9$gene.info[,1] == g, 4])) newP <- kegg9$nP[snps] ; newP ``` _LIPA_ have `r length(newP)` SNPs mapped and we can see that there are many significant SNPs. It makes sense that SPUs(1) have less p-value then SPUs(inf) since there are multiple significant SNPs. GATES have more power when there are small number of significant SNPs in the Gene (similar to minP test), so aSPUs have less p-value than GATES. ## Use of aSPUsPath function We can get p-value for pathways as follows. Let's perform pathway based analysis using aSPUsPath, Gates-Simes and HYST. ```{r pathway p} out.g <- GatesSimes(pvec = kegg9$nP, ldmatrix = kegg9$ldmatrix, snp.info=kegg9$snp.info, gene.info = kegg9$gene.info) out.h <- Hyst(pvec = kegg9$nP, ldmatrix = kegg9$ldmatrix, snp.info=kegg9$snp.info, gene.info = kegg9$gene.info) out.a <- aSPUsPath(kegg9$nP, corSNP = kegg9$ldmatrix, pow=c(1,2,4,8, Inf), pow2 = c(1,2,4,8), snp.info=kegg9$snp.info, gene.info = kegg9$gene.info, n.perm=1000, Ps = TRUE) out.g; out.h; out.a ``` As we can see from the gene-wise analysis, there is only one very significant gene `LIPA`. In this situation Gates-Simes works well and Gate-Simes p-value is `r out.g`. On the other hand, HYST works well when there are many significant genes with similar effects. The aSPUsPath adaptively consider all SPUsPath(i,j). We can see that the p-value decrease as our 2nd parameter , `pow2`, increases. It makes sense because larger `pow2` is more effective if there are fewer associated genes.