---
title: "Adaptive Gene- and Pathway- Trait Association testing with GWAS Summary Statistics (aSPUs() and aSPUsPath())"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{aSPU with GWAS Summary Statistics}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

```{r knitr_options, echo=FALSE, results=FALSE}
library(knitr)
opts_chunk$set(fig.width = 12)
```

This vignette illustrates the use of the aSPU package with GWAS Summary Statistics. 


## Data

We will consider the analysis of a coronary artery disease (CAD)
data from the CARDIoGRAM and C4D consortium. The data set contains P value
data for coronary artery disease (CAD). This study comprised 63,746
CAD cases and 130,681 controls. We mapped these SNPs to the 9th KEGG
pathway ([KEGG_STEROID_BIOSYNTHESIS](http://software.broadinstitute.org/gsea/msigdb/cards/KEGG_STEROID_BIOSYNTHESIS))
. Let's load this subset.

```{r loading}
library(aSPU)
data(kegg9)
```
The 9th Kegg pathway contains 16 genes.

```{r kegg1}
kegg9$gene.info
```

The `PPs` is a list object contains the SNP information for each genes.
```{r kegg2}
## SNPs mapped on 3rd and 4th genes of 9th Kegg pathway
kegg9$PPs[3:4]
```

The `Ps` object contains p-value information for mapped
SNPs. Total `r length(kegg9$Ps)`SNPs are mapped on 9th kegg pathway.

```{r kegg3}
length(kegg9$Ps)
kegg9$Ps[1:10]
```

Using plink,
we mapped our SNPs to the reference population ( Hapmap CEU phase 2
). We dropped the SNPs of minor allele frequency (MAF) less then 5
percent. Among `r length(kegg9$Ps)` SNPs mapped on 9th Kegg pathway,
`r length(kegg9$nP)` SNPs are mapped on the reference data. P-values of these
SNPs and correlation matirx of SNPs using the reference population,
saved on `nP` and `ldmatrix` object.

```{r kegg4}
kegg9$nP[1:10]
kegg9$ldmatrix[1:10,1:10]
```
The `snp.info` data object have snp information for each mapped
SNPs. The 1st column is rsID, 2nd column is Chr, 3rd column is
location and 4th column is p-value.

```{r kegg5}
kegg9$snp.info[1:10,]
```

## Use of aSPUs function

Using the following code we can use `aSPUs()` and `GATES2()` function
to get gene-wise aSPU and GATES p-value for each gene in 9th Kegg pathway.
```{r gene-wise p}
Gps<-NULL;
gl <- NULL;
for( g in kegg9$gene.info[,1]) {
    snps <-
    which( ( kegg9$snp.info[,2] == kegg9$gene.info[kegg9$gene.info[,1]==g,2]) &
           (kegg9$snp.info[,3] > kegg9$gene.info[kegg9$gene.info[,1] == g, 3])&
           (kegg9$snp.info[,3] < kegg9$gene.info[kegg9$gene.info[,1] == g, 4]))

    newP <- kegg9$nP[snps] ;
    ldsub <- kegg9$ldmatrix[snps, snps];

    if( length(snps) > 1) {
        out <- aSPUs(newP, corSNP=ldsub , pow=c(1,2,4,8, Inf),
                     n.perm=10000, Ps=TRUE)

        o.pvec = order(newP)
        ldmat <- ldsub[o.pvec, o.pvec]
        gatesp <- GATES2(ldmat, sort(newP))[1]
        Gps <- rbind(Gps, c(length(newP),out$pvs, gatesp))
        gl <- c(gl, g)
    } else if (length(snps) == 1) {
        out <- newP
        gatesp <- newP
        Gps <- rbind(Gps, c(length(newP),rep(out,6), gatesp) )
        gl <- c(gl, g)
    }
}
colnames(Gps)[1] <- "nSNP"
rownames(Gps) <- gl
Gps
```

The row of `Gps` means each gene, 1st column indicate the number of
SNPs for each gene. 2nd to 6th column indicate SPUs p-values for each
power ( 1,2,4,8 and Inf), 7th column indicate aSPUs p-value and 8th
column indicate p-value of GATES method.
We can see that a gene _LIPA_ is very significant.

```{r pathway2}
g = "LIPA"
    snps <-
    which( ( kegg9$snp.info[,2] == kegg9$gene.info[kegg9$gene.info[,1]==g,2]) &
           (kegg9$snp.info[,3] > kegg9$gene.info[kegg9$gene.info[,1] == g, 3])&
           (kegg9$snp.info[,3] < kegg9$gene.info[kegg9$gene.info[,1] == g, 4]))

newP <- kegg9$nP[snps] ;
newP

```
_LIPA_ have `r length(newP)` SNPs mapped and we can see that there are
many significant SNPs. It makes sense that SPUs(1) have less p-value
then SPUs(inf) since there are multiple significant SNPs. GATES have
more power when there are small number of significant SNPs in the
Gene (similar to minP test), so aSPUs have less p-value than GATES.

## Use of aSPUsPath function

We can get p-value for pathways as follows.
Let's perform pathway based analysis using aSPUsPath, Gates-Simes and
HYST.


```{r pathway p}
out.g <- GatesSimes(pvec = kegg9$nP, ldmatrix = kegg9$ldmatrix,
                    snp.info=kegg9$snp.info, gene.info = kegg9$gene.info)

out.h <- Hyst(pvec = kegg9$nP, ldmatrix = kegg9$ldmatrix,
              snp.info=kegg9$snp.info, gene.info = kegg9$gene.info)

out.a <- aSPUsPath(kegg9$nP, corSNP = kegg9$ldmatrix, pow=c(1,2,4,8, Inf),
                   pow2 = c(1,2,4,8),
                   snp.info=kegg9$snp.info, gene.info = kegg9$gene.info,
                   n.perm=1000, Ps = TRUE)

out.g; out.h; out.a

```

As we can see from the gene-wise analysis, there is only one very
significant gene `LIPA`. In this situation Gates-Simes works well and
Gate-Simes p-value is `r out.g`. On the other hand, HYST works well
when there are many significant genes with similar effects.

The aSPUsPath adaptively consider all SPUsPath(i,j). We can see that
the p-value decrease as our 2nd parameter , `pow2`, increases. It makes
sense because larger `pow2` is more effective if there are fewer
associated genes.