Carrie Wang
Department of Computer Science, The University of Hong Kong
April 22, 2025

Introduction

This notebook is a personal note I made when I first started learning GWAS. It’s designed for beginners who are starting to explore GWAS or just getting into related research.

Since real GWAS datasets often are with restricted access because of the inclusion of raw genotypes and individual phenotypes, the hierGWAS package provides an example toy dataset called simGWAS, which is perfect for learning and practicing GWAS analysis.

In traditional GWAS, we test every SNP (single-nucleotide polymorphism) one by one to see if it’s linked to a trait or disease. But when we’re working with thousands or even millions of SNPs, this gets messy: lots of noise, multiple testing issues, and low power to detect true signals.

That’s where the hierGWAS package comes in! Instead of testing every SNP individually, hierGWAS:

In this notebook, we’ll walk through:

  1. Loading a small example dataset (simGWAS) provided by HierGWAS
  2. Clustering the SNPs to find groups with similar patterns
  3. Running multiple LASSO regressions to identify promising SNPs
  4. Testing those SNPs (or SNP groups) using a hierarchical approach to find which ones are significantly linked to the phenotype

This method helps us find genetic signals more reliably, especially when dealing with high-dimensional genetic data. Let’s dive in and explore how it works!

Load data

library(hierGWAS)
data("simGWAS")

simGWAS is the toy dataset provided by the package. It contains 500 human samples (250 cases, 250 controls) and 1000 SNPs. Each SNP can be 0, 1, or 2, representing the number of minor (or alternative) alleles an individual carries at that position.

geno = data.matrix(simGWAS[,1:1000])
dim(geno)
## [1]  500 1000

geno is the genotype data matrix of size 500 by 1000.

pheno = simGWAS$y
pheno
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [223] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0
## [260] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [297] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [334] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [371] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [408] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [445] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [482] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
summary(pheno)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0     0.5     0.5     1.0     1.0
hist(pheno, main="Distribution of Phenotype", xlab = "Phenotype")

pheno is the phenotype data of size 500 by 1. It can be continuous or descrite. Here by looking at the histogram, we know that the phenotype we are interested in is discrete.

covar = simGWAS[,c("sex", "age")]
covar

covar is the covariate matrix of size 500 by 2. For sex, 0 for men and 1 for women.

Hierarchical Clustering of SNPs

In GWAS, instead of testing every SNP independently (which is noisy and has multiple testing problems), hierGWAS lets you: * Group SNPs based on how similar their patterns are across individuals (e.g., correlation of genotype values). * Then test these groups hierarchically, starting from large clusters down to individual SNPs.

SNPindex.chrom1 = seq(1,500)
dendr.chrom1 = cluster.snp(geno,SNP_index = SNPindex.chrom1)
SNPindex.chrom2 = seq(501,1000)
dendr.chrom2 = cluster.snp(geno,SNP_index = SNPindex.chrom2)

cluster.snp() groups the selected SNPs based on how correlated their genotype patterns are across individuals. Therefore, dendr.chrom1 should be of size 500 by 500, each entry contains a similarity score.

Multiple Sample Splitting and LASSO Regression

Why LASSO?: In GWAS, usually we have hundreds of individuals but millions of SNPs - high-dimensional data. LASSO helps by selecting only a small subset of relevant SNPs, doing feature selection and regression at the same time, forcing many SNP coefficients to zero to improve interpretability.

Doing multiple times of sample splitting + LASSO is because LASSO is very unstable. The result of one LASSO run can depend heavily on which samples are included. A single split might select SNP A today, and SNP B tomorrow.

The procedure is as follows:

  1. Randomly split the sample into two equal parts: I1 and I2
  2. Do LASSO selection on the data from I1 (the reason is that we cannot use the same data for feature engineering as well as testing significance! we need I1 to find significant SNP and then, in the next subsection, use another split I2 to test the significance of those SNPs). Save the n/6 selected SNPs (we only keep the n/6 SNPs to prevent overfitting caused by too many significant SNPs, sorted by significance).
  3. Repeat steps 1-2 B times. The default value for B is 50, but a different number of random splits can be specified.
res.multisplit <- multisplit(geno,pheno,covar = covar)
# the matrix of selected coefficients for each sample split
show(res.multisplit$sel.coeff[1:10,1:10])
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]  991  995  992  430  846  990  996  635  252   998
##  [2,]  991  990  995  989  992  533  276  655  730   908
##  [3,]  995  991  990  989  655  846  996  173  998   106
##  [4,]  991  990  995  989  898  492  275  313  553   218
##  [5,]  570  992  995  991  433  989  996  730  313   471
##  [6,]  991  995  989  218  846  990  415  433  524   106
##  [7,]  991  992  995  996  433  823  888  178  330   701
##  [8,]  991  989  992  178  995  405  768  216  942   990
##  [9,]  991  995  990  997  898  304  260  334  655   662
## [10,]  995  991  989  882  996   94  990  635  606   730
# the samples which will be used for testing
show(res.multisplit$out.sample[1:10,1:10])
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]    2    3    5    7    9   10   12   13   14    15
##  [2,]    3    4    5    6   10   11   13   14   16    17
##  [3,]    4    6    9   11   12   13   14   16   23    24
##  [4,]    1    2    3    6   10   11   14   15   18    22
##  [5,]    2    6    7    8   10   11   15   16   17    18
##  [6,]    7    9   11   12   13   14   16   17   18    21
##  [7,]    1    2    3    7    8   13   17   18   19    21
##  [8,]    1    4    6    7    8    9   10   11   13    16
##  [9,]    1    2    3    5    6    7    8    9   14    15
## [10,]    2    3    5    6    7    8   10   11   12    16

Hierarchical Testing of SNPs

result.chrom1 <- test.hierarchy(geno, pheno, dendr.chrom1, res.multisplit,covar = covar, SNP_index = SNPindex.chrom1)
## Testing the global null hypothesis..
## The global null hypothesis was rejected
## Testing a subset..
## The null hypothesis of the subset was rejected
## Testing the hierarchy of the subset..
## Tested a group with 500 SNPs.
## The group is significant.
## Testing the children of the group:
## Child 1 with 224 SNPs and child 2 with 276
## Tested a group with 276 SNPs.
## The group is significant.
## Testing the children of the group:
## Child 1 with 114 SNPs and child 2 with 162
## Tested a group with 114 SNPs.
## The group is significant.
## Testing the children of the group:
## Child 1 with 64 SNPs and child 2 with 50
result.chrom2 <- test.hierarchy(geno, pheno, dendr.chrom2, res.multisplit,covar = covar, SNP_index = SNPindex.chrom2, global.test = FALSE, verbose = FALSE)
show(result.chrom2)
## [[1]]
## [[1]]$label
##  [1] 517 884 879 958 583 638 620 812 553 647 608 833 566 844 641 712 906 995 894
## [20] 656 861 527 709 767 943 737 980 761 951 537 947 735 813 734 805 644 957 621
## [39] 837 515 725 817 648 678 755 604 926 679 587 956 666 705 506 713 627 677 534
## [58] 760
## 
## [[1]]$pval
## [1] 0.0295723
## 
## 
## [[2]]
## [[2]]$label
##  [1] 605 792 636 857 858 911 571 998 708 867 612 932 803 920 643 653 778 808 720
## [20] 714 732 854 876 727 738
## 
## [[2]]$pval
## [1] 0.02184107
## 
## 
## [[3]]
## [[3]]$label
## [1] 992
## 
## [[3]]$pval
## [1] 0.006656329
## 
## 
## [[4]]
## [[4]]$label
## [1] 991
## 
## [[4]]$pval
## [1] 0.005397572
## 
## 
## [[5]]
## [[5]]$label
## [1] 997
## 
## [[5]]$pval
## [1] 0.0008729821

Finally, we can show(result.chrom1) and show(result.chrom2) to see which groups or single SNPs are significant!


Note about R markdown: This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.

Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Cmd+Shift+Enter.

Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Cmd+Option+I.

When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Cmd+Shift+K to preview the HTML file).

The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.