Carrie Wang
Department of Computer Science, The University of Hong Kong
April 22, 2025
This notebook is a personal note I made when I first started learning GWAS. It’s designed for beginners who are starting to explore GWAS or just getting into related research.
Since real GWAS datasets often are with restricted access because of
the inclusion of raw genotypes and individual phenotypes, the
hierGWAS
package provides an example
toy dataset called simGWAS
, which is
perfect for learning and practicing GWAS analysis.
In traditional GWAS, we test every SNP (single-nucleotide polymorphism) one by one to see if it’s linked to a trait or disease. But when we’re working with thousands or even millions of SNPs, this gets messy: lots of noise, multiple testing issues, and low power to detect true signals.
That’s where the hierGWAS
package comes
in! Instead of testing every SNP individually,
hierGWAS
:
In this notebook, we’ll walk through:
simGWAS
) provided by
HierGWASThis method helps us find genetic signals more reliably, especially when dealing with high-dimensional genetic data. Let’s dive in and explore how it works!
library(hierGWAS)
data("simGWAS")
simGWAS
is the toy dataset provided by the package. It
contains 500 human samples (250 cases, 250 controls) and 1000 SNPs. Each
SNP can be 0, 1, or 2, representing the number of minor (or alternative)
alleles an individual carries at that position.
geno = data.matrix(simGWAS[,1:1000])
dim(geno)
## [1] 500 1000
geno
is the genotype data matrix of size 500 by
1000.
pheno = simGWAS$y
pheno
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [223] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0
## [260] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [297] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [334] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [371] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [408] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [445] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [482] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
summary(pheno)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 0.5 0.5 1.0 1.0
hist(pheno, main="Distribution of Phenotype", xlab = "Phenotype")
pheno
is the phenotype data of size 500 by 1. It can be
continuous or descrite. Here by looking at the histogram, we know that
the phenotype we are interested in is discrete.
covar = simGWAS[,c("sex", "age")]
covar
covar
is the covariate matrix of size 500 by 2. For sex,
0 for men and 1 for women.
In GWAS, instead of testing every SNP independently (which is noisy and has multiple testing problems), hierGWAS lets you: * Group SNPs based on how similar their patterns are across individuals (e.g., correlation of genotype values). * Then test these groups hierarchically, starting from large clusters down to individual SNPs.
SNPindex.chrom1 = seq(1,500)
dendr.chrom1 = cluster.snp(geno,SNP_index = SNPindex.chrom1)
SNPindex.chrom2 = seq(501,1000)
dendr.chrom2 = cluster.snp(geno,SNP_index = SNPindex.chrom2)
cluster.snp()
groups the selected SNPs based on how
correlated their genotype patterns are across individuals. Therefore,
dendr.chrom1
should be of size 500 by 500, each entry
contains a similarity score.
Why LASSO?: In GWAS, usually we have hundreds of individuals but millions of SNPs - high-dimensional data. LASSO helps by selecting only a small subset of relevant SNPs, doing feature selection and regression at the same time, forcing many SNP coefficients to zero to improve interpretability.
Doing multiple times of sample splitting + LASSO is because LASSO is very unstable. The result of one LASSO run can depend heavily on which samples are included. A single split might select SNP A today, and SNP B tomorrow.
The procedure is as follows:
res.multisplit <- multisplit(geno,pheno,covar = covar)
# the matrix of selected coefficients for each sample split
show(res.multisplit$sel.coeff[1:10,1:10])
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 991 995 992 430 846 990 996 635 252 998
## [2,] 991 990 995 989 992 533 276 655 730 908
## [3,] 995 991 990 989 655 846 996 173 998 106
## [4,] 991 990 995 989 898 492 275 313 553 218
## [5,] 570 992 995 991 433 989 996 730 313 471
## [6,] 991 995 989 218 846 990 415 433 524 106
## [7,] 991 992 995 996 433 823 888 178 330 701
## [8,] 991 989 992 178 995 405 768 216 942 990
## [9,] 991 995 990 997 898 304 260 334 655 662
## [10,] 995 991 989 882 996 94 990 635 606 730
# the samples which will be used for testing
show(res.multisplit$out.sample[1:10,1:10])
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 2 3 5 7 9 10 12 13 14 15
## [2,] 3 4 5 6 10 11 13 14 16 17
## [3,] 4 6 9 11 12 13 14 16 23 24
## [4,] 1 2 3 6 10 11 14 15 18 22
## [5,] 2 6 7 8 10 11 15 16 17 18
## [6,] 7 9 11 12 13 14 16 17 18 21
## [7,] 1 2 3 7 8 13 17 18 19 21
## [8,] 1 4 6 7 8 9 10 11 13 16
## [9,] 1 2 3 5 6 7 8 9 14 15
## [10,] 2 3 5 6 7 8 10 11 12 16
result.chrom1 <- test.hierarchy(geno, pheno, dendr.chrom1, res.multisplit,covar = covar, SNP_index = SNPindex.chrom1)
## Testing the global null hypothesis..
## The global null hypothesis was rejected
## Testing a subset..
## The null hypothesis of the subset was rejected
## Testing the hierarchy of the subset..
## Tested a group with 500 SNPs.
## The group is significant.
## Testing the children of the group:
## Child 1 with 224 SNPs and child 2 with 276
## Tested a group with 276 SNPs.
## The group is significant.
## Testing the children of the group:
## Child 1 with 114 SNPs and child 2 with 162
## Tested a group with 114 SNPs.
## The group is significant.
## Testing the children of the group:
## Child 1 with 64 SNPs and child 2 with 50
result.chrom2 <- test.hierarchy(geno, pheno, dendr.chrom2, res.multisplit,covar = covar, SNP_index = SNPindex.chrom2, global.test = FALSE, verbose = FALSE)
show(result.chrom2)
## [[1]]
## [[1]]$label
## [1] 517 884 879 958 583 638 620 812 553 647 608 833 566 844 641 712 906 995 894
## [20] 656 861 527 709 767 943 737 980 761 951 537 947 735 813 734 805 644 957 621
## [39] 837 515 725 817 648 678 755 604 926 679 587 956 666 705 506 713 627 677 534
## [58] 760
##
## [[1]]$pval
## [1] 0.0295723
##
##
## [[2]]
## [[2]]$label
## [1] 605 792 636 857 858 911 571 998 708 867 612 932 803 920 643 653 778 808 720
## [20] 714 732 854 876 727 738
##
## [[2]]$pval
## [1] 0.02184107
##
##
## [[3]]
## [[3]]$label
## [1] 992
##
## [[3]]$pval
## [1] 0.006656329
##
##
## [[4]]
## [[4]]$label
## [1] 991
##
## [[4]]$pval
## [1] 0.005397572
##
##
## [[5]]
## [[5]]$label
## [1] 997
##
## [[5]]$pval
## [1] 0.0008729821
Finally, we can show(result.chrom1)
and
show(result.chrom2)
to see which groups or single SNPs are
significant!
Note about R markdown: This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.
Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Cmd+Shift+Enter.
Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Cmd+Option+I.
When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Cmd+Shift+K to preview the HTML file).
The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.