Skip to page navigation menu Skip entire header
Brown University
Skip 13 subheader links

Interpretable Multi-scale Statistical Methods for Genetic Association Studies


In genetics, one fundamental goal is to understand which mutations in the human genome have significant effects on complex traits and diseases. Achieving this goal requires performing statistical inference such as hypothesis testing to generate interpretable results. Genome-wide association studies (GWAS) often use marginal linear regression to test the association between each mutation with the trait independently. Even though GWAS enjoys simplicity and scalability, it still suffers from several problems due to the complex nature of the genetic data. For example, the strong correlation structure of genotype states along the genome can induce a large number of false positives under the GWAS framework. Further, linear models can only explain additive variation in traits and cannot account for nonlinear effects such as dominance effects and epistatic interactions. This dissertation addresses these problems via multiple projects and two conceptual themes. In theme 1, we developed a gene-level association method called gene-ε that uses a reformulated null model for association testing with shrinkage on GWAS summary statistics. We show that gene-ε reduces false positives compared to GWAS and competing gene-level association approaches using extensive simulations, and we apply gene-ε to quantitative traits in UKBiobank data to identify novel associated genes. In addition, we explored the potential of gene-ε to replicate findings in multiple ancestries and in case-control studies. In theme 2, we develop two nonlinear methods: Biologically Annotated Neural Networks (BANN) and the Ensemble of Single-effect Neural Network (ESNN). BANN uses coordinate ascent variational inference to simultaneously perform association tests on SNPs and genes together. ESNN, on the other hand, uses black-box variational inference and can be used to quantify uncertainty for associations when genetic data are highly collinear by providing credible sets. ESNN can be applied to both continuous traits and binary traits with any neural network architecture. For BANN and ESNN, we demonstrate their power and interpretability for association studies using extensive simulations. We also apply them to different real-world datasets including both quantitative and binary traits for biological discovery.
Thesis (Ph. D.)--Brown University, 2022


Cheng, Wei, "Interpretable Multi-scale Statistical Methods for Genetic Association Studies" (2022). Center for Computational Molecular Biology Theses and Dissertations. Brown Digital Repository. Brown University Library.