Kamil Jan Cygan Methods for Identifying Allele-Specific Splicing Aberrations Associated With Human Hereditary Disease To my parents – Lidia and Jan, my brother – Marek, and my wife – Marta Methods for Identifying Allele-Specific Splicing Aberrations Associated With Human Hereditary Disease by Kamil Jan Cygan B.Sc., Loyola University Chicago, 2010 Thesis Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Molecular Biology, Cell Biology, and Biochemistry and Center for Computational Molecular Biology at Brown University PROVIDENCE, RHODE ISLAND May 2018 c Copyright 2018 by Kamil J. Cygan This dissertation by Kamil J. Cygan is accepted in its present form by the Department of Molecular Biology, Cell Biology, and Biochemistry and Center for Computational Molecular Biology as satisfying the dissertation requirement for the degree of Doctor of Philosophy. Date (William G. Fairbrother, Ph.D., Advisor) Recommended to the Graduate Council Date (Daniel M. Weinreich, Ph.D., Reader) Date (Nicola Neretti, Ph.D., Reader) Date (Stephen L. Helfand, M.D., Reader) Date (Jeremy R. Sanford, Ph.D., External Reviewer) Approved by the Graduate Council Date (Andrew G. Campbell, Ph.D., Dean of the Graduate School) Acknowledgments First of all, I would like to thank my parents for constantly encouraging me to pursue my goals and keep learning every day. Without you, I would never have picked the path that I did. Thank you for giving me many useful advices, related not only to my research, but more importantly to other aspects of my life. I would also like to thank you for your continuous support without which I would not be able to complete my studies. My deepest gratitude goes to my wife Marta. Her endless patience, tolerance and encouragements allowed me to complete my work and inspired me to always go forward onto the next steps. I am grateful that I have had YOU and your constant support throughout the graduate school journey. I am looking forward to what we can accomplish throughout our lives together. To my brother Marek, I know that you are not with us anymore, but I want to thank you for the best times of my life. You taught me to always be optimistic and go through the life with a big smile on my face. Losing you during my first year of grad school was not easy, but please remember that you will always have a special place in my heart. I always appreciated hearing kind words from you, especially how proud you were of what I have already accomplished. I am thankful to William Fairbrother, my advisor, for his personal and financial support for many years during my doctoral studies. His constant encouragements and morning visits in the dry lab provided invaluable advices so essential for my progress and completion of many projects, most of which are part of this thesis. Finally, my PhD experience would definitely be less enjoyable if not for my friends and colleagues from the Fairbrother lab. I would like to thank you all for you friendship, help with research problems, and fun times during many discussions and outings. Abstract In eukaryotes, DNA is first transcribed into precursor-messenger RNA (pre-mRNA) which is then followed by extensive RNA processing events. One such event is RNA splicing, a process in which non-coding sequences (introns) are removed and the flanking protein-coding sequences (exons) are joined to generate mature RNA (mRNA) that can be translated into protein. It has been predicted that about one third of human hereditary mutations affect RNA splicing. Most splicing mutations cause exon skipping, but may also lead to alternative splice site usage and intron retention. This thesis introduces a novel computational analysis for a massively parallel splicing reporter assay to detect human hereditary disease-associated allele-specific splicing events. Thousands of disease alleles and their wild-type counterparts were tested in high-throughput in vitro and in vivo assays for evidence of allelic imbalance and allele-specific aberrations in RNA splicing. Due to the plethora of information obtained from the study, we were able to observe that retinoblastomas are predominantly caused by defective splicing of the RB1 transcript. In a subsequent run of the assay, we investigated the splicing disruption potential of de novo variants in autism spectrum disorders. Using the results from the high- throughput experiments as a foundation for supervised statistical learning, we were able to extract relevant features that play significant roles in allele-specific splicing aberrations. We also generated multiple predictive tools for visualizing and identifying mutations that cause splicing defects, and measured the extent of aberrant splicing caused by these mutations. Contents Preface xix 1 Introduction 1 1.1 Messenger RNA splicing signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.1 The splicing of pre-mRNA via a lariat intermediate . . . . . . . . . . . . . . . . 3 1.1.2 Conserved consensus sequences at the sites of splicing . . . . . . . . . . . . . . 5 1.1.3 Splicing signals outside of the canonical splice sites . . . . . . . . . . . . . . . . 8 1.1.4 Prediction of causal variants affecting splicing . . . . . . . . . . . . . . . . . . . 13 1.2 RNA-binding proteins: splicing factors and disease . . . . . . . . . . . . . . . . . . . . 14 1.2.1 The core spliceosome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.2.2 HnRNPs, SR proteins, and other splicing factors . . . . . . . . . . . . . . . . . . 16 1.3 Three mechanisms of RBP-related splicing dysregulation . . . . . . . . . . . . . . . . . 18 1.3.1 Mechanism I: disruption of a splicing element . . . . . . . . . . . . . . . . . . . 18 1.3.2 Mechanisms II: toxic RNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.3.3 Mechanism III: mutations that affect splicing factors . . . . . . . . . . . . . . . 21 1.4 Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 1.4.1 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 1.5 Developing tools predicting causal SNPs . . . . . . . . . . . . . . . . . . . . . . . . . . 27 1.6 Functionally validating individual variants . . . . . . . . . . . . . . . . . . . . . . . . . 28 1.7 Conclusions and future directions in therapeutic interventions for splicing disorders . . 29 2 Identification of Splicing Defects 31 2.1 Pathogenic variants that alter protein code often disrupt splicing . . . . . . . . . . . . . 33 2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.2.1 Massively parallel splicing assays . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.2.2 Nonuniform distribution of splicing mutations . . . . . . . . . . . . . . . . . . . 34 2.2.3 Random forest classification of exonic splicing mutations . . . . . . . . . . . . . 37 2.2.4 RNA-binding protein motifs in the 5K panel . . . . . . . . . . . . . . . . . . . . 39 2.2.5 Mechanistic signatures of splicing mutants . . . . . . . . . . . . . . . . . . . . . 42 2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.4.1 Library design and synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.4.2 MaPSy in vivo assays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 2.4.3 MaPSy in vitro assays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 2.4.4 Library species alignment and counting . . . . . . . . . . . . . . . . . . . . . . . 48 2.4.5 Allelic imbalance analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.4.6 Splicing efficiency analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 xii Contents 2.4.7 MaPSy validation in patient samples . . . . . . . . . . . . . . . . . . . . . . . . 48 2.4.8 MaPSy validation in ENCODE data . . . . . . . . . . . . . . . . . . . . . . . . . 49 2.4.9 HGMD mutation analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 2.4.10 Random forest classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 2.4.11 Random forest predictor variables . . . . . . . . . . . . . . . . . . . . . . . . . . 50 2.4.12 Motif analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 2.4.13 RBP-binding motif validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 2.4.14 Functional SELEX analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 2.4.15 Data availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 2.4.16 URLs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 2.5 Technical aspects of MaPSy – challenges and limitations . . . . . . . . . . . . . . . . . 52 2.5.1 Limitations of MaPSy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 2.5.2 Crucial step of MaPSy – read counting/filtering . . . . . . . . . . . . . . . . . . 53 2.6 Critical assessment of genome interpretation (CAGI) challenge . . . . . . . . . . . . . 55 2.6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 2.6.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 2.6.3 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 2.6.4 Prediction challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 2.7 Effects of RNA-binding proteins on intron removal . . . . . . . . . . . . . . . . . . . . 57 2.7.1 Global analysis of order of intron removal . . . . . . . . . . . . . . . . . . . . . 57 2.7.2 Poly-U signals exhibit strong influence on splicing order . . . . . . . . . . . . . 57 3 Splicing Aberrations and Human Hereditary Diseases 59 3.1 Defective splicing of the RB1 transcript is the dominant cause of retinoblastomas . . . 61 3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.2.1 Global analysis suggest retinoblastoma belongs to a distinct class of diseases driven by splicing mutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.2.2 RB1 has a high fraction of exonic mutations that alter splicing in vitro and in vivo 63 3.2.3 The transition between A and B complex is the major point of disruption for RB1 coding mutants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.2.4 Deep sampling of genetic variation in the human population suggests the pres- ence of rare deleterious alleles that may alter splicing . . . . . . . . . . . . . . . 68 3.2.5 At least 553 RB1 variants exist in the human population . . . . . . . . . . . . . 70 3.2.6 Online visualization tool enables splicing phenotype to be added to annotations of disease alleles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.4.1 Splicing efficiency analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.4.2 RB1 HGMD mutation simulation . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.4.3 Genomic evolutionary rate profiling (GERP) conservation analysis . . . . . . . . 74 3.4.4 Calculating whole-exome intronic coverage . . . . . . . . . . . . . . . . . . . . 74 3.4.5 Global variant distribution in ExAC dataset . . . . . . . . . . . . . . . . . . . . . 74 3.4.6 Annotation of RB1 variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.5 Genetic variation in autism spectrum disorder (ASD) . . . . . . . . . . . . . . . . . . . 75 Contents xiii 3.5.1 Distribution of de novo mutations . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.5.2 Identification of de novo ASD mutations that cause splicing defects . . . . . . . 77 4 Visualization and Inference of Splicing Aberrations 79 4.1 Spliceman2 – a computational web server that predicts defects in pre-mRNA splicing . 81 4.2 Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.2.1 Input improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.2.2 Algorithm methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.3 Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.4.1 Definition and calculation of L1 distance . . . . . . . . . . . . . . . . . . . . . . 84 4.4.2 Validation of L1 distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.4.3 Creation of a valid mutation databases for Spliceman2 . . . . . . . . . . . . . . 86 4.4.4 Spliceman2 speed test preparation and results . . . . . . . . . . . . . . . . . . . 86 4.5 Inference of splicing mutations using machine learning . . . . . . . . . . . . . . . . . . 86 4.5.1 Classification approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.5.2 Regression approach and proposed Spliceman3 implementation . . . . . . . . . 88 5 Conclusions and Future Directions 91 5.1 Usefulness of conservation scores in classification tasks . . . . . . . . . . . . . . . . . . 93 5.2 Potential improvements with CLIP/KD data . . . . . . . . . . . . . . . . . . . . . . . . 94 5.3 Potential improvements with better datasets . . . . . . . . . . . . . . . . . . . . . . . . 94 A Identification of Splicing Defects 97 B Splicing Aberrations and Human Hereditary Diseases 117 C Visualization and Inference of Splicing Aberrations 129 Bibliography 139 xiv Contents List of Figures 1.1 The chemistry of pre-mRNA splicing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Consensus sequences and base pairing of small nuclear RNAs (snRNAs). . . . . . . . . 6 1.3 Examples of interactions among factors that recognize splicing signals. . . . . . . . . . 11 1.4 Different outcomes of the failure to recognize a specific splicing signal. . . . . . . . . . 12 1.5 Stepwise assembly of the early spliceosome . . . . . . . . . . . . . . . . . . . . . . . . 17 1.6 Three mechanisms of RBP-mediated splicing dysregulation. Mechanism I: Disruption of a splicing element. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.7 Three mechanisms of RBP-mediated splicing dysregulation. Mechanism II: Toxic RNA. . 21 1.8 Three mechanisms of RBP-mediated splicing dysregulation. Mechanism III: Mutations that affect splicing factors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.1 MaPSy on the 5K panel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.2 Robustness of MaPSy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.3 Prevalence of splicing mutations in disease-associated genes. . . . . . . . . . . . . . . . 37 2.4 Random forest classification of exonic mutations that disrupt splicing. . . . . . . . . . . 38 2.5 Detection of RBP motifs that affect splicing. . . . . . . . . . . . . . . . . . . . . . . . . 39 2.6 Profiles of RBP motifs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.7 Validation of RBP motifs that affect splicing. . . . . . . . . . . . . . . . . . . . . . . . . 41 2.8 Clustering of RBP motifs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.9 Isolation of spliceosomal intermediates. . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.10 Clustering of allelic ratios provides ESM mechanistic insights. . . . . . . . . . . . . . . 45 2.11 MaPSy is unable to detect all possible alternative events. . . . . . . . . . . . . . . . . . 53 2.12 Read counting strategy used during the MaPSy protocol. . . . . . . . . . . . . . . . . . 54 2.13 U-rich Motifs enriched in always-first splicing introns can increase splicing efficiency. . 58 3.1 RB1 mutations frequently disrupt splicing. . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.2 RB1 mutations that disrupt splicing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.3 Spliceosomal assembly results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.4 RB1 mutations disrupt splicing in various stages of the spliceosomal assembly. . . . . . 67 3.5 Low-frequency variants predicted to disrupt splicing in RB1. . . . . . . . . . . . . . . . 69 3.6 Comparison of the distribution of polymorphisms in RB1. . . . . . . . . . . . . . . . . . 70 3.7 Online browser enables navigation through RB1 mutations that disrupt splicing. . . . . 72 3.8 Online browser’s results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.9 Distribution of ExAC, HGMD, and denovo-DB variants in canonical transcripts . . . . . 76 3.10 Distributions of input ratios between mutant and wild type species in the SFARI pilot study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 xvi List of Figures 4.1 L1 distance correlates with element’s ESR assignment change. . . . . . . . . . . . . . . 81 4.2 Spliceman landing page. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.3 Spliceman results pages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.4 Spliceman2 speed test performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.5 Improved performance of the model including wild type splicing efficiency. . . . . . . . 87 4.6 Performance of GBM and RF regression models . . . . . . . . . . . . . . . . . . . . . . 88 A.1 Alternative splicing events in the 5K panel. . . . . . . . . . . . . . . . . . . . . . . . . . 105 A.2 MaPSy performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 A.3 MaPSy validation in patient samples and ENCODE data. . . . . . . . . . . . . . . . . . 107 A.4 Mode of inheritance in the 5K panel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 A.5 Genes intolerant to protein-truncating variants (PTVs) in the ExAC population are predisposed to disease-associated splicing mutations. . . . . . . . . . . . . . . . . . . . 109 A.6 Features of splicing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 A.7 The role of PTBP1 and SRSF1 in ESM phenotypes. . . . . . . . . . . . . . . . . . . . . . 111 A.8 Overlap of intronic and exonic splicing regulatory motifs. . . . . . . . . . . . . . . . . . 112 A.9 In vitro functional SELEX. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 A.10 Mutant feature analyses in different clusters revealed distinct ESM mechanistic signatures.114 A.11 ESM visualization browser. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 A.12 Common sequences of the 5K panel reporters. . . . . . . . . . . . . . . . . . . . . . . . 116 List of Tables 3.1 Variants in RB1 analyzed with MaPSy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 A.1 SNPs evaluated with MaPSy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 A.2 Summary of MaPSy validation in patient samples . . . . . . . . . . . . . . . . . . . . . 101 A.3 Genes enriched with SSM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 A.4 ENCODE Accession numbers used for the purpose of validating MaPSy results. . . . . . 104 B.1 Low frequency variants in RB1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 C.1 List of features of the boosted model that included the splicing efficiency of wild type . 130 C.2 List of features of the boosted model that did not include the splicing efficiency of wild type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 xviii List of Tables Preface Precursor messenger ribonucleic acid (pre-mRNA) splicing is an important pathway in the regula- tion of gene expression. Mismanagement of this process in humans may lead to many different types of disease. Many human hereditary disorders are caused by single nucleotide variants that result in splicing defects. However, most of the splicing mutations that have been identified so far are limited to those that disrupt canonical splice sites, while previous studies have estimated that at least a quarter of both synonymous and non-synonymous mutations are also likely to impact splicing by altering the landscape of various cis-acting elements.1–3 This thesis attempts to address some of the current limitations in the splicing field. Consequently, this thesis is a combination of two important areas of research – biology and computer science. Accordingly, in the first part of this dissertation (Chapter 1), we provide a basic background for both fields. For the biology background, we introduce splicing signals that the cell’s machinery relies on for a successful splicing outcome and the basic biochemistry behind the splicing reaction. We continue our overview of splicing by introducing important RNA-binding proteins that are known to either suppress or activate particular splicing events. For the computational background, we close the introductory chapter by focusing on a single supervised machine learning method, namely decision trees and multiple extensions of this approach – bagging trees, random forests, and boosting trees. The portions of Chapter 1 were published as two separate manuscripts: • A. M. Fredericks, K. J. Cygan, B. A. Brown, and W. G. Fairbrother. RNA-Binding Proteins: Splicing Factors and Disease. Biomolecules, 5: 893 – 909, 2015. D O I : 10.3390/biom5020893 • K. J. Cygan, A. J. Taggart, W. G. Fairbrother, and S. M. Mount. Messenger rna splicing signals. eLS, 2017. 1 – 8 p. D O I : 10.1002/9780470015902.a0000888.pub2 Although the consensus sequences for splicing signals in the human genome are known, they are very degenerate. Interactions of auxiliary factors with the functional sequence elements are therefore not well understood. Despite the recent progress in our understanding of alternative splicing in human transcriptome,6 very little is known about allele-specific splicing. This is mostly due to the absence of suitable datasets for such analysis. Most computational algorithms that evaluate the effect of single nucleotide substitution are restricted to addressing potential damaging effect from single amino acid 1 Z. Wang et al. Cell, 119: 831 – 45, 2004. 2 W. G. Fairbrother et al. Science, 297: 1007 – 13, 2002. 3 R. Soemedi et al. Adv Exp Med Biol, 825: 227 – 66, 2014. 6 E. T. Wang et al. Nature, 456: 470 – 6, 2008. xx Preface changes. While some tools have been developed7–10 to predict the effect of exonic mutations on splicing, they have serious limitations and there is little overlap in the predictive calls that each tool provides.10 Chapter 2 presents the collaborative effort in the development of a novel method that can assay splicing of thousands of pooled substrates simultaneously, and inference the effects that a variant may have on splicing. The method utilizes oligo libraries that were designed to investigate allele-specific splicing in a high-throughput manner. Significant mechanistic insight can be obtained by tracking the splicing phenotypes in each allelic pair of the substrate pool and by identifying sequences that promote or inhibit the different stages in the spliceosomal assembly. Furthermore, allelic differences in protein occupancy of different splicing factors can be detected and thus mutations that disrupt various cis-elements that impact splicing can be identified. RNA-binding proteins are also important players in splicing reactions. In the same chapter, we were able to categorize multiple RNA-binding proteins as either activators or repressors of splicing based on the RNA-binding protein binding site creation or disruption potential of the mutant allele and the functional behavior of the mutant species relative to the wild-type counterpart in the assay. In a separate study, some RNA-binding proteins were identified as important influencers of the order in which a pair of adjacent introns is removed from a transcript. Parts of obtained results were published in the following publications: • R. Soemedi, K. J. Cygan, C. L. Rhine, J. Wang, C. Bulacan, J. Yang, P. Bayrak-Toydemir, J. McDonald, and W. G. Fairbrother. Pathogenic variants that alter protein code often disrupt splicing. Nat Genet, 49: 848 – 855, 2017. D O I : 10.1038/ng.3837 • S. W. Kim, A. J. Taggart, C. Heintzelman, K. J. Cygan, C. G. Hull, J. Wang, B. Shrestha, and W. G. Fairbrother. Widespread intra-dependencies in the removal of introns from human transcripts. Nucleic Acids Res, 45: 9503 – 9513, 2017. D O I : 10.1093/nar/gkx661 Due to the fact that splicing mutations are underreported and that mutations in exons may also affect splicing, the high-throughput method (described in detail in Chapter 2) allowed us to re- categorize many exonic mutations as splicing variants and estimate the influence of splicing mutations on particular disease phenotypes. In Chapter 3, we focus on one hereditary disease that we identified as predominantly caused by splicing defects in the RB1 transcript, namely retinoblastoma. The results of the analyses presented in this chapter were published in the following manuscript: • K. J. Cygan, R. Soemedi, C. L. Rhine, A. Profeta, E. L. Murphy, M. F. Murray, and W. G. Fair- brother. Defective splicing of the RB1 transcript is the dominant cause of retinoblastomas. Hum Genet, 2017. D O I : 10.1007/s00439-017-1833-4 We close the chapter by investigating splicing disruption potential of de novo variants that were reported in autistic patients to further validate the usefulness of our dual reporter assay. Finally, by utilizing visualization techniques and machine learning approaches in conjunction with the results from the high-throughput assay, we were able to identify features that are relevant in allele-specific splicing aberrations and develop a superior pipeline to predict the effects of single 7 K. H. Lim and W. G. Fairbrother. Bioinformatics, 28: 1031 – 2, 2012. 8 M. Mort et al. Genome Biol, 15: R19, 2014. 9 H. Y. Xiong et al. Science, 347: 1254806, 2015. 10 A. B. Rosenberg et al. Cell, 163: 698 – 711, 2015. Preface xxi nucleotide substitutions on splicing. Chapter 4 describes in detail the modifications of Spliceman – a computational web server that predicts defects in pre-mRNA splicing. The same chapter deals with the implementation of a novel classification/regression machine learning pipeline – Spliceman 3 that we expect to release for public soon. Spliceman 2 was published as the following publication: • K. J. Cygan, C. H. Sanford, and W. G. Fairbrother. Spliceman2: a computational web server that predicts defects in pre-mRNA splicing. Bioinformatics, 33: 2943 – 2945, 2017. D O I : 10.1093/ bioinformatics/btx343 xxii Preface Chapter 1 Introduction Summary and contributions Splicing is a post-transcriptional processing step in which intervening sequences (introns) are excised and coding sequences (exons) are ligated together to create the mature mRNA (messenger ribonucleic acid) molecule. It is a sequential process that is facilitated by the information in the RNA (ribonucleic acid) sequence (splicing regulatory elements/signals) and numerous RNA-binding proteins (trans factors). Splicing occurs through two biochemical steps, and is catalyzed by a large ribonucleoprotein known as the spliceosome. The spliceosomal machinery recognizes the core splicing signals and assem- bles in a stepwise fashion on the pre-mRNA molecule. These signals include the obligate 50 splice site, 30 splice site and branch site sequence, as well as enhancer and silencer sequences which functionally interact with RNA binding proteins. A current research application in the field uses computational approaches to study splicing signals and predict the effect of single nucleotide variations on message processing. Section 1.1 was published in the following manuscript: • K. J. Cygan, A. J. Taggart, W. G. Fairbrother, and S. M. Mount. Messenger rna splicing signals. eLS, 2017. 1 – 8 p. D O I : 10.1002/9780470015902.a0000888.pub2 Kamil J. Cygan, Allison J. Taggart, William G. Fairbrother, and Stephen M. Mount performed literature review and prepared the aforementioned manuscript. Sections 1.2, 1.3, 1.5, 1.6, and 1.7 were published in the following manuscript: • A. M. Fredericks, K. J. Cygan, B. A. Brown, and W. G. Fairbrother. RNA-Binding Proteins: Splicing Factors and Disease. Biomolecules, 5: 893 – 909, 2015. D O I : 10.3390/biom5020893 Alger M. Fredericks, Kamil J. Cygan, Brian A. Brown, and William G. Fairbrother performed literature review and prepared the aforementioned manuscript. Section 1.4 is the result of my own machine learning literature review, includes nothing which is the outcome of work done in collaboration, and remains unpublished. 2 Chapter 1. Introduction 3 1.1 Messenger RNA splicing signals Ribonucleic acid (RNA) splicing is the process by which sections of an RNA, known as introns, are removed, and other sections, known as exons, are joined together.15,16 Splicing signals are elements in the deoxyribonucleic acid (DNA) sequence of a gene that specify the accurate splicing of its primary RNA transcript to generate its mature RNA product. In the case of genes encoding proteins, splicing signals are found immediately adjacent to splice sites and in the exons and introns flanking splice sites. Although protein-coding information must be present in exons in order to be actually translated into protein, not all exons code for protein. Indeed, many messenger RNAs have one or more exons that are entirely non-coding. Splicing is a common step in the synthesis of messenger ribonucleic acid (mRNA) in all eukaryotic cells.17,18 However, the typical size and number of introns per gene varies considerably among taxa.19,20 A typical mammalian gene has between three and ten exons spread out over an area of between 2,000 and 10,000 nucleotides. In contrast, the yeast Saccharomyces cerevisiae, whose complete genome sequence is known, has introns in only 235 of its roughly 6,000 genes. In species with many introns, there is considerable variation among genes in the number and length of introns.19,20 While the RNA transcripts from some human genes are not spliced at all, many other genes have dozens of introns.21 An extreme example, the Duchenne muscular dystrophy gene, stretches for over 2.5 million nucleotides, and includes 79 exons which must be spliced together to form a messenger RNA of about 14,000 nucleotides. Such examples emphasize the need for accurate recognition of splicing signals. 1.1.1 The splicing of pre-mRNA via a lariat intermediate The boundaries between exon nucleotides (which are retained in the messenger RNA) and intron nucleotides (which are not) are referred to as splice sites. The exon/intron junction at the 50 end of the intron is known as the 50 splice site, and the intron/exon boundary at the other end of the intron is known as the 30 splice site. 50 splice sites and 30 splice sites are sometimes referred to as donor and acceptor sites, respectively. The splicing reaction itself consists of two consecutive phosphoryl transfer reactions in which the phosphate bonds that link nucleotides are exchanged (Figure 1.1).22,23 Following the assembly of a mature spliceosome, the first of the two phosphoryl transfer reactions joins the first nucleotide of the intron with a branch site within the intron, generally an A residue between 18 and 35 nucleotides upstream of the 30 splice site.24 In this reaction, the 20 hydroxyl group on the ribose ring of the branch nucleotide carries out a nucleophilic attack on the phosphodiester at the 50 splice site. As a result, a branched, or lariat, intermediate containing an unusual 20 – 50 phosphodiester 15 P. A. Sharp. Cell, 77: 805 – 15, 1994. 16 P. A. Sharp. Trends Biochem Sci, 30: 279 – 81, 2005. 17 E. Kim, A. Magen, and G. Ast. Nucleic Acids Res, 35: 125 – 31, 2007. 18 H. Kim et al. Nat Genet, 36: 915 – 6; author reply 916 – 7, 2004. 19 J. Merkin et al. Science, 338: 1593 – 9, 2012. 20 N. L. Barbosa-Morais et al. Science, 338: 1587 – 93, 2012. 21 T. W. Nilsen and B. R. Graveley. Nature, 463: 457 – 63, 2010. 22 M. C. Wahl, C. L. Will, and R. Luhrmann. Cell, 136: 701 – 18, 2009. 23 C. L. Will and R. Luhrmann. Cold Spring Harb Perspect Biol, 3: , 2011. 24 A. J. Taggart et al. Nat Struct Mol Biol, 19: 719 – 21, 2012. 4 1.1. Messenger RNA splicing signals a 5´splice site Branch site 3´splice site Intron 5´Exon 3´Exon 5´Exon OH + 3´Exon + 5´Exon 3´Exon OH b A O O O O − O P O −O P O O G O U O O OH − O OH O P O5´ end of intron − OP O O O to 3´end of intron Figure 1.1: The chemistry of pre-mRNA splicing. (a) The two steps of splicing are indicated in cartoon form, (b) with an expanded view of the branch structure, including the unusual 20 -50 phosphodiester bond. bond is generated (Figure 1.1b).24 The first step of splicing also displaces the 30 hydroxyl group of the last nucleotide of the 50 exon, which is then free to carry out a nucleophilic attack on the phosphodiester at the 30 splice site.23 This second step of splicing thus joins the two exons via a standard 30 – 50 bond and leaves a free branched intron. With the completion of splicing, the intron dissociates from the spliced exons, the 20 – 50 phosphodiester linkage is cleaved by a debranching enzyme, and the intron is degraded by ribonucleases.24 The spliceosome, snRNPs, and RNA-binding proteins Splicing is carried out by the spliceosome, a large macromolecular machine which forms around each intron and catalyzes splicing.22,23 The assembly of the spliceosome proceeds through a series of distinct intermediate steps that involve the recognition of splicing signals in pre-mRNA through interactions with specific splicing factors.22,23 These factors include proteins and small nuclear ribonucleoprotein particles (snRNPs), which consist of small nuclear ribonucleic acid (snRNA) complexes with several Chapter 1. Introduction 5 proteins.25 Early in the process of spliceosome assembly, specific sites on the pre-mRNA are bound by a number of RNA-binding proteins including U1 and U2 snRNPs, heterogeneous nuclear RNA proteins (hnRNPs), serine-arginine (SR) proteins, U2 auxiliary protein (U2AF), and splicing factor 1 (SF1).22,23 The U1 snRNP recognizes GU of the 50 splice site while early recognition of the branch-point is performed by SF1, which is later displaced by the U2 snRNP. U2AF is a heterodimeric protein (U2AF65 binds polypyrimidine tract, U2AF35 recognizes the AG of the 30 splice site) whereby the binding of the U2 snRNP to the branch-point sequence is stabilized.26 HnRNP and SR proteins both recognize loose consensus sequences in both introns and exons and can either activate or repress splicing depending on where they bind in the pre-mRNA.26 Although there are a number of distinct steps required for the formation of a functional spliceosome (described in Section 1.2.1), the recognition of known splicing signals occurs early and it is likely that the assembly of these early factors determines the outcome of splicing in most cases. 1.1.2 Conserved consensus sequences at the sites of splicing The two steps of splicing define three sites at which phosphoryl transfer reactions take place: the 50 splice site, the branch site and the 30 splice site.25 Most introns show some similarity to each other at these three sites, and the nucleotide sequences at these sites contribute to the recognition of intron boundaries. For example, it is almost always the case that an intron can be represented by a sequence beginning GT (GU in the RNA) and ending AG.25 50 splice site consensus In addition to the conserved GU dinucleotide, 50 splice sites are generally similar at the last three exon nucleotides and the first seven intron nucleotides adjacent to the site of splicing.25 This similarity can be represented by consensus sequence matrices like those shown in Figure 1.2a. Interaction between the U1 snRNP and the pre-mRNA is mediated by base pairing between the 50 end of the U1 snRNA and the 50 splice site (Figure 1.2b).22,23 A comparison of U1 snRNA sequences from a variety of species shows that the sequence of U1 is highly conserved, and the most conserved region lies near the 50 end of the RNA. U1 snRNAs from all species known (including plants, fungi and a variety of animals) have the sequence ACUUACCUG at, or very near to, the 50 end.25 The sequence of many 50 splice sites is strikingly similar to the exact complement of this sequence: CAG|GUAAGU. Furthermore, the most frequently occurring nucleotide at each position relative to the 50 splice site is in every case complementary to the corresponding nucleotide in U1 snRNA.22,23 Although recognition of 50 splice sites during the initial stages of splicing is primarily accomplished by the U1 snRNP, the U1 snRNP does not act in isolation. Not all sites bound by the U1 snRNP are ultimately used as 50 splice sites. Factors bound to other sites on the pre-mRNA, some of which are discussed below, can promote binding between U1 snRNP and the 50 splice site, or can facilitate progression along the pathway towards splicing. Furthermore, the 50 splice site is later ‘examined’ by additional factors in the course of splicing. One such factor is U6 snRNA, which displaces U1 prior to the first catalytic step, and remains associated with the 50 splice site throughout the remainder of the splicing reaction.22,23 In general, initial selection of the 50 splice site by the U1 snRNP is influenced by other factors, and 25 Y. Lee and D. C. Rio. Annu Rev Biochem, 84: 291 – 323, 2015. 26 Rick Russell. Biophysics of RNA folding. New York, NY: Springer, 2013. vi, 236 p. 6 1.1. Messenger RNA splicing signals a 5´SS −3 −2 −1 1 2 3 4 5 6 A 33 60 8 0 0 49 71 6 15 C 37 13 4 0 0 3 7 5 19 G 18 14 81 100 0 45 12 84 20 T 12 13 7 0 100 3 9 5 46 M A G G T R A G T 3´SS −16/ −10/ −7/ −5/ −4 −3 −2 −1 1 −20 −15 −9 −6 A 16 9 8 7 22 4 100 0 25 C 32 34 40 41 33 74 0 0 13 G 17 12 11 6 22 0 0 100 52 T 35 45 41 46 22 21 0 0 9 Y Y Y Y Y A G G Branch site A 10 32 1 25 32 94 3 C 26 29 64 19 13 3 48 G 16 23 9 1 42 0 23 T 48 16 26 55 13 3 26 T N C T R A C b 5´splice site 5´ C A G G U A A G U A 3´ GUC C A U U C A U (A)-cap 5´end of U1 snRNA Branch point 5´ U A C U A A C 3´ 3´ A U G A U G 5´ U2 snRNA Figure 1.2: Consensus sequences and base pairing of small nuclear RNAs (snRNAs). (a) Frequency matrices presenting the 50 splice site, 30 splice site, and branch-point consensus sequences. (b) Base pairing between consensus sequences and U-RNAs (U1: 50 splice site; U2: branch site). The single unpaired nucleotide in the base pairing between U2 snRNA and the branch site is the A at which branch formation occurs. The U snRNA sequences shown are phylogenetically invariant, and the consensus sequences shown are similar in all eukaryotes. However, the sequence of any individual splice site is likely to differ from the sequence shown. As a result, the extent of base pairing is generally less than depicted here. Additional base pairing interactions between snRNAs and the pre-mRNA, or among snRNAs, that occur late in the splicing process are not shown. must be followed by appropriate interactions between the 50 splice site and other components of the spliceosome (Section 1.2.1). Branch site consensus While the majority of splice sites in sequenced species are easily identified from comparison of complementary deoxyribonucleic acid (cDNA) and genomic sequences, branch sites are more difficult Chapter 1. Introduction 7 to identify due to the transient nature of the lariat species. While this is still an active area of research, thousands of branch sites have been identified in several species and these examples are sufficient to establish significant similarities among branch sites.24 The branch site nucleotide is usually (though not always) an A. In the yeast S. cerevisiae, branch site sequences are very strongly conserved, and branch formation almost always occurs within the exact sequence UACUAAC, at the site of the final A of this sequence (bold). This branch site sequence binds precisely to U2 in a duplex formation, with the A branch-point nucleotide bulged out (depicted in Figure 1.2b).25 In human introns, there is much more variation in both the sequence around the branch site and the bulged branch site nucleotide itself (sometimes a C nucleotide is used instead of an A).24 Sites without much resemblance to the consensus can be used in the absence of suitable alternative sites. The branch site sequence can be recognized by the RNA-binding protein SF1 (or BBP, branch site-binding protein).22,23 Subsequently, the U2 snRNP binds to the branch site. This latter interaction is facilitated by the base pairing between U2 snRNA and UACUAAC as well as by protein:protein interaction between the U2 snRNP and U2AF auxiliary protein.26 30 splice site consensus As described, the 30 splice site usually occurs immediately 30 of the trinucleotide CAG or UAG (and occasionally AAG, but never GAG).25 In addition, there is some conservation of the first nucleotide of the exon adjacent to the 30 splice site, bringing the number of conserved 30 splice site nucleotides to four. How is it possible for the 30 splice site to be specified by so few nucleotides? The answer to this question lies in the sequence immediately upstream of the 30 splice site, which is generally rich in pyrimidines.26 In the introns of multicellular animals, this pyrimidine-rich stretch often consists of eight or more consecutive U or C residues lying between the 30 splice site and the branch site. This pyrimidine tract is bound by U2AF, a dimeric protein which then recruits the U2 snRNP to the branch-point immediately upstream.26 In addition, the branch site itself can be recognized by SF1. Recognition of the branch site by SF1 and the pyrimidine tract by U2AF is cooperative, so that the branch site – pyrimidine tract region is recognized as a unit.26 The 30 splice site itself is generally the first AG dinucleotide downstream of the branch site.24 The relative contribution of these three sequence elements (branch site, pyrimidine tract and the 30 splice site itself) varies among introns and among species. A comparison of consensus sequences in the vicinity of the 30 splice site in the nematode worm Caenorhabditis elegans, the yeast S. cerevisiae and mammals provides an illustrative contrast. In the case of yeast, the branch site is nearly invariant and provides enough information to specify splicing in most cases.17 In the worm, no branch site consensus can be discerned, yet the 30 splice site is nearly always UUUYAG (where Y is either C or U).17 In mammals, the pyrimidine tract provides most of the information.17 In summary, the 30 splice site is recognized as a cluster of three distinct consensus sequences: the branch site, the pyrimidine tract and the 30 splice site itself. These three splicing signals are recognized in a cooperative manner, and may be thought of as components of a single signal. Nonconsensus introns Not all introns follow the GT– AG rule. One class of exception is made up of 50 splice sites (about 0.5% of human examples) that closely resemble the normal splice site consensus, but deviate at the 8 1.1. Messenger RNA splicing signals GT by having GC.27 These sites match the standard consensus quite well elsewhere in the 50 consensus region, and it is likely that their recognition occurs by a mechanism which is essentially identical to that by which standard GT 50 splice sites are recognized.27 AT–AC introns and U12-dependent introns — a parallel spliceosome However, an altogether distinct class of intron has been observed. Many of these have the consensus |ATATCCTT at the 50 end, CCTTRACCY at the branch site and YAC| at the 30 end.28 The study of such ‘AT– AC’ (pronounced ‘attack’) introns led to the discovery that a class of low-abundance snRNPs recognize many of these sites.29 These introns are recognized by the minor (low abundance) snRNPs U11 and U12, which have regions of complementarity to the 50 splice site and branch site, respectively. The later steps of splicing are quite analogous to splicing by the major U2 spliceosome, and involve dedicated homologues of U4 and U6 known as U4atac and U6atac. U5 plays a role in the function of both spliceosomes.28 Once U12 spliceosomes (including not only U12, but also U11, U4atac and U6atac) were discovered, further work showed that although there is almost always agreement be- tween the dinucleotides at the two ends of an intron (GT going with AG and AT going with AC), the identity of these pairs does not always correlate with the identity of the spliceosome that carries out the splicing.28 A small but significant fraction (about 1 in 600) of GT– AG introns carry telltale nucleotides indicating recognition by the U12 spliceosome. This prediction has been experimentally verified for some examples.28 In fact, the number of GT– AG introns recognized by the U12 spliceosome appears to be greater than the number of AT– AC introns recognized by the U12 spliceosome.30 1.1.3 Splicing signals outside of the canonical splice sites The splice site and branch site consensus sequences discussed in the previous section (and depicted in Figure 1.2a) are essential for splicing, and it is well established that they serve as the primary determinants of splice site selection.31 However, these consensus sequences are short and are imper- fectly matched in the case of almost all specific examples. As a result, most real splice sites do not contain enough information to be reliably distinguished from many sequences that are not splice sites. Potential splice sites in pre-mRNAs that match the splice site consensus well, but are not used, have been shown to be fully capable of functioning as splice sites by experimental mutation of the natural splice site, or by alteration of their location within a gene.32 Such results confirm the the- oretical observation that the information at the splice site consensus sequences is not sufficient to determine the outcome of splicing. This implies that there must be information outside of the splice sites themselves that determines whether or not they are used. This information includes signals known as splicing enhancers that act at a moderate distance to promote splicing. Conversely, splicing silencers are sequences that act to suppress splicing. Additional information is provided by differences in the sequence composition of introns and exons, and by the spatial relationships among potential sites of RNA processing. Splicing signals outside of the splice sites themselves (splicing enhancers 27 T. A. Thanaraj and F. Clark. Nucleic Acids Res, 29: 2581 – 93, 2001. 28 C. B. Burge, R. A. Padgett, and P. A. Sharp. Mol Cell, 2: 773 – 85, 1998. 29 S. L. Hall and R. A. Padgett. Science, 271: 1716 – 8, 1996. 30 A. Levine and R. Durbin. Nucleic Acids Res, 29: 4006 – 13, 2001. 31 N. Behzadnia et al. EMBO J, 26: 1737 – 48, 2007. 32 M. Amit et al. Cell Rep, 1: 543 – 56, 2012. Chapter 1. Introduction 9 and silencers, sequence composition and context) make essential contributions to accurate splice site selection in vivo.9 It should be noted that while the term ‘splicing signal’, as used here, embraces all of the information required to specify accurate splicing, many authors reserve the term ‘splicing signals’ for the consensus sequences at the splice sites and branch site. Exonic splicing enhancer (ESE) sequences and SR proteins Splicing enhancers are sequences that stimulate splicing at a distance (typically between 40 and 200 nucleotides) from the splice sites.2,33 Although splicing enhancers have been identified in both exons and introns,33,34 exonic splicing enhancers (ESEs) are generally better characterized, and are probably more common. Many ESEs can be bound by one of several related splicing factors known as SR proteins. SR proteins contain either one or two RNA-binding domains and ‘RS’ domains that are characterized by numerous arginine – serine dipeptide repeats.26 SR proteins are not only essential for the completion of splicing but also for each of the first three recognizable steps of spliceosome assembly.35–38 In vitro, any one of the several SR proteins can restore splicing to a splicing extract lacking SR proteins. Thus, the essential functions of individual SR proteins in splicing are at least partially redundant.35–38 Individual SR proteins differ with respect to the sequence specificity of their RNA-binding domains and with respect to their ability to recognize and activate different ESE sequences.39 SR proteins have been implicated to stabilize and recruit interactions between multiple early spliceosome components. Some of the interactions include bridging the U1-70K binding domain to the pre-mRNA transcript at the 50 splice site40 and recruiting the U2AF auxiliary factor of the 30 splice site via interactions with the U2 snRNP.41 Splicing silencer sequences and intronic splicing enhancer (ISE) sequences The converse of splicing enhancer sequences, splicing silencer sequences, also occur. In some cases, these sequences are bound by hnRNP proteins such as polypyrimidine tract-binding protein (PTB) also known as hnRNP I and hnRNP A1, which are likely to mediate repression.26 For example, hnRNP A1 requires two tandem repeats of UAGGGA/U in order to efficiently bind the pre-mRNA.26 HnRNP A1 also has weak affinity to bind random RNA sequence which allows it to attach in close proximity sites after all high-affinity locations are already occupied, spread across the message, and disrupt any secondary structure or binding of other RNA binding proteins (RBPs).26 Splicing enhancer sequences in introns have also been described.34 In several cases, these enhancer sequences act to prevent the binding of a negatively acting factor. Competition between proteins for binding at these sites can lead to either repression or release from repression. The mechanism of action of splicing enhancers and 9 H. Y. Xiong et al. Science, 347: 1254806, 2015. 2 W. G. Fairbrother et al. Science, 297: 1007 – 13, 2002. 33 W. G. Fairbrother et al. Nucleic Acids Res, 32: W187 – 90, 2004. 34 Y. Wang et al. Nat Struct Mol Biol, 19: 1044 – 52, 2012. 35 J. C. Long and J. F. Caceres. Biochem J, 417: 15 – 27, 2009. 36 J. L. Manley and A. R. Krainer. Genes Dev, 24: 1073 – 4, 2010. 37 M. L. Anko. Semin Cell Dev Biol, 32: 11 – 21, 2014. 38 Z. Zhou and X. D. Fu. Chromosoma, 122: 191 – 207, 2013. 39 D. Ray et al. Nature, 499: 172 – 7, 2013. 40 S. Cho et al. Proc Natl Acad Sci U S A, 108: 8233 – 8, 2011. 41 Y. Zhang et al. Nucleic Acids Res, 41: 1343 – 54, 2013. 10 1.1. Messenger RNA splicing signals splicing silencers is an active area of research and it is likely that additional and varied mechanisms of action will emerge shortly. Base or oligonucleotide composition In some species, the base composition of the exons is quite different from the base composition of the introns that flank them.32 This is especially true in dicotyledonous plants, whose introns have an average AT content in excess of 70%, in contrast to the flanking exons, which have an AT content of approximately 50%. Thus, it is a simple matter to distinguish between intron and exon sequences, and experimental results indicate that base composition (in particular, runs of U residues in the intron) directly contributes to the recognition of introns. In mammals, short runs of consecutive G residues appear to play a similar role in specification of introns. In addition, differential GC content of exons and surrounding introns have been reported as a strong signal of exon recognition by the splicing machinery.32 Presumably, there are proteins that mediate recognition of these elements, and high AT content, abundance of G stretches, or differential GC content, can be thought of as a special case of intronic splicing enhancer sequences. Although distinguishing between exons and introns on the basis of pervasive signals is appealing, the factors that recognize such differences have not been identified. Exon definition Splice sites are generally recognized by the splicing machinery as appropriately spaced pairs of splice sites (Figure 1.3). Of course, splice sites can pair across an intron. They must, or splicing could not occur! However, splice sites can also pair across an exon, in which case the pairing serves only to activate the sites, which are then joined to other partners.42 Whether pairing occurs across an intron or across an exon, the existence of an appropriately spaced partner serves to activate a splice site, and this spacing can often be critical for recognition of a splice site. The phenomenon can be seen very clearly in the outcome of experimental and natural mutations in splice sites. A variety of outcomes resulting from the mutation of a 50 splice site are depicted in Figure 1.4 (similar results are observed when 30 splice sites are mutated). The outcome that one would expect (based on the straightforward assumption that a consensus 50 splice site signal contributes to the specification of the intron whose removal defines it as a 50 splice site) is retention of the flanking intron, which is depicted in Figure 1.4 as alternative A. However, this result is observed much less often than two other outcomes. The most commonly observed result of splice site mutation is exon skipping, in which the exon bearing the mutant 50 splice site is skipped, and the upstream exon is joined directly to the downstream exon (Figure 1.4, alternative B).43 This result can be understood if the splice sites involved normally pair across the exon (as shown in Figure 1.3b) and such pairing is essential to their use. Mutation of one site interferes with the ability of the other site to be recognized, and the entire exon is skipped. Although the molecular interactions that contribute to exon definition have not been fully defined, it is clear that the U1 snRNP and U2AF (which recruits the U2 snRNP) play some role.42 In addition, a correlation between splice site strength (how well each of the sites match the consensus sequence) and exon inclusion has been reported.44 The second most frequent result of mutational inactivation of 42 S. M. Berget. J Biol Chem, 270: 2411 – 4, 1995. 43 D. L. Black. Annu Rev Biochem, 72: 291 – 336, 2003. 44 P. J. Shepard et al. Nucleic Acids Res, 39: 8928 – 37, 2011. Chapter 1. Introduction 11 a X U1 Exon U2AF Exon Intron b X U1 U2AF Exon Intron c SR protein Exonic splicing enhancer d Exon Splicing Splicing suppressor suppressor sequence sequence bound by bound by protein protein Figure 1.3: Examples of interactions among factors that recognize splicing signals. Appropriate spacing of splice sites across either (a) an intron or (b) an exon facilitates spliceosome assembly. (c) Recognition of an exonic splicing enhancer by an SR protein. (d) Repression of splicing mediated by heterogeneous nuclear RNA proteins (hnRNPs) bound at sites within introns. a 50 splice site is splicing at a cryptic 50 splice site. Such sites are in the vicinity of the natural site, and can be either upstream or downstream of the authentic splice site (as shown in Figure 1.4, alternative C). This outcome demonstrates the role of splicing signals other than the splice sites themselves. 50 terminal exons: activation of splicing by the 50 cap All RNA polymerase II transcripts contain a 7-methylguanosine cap at the 50 terminus.45 This structure is required for both stability and translation of mRNAs. The cap structure is also required for splicing both in vivo and in vitro.45,46 After export of mRNA from the nucleus, the cap is bound by factors involved in the initiation of translation. Prior to export, the cap is bound by the dimeric complex CBC 45 M. M. Konarska, R. A. Padgett, and P. A. Sharp. Cell, 38: 731 – 6, 1984. 46 K. Inoue et al. Genes Dev, 3: 1472 – 9, 1989. 12 1.1. Messenger RNA splicing signals pre-mRNA Exon 1 Exon 2 Exon 3 mRNA Exon 1 Exon 2 Exon 3 Alternative A Exon 1 Exon 2 Exon 3 Alternative B Exon 1 Exon 3 Alternative C Exon 1 Exon 2 Exon 3 Figure 1.4: Different outcomes of the failure to recognize a specific splicing signal. When a splicing signal (such as the 50 splice site at the 30 end of exon 2) is defective, one might expect the affected intron to be retained (alternative A). However, the most commonly observed result is exon skipping (alternative B). Another result often observed is splicing at a cryptic 50 splice site (alternative C). These outcomes demonstrate the role of splicing signals other than the splice sites themselves. (cap-binding complex) of nuclear cap-binding proteins CBP20 (the subunit that specifically binds the cap) and CBP80 (a necessary cofactor of CBP20).47 These factors associate directly with the U1 snRNP and stimulate splicing at the adjacent 50 splice site in the first intron of the messenger RNA precursor.48 It is likely that this interaction serves to activate 50 terminal exons in much the same way that internal exons are activated by associations between the two flanking splice sites. Cooperation between splicing and polyadenylation The 30 ends of messenger RNAs are generated by cleavage and polyadenylation, a reaction that is carried out by a complex of factors soon after the appropriate sequences have been synthesized. Recognition of the 30 splice site of the final intron and recognition of the polyadenylation site are mutually stimulatory, both in vivo and in vitro.49,50 Mutation of the 30 splice site within the last intron reduces the efficiency of polyadenylation. Conversely, mutation of the polyadenylation signal reduces the efficiency with which the terminal intron is removed.42,50 This reciprocal stimulation is analogous to the stimulation of splice sites across an exon during exon definition. Because many alternatively spliced genes differ at their 30 termini, it is likely that 30 terminal exon definition plays an important regulatory role in many genes. 47 C. Mazza et al. EMBO J, 21: 5548 – 57, 2002. 48 J. D. Lewis et al. Genes Dev, 10: 1683 – 98, 1996. 49 Y. Li et al. RNA, 7: 920 – 31, 2001. 50 S. Vagner, C. Vagner, and I. W. Mattaj. Genes Dev, 14: 403 – 13, 2000. Chapter 1. Introduction 13 Trans-splicing In some species (not including humans) a significant number of genes are spliced in trans, meaning that two exons from different primary transcripts encoded by different regions of the genome are joined in a bimolecular splicing reaction.51 The 50 exon in natural trans-splicing reactions is always a snRNA known as the SL (spliced leader) RNA. Because the 50 splice site in trans-splicing reactions is contributed by a snRNA, it is not necessary for this site to be recognized by the U1 snRNP, and the U1 snRNP is indeed not required for trans-splicing.51 1.1.4 Prediction of causal variants affecting splicing If splicing signals were completely understood, it would be possible to accurately predict where splicing will occur and which sequence variants will affect gene processing by inspecting a sequence with the help of a computer. In fact, this is precisely the problem addressed by the numerous machine learning algorithms that predict the effects of single nucleotide variants on splicing.7–10 With an onset of new high-throughput methods including RNA sequencing experiments that allow identification of gene expression at individual exon resolution, cross-linking immunoprecipitation (HITS-CLIP, iCLIP, eCLIP) studies that allow scientists to discover genome-wide RNA binding sites for splicing factors in vivo, and RNAcompete datasets that indicate sequence preference of many RBPs in vitro, even more data is pouring in that needs to be processed through rigorous computational pipelines.39,52 Because large amounts of DNA sequence have already been determined, along with large numbers of detected sequence variants through various whole genome sequencing (WGS) and whole exome sequencing (WES) projects, identification of causal splicing variants is an important problem.53 However, two distinct subproblems must be distinguished. The first problem involves identifying the information in the pre-mRNA sequence (sequence features) that are recognized by the cellular splicing machinery for accurate splicing (splicing code modeling). A complete understanding of splicing signals should allow one to accurately predict splicing outcomes from sequence information. On the other hand, the second problem is accurate prediction of sequence variants that may have an effect on splicing and splicing efficiency (building of precise classifiers). This obstacle does not require all features used by the classifier to be biologically relevant even though some of those irrelevant signals might still contribute to a higher accuracy of the predictions. The most challenging step in learning new insights about the process of splicing is the filtering of features (feature selection) that were or can be validated in vivo. It is, however, discovering a great number of new features (or feature engineering) that is the most difficult task when focusing on the second problem. Although much is understood about the splicing process, the current state of the art is that the best software predicts the effect of individual variants exactly only about 80% of the time, with only a limited agreement between these in silico splicing predictive tools.7–10 In addition, we are still lacking a clear set of ‘important’ features that can teach us about the full splicing code. Clearly, computational methods (some of which will be addressed in Chapter 4) have a long way to go! 51 T. Horiuchi and T. Aigaki. Biol Cell, 98: 135 – 40, 2006. 7 K. H. Lim and W. G. Fairbrother. Bioinformatics, 28: 1031 – 2, 2012. 8 M. Mort et al. Genome Biol, 15: R19, 2014. 10 A. B. Rosenberg et al. Cell, 163: 698 – 711, 2015. 52 J. Boucas. Methods Mol Biol, 1720: 111 – 129, 2018. 53 M. Lek et al. Nature, 536: 285 – 91, 2016. 14 1.2. RNA-binding proteins: splicing factors and disease RNA-binding proteins (RBPs) motifs One task that is relatively simple computationally is the recognition of created or disrupted RBP motifs by single nucleotide variants which can be approached using either a matrix method or more sophisticated approaches, such as neural nets.39,54 Generally, variant/splicing prediction software includes a component that is specialized for the recognition of creation or disruption of RBP binding motifs.9 However, binding of those sequences in vivo involves contributions from other variables including concentration of proteins, contribution of other splicing enhancers and suppressors in close proximity, as well as contextual information, in addition to specific RBP motifs alone. To make the matter even more challenging, multiple proteins have very similar recognition sequences and it is extremely difficult to distinguish which protein is involved at a particular location and contribute to the splicing outcome without further laboratory testing/validation.39 Multiple experiments and computational analyses led to the development of what is known as ‘motif maps’ which characterize locations of motifs and relation of a small subset of RBPs on splicing.55 Major limitations: terminal exons and indels The prediction of variants affecting splicing in terminal exons depends on the recognition of sequence features common to terminal exons that are distinct from sequence features common to other cod- ing sequence. For example, the length of terminal exons is usually longer in comparison the other exons.42 In fact, most existing splicing prediction software does not even attempt to identify effects of variations on splicing that fall within non-coding/terminal exons.8–10 Finally, splicing prediction software that incorporates information about insertions, deletions and tandem repeats has so far been non-existent and is an active area of research in the field. In summary, progress in computational tools predicting causal single nucleotide variations represents a mutual feedback between the com- putational recognition of splicing signals and understanding the biological recognition of splicing signals. Thus, understanding how splicing signals are recognized by the cellular splicing machinery will contribute to their computational recognition, and computational approaches can lead to the discovery of splicing signals that would have otherwise been overlooked. 1.2 RNA-binding proteins: splicing factors and disease Most genes in higher eukaryotes are composed of introns (non-coding segments) and exons (coding segments). The majority of human intron removals are catalyzed by a large and dynamic ribonucleo- protein (RNP) complex called the spliceosome. This process involves two sequential transesterification reactions which ligate the exons and release the intron as a lariat. Alternative splicing allows for the production of many proteins from a single genomic locus. This greatly augments the repertoire of proteins that can be produced by a given gene and plays an important role in evolution, development, and disease. It is estimated that 95% of human genes are alternatively spliced.56 The regulation of splice site usage is not well understood, but has been 54 B. Alipanahi et al. Nat Biotechnol, 33: 831 – 8, 2015. 55 J. Ule et al. Nature, 444: 580 – 6, 2006. 56 Q. Pan et al. Nat Genet, 40: 1413 – 5, 2008. Chapter 1. Introduction 15 shown to be specific to species, populations within a species, and tissues within individuals.6,56,57 Misregulation of splicing often results in exon skipping (truncation of the translated protein), intron retention (translation of non coding regions), or alternative splice site usage (alteration of protein composition). Changes in transcript levels or transcript ratios are generally more deleterious than substitutions and have been implicated in several RNA-dependent diseases.58–60 Splicing is a sequential process facilitated by the interaction of cis-sequence elements and trans- acting RNA-binding proteins (RBPs). Splicing can be highly variable as mRNA-RBP interactions are transient and of relatively low specificity. Changes in cis-sequence and levels of trans-factors can alter splicing and cause disease. In fact, approximately one third of all disease alleles are thought to affect splicing.61 1.2.1 The core spliceosome The major spliceosome is made up of five small nuclear ribonucleoproteins (snRNPs): U1, U2, U4, U5, and U6 and catalyzes ≈99% of splicing in humans.62–65 The spliceosome is one of the more complex macromolecules in eukaryotic cells consisting of over 300 different proteins.66,67 The 50 and 30 splice sites (ss) are recognized in a coordinated manner. Exon definition is initiated by the U1 snRNP which binds to the 50 ss motif, and the splicing factor SF1 which binds to the branch-point sequence just upstream of the 30 ss. This commits the pre-mRNA transcript to the splicing pathway and forms the commitment (E0 ) complex.68 Next an additional splicing factor, U2AF65, cooperatively binds with SF1 to recognize the polypyrimidine tract between the branch-point and 30 ss and the 30 ss itself forming the early (E) complex.41 The U2 snRNP then displaces SF1 by base pairing at the branch-point motif in an ATP-dependent manner forming the ATP-dependent (A) complex (Figure 1.5). This process is catalyzed by the RNA helicases Prp5 and Sub2. The Prp5 helicase helps the base pairing interaction by binding the U2 subunit and stabilizing the branch-point-interacting stem-loop (BST) which actually base pairs with the intron.69,70 Sub2 is necessary to stabilize the interaction between the RNA branch-point and the U2 subunit. U4, U5 and U6 tri-snRNP is then recruited to form the pre-catalytic spliceosome or the B complex.71 This process is catalyzed by the DEAD-box helicase Prp28. This protein releases the U1 snRNP during the recruitment of the U4, U5,and U6 snRNP.72 The 6 E. T. Wang et al. Nature, 456: 470 – 6, 2008. 57 J. J. Merkin et al. Cell Rep, , 2015. 58 R. J. Osborne and C. A. Thornton. Hum Mol Genet, 15 Spec No 2: R162 – 9, 2006. 59 J. R. O’Rourke and M. S. Swanson. J Biol Chem, 284: 7419 – 23, 2009. 60 L. P. Ranum and T. A. Cooper. Annu Rev Neurosci, 29: 259 – 77, 2006. 61 K. H. Lim et al. Proc Natl Acad Sci U S A, 108: 11093 – 8, 2011. 62 M. S. Jurica and M. J. Moore. Mol Cell, 12: 5 – 14, 2003. 63 X. Roca et al. Genes Dev, 26: 1098 – 109, 2012. 64 K. Hartmuth et al. Proc Natl Acad Sci U S A, 99: 16719 – 24, 2002. 65 Z. Zhou et al. Nature, 419: 182 – 5, 2002. 66 M. S. Jurica et al. RNA, 8: 426 – 39, 2002. 67 T. W. Nilsen. Bioessays, 25: 1147 – 9, 2003. 68 O. A. Kent, D. B. Ritchie, and A. M. Macmillan. Mol Cell Biol, 25: 233 – 40, 2005. 69 W. W. Liang and S. C. Cheng. Genes Dev, 29: 81 – 93, 2015. 70 R. Perriman and Jr. Ares M. Mol Cell, 38: 416 – 27, 2010. 71 A. G. Matera and Z. Wang. Nat Rev Mol Cell Biol, 15: 108 – 21, 2014. 72 S. Mohlmann et al. Acta Crystallogr D Biol Crystallogr, 70: 1622 – 30, 2014. 16 1.2. RNA-binding proteins: splicing factors and disease pre-catalytic spliceosome goes through a series of conformational changes catalyzed by RNA helicases Brr2, Snu114, and Prp2, which lead to the release of the U1 and U4 subunits forming the activated B (B*) spliceosome. Brr2 helps retain U5 and U6 while releasing U1, Snu114 and Brr2 assemble on the U5 snRNA to produce the U5 snRNP, and Prp2 is responsible for destabilizing the RNA core of the spliceosome to catalyze the conformational change from the B complex to the C1 complex.73–75 The subsequent splicing events take place in two steps. First, U2 associated protein complexes SF3a and SF3b are released and expose the branch-point allowing a nucleophilic attack by the branch-point 20 OH group on the 50 splice site. This results in a free 50 exon and a lariat intron intermediate (C1 complex).76 The second step of splicing is promoted by the Prp8 protein which cross links the U5 and U6 snRNP.77 In this step 30 OH of the 50 exon attacks the 30 ss forming the C2 complex. The remaining snRNPs and associated factors are disassembled, the exons are ligated, and the intron lariat is released and rapidly degraded by the cell.78 Exons can be constitutive (included in all isoforms of the transcript) or alternative (included in only some isoforms of the transcript), and the availability and recruitment of the associated splicing factors have been demonstrated to regulate this through influencing splice site efficiency (relative strength), and as a result splice site usage. The two major splicing factor RBPs are the heterogeneous ribonucleoprotein particles (hnRNPs) and serine-arginine (SR) proteins. These two RBPs have opposite enhancing and repressive qualities that often depend upon where they bind. 1.2.2 HnRNPs, SR proteins, and other splicing factors HnRNP proteins are a well characterized class of RBPs which perform their functions in large ho- mopolymer complexes as opposed to diverse ribonucleoprotein complexes.79 These aggregates are made up of major hnRNP proteins that form the core of the hnRNP aggregates, and minor proteins that are more transiently associated with a subset of hnRNP homopolymer complexes.80–82 Although there remains a class of uncharacterized hnRNP proteins, the majority (over 50%) have been char- acterized to play a role in splicing.83 Other functions include mRNA export, localization, translation, and stability. HnRNPs bound to exonic motifs function as splicing suppressors. HnRNPA1 for exam- ple binds to a high affinity RBP motif in the third exon of the HIV-1 gene. Additional hnRNPs are recruited, and the subsequent homopolymer inhibits splicing by disrupting spliceosome assembly.84 HnRNP binding motifs in introns conversely have been shown to enhance splicing. HnRNPH has been shown to enhance splicing in the mouse src gene resulting in a neuron specific isoform.85 73 L. Zhang et al. Nucleic Acids Res, 43: 3286 – 97, 2015. 74 V. Nancollis et al. J Cell Biochem, 114: 2770 – 84, 2013. 75 A. M. Wlodaver and J. P. Staley. RNA, 20: 282 – 94, 2014. 76 R. M. Lardelli et al. RNA, 16: 516 – 28, 2010. 77 W. P. Galej et al. Nature, 493: 638 – 43, 2013. 78 S. M. Fica et al. Nature, 503: 229 – 34, 2013. 79 L. L. Piccolo, D. Corona, and M. C. Onorati. Chromosoma, 123: 515 – 27, 2014. 80 E. L. Matunis, M. J. Matunis, and G. Dreyfuss. J Cell Biol, 116: 257 – 69, 1992. 81 R. Martinez-Contreras et al. Adv Exp Med Biol, 623: 123 – 47, 2007. 82 S. P. Han, Y. H. Tang, and R. Smith. Biochem J, 430: 379 – 92, 2010. 83 N. Han, W. Li, and M. Zhang. J Cancer Res Ther, 9 Suppl: S129 – 34, 2013. 84 C. Rollins et al. Biochemistry, 53: 2172 – 84, 2014. 85 N. Rooke et al. Mol Cell Biol, 23: 1874 – 84, 2003. Chapter 1. Introduction 17 pre-mRNA GU A AG 5’ splice site branch-point 3’ splice site U1 SF1 U1 SF1 E´ COMPLEX GU A AG U2AF65 U2AF35 U1 SF1 U2AF65 U2AF35 E COMPLEX GU A AG U2 SF3a SF3b U2 SF1 U1 U2 U2AF65 U2AF35 A COMPLEX GU A AG Intron Exon Polypyrimidine tract Figure 1.5: Stepwise assembly of the early spliceosome highlighting the known splicing factors that bind to the substrate. SR proteins are a large family of RBPs that were first described in the early 1990s by the Gall and Roth laboratories independently. The Gall laboratory identified four SR proteins (SRp20, SRp40, SRp55 and SRp75)86 using the monoclonal antibody mAb104 against the phosphorylated epitope of the SR protein in Xenopus laevis.87 Concurrently the observation of B52 antibody bracketed RNA polymerase II (Pol II) on Hsp70 loci of polytene chromosomes in Drosphila melanogaster provided a link between the B52 splicing factor and SF2/ASF, which was previously implicated in constitutive and alternative splicing.88–90 Ultimately three SR proteins were identified: suppressor-of-white-apricot 86 A. M. Zahler et al. Genes Dev, 6: 837 – 47, 1992. 87 M. B. Roth, C. Murphy, and J. G. Gall. J Cell Biol, 111: 2217 – 23, 1990. 88 H. Ge, P. Zuo, and J. L. Manley. Cell, 66: 373 – 82, 1991. 89 A. R. Krainer et al. Cell, 66: 383 – 94, 1991. 90 H. Ge and J. L. Manley. Cell, 62: 25 – 34, 1990. 18 1.3. Three mechanisms of RBP-related splicing dysregulation (SWAP),91 Transformer (Tra),92 and Transformer-2 (Tra-2).93,94 SR proteins are named for their con- served Arg/Ser (RS) binding domain, which distinguishes them from most other RBPs. This domain is found near the C-terminal domain of the protein and promotes protein-protein interactions between the SR protein and the spliceosome.95 SR proteins have been shown to recruit and stabilize interactions between: U1 snRNP and the 50 ss by bridging the U1-70K binding domain to the pre-mRNA transcript, U2AF and the 30 ss through U2 snRNP interactions, and U4/U6.U5 tri-snRNP and the pre-spliceosome complex by promoting the formation of the cross-intron complex.38,40,96 Improper phosphorylation of SR proteins however has been shown to block U2 from binding the 30 ss and function as a splicing inhibitor.97 Although hnRNPs and SR proteins are thought to be the major RBP regulating splicing associated factors, recently other RBPs have been implicated in influencing splicing. RBPs from several other protein families with previously undefined roles in splicing have recently garnered great interest and are now being implicated as key splicing regulators. One example is the splicing factor FUS. FUS is a member of the FET protein family along with EWSR1 and TAF15.98 The function of FET family proteins have not been well characterized, but recent studies suggest that FUS is involved in transcription, splicing and mRNA transport, microRNA processing, DNA repair, and cell proliferation.99 The C-terminal of FUS contains an RNA binding domain with several RNA binding motifs including three arginine-glycine-glycine boxes, a zinc finger, and an RNA recognition motif, though the exact residues involved in interactions with RNA are yet to be described in the literature. The N-terminus contains SYGQ rich domain which binds transcription factors and activates transcrip- tion through interactions with Pol II. As approximately 80% of splicing occurs cotranscriptionally,100 the interplay between Pol II activation and RNA binding makes FUS an interesting splicing factor candidate. 1.3 Three mechanisms of RBP-related splicing dysregulation Here we describe three basic disease mechanisms caused by dysfunctional mRNA-RBP interactions: the disruption of cis-elements, toxicity conferred by mutant mRNA transcripts, and the loss of trans- acting factors. The first mechanism will be described in more details and supported with data in Chapter 2. The last two mechanisms are not covered in the analytical chapters of this thesis. 1.3.1 Mechanism I: disruption of a splicing element Non coding point mutations that cause splicing defects constitute about 13.5% of hereditary disease alleles reported in the Human Gene Mutation Database (HGMD). A wide range of common human 91 T. B. Chou, Z. Zachar, and P. M. Bingham. EMBO J, 6: 4095 – 104, 1987. 92 R. T. Boggs et al. Cell, 50: 739 – 47, 1987. 93 H. Amrein, M. Gorman, and R. Nothiger. Cell, 55: 1025 – 35, 1988. 94 P. J. Shepard and K. J. Hertel. Genome Biol, 10: 242, 2009. 95 A. Busch and K. J. Hertel. Wiley Interdiscip Rev RNA, 3: 1 – 12, 2012. 96 M. Schneider et al. Mol Cell, 38: 223 – 35, 2010. 97 S. Furuyama and J. P. Bruzik. Mol Cell Biol, 22: 5337 – 46, 2002. 98 M. Neumann et al. Brain, 134: 2595 – 609, 2011. 99 H. Deng, K. Gao, and J. Jankovic. Nat Rev Neurol, 10: 337 – 48, 2014. 100 C. Girard et al. Nat Commun, 3: 994, 2012. Chapter 1. Introduction 19 disease such as: Ataxia Telangiectasia, Retinitis Pigmentosa, breast cancer, and Cohen’s Syndrome are caused by changes in splice site recognition.101–103 Enhancer disruption Silencer creation Normal Normal 1 ESE 2 3 1 2 3 1 2 3 1 2 3 Mutant Mutant 1 2 3 1 ESS 2 3 1 3 1 3 Figure 1.6: Three mechanisms of RBP-mediated splicing dysregulation. Mechanism I: Disruption and/or creation of cis- elements by disease variants. The highly conserved GU/AG motifs mark the beginning and end of 99% of introns. Mutating either motif prevents the interactions between the core spliceosome and the pre-mRNA transcript that occur during the splicing process.63 Most intronic point mutations annotated as splicing muta- tions fall within two nucleotides of the exon. Beyond the dinucleotide motif the core cis-splicing elements extend from the – 3 position to the +6 position at the 50 splice site, and from the – 20 position to the +3 position at the 30 splice site traversing the exon intron junctions. The remainder of the cis-sequence is significantly divergent with the probability of a base in any position ranging between 35% – 80%. Less than 5% of splice sites match the consensus motif perfectly.104 This poses the fundamental question of how exons are recognized in large introns. An additional degree of definition could come from a branch-point sequence which is required for splicing. Mutations in the branch-point sequence just upstream of the 30 splice site have been shown to have a similar effect in some heritable disorders,105–107 however, the relatively low number of branch-point sequences that have been identified and the relative degeneracy of the motif in humans restricts the ability to screen for this class of variants in a high throughput manner.24 Auxiliary elements could explain how splice 101 L. Eng et al. Hum Mutat, 23: 67 – 76, 2004. 102 K. Xia et al. Mol Vis, 10: 361 – 5, 2004. 103 J. M. Hartikainen et al. Hum Mutat, 15: 120, 2000. 104 H. Sun and L. A. Chasin. Mol Cell Biol, 20: 6414 – 25, 2000. 105 N. P. Burrows et al. Am J Hum Genet, 63: 390 – 8, 1998. 106 L. B. Crotti and D. S. Horowitz. Proc Natl Acad Sci U S A, 106: 18954 – 9, 2009. 107 C. Maslen et al. Am J Hum Genet, 60: 1389 – 98, 1997. 20 1.3. Three mechanisms of RBP-related splicing dysregulation sites are distinguished from the multitude of pseudo splice sites found in introns.108 In the next section we describe how the disruption of auxiliary splicing elements contributes to deleterious variability in splice site usage. Disease mutations can also alter splicing by the disruption of cis-elements that modulate the recognition of splice sites. These auxiliary elements are often ligands for RBPs. The principle splicing factors that bind these auxiliary enhancers and silencers are the SR and hnRNP protein families. Both protein families are generalized to function in a position specific manner. In other words, SR proteins bound in the exon are generally regarded as activating splicing whereas the same protein relocated to the intron can act as a repressor. Conversely, hnRNPs are regarded as repressors when bound to exonic locations and activators when bound to the intron. The binding specificities of many RBPs have been modeled in vitro and can be used to evaluate the potential of a variant to disrupt a binding site.39 This position dependence seems to be a general property of splicing elements. Exonic splicing enhancer (ESE) motifs functionally repress splicing when found in the intron, becoming intronic splicing silencers (ISSs).109 Likewise exonic splicing silencer (ESS) motifs have been shown to function as intronic splicing enhancers (ISEs) (Figure 1.6).110 Positional distribution analysis uses this property to predict loss of binding without knowledge of the trans-acting factor61 (see also Section 1.5 for Spliceman overview). Non-coding, and functionally conservative or silent mutations that have little to no effect on the translated protein have been demonstrated to cause disease by disrupting splicing.111 In a recent mutational survey of HGMD, it was estimated that 25% of reported missense and nonsense mutations disrupt splicing by creating or destroying auxiliary exonic signals.112 It is worth noting that causal alleles with mutations in auxiliary cis-sequence that disrupt splicing have also been identified in each disease previously described. 1.3.2 Mechanisms II: toxic RNA Mutations that increase the stability of interactions between an RNA species and RBP substrate can cause disease. This has been demonstrated in several well studied diseases particularly neurological and muscular degenerative disorders. The common feature that defines this class of disorders is repeat expansions that are particularly unstable and often result in further enlargement. Often, the repeated sequence becomes pathogenic after expanding beyond a threshold length. The toxic mRNA transcripts produced cause the dysregulation of alternative splicing of many pre-mRNAs in trans simultaneously. Also known as spliceopathy, this pathogenic mechanism has been observed in several RNA-dominant diseases including Myotonic Dystrophy (DM). Spliceopathy is observed when repeating motifs specify an RBP ligand. The repeat expansion creates a tandem array of RBP binding sites which recruits and sequesters RBPs to the transcript, resulting in a sponge like titration of splicing factors effectively depleting the available pool in the cell (Figure 1.7). In an opposite fashion the expansion can also lead to the upregulation of RBPs that bind only to the short, endogenous motifs. An increase in the splicing factor CUGBP1, a CELF family protein specific to striated muscle, is also pathogenic in DM. CUGBP1 when unbound to its substrate becomes hyper phos- 108 X. H. Zhang and L. A. Chasin. Genes Dev, 18: 1241 – 50, 2004. 109 R. Martinez-Contreras et al. PLoS Biol, 4: e21, 2006. 110 A. Kanopka, O. Muhlemann, and G. Akusjarvi. Nature, 381: 535 – 8, 1996. 111 V. Faa et al. J Mol Diagn, 12: 380 – 3, 2010. 112 T. Sterne-Weiler et al. Genome Res, 21: 1563 – 71, 2011. Chapter 1. Introduction 21 Normal Mutant Repeat expansion DNA RNA Splicing 1 2 3 1 2 3 Pre-mRNA 1 3 1 2 3 mRNA Intron Exon Repeat Figure 1.7: Three mechanisms of RBP-mediated splicing dysregulation. Mechanism II: The RNA becomes toxic as a result of repeat expansion. Misregulation of splicing by the toxic RNA occurs through sponge-like titration of a splicing factor. phorylated, giving it a negative gain of function that contributes to extensive splicing dysfunction.113 In DM, two distinct repeat expansions have been reported as the mechanism of pathogenesis: CUG in DM1 and CCUG in DM2 in non coding regions of the DMPK and ZNF9 genes respectively.58 In both cases expression of the transcript is repressed, but the more significant pathogenic result is generated by the sequestration of many RBPs involved in mRNA biogenesis including splicing factors. Although patients share core phenotypes, DM in addition to many other degenerative disorders such as Alzheimer’s and Spinocerebellar Ataxia present in a markedly variable composite phenotype. This may be explained by the broad yet relatively non specific impact of toxic mRNAs. 1.3.3 Mechanism III: mutations that affect splicing factors The other major category of spliceopathy is direct mutation of a splicing factor (Figure 1.8). Mutations in splicing factors have been described in a wide array of common diseases. A pair of the more well understood spliceopathic RBPs are NOVA (Paraneoplastic neurological disorders), and TDP-43 (ALS). Both splicing factors regulate alternative events in neurons, and the loss of either result in severe pathogenesis. NOVA belongs to the K-homology (KH) family of RBPs and is known to interact with hnRNPE1 and hnRNPE2 to promote inclusion of an alternatively spliced transcript.114 The spliceopathy of TDP-43 however is conferred by its ability to regulate itself. TDP-43 governs splicing patterns 113 N. M. Kuyumcu-Martinez, G. S. Wang, and T. A. Cooper. Mol Cell, 28: 68 – 78, 2007. 114 B. K. Dredge, A. D. Polydorides, and R. B. Darnell. Nat Rev Neurosci, 2: 43 – 50, 2001. 22 1.3. Three mechanisms of RBP-related splicing dysregulation of ≈950 transcripts, increase or decrease in cellular TDP-43 causes exclusion events in its target transcripts. The global loss of splicing regulation in the neuron is thought to result in aggregates of ubiquitinated inclusions.115 GU A AG SF1 U1 U2AF65 U2AF35 U1 SF1 GU A AG U2 U1 SF1 GU A AG Figure 1.8: Three mechanisms of RBP-mediated splicing dysregulation. Mechanism III: Mutation in splicing factor (e.g., U2AF) prevents it from binding to the pre-mRNA and stabilizing U2 snRNP. This results in unsuccessful transcript recogni- tion. In Dilated Cardiomyopathy (DCM) the RBP RBM20 has been shown to regulate diastolic function, sarcomere assembly, and ion transport in an enhancer dependent mechanism. RBM20 is recruited by phosphorylated SR proteins causing inclusion of differentially expressed (mutually exclusive) exons, which promotes elasticity primarily in the sacromeric titin protein. Depletion of RBM20 reduces cardiac elasticity causing heart disease.116 RBFOX1 is a neuron specific splicing factor that is associated with with several neurodegenerative disorders including Autism. RBFOX1 plays an important role as a master regulator of splicing in the development of early neurons. Loss of the RBFOX1 causes changes in synaptic transmission as well as membrane excitability. Variants that deplete RBFOX1 show a globally negative affect on growth and proliferation in most neurodevelopmental pathways.117 Finally, splicing factor RBPs such as SRSF1 have been strongly correlated with proto-oncogenic transformations. SRSF1 has been shown to regulate the splicing of several oncogenes. SRSF1’s primary target, BIN1, is known to inhibit cMyc. Depletion of SRSF1 leads to an aberrant BIN1 protein with reduced ability to suppress 115 E. S. Arnold et al. Proc Natl Acad Sci U S A, 110: E736 – 45, 2013. 116 W. Guo et al. Nat Med, 18: 766 – 73, 2012. 117 B. L. Fogel et al. Hum Mol Genet, 21: 4171 – 86, 2012. Chapter 1. Introduction 23 cMyc.118,119 Each of these RBP splicing factors perform different functions in splicing regulation and disease. This demonstrates the multitude of biological processes dependent on the regulation of pre-mRNA splicing. 1.4 Machine learning Machine learning techniques have been successfully used to address a variety of problems in biological sciences, such as splice site identification,120,121 discovery of translation initiation sites,122 identification of potential functional annotation errors in genes and proteins,123,124 identification of functionally important sites in proteins,125 and analyses of splicing outcome.9,10 Depending on the setting in which a particular question is based, different algorithms are being used ranging from simple linear models to complicated ‘black boxes’ like support vector machines. Generally speaking, machine learning refers to a vast set of algorithms for explaining/clustering data. Most of machine learning problems fall into one of the two categories: supervised or unsupervised. In the case of supervised learning, for each observation of the predictor measurement(s) xi , i = 1 . . . n, there is an associated response measurement yi (described in more details below). In contrast, unsupervised learning describes a more challenging situation in which for every observation i = 1 . . . n, we observe a vector of measurements xi , but there is no associated response variable yi . 1.4.1 Supervised learning Supervised learning involves a derivation of a statistical model for predicting the result based on one or more input variables. Broadly speaking, if we observe a quantitative response Y and p different inputs, X1 , X2 , . . . , Xp , we may assume that there is a relationship between the two sets of values that can take a very general form of: Y = f (X) +  (1.1) , where f is some function of the input variables and  is a random error term. The objective of supervised learning is to estimate a mapping from predictors to responses. The accuracy of the prediction depends on two quantities: reducible error and irreducible error. In most cases, the estimate of f will not be perfect and this defect will introduce some error. It is possible to minimize reducible error, but perfect estimates cannot be obtained in the presence of irreducible error (e.g. due to uncharacterized noise in the experimental data). Supervised learning can be used for both prediction (classification) and inference (regression). In prediction setting, the aim is to accurately predict the response for future observation. On the other hand, inference approach focuses on better understanding the relationship between the responses and the predictors. 118 J. Zhang and J. L. Manley. Cancer Discov, 3: 1228 – 37, 2013. 119 O. Anczukow et al. Nat Struct Mol Biol, 19: 220 – 8, 2012. 120 A. Ben-Hur et al. PLoS Comput Biol, 4: e1000173, 2008. 121 S. Degroeve et al. Bioinformatics, 21: 1332 – 8, 2005. 122 P. Meinicke et al. BMC Bioinformatics, 5: 169, 2004. 123 A. K. Baten et al. BMC Bioinformatics, 7 Suppl 5: S15, 2006. 124 A. K. Baten, S. K. Halgamuge, and B. C. Chang. BMC Bioinformatics, 9 Suppl 12: S8, 2008. 125 R. R. Walia et al. BMC Bioinformatics, 13: 89, 2012. 24 1.4. Machine learning Decision trees In this section, we describe tree-based methods for classification and regression. The tree-based methods stratify the predictor space into a number of distinct regions. The segmentation process that creates the regions can be summarized using a tree and therefore these methods are called decision tree methods. The process of building a regression tree is straight forward. Briefly, there are two steps: 1. We divide the feature space (also called the predictor space) X1 , X2 , . . . , Xp into J distinct non-overlapping regions R1 , R2 , . . . , RJ . 2. For every observation that falls into the region Rj , we make the same predictive call, which is simply the mean of the response values for the training observations in the same region Rj . To explain Step 2 above, suppose we have two regions R1 and R2 , and that mean response value for the training observations in region R1 is 50, while mean response value for the training observations in region R2 is 100. Then for the observation X = x, if x ∈ R1 we would predict a response value of 50, and if x ∈ R2 we would predict a response value of 100. Let’s now try to elaborate on Step 1 above. In order to divide the feature space into regions R1 , R2 , . . . , RJ , we need to find high-dimensional rectangles or “boxes” that minimize the residual sum of squares (RSS), defined as: J X X 2 yi − yˆRj (1.2) j=1 i∈Rj , where yˆRj is the mean response value for the training observations within the jth “box” or rectangle. To recursively divide the feature space into the rectangles we use recursive binary splitting. We start at the top of the tree (at this point all training observations belong to a single region) and select the feature Xj and the cut-point s such that splitting the feature space into regions {X|Xj < s} and {X|Xj ≥ s}. Each such split can be visualized as two new branches on the tree that we are crating. At each split, we consider all features X1 , X2 , . . . , Xp , and all possible cut-point values s for each of the predictors, and then choose the predictor and cut-point such that the resulting tree has the lowest RSS. Next, we repeat the process, selecting another “best” predictor and “best” cut-point in order to split the data further as to minimize the RSS within each of the resulting regions. However, this time instead of splitting the entire predictor space, we split one of the two previously defined regions (defined by the first split). The process continues until a termination criterion is reached. For example, we might want to limit our tree to a certain number of branches or we want to build our tree until no region contains more than five train observations. Once the regions R1 , R2 , . . . , RJ have been created, we predict the response value for a given test observation using the mean of the training observation in the region to which that test observation belongs (as indicated earlier in the list above). Sometimes, the resulting tree might be too complex and perform well on the training data but not the test data. We then say that the tree ‘over-fits’ on the training dataset. A smaller tree with fewer splits might lead to lower variance and better interpretation of the results at the cost of little bias. It is therefore better to grow large trees and then prune them back in order to obtain a subtree. The tree pruning can be done using a variety of methods (one of them widely being used is called cost complexity pruning also known as weakest link pruning). Chapter 1. Introduction 25 The above description has been based on the regression tree. The task of building a classification tree is very similar to the one of regression tree. Just like in the regression environment, we use recursive binary splitting to grow a classification tree. This time however, we cannot use RSS as our metric. Since we plan to assign a training observation in a given region to the most commonly occurring class of the training observations in that region, we might use the classification error rate as an alternative to RSS. The classification error rate is simply the fraction of training observations in the region that have the opposite class than the most common one. One measure that is based on the classification error rate is Gini index which is defined as: K X G= pˆmk (1 − pˆmk ) (1.3) k=1 , where pˆmk is the proportion of training observations in the mth region that are from the kth class. The Gini index is therefore a measure of total variance across the K classes. It takes on a small value if all of the pˆmk are close to one or close to zero and therefore can be used as a measure of node purity. A small value of the Gini index tells us that a tree node contains predominantly observations from the same class. One alternative to the Gini index is cross-entropy defined as: K X D=− pˆmk log pˆmk . (1.4) k=1 Again, we can see that the value of the cross-entropy is close to zero when pˆmk ’s are all near zero or one (the two classes in the classification setting). It is not hard to see that numerically the Gini index and the cross-entropy are quite similar. Both of these metrics can be used to evaluate the quality of a particular split when building a classification tree. We would like to conclude this section by listing advantages and disadvantages of decision trees. Advantages of decision trees: 1. Trees are very easy to explain. 2. Decision trees replicate closely human decision-making. 3. Decision trees can be visualized easily (especially when small). 4. Trees can handle qualitative predictors without converting them using dummy variables. Disadvantages of decision trees: 1. Single decision tree does not have the same predictive power as some of the other approaches. 2. A small change in the data can cause large change in the final single decision tree. Single decision trees are simply not very robust. We will describe the strategies to improve the performance of decision trees in the next sections. These include: bagging, random forest, and boosting. 26 1.4. Machine learning Bagging and random forest As discussed in the previous section, single decision tree is very non-robust (it suffers from high variance). Bootstrap aggregation, or bagging, is a procedure for reducing the variance of a model. This procedure takes advantage of the known evidence that averaging a set of observation will reduce the variance. Therefore, it is useful for a statistical learning method to take many training sets from the population, build a separate predictive model using each training set, and average the resulting predictions. Of course, in reality we do not have access to multiple training sets. Instead, we could use bootstrap - taking repeated samples from the single training data set and generating B different bootstrapped training data sets. Then, we would train our algorithm on the bth bootstrapped training set in order to get fˆ∗b (x), and finally average all the predictions: B 1 X ˆ∗b fˆbag (x) = f (x) . (1.5) B b=1 The above procedure is called bagging. In order to apply bagging to regression trees, we construct B regression trees using B bootstrapped training sets, and average the resulting predictions. Again, each single tree will have high variance, but low bias. Averaging these B trees will reduce the former. To apply bagging to classification trees, for a given test observation, we record the class predicted by each of the B trees, and take what is called a majority vote. The majority vote provides the final prediction which is the most commonly occurring class among the B tree predictions. Another advantage provided by the bagging strategy in case of trees is what is called an out-of-bag error estimation. The out-of-bag is a quick way to estimate the test error of a bagged model and saves time by removing the requirement of performing cross-validation. On average each bagged tree will use only two-thirds of the observations when growing. The remaining one-third of observations not used to fit a bagged tree are called out-of-bag (OOB) observations. We can predict the response for each of the out-of-bag observations using each of the trees in which that observation was OOB. This will give us about B/3 predictions for each OOB observation. We can then average these predictions when we are in the regression setting or take a majority vote (in the case of the classification setting). Looking at each OOB prediction, we can then compute the final OOB mean squared error (MSE) (for the regression setting) or classification error (for the classification setting). The resulting error is a valid estimate of the test error as the response for each observation was computed by using only the trees that were not fit using that observation. If we create a large enough number of B bagged trees, OOB error will be almost identical to leave-one-out cross-validation. In order to measure variable importance in the bagged model (and random forest described below) we can once again use the RSS (for bagging regression trees) or the Gini index (for bagging classification trees). We can simply record the total amount that the RSS is decreased due to splits over a given predictor, averaged over all B trees. A large value of the RSS decrease will indicate a strong (important) predictor. For bagged classification trees, we can record the total amount that the Gini index is decreased due to splits over a given predictor over all B trees. Here, once again, a larger number will indicate an important predictor. Random forests provide additional improvement over bagging by decorrelating the bagged trees. In random forests, when building each bagged tree, instead of using all possible predictors to create a split (as in the case of the bagging strategy), we choose a random sample of m predictors from the Chapter 1. Introduction 27 full set of p features. Then, the split is allowed to use only these m predictors. The sampling strategy of the available features for each split is repeated each time a split in a tree is considered. The default √ value of many random forest implementations is m ≈ p. This clever trick is preventing strong and correlating features to appear in all of the bagged trees. If all bagged trees would look quite similar (as in the case of a couple of strong predictors), averaging them would not increase the performance of the model. Boosting Boosting is yet another approach that can be used to improve the predictive power of many machine learning algorithms. Here, we will describe boosting in the context of decision trees. Boosting is very similar to bagging, except that the trees in the boosting model are grown sequentially (each tree is grown using information from previously grown trees) instead of all at once. Like bagging, boosting involves combining multiple (B) decision trees fˆ1 , . . . , fˆB . Boosting also limits the possibility of over-fitting each single decision tree by learning slowly. Each time we build a boosting tree we end up with a model. The next step involves building subsequent model based on the residuals from the previous tree (model). We then add this new decision tree into the fitted function in order to update the residuals. Boosting requires an additional parameter d beyond the parameters used by random forests. The parameter d tells the algorithm how many terminal nodes each single tree should have and therefore limits the number of splits in each tree and tree’s depth. Each tree in the boosting model is usually rather small. By fitting small trees to the residuals, we slowly improve fˆ in areas where it does not perform well. A similar strategy is used for the boosting classification trees. The boosting trees use three parameters: B – number of trees, which should be limited if λ is large enough, otherwise, boosting models can easily over-fit the data (unlike bagging and random forests which require a large number of trees to perform well); λ – the shrinkage parameter, which is usually a small positive number and controls the speed at which the boosting learns; and finally, described earlier parameter d, which controls the complexity of the boosted ensemble. We will use both random forest and boosting strategies in Chapters 2 and 4. 1.5 Developing tools predicting causal SNPs Several tools have been developed to predict the effects of variants on splicing. These tools evalu- ate splice site strength (MaxEntScan),126 predict splice site usage (NetGene2),127 identify splice site motifs (RESCUE-ESE),33 as well as predict the effect of mutations in both canonical splicing motifs (ASSEDA)128 and auxiliary motifs (Spliceman).7 Spliceman for example, uses the positional distribu- tion of hexamer motifs around exon intron junctions to predict variants outside of canonical splice site signals that disrupt splicing. However, the complex haplotype architecture of genetic variation in hu- mans makes it challenging to functionally assess individual variants in the laboratory. The haplotype identified in an association study requires further analysis to find causal variants. The cost and time of sequencing and analysis has dramatically decreased since the original genome- wide association studies (GWAS). Data is now produced in tremendous volumes and consolidated 126 G. Yeo and C. B. Burge. J Comput Biol, 11: 377 – 94, 2004. 127 S. Brunak, J. Engelbrecht, and S. Knudsen. J Mol Biol, 220: 49 – 65, 1991. 128 A. Goren et al. Nucleic Acids Res, 38: 3318 – 27, 2010. 28 1.6. Functionally validating individual variants in databases. One such database, the database of Genotypes and Phenotypes (dbGAP) hosted by the National Center for Biotechnology Information (NCBI) combines genotype and phenotype data from the literature and the clinic. Some of this data like the Genotype Tissue Expression Project (GTEx)129 combines a survey of RNA-seq data from different tissues with genomic sequencing data in a diverse population of individuals. Here, variants in individuals that are discovered within the cohort can be checked for changes in the individuals’ transcript level or splice isoform usage. Although correlations between variants and processing defects do not necessarily prove causality, this type of data greatly reduces the search space for common variants that affect splicing. Furthermore planned expansions of this dataset: increasing the population size and diversity, and adding the dimension of RNA deep sequencing data should reduce false positives and allow for higher confidence in observed associations. Analysis of the GTEx data upon the completion of the project will likely lead to refined predictive tools and an increase in the identification of causal variants. 1.6 Functionally validating individual variants Despite the great promise of public datasets such as GTEx and predictive tools, experimental ap- proaches offer the most definitive test of causality. For common variants, causality can be determined by testing variants in linkage disequilibrium (LD) with the associated single nucleotide polymorphism (SNP) in a splicing assay. This approach allows the effect of the variant to be measured indepen- dently of neighboring SNPs and to control for the genetic background. Here, we will discuss high throughput strategies we are developing in our lab to functionally evaluate variants of interest and attribute causality. These approaches can be applied to SNPs, disease alleles or variants of unknown significance that are returned in exome sequencing studies. Minigene reporter constructs can be synthesized and used to identify variants in the cis-sequence that demonstrate allele-specific splicing defects.130 Variation of the minigene constructs can be de- veloped with alternate promoters and vectors, and be tested in a number of cell lines with different expression profiles to account for tissue-specific expression level and neighboring environment vari- ability. Original minigene constructs relied on the generation and insertion of recombinant comple- mentary DNA (cDNA) into the genome, but more recently simple polymerase chain reaction (PCR) strategies have been used to amplify sequence of interest from genomic DNA which are ligated to splic- ing reporters to measure splicing activity.131 The main limitation to constructing minigene reporters from genomic DNA is the inability to separate nearby variants. Currently, DNA libraries of short oligos can be synthesized to test the splicing efficiency of variants and their wild type pairs in a neutral background sequence. These minigenes can undergo splicing in nuclear extract or be transfected into cells to assay in vitro or in vivo splicing activity respectively.130 Results can be directly quantitated by comparing the levels of input RNA and spliced product.3 These mutant wild type pairs can then be tested for splicing activity in a massively parallel high throughput assay.3 This approach is limited by the length of oligonucleotide that can be accurately synthesized by the current technology. We currently employ a combination of these strategies. Oligo libraries are 129 G. TEx Consortium. Nat Genet, 45: 580 – 5, 2013. 130 T. A. Cooper. Methods, 37: 331 – 40, 2005. 131 K. Basler, P. Siegrist, and E. Hafen. EMBO J, 8: 2381 – 6, 1989. 3 R. Soemedi et al. Adv Exp Med Biol, 825: 227 – 66, 2014. Chapter 1. Introduction 29 designed to uncover candidate variants that influence splicing in a high throughput manner. We then validate the candidate variants by assessing splicing phenotype in patient-derived tissues by RT-PCR assay. A key advantage of high throughput functional assays is that their input can accommodate the typical number of variants called in an exome sequencing run (i.e., 20 – 30,000). Loss of trans-acting splicing factors and toxicity of RNA can be measured using well characterized binding assays such as immunoprecipitations, fluorescent in situ hybridization, and chromatography. In binding assays in vitro techniques allow for the direct comparison of intrinsic binding between different RBPs and their substrates in titrated concentrations and environments. Association kinetics in endogenous conditions likely vary significantly from in vitro assays. Conditions in vivo are the result of complex interactions between a multitude of factors and binding assays with rare or low concentrations of spliceopathic transcripts may not produce a discernible signal. Binding assays as well have been adapted to high throughput platforms to increase sensitivity and the amount of data produced. Low throughput cross-linking immunoprecipitation (CLIP) assays as an example evolved to high throughput cross-linking immunoprecipitation (HITS-CLIP) and was used to map genome wide NOVA interactions described previously.132 1.7 Conclusions and future directions in therapeutic interventions for splicing disorders Ultimately, the larger goal in studying the mechanisms of splicing disruption is to enable further research in therapies that reverse splicing defects. Of the three classes of splicing disorders, mutations that disrupt splicing in cis may be most amenable to therapy as its effects are limited to a single gene. Oligonucleotides and other RNA binding compounds have been used to rescue aberrant splice site choices in vivo. The precise strategy for correcting a cis-mutation depends on the type of aberrant splicing that arises. Many aberrant splicing events are caused by the unwanted binding of a spliceo- some component to a pre-mRNA element. For example, a silent variation in SMN2 was hypothesized to create a binding site of the repressor hnRNPA1 which reduced the inclusion of exon 7 of SMN2 and caused spinal muscular atrophy.133,134 The binding of modified oligonucleotides to nearby hn- RNPA1 binding sites rescued splicing.134 In a similar manner, cryptic splice sites can also be blocked by complementary oligonucleotides restoring usage of the appropriate splice site.135 Oligonucleotides delivered into the cell can be modified, usually in the sugar or backbone, to increase nuclease resis- tance, specificity and to improve delivery to the target. Common modifications include morpholino oligomers, 20 -methoxyethoxy, 20 -O-methyl phosphorothioate and locked nucleic acid (LNA).136 Varia- tions of oligonucleotide therapy have moved beyond simple steric hindrance of binding. Bi-functional oligonucleotides that combine a targeting sequence with a splicing enhancer have been shown to rescue the defective splicing of SMN2 in vivo.137 Oligonucleotides do not necessarily have to target the affected exon. 132 D. D. Licatalosi et al. Nature, 456: 464 – 9, 2008. 133 Y. Hua et al. Nature, 478: 123 – 6, 2011. 134 Y. Hua et al. Genes Dev, 24: 1634 – 44, 2010. 135 S. Svasti et al. Proc Natl Acad Sci U S A, 106: 1205 – 10, 2009. 136 K. E. Lundin et al. Adv Genet, 82: 47 – 107, 2013. 137 N. Owen et al. Nucleic Acids Res, 39: 7194 – 208, 2011. 30 1.7. Conclusions and future directions in therapeutic interventions for splicing disorders Duchenne Muscular dystrophy is caused by Dystrophin gene mutations many of which induce frameshifting exon skipping events. Pharmaceutical oligonucleotides (named eteplirsen and dris- apersen) were designed to restore the open reading frame by skipping additional exons. The resulting message contains internal deletions, but encodes a more functional dystrophin protein. During clinical trials the drugs were shown to improve some features of the disease (but not mobility).138 Small molecule therapy has been utilized as an alternate strategies for correcting splicing de- fects. Numerous FDA approved compounds bind bacterial ribosomal RNA (e.g., aminoglycosides). Screening compounds for their ability to increase exon inclusion in the SMN2 transcript yielded a tetracycline-like compound, PTK-SMA1.139 Not all compounds function by directly binding RNA. There are numerous examples of compounds that change splice isoform ratios by altering the expression of chromatin modifying factors (reviewed in P. Disterer et al. Development of therapeutic splice- switching oligonucleotides. Hum Gene Ther, 25: 587 – 98, 2014. D O I : 10.1089/hum.2013.234) While certain aberrant splicing events may be altered by these types of changes, it is likely that many other factors will be affected. Finally, strategies are also being developed to counter the gain of function toxic RNAs. Here a repeat expansion titrates a splicing factor from the cell, potentially affecting numerous splicing events. For example the CUG repeat expansions associated with Myotonic Dystrophy type 1 (DM1) are being targeted by antisense oligonucleotides that function through a variety of mechanisms.140–146 Other approaches include designed compounds and endonucleases that recognize (CUG) repeats.147–150 However, significant challenges associated with drug delivery remain and how effective all of these approaches will be in patients is still a major unanswered question. With the growing awareness of RNA processing in disease, new efforts to diagnose, characterize and treat splicing defects are underway. As the cost of sequencing decreases, techniques such as RNA- seq are enjoying more widespread use. These approaches will undoubtedly reveal a significant role for aberrant splicing in human disease. It is difficult to predict what role oligonucleotides will play in future therapies. There are numerous challenges that deter large scale development of oligonu- cleotides as drugs (e.g., delivery issues, the small number of patients afflicted with a particular allele). However oligonucleotide therapies offer some key advantages that may speed their development. Unlike small molecule targeting, the principle of oligonucleotide specificity (nucleotide base pairing) is well understood. Oligonucleotide therapies also appear to be inherently more conservative and will likely not be hampered by the safety issues that halted previous attempts at gene therapy. 138 P. Disterer et al. Hum Gene Ther, 25: 587 – 98, 2014. 139 M. L. Hastings et al. Sci Transl Med, 1: 5ra12, 2009. 140 T. M. Wheeler et al. Science, 325: 336 – 9, 2009. 141 S. A. Mulders et al. Proc Natl Acad Sci U S A, 106: 13915 – 20, 2009. 142 J. E. Lee, C. F. Bennett, and T. A. Cooper. Proc Natl Acad Sci U S A, 109: 4221 – 6, 2012. 143 A. J. Leger et al. Nucleic Acid Ther, 23: 109 – 17, 2013. 144 T. M. Wheeler et al. Nature, 488: 111 – 5, 2012. 145 V. Francois et al. Nat Struct Mol Biol, 18: 85 – 7, 2011. 146 K. Sobczak et al. Mol Ther, 21: 380 – 7, 2013. 147 M. B. Warf et al. Proc Natl Acad Sci U S A, 106: 18551 – 6, 2009. 148 A. Garcia-Lopez et al. Proc Natl Acad Sci U S A, 108: 11866 – 71, 2011. 149 J. L. Childs-Disney et al. ACS Chem Biol, 7: 1984 – 93, 2012. 150 W. Zhang et al. Mol Ther, 22: 312 – 20, 2014. Chapter 2 Identification of Splicing Defects Summary and contributions The lack of tools to identify causative variants from sequencing data greatly limits the promise of precision medicine. Previous studies suggest that one-third of disease-associated alleles alter splicing. We discovered that the alleles causing splicing defects cluster in disease-associated genes (for exam- ple, haploinsufficient genes). We analyzed 4,964 published disease-causing exonic mutations using a massively parallel splicing assay (MaPSy), which showed an 81% concordance rate with splicing in patient tissue. Approximately 10% of exonic mutations altered splicing, mostly by disrupting multiple stages of spliceosome assembly. We present a large-scale characterization of exonic splicing mutations using a new technology that facilitates variant classification and keeps pace with variant discovery. Sections 2.1, 2.2, 2.3, and 2.4 were published in the following manuscript: • R. Soemedi, K. J. Cygan, C. L. Rhine, J. Wang, C. Bulacan, J. Yang, P. Bayrak-Toydemir, J. McDonald, and W. G. Fairbrother. Pathogenic variants that alter protein code often disrupt splicing. Nat Genet, 49: 848 – 855, 2017. D O I : 10.1038/ng.3837 William G. Fairbrother and Rachel Soemedi designed the experiments. Rachel Soemedi performed experimental portions of MaPSy. Kamil J. Cygan performed NGS alignment and counting. Kamil J. Cygan performed in silico RBP motif analyses. Kamil J. Cygan designed and performed clustering of RBP motifs. Rachel Soemedi and Kamil J. Cygan performed ESM analyses (allelic imbalance and splicing efficiency) and machine learning. Rachel Soemedi performed MaPSy SELEX analyses. Christy L. Rhine and Kamil J. Cygan performed HGMD analyses. Charlston Bulacan and John Yang devel- oped the visualization web browser. Rachel Soemedi, Pinar Bayrak-Toydemir, and Jamie McDonald performed validation experiments in patient tissue samples. Kamil J. Cygan performed validation experiments using ENCODE datasets. Kamil J. Cygan and Rachel Soemedi created figures. William G. Fairbrother and Rachel Soemedi wrote the paper with contributions from all authors. Section 2.6 was published under the following URL: • https://genomeinterpretation.org/content/MaPSy 32 Kamil J. Cygan and Gaia Andreoletti wrote the aforementioned challenge. Kamil J. Cygan prepared the training and test sets for the machine learning challenge. Section 2.7 was published in the following manuscript: • S. W. Kim, A. J. Taggart, C. Heintzelman, K. J. Cygan, C. G. Hull, J. Wang, B. Shrestha, and W. G. Fairbrother. Widespread intra-dependencies in the removal of introns from human transcripts. Nucleic Acids Res, 45: 9503 – 9513, 2017. D O I : 10.1093/nar/gkx661 Seong Won Kim, Allison J. Taggart, and Claire Heintzelman performed the global analysis of order of intron removal. Kamil J. Cygan performed RBP analyses and correlation of RBP data with the order of intron removal data. Section 2.5 is the result of my own set of analyses, where I identified the limitations and described proper pipeline modifications in order to overcome the challenges of MaPSy. The aforementioned section remains unpublished. Chapter 2. Identification of Splicing Defects 33 2.1 Pathogenic variants that alter protein code often disrupt splicing Human genetic disorders occur in ∼8% of the population.151 Major technological advancements in the past decade have made it possible to detect all sequence variations in individual genomes in a cost-effective manner. In combination with capture technologies, targeted sequencing of all protein- coding regions of the human genome (the exome) has been increasingly used for routine diagnostics in Mendelian disorders.152,153 Unfortunately, the tremendous progress that has been made in variant detection has outpaced the capacity to characterize sequence variations. Recent deep sequencing of human exomes detected 14,000 single-nucleotide variants (SNVs) per individual, 47% of which were predicted to be deleterious by one or more in silico prediction tools, but there was very little agreement (<1%) between the commonly used methods.154 Large-scale sequencing has identified many loss-of-function variants in asymptomatic individuals that are thought to cause severe genetic disorders.53,155 These variants could represent annotation or sequencing errors, partial penetrance or recessive alleles carried by asymptomatic individuals. This uncertainty illustrates the urgency for better characterization of sequence variation. Although it is difficult to predict the effect of an SNV on protein function, the characterization of splicing mutations is a tractable problem. Splicing mutations are easily detected and quantified. They are deleterious, and one-third of the alleles that cause hereditary disease are predicted to confer some degree of missplicing.61 Some of these mutations disrupt canonical splice sites, whereas others disrupt the multi- tude of enhancers and silencers that can modulate splice-site usage. Any change in an exonic sequence may therefore disrupt or create cis-acting elements that facilitate exon recognition, resulting in aber- rant splicing. Here we present a new parallel splicing reporter system to characterize 4,964 published disease-causing exonic mutations for effects on splicing. The present study identified an allelic splic- ing imbalance caused by these exonic mutations and provided insights into the determinants and mechanisms of splicing aberrations. 2.2 Results 2.2.1 Massively parallel splicing assays We developed a massively parallel splicing assay (MaPSy) to screen a panel of 4,964 exonic disease mutations (5K panel) reported in the Human Gene Mutation Database156 (HGMD) for mutations causing splicing defects. One library was designed to evaluate the effects of the mutations on splicing in vivo via transfection in cells grown in tissue culture. The second library comprised RNA substrates designed to evaluate the mutations’ effects on splicing in vitro via incubation in cell nuclear extract. Solid-phase oligonucleotide synthesis technology and PCR were used to manufacture the in vivo library 151 P. A. Baird et al. Am J Hum Genet, 42: 677 – 93, 1988. 152 Y. Yang et al. JAMA, 312: 1870 – 9, 2014. 153 M. J. Bamshad et al. Nat Rev Genet, 12: 745 – 55, 2011. 154 J. A. Tennessen et al. Science, 337: 64 – 9, 2012. 53 M. Lek et al. Nature, 536: 285 – 91, 2016. 155 Y. Xue et al. Am J Hum Genet, 91: 1022 – 32, 2012. 61 K. H. Lim et al. Proc Natl Acad Sci U S A, 108: 11093 – 8, 2011. 156 P. D. Stenson et al. Hum Mutat, 21: 577 – 81, 2003. 34 2.2. Results and the template for the in vitro library (Figure 2.1). Each reporter in the library contains a 170-mer genomic fragment of either the mutant or wild-type (reference) sequence, each of which consists of an exon, at least 55 nt of the upstream intron and 15 nt of the downstream intron (Figure 2.1a).24 The allelic ratio for each mutant/wild-type (M/W) pair was determined from the allelic counts obtained from deep sequencing of the input libraries, the output spliced fractions and the RNA pools isolated from different in vitro spliceosomal intermediates (Figure 2.1b,c). The most common outcome of disrupted splicing in vivo is exon skipping, whereas most pre-mRNAs with splicing mutations in vitro remain unspliced. While changes in transcription or stability may account for an altered allelic ratio in the spliced fraction in vivo, the in vitro assay is a direct measure of splicing. Despite substantial differences in processing and substrate design, general agreements were observed between the allelic splicing ratios obtained from the two assays (Figure 2.2a; Pearson’s r = 0.55). Approximately 10% of the exonic mutations in the 5K panel altered splicing in both systems (Figure 2.2c; >1.5-fold change, two-sided Fisher’s exact test, adjusted with 5% false discovery rate (FDR)) and thus were regarded as unambiguous splicing changes and were classified as exonic splicing mutations (ESMs). We also performed MaPSy on a control panel of common SNPs, which disrupted splicing at a significantly lower level (8/228 or 3% of common SNPs, P = 9.94 × 10−5 , two-sided Fisher’s exact test; Table A.1). Additionally, cryptic 30 -splice-site usage was identified in both assays (Figure 2.2b; Pearson’s r = 0.8). Although most bona fide cryptic splicing events (74%) were caused by the creation of an AG (i.e., a splice acceptor site), a substantial number of disease-associated alleles caused dramatic shifts in the usage of an existing AG (Figure A.1). MaPSy was found to be robust (Pearson’s r = 0.85 − 0.89 between allelic splicing ratios from experimental replicates; Figure A.2a-d). In order to assess the validity and relevance of the splicing aberrations detected by MaPSy, we performed RT-PCR validations in RNA extracted from patient samples consisting of lymphoblastoid cell lines, fibroblasts, whole blood and postmortem brain tissues (Figure A.3a-f and Table A.2). The validation samples were chosen solely on the basis of availability. In addition, we searched the literature for follow-up studies involving the mutations in the 5K panel that included RNA splicing analyses in patient tissue samples. A summary of the validations can be found in Table A.2. Overall, 81% (26/32) of MaPSy-detected ESMs were validated in patient tissue samples (Figure 2.2d). Furthermore, we compared the splice- site usage in 19 different cell lines that are part of the Encyclopedia of DNA Elements (ENCODE) data set with wild-type (reference) splicing in our 5K panel. Exons that spliced most efficiently in the 5K panel also had the highest average splice-site usage in the ENCODE cell lines, whereas exons that spliced least efficiently in the 5K panel also had the lowest average splice-site usage in the ENCODE data (Figure A.3g). 2.2.2 Nonuniform distribution of splicing mutations Some exons appeared to have a higher fraction of splicing mutations than others (for example, exon 8 of MLH1 and exon 18 of BRCA1, adjusted P = 2.26 × 10−3 and 4.18 × 10−6 , respectively, two- sided binomial test). Interestingly, the set of (mostly) intronic splice-site mutations (SSMs) were also not distributed uniformly in disease-associated genes. Analyses of 2,314 disease-causing gene loci identified 64 genes that are predisposed to SSMs (Figure 2.3a, left and Table A.3).156 SSMs often result in exon skipping. Not surprisingly, SSMs and nonsense mutations in human disease-associated 24 A. J. Taggart et al. Nat Struct Mol Biol, 19: 719 – 21, 2012. Chapter 2. Identification of Splicing Defects 35 a Wild type test exon = x 4,964 exonic loci X Mutant b In vivo splicing assay Input DNA (pooled) Tissue culture cells % s 14 Los a b Null hypothesis a/c = b/d Null 82% Gain 4% Mutant loss a/c < b/d X c d Mutant gain a/c > b/d c In vitro splicing/spliceosome assembly assay A complex B complex C complex Input RNA ex lic x ex + e pl pl pl ed m HeLa N.E. m m t pu co co co % Sp s 17 In A C B Los a b e g i Null 76% Gain c d f h j 7% Null hypothesis a/c = b/d = e/f = g/h = i/j Figure 2.1: MaPSy on the 5K panel. (a) The panel consists of 4,964 mutant-wild type pairs. (b) The panel was incorporated into a three-exon in vivo library. The allelic ratios of both the input and output library were determined by deep sequencing. The result from RT-PCR of output RNA (spliced species) is shown (see also Figure A.2f). Splicing aberrations were found in 18% of mutants. (c) Allelic ratios were determined in spliceosomal intermediates; ∼24% of species disrupt splicing in vitro. N.E., nuclear extract. transcripts were positively correlated, as they both result in loss of function of the proteins that they encode. This correlation was not observed between missense mutations and SSMs (Figure 2.3a, middle and right). We found that ESMs were more abundant in genes that were also enriched for SSMs P = 3 × 10−6 , Kruskal-Wallis; Figure 2.3b and Section 2.4). This effect was more pronounced at the level of the individual exons (P = 2.1 × 10−34 , Kruskal-Wallis; Figure 2.3c and Section 2.4). Moreover, disease-causing mutations with autosomal dominant inheritance showed a twofold ESM 36 2.2. Results a b c % usage in vitro In vivo M/W 5 ratio (log2) 8% 5 ss (log2) L o 2% Gain 0 0 Null −5 90% −5 −10 r = 0.55 −10 r = 0.8 −4 0 4 −10 −5 0 vitro In vitro M/W % usage in vivo ratio (log2) (log2) d Splicing Patient sample n Agreement assay validation pos pos 17 Y neg neg 9 Y pos neg 3 N neg pos 3 N 32 81.3% Figure 2.2: Robustness of MaPSy. (a) Allelic splicing M/W ratios in vivo versus in vitro. (b) Cryptic splice-site usage in vivo versus in vitro. (c) Exonic splicing mutations were identified in ∼10% of the 5K panel. (d) Summary of MaPSy validations in tissue samples from patients with mutations tested with MaPSy. enrichment in haploinsufficient genes as compared to haplosufficient genes (P = 0.002, Kruskal- Wallis; Figure 2.3d). This finding is in agreement with splicing mutations acting mainly via a loss- of-function mechanism and further confirms the utility of MaPSy in identifying deleterious ESMs (Figure A.4). The same enrichment was also observed in SSMs reported in the HGMD (P = 0.02, Kruskal-Wallis; Figure 2.3e).157 Recently, the Exome Aggregation Consortium (ExAC) identified 3,230 genes that are depleted of protein-truncating variants (PTVs) in 60,706 humans,53 thus providing evidence for extreme selective constraint. Because PTVs and splicing mutations often share the same loss-of-function mechanism, we examined disease-associated ESM occurrence in PTV-intolerant genes (probability that a gene is intolerant to a loss-of-function mutation (pLI) ≥ 0.9)53 in comparison to other genes. In the 5K panel, we found a threefold excess of ESMs in PTV-intolerant genes (n = 92) as compared to PTV-tolerant genes (n = 66) that cause dominant disease traits (adjusted P = 0.005, Kruskal-Wallis; Figure A.5a).53 These findings suggest that ESMs and SSMs are enriched in haploinsufficient genes, in which the loss of one functional copy likely leads to a disease phenotype. 157 N. Huang et al. PLoS Genet, 6: e1001154, 2010. Chapter 2. Identification of Splicing Defects 37 a y=1.1+1.2x y=11+2.7x R2 = 0.674 R2 = 0.411 300 100 100 n SSMs in HGMD genes n nonsense mutations n missense mutations in HGMD genes in HGMD genes 75 75 200 50 50 100 25 25 0 0 0 0 500 1000 1500 0 25 50 75 100 0 25 50 75 100 n all mutations n SSM in HGMD genes n SSM in HGMD genes in HGMD genes b c d e 15 25 25 %ESM in 5K panel %SSM in HGMD mean %ESM 20 in 5K panel 20 mean %ESM 20 in 5K panel 10 15 genes 15 5 10 10 10 5 5 0 0 0 0 0 7 13 20 0 80 1− 2 0 1 2 3 4 5 7− − − at I HS at I I HS I er H er H eH 13 eH n SSM in HGMD n SSM in HGMD od od genes exons m m Figure 2.3: Prevalence of splicing mutations in disease-associated genes. (a) Left, SSMs versus all exonic mutations in the HGMD with the 99.9% confidence interval shown in gray. Middle and right, number of SSMs versus nonsense variants (middle) and missense variants (right) in all disease-associated genes. (b) Mean ESM percentage for each gene plotted against roughly equal bins of the percentage of SSMs in HGMD genes (n = 708). (c) Mean ESM percentage for each exon versus the number of SSMs per exon (n = 2, 048). (d) Percentage of ESMs in haploinsufficient (HI; n = 174), moderately haploinsufficient (n = 567) and haplosufficient (HS; n = 874) genes in autosomal dominant diseases in the 5K panel.157 (e) Percentage of SSMs in HGMD with autosomal dominant inheritance in haploinsufficient (n = 1, 383), moderately haploinsufficient (n = 14, 059) and haplosufficient (n = 59, 901) genes.157 Error bars, s.e.m. (b,c) and 95% confidence interval (d,e). 2.2.3 Random forest classification of exonic splicing mutations Various genomic and sequence features have been reported to affect splicing.2,8,32,157,158 Although most of these studies were only done using a few substrates, MaPSy enables direct comparisons of the splicing performance of thousands of exons in vivo and in vitro (Figure A.2e). Many of these fea- 2 W. G. Fairbrother et al. Science, 297: 1007 – 13, 2002. 8 M. Mort et al. Genome Biol, 15: R19, 2014. 32 M. Amit et al. Cell Rep, 1: 543 – 56, 2012. 158 S. Ke et al. Genome Res, 21: 1360 – 74, 2011. 38 2.2. Results tures (for example, differential GC content between exons and introns and density of exonic splicing silencers (ESSs)) were confirmed with MaPSy (Figure A.6a).32,158 We used random forest classifica- a b Feature Direction 1.00 SS strength diff True positive rate (sensitivity) HGMD SS vars n ESEs n ESSs 0.75 Intron length Exon length G (kcal/mol) WT exon ESRseq diff 0.50 Exon SS strength Distance to SS Haploinsufficiency Phastcons exon 0.25 PPT score Properties n introns Exon Exon pos in gene Mutation MaPSy both AUC = 0.815 ExAC SS vars Gene MaPSy in vitro AUC = 0.754 Nucleotide change G>T 0.00 MaPSy in vivo AUC = 0.81 00 25 50 75 00 25 00 00 00 00 01 01 0.00 0.25 0.50 0.75 1.00 0. 0. 0. 0. 0. 0. False positive rate (specificity) Mean decrease in accuracy Figure 2.4: Random forest classification of exonic mutations that disrupt splicing. (a) The classification performance of the random forest model was calculated as the AUC in receiver operating characteristic (ROC) analysis. (b) The order of variable importance by mean decrease in accuracy. Error bars, s.d. The directions of changes that promote ESMs are shown; positive directions are colored blue, and negative directions are colored red. Variables include differences in splice-site strength126 and hexamer splicing scores158 (SS strength diff, ESRseq diff), the sum of the effects of splice-site variants in the HGMD and ExAC data sets (HGMD SS vars, ExAC SS vars),53,156 numbers of ESEs and ESSs in the exon (n ESEs, n ESSs), the free-energy estimate in wild-type exon (∆G(kcal/mol) for the wild-type (WT) exon),159 exon conservation (Phastcons exon), number of introns (n introns) and relative exon position in the gene (Exon pos in gene). PPT, polypyrimidine tract. tion (Section 2.4) on the ESM data set generated with MaPSy to further understand the different contributions of the various genomic and sequence features that may lead to ESM.160 Performance of the random forest model was measured by mean area under the curve (AUC = 0.81, 0.755 and 0.816 for the in vivo, in vitro and combined approaches, respectively) (Figure 2.4a). The in vivo assay performed better than the in vitro assay, but combining the two assays resulted in further in- crease in sensitivity to ESMs. Measures of feature importance were calculated as the mean decrease in accuracy (MDA). Each feature was categorized as a property of the mutation, the exon or the gene (Figure 2.4b). It was surprising that the majority of the top predictors of ESMs that are not within the splice-site regions (∼76%) were exon-level features, rather than some properties of the nucleotide substitutions (for example, exon splicing enhancer (ESE) disruption and ESS creation). In other words, some exon properties (for example, low ESE density and high ESS density) sensitize an exon to ESMs-variants in these exons are more likely to disrupt splicing (adjusted P = 1.8 × 10−12 and 7.8 × 10−18 , Kruskal-Wallis, for ESE and ESS density, respectively; Figure A.6b). In addition, the random forest model suggests that ESMs are more likely to occur in genes with many introns. We found that PTV-intolerant genes53 also contained more introns than the average for disease-associated genes (P < 2.2 × 10−16 , Mann-Whitney), similar to ESM- and SSM-enriched genes (Figure A.5b). 160 Leo Breiman. Mach. Learn., 45: 5 – 32, 2001. Chapter 2. Identification of Splicing Defects 39 2.2.4 RNA-binding protein motifs in the 5K panel Presumably, most mutations that alter splicing act by disrupting the binding site of an activator or by creating a binding site for a repressor. The loss or gain of previously characterized elements (i.e., the mutation being predicted to either promote or inhibit splicing) was compared to loss or gain of splicing in MaPSy1,2,108,161,162 (Figure 2.5). A positive correlation was observed between INPUT: 5K X Example profile stronger Mutant Mutant matching a motif in vitro Motif Analysis Wild type stronget M/W ratio ranking 1 5192 Function Analysis X X Figure 2.5: Detection of RBP motifs that affect splicing. All mutant-wild type pairs were examined for difference in position weight matrices corresponding to 155 RBP motifs and known exonic cis elements. gains of known exonic enhancing elements and relative splicing performance (i.e., mutant/wild type ratio, adjusted P = 7.75 × 10−25 , linear regression; Figure 2.6 and Section 2.4). In contrast, a negative correlation was observed between gains of known exonic silencing elements and the relative performance of splicing (adjusted P = 0.0001, linear regression; Figure 2.6). To predict which binding events of trans-acting factors were affected by exonic mutations, we compared the splicing effect of thousands of point mutations (using the relative splicing performance of the mutant versus wild-type sequence in MaPSy) with the predicted change of the binding affinity of 155 human RNA-binding proteins (RBPs) (determined bioinformatically using published data).39 Briefly, mutant- wild type pairs were ranked from the lowest to highest degree of exon inclusion for the mutant allele relative to the wild-type allele. The predicted changes in binding affinity were compared to the 1 Z. Wang et al. Cell, 119: 831 – 45, 2004. 108 X. H. Zhang and L. A. Chasin. Genes Dev, 18: 1241 – 50, 2004. 161 S. Ke, X. H. Zhang, and L. A. Chasin. Genome Res, 18: 533 – 43, 2008. 162 P. J. Smith et al. Hum Mol Genet, 15: 2490 – 508, 2006. 39 D. Ray et al. Nature, 499: 172 – 7, 2013. 40 2.2. Results RescueESEs in vitro Mutant matching WangESSs in vitro Mutant matching 0.2 0.1 0.0 0.0 −0.2 −0.1 −0.4 0 2000 4000 0 2000 4000 M/W ratio rank M/W ratio rank 2 SRSF1 motif in vitro PTBP1 motif in vitro Mutant matching Mutant matching 5.0 0 −2 2.5 −4 0.0 0 2000 4000 0 2000 4000 M/W ratio rank M/W ratio rank Figure 2.6: Profiles of RBP motifs. Motif profiles show clear trends of agreement with previously defined functions. Shaded blue regions represent 95% confidence intervals. observed gain or loss of splicing activity (i.e., the mutant/wild type ratio).163 Levels of SRSF1, a well- characterized exonic splicing activator,35,164 showed a positive correlation with splicing (adjusted P = 3.34 × 10−27 , linear regression; Figure 2.6), whereas levels of polypyrimidine tract-binding protein 1 (PTBP1), a known exonic splicing repressor, correlated negatively with splicing performance (adjusted P = 3.26 × 10−21 , linear regression; Figure 2.6).112,165 As the presence of an RBP motif does not necessarily result in a binding event,39,166 it is necessary to validate the relationship between the increase or decrease of protein binding with the increase or decrease of splicing. An ESM in exon 20 of COL1A2 (NM_000089.3:c.1045G>T) was predicted to create a PTBP1 motif. If PTBP1 binding were responsible for splicing repression, depletion of PTBP1 would be predicted to relieve the splicing defect. We found that, in the absence of PTBP1, rescue of splicing (i.e., ∼0.5-fold less skipping) was observed in the mutant exon, but not in the wild-type exon (P = 4.19 × 10−5 , two-sided Cochran-Mantel-Haenszel χ2 test; Figure 2.7, right and Figure A.7a). An ESM that was predicted to function by disrupting SRSF1 binding in exon 8 of MLH1 (NM_000249.3:c.595G>C) was also 163 D. Ray et al. Nat Biotechnol, 27: 667 – 70, 2009. 35 J. C. Long and J. F. Caceres. Biochem J, 417: 15 – 27, 2009. 164 M. A. Rahman et al. Sci Rep, 5: 13208, 2015. 112 T. Sterne-Weiler et al. Genome Res, 21: 1563 – 71, 2011. 165 H. Shen et al. RNA, 10: 787 – 94, 2004. 166 J. Wang, S. H. Xiao, and J. L. Manley. Genes Dev, 12: 2222 – 33, 1998. Chapter 2. Identification of Splicing Defects 41 2.5 1.2 *** 1 2 ratio of % skipping ratio of % skipping 0.8 (PTBP1 kd/ctrl) (SRSF1 kd/ctrl) 1.5 *** 0.6 1 0.4 0.5 0.2 0 0 WT MT WT MT WT GCAAGGAGAGACAGT WT TGAGCCTGGTCCAGC MT GCAAGGACAGACAGT MT TGAGCCTTGTCCAGC Figure 2.7: Validation of RBP motifs that affect splicing. Left, in the absence of SRSF1, the mutant (MT) exon in which the SRSF1-binding motif is disrupted shows a modest but nonsignificant increase in exon skipping, whereas the wild-type (WT) exon with the SRSF1 motif has a twofold increase in exon skipping, compared to splicing in the presence of SRSF1. Right, the splicing phenotype of a mutation that creates a PTBP1-binding motif was rescued (∼0.5-fold fewer skipping events) when PTBP1 was knocked down, whereas the wild-type exon was not affected. The sequences represent the exonic sequence around the mutation site: red, sequence that highly matches the motif; blue, sequence that does not match the motif. ***P < 0.001, two-sided Cochran-Mantel-Haenszel test. Error bars, s.d. kd, knockdown; ctrl, control. Experiments were performed in two cell culture replicates. selected for similar analysis. In the absence of SRSF1, the wild-type exon had a significant increase in skipping events (P = 0.0002, two-sided Cochran-Mantel-Haenszel χ2 ; Figure 2.7, left and Figure A.7b), but the mutant exon did not (P = 0.07, two-sided Cochran-Mantel-Haenszel χ2 ). This result demonstrates how motif prediction can identify mutations where the gain of PTBP1 binding or the loss of SRSF1 binding can lead to the ESM phenotype. Clustering the functional profiles of human RBP motifs in the 5K panel (Section 2.4) resulted in 19 clusters, of which the 2 largest matched the profile of exonic splicing enhancers and repressors (Figure 2.8). The method was robust; >90% of all motifs that functioned as silencers or enhancers in vivo segregated into the same category in vitro (P = 1 × 10−16 and 1.5 × 10−10 , one-sided Fisher’s exact test for Venn diagram overlap of exonic splicing repressors and activators, respectively; Figure 2.8 and Figure A.8e). Overall, 38 motifs corresponding to 35 RBPs consistently behaved as exonic repressors and 24 motifs corresponding to 25 RBPs behaved as exonic activators in both assays. Comparing the degree of predicted intronic binding with splicing performance suggests that most exonic repressors enhance splicing when bound in introns (57%; Figure A.8c) and most exonic activators repress splicing when bound in introns (77%; Figure A.8d). These findings reinforce the notion that splicing factors behave in highly position- dependent manners.7,61 7 K. H. Lim and W. G. Fairbrother. Bioinformatics, 28: 1031 – 2, 2012. 42 2.2. Results SRSF1 in vitro 2 Exonic splicing SRSF1 in vivo activators 19 clusters Exonic splicing repressors PTBP1 1 in vitro PTBP1 in vivo 1 Exonic splicing repressors 2 Exonic splicing activators In vitro In vitro In vivo In vivo 11 38 3 33 24 2 Figure 2.8: Clustering of RBP motifs. Clustering of data shows similar functions for RBP motifs in vivo and in vitro. The plot was generated by sliding window (Section 2.4). The mean values from each bin (sliding window) are colored black. 2.2.5 Mechanistic signatures of splicing mutants During the development of the in vitro splicing assay in the 1980s, techniques were developed to isolate the biochemical intermediates in the stepwise assembly of the spliceosome.167 A spliceosome is assembled from the A through the B to the C complex on the model adenovirus substrate, as previ- ously described.168,169 In accordance with catalysis occurring in the C complex, chemical intermediates of splicing co-migrated with the C complex during glycerol-gradient centrifugation (Figure 2.9a). This same procedure was implemented on the 5K panel of mixed library substrates. Although each library member is the same length, greater heterogeneity in complex mobility was observed (Figure 167 R. A. Padgett et al. Annu Rev Biochem, 55: 1119 – 50, 1986. 168 M. M. Konarska and P. A. Sharp. Cell, 46: 845 – 55, 1986. 169 R. Das and R. Reed. RNA, 5: 1504 – 8, 1999. Chapter 2. Identification of Splicing Defects 43 a 10% Centrifugation 1 After splicing at A complex nuclear extract B/C complex for 80 min 30% 16 Glycerol gradient Fractionation b Native c d Native complexes RNA species complexes in gradient fractions Time Fraction T T’ Pre-SELEX Fraction 0 min 30 min 1 16 2 16 B/C B/C A A Free Adenovirus RNA E Post-SELEX2 B/C Fraction H 2 16 B/C A Time 0 min 30 min Free Fraction T T’ 1 16 RNA B/C MaPSy in vitro A E Post-SELEX2 A 2 16 Fraction H B/C A Free RNA Figure 2.9: Isolation of spliceosomal intermediates. (a) After the in vitro MaPSy assay, the splicing reaction was loaded onto a 10-30% glycerol gradient followed by fractionation. Spliceosomal intermediates from different stages of assembly were retrieved from the different fractions. (b) Spliceosomal complexes (B/C, A, E, H) visualized in native gels for control (top) and heterogeneous library (bottom) substrates. (c) RNA splicing intermediates migrate to the same fractions in control and heterogeneous library substrates (underlined in orange). Total RNA before (T) and after (T0 ) splicing is indicated. (d) Reassembly of purified B/C and A fractions (middle and bottom) in comparison to assembly of original input (top). Fractions used for SELEX are underlined (cyan). 2.9b). Despite this increased heterogeneity, distinct splicing complexes were effectively partitioned, as the splicing intermediates and final products were found to segregate into the same fractions as seen in the control (Figure 2.9c). Furthermore, each stage of spliceosome assembly had a distinct composition of library species that could be further enriched by a systematic evolution of ligands by 44 2.2. Results exponential enrichment (SELEX) approach (Figure 2.9d and Figure A.9a). For example, extracting RNA from the B/C fraction and repeating the spliceosome assembly assay returned a clear bias to- ward the B/C complex (Figure 2.9d, middle), whereas reassembly of the A fraction resulted in a bias toward the A complex (Figure 2.9d, bottom). By using glycerol-gradient centrifugation coupled with next-generation sequencing, the allelic ratio of each locus was determined at the different stages of spliceosome assembly: pre-assembly (t0 ), A, B/C and spliced. In general, RNA species that were enriched in the early A complex were under-represented in the spliced fraction, suggesting that the species blocked from transitioning to the catalytic B/C complex were accumulating in the A complex. Conversely, RNA species that were enriched in the B/C complex were also enriched in the spliced fraction, suggesting that spliceosomes at the B/C stage were mostly committed to splicing (Figure A.9b). Clustering the 5K panel by allelic ratios in the different spliceosomal fractions showed distinct patterns of disruptions. Most mutations affected multiple transitions of the spliceosome (Figure 2.10 and Figure A.9c). We found that mutations in the same exon were more likely to cluster together (P = 0.008, permutation test). This result suggests that an exon disrupted by splicing mutations will tend to fail at the same stage of spliceosome assembly, a behavior that is consistent with the finding that exon properties are strong predictors of ESMs (Figure 2.4). The allelic ratio profiles in the different assemblies seem to represent mechanistically distinct scenarios of splicing disruption. For example, mutants in cluster 20 are strongly inhibited in each step of spliceosome assembly (Figure 2.10). Interestingly, cluster 20 comprises mutations that are likely to trigger structural rearrangements (average ∆∆G = 1.95 kcal/mol, adjusted P = 0.014, permutation test).159 They are single substitu- tions that, on average, were predicted to trigger the formation of four new base pairs that contribute to a more closed RNA secondary structure. Cluster 15 contained mutations in weakly defined exons (low differential GC content and high numbers of ESSs, adjusted P = 0.008 and 0.014, respectively, permutation test) and flanked by highly conserved introns (adjusted P = 0.006, permutation test). The splicing progression of these mutants was stalled in A and B/C, all of which significantly altered splicing in vitro and ∼80% of which also significantly altered splicing in vivo. Exons in clusters 15 and 20 are also frequent targets of disease-causing SSMs,156 which is consistent with the finding that disease-causing ESMs and SSMs are often co-enriched in the same exons. In contrast, mutations in cluster 14 were associated with strongly defined exons (high differential GC content and low numbers of ESSs, adjusted P = 0.001 and 0.002, respectively, permutation test) and rarely disrupted splicing (Figure 2.10). Mutants in cluster 7 were found in exons with strong splice sites (adjusted P = 0.01, permutation test), and their respective wild-type exons were strong splicers both in vivo and in vitro, having a mean splicing efficiency that was significantly higher than the mean splicing efficiency of wild-type exons from a random sampling of 10,000 (adjusted P = 0.0008 for both assays, permutation test). The splicing progressions of these mutants were mainly inhibited in the A complex. Whereas mutations in clusters 15, 16 and 20 represented ESMs with the most dramatic change in the splicing phenotype of the mutant substrate in comparison to the wild-type substrate, ESMs in clusters 7 and 14 had modest effects on splicing (Figure A.10). It remains to be determined whether these distinct modes of splicing disruption are associated with the degree of severity or other aspects of disease phenotypes. We predict that a mechanism operating via structural changes (for example, cluster 20) is likely to function independently of tissues and cell types, as they seem more independent of trans- acting factors that may vary across tissues and cell types, whereas mutational mechanisms that involve 159 R. Lorenz et al. Algorithms Mol Biol, 6: 26, 2011. Allele ratio Allele ratio Allele ratio −4 −2 0 −2 −1 0 −3 −2 −1 0 t0 t0 t0 A A A −3 n = 9 n = 38 BC BC BC −4 n = 1734 Cluster 15 Cluster 2 Cluster 20 Sp Sp Sp A BC t0 A A BC t0 BC t0 Sp Sp Sp Allele ratio Allele ratio −1 0 1 2M M MM M MM M M M M 7M M CM097963M CM930691M M CM065223M CM060880M CM098242M CM021651M CM076020M 3M CM030278M CM022213M CM940877M CM097792M CM103738M CM087338M CM077332M CM104440M CM920207M CM002784M CM970730M CM920291M CM040381M CM101536M CM010402M CM940945M CM980237M CM950389M CM930403M CM023891M CM971162M CM062990M CM044603M CM015027M CM092540M CM110410M CM991141M CM103271M CM971482M CM104503M CM098165M CM023089M CM960285M CM043316M CM118312M CM961139M CM053986M CM001077M CM983543M CM081586M CM062995M CM980976M CM081410M CM042388M CM109211M CM054160M CM105640M CM042096M CM001735M CM000143M CM992210M CM971389M CM087364M CM107494M CM104389M CM043773M CM930402M CM002949M CM990497M CM960965M CM108589M CM031215M CM000435M CM034720M CM994293M CM034449M CM099712M CM984051M CM095945M CM070066M CM070748M CM065256M CM102358M CM102854M CM001745M CM061015M CM973041M CM013768M CM055969M CM060774M CM920013M CM941128M CM062523M CM011356M CM000578M CM981024M CM076332M CM030003M CM094396M CM114232M CM057499M CM010953M CM088240M CM043044M CM035560M CM110420M CM112780M CM021695M CM062604M CM071094M CM050270M CM110256M CM044252M CM071693M CM064181M CM010198M CM045551M CM042390M MM CM941408M CM012659M CM012933M CM011945M CM091771M CM113174M CM980990M CM961171M CM981488M CM091877M CM083779M CM041802M CM103582M CM090157M CM050334M CM961005M CM042347M CM920556M CM085962M CM930502M CM981097M CM993639M CM970729M CM022943M CM118630M CM045285M CM083752M CM060495M CM990856M CM110308M CM992945M CM096123M CM106757M CM056963M CM015096M CM041710M CM055214M CM0910960M CM105085M CM105158M CM960676M CM091353M CM052007M CM073235M CM020521M CM980277M CM992342M CM065521M CM014713M CM950698M CM024588M CM033317M CM058293M CM970700M CM984143M CM970213M CM060878M CM081346M CM078119M CM098618M CM071655M CM057355M CM073225M CM011913M CM084904M CM092731M CM050244M CM980270M CM103633M CM940989M CM983404M CM991152M CM000428M CM077626M CM057167M CM118903M CM104054M 0M 1M CM941968M CM074605M CM077951M CM942037M CM960380M CM096255M 5M 7M CM101184M CM060957M CM024638M CM045464M CM057510M CM020769M 2M CM070800M CM097935M CM033912M CM004287M CM056353M CM070833M CM078352M CM990734M CM020986M CM960347M 8M 8M CM970485M CM960182M CM062899M CM052839M 6M 3M CM940222M CM091679M CM980387M CM032683M CM962487M 5M 8M CM971018M CM102677M CM111172M 1M CM021082M CM086986M CM941980M CM057881M CM082652M CM085404M CM086987M CM010228M CM981506M CM081510M CM074674M CM077624M 4M CM114454M CM014329M CM071962M CM012320M CM022781M CM081781M CM073177M CM990096M CM116317M CM000316M CM940140M 5M CM109228M CM110516M 2M CM024149M CM992307M CM015353M 0M CM890036M CM941970M CM066103M CM001105M CM992365M CM003818M CM042405M CM990476M 4M CM062693M CM065489M CM118268M CM041319M 5M CM061061M 1M CM011290M CM950638M CM002841M CM061117M CM021504M CM013953M CM112964M CM990231M CM930410M CM106351M CM013929M CM980977M CM043530M 6M CM012759M 3M 4M 6M 7M CM014978M CM099475M CM077284M 0M CM951091M CM042697M CM034006M CM109295M CM910268M CM060879M CM065022M CM003817M CM023869M CM054781M CM960010M 0M CM090644M CM001081M CM080336M CM042392M CM071689M 1M CM074148M CM053921M CM033006M CM116206M CM070296M CM071863M CM970578M CM025505M CM114772M CM087869M CM994039M CM020490M CM920214M CM055344M CM995169M CM088011M CM107176M CM074767M CM004649M CM920977M CM043547M CM981665M CM104085M CM087383M CM114567M CM001711M CM991151M CM083082M CM087635M CM992315M CM115034M CM042350M CM040494M 37M CM960413M CM107485M CM034007M CM101029M CM930189M CM000176M 40M CM000330M CM115594M 46M CM972956M CM960967M CM109296M CM014331M 05M CM012912M CM070864M CM030802M 29M CM020316M CM093808M CM002628M 61M CM950803M CM054767 CM981681M 78M 58M CM031943 CM970742M 64M 10M CM075016 CM990227M CM981885M CM021062M CM024128M 93M CM950390 CM981448M CM054813M CM023890M CM994665M CM010954 CM960689M CM941354M CM031274M 58M CM087390M 57M CM012171 CM091432 CM015165M CM062581M CM107093 CM071958 82M CM992880M CM961169M CM111702 CM108144M 94M CM990122 CM980265M CM941153 CM010841M CM078264 CM001958M CM065215M CM090416 CM971282M CM078462M CM960417 CM088269M CM082908M CM973204M 20M CM09161 CM971128M CM095364M CM011460M 43M CM01128 CM112898 CM071914M CM990868M CM11102 CM085334 CM960808M CM980512M 40M 65M CM99287 CM971236 CM055408M CM021992M CM11140 CM05375 CM973152 CM091049 84M CM087308M CM96004 CM99032 CM001613 CM012167 17M CM00000 CM950856 CM115022 CM994455 89M CM10909 CM056036 M CM044229 10M CM973678 CM950270 CM014839 M 89M CM910238 82M CM01473 M CM05020 M CM09036 M CM981715 CM95126 CM09303 M CM960544 01M CM06315 CM011918 CM01151 78M M CM095710 CM01097 66M CM031645 CM111646 M CM06596 CM094287 M CM940946 CM980426 69M CM10166 CM990475 CM055987 78M CM05374 CM961314 CM091647 CM07457 M CM012729 02M CM109587 15M CM91000 CM002080 CM09294 M CM070215 CM091100 CM11884 CM011294 CM022205 CM98199 M CM091051 40M CM96137 CM98167 CM068003 70M CM91008 M CM973303 48M CM00435 M 92M CM11612 91M 41M 5M CM074198 29M CM1127 8M CM101623 97M 70M CM0760 CM06420 1M CM07603 64M CM0979 CM09273 9M 2M CM10038 32M CM0721 CM07157 1M 6M CM00317 CM9600 3M CM0313 CM10450 CM05826 CM04067 0M CM91032 8M 7M CM92036 22M CM0043 CM0707 2M CM01369 CM00263 CM09066 21M CM0660 CM0779 CM06517 CM99048 CM11055 27M CM9711 CM9302 CM94024 CM98171 CM94088 82M CM02132 7M CM04346 CM0417 CM06268 8M CM0700 0M CM10052 CM0563 CM0207 2M 3M 1M CM9818 CM9102 8M 6M CM08306 CM0910 CM0995 CM11865 CM0910 4M CM0619 0M CM11787 CM9704 6M CM11784 CM0715 3M CM07083 CM11707 CM1152 7M 2M CM0859 CM00077 8M CM98051 2M 9M 9M CM11277 455M 772M CM1052 4M CM06422 031M 521M −4 −2 0 CM0054 7M CM00546 729M 0M CM10804 079M CM08037 CM9305 CM99284 776M CM9813 567M CM0410 9M CM10859 302M CM9702 CM1081 7M CM98099 CM9912 454M CM0228 3M CM0951 366M 1M CM9202 320M CM0238 481M CM070 6M CM0625 929M CM002 CM0109 5M CM0141 CM9711 CM035 CM0742 3M CM0938 CM9104 CM0117 672M CM113 CM9203 6M CM1166 CM0830 CM067 CM0639 43M 96M CM9500 CM051 CM1114 250M CM0684 08M 10M CM0981 CM060 99M 28M 055M CM1136 67M 20M CM0245 CM053 62M 44M 33M 82M CM9404 174M CM118 CM0222 032M 13M CM050 CM9201 608M 16M CM961 CM1016 56M 82M 135M CM0109 CM931 16M 15M CM0853 CM020 78M CM0302 CM092 43M CM0239 005M CM112 CM041 88M CM9203 600M CM052 50M 196M CM070 92M CM0344 440M 38M CM042 CM0706 039M CM9613 CM105 CM0151 CM920 54M CM9802 354M 80M CM9103 CM056 002M CM1173 333M 940M CM092 CM010 55M 023M CM001 31M 047M CM030 86M CM0004 830M 35M CM051 CM1182 719M 37M CM0716 CM930 347M CM9902 827M 013M CM035 CM970 CM9808 125M 741M CM003 72M CM9839 466M CM085 CM021 90M CM0852 835M CM0811 CM050 42M CM0516 CM9804 375M 151M CM981 CM0316 1009M CM020 11M CM9905 015M CM055 CM1007 052M CM020 79M CM9812 07M CM9612 CM063 717M 85M CM0667 870M CM113 833M CM070 33M 016M CM070 042M CM034 CM930 711M 076M CM115 CM074 354M CM109 959M 981M CM082 919M CM091 870M CM074 CM961 434M CM081 063M CM080 455M CM004 CM033 CM000 CM940 029M CM991 339M CM074 823M 349M CM034 207M 427M 140M CM000 CM110 189M 214M 917M 789M 441M CM092 768M 260M CM075 CM065 738M 031M 180M CM970 677M CM070 508M CM041 309M 407M CM108 494M CM083 569M CM971 361M CM060 903M CM074 982M CM096 CM045 873M CM970 946M 736M CM013 557M 779M CM002 CM088 136M 617M CM010 240M CM030 615M CM062 CM961 582M 340M CM119 CM061 347M 023M CM96 716M 054M CM970 601M CM014 CM10 CM115 635M 356M CM98 078M CM011 CM99 CM981 131M CM993 CM023 144M CM075 326M 3102M CM040 0889M CM09 983M CM109 CM08 5092M CM97 105M CM041 CM99 485M CM09 CM087 0462M CM023 CM91 211M CM083 0436M CM96 CM971 CM054 CM09 CM078 4648M CM002 5033M M CM093 CM992 7245M CM00 CM091 1641M CM087 CM99 271M HM060 CM06 352M 189M 038M 936M CM113 CM088 1827M CM014 6520M CM99 497M CM070 4398M CM07 0981M CM042 0414M CM05 276M CM104 0100M CM07 976M CM000 5009M CM02 692M CM013 8193M CM10 064M 0264M 1166M CM11 971M CM900 7162M CM106 6213M CM98 153M 434M CM970 1800M CM01 426M CM063 5152M CM10 CM01 729M 013M 624M CM077 1968M CM09 399M CM065 0058M CM05 079M CM092 CM00 4348M CM09 851M 2637M CM081 CM03 CM094 603M CM950 CM09 1083M CM096 CM00 0138M 360M CM060 2334M 5160M 059M CM990 4724M CM092 CM98 2439M CM094 CM03 CM06 0273M 0236M CM055 CM96 CM11 5133M CM952 0534M 1003M 1809M CM06 CM10 0732M 2637M CM93 CM97 6167M CM03 CM05 9589M CM11 3715M CM09 CM06 2562M 0562M CM09 11128 CM09 CM08 1158M 2490M CM07 1424M CM11 0076M 4340M 2457M 1212M CM00 7235M Significant in vitro and in vivo CM10 5502M 0317M CM04 CM11 2946M CM11 CM08 3643M 4045M CM09 2174M CM10 CM90 1254M CM05 CM06 0301M CM99 CM09 0847M CM92 3007M CM97 1588M CM05 3636M 7745M CM99 0729M 4145M CM07 0468M CM99 CM96 3954M CM01 CM99 7048M CM90 0168M 4156M CM02 0097M CM02 0973M CM01 5974M CM06 6217M 2364M CM04 0632M CM02 CM11 9613M CM90 1498M CM03 CM02 5989M 0498M CM00 CM09 0019M CM07 CM00 9600M 1079M CM07 4860M 4260M CM08 CM06 2251M 0378M CM05 6687M 8005M CM05 4433M CM92 7007M CM98 3115M CM07 CM06 4259M 3678M CM10 6203M CM95 0281M CM97 1712M CM99 0364M M CM09 0306M CM02 CM04 CM98 5799M 7570M CM11 CM09 CM06 7208M M 0350M CM07 CM96 1857M CM11 2534M 2212M M 0415M CM02 CM11 9614M 7620M 3942M 2816M 0203M CM99 CM10 3623M 2890M CM01 1773M 2466M CM11 CM99 2316M 4920M CM98 3253M MM CM04 2856M CM08 0059M CM03 CM09 0623M CM09 4378M M CM05 4787M CM07 5679M CM06 1662M M 0232M 4053M CM02 2111M CM98 M CM10 6562M 0273M CM08 CM03 0451M CM04 M 1120M CM08 1080M CM99 6522M CM09 0532M CM04 2057M CM10 2023M CM00 4251M CM04 0931M CM10 3108M CM02 4759 CM09 CM05 CM11 MMM 4216 CM95 0668M 3168M CM08 4924 9969 CM07 1147M 3903M CM06 7596 HM07 8091M 1651M CM03 5737 CM04 6301M CM10 4747 7432M CM08 CM08 9968 CM96 CM04 3966 CM10 CM09 0817M CM11 CM08 0172M CM94 3209M 1693 4152M 0493M CM02 3134 CM05 3770 2002M 5616M CM96 1650 2MMM CM99 1413 0727M CM08 1078 1M CM06 4042 3268M CM96 0069 5M CM01 0940M CM02 CM07 1999 CM06 3613 CM08 1117 CM9 CM09 5084 1559M CM03 6027 0M CM0 CM10 3079 6905 2843M 4762 3M CM06 3610 1684M CM99 4552 0M CM11 3144 M 0375M CM07 2M 6001 8M CM1 CM09 CM06 8M 1392 3M CM9 M CM08 3035 7914 0525 CM03 0M CM0 1871 1046 M CM07 9M 1718 0693 CM01 2M CM9 2465 0106 M CM98 0320 CM0 3024 CM06 9M 0038 CM07 5M M CM10 CM0 1451 0M CM09 CM0 5361 M CM08 6758 M CM07 0388 M 6M CM0 CM10 7M CM9 5492 CM00 8046 6M CM0 3837 1871 CM9 6880 MM 8538 CM11 1M CM0 4333 CM08 2M 7730 5069 M CM03 7163 3M 5164 2096 CM97 2410 6M 9093 3M M CM91 1089 4M 0276 9M CM0 CM06 7175 CM0 2792 8126 3M CM94 5186 CM1 0904 CM07 9254 2M 4030 CM07 CM08 1920 9M 3166 CM99 5145 CM0 5375 MMM 2M CM08 8354 6117 9M CM0 9392 1M CM09 5044 9465 CM05 9M 1734 5073 MM CM00 6M 4457 5315 CM0 5060 CM06 3498 7M 9086 CM99 CM0 3M 1095 1M 9365 6595 7411 4M 5208 CM0 2219 CM1 CM0 CM1 2M 9602 7M CM0 7M 9332 0680 4M 5290 CM0 5M 5219 8M 0364 8M CM1 2M CM0 2155 3M CM0 0M CM1 6034 7M 9234 1M 1511 8M CM0 9107 CM9 0207 CM0 7076 5015 CM0 7M 1293 9M 8562 7M CM1 1M CM0 3049 8M CM9 5M 2M CM9 6266 CM0 6391 7M 9M 2M 5522 CM0 CM9 8M 6M CM9 1226 CM9 9723 4M CM1 7M 9304 CM1 9615 5M CM0 0M 4170 HM9 7M 1502 2M CM0 0M 4326 6813 4275 6M 1M CM1 0109 CM0 3080 5M 5237 CM9 2544 1098 1M CM0 2014 CM0 6525 9M 9150 CM1 8097 8M 1789 CM0 6106 0M CM0 6045 6307 58M CM9 8M CM1 7112 9996 6M CM0 5666 4M CM0 9318 3M CM9 8M CM9 1M 5M CM0 6864 6M 5M CM0 3017 6M 5657 9M 2M CM1 8M CM0 9779 1M 0389 6M CM1 0043 9454 8M CM0 7033 9M 6020 3114 9M 0448 3M CM0 8317 7M 8M CM9 7153 8275 3M CM0 3047 5321 5064 5M 2M CM0 4236 3229 0M CM0 0157 0M CM0 4205 1656 9M 7M CM9 2058 CM0 CM9 9021 1M CM0 6199 CM9 1125 9M CM0 8365 5M CM1 9656 3M 0M CM9 6261 3M CM0 1083 CM0 3M 9413 2M 8M CM9 0082 9M 4295 5M 9M CM0 7136 3M 0061 9332 3163 CM1 2M CM0 9223 6M 1306 1855 CM9 0896 0M 0832 CM0 CM0 8495 9M 9M 6M 3122 4M CM0 8145 CM0 5324 3M 1466 CM1 6030 6M 0M CM0 CM9 1458 CM0 9946 8158 4M CM0 4M CM1 1035 CM0 7319 1M CM0 9823 CM0 8047 3M 51M 9380 CM0 5018 6M 2M CM0 3M CM0 CM9 5121 8M 8012 CM9 5331 0M CM9 0M CM0 CM1 0405 CM0 CM9 1502 8M 9M CM9 1792 CM9 CM0 CM0 7334 8M 1M CM0 4203 CM0 1174 5M CM1 CM9 8710 CM1 CM0 0616 9M 0M CM0 CM0 CM0 7071 1834 6M CM1 0366 5M CM1 8M CM9 7098 CM9 5022 9M CM9 CM0 5M 3023 CM9 4485 0M 8M CM0 CM0 1063 CM1 6111 9M 9M CM1 2161 7M CM1 3016 1M CM0 1721 8M 8041 31M CM1 7069 CM0 0197 6M 9022 1M CM0 3393 5157 3M 7M CM0 CM9 5381 9110 4M CM0 9286 5M 7411 CM9 7034 7M HM0 6826 3074 9M CM0 8343 6858 0010 8684 0M 5401 CM1 6527 9320 9M CM0 9575 5M CM0 9486 1M 6167 CM1 7023 0M 2013 7M 7488 CM1 CM0 2164 9M 1720 0M 8378 CM0 9454 5M 7437 4M CM9 3130 CM9 4079 7M 9318 2M CM0 0757 2046 3M 7016 CM1 0025 CM0 1490 1080 2221 CM9 8M 5000 CM0 0487 CM1 1437 2M 2M CM9 2M 7M CM0 9102 8494 1M 6M 3M 6M 4M 6493 CM0 8195 3M 9009 0M CM0 0974 CM9 0539 6M 3M 6M 8107 1M 0M 9M CM0 CM0 0013 0275 7M 5666 8M 3M CM9 6M 7139 5M CM0 CM0 9715 2M 9091 9M 5M 0139 1M CM0 CM1 2M CM0 CM0CM0 6077 3M 6034 5092 8M CM9 2M 5M CM0 CM1 6031 2M 9M 4M 2230 2225 0M CM1 0444 8M 8814 2M CM9 CM9 7804 1288 4M 4171 5M CM0 CM0 CM9 9283 2542 7M 0M 7181 7159 9303 5005 9M 1M CM0 9244 CM8 9M CM1 CM0 CM0 CM9 CM9 7503 7282 7153 6185 3M 7M 8M 1M 2M 0M CM0 CM9 CM0 0174 9417 6400 8204 4019 1M 0M 4M 3M 4M CM9 CM0 CM9 CM1 91020709 6654 1170 9M 3M 9M CM9 CM0 CM0 7634 4170 8160 0043 5M 6M 0M 2M 1M 2M CM0 CM0 CM1 CM0 CM0 4195 9664 9073 5555 2098 7074 6M 9M 1M 0M 6M 1M CM0 CM9 CM0 CM0 5141 9077 7086 0836 8354 4M 5M 8M 9M CM0 CM1 CM9 8144 9101 9462 5640 07M 7M 3M 2M CM0 CM9 6185 8094 3203 0474 5000 8M 5M 4M CM0 CM8 CM9 CM1 CM0 6084 0316 0767 1378 4169 8M 3M 9M CM0 CM9 CM0 CM0 7095 0403 0006 7087 4577 9M 4M 3M CM0 CM9 CM9 5518 0254 9056 7082 3131 5055 4M3M 6M 7M 7M 8M CM1 CM0 0108 6016 6048 3451 0732 3474 7M 6M 2M 7M 0M 3M 7M CM1CM1 CM0 1269 2019 3374 6396 9018 1073 4M 1M 7M CM9 CM0 CM1 CM9 CM1 8340 8318 1040 9147 8304 5M 6M 2M 0M CM0 CM9 CM0 8144 1911 0222 2170 6270 1M 8M 9M 0M 5M CM0 CM0 CM9 7637 9216 2196 3M 2M 4M 7M CM0 CM9 CM0 CM9 CM1 3059 8065 0165 1134 1022 5M 3M 5M 2M 8M 7M CM0 CM1 CM0 CM9 6103 6266 3131 0116 2232 9009 4168 5M 2M 6M 5M CM9 CM0 0603 9261 7169 4203 9M 3M 5M 0M 6M 6M 2M CM0 CM0 CM0 7086 6285 9177 4M 7M 0M 1M CM0 CM9 CM1 CM9 3307 2048 2244 6111 3M 2M 5M 1M CM0 CM9 CM0 CM0 5027 8123 9558 8M 3M 8M CM0 CM0 CM1 8401 4026 0323 4029 9445 1378 7M 9M 1M 3M CM1 CM9 CM0 9108 0904 7012 9338 3016 9M 5M 4M 7M 0M CM0 CM1 CM9 CM0 CM04 9785 8125 8258 5126 8M 4M 9M 5M 6M CM0 CM0 CM9 CM0 6298 0303 0000 9041 9M 1M 2M 9M CM01 CM05 CM06 CM08 0376 7296 1216 2131 8M1M 5M 0M 6M 5M CM0 CM9 CM1 9376 2057 1225 1023 4397 5M 0M 4M 1M 4M CM02 CM11 CM03 CM11 1484 0975 5652 1374 2M 0M 2M 9M CM9 CM0 CM9 CM0 CM0 0731 6808 9042 9396 9311 0M 3M 1M 2M MMM CM96 CM07 CM11 CM03 CM98 8372 2524 6M 0M 4M 3M CM0 CM9 9461 2148 9292 1234 M CM02 CM09 HM09 CM08 CM01 4705 1414 1868 27350M 0164 6M 4M 9M CM1 CM0 CM1 CM0 1843 8382 64346141 9029 80260360 1436 9029 MMM M CM07 CM05 CM10 CM04 2307 6324 2394 8187 8M 2M 9M 2M CM9 CM9 CM0 0392 5510 6176 3423 0087 M M CM97 CM04 1266 7317 2239 4916 1931 M CM9 CM0 CM1 CM9 CM0 0295 0814 2328 1157 2386 MM M 4166 t0 CM96 CM06 CM08 CM03 0794 1136 2166 0028 0121 3270 MM CM9 CM9 CM1 CM06 CM97 CM93 3114 8873 2966 0385 M M CM10 CM08 CM96 CM94 CM03 6124 8349 8441 2397 MM CM98 CM01 CM98 CM91 0290 7214 2253 2949 M CM07 CM96 CM11 1339 1124 1782 0189M MM CM02 CM00 CM03 CM06 CM99 2624 4869 6338 4286 6566 4625 CM06 CM10 CM95 CM98 2490M 5541M 7528M 4587M MMM CM97 CM10 CM01 CM99 CM92 2955 4561 2243 3830 CM10 CM07 CM10 CM08 CM04 0128M 6901M 0962M 0193M 2559M 7618M MM CM08 CM11 CM10 CM11 3152M 0844M 0430M 6537M 1135M CM07 CM06 CM98 CM07 1141M 3414M 7931M 4359M M CM04 CM07 CM09 CM05 CM11 CM07 3242M 0858M 0385M CM06 CM98 5508M 1105M 1666M 2478M 4438M CM00 CM10 CM00 1659M 1748M 4785M CM01 CM07 CM11 CM08 CM96 7311M 3194M 8278M 5418M 2961M CM99 CM07 CM06 CM05 CM93 CM00 6746M 5424M 9432M 5904M 7187M CM11 CM05 CM08 CM94 3167M 3525M 3209M CM07 CM04 0767M 9017M 3085M CM97 CM02 CM06 CM08 CM00 0780M 2658M 1700M 1522M CM98 CM08 CM07 CM11 2944M 3639M 2464M 3322M CM03 CM00 CM06 CM10 CM00 1700M 7949M 5265M 2619M 5595M CM04 CM09 CM05 CM10 CM99 0273M 0418M 1934M 0752M 0154M 3073M CM06 CM03 CM06 0259M 0200M 1352M 0699M 0722M CM11 CM05 CM03 3272M 4014M 1526M 1682M 0884M M CM08 CM99 CM01 3860M 7219M 4870M 3742M 1995M CM10 CM06 CM99 CM95 1951M 1071M 3582M 3767M CM07 CM09 CM89 CM05 2549M 5983M 3164M 5517M CM97 CM98 CM07 CM93 CM06 CM10 CM98 1178M 2976M 0241M 6868M CM00 CM88 CM10 CM06 CM96 5749M 6528M 0288M 2415M CM04 CM95 CM09 0421M 10897 0941M 2003M 3004M CM11 CM05 CM00 5284M 3297M 1744M 2188M CM98 CM04 CM03 CM92 1364M 4908M 1496M CM00 CM99 CM01 CM09 9091M 0071M 2163M 1218M 0045M CM10 CM97 CM10 CM99 CM09 0383M 0407M 2576M 4197M CM91 CM06 CM05 7775M 2404M 8150M 5555M CM06 CM02 CM97 CM09 CM02 3550M 2216M 0363M 3825M 7920M CM94 CM05 CM04 2222M 7778M 1195M 0735M CM97 CM96 4335M 1365M 1027M 0002M CM09 CM11 CM09 CM930 4364M 0235M 2843M CM08 CM03 CM09 CM02 2955M 7866M 2040M 0011M 7176M 2860M HM070 CM031 CM099 CM971 3175M 4040M 1109M 0969M CM95 CM11 CM10 CM01 CM97 3109M 4066M 2337M 0974M CM961 CM970 2964M CM06 CM88 t0 CM961 CM094 CM930 CM990 3869M 4099M 0691M CM04 CM08 CM01 CM96 CM11 CM99 0786M 2262M 2215M 1936M 0723M 5987M CM002 CM094 CM073 CM020 153M 124M 625M 776M CM99 CM02 CM98 1303M 316M 1200M 885M 5009M 528M 664M CM096 CM057 CM087 CM045 221M 410M 279M CM95 CM11 CM02 000M 306M 430M 297M 797M CM104 CM023 CM041 CM082 CM011 213M 726M 583M 998M 625M CM03 CM99 CM02 CM96 CM10 CM07 172M 210M 053M CM980 CM110 CM961 CM080 829M 127M 632M 634M CM114 CM051 CM950 CM003 CM991 340M 756M 506M 761M CM973 CM099 CM020 410M 820M 076M CM073 CM045 CM109 CM070 092M 800M 291M 674M CM940 CM110 899M 064M 402M 546M 302M CM992 CM101 CM076 123M 767M 778M 024M CM021 CM074 CM030 CM087 266M 008M 018M 163M CM033 CM074 CM025 HM971 CM088 893M 877M 173M 027M 447M 767M CM024 CM072 CM081 014M 968M 933M CM093 CM950 CM086 CM090 340M 149M 220M 350M CM042 CM107 CM011 CM051 822M 763M 335M CM070 CM104 CM051 650M 105M 318M 008M HM971 CM061 CM053 377M 228M 439M 624M CM030 CM083 HM080 CM003 CM962 CM000 CM118 495M 645M 813M 245M CM116 CM002 092M 458M CM024 CM118 512M and minor disruptions are indicated with purple arrows. CM087 CM042 CM080 CM107 513M 572M 455M 867M CM054 CM051 CM011 412M 110M 069M 899M CM090 CM087 CM118 CM930 CM041 277M 109M 201M 213M CM890 CM011 CM034 CM983 CM050 806M 429M 658M 196M 149M CM062 CM972 CM024 207M 641M 279M 476M CM041 CM110 CM080 CM890 083M 970M 428M CM097 CM962 CM053 417M CM920 CM023 537M 079M 254M 031M 743M CM970 CM113 CM970 CM033 753M 231M 922M 153M 210M CM108 CM033 CM041 680M 701M 848M CM950 CM031 CM973 CM025 430M 992M 718M 568M 707M CM0740 CM9601 CM0901 623M 876M 247M 403M CM970 CM101 CM016 CM950 025M 905M 283M CM0950 CM9701 CM0300 CM0223223M CM0960 032M 826M 815M CM081 CM063 CM091 CM073 CM070 476M 31M 974M 95M 34M CM1102 CM0439 CM0987 714M 99M 11M CM068 CM021 CM981 CM082 96M 91M 65M 36M 72M CM9908 CM9709 CM0631 CM9801 28M 35M 77M CM930 CM983 CM990 12M 97M 19M 21M CM0417 CM0748 CM0547 06M 34M 70M CM0914 CM0023 CM0930 CM0023 69M 11M 12M 44M CM1122 CM0531 CM0730 CM0955 74M 85M 52M CM0962 CM0941 CM9311 CM0917 13M 55M 14M 42M 47M CM1143 CM9940 CM0569 CM9903 81M 34M 54M CM9509 CM0434 CM9846 CM9701 CM9101 10M 31M 56M 39M CM0231 CM0222 CM0334 CM1089 CM0113 17M 34M 60M 38M CM9614 CM0502 CM1178 CM0582 32M 57M 05M CM0715 CM9611 CM9818 CM1155 51M 65M 83M 08M 44M CM9928 CM0036 CM1178 CM0856 CM9604 69M 25M 57M 25M CM1175 CM0978 CM9602 94M 63M 52M 41M CM9937 CM0744 CM0716 CM9707 42M 21M 10M CM1015 CM0750 CM0881 07M CM9838 02M 10M 12M 74M 52M CM1040 CM0305 CM9610 CM9505 25M 70M 44M 65M CM0941 CM9301 CM0911 CM0957 60M 96M 95M 86M CM0913 CM0835 CM0872 CM0780 63M 80M 57M 0M CM10680 CM95037 82M 37M CM0980 122M CM9937 CM1118 16M 4M 8M 6M 4M CM00382 CM04570 HM0703M CM00058 CM00341 12M 73M 06M 92M 218M CM0829 CM0121 CM0911 CM0454 8M 9M 7M 3M CM00444 CM92024 CM10228 CM97038 09M 7M CM0629 CM9841 CM1082 CM06302 7M 8M 9M CM01369 CM04170 CM10837 CM04205 8M 0M CM06502 CM05313 CM10780 CM02529 3M CM07032 CM07285 CM04075 CM98041 5M 1M 6M 2M CM99040 CM10111 0M CM96254 CM07751 CM09294 CM95214 1M 0M 9M 3M CM11369 CM95021 CM01021 CM98023 3M 5M 6M 7M CM06262 CM04204 CM06030 CM98172 7M 6M 3M 1M CM11389 CM99366 1M 8M 7M 0M 4M CM10199 CM10018 CM11911 CM03050 7M 5M 0M CM05223 CM09059 CM06620 5M 6M 1M CM97115 CM07155 CM08561 CM06606 1M 2M 9M 4M 9M CM09508 CM97015 CM109712 CM043002 CM115591 CM084871 6M 8M 5M CM07015 CM06298 HM08010 8M 4M 5M CM076325 CM099958 CM930538 CM000192 8M 0M 1M CM06316 CM03266 CM98023 CM08159 CM05718 3M 3M 1M CM98107 CM041708 CM109243 CM992852 CM990665 CM042690 CM094748 CM061956 CM096537 CM012192 CM022250 CM961135 CM002270 CM032199 CM041691 CM960565 CM098970 CM055933 CM087374 CM076102 M CM982025 CM061864 CM920975 CM024141 CM010523M CM005465M CM951265M HM972177M CM062909M CM973013M CM095705M CM971219M CM001290M CM063883M CM108638M CM025504M CM109531M CM082800M CM061734M CM023354M CM890073M CM106927M CM058226M CM080469M CM110735M CM051392M CM0910708M CM023174M CM111701M M M M M M M M 9M 5M 4M M M M M M M M M CM02066 4M CM94093 CM01292 CM97055 CM10358 CM04375 CM05337 CM06040 CM03489 CM03286 M CM023170 CM014171 CM940448 CM074799 CM940440 CM111552 CM940442 CM102124 CM076497 CM107573 CM094724 CM098378 CM074903 CM042393 CM118655 CM080465 CM032358 CM972963 CM890062 CM920245 CM930137 CM100509 CM002856 CM001971 CM960198M CM061679M CM995125M CM950154M CM042395M CM981636M CM920680M CM940335M CM098244M CM990993M CM920192M CM900062M CM071742M CM016043M 7M 2M CM880044M CM063089M CM111705M CM011499M CM950496M CM104887M M M M MM M MM M M MM A CM107313M CM095013M CM970499M CM111121M CM113722M CM054062M CM095776M CM055908M CM095803M CM004437M CM000788M CM034719M CM000644M CM065244M CM117075M CM108963M CM064058M CM033015M CM952423M CM920294M CM114769M CM052179M CM041078M CM022318M CM087946M CM107923M CM041750M CM010211M CM970331M CM090201M CM111281M CM109293M CM101378M CM106869M CM062892M CM110140M CM940441M CM070682M CM100510M HM070086M CM032299M CM970175M CM071616M CM110771M CM102760M CM940173M CM072062M CM080161M CM030007M CM067013M CM950171M CM117289M CM054794M CM111359M CM098952M CM940244M CM093872M HM090036M CM910237M CM100401M CM067460M CM042296M CM092494M CM087192M CM000326M CM093868M CM056047M CM109180M CM041378M CM990846M CM076444M CM111640M CM043444M CM001183M CM060507M CM081167M CM063934M CM950893M CM117784M CM070870M CM077967M CM991024M CM910304M CM971400M CM991107M CM980675M CM920976M CM990648M CM023646M CM981405M CM011946M CM085299M CM101221M CM070679M CM910091M A CM003667M CM920213M CM085649M CM109828M CM981815M CM940800M CM010936M CM970701M CM072090M CM971126M CM100645M CM990342M CM050541M CM083749M CM065130M CM114634M CM103638M CM961467M CM031626M CM076113M CM084854M CM010350M CM040731M CM113595M CM002783M CM993202M CM119018M CM950318M CM001385M CM081193M CM065516M CM075938M CM099636M CM110247M CM970342M CM114203M CM020306M CM962538M CM052841M CM002816M CM115271M CM992935M CM074199M CM0910621M CM115127M CM002840M CM110892M CM103188M CM096531M CM104304M CM103345M CM001705M HM040083M CM086455M CM099950M CM910236M CM070853M CM065353M CM108509M CM014510M CM960437M CM088156M CM072956M CM067721M CM020627M CM940008M CM073118M CM113334M CM1010026M CM016042M CM990297M CM083094M CM072078M CM950975M CM091824M CM051872M CM950140M CM080120M CM085332M CM115680M CM000318M CM990381M CM980338M CM107324M CM992306M CM001181M CM023601M CM001703M CM981730M CM992877M CM073954M CM115187M CM000418M CM990386M CM087781M CM098271M CM950772M CM073231M CM990619M CM041709M CM984016M CM994108M CM110503M CM093610M CM970739M CM051858M CM992420M CM003418M CM023149M CM082912M CM056033M CM983568M CM114307M CM111600M CM115863M CM981533M CM083171M CM950288M CM920365M CM960082M CM083794M CM0911073M CM021714M CM033932M CM080348M CM993619M CM000334M CM061729M CM001972M CM950526M CM001089M CM980712M CM020145M CM970214M CM051391M CM043976M CM020956M CM100385M CM960787M CM093511M CM024104M CM109550M CM095077M CM010957M CM0911233M CM960412M CM062967M CM070736M CM083748M CM981729M CM094232M CM060447M CM994056M CM010456M CM071747M CM050553M CM090004M CM040747M CM011307M CM103067M CM081709M CM000844M CM082994M CM106157M −2 n = 66 CM040224M CM970138M CM064183M CM051013M CM081429M CM002636M CM060038M CM013773M CM056547M CM101535M CM992367M CM111427M CM001284M CM992841M CMM0 059 77 6 2588 2 6432M CM041068M CM094766M CM081406M CM077528M M9 0 4 0 9 M 8 0087 9 0 26 0 4 165 7 7 2 7 M C C 097 9 7 2 0 3 5 0 4 5 2 9 0 M M CM CM C M1 1 6 0 5 1 8 6 9 4 1 7 0 M CM C M0 M 0 0 197 6 11 7 0 57 8 7 2 02 8 3 2 3 6 3 8 5M 3 1 C C 8 9 058 0 0 1 0 2 1 5 9 6 7 2 M M C M 9 0 9 6 9 8 0 1 7 6 3 7 6 6 8 5 4 M M C H M M1 M 09 0 9 2 1 0 3 7 4 00 3 9 0 4 0 3 1 20 52 6 8 0 24 M M C C M M 0 003 59 3 6 37 8 2425 36 M C C M91 01 2 6120 4285 2M M C M91 0 1 5 07 990 41 0 5 8MM 00 02 4 4 83 3 8 M C C CM MM0 9 9 92 5 6 61 151 5 0303 19 4 1M M C CM C M01 9 1 8 1 441 0 6 78 9 7 4 7 3 4 2M 7 C C M M 9 10 0 18 7 3624 052 06 11 5 06 77 2 M C HMM8 08 0 47 1 0330 79 3 45 7 8569M 0 M CCM9 1 0177 59 2 3 53 7 0 1 9 M 81M C C M 9 0 0 8 0 3 5 0 2 0 6 5 6 4 7 6 5 0 3 6 4 M CMM90 95 6 70 9 119 0 6 8 38 7 3 7 4 0 7 3 7 M M CCMM M 99 09 1 9 80 0 1 510 50 0 7 23 9 74 0M 9 M M C M M 0 9 0 0 32 2 3 50 2 9 04 09 4 88 1 3MM C C M 0 9 0 7 4 6 4 1 0 5 3 4 7 5 4 5 M C M M9 M 9 7 6 2 0 0 6 4 7 3 9 1 9 1 4 5 3 M CCMM1001 876 6 5495 029 95 064 10MM CC CM CM 0 049 32 1 0 9 9 3 1 6 0M2 MM C CMM 9 7 3 3 0 8 7MM C 1 9 0 0 9 006 11 9 2 06 8 6 8 1 7 78 1 5 0 M M C M000 7 630 919 8 8 0 326 127 81429 M M C C M M9 M 0 1 7 0 9 6 0 112 4 0 1 0 2 283 1 3 9 6 M M C M CCMM009525 33 22 0604 5 790 8 275MM C C M M 0 5 8 2 1 0 7 2 4 3 4 7 6 1 7 1 2 0 8 4 0 M CMM 5 67 014012450 7 6 587 1 1 49 5MM C 0 9 7 6 1 0 6 3 4 3 4 6 M M CC CMM009 0 1 0 0 2 1 5 2 6 2 7 1 6 1 8 6 M M CM C M0 M 02 0 51 7 10 6 1 85 2 8 4 4 7 M C MM 9 0 0 8 2 0 1 4 23 8 0 2 1 6 9 M M C C M 1 9 1 0 0 6 0 9 8 1 0 1 1 6 2 5 8 36 3 9 5M 9 M C M 045 2 3 1 2 1 0 4 0 M M 0 8 5 4 8 1 9 0 6 7 2 5 0 M −6 n = 24 C C M 9 3 0 0 9 53 60M C C M M 1 0 1 0 1 6 0 2 4 5 3 1 8 4 8 1 4 8 3 1 3 9 M C M0 9 087 98 CM050562M 4284 98 9M C M 1 2 1 CM034729M CM096540M1 0 1 9 1 1 8 M M CM060024M CM024241M CM077241M CM952022M CM085331M CM112251M CM117110M CM021290M CM095557M CM951100M CM030844M CM021060M CM116213M CM920012M CM066538M CM103071M CM990417M CM116108M CM119020M CM031987M CM113284M CM015028M CM107081M CM074764M CM023961M CM043329M CM090164M CM011306M CM107926M CM061140M CM108953M CM0910982M CM091983M CM004440M CM031420M CM0910218M CM981267M CM065329M CM015134M CM099587M CM001093M CM060475M CM090760M CM114910M CM091203M CM093809M CM057507M CM980992M CM065499M CM972849M CM104004M CM930522M CM930193M CM074167M CM900033M CM045296M CM034614M CM046066M CM062989M CM000583M CM083766M CM920613M CM080180M CM920384M CM051887M CM000314M CM970359M CM108463M CM950081M CM035621M CM001055M CM083085M CM067386M CM020781M CM103435M CM061691M CM980822M CM994451M CM023192M CM043505M CM042916M CM984088M CM035696M CM111990M CM064130M CM030394M CM077910M CM060022M CM083154M CM043850M CM000328M CM071635M CM021099M CM100533M CM040236M CM094649M CM001980M BC CM053814M CM002626M CM097668M CM088136M CM003261M CM076324M CM103261M CM072059M CM111025M CM100724M CM025388M CM078351M CM890070M CM109834M CM085261M CM004249M CM033357M CM952105M CM022340M CM117115M CM055544M CM031384M CM105919M CM010880M CM115577M CM042297M CM051609M CM043548M CM066258M CM044230M CM013736M CM992428M CM981023M CM101209M CM022252M CM043440M CM1010025M CM096577M CM087391M CM057337M CM058290M CM071814M CM117105M CM110793M CM942084M CM118314M CM950331M CM012365M CM920209M CM016239M CM960337M CM065141M CM045624M CM994119M CM076528M CM023579M CM990418M CM092938M CM070755M CM116410M CM034236M HM070116M CM074544M CM106767M CM083662M CM041721M CM012187M CM993871M CM101662M CM993360M CM100563M CM043295M CM095836M CM063908M CM081648M CM920208M CM981290M CM002630M CM021106M CM107734M CM970698M CM950336M CM992389M CM072966M CM090725M CM112100M CM950855M CM031224M CM073277M CM940937M CM030206M CM113250M CM041685M CM060001M CM072926M CM080595M CM010524M CM043103M CM044721M CM034220M CM103734M CM111079M CM117702M CM990223M CM086712M CM010197M CM043967M CM104543M CM983734M CM973118M CM072886M CM970727M CM012946M CM070783M CM981932M CM077509M CM031201M CM068400M CM0910712M CM941116M CM051840M CM102665M CM107325M CM119019M CM085614M CM100646M CM880034M CM107293M CM077316M CM053167M CM114378M CM001963M CM111298M CM110413M CM078484M CM064026M CM930741M CM064031M CM090566M CM994055M CM992192M BC CM950163M CM076058M CM093833M CM076555M CM910166M CM094159M CM054817M CM020434M CM031257M CM090380M CM044589M CM099959M CM062547M CM010351M CM000327M CM086585M CM091351M CM005526M CM992311M CM086473M CM030427M CM108292M CM980271M CM034237M CM101681M CM982028M CM023058M CM970982M CM072957M CM040421M CM062428M CM114665M CM041715M CM086797M CM983520M CM041261M CM034103M CM087867M CM086635M CM118405M CM052168M CM097507M CM920219M CM981719M CM095021M CM041712M CM094509M CM087029M CM114728M CM051590M CM041707M CM071668M CM003386M CM095636M CM090377M CM980945M CM096480M CM000315M CM022440M CM068450M CM930740M CM085382M CM013930M CM094694M CM960846M CM020780M CM920296M CM109871 CM107576 CM001286 CM045312M CM087389M CM107179M M M CM960627 CM101903 CM093775 CM013523 CM112870M CM031307M CM040989M CM052344M M CM950934 CM041372 CM930246 CM090462 M CM115347M CM991079M CM042264 CM107604 M M CM068585 CM983310 CM961070 CM992371 M M CM095228 CM077954 CM090348 CM000502 MM CM950066 CM060041 CM101213 CM034741 CM000313 M CM040782 CM104056 M CM106596 CM074412 M CM942095 CM061989 CM013383 CM053186 CM042394 M M CM981595 CM011410 M M CM096532 CM024622 CM994572 CM980740 MM CM110156 CM971220 2M CM994042 CM023684 CM085226 M M CM010237 CM042919 CM082550 CM960841 0M 4M 6M CM02231 CM00033 CM10720 CM09012 MM CM093354 CM960165 CM042679 CM971149 0M 4M 8M 8M 2M CM11300 CM10226 CM92047 CM07414 CM99330 M M CM01342 CM06311 6M 4M 3M CM04170 CM08302 CM95021 CM94093 M M CM10137 CM98021 CM04227 CM06291 0M 8M 5M 5M CM95063 CM01209 HM08004 CM01342 3M 9M 4M 6M CM11706 CM10845 CM99004 CM11849 CM078120M 7M 6M 1M 0M CM10573 CM07156 CM04174 CM02298 5M 6M 0M 3M 2M CM96025 CM08274 CM96078 4M 5M 7M 6M CM99238 CM11248 CM00441 CM00125 4M 7M 2M CM97131 CM98181 CM02079 CM03193 CM04306 4M 6M 0M 3M CM11102 CM10197 CM06418 CM02133 5M 6M 7M 5M CM06835 CM09095 CM02169 6M 5M CM95039 CM10363 CM06390 CM9502 CM9419 2M 1M 8M 6M CM00058 CM11060 CM01149 CM10197 0M 2M 7M 1M CM0853 CM0121 CM9840 CM0569 2M 3M 7M CM95123 CM10635 CM11825 CM01439 8M 3M CM0417 CM1107 CM9001 CM9703 CM0430 6M 2M 1M CM0625573M CM06604 CM09598 53M CM9402 CM9502 CM0343 CM9502 7M 40M 0M 69M 35M 01M CM08689 CM92019 CM07083 CM04213 84M 48M 07M 22M CM0733 CM9205 17M 75M 67M 29M 31M CM94036 CM08250 27M 01M 33M 45M 08M 28M CM0950 CM0625 CM0253 00M 29M 23M 10M CM1013 CM0427 CM1054 CM1115 99M 05M 51M 55M CM1042 CM9603 CM1033 CM9940 17M CM0662 92M 09M CM0610 CM0708 CM1150 CM0512 CM9941 CM9928 31M 95M 36M 76M 70M CM9903 CM0540 CM0744 CM0715 90M 35M 86M 55M CM0936 CM9703 CM0506 CM0629 63M 07M 01M 757M CM9715 CM1040 43M 90M 17M CM0237 CM9836 CM0404 CM0610 CM0046 21M 57M 26M 86M CM0859 CM0830 CM9804 54M CM0430 CM0707 69M 21M 01M CM0804 CM0514 CM0825 CM0028 93M 77M 58M 33M CM0312 CM1037 CM1101 CM0204 CM0600 73M 54M 31M CM0431 CM0910 CM1190 63M 93M 53M 92M CM1062 CM1037 CM9423 CM0739 CM0938 05M 53M 36M CM9943 CM0440 CM0834 CM9921 76M 93M 56M 29M CM064 CM103 CM061 20M 79M 11M 73M CM0868 CM0234 CM1094 CM1050 64M 15M 324M CM940 CM061 CM013 CM061 CM060 28M 47M 48M 11M 56M CM0856 CM9413 CM0414 CM0017 770M 864M 468M 028M CM114 CM094 CM980 29M 21M 79M 173M 05M 036M CM1056 CM9612 CM0308 CM0979 649M 721M 172M CM010 CM109 CM990 CM070 874M 172M 861M 366M CM1184 CM000 CM022 771M 335M 351M 002M CM980 CM070 CM020 051M 023M 385M CM054 CM099 HM080 CM030 888M 165M 124M 709M CM011 CM062 CM023 CM094 222M 394M 226M 441M CM100 CM012 CM074 092M 937M 459M CM116 CM104 CM012 CM096 387M 013M Cluster 14 CM000 CM087 CM014 CM082 500M 287M 314M CM055 CM096 CM050 CM992 232M 009M 462M CM104 CM061 CM031 701M 569M 384M 206M 107M CM010 CM992 CM098 297M 898M 356M 795M CM071 CM023 CM097 CM100 CM085 254M 600M 720M 936M CM117 CM115 CM099 CM043 CM076 319M 437M 084M 530M 096M CM992 CM085 CM970 512M 844M 652M 953M 723M 618M CM982 CM994 CM072 CM071 CM930 CM085 536M 540M 439M 492M CM074 CM920 CM119 CM022 CM074 608M 691M 825M 695M CM074 CM023 CM971 CM092 340M 888M 109M 038M CM100 CM032 CM024 CM960 078M 218M 387M CM992 CM093 CM070 CM115 CM981 577M 001M 521M 029M 290M CM091 CM109 CM041 CM071 CM050 797M 966M 774M 967M CM066 CM920 CM103 828M 964M 673M 097M 984M CM081 CM109 CM004 CM910 166M 845M 682M CM962 CM024 CM012 CM980 355M 243M 551M 850M CM024 CM095 954M 731M 518M 377M 321M CM900 CM012 CM022 CM983 437M 823M 061M 391M CM094 CM095 CM022 CM118 606M 814M 069M 117M CM043 CM032 CM100 CM010 CM980 973M 567M 749M 468M CM053 CM055 CM074 CM081 992M 613M 353M 153M CM962 CM100 CM083 CM910 680M 757M 339M 007M CM113 CM024 CM041 CM000 634M 159M 547M 303M CM05 CM98 CM03 178M 969M 437M 078M CM962 CM070 CM900 607M CM11 CM95 383M 014M 217M Sp CM055 CM032 CM003 CM930 9843M 3799M 1014M CM04 CM03 CM00 CM10 CM09 155M 746M 165M 067M CM101 CM095 CM930 CM082 2393M 2175M 2825M CM07 CM02 CM99 6578M 620M 3737M 1229M 3919M 0913M CM011 CM051 CM10 1333M 0338M 4623M 3518M CM06 CM04 CM96 CM06 CM99 CM09 0192M 2090M 0191M CM05 CM99 CM01 5029M7560M 3143M 0595M 0524M CM11 CM10 4698M 1774M 5159M 2089M 3456M CM05 CM94 CM00 1118M 0068M 2531M CM02 CM01 HM07 3445M 5187M 0006M 4254M 2465M CM01 CM09 CM10 CM08 CM09 1157M 0379M 6937M 1454M 1995M CM98 CM09 0633M CM92 CM07 CM03 4150M 5585M CM91 CM93 CM94 HM09 3110M 2611M 8334M 1672M CM00 CM97 CM04 CM03 8969M 3163M 9870M CM06 CM94 CM09 4502M 3956M 6521M 6226M 2280M CM02 CM97 CM08 CM98 CM08 3424M 4166M 1293M 0107M 3656M CM10 CM11 CM05 CM97 2396M 1244M 6399M 2548M CM10 CM97 CM00 0359M 8064M 0186M 4439M Cluster 16 CM02 CM10 CM00 CM10 CM11 0153M 0388M 0360M CM99 CM11 CM10 CM08 0282M 1497M 4786M 0008M 0430M CM08 CM11 CM00 CM04 CM11 3783M 2470M 4927M 0955M 3197M CM99 CM09 CM98 CM07 7184M 1269M 7253M CM05 CM08 CM99 0001M 3305M 0101M 1053M 0794M CM09 CM00 CM11 CM00 CM02 CM99 7504M 0167M 2119M 3956M CM96 CM11 CM96 2550M 3061M 4765M 9952M 0023M CM98 CM06 CM99 CM95 CM96 0730M 7870M 3202M 0631M CM04 CM01 CM99 CM91 CM01 CM89 1704M 1135M 2526M CM01 CM95 CM11 CM98 10538 0772M 7947M 1932M 2633M CM04 CM11 CM06 CM07 0714M 3999M 1417M 1101M 0422M M CM97 CM11 CM06 CM90 CM06 CM04 9007M 2632M 1490M 3357M 0147M 7770M CM09 CM03 CM00 2946M 8416M 2463M 3564M 8339M 6570M CM03 CM98 CM06 0419M 0245M 0348M 0137M M CM02 CM06 CM00 1746M 5021M 1988M CM09 CM05 CM10 CM03 CM10 1323M 8067M 2027M 0498M CM03 CM96 CM97 CM00 2794M 1023M 1400M 1229M 3328M 11053 M M CM95 CM02 CM11 CM10 CM07 CM07 CM05 7895M 0010M 0037M 1119M 5192M CM06 CM11 CM07 CM08 CM11 CM00 3224M 2920M 5004M 6479M MM CM11 CM00 CM99 CM96 0925M 1005M 0939M 9474M CM07 CM94 CM01 0407M 0857M 1110M 1978 6944M M M CM10 CM96 CM10 4729M 6102M 4395M 4052M CM96 CM11 CM97 CM09CM99 0559 4457 1053 0074 2259 MMM CM93 CM08 CM10 CM05 CM96 0055M 2463M 2232M 7922 3159 1647 CM07 CM97 CM06 CM09 CM01 1912 5527 0879 CM11 CM06 CM03 CM07 CM95 5157 5029 1706 0322 1019 CM99 CM94 CM05 CM94 3570 3044 7187 6257 1029 5354 MM CM10 CM06 CM07 CM00 3186 0838 7090 CM98 CM99 CM11 CM90 CM11 2474 3109 7298 0156 MMM M CM92 CM07 CM0 1197 3746 7927 7155 1405 MMM CM01 CM11 CM01 1494 1710 7401 4368 4M 2M CM9 CM0 CM1 CM0 CM9 4364 1068 5769 4542 M CM04 CM11 CM05 CM09 CM07 CM08 6238 8874 2865 0931 4506 0M HM9 CM0 CM1 CM9 1177 0886 1060 5039 2333 MMM CM06 CM08 CM95 0191 13179M 5M 1M 8M 3M CM1 CM9 CM0 7440 0555 3254 1056 8343 8750 0875 1193 M MM Sp CM02 CM03 CM06 CM11 0169 4099 7082 7M 5M 4M 8M 2M 9M CM9 CM0 CM9 CM0 9124 9468 7193 9837 CM11 CM07 CM97 CM10 0251 7146 9653 9351 5M 9M 0M 6M CM9 CM1 CM0 CM0 CM9 0164 3052 1503 0893 8122 9M 7MMMMM CM92 CM01 CM1 CM0 1000 7159 8138 1512 6096 9M 4M 5M 3M CM1HM9 CM0 0241 3055 5651 8M 2M 4M 5M 3M CM0 CM1 6294 5032 8150 6532 1M 0M 4M 5M CM0 CM0 8039 3458 7069 4113 1711 7M 4M 9M 0M CM9 CM0 CM9 CM0 CM9 8687 8358 9404 7109 6309 5021 0M 9M 5M 3M 9M 7M 2M CM9 CM0 CM9 CM0 9461 1522 0006 4326 7185 7114 3M 2M 7M 5M 2M CM0 CM1 CM9 CM0 CM0 0984 7075 9863 0327 0M 3M 5M 8M 7M CM0 CM9 CM0 6254 1026 6317 3472 6042 6M 7M 5M 1M CM9 CM0 CM9 2525 8259 8163 4466 8148 9311 1M 7M 5M 7M CM0 CM0 CM1 CM9 8188 2002 9845 3M 5M 6M 4M 1M CM9 CM0 CM1 CM0 4168 9957 1805 1292 5133 9286 0M 9M 2M 3M 6M 0M CM0 CM1 CM0 CM0 CM9 CM9 9038 8123 9945 6771 4456 5107 8M 7M 5M 3M 2M CM1 CM0 CM9 0793 1679 5640 1522 7M 2M 0M 8M 4M CM9 CM0 CM9 9210 6102 7M 0M 2M 6M CM0 CM9 CM0 9416 0326 5218 9273 6296 9964 6M 2M 5M 2M 9M CM0 CM0 CM1 CM9 CM0 CM1 4182 0049 6253 8492 0303 5052 3M 5M 9M 4M 0M CM1 CM1 CM9 CM1 2241 6122 1170 9351 8M 5M 5M CM0 CM0 CM1 CM0 HM0 4377 9321 6007 7069 2M8M 9M CM0 CM1 CM0 9899 0058 7296 1275 9124 3125 9M 2M 3M 3M 3M CM0 CM0 CM0 0138 0004 6041 0554 5188 0241 4M 8M CM1 CM0 CM0 CM0 9551 3080 0352 0036 0M 9M CM1 CM0 CM9 CM1 1781 7310 7012 0615 8679 9254 3M 7M 5M 9M 0M CM9 CM1 CM9 CM0 CM0 1631 1097 0073 8632 1325 9M 4M1M 8M 0M 4M 4M CM0 CM9 CM1 1170 5021 7003 7446 4M 3M 5M CM0 CM9 CM0 9315 5215 5021 1557 5M 5M 9M CM0 CM1 CM0 CM0 2156 9606 8382 4361 0542 4M 8M 6M 4M CM0 CM1 CM0 CM1 CM0 3083 1194 2M 1M 4M CM1 CM0 CM0 CM1 5827 7032 1618 8133 9107 4M 3M 8M 1M 9M 1M CM0 CM1 CM0 CM9 7083 3374 7082 5316 8457 3051 8202 7M 6M 4M 4M CM0 CM9 CM0 CM9 8116 3058 0792 0164 8M 7M 3M CM0 7653 3081 90M6M 9M 4M CM0 8810 1885 9363 3M 1M 3M CM1 6655 0756 3168 0084 5M 2M 1M 8M CM0 0792 7M 0946 4580 7M 3M 9124 3477 3M 1M 5M CM0 2536 0079 4M CM0 4385 0792 4M 8M 7168 3069 94M 4M CM0 CM0 8094 7159 4M 5M CM9 3166 0M 8M CM9 9342 8343 5M 4M 6M CM9 3464 5773 CM0 9380 1M 3M CM0 4422 7M 4M CM0 9100 CM9 5060 0M 0M CM0 CM0 0054 8M 4M CM1 CM9 2M 8305 0M CM0 3M 5316 4M 8161 4M 2M CM0 CM1 4279 9M CM9 4M 0982 CM0 1M CM9 1175 7M CM0 8119 0M CM0 1M 8155 4M CM9 8M 7163 2M CM0 CM0 0M 9314 5M CM9 CM9 4M 7076 2M CM0 CM0 5M 4M 1720 CM0 7M 0512 5M CM0 3471 CM0 4M 1130 6M 4M CM0 6061 9M CM0 2022 8176 1M CM9 CM0 9102 8834 8M 7041 8187 7M CM0 3163 CM0 7079 0419 CM0 1522 4M CM1 CM0 5024 7456 8M 4249 CM0 0M CM0 4093 CM0 9849 5M CM1 9335 8M 9003 CM9 6388 CM0 8143 CM0 4M 2M 6094 7163 8772 3M CM0 8535 CM1 9035 5M 5166 6M CM1 5833 0465 CM1 8M 3M 8331 CM0 1219 CM1 4M 9M 4487 9653 CM9 2M 1622 6M 7M 7074 CM0 5007 16M 7M 8M 3254 CM0 CM9 8717 8M CM0 0033 7072 5M CM9 9M 9573 CM0 3M CM0 0005 CM9 9M 6192 CM1 4M 1M CM0 CM0 2029 2M 0684 1M CM0 3427 7015 2M 0M 4267 6M 4168 0M CM0 7437 3M CM0 9895 9M CM8 7633 8M CM1 0953 CM9 6M CM0 7042 7M 1M 3M 0049 0M 7107 CM0 CM9 6M 8M 5602 6M 9351 CM0 CM0 1M 1245 3161 CM9 3M 0012 CM0 CM0 6M 5017 CM9 7598 7M CM0 0M CM0 0111 5606 8M 8M 8293 9M 7616 2M 8155 1027 8318 CM0 3M 1377 3M 1M CM0 CM9 7M 4M 0M 3039 7073 7M 9042 CM0 2421 9M 7029 CM1 8M 5062 CM0 9397 7M CM0 6M 9565 5M 2231 3M 9538 7M CM0 9049 CM0 0781 1M CM1 1128 0M 8055 5M CM9 1566 1110 1175 CM0 7M 9332 2022 2M 8M 9236 0300 CM0 7000 9M CM0 9M 2435 0M 0773 CM0 9554 6M CM0 7015 4M CM9 0084 2M 6857 7M CM0 6040 1M CM0 7051 0539 9M CM0 6M 4M CM0 1551 5080 5M 3M CM1 2M 0043 6M 9M CM9 8165 CM0 4023 2044 8M 8049 2M 8M CM1 0033 4M 6M CM0 CM0 0755 8301 3M 2125 CM0 6511 3M 7M 3132 1352 CM0 9779 6M CM9 2597 8M 9417 CM1 CM0 4M 5M CM0 6520 0472 CM9 2M CM9 0M 1724 7056 CM1 6M CM1 3015 0855 2M CM9 CM1 CM1 CM9 3M 0283 CM0 0M 0181 CM1 7082 5M 8531 7M CM9 6045 2M CM0 9143 9283 CM9 9218 5M M CM0 5186 6M 6M CM1 6M 7M CM9 5112 4M CM9 2028 1M 0314 CM0 6677 4M CM1 3M 0247 8728 9M CM0 0104 5313 CM0 8M CM1 MMM 1095 9213 1014 7406 6309 CM0 CM0 1436 4504 9102 1830 CM0 7M CM9 4300 2006 5M CM0 1046 CM0 CM9 5148 CM9 9027 MMM 8826 6M 7377 8M CM1 5085 1269 9M 8005 M 0001 0M M 0924 1911 CM0 3565 5164 7M M CM07 6M 0118 M CM99 5146 CM0 0738 4M CM01 4M CM94 7026 3401 4137 6M CM0 CM09 5029 7908 0294 3608 CM98 9237 9392 3M 1M MM CM04 CM0 3335 0974 2M 8M CM99 1224 CM1 6029 0224 8M M CM96 1382 CM0 2059 4M 4349 CM0 6662 CM11 7M 36M CM1 0311 CM04 6087 CM97 2146 4234 4M CM10 3M CM05 CM1 6391 7938 CM10 5063 CM0 2529 8012 CM06 9958 CM9 6106 CM07 5030 CM0 0085 CM10 3M CM08 0044 5M CM1 CM05 5652 9M CM0 4083 CM06 CM1 2068 5518 1M CM99 CM0 3203 2961 CM99 3158 3M CM11 4575 4692 9M CM00 CM9 2203 4003 5M CM99 5472 0348 9M CM97 1001 CM9 CM05 1220 6M 7M CM0 1054 CM00 CM97 CM9 2842 CM98 4068 2M 2M 3926 5M CM07 4622 1031 M CM10 3859 CM08 CM08 CM10 2240 6061M M CM06 CM02 3610 3126 2932M CM97 CM09 8438 8275M CM02 CM96 7352 3176M M CM11 CM01 8367 0212M CM94 CM08 3113 CM95 0876M CM07 2867 2874M CM01 6912 4362M 6756 0M CM05 5981M MM 3558M 5924 CM08 1128M CM06 M 4989M CM07 CM97 M CM08 CM08 5208 1106M 0268 0634 M CM02 4047M 1691 CM09 CM95 CM96 1994M 1191 CM07 0464M CM94 CM07 M CM01 4129 CM01 CM06 5030M CM08 1532 CM97 3203M 3641M 2387 the MaPSy results (see Section 2.4.16 and Figure A.11). CM99 0118M 1090 CM10 1329M 1919 CM06 MM CM06 M 1548M CM11 0215M M CM98 CM07 9649M CM92 CM99 0753M CM10 M CM09 CM05 CM11 M CM03 CM04 0845M CM00 CM07 CM07 M CM05 CM03 5394M CM07 0191M CM03 5333M CM00 8246M CM07 7532M CM92 4288M CM04 5389M CM07 2551M CM02 2650M CM11 9159M CM99 2893M 0696M HM03 0342M CM01 CM09 1666M CM99 2875M CM98 M CM95 6152M 6782M CM99 CM11 1644M 2839M 8125M 4083M CM04 CM96 5372M CM01 CM01 1030M 3467M 5326M CM10 1108M CM02 0215M CM97 2095M CM04 0117M CM11 0448M CM92 7547M CM08 CM10 0635M CM06 CM07 0292M CM02 CM99 1687M 0946M 1334M CM98 0493M CM95 4563M 1164M CM00 0153M CM08 0355M CM11 1555M CM92 3286M 8102M CM09 CM02 1421M CM04 CM04 0391M 3116M CM95 1724M CM07 3446M 8773M CM95 3433M 4519M CM96 0574M 0379M CM98 0419M 7091M CM06 0667M CM00 CM10 1872M 0335M CM98 4688M CM08 1244M 8992M 4581M CM03 1979M CM03 0827M CM02 3833M CM06 2869M CM05 CM02 CM10 4690M CM97 CM09 0801M CM07 3373M CM95 2108M CM06 3870M CM98 2995M CM09 CM09 3600M CM95 7230M CM07 10576 0404M CM07 CM03 0023M CM08 1665M 0214M 4656M 5807M CM92 CM04 1257M CM96 4202M CM04 0338M 0707M 4953M CM01 4512M 1812M CM11 2762M CM07 4279M 8654M 0280M CM04 CM08 2742M 0431M CM97 382M CM04 CM11 3795M 1353M CM10 CM06 850M 3165M CM98 0029M CM11 341M 0357M CM08 7338M 7221M CM07 161M 2335M 534M CM11 0701M 549M 0527M CM03 131M 2831M CM98 317M 8063M CM04 528M 293M CM10 243M CM00 943M 2218M CM96 243M CM023 4880M 717M CM110 0269M CM032 CM99 0845M 077M 052M CM983 CM05 CM95 150M CM103 0872M CM09 551M 150M CM070 1084M CM11 567M CM985 0389M CM04 216M CM073 9744M CM00 976M 1593M CM09 947M 3354M 842M 986M CM073 3365M 191M CM103 1202M CM010 CM09 179M CM02 058M CM100 2024M 0312M CM00 CM091 0799M 0481M CM09 008M CM074 1518M CM04 681M CM043 0667M 492M CM115 2142M CM07 244M CM100 CM052 CM03 1108M CM96 CM04 CM961 0913M CM99 CM06 CM072 0002M CM03 705M CM094 0545M CM11 CM97 098M CM015 CM01 183M CM012 CM08 164M CM107 CM10 440M CM004 5520M CM97 351M CM104 CM105 8849M 734M CM066 CM011 4952M 107M 538M CM060 CM065 064M CM062 1692M 6046M 734M CM981 0091M 5606M CM094 3879M CM031 273M CM981 7541M CM041 056M 5545M CM012 244M 946M CM068 421M CM108 587M CM101 159M CM003 667M 605M CM090 447M CM970 335M 727M CM031 192M CM032 CM022 227M CM061 CM056 078M CM950 CM116 137M CM941 506M CM001 264M 444M CM051 114M CM023 521M CM052 401M 925M CM951 680M 305M CM920 CM970 988M 100M CM063 262M 629M HM040 CM961 236M CM055 191M CM060 CM080 164M 011M CM003 CM110 456M CM940 CM093 327M 292M CM094 CM095 728M 79M 322M CM112 CM070 276M 07M CM091 430M 332M CM094 435M 41M 368M CM106 463M 39M 201M 75M 455M CM104 CM980 CM072 CM980 942M 10M 116M CM023 22M CM091 327M CM061 53M CM962 0992M CM021 340M CM111 325M 684M CM114 166M 13M CM062 601M CM032 09M CM042 59M 361M CM041 CM961 CM105 17M CM062 CM057 093M 609M 77M CM993 825M 547M CM030 25M CM994 260M 034M 49M CM118 530M 34M 37M 49M CM095 525M CM014 156M CM067 16M CM971 14M 961M CM042 CM004 32M CM102 40M 68M 36M 362M CM950 52M CM115 693M 78M CM992 27M 253M CM970 80M CM034 73M 75M 051M CM062 71M 64M 115M CM072 13M 98M 584M CM074 37M 20M 048M CM077 96M CM9904 463M CM930 69M CM9402 196M CM102 54M CM0630 CM971 CM0741 365M CM110 7M CM0040 8M CM1162 174M CM001 87M 8M 4M CM9403 CM1152 562M 21M 8M CM002 0M CM0708 CM0404 289M CM1122 819M CM074 25M 2M CM0639 CM010 24M 1M 083M CM0901 574M CM900 98M 4M CM0442 61M 7M CM980 CM0639 245M 88M 3M 2M CM033 CM9931 CM024 4M CM097 CM1191 CM063 0M 4M CM981 CM1132 CM940 1M 6M 8M CM0348 107M CM041 0M CM0219 085M CM020 6M CM0053 CM970 8M CM1145 640M 3M M CM0028 326M CM100 9M CM0436 758M CM093 M CM9001 522M 1016M CM960 M M 752M CM970 456M 690M CM910 M CM0533 896M CM1073 CM930 8M 467M 1M M CM9402 CM040 058M CM0995 1054M CM920 M 168M CM1026 3M CM011 M CM0418 162M 7M CM9817 066M 0M CM0455 534M 858M CM055 CM113 CM9902 M 14M CM1124 95M CM0004 26M CM0226 CM0600 M 21M CM0336 CM9613 56M M M CM1051 CM0012 19M M 03M 33M CM1004 M CM0035 65M CM0780 42M CM0964 31M M CM9202 CM1184 CM9839 22M 07M CM0407 18M CM0653 84M CM0220 CM9625 93M CM0906 73M CM0916 79M CM0211 CM0670 51M 38M CM9403 34M CM0417 CM05021 00M CM0300 36M CM0925 12M CM9814 CM06107 CM00032 91M CM1032 CM9907 CM09063 CM9001 CM00197 CM93016 CM9419 CM07398 60M CM0941 CM9846 CM05062 89M CM0010 CM98093 90M CM09109 85M CM0548 CM0927 CM04421 33M CM0107 CM1155 CM05034 26M CM0620 CM9201 CM11309 17M CM0881 CM11219 87M CM0625 CM07075 CM0132 CM09162 CM0019 CM06101 CM0538 CM07210 CM98366 CM97033 41M CM10190 CM0406 82M CM0806 39M CM0417 CM95064 75M CM0582 CM03128 CM05487 CM00303 CM08145 35M CM9100 CM11623 CM03205 04M CM9905 14M 9M CM9100 CM11551 1M 5M CM05220 4M 3M CM89006 CM98422 1M CM99293 CM06111 CM09041 5M CM03079 9M CM01373 CM92055 5M CM01433 CM11638 7M CM08059 CM06509 CM95063 CM10263 CM00199 CM06284 6M CM07833 CM09866 4M 3M CM05000 CM04038 7M CM05402 CM93019 CM95030 0M CM11160 CM97017 2M CM99025 6M CM02165 CM03121 6M CM06309 4M CM99213 2M 39M CM030004 CM07070 CM118661 CM001273 CM060943 6M CM053916 CM031128 8M 0M CM981884 9M 7M 2M CM07081 CM065498 CM054098 CM000323 CM97128 CM055439 0M CM05477 CM070715 CM910082 CM091074 CM06312 CM108333 CM11323 CM0989378M CM023371 CM01258 CM08820 CM980588 7M CM087254 CM070686 8M CM112257 CM097246 7M CM061758 CM984686 0M CM961142 CM056945 9M 4MM CM970127 5M CM114082 6M M CM070695 CM951090 CM091038 CM920627 M CM114074 CM04396 M CM940879 CM116698 M M M M CM114244 CM103267 CM981667 CM990348 CM085677 CM056318 M CM032810 M CM109992 CM118606 CM090272 CM062497 CM077240 M CM112538 CM070878 CM0910464 CM020491 CM070721M CM075972M CM053313M M CM023150 7M CM065257M CM110134M CM114083M CM043009M CM102797M CM034646M CM010192M CM941089M CM060021M CM012605M CM074264M CM102320M CM950109M CM080569M CM961404M CM940880M CM044691M CM056361M CM060967M HM090016M CM020110M CM110129M CM071667M CM072919M CM114366M CM070781M CM042016M CM010992M CM011301M CM065007M CM011818M CM034008M CM001792M CM940882M CM090252M CM100723M CM070687M CM102758M CM092236M CM000342M CM070166M CM951103M M CM065488M CM062846M CM055475M CM074193M CM993516M CM940445M CM082577M CM013962M CM930085M CM099006M CM102376M CM074758M CM100338M CM081601M CM087865M CM081207M CM990917M CM981716M CM002609M CM080411M CM093297M CM001100M CM980560M CM920554M CM041070M CM082519M CM021112M CM108352M CM092937M CM112244M CM002754M CM118330M CM012337M CM983736M CM042770M CM992171M CM055148M CM070828M CM035654M CM014958M CM067983M CM064082M CM080581M CM961081M CM020479M CM080006M CM095679M CM053207M CM021672M CM081411M CM081385M CM065487M CM104600M CM962397M CM113092M CM102632M CM035655M CM113829M CM972811M CM110801M HM040128M CM001134M CM970431M CM051416M CM045439M CM072119M CM102676M CM030001M CM984618M CM950160M CM085584M CM980988M CM014307M CM910094M CM066736M CM074118M CM078428M CM032288M CM890029M CM010154M CM070708M CM950330M CM940296M CM970737M CM990829M CM093806M CM980943M CM012925M CM062564M HM060037M CM118626M CM112249M CM013952M CM003932M CM001996M CM081456M CM102889M CM066260M CM001094M CM910079M CM117845M CM063912M CM087023M CM072916M CM030814M CM011353M CM010052M CM070756M CM080208M CM001182M CM062444M CM116123M CM050220M CM033437M CM108630M CM075006M CM004063M CM076110M CM092561M CM044567M CM001083M CM099588M CM930539M CM033755M CM970330M CM061669M HM080037M CM068393M CM097738M CM118221M CM107559M CM103643M CM074763M CM942087M CM070777M CM085302M CM960709M CM070866M CM014960M CM105738M CM981803M CM051983M CM085599M CM117002M CM052186M CM993730M CM074480M CM112440M CM091236M CM113258M CM970281M CM105982M CM990616M CM114367M CM065316M CM062673M CM940128M CM050769M CM990451M CM970436M CM013520M CM041720M CM073243M CM117008M CM034965M CM101456M CM001793M CM090986M CM112246M CM042389M CM108145M CM970385M CM003175M CM001322M CM970102M CM085212M CM057156M CM920704M CM992854M CM055118M CM024145M CM972918M CM041374M CM012789M CM950275M CM960345M CM992853M CM095739M CM002623M CM061123M CM042014M CM002106M CM115431M CM014009M CM032305M CM940041M CM952099M CM095713M CM115941M CM022326M CM004200M CM005547M CM960837M CM971563M CM022197M CM023437M CM004062M CM930401M CM990534M CM051649M CM910085M CM940878M CM940950M CM064196M CM051394M CM910223M CM043571M CM050596M CM970292M CM002413M CM074548M CM067371M CM093096M CM992926M CM960727M CM940066M CM980412M CM114309M CM940951M CM060933M CM011961M CM051883M CM110790M CM115768M CM067040M CM960687M CM992676M CM960279M CM074440M CM940190M CM070815M CM030647M CM980309M CM086321M CM072864M CM111704M CM000792M CM060199M CM096534M CM012936M CM023016M CM012756M CM980073M CM994190M CM990362M CM104915M CM112994M CM085595M CM950744M CM970384M CM031272M CM090677M CM106764M CM993620M CM981105M CM085501M CM084959M CM023006M CM040212M CM973043M CM000144M CM068566M CM061684M CM041671M CM034222M CM994350M CM112466M CM106118M CM971173M CM074987M CM990379M CM020133M CM042263M CM116205M CM021531M CM010390M CM112454M CM116425M CM890027M CM961228M CM030504M CM101182M CM024748M CM107253M CM074454M CM077933M CM065266M CM082801M CM940374M CM000357M CM111059M CM060945M CM102375M CM090119M CM111643M CM063819M CM960414M CM034065M CM108559M CM950161M CM098495M CM077915M CM103647M CM114838M CM051996M CM970327M CM951073M CM971098M CM012168M CM051894M CM114196M CM070867M CM086454M CM020433M CM094717M CM097253M CM032885M CM011775M CM107561M CM024147M CM900135M CM067626M CM042918M CM070700M CM001180M CM055932M CM070681M CM041084M CM014337M CM043596M CM054825M CM981655M CM096358M CM061853M CM991075M CM057709M CM000437M CM000711M CM111230M CM070738M CM0910941M CM0911043M A BC t0 Sp A BC t0 Sp Significant in vitro Allele ratio Chapter 2. Identification of Splicing Defects Allele ratio Allele ratio −2 −1 0 −3 −1 0 1 2 −1.0 −0.5 0.0 0.5 t0 t0 t0 A A A n = 64 BC BC BC −2 n = 115 −1.5 n = 268 Cluster 7 Cluster 6 Cluster 9 Sp Sp Sp Not significant A BC t0 A A BC t0 BC t0 Sp Sp Sp effect of exonic mutations on splicing and created a public web server that enables visualization of a patient and/or family in the last four decades. We have established a large-scale collection of the can be found in Figure A.9c. Pie charts in individual panels show the proportion of ESM classifications. Spliceosome stages significance in both assays, orange for significance in vitro and gray for no significance). The complete profile of all clusters ratios in spliceosomal fractions is shown (center plot) with representative clusters shown in different colors. The individual Figure 2.10: Clustering of allelic ratios provides ESM mechanistic insights. The result of hierarchical clustering of allelic be tissue and cell type dependent. Each mutation in the 5K panel represents a variant reported in trans-acting factors recognizing exonic-binding motifs (for example, cluster 15) are more likely to 45 B/C and spliced (sp)) for the corresponding clusters. Each pair is colored according to its ESM classification (dark red for are depicted at the right of the individual panels. Major disruptions in assembly transitions are indicated with red arrows, panels surrounding the center plot show the allelic ratios of each mutant-wild type pair in the different fractions (t0 , A, 46 2.3. Discussion 2.3 Discussion The need for better characterization of sequence variation is ever more urgent with the increasing number of rare variants being discovered from many large-scale sequencing efforts.53,170 Previous studies tested the effect of random k-mers in enhancing or silencing splicing.10,34,158 We present the results of a survey of the effects of 4,964 point mutations on splicing using MaPSy, a new parallel splicing system. We further characterized the splicing aberrations by their stage of disruption in spliceosome assembly. We found that ∼10% (513/4,964) of exonic disease-associated alleles disrupt splicing in vivo and in vitro. In contrast, only 3% (8/228) of common SNPs altered splicing in both assays. It is interesting that in diseases that are more frequently caused by splicing mutations, more exonic mutations were also found to disrupt splicing. This likely reflects disease processes that occur through loss-of-function mechanisms. We found that exonic features have a large role in forming ESMs. We also identified 24 exonic RBP motifs that are associated with increased splicing and 38 RBP motifs that are associated with decreased splicing. MaPSy has certain limitations; particularly, only mutations in exons of fewer than 100 nt in length can be evaluated owing to the current limitation in oligonucleotide synthesis technology. Given that the average length of internal exons is around 130 nt, half of all human exons are not eligible for splicing characterization using MaPSy. We also cannot rule out the presence of other influences- for example, flanking splice sites, different transcription efficiencies and tissue-specific effects, all of which are not preserved in MaPSy. It is intriguing that some features previously shown to be predictors for SSMs but not present in MaPSy (for example, flanking intron length and number of introns) were also identified as predictors for ESMs (Figure 2.4b).8 These findings, together with the high concordance rate with splicing phenotypes in corresponding patient tissue samples, suggest that, despite these limitations, MaPSy contains most of the critical elements required for splicing in native conditions and thus is a powerful tool for characterization of the sequence variation underlying splicing aberrations. In conclusion, MaPSy facilitates large-scale identification and characterization of ESMs. The sys- tem effectively translates to 5K implementations of basic mutational approaches and can be further adapted to other mutation panels, thus accelerating efforts to characterize all sequence variation. 2.4 Methods 2.4.1 Library design and synthesis Nonsynonymous mutations classified as disease causing (DM) were downloaded from the Human Gene Mutation Database156 (HGMD; accessed in May 2012) and mapped to the GRCh37/hg19 human reference sequence. Mutations were mapped to internal exons that were ≤100 nt in length, and the exons that fit into 170-nt genomic windows that included 15 nt of the downstream intron and ≥55 nt of the upstream intron (n = 4, 964) were selected. The mutant and wild-type versions of the 170-mer genomic fragments were flanked by 15-mer common primer sequences and designed into a 200-mer 170 D. G. MacArthur et al. Nature, 508: 469 – 76, 2014. 10 A. B. Rosenberg et al. Cell, 163: 698 – 711, 2015. 34 Y. Wang et al. Nat Struct Mol Biol, 19: 1044 – 52, 2012. Chapter 2. Identification of Splicing Defects 47 oligonucleotide library. Solid-phase oligonucleotide synthesis was performed by Agilent Technologies and used to generate in vivo and in vitro reporters. 2.4.2 MaPSy in vivo assays The in vivo splicing reporter includes a cytomegalovirus (CMV) promoter, an adenovirus (pHMS81)171 exon with part of its downstream intron, the 200-mer oligonucleotide library, exon 16 of ACTN1 with part of intron 15 and the bGH poly(A) signal sequence (Figure A.12). Common sequences (everything except the 200-mer library) were concatenated by overlapping PCR and cloned with TOPO TA (Invitrogen) to generate a 5 common fragment and a 3 common fragment. Each cloned fragment was PCR amplified, and equimolar amounts of the common fragments and the oligonucleotide library were concatenated in a single PCR reaction and purified and size selected twice with a 0.4:1 ratio of Agencourt AMPure beads (Beckman Coulter) to PCR reaction. The resulting in vivo reporters were transfected into human embryonic kidney HEK293T cells (ATCC) in three cell culture replicates using Lipofectamine 2000 (Invitrogen) in a six-well plate. RNA was extracted 24h after transfection using TRIzol (Thermo Fisher) and DNase treated. Random 9-mers were used to generate cDNA with SuperScript III Reverse Transcriptase (Invitrogen) followed by PCR (GoTaq, Promega). All PCR reactions were kept to the lowest possible number of cycles (15-20 cycles). Input reporters and spliced species were sequenced on an Illumina HiSeq 2500 (100-bp paired-end reads). Cultured cells were periodically tested for mycoplasma contamination. 2.4.3 MaPSy in vitro assays The in vitro splicing reporter has a design similar to that of the in vivo reporters, but it lacks the ACTN1 exon and a T7 promoter was used (Figure A.12). In vitro reporters were obtained via transcription in vitro using T7 RNA polymerase (Stratagene) and internally labeled with [α-32P]UTP (PerkinElmer) and were capped with G(5)ppp(5)G (New England BioLabs). The resulting RNA was gel purified and used for splicing reactions in 40% HeLa-S3 (NCCC) nuclear extracts containing 40% HeLa-S3 nuclear extract for 80 min at 30◦ C (the salt conditions for splicing reactions have been previously described).172 Pools of input and spliced RNA were converted to cDNA (SuperScript III, Invitrogen) and used to generate an Illumina library (NEBNext kit, New England BioLabs) for deep sequencing. For glycerol-gradient fractionation, 120 µl of the splicing reaction was treated with 0.5 mg/ml heparin for 5 min at 30◦ C and then loaded onto 3.75 ml of a 10-30% glycerol gradient and centrifuged at 175,000g using a SW55 rotor (Beckman Coulter) at 4◦ C for 2.5 h. After centrifugation, the gradient was fractionated from top to bottom in 16 equal volumes, and the fractions were analyzed on 2.1% native agarose (UltraPure Low-Melting-Point Agarose, Invitrogen) or 8% denaturing polyacrylamide (29:1 cross-linking) gels. The in vitro MaPSy assays were done in two experimental replicates. Gels were visualized with a Typhoon PhosphorImager (GE Healthcare). Unspliced RNAs that were bound to different complexes were extracted from relevant fractions, converted to cDNA (SuperScript III, Invitrogen), reattached to the T7 promoter sequence by PCR, gel purified and used as template for subsequent in vitro transcription to make pre-mRNA substrates for the next round of SELEX (Figure 171 O. Gozani, J. G. Patton, and R. Reed. EMBO J, 13: 3356 – 67, 1994. 172 V. Reichert and M. J. Moore. Nucleic Acids Res, 28: 416 – 23, 2000. 48 2.4. Methods A.9a). RNA pools recovered from each purification step were converted to cDNA, PCR amplified and analyzed by deep sequencing (Illumina HiSeq 2500, 100-bp paired-end reads). 2.4.4 Library species alignment and counting We generated ‘reference genomes’ for both the in vivo and in vitro libraries with each pair of wild- type (reference) and mutant species treated as its own ‘chromosome. Paired-end reads were mapped using the STAR aligner.173 For input alignment, we did not allow for split reads and only uniquely mapped reads with a maximum of ten mismatches were allowed. We used the same settings for output alignment as we did for input alignment, with the exception that we allowed for split reads. Because there may be more than one mutation per exon in the 5K panel, the requirement for calling a species as wild type can be more stringent than the requirement for calling each of the mutants, given that calling the wild-type species would require the read pair to span all mutation positions in the same exon, whereas calling the mutant species would only require the read pair to span the respective mutant position. Thus, we also required all mapped reads to span all mutation positions in order to ensure balanced detection of wild-type and mutant species. 2.4.5 Allelic imbalance analyses The allelic ratios for MaPSy analyses were calculated as   mo /mi log2 (2.1) wo /wi , where mo is the count of mutant spliced species, mi is the count of mutant input, wo is the count of wild-type spliced species and wi is the count of wild-type input. To assess statistical significance, a two-sided Fisher’s exact test was used and the resulting P values were adjusted to account for multiple comparisons using the p.adjust function in R (method = ‘hochberg’). A significance level of <0.05 and an allelic ratio of ≥1.5-fold were used to call ESMs. 2.4.6 Splicing efficiency analyses To compare splicing performance between individual species, the following splicing index was calcu- lated for each species spli / ni=1 spli  P  log2 Pn (2.2) inpi / i=1 inpi , where spli is the count for spliced output for species i, inpi is the count for the input for species i and n is the number of species in the library pool. 2.4.7 MaPSy validation in patient samples Tissue samples (n = 13) were obtained from the University of Utah School of Medicine (Salt Lake City, UT), the Washington University School of Medicine Alzheimer’s Disease Research Center (St. 173 A. Dobin et al. Bioinformatics, 29: 15 – 21, 2013. Chapter 2. Identification of Splicing Defects 49 Louis, MO), Ohio State University (Columbus, OH), the National Institute of Child Health and Hu- man Development (Bethesda, MD) and the Coriell Repository. Ethical approvals were granted by local institutional review boards, and informed consent was obtained from all participants. RNA was extracted using the PAXgene kit (Qiagen) for whole-blood samples, the RNAeasy kit (Qiagen) for postmortem brain samples and TRIzol (Life Technologies) for all other samples, using the respective manufacturer’s protocols. SuperScript III Reverse Transcriptase (Invitrogen) was used to generate cDNA with random 9-mers, followed by PCR (GoTaq, Promega). PCR primers were designed to map to exons flanking the mutant exon. In the case of individuals with nonsense mutations for whom we had lymphoblastoid cell lines or fibroblasts available, the cells were also treated with 10 µg/ml cycloheximide for 3h before RNA extraction. 2.4.8 MaPSy validation in ENCODE data We downloaded 46 whole-cell RNA-seq long poly(A)+ ENCODE data sets for 19 different cell lines (for accession numbers, see Table A.4). Reads were mapped to hg19 using the STAR173 aligner with default parameters. Each STAR output generates a splice-junction file, which was used to calculate percentage usage at each splice junction as follows. # 30 ss reads   % usage at 3 ss =0 × 100% (2.3) # upstream 50 ss reads # 50 ss reads   0 % usage at 5 ss = × 100% (2.4) # downstream 30 ss reads Results from multiple runs of the same cell lines were collapsed. The hg19 positions of the 30 splice sites (ss), 50 splice sites, upstream 50 splice sites and downstream 30 splice sites for all wild-type exons in the 5K panel were retrieved and were binned into four groups of increasing splicing performance in MaPSy. Average percentage usage at both splice sites was plotted in each bin and compared. 2.4.9 HGMD mutation analyses Disease-causing splicing and coding-sequence mutations were selected from HGMD (n = 77, 943). The mutations were classified as splicing, missense or nonsense mutations, and the numbers of all classes of mutation were determined for each gene. The total number of mutations was plotted against the total number of SSMs in a gene (Figure 2.3a). Weighted random sampling was then used to construct a 99.9% confidence interval that capitulates the expected number of SSMs given the total number of mutations within a gene. Using the proportion of total SSMs to total mutations in the HGMD as a weight for random sampling, the proportion of SSMs given the total number of mutations in each gene was simulated 1,000 times. Genes falling outside the simulated values represent genes that have more (above the confidence interval) or fewer (below the confidence interval) SSMs than expected (P < 0.01) based on the distribution of mutation types within the data set. Haploinsufficiency scores were obtained from published data.157 HGMD genes were binned as haploinsufficient genes (haploinsufficiency (HI) score = 1), moderately haploinsufficient genes (HI score = 0.7 – 1) and haplosufficient genes (HI score ≤0.7). 50 2.4. Methods 2.4.10 Random forest classification We used R implementation of random forest,160 a nonparametric ensemble learning method, to model the contribution of various genomic, sequence and functional features to the likelihood that an exonic mutation will have an impact on splicing. Each tree in the forest is constructed with a different bootstrap sample from the original data set, with approximately two-thirds of the bootstrap samples being used for construction of the kth tree and the remaining one-third (out-of-bag data) used for cross-validation. The results from all trees are then averaged to provide unbiased estimates of predicted values, error rates and measures of variable importance. Default parameters were used to build the random forest model, with the exception that the number of trees was specified as 1,000. As variable importance measures may vary depending on the parameters of the algorithm, and both the degree of correlation and the scale of the variables can influence them, we opted to use two different methods for feature selection and measures of importance. The first method created shuffled copies of all the features (shadow features) and trained a random forest classifier using the supplementary set while iteratively removing irrelevant features (those with z scores less than the maximum z score of the respective shadow features). This was done until all features were either confirmed or rejected, using the Boruta174 package in R. For the second method, we generated the null distribution of the variable importance measures by permuting the response variable so that the relationship between the response and predictor variables was destroyed. This was done with 1,000 runs of random forest, and the empirical P values for importance measures were calculated by counting the number of occurrences in which each importance measure in the original data was either lower or equal to the respective importance measure in the permuted data. Features that are selected in both methods with significance level < 0.05 were used for the final random forest model. 2.4.11 Random forest predictor variables Splice-site strength was computed using Perl scripts downloaded from the MaxEntScan126 package, which uses a maximum-entropy approach on large data sets of splice sites in humans while taking into account both adjacent and nonadjacent dependencies. The splice-site models assign log-odds ratios to 9-bp sequences (-3 to +6 positions) for the 50 splice-site scores and 23-bp sequences (-20 to +3 positions) for the 30 splice-site scores. ‘SS vars’ is the sum of the differences in wild type-mutant splice-site scores for all SSMs in the HGMD156 and ExAC53 datasets at each exon. ESEs and ESSs were downloaded from published data.1,33,158 ‘ESRseq diff’ was computed as the wild type-mutant difference in hexamer splicing scores.158 Haploinsufficiency scores were obtained from a previous study that developed a haploinsufficiency prediction model using a large deletion data set (Wellcome Trust Consortium Controls).157 PPT scores were computed as previously described.175 ‘Exon POS in gene’ was calculated as exon number divided by the total number of exons in the gene (values between 0 and 1). The free energy estimate (∆G) was computed using ViennaRNA package159 version 1.8.5, using default settings with the --d2 and --noLP options. 174 Miron B. Kursa, Aleksander Jankowski, and Witold R. Rudnicki. Fundam. Inf., 101: 271 – 285, 2010. 126 G. Yeo and C. B. Burge. J Comput Biol, 11: 377 – 94, 2004. 33 W. G. Fairbrother et al. Nucleic Acids Res, 32: W187 – 90, 2004. 175 C. L. Lin et al. Genome Res, 26: 12 – 23, 2016. Chapter 2. Identification of Splicing Defects 51 2.4.12 Motif analyses RBP, ESE and ESS motifs were obtained from published sources.158,163 ESE and ESS hexamers were mapped and counted in each of the mutant and wild-type exons from the 5K panel. The contribution of known splicing elements to MaPSy splicing was evaluated by plotting the mutant-wild type difference in ESE and ESS counts against the mutant/wild type splicing ratio in sliding windows (size = 1,000, step = 1). RBP motifs were mapped to the exons and upstream introns of the 5K panel using the matchP W M function from the Bioconductor package176 with default settings (minimum score = 0.8). We computed the maximum matchPWM score percentiles of all spanning n-mers at the mutation positions that overlapped the exonic motif maps and calculated the mutant-wild type difference for each mutation position (n = length of motif). The in vitro and in vivo splicing profiles of exonic motifs were generated by plotting the mean of the maximum score differences in rolling windows of increasing mutant allele inclusion of spliced species (i.e., mutant/wild type ratio, window size = 1,000, step = 1). Intronic motif maps of wild-type species (n = 2, 086) were used to calculate intronic motif density for each RBP (Figure A.8a). Wild-type splicing profiles of intronic motifs were generated by plotting the mean motif density in rolling windows of increasing splicing efficiency (window size = 200, step = 1). In vitro and in vivo profiles were combined and fitted using the smooth.spline function in R.177 The Bayesian information criterion was used to determine the optimal number of clusters with the mclust function from the mclust R package.178 Profiles were clustered on the basis of the coefficient values from spline fitting using the hclust function in R (Figure 2.8 and Figure A.8b). 2.4.13 RBP-binding motif validation We ordered small interfering RNA (siRNA) for human PTBP1 from Thermo Scientific (s11436) and siRNA for human SRSF1 from Dharmacon as previously described.164 For control siRNA, AllStar negative-control siRNA (Qiagen) was used. Minigenes were synthesized by Synbio Technologies. HeLa cells (ATCC) were plated 24h before transfection. For PTBP1 knockdown, 7.5 µl of Lipofectamine RNAiMAX (Invitrogen) was used to transfect siRNA for PTBP1 (20 nM, final concentration) in a six-well plate for 48h according to the manufacturer’s protocol (Invitrogen). This was followed by a second transfection with 3.75 µl of Lipofectamine 3000 (Invitrogen) and the same siRNA in Opti-MEM (Life Technologies) and 500 ng of DNA in 100 µl of pure DMEM (Invitrogen). RNA was extracted 24h later with TRIzol according to the manufacturer’s protocol (Ambion), followed by DNase treatment and RT-PCR as described above. For SRSF1 knockdown, 1.5 µl of Lipofectamine 3000 (Invitrogen) was used to transfect siRNA for SRSF1 (20 nM final concentration) in Opti-MEM (Life Technologies) and 500 ng of DNA in 100 µl of pure DMEM (Invitrogen). After 72h, RNA was isolated, followed by DNase treatment and RT-PCR. Knockdown efficiencies were evaluated with immunoblotting using anti-SRSF1 (sc-33652, Santa Cruz), anti-PTBP1 (32-4800, Thermo Fisher) and anti-GAPDH (sc-47724 and FL- 335, Santa Cruz). All experiments were done in two cell culture replicates that had been periodically tested for mycoplasma contamination. 176 W. W. Wasserman and A. Sandelin. Nat Rev Genet, 5: 276 – 87, 2004. 177 John M. Chambers and Trevor Hastie. Statistical models in S. Wadsworth & Brooks/Cole computer science series Pacific Grove, Calif.: Wadsworth & Brooks/Cole Advanced Books & Software, 1992. xv, 608 p. 178 C. Fraley and A. E. Raftery. Journal of the American Statistical Association, 97: 611 – 631, 2002. 52 2.5. Technical aspects of MaPSy – challenges and limitations 2.4.14 Functional SELEX analysis The allele ratios were calculated as follows   mie /mii log2 (2.5) mje /mji , where mie is the minor allele count in the enriched pool, mii is the minor allele count in input, mje is the major allele count in the enriched pool and mji is the major allele count in input. The minor allele was the allele that spliced less efficiently than the respective major allele; these alleles differed by one nucleotide. All analyses were performed in R. Hierarchical clustering was performed on all mutant-wild type pairs that were recovered in all purified fractions (n = 4, 873) using the hclust function with the complete linkage method and Euclidean distances. Bayesian information criterion plots were generated for k = 1 to k = 50 using the mclust package to estimate the optimal number of clusters. The resulting clusters were visualized, and the tree was cut using the cutree function (k = 32). To determine the significance of the observation that mutations in the same exons were more often clustered together, we permuted the exon assignment in the 32 clusters 10,000 times and obtained the distribution of the permuted data. The P value was obtained by counting the number of times the computations of the permuted data exceeded or equaled that of the original data divided by the number of permutations. To examine whether certain genomic features may act as ‘signatures’ of the identified clusters, we plotted the distribution of each feature in the different clusters, and significance was determined by the mean difference in two-sided t-statistics on the actual data and permuted data 10,000 times using the flip function followed by flip.adjust (method = ’hochberg’) to account for multiple testing.179 2.4.15 Data availability The data generated from this study (raw allelic counts and allelic ratios from each mutant-wild type pair from MaPSy experiments with the corresponding genomic positions, variant allele and HGMD accession numbers) are available at http://fairbrother.biomed.brown.edu/ESM_browser/. 2.4.16 URLs Visualization of MaSPy results, http://fairbrother.biomed.brown.edu/ESM_browser/. 2.5 Technical aspects of MaPSy – challenges and limitations In this section, we define the limitations of MaPSy and focus on our read counting/filtering strategy for correcting RNA-sequencing biases that we encountered during the development of the assay. 2.5.1 Limitations of MaPSy The fact that MaPSy does not rely on any barcoding strategy introduces a critical limitation in al- ternative events detection. Observations/mutations that lie directly downstream of a cryptic 30 ss or 179 Fortunato Pesarin. Multivariate permutation tests: with applications in biostatistics. vol. 240 Wiley Chichester, 2001. Chapter 2. Identification of Splicing Defects 53 directly upstream of a cryptic 50 ss would not be detected in the assay. In other words, only a subset of alternative events would be reported by our method (Figure 2.11). Due to this limitation, we decided Location of cryptic 3´ss site Splicing We cannot see the mutation in the final product and therefore assign that isoform to a specific species Mutant position (closer to constitutive splice site Location of cryptic 3´ss site Splicing We can see the mutation in the final product and therefore assign that isoform to a specific species Mutant position (closer to cryptic 3´ss) Figure 2.11: MaPSy is unable to detect all possible alternative events.This limitation can be partially resolved by the use of our improved read counting strategy described in Section 2.5.2. to analyze only the events that could be observed for all species (multiple mutant and the wild type) in the exon of interest. The rest of alternative events were removed from the downstream analysis by our read filtering algorithm, which we will describe in the next section. 2.5.2 Crucial step of MaPSy – read counting/filtering We have established that the MaPSy approach, like any synthetic method, suffers from some lim- itations. We will now focus on the approach that we used in order to limit the influence of RNA- sequencing biases that emerge during the library preparation and sequencing steps. RNA-sequencing overview To perform an RNA-seq experiment, we first need to generate RNA library from a biological sample of interest. It is not unusual that some filtering steps are performed, for instance, selection based on size (to filter for the amplicons of interest). Then, a fragmentation step is performed to generate random RNA pieces. These RNA fragments are reverse transcribed into DNA, and another random segmentation round may follow before they are sequenced. Random fragmentation is an important step for applications of RNA-seq in the alternative splicing characterization (i.e. MaPSy). The reason behind the fragmentation step is that it ensures that the proportion of read abundance is similar to RNA abundance in the sample of interest. As a result, uniformly random fragmentation is what most of the RNA-seq pipelines assume during the downstream processing of the data. Unfortunately, this is rarely the correct assumption. 54 2.5. Technical aspects of MaPSy – challenges and limitations Read counting strategy In order to correct for uneven fragmentation and coverage of samples in MaPSy, we developed a conservative read counting strategy that allowed us to accurately compare mutant and wild type species in the assay (Figure 2.12). Briefly, because multiple variations may fall in the same exon in Definition: Variable Region - smallest region that covers all mutation positions 1 ALL READS MUST COVER VARIABLE REGION READS MUST ALIGN UNIQUELY TO WILD TYPE OR ONE OF THE 2 MUTANT ALLELES EXAMPLE CALL variable region WILD TYPE MUT ALLELE 1 DISCARDED REASON 1 MUT ALLELE 2 INCONCLUSIVE 2 MUT ALLELE 3 IDENTIFIES ALLELE 3 Figure 2.12: Read counting strategy used during the MaPSy protocol. the MaPSy assay, the requirement for calling a sequencing read as wild type can be more stringent than the requirement of each individual mutant. It is not hard to see that for the aligner to ‘call’ a sequencing read as wild type would require the read to span all mutation positions in the same exon. However, calling the mutant species would only require the read to span the respective mutation position. The reason for this is that MaPSy’s initial run did not include combination of mutations or barcodes. Accordingly, the following pipeline was used to ensure balanced detection of both species in the MaPSy assay: 1. We required all reads to cover the variable region of the exon of interest. 2. We required that the read must be uniquely assigned to wild type or one of the mutant alleles Figure 2.12 shows some examples of reads that would pass the filtering step and some possible scenarios that would cause a read to be discarded. Chapter 2. Identification of Splicing Defects 55 2.6 Critical assessment of genome interpretation (CAGI) challenge Due to the limited access to large datasets, machine learning approaches to identify single nucleotide variants that cause splicing defects have struggled to achieve good performance. With the develop- ment of MaPSy, the splicing/machine learning field received an enormous training set that can be used to improve classification of variants. Moreover, the same dataset can be utilized to benchmark multiple tools/approaches that are already available for researches. CAGI challenges are very similar to Kaggle competitions, where multiple machine learning teams of experts try to develop predictive tools/approaches for a certain (biological in case of CAGI) task. For the purpose of the MaPSy CAGI challenge, we created a supplementary test set in addition to the already described training set in Chapter 2. The following section describes the MaPSy CAGI challenge that was released in late November of 2017 and ran until February 12, 2018. 2.6.1 Summary The Massively Parallel Splicing Assay (MaPSy) approach was used to screen 797 reported exonic disease mutations using a mini-gene system, assaying both in vivo via transfection in tissue culture, and in vitro via incubation in cell nuclear extract. The challenge is to predict the degree to which a given variant causes changes in splicing. 2.6.2 Background The Fairbrother lab developed a Massively Parallel Splicing Assay (MaPSy)11 to screen a panel of 4,964 exonic disease mutations reported in the Human Gene Mutation Database (HGMD)156 not classified as splicing mutations for splicing defects and synthesized a library of sequence pairs consisting of the reported mutant sequences as well as their WT counterparts. This library was then incorporated into artificial genes to be used in the splicing assay in vivo and in vitro. For 10% of the mutations MaPSy confirmed both in vivo and in vitro alter splicing. Such mutations are now classified as exonic splicing mutations (ESMs).11 2.6.3 Experiment Nonsynonymous mutations classified as disease-causing (DM) were downloaded from HGMD. Muta- tions were mapped to internal exons ≤ 100 nucleotides in length and selected for those that fit into 170 nucleotide genomic windows, which include 15 nucleotides of downstream intronic sequence and ≥ 55 nucleotides of upstream intronic sequence (for a total of 4,964 variants). The mutant and wild type versions of the 170-mer genomic fragments were flanked with 15-mer common primers and synthesized as a 200-mer oligo library. Additional missense and nonsense mutations (n = 797) were also mapped to exons of length >100 nucleotides Each such exon has been “cut” in the following way: 1. 50 and 30 splice site signals have been preserved. 2. A portion(s) of the middle of the exon was/were removed to decrease its size to ≤ 100 nu- cleotides in order to meet the requirements of oligonucleotide synthesis at the time of the experiment. 56 2.6. Critical assessment of genome interpretation (CAGI) challenge One library of the assay was designed to evaluate the effects of the mutations on splicing in vivo via transfection in cells grown in tissue culture. The second library of the assay comprised RNA substrates designed to evaluate the mutations’ effects on splicing in vitro via incubation in cell nuclear extract. For the in vivo assay, the sequence pairs were incorporated into three-exon minigenes and trans- fected into HEK293 cells. After 24-hours, RNA was recovered, converted to cDNA, and made into Illumina libraries for parallel sequencing. The allelic ratios (WT to mutant) were determined by sequencing the input DNA and the output RNA (cDNA). Three replicates were performed. To identify the presence of allelic imbalances in splicing efficiency, sequencing reads were classified into the following categories: • wild-type (WT) DNA input (a), • WT spliced RNA species (b), • Mutant (MT) DNA input (c), • MT spliced RNA species (d). The term “spliced” refers to species that have perfect splicing to include the reference-annotated exon. Any other spliced species (such as exon skipping, alternative 30 or 50 splice sites, intron retention) are not considered in this measure. The read counts from input DNA were compared with the read count of of correctly spliced cDNA were compared. Statistical tests were subsequently conducted to determine significance of the allelic imbalances in splicing efficiency. These changes were interpreted as follows: • input WT (a) / input mutant (c) = spliced WT (b) / spliced mutant (d) implies the mutation was neutral, • input WT (a) / input mutant (c) < spliced WT (b) / spliced mutant (d) implies a mutant loss in splicing efficiency, • input WT (a) / input mutant (c) > spliced WT (b) / spliced mutant (d) implies a mutant gain in splicing efficiency. The allelic ratios for MaPSy analyses were calculated as:   d/c log2 (2.6) b/a , where d is the count of mutant spliced species, c is the count of mutant input, b is the count of wild-type spliced species and a is the count of wild-type input. Across the 5K panel, 14% of mutations lead to loss in splicing efficiency, 4% lead to gain in splicing efficiency and the rest were unchanged. To assess splicing efficiency between individual species, the Fairbrother laboratory developed a specific splicing index for each species.11 Fisher’s exact test (adjusted with 5% FDR) was performed to find differences in splicing). For the in vitro assay, the library was incorporated into two-exon constructs and incubated in HeLa nuclear extract so that splicing could occur. Afterwards, RNA was extracted, converted to cDNA, and Chapter 2. Identification of Splicing Defects 57 subjected to parallel sequencing. Any change in the allelic ratio of the spliced/input was characterized as aberrant splicing. Two replicates were performed. A total of 17% of mutations lead a mutant loss, while 17% lead to a mutant gain. Unspliced RNA was the most common change in vitro. 9% (453/4,964) of mutations in the 5K panel of exonic mutations showed splicing defects in both the in vivo and the in vitro splicing assays. In contrast, 3% (8/228) of common SNPs altered splicing in both assays. 81% (26/32) of tested MaPSy-detected ESMs were validated in patient tissue samples (consisting of LCLs, fibroblasts, whole blood and postmortem brain tissues). 2.6.4 Prediction challenge Participants are asked to submit predictions of variants in the test set. Participants should provide the probability that each variant is an ESM (that passed the 1.5 fold change and a two-sided Fisher’s exact test adjusted with 5% FDR both in vitro and in vivo). In addition, given the input read counts of the WT input (a) and MT input (c) the participants should predict the log2 allelic skew ratio for in vivo and in vitro panels for each pair in the test set. 2.7 Effects of RNA-binding proteins on intron removal In this section, we describe a novel genome-wide method that is utilized to extract pairwise order of splicing measurements for adjacent introns from RNA-seq data. The order of splicing inferred from sequencing reads is well correlated with the order of splicing determined through traditional methods. In considering all adjacent intron pairs in the genome, certain introns have a strong bias toward splicing before their neighbors (i.e. ‘always-first’ introns). Performing similar analysis to the one that identified activators and repressors of splicing in Section 2.2.4 revealed that the order of splicing also correlates (both positively and negatively) with the density of multiple RNA-binding proteins binding sites. 2.7.1 Global analysis of order of intron removal Paired end reads from ENCODE (GSE26284) https://www.ncbi.nlm.nih.gov/geo/query/acc. cgi?acc=GSE26284 that were 200 nt, non-polyAdenylated and from either the total cell or nucleus to enrich for partially spliced reads were aligned to the human genome using TopHat. Using a combina- tion of custom Perl scripts, samtools and bedtools, the pairs were filtered for reads that contain at least one spliced intron. These pairs were then intersected with a bed file of all introns from UCSC hg19 database. Pairs that contained both evidence of a spliced intron and an intronic sequence were then counted as an intermediate read (i.e. a partially spliced transcript). Intron pairs associated with 10 or more intermediate reads per pair were retained for further analysis. Those with <10 intermediate reads were discarded. 2.7.2 Poly-U signals exhibit strong influence on splicing order To better understand mechanisms that enforced order of splicing, pairwise comparisons of sequence features in upstream and downstream introns were performed. In addition to other well-known 58 2.7. Effects of RNA-binding proteins on intron removal RNAcompete (175 binding motifs) calculate upstream 1 Test sequences motif density (adjacent pairs of introns) Motif map 2 calculate downstream motif density 3 calculate differece upstream - downstream Splicing order percentages Test sequences 175 RBP (adjacent pairs of introns) binding trends Order of splicing clear pattern helping intron to splice first (19 RBP motifs) clear pattern hindering intron to splice first (1 RBP motif) Binding trend for Density difference Index motif M124_0.6.txt List of binding proteins with correlation: CPEB2, CPEB3, HNRNPC, ELAVL1, 2e−04 ELAVL2, ELAVL3, TIA1, U2AF2, SRSF3, CPEB4, RALY, HNRNPCL1, SF3B4, 0e+00 PTBP1, PTBP2, ROD1 −2e−04 ELAVL3 −4e−04 0123456789 Increasing % upstream intron splicing first Figure 2.13: U-rich Motifs enriched in always-first splicing introns can increase splicing efficiency. Methods to find motifs enriched in the always-first or always-last splicing introns. Most of motifs that are enriched in always-first splicing introns were U-rich motifs. There was only one motif that is enriched in always-last splicing intron (SRSF3 motif, marked in red font). features (e.g. splice-site scores), it was possible that splicing factor binding could influence the order of splicing. Analysis of the density of 175 RNA-binding protein binding motifs in pairwise data, revealed 20 motifs corresponding to sixteen RBP that significantly influenced the order of intron removal. The binding of a single factor, SRSF3, was associated with splicing last (Figure 2.13). The remaining 19 motifs were enriched in introns that spliced first. Chapter 3 Splicing Aberrations and Human Hereditary Diseases Summary and contributions Defective splicing is a common cause of genetic diseases. On average, 13.4% of all hereditary disease alleles are classified as splicing mutations with most mapping to the critical GT or AG nucleotides within the 50 and 30 splice sites. However splicing mutations are underreported and the fraction of splicing mutations that compose all disease alleles varies greatly across disease gene. For example, there are a great excess (46%; ∼3 fold) of hereditary disease alleles that map to splice sites in RB1 that cause retinoblastoma. Furthermore, mutations in the exons and deeper intronic position may also affect splicing. We recently developed a high throughput method that assays reported disease mutations for their ability to disrupt pre-mRNA splicing. Surprisingly, 27% of RB1 coding mutations tested also disrupt splicing. High-throughput in vitro spliceosomal assembly assay reveal heterogeneity in which stage of spliceosomal assembly is affected by splicing mutations. 58% of exonic splicing mutations were primarily blocked at the A complex in transition to the B complex and 33% were blocked at the B complex. Several mutants appear to reduce more than one step in the assembly. As RB1 splicing mutants are enriched in retinoblastoma disease alleles, additional priority should be allocated to this class of allele whilst interpreting clinical sequencing experiments. Analysis of the spectrum of RB1 variants observed in 60,706 exomes identifies 197 variants that have enough potential to disrupt splicing to warrant further consideration. Sections 3.1, 3.2, 3.3, and 3.4 were published in the following manuscript: • K. J. Cygan, R. Soemedi, C. L. Rhine, A. Profeta, E. L. Murphy, M. F. Murray, and W. G. Fair- brother. Defective splicing of the RB1 transcript is the dominant cause of retinoblastomas. Hum Genet, 2017. D O I : 10.1007/s00439-017-1833-4 Kamil J. Cygan, Rachel Soemedi, and Christy L. Rhine performed ESM analyses. Abraham Profeta performed conservation analyses. Michael F. Murray contributed the Geisinger dataset. Rachel 60 Soemedi performed spliceosomal complex analyses. Kamil J. Cygan performed sampling of variants from ExAC and HGMD datasets, and generated distributions of variants. Kamil J. Cygan performed whole-exome-analyses. Kamil J. Cygan performed the annotation of RB1 variants. Kamil J. Cygan, Rachel Soemedi, and Christy L. Rhine created figures. Kamil J. Cygan, Eileen L. Murphy, and William G. Fairbrother wrote the manuscript with contributions from all authors. Section 3.5 is the result of my own set of computational analyses on a subsequent run of MaPSy. This experiment included de novo variants seen in patients with ASD and identified 55 variants that cause splicing defects. The aforementioned section remains unpublished. Chapter 3. Splicing Aberrations and Human Hereditary Diseases 61 3.1 Defective splicing of the RB1 transcript is the dominant cause of retinoblastomas Retinoblastoma, the most common form of cancer in children, has a complex etiology. While, most instances of retinoblastoma occur in a sporadic form, a third of all cases start with an inherited mutation in one copy of the RB1 gene. Although retinoblastoma is a recessive disorder, the disease appears to follow a dominant mode of transmission as a second loss-of-function mutation is acquired somatically. Retinoblastomas are treated effectively with surgery though survivors of retinoblastoma face an increased risk of other cancers later in life. There are 293 single nucleotide mutations in RB1 reported to cause retinoblastoma currently listed in the public version of the Human Gene Mutation Database (HGMD).156 Many of these mutations were reported to disrupt splicing. Some RB1 splicing mutations have been associated with lower penetrance180–182 as well as incomplete penetrance that varies between paternal and maternal inheritance.183 Across all of the disease mutations in 2,314 intron-containing genes reported in the HGMD, on average 13.4% disrupt splicing. Splicing mutations most often result in exon skipping, but other consequences of splice site misrecognition include the usage of cryptic 30 or 50 splice site (ss) or intron retention (IR). A common feature of these aberrant isoforms is an insertion or deletion (in/del) in the mRNA, which often results in a disruption of reading frame and message decay by the nonsense-mediated decay (NMD) pathway. The cost of sequencing has fallen several thousand-fold in the last ten years and has driven application of whole- genome sequencing (WGS) and whole-exome sequencing (WES) into personal genomics and clinical medicine. A typical exome will reveal thousands of variants of unknown significance. The effects of coding variants on protein function are particularly difficult to interpret, as individual functional assays do not exist for most proteins. In contrast, splicing function can be determined by polymerase chain reaction (PCR) analysis of complementary DNA (cDNA) from RNA extracted from mutant and wild-type cells (i.e. RT-PCR). Recently our group developed a massively parallel splicing assay (MaPSy) to screen the effects of coding variants on splicing.11 This dual in vivo/in vitro assay identifies previously annotated coding mutations as splicing mutations and pinpoints the stage in splicing that is disrupted. Here, we present the results of this assay on 30 coding variants in RB1 that cause retinoblastoma. A total of 8 mutations significantly disrupt splicing in vivo and in vitro. Numerous variants carried in the population at low levels have the potential to disrupt splicing and potentially contribute to elevated cancer risk in their carriers. We have created an online tool that displays these results. 156 P. D. Stenson et al. Hum Mutat, 21: 577 – 81, 2003. 180 S. H. Lefevre et al. J Med Genet, 39: E21, 2002. 181 H. Scheffer et al. J Med Genet, 37: E6, 2000. 182 E. L. Schubert, L. C. Strong, and M. F. Hansen. Hum Genet, 100: 557 – 63, 1997. 183 M. Klutz, D. Brockmann, and D. R. Lohmann. Am J Hum Genet, 71: 174 – 9, 2002. 11 R. Soemedi et al. Nat Genet, 49: 848 – 855, 2017. 62 3.2. Results 3.2 Results 3.2.1 Global analysis suggest retinoblastoma belongs to a distinct class of diseases driven by splicing mutations To better understand the spectrum of mutations that cause retinoblastoma, the role of splicing muta- tions in disease was explored for all reported disease-causing mutations. For each of the hereditary disease genes reported in HGMD, the complete set of point mutations was downloaded and separated according to location (i.e. splice site and exonic). The fraction of point mutations that alter splicing was displayed graphically (Figure 3.1a, adapted from Pathogenic variants that alter protein code often disrupt splicing,11 Nature genetics). Overall, the fraction of all disease causing mutations that altered splicing was estimated to be 13.4% on average. A permutation approach was used to determine the 99.9% confidence interval for each disease gene. The retinoblastoma causing RB1 was significantly enriched for mutations that fall within the splice sites. A total of 130 mutations, comprising 46% of all reported point mutations were localized to canonical splice sites. The remaining coding mutations (Table 3.1) were analyzed with MaPSy. Briefly, this assay resynthesizes the wild-type and the mutant sequences and combines thousand of these allelic reporters in a single pool, which was tested in vivo (via transfection of HEK293 cells) and in vitro (via incubation in HeLa nuclear extract) for splicing efficiency. A contingency table was created for each mutant/wild-type pair and included the counts obtained from deep sequencing of the input pool as well as the output-spliced fractions. To determine pairs with significant allelic skew we required at least 1.5 fold change and a two-sided Fisher’s exact test adjusted with 5% false discovery rate (FDR). Of the five thousand exonic mutations tested in this panel, approximately 10% disrupted splicing in vivo and in vitro (dotted green line, Figure 3.1b).11 Of all the diseases in HGMD, the highest fraction (27%) of splicing phenotypes seen in exonic mutations were found in the retinoblastoma gene, RB1. This represents a 2.5-fold enrichment in splicing dis- rupting mutations in coding mutations and a 3-fold enrichment in mutations that coincide with splice sites. Taken together, this data suggests that RB1 is especially prone to being disrupted by splicing mutants. Table 3.1: Variants in RB1 analyzed with MaPSy Significant in Exon Frame-shifting if HGMD ID Variant description MaPSy number exon skipped CM016042 No 14 No NM_000321.2:c.1339A>T CM016043 No 16 Yes NM_000321.2:c.1449T>G CM022059 No 6 Yes NM_000321.2:c.604A>T CM025388 No 9 No NM_000321.2:c.908T>A CM025389 No 16 Yes NM_000321.2:c.1494T>A CM025390 No 16 Yes NM_000321.2:c.1467C>A CM030499 No 9 No NM_000321.2:c.920C>T CM030500 Yes 9 No NM_000321.2:c.928G>T CM030504 Yes 14 No NM_000321.2:c.1346G>A CM030505 Yes 14 No NM_000321.2:c.1372G>T Continued on next page Chapter 3. Splicing Aberrations and Human Hereditary Diseases 63 Table 3.1 – continued from previous page Significant in Exon Frame-shifting if HGMD ID Variant description MaPSy number exon skipped CM032660 No 16 Yes NM_000321.2:c.1494T>G CM034897 Yes 6 Yes NM_000321.2:c.584G>A CM040261 Yes 15 Yes NM_000321.2:c.1396G>T CM044254 No 12 Yes NM_000321.2:c.1129A>T CM063089 Yes 15 Yes NM_000321.2:c.1390G>T CM071074 No 12 Yes NM_000321.2:c.1166T>A CM117842 No 16 Yes NM_000321.2:c.1447C>T CM900192 No 14 No NM_000321.2:c.1333C>T CM942037 No 12 Yes NM_000321.2:c.1150C>T CM951103 No 6 Yes NM_000321.2:c.554T>C CM951105 No 11 No NM_000321.2:c.1072C>T CM951106 No 15 Yes NM_000321.2:c.1399C>T CM952105 Yes 24 Yes NM_000321.2:c.2501C>G CM952423 Yes 24 Yes NM_000321.2:c.2513C>G CM961225 No 11 No NM_000321.2:c.1060C>T CM961226 No 11 No NM_000321.2:c.1072C>G CM961227 No 12 Yes NM_000321.2:c.1190C>A CM961228 No 14 No NM_000321.2:c.1363C>T CM973041 No 14 No NM_000321.2:c.1339A>C CM981700 No 14 No NM_000321.2:c.1388C>G 3.2.2 RB1 has a high fraction of exonic mutations that alter splicing in vitro and in vivo To place these coding splicing mutations in their genomic context, RB1 and exon loci are depicted in Figure 3.2. RB1 is a relatively large gene (200Kb, 26 introns) that contains more introns than an average transcript (mean number of introns in a transcript is 11). A total of 8 exons were selected for mutational analysis (Table 3.1). Of the 8 exons examined, 5 of which contained exonic mutations that affected splicing (Figure 3.2, Figure 3.3 and Table 3.1). The criteria utilized for identifying exonic splicing mutants was a fold difference of at least 1.5 between weaker and stronger allele, two-sided Fisher’s exact test adjusted with 5% false discovery rate (FDR). To allow for comparison of splicing performance between any two species in the pool, an individual splicing index was calculated using the following formula: spli / ni=1 spli  P  log2 (3.1) inpi / ni=1 inpi P , where spli is the count of reads in the spliced-output fraction for species i, inpi is the count of reads in the input for species i and n is the total number of species in the pool. In the majority of 64 3.2. Results a 300 Number of splicing mutations 200 RB1 100 0 0 500 1000 1500 Number of mutations b 30 % splice mutants 20 10 0 Br o ndr ilia 30) _c ri ( ) ( ) bl QT yn ce (15 ) Al din syn om 22 ) po g_ dr e 7) ) Ag m Cy tr o ( ) m as ic h ( ) N ag e_d Fib y_1 82) m Sp rot uli cie is ( 3) pa sti sy em y ( ) zm hy_ _pa dro ia ( 3) n_ Ga per ple e ( ) ) yp ba e c ) eo F ro ins hen a ( 6) n co ro n ( ) ne esis ni_ ma ism 24) co deg per em s (5 ) nn en fe ia 8) ti at ( 5) le er ied eo ssu (4 ) Be s− zba l_s etr e ( 22) _m n e dr is 5) ul sy ise e 0) Th _dy dro se ( 2) m op ( ) ph y ( ) a ) 5) st hy e 9 an a 49 g_ h_ cancer 108 ee _ dr r ( 8 O rt_ dis me (27 c_ d de (42 am yl st op me 96 yo a ic_ na nc 70 hy ra m 62 th lac tro gia 25 H om osa phi (91 ge an fib uli ia 40 a si 0 Eh −M −B st veti ive 179 ro str me 24 bo h 6 2 ili 50 ea rp om (6 h ob fi os 3 4 eu er st mi (5 ur _im an to (4 ec er cta (2 st Da ch yn os 58 ac los_ r_d om (2 ar n a (2 (2 y h ( ep l e r ( _s op a r an em om o pt sy or r yh a st N H bla n t c n p r t s a o p tin O i t Re Lo Bar o n N h− us et a an rb ae rd r sc ca liz Ba st Le ns an io O rd tra Gl Ca e_ Pe in ith rn disease O Figure 3.1: RB1 mutations frequently disrupt splicing. (a) HGMD non-synonymous, nonsense, and splice-site mutations were grouped by genes (n = 2, 314) and analyzed. Mutations that fell in the canonical splice sites (y-axis) were plotted against total mutations (x-axis). The gray area represents region of 99.9% confidence interval (see Electronic Supplementary Material Methods). (b) Proportions of coding mutations that disrupt splicing in different hereditary disorders were ranked according to their degrees of enrichment of splicing mutations. Red bar indicates 27% of RB1 mutations disrupt splicing in vivo and in vitro. Error bars represent 95% confidence interval. The number inside each set of round brackets following each disease name represents a total number of mutations tested by MaPSy for that disease cases (7/8), the mutant allele was the weaker substrate, splicing on average at 25% of the efficiency of the wild type allele (Figure 3.311 ). It is typical that splicing mutations result in exon skipping. Less frequently, usage of cryptic 30 ss Chapter 3. Splicing Aberrations and Human Hereditary Diseases 65 >A >G G >T 396 T T A T > G> C> G> G> 4 6G 72 G 90 G 0 1C 5 13 4 8 .58 .92 .13 .13 032 :c.13 .25 :c.1 :c.2 .2:c .2:c .2:c .2:c .2:c .2 1.2 1.2 321 32 1 321 321 NM 0321 32 1 032 000 0 00 0 00 00 0 0 00 _0 0 _00 _0 0 _ _ _ _ _ NM NM NM NM NM NM NM Exon 6 9 14 15 24 Number RB1 RefSeq Genes 6.17 _ GERP Scores for Mammalian Alignments GERP 0 - -12.3 50 kb Figure 3.2: RB1 mutations that disrupt splicing. Out of 30 RB1 mutations analyzed with MaPSy, eight disrupted splicing both in vivo and in vitro. RB1 exons that contain splicing mutations are shown in red and other exons in blue. Each exon with splicing mutations is shown enlarged with mutations depicted as dots (red mutations that disrupt splicing and blue mutations that do not disrupt splicing). or 50 ss is observed and occasional cases of IR has been reported. The consequence of these aberrant processing events is to create a transcript in/del in the message. This modification disrupts the reading frame unless the size of the in/del is a multiple of three. In the case of the 30 RB1 mutations screened, none of the mutations were associated with a cryptic splicing event and 5 out of 8 of significant MaPSy mutations in RB1 would be predicted to create a frame-shifting event if the exon was skipped. While there is good correlation between the relative strength of wild-type and mutant splicing in vitro and in vivo, the in vivo assay is a more sensitive assay and tends to result in more extreme enrichment or depletion of an allele in the successfully spliced fraction.11 3.2.3 The transition between A and B complex is the major point of disruption for RB1 coding mutants In addition to offering a highly sensitive test of a pre-mRNA’s ability to serve as a substrate for splicing, the in vitro splicing assay can be utilized to follow spliceosome assembly.167 The in vitro splic- ing/spliceosome assembly was done by performing splicing reaction by incubating in vitro transcribed RNA in cell nuclear extract. The spliceosome assembles in a step-wise fashion on in vitro transcribed RNA to form the A, the B and the C complex.168,169 The spliceosome assembly and the formation of the A complex begins with U1 and U2 small nuclear ribonucleoproteins (snRNPs) binding to the 50 splice site and the branch-point, respectively. In the next step U4, U5, and U6 tri-snRNP is recruited to form pre-catalytic B complex. Following the formation of the B complex, a series of conformational 167 R. A. Padgett et al. Annu Rev Biochem, 55: 1119 – 50, 1986. 168 M. M. Konarska and P. A. Sharp. Cell, 46: 845 – 55, 1986. 169 R. Das and R. Reed. RNA, 5: 1504 – 8, 1999. 66 3.2. Results Species c.928G>T c.1346G>A c.1372G>T c.584G>A Mutant AAT(G>T)GAC TTG(G>A)AGT AAT(G>T)AAT CTT(G>A)GAT Wild type 0.0 −2.5 −5.0 i =1 inp i spl i inp i spl i i =1 n n o o o o vo vo vo vo tr tr tr tr vi vi vi vi vi vi vi vi in in in in in in in in Splicing Efficiency = log2 c.1396G>T c.1390G>T c.2501C>G c.2513C>G GAA(G>T)AAC CAG(G>T)AAG TAT(C>G)AAT AAT(C>G)ATT 0.0 −2.5 −5.0 o o o o vo vo vo vo tr tr tr tr vi vi vi vi vi vi vi vi in in in in in in in in Figure 3.3: Spliceosomal assembly results. Relative representation of the mutant (red bar) and wild type (blue bar) in the spliced product. Negative values indicate a species that lost representation in the fully spliced (i.e., output) pool relative to the starting library (i.e., input) pool; spli is the count of reads in the spliced-output fraction for species i, inpi is the count of reads in the input for species i, and n is the total number of species in the pool. changes results in the release of U1 and U4 subunits and establishes the C complex (Figure 3.4a). While the B and C complex on the reporter substrates are biochemically inseparable, the A and B/C complexes can be biochemically separated into distinct fractions by glycerol gradient centrifugation.11 The resulting fractions can be sequenced and the representation of the wild-type and mutant alleles can be ascertained from the read counts. The following formula was used to calculate allelic skew at each stage of spliceosome assembly:   muts /muti log2 (3.2) wts /wti , where muts is the count of reads in the stage for the mutant, muti is the count of reads in the input for the mutant, wts is the count of reads in the stage for the wild type, and wti is the count of reads in the input for the wild type. The eight RB1 mutations were analyzed by this treatment with the experiment described above. The results reveal diverse patterns of enrichment across the different Chapter 3. Splicing Aberrations and Human Hereditary Diseases 67 a lariat product pre-mRNA spliced c.1396G>T* c.928G>T* c.1372G>T* c.1396G>T* U1 U2 A complex c.2513C>G* U2AF U2 U6U2A C complex U5 F c.584G>A c.928G>T* U1 c.1346G>A U4U5 U2 U6 c.1372G>T* U2AF c.1390G>T B complex c.2501C>G c.2513C>G* b 0/3 2/3 1/3 1/3 0/4 2/7 0/5 2/3 RB1 c.584G>A c.928G>T c.1346G>A c.1390G>T c.2501C>G exon 14 exon 6 exon 15 exon 24 control c.1372G>T c.1396G>T c.2513C>G A ed sp BC lic exon 9 A ed A sp BC A ed sp BC ed sp BC A ed sp BC lic lic lic lic − .5 − .5 0. 5 0 0. 2 1 − − .0 0 0 1. 4 2 1 0. 0 4 2 −3 −2 1 0 0 0. − − 0. 0. − − − Enrichment = log2 mut s wt s mut i wt i Figure 3.4: RB1 mutations disrupt splicing in various stages of the spliceosomal assembly. (a) RB1 mutations alter splicing at different stages of spliceosomal assembly. The mechanism of spliceosome assembly from A, BC, and spliced is illustrated. ESM that impact each transition of the assembly are indicated (black font indicates major disruption and gray font indicates minor disruption). ESM that act at multiple stages of spliceosome assembly are marked with asterisks. (b) Relative positions of all exons in RB1 gene are shown with exons containing ESM shown in red, exons with no ESM that were tested with MaPSy in blue and exons that were not tested in black (top). Numbers on top of each exon represent the number of ESM out of total number of mutations tested in the respective exon. The heat-map representation of the degree of enrichment in the spliceosomal fractions (A, BC, and spliced) for each mutation in the exon is shown, together with the relative position of the respective mutation in each exon (gray box to the left of each heat map). ESM are indicated as red dots and non-ESM as black dots, with the bottom end of the gray box representing the 50 end of the exon and the top representing the 30 end of the exon. Log2 scale of the mutant to wild-type ratios in each spliceosomal fraction is indicated in the color bar legend below each heat map; muts is the count of reads in the fraction for the mutant, muti is the count of reads in the input for the mutant, wts is the count of reads in the fraction for the wild type, and wti is the count of reads in the input for the wild type fractions (Figure 3.4b11 ). The map of RB1 is shown in Figure 3.4b, top. Exons with exonic splicing mutations (ESMs) are colored red, blue represent exons that were tested with MaPSy but have no ESM, and exons that were not tested are colored black (Figure 3.4b, top). Each mutant to wild type ratios in A, BC and spliced fractions are illustrated as a heat map (Figure 3.4b, below). Degrees of enrichment are indicated in red color spectrum and degrees of depletion in blue color spectrum (ratios in log2 scale indicated in color spectrum is shown in the bar legend below each heat map). The relative 68 3.2. Results positions of ESM (red dots) and control (black dots) in each exon are illustrated to the left of the heat maps. The first group represented cases where the mutation principally blocked A complex formation. In practice, this step was not the only step of splicing affected, but mutant NM_000321.2:c.1396G>T (HGMD ID CM040261) clearly illustrates an underrepresentation of mutant allele in the A complex fraction, suggesting a failure in the events leading up to U1 snRNP and U2 snRNP recognizing the 50 ss and 30 ss. Mutant NM_000321.2:c.1372G>T (HGMD ID CM030505) has a minor defect in A complex assembly and while the mechanism of splicing defect in NM_000321.2:c.1372G>T was similar to NM_000321.2:c.1396G>T, NM_000321.2:c.1372G>T has more dramatic distortion in allelic ratio during the transition from the A complex to the B/C complex. This transition was the most common point of misregulation of exonic splicing mutants in RB1. The remaining mutations appear to also block one of the events leading up to the entry of U4/U6.U5 tri-snRNP into the spliceosome and the formation of the catalytically active complex (Figure 3.4a). 3.2.4 Deep sampling of genetic variation in the human population suggests the pres- ence of rare deleterious alleles that may alter splicing Disruption of RB1 splicing often leads to retinoblastoma. As demonstrated earlier, a high fraction of nonsynonymous variants cause aberrant splicing of RB1. Currently, the whole exome sequencing technology has been increasingly used to detect causal variants and diagnose diseases. Intronic variants can also be captured by exome sequencing as the library fragments often extend into the intron. Because non-coding variation can contribute to disease risk, we examined a typical exome capture (NA12891WEX dataset 1000 Genomes Project184 ). The reanalysis reveals how the power to detect intronic variants declines as a function of distance from splice site (Figure 3.5a). While there is no strict consensus, typical exome sequencing run will be designed at 20 to 100 fold mean transcriptome coverage (MTC).185 An intronic position 75 nucleotides away from the splice site has about 20% of full coverage (e.g. 4X coverage for an experiment with 20X MTC)(Figure 3.5a). Several studies have suggested variable penetrance of RB1 splicing mutants, so it is entirely possible for asymptomatic individuals to be carriers of retinoblastoma alleles180–182,186 or potentially be more susceptible to other types of cancers.187 To better understand the potential contribution of natural variants of RB1 to retinoblastoma disease risk, as well as, to identify and estimate a proportion of variants that could be potential future targets in understanding the disease mechanisms, the aggregated data of all publicly available exome sequencing experiments from the Exome Aggregation Consortium (ExAC)53 were downloaded for analysis. According to prevailing models of population genetics, variations that accumulate to an appreciably frequency in the population will mostly be neutral as variation that have a deleterious or advantageous affect on fitness will either be rapidly eliminated or fixed in the population by natural selection.188 The resulting prediction is that common variants will be more likely to be neutral than rare variants. However, it is possible that other factors 184 Consortium Genomes Project et al. Nature, 526: 68 – 74, 2015. 185 A. M. Meynert et al. BMC Bioinformatics, 14: 195, 2013. 186 J. W. Harbour. Arch Ophthalmol, 119: 1699 – 704, 2001. 187 C. J. Dommering et al. Fam Cancer, 11: 225 – 33, 2012. 53 M. Lek et al. Nature, 536: 285 – 91, 2016. 188 Motoo Kimura. The neutral theory of molecular evolution. Cambridge Cambridgeshire ; New York: Cambridge University Press, 1983. xv, 367 p. Chapter 3. Splicing Aberrations and Human Hereditary Diseases 69 a b Allele Type common; MAF > 1% Proportion of allele type Mean Coverage rare; MAF < 0.01% 300 200 100% 19117 60% 863335 908942 100 20% 40% 10716 0 20% ss ss 0 0 0 0 0 − 0 10 20 30 3’ 30 20 10 75 5’ − − − Distance from 3'SS Distance from 5'SS 0% on on Region tr ex in c d 529852 100% 9509 421990 Proportion of allele type Proportion of allele type 60% 5739 75% 4285 40% 261022 50% 20% 25% 0% 24 14847 413 41634 missense nonsense synonymous 0% 5' splice site intron Codon change Region Figure 3.5: Low-frequency variants predicted to disrupt splicing in RB1. (a) Power to discover intronic variants in exome sequencing experiment declines as a function of distance from exon boundaries. (b) Common alleles’ underrepresentation in exons shows the mark of selection (yellow color). Rare alleles are more uniformly distributed across functional and non-functional categories (green color). (c) Common exonic alleles (yellow color) are underrepresented in deleterious categories (missense and nonsense) of mutations relative to rare variants (green color). (d) Common intronic variants (yellow color) are underrepresented in 50 ss relative to rare alleles (green color). like linkage disequilibrium can complicate inferences of selection drawn from data derived from single alleles. To explore the agreement of the observed human polymorphism data with theoretical models of selection, all ExAC variants were separated into classes based on the severity of their predicted effect on gene function (i.e. nonsense > missense > synonymous). Variants were also separated by minor allele frequency (MAF) into rare (MAF < 0.01%) and common (MAF > 1%). Common variants were depleted in all the functional categories tested (e.g. depleted in exons relative to introns, Figure 3.5b). Rare variants, on the other hand, were more evenly distributed. Selection against protein coding changes was observed in the enrichment of common variants in synonymous changes and depletion in the missense and nonsense categories (Figure 3.5c). In order to determine if selection was evident against splicing signals, variants that disrupted splice sites were evaluated relative to variants in other parts of the intron. Again, rare variants occurred at almost twice the rate of common variants in 50 ss (Figure 3.5d). As all common variants were initially 70 3.2. Results Variant class RB1 ExAC RB1 HGMD Proportion of variant class 40% 30% 20% 10% 0% Intronic Intronic / 3´ ss Intronic / 5´ ss Missense Missense / 3´ ss Missense / 5´ ss Splice acceptor Splice donor Nonsense Nonsense / 3´ ss Nonsense / 5´ ss Synonymous Synonymous / 3´ ss Synonymous / 5´ ss 3´ UTR 5´ UTR Mutation type Figure 3.6: Comparison of the distribution of polymorphisms (ExAC variants, green color) in RB1 to retinoblastoma causing RB1 disease alleles (HGMD, purple color) rare variants, this result suggests that half of all rare single nucleotide polymorphisms (SNPs) that fall within splice site regions are eliminated by natural selection. 3.2.5 At least 553 RB1 variants exist in the human population Given the previous analysis, there exist a substantial proportion of rare variants found in asymptotic individuals that might in fact carry an increased disease risk for retinoblastoma. We decided to evaluate all reported variants and create a curated list of the ones that can potentially cause splicing aberration in RB1. Among RB1 variants discovered in the ExAC dataset of 60,706 exomes, all but 4 are rare variants (MAF < 0.01%). 284 variants occur in the coding region, 269 occur in the intronic region. Further analysis shows that 13.4% of the rare variants fall within splice sites (65 variants fall within – 20 to +3 window at the 30 splice site and 9 variants fall within – 3 to +6 window at the 50 splice site). This distribution of low frequency alleles is confirmed in another dataset of asymptomatic individuals (the Geisinger cohort contains 423 allele/50,000 people) The distribution of these variants across different categories (e.g. splice site variants, synonymous, non-synonymous, etc.) differs significantly from the 293 HGMD mutations reported to cause retinoblastoma reflecting the fact that most variations are not as deleterious as disease alleles (Figure 3.6). However, the exonic regions of RB1 are highly conserved across vertebrates implying the coding sequence is evolving under purifying selection (Figure 3.2). It is very likely that some of these rare variants affect RB1 function or splicing and potentially contribute to disease risk. While the HGMD disease alleles arose through spontaneous mutation, many of the rare alleles appeared to be inherited. As rare alleles are more likely to be deleterious, we have compiled a table of all variants in RB1 annotated by splicing location (intron, 50 ss, branch-point region, 30 ss and exon)(Table B.1). To construct the final list of Chapter 3. Splicing Aberrations and Human Hereditary Diseases 71 mutations that have enough potential to disrupt splicing, RB1 variants were analyzed by a variety of predictive tools.7,9,24,126,158 197 variants were set aside for further consideration. These variants either: created or disrupted splicing elements, modified the branch-point region, substantially changed inclusion/exclusion ratio of the closest exon, or fell within a splice site (Table B.1). 3.2.6 Online visualization tool enables splicing phenotype to be added to annotations of disease alleles Finally, a visualization tool is available to access the splicing data for disease causing variants in RB1. An online mutation browser was developed that diagrams the reporter constructs used in the assay (Figure 3.7). The location of each analyzed RB1 variant is depicted and can be used to check for clustering of mutations or the distance from known splicing signals (splice sites)(Figure 3.7). In addition, the visualization tool shows splicing efficiency of each variant relative to the wild-type counterpart and allows for quick interpretation of severity of the splicing phenotype (Figure 3.7). Finally, the online tool contains the in vivo and in vitro results for all 30 of mutations: read counts in vivo and in vitro for the starting libraries, A, B/C, and spliced complexes for a total of 30×6 experiments (Figure 3.8). Mutations can be searched by mutant ID or author and references link mutations to original reports. The online tool allows researchers to analyze novel variants and submit alleles for analysis with MaPSy. The tool is freely available at http://fairbrother.biomed.brown.edu/RB1.htm and supports all major browsers. 3.3 Discussion With the development of tools to screen thousand of variants for splicing defects, it has become clear that retinoblastoma are highly biased towards splicing mutations. Indeed, for all the exonic disease mutations that were included in a recent high throughput screen of disease variants, retinoblastoma had the highest fraction of ESMs. This suggests that diseases like retinoblastoma function mainly by a loss-of-function mechanism. As many splicing defects result in mRNA in/dels that disrupt reading frame, splicing mutations can be highly deleterious and therefore warrant additional attention during the variant classification process. In addition to discovering alleles that confer a splicing defect, the mechanism of the defect was uncovered. The spliceosome assembles in a progressive, step-wise fashion. While early work on splicing indicated that many spliceosomes were committed to a splicing relatively early (i.e. during the formation of the A complex), defective alleles in RB1 frequently block splicing at later points in the assembly. Frequently, it is the formation of the B or C complex that is prevented by a mutation, perhaps by distorting enhancer signals used by the SR family and related proteins which have been shown to drive early stage spliceosome formations into active ones.189 A recent systematic depletion of splicing factors also found substrate specific effects when targeting core 7 K. H. Lim and W. G. Fairbrother. Bioinformatics, 28: 1031 – 2, 2012. 9 H. Y. Xiong et al. Science, 347: 1254806, 2015. 24 A. J. Taggart et al. Nat Struct Mol Biol, 19: 719 – 21, 2012. 126 G. Yeo and C. B. Burge. J Comput Biol, 11: 377 – 94, 2004. 158 S. Ke et al. Genome Res, 21: 1360 – 74, 2011. 189 R. F. Roscigno and M. A. Garcia-Blanco. RNA, 1: 692 – 706, 1995. 72 3.3. Discussion Experiment: Visualization In vivo RB1 exon #: 14 Efficiency of Mutant Types (Relative to wildtype) 2 Efficiency Efficiency Significant in vivo/in vitro 1 Significant in vivo Significant in vitro WT Baseline 0 MT Comparison 2 4 5 2 8 1 0 1 60 4 3 05 0 3 0 50 0 0 19 6 1 22 7304 8170 CM0 CM0 CM0 CM9 CM9 CM9 CM9 In vitro RB1 exon #: 14 Efficiency of Mutant Types (Relative to wildtype) 1.5 Efficiency Efficiency 1 Significant in vivo/in vitro Significant in vivo 0.5 Significant in vitro WT Baseline MT Comparison 0 2 4 5 2 8 1 0 16 0 4 30 5 0 30 5 0 00 1 9 61 2 2 73 0 4 81 7 0 CM0 CM0 CM0 CM9 CM9 CM9 CM9 Figure 3.7: Online browser enables navigation through RB1 mutations that disrupt splicing. http://fairbrother.biomed. brown.edu/RB1.htm contains an interactive online tool to browse the results of in vitro and in vivo splicing assays on 30 RB1 mutations. Main browser depicts the in vivo substrate (above) and the in vitro substrate (below) for each exon in RB1 that was tested for splicing efficiency. Histogram bars indicate the normalized efficiency of mutation as a fraction of the wild type (dashed blue line). Red bars indicate significant results. Blue bars indicate results that failed to meet significance threshold (a fold difference of at least 1.5 between weaker and stronger allele, two-sided Fisher’s exact test adjusted with 5% FDR). Options convert HGMD labels to amino acid changes. components of the spliceosome. Components that were thought to be required for B complex or even mature catalytic spliceosomes affected splice site selection.190 As sequencing becomes less expensive, exome sequencing will be used more and more in precision medicine applications. Individual genomes will be sequenced more frequently by exome sequencing. We demonstrated that the capture technology can be used to discover variants well into the intron. Unlike disease alleles, variants discovered in this fashion are, for the most part, biologically inert. Despite this we can demonstrate the mark of selection on splicing signals such as the 50 ss (Figure 3.5d) and estimate the proportion of variants that are in fact deleterious. By comparing common to rare alleles, the data suggests that many RB1 rare alleles are deleterious and subject to negative 190 P. Papasaikas et al. Mol Cell, 57: 7 – 22, 2015. Chapter 3. Splicing Aberrations and Human Hereditary Diseases 73 Figure 3.8: Online browser’s results. Mouse over buttons display raw data for each mutation tested in RB1 selection. As the vast majority of variants in a genome are rare variants, a major focus going forward will be to screen these variants for alleles that contribute to disease risk. This goal is especially important for genes like RB1 that cause diseases, which seem to be especially biased toward splicing mutations. Lastly, we have created an online search tool for splicing defective RB1 variants and will systematically screen the population RB1 variants using MaPSy. In conclusion, by leveraging the power of biochemical assays, computational tools, and whole exome capturing technologies it is entirely possible to deeply characterize sequence variations and assign risk alleles for diseases like retinoblastoma. This task is more urgent than ever in order to keep pace with variant discovery that is taking place at the clinics. 3.4 Methods 3.4.1 Splicing efficiency analyses Representations of spliced species of wild-type and mutant RB1 exons were calculated as below: spli / ni=1 spli  P  log2 (3.3) inpi / ni=1 inpi P , where spli is the count for spliced species i , inpi is the count for input i, and n is the number of species analyzed in RB1. MaPSy in vivo and in vitro experiments were performed as previously described.11 Mutant/wild-type allele ratios in A, BC and spliced fractions were calculated as follows:   mie /mii log2 (3.4) mje /mji , where mie and mii is the minor allele counts in enriched pool and input, respectively, mje and 74 3.4. Methods mji is the major allele counts in the enriched pool and input, respectively. Minor allele is defined as allele that splices less efficiently compared to the major allele. 3.4.2 RB1 HGMD mutation simulation Disease-causing splicing and coding sequence mutations were selected from HGMD (n = 77, 943). The mutations were classified as splicing, missense, or nonsense mutations and the numbers of mutation types were determined for each gene. The total number of mutations was plotted against the total number of splice site mutations (SSM) in a gene (Figure 3.1a). Using the proportion of total SSM to total mutations in the HGMD as a weight for random sampling, the proportion of SSM given the total mutations in each gene was simulated 1,000 times. Genes falling outside the simulated values represent genes that have more (above the confidence interval) or fewer (below the confidence interval) SSM than expected (P < 0.01) based on the distribution of mutations types within the dataset. RB1 gene was highlighted in red to show its position as being in the category of genes predicted to be susceptible to SSM (Figure 3.1a). 3.4.3 Genomic evolutionary rate profiling (GERP) conservation analysis The GERP conservation UCSC genome browser track was intersected with the HGMD variant positions located in RB1. The average GERP conservation of the all the HGMD variants (n = 293), the neutral MaPSy variants (n = 22), and the variants that significantly altered splicing (n = 8) was determined and plotted in each category for RB1. 3.4.4 Calculating whole-exome intronic coverage We downloaded the list of Broad Institute targets (n = 189, 894) from the following address: ftp://gsapubftpanonymous@ftp.broadinstitute.org/bundle/b37/Broad.human.exome.b37. interval_list.gz and added 300 upstream and downstream positions to each target. Du- plicate entires were removed and the list of unique targets (n = 113, 000) was intersected with the list of RefSeq exons retrieved from the UCSC genome browser. Only the targets that overlapped single entire RefSeq exon were kept for the coverage calculation (n = 81, 802). Coverage was calculated using BEDTools genomecov (Quinlan and Hall 2010) script. The input consisted of the bam file for the single whole exome experiment (NA12878WEX) retrieved from: ftp://ftptrace.ncbi.nih.gov/1000genomes/ftp/technical/working/20120117_ceu_trio_b37_ decoy/CEUTrio.HiSeq.WEx.b37_decoy.NA12878.clean.dedup.recal.20120117.bam and the list of targets. Lastly, global mean coverage at each position from each splice site was calculated by collapsing the results of individual targets. 3.4.5 Global variant distribution in ExAC dataset ExAC variants were obtained from the published source53 together with minor allele frequency (MAF) data. The list of targets used to calculate coverage was further filtered by intersecting it with canonical exons downloaded from USCS genome browser. The final list of targets (n = 80, 337, canonical targets) including the intronic flanks was intersected with the list of ExAC variants. ExAC variants that fell Chapter 3. Splicing Aberrations and Human Hereditary Diseases 75 within the list of canonical targets were kept and plotted. ExAC variants that fell within the canonical RB1 target are listed in Table B.1. 3.4.6 Annotation of RB1 variants ExAC variants reported in RB1 were annotated using the following bioinformatics tools: 30 splice site and 50 splice site MaxEnt scores,126 SPANR software,9 Spliceman,7 Chasin ESR score,158 and branch- point location.24 3.5 Genetic variation in autism spectrum disorder (ASD) Autism spectrum disorder (ASD) is a neurodevelopmental condition that appears relatively early in childhood development. The most common characteristics of the disease include repetitive behaviors, communication problems, and impaired social skills. The severity of the disorder varies greatly from subject to subject and might interfere with the affected individuals’ ability to properly function in the society. Although the mechanism which cause ASD is poorly understood, it clearly has a genetic component. ASD is highly heritable. Due to the enormous ASD genetic heterogeneity, multiple consor- tia including the Simons Foundation Autism Research Initiative (SFARI) (https://www.sfari.org/) have been trying for a long time to identify etiologic genetic variants. Through the aggregation of multiple studies including, but not limited to the Simons Simplex Collection (SSC) (a repository of ge- netic samples from families, each of which has one child affected and unaffected parents and siblings), SFARI gene (https://gene.sfari.org/database/gene-scoring/) now lists about 120 genes as syn- dromic. SFARI resource also enables researchers to access particular variants that the consortium identified in ASD families, including inherited and de novo mutations. Denovo-DB191 is another useful resource for studying neurodevelopmental disorders including ASD. It is a database for human de novo variants and includes over 40 different studies. Denovo-DB provides a detailed variant information including severity scores, frequency, validation status, and the phenotype of the individual with the variant. In the following sections, we aim to identify a subset of de novo variants listed in SFARI resource that potentially cause splicing defects in ASD cases. We also show that the distribution of variants found in ASD subjects in denovo-DB is quite different than the distribution of variants found in ASD controls, variants in denovo-DB assigned to the developmental disorders (excluding ASD) category, ExAC controls, and disease-causing HGMD variants. 3.5.1 Distribution of de novo mutations To visualize the distributions of variants found in denovo-DB and other datasets (HGMD and ExAC), we used a similar approach to the one described earlier in Section 3.4.5. Briefly, we downloaded variants from the following published URLs: http://denovo-db.gs.washington.edu/denovo-db. variants.v.1.5.tsv.gz; ftp://ftp.broadinstitute.org/pub/ExAC_release/release1/ExAC.r1. sites.vep.vcf.gz; https://portal.biobase-international.com. Next, we used SnpEff192 to an- notate all variants and their locations. Then, the first filtering step (done using Perl) discarded variants 191 T. N. Turner et al. Nucleic Acids Res, 45: D804 – D811, 2017. 192 P. Cingolani et al. Fly (Austin), 6: 80 – 92, 2012. 76 3.5. Genetic variation in autism spectrum disorder (ASD) that were not located in known Ensembl transcripts. Finally, only variants that were categorized as the following by SnpEff were kept and plotted: 30 UTR, 50 UTR, intron, missense, synonymous, nonsense, splice acceptor, splice donor, splice region. Following was an additional round of filtering performed on denovo-DB database. Only variants observed in ASD subjects, controls, and neurodevelopmental disorder categories were kept. A similar filtering step was done with HGMD variants where we kept only the disease-causing single nucleotide variants. Based on the plot shown in Figure 3.9, we can Dataset DenovoDB autism DenovoDB controls DenovoDB developmental disorder ExAC HGMD 75 Proportion 50 25 0 r r n s e on se R R o no ou ns io UT UT pt en tr g se do m ce re In iss ny 3´ 5´ on ac e e M lic no N lic e Sp lic Sy Sp Sp Variant type Figure 3.9: Distribution of ExAC, HGMD, and denovo-DB variants in canonical transcripts. notice emerging patterns. None of the causal variants from any of the databases are enriched in either 30 or 50 UTR region. Disease-causing HGMD variants are enriched in a couple of variant type categories including missense variants, splice acceptor and donor variants, and variants that cause premature termination codon. HGMD variants are also depleted from the synonymous category as well as the intronic category. This shows the drawbacks of using the HGMD resource for some analyses. Intronic variants and splice-site variants (with the exception of variants falling on AG and GU di-nucleotides) are underreported in HGMD. On the other hand, when we study the distribution of de novo variants that are found in subjects with ASD, it is not hard to notice that a large proportion of these fall in the intronic regions of genes. In addition, de novo ASD variants are depleted from all other categories when compared to denovo-DB controls. Lastly, de novo variants found in patients that show signs of developmental disorders are enriched in similar categories as the variants in HGMD database. This suggests that some developmental disorders might belong to the same distinct class of diseases as retinoblastoma which are driven by splicing mutations. Chapter 3. Splicing Aberrations and Human Hereditary Diseases 77 3.5.2 Identification of de novo ASD mutations that cause splicing defects We now consider the task of identifying de novo variants in ASD patients that cause aberrant splicing. Splicing misregulation is one possible underlying cause of ASD. In the pilot experiment, we used our MaPSy approach to test randomly selected 764 exonic de novo variants that fit into the reporter system. Aiming to separate causal single nucleotide variants from non-casual ones, we wanted to Number of mut/wt pairs Mut/Wt input ratio (shuffled method) 100 50 0 −10 −5 0 5 10 Ratio Mut/Wt input ratio (actual) Number of mut/wt pairs 600 400 200 0 −10 −5 0 5 10 Ratio Figure 3.10: Distributions of input ratios between mutant and wild type species in the SFARI pilot study. (Top) By permuting mutant and wild type labels the random distribution of mutant/wild type ratios was recovered. (Bottom) The actual distribution of mutant/wild type ratios in the input fraction of the SFARI panel. make sure that similar coverage of both mutant and wild type species in each of the pairs was achieved after the sequencing run. Figure 3.10 illustrates the simulation approach comparing the actual distribution of input mutant/wild type ratios to the distribution that was acquired by random permutation of mutant/wild type labels. Based on the simulation results, we were satisfied with the similar representation of the species in each of the pairs. The MaPSy protocol described in Chapter 2 was followed to perform the downstream comparisons and record allele skew in each of the pairs tested. We identified 55 (7.2%) of single nucleotide exonic de novo ASD variants that cause significant splicing defects in the assay. Significant variants included 12 splice site mutations (as categorized 78 3.5. Genetic variation in autism spectrum disorder (ASD) by the SFARI) that were included in this experiment as controls. The number of significant single nucleotide variants was lower than the average percentage of ESM across the previous 5K MaPSy panel (10%). This is not surprising as the 5K panel was testing disease-causing exonic mutations listed in HGMD, and here, we selected a random subset of variants with no associated disease risk. We are, however, quite encouraged that this percentage is higher than for the SNPs panel tested by MaPSy (3%). Further testing of the main class of intronic de novo mutations, as well as the inherited variants listed both in denovo-DB and SFARI resource is required in order to correctly estimate the role of splicing misregulation in ASD. The analysis performed in this section, as well as the one in the previous section, show a possibility of discovering new genetic determinants of ASD and also suggest that our MaPSy approach can be used to filter variants to support precision medicine and whole-genome and whole-exome variant studies. Chapter 4 Visualization and Inference of Splicing Aberrations Summary and contributions Most pre-mRNA transcripts in eukaryotic cells must undergo splicing to remove introns and join exons, and splicing elements present a large mutational target for disease-causing mutations. Splicing elements are strongly position dependent with respect to the transcript annotations. In 2012, we presented Spliceman, an online tool that used positional dependence to predict how likely distant mutations around annotated splice sites were to disrupt splicing. Here, we present an improved version of the previous tool that will be more useful for predicting the likelihood of splicing mutations. We have added industry-standard input options (i.e., Spliceman now accepts Variant Call Format (VCF) files), which allows much larger inputs than previously available. The tool also can visualize the locations – within exons and introns – of sequence variants to be analyzed and the predicted effects on splicing of the pre-mRNA transcript. In addition, Spliceman2 integrates with RNAcompete motif libraries to provide a prediction of which trans-acting factors binding sites are disrupted/created and links out to the UCSC genome browser. In summary, the new features in Spliceman2 will allow scientists and physicians to better understand the effects of single nucleotide variations on splicing. Sections 4.1, 4.2, 4.3, and 4.4 were published in the following manuscript: • K. J. Cygan, C. H. Sanford, and W. G. Fairbrother. Spliceman2: a computational web server that predicts defects in pre-mRNA splicing. Bioinformatics, 33: 2943 – 2945, 2017. D O I : 10.1093/ bioinformatics/btx343 Kamil J. Cygan and William G. Fairbrother designed the algorithms. Kamil J. Cygan calculated L1 distance. Kamil J. Cygan performed validation of L1 distance. Kamil J. Cygan created mutation databases. Kamil J. Cygan designed and performed speed tests. Clayton H. Sanford created the visualization tool and user outputs. Kamil J. Cygan created the website. Kamil J. Cygan created figures. Kamil J. Cygan, Clayton H. Sanford, and William G. Fairbrother wrote the manuscript with 80 contributions from all authors. Section 4.5 is the result of my own machine learning modeling experiments performed on the initial MaPSy dataset. This section includes approaches/algorithms for detection of splicing aberrations in both classification and regression settings. The aforementioned section remains unpublished. Chapter 4. Visualization and Inference of Splicing Aberrations 81 4.1 Spliceman2 – a computational web server that predicts defects in pre-mRNA splicing During the process of precursor mRNA (pre-mRNA) splicing, non-coding portions (introns) are re- moved and coding sections (exons) are joined together to form a mature message. The spliceosome, a macromolecular ribonucleoprotein complex catalyzes the splicing reaction. During the step-wise assembly on the pre-mRNA, the spliceosome has to rely on information encoded in the intron/exon boundaries (30 splice site and 50 splice site) as well as around them (branch-point sequence, polypyrim- idine tract, intronic splicing enhancers and silencers, and exonic splicing enhancers and silencers). It is estimated that about a third of disease causing mutations also affect splicing by disruption of the signals needed for the correct assembly of the spliceosome.61 Exonic region L1 distance Entire region L1 distance 0.04 L1 distance (exonic only) 0.09 0.03 L1 distance (entire) 0.06 0.02 0.03 0.01 0.00 0.00 N S E S N S E S − E− − E− − − − − N N N N N N or or S S S− S− or or E E E− E− Figure 4.1: L1 distance correlates with element’s ESR assignment change. L1 distance was calculated using either only exonic portion of the region (right) or entire region (left). The x-axis shows ESR change and follows hexamers assignment as computed by Ke et al. (2011); N - neutral, E - enhancer, S - silencer. Error bars represent 95% confidence intervals. In a previous study,61 we found that splicing elements have unique positional distributions around splice sites. The L1 distance metric was used to measure the difference between positional distribu- tions of individual splicing elements (see Section 4.4 for L1 distance calculation description). The L1 distance proved to be a reliable way of detecting aberrant splicing caused by single nucleotide variations. Point mutations that caused higher L1 distances were more likely to affect splicing in vitro than those with smaller distances61 (Section 4.4 and Figure 4.1). A year later, the Spliceman tool was released with the goal to use the L1 distance metric as a predictor of single point mutations’ 61 K. H. Lim et al. Proc Natl Acad Sci U S A, 108: 11093 – 8, 2011. 82 4.2. Improvements effects on splicing.7 Spliceman 2 Select your human assembly build (GRCh37/hg19 is the default) Then, please either: upload your .vcf file OR input one variant per line using the format below GRCh38/hg38 GRCh37/hg19 [chr variant_position(1-based) reference_allele alternative_allele] space separated; for example: chr20 2301308 T G Choose File no file selected Load Sample Data Reset Values Recommend Variants for Submission Process Sequences Figure 4.2: Spliceman landing page and input panel for Spliceman2. Spliceman2 now accepts VCF formatted entries as its input. We decided to revisit Spliceman’s pipeline and add functionality that together with L1 distance would make better predictions and provide a more user-friendly interface. 4.2 Improvements 4.2.1 Input improvement The original Spliceman program accepted inputs only in the FASTA format.7 This placed limitations on the number of mutations that could be processed at once as well as limited integration of the tool with already established variant processing pipelines. Data is inputted to Spliceman2 using industry- standard VCF format. Users can either enter data into a text box for a small number of mutations or 7 K. H. Lim and W. G. Fairbrother. Bioinformatics, 28: 1031 – 2, 2012. Chapter 4. Visualization and Inference of Splicing Aberrations 83 upload a VCF file for larger inputs, which allows Spliceman2 to handle much larger datasets than its predecessor (Figure 4.2). a Results b 0.6 Absolute Value of log10(p-values) |log10(p-value)| This page links to your results. by Mutation Position This URL will remain available for later 0.4 usage. Progress: Job complete! Download Results 0.2 Visualize Results 0 Download Errors chr8: 143956641, C to A, 9nt from 5'SS Likely to affect splicing Unlikely to affect splicing c CYP11B1, chr8: 143956574 - 143956649 Mutation RBPs, Motifs, Z-score Change, Direction Other Links PCBP4 RBM45 PCBP3 Genome Browser 143956641: C A 0.9417 0.7186 Mutant disrupts b.s. Mutant creates b.s. Normal Normal Mutant Mutant Figure 4.3: Spliceman results pages. (a) Results page for Spliceman2. (b) Plot of L1 distance results. (c) Fragment of visualization page generated by the Spliceman2 pipeline. b.s - binding site 4.2.2 Algorithm methodology Spliceman2 processes its input through a multi-step pipeline. Step (1) Input validation: The algorithm first checks that its inputs are valid coordinates in either the hg19 or the hg38 reference genome by verifying that each reported ref base in the input matches the entry at the same location in the reference. Step (2) Intersection of input with mutation database: Each mutation in the input file is intersected with the list of pre-computed valid mutation coordinates (Section 4.4). Step (3) RNA-binding protein (RBP) score calculation: To determine if a 13-mer region is signifi- cantly enriched with matches to a RBP motif,39 we first summed the position weight matrix (PWM) scores of each 7-mer in the region for all possible regions of size 13. We then transformed these 39 D. Ray et al. Nature, 499: 172 – 7, 2013. 84 4.3. Outputs values to z-scores. We deemed that a particular sequence has an enrichment of matches if the z-score for that sequence was >1.96. These steps were repeated for each RBP with PWM width of size 7 (n = 122). Spliceman2 returns only top 5 RBPs that differ in enrichment between wild type and mutant sequences (either mutant or wild type are enriched for matches, but not both) together with the z-score difference between the two species. Step (4) Exonic splicing regulatory (ESR) sequences score calculation: To determine whether a particular mutation in a particular exon disrupts any exonic cis-elements, Spliceman2 incorporates the findings of (S. Ke et al. Quantitative evaluation of all hexamers as exonic splicing elements. Genome Res, 21: 1360 – 74, 2011. D O I : 10.1101/gr.119628.110) regarding the effects of hexamers on splicing. In brief, we summed the reported ESR scores of each hexamer overlapping the mutation position in the wild type and mutant sequences separately. We then took the difference between these two values and deemed that a particular mutation created a silencing element or disrupted an enhancing element if the final score was negative and created an enhancing element or disrupted a silencing element if the final score was positive. This value is reported only for exonic mutations. Step (5) Storage and retrieval of the processed variations: Each analyzed mutation (chromosome, coordinate, and nucleotide change) is added to a local database to make future searches for the same mutation significantly faster (Section 4.4 and Figure 4.4). No other information about the submission is captured. 4.3 Outputs Spliceman2 expands on Spliceman’s output by adding a visualization component and returning more relevant data. After processing, users have several options to view their results (Figure 4.3a). Results can be downloaded as a text file with columns of data for L1-distance, ESR, genome, and RBPs as well as information about the mutation and its location. Spliceman2 features data visualization, plots the L1-distance results (Figure 4.3b), presents the data contained in the text file in a table, and creates diagrams displaying the locations of mutations on exons (Figure 4.3c). Users can also download all mutations that could not be processed as an error text file with descriptions of each individual failure event. 4.4 Methods In this methods section, we describe L1 distance calculation, validation of L1 distance with (S. Ke et al. Quantitative evaluation of all hexamers as exonic splicing elements. Genome Res, 21: 1360 – 74, 2011. D O I : 10.1101/gr.119628.110) dataset, the creation of a valid mutation databases for Spliceman2, and testing Spliceman2 speed performance. 4.4.1 Definition and calculation of L1 distance Exon database for hg19 human genome version was built from RefSeq annotation (n = 197, 082). Duplicated entries were removed, and each sequence was divided into two distinct regions: upstream intron (up to 200 intronic and 100 exonic nucleotides around 30 splice site (ss)) and downstream intron (up to 200 intronic and 100 exonic nucleotides around 50 ss). In the case that intronic or exonic Chapter 4. Visualization and Inference of Splicing Aberrations 85 Spliceman2 Speed Test Results Spliceman2 time to process the input 1000 750 (in seconds) 500 250 0 0 2500 5000 7500 10000 Input size (number of mutations) Run type All mutations already in the Spliceman2 database No mutations in the Spliceman2 database Figure 4.4: Spliceman2 speed test performance. Spliceman2 performance (y-axis) as the input size (x-axis) increases. In salmon - Spliceman2 performance if all mutations in the input have been processed previously and recorded in the database; In teal - Spliceman2 performance if none of the mutations in the input have been processed previously. Error bars represent standard error of the mean (SEM). length was less than 400 or 200 nucleotides, respectively, the sequence was divided by half and each half was assigned to its nearest splice site. For each hexamer, the counting algorithm traversed through the exon database and recorded the occurrences of that hexamer at 600 different positions relative to splice sites. Repeated this procedure for all hexamers generated 4,096 feature vectors. Spliceman2 uses the L1 distance metric to quantify the ‘closeness’ between two feature vectors. The L1 distance was calculated as the sum of the absolute differences in feature vectors at each of the 600 positions using the following equation: X399 d1 (p, q) = |pi − qi | (4.1) n=−200 4.4.2 Validation of L1 distance To determine whether L1 distance correlates with the experimental hexamer data,158 as well as to detect the influence of exonic bias in the experimental data (only exonic hexamer influence captured by the study), we decided to compute two types of L1 distance (based only on the exonic portion of the exon database and based on the entire portion of the exon database). Each type of L1 distance was computed as previously described (Section 4.4.1). The two types of calculation were performed for all possible pairs of hexamers with edit distance = 1. Figure 4.1 shows a good agreement between the experimental data and L1 distance calculation. The neutral changes between elements have the 86 4.5. Inference of splicing mutations using machine learning lowest L1 distance while the most extreme changes (i.e. changing enhancer to silencer or vice versa) have the largest L1 distance. 4.4.3 Creation of a valid mutation databases for Spliceman2 Exon databases were build from Ensembl annotations of the GRCh37 and GRCh38 assemblies using EnsDb.Hsapiens.v75 and EnsDb.Hsapiens.v86 packages for R. Briefly, we used ‘exonsBy’ function to extract all exons from all transcripts listed in these annotations. Next, we excluded all terminal exons as the processing of these exons relies on different types of information than the processing of internal exons. Finally, for each exon in the list, we extracted additional 75 nucleotides (nts) of flanking intronic sequence from each end. In the case where intronic sequence length was < 75nts, the entire sequence was included for each adjacent exon. 4.4.4 Spliceman2 speed test preparation and results To determine whether Spliceman2 is capable of processing large inputs we randomly generated five sets of mutations of the following sizes: 1 mutation, 10 mutations, 100 mutations, 1,000 mutations, and 10,000 mutations in total. Each file has been ran three times on the empty database and the same number of times on the database that already recorded the Spliceman2 results for all mutations in the input file. Multiple runs of each file have been collapsed and plotted (Figure 4.4) to estimate the mean performance at each input size. Recording mutation results considerably improves Spliceman2 performance. 4.5 Inference of splicing mutations using machine learning Recall the machine learning random forest model from Chapter 2. Figure 2.4 displays the AUC performance and important features of the published model. To improve the performance of the classifier we incorporated thousands of new features1,33,34,39,158,193 including predicted binding sites for 175 RBPs, differences in agreement to PWMs of these RBPs between mutant and wild type species, binding of enhancers/silencer in different regions of the reporter and differences in those values between the mutant and wild type species, nucleotide/di-nucleotide/tri-nucleotide frequencies in different regions of the reporter construct, and occurrence of evolutionarily conserved sequences in the reporter construct. In addition to adding more features, we also tested multiple machine learning algorithms in both classification and regression settings. The following sections describe the results achieved. 4.5.1 Classification approach The input file for the classification task contained about 1,800 features. We performed the analysis of the model improvements on in vivo MaPSy data. Included were only the mutant/wild type pairs 1 Z. Wang et al. Cell, 119: 831 – 45, 2004. 33 W. G. Fairbrother et al. Nucleic Acids Res, 32: W187 – 90, 2004. 34 Y. Wang et al. Nat Struct Mol Biol, 19: 1044 – 52, 2012. 193 C. Zhang et al. Proc Natl Acad Sci U S A, 105: 5797 – 802, 2008. Chapter 4. Visualization and Inference of Splicing Aberrations 87 1.0 0.8 Sensitivity 0.4 0.6 p−value = 0.0013325 0.2 vivo_with_wt_eff vivo_without_wt_eff 0.0 1.0 0.8 0.6 0.4 0.2 0.0 Specificity Figure 4.5: Improved performance of the model including wild type splicing efficiency. The model including additional information about splicing efficiency of the wild type outperforms significantly the model not including the data. that had at least 100 reads in the input fraction for each of the species. For the classification task we decided to use Light-GBM https://github.com/Microsoft/LightGBM a tree boosting algorithm from Microsoft. LightGBM has been shown to be very robust and fast. Please recall that in the initial implementation of the model used random forest algorithm. The parameters that we defined for the boosted approach were the following: number of leaves = 7; learning rate of 0.1; and number of trees = 500 with an option of early stoppage after 100 trees (to avoid over-fitting). We split the data into training (80% of samples) and test sets (20% of samples). The performance achieved by the boosted model was evaluated using the AUC metric. The new model achieved the AUC of 83.5%, a significant improvement over the previously published MaPSy result (AUC = 81.5%). Looking at the important features of the published classifier, we noticed that the majority were exon-level features (see Section 2.2.3). We therefore decided to add an additional feature to the mix, namely the splicing efficiency of the wild type exon calculated from the MaPSy data. If it is true that the global exon-level properties sensitize an exon to ESMs (i.e. variants in “weak exons” are more likely to disrupt splicing), the wild type splicing efficiency should be an important predictor of ESMs. After another run of the algorithm, the performance of the model which included the wild type splicing efficiency reached the AUC of 87.2%, further significantly improving the classification power of the model (Figure 4.5). As expected, the prevalent feature of the last model was indeed the splicing efficiency of the wild type (Table C.1). What is even more striking, when we compared the lists of features of the two boosted models, we were surprised that together with the increase in AUC of about 4%, the list of useful features shrunk by almost half (389 versus 192)(Table C.1 and Table C.2) One possible cause of the feature space shrinkage is the fact that the wild type splicing efficiency can simultaneously provide data that is encoded by a large number of features with a smaller importance as judged by the statistical method. 88 4.5. Inference of splicing mutations using machine learning 4.5.2 Regression approach and proposed Spliceman3 implementation For the regression approach, we decided to use H2 O (https://www.h2o.ai/), a machine learning platform that incorporates multiple machine learning algorithms (including random forests and boosting trees). To develop a regression model to solve for in vitro log2 mut/wt splice ratio, we used the same input file as in the case of the classification setting above (the list of features did not include wild type splicing efficiency). We did not perform any additional filtering of species based on coverage. We decided to run two algorithms: random forests and gradient boosting trees as implemented in H2 O. To our surprise, only the boosted trees algorithm was able to correctly predict the direction Mt/wt log2 in vitro ratio actual 4 4 0 0 −4 r = 0.57 −4 r = 0.45 −2 0 2 −1.0 −0.5 0.0 0.5 1.0 Mt/wt log2 in vitro ratio predicted Figure 4.6: Performance of GBM and RF regression models. Gradient boosting model was able to explain most of the in vitro MaPSy data (left). Random forests were unable to generate a correct regression model for the in vitro MaPSy 5K panel data (right). and effect of single nucleotide variants on splicing in MaPSy (Pearson’s r = 0.57 between actual and predicted splicing ratios)(Figure 4.6). On the other hand, random forests were unable to correctly split the feature space and come up with sufficient regression solution (Figure 4.6). The result is quite striking, based on the fact that the random forest was able to correctly classify ESMs in the published classifier. Next, we would like to propose to implement the Spliceman3 classification and regression model in the following way: • We will use the existing Spliceman2 backbone (back-end and front-end) implemented on Google Cloud. This way, Spliceman3 would be primed to validate user’s input fields/input files. • Most endogenous exonic features used by the model would be precomputed and stored in SQL databases for quick access. • Most variant position features would be calculated on the fly, as it would be computationally infeasible to store that much information for all possible variations inside a database at the time. Chapter 4. Visualization and Inference of Splicing Aberrations 89 • In contrast to Spliceman2, Spliceman3 would only return a text file to the user. The text file would contain the percent of confidence that a particular variant affects splicing and the pre- dicted in vivo mut/wt splice ratio. As we have seen, there are still a lot of improvements needed in order to achieve close-to-perfect performance of single nucleotide variants classifiers. Both approaches (classification and regression) might benefit further by increasing the number of datasets/data points available for learning. We will describe potential new panel/library designs in Chapter 5. 90 4.5. Inference of splicing mutations using machine learning Chapter 5 Conclusions and Future Directions In the last decade, the development of new sequencing platforms has driven down the cost of whole- genome and whole-exome sequencing. Consequently, it has become easier for individuals to obtain their own individual genomes, and for large consortia like the Simons Foundation Autism Research Initiative to sequence large cohorts of families. A typical whole-exome sequencing run will return between 300-600 variants of unknown significance. Clearly, our ability to classify variants as causal has not kept pace with the amount of sequencing returned by research groups. In this thesis, multiple methods for identifying allele-specific splicing aberrations were developed. By utilizing large-scale data analysis, novel high-throughput splicing reporter systems, and machine learning approaches, we were able to discover patterns that might shed some light on mechanisms of several diseases. The techniques described in this thesis may also bolster variant annotation attempts in order for them to match the pace of next generation sequencing data output. Chapters 1 and 2 form the first part of the thesis. They address the influence of multiple variables on the splicing process including splicing signals (Chapter 1), RNA-binding proteins (Chapter 1 and 2), and single nucleotide variants (Chapter 2). Chapter 2 also presents extensive data from our MaPSy 5K study of 4,964 disease variants and describes in detail our novel high-throughput in vivo and in vitro splicing reporter system. The correlation between in vivo and in vitro was reported as fair (Pearson’s r = 0.55). Higher degree of agreement was reported for mutations that affect signals present in both assays (e.g. variants closer to the 30 splice site). The reliance on significant results both in vivo and in vitro constituted a very conservative approach to detect causal splicing variants. Such high-throughput annotation undertaking has not been presented before, especially considering that mechanistic signatures of splicing mutants (i.e. at which steps of the spliceosomal assembly each mutant disrupts splicing) were efficiently partitioned. This study unveiled that exonic disease-causing mutations ofttimes cause aberrant splicing (10% of cases). In addition, the pilot study reported that exonic splicing mutations are also non-uniformly distributed across disease genes, a property that we further exploited in Chapter 3. The same study compared the loss or gain of previously characterized cis-elements (RNA-binding proteins’ binding sites) to loss or gain of splicing in both in vitro and in vivo splicing reporter assays. Clustering of functional RBP profiles resulted in the discovery of two considerably-sized sets – exonic splicing activators and repressors. Interestingly enough, when a similar analysis was performed on the intronic portions of wild type species (Figure A.8), we also recovered two large clusters containing intronic splicing repressors and activators. Overlapping 92 the sets revealed that most exonic splicing repressors activate splicing when bound in introns and most exonic splicing activators repress splicing when bound in introns. The result reinforced our understanding of the role of RBPs in splicing and the notion that the splicing factors behave in a highly-dependent manner. In another study, we considered the sequence in which adjacent introns are spliced out from the precursor mRNA. Whereas, many useful features were found to predispose a particular exon to either splice first or last (e.g. intron length), the large-scale analysis of RBPs and their binding sites revealed that the densities of some RBPs sites are associated with intron sequence removal. While powerful, mini-gene approaches may not reflect the consequences that mutations have in the native setting. Chapter 2 acknowledges some of the limitations of the pilot study and provides the technical aspects that help to overcome some of the study limitations. Despite the drawbacks of the mini-gene approaches, the published data indicates a 82% agreement with in vivo data based on analyzed patients’ samples. While we only found 32 instances where we could verify the phenotype in vivo similar false-positive and false-negative error rate was recovered (<10%). Part two of this thesis (Chapter 3) addresses the discovery of polymorphisms and variants that disrupt splicing. Clinical sequencing, as mentioned previously, deals with a very pressing need to discover potentially etiologic regulatory region variants. Especially now that the era of personal medicine is quickly approaching. We discovered selection signatures on splicing signals, both de novo and low frequency variants might play a significant role in many hereditary diseases. One such disease, retinoblastoma, was enriched for hereditary disease alleles that map to RB1 splice sites. In addition, the same disease showed an excess of mutations that affect splicing as tested by our pilot 5K MaPSy panel. As splicing mutants may be be enriched in other hereditary diseases, additional priority should be allocated to these types of alleles while interpreting clinical sequencing experiments. Moreover, it was shown that 197 variants that fall in RB1 have enough potential to disrupt splicing to warrant subsequent experiments. Additionally, this section of the thesis determined signatures of selection on splicing signals, and across functional and non-functional regions of transcripts (exon and introns). Marks of selection on deleterious variants could be clearly seen in contrasting low to high frequency variants (Figure 3.5). While this analysis provides clear evidence that most variants are not causal, a sufficient fraction of rare alleles might contribute to disease risks. To check that hypothesis, we analyzed 764 de novo mutations present in ASD patients and listed in SFARI Simplex Collection in an expansion of the pilot study. Even though a lower fraction (7.2% compared to 10%) of de novo ASD variants disrupted splicing compared to the disease-causing alleles from the pilot study, all of the internal controls (12 variants that are categorized as splice site mutations by the SFARI consortium) affected splicing. This again confirms usefulness and MaPSy’s ability to predict splicing defects. Chapter 4 is dedicated to the development of visualization/prediction tools. The improvements to Spliceman2 allow for much larger inputs than previously available. By invoking calls to a few Perl scripts hosted in the cloud, Spliceman2 is capable of not only processing user’s input(s) encoded in the industry-standard VCF format, but also quickly returning predictions of how likely mutations are to disrupt splicing. Additionally, RNAcompete39 dataset was integrated and provides predictions about which trans-acting factors binding sites are disrupted/created. Visualization of sequence variants within exons and introns completes the list of newly implemented features. The new features in Spliceman2 would allow users to better understand the effects of single nucleotide variations on splicing. Finally, after the required filtering steps were performed, it was shown that the data from 39 D. Ray et al. Nature, 499: 172 – 7, 2013. Chapter 5. Conclusions and Future Directions 93 the pilot 5K MaPSy panel permitted the development of multiple machine learning models (both classification and regression types). While the random forest classifier is effective and powerful (recall that the published model had an average prediction performance of 81%), we wanted to further improve it. The design and results from the enhanced version of the classifier are also described in Chapter 4. By increasing the number of features to about 1,800 and switching the algorithm from random forest to boosting trees, the classification performance increased by approximately 2% as judged by the AUC metric. The preliminary data described in Chapter 2 suggests that many of the features that are useful in predicting which mutations disrupt splicing are exon-level features. Our updated machine learning models from Chapter 4 suggest the splicing profile of the wild type version of the exon contributes greatly to both the classifier and the regression solver. In fact, inclusion of this one feature further improved the classifier’s AUC to above 87%. The importance of of this feature was proved again when the list of the two models (one including and one excluding the feature) was compared. The wild type splicing efficiency decreased the number of relevant features of the classifier by about half. This suggests that the information content of the feature is somehow a combination of many other features. Finally, it was shown that boosting trees are capable of partitioning the feature space of the MaPSy in vitro data and solve for the regression of mutant/wild type ratio. Random forest was incapable of performing the same task. The multiple methods addressed in this work are capable of identifying allele-specific splicing aberrations, and we believe that they constitute a systematic route towards annotating sequence variations at the similar pace to the amount of sequencing generated in clinics. In the following sections, we discuss the potential improvements of our classifier by the means of using external datasets and internal assessment of splicing phenotypes of additional sets of variants. We also briefly touch on the issue of using known annotations like conservation scores. 5.1 Usefulness of conservation scores in classification tasks Conservation information is not causal in terms of determining the outcome of the splicing process. The spliceosome is not capable of distinguishing if a single nucleotide variant fall on a position that is highly conserved. On the contrary, we found that our published machine learning model, as well as the other published classifiers, benefits from the inclusion of the conservation data. Recall, that the PhasCons conservation of the wild type was one of the 17 main features used by the MaPSy classifier. There are several reasons for why the inclusion of conservation improves the classification of splicing variants. First, conservation scores are usually correlated with splicing efficiency (see Figure A.6). Second, if we weight the sequence features by conservation, we can improve the classification task by including only the relevant features in our model (e.g. highly conserved binding sites). There are, however some disadvantages of using conservation scores, the main one being that we would miss some informative events, for example RBP binding sites in newly emerged exons. Given the above reasons, we would advocate for the limiting usage of the conservation scores in statistical learning. Especially given, that similar classification performance (or better) can be achieved without supplying that feature information. 94 5.2. Potential improvements with CLIP/KD data 5.2 Potential improvements with CLIP/KD data In the following section, we discuss how the classification model described in this thesis could be applied to a larger number of cell lines. Recall, that the model was generated from the MaPSy data, which in turn was gathered from only one cell line using common flanking exons. Tissue-specific alternative splicing is regulated to a large extent by the concentration of different RBPs that bind to specific cis-elements surrounding a specific event site. The cis-elements are common to all cell types. Therefore, for each cell type there has to be a set of ‘true’ binding sites and ‘false’ binding sites for each RBP. In order to develop a more general model which depends on the local environment of a cell, we would have to be aware of these cell-specific true binding events. Thanks to the ENCODE consortium (https://www.encodeproject.org/), it is now possible to map and characterize RNA elements recognized by a large collection of human RBPs in two cell lines, namely, K562 and HepG2. More cell lines are expected to follow soon. The binding behaviors of the proteins and the influence of variants on the splicing outcome may be more clearly approximated by taking advantage of the experimental data provided by the ENCODE, and specifically aimed at measuring binding, such as 57 RBPs with enhanced CLIP (eCLIP) and knockdown/RNA-seq performed in the same cell line. MaPSy panel(s) would have to be repeated in the same cell lines that the ENCODE data is available for in order to develop a fresh splicing model. Following the development of these RNA-binding models for each of the cell lines available, they could be linked with the already developed splicing model in order to account for the true binding events and how the disruption or creation of them by sequence variants affect splicing. 5.3 Potential improvements with better datasets There are two potential ways to improve the splicing classifier described in this thesis, namely, better statistical approaches and better datasets. Thanks to MaPSy, it has become feasible to perform a high-throughput in vivo and in vitro splicing experiments. Subsequent MaPSy runs on different sets of variants could be used to improve the splicing classifiers. Furthermore, incorporating intronic variant data can potentially take the predictive ability of the splicing classifier to a higher level. In the following section, we propose a set of new libraries that could be utilized to improve splicing predictions. 1. Perform minigene assays in a variety of cell lines. As described in the previous section, cell-line specific splicing events are linked to the cell’s local environment. By running MaPSy on the same set of variants in different cell lines, we would look for robust as well as subtle signal changes. The latter would probably be linked with cell-specific RBP binding events and the former with, for example, more ubiquitously expressed trans-factors binding events, secondary structure creation/disruption, and disruption of the canonical splicing elements. 2. Perform the splicing assays on the exhaustive list of human internal exons. The MaPSy classifier and subsequent improvements of the model suggest that many of the useful features are exon- level features. As seen from the analysis, of the relevant features, by including wild type splicing efficiency, we were able to not only improve the predictive power of the classifier, but also remove many features that contributed low amount of information. Therefore, we believe that Chapter 5. Conclusions and Future Directions 95 collecting the behavior information for all wild type exons in the genome, would aid the splicing field in developing better classification tools. 3. Perform minigene assays using the same variant sets with inclusion of different flanking exons. In order to correctly estimate the influence of variants tested in MaPSy on splicing, it would be extremely helpful to assay the variants tested in the pilot study in a different flanking sequence context. In addition, this dataset would address one of the limitations of the pilot MaPSy study, namely, the inability to detect the effect of flanking exonic sequences on splicing. 4. Perform the splicing assay on intronic variants and small insertions and deletions. This high- quality training set that explicitly tests substitutions as well as deletions could be used as a gold-standard to test multiple already developed splicing classifiers and compare them to each other. This set would also make it possible to develop the first splicing classifier that models insertions and deletions. 5. Lastly, design a larger library of de novo and inherited exonic and intronic variants that are reported in ASD patients. Our initial pilot study of 764 exonic de novo ASD sequence variants might be too limited in order to truly estimate the influence and proportion of de novo splicing mutations in this heritable disorder. Additionally, all machine learning models improve with the increase of the training set size. By including a large enough number of intronic and ex- onic sequence variants, we should be able to boost the performance of the classifier simply by supplying it with a more diverse set of variants. 96 5.3. Potential improvements with better datasets Appendix A Identification of Splicing Defects 98 Table A.1: SNPs evaluated with MaPSy. dbSNP ref alt ESM rs3207775 G A 0 rs7099565 G A 0 rs1050704 T C 0 rs1056782 A G 0 rs2230395 T G 0 rs2234978 T C 0 rs1326331 C T 0 rs726176 T C 0 rs2229700 T C 0 rs4068083 C A 0 rs4641 C T 0 rs1802778 G A 0 rs11558511 T C 0 rs1046934 A C 0 rs201594599 A C 0 rs4704 G A 0 rs11057401 T A 0 rs3213764 A G 0 rs116579083 G T 0 rs17599 A C 0 rs1923950 G A 0 rs11052110 G A 0 rs2272238 G A 0 rs61752088 A G 0 rs12410563 G C 0 rs6580870 A G 0 rs1043879 T C 0 rs34308410 T C 0 rs4902 A G 0 rs4075325 C T 0 rs1147096 C T 0 rs35512811 G A 1 rs2305293 C T 0 rs538229 A G 0 rs9509307 G A 0 rs9318554 G A 0 rs9578751 C T 0 rs2296645 A G 0 rs3764056 A G 0 rs3212102 C T 0 rs2295682 C T 1 rs343376 A G 0 rs3100906 T A 0 rs2074932 G A 1 rs1128468 A G 0 rs2278857 T C 0 rs11638215 A C 0 rs10220843 T C 0 rs10851726 G A 0 rs3812908 T C 0 rs1435163 C A 0 rs5030691 G A 0 rs2286466 A G 0 rs11866002 C T 0 rs5473 G A 0 rs721005 C G 0 rs9930567 G A 0 rs2286873 A G 0 rs3760454 T C 0 rs4601 A G 0 rs238239 T C 0 rs34818467 A G 0 rs8080100 C T 0 rs883541 G A 0 rs2286562 C T 0 rs9894429 C T 0 rs1130674 A G 0 rs657138 G A 0 rs3737374 T C 0 rs34227891 A G 0 rs2003149 C T 0 rs2612086 A G 0 rs2278161 T C 0 rs906807 T C 0 rs12488 T C 0 rs3745969 A G 0 rs3745859 C T 0 rs922063 C G 0 rs1064257 C G 0 Continued on next page Appendix A. Identification of Splicing Defects 99 Table A.1 – continued from previous page dbSNP ref alt ESM rs7951 G A 0 rs428453 C G 0 rs61738004 G A 0 rs16982513 A G 0 rs1178016 C T 0 rs1130146 G A 0 rs8957 G T 0 rs2294597 C T 0 rs1210133 T C 0 rs2155722 G A 1 rs2307394 T C 0 rs1155779 G A 0 rs17783344 T G 0 rs10176588 A G 0 rs116676813 C T 0 rs2244492 C T 0 rs788023 T C 0 rs2372536 C G 0 rs3796028 G A 0 rs2229814 G A 0 rs28930679 G A 0 rs1050224 G A 0 rs12724 A G 0 rs2303291 C T 0 rs738479 T C 0 rs226524 T C 0 rs778155 A G 0 rs1137930 A G 0 rs2287328 T C 0 rs10936352 C T 0 rs3772126 G A 0 rs2070178 G A 0 rs4548 C T 0 rs4303883 G A 0 rs41272321 T G 0 rs2280031 T C 0 rs61741194 A G 0 rs6762208 C A 0 rs364519 C A 0 rs2130407 A G 0 rs17850206 T C 0 rs36078246 T G 0 rs1141601 T C 0 rs3796386 G A 0 rs1138536 T C 0 rs9324 T C 0 rs1051485 T C 0 rs7637449 G A 0 rs17058639 C T 0 rs1522384 T C 0 rs4361282 C G 0 rs2172257 C T 0 rs26821 A C 0 rs25640 G A 0 rs10073922 G A 0 rs2304052 T C 0 rs2578377 C T 0 rs4702269 A G 0 rs26635 G T 0 rs275819 C T 0 rs706679 C T 0 rs28450427 G T 0 rs2973566 G A 0 rs16872235 T A 0 rs17852781 C T 0 rs17085249 G A 0 rs314359 T C 0 rs11556986 A T 0 rs34349457 T C 0 rs5574 C T 0 rs751296046 T C 0 rs3213709 A G 0 rs3750117 A G 0 rs2595701 A G 0 rs2230197 T C 0 rs1138495 T C 0 rs1800222 T C 0 rs17803441 C T 0 rs1803250 T C 0 rs7003969 G A 0 rs2292741 T C 1 Continued on next page 100 Table A.1 – continued from previous page dbSNP ref alt ESM rs2272736 G A 0 rs1129660 A G 0 rs1129152 C T 0 rs2230808 T C 0 rs16916040 C T 0 rs11552582 A G 0 rs3814547 C T 0 rs17519205 A G 0 rs3739576 G T 1 rs45452691 C G 0 rs6622126 G A 0 rs17318100 G A 0 rs1141608 G A 0 rs8094 C T 0 rs2230488 G T 0 rs2071932 C T 0 rs28938169 G A 0 rs4630153 C T 0 rs1926447 A G 0 rs25640 G C 0 rs34612342 T C 0 rs1127354 C A 1 rs16945474 A G 0 rs11548937 C T 0 rs41507953 A G 0 rs751141 G A 0 rs28934585 C T 1 rs200220210 G A 0 rs121434580 A C 0 rs2476601 A G 0 rs2230037 G A 0 rs4746 T G 0 rs2233578 G A 0 rs28363284 T C 0 rs1126742 A G 0 rs5742912 A G 0 rs7946 C T 0 rs35366573 C T 0 rs144041067 C G 0 rs73569592 A C 0 rs56257827 G T 0 rs1676486 A G 0 rs2108622 C T 0 rs2227914 T C 0 rs6232 T C 0 rs738409 C G 0 rs11640851 C A 0 rs33996649 C T 0 rs199476317 G A 0 rs2229388 G C 0 rs2295283 A G 0 rs11574 T C 0 rs8065080 T C 0 rs2288904 A G 0 rs11553473 G A 0 rs751120439 G A 0 rs121965091 A G 0 rs76844316 T G 0 rs17602729 G A 0 rs2229137 A C 0 rs1800460 C T 0 rs854560 A T 0 rs5882 G A 0 rs11556084 G A 0 rs4793 A G 0 rs4717 A T 0 rs11554130 C T 0 rs200692117 G A 0 rs2230658 A G 0 rs2230659 C T 0 rs66626885 G T 0 Table A.2: Summary of MaPSy validation in patient samples gene agreement AA change Library result hg19 position validation result HGMD or SNP id nucleotide change Reference / source Patient tissue/ RNA source loss of splicing in vivo (0 fold) exon 44 lymphoblastoid cell CM990215 ATM chr11:108186638:+ c.6095G>A p.R2032K TP Teraoka et al., 1999 and in vitro (0 fold) skipping line loss of splicing in vivo (0 fold) and in vitro (0 fold), but neither exon 42 lymphoblastoid cell CM980147 ATM chr11:108183151:+ c.5932G>T p.E1978* FN Teraoka et al., 1999 were significant due to low skipping line counts loss of splicing in vivo (0.4 fold) Coriell Repository CM940140 ATP7A chrX:77266736:+ c.1933C>T p.R645* exon 8 skipping fibroblasts TP and in vitro (0.4 fold) (Figure A.3) loss of splicing in vivo (0 fold) CM994455 BRCA1 chr17:41258474:- c.211A>G p.R71G exon 5 skipping T-lymphocytes TP Sanz et al., 2010 and in vitro (0.5 fold) loss of splicing in vivo (0.3 fold) exon 18 CM950153 BRCA1 chr17:41215920:- c.5123C>A p.A1708E T-lymphocytes TP Sanz et al., 2010 and in vitro (0.3 fold) skipping T-lymphocytes, Mazoyer et al., 1998, loss of splicing in vivo (0.03 exon 18 minigene of exons CM980231 BRCA1 chr17:41215963:- c.5080G>T p.E1694* TP Liu et al., 2001, fold) and in vitro (0.1 fold) skipping 17-19, CRISPR/Cas9 Findlay et al., 2014 engineered mutation lymphoblastoid cell Amanda Toland (data CM101977 BRCA1 chr17:41267746:- c.131G>A p.C44Y no change no change TN line not shown) CM053138 BRCA2 chr13:32921033:+ c.673A>G p.T225A no change no change T-lymphocytes TN Sharp et al., 2004 loss of splicing in vivo (0.03 exon 13 CM042309 BRCA2 chr13:32903621:+ c.7007G>A p.R2336H T-lymphocytes TP Sanz et al., 2010 fold) and in vitro (0.03 fold) skipping Joan Marini (data CM051414 COL1A1 chr9:130579482:- c.608G>T p.G203V no change no change fibroblast TN not shown) Loss of splicing in vivo (0.1 fold) Joan Marini (data CM051415 COL1A1 chr9:130579482:- c.634G>A p.G212R no change fibroblast FP and in vitro (0.2 fold) not shown) Pinar loss of splicing in vivo (0.38 Bayrack-Toydemir CM111552 ENG chr9:130579482:- c.1687G>T p.E563* loss of exon 13 T-lymphocytes TP fold) and in vitro (0.17 fold) and Jamie McDonald (Figure A.3) loss of splicing in vivo (0.04 exon 51 CM950449 FBN1 chr15:48729559:- c.6339T>G p.Y2113* fibroblasts TP Dietz et al., 1993 fold) and in vitro (0.18 fold) skipping loss of splicing in vivo (0.3 fold) Coriell Repository CM930246 FBN1 chr15:48758037:- c.4766G>T p.C1589F no change fibroblasts FP and in vitro (0.6 fold) (data not shown) Appendix A. Identification of Splicing Defects loss of splicing in vivo (0.01 lymphoblastoid cell Coriell Repository CM910169 GALT chr9:34648167:+ c.563A>G p.Q188R missplicing TP fold) and in vitro (0.4 fold) line (Figure A.3) loss of splicing in vivo (0 fold) lymphoblastoid cell CM002632 HPRT1 chrX:133632695:+ c.590A>T p.E197V exon 8 skipping TP Tu et al., 2000 and in vitro (0.3 fold) line loss of splicing in vivo (0.2 fold) CM002633 HPRT1 chrX:133632707:+ c.602A>T p.D201V exon 8 skipping T-lymphocytes TP Tu et al., 2000 and in vitro (0.3 fold) lymphoblastoid cell CM004361 HPRT1 chrX:133632674:+ c.569G>A p.G190E no change no change TN Tu et al., 2000 line lymphoblastoid cell CM920361 HPRT1 chrX:133632706:+ c.601G>A p.D201N no change no change TN Tu et al., 2000 line lymphoblastoid cell loss of splicing in vivo (0.18 CM920362 HPRT1 chrX:133632706:+ c.601G>T p.D201Y no change line and T TN Tu et al., 2000 fold) but not in vitro lymphocytes lymphoblastoid cell Coriell Repository loss of splicing in vivo (0.1 fold) CM022413 ITPA chr20:3193842:+ c.94C>A p.P32T exon 3 skipping line and TP (Figure A.3), Arenas and in vitro (0.3 fold) T-lymphocytes et al., 2007 Pagenstecher et al., loss of splicing in vivo (0 fold) minigene and whole CM045463 MLH1 chr3:37053590:+ c.677G>A p.R226Q loss of exon 8 TP 2006, Tournier et al., and in vitro (0.01 fold) blood 2008 Continued on next page 101 Table A.2 – continued from previous page 102 gene AA change agreement Library result hg19 position validation result HGMD or SNP id nucleotide change Reference / source Patient tissue/ RNA source minigene and Auclair et al., 2006, CM960965 MLH1 chr3:37045935:+ c.350C>T p.T117M no change no change lymphoblastoid cell TN Tournier at al., 2008 line loss of splicing in vivo (0.6 fold) lymphoblastoid cell CM068362 MLH1 chr3:37045923:+ c.338T>A p.V113D no change TN Auclair et al., 2006 but no change in vitro line loss of splicing in vivo (0.6 fold) lymphoblastoid cell CM000176 MLH1 chr3:37053550:+ c.637G>A p.V213M no change FP Auclair et al., 2006 and in vitro (0.65 fold) line frontal cortex brain-specific (post-mortem), loss of splicing in vivo (0.3 fold) loss of exon 6 knock-in mice Dermaut et al., 2004, CM041068 PSEN1 chr14:73653628:+ c.548G>T p.G1883V TP and in vitro (0.1 fold) or loss of exons (neocortex, Watanabe et al., 2012 6 and 7 hippocampus and cerebellum) modest exon 9 Dumanchin et al., CM064173 PSEN1 chr14:73673096:+ c.871A>C p.T291P no change Whole blood FN skipping 2006 frontal cortex and no change in vivo but loss of Alison Goate (data CM054825 PSEN1 chr14:73653589:+ c.509C>T pS170F no change cerebellum TN splicing in vitro (0.5 fold) not shown) (post-mortem) loss of splicing in vivo (0.2 fold) N/A (poor RNA Manfred Kilimann CM961108 PHKA2 chrX:18958135:- c.896A>G p.D299G whole blood (frozen) N/A but not in vitro quality) (Data not shown) N/A (poor RNA Manfred Kilimann CM961106 PHKA2 chrX:18963257:- c.557G>A p.R186H no change whole blood (frozen) N/A quality) (Data not shown) increased splicing in vivo (1.6 N/A (poor RNA Manfred Kilimann CM961106 PHKA2 chrX:18963257:- c.565A>G p.K189E whole blood (frozen) N/A fold) and in vitro (1.6 fold) quality) (Data not shown) Genome-wide survey Nembaware et al., loss of splicing in vivo (0.3 fold) in 166 rs2295682 RBM23 chr14:23374862:- c.408G>A p.R136 exon 6 skipping TP 2008, Hull et al., and in vitro (0.5 fold) lymphoblastoid cell 2007 lines Martin Olivieri and N/A (poor RNA CM113684 SERPINC1 chr1:173876649:- c.1157T>C p.I386T no change whole blood (frozen) N/A Beate Luxembourg quality) (Data not shown) loss of splicing in vivo (0.4 fold) Riveira-Munoz et al., CM075012 SLC12A3 chr16:56902212:+ c.433C>T p.R145C and in vitro (0.6 fold) but exon 7 skipping whole blood FN 2007 neither were significant loss of splicing in vivo (0.3 fold) exon 6 and 7 lymphoblastoid cell Coriell Repository CM115034 TAZ chrX:153647918:+ c.497T>A p.L166* TP and in vitro (0.5 fold) skipping line (Figure A.3) loss of splicing in vivo (0.6 fold) lymphoblastoid cell Coriell Repository CM067040 TAZ chrX:153641901:+ c.367C>T p.R123* intron retention TP and in vitro (0.4 fold) line (Figure A.3) Appendix A. Identification of Splicing Defects 103 Table A.3: Genes enriched with SSM Gene ATM ATP7A BBS1 BRCA1 BRCA2 BTK CCM2 CHD7 CHM COL11A1 COL1A1 COL2A1 COL3A1 COL4A5 COL5A1 COL6A2 COL7A1 CUL3 CYBB DARS2 DMD ENG EVC EXT1 FANCA FANCG FECH FLCN GH1 HMBS KRIT1 L1CAM LAMA2 LAMP2 MLH1 MSH2 MTM1 MTTP NEB NF1 NF2 NIPBL OFD1 OPA1 PDCD10 PHEX PKD2 PKP1 PMS2 PRKAR1A RB1 RPGR SGCE SLC2A2 SPAST SPINK5 SURF1 TCIRG1 TSC2 TTC8 VPS13A VPS33B WDR45 WRN 104 Table A.4: ENCODE Accession numbers used for the purpose of validating MaPSy results. ENCODE Accession Number SRR307897 SRR307898 SRR307901 SRR307902 SRR307903 SRR307904 SRR307905 SRR307906 SRR307907 SRR307908 SRR307911 SRR307912 SRR307920 SRR307926 SRR307927 SRR307932 SRR307933 SRR315301 SRR315302 SRR315313 SRR315314 SRR315315 SRR315316 SRR315327 SRR315328 SRR315329 SRR315330 SRR315331 SRR315336 SRR315337 SRR534301 SRR534302 SRR534309 SRR534310 SRR534319 SRR534320 SRR534321 SRR534322 SRR534323 SRR534324 SRR545695 SRR545696 SRR545697 SRR545698 SRR545699 SRR545700 Appendix A. Identification of Splicing Defects 105 Alternative Events Alternative Events in vivo in vitro III 2% III 1% ~5% of mutations II II Type I induce cryptic 107 169 I 36% I 43% wt splicing in vivo 183 223 * 62% 56% mt Type II 2 175 wt 59% 118 mt I 40% Alternative Events Type III in vitro and in vivo wt II / III mt cryptic 3ss 32 26% cryptic 5ss I 91 74% Figure A.1: Alternative splicing events in the 5K panel. The majority of cryptic splicing occurred by creation of an AG or GT (Type I). While some other mutations increased the usage of a nearby weaker splice-site (type II). Very few mutations were found to abolish alternative splice-site usage (type III). 106 a c e f 6 6 r = 0.85 r = 0.87 MaPSy in vivo 1200 4 4 2 2 Replicate 3 Replicate 2 900 500 0 0 400 600 300 −4 −4 200 300 100 −8 −8 0 −10 −5 0 5 Count −8 −4 0 2 4 6 −8 −4 0 2 4 6 Replicate 1 Replicate 1 MaPSy in vitro b d 900 6 6 r = 0.87 r = 0.89 4 4 600 Replicate 2 2 2 Replicate 3 0 −2 0 300 −4 0 −15 −10 −5 0 5 −6 −8 Relative splicing efficiency −8 −4 0 2 4 6 −6 −2 0 2 4 6 Replicate 2 Replicate 1 Species: Mutant Wildtype Figure A.2: MaPSy performance. (a-d) Agreement between allelic splicing ratios (log2 ) of three cell culture replicates of MaPSy in vivo (a-c) and two experimental replicates of MaPSy in vitro (d). (e) Stacked histogram of mutant (red) and wild-type (blue) relative splicing efficiency in MaPSy in vivo (top) and in vitro (bottom). (f) Full gel of output (spliced species) from MaPSy in vivo. Appendix A. Identification of Splicing Defects 107 a b c l ro nt nt G co l co A> ro l ro no ol nt RT T RT 3 nt r no 7G> 56 co >A co co no ol c. RT C r 8 nt 94 16 c. 5 6 7 c. 1 2 3 4 12 13 14 1 4 5 6 7 12 14 c.1687 Only WT allele is ITPA CM022413 ENG CM111552 missense, LCL GALT CM910169 nonsense, whole-blood present in full length spliced product missense, LCL d e f e id e e id am id m m ex xa xa oh he he no ol yclo cl l lo ro cy l ro no ol cyc nt nt T, c 36 T l ro nt nt A, co c. C> co C> 49 >A , 33 T nt >T nt co co T> no ol c. 3C> RT 7 7 co 7T r co C RT 36 7 r 49 RT 3 c. r 19 19 c. c. c. 7 8 9 10 7 9 10 3 4 6 8 9 10 3 4 6 8 9 10 3 4 6 8 9 10 3 6 8 9 10 ATP7A CM940140 3 6 8 9 10 3 8 9 10 nonsense, fibroblast TAZ CM115034 TAZ CM067040 nonsense, LCL nonsense, LCL g 100% Percent usage at 3’ss Percent usage at 5’ss 98% 98% 96% 96% 94% 94% 92% 92% 1 2 3 4 1 2 3 4 Increasing WT splicing efficiency Increasing WT splicing efficiency A549 GM12878 HMEC K562 NHLF Cell: AG04450 BJ H1−hESC HSMM HeLa−S3 HUVEC MCF−7 SK−N−SH Monocytes−CD14+ SK−N−SH−RA CD20+ HepG2 IMR90 NHEK Figure A.3: MaPSy validation in patient samples and ENCODE data. (a-f) MaPSy’s identified ESMs in mutations causing inosine triphosphatase deficiency (a), galactosemia (b), haemorrhagic telangiectasia (c), Menkes syndrome (d) and Barth syndrome (e,f) were shown to exhibit splicing aberrations (exon skipping and/or intron retention) in RNAs derived from patient tissue samples. (g) Splicing efficiency in MaPSy corresponds to splicing in ENCODE data. 108 a AD AR XL XLD XLR %ESM in 5K panel AD 20 AR XL 10 XLD XLR 0 er HI I er HI I er HI I er HI I er HI I HS HS HS HS HS eH eH eH eH eH at at at at at od od od od od m m m m m b AD AR XL XLD XLR 1767 1500 AD n mutations AR 1000 874 XL 567 XLD 500 289 XLR 174 192 92 67 78 68 3 0 er HI I er HI I er HI I er HI I er HI I HS HS HS HS HS eH eH eH eH eH at at at at at od od od od od m m m m m Figure A.4: Mode of inheritance in the 5K panel. (a) Percent ESM in the 5K panel stratified by modes of inheritance in haploinsufficient genes (prediction score = 1), haplosufficient genes (prediction score < 0.7) and moderately haploinsuffi- cient genes (1 > prediction score ≥ 0.7).157 Error bars, 95% confidence intervals. (b) Number of mutations in the different modes of inheritance in the 5K panel. Appendix A. Identification of Splicing Defects 109 a Dominant Recessive 15 Mean %ESM in gene P = 0.00488 P = 0.0499 10 10 5 5 0 0 nt nt an t nt nt an t e ra e ra r e ra e ra r l l le l l le to to to to to to in i− in i− m m se se Gene tolerance to truncating variants b n introns 0.06 MaPSy−ESM SSM−enriched Density 0.04 PTV−intolerant HGMD 0.02 RefSeq 0.00 0 25 50 75 Figure A.5: Genes intolerant to protein-truncating variants (PTVs) in the ExAC population are predisposed to disease- associated splicing mutations. (a) Mean fraction of ESMs in PTV-intolerant (pLI ≥ 0.9), semitolerant (0.1 < pLI < 0.9) and tolerant (pLI ≤ 0.1) genes in dominant and recessive traits. Error bars, s.e.m. (b) PTV-intolerant genes also have more introns than other genes, similar to disease genes that lose function via splicing mutations. 110 a %GC differential ESE density ESS density Exon length exon−intron P = 4.78e−06 P = 0.74 P = 0.0106 1.6 P = 0.011 1.2 1.0 1.0 0.9 0.9 1.2 1.0 0.8 0.8 0.8 Splicing efficiency 0.8 0.7 0.7 0.6 0.6 0.4 0.6 10 0 10 20 1 2 3 4 1 2 3 40 50 60 70 80 90 0. 0. 0. 0. 0. 0. 0. − Exon Phastcons Intron length SS strength G (kcal/mol) P = 6.91e−08 P = 3.54e−28 P = 1.23e−42 P = 3.31e−17 1.0 1.0 1.1 1.0 0.9 0.8 0.8 0.7 0.6 0.5 0.6 0.5 0.4 0 0 0 1 2 3 4 5 6 5 0 5 0 .0 .5 .0 .5 .0 .5 26 24 22 2. 3. 3. 4. 10 12 15 17 20 22 − − − b %GC differential ESE density ESS density SS strength exon−intron P = 1.77e−07 P = 1.79e−12 20 P = 7.82e−18 P = 0.0319 15 15 15 10 %ESM 10 10 10 5 5 5 5 0 0 0 0 3 9 39 16 −1 4 12 −1 7 17 −1 7 4− 13 2 19 19 2 4 24 2 9 29 35 35 67 14 14 19 19 2 3 23 2 9 29 56 18 −1 5 8− 17 6− 6 .1 4 4.9 57 .5 8 .3 7 .9 .9 25 4. −4. .8 2. .1 7. 0. 0. 0 . 0. 0 . 0. 0. 0 . 0. 0 . 0. 0. 0. 0. 0. 0 . 0. 0. .7 6. .1 8. 8. −8 40 19 1 9 14 6−1 0− − − − − 0− − − − − 7 .7 7 5 2 .9 19 2. − − Figure A.6: Features of splicing. (a) The mean of relative splicing efficiency of wild-type species in vivo (n = 2, 086) is plotted against increasing mean of feature measures in sliding window (size = 200, step = 1). Shaded regions represent 95% confidence intervals. Intron length is plotted on a log10 scale. The mean of PhastCons score for all bases of the exon was used to measure conservation. Genomic features that have previously been associated with splicing are shown to display similar trends in MaPSy. P values were obtained from linear regression analyses. (b) The 5K panel is divided into five bins of increasing feature measures, and percent ESM in each bin is plotted. Error bars, 95% confidence intervals. Low differential GC content between exon and intron, less ESE, more ESS and less agreement with splice-site consensus sequence, which are all associated with weaker splicing are shown to sensitize exons to ESM. The Kruskal-Wallis test was used to obtain P values. Appendix A. Identification of Splicing Defects 111 a siPTBP1 control WT MUT WT MUT PTBP1 GAPDH b siSRSF1 control WT MUT WT MUT SRSF1 GAPDH Figure A.7: The role of PTBP1 and SRSF1 in ESM phenotypes. (a) The splicing phenotype of a mutation in exon 20 of COL1A2 that creates a PTBP1-binding motif was partially rescued when PTBP1 was knocked down. (b) A mutation that weaken a SRSF1-binding motif in exon 8 of MLH1 caused a modest but not significant increase of skipping events in the absence of SRSF1, whereas the wild-type exon that contains a SRSF1-binding site had a significant increase in skipping events when SRSF1 was knocked down. 112 a Test sequences: Calculate intronic density b Intronic splicing activators 4 Intronic splicing repressors 2,086 wt species 1 1 2 2 2 0 0 −2 c d −2 In vitro In vitro In vivo In vivo 0 500 1500 0 500 1500 SRSF1 10 17 in vivo 3 Upstream intron (20 clusters) Intronic splicing Intronic splicing 1 activators 2 repressors 3 2 4 1 0 5 0 SRSF1 3 12 0 5 16 0 11 0 Exonic splicing 1 2 in vitro Exonic splicing 2 1 repressors 15 activators 23 5 2 8 2 PTBP1 Proteins (number of motifs) Proteins (number of motifs) in vivo IGF2BP2 (1) BX511012.1, PCBP1,PCBP2, MEX3B,MEX3C, RBM46,RBM47 (1) Fusip1,SRSF12 (1) PCBP3,PCBP4 (2) MEX3D (1) STAR-PAP (1) ACO1 (1) PABPC1L, RBM42 (1) YTHDC1 (1) ENSG00000180771, PABPC3 (1) ENSG00000213250, SRSF2 (1) PTBP1 MBNL1,MBNL2, ENOX1,ENOX2 (1) in vitro IGF2BP3 (1) RBMS1, e MBNL3 (1) ESRP1,ESRP2 (1) CSDA (2) RBMS2,RBMS3 (3) RBM5 (1) EIF4B (2) ELAVL2,ELAVL3 (3) CPEB2,CPEB3, PCBP3,PCBP4 (1) ZFP36,ZFP36L1, CPEB4 (2) SRSF1 (1) FMR1 (1) Exonic Splicing Repressors Exonic Splicing Activators ZFP36L2 (1) HNRNPC (1) FXR2 (1) ANKHD1,ANKRD17, DAZAP1 (1) RBM41 (1) PPRC1 (1) Proteins (number of motifs) Proteins (number of motifs) ENSG00000250177, ENSG00000249536 (1) TIA1 (2) ENSG00000215042, SAMD4A,SAMD4B (1) in vivo and in vitro in vivo and in vitro PABPC1,PABPC1L, U2AF2 (1) SRSF9 (1) PABPC4 (2) HNRNPL (2) RALY (1) ZC3H10 (1) CPEB2,CPEB3,CPEB4 (2) FMR1 (1) CIRBP,RBM3 (1) CIRBP,RBM3 (1) HNRNPCL1 (1) SRSF1 (5) HNRNPC (1) FXR2 (1) SRSF3 (1) SRSF3 (2) SF3B4 (1) SRSF4,SRSF6 (1) SART3 (1) SNRPA,SNRPB2 (2) RBM41 (1) PPRC1 (1) PTBP1,PTBP2, HNRNPF,HNRNPH1, SNRPA,SNRPB2 (2) SRSF7 (1) TIA1 (2) SAMD4A,SAMD4B (1) ROD1 (1) HNRNPH2 (1) TARDBP (1) SRSF9 (1) U2AF2 (1) SRSF9 (2) ELAVL1,ELAVL3 (1) RBM6 (1) CSDA,YB-1,YBX2 (3) PABPN1,PABPN1L (1) RBFOX2,RBFOX3 (1) RALY (1) ZC3H10 (1) ZCRB1 (1) PABPC1L,PABPC3 (1) SF3B4 (1) MBNL1,MBNL2, HNRNPCL1 (1) SRSF1 (6) RBM47 (2) RBM5 (1) A1CF (1) MBNL3 (1) SF3B4 (1) SRSF4,SRSF6 (1) HNRNPR,SYNCRIP (3) CNOT4 (1) ENSG00000215492, FXR1 (1) RBM4,RBM4B (2) A1CF (1) HNRNPF,HNRNPH1, ZFP36,ZFP36L1, ENSG00000231942, ZFP36L2 (1) hnRNPLL (1) ENSG00000215492, HNRNPH2 (1) HNRNPA1L2, RBMXL1,RBMXL2, LIN28A,LIN28B (1) ENSG00000231942, RBM4,RBM4B (2) DAZAP1 (1) HNRNPA3 (1) RBMXL3,RBMY1A1, MSI1,MSI2 (2) ELAVL1,ELAVL3 (5) HNRNPA1L2,HNRNPA3 (1) LIN28A,LIN28B (2) RBMY1B,RBMY1D, ELAVL1,ELAVL3 (6) RBM8A (1) QKI (2) CELF3 (1) RBMY1E,RBMY1F, RBM28 (1) CPEB3 (1) CELF3 (1) RBM45 (1) RBMY1J (2) PCBP3,PCBP4 (1) PTBP1,PTBP2, SRSF1 (2) CPEB3 (1) RBM6 (1) SNRPA,SNRPB2 (1) ROD1 (1) EIF4B (1) PTBP1,PTBP2,ROD1 (2) RBFOX2,RBFOX3 (1) MATR3 (1) RBM38 (1) RBM38 (1) MBNL1,MBNL2, SNRPA,SNRPB2 (2) ACO1 (1) BRUNOL4,BRUNOL5, MBNL3 (1) BRUNOL4,BRUNOL5, LIN28A,LIN28B (1) BRUNOL6,CELF3 (4) BRUNOL6,CELF3 (4) RBM8A (1) ENSG00000215492, ENSG00000215492, SRSF1 (1) ENSG00000231942, ENSG00000231942, RBM45 (1) HNRNPA1,HNRNPA2B1, HNRNPA1,HNRNPA2B1, SRSF9 (1) HNRNPA3 (3) HNRNPA3 (3) LIN28A,LIN28B (1) MBNL1,MBNL2,MBNL3 (1) MBNL1,MBNL2,MBNL3 (1) RBM8A (1) RBM24,RBM38 (4) HNRPDL (1) SRSF1 (1) RBM24,RBM38 (4) ZNF638 (1) RBM45 (1) PSPC1,SFPQ (1) ZNF638 (1) SRSF9 (1) HNRPDL (1) PSPC1,SFPQ (1) Figure A.8: Overlap of intronic and exonic splicing regulatory motifs. (a) The density for each RBP motif was calculated in all wild-type species (n = 2, 048). (b) Clustering of intronic data reveals similar trends in vivo and in vitro. (c) Intronic splicing activators and exonic splicing repressors show a high degree of overlap. (d) Intronic splicing repressor motifs and exonic splicing activator motifs display a high degree of overlap. (e) Table of exonic splicing repressors and exonic splicing activators that exhibit the same function in vivo and in vitro. Appendix A. Identification of Splicing Defects 113 a b 0.2 vivo vivo 0.2 80’ at HeLa NE 15’ at HeLa NE vitro vitro 0.1 M/W at B/C 0.1 A, B/C, spliced B/C Selex 1 m/w at A 0.0 0.0 80’ at HeLa NE 15’ at HeLa NE 0.2 0.1 A Selex 2 B/C Selex 2 0.2 80’ at HeLa NE 0.4 0 1000 2000 3000 4000 0 1000 2000 3000 4000 B/C Selex 3 and spliced Selex 3 M/W splicing M/W splicing c Significance: Significant Significant in vitro Not significant 1 2 3 4 5 1 1 1 0 1 0 0 −1 0 −1 0 −2 −1 −1 −1 −2 −3 −2 −2 −4 −3 −2 n = 266 n = 1734 n = 738 n = 1132 n = 141 A A A A A l l l l l BC BC BC BC BC sp sp sp sp sp t0 t0 t0 t0 t0 6 7 8 9 10 0.5 0 1 2 0.0 0 0 1 −1 −0.5 0 −1 −1 −1.0 −1 −2 −2 −2 −1.5 −2 −3 Allele ratio n = 268 n = 115 n = 126 n = 64 n = 55 −3 A A A A A l l l l l BC BC BC BC BC sp sp sp sp sp t0 t0 t0 t0 t0 11 12 13 14 15 0.0 0 1 0 0 −1 0 −0.5 −1 −2 −2 −1 −1.0 −2 −4 −3 −2 n = 10 n=9 n = 20 n = 66 n = 38 A A A A A l l l l l BC BC BC BC BC sp sp sp sp sp t0 t0 t0 t0 t0 16 18 19 20 23 0 0 2 1.0 1 −1 −2 0.5 1 0 0.0 −2 −4 0 −0.5 −6 −1 −3 n = 24 n=9 −1.0 n=8 n=9 n = 10 A A A A A l l l l l BC BC BC BC BC sp sp sp sp sp t0 t0 t0 t0 t0 Fraction Figure A.9: In vitro functional SELEX. (a) Series of functional SELEX with MaPSy. (b) Mutant/wild type ratio in the B/C fraction in comparison to spliced species (left) and in the A fraction in comparison to spliced species (right). Enrichment in B/C complex is positively correlated with splicing, while enrichment in A complex is negatively correlated with splicing. (c) Clustering the effects of exonic mutation disruptions on different stages of spliceosomal assembly revealed mechanistic signatures of ESM. Only clusters with ≥ 8 members are shown. 114 G (kcal/mol) G (kcal/mol) wt exon 7.5 5.0 −225 2.5 −250 0.0 −2.5 −275 −5.0 −300 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 18 20 23 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 18 20 23 Number of ESSs %GC differential exon-intron 50 40 40 30 30 20 20 10 10 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 18 20 23 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 18 20 23 Mutant severity in vitro Mutant severity in vivo 10.0 6 7.5 4 5.0 2 2.5 0 0.0 Value 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 18 20 23 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 18 20 23 Number of HGMD splice site variants Distance to splice site 90 40 30 60 20 30 10 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 18 20 23 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 18 20 23 Exon splice site strength Phastcons conservation - intron 25 1.00 20 0.75 15 0.50 10 0.25 5 0.00 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 18 20 23 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 18 20 23 Wt splicing efficiency in vitro Wt splicing efficiency in vivo 5.0 2.5 2.5 0.0 0.0 −2.5 −2.5 −5.0 −5.0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 18 20 23 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 18 20 23 Cluster Figure A.10: Mutant feature analyses in different clusters revealed distinct ESM mechanistic signatures. Horizontal dotted lines indicate the mean value of the features in the 5K panel. Box plots of feature values that are significantly different than background (permuted cluster assignment) are colored red. The medians are indicated as horizontal bold lines, and the means as black hollow dots. Appendix A. Identification of Splicing Defects 115 Figure A.11: ESM visualization browser. A web browser was developed to visualize raw counts and information on individual mutations from original publications. Mutations can be queried by HGMD ID, gene or author. 116 Figure A.12: Common sequences of the 5K panel reporters. (a) In vivo reporter sequence: CMV enhancer and promoter sequence (blue), adenovirus sequence (green; exon in uppercase and intron in lowercase), 200-mer library (red), ACTN1 intron 15 (lowercase, purple) and exon 16 (uppercase, purple), bGH poly(A) (cyan). (b) In vitro reporter sequence: adenovirus sequence (green; including T7 promoter sequence in bold), 200-mer oligo library (red) and additional intronic sequence (purple). Appendix B Splicing Aberrations and Human Hereditary Diseases Table B.1: Low frequency variants predicted to disrupt splicing in RB1 118 site Exon/Intron Chasin score Chromosome Spliceman score Variant category Closest splice site Coordinate (hg19) Exon/Intron number ExAC allele frequency Further consideration Distance from closest splice 13 48878001 1 118 exon 3_ss 5_prime_UTR 0.0277624 -0.1107268 53 NO 13 48878023 1 140 exon 3_ss 5_prime_UTR 0.0186289 0.0698396 73 NO 13 48878048 1 -137 exon 5_ss 5_prime_UTR 0.0186254 -1.1491927 98 YES 13 48878065 1 -120 exon 5_ss missense 0.0185839 1.777245 92 YES 13 48878090 1 -95 exon 5_ss synonymous 0.0094518 0.1595113 74 NO 13 48878108 1 -77 exon 5_ss synonymous 0.00977326 2.5115433 61 NO 13 48878117 1 -68 exon 5_ss synonymous 0.0101688 -0.8908173 66 NO 13 48878125 1 -60 exon 5_ss missense 0.0101937 -0.0604118 69 NO 13 48878131 1 -54 exon 5_ss missense 0.0102775 1.6757234 45 NO 13 48878161 1 -24 exon 5_ss missense 0.0228519 1.6190222 78 NO 13 48878198 1 13 intron 5_ss intronic 0.0539957 NA 65 NO 13 48881366 1 -50 intron 3_ss intronic 0.000841099 NA 81 NO 13 48881370 1 -46 intron 3_ss intronic 0.000840647 NA 89 NO 13 48881376 1 -40 intron 3_ss intronic 0.00168093 NA 57 NO 13 48881379 1 -37 intron 3_ss intronic 0.00672088 NA 36 NO 13 48881386 1 -30 intron 3_ss intronic 0.000837816 NA 72 NO 13 48881387 1 -29 intron 3_ss intronic 0.00167482 NA 67 NO 13 48881388 1 -28 intron 3_ss intronic 0.0092044 NA 68 NO 13 48881396 1 -20 intron 3_ss intronic 0.00249717 NA 96 YES 13 48881423 2 7 exon 3_ss missense 0.000825232 0.7455179 48 NO 13 48881431 2 15 exon 3_ss missense 0.000824851 -1.1030541 19 NO 13 48881433 2 17 exon 3_ss missense 0.000824756 -2.0943103 18 NO 13 48881437 2 21 exon 3_ss synonymous 0.000824647 -1.0326177 40 NO 13 48881441 2 25 exon 3_ss missense 0.00329864 0.1056588 35 NO 13 48881443 2 27 exon 3_ss synonymous 0.000824579 -0.6616866 94 YES 13 48881451 2 35 exon 3_ss missense 0.00577177 -1.9451201 56 NO 13 48881453 2 37 exon 3_ss missense 0.00164913 -0.9214567 70 NO 13 48881455 2 39 exon 3_ss synonymous 0.00164897 -0.8776521 69 NO 13 48881483 2 -59 exon 5_ss missense 0.000824484 1.2430543 52 NO 13 48881484 2 -58 exon 5_ss missense 0.000824484 1.1319423 78 NO 13 48881485 2 -57 exon 5_ss synonymous 0.000824484 0.696165 92 YES 13 48881492 2 -50 exon 5_ss missense 0.000824606 -0.0371454 13 NO 13 48881493 2 -49 exon 5_ss missense 0.00164921 -1.585987 28 NO 13 48881496 2 -46 exon 5_ss missense 0.00164932 -0.6605404 62 NO 13 48881499 2 -43 exon 5_ss missense 0.000824783 -1.3398573 42 NO 13 48881523 2 -19 exon 5_ss missense 0.000825055 -2.2462666 44 NO 13 48881536 2 -6 exon 5_ss synonymous 0.000825859 -3.0675182 95 YES 13 48881537 2 -5 exon 5_ss missense 0.00330366 0.2310482 50 NO 13 48881562 2 20 intron 5_ss intronic 0.000834516 NA 74 NO 13 48881564 2 22 intron 5_ss intronic 0.00167386 NA 92 YES 13 48881573 2 31 intron 5_ss intronic 0.00252279 NA 54 NO 13 48881574 2 32 intron 5_ss intronic 0.00168865 NA 42 NO 13 48881581 2 39 intron 5_ss intronic 0.000847343 NA 88 NO 13 48881584 2 42 intron 5_ss intronic 0.00084995 NA 25 NO 13 48881589 2 47 intron 5_ss intronic 0.00170506 NA 75 NO 13 48916698 2 -37 intron 3_ss intronic 0.00254044 NA 49 NO 13 48916736 3 1 exon 3_ss missense and splice_region 0.000832376 1.49243942 21 YES 13 48916738 3 3 exon 3_ss missense 0.000831822 2.3748346 92 YES 13 48916739 3 4 exon 3_ss missense 0.00249501 1.1802176 79 NO 13 48916742 3 7 exon 3_ss missense 0.00249306 1.1472502 61 NO 13 48916743 3 8 exon 3_ss synonymous 0.000830896 1.396969 68 NO Continued on next page Table B.1 – continued from previous page site Exon/Intron Chasin score Chromosome Spliceman score Variant category Closest splice site Coordinate (hg19) Exon/Intron number ExAC allele frequency Further consideration Distance from closest splice 13 48916755 3 20 exon 3_ss synonymous 0.000828555 2.6229368 18 NO 13 48916765 3 30 exon 3_ss missense 0.000826952 2.31396522 79 NO 13 48916785 3 50 exon 3_ss synonymous 0.000825696 1.5709402 79 NO 13 48916803 3 -47 exon 5_ss synonymous 0.00165115 1.8370629 65 NO 13 48916806 3 -44 exon 5_ss synonymous 0.000825573 -0.1486369 52 NO 13 48916812 3 -38 exon 5_ss synonymous 0.00165164 -1.425421 75 NO 13 48916815 3 -35 exon 5_ss synonymous 0.00247786 -0.5542286 85 NO 13 48916833 3 -17 exon 5_ss synonymous 0.000828171 -0.2551171 52 NO 13 48916837 3 -13 exon 5_ss missense 0.0066257 1.1602882 54 NO 13 48916840 3 -10 exon 5_ss missense 0.000828899 -0.0182508 90 YES 13 48916856 3 6 intron 5_ss splice_region and intronic 0.000837521 NA 72 YES 13 48916860 3 10 intron 5_ss intronic 0.176408 NA 11 NO 13 48916861 3 11 intron 5_ss intronic 0.000840534 NA 41 NO 13 48916862 3 12 intron 5_ss intronic 0.322407 NA 73 NO 13 48916870 3 20 intron 5_ss intronic 0.000847831 NA 40 NO 13 48916878 3 28 intron 5_ss intronic 0.000852544 NA 20 NO 13 48916887 3 37 intron 5_ss intronic 0.654649 NA 69 NO 13 48916896 3 46 intron 5_ss intronic 0.00791098 NA 81 NO 13 48919166 3 -50 intron 3_ss intronic 0.00184822 NA 82 NO 13 48919169 3 -47 intron 3_ss intronic 0.000922645 NA 46 NO 13 48919170 3 -46 intron 3_ss intronic 0.00554457 NA 63 NO 13 48919204 3 -12 intron 3_ss intronic 0.00088129 NA 20 YES 13 48919205 3 -11 intron 3_ss intronic 0.00175741 NA 23 YES splice_region and 13 48919216 4 0 exon 3_ss 0.000859727 2.5313409 NA YES synonymous 13 48919228 4 12 exon 3_ss missense 0.00339276 0.7651483 48 NO 13 48919231 4 15 exon 3_ss synonymous 0.000846081 4.1856216 54 NO 13 48919246 4 30 exon 3_ss missense 0.0418326 -0.8578923 43 NO 13 48919277 4 -58 exon 5_ss missense 0.000827787 0.57504588 64 NO 13 48919281 4 -54 exon 5_ss missense 0.00248225 -2.5170208 64 NO 13 48919294 4 -41 exon 5_ss missense 0.000826446 -1.7583517 40 NO 13 48919297 4 -38 exon 5_ss synonymous 0.00330524 -0.6273022 50 NO 13 48919299 4 -36 exon 5_ss missense 0.000826228 0.5253591 53 NO 13 48919320 4 -15 exon 5_ss missense 0.0090837 -0.9425876 65 NO 13 48919358 4 23 intron 5_ss intronic 96.0746 NA 70 NO 13 48919374 4 39 intron 5_ss intronic 0.000828349 NA 64 NO 13 48919378 4 43 intron 5_ss intronic 0.000828775 NA 51 NO 13 48919381 4 46 intron 5_ss intronic 0.178311 NA 76 NO 13 48921923 4 -38 intron 3_ss intronic 0.326055 NA 77 NO 13 48921930 4 -31 intron 3_ss intronic 0.00206898 NA 79 NO Appendix B. Splicing Aberrations and Human Hereditary Diseases 13 48921931 4 -30 intron 3_ss intronic 0.00413121 NA 29 NO 13 48921939 4 -22 intron 3_ss intronic 0.00301987 NA 62 NO 13 48921957 4 -4 intron 3_ss splice_region and intronic 0.000941797 NA 9 YES 13 48921958 4 -3 intron 3_ss splice_region and intronic 0.00187783 NA 19 YES 13 48921987 5 -12 exon 5_ss missense 0.000909653 -0.4012625 68 NO 13 48921999 5 0 exon 5_ss missense and splice_region 0.00275938 -2.0827767 42 YES 13 48922011 5 12 intron 5_ss intronic 0.00187988 NA 53 NO 13 48922038 5 39 intron 5_ss intronic 0.00206033 NA 15 NO 13 48922039 5 40 intron 5_ss intronic 0.00102931 NA 22 NO 13 48922042 5 43 intron 5_ss intronic 0.0544163 NA 43 NO 13 48923044 5 -48 intron 3_ss intronic 0.000852646 NA 69 NO 13 48923048 5 -44 intron 3_ss intronic 0.00085189 NA 87 NO 13 48923051 5 -41 intron 3_ss intronic 0.000852021 NA 57 NO Continued on next page 119 Table B.1 – continued from previous page 120 site Exon/Intron Chasin score Chromosome Spliceman score Variant category Closest splice site Coordinate (hg19) Exon/Intron number ExAC allele frequency Further consideration Distance from closest splice 13 48923059 5 -33 intron 3_ss intronic 0.0271845 NA 11 NO 13 48923069 5 -23 intron 3_ss intronic 0.00506389 NA 13 NO 13 48923070 5 -22 intron 3_ss intronic 0.000844053 NA 8 NO 13 48923078 5 -14 intron 3_ss intronic 0.00503812 NA 23 YES 13 48923089 5 -3 intron 3_ss splice_region and intronic 0.00669467 NA 15 YES 13 48923093 6 1 exon 3_ss missense and splice_region 0.000835869 -1.8680915 71 YES 13 48923101 6 9 exon 3_ss synonymous 0.000834641 0.4235853 88 NO 13 48923122 6 30 exon 3_ss synonymous 0.000833528 -0.5828414 79 NO 13 48923123 6 31 exon 3_ss synonymous 0.0325125 0.0023983 63 NO 13 48923132 6 -27 exon 5_ss missense 0.000833764 0.1139748 53 NO 13 48923149 6 -10 exon 5_ss synonymous 0.000835785 0.5307481 37 NO 13 48923152 6 -7 exon 5_ss missense 0.000836694 4.2474865 100 YES 13 48923163 6 4 intron 5_ss splice_region and intronic 0.00168637 NA 52 YES 13 48923172 6 13 intron 5_ss intronic 0.000852748 NA 74 NO 13 48923177 6 18 intron 5_ss intronic 0.000863737 NA 18 NO 13 48923199 6 40 intron 5_ss intronic 0.000931446 NA 61 NO 13 48934104 6 -49 intron 3_ss intronic 0.00086786 NA 83 NO 13 48934107 6 -46 intron 3_ss intronic 0.00347602 NA 88 NO 13 48934114 6 -39 intron 3_ss intronic 0.000865816 NA 63 NO 13 48934141 6 -12 intron 3_ss intronic 0.00092098 NA 41 YES 13 48934144 6 -9 intron 3_ss intronic 0.000833597 NA 39 YES 13 48934146 6 -7 intron 3_ss splice_region and intronic 0.000832778 NA 5 YES 13 48934158 7 5 exon 3_ss missense 0.000829119 -0.2373917 50 NO 13 48934173 7 20 exon 3_ss missense 0.0107561 -2.8782744 52 NO 13 48934188 7 35 exon 3_ss missense 0.00165281 -0.7839137 34 NO 13 48934189 7 36 exon 3_ss missense 0.000826378 -2.4428756 13 NO 13 48934203 7 50 exon 3_ss synonymous 0.00247836 0.2558824 59 NO 13 48934211 7 -52 exon 5_ss synonymous 0.00165224 -0.82170458 60 NO 13 48934219 7 -44 exon 5_ss missense 0.000826241 2.1612114 68 NO 13 48934226 7 -37 exon 5_ss synonymous 0.000826323 2.4408955 52 NO 13 48934228 7 -35 exon 5_ss missense 0.00165287 0.6958528 89 NO 13 48934230 7 -33 exon 5_ss missense 0.00165284 -0.8381001 51 NO splice_region and 13 48934262 7 -1 exon 5_ss 0.000829504 2.3451519 74 YES synonymous 13 48934275 7 12 intron 5_ss intronic 0.017502 NA 94 YES 13 48934282 7 19 intron 5_ss intronic 0.000837816 NA 69 NO 13 48934289 7 26 intron 5_ss intronic 0.108498 NA 28 NO 13 48934298 7 35 intron 5_ss intronic 0.299912 NA 16 NO 13 48934303 7 40 intron 5_ss intronic 0.000848651 NA 73 NO 13 48936913 7 -38 intron 3_ss intronic 0.00165961 NA 67 NO 13 48936917 7 -34 intron 3_ss intronic 0.000829573 NA 67 NO 13 48936923 7 -28 intron 3_ss intronic 0.000828748 NA 74 NO 13 48936937 7 -14 intron 3_ss intronic 0.001653 NA 52 YES 13 48936943 7 -8 intron 3_ss splice_region and intronic 0.000825791 NA 49 YES 13 48936962 8 11 exon 3_ss missense 0.000824906 0.8725631 80 NO 13 48936963 8 12 exon 3_ss missense 0.00247463 0.9546755 72 NO 13 48936967 8 16 exon 3_ss synonymous 0.000824797 0.0645219 75 NO 13 48936984 8 33 exon 3_ss missense 0.000824511 -2.3078504 79 NO 13 48936996 8 45 exon 3_ss missense 0.00164878 -1.3586711 42 NO 13 48936999 8 48 exon 3_ss missense 0.000824375 0.0767113 98 YES 13 48937014 8 63 exon 3_ss missense 0.00164883 0.18869685 90 YES 13 48937015 8 64 exon 3_ss synonymous 0.000824416 1.27772805 94 YES 13 48937017 8 66 exon 3_ss missense 0.000824484 -1.7005646 99 YES Continued on next page Table B.1 – continued from previous page site Exon/Intron Chasin score Chromosome Spliceman score Variant category Closest splice site Coordinate (hg19) Exon/Intron number ExAC allele frequency Further consideration Distance from closest splice 13 48937030 8 -63 exon 5_ss synonymous 0.000824538 -1.35496 77 NO 13 48937072 8 -21 exon 5_ss synonymous 0.0107404 -0.698853 44 NO 13 48937082 8 -11 exon 5_ss missense 0.00331055 1.8882689 51 NO 13 48937103 8 10 intron 5_ss intronic 0.000835715 NA 80 NO 13 48937109 8 16 intron 5_ss intronic 0.00252262 NA 66 NO 13 48937120 8 27 intron 5_ss intronic 0.00592417 NA 26 NO 13 48937129 8 36 intron 5_ss intronic 0.000852718 NA 40 NO 13 48937130 8 37 intron 5_ss intronic 0.000853228 NA 39 NO 13 48937140 8 47 intron 5_ss intronic 0.000863662 NA 50 NO 13 48937142 8 49 intron 5_ss intronic 0.00173883 NA 73 NO 13 48938980 8 -50 intron 3_ss intronic 0.0563063 NA 100 YES 13 48939001 8 -29 intron 3_ss intronic 0.055938 NA 79 YES 13 48939013 8 -17 intron 3_ss intronic 0.00424701 NA 55 YES 13 48939015 8 -15 intron 3_ss intronic 0.161031 NA 61 YES splice_region and 13 48939032 9 2 exon 3_ss 0.00815129 -2.018635 18 YES synonymous 13 48939065 9 35 exon 3_ss missense 0.00339443 -0.1891095 39 YES 13 48939088 9 -19 exon 5_ss missense 0.275374 -1.2807164 60 YES 13 48939097 9 -10 exon 5_ss missense 0.0604449 -0.4073398 58 YES 13 48939120 9 13 intron 5_ss intronic 0.00315956 NA 49 YES 13 48939129 9 22 intron 5_ss intronic 0.00333511 NA 32 YES 13 48939132 9 25 intron 5_ss intronic 0.00672766 NA 61 NO 13 48939148 9 41 intron 5_ss intronic 0.00386279 NA 86 NO 13 48941582 9 -48 intron 3_ss intronic 0.000850369 NA 50 NO 13 48941584 9 -46 intron 3_ss intronic 0.00170117 NA 51 NO 13 48941593 9 -37 intron 3_ss intronic 0.00084894 NA 71 NO 13 48941631 10 1 exon 3_ss missense and splice_region 0.00752219 2.7806144 67 YES 13 48941638 10 8 exon 3_ss synonymous 0.000834585 0.3169406 44 NO 13 48941642 10 12 exon 3_ss missense 0.000834042 -0.6104506 32 NO 13 48941648 10 18 exon 3_ss missense 0.000833528 0.3026 96 YES 13 48941649 10 19 exon 3_ss missense 0.00166686 -1.1836168 94 YES 13 48941653 10 23 exon 3_ss synonymous 0.00166628 -0.2622869 99 YES 13 48941654 10 24 exon 3_ss missense 0.000833111 -1.8043326 75 NO 13 48941655 10 25 exon 3_ss missense 0.00249867 -0.5476233 93 YES 13 48941658 10 28 exon 3_ss missense 0.00249817 -0.442141 93 YES 13 48941669 10 39 exon 3_ss missense 0.000832251 2.0945058 54 NO 13 48941678 10 48 exon 3_ss missense 0.000831476 -1.9090763 79 NO 13 48941685 10 -54 exon 5_ss missense 0.000830965 -2.3687165 36 NO 13 48941714 10 -25 exon 5_ss missense 0.000830165 0.1784271 63 NO 13 48941722 10 -17 exon 5_ss synonymous 0.000830096 -0.8444075 80 NO Appendix B. Splicing Aberrations and Human Hereditary Diseases 13 48941732 10 -7 exon 5_ss missense 0.00166085 1.055544 47 NO 13 48941747 10 8 intron 5_ss splice_region and intronic 0.000831062 NA 86 NO 13 48941759 10 20 intron 5_ss intronic 0.000831781 NA 89 NO 13 48941769 10 30 intron 5_ss intronic 0.000832612 NA 41 NO 13 48941789 10 50 intron 5_ss intronic 0.000835017 NA 60 NO 13 48941791 10 52 intron 5_ss intronic 0.000836526 NA 74 NO 13 48942620 10 -43 intron 3_ss intronic 0.00333322 NA 69 YES 13 48942623 10 -40 intron 3_ss intronic 0.000832709 NA 89 NO 13 48942633 10 -30 intron 3_ss intronic 0.00166276 NA 19 NO 13 48942639 10 -24 intron 3_ss intronic 0.00166069 NA 45 YES 13 48942648 10 -15 intron 3_ss intronic 0.000829105 NA 39 YES 13 48942652 10 -11 intron 3_ss intronic 0.000828665 NA 53 YES 13 48942673 11 10 exon 3_ss stop_gained 0.000826269 -1.4730995 42 YES Continued on next page 121 Table B.1 – continued from previous page 122 site Exon/Intron Chasin score Chromosome Spliceman score Variant category Closest splice site Coordinate (hg19) Exon/Intron number ExAC allele frequency Further consideration Distance from closest splice 13 48942680 11 17 exon 3_ss missense 0.00165164 -0.0796225 66 YES 13 48942682 11 19 exon 3_ss missense 0.000825764 -0.0076661 90 YES 13 48942686 11 23 exon 3_ss missense 0.00165112 -1.1536964 76 YES 13 48942708 11 -32 exon 5_ss missense 0.00165117 3.0637926 74 YES 13 48942758 11 18 intron 5_ss intronic 0.00182345 NA 58 NO 13 48942767 11 27 intron 5_ss intronic 0.000954016 NA 64 NO 13 48942769 11 29 intron 5_ss intronic 0.00194526 NA 59 NO 13 48947490 11 -51 intron 3_ss intronic 0.000828748 NA 3 NO 13 48947495 11 -46 intron 3_ss intronic 0.000828281 NA 8 NO 13 48947504 11 -37 intron 3_ss intronic 0.000827869 NA 63 NO 13 48947521 11 -20 intron 3_ss intronic 0.000827075 NA 32 YES 13 48947530 11 -11 intron 3_ss intronic 0.0024793 NA 30 YES 13 48947542 12 1 exon 3_ss missense and splice_region 0.0256101 -2.5821741 47 YES 13 48947553 12 12 exon 3_ss synonymous 0.0165177 -0.2091465 75 NO 13 48947556 12 15 exon 3_ss synonymous 0.000825805 0.113653 77 NO 13 48947569 12 28 exon 3_ss missense 0.00247713 -1.3848463 79 NO 13 48947572 12 31 exon 3_ss missense 0.000825777 -1.4349625 46 NO 13 48947573 12 32 exon 3_ss missense 0.000825764 1.2761025 83 NO 13 48947576 12 35 exon 3_ss missense 0.0016515 0.1439533 59 NO 13 48947577 12 36 exon 3_ss synonymous 0.000825764 0.9485474 57 NO 13 48947587 12 -41 exon 5_ss missense 0.00082575 0.3270871 73 NO 13 48947593 12 -35 exon 5_ss missense 0.000825832 -0.9069538 54 NO 13 48947611 12 -17 exon 5_ss synonymous 0.000826433 -1.1439651 44 NO 13 48947617 12 -11 exon 5_ss missense 0.000826733 -0.3789838 63 NO 13 48947619 12 -9 exon 5_ss synonymous 0.00165369 0.4707983 31 NO splice_region and 13 48947628 12 0 exon 5_ss 0.00496549 -1.3954788 72 YES synonymous 13 48947638 12 10 intron 5_ss intronic 0.0024869 NA 74 NO 13 48947640 12 12 intron 5_ss intronic 0.00248859 NA 62 NO 13 48947658 12 30 intron 5_ss intronic 0.00166276 NA 72 NO 13 48947663 12 35 intron 5_ss intronic 0.00249559 NA 65 NO 13 48947666 12 38 intron 5_ss intronic 0.0041607 NA 75 NO 13 48951013 12 -41 intron 3_ss intronic 0.00165835 NA 86 NO 13 48951014 12 -40 intron 3_ss intronic 0.0489204 NA 59 NO 13 48951017 12 -37 intron 3_ss intronic 0.000829036 NA 56 NO 13 48951021 12 -33 intron 3_ss intronic 0.000828871 NA 98 YES 13 48951023 12 -31 intron 3_ss intronic 0.00580191 NA 95 YES 13 48951024 12 -30 intron 3_ss intronic 0.000828789 NA 79 NO 13 48951025 12 -29 intron 3_ss intronic 0.804487 NA 94 YES 13 48951037 12 -17 intron 3_ss intronic 0.00248353 NA 7 YES 13 48951038 12 -16 intron 3_ss intronic 0.00248307 NA 61 YES 13 48951059 13 5 exon 3_ss synonymous 0.000826569 -0.933416 55 NO 13 48951062 13 8 exon 3_ss synonymous 0.0033054 1.6502479 89 YES 13 48951071 13 17 exon 3_ss synonymous 0.0008259 1.3949608 87 NO 13 48951080 13 26 exon 3_ss synonymous 0.000825518 0.7230549 66 NO 13 48951089 13 35 exon 3_ss synonymous 0.000825314 2.3057729 25 NO 13 48951106 13 52 exon 3_ss missense 0.000825042 0.2284209 62 NO 13 48951109 13 55 exon 3_ss missense 0.000825042 -0.3354879 79 NO 13 48951118 13 -52 exon 5_ss missense 0.000824974 0.7604512 44 NO 13 48951131 13 -39 exon 5_ss synonymous 0.000824865 0.1730923 51 NO 13 48951144 13 -26 exon 5_ss missense 0.0602201 0.0586915 21 YES 13 48951149 13 -21 exon 5_ss synonymous 0.000825083 1.7265656 54 YES 13 48951181 13 11 intron 5_ss intronic 0.000826296 NA 40 NO Continued on next page Table B.1 – continued from previous page site Exon/Intron Chasin score Chromosome Spliceman score Variant category Closest splice site Coordinate (hg19) Exon/Intron number ExAC allele frequency Further consideration Distance from closest splice 13 48951184 13 14 intron 5_ss intronic 0.00247909 NA 92 YES 13 48951185 13 15 intron 5_ss intronic 0.00165319 NA 41 NO 13 48951193 13 23 intron 5_ss intronic 0.00496179 NA 88 NO 13 48951194 13 24 intron 5_ss intronic 0.0132334 NA 92 YES 13 48951201 13 31 intron 5_ss intronic 0.00827445 NA 58 NO 13 48951202 13 32 intron 5_ss intronic 0.00165494 NA 64 NO 13 48953696 13 -34 intron 3_ss intronic 0.00111664 NA 81 NO 13 48953701 13 -29 intron 3_ss intronic 0.0821288 NA 42 YES 13 48953731 14 1 exon 3_ss missense and splice_region 0.00094697 -1.1888665 91 YES splice_region and 13 48953732 14 2 exon 3_ss 0.00377529 -0.3439336 98 YES synonymous 13 48953748 14 18 exon 3_ss missense 0.000920505 -2.0119367 87 YES 13 48953753 14 23 exon 3_ss synonymous 0.000923361 -1.2585838 99 YES 13 48953761 14 -25 exon 5_ss missense 0.000926046 -1.065754 81 YES 13 48953771 14 -15 exon 5_ss synonymous 0.00094345 -0.2002097 69 NO 13 48953802 14 16 intron 5_ss intronic 0.00220575 NA 59 NO 13 48953804 14 18 intron 5_ss intronic 0.00112785 NA 75 NO 13 48953819 14 33 intron 5_ss intronic 0.0112845 NA 51 NO 13 48953820 14 34 intron 5_ss intronic 0.00584676 NA 91 YES 13 48953822 14 36 intron 5_ss intronic 0.00146224 NA 95 YES 13 48953823 14 37 intron 5_ss intronic 0.00905496 NA 87 NO 13 48953824 14 38 intron 5_ss intronic 0.00605144 NA 86 NO 13 48953825 14 39 intron 5_ss intronic 0.00159923 NA 85 NO 13 48953826 14 40 intron 5_ss intronic 0.0625461 NA 92 YES 13 48953830 14 44 intron 5_ss intronic 0.00333433 NA 91 YES 13 48953836 14 50 intron 5_ss intronic 0.00346656 NA 58 NO 13 48954173 14 -16 intron 3_ss intronic 0.0163399 NA 48 YES 13 48954175 14 -14 intron 3_ss intronic 4.38724 NA 58 NO 13 48954178 14 -11 intron 3_ss intronic 0.167648 NA 64 YES 13 48954207 15 -13 exon 5_ss missense 0.00405548 -0.3415362 74 NO 13 48954209 15 -11 exon 5_ss synonymous 0.0162483 -0.2016752 53 NO 13 48954229 15 9 intron 5_ss intronic 0.0504159 NA 27 NO 13 48954232 15 12 intron 5_ss intronic 0.0042878 NA 10 YES 13 48954234 15 14 intron 5_ss intronic 0.00855286 NA 31 NO 13 48954236 15 16 intron 5_ss intronic 0.124517 NA 20 NO 13 48954244 15 24 intron 5_ss intronic 0.00434141 NA 39 NO 13 48954247 15 27 intron 5_ss intronic 0.158353 NA 27 YES 13 48954248 15 28 intron 5_ss intronic 0.00436529 NA 48 YES 13 48954289 15 -12 intron 3_ss intronic 0.00157639 NA 5 YES 13 48954295 15 -6 intron 3_ss splice_region and intronic 0.00154603 NA 66 YES Appendix B. Splicing Aberrations and Human Hereditary Diseases 13 48954328 16 27 exon 3_ss synonymous 0.00216109 0.1688401 57 YES 13 48954342 16 -35 exon 5_ss missense 0.00817261 -2.17473901 91 YES 13 48954343 16 -34 exon 5_ss synonymous 0.0091646 -1.04844747 94 YES 13 48954346 16 -31 exon 5_ss synonymous 0.00809848 -1.36463347 88 NO 13 48954354 16 -23 exon 5_ss missense 0.00100561 0.485148 92 YES 13 48954358 16 -19 exon 5_ss synonymous 0.00101812 2.2273739 77 NO 13 48954362 16 -15 exon 5_ss missense 0.00104515 0.3018338 93 YES 13 48954366 16 -11 exon 5_ss missense 0.00105845 -0.84805666 68 NO 13 48954368 16 -9 exon 5_ss missense 0.00106232 1.3096232 83 YES 13 48954370 16 -7 exon 5_ss synonymous 0.0343267 -0.3321144 72 YES 13 48954373 16 -4 exon 5_ss synonymous 0.00108389 2.8918246 65 YES 13 48954375 16 -2 exon 5_ss missense and splice_region 0.00108963 0.9110326 82 YES 13 48954404 16 27 intron 5_ss intronic 0.00121024 NA 42 NO Continued on next page 123 Table B.1 – continued from previous page 124 site Exon/Intron Chasin score Chromosome Spliceman score Variant category Closest splice site Coordinate (hg19) Exon/Intron number ExAC allele frequency Further consideration Distance from closest splice 13 48954407 16 30 intron 5_ss intronic 0.014875 NA 64 NO 13 48954412 16 35 intron 5_ss intronic 0.00643401 NA 68 NO 13 48954419 16 42 intron 5_ss intronic 0.00676242 NA 75 YES 13 48955352 16 -31 intron 3_ss intronic 0.00243974 NA 47 NO 13 48955373 16 -10 intron 3_ss intronic 0.00135925 NA 11 YES 13 48955389 17 6 exon 3_ss missense 0.00109779 -1.139595 68 NO 13 48955423 17 40 exon 3_ss synonymous 0.000929333 0.2884579 46 NO 13 48955457 17 74 exon 3_ss missense 0.0104928 0.9600019 70 NO 13 48955458 17 75 exon 3_ss missense 0.197149 0.2615609 52 NO 13 48955501 17 -78 exon 5_ss missense 0.00166731 0.4850163 91 YES 13 48955503 17 -76 exon 5_ss missense 0.000833389 0.2284832 69 NO 13 48955547 17 -32 exon 5_ss missense 0.000832473 -1.7123596 94 YES 13 48955551 17 -28 exon 5_ss missense 0.000833333 -0.9232656 91 YES 13 48955565 17 -14 exon 5_ss missense 0.00167358 0.02859087 55 NO 13 48955596 17 17 intron 5_ss intronic 0.000856869 NA 40 NO 13 48955609 17 30 intron 5_ss intronic 0.000866521 NA 41 NO 13 49027086 17 -43 intron 3_ss intronic 0.00165843 NA 52 NO 13 49027115 17 -14 intron 3_ss intronic 0.00248476 NA 67 YES 13 49027116 17 -13 intron 3_ss intronic 0.000828157 NA 78 YES 13 49027122 17 -7 intron 3_ss splice_region and intronic 0.000827883 NA 50 YES 13 49027125 17 -4 intron 3_ss splice_region and intronic 0.000827568 NA 64 YES 13 49027140 18 11 exon 3_ss synonymous 0.0636416 1.8132878 42 NO 13 49027165 18 36 exon 3_ss missense 0.000825818 -1.7891234 87 NO 13 49027169 18 40 exon 3_ss missense 0.00165139 0.5465081 76 NO 13 49027178 18 49 exon 3_ss missense 0.000825573 1.7892799 92 YES 13 49027180 18 51 exon 3_ss missense 0.000825464 -0.9775666 83 NO 13 49027197 18 -50 exon 5_ss synonymous 0.00990311 -1.6286909 88 NO 13 49027199 18 -48 exon 5_ss missense 0.00165068 -0.4977329 48 NO 13 49027203 18 -44 exon 5_ss synonymous 0.0577663 -0.7205048 53 NO 13 49027217 18 -30 exon 5_ss missense 0.000825137 0.6852971 7 NO 13 49027240 18 -7 exon 5_ss missense 0.000825518 1.342536 82 NO 13 49027250 18 3 intron 5_ss splice_region and intronic 0.00412896 NA 10 YES 13 49027258 18 11 intron 5_ss intronic 0.00247827 NA 44 NO 13 49027274 18 27 intron 5_ss intronic 0.000827034 NA 83 NO 13 49030328 18 -12 intron 3_ss intronic 0.00171421 NA 40 YES 13 49030335 18 -5 intron 3_ss splice_region and intronic 0.00425156 NA 79 YES 13 49030348 19 8 exon 3_ss missense 0.000838223 -1.0965621 30 YES 13 49030353 19 13 exon 3_ss missense 0.000835101 -0.3722429 82 YES 13 49030364 19 24 exon 3_ss synonymous 0.000829807 2.9814507 67 YES 13 49030365 19 25 exon 3_ss missense 0.00082949 -0.17568329 60 NO 13 49030367 19 27 exon 3_ss missense 0.000828885 -3.236231 47 YES 13 49030384 19 44 exon 3_ss missense 0.00330109 -2.7608525 93 YES 13 49030385 19 45 exon 3_ss synonymous 0.000825178 -1.9231755 93 YES 13 49030387 19 47 exon 3_ss missense 0.00577472 -1.1590514 98 YES 13 49030388 19 48 exon 3_ss synonymous 0.000824865 0.9064438 100 YES 13 49030401 19 61 exon 3_ss missense 0.00412106 -0.0296035 70 NO 13 49030412 19 72 exon 3_ss missense 0.00164837 -0.6131606 52 YES 13 49030413 19 -72 exon 5_ss missense 0.0107137 -0.96986786 43 YES 13 49030418 19 -67 exon 5_ss synonymous 0.000824158 -0.5278743 83 YES 13 49030439 19 -46 exon 5_ss synonymous 0.000824552 0.4700367 57 YES 13 49030441 19 -44 exon 5_ss missense 0.000824565 2.5109973 82 NO 13 49030469 19 -16 exon 5_ss synonymous 0.00083545 0.725129 34 YES 13 49030471 19 -14 exon 5_ss missense 0.001673 0.5924656 81 YES Continued on next page Table B.1 – continued from previous page site Exon/Intron Chasin score Chromosome Spliceman score Variant category Closest splice site Coordinate (hg19) Exon/Intron number ExAC allele frequency Further consideration Distance from closest splice 13 49030504 19 19 intron 5_ss intronic 0.00631393 NA 50 YES 13 49030514 19 29 intron 5_ss intronic 0.00184291 NA 82 NO 13 49030525 19 40 intron 5_ss intronic 0.000952599 NA 61 NO 13 49030526 19 41 intron 5_ss intronic 0.000951602 NA 34 YES 13 49030535 19 50 intron 5_ss intronic 0.00100644 NA 60 YES 13 49030537 19 52 intron 5_ss intronic 0.0010129 NA 95 YES 13 49033787 19 -37 intron 3_ss intronic 0.000825355 NA 65 NO 13 49033792 19 -32 intron 3_ss intronic 0.00247537 NA 73 NO 13 49033809 19 -15 intron 3_ss intronic 0.000824484 NA 15 YES 13 49033812 19 -12 intron 3_ss intronic 0.160764 NA 31 YES 13 49033813 19 -11 intron 3_ss intronic 0.000824429 NA 20 YES 13 49033818 19 -6 intron 3_ss splice_region and intronic 0.000824198 NA 49 YES 13 49033824 20 0 exon 3_ss missense and splice_region 0.000824022 1.3140645 NA YES 13 49033827 20 3 exon 3_ss missense 0.000825737 1.0298142 89 NO 13 49033829 20 5 exon 3_ss missense 0.0577358 -1.8382194 83 NO 13 49033830 20 6 exon 3_ss missense 0.00906454 -2.1098307 87 NO 13 49033839 20 15 exon 3_ss missense 0.00165379 1.2720632 71 NO 13 49033844 20 20 exon 3_ss missense 0.000826843 -1.6505926 82 NO 13 49033852 20 28 exon 3_ss synonymous 0.00247272 0.6380959 63 NO 13 49033865 20 41 exon 3_ss missense 0.00329489 -2.0690019 78 NO 13 49033866 20 42 exon 3_ss missense 0.00411855 -1.5611551 77 NO 13 49033867 20 43 exon 3_ss synonymous 0.00411848 0.6086048 85 NO 13 49033885 20 61 exon 3_ss synonymous 0.000823642 0.754654 97 YES 13 49033896 20 72 exon 3_ss missense 0.000823669 0.6818085 74 NO 13 49033915 20 -54 exon 5_ss synonymous 0.000823669 1.03734985 17 NO 13 49033921 20 -48 exon 5_ss synonymous 0.00164745 -0.50015767 72 NO 13 49033924 20 -45 exon 5_ss synonymous 0.00082375 0.0054827 57 NO 13 49033941 20 -28 exon 5_ss missense 0.00164766 0.0441486 70 NO 13 49033945 20 -24 exon 5_ss synonymous 0.000823927 1.0831888 46 NO 13 49033954 20 -15 exon 5_ss missense 0.000824185 0.2664617 50 NO 13 49033977 20 8 intron 5_ss splice_region and intronic 0.00165328 NA 54 NO 13 49033983 20 14 intron 5_ss intronic 0.000827623 NA 80 NO 13 49033990 20 21 intron 5_ss intronic 0.00083011 NA 72 NO 13 49033991 20 22 intron 5_ss intronic 0.00249112 NA 28 NO 13 49033992 20 23 intron 5_ss intronic 0.000831131 NA 37 NO 13 49033993 20 24 intron 5_ss intronic 0.0398969 NA 80 NO 13 49034000 20 31 intron 5_ss intronic 0.000845552 NA 5 NO 13 49034008 20 39 intron 5_ss intronic 0.00103108 NA 91 YES 13 49034011 20 42 intron 5_ss intronic 0.00121643 NA 53 NO 13 49034012 20 43 intron 5_ss intronic 0.0060163 NA 89 NO Appendix B. Splicing Aberrations and Human Hereditary Diseases 13 49037826 20 -41 intron 3_ss intronic 0.000838097 NA 65 NO 13 49037831 20 -36 intron 3_ss intronic 0.000837577 NA 65 NO 13 49037846 20 -21 intron 3_ss intronic 0.000835478 NA 62 NO 13 49037849 20 -18 intron 3_ss intronic 0.00167017 NA 68 YES 13 49037850 20 -17 intron 3_ss intronic 0.00166987 NA 71 YES 13 49037852 20 -15 intron 3_ss intronic 0.000834711 NA 67 YES 13 49037853 20 -14 intron 3_ss intronic 0.000834683 NA 75 YES 13 49037854 20 -13 intron 3_ss intronic 0.00083446 NA 57 YES 13 49037940 21 -31 exon 5_ss missense 0.000829614 -0.5944271 66 NO 13 49037948 21 -23 exon 5_ss missense 0.000829655 -1.6065973 77 NO 13 49037978 21 7 intron 5_ss splice_region and intronic 0.000832584 NA 23 YES 13 49037984 21 13 intron 5_ss intronic 0.000834446 NA 45 NO 13 49037988 21 17 intron 5_ss intronic 0.000835869 NA 60 NO 125 Continued on next page Table B.1 – continued from previous page 126 site Exon/Intron Chasin score Chromosome Spliceman score Variant category Closest splice site Coordinate (hg19) Exon/Intron number ExAC allele frequency Further consideration Distance from closest splice 13 49038004 21 33 intron 5_ss intronic 0.000841156 NA 59 NO 13 49038009 21 38 intron 5_ss intronic 0.000844766 NA 73 NO 13 49038015 21 44 intron 5_ss intronic 0.000854336 NA 71 NO 13 49038019 21 48 intron 5_ss intronic 0.000854088 NA 81 NO 13 49039088 21 -46 intron 3_ss intronic 0.0021978 NA 17 NO 13 49039093 21 -41 intron 3_ss intronic 0.00252309 NA 14 NO 13 49039096 21 -38 intron 3_ss intronic 0.0178508 NA 57 YES 13 49039128 21 -6 intron 3_ss splice_region and intronic 0.00089699 NA 4 YES 13 49039143 22 9 exon 3_ss missense 0.00254048 -1.0492645 84 YES 13 49039148 22 14 exon 3_ss synonymous 0.000838715 0.3175326 75 YES 13 49039162 22 28 exon 3_ss missense 0.000829215 0.2664049 76 YES 13 49039163 22 29 exon 3_ss missense 0.000828871 0.1946578 33 YES 13 49039169 22 35 exon 3_ss synonymous 0.000827499 0.5271194 85 YES 13 49039175 22 41 exon 3_ss synonymous 0.000826542 0.5266042 58 YES 13 49039194 22 -53 exon 5_ss missense 0.00082575 1.1306303 98 YES 13 49039195 22 -52 exon 5_ss missense 0.00247713 -1.8515124 85 YES 13 49039196 22 -51 exon 5_ss synonymous 0.00165125 -1.2146363 98 YES 13 49039208 22 -39 exon 5_ss synonymous 0.000825369 -0.4626354 56 YES 13 49039212 22 -35 exon 5_ss missense 0.00082526 -1.385864 45 NO 13 49039232 22 -15 exon 5_ss synonymous 0.000824906 -0.1192568 81 YES 13 49039233 22 -14 exon 5_ss missense 0.000824865 1.0534587 52 YES 13 49039240 22 -7 exon 5_ss missense 0.000824824 -0.0497238 39 NO 13 49039251 22 4 intron 5_ss splice_region and intronic 0.000824756 NA 59 YES 13 49039257 22 10 intron 5_ss intronic 0.000824729 NA 30 YES 13 49039260 22 13 intron 5_ss intronic 0.00164946 NA 56 YES 13 49039264 22 17 intron 5_ss intronic 0.00164951 NA 26 YES 13 49039270 22 23 intron 5_ss intronic 0.00164943 NA 44 YES 13 49039281 22 34 intron 5_ss intronic 0.000824742 NA 77 YES 13 49039289 22 42 intron 5_ss intronic 0.0255666 NA 71 YES 13 49039293 22 46 intron 5_ss intronic 0.000824701 NA 40 YES 13 49039325 22 -16 intron 3_ss intronic 0.000824674 NA 13 YES 13 49039364 23 23 exon 3_ss synonymous 0.000824253 NA 69 NO 13 49039371 23 30 exon 3_ss missense 0.00412133 -0.233363 83 NO 13 49039372 23 31 exon 3_ss missense 0.000824253 0.8318881 91 YES 13 49039387 23 46 exon 3_ss missense 0.000824416 -1.9463308 74 NO 13 49039395 23 54 exon 3_ss missense 0.00164899 2.6218656 97 YES 13 49039407 23 66 exon 3_ss missense 0.0107232 -1.5488008 81 NO 13 49039408 23 67 exon 3_ss missense 0.00329995 -2.3661323 99 YES 13 49039412 23 71 exon 3_ss missense 0.00412473 0.4126715 91 YES 13 49039440 23 -64 exon 5_ss missense 0.000827061 2.067097 87 NO 13 49039454 23 -50 exon 5_ss synonymous 0.00082824 2.2833742 74 NO 13 49039464 23 -40 exon 5_ss missense 0.000828899 -0.2009135 65 NO 13 49039476 23 -28 exon 5_ss missense 0.00249153 -1.5048716 84 NO 13 49039478 23 -26 exon 5_ss synonymous 0.00332265 1.948504 79 NO 13 49039479 23 -25 exon 5_ss missense 0.00166154 0.416194 66 NO 13 49039483 23 -21 exon 5_ss missense 0.000831352 1.7027778 75 NO 13 49039511 23 7 intron 5_ss splice_region and intronic 0.000839278 NA 34 NO 13 49039523 23 19 intron 5_ss intronic 0.0067619 NA 98 YES 13 49039540 23 36 intron 5_ss intronic 0.00256021 NA 28 NO 13 49039549 23 45 intron 5_ss intronic 0.0026024 NA 59 NO 13 49039554 23 50 intron 5_ss intronic 0.00789668 NA 56 NO 13 49047449 23 -47 intron 3_ss intronic 0.00331351 NA 40 NO 13 49047451 23 -45 intron 3_ss intronic 0.575751 NA 68 NO Continued on next page Table B.1 – continued from previous page site Exon/Intron Chasin score Chromosome Spliceman score Variant category Closest splice site Coordinate (hg19) Exon/Intron number ExAC allele frequency Further consideration Distance from closest splice 13 49047453 23 -43 intron 3_ss intronic 0.0405837 NA 41 NO 13 49047455 23 -41 intron 3_ss intronic 0.000828432 NA 94 YES 13 49047471 23 -25 intron 3_ss intronic 0.000827979 NA 31 YES 13 49047481 23 -15 intron 3_ss intronic 0.00662186 NA 34 YES 13 49047492 23 -4 intron 3_ss splice_region and intronic 0.000827623 NA 59 YES 13 49047497 24 1 exon 3_ss missense and splice_region 0.00248287 -0.3949513 64 YES 13 49047504 24 8 exon 3_ss missense 0.000827691 1.1036557 51 NO 13 49047510 24 14 exon 3_ss missense 0.000827623 0.7183685 63 YES 13 49047523 24 -3 exon 5_ss synonymous 0.00165659 -2.1128541 91 YES 13 49047524 24 -2 exon 5_ss missense and splice_region 0.000828473 -2.17568 84 YES 13 49047540 24 14 intron 5_ss intronic 0.00165997 NA 69 NO 13 49047546 24 20 intron 5_ss intronic 0.000830634 NA 69 NO 13 49047562 24 36 intron 5_ss intronic 0.000832321 NA 45 NO 13 49047564 24 38 intron 5_ss intronic 0.0166522 NA 74 NO 13 49047572 24 46 intron 5_ss intronic 0.00166886 NA 33 NO 13 49050793 24 -44 intron 3_ss intronic 0.000853665 NA 71 NO 13 49050817 24 -20 intron 3_ss intronic 0.00169245 NA 38 YES 13 49050819 24 -18 intron 3_ss intronic 0.000845266 NA 75 YES 13 49050826 24 -11 intron 3_ss intronic 0.652397 NA 56 YES 13 49050875 25 38 exon 3_ss synonymous 0.00165093 1.645991 72 NO 13 49050882 25 45 exon 3_ss missense 0.0189791 -2.3973512 99 YES 13 49050899 25 62 exon 3_ss synonymous 0.00082462 -1.4074771 40 NO 13 49050919 25 -60 exon 5_ss missense 0.000824389 -0.4355552 63 NO 13 49050922 25 -57 exon 5_ss missense 0.000824362 -0.9890515 45 NO 13 49050923 25 -56 exon 5_ss synonymous 0.000824334 3.0337445 93 YES 13 49050930 25 -49 exon 5_ss synonymous 0.00082428 NA 65 NO 13 49050941 25 -38 exon 5_ss synonymous 0.000824239 0.0381317 99 YES 13 49050942 25 -37 exon 5_ss missense 0.00494544 -2.0980817 81 NO 13 49050943 25 -36 exon 5_ss missense 0.000824239 -1.4199771 84 NO 13 49050950 25 -29 exon 5_ss synonymous 0.00164845 0.9958345 53 NO 13 49050963 25 -16 exon 5_ss missense 0.000824239 -0.4756147 80 NO 13 49050968 25 -11 exon 5_ss missense 0.00741791 0.5152249 79 NO 13 49050969 25 -10 exon 5_ss missense 0.00164842 -0.1610481 33 NO 13 49050970 25 -9 exon 5_ss missense 0.000824212 3.803267 31 NO 13 49050975 25 -4 exon 5_ss missense 0.000824239 -2.915374 29 NO 13 49050987 25 8 intron 5_ss splice_region and intronic 0.0016488 NA 33 NO 13 49050995 25 16 intron 5_ss intronic 0.000824443 NA 37 NO 13 49050997 25 18 intron 5_ss intronic 0.000824375 NA 53 NO 13 49051012 25 33 intron 5_ss intronic 95.5207 NA 73 NO 13 49051024 25 45 intron 5_ss intronic 0.0140375 NA 45 NO Appendix B. Splicing Aberrations and Human Hereditary Diseases 13 49051029 25 50 intron 5_ss intronic 0.00165175 NA 59 NO 13 49051445 25 -46 intron 3_ss intronic 0.00562518 NA 19 YES 13 49051465 25 -26 intron 3_ss intronic 0.00107585 NA 53 YES 13 49051481 25 -10 intron 3_ss intronic 26.0321 NA 64 NO 13 49051482 25 -9 intron 3_ss intronic 0.00275022 NA 55 YES 13 49051497 26 6 exon 3_ss synonymous 0.000881896 -0.3171666 57 NO 13 49051505 26 14 exon 3_ss missense 0.000875304 -2.54360908 40 NO 13 49051506 26 15 exon 3_ss synonymous 0.000874202 -0.6871698 55 YES 13 49051521 26 -19 exon 5_ss missense 0.00174892 0.2432346 92 YES 13 49051557 26 17 intron 5_ss intronic 0.00179801 NA 60 NO 13 49051579 26 39 intron 5_ss intronic 0.00459086 NA 87 NO 13 49054083 26 -51 intron 3_ss intronic 0.000990138 NA 82 NO 13 49054084 26 -50 intron 3_ss intronic 0.001978 NA 84 NO 127 Continued on next page Table B.1 – continued from previous page 128 site Exon/Intron Chasin score Chromosome Spliceman score Variant category Closest splice site Coordinate (hg19) Exon/Intron number ExAC allele frequency Further consideration Distance from closest splice 13 49054085 26 -49 intron 3_ss intronic 0.000992221 NA 63 NO 13 49054090 26 -44 intron 3_ss intronic 0.000990374 NA 92 YES 13 49054092 26 -42 intron 3_ss intronic 0.00295055 NA 79 NO 13 49054093 26 -41 intron 3_ss intronic 0.000984504 NA 96 YES 13 49054096 26 -38 intron 3_ss intronic 0.00196259 NA 46 NO 13 49054109 26 -25 intron 3_ss intronic 0.000957946 NA 70 NO 13 49054116 26 -18 intron 3_ss intronic 0.000942454 NA 28 YES 13 49054121 26 -13 intron 3_ss intronic 0.000930821 NA 69 YES 13 49054122 26 -12 intron 3_ss intronic 0.00185739 NA 57 YES 13 49054123 26 -11 intron 3_ss intronic 0.0120482 NA 19 YES 13 49054127 26 -7 intron 3_ss splice_region and intronic 0.00183063 NA 4 YES 13 49054146 27 12 exon 3_ss missense 0.00088333 -0.6562643 95 YES 13 49054179 27 45 exon 3_ss missense 0.00175147 0.06092404 82 NO 13 49054194 27 60 exon 3_ss missense 0.00264952 -2.0713659 77 NO 13 49054201 27 67 exon 3_ss missense 0.000887343 -1.4086353 29 NO 13 49054221 27 87 exon 3_ss 3_prime_UTR 0.00180917 0.1246464 53 NO 13 49054224 27 90 exon 3_ss 3_prime_UTR 0.000909091 1.1860224 76 NO 13 49054229 27 95 exon 3_ss 3_prime_UTR 0.000913392 -0.8977889 57 NO 13 49054232 27 98 exon 3_ss 3_prime_UTR 0.00091498 0.9686703 72 NO 13 49054233 27 99 exon 3_ss 3_prime_UTR 0.000916221 0.5625088 72 NO 13 49054234 27 100 exon 3_ss 3_prime_UTR 0.000918088 -0.5699223 86 NO 13 49054235 27 101 exon 3_ss 3_prime_UTR 0.00091932 0.176619 55 NO 13 49054238 27 104 exon 3_ss 3_prime_UTR 0.000921625 -1.1389377 71 NO 13 49054239 27 105 exon 3_ss 3_prime_UTR 0.000924539 0.3220003 62 NO 13 49054240 27 106 exon 3_ss 3_prime_UTR 0.00092495 -0.0097098 59 NO 13 49054242 27 108 exon 3_ss 3_prime_UTR 0.0120489 -0.1801982 48 NO 13 49054243 27 109 exon 3_ss 3_prime_UTR 0.000928281 0.8167725 40 NO 13 49054250 27 116 exon 3_ss 3_prime_UTR 0.000940628 -1.9176429 43 NO 13 49054256 27 122 exon 3_ss 3_prime_UTR 0.000946647 -0.24788342 66 NO 13 49054258 27 124 exon 3_ss 3_prime_UTR 0.0113928 -0.7154706 52 NO Appendix C Visualization and Inference of Splicing Aberrations 130 Table C.1: List of features of the boosted model that included the splicing efficiency of wild type Feature Gain Cover Frequency wt_vivo_splicing_efficiency 0.226619045387759 0.160169031644383 0.12568306010929 splice_site_5_alt_score 0.112184119122636 0.0691760638674371 0.0437158469945355 exon_intron_I_ratio 0.0534643985094615 0.0158898686565014 0.0273224043715847 splice_site_5_ref_score 0.0261128543772573 0.0360924533058154 0.0327868852459016 chasin_ESE_r1_alt 0.0214786082105662 0.0285784180416321 0.0218579234972678 trinucleotide_freq_ref_r2_AAC 0.0194818329622662 0.0298164767357968 0.0191256830601093 M026_0.6_alt_region_3 0.0144113862024745 0.0173762841599269 0.0109289617486339 splice_site_3_alt_score 0.0141126262832309 0.0135019179354595 0.0163934426229508 M177_0.6_alt_region_3 0.0135373283398652 0.0186329696147862 0.0136612021857923 dinucleotide_freq_alt_r2_TA 0.0118618842029998 0.0166697090375801 0.0109289617486339 chasin_ESS_r2_ref 0.0115612384925859 0.0118031573601161 0.0136612021857923 trinucleotide_freq_alt_r2_TAG 0.0109045266241214 0.0189732184434734 0.0136612021857923 M073_0.6_alt_region_3 0.00902079350588522 0.013103305256596 0.00819672131147541 trinucleotide_freq_alt_r2_GGG 0.00868897140605968 0.016104697296512 0.0109289617486339 M153_0.6_alt_region_3 0.00810299382802634 0.0119000165157278 0.00819672131147541 trinucleotide_freq_ref_r2_TGA 0.00770620990687763 0.00708313620011598 0.00819672131147541 M167_0.6_alt_region_4 0.00741454718155548 0.0101130892730969 0.00819672131147541 M231_0.6_alt_region_4 0.0069672026491339 0.00881169959257065 0.00546448087431694 first_upstream_AG_ref 0.00684283678788311 0.00505778642059474 0.00819672131147541 EIE_alt_region_3 0.00661876547234142 0.00754135451320203 0.00819672131147541 exon_length 0.00624130069197946 0.00172111268817685 0.00273224043715847 trinucleotide_freq_ref_r3_AAA 0.00620457532797864 0.0130561174628365 0.0109289617486339 M319_0.6_alt_region_3 0.00602966516758952 0.00903770428899792 0.00546448087431694 trinucleotide_freq_alt_r2_GGA 0.00598852454114793 0.0126587465680193 0.00819672131147541 trinucleotide_freq_ref_r2_TAG 0.00556360827690694 0.0100572089910132 0.00819672131147541 M073_0.6_ref_region_4 0.00556244489663105 0.00101453756583008 0.00546448087431694 trinucleotide_freq_ref_r1_CTG 0.00552012070118283 0.00665720427223383 0.00819672131147541 dinucleotide_freq_alt_r2_GA 0.00544975334035794 0.0101987723722918 0.00819672131147541 trinucleotide_freq_ref_r2_GAT 0.0053670188769799 0.0135565564334969 0.00819672131147541 trinucleotide_freq_ref_r1_CTC 0.00518973545514846 0.00196326057720606 0.00546448087431694 dinucleotide_freq_ref_r3_GA 0.00503860148675939 0.00312184509240736 0.00546448087431694 M083_0.6_alt_region_3 0.00495651797670595 0.00506275355677995 0.00546448087431694 dinucleotide_freq_ref_r1_TT 0.00468660663088999 0.00685588971964242 0.00546448087431694 Rescue_ESE_r1_ref 0.00463417391781286 0.00679628408541984 0.00546448087431694 trinucleotide_freq_alt_r2_CCC 0.00458992832141293 0.0180754085779958 0.0109289617486339 trinucleotide_freq_ref_r2_GCA 0.00418222925961378 0.00498948829804804 0.00546448087431694 M108_0.6_alt_region_3 0.00411580083138647 0.00642250708748244 0.00546448087431694 trinucleotide_freq_ref_r2_GGA 0.00410731411149997 0.0134497630055148 0.00819672131147541 M023_0.6_alt_region_3 0.00387004351142577 0.00399481927695882 0.00546448087431694 M068_0.6_ref_region_1 0.00385084076005588 0.0129033780251412 0.00819672131147541 P_ESS_r2_ref 0.00373284062185617 0.00147399766296242 0.00546448087431694 M151_0.6_ref_region_3 0.00365181669847533 0.000886633809060801 0.00546448087431694 first_downstream_GT_ref 0.00353436105192413 0.00414259157846896 0.00273224043715847 M017_0.6_alt_region_3 0.00344795884038178 0.000281884978510927 0.00273224043715847 M149_0.6_ref_region_2 0.00343467288694852 0.00425311035858998 0.00273224043715847 trinucleotide_freq_ref_r3_TTC 0.00336290765020132 0.00131504930503556 0.00273224043715847 M089_0.6_ref_region_2 0.00332187671885141 0.00182169719592745 0.00546448087431694 trinucleotide_freq_ref_r1_TTA 0.00308606546512448 0.000226004696427263 0.00273224043715847 M262_0.6_ref_region_5 0.00308188764064007 0.00366574650468836 0.00546448087431694 M170_0.6_ref_region_5 0.00304561410006998 0.00176705869789008 0.00273224043715847 P_ESE_r1_ref 0.00301820369604061 0.00735260333816387 0.00546448087431694 M103_0.6_ref_region_1 0.00296360475685997 0.00254938264706138 0.00273224043715847 P_ESE_r1_alt 0.00294909127175303 0.00799708925819546 0.00546448087431694 dinucleotide_freq_ref_r2_GA 0.0029087256793777 0.00832864559855854 0.00546448087431694 chasin_ESE_r1_ref 0.00285493257830208 0.00738613150741407 0.00546448087431694 trinucleotide_freq_ref_r3_TAA 0.00285076072143959 0.00390044368943974 0.00273224043715847 M140_0.6_ref_region_1 0.00283797215026824 0.0063728357256303 0.00546448087431694 M147_0.6_ref_region_2 0.0028164310113392 0.00140942489255464 0.00273224043715847 M072_0.6_ref_region_2 0.0026975886311573 0.000163915494112081 0.00273224043715847 nucleotide_freq_ref_r2_T 0.00268451613418919 0.00376260566030004 0.00273224043715847 M044_0.6_ref_region_4 0.0026244841984936 0.000745070427782186 0.00273224043715847 exon_intron_II_ratio 0.00261653624977552 0.00133119249763751 0.00273224043715847 M163_0.6_ref_region_1 0.00259240434742842 0.000511615027077101 0.00273224043715847 M024_0.6_alt_region_3 0.00258478233902132 0.000803434277958457 0.00273224043715847 trinucleotide_freq_ref_r1_ACA 0.00256264557715061 0.00125296010272038 0.00273224043715847 M035_0.6_alt_region_3 0.00255659298361664 0.000116727700352542 0.00273224043715847 dinucleotide_freq_ref_r2_CG 0.00255592604147761 0.00713529113006074 0.00546448087431694 M105_0.6_ref_region_1 0.00252972090301034 0.000365084509613271 0.00273224043715847 trinucleotide_freq_ref_r1_ATA 0.00251885314465074 0.000674288737142878 0.00273224043715847 M061_0.6_alt_region_3 0.00250014278277346 0.0043623873546647 0.00273224043715847 trinucleotide_freq_alt_r2_GTA 0.00247908294792187 0.00427546247142344 0.00273224043715847 IIE_3_alt_region_3 0.00246855016819845 0.000948723011375984 0.00273224043715847 M052_0.6_alt_region_4 0.00246773113380761 0.00135354461047097 0.00273224043715847 M273_0.6_ref_region_3 0.00244449556124181 0.00256925119180224 0.00273224043715847 M319_0.6_ref_region_2 0.00243631796747268 0.000310446011575911 0.00273224043715847 IIE_5_alt_region_3 0.00239774253258026 0.00370424181012377 0.00273224043715847 M001_0.6_ref_region_2 0.0023866033416785 0.00398364322054209 0.00273224043715847 M144_0.6_alt_region_2 0.00236749145891642 0.00248977701283881 0.00273224043715847 M037_0.6_alt_region_3 0.00233037930159005 0.000166399062204688 0.00273224043715847 Continued on next page Appendix C. Visualization and Inference of Splicing Aberrations 131 Table C.1 – continued from previous page Feature Gain Cover Frequency M021_0.6_alt_region_4 0.00231800352114509 0.00226004696427263 0.00273224043715847 M054_0.6_ref_region_4 0.00230291204007372 0.00197071128148388 0.00273224043715847 trinucleotide_freq_alt_r2_TTA 0.00229081178625778 0.000827028174838227 0.00273224043715847 dinucleotide_freq_ref_r2_GC 0.00228624062227937 0.00105924179149701 0.00273224043715847 dinucleotide_freq_ref_r3_AA 0.00227478558162833 0.0025568333513392 0.00273224043715847 M047_0.6_ref_region_3 0.00226644616233723 0.00362725119925294 0.00273224043715847 trinucleotide_freq_ref_r2_ACT 0.00226198986714734 0.000808401414143672 0.00273224043715847 M056_0.6_ref_region_5 0.00225408869670356 0.000389920190539344 0.00273224043715847 M112_0.6_ref_region_1 0.00225314728572878 0.00216815494484616 0.00273224043715847 trinucleotide_freq_ref_r1_GAT 0.0022391326730569 0.000733894371365453 0.00273224043715847 splice_site_3_ref_score 0.00223098970067673 0.00903770428899792 0.00546448087431694 M073_0.6_ref_region_3 0.00221267208510212 0.000346457748918717 0.00273224043715847 Wang_ESS_r1_ref 0.00219936550139629 0.0017384976648251 0.00273224043715847 dinucleotide_freq_ref_r2_AC 0.00219085458856284 0.00221285917051309 0.00273224043715847 M234_0.6_ref_region_4 0.0021859528426633 0.000180058686714028 0.00273224043715847 M262_0.6_ref_region_1 0.00218515231191829 0.00903770428899792 0.00546448087431694 EIE_alt_region_2 0.00215093055773571 0.00138707277972117 0.00273224043715847 trinucleotide_freq_ref_r1_CTT 0.00213629655878966 0.000158948357926866 0.00273224043715847 trinucleotide_freq_ref_r2_GAG 0.002127739906445 0.00202410799547494 0.00273224043715847 trinucleotide_freq_ref_r2_TGT 0.00212654838507769 0.000311687795622215 0.00273224043715847 trinucleotide_freq_ref_r2_AGA 0.00211869615696015 0.002147044616059 0.00273224043715847 M176_0.6_ref_region_4 0.00209798360102016 0.00131008216885034 0.00273224043715847 dinucleotide_freq_ref_r2_TT 0.00208276817671643 0.00284120189794274 0.00273224043715847 trinucleotide_freq_ref_r2_TTT 0.00207167219018355 0.00116976057161803 0.00273224043715847 M035_0.6_ref_region_5 0.00206205609565505 0.000330314556316769 0.00273224043715847 M021_0.6_ref_region_1 0.00206104059027894 0.000615924886966607 0.00273224043715847 trinucleotide_freq_ref_r2_TGC 0.00203882090328761 0.00175464085742705 0.00273224043715847 M109_0.6_ref_region_4 0.00201823595776332 0.00112008920976589 0.00273224043715847 M140_0.6_ref_region_2 0.00200720551281925 0.000257049297584854 0.00273224043715847 trinucleotide_freq_alt_r2_GGT 0.00200420473793718 0.000623375591244429 0.00273224043715847 trinucleotide_freq_ref_r2_GTC 0.00198799939653965 0.00134361033810054 0.00273224043715847 trinucleotide_freq_ref_r1_ATT 0.00197768061987494 0.00234945541560649 0.00273224043715847 dinucleotide_freq_ref_r2_TG 0.00195047184256026 0.00351921598722453 0.00273224043715847 EIE_ref_region_1 0.00192584902715971 0.000696640849976344 0.00273224043715847 M112_0.6_ref_region_2 0.0019136378273527 0.00295792959829528 0.00273224043715847 M349_0.6_ref_region_5 0.00189811817715973 0.00451885214449896 0.00273224043715847 trinucleotide_freq_ref_r2_GAA 0.00187992254995913 0.000623375591244429 0.00273224043715847 M242_0.6_ref_region_5 0.00186226077310159 0.00202162442738233 0.00273224043715847 M143_0.6_alt_region_3 0.00184855481370445 0.00248108452451468 0.00273224043715847 M211_0.6_ref_region_2 0.00184627867212381 0.000711542258531988 0.00273224043715847 M231_0.6_alt_region_3 0.00183564884815606 0.00277787091158125 0.00273224043715847 M168_0.6_ref_region_3 0.00183464437847521 0.000812126766282583 0.00273224043715847 trinucleotide_freq_ref_r2_AGG 0.00180580194016174 0.000373776997937397 0.00273224043715847 M022_0.6_alt_region_3 0.00180000974567739 0.00365829580041053 0.00273224043715847 M227_0.6_ref_region_1 0.00177765228332061 0.00206508686900296 0.00273224043715847 M082_0.6_ref_region_3 0.00177216843483874 0.000584880285809016 0.00273224043715847 M161_0.6_ref_region_4 0.00177062782138779 0.00156092254620368 0.00273224043715847 M019_0.6_ref_region_2 0.00173694118813665 0.00166150705395428 0.00273224043715847 trinucleotide_freq_ref_r1_AGA 0.00173399231637597 0.000584880285809016 0.00273224043715847 M046_0.6_ref_region_2 0.00173134012507088 0.000500438970660368 0.00273224043715847 M048_0.6_alt_region_4 0.00171504216809883 0.00366947185682727 0.00273224043715847 dinucleotide_freq_ref_r2_TC 0.00170897908548646 0.00283250940961861 0.00273224043715847 Wang_ESS_r3_ref 0.0016854181592723 0.00251212912567227 0.00273224043715847 dinucleotide_freq_ref_r1_GG 0.00167424157003567 0.000483053994012117 0.00273224043715847 M069_0.6_ref_region_3 0.00167064184458533 0.000322863852038947 0.00273224043715847 M234_0.6_ref_region_3 0.00166272749326579 3.97370894817166e-05 0.00273224043715847 M105_0.6_ref_region_5 0.00166262867689942 0.000772389676800866 0.00273224043715847 M318_0.6_ref_region_5 0.00164097288878163 0.000404821599094988 0.00273224043715847 dinucleotide_freq_ref_r2_CT 0.00163244997882784 0.000889117377153409 0.00273224043715847 dinucleotide_freq_ref_r2_GG 0.00163233476030065 0.00300139203991591 0.00273224043715847 trinucleotide_freq_alt_r2_TGA 0.00161569075564488 3.10446011575911e-05 0.00273224043715847 M002_0.6_ref_region_2 0.00161332567194894 0.00451885214449896 0.00273224043715847 dinucleotide_freq_alt_r2_GT 0.00160377879914647 0.000130387324861883 0.00273224043715847 M162_0.6_alt_region_3 0.00160218482966207 3.84953054354129e-05 0.00273224043715847 M053_0.6_alt_region_2 0.00159763593727443 0.000218553992149441 0.00273224043715847 trinucleotide_freq_ref_r3_GTT 0.00158502797049588 0.00149759155984219 0.00273224043715847 M022_0.6_ref_region_3 0.00158485729765779 0.00225880518022633 0.00273224043715847 M127_0.6_ref_region_3 0.00158384568324762 0.000370051645798486 0.00273224043715847 M141_0.6_ref_region_4 0.00158050774137029 0.00202410799547494 0.00273224043715847 M148_0.6_alt_region_4 0.0015787089360978 0.00165902348586167 0.00273224043715847 M001_0.6_ref_region_4 0.00157376285767011 0.00197071128148388 0.00273224043715847 nucleotide_freq_ref_r2_C 0.00156706435463435 0.00305230518581436 0.00273224043715847 dinucleotide_freq_ref_r2_AT 0.00156535507321145 0.0023867089369956 0.00273224043715847 trinucleotide_freq_ref_r2_AGT 0.00156243220657172 0.00135602817856358 0.00273224043715847 trinucleotide_freq_ref_r1_CAC 0.00155160454228356 0.000322863852038947 0.00273224043715847 M242_0.6_alt_region_2 0.00154564872660989 0.00269467138047891 0.00273224043715847 trinucleotide_freq_ref_r2_TCA 0.00154239158521477 5.2154929944753e-05 0.00273224043715847 M054_0.6_ref_region_3 0.0015232070528453 0.00264872537076567 0.00273224043715847 trinucleotide_freq_ref_r1_GTA 0.00151014435499394 8.6924883241255e-05 0.00273224043715847 trinucleotide_freq_ref_r2_ATT 0.00149787027249029 0.00451885214449896 0.00273224043715847 trinucleotide_freq_alt_r2_CTT 0.00147694121508965 0.00166150705395428 0.00273224043715847 Continued on next page 132 Table C.1 – continued from previous page Feature Gain Cover Frequency M068_0.6_alt_region_4 0.00147664397348072 0.00161928639637995 0.00273224043715847 IIE_5_ref_region_4 0.00145820052650879 0.00118342019612737 0.00273224043715847 M348_0.6_alt_region_3 0.00144786730954136 0.000803434277958457 0.00273224043715847 M127_0.6_ref_region_4 0.00143180732574999 0.00424193430217325 0.00273224043715847 M077_0.6_ref_region_2 0.00142306587442881 0.00331183805149182 0.00273224043715847 trinucleotide_freq_ref_r1_CAA 0.00141703813422332 0.000584880285809016 0.00273224043715847 trinucleotide_freq_alt_r2_CGG 0.00141314235557081 0.00143922770966592 0.00273224043715847 M175_0.6_ref_region_4 0.00141020473402493 0.00152366902481457 0.00273224043715847 M195_0.6_ref_region_3 0.00140890366052813 0.00321373711183383 0.00273224043715847 M170_0.6_ref_region_3 0.00139603541343983 0.00322118781611165 0.00273224043715847 M126_0.6_ref_region_1 0.00138616483651838 0.00199927231454887 0.00273224043715847 M086_0.6_ref_region_2 0.00136772327235552 0.00039861267886347 0.00273224043715847 M145_0.6_ref_region_4 0.00136563161119356 0.000342732396779806 0.00273224043715847 trinucleotide_freq_ref_r1_ACT 0.00135085545731863 0.00130138968052622 0.00273224043715847 dinucleotide_freq_alt_r2_AA 0.00134553581744896 0.000260774649723765 0.00273224043715847 M065_0.6_ref_region_4 0.00134262993990498 0.00116106808329391 0.00273224043715847 M069_0.6_alt_region_2 0.0013023862331967 0.000276917842325712 0.00273224043715847 trinucleotide_freq_ref_r1_CTA 0.00130059805340494 0.000935063386866644 0.00273224043715847 trinucleotide_freq_alt_r2_GCC 0.00128726684487438 0.00121819014942387 0.00273224043715847 trinucleotide_freq_ref_r1_CCT 0.00127382332768373 0.00093382160282034 0.00273224043715847 trinucleotide_freq_ref_r2_CAC 0.00125912837237293 0.0035117652829467 0.00273224043715847 Rescue_ESE_r1_alt 0.00122982521960652 0.00451885214449896 0.00273224043715847 trinucleotide_freq_ref_r2_GTT 0.00122777299361904 0.00447787327097094 0.00273224043715847 dinucleotide_freq_ref_r1_GT 0.00121660586025947 0.00254814086301508 0.00273224043715847 M124_0.6_alt_region_3 0.00120248484478013 0.00440460801223902 0.00273224043715847 trinucleotide_freq_ref_r2_CAA 0.00104494044984704 0.000194960095269672 0.00273224043715847 dinucleotide_freq_ref_r1_AG 0.00102667100165864 0.00440460801223902 0.00273224043715847 chasin_ESS_r1_ref 0.000996041368294685 0.00451885214449896 0.00273224043715847 nucleotide_freq_alt_r2_T 0.000986089086961675 0.00451885214449896 0.00273224043715847 M012_0.6_ref_region_4 0.000931145062445918 0.00451885214449896 0.00273224043715847 trinucleotide_freq_ref_r1_TAC 0.000929180274941008 0.00451388500831374 0.00273224043715847 IIE_3_alt_region_2 0.000907538073159385 0.00451885214449896 0.00273224043715847 Appendix C. Visualization and Inference of Splicing Aberrations 133 Table C.2: List of features of the boosted model that did not include the splicing efficiency of wild type Feature Gain Cover Frequency splice_site_5_alt_score 0.0856849457219962 0.0536148321783385 0.0301932367149758 splice_site_5_ref_score 0.0355992832240268 0.0450269471893177 0.0277777777777778 chasin_ESE_r1_alt 0.019846320965991 0.020211450876366 0.0181159420289855 splice_site_3_ref_score 0.0179467327622109 0.0239199795241296 0.0181159420289855 dinucleotide_freq_ref_r2_GC 0.0165056965168451 0.00996201919673354 0.0108695652173913 M026_0.6_alt_region_3 0.0154517346947544 0.0166644782979372 0.00966183574879227 trinucleotide_freq_ref_r2_AAC 0.0154455307116075 0.0269629213693904 0.0157004830917874 trinucleotide_freq_ref_r2_TAG 0.0136155912273 0.0207402898449001 0.0144927536231884 trinucleotide_freq_ref_r2_GAT 0.0135916929297015 0.00975629236686902 0.00845410628019324 dinucleotide_freq_alt_r2_TA 0.0123948939189044 0.0165331761741708 0.0108695652173913 dinucleotide_freq_ref_r2_TT 0.010983671431705 0.00607741258576237 0.00483091787439614 splice_site_3_alt_score 0.0107949936226416 0.0107752452536097 0.00845410628019324 M153_0.6_alt_region_3 0.009721689968184 0.00804996983681627 0.00603864734299517 P_ESS_r2_ref 0.00916319168748361 0.0024687219583742 0.00603864734299517 trinucleotide_freq_alt_r2_GGA 0.00909270304466198 0.0080493647579049 0.00966183574879227 chasin_ESE_r1_ref 0.00906180480346157 0.0125965327768221 0.00966183574879227 M177_0.6_alt_region_3 0.0086317232661131 0.00656268587267809 0.0036231884057971 trinucleotide_freq_ref_r1_CTG 0.00860616719857588 0.0078738918736087 0.00845410628019324 trinucleotide_freq_alt_r2_TAG 0.00855119720023966 0.0069977376099504 0.00966183574879227 nucleotide_freq_ref_r2_C 0.00832371811766609 0.00278880870248693 0.0036231884057971 chasin_ESS_r2_ref 0.00826130518049936 0.007456387424766 0.00603864734299517 trinucleotide_freq_ref_r2_CCT 0.00783673629182555 0.00155081724983164 0.00603864734299517 Rescue_ESE_r1_ref 0.00774957544089936 0.00478133355761592 0.00603864734299517 trinucleotide_freq_alt_r2_GGG 0.00761739792668654 0.0136765986336108 0.00845410628019324 M016_0.6_ref_region_3 0.00744335898016515 0.00475955071680673 0.00483091787439614 trinucleotide_freq_ref_r2_GTT 0.00743687698365728 0.0081855075129623 0.00603864734299517 M167_0.6_alt_region_4 0.00731661447625064 0.0065445335053371 0.0072463768115942 M147_0.6_ref_region_2 0.00690746752152153 0.0021105152428454 0.00483091787439614 M001_0.6_ref_region_4 0.00677597408625936 0.0028807806970146 0.0036231884057971 trinucleotide_freq_ref_r3_AAA 0.00663352681637593 0.0124785423891057 0.0072463768115942 dinucleotide_freq_ref_r2_GT 0.00604509415204921 0.00068131885419837 0.0036231884057971 M120_0.6_ref_region_3 0.00602686834284663 0.00405342362724235 0.00241545893719807 trinucleotide_freq_ref_r2_GGA 0.00588867414643134 0.00719741365070126 0.0072463768115942 trinucleotide_freq_ref_r1_AAA 0.00586133871146328 0.00132149234242384 0.00603864734299517 exon_intron_I_ratio 0.00578795135902259 0.00286081309293952 0.00241545893719807 trinucleotide_freq_ref_r2_GTG 0.00577322784092042 0.00308953292143595 0.00483091787439614 trinucleotide_freq_ref_r1_GAG 0.00574618147318161 0.00254072634882678 0.0036231884057971 M273_0.6_ref_region_4 0.00570797240444892 0.00320389283568417 0.0036231884057971 trinucleotide_freq_alt_r2_CCC 0.0055814257238191 0.0176150572676936 0.00966183574879227 trinucleotide_freq_ref_r2_ACC 0.00524955644032433 0.000179708436675769 0.00241545893719807 M061_0.6_alt_region_3 0.00517528011709927 0.00609193447963516 0.0036231884057971 M151_0.6_alt_region_4 0.00513808823037813 0.00484365668548664 0.0036231884057971 M083_0.6_ref_region_4 0.00506580855234183 0.00259699868758384 0.00241545893719807 M089_0.6_ref_region_2 0.00501783261970455 0.00140741354783784 0.0036231884057971 dinucleotide_freq_alt_r2_GA 0.00495828911729031 0.00742310808464086 0.00603864734299517 M227_0.6_ref_region_1 0.00487324604159512 0.00538459723224804 0.0036231884057971 M170_0.6_ref_region_5 0.00473169970793561 0.00165549590149799 0.00241545893719807 trinucleotide_freq_ref_r2_GCA 0.00464024543682509 0.00244633403865365 0.0036231884057971 P_ESE_r1_alt 0.00457258909599286 0.00627890386324733 0.00603864734299517 M061_0.6_ref_region_4 0.00455613028866037 0.00379566001100033 0.0036231884057971 EIE_alt_region_3 0.00453795601421489 0.00727970438264706 0.00603864734299517 M036_0.6_ref_region_4 0.00447186069767512 0.00507419175071717 0.00483091787439614 M227_0.6_ref_region_3 0.00445959226374643 0.00246448640599464 0.0036231884057971 M108_0.6_alt_region_3 0.00433146527696342 0.0064943119556937 0.0036231884057971 M349_0.6_ref_region_1 0.0042661391603906 0.0131592561643927 0.0072463768115942 trinucleotide_freq_ref_r2_GGG 0.00423661658168875 0.00223516149858684 0.00241545893719807 IIE_5_alt_region_3 0.00418425683829413 0.00419077654012248 0.00483091787439614 dinucleotide_freq_ref_r1_TG 0.00414039792507911 0.0033860215880054 0.0036231884057971 M073_0.6_ref_region_3 0.00407345899884709 0.00337028953630988 0.00483091787439614 M228_0.6_ref_region_2 0.00390373181129037 0.00272890589026168 0.00241545893719807 M025_0.6_ref_region_2 0.00385510097333626 0.000614760173948085 0.00120772946859903 M147_0.6_ref_region_3 0.00382399891120764 0.00170329713549593 0.00241545893719807 trinucleotide_freq_ref_r1_ACA 0.0037336023204719 0.00100382591395657 0.0036231884057971 nucleotide_freq_alt_r2_T 0.00368182914687609 0.00880752863384678 0.00483091787439614 trinucleotide_freq_ref_r2_CGA 0.00367181528482477 0.00273798207393217 0.0036231884057971 trinucleotide_freq_ref_r2_AAA 0.00364051651208045 0.00155021217092027 0.0036231884057971 trinucleotide_freq_ref_r1_CAT 0.00342039081311575 7.20043904525809e-05 0.00120772946859903 trinucleotide_freq_ref_r3_TAA 0.00338244132746963 0.00397052781638517 0.00241545893719807 M073_0.6_alt_region_3 0.00331189225047663 0.00430090090199113 0.00241545893719807 M022_0.6_alt_region_3 0.0032905057113943 0.00560787135054218 0.0036231884057971 M035_0.6_alt_region_3 0.00326246766282548 0.000112544677514118 0.00241545893719807 dinucleotide_freq_ref_r1_AG 0.00321428890682977 0.0067236368631015 0.00483091787439614 trinucleotide_freq_ref_r1_CCC 0.00321411696130448 0.00492836773307791 0.00483091787439614 dinucleotide_freq_ref_r2_GA 0.00320864660255673 0.00620689947279474 0.00483091787439614 M149_0.6_ref_region_2 0.00320854339244489 0.00189692238713312 0.00120772946859903 M143_0.6_ref_region_2 0.00320407154105901 0.000634727778023171 0.00120772946859903 trinucleotide_freq_ref_r2_TAC 0.00315963974041121 0.0132112929507702 0.0072463768115942 dinucleotide_freq_ref_r1_GT 0.00315785881293272 0.00234286554481003 0.00241545893719807 trinucleotide_freq_ref_r1_CAA 0.00313242570566414 0.00135174628799215 0.00241545893719807 Continued on next page 134 Table C.2 – continued from previous page Feature Gain Cover Frequency trinucleotide_freq_ref_r1_TAT 0.00311283783473103 0.00221156342104356 0.00241545893719807 dinucleotide_freq_ref_r2_TA 0.00311071152910334 0.00104194588537264 0.00241545893719807 chasin_ESS_r1_ref 0.00310866773720741 0.00806570188851179 0.00483091787439614 M121_0.6_ref_region_4 0.00307978216452711 0.00105828301597953 0.0036231884057971 dinucleotide_freq_ref_r2_CG 0.00306547576221893 0.00853463804482061 0.00483091787439614 trinucleotide_freq_ref_r2_CTT 0.00304913493550003 0.000353366084237876 0.0036231884057971 M061_0.6_ref_region_5 0.00299969640881157 0.000104073572754991 0.00120772946859903 Rescue_ESE_r1_alt 0.00298391505652889 0.00861390338220959 0.00483091787439614 M127_0.6_ref_region_1 0.00297727669096165 0.00961409882269796 0.00603864734299517 dinucleotide_freq_ref_r2_CT 0.00297625497533412 0.00142738115191293 0.0036231884057971 M069_0.6_ref_region_5 0.00287243572109425 0.00389852342593259 0.0036231884057971 M211_0.6_alt_region_3 0.00286335567486106 0.00378597874841847 0.0036231884057971 trinucleotide_freq_ref_r2_AGC 0.00286087860672071 0.000689789958957497 0.00241545893719807 IIE_5_ref_region_3 0.00280925098164217 0.00183822973273059 0.00120772946859903 trinucleotide_freq_ref_r2_TGC 0.00278004973523154 0.00243181214478086 0.00241545893719807 dinucleotide_freq_ref_r1_CA 0.00277846371828262 0.00412361278096083 0.00483091787439614 M001_0.6_ref_region_2 0.0027224499916816 0.00184065004837606 0.00120772946859903 trinucleotide_freq_ref_r1_TTC 0.00270224951184422 0.00148909920087228 0.00241545893719807 M162_0.6_ref_region_5 0.00269930511675804 0.0010050360717793 0.0036231884057971 P_ESE_r1_ref 0.00268343521660409 0.00367706454437255 0.0036231884057971 M012_0.6_ref_region_4 0.00266918998555137 0.00285899785620542 0.00241545893719807 trinucleotide_freq_ref_r3_AGG 0.00266813898216179 0.00222971578838454 0.00241545893719807 M319_0.6_ref_region_2 0.00265987925759566 0.00087373394801283 0.00241545893719807 M162_0.6_ref_region_3 0.00258569375493735 0.0012863977655646 0.00120772946859903 M031_0.6_ref_region_3 0.00258157818769847 0.0022466579979028 0.00241545893719807 trinucleotide_freq_ref_r1_CCG 0.00252915059582005 0.0014152795736856 0.00241545893719807 M149_0.6_ref_region_3 0.00247810243570086 0.00129002823903279 0.00120772946859903 trinucleotide_freq_ref_r1_TTT 0.0024586350666885 0.000441707605297345 0.0036231884057971 M151_0.6_ref_region_2 0.00243958400804906 0.00157381024846355 0.00241545893719807 M201_0.6_ref_region_4 0.00243605780179882 0.00175049329058249 0.00241545893719807 dinucleotide_freq_ref_r2_AT 0.00232413122308186 0.00264419484267041 0.0036231884057971 trinucleotide_freq_alt_r2_GCC 0.00231240999904512 0.00678172443859266 0.00483091787439614 M022_0.6_ref_region_3 0.00230937399578609 0.00211596095304769 0.00241545893719807 M319_0.6_alt_region_3 0.00223740699452989 0.0022018821584617 0.00120772946859903 dinucleotide_freq_ref_r2_TG 0.00221949682110992 0.00342051108595327 0.00241545893719807 M065_0.6_ref_region_4 0.00219673062343251 0.000746062297714556 0.00241545893719807 M126_0.6_ref_region_3 0.0021907001113793 0.0013705037342445 0.00120772946859903 M232_0.6_ref_region_2 0.00214677201949279 0.000969336416008694 0.00120772946859903 M175_0.6_ref_region_3 0.00213824598311648 0.00160587943076596 0.00241545893719807 M013_0.6_alt_region_4 0.00210724011911297 0.000157925595866585 0.00241545893719807 trinucleotide_freq_ref_r3_ATT 0.0020627963313293 0.000685554406577934 0.00120772946859903 M147_0.6_ref_region_1 0.00204379648250237 0.000140378307436964 0.0036231884057971 M291_0.6_ref_region_4 0.00197375301971973 0.00872039727061005 0.00483091787439614 M290_0.6_ref_region_4 0.00196605030737427 0.000917904708542565 0.00241545893719807 IIE_3_alt_region_2 0.00195830103198188 0.00660564647538509 0.0036231884057971 M163_0.6_ref_region_1 0.001946020902195 0.000392091134565314 0.00120772946859903 M169_0.6_ref_region_5 0.00193212283239788 0.00398383955243523 0.00241545893719807 trinucleotide_freq_ref_r1_CAG 0.00191098490326087 0.00129426379141236 0.00241545893719807 M167_0.6_ref_region_3 0.00190900019618462 0.00166517716407985 0.00120772946859903 M145_0.6_ref_region_4 0.00189824833526965 0.000392696213476681 0.00120772946859903 trinucleotide_freq_ref_r2_GAC 0.00189191479519977 0.00108248617243418 0.00120772946859903 trinucleotide_freq_alt_r2_GTC 0.00188885410026947 0.00290619401129198 0.0036231884057971 first_upstream_AG_ref 0.00188552073947861 0.00520307355883818 0.0036231884057971 IIE_5_ref_region_1 0.00188319315320979 0.00140378307436964 0.00120772946859903 M112_0.6_ref_region_1 0.00188289522556465 0.00139228657505369 0.00120772946859903 dinucleotide_freq_ref_r2_GG 0.00186642316561667 0.00258731742500198 0.00241545893719807 dinucleotide_freq_alt_r2_GG 0.00186399681634069 0.00114238898465943 0.00241545893719807 M159_0.6_ref_region_4 0.00179970833904775 0.00608104305923057 0.0036231884057971 trinucleotide_freq_ref_r1_GGG 0.00179183963442389 0.00108490648807964 0.00241545893719807 trinucleotide_freq_ref_r3_CAG 0.00177255048893839 0.000226299512850968 0.00120772946859903 IIE_3_ref_region_5 0.00176400369403833 0.000104073572754991 0.00120772946859903 M051_0.6_alt_region_3 0.00175681779105109 0.0010032208350452 0.00241545893719807 trinucleotide_freq_ref_r1_GTC 0.00174049136036668 0.00346649708321711 0.00241545893719807 M231_0.6_ref_region_5 0.00173184400538036 0.00185880241571705 0.00120772946859903 M068_0.6_alt_region_2 0.00172690056659078 0.000537310073293209 0.00120772946859903 EIE_alt_region_2 0.00171237260677078 0.00146973667570856 0.00241545893719807 M151_0.6_ref_region_3 0.00166657116772917 0.000353366084237876 0.00120772946859903 M290_0.6_ref_region_3 0.00166087528942366 0.000718833746703076 0.00241545893719807 M022_0.6_ref_region_1 0.00163676183241677 0.000658325855566454 0.00241545893719807 dinucleotide_freq_ref_r3_GA 0.00160300655889139 0.0007327505616645 0.00120772946859903 M158_0.6_ref_region_4 0.00157957104499385 0.00130394505399422 0.00120772946859903 L1_distance_sum 0.00157546587179504 0.00113573311663441 0.00241545893719807 trinucleotide_freq_ref_r1_GCG 0.00157093740720975 0.000292253114189887 0.00120772946859903 trinucleotide_freq_ref_r3_GTA 0.00155623598174527 0.00657297221417131 0.0036231884057971 trinucleotide_freq_ref_r2_TAA 0.00151577338348705 0.00015550528022112 0.00241545893719807 IIE_3_ref_region_4 0.00150877786707176 0.00370368801647267 0.00241545893719807 M072_0.6_ref_region_2 0.00150840473719667 0.00271075352292069 0.00241545893719807 trinucleotide_freq_ref_r2_ACG 0.00150153892526461 0.000597817964429831 0.00120772946859903 M242_0.6_ref_region_4 0.00150070538695912 0.00114904485268446 0.00241545893719807 M054_0.6_ref_region_3 0.00149647736571312 0.00119866132341649 0.00120772946859903 trinucleotide_freq_alt_r2_CAC 0.00149229046395257 0.00657357729308268 0.0036231884057971 Continued on next page Appendix C. Visualization and Inference of Splicing Aberrations 135 Table C.2 – continued from previous page Feature Gain Cover Frequency M075_0.6_ref_region_4 0.00147258641990305 0.00418835622447701 0.00241545893719807 trinucleotide_freq_alt_r2_ATA 0.00147091470504124 0.000168211937359811 0.00241545893719807 dinucleotide_freq_alt_r2_TT 0.00146449532248762 0.000223274118294137 0.00241545893719807 IIE_3_alt_region_3 0.00146146390053942 0.000217828408091841 0.00120772946859903 dinucleotide_freq_ref_r1_GA 0.0014600279582027 0.000259578852976111 0.00120772946859903 M121_0.6_ref_region_3 0.00145446473680438 0.000168817016271177 0.00120772946859903 trinucleotide_freq_ref_r1_GTT 0.00145235962456006 0.000473171708688389 0.00120772946859903 M140_0.6_ref_region_1 0.00144169440492023 0.00192657125379006 0.00120772946859903 M151_0.6_alt_region_3 0.0014228608916043 0.000184549067966699 0.00120772946859903 M262_0.6_ref_region_1 0.00142191017233013 0.000643198882782298 0.00120772946859903 M170_0.6_ref_region_2 0.00142179135748859 0.000248687432571519 0.00120772946859903 trinucleotide_freq_ref_r2_TGG 0.00140864480482242 0.000124646255741443 0.00120772946859903 M318_0.6_ref_region_2 0.00140009733517448 0.00224907831354826 0.00241545893719807 dinucleotide_freq_alt_r2_AT 0.0013809138587306 0.00419198669794521 0.00241545893719807 M049_0.6_ref_region_1 0.00137791356567057 0.00151814298861786 0.00241545893719807 trinucleotide_freq_ref_r1_AGT 0.00137456103268904 0.00216678758160245 0.00241545893719807 M155_0.6_ref_region_4 0.00135657699186415 0.00423313206391812 0.00241545893719807 trinucleotide_freq_alt_r2_GGT 0.00131320744545483 0.00156594422261579 0.00241545893719807 M024_0.6_ref_region_3 0.00130663012662392 0.00199555024968581 0.00120772946859903 trinucleotide_freq_ref_r2_ATG 0.00129536177501554 0.00119019021865737 0.00241545893719807 Wang_ESS_r1_alt 0.00128884875588037 0.0029685171391627 0.00241545893719807 trinucleotide_freq_ref_r3_ATG 0.00128319568221331 0.000455019341347402 0.00120772946859903 M236_0.6_ref_region_4 0.00127627522324203 0.00417564956733832 0.00241545893719807 M142_0.6_ref_region_5 0.00126385497230503 0.000890071078619718 0.00120772946859903 M087_0.6_ref_region_1 0.0012634007086213 0.00054033546785004 0.00120772946859903 dinucleotide_freq_alt_r2_CG 0.00126197782797973 0.00405886933744464 0.00241545893719807 trinucleotide_freq_ref_r2_TAT 0.00124280074469055 0.0008241174772808 0.00120772946859903 trinucleotide_freq_alt_r2_CTA 0.00123070311924787 0.000294068350923986 0.00120772946859903 trinucleotide_freq_ref_r2_AAT 0.00122865261524917 0.00172871044977331 0.00120772946859903 M056_0.6_alt_region_3 0.00122760377803931 3.81199714160722e-05 0.00241545893719807 M148_0.6_ref_region_2 0.00122592880320247 0.000168211937359811 0.00120772946859903 trinucleotide_freq_ref_r2_CTA 0.00122544544995496 8.41059686799054e-05 0.00241545893719807 chasin_ESS_r2_alt 0.00122299069348226 8.16856530344405e-05 0.00120772946859903 IIE_5_ref_region_5 0.00121338473354508 2.72285510114802e-05 0.00241545893719807 M083_0.6_ref_region_3 0.00120334115845567 0.000157320516955219 0.00120772946859903 trinucleotide_freq_alt_r2_TCC 0.00118654829517084 0.00240155819921255 0.00241545893719807 M111_0.6_ref_region_5 0.00118581800349251 0.000444732999854176 0.00120772946859903 M318_0.6_ref_region_4 0.00117544559260269 0.000102863414932258 0.00120772946859903 M082_0.6_ref_region_3 0.00117444265250726 0.000321296901935466 0.00120772946859903 M002_0.6_ref_region_3 0.00117093209759595 0.000612339858302621 0.00120772946859903 M087_0.6_ref_region_5 0.00116552573174848 0.00039753684476761 0.00120772946859903 M349_0.6_ref_region_5 0.00116272009027463 0.000203911593130418 0.00120772946859903 M023_0.6_ref_region_5 0.00112289386287637 0.00044896855223374 0.00120772946859903 M159_0.6_ref_region_1 0.00112263699165796 9.86278625526948e-05 0.00120772946859903 Wang_ESS_r3_ref 0.00112112111380983 0.000327347691049128 0.00120772946859903 M142_0.6_ref_region_2 0.00112083243112486 0.00422466095915899 0.00241545893719807 M235_0.6_alt_region_2 0.001104361966756 0.000235980775432828 0.00120772946859903 nucleotide_freq_ref_r2_G 0.00109844792956596 0.00417564956733832 0.00241545893719807 trinucleotide_freq_ref_r1_AAC 0.00109481280805565 0.000656510618832355 0.00120772946859903 M319_0.6_alt_region_4 0.00109318557435288 0.000362442267908369 0.00120772946859903 M228_0.6_ref_region_4 0.00106923238638463 0.00197800296125619 0.00120772946859903 M160_0.6_alt_region_3 0.00106920439970023 0.00216073679248879 0.00241545893719807 M121_0.6_ref_region_2 0.00106495885931425 0.000795073689535221 0.00120772946859903 M026_0.6_ref_region_4 0.00106447913428432 0.000772080690903304 0.00120772946859903 M143_0.6_alt_region_2 0.00106010191743227 0.00202943466872232 0.00120772946859903 trinucleotide_freq_alt_r2_TCT 0.00105225830264984 0.000141588465259697 0.00241545893719807 trinucleotide_freq_ref_r2_AGG 0.00104990500584548 0.0042603606149296 0.00241545893719807 M201_0.6_ref_region_5 0.00104764630794906 0.0010982182241297 0.00120772946859903 M242_0.6_ref_region_1 0.00104730337105536 9.74177047299624e-05 0.00120772946859903 M243_0.6_ref_region_4 0.00103837940504791 0.00016518654280298 0.00120772946859903 M211_0.6_ref_region_2 0.00103450824551724 0.000398747002590343 0.00120772946859903 M013_0.6_ref_region_4 0.00103184917579028 0.000202096356396319 0.00120772946859903 M042_0.6_ref_region_3 0.00102853985858797 0.000281966772696661 0.00120772946859903 M065_0.6_alt_region_3 0.00102506389093992 7.26094693639471e-06 0.00120772946859903 M065_0.6_alt_region_4 0.00101884818908009 0.00440376431692339 0.00241545893719807 trinucleotide_freq_ref_r2_CCA 0.0010186453884323 0.0017377866334438 0.00120772946859903 trinucleotide_freq_ref_r2_GGC 0.00101802299918068 0.00415507688435187 0.00241545893719807 M037_0.6_alt_region_2 0.00101072865137626 0.00143524717776069 0.00120772946859903 dinucleotide_freq_alt_r2_CC 0.00100101480472363 0.00425915045710686 0.00241545893719807 M168_0.6_ref_region_3 0.000997741624427407 7.26094693639471e-05 0.00120772946859903 trinucleotide_freq_ref_r1_AGG 0.000997204876074657 8.28958108571729e-05 0.00120772946859903 trinucleotide_freq_ref_r1_TAA 0.000992064848068086 0.000596002727695732 0.00120772946859903 M020_0.6_ref_region_1 0.000989025321144563 0.000186969383612164 0.00120772946859903 M037_0.6_alt_region_3 0.000985356299881373 0.000158530674777951 0.00120772946859903 dinucleotide_freq_alt_r2_GT 0.000972368702850426 6.59536013389186e-05 0.00120772946859903 trinucleotide_freq_ref_r1_GCA 0.000970363778400732 3.50945768592411e-05 0.00120772946859903 trinucleotide_freq_ref_r2_AAG 0.000963702383966026 0.00110063853977516 0.00120772946859903 P_ESS_r1_alt 0.000959506909107186 0.00027894137813983 0.00120772946859903 translated_in_vitro_ref 0.000956207837985616 0.000284992167253492 0.00120772946859903 trinucleotide_freq_ref_r1_CTC 0.000953087213058328 0.000634727778023171 0.00120772946859903 trinucleotide_freq_ref_r1_TCT 0.00095292237124975 0.00131120600093061 0.00120772946859903 Continued on next page 136 Table C.2 – continued from previous page Feature Gain Cover Frequency first_upstream_GT_ref 0.000949177505833293 0.00417928004080652 0.00241545893719807 M036_0.6_ref_region_2 0.00094805546597036 0.00430634661219343 0.00241545893719807 M019_0.6_ref_region_3 0.000942502516005948 0.000310405481530874 0.00120772946859903 M236_0.6_alt_region_3 0.000938221813950283 4.35656816183683e-05 0.00120772946859903 M242_0.6_ref_region_2 0.000934658688362619 0.000210567461155447 0.00120772946859903 M120_0.6_alt_region_4 0.000934633588779932 0.00424704887887954 0.00241545893719807 trinucleotide_freq_ref_r2_TTT 0.000932060542600461 0.00096994149492006 0.00120772946859903 trinucleotide_freq_ref_r1_AAG 0.000929134355707191 6.23231278707213e-05 0.00120772946859903 M167_0.6_alt_region_3 0.000928612938280634 0.000166396700625712 0.00241545893719807 M043_0.6_ref_region_4 0.000915303397080843 9.07618367049339e-05 0.00120772946859903 P_ESS_r3_ref 0.000914967580310105 0.000259578852976111 0.00120772946859903 M195_0.6_ref_region_4 0.000900813866166878 0.0043499122938118 0.00241545893719807 M002_0.6_ref_region_1 0.000891725651818587 2.11777618978179e-05 0.00120772946859903 chasin_ESS_r3_ref 0.000888269130574767 5.68774176684252e-05 0.00120772946859903 M162_0.6_alt_region_3 0.000870428455352034 3.81199714160722e-05 0.00120772946859903 M053_0.6_ref_region_3 0.000869904488814481 4.05402870615371e-05 0.00120772946859903 first_downstream_AG_alt 0.000866357869819013 0.000734565798398598 0.00120772946859903 M126_0.6_ref_region_2 0.000861660766461348 4.17504448842696e-05 0.00120772946859903 trinucleotide_freq_alt_r2_ATC 0.000853120492334697 0.000225089355028236 0.00120772946859903 trinucleotide_freq_ref_r1_GTA 0.000851861518201787 0.0010032208350452 0.00120772946859903 trinucleotide_freq_ref_r3_GCA 0.000851726889327608 0.00435414784619136 0.00241545893719807 M048_0.6_ref_region_3 0.000846791441839217 0.00133480407847389 0.00120772946859903 trinucleotide_freq_ref_r1_TGA 0.000842391524527227 0.00178558786744173 0.00120772946859903 trinucleotide_freq_ref_r1_TAG 0.000840595892207064 3.99352081501709e-05 0.00120772946859903 nucleotide_freq_alt_r2_A 0.000831802361719349 0.00107885569896598 0.00120772946859903 M040_0.6_ref_region_4 0.000824181775879423 0.000317666428467269 0.00120772946859903 trinucleotide_freq_ref_r2_CAC 0.00081851706451636 0.0017861929463531 0.00120772946859903 M021_0.6_ref_region_2 0.000808291169301246 0.000737591192955429 0.00120772946859903 M075_0.6_alt_region_3 0.000807042545823086 0.000295883587658084 0.00120772946859903 M037_0.6_ref_region_2 0.00080655862895949 0.000119805624450513 0.00120772946859903 M201_0.6_alt_region_2 0.000806075991209083 0.000361232110085637 0.00120772946859903 M211_0.6_ref_region_5 0.00079637906082911 0.000611734779391254 0.00120772946859903 trinucleotide_freq_ref_r1_GGT 0.00079594401453301 9.3787231261765e-05 0.00120772946859903 trinucleotide_freq_ref_r1_GAC 0.000792329256704513 0.000147034175461993 0.00120772946859903 trinucleotide_freq_ref_r1_TGC 0.00079179527646546 9.86278625526948e-05 0.00120772946859903 IIE_5_alt_region_2 0.00079167010679267 0.000576035123620647 0.00120772946859903 trinucleotide_freq_alt_r2_CTG 0.00079159231137484 0.000307985165885409 0.00120772946859903 IIE_5_ref_region_2 0.00079066926176391 0.0022018821584617 0.00120772946859903 M021_0.6_ref_region_1 0.000789255965724549 9.68126258185961e-05 0.00120772946859903 trinucleotide_freq_alt_r2_ACT 0.000788033566511277 0.00015550528022112 0.00120772946859903 M053_0.6_ref_region_5 0.000787180501562573 0.000145218938727894 0.00120772946859903 trinucleotide_freq_alt_r2_TTA 0.000779132645155758 0.000432026342715485 0.00120772946859903 M142_0.6_alt_region_3 0.00077390378233819 6.05078911366226e-06 0.00120772946859903 M150_0.6_ref_region_4 0.000759695285329098 0.000347315295124214 0.00120772946859903 dinucleotide_freq_ref_r1_TC 0.000757084277878313 2.35980775432828e-05 0.00120772946859903 M143_0.6_alt_region_3 0.0007570745417154 0.00123194066354164 0.00120772946859903 M231_0.6_ref_region_4 0.000756487174346678 0.000607499227011691 0.00120772946859903 trinucleotide_freq_ref_r2_CAG 0.000755039552113031 0.000323112138669565 0.00120772946859903 M157_0.6_alt_region_2 0.000754961121348088 0.00181765704974414 0.00120772946859903 M082_0.6_ref_region_1 0.000745080986862225 0.0016427892443593 0.00120772946859903 M108_0.6_ref_region_2 0.00073693672014726 0.000356391478794707 0.00120772946859903 M079_0.6_ref_region_4 0.000734757033287552 0.000200281119662221 0.00120772946859903 M148_0.6_ref_region_4 0.000730583062402062 3.02539455683113e-05 0.00120772946859903 M004_0.6_ref_region_5 0.0007280143130539 0.00120592227035289 0.00120772946859903 trinucleotide_freq_alt_r2_CAG 0.000726335595266476 0.000358811794440172 0.00120772946859903 IIE_3_ref_region_2 0.000725812561046123 0.000409638422994935 0.00120772946859903 trinucleotide_freq_alt_r2_AAA 0.000724744130055324 0.000138563070702866 0.00120772946859903 trinucleotide_freq_ref_r1_GCT 0.000723467148982531 3.08590244796775e-05 0.00120772946859903 M120_0.6_alt_region_3 0.000721790785533951 0.00024687219583742 0.00120772946859903 exon_intron_II_ratio 0.000713340573875407 6.47434435161862e-05 0.00120772946859903 first_downstream_AG_ref 0.000712787513803575 1.51269727841556e-05 0.00120772946859903 trinucleotide_freq_alt_r2_CTT 0.000712139639172451 5.14317074661292e-05 0.00120772946859903 trinucleotide_freq_alt_r2_CTC 0.000709132457936678 0.000371518451578863 0.00120772946859903 M069_0.6_ref_region_2 0.000708048584318292 0.000226299512850968 0.00120772946859903 trinucleotide_freq_alt_r2_TGG 0.000706563793620764 0.000694025511337061 0.00120772946859903 M069_0.6_ref_region_4 0.000704172928439631 0.00108369633025691 0.00120772946859903 M155_0.6_ref_region_1 0.000701313448104991 5.14317074661292e-05 0.00120772946859903 M048_0.6_ref_region_5 0.000700471101421707 0.00037514892504706 0.00120772946859903 trinucleotide_freq_ref_r1_TCA 0.000700432197262535 6.1718048959355e-05 0.00120772946859903 M143_0.6_ref_region_5 0.00069872664611747 0.000335818795808255 0.00120772946859903 trinucleotide_freq_alt_r2_AAG 0.000697283393026723 0.000810805741230743 0.00120772946859903 M234_0.6_alt_region_4 0.000696370002672048 0.0022018821584617 0.00120772946859903 M111_0.6_alt_region_2 0.000693470214595666 0.00209417811223851 0.00120772946859903 M234_0.6_alt_region_2 0.000692646370069365 0.00151632775188376 0.00120772946859903 dinucleotide_freq_ref_r2_CC 0.000690332526422215 0.000109519282957287 0.00120772946859903 trinucleotide_freq_alt_r2_TAA 0.000690156325594972 0.000180918594498502 0.00120772946859903 M262_0.6_ref_region_2 0.000687736289358452 0.00202096356396319 0.00120772946859903 trinucleotide_freq_ref_r3_CAT 0.000686426489064437 0.00218191455438661 0.00120772946859903 dinucleotide_freq_ref_r3_TC 0.000683239710992878 0.000341264506010551 0.00120772946859903 M069_0.6_alt_region_4 0.000680927325683109 0.000148849412196092 0.00120772946859903 M209_0.6_ref_region_2 0.000678059253557091 2.48082353660153e-05 0.00120772946859903 Continued on next page Appendix C. Visualization and Inference of Splicing Aberrations 137 Table C.2 – continued from previous page Feature Gain Cover Frequency nucleotide_freq_alt_r2_G 0.000676962527014372 0.00117445816696184 0.00120772946859903 M235_0.6_alt_region_3 0.00067270338680017 4.47758394411007e-05 0.00120772946859903 M056_0.6_ref_region_3 0.000670570990144911 0.000203911593130418 0.00120772946859903 M105_0.6_alt_region_2 0.000668531436867444 0.00148244333284725 0.00120772946859903 translated_in_vitro_alt 0.000667475375256859 0.000207542066598615 0.00120772946859903 M195_0.6_alt_region_3 0.000665256863617794 0.000856791738494576 0.00120772946859903 M016_0.6_ref_region_2 0.00066399755272024 2.66234721001139e-05 0.00120772946859903 M053_0.6_ref_region_1 0.000659828182514369 0.00103589509625898 0.00120772946859903 M026_0.6_ref_region_5 0.000656877890804702 0.000591767175316169 0.00120772946859903 M047_0.6_ref_region_3 0.000651548267765726 0.000116780229893682 0.00120772946859903 M143_0.6_ref_region_4 0.000648392105553881 0.00209296795441577 0.00120772946859903 M170_0.6_ref_region_1 0.00063922384327951 4.78012339979318e-05 0.00120772946859903 M017_0.6_ref_region_3 0.000636842101699629 0.000139773228525598 0.00120772946859903 M088_0.6_alt_region_3 0.000634842379853533 0.00012827672920964 0.00120772946859903 M273_0.6_ref_region_1 0.000628621041211851 0.000763609586144177 0.00120772946859903 trinucleotide_freq_alt_r2_AGT 0.000628071046376848 0.00202943466872232 0.00120772946859903 M013_0.6_alt_region_2 0.000627601978391258 0.00112000106493888 0.00120772946859903 M016_0.6_alt_region_3 0.0006218615661462 3.81199714160722e-05 0.00120772946859903 IIE_3_alt_region_4 0.000619721981519549 7.68450217435107e-05 0.00120772946859903 M089_0.6_ref_region_5 0.000619227136277386 0.000642593803870932 0.00120772946859903 trinucleotide_freq_ref_r1_TTG 0.00061738920189809 0.000747272455537289 0.00120772946859903 M209_0.6_ref_region_1 0.000613595832181705 0.000462885367195163 0.00120772946859903 M021_0.6_ref_region_5 0.000611885782349537 5.68774176684252e-05 0.00120772946859903 M229_0.6_alt_region_2 0.000610677448014438 0.00159014737907044 0.00120772946859903 M273_0.6_alt_region_3 0.000610481515497713 9.49973890844974e-05 0.00120772946859903 dinucleotide_freq_alt_r2_TC 0.000608244088894225 9.86278625526948e-05 0.00120772946859903 M023_0.6_ref_region_3 0.000607054841565286 0.000254133142773815 0.00120772946859903 M035_0.6_ref_region_2 0.000606582183798506 0.000134327518323302 0.00120772946859903 trinucleotide_freq_ref_r2_ACA 0.000596116421953147 1.57320516955219e-05 0.00120772946859903 M159_0.6_alt_region_3 0.000594680071738475 0.00108853696154784 0.00120772946859903 M177_0.6_ref_region_4 0.00058236852792819 3.02539455683113e-05 0.00120772946859903 trinucleotide_freq_ref_r1_CCA 0.000575411675460498 0.000394511450210779 0.00120772946859903 trinucleotide_freq_alt_r2_CGG 0.000564001461310919 0.000651064908630059 0.00120772946859903 M026_0.6_ref_region_1 0.000561508622490995 0.000142798623082429 0.00120772946859903 nucleotide_freq_alt_r2_C 0.000558030280474297 0.00010588880948909 0.00120772946859903 trinucleotide_freq_ref_r1_GAT 0.000556857891740283 0.00171479363481188 0.00120772946859903 M056_0.6_ref_region_5 0.000552202293637465 3.7514892504706e-05 0.00120772946859903 M149_0.6_ref_region_4 0.000550355958597403 0.00206694956122703 0.00120772946859903 dinucleotide_freq_ref_r3_GT 0.000545462326031505 0.00184186020619879 0.00120772946859903 M124_0.6_alt_region_4 0.000540866610127338 7.5029785009412e-05 0.00120772946859903 M140_0.6_alt_region_3 0.000537922432131363 0.00208933748094758 0.00120772946859903 M178_0.6_ref_region_4 0.000537359416292828 0.00122770511116207 0.00120772946859903 trinucleotide_freq_ref_r3_ATC 0.000531239965888014 0.000763609586144177 0.00120772946859903 trinucleotide_freq_ref_r2_TCC 0.000529392681880782 0.002146214898616 0.00120772946859903 dinucleotide_freq_ref_r1_TT 0.000519679786676864 0.00204940227279741 0.00120772946859903 M037_0.6_ref_region_3 0.000517099178005148 4.84063129092981e-06 0.00120772946859903 M055_0.6_alt_region_3 0.000514039060592604 0.0022018821584617 0.00120772946859903 M151_0.6_ref_region_4 0.000503642075454701 0.00211838126869316 0.00120772946859903 M144_0.6_ref_region_1 0.000493239751241447 0.00212019650542726 0.00120772946859903 M229_0.6_alt_region_4 0.00046281710325633 7.5029785009412e-05 0.00120772946859903 M055_0.6_ref_region_4 0.000458222000510016 0.0021480301353501 0.00120772946859903 trinucleotide_freq_ref_r2_TGA 0.000450246045764538 6.77688380730173e-05 0.00120772946859903 trinucleotide_freq_ref_r1_GCC 0.000444689174691617 0.00212685237345228 0.00120772946859903 trinucleotide_freq_ref_r2_TTG 0.000433655524401261 0.00217707392309568 0.00120772946859903 M229_0.6_alt_region_3 0.000432634470383574 7.86602584776094e-05 0.00120772946859903 M227_0.6_ref_region_4 0.000430940975132689 0.0022018821584617 0.00120772946859903 EIE_ref_region_5 0.000392376471726889 0.0022018821584617 0.00120772946859903 trinucleotide_freq_ref_r2_GAG 0.00038496837317453 0.0022018821584617 0.00120772946859903 trinucleotide_freq_alt_r2_TAC 0.000382188870042461 0.00213895395167961 0.00120772946859903 Wang_ESS_r2_ref 0.000373277676341528 0.0022018821584617 0.00120772946859903 M147_0.6_ref_region_4 0.000373053148910588 0.00214500474079327 0.00120772946859903 M157_0.6_ref_region_3 0.000355742889629934 0.00219583136934803 0.00120772946859903 dinucleotide_freq_alt_r2_AC 0.000345632565468438 0.0022018821584617 0.00120772946859903 trinucleotide_freq_ref_r2_TTC 0.000343325642237126 0.0022018821584617 0.00120772946859903 M044_0.6_ref_region_4 0.000343322059798908 0.0022018821584617 0.00120772946859903 dinucleotide_freq_ref_r2_CA 0.00032869044981896 0.0022018821584617 0.00120772946859903 first_downstream_GT_ref 0.000319666371838239 0.0022018821584617 0.00120772946859903 138 Bibliography [1] Z. W A N G , M. E. R O L I S H , G. Y E O , V. T U N G , M. M AW S O N , and C. B. B U R G E . Systematic identification and analysis of exonic splicing silencers. Cell, 119: 831 – 45, 2004. D O I : 10 . 1016/j.cell.2004.11.010 (see pp. xix, 39, 50, 86) [2] W. G. FA I R B R O T H E R , R. F. Y E H , P. A. S H A R P , and C. B. B U R G E . Predictive identification of exonic splicing enhancers in human genes. Science, 297: 1007 – 13, 2002. D O I : 10.1126/ science.1073774 (see pp. xix, 9, 37, 39) [3] R. S O E M E D I , H. V E G A , J. M. B E L M O N T , S. R A M A C H A N D R A N , and W. G. FA I R B R O T H E R . Genetic variation and RNA binding proteins: tools and techniques to detect functional poly- morphisms. Adv Exp Med Biol, 825: 227 – 66, 2014. D O I : 10.1007/978-1-4939-1221-6_7 (see pp. xix, 28) [4] A. M. F R E D E R I C K S , K. J. C Y G A N , B. A. B R O W N , and W. G. FA I R B R O T H E R . RNA-Binding Proteins: Splicing Factors and Disease. Biomolecules, 5: 893 – 909, 2015. D O I : 10 . 3390 / biom5020893 (see pp. xix, 1) [5] K. J. C Y G A N , A. J. TA G G A RT , W. G. FA I R B R O T H E R , and S. M. M O U N T . Messenger rna splicing signals. eLS, 2017. 1 – 8 p. D O I : 10 . 1002 / 9780470015902 . a0000888 . pub2 (see pp. xix, 1) [6] E. T. W A N G , R. S A N D B E R G , S. L U O , I. K H R E B T U KOVA , L. Z H A N G , C. M AY R , S. F. K I N G S M O R E , G. P. S C H R O T H , and C. B. B U R G E . Alternative isoform regulation in human tissue transcriptomes. Nature, 456: 470 – 6, 2008. D O I : 10.1038/nature07509 (see pp. xix, 15) [7] K. H. L I M and W. G. FA I R B R O T H E R . Spliceman – a computational web server that predicts sequence variations in pre-mRNA splicing. Bioinformatics, 28: 1031 – 2, 2012. D O I : 10.1093/ bioinformatics/bts074 (see pp. xx, 13, 27, 41, 71, 75, 82) [8] M. M O RT , T. S T E R N E -W E I L E R , B. L I , E. V. B A L L , D. N. C O O P E R , P. R A D I VO JA C , J. R. S A N F O R D , and S. D. M O O N E Y . MutPred Splice: machine learning-based prediction of exonic variants that disrupt splicing. Genome Biol, 15: R19, 2014. D O I : 10.1186/gb-2014-15-1-r19 (see pp. xx, 13, 14, 37, 46) [9] H. Y. X I O N G , B. A L I PA N A H I , L. J. L E E , H. B R E T S C H N E I D E R , D. M E R I C O , R. K. Y U E N , Y. H UA , S. G U E R O U S S OV , H. S. N A JA FA B A D I , T. R. H U G H E S , Q. M O R R I S , Y. B A R A S H , A. R. K R A I N E R , N. J O J I C , S. W. S C H E R E R , B. J. B L E N C O W E , and B. J. F R E Y . RNA splicing. The human splicing code reveals new insights into the genetic determinants of disease. Science, 347: 1254806, 2015. D O I : 10.1126/science.1254806 (see pp. xx, 9, 13, 14, 23, 71, 75) 139 140 Bibliography [10] A. B. R O S E N B E R G , R. P. PAT WA R D H A N , J. S H E N D U R E , and G. S E E L I G . Learning the se- quence determinants of alternative splicing from millions of random sequences. Cell, 163: 698 – 711, 2015. D O I : 10.1016/j.cell.2015.09.054 (see pp. xx, 13, 14, 23, 46) [11] R. S O E M E D I , K. J. C Y G A N , C. L. R H I N E , J. W A N G , C. B U L A CA N , J. Y A N G , P. B AY R A K - T O Y D E M I R , J. M C D O N A L D , and W. G. FA I R B R O T H E R . Pathogenic variants that alter protein code often disrupt splicing. Nat Genet, 49: 848 – 855, 2017. D O I : 10 . 1038 / ng . 3837 (see pp. xx, 31, 55, 56, 61, 62, 64 – 67, 73) [12] S. W. K I M , A. J. TA G G A RT , C. H E I N T Z E L M A N , K. J. C Y G A N , C. G. H U L L , J. W A N G , B. S H R E S T H A , and W. G. FA I R B R O T H E R . Widespread intra-dependencies in the removal of introns from human transcripts. Nucleic Acids Res, 45: 9503 – 9513, 2017. D O I : 10.1093/nar/ gkx661 (see pp. xx, 32) [13] K. J. C Y G A N , R. S O E M E D I , C. L. R H I N E , A. P R O F E TA , E. L. M U R P H Y , M. F. M U R R AY , and W. G. FA I R B R O T H E R . Defective splicing of the RB1 transcript is the dominant cause of retinoblastomas. Hum Genet, 2017. D O I : 10.1007/s00439-017-1833-4 (see pp. xx, 59) [14] K. J. C Y G A N , C. H. S A N F O R D , and W. G. FA I R B R O T H E R . Spliceman2: a computational web server that predicts defects in pre-mRNA splicing. Bioinformatics, 33: 2943 – 2945, 2017. D O I : 10.1093/bioinformatics/btx343 (see pp. xxi, 79) [15] P. A. S H A R P . Split genes and RNA splicing. Cell, 77: 805 – 15, 1994. (see p. 3) [16] P. A. S H A R P . The discovery of split genes and RNA splicing. Trends Biochem Sci, 30: 279 – 81, 2005. D O I : 10.1016/j.tibs.2005.04.002 (see p. 3) [17] E. K I M , A. M A G E N , and G. A S T . Different levels of alternative splicing among eukaryotes. Nucleic Acids Res, 35: 125 – 31, 2007. D O I : 10.1093/nar/gkl924 (see pp. 3, 7) [18] H. K I M , R. K L E I N , J. M A J E W S K I , and J. O T T . Estimating rates of alternative splicing in mammals and invertebrates. Nat Genet, 36: 915 – 6; author reply 916 – 7, 2004. D O I : 10.1038/ ng0904-915 (see p. 3) [19] J. M E R K I N , C. R U S S E L L , P. C H E N , and C. B. B U R G E . Evolutionary dynamics of gene and iso- form regulation in Mammalian tissues. Science, 338: 1593 – 9, 2012. D O I : 10.1126/science. 1228186 (see p. 3) [20] N. L. B A R B O S A -M O R A I S , M. I R I M I A , Q. PA N , H. Y. X I O N G , S. G U E R O U S S OV , L. J. L E E , V. S L O B O D E N I U C , C. K U T T E R , S. W AT T , R. C O L A K , T. K I M , C. M. M I S Q U I T TA -A L I , M. D. W I L S O N , P. M. K I M , D. T. O D O M , B. J. F R E Y , and B. J. B L E N C O W E . The evolutionary landscape of alternative splicing in vertebrate species. Science, 338: 1587 – 93, 2012. D O I : 10.1126/science.1230612 (see p. 3) [21] T. W. N I L S E N and B. R. G R AV E L E Y . Expansion of the eukaryotic proteome by alternative splicing. Nature, 463: 457 – 63, 2010. D O I : 10.1038/nature08909 (see p. 3) [22] M. C. W A H L , C. L. W I L L , and R. L U H R M A N N . The spliceosome: design principles of a dynamic RNP machine. Cell, 136: 701 – 18, 2009. D O I : 10.1016/j.cell.2009.02.009 (see pp. 3 – 5, 7) [23] C. L. W I L L and R. L U H R M A N N . Spliceosome structure and function. Cold Spring Harb Perspect Biol, 3: 2011. D O I : 10.1101/cshperspect.a003707 (see pp. 3 – 5, 7) Bibliography 141 [24] A. J. TA G G A RT , A. M. D E S I M O N E , J. S. S H I H , M. E. F I L L O U X , and W. G. FA I R B R O T H E R . Large-scale mapping of branchpoints in human pre-mRNA transcripts in vivo. Nat Struct Mol Biol, 19: 719 – 21, 2012. D O I : 10.1038/nsmb.2327 (see pp. 3, 4, 7, 19, 34, 71, 75) [25] Y. L E E and D. C. R I O . Mechanisms and Regulation of Alternative Pre-mRNA Splicing. Annu Rev Biochem, 84: 291 – 323, 2015. D O I : 10.1146/annurev- biochem- 060614- 034316 (see pp. 5, 7) [26] R I C K R U S S E L L . Biophysics of RNA folding. New York, NY: Springer, 2013. vi, 236 p. (see pp. 5, 7, 9) [27] T. A. T H A N A R A J and F. C L A R K . Human GC-AG alternative intron isoforms with weak donor sites show enhanced consensus at acceptor exon positions. Nucleic Acids Res, 29: 2581 – 93, 2001. (see p. 8) [28] C. B. B U R G E , R. A. PA D G E T T , and P. A. S H A R P . Evolutionary fates and origins of U12-type introns. Mol Cell, 2: 773 – 85, 1998. (see p. 8) [29] S. L. H A L L and R. A. PA D G E T T . Requirement of U12 snRNA for in vivo splicing of a minor class of eukaryotic nuclear pre-mRNA introns. Science, 271: 1716 – 8, 1996. (see p. 8) [30] A. L E V I N E and R. D U R B I N . A computational scan for U12-dependent introns in the human genome sequence. Nucleic Acids Res, 29: 4006 – 13, 2001. (see p. 8) [31] N. B E H Z A D N I A , M. M. G O L A S , K. H A RT M U T H , B. S A N D E R , B. K A S T N E R , J. D E C K E RT , P. D U B E , C. L. W I L L , H. U R L AU B , H. S TA R K , and R. L U H R M A N N . Composition and three- dimensional EM structure of double affinity-purified, human prespliceosomal A complexes. EMBO J, 26: 1737 – 48, 2007. D O I : 10.1038/sj.emboj.7601631 (see p. 8) [32] M. A M I T , M. D O N Y O , D. H O L L A N D E R , A. G O R E N , E. K I M , S. G E L F M A N , G. L E V-M A O R , D. B U R S T E I N , S. S C H WA RT Z , B. P O S T O L S KY , T. P U P KO , and G. A S T . Differential GC content between exons and introns establishes distinct strategies of splice-site recognition. Cell Rep, 1: 543 – 56, 2012. D O I : 10.1016/j.celrep.2012.03.013 (see pp. 8, 10, 37, 38) [33] W. G. FA I R B R O T H E R , G. W. Y E O , R. Y E H , P. G O L D S T E I N , M. M AW S O N , P. A. S H A R P , and C. B. B U R G E . RESCUE-ESE identifies candidate exonic splicing enhancers in vertebrate exons. Nucleic Acids Res, 32: W187 – 90, 2004. D O I : 10.1093/nar/gkh393 (see pp. 9, 27, 50, 86) [34] Y. W A N G , M. M A , X. X I A O , and Z. W A N G . Intronic splicing enhancers, cognate splicing factors and context-dependent regulation rules. Nat Struct Mol Biol, 19: 1044 – 52, 2012. D O I : 10.1038/nsmb.2377 (see pp. 9, 46, 86) [35] J. C. L O N G and J. F. C A C E R E S . The SR protein family of splicing factors: master regulators of gene expression. Biochem J, 417: 15 – 27, 2009. D O I : 10.1042/BJ20081501 (see pp. 9, 40) [36] J. L. M A N L E Y and A. R. K R A I N E R . A rational nomenclature for serine/arginine-rich protein splicing factors (SR proteins). Genes Dev, 24: 1073 – 4, 2010. D O I : 10.1101/gad.1934910 (see p. 9) [37] M. L. A N KO . Regulation of gene expression programmes by serine-arginine rich splicing factors. Semin Cell Dev Biol, 32: 11 – 21, 2014. D O I : 10.1016/j.semcdb.2014.03.011 (see p. 9) 142 Bibliography [38] Z. Z H O U and X. D. F U . Regulation of splicing by SR proteins and SR protein-specific kinases. Chromosoma, 122: 191 – 207, 2013. D O I : 10.1007/s00412-013-0407-z (see pp. 9, 18) [39] D. R AY , H. K A Z A N , K. B. C O O K , M. T. W E I R AU C H , H. S. N A JA FA B A D I , X. L I , S. G U E R - O U S S OV , M. A L B U , H. Z H E N G , A. Y A N G , H. N A , M. I R I M I A , L. H. M AT Z AT , R. K. D A L E , S. A. S M I T H , C. A. Y A R O S H , S. M. K E L LY , B. N A B E T , D. M E C E N A S , W. L I , R. S. L A I S H R A M , M. Q I A O , H. D. L I P S H I T Z , F. P I A N O , A. H. C O R B E T T , R. P. C A R S T E N S , B. J. F R E Y , R. A. A N D E R S O N , K. W. LY N C H , L. O. P E N A LVA , E. P. L E I , A. G. F R A S E R , B. J. B L E N C O W E , Q. D. M O R R I S , and T. R. H U G H E S . A compendium of RNA-binding motifs for decoding gene regulation. Nature, 499: 172 – 7, 2013. D O I : 10.1038/nature12311 (see pp. 9, 13, 14, 20, 39, 40, 83, 86, 92) [40] S. C H O , A. H O A N G , R. S I N H A , X. Y. Z H O N G , X. D. F U , A. R. K R A I N E R , and G. G H O S H . Interaction between the RNA binding domains of Ser-Arg splicing factor 1 and U1-70K snRNP protein determines early spliceosome assembly. Proc Natl Acad Sci U S A, 108: 8233 – 8, 2011. D O I : 10.1073/pnas.1017700108 (see pp. 9, 18) [41] Y. Z H A N G , T. M A D L , I. B A G D I U L , T. K E R N , H. S. K A N G , P. Z O U , N. M AU S B A C H E R , S. A. S I E B E R , A. K R A M E R , and M. S AT T L E R . Structure, phosphorylation and U2AF65 binding of the N-terminal domain of splicing factor 1 during 3’-splice site recognition. Nucleic Acids Res, 41: 1343 – 54, 2013. D O I : 10.1093/nar/gks1097 (see pp. 9, 15) [42] S. M. B E R G E T . Exon recognition in vertebrate splicing. J Biol Chem, 270: 2411 – 4, 1995. (see pp. 10, 12, 14) [43] D. L. B L A C K . Mechanisms of alternative pre-messenger RNA splicing. Annu Rev Biochem, 72: 291 – 336, 2003. D O I : 10.1146/annurev.biochem.72.121801.161720 (see p. 10) [44] P. J. S H E PA R D , E. A. C H O I , A. B U S C H , and K. J. H E RT E L . Efficient internal exon recognition depends on near equal contributions from the 3’ and 5’ splice sites. Nucleic Acids Res, 39: 8928 – 37, 2011. D O I : 10.1093/nar/gkr481 (see p. 10) [45] M. M. K O N A R S KA , R. A. PA D G E T T , and P. A. S H A R P . Recognition of cap structure in splicing in vitro of mRNA precursors. Cell, 38: 731 – 6, 1984. (see p. 11) [46] K. I N O U E , M. O H N O , H. S A KA M O T O , and Y. S H I M U R A . Effect of the cap structure on pre- mRNA splicing in Xenopus oocyte nuclei. Genes Dev, 3: 1472 – 9, 1989. (see p. 11) [47] C. M A Z Z A , A. S E G R E F , I. W. M AT TA J , and S. C U S A C K . Large-scale induced fit recognition of an m(7)GpppG cap analogue by the human nuclear cap-binding complex. EMBO J, 21: 5548 – 57, 2002. (see p. 12) [48] J. D. L E W I S , E. I Z AU R R A L D E , A. J A R M O L O W S K I , C. M C G U I G A N , and I. W. M AT TA J . A nuclear cap-binding complex facilitates association of U1 snRNP with the cap-proximal 5’ splice site. Genes Dev, 10: 1683 – 98, 1996. (see p. 12) [49] Y. L I , Z. Y. C H E N , W. W A N G , C. C. B A K E R , and R. M. K R U G . The 3’-end-processing factor CPSF is required for the splicing of single-intron pre-mRNAs in vivo. RNA, 7: 920 – 31, 2001. (see p. 12) Bibliography 143 [50] S. VA G N E R , C. VA G N E R , and I. W. M AT TA J . The carboxyl terminus of vertebrate poly(A) polymerase interacts with U2AF 65 to couple 3’-end processing and splicing. Genes Dev, 14: 403 – 13, 2000. (see p. 12) [51] T. H O R I U C H I and T. A I G A K I . Alternative trans-splicing: a novel mode of pre-mRNA process- ing. Biol Cell, 98: 135 – 40, 2006. D O I : 10.1042/BC20050002 (see p. 13) [52] J. B O U CA S . Integration of ENCODE RNAseq and eCLIP Data Sets. Methods Mol Biol, 1720: 111 – 129, 2018. D O I : 10.1007/978-1-4939-7540-2_8 (see p. 13) [53] M. L E K , K. J. K A R C Z E W S K I , E. V. M I N I K E L , K. E. S A M O C H A , E. B A N K S , T. F E N N E L L , A. H. O’D O N N E L L -L U R I A , J. S. W A R E , A. J. H I L L , B. B. C U M M I N G S , T. T U K I A I N E N , D. P. B I R N B AU M , J. A. K O S M I C K I , L. E. D U N CA N , K. E S T R A DA , F. Z H A O , J. Z O U , E. P I E R C E -H O FF M A N , J. B E R G H O U T , D. N. C O O P E R , N. D E FL AU X , M. D E P R I S T O , R. D O , J. F L A N N I C K , M. F R O M E R , L. G AU T H I E R , J. G O L D S T E I N , N. G U P TA , D. H O W R I G A N , A. K I E Z U N , M. I. K U R K I , A. L. M O O N S H I N E , P. N ATA R A JA N , L. O R O Z C O , G. M. P E L O S O , R. P O P L I N , M. A. R I VA S , V. R UA N O -R U B I O , S. A. R O S E , D. M. R U D E R F E R , K. S H A K I R , P. D. S T E N S O N , C. S T E V E N S , B. P. T H O M A S , G. T I A O , M. T. T U S I E -L U N A , B. W E I S - B U R D , H. H. W O N , D. Y U , D. M. A LT S H U L E R , D. A R D I S S I N O , M. B O E H N K E , J. D A N E S H , S. D O N N E L LY , R. E L O S UA , J. C. F L O R E Z , S. B. G A B R I E L , G. G E T Z , S. J. G L AT T , C. M. H U LT M A N , S. K AT H I R E S A N , M. L A A K S O , S. M C C A R R O L L , M. I. M C C A RT H Y , D. M C G OV- E R N , R. M C P H E R S O N , B. M. N E A L E , A. PA L O T I E , S. M. P U R C E L L , D. S A L E H E E N , J. M. S C H A R F , P. S K L A R , P. F. S U L L I VA N , J. T U O M I L E H T O , M. T. T S UA N G , H. C. W AT K I N S , J. G. W I L S O N , M. J. D A LY , D. G. M A C A RT H U R , and C O N S O RT I U M E XO M E A G G R E G AT I O N . Analysis of protein-coding genetic variation in 60,706 humans. Nature, 536: 285 – 91, 2016. D O I : 10.1038/nature19057 (see pp. 13, 33, 36, 38, 46, 50, 68, 74) [54] B. A L I PA N A H I , A. D E L O N G , M. T. W E I R AU C H , and B. J. F R E Y . Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat Biotechnol, 33: 831 – 8, 2015. D O I : 10.1038/nbt.3300 (see p. 14) [55] J. U L E , G. S T E FA N I , A. M E L E , M. R U G G I U , X. W A N G , B. TA N E R I , T. G A A S T E R L A N D , B. J. B L E N C O W E , and R. B. D A R N E L L . An RNA map predicting Nova-dependent splicing regulation. Nature, 444: 580 – 6, 2006. D O I : 10.1038/nature05304 (see p. 14) [56] Q. PA N , O. S H A I , L. J. L E E , B. J. F R E Y , and B. J. B L E N C O W E . Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat Genet, 40: 1413 – 5, 2008. D O I : 10.1038/ng.259 (see pp. 14, 15) [57] J. J. M E R K I N , P. C H E N , M. S. A L E X I S , S. K. H AU TA N I E M I , and C. B. B U R G E . Origins and Impacts of New Mammalian Exons. Cell Rep, 2015. D O I : 10.1016/j.celrep.2015.02.058 (see p. 15) [58] R. J. O S B O R N E and C. A. T H O R N T O N . RNA-dominant diseases. Hum Mol Genet, 15 Spec No 2: R162 – 9, 2006. D O I : 10.1093/hmg/ddl181 (see pp. 15, 21) [59] J. R. O’R O U R K E and M. S. S WA N S O N . Mechanisms of RNA-mediated disease. J Biol Chem, 284: 7419 – 23, 2009. D O I : 10.1074/jbc.R800025200 (see p. 15) [60] L. P. R A N U M and T. A. C O O P E R . RNA-mediated neuromuscular disorders. Annu Rev Neurosci, 29: 259 – 77, 2006. D O I : 10.1146/annurev.neuro.29.051605.113014 (see p. 15) 144 Bibliography [61] K. H. L I M , L. F E R R A R I S , M. E. F I L L O U X , B. J. R A P H A E L , and W. G. FA I R B R O T H E R . Using positional distribution to identify splicing elements and predict pre-mRNA processing defects in human genes. Proc Natl Acad Sci U S A, 108: 11093 – 8, 2011. D O I : 10.1073/pnas. 1101135108 (see pp. 15, 20, 33, 41, 81) [62] M. S. J U R I CA and M. J. M O O R E . Pre-mRNA splicing: awash in a sea of proteins. Mol Cell, 12: 5 – 14, 2003. (see p. 15) [63] X. R O CA , M. A K E R M A N , H. G AU S , A. B E R D E JA , C. F. B E N N E T T , and A. R. K R A I N E R . Widespread recognition of 5’ splice sites by noncanonical base-pairing to U1 snRNA involving bulged nucleotides. Genes Dev, 26: 1098 – 109, 2012. D O I : 10.1101/gad.190173.112 (see pp. 15, 19) [64] K. H A RT M U T H , H. U R L AU B , H. P. V O R N L O C H E R , C. L. W I L L , M. G E N T Z E L , M. W I L M , and R. L U H R M A N N . Protein composition of human prespliceosomes isolated by a tobramycin affinity-selection method. Proc Natl Acad Sci U S A, 99: 16719 – 24, 2002. D O I : 10.1073/pnas. 262483899 (see p. 15) [65] Z. Z H O U , L. J. L I C K L I D E R , S. P. G Y G I , and R. R E E D . Comprehensive proteomic analysis of the human spliceosome. Nature, 419: 182 – 5, 2002. D O I : 10.1038/nature01031 (see p. 15) [66] M. S. J U R I CA , L. J. L I C K L I D E R , S. R. G Y G I , N. G R I G O R I E FF , and M. J. M O O R E . Purification and characterization of native spliceosomes suitable for three-dimensional structural analysis. RNA, 8: 426 – 39, 2002. (see p. 15) [67] T. W. N I L S E N . The spliceosome: the most complex macromolecular machine in the cell? Bioessays, 25: 1147 – 9, 2003. D O I : 10.1002/bies.10394 (see p. 15) [68] O. A. K E N T , D. B. R I T C H I E , and A. M. M A C M I L L A N . Characterization of a U2AF- independent commitment complex (E’) in the mammalian spliceosome assembly pathway. Mol Cell Biol, 25: 233 – 40, 2005. D O I : 10.1128/MCB.25.1.233-240.2005 (see p. 15) [69] W. W. L I A N G and S. C. C H E N G . A novel mechanism for Prp5 function in prespliceosome formation and proofreading the branch site sequence. Genes Dev, 29: 81 – 93, 2015. D O I : 10.1101/gad.253708.114 (see p. 15) [70] R. P E R R I M A N and J R . A R E S M. Invariant U2 snRNA nucleotides form a stem loop to recog- nize the intron early in splicing. Mol Cell, 38: 416 – 27, 2010. D O I : 10.1016/j.molcel.2010. 02.036 (see p. 15) [71] A. G. M AT E R A and Z. W A N G . A day in the life of the spliceosome. Nat Rev Mol Cell Biol, 15: 108 – 21, 2014. D O I : 10.1038/nrm3742 (see p. 15) [72] S. M O H L M A N N , R. M AT H E W , P. N E U M A N N , A. S C H M I T T , R. L U H R M A N N , and R. F I C N E R . Structural and functional analysis of the human spliceosomal DEAD-box helicase Prp28. Acta Crystallogr D Biol Crystallogr, 70: 1622 – 30, 2014. D O I : 10.1107/S1399004714006439 (see p. 15) [73] L. Z H A N G , X. L I , R. C. H I L L , Y. Q I U , W. Z H A N G , K. C. H A N S E N , and R. Z H A O . Brr2 plays a role in spliceosomal activation in addition to U4/U6 unwinding. Nucleic Acids Res, 43: 3286 – 97, 2015. D O I : 10.1093/nar/gkv062 (see p. 16) Bibliography 145 [74] V. N A N C O L L I S , J. P. R U C K S H A N T H I , L. N. F R A Z E R , and R. T. O’K E E F E . The U5 snRNA internal loop 1 is a platform for Brr2, Snu114 and Prp8 protein binding during U5 snRNP assembly. J Cell Biochem, 114: 2770 – 84, 2013. D O I : 10.1002/jcb.24625 (see p. 16) [75] A. M. W L O DAV E R and J. P. S TA L E Y . The DExD/H-box ATPase Prp2p destabilizes and proof- reads the catalytic RNA core of the spliceosome. RNA, 20: 282 – 94, 2014. D O I : 10.1261/rna. 042598.113 (see p. 16) [76] R. M. L A R D E L L I , J. X. T H O M P S O N , 3 R D Y AT E S J. R., and S. W. S T E V E N S . Release of SF3 from the intron branchpoint activates the first step of pre-mRNA splicing. RNA, 16: 516 – 28, 2010. D O I : 10.1261/rna.2030510 (see p. 16) [77] W. P. G A L E J , C. O U B R I D G E , A. J. N E W M A N , and K. N A G A I . Crystal structure of Prp8 reveals active site cavity of the spliceosome. Nature, 493: 638 – 43, 2013. D O I : 10.1038/nature11843 (see p. 16) [78] S. M. F I CA , N. T U T T L E , T. N OVA K , N. S. L I , J. L U , P. K O O DAT H I N G A L , Q. D A I , J. P. S TA L E Y , and J. A. P I C C I R I L L I . RNA catalyses nuclear pre-mRNA splicing. Nature, 503: 229 – 34, 2013. D O I : 10.1038/nature12734 (see p. 16) [79] L. L. P I C C O L O , D. C O R O N A , and M. C. O N O R AT I . Emerging roles for hnRNPs in post- transcriptional regulation: what can we learn from flies? Chromosoma, 123: 515 – 27, 2014. D O I : 10.1007/s00412-014-0470-0 (see p. 16) [80] E. L. M AT U N I S , M. J. M AT U N I S , and G. D R E Y F U S S . Characterization of the major hnRNP proteins from Drosophila melanogaster. J Cell Biol, 116: 257 – 69, 1992. (see p. 16) [81] R. M A RT I N E Z -C O N T R E R A S , P. C L O U T I E R , L. S H K R E TA , J. F. F I S E T T E , T. R E V I L , and B. C H A B O T . hnRNP proteins and splicing control. Adv Exp Med Biol, 623: 123 – 47, 2007. (see p. 16) [82] S. P. H A N , Y. H. TA N G , and R. S M I T H . Functional diversity of the hnRNPs: past, present and perspectives. Biochem J, 430: 379 – 92, 2010. D O I : 10.1042/BJ20100396 (see p. 16) [83] N. H A N , W. L I , and M. Z H A N G . The function of the RNA-binding protein hnRNP in cancer metastasis. J Cancer Res Ther, 9 Suppl: S129 – 34, 2013. D O I : 10.4103/0973-1482.122506 (see p. 16) [84] C. R O L L I N S , J. D. L E V E N G O O D , B. D. R I F E , M. S A L E M I , and B. S. T O L B E RT . Thermo- dynamic and phylogenetic insights into hnRNP A1 recognition of the HIV-1 exon splicing silencer 3 element. Biochemistry, 53: 2172 – 84, 2014. D O I : 10.1021/bi500180p (see p. 16) [85] N. R O O K E , V. M A R KOV T S OV , E. C A G AV I , and D. L. B L A C K . Roles for SR proteins and hnRNP A1 in the regulation of c-src exon N1. Mol Cell Biol, 23: 1874 – 84, 2003. (see p. 16) [86] A. M. Z A H L E R , W. S. L A N E , J. A. S T O L K , and M. B. R O T H . SR proteins: a conserved family of pre-mRNA splicing factors. Genes Dev, 6: 837 – 47, 1992. (see p. 17) [87] M. B. R O T H , C. M U R P H Y , and J. G. G A L L . A monoclonal antibody that recognizes a phos- phorylated epitope stains lampbrush chromosome loops and small granules in the amphibian germinal vesicle. J Cell Biol, 111: 2217 – 23, 1990. (see p. 17) [88] H. G E , P. Z U O , and J. L. M A N L E Y . Primary structure of the human splicing factor ASF reveals similarities with Drosophila regulators. Cell, 66: 373 – 82, 1991. (see p. 17) 146 Bibliography [89] A. R. K R A I N E R , A. M AY E DA , D. K O Z A K , and G. B I N N S . Functional expression of cloned human splicing factor SF2: homology to RNA-binding proteins, U1 70K, and Drosophila splicing regulators. Cell, 66: 383 – 94, 1991. (see p. 17) [90] H. G E and J. L. M A N L E Y . A protein factor, ASF, controls cell-specific alternative splicing of SV40 early pre-mRNA in vitro. Cell, 62: 25 – 34, 1990. (see p. 17) [91] T. B. C H O U , Z. Z A C H A R , and P. M. B I N G H A M . Developmental expression of a regulatory gene is programmed at the level of splicing. EMBO J, 6: 4095 – 104, 1987. (see p. 18) [92] R. T. B O G G S , P. G R E G O R , S. I D R I S S , J. M. B E L O T E , and M. M C K E O W N . Regulation of sexual differentiation in D. melanogaster via alternative splicing of RNA from the transformer gene. Cell, 50: 739 – 47, 1987. (see p. 18) [93] H. A M R E I N , M. G O R M A N , and R. N O T H I G E R . The sex-determining gene tra-2 of Drosophila encodes a putative RNA binding protein. Cell, 55: 1025 – 35, 1988. (see p. 18) [94] P. J. S H E PA R D and K. J. H E RT E L . The SR protein family. Genome Biol, 10: 242, 2009. D O I : 10.1186/gb-2009-10-10-242 (see p. 18) [95] A. B U S C H and K. J. H E RT E L . Evolution of SR protein and hnRNP splicing regulatory factors. Wiley Interdiscip Rev RNA, 3: 1 – 12, 2012. D O I : 10.1002/wrna.100 (see p. 18) [96] M. S C H N E I D E R , C. L. W I L L , M. A N O K H I N A , J. TA Z I , H. U R L AU B , and R. L U H R M A N N . Exon definition complexes contain the tri-snRNP and can be directly converted into B-like precatalytic splicing complexes. Mol Cell, 38: 223 – 35, 2010. D O I : 10.1016/j.molcel.2010. 02.027 (see p. 18) [97] S. F U R U YA M A and J. P. B R U Z I K . Multiple roles for SR proteins in trans splicing. Mol Cell Biol, 22: 5337 – 46, 2002. (see p. 18) [98] M. N E U M A N N , E. B E N T M A N N , D. D O R M A N N , A. J AWA I D , M. D E J E S U S -H E R N A N D E Z , O. A N S O R G E , S. R O E B E R , H. A. K R E T Z S C H M A R , D. G. M U N O Z , H. K U S A KA , O. Y O KO TA , L. C. A N G , J. B I L B A O , R. R A D E M A K E R S , C. H A A S S , and I. R. M A C K E N Z I E . FET proteins TAF15 and EWS are selective markers that distinguish FTLD with FUS pathology from amyotrophic lateral sclerosis with FUS mutations. Brain, 134: 2595 – 609, 2011. D O I : 10.1093/brain/ awr201 (see p. 18) [99] H. D E N G , K. G A O , and J. J A N KOV I C . The role of FUS gene variants in neurodegenerative diseases. Nat Rev Neurol, 10: 337 – 48, 2014. D O I : 10.1038/nrneurol.2014.78 (see p. 18) [100] C. G I R A R D , C. L. W I L L , J. P E N G , E. M. M A KA R OV , B. K A S T N E R , I. L E M M , H. U R L AU B , K. H A RT M U T H , and R. L U H R M A N N . Post-transcriptional spliceosomes are retained in nuclear speckles until splicing completion. Nat Commun, 3: 994, 2012. D O I : 10.1038/ncomms1998 (see p. 18) [101] L. E N G , G. C O U T I N H O , S. N A H A S , G. Y E O , R. TA N O U Y E , M. B A B A E I , T. D O R K , C. B U R G E , and R. A. G AT T I . Nonclassical splicing mutations in the coding and noncoding regions of the ATM Gene: maximum entropy estimates of splice junction strengths. Hum Mutat, 23: 67 – 76, 2004. D O I : 10.1002/humu.10295 (see p. 19) Bibliography 147 [102] K. X I A , D. Z H E N G , Q. PA N , Z. L I U , X. X I , Z. H U , H. D E N G , X. L I U , D. J I A N G , H. D E N G , and J. X I A . A novel PRPF31 splice-site mutation in a Chinese family with autosomal dominant retinitis pigmentosa. Mol Vis, 10: 361 – 5, 2004. (see p. 19) [103] J. M. H A RT I KA I N E N , M. M. P I R S KA N E N , A. H. A R FF M A N , U. K. R I S T O N M A A , and A. J. M A N N E R M A A . A Finnish BRCA1 exon 12 4216-2nt A to G splice acceptor site mutation causes aberrant splicing and frameshift, leading to protein truncation. Hum Mutat, 15: 120, 2000. D O I : 10.1002/(SICI)1098-1004(200001)15:1<120::AID-HUMU31>3.0.CO;2-E (see p. 19) [104] H. S U N and L. A. C H A S I N . Multiple splicing defects in an intronic false exon. Mol Cell Biol, 20: 6414 – 25, 2000. (see p. 19) [105] N. P. B U R R O W S , A. C. N I C H O L L S , A. J. R I C H A R D S , C. L U C CA R I N I , J. B. H A R R I S O N , J. R. Y AT E S , and F. M. P O P E . A point mutation in an intronic branch site results in aberrant splicing of COL5A1 and in Ehlers-Danlos syndrome type II in two British families. Am J Hum Genet, 63: 390 – 8, 1998. D O I : 10.1086/301948 (see p. 19) [106] L. B. C R O T T I and D. S. H O R O W I T Z . Exon sequences at the splice junctions affect splicing fidelity and alternative splicing. Proc Natl Acad Sci U S A, 106: 18954 – 9, 2009. D O I : 10.1073/ pnas.0907948106 (see p. 19) [107] C. M A S L E N , D. B A B C O C K , M. R A G H U N AT H , and B. S T E I N M A N N . A rare branch-point muta- tion is associated with missplicing of fibrillin-2 in a large family with congenital contractural arachnodactyly. Am J Hum Genet, 60: 1389 – 98, 1997. D O I : 10.1086/515472 (see p. 19) [108] X. H. Z H A N G and L. A. C H A S I N . Computational definition of sequence motifs governing constitutive exon splicing. Genes Dev, 18: 1241 – 50, 2004. D O I : 10.1101/gad.1195304 (see pp. 20, 39) [109] R. M A RT I N E Z -C O N T R E R A S , J. F. F I S E T T E , F. U. N A S I M , R. M A D D E N , M. C O R D E AU , and B. C H A B O T . Intronic binding sites for hnRNP A/B and hnRNP F/H proteins stimulate pre-mRNA splicing. PLoS Biol, 4: e21, 2006. D O I : 10.1371/journal.pbio.0040021 (see p. 20) [110] A. K A N O P KA , O. M U H L E M A N N , and G. A KU S JA RV I . Inhibition by SR proteins of splicing of a regulated adenovirus pre-mRNA. Nature, 381: 535 – 8, 1996. D O I : 10.1038/381535a0 (see p. 20) [111] V. FA A , A. C O I A N A , F. I N CA N I , L. C O S TA N T I N O , A. C A O , and M. C. R O S AT E L L I . A syn- onymous mutation in the CFTR gene causes aberrant splicing in an italian patient affected by a mild form of cystic fibrosis. J Mol Diagn, 12: 380 – 3, 2010. D O I : 10.2353/jmoldx.2010. 090126 (see p. 20) [112] T. S T E R N E -W E I L E R , J. H O WA R D , M. M O RT , D. N. C O O P E R , and J. R. S A N F O R D . Loss of exon identity is a common mechanism of human inherited disease. Genome Res, 21: 1563 – 71, 2011. D O I : 10.1101/gr.118638.110 (see pp. 20, 40) [113] N. M. K U Y U M C U -M A RT I N E Z , G. S. W A N G , and T. A. C O O P E R . Increased steady-state levels of CUGBP1 in myotonic dystrophy 1 are due to PKC-mediated hyperphosphorylation. Mol Cell, 28: 68 – 78, 2007. D O I : 10.1016/j.molcel.2007.07.027 (see p. 21) 148 Bibliography [114] B. K. D R E D G E , A. D. P O LY D O R I D E S , and R. B. D A R N E L L . The splice of life: alternative splicing and neurological disease. Nat Rev Neurosci, 2: 43 – 50, 2001. D O I : 10.1038/35049061 (see p. 21) [115] E. S. A R N O L D , S. C. L I N G , S. C. H U E L G A , C. L A G I E R -T O U R E N N E , M. P O LY M E N I D O U , D. D I T S W O RT H , H. B. K O R DA S I E W I C Z , M. M C A L O N I S -D O W N E S , O. P L AT O S H Y N , P. A. PA R O N E , S. D A C R U Z , K. M. C LU TA R I O , D. S W I N G , L. T E S S A R O L L O , M. M A R S A L A , C. E. S H AW , G. W. Y E O , and D. W. C L E V E L A N D . ALS-linked TDP-43 mutations produce aberrant RNA splicing and adult-onset motor neuron disease without aggregation or loss of nuclear TDP-43. Proc Natl Acad Sci U S A, 110: E736 – 45, 2013. D O I : 10.1073/pnas.1222809110 (see p. 22) [116] W. G U O , S. S C H A F E R , M. L. G R E A S E R , M. H. R A D K E , M. L I S S , T. G OV I N DA R A JA N , H. M A AT Z , H. S C H U L Z , S. L I , A. M. PA R R I S H , V. D AU K S A I T E , P. VA K E E L , S. K L A A S S E N , B. G E R U L L , L. T H I E R F E L D E R , V. R E G I T Z -Z A G R O S E K , T. A. H A C K E R , K. W. S AU P E , G. W. D E C , P. T. E L L I N O R , C. A. M A C R A E , B. S PA L L E K , R. F I S C H E R , A. P E R R O T , C. O Z C E L I K , K. S A A R , N. H U B N E R , and M. G O T T H A R D T . RBM20, a gene for hereditary cardiomyopathy, regulates titin splicing. Nat Med, 18: 766 – 73, 2012. D O I : 10.1038/nm.2693 (see p. 22) [117] B. L. F O G E L , E. W E X L E R , A. W A H N I C H , T. F R I E D R I C H , C. V I JAY E N D R A N , F. G A O , N. PA R I K S H A K , G. K O N O P KA , and D. H. G E S C H W I N D . RBFOX1 regulates both splicing and transcriptional networks in human neuronal development. Hum Mol Genet, 21: 4171 – 86, 2012. D O I : 10.1093/hmg/dds240 (see p. 22) [118] J. Z H A N G and J. L. M A N L E Y . Misregulation of pre-mRNA alternative splicing in cancer. Cancer Discov, 3: 1228 – 37, 2013. D O I : 10.1158/2159-8290.CD-13-0253 (see p. 23) [119] O. A N C Z U KO W , A. Z. R O S E N B E R G , M. A K E R M A N , S. D A S , L. Z H A N , R. K A R N I , S. K. M U T H U S WA M Y , and A. R. K R A I N E R . The splicing factor SRSF1 regulates apoptosis and proliferation to promote mammary epithelial cell transformation. Nat Struct Mol Biol, 19: 220 – 8, 2012. D O I : 10.1038/nsmb.2207 (see p. 23) [120] A. B E N -H U R , C. S. O N G , S. S O N N E N B U R G , B. S C H O L KO P F , and G. R AT S C H . Support vector machines and kernels for computational biology. PLoS Comput Biol, 4: e1000173, 2008. D O I : 10.1371/journal.pcbi.1000173 (see p. 23) [121] S. D E G R O E V E , Y. S A E Y S , B. D E B A E T S , P. R O U Z E , and Y. VA N D E P E E R . SpliceMachine: predicting splice sites from high-dimensional local context representations. Bioinformatics, 21: 1332 – 8, 2005. D O I : 10.1093/bioinformatics/bti166 (see p. 23) [122] P. M E I N I C K E , M. T E C H , B. M O R G E N S T E R N , and R. M E R K L . Oligo kernels for datamin- ing on biological sequences: a case study on prokaryotic translation initiation sites. BMC Bioinformatics, 5: 169, 2004. D O I : 10.1186/1471-2105-5-169 (see p. 23) [123] A. K. B AT E N , B. C. C H A N G , S. K. H A L G A M U G E , and J. L I . Splice site identification using probabilistic parameters and SVM classification. BMC Bioinformatics, 7 Suppl 5: S15, 2006. D O I : 10.1186/1471-2105-7-S5-S15 (see p. 23) [124] A. K. B AT E N , S. K. H A L G A M U G E , and B. C. C H A N G . Fast splice site detection using in- formation content and feature reduction. BMC Bioinformatics, 9 Suppl 12: S8, 2008. D O I : 10.1186/1471-2105-9-S12-S8 (see p. 23) Bibliography 149 [125] R. R. W A L I A , C. C A R A G E A , B. A. L E W I S , F. T O W FI C , M. T E R R I B I L I N I , Y. E L -M A N Z A L AW Y , D. D O B B S , and V. H O N AVA R . Protein-RNA interface residue prediction using machine learn- ing: an assessment of the state of the art. BMC Bioinformatics, 13: 89, 2012. D O I : 10.1186/ 1471-2105-13-89 (see p. 23) [126] G. Y E O and C. B. B U R G E . Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals. J Comput Biol, 11: 377 – 94, 2004. D O I : 10 . 1089 / 1066527041410418 (see pp. 27, 38, 50, 71, 75) [127] S. B R U N A K , J. E N G E L B R E C H T , and S. K N U D S E N . Prediction of human mRNA donor and acceptor sites from the DNA sequence. J Mol Biol, 220: 49 – 65, 1991. (see p. 27) [128] A. G O R E N , E. K I M , M. A M I T , K. VA K N I N , N. K FI R , O. R A M , and G. A S T . Overlapping splicing regulatory motifs – combinatorial effects on splicing. Nucleic Acids Res, 38: 3318 – 27, 2010. D O I : 10.1093/nar/gkq005 (see p. 27) [129] G. TE X C O N S O RT I U M . The Genotype-Tissue Expression (GTEx) project. Nat Genet, 45: 580 – 5, 2013. D O I : 10.1038/ng.2653 (see p. 28) [130] T. A. C O O P E R . Use of minigene systems to dissect alternative splicing elements. Methods, 37: 331 – 40, 2005. D O I : 10.1016/j.ymeth.2005.07.015 (see p. 28) [131] K. B A S L E R , P. S I E G R I S T , and E. H A F E N . The spatial and temporal expression pattern of sevenless is exclusively controlled by gene-internal elements. EMBO J, 8: 2381 – 6, 1989. (see p. 28) [132] D. D. L I CATA L O S I , A. M E L E , J. J. FA K , J. U L E , M. K AY I KC I , S. W. C H I , T. A. C L A R K , A. C. S C H W E I T Z E R , J. E. B LU M E , X. W A N G , J. C. D A R N E L L , and R. B. D A R N E L L . HITS- CLIP yields genome-wide insights into brain alternative RNA processing. Nature, 456: 464 – 9, 2008. D O I : 10.1038/nature07488 (see p. 29) [133] Y. H UA , K. S A H A S H I , F. R I G O , G. H U N G , G. H O R E V , C. F. B E N N E T T , and A. R. K R A I N E R . Peripheral SMN restoration is essential for long-term rescue of a severe spinal muscular atrophy mouse model. Nature, 478: 123 – 6, 2011. D O I : 10.1038/nature10485 (see p. 29) [134] Y. H UA , K. S A H A S H I , G. H U N G , F. R I G O , M. A. PA S S I N I , C. F. B E N N E T T , and A. R. K R A I N E R . Antisense correction of SMN2 splicing in the CNS rescues necrosis in a type III SMA mouse model. Genes Dev, 24: 1634 – 44, 2010. D O I : 10.1101/gad.1941310 (see p. 29) [135] S. S VA S T I , T. S U WA N M A N E E , S. F U C H A R O E N , H. M. M O U LT O N , M. H. N E L S O N , N. M A E DA , O. S M I T H I E S , and R. K O L E . RNA repair restores hemoglobin expression in IVS2- 654 thalassemic mice. Proc Natl Acad Sci U S A, 106: 1205 – 10, 2009. D O I : 10.1073/pnas. 0812436106 (see p. 29) [136] K. E. L U N D I N , T. H O J L A N D , B. R. H A N S E N , R. P E R S S O N , J. B. B R A M S E N , J. K J E M S , T. K O C H , J. W E N G E L , and C. I. S M I T H . Biological activity and biotechnological aspects of locked nucleic acids. Adv Genet, 82: 47 – 107, 2013. D O I : 10.1016/B978-0-12-407676- 1.00002-0 (see p. 29) [137] N. O W E N , H. Z H O U , A. A. M A LY G I N , J. S A N G H A , L. D. S M I T H , F. M U N T O N I , and I. C. E P E R O N . Design principles for bifunctional targeted oligonucleotide enhancers of splicing. Nucleic Acids Res, 39: 7194 – 208, 2011. D O I : 10.1093/nar/gkr152 (see p. 29) 150 Bibliography [138] P. D I S T E R E R , A. K RYC Z KA , Y. L I U , Y. E. B A D I , J. J. W O N G , J. S. O W E N , and B. K H O O . Development of therapeutic splice-switching oligonucleotides. Hum Gene Ther, 25: 587 – 98, 2014. D O I : 10.1089/hum.2013.234 (see p. 30) [139] M. L. H A S T I N G S , J. B E R N I A C , Y. H. L I U , P. A B AT O , F. M. J O D E L KA , L. B A RT H E L , S. K U M A R , C. D U D L E Y , M. N E L S O N , K. L A R S O N , J. E D M O N D S , T. B O W S E R , M. D R A P E R , P. H I G G I N S , and A. R. K R A I N E R . Tetracyclines that promote SMN2 exon 7 splicing as therapeu- tics for spinal muscular atrophy. Sci Transl Med, 1: 5ra12, 2009. D O I : 10.1126/scitranslmed. 3000208 (see p. 30) [140] T. M. W H E E L E R , K. S O B C Z A K , J. D. L U E C K , R. J. O S B O R N E , X. L I N , R. T. D I R K S E N , and C. A. T H O R N T O N . Reversal of RNA dominance by displacement of protein sequestered on triplet repeat RNA. Science, 325: 336 – 9, 2009. D O I : 10.1126/science.1173110 (see p. 30) [141] S. A. M U L D E R S , W. J. VA N D E N B R O E K , T. M. W H E E L E R , H. J. C R O E S , P. VA N K U I K - R O M E I J N , S. J. D E K I M P E , D. F U R L I N G , G. J. P L AT E N B U R G , G. G O U R D O N , C. A. T H O R N - T O N , B. W I E R I N G A , and D. G. W A N S I N K . Triplet-repeat oligonucleotide-mediated reversal of RNA toxicity in myotonic dystrophy. Proc Natl Acad Sci U S A, 106: 13915 – 20, 2009. D O I : 10.1073/pnas.0905780106 (see p. 30) [142] J. E. L E E , C. F. B E N N E T T , and T. A. C O O P E R . RNase H-mediated degradation of toxic RNA in myotonic dystrophy type 1. Proc Natl Acad Sci U S A, 109: 4221 – 6, 2012. D O I : 10.1073/pnas.1117019109 (see p. 30) [143] A. J. L E G E R , L. M. M O S Q U E A , N. P. C L AY T O N , I. H. W U , T. W E E D E N , C. A. N E L S O N , L. P H I L L I P S , E. R O B E RT S , P. A. P I E P E N H A G E N , S. H. C H E N G , and B. M. W E N T W O RT H . Systemic delivery of a Peptide-linked morpholino oligonucleotide neutralizes mutant RNA toxicity in a mouse model of myotonic dystrophy. Nucleic Acid Ther, 23: 109 – 17, 2013. D O I : 10.1089/nat.2012.0404 (see p. 30) [144] T. M. W H E E L E R , A. J. L E G E R , S. K. PA N D E Y , A. R. M A C L E O D , M. N A KA M O R I , S. H. C H E N G , B. M. W E N T W O RT H , C. F. B E N N E T T , and C. A. T H O R N T O N . Targeting nuclear RNA for in vivo correction of myotonic dystrophy. Nature, 488: 111 – 5, 2012. D O I : 10.1038/ nature11362 (see p. 30) [145] V. F R A N C O I S , A. F. K L E I N , C. B E L E Y , A. J O L L E T , C. L E M E R C I E R , L. G A R C I A , and D. F U R L I N G . Selective silencing of mutated mRNAs in DM1 by using modified hU7-snRNAs. Nat Struct Mol Biol, 18: 85 – 7, 2011. D O I : 10.1038/nsmb.1958 (see p. 30) [146] K. S O B C Z A K , T. M. W H E E L E R , W. W A N G , and C. A. T H O R N T O N . RNA interference targeting CUG repeats in a mouse model of myotonic dystrophy. Mol Ther, 21: 380 – 7, 2013. D O I : 10.1038/mt.2012.222 (see p. 30) [147] M. B. W A R F , M. N A KA M O R I , C. M. M AT T H Y S , C. A. T H O R N T O N , and J. A. B E R G LU N D . Pentamidine reverses the splicing defects associated with myotonic dystrophy. Proc Natl Acad Sci U S A, 106: 18551 – 6, 2009. D O I : 10.1073/pnas.0903234106 (see p. 30) [148] A. G A R C I A -L O P E Z , B. L L A M U S I , M. O R Z A E Z , E. P E R E Z -PAYA , and R. D. A RT E R O . In vivo discovery of a peptide that prevents CUG-RNA hairpin formation and reverses RNA toxicity in myotonic dystrophy models. Proc Natl Acad Sci U S A, 108: 11866 – 71, 2011. D O I : 10.1073/pnas.1018213108 (see p. 30) Bibliography 151 [149] J. L. C H I L D S -D I S N E Y , R. PA R K E S H , M. N A KA M O R I , C. A. T H O R N T O N , and M. D. D I S - N E Y.Rational design of bioactive, modularly assembled aminoglycosides targeting the RNA that causes myotonic dystrophy type 1. ACS Chem Biol, 7: 1984 – 93, 2012. D O I : 10.1021/ cb3001606 (see p. 30) [150] W. Z H A N G , Y. W A N G , S. D O N G , R. C H O U D H U RY , Y. J I N , and Z. W A N G . Treatment of type 1 myotonic dystrophy by engineering site-specific RNA endonucleases that target (CUG)(n) repeats. Mol Ther, 22: 312 – 20, 2014. D O I : 10.1038/mt.2013.251 (see p. 30) [151] P. A. B A I R D , T. W. A N D E R S O N , H. B. N E W C O M B E , and R. B. L O W RY . Genetic disorders in children and young adults: a population study. Am J Hum Genet, 42: 677 – 93, 1988. (see p. 33) [152] Y. Y A N G , D. M. M U Z N Y , F. X I A , Z. N I U , R. P E R S O N , Y. D I N G , P. W A R D , A. B R A X T O N , M. W A N G , C. B U H AY , N. V E E R A R A G H AVA N , A. H AW E S , T. C H I A N G , M. L E D U C , J. B E U T E N , J. Z H A N G , W. H E , J. S C U L L , A. W I L L I S , M. L A N D S V E R K , W. J. C R A I G E N , M. R. B E K H E I R N I A , A. S T R AY-P E D E R S E N , P. L I U , S. W E N , W. A L CA R A Z , H. C U I , M. W A L K I E W I C Z , J. R E I D , M. B A I N B R I D G E , A. PAT E L , E. B O E RW I N K L E , A. L. B E AU D E T , J. R. L U P S K I , S. E. P L O N , R. A. G I B B S , and C. M. E N G . Molecular findings among patients referred for clinical whole-exome sequencing. JAMA, 312: 1870 – 9, 2014. D O I : 10.1001/jama.2014.14601 (see p. 33) [153] M. J. B A M S H A D , S. B. N G , A. W. B I G H A M , H. K. TA B O R , M. J. E M O N D , D. A. N I C K E R S O N , and J. S H E N D U R E . Exome sequencing as a tool for Mendelian disease gene discovery. Nat Rev Genet, 12: 745 – 55, 2011. D O I : 10.1038/nrg3031 (see p. 33) [154] J. A. T E N N E S S E N , A. W. B I G H A M , T. D. O’C O N N O R , W. F U , E. E. K E N N Y , S. G R AV E L , S. M C G E E , R. D O , X. L I U , G. J U N , H. M. K A N G , D. J O R DA N , S. M. L E A L , S. G A B R I E L , M. J. R I E D E R , G. A B E CA S I S , D. A LT S H U L E R , D. A. N I C K E R S O N , E. B O E RW I N K L E , S. S U N YA E V , C. D. B U S TA M A N T E , M. J. B A M S H A D , J. M. A K E Y , G. O. B R O A D , G. O. S E AT T L E , and N H L B I E XO M E S E Q U E N C I N G P R O J E C T . Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science, 337: 64 – 9, 2012. D O I : 10.1126/science. 1219240 (see p. 33) [155] Y. X U E , Y. C H E N , Q. A Y U B , N. H UA N G , E. V. B A L L , M. M O RT , A. D. P H I L L I P S , K. S H AW , P. D. S T E N S O N , D. N. C O O P E R , C. T Y L E R -S M I T H , and C O N S O RT I U M G E N O M E S P R O J E C T . Deleterious- and disease-allele prevalence in healthy individuals: insights from current predic- tions, mutation databases, and population-scale resequencing. Am J Hum Genet, 91: 1022 – 32, 2012. D O I : 10.1016/j.ajhg.2012.10.015 (see p. 33) [156] P. D. S T E N S O N , E. V. B A L L , M. M O RT , A. D. P H I L L I P S , J. A. S H I E L , N. S. T H O M A S , S. A B E Y S I N G H E , M. K R AW C Z A K , and D. N. C O O P E R . Human Gene Mutation Database (HGMD): 2003 update. Hum Mutat, 21: 577 – 81, 2003. D O I : 10 . 1002 / humu . 10212 (see pp. 33, 34, 38, 44, 46, 50, 55, 61) [157] N. H UA N G , I. L E E , E. M. M A R C O T T E , and M. E. H U R L E S . Characterising and predicting haploinsufficiency in the human genome. PLoS Genet, 6: e1001154, 2010. D O I : 10.1371/ journal.pgen.1001154 (see pp. 36, 37, 49, 50, 108) 152 Bibliography [158] S. K E , S. S H A N G , S. M. K A L A C H I KOV , I. M O R O Z OVA , L. Y U , J. J. R U S S O , J. J U , and L. A. C H A S I N . Quantitative evaluation of all hexamers as exonic splicing elements. Genome Res, 21: 1360 – 74, 2011. D O I : 10.1101/gr.119628.110 (see pp. 37, 38, 46, 50, 51, 71, 75, 84 – 86) [159] R. L O R E N Z , S. H. B E R N H A RT , C. H O N E R Z U S I E D E R D I S S E N , H. TA F E R , C. F L A M M , P. F. S TA D L E R , and I. L. H O FA C K E R . ViennaRNA Package 2.0. Algorithms Mol Biol, 6: 26, 2011. D O I : 10.1186/1748-7188-6-26 (see pp. 38, 44, 50) [160] L E O B R E I M A N . Random Forests. Mach. Learn., 45: 5 – 32, 2001. D O I : 10 . 1023 / a : 1010933404324 (see pp. 38, 50) [161] S. K E , X. H. Z H A N G , and L. A. C H A S I N . Positive selection acting on splicing motifs reflects compensatory evolution. Genome Res, 18: 533 – 43, 2008. D O I : 10.1101/gr.070268.107 (see p. 39) [162] P. J. S M I T H , C. Z H A N G , J. W A N G , S. L. C H E W , M. Q. Z H A N G , and A. R. K R A I N E R . An increased specificity score matrix for the prediction of SF2/ASF-specific exonic splicing enhancers. Hum Mol Genet, 15: 2490 – 508, 2006. D O I : 10.1093/hmg/ddl171 (see p. 39) [163] D. R AY , H. K A Z A N , E. T. C H A N , L. P E N A C A S T I L L O , S. C H AU D H RY , S. TA LU K D E R , B. J. B L E N C O W E , Q. M O R R I S , and T. R. H U G H E S . Rapid and systematic analysis of the RNA recognition specificities of RNA-binding proteins. Nat Biotechnol, 27: 667 – 70, 2009. D O I : 10.1038/nbt.1550 (see pp. 40, 51) [164] M. A. R A H M A N , Y. A Z U M A , F. N A S R I N , J. TA K E DA , M. N A Z I M , K. B I N A H S A N , A. M A S U DA , A. G. E N G E L , and K. O H N O . SRSF1 and hnRNP H antagonistically regulate splicing of COLQ exon 16 in a congenital myasthenic syndrome. Sci Rep, 5: 13208, 2015. D O I : 10 . 1038 / srep13208 (see pp. 40, 51) [165] H. S H E N , J. L. K A N , C. G H I G N A , G. B I A M O N T I , and M. R. G R E E N . A single polypyrimidine tract binding protein (PTB) binding site mediates splicing inhibition at mouse IgM exons M1 and M2. RNA, 10: 787 – 94, 2004. (see p. 40) [166] J. W A N G , S. H. X I A O , and J. L. M A N L E Y . Genetic analysis of the SR protein ASF/SF2: interchangeability of RS domains and negative control of splicing. Genes Dev, 12: 2222 – 33, 1998. (see p. 40) [167] R. A. PA D G E T T , P. J. G R A B O W S K I , M. M. K O N A R S KA , S. S E I L E R , and P. A. S H A R P . Splicing of messenger RNA precursors. Annu Rev Biochem, 55: 1119 – 50, 1986. D O I : 10.1146/annurev. bi.55.070186.005351 (see pp. 42, 65) [168] M. M. K O N A R S KA and P. A. S H A R P . Electrophoretic separation of complexes involved in the splicing of precursors to mRNAs. Cell, 46: 845 – 55, 1986. (see pp. 42, 65) [169] R. D A S and R. R E E D . Resolution of the mammalian E complex and the ATP-dependent spliceosomal complexes on native agarose mini-gels. RNA, 5: 1504 – 8, 1999. (see pp. 42, 65) Bibliography 153 [170] D. G. M A C A RT H U R , T. A. M A N O L I O , D. P. D I M M O C K , H. L. R E H M , J. S H E N D U R E , G. R. A B E CA S I S , D. R. A DA M S , R. B. A LT M A N , S. E. A N T O N A R A K I S , E. A. A S H L E Y , J. C. B A R R E T T , L. G. B I E S E C K E R , D. F. C O N R A D , G. M. C O O P E R , N. J. C OX , M. J. D A LY , M. B. G E R S T E I N , D. B. G O L D S T E I N , J. N. H I R S C H H O R N , S. M. L E A L , L. A. P E N N A C C H I O , J. A. S TA M AT O YA N N O P O U L O S , S. R. S U N YA E V , D. VA L L E , B. F. V O I G H T , W. W I N C K L E R , and C. G U N T E R . Guidelines for investigating causality of sequence variants in human disease. Nature, 508: 469 – 76, 2014. D O I : 10.1038/nature13127 (see p. 46) [171] O. G O Z A N I , J. G. PAT T O N , and R. R E E D . A novel set of spliceosome-associated proteins and the essential splicing factor PSF bind stably to pre-mRNA prior to catalytic step II of the splicing reaction. EMBO J, 13: 3356 – 67, 1994. (see p. 47) [172] V. R E I C H E RT and M. J. M O O R E . Better conditions for mammalian in vitro splicing provided by acetate and glutamate as potassium counterions. Nucleic Acids Res, 28: 416 – 23, 2000. (see p. 47) [173] A. D O B I N , C. A. D AV I S , F. S C H L E S I N G E R , J. D R E N KO W , C. Z A L E S K I , S. J H A , P. B AT U T , M. C H A I S S O N , and T. R. G I N G E R A S . STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 29: 15 – 21, 2013. D O I : 10.1093/bioinformatics/bts635 (see pp. 48, 49) [174] M I R O N B. K U R S A , A L E K S A N D E R J A N KO W S K I , and W I T O L D R. R U D N I C K I . Boruta - A System for Feature Selection. Fundam. Inf., 101: 271 – 285, 2010. (see p. 50) [175] C. L. L I N , A. J. TA G G A RT , K. H. L I M , K. J. C Y G A N , L. F E R R A R I S , R. C R E T O N , Y. T. H UA N G , and W. G. FA I R B R O T H E R . RNA structure replaces the need for U2AF2 in splicing. Genome Res, 26: 12 – 23, 2016. D O I : 10.1101/gr.181008.114 (see p. 50) [176] W. W. W A S S E R M A N and A. S A N D E L I N . Applied bioinformatics for the identification of regulatory elements. Nat Rev Genet, 5: 276 – 87, 2004. D O I : 10.1038/nrg1315 (see p. 51) [177] J O H N M. C H A M B E R S and T R E VO R H A S T I E . Statistical models in S. Wadsworth & Brooks/Cole computer science series Pacific Grove, Calif.: Wadsworth & Brooks/Cole Advanced Books & Software, 1992. xv, 608 p. (see p. 51) [178] C. F R A L E Y and A. E. R A F T E RY . Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association, 97: 611 – 631, 2002. D O I : Doi10. 1198/016214502760047131 (see p. 51) [179] F O RT U N AT O P E S A R I N . Multivariate permutation tests: with applications in biostatistics. vol. 240 Wiley Chichester, 2001. (see p. 52) [180] S. H. L E F E V R E , L. C H AU V E I N C , D. S T O P PA -LY O N N E T , J. M I C H O N , L. L U M B R O S O , P. B E RT H E T , D. F R A P PA Z , B. D U T R I L L AU X , S. C H E V I L L A R D , and B. M A L F O Y . A T to C muta- tion in the polypyrimidine tract of the exon 9 splicing site of the RB1 gene responsible for low penetrance hereditary retinoblastoma. J Med Genet, 39: E21, 2002. (see pp. 61, 68) [181] H. S C H E FF E R , P. VA N D E R V L I E S , M. B U RT O N , E. V E R L I N D , A. C. M O L L , S. M. I M H O F , and C. H. B U Y S . Two novel germline mutations of the retinoblastoma gene (RB1) that show incomplete penetrance, one splice site and one missense. J Med Genet, 37: E6, 2000. (see pp. 61, 68) 154 Bibliography [182] E. L. S C H U B E RT , L. C. S T R O N G , and M. F. H A N S E N . A splicing mutation in RB1 in low penetrance retinoblastoma. Hum Genet, 100: 557 – 63, 1997. (see pp. 61, 68) [183] M. K LU T Z , D. B R O C K M A N N , and D. R. L O H M A N N . A parent-of-origin effect in two families with retinoblastoma is associated with a distinct splice mutation in the RB1 gene. Am J Hum Genet, 71: 174 – 9, 2002. D O I : 10.1086/341284 (see p. 61) [184] C O N S O RT I U M G E N O M E S P R O J E C T , A. A U T O N , L. D. B R O O K S , R. M. D U R B I N , E. P. G A R - R I S O N,H. M. K A N G , J. O. K O R B E L , J. L. M A R C H I N I , S. M C C A RT H Y , G. A. M C V E A N , and G. R. A B E CA S I S . A global reference for human genetic variation. Nature, 526: 68 – 74, 2015. D O I : 10.1038/nature15393 (see p. 68) [185] A. M. M E Y N E RT , L. S. B I C K N E L L , M. E. H U R L E S , A. P. J A C K S O N , and M. S. TAY L O R . Quantifying single nucleotide variant detection sensitivity in exome sequencing. BMC Bioin- formatics, 14: 195, 2013. D O I : 10.1186/1471-2105-14-195 (see p. 68) [186] J. W. H A R B O U R . Molecular basis of low-penetrance retinoblastoma. Arch Ophthalmol, 119: 1699 – 704, 2001. (see p. 68) [187] C. J. D O M M E R I N G , T. M A R E E S , A. H. VA N D E R H O U T , S. M. I M H O F , H. M E I J E R S - H E I J B O E R , P. J. R I N G E N S , F. E. VA N L E E U W E N , and A. C. M O L L . RB1 mutations and second primary malignancies after hereditary retinoblastoma. Fam Cancer, 11: 225 – 33, 2012. D O I : 10.1007/s10689-011-9505-3 (see p. 68) [188] M O T O O K I M U R A . The neutral theory of molecular evolution. Cambridge Cambridgeshire ; New York: Cambridge University Press, 1983. xv, 367 p. (see p. 68) [189] R. F. R O S C I G N O and M. A. G A R C I A -B L A N C O . SR proteins escort the U4/U6.U5 tri-snRNP to the spliceosome. RNA, 1: 692 – 706, 1995. (see p. 71) [190] P. PA PA S A I KA S , J. R. T E J E D O R , L. V I G E VA N I , and J. VA L CA R C E L . Functional splicing net- work reveals extensive regulatory potential of the core spliceosomal machinery. Mol Cell, 57: 7 – 22, 2015. D O I : 10.1016/j.molcel.2014.10.030 (see p. 72) [191] T. N. T U R N E R , Q. Y I , N. K R U M M , J. H U D D L E S T O N , K. H O E K Z E M A , F. S T E S S M A N HA, A. L. D O E B L E Y , R. A. B E R N I E R , D. A. N I C K E R S O N , and E. E. E I C H L E R . denovo-db: a compendium of human de novo variants. Nucleic Acids Res, 45: D804 – D811, 2017. D O I : 10.1093/nar/gkw865 (see p. 75) [192] P. C I N G O L A N I , A. P L AT T S , L. W A N G L E , M. C O O N , T. N G U Y E N , L. W A N G , S. J. L A N D , X. L U , and D. M. R U D E N . A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin), 6: 80 – 92, 2012. D O I : 10.4161/fly.19695 (see p. 75) [193] C. Z H A N G , W. H. L I , A. R. K R A I N E R , and M. Q. Z H A N G . RNA landscape of evolution for optimal exon and intron discrimination. Proc Natl Acad Sci U S A, 105: 5797 – 802, 2008. D O I : 10.1073/pnas.0801692105 (see p. 86)