Top-Down Effects on Speech Perception: An Integrated Computational and Behavioral Approach by Neal P. Fox B. A., University of Virginia, 2009 Sc. M., Brown University, 2012 Submitted in partial fulfillment of the requirements for the Degree of Doctor of Philosophy in the Department of Cognitive, Linguistic, and Psychological Sciences at Brown University Providence, Rhode Island May 2016 © Copyright 2016 by Neal P. Fox This dissertation by Neal P. Fox is accepted in its present form by the Department of Cognitive, Linguistic, and Psychological Sciences as satisfying the dissertation requirement for the degree of Doctor of Philosophy. _________ _______________________________________ Date Sheila E. Blumstein, Advisor Recommended to the Graduate Council _________ _______________________________________ Date Michael J. Frank, Reader _________ _______________________________________ Date James L. Morgan, Reader Approved by the Graduate Council _________ _______________________________________ Date Peter M. Weber, Dean of the Graduate School iii Neal P. Fox Education Brown University Ph.D. in Cognitive Science January 2016 Top-Down Effects on Speech Perception: An Integrated Computational and Behavioral Approach (Advisor: Dr. Sheila E. Blumstein) Brown University M.S. in Cognitive Science May 2012 Top-down effects from syntactic category expectations on speech processing University of Virginia B.A. with Distinction in Cognitive Science May 2009 Minor: Mathematics; Post-grad. training in Systems Engineering (2009-10) Honors and Awards Reisman Brain Science Graduate Fellowship Brown Institute for Brain Science 2014 Dissertation Fellowship Brown University 2013 Best Poster Award Society for Teaching of Psychology 2013 NSF Graduate Research Fellowship National Science Foundation 2010–2013 LSA Summer Institute Fellowship Linguistic Society of America 2011 Calvin & Rose G Hoffman Prize The Marlowe Society 2011 Graduate Fellowship Brown University 2010 Raven Society (Academic Honor Society) University of Virginia 2010 Graduate Research Fellowship in Engineering University of Virginia 2009–2010 Publications 1. Fox, N. P., Reilly, M., & Blumstein, S. E. (2015). Phonological neighborhood competition affects spoken word production irrespective of sentential context. Journal of Memory and Language, 83, 97-117. 2. Fox, N. P. & Blumstein, S. E. (in press). Top-down effects of syntactic sentential context on phonetic processing. Journal of Experimental Psychology: Human Perception and Performance. 3. Luthra, S., Fox, N. P., & Blumstein, S. E. (2016). Speaker information affects false memory recognition of unstudied lexical-semantic associates. (under revision) 4. Caplan, S., Fox, N. P., McClosky, D. M., & Charniak, E. (2016). Lexical substitution for cross-domain parser adaptation. (under revision) 5. Reilly, M., Guediche, S., Fox, N. P., & Blumstein, S. E. (2016, manuscript). Articulatory planning and motor reprogramming: An fMRI investigation. (in prep) 6. Fox, N. P., Blumstein, S. E., & Frank, M. J. (2016, manuscript). Bayesian Integration of Acoustic and Sentential Evidence in Speech: The BIASES Model of Spoken Word Recognition in Context. 7. Fox, N. P. & Blumstein, S. E. (2016, manuscript). Bottom-up and top-down contributions to lexical processing deficits in aphasia. Refereed Conference Presentations 1. Fox, N. P. & Larsen, E. W. (2009). A comparative optimality theoretic outlook on loanword phonology in Huave dialects of Mexico. Rice University Linguistics Society Third Biennial Meeting; Houston, TX. 2. Fox, N. P. (2012). Top-down effect of syntactic category expectations on spoken word recognition. CUNY Conference on Sentence Processing; New York, NY. 3. Fox, N. P., Ehmoda, O., & Charniak, E. (2012). Statistical stylometrics and the Marlowe- Shakespeare authorship debate. Georgetown University Roundtable on Languages and Linguistics; Washington DC. 4. Fox, N. P. & Blumstein, S. E. (2013). Top-down effects from sentence context on speech processing in aphasia. Society for Neurobiology of Language; San Diego, CA. iv 5. Fox, N. P. & Reilly, M. (2013). Significantly different: A meta-analysis of the gap between statistics students learn and statistics psychologists use. Northeast Conference for Teachers of Psychology; Bridgeport, CT. (Best Poster Award) 6. Fox, N. P., Reilly, M., & Blumstein, S. E. (2014). Independent and interacting effects of sentential context and phonological neighborhood structure in spoken word production. Acoustical Society of America; Providence, RI. 7. Luthra, S., Fox, N. P. & Blumstein, S. E. (2015). Speaker information affects false recognition of unstudied lexical-semantic associates. Society for Neurobiology of Language; Chicago, IL. 8. Fox, N. P. & Blumstein, S. E. (2015). Computational and neural mechanisms of top-down effects on speech perception. Society for Neurobiology of Language; Chicago, IL. Teaching Experience Introduction to Cognitive Neuroscience Teaching Assistant Brown University 2013 Quantitative Methods in Psychology Teaching Assistant Brown University 2012 Computational Cognitive Science Teaching Assistant Brown University 2012 Teaching Development and Leadership Teaching Certifications: Certificate III: Professional Development Seminar 2014 Certificate IV: The Teaching Consultant 2013 Certificate I: Reflective Teaching Seminar 2012 Program Facilitation and Teaching Consultation: Certificate I: Reflective Teaching Seminar 2012–2015 Center for Engaged Learning Seminar: Undergraduate Research Mentorship 2014 Principles & Practice in Reflective Mentorship 2013–2014 Senior Teaching Consultant 2013–2014 New Teaching Assistant Orientation: Interactive Classrooms 2013 Undergraduate Research Mentorship Spencer Caplan, Brown class of 2015 2013–2015 Sahil Luthra, Brown class of 2014 2012–2014 Service Session Chair, Speech Perception, Acoustical Society of America Meeting 2014 CLPS Department Representative; Sheridan Center for Teaching & Learning 2012–2014 Graduate Representative, Campus Access Advisory Committee 2011–2014 Session Chair, Computational Linguistics, Georgetown Roundtable on Lang. & Ling. 2012 Graduate Representative, Brown Univ. Presidential Search Advisory Committee 2011–2012 Professional Memberships Society for Neurobiology of Language 2013–2016 Acoustical Society of America 2013–2016 American Psychological Association 2013–2016 Society for the Teaching of Psychology 2013–2016 Northeast Psychological Association 2013–2016 Linguistic Society of America 2011–2015 Institute for Electrical and Electronic Engineers 2009–2010 IEEE Engineering in Medicine and Biology Society 2009–2010 Society for Neuroscience 2008–2010 Research Interests computational models of speech and language perception and production, cognitive neuroscience of language and aphasia, computational linguistics and natural language processing Birth Information: Neal P. Fox was born on August 3, 1989, in San Diego, CA. v Acknowledgements “It is clear that the successful completion of graduate school requires the recruitment of resources from a variety of sources.” Eye rolls aside, this statement is undoubtedly one of the most fundamental truths of graduate school. I have been lucky to work with, learn from, and get support from incredible scientists and outstanding people. At every turn, Sheila Blumstein was there to guide me, to back me up, and – when necessary – to kick me in the butt. The role of advisor and mentor is often a thankless one, but it is impossible to express the impact Sheila has had on me and on my career. The breadth and depth of her work will always be an inspiration to me, along with the earnestness and fierceness of her loyalty, to say nothing of her unparalleled patience. I was also fortunate to work with Michael J. Frank. Michael’s multidisciplinary approach to cognitive science and computational neuroscience and his willingness to explore new questions opened many doors to me. Throughout my time at Brown, Jim Morgan was a valuable sounding board for my scientific work, but he is also representative of why the CLPS Department has become as much a home as a workplace. It was a privilege to “grow up” in a department in which a full professor would invite a graduate student to his house on a Saturday morning so the student can use his power tools to complete construction of a pair of Cornhole boards. Eugene Charniak’s mentorship and collaboration provided me with an indispensible perspective at every step of my graduate career. I was often impressed his humility when he didn’t know the answer (which was rare) and his ability to discern worthwhile projects from interesting exercises (while still seeing the value in both). vi The mentorship of several other faculty members and postdocs in the CLPS department has also been critical to my development as a scientist. Especially at the beginning, Laura Kertz, Paul Allopenna, Emily Myers, Rachel Theodore and Hugh Rabagliati were longsuffering teachers and mentors. I am also grateful to Thomas Serre, Kathy Spoehr and David Badre for their formative mentorship in my three teaching assistantships in the department. Finally, I would be remiss if I failed to thank the dedicated folks who were there for me (and the entire department) whenever I asked for help and worked behind the scenes every day so I wouldn’t need to. I feel lucky to call Reinette, Don, Jesse, Michelle, Rosa and Bill my friends. My research projects spanned three labs (Sheila’s, Michael’s and Eugene’s) over the last several years, and members of each lab have helped me clarify, shape, and also do much of that research. Among those colleagues were Megan Reilly, Sahil Luthra, Sara Guediche, John Mertus, Kathy Kurowski, Jeff Cockburn, Nick Franklin, Anne Franklin, Matt Nassar, Jason Scimeca, Micha Elsner, Spencer Caplan, Dave McClosky, Rebecca Mason and Omran Ehmoda. Additionally, several funding sources were critical to my research and training, including a Graduate Research Fellowship (DGE 0228243) from the National Science Foundation, a Graduate Fellowship from the Brown Institute for Brain Science, a summer training fellowship from the Linguistic Society of America, and funding from Brown’s Graduate School and the CLPS Department. As much as graduate school is and should be a disciplinary endeavor, I cannot overstate the amount of personal and professional development I gained by participating in the Sheridan Center’s programming, and especially thanks to the incredible mentorship of the Center’s former director Kathy Takayama. Working with her and the rest of the vii staff and students associated with the Sheridan Center was unquestionably one of the most important opportunities of my graduate career. Similarly, the opportunities to meet new colleagues, mentors, and friends from across the school by serving on University committees were formative experiences and fulfilling outlets during my time at Brown. All work and no play would have made grad school a bummer. Luckily, I met Megan Reilly on my first day of my first year at Brown, so there was never any chance of that happening. She and the rest of the DuckTales Cohort (awoohoo: Jason Scimeca, Chris Erb, Patrick Heck, David Mély) made me a better scientist and person, and they put up with me when I constantly pointed out top-down effects on speech perception in real life (TDEB). Without them, these last few years would have been DISGUSTING. Finally, as families go, I’ve got the best one. Thanks, Mom, Dad, Sean, Collin and Alyson. And thanks to the newest members of my family, Emily and Ginny. You have been the sources of many, many smiles while I wrote this thesis, and I will never be able to re-bay you. viii Table of Contents Introduction 1 Chapter 1 Top-down effects of syntactic sentential context on phonetic processing 1.1. Introduction 5 1.1.1. Models of Spoken Word Recognition: Competing Frameworks 6 1.1.2. Time Course of Sentential Context Effects 7 1.2. Experiment 1.1 10 1.2.1. Methods 11 1.2.1.1. Materials 11 1.2.1.1.1. Target Word Selection 11 1.2.1.1.2. Sentence Contexts 11 1.2.1.1.3. Stimulus Recording 12 1.2.1.1.4. Target Word Manipulation 12 1.2.1.1.5. Target Word Token Selection 13 1.2.1.2. Participants 14 1.2.1.3. Task 14 1.2.2. Results 15 1.2.3. Discussion 20 1.3. Experiment 1.2 21 1.3.1. Methods 23 1.3.1.1. Materials 23 1.3.1.1.1. Critical Targets 23 1.3.1.1.2. Filler Targets 23 1.3.1.1.3. Sentence Contexts 24 1.3.1.2. Participants 24 1.3.1.3. Task 25 1.3.2. Results 25 1.3.3. Discussion 28 1.4. General Discussion 29 1.4.1 Implications for Interactive Models of Speech Perception 30 1.4.2. Implications for Autonomous Models of Speech Perception 32 1.4.3. Predicting Behavior with Comp. Models of Speech Perception 33 1.5. Conclusion 33 1.6. Overview of Next Steps 34 Chapter 2 Bayesian Integration of Acoustic and Sentential Evidence in Speech: The BIASES Model of Spoken Word Recognition in Context 2.1. Introduction 36 2.1.1. Brief Introduction 36 2.1.2. Overview of Chapter 2 38 ix 2.2. Sentential Context and Connectionist Models of Spoken Word Recognition 39 2.2.1. Modulation of Spoken Word Recognition by Sentential Context 40 2.2.2. Challenges in Modeling Context Effects on Sp. Word Recognition 43 2.2.2.1. Challenges...: Representing Context 44 2.2.2.2. Challenges...:: Activation Dynamics 45 2.2.2.3. Challenges...: Representing Time 46 2.2.2.4. Context Effects Without Connectionist Models 48 2.3. A Computational-Level Analysis of Spoken Word Recognition 48 2.3.1. Bayesian Models of Spoken Word Recognition 49 2.3.2. Prior Expectations in SWR: Lexical Frequency 51 2.3.3. Prior Expectations in SWR: Sentential Context 53 2.4. BIASES: Bayesian Integration of Acoustic and Sentential Evidence in Speech 55 2.4.1. Conditional Prior: A Model of Listeners’ Contextual Knowledge 57 2.4.1.1. Conditional Expectations from n-gram Language Models 58 2.4.1.2. Consequences of Adopting an n-gram Lang. Model Prior 59 2.4.1.3. BIASES’ Conditional Prior: A Bigram Language Model 63 2.4.1.4. Add. Constraints on Prior Expectations: Forced-Choice 64 2.4.1.5. Implementing BIASES’ Prior: Corpus Est., Smoothing 66 2.4.2. Likelihood Term: Mapping an Acoustic Signal onto Lexical Forms 68 2.4.2.1. Likelihood Functions: Many-to-One Mapping 68 2.4.2.2. Phonetic Ambiguity: One-to-Many Mapping 70 2.4.2.3. BIASES’ Likelihood Term: A Mixture of Gaussians 73 2.4.2.4. Comparing Likelihood Terms in BIASES & Shortlist B 77 2.4.3. Integrating Prior Context and Perceptual Input in BIASES 79 2.4.4. Conclusion and Next Steps 80 Chapter 3 Exploring and Evaluating the BIASES Model of Spoken Word Recognition in Context 3.1. Understanding Top-Down Effects in BIASES 82 3.1.1. Overview of the Mathematical Form of BIASES 83 3.1.1.1. Components of BIASES: Phonetic Category Structure (g) 84 3.1.1.2. Components of BIASES: Category Boundary (χ) 88 3.1.1.3. Components of BIASES: Prior Context (Π) 88 3.1.2. Towards Model-based Analyses of Top-Down Effects 90 3.1.2.1. Shifting of Invisible Category Boundaries 90 3.1.2.2. Boundary Shifts vs. Effect Sizes 93 3.1.2.3. Predicting Effect Sizes 95 3.2. Evaluating BIASES 104 3.2.1. Observed Variability in the Size of Top-Down Context Effects 104 3.2.2. Variability in the Ambiguity of Phonetic Cues: VOT 108 3.2.3. Variability in the Ambiguity of Phonetic Cues: Additional Cues 109 3.2.4. Variability in the Strength of Prior Cues 112 3.2.5. Variability in the Effect Sizes Comp. to “Neutral” Prior Contexts 117 3.3. Testing Predictions of BIASES: Experiment 3.1 119 3.3.1. Methods 119 x 3.3.1.1. Subjects 119 3.3.1.2. Materials 119 3.3.1.3. Procedure 121 3.3.2. Results: Logistic Regression Analysis of Biased Contexts 121 3.3.3. Results: Model Comparison 1 – Subject Variability 125 3.3.4. Results: Model Comparison 2 – Inherent Biases in “Neutral” Priors 127 3.3.5. Conclusion 131 Chapter 4 Top-Down Effects on Spoken Word Recognition in Aphasia: A Model-Based Assessment of Information Processing Impairments 4.1. Introduction 132 4.1.1. Brief Introduction 132 4.1.2. Overview of Chapter 4 136 4.1.3. Lexical Processing in Aphasia 139 4.1.3.1. Lexical Processing Deficits 139 4.1.3.2. The Lexical Activation Hypothesis 141 4.1.3.3. Alternative Accounts of Lexical Processing Deficits 141 4.1.3.4. Top-Down Effects and Lexical Processing 143 4.2. Applying BIASES to Spoken Word Recognition in Aphasia 145 4.2.1. Brief Overview of BIASES 145 4.2.2. From Activations to Probabilities: Lexical Activation Hypothesis 146 4.2.2.1. Preliminary Simulations: Lexical Activation Hypothesis 148 4.2.2.2 Implications for Top-Down Effects on Speech Perception 152 4.2.3. Implementing BIASES-A 155 4.2.3.1. Adapting the Prior and Likelihood of BIASES 157 4.2.3.2. Modeling Speech Processing Deficits in BIASES-A 160 4.3. Top-Down Effects of Lexical Status on SWR in Aphasia 163 4.3.1. Simulation Study 4.1: Lexical Effects in Aphasia 164 4.3.2. Experiment 4.1: Lexical Effects in Aphasia 168 4.3.2.1. Methods 169 4.3.2.1.1. Subjects 169 4.3.2.1.2. Stimuli 170 4.3.2.1.3. Procedure 171 4.3.2.2. Results: Statistical Analyses 172 4.3.2.2.1. Motivation and Interp. of Logistic Regressions 173 4.3.2.2.2. Control Subjects: YCs vs. AMCs 176 4.3.2.2.3. Elderly Subjects: AMCs vs. BAs vs. W/CAs 178 4.3.2.2.4. Summary of Results of Statistical Analyses 182 4.3.2.3. Results: Model-Based Analyses 183 4.3.2.3.1. Motivation of Model-Based Analyses 184 4.3.2.3.2. Key Results of Model-Based Analyses 185 4.3.2.4. General Discussion of Results of Experiment 4.1 191 4.4. Top-Down Effects of Sentence Context on SWR in Aphasia 194 4.4.1. Joint Modeling Contextual & Lexical Effects on Word Recognition 196 xi 4.4.2. Simulation Study 4.2: Sentential Context Effects in Aphasia 199 4.4.3. Experiment 4.2: Sentential Context Effects in Aphasia 205 4.4.3.1. Methods 206 4.4.3.1.1. Subjects 206 4.4.3.1.2. Stimuli 207 4.4.3.1.3. Procedure 208 4.4.3.1.4. Methodological Diff’s Between Subject Groups 209 4.4.3.2. Results: Statistical Analyses 210 4.4.3.2.1. Control Subjects: YCs vs. AMCs 213 4.4.3.2.2. Elderly Subjects: AMCs vs. BAs vs. W/CAs 215 4.4.3.2.3. Summary of Results of Statistical Analyses 217 4.4.3.3. Results: Model-Based Analyses 219 4.4.3.3.1. Motivation of Model-Based Analyses 219 4.4.3.3.2. Key Results of Model-Based Analyses 220 4.4.3.4. General Discussion of Results of Experiment 4.1 and 4.2 226 Conclusion 228 xii List of Tables Table 1.1 page... 14 Box 3.1 86 Table 3.1 93 Box 3.2 103 Table 3.2 104 Table 3.3 104 Table 3.4 114 Table 3.5 126 Table 3.6 126 Table 3.7 130 Table 3.8 130 Table 4.1 155 Table 4.2 176 Table 4.3 176 Table 4.4 177 Table 4.5 178 Table 4.6 179 Table 4.7 181 Table 4.8 182 Table 4.9 186 Table 4.10 189 Table 4.11 214 Table 4.12 214 Table 4.13 215 Table 4.14 215 Table 4.15 217 Table 4.16 217 Table 4.17 221 Table 4.18 224 xiii List of Figures Figure 1.1 page... 15 Figure 1.2 16 Figure 1.3 20 Figure 1.4 26 Figure 1.5 27 Figure 1.6 28 Figure 2.1 73 Figure 3.1 87 Figure 3.2 87 Figure 3.3 89 Figure 3.4 99 Figure 3.5 99 Figure 3.6 100 Figure 3.7 101 Figure 3.8 102 Figure 3.9 102 Figure 3.10 111 Figure 3.11 113 Figure 3.12 115 Figure 3.13 117 Figure 3.14 122 Figure 3.15 123 Figure 3.16 124 Figure 3.17 124 Figure 3.18 127 Figure 3.19 129 Figure 4.1 150 Figure 4.2 162 Figure 4.3 166 Figure 4.4 167 Figure 4.5 167 Figure 4.6 168 Figure 4.7 173 Figure 4.8 183 Figure 4.9 190 Figure 4.10 191 Figure 4.11 193 Figure 4.12 194 Figure 4.13 202 Figure 4.14 203 Figure 4.15 203 Figure 4.16 204 Figure 4.17 204 Figure 4.18 212 xiv List of Figures (continued) Figure 4.19 page... 219 Figure 4.20 225 Figure 4.21 226 xv Introduction During auditory language comprehension, a listener’s principal objective is to infer what meaning the speaker intended to convey. For a healthy adult communicating in his or her native language, this task is typically both effortless and errorless. What makes this behavioral generalization noteworthy is the fact that there is rarely just one possible interpretation of a given acoustic signal; in fact, perceptual uncertainty is ubiquitous in speech communication. Indeed, listeners are faced with uncertainties arising from countless sources ranging from unclearly produced speech to imperfect listening conditions to inescapable ambiguities inherent in language (e.g., homophony). It is easy to see how the challenge posed by such pervasive uncertainty might be crippling to a speech processing system that relied exclusively on these ambiguous acoustic cues available in the perceived speech signal to decode a speaker’s meaning. A fundamental question, then, regards how the perception of speech can be so robust despite these barriers. In the present work, it is argued that at least part of the answer to that question is that even though much ambiguity exists, when one source of information is unreliable, there are usually other cues available in the signal that can be leveraged to understand the speaker’s intended meaning. For instance, although the spoken word /bɔrd/ could be an exemplar of either the word board or of bored, words are rarely uttered in isolation, and – most of the time – there is little doubt as to which meaning to assign to the lexically ambiguous speech token. This is, in large part, because the word’s linguistic and extra- linguistic context provides another cue that can aid in the extraction of meaning, particularly when the acoustic cues are degraded or insufficient. 1 While this explanation may appear straightforward, it raises the key question of how multiple sources of information are integrated by the speech processing system. It is that question which is the focus of this thesis. More specifically, the present work employs behavioral experiments and computational modeling in order to investigate how so-called bottom-up acoustic cues available in the sensory signal and top-down information about which words or sounds are likely in a given context are integrated during online auditory language comprehension. For the purpose of exposition, let a word be defined as the discrete linguistic unit that stands at the juncture between the sound information perceived by the listener in the speech signal and the underlying meaning associated with the signal (Blumstein, 2009). Spoken word recognition, then, represents a critical sub-routine of auditory language comprehension if a listener is to extract meaning from the perceived signal. A given word can be thought of as being associated with (1) a lexical form that defines how the word sounds, and (2) the word’s meaning. When a speech token that resembles the lexical form of a word is perceived, that word can be recognized and its meaning accessed. When it comes to recognizing a spoken word, acoustic cues in the sensory signal are certainly the paramount source of information available to the speech perception system. The perceptual system is adept at decoding the speech signal based on auditory cues alone. These information sources, which are the product of low-level sensory and phonetic processing of the input, are referred to as bottom-up cues. However, as important as bottom-up cues are, much research has shown that, in addition to integrating a host of bottom-up cues, word recognition also involves the recruitment of top-down information that is not immediately available in the signal, but 2 instead relies on cognitive or higher-level linguistic processing. A general conclusion of this line of research is that listeners tend to perceive things that are more probable; for instance, identification is biased towards words rather than non-words (Ganong, 1980) and towards contextually consistent or sensible words over words that are inconsistent or nonsensical given the context (e.g., Borsky et al, 1998; Fox & Blumstein, in press). Although the roles of both bottom-up and top-down cues are well attested, this thesis examines the basic computational principles that underlie their integration. Although a number of models have sought to tackle the question of how top-down cues come to influence speech perception, several issues remain unclear. Firstly, a long- standing, much-debated topic regards whether the observed biases in listeners’ responses (toward words over non-words and toward contextually consistent words over inconsistent words) reflect the direct, top-down modulation of perceptual processing of the input or whether they reflect processing biases at a later decision-making level (see, e.g., McClelland, Mirman & Holt, 2006; Norris, McQueen & Cutler, 2000). Secondly, despite substantial evidence that sentential context influences spoken word recognition, existing models lack an explicit characterization that can account for these effects. Thirdly, another weakness of existing spoken word recognition models is that they largely ignore the enormous variability that exists in the observed sizes of top-down effects. Finally, the extent to which patients with aphasia experience deficits in top-down processing and cue integration during speech perception is poorly understood. In the present work, each of these four issues is considered in turn. Empirical and computational methodologies are employed in order to probe the questions each issue poses. Ultimately, the results of this thesis provide a more complete picture of the 3 computations that take place at the interface between the perceptual processing of speech and the cognitive and linguistic processing of language, while also establishing a novel theoretical basis that promises to guide future work. 4 Chapter 11 Top-down effects of syntactic sentential context on phonetic processing 1.1. Introduction During auditory language comprehension, listeners integrate information from a variety of sources in their categorization of sounds and words, especially when confronted with degraded or ambiguous speech. Besides low-level acoustic cues, listeners are also sensitive to higher-level information that is not immediately available in the raw sensory input. For instance, listeners exhibit a lexical bias in their categorization of a phonetically ambiguous segment between /g/ and /k/ such that they label the segment as /g/ more often when followed by –ift, but as /k/ more often when followed by –iss (Ganong, 1980; see also Burton, Baum & Blumstein, 1989; Burton & Blumstein, 1995; Connine, 1990; Connine & Clifton, 1987; Fox, 1984; McQueen, 1991; Miller & Dexter, 1988; Myers & Blumstein, 2008; Pitt, 1995; Pitt & Samuel, 1993). Moreover, when a stimulus is phonetically ambiguous between two words (e.g., between goat and coat), listeners exhibit a semantic bias such that they label the ambiguous word as goat more often when embedded in a sentence like The busy farmer hurried to milk the… but as coat more often in The elderly tailor had to dry-clean the… (Borsky, Tuller & Shapiro, 1998; Connine, 1987; Connine, Blasko & Hall, 1991; Garnes & Bond, 1976; Guediche, Salvata & Blumstein, 2013; Miller, Green & Schermer, 1984). Like semantic information, syntactic (Isenberg, Walker & Ryder, 1980; van Alphen & McQueen, 2001), morphosyntactic (Martin, Monahan & Samuel, 2012), and pragmatic 1 At the time of submission of this dissertation, a version of Chapter 1 is currently in press at the Journal of Experimental Psychology: Human Perception and Performance (http://dx.doi.org/10.1037/a0039965). This article may not exactly replicate the 5 information (Do, 2011; Rohde & Ettlinger, 2012) have all been shown to bias listeners’ identifications of phonetically ambiguous words. From this robust literature, it is clear that sensory processing alone cannot explain listeners’ judgments about the identities of spoken words and sounds. In order for higher- level information to influence spoken word recognition, perceptual input must make contact with lexical representations which act as a gateway to the semantic, syntactic and other properties of words, and those lexical representations must then be able to influence behavioral responses in tasks like those described above (Samuel, 2011). 1.1.1. Models of Spoken Word Recognition: Competing Frameworks There exists, however, a longstanding debate about how those lexical representations come to influence spoken word recognition (for reviews, see McClelland, Mirman & Holt, 2006; McQueen, Norris & Cutler, 2006). Two competing families of spoken word recognition models – interactive models and autonomous models – each account for the effects of higher-level cues by appealing to different mechanisms. In particular, they differ in how pre-lexical (i.e., perceptual/phonetic) representations and lexical representations influence one another. Both approaches allow for a bottom-up flow of information such that pre-lexical processing of speech modulates the extent to which competing lexical representations are supported. However, only interactive models of spoken word recognition, exemplified by TRACE (McClelland & Elman, 1986; McClelland, 1991), incorporate top-down feedback projections that, conversely, allow lexical representations to modulate the extent to which competing pre-lexical representations are supported (see also Adaptive Resonance Theory; Grossberg, 1980, 2003; Grossberg & Myers, 2000). 6 Autonomous models of spoken word recognition, on the other hand, eschew top- down modulation of phonetic processing. Instead, they account for lexical and contextual effects on listeners’ responses by positing that both higher-level and lower-level information can influence phonemic decisions, but it is maintained that pre-lexical representations remain faithful to the bottom-up acoustic input. Thus, under the autonomous view, the observed biases reflect the integration of multiple information sources, but, crucially, this integration does not affect lower-level phonetic processing itself (see, e.g., Norris, McQueen & Cutler, 2000; McQueen, Jesse & Norris, 2009). A succession of autonomous models has been proposed in the literature, including Race (Cutler & Norris, 1979; Cutler, Mehler, Norris & Segui, 1987), Shortlist (Norris, 1994), Merge (Norris et al., 2000), and Shortlist’s Bayesian implementation (Norris & McQueen, 2008). Although each varies in its details, none allows for higher-level modulation of phonetic processing through feedback (McQueen et al., 2006; see also Fuzzy Logical Model of Speech Perception; Massaro, 1989). Thus, any behavioral demonstration of lexical or contextual biases in phoneme judgments could, in theory, be explained at the level of the judgment itself (a post- perceptual explanation, as in autonomous models), or by direct modulation of pre-lexical processing prior to the judgment (a perceptual explanation, as in interactive models). Because of this, interactive and autonomous models of spoken word recognition have, in practice, proven difficult to distinguish. However, past work suggests that these two frameworks may diverge in their predictions about the time course of these effects. 1.1.2. Time Course of Sentential Context Effects 7 One result that proponents of autonomous models have cited as incompatible with TRACE (and interactive models more generally) concerns the time course of lexical and contextual effects. Specifically, they contend that if top-down feedback can directly alter the activation of pre-lexical representations (as it can in TRACE), then the influence of higher-level information could only grow or remain stable as a function of processing time, but could not diminish (McQueen, 1991; Tuinman, Mitterer & Cutler, 2014; van Alphen & McQueen, 2001). According to this argument, top-down biasing information within an interactive framework tends to overwrite the ambiguous bottom-up input pattern, shaping it and pulling it towards a pattern that would be expected for lexically- or contextually-consistent speech. For instance, in one experiment, the identification of phonetically ambiguous function words (between de and te; roughly the and to in Dutch) was shown to be biased by manipulating the target words’ grammaticality in context (van Alphen & McQueen, 2001). Ambiguous stimuli were labeled as /de/ more often in sentences like We verstoppen [?] schaatsen (We hide [the]/[to] skates; de-biased) than in sentences like We behoren [?] schaatsen (We ought [to]/[the] skate; te-biased] (cf. Isenberg, Walker & Ryder, 1980). Importantly, though, when responses were divided into bins based on reaction time (cf. Fox, 1984), contextual biases from an immediately preceding syntactic cue were strongest in fast responses and grew weaker with time. A similar pattern of results was found for the time course of syntactic context effects on a different type of phonetic judgment by Tuinman, Mitterer and Cutler (2014). Can interactive models account for a smaller contextual bias in slower responses than in faster ones? Van Alphen and McQueen (2001) argue that they cannot: if 8 ambiguous bottom-up information has been overwritten by top-down feedback such that even early activation levels at the pre-lexical level are biased by context, then the original (ambiguous) pre-lexical representation cannot be recovered in order to yield more ambiguous (i.e., less biased) responses later in processing. That is, in interactive models, feedback permanently and irrevocably biases the bottom-up record of the unbiased pre- lexical representation (Massaro, 1989). Although Dahan, Magnuson and Tanenhaus (2001) dismiss this argument on the grounds that using response latencies to track top- down influences on pre-lexical activation “is not straightforward” (p. 321), it remains unclear to what extent the time course of sentential context effects on spoken word recognition does, in fact, challenge interactive models of speech perception. The present work aimed to examine this question by testing two possible alternative explanations of previous time course data (van Alphen & McQueen, 2001; Tuinman et al, 2014). Specifically, two experiments investigated whether characteristics of the experimental designs in earlier work might have allowed subjects to adopt strategies that would not only explain the diminishing bias effect, but also undermine the ability of those data to distinguish between interactive and autonomous models. Following previous work that showed diminishing contextual biases, both of the present experiments examined syntactic sentence context effects on subjects’ perception of phonetically ambiguous speech. Experiment 1.1 tested whether the diminishing influence of sentential context would persist when subjects could not plan contextually congruent responses prior to the presentation of the target stimulus. Experiment 1.2 tested whether the diminishing bias effect would persist when subjects were induced to engage in phoneme identification rather than word identification strategies. 9 1.2. Experiment 1.1 In prior work (van Alphen & McQueen, 2001; Tuinman et al, 2014), the experimental design allowed subjects to identify the contextually appropriate response before hearing the ambiguous target segment. For example, in van Alphen and McQueen’s (2001) study, a preceding context biased the identification of an acoustically ambiguous target between de and te. Because this experiment utilized a single continuum with only two alternatives (de and te), subjects could have identified and prepared a grammatically congruent response before they encountered the target. This raises the question of whether some responses might reflect decisions that were generated before processing of the target could have actually begun. If the responses that were fastest were disproportionately contaminated with such pre-planned decisions, it would not be surprising to observe a fast-arising bias that appears to weaken at longer RTs (once subjects had heard the target word). Importantly, this explanation would not challenge interactive models; button-presses planned before a stimulus is presented could not bear on whether the pre-lexical representation of that stimulus is modulated by interactive feedback. Thus, Experiment 1.1 utilized two voice-onset time (VOT) continua, rather than just one: a noun–verb continuum (bay–pay) and a verb–noun continuum (buy–pie), crossing target word voicing with syntactic category bias. Tokens from these VOT continua were appended to noun-biasing and verb-biasing sentence contexts (e.g., Valerie hated the..., Brett hated to...), and subjects identified the initial segment of sentence-final targets (/b/ vs. /p/). In this way, subjects could not know which phonemic response (/b/ vs. /p/) was congruent with the contextual bias of a sentence until the target word was 10 presented, thereby preventing them from anticipating or planning a specific button-press response before hearing the target word. 1.2.1. Methods 1.2.1.1. Materials 1.2.1.1.1. Target Word Selection Sixteen monolingual volunteers who spoke American English participated in a norming study to confirm that the four critical targets words (bay, pay; buy, pie) had strong syntactic category biases in the expected directions. The four targets were included in a randomly ordered list of 40 words that included words from a variety of syntactic categories. For each word in the list, subjects wrote one sentence “that might be heard in everyday speech,” and their responses were coded for the target words’ part of speech usage in each sentence. The noun targets (bay and pie) were each used by at least 15 of 16 subjects as a noun; the verb targets (pay and buy) were each used by at least 14 subjects as a verb. Thus, bay/pay and buy/pie were judged to be sufficiently biased noun/verb and verb/noun minimal pairs, respectively. 1.2.1.1.2. Sentence Contexts Twenty main verbs (e.g., hate, want) that could be followed by either a noun phrase or an infinitive phrase (e.g., hate the bay; hate to pay) were identified. Forty sentence contexts (20 noun-biased; 20 verb-biased) were then constructed by concatenating a first name, the past tense form of the main verb, and either the or to, yielding pairs of sentence contexts like Valerie hated the... and Brett hated to... The full list of contexts can be found in Appendix A. This design helped ensure that participants could not use information from the main verb to predict the target word. In this way, any 11 syntactic bias effect on responses could be attributed to the influence of the immediately preceding function word. 1.2.1.1.3. Stimulus Recording Sentences ending with the target words were recorded by a female monolingual native American English speaker in a sound-dampened room with an Edirol digital recorder (model R09-HR; Sony microphone model ECM-MS907; sampled: 44,100 Hz / 24 bits / stereo; resampled in BLISS speech-editing software: 22,050 Hz / 16 bits / mono; Mertus, 1989). Target words were spliced out of the sentences’ waveforms yielding 40 partial sentences (20 noun-biased, 20 verb-biased). 1.2.1.1.4. Target Word Manipulation A natural token of bay and of buy served as base tokens for two VOT continua, constructed using the BLISS waveform editor (Mertus, 1989). Beginning with the base voiced token of bay, an acoustically modified voiceless end of the bay–pay continuum and each intermediary token were generated by successively adding aspiration from the middle of the aspiration of a naturally-produced pay token and removing pitch periods of equal duration from the onset of the vowel of the natural bay token. This procedure yielded 12 stimuli with VOTs ranging from 2 to 64 milliseconds. The onset’s burst was amplified (2x) for all tokens in the continuum because the aspiration from the pay token rendered the burst in the natural bay inaudible. Tokens of the buy–pie continuum were generated in the same manner, yielding 12 stimuli with VOTs ranging from 3 to 62 milliseconds. As with the bay–pay continuum, the onset’s burst was amplified (3x) for all tokens in the continuum. Twenty milliseconds of silence was appended to the beginning of each token of both continua. The waveforms of all stimuli across the two continua 12 were normalized for amplitude so that the highest peaks of the waveforms were equally high. 1.2.1.1.5. Target Word Token Selection Ten monolingual native English-speaking volunteers from the Brown University community participated in a norming study whose goal was to select the eight target stimuli: two ambiguous tokens and one token from each phonetic category endpoint for each continuum. Twenty trials each of the 12 tokens of each continuum (480 total trials) were presented in isolation (without sentential context) binaurally to participants in random order. Participants responded whether each target began with a “p” or “b” by pressing a corresponding button (response mapping was counterbalanced between subjects) and were instructed to respond as quickly as possible while maintaining accuracy, and to guess if they were unsure. The two tokens from each continuum with the identification rates closest to 50% and the highest mean response reaction times (Pisoni & Tash, 1974) were selected as the ambiguous tokens from each continuum. Endpoint tokens of each continuum were selected such that each the /b/ and /p/ was equidistant from the ambiguous pair. Table 1.1 shows the results for the eight selected tokens. 13 Continuum Token # VOT (ms) Mean % p Mean RT (ms) bay–pay 2 7 1.5 591 bay–pay 4 18 42.5 794 bay–pay 5 24 74 778 bay–pay 7 35 97.5 655 buy–pie 2 7 2 609 buy–pie 4 18 39 852 buy–pie 5 23 79 826 buy–pie 7 34 86.5 720 Table 1.1. Mean classification rates and reaction times (RTs) for the selected tokens from each voice-onset time (VOT) continuum 1.2.1.2. Participants Fifty self-reported native monolingual American English speakers with normal hearing from the Brown University community volunteered or received course credit to participate in Experiment 1.1. None had participated in any of the norming studies reported earlier. Due to technical difficulties, one subject’s incomplete data were excluded from analysis. 1.2.1.3. Task Each of the eight selected tokens was appended to each of the 40 sentence contexts, yielding 320 stimuli. The resulting design crossed two levels of CONTINUUM (bay–pay, buy–pie) with two levels of CONTEXT (noun-biased, verb-biased) and four tokens from each VOT continuum. All sentences were presented binaurally in a random order after eight practice trials. Participants were instructed to indicate whether the last word in each sentence began with a “b” or a “p” by pushing the appropriate button with either the index or middle finger of their dominant hand (response mapping was counterbalanced between subjects). The experiment did not advance to the next trial until a subject responded, but participants were instructed to respond as quickly as possible 14 while maintaining accuracy, and to guess if they did not know. They were also warned that some sentences might not make sense. Reaction times were measured from the onset of the target word. 1.2.2. Results bay−pay 1.0 0.8 0.6 0.4 proportion "p"−responses 0.2 0.0 buy−pie 1.0 0.8 0.6 0.4 0.2 the to 0.0 5 10 15 20 25 30 35 VOT (ms) Figure 1.1. Mean proportion of /p/-responses to tokens from each VOT continuum in Experiment 1.1 after noun-biasing and verb-biasing sentence contexts. Error bars represent standard error. The results of Experiment 1.1 are shown in Figure 1.1. Because this study was designed to examine contextual effects on the processing of ambiguous speech, data from responses to the two middle tokens in each continuum were analyzed. The mean proportion of /p/-responses for those intermediate tokens in each context and continuum 15 are summarized in Figure 1.2. Individual subjects’ data were excluded if they did not make at least 10% /b/-responses and 10% /p/-responses to the ambiguous tokens in a given continuum. Based on this criterion, all 49 subjects perceived at least one of the continua ambiguously (36 for bay–pay continuum; 46 for the buy–pie continuum; 33 for both continua). Finally, 205 trials (1.31% of responses) with extreme reaction time (RT) values were removed prior to analysis (>3 standard deviations from the mean RT for a given subject/target/context). 1.0 the to proportion "p"−responses 0.8 0.6 0.4 0.2 0.0 bay−pay buy−pie continuum Figure 1.2. Mean proportion of /p/-responses to ambiguous tokens from each VOT continuum in Experiment 1.1 after noun-biasing and verb-biasing sentence contexts. Error bars represent standard error. To test for an effect of sentential context on the identification of ambiguous stimuli, the data were analyzed using mixed effects logistic regression (Baayen, Davidson & Bates, 2008; Jaeger, 2008); a detailed description of the analyses can be found in the 16 Appendix C. The regression model included fixed effects for CONTEXT (verb-biased vs. noun-biased), CONTINUUM (bay–pay vs. buy–pie), and VOT (the VOT of each ambiguous token), along with all their two- and three-way interactions. All random intercepts and slopes were included for both subjects and items (i.e., main verbs; e.g., hated). Since the CONTINUUM factor crossed voicing (/b/ vs. /p/) and syntactic category bias (noun vs. verb), the critical test of a syntactic context effect is a CONTEXT × CONTINUUM interaction. A significant CONTEXT × CONTINUUM interaction would indicate that, after hearing the (which requires a noun rather than a verb), subjects were more likely to report hearing a /p/ if the target came from the buy–pie continuum, and more likely to perceive a /b/ if it came from the bay–pay continuum. As suggested by Figure 1.2’s reversal in direction of the context effect within each continuum, the results showed a robust CONTEXT × CONTINUUM interaction (β = 2.27, SE = 0.30, |z| = 7.53, p < 0.001). Follow- up tests confirmed a crossover interaction between CONTEXT and CONTINUUM, indicated by a significant simple effect of CONTEXT on responses to stimuli from each continuum, but in opposite directions (bay–pay: β = -1.37, SE = 0.19, |z| = 7.34, p < 0.001; buy–pie: β = 0.95, SE = 0.20, |z| = 4.82, p < 0.001). All other effects that reached significance in the omnibus and follow-up analyses are reported and discussed in the Appendix C. The central aim of Experiment 1.1 was to examine whether the obtained syntactic context effect was modulated by response latency. Thus, following the analysis procedure of Tuinman and colleagues (2014), we divided responses into two RT ranges (fast vs. slow) to test for differences in the size of this contextual bias. To do this, the responses of each participant within each cell of the experiment’s design (20 responses for each subject/context/continuum/token, less outliers) were ranked according to their RTs. Then, 17 from each ranked list of RTs, the eight trials (40%) with the shortest RTs were labeled fast (mean RT = 577 ms, SD = 149 ms) and the eight trials (40%) with the longest RTs were labeled slow (mean RT = 1,012 ms, SD = 410 ms), omitting the mid-range. Together, the fast and slow RT ranges constituted a binary factor: SPEED. Tuinman and colleagues (2014) showed that sentential context interacted with SPEED such that subjects’ responses were less influenced by context at slow RTs than fast RTs. However, in the present study, CONTEXT (i.e., the vs. to) has the opposite effect on responses to stimuli from the bay–pay continuum than for stimuli from the buy–pie continuum. Therefore, the CONTEXT and CONTINUUM factors were recoded into a single factor (BIAS) with two levels (/p/-congruent vs. /b/-congruent), each corresponding to two types of trials. Trials were classified as /p/-congruent when a verb-biasing context (e.g., Brett hated to...) preceded a target from the bay–pay continuum (because a pay- response is congruent with the context in these trials) and when a noun-biasing context (e.g., Valerie hated the...) preceded a target from the buy–pie continuum (because a pie- response is congruent with the context in these trials). Conversely, trials were /b/- congruent if they contained a verb-biasing context and a target from the buy–pie continuum (“...to /?ai/”) or a noun-biasing context and a target came from the bay–pay continuum (“the /?ei/”). Visually, the /p/-congruent trial-types correspond to the two conditions in Figure 1.2 with higher rates of /p/-responses, while the lower bars represent /b/-congruent trials. Having created the SPEED and BIAS factors, we examined whether there was a BIAS × SPEED interaction similar to what was observed in previous work (Tuinman et al, 2014; cf. van Alphen & McQueen, 2001). Figure 1.3 shows the mean proportion of /p/- 18 responses subjects made for ambiguous tokens in /p/-congruent and /b/-congruent conditions in each of the RT ranges. A logistic regression analysis with fixed effects for BIAS, SPEED, VOT, their two- and three-way interactions, and all corresponding random intercepts and slopes for subjects and items was conducted (see the Appendix C for details). This analysis revealed a significant BIAS × SPEED interaction (β = 0.51, SE = 0.15, |z| = 3.30, p < 0.001), wherein responses were more likely to be syntactically congruent – and hence show a larger bias effect – in faster responses (61.6% /p/- responses to ambiguous tokens in /p/-congruent conditions vs. 41.4% in /b/-congruent conditions) than in slower responses (/p/-congruent: 55.6%; /b/-congruent: 42.0%). Additional effects that reached significance are reported and discussed in the Appendix C. 19 1.0 BIAS /p/−congruent /b/−congruent proportion "p"−responses 0.8 0.6 0.4 0.2 0.0 fast slow RT range (SPEED) Figure 1.3. Mean proportion of /p/-responses in fast and slow responses to ambiguous tokens in /p/-congruent (“...to /?ei/” and “the /?ai/”) and /b/-congruent (“...the /?ei/” and “to /?ai/”) conditions in Experiment 1.1. Results indicate a weaker effect of BIAS in slow responses than in fast responses (see main text). Error bars represent standard error. 1.2.3. Discussion Experiment 1.1 was designed to investigate one possible alternative explanation for previous results showing a diminishing influence of context on spoken word recognition (van Alphen & McQueen, 2001; Tuinman et al., 2014). Here, as in previous work, acoustic targets were manipulated to be phonetically ambiguous between two words and embedded in sentential contexts that rendered one of those words ungrammatical (or at least less plausible). However, unlike previous studies in which subjects could identify which response (button-press) would be congruent with the context before the target was presented, Experiment 1.1’s design made it impossible for responses to the target stimuli to be systematically biased by the context if such responses 20 were planned prior to target presentation. Despite the addition of this control, the results of Experiment 1.1 showed that contextual biases on spoken word recognition still diminished over time, replicating earlier results. Having ruled out this alternative explanation, the central theoretical question that remains is whether the observation of a weakening bias effect in Experiment 1.1 (and elsewhere; van Alphen & McQueen; Tuinman et al, 2014) is inconsistent with interactive speech perception models. As described earlier, this interpretation follows from the argument that top-down feedback permanently overwrites pre-lexical information in such models (e.g., TRACE). However, this argument rests on the critical assumption that subjects’ decisions (both in previous experiments and in Experiment 1.1) tap into pre- lexical processing levels. In previous studies (van Alphen & McQueen, 2001; Tuinman et al., 2014), subjects made word identification decisions, which reflect the relative activation of competing lexical representations. Results reflecting lexical-level decisions cannot be taken as evidence either for or against top-down feedback to pre-lexical levels. Similarly, although Experiment 1.1 employed a phoneme identification task, it remains possible that subjects were monitoring for the four possible target words (bay, pay, buy, pie) and learning that each response button corresponded to two words, and thus were implicitly engaging in word identification. Experiment 1.2 aimed to resolve this potential issue by making lexical-level decisions difficult, if not impossible. 1.3. Experiment 1.2 Experiment 1.2 considered a second factor important for the interpretation of contextual influences that diminish in slower responses. Previous studies showing this pattern of results employed word identification tasks, not phoneme identification tasks 21 (van Alphen & McQueen, 2001; Tuinman et al., 2014). It is essential to consider the task- specific linking hypothesis (cf. Magnuson, Mirman & Harris, 2012; Tanenhaus, Magnuson, Dahan & Chambers, 2000) that transforms model activations into behavioral predictions in TRACE: “word identification responses are assumed to be based on readout from the word level, and phoneme identification responses are assumed to be based on readout from the phoneme level” (McClelland & Elman, 1986; p. 21). Thus, since subjects were identifying words (not phonemes), a model like TRACE would predict that lexical responses should reflect word-level (not phoneme-level) activations. Consequently, the time course of context effects demonstrated in previous work may not, in fact, be inconsistent with either interactive models (generally) or TRACE (specifically) because word identification data are not relevant to the question of the presence or absence of feedback from lexical to pre-lexical nodes. Experiment 1.2 modified the design of Experiment 1.1 to discourage subjects from adopting word identification strategies. In particular, subjects performed a phoneme identification task in which the critical target stimuli from the bay–pay and buy–pie continua were embedded among twenty filler target words beginning with /b/ or /p/. We hypothesized that, when responding to 24 unique target words, subjects would have to monitor for the identity of a target’s initial consonant in order to perform the phoneme identification task rather than utilizing word-level strategies. As such, subjects’ responses in Experiment 1.2 should reflect the relative evidence for competing pre-lexical representations, a prerequisite to using any time course analysis to discriminate between models with and without interactive feedback. 22 The stimuli for Experiment 1.2 included sentences of the same form as Experiment 1.1 (e.g., Brett hated to...), but 160 sentences ending with critical targets (an acoustically manipulated token from either the bay–pay or buy–pie continuum) were embedded among 800 sentences ending with filler targets. 1.3.1. Methods 1.3.1.1. Materials 1.3.1.1.1. Critical Targets The same 8 critical target tokens used in Experiment 1.1 were used in Experiment 1.2. 1.3.1.1.2. Filler Targets Twenty words (10 beginning with /b/; 10 beginning with /p/) were selected to serve as filler targets in Experiment 1.2 (for a complete list, see Appendix B). The list of filler words included nouns (e.g., bull), verbs (e.g., put), and syntactically ambiguous words (e.g., plan). The syntactic bias of each filler word (rate of usage of the word as a noun vs. a verb) was computed using the Penn Treebank (Marcus, Marcinkiewicz & Santorini, 1993), and the list of filler words beginning with /b/ and /p/ were balanced for their average syntactic bias. Finally, of the twenty filler words, eight of them composed four minimal pairs (e.g., bull, pull) so that, when combined with the critical targets, half of the targets in Experiment 1.2 comprised minimal pairs. The same female speaker that recorded the stimuli for Experiment 1.1 read aloud sentences ending with the filler target words. The filler targets were then spliced out of the recorded sentences, scaled to have the same maximum volume as the critical targets, and appended to 20 ms of silence (like 23 the critical targets), but were not acoustically manipulated (e.g., by altering the VOT of onsets). 1.3.1.1.3. Sentence Contexts Each critical and filler target was appended to each of the forty sentence contexts used in Experiment 1.1; for each of the twenty main verbs (that is, for each item; e.g., hated), there was a set of noun-biasing and verb-biasing contexts (Valerie hated the..., Brett hated to...) followed by every critical and filler target. Of the twenty item-sets, ten were randomly selected for each participant in Experiment 1.2, and that participant heard all sentence stimuli associated with those ten items (noun-biased and verb-biased sentences, ending in all critical and filler targets). Thus, the design remained fully within- subjects and within-items, although not every subject heard the same items. As in Experiment 1.1, critical stimuli included four tokens from each of the VOT continua. In an effort to balance the likelihood that subjects would hear a sentence that ended with any given word, participants heard each of their filler sentences twice over the course of Experiment 1.2. In all, subjects heard 800 filler trials (10 items * 2 levels of CONTEXT * 20 filler target words * 2 presentations) and 160 critical trials (10 items * 2 levels of CONTEXT * 2 levels of CONTINUUM * 4 tokens from each VOT continuum), yielding a total of 960 trials. 1.3.1.2. Participants Twenty Brown University undergraduates who were self-reported native monolingual American English speakers with normal hearing received course credit to participate in Experiment 1.2. None had participated in Experiment 1.1 or any previous norming study. 24 1.3.1.3. Task The task was identical to Experiment 1.1: all critical and filler sentences were presented binaurally in a random order after eight practice trials and participants were asked to indicate with a button-press whether the last word in each sentence began with a “b” or a “p” (response mapping was counterbalanced between subjects). All instructions were the same as in Experiment 1.1. Subjects were offered three breaks (after every 240 stimuli). 1.3.2. Results Phoneme identification responses to trials ending with filler words were highly accurate (98.2% correct). This was unsurprising because all filler words were naturally produced tokens, so they were not phonetically ambiguous. Responses to critical trials from each continuum are shown in Figure 1.4. All further analyses followed the same approach as Experiment 1.1’s analyses, independently fitting identical logistic regression models using subjects’ responses to ambiguous tokens from the two VOT continua (additional details in the Appendix C). Using the same criterion as in Experiment 1.1, a subject’s data were excluded if the intermediate tokens of a continuum were not perceived ambiguously (the responses of 18/20 subjects were included for at least one continuum: 15 for bay–pay; 13 for buy–pie; 10 for both). No trials warranted removal following an RT outlier analysis (same criteria as Experiment 1.1). 25 bay−pay 1.0 0.8 0.6 0.4 proportion "p"−responses 0.2 0.0 buy−pie 1.0 0.8 0.6 0.4 0.2 the to 0.0 5 10 15 20 25 30 35 VOT (ms) Figure 1.4. Mean proportion of /p/-responses to tokens from each VOT continuum in Experiment 1.2 after noun-biasing and verb-biasing sentence contexts. Error bars represent standard error. The results of the first three-factor (CONTEXT × CONTINUUM × VOT) mixed effects regression revealed a significant CONTEXT × CONTINUUM interaction (β = 2.75, SE = 0.51, |z| = 5.34, p < 0.001; see Figure 1.5), confirming that subjects’ phonemic decisions about ambiguous targets were influenced by the grammaticality of targets given a preceding context. Follow-up tests confirmed the crossover interaction (opposite effects of CONTEXT in each continuum; bay–pay: β = -1.87, SE = 0.25, |z| = 7.46, p < 0.001; buy–pie: β = 0.66, SE = 0.27, |z| = 2.45, p < 0.02). Other effects are reported and discussed in the Appendix C. 26 1.0 the to proportion "p"−responses 0.8 0.6 0.4 0.2 0.0 bay−pay buy−pie continuum Figure 1.5. Mean proportion of /p/-responses to ambiguous tokens from each VOT continuum in Experiment 1.2 after noun-biasing and verb-biasing sentence contexts. Error bars represent standard error. To test whether the influence of context on phoneme identification diminished over time, trials were recoded and split into two RT ranges (fast: mean RT = 616 ms, SD = 157 ms; slow: mean RT = 1,063 ms, SD = 423 ms) for a three-way BIAS × SPEED × VOT logistic regression, as in Experiment 1.1. This analysis provided no evidence of a BIAS × SPEED interaction (p > 0.86; see Figure 1.6). That is, the effect of the syntactic manipulation on the rate of /p/-responses did not diminish between faster responses (/p/- congruent: 60.6%; /b/-congruent: 40.6%) and slower responses (/p/-congruent: 61.3%; /b/-congruent: 41.9%). Other effects are discussed in the Appendix C. 27 1.0 BIAS /p/−congruent /b/−congruent proportion "p"−responses 0.8 0.6 0.4 0.2 0.0 fast slow RT range (SPEED) Figure 1.6. Mean proportion of /p/-responses in fast and slow responses to ambiguous tokens in /p/-congruent (“...to /?ei/” and “the /?ai/”) and /b/-congruent (“...the /?ei/” and “to /?ai/”) conditions in Experiment 1.2. Unlike Experiment 1.1, the BIAS effect in Experiment 1.2 is as strong in slow responses as in fast responses (see main text). Error bars represent standard error. 1.3.3. Discussion The goal of Experiment 1.2 was to examine contextual influences on the identification of ambiguous targets using a task designed to elicit phonemic decisions. Of particular interest was the time course of such context effects, and especially whether these results would differ from previous word identification experiments (van Alphen & McQueen, 2001; Tuinman et al, 2014) and from Experiment 1.1, in which it was unclear whether subjects were engaging in word or phoneme monitoring strategies. Experiment 1.2, like Experiment 1.1, showed that syntactic context has a robust effect on subjects’ responses (including their fastest responses). However, unlike Experiment 1.1, Experiment 1.2 showed that this contextual bias on phoneme identification was as strong 28 in slow responses as it was in fast responses. Van Alphen and McQueen (2001) argue that if the fast-arising bias in phoneme responses was the result of lexical feedback to pre- lexical representations (as hypothesized by interactive models), then “there should have been a similar shift (if not a stronger one as more time elapsed with more feedback) in slow responses” (p. 1069). Indeed, results of Experiment 1.2 are consistent with these predictions and thus support the view that pre-lexical representations are modulated by lexical feedback, as hypothesized by interactive models. Additionally, the results of Experiment 1.2 suggest that tasks in which word identification decisions are required or may be used strategically by participants may not tap pre-lexical representations. 1.4. General Discussion The present work aimed to evaluate the validity of claims that the time course of context effects on speech perception is incompatible with interactive models of speech perception (cf. van Alphen & McQueen, 2001; Tuinman et al., 2014). According to this view, context effects within an interactive system should remain stable or grow over time, but they should not become weaker, because biasing feedback from lexical representations irreversibly overwrites an initially ambiguous representation of the acoustic input. Once pre-lexical representations are altered by top-down modulation, there is no recovering the record of the ambiguous signal, so the size of the bias effect should not diminish. Crucially, this logic is predicated on the assumption that subjects’ responses reflect activation levels of pre-lexical representations, an assumption that Experiments 1.1 and 1.2 examined more closely. Two key results emerged from Experiments 1.1 and 1.2. Firstly, as already discussed, the results of Experiment 1.2 suggest that when an experimental task is 29 designed to tap into pre-lexical processing, contextual influences on speech perception are robust and persistent over time. Secondly, the biasing effects from a preceding sentential context arose very rapidly in both experiments. In the present studies, the button-press that would represent a contextually congruent response depended on the target stimulus itself, so it was impossible for a crossover interaction to emerge unless subjects waited to hear the target stimuli. In other words, the fact that subjects’ fast responses were biased suggests that the processing of ambiguous speech is influenced by the grammatical properties of competing words, that this influence is virtually immediate, that this rapid influence is not attributable to pre-planned responses, and that top-down expectations rapidly propagate to bias both lexical and pre-lexical representations. As such, our data provide strong evidence for immediate top-down effects from sentence context effects on speech perception. Taken together, these results suggest that the time course of contextual effects on phoneme recognition does not, in fact, challenge the interactive modeling framework. Next, we consider the extent to which these results actually support interactive models, and how they constrain autonomous models. 1.4.1 Implications for Interactive Models of Speech Perception Despite the fact that subjects responded to the same stimuli in Experiments 1.1 and 2, the time course of the contextual bias effect differed between experiments, with Experiment 1.1’s results matching the pattern obtained by word identification tasks (van Alphen & McQueen, 2001; Tuinman et al., 2014). An important question is why lexically-driven and phonemically-driven responses would generate different patterns with respect to the time course of context effects. Recall that in TRACE, outputs in n- alternative forced-choice tasks (such as phoneme monitoring/identification or visual 30 world eye-tracking) are generated probabilistically from among a set of alternatives that is identified depending on the task and stimuli (cf. Luce, 1959). TRACE’s decision model keeps track of a running average of activation levels of the alternatives, which are nodes in a single layer of the model (cf. McClelland & Rumelhart, 1981). For instance, in a phoneme identification task, two units in the Phoneme layer (e.g., /b/ and /p/) constitute the output alternatives that are tracked (McClelland & Elman, 1986; McClelland, 1987), while in word recognition or visual world eye-tracking tasks, nodes in the Word layer (e.g., bear and pear) are identified as the output alternatives (e.g., Allopenna, Magnuson & Tanenhaus, 1998; Dahan, Magnuson & Tanenhaus, 2001; Dahan, Magnuson, Tanenhaus & Hogan, 2001; Magnuson, Dixon, Tanenhaus & Aslin, 2007; Magnuson, Tanenhaus, Aslin & Dahan, 2003; McMurray, Tanenhaus & Aslin, 2002). Since different tasks dictate that outputs reflect activation dynamics in different layers of the model, it is not surprising to observe unique patterns of results for word vs. phoneme identification tasks. Nevertheless, TRACE (or any other model) still must explain why Word-level activations become less biased over time even though biased Phoneme-level activations persist. This question can only be fully addressed once models of speech perception incorporate sentence-level representations (cf. Strand, Simenstad, Cooperman & Rowe, 2014). However, it seems likely that an interactive model could capture this difference. A model with a relatively strong stabilizing force (i.e., decay) at the lexical level or a quickly decaying influence of lexical expectation from “supra-lexical” levels of representation would predict a transient context effect in word identification responses. Notably, Word-level decay is the strongest decay parameter in TRACE (McClelland & 31 Elman, 1986). Meanwhile, as van Alphen and McQueen (2001) suggest, pre-lexical representations that are modulated by lexical feedback may not recover from the top- down biasing influence (or at least may not recover as quickly as the lexical representations), leading to more persistent context effects on phoneme identification responses (as in Experiment 1.2). 1.4.2. Implications for Autonomous Models of Speech Perception It is less clear to what extent the results of Experiment 1.2 challenge autonomous models of speech perception. Van Alphen and McQueen (2001) note that “if...sentential context effects are the result of a decision bias, predictions about their time course are much less clear” (p. 1059). As they acknowledge, their verbal account (Coltheart, Rastle, Perry, Langdon & Ziegler, 2001; Magnuson, Mirman & Harris, 2012; Mirman & Britt, 2013) of sentential context effects on word recognition could predict many patterns of results, including the one their experiments suggest. Nonetheless, the present results do challenge one theoretical proposal they offer. In their data, responses in the slowest RT range failed to show a significant effect of preceding context. They ultimately interpret this as consistent with a model in which context effects on subjects’ decisions are time- limited such that sentence-level processing can only bias responses while the syntactic parse of the sentence remains ambiguous (see also Mattys, Melhorn & White, 2007). However, the persistent top-down effects on subjects’ responses in Experiment 1.2 challenges the view that sentence context has a time-limited effect (see also Bicknell, Jaeger & Tanenhaus, 2015; Bicknell, Tanenhaus & Jaeger, 2015; Connine, Blasko & Hall, 1991; Szostak & Pitt, 2013). In fact, even when the diminishing bias effect was replicated (in Experiment 1.1 and by Tuinman et al, 2014), the slow responses were still 32 significantly biased. Thus, any autonomous model of speech perception must be able to account for long-lasting contextual biases on both phoneme and word identification responses. 1.4.3. Predicting Behavior with Computational Models of Speech Perception The present work underscores the need for explicit, testable computational models bridging speech perception and sentence processing, while also highlighting the importance of appropriately interpreting existing models’ predictions. Neither TRACE nor Merge makes any predictions about sentential context effects. Indeed, McClelland and Elman (1986) conceded this point when they introduced TRACE, explicitly leaving the question to future research: “We have not yet included...higher level contextual influences in TRACE, though of course we believe they are important” (p. 60). In order to understand the mechanisms underlying speech perception in naturalistic environments, it is necessary to develop and test models in which spoken sounds/words are accompanied by a rich linguistic and non-linguistic context. 1.5. Conclusion Interactive and autonomous models represent two powerful theories, each of which are capable of explaining many data regarding speech perception. Despite decades of arguments, rejoinders, clarifications, and revisions, they have proven difficult to distinguish. Given the persistence of the debate and the considerable power of each modeling framework, some have wondered whether any unique predictions exist that might settle the question (Cutler et al., 1987; Pitt & Samuel, 1993). The goal of this study was to assess whether the time course of context effects on speech perception is incompatible with interactive models, as has been proposed. Our results suggest that this 33 claim is unwarranted. Indeed, when an experiment is designed to tap into pre-lexical processing dynamics, the persistent influence of top-down lexical feedback can be observed. Our findings, therefore, indicate that there is even less evidence against interactive models of spoken word recognition than previously thought. On the other hand, while the present results provide some constraints for autonomous models, they cannot entirely rule out such a framework. Ultimately, it is clear that the lack of overt, testable, divergent predictions represents a fundamental barrier to resolving this long- running theoretical debate. Given this critical gap, we believe that the development of well-constrained, explicit computational models must play a central role in future research investigating the mechanisms underlying spoken word recognition. 1.6. Overview of Next Steps Unfortunately, as noted earlier, neither TRACE nor Merge is designed to explicitly model the role of sentential context in spoken word recognition. In fact, as we will discuss in Chapter 2, despite strong evidence illustrating context effects (as seen in the experiments reported here in Chapter 1), the influence of sentential context on speech perception is poorly characterized in existing psycholinguistic models. In order to address this major gap, we now turn to the task of developing a computational model of speech perception capable of explaining top-down effects from a word’s sentence context. As we will further show in Chapter 3, the model developed in Chapter 2 also provides a straightforward, natural account for the enormous variability in the size of top-down effects across different subjects, stimuli and experiments – an issue that has been almost completely ignored by previous models of speech perception. Finally, as we will discuss in Chapters 3 and 4, developing such a model promises to generate novel, testable 34 predictions that can elucidate properties of the cognitive and perceptual processing underlying auditory language comprehension in both healthy adults and patients with brain damage. 35 Chapter 2 Bayesian Integration of Acoustic and Sentential Evidence in Speech: The BIASES Model of Spoken Word Recognition in Context 2.1. Introduction 2.1.1. Brief Introduction In order to comprehend spoken language, a listener must ultimately map a perceived acoustic waveform onto some meaningful interpretation of the speaker’s message. The processing that underlies this complex phenomenon can be thought of as consisting of at least three subroutines: pre-lexical speech processing, spoken word recognition, and auditory sentence comprehension. While all three of these are, of course, critical for arriving at an accurate, contextualized understanding of the meaning behind some speech that reaches a listener, spoken word recognition occupies a critical juncture between the lower-level perceptual processing of a signal and the higher-level processing by which listeners recruit linguistic and world knowledge to construe the meaning of the words they identify. Because of the crucial role words play as the gatekeepers of meaning, characterizing the computations involved in the recognition of spoken words is of critical importance (for recent reviews, see Mattys, 2013; Maguson, Mirman & Myers, 2013; Tanenhaus, 2007). The fundamental computational problem (Marr, 1982) that must be solved in order to recognize words in speech is to infer which word (or sequence of words) was most likely to have been produced by the speaker. This decoding of the perceived speech stream would be trivial if words were consistently produced with a single acoustic form, any given acoustic signal corresponded to exactly one word, and perception always took 36 place under noise-free conditions. However, none of these facts is true of speech perception in the real world. Quite to the contrary, the number of ways in which noise, ambiguity, and uncertainty can compromise processing is daunting. Nonetheless, despite all of the potential barriers to successful mapping of signal to meaning, healthy listeners rarely experience difficulties in speech processing. Even when speech is intentionally degraded in the laboratory to tax the perceptual system, sounds and words are typically perceived with high accuracy (e.g., Luce & Pisoni, 1998; Cutler, Weber, Smits & Cooper, 2004). Indeed, listeners often fail to notice when a segment in a word is replaced with white noise or a cough (Warren & Obusek, 1971), and, when they do, it seldom impedes comprehension. What accounts for this robustness in the face of such pervasive degradation? A complete understanding of this issue requires, minimally, answering two fundamental questions: what cues are available to the listener, and how are these cues leveraged in order to overcome the ambiguity inherent in the input. Broadly, investigations of the first question – which cues do listeners utilize during speech recognition – have shown that listeners integrate both bottom-up (sensory-based) and top-down (knowledge-based) cues (for review, see Samuel, 2011). That is, in addition to leveraging bottom-up acoustic cues derived from perceptual processing of the speech signal such as voice-onset time and formant values, listeners also exploit cues that require higher-level cognitive processing of the signal. For instance, listeners are sensitive to whether or not different potential interpretations (e.g., goat vs. coat) of some speech input are sensible given the preceding sentential context (e.g., The busy farmer hurried to milk the...) (Borsky, Tuller & Shapiro, 1998). 37 The second question – how the various cues are integrated and come to influence speech recognition – is the focus of the present work. This wide-ranging question has motivated a host of theoretical and computational models focusing on different aspects of speech processing from how multiple distinct bottom-up cues are weighted (e.g., Massaro & Oden, 1980; Nearey, 1990, 1997; Oden & Massaro, 1978; Repp, 1982, 1983; Toscano & McMurray, 2010) to how multisensory information sources are combined (e.g., Diehl & Kluender, 1989; Fowler & Rosenblum, 1991; Kluender, 1994; Massaro, 1987; McGurk & MacDonald, 1976; Ostrand, Blumstein & Morgan, 2011; Rosenblum, 2005) to what mechanisms give rise to lexical biases on word recognition (Elman & McClelland, 1988; McClelland, Mirman & Holt, 2006; McQueen, Norris & Cutler, 2006; McQueen, Jesse & Norris, 2009; Norris, McQueen & Cutler, 2000). However, one major theoretical gap that remains concerns how top-down information available from a sentence context is integrated with bottom-up cues. This gap is especially conspicuous given that everyday speech rarely features words produced and perceived in isolation, and sentential context has consistently been shown to impact the recognition of spoken words (e.g., Borsky et al, 1998; Lieberman, 1963; Warren & Warren, 1970). The present work aims to narrow this gap by advancing one approach to modeling the integration of top-down and bottom-up cues during speech perception. In particular, we argue that viewing speech perception through the lens of Bayesian cue integration provides a powerful, principled framework to understand a wide range of behavioral data. To this end, we outline the issues addressed in this chapter, which is organized into three parts. 2.1.2. Overview of Chapter 2 38 First, we address the question of why this gap exists at all. We discuss several practical and theoretical bottlenecks associated with representing sentential context and modeling the mechanisms by which context might come to influence spoken word recognition. Second, we argue that these challenges motivate the specification of a computational-level (Marr, 1982) model of spoken word recognition capable of explicitly integrating bottom-up and top-down information sources. We present a Rational Analysis (Anderson, 1990) of spoken word recognition in context and propose that a Bayesian modeling approach may offer key insights into the information processing that underlies spoken language processing. Third, we introduce BIASES (short for Bayesian Integration of Acoustic and Sentential Evidence in Speech), a Bayesian model of spoken word recognition in context. BIASES is a novel, flexible computational framework for simulating human behavior in word recognition tasks and for testing psycholinguistic theories about how bottom-up and top-down information sources are represented and integrated by listeners. Adopting a model like BIASES involves embracing three basic assumptions: (1) that listeners are sensitive to fine-grained acoustic properties of spoken words; (2) that they are also sensitive to fine-grained differences in the chances of encountering different words in a given sentence context; and (3) that, when identifying spoken words, they integrate these information sources with consideration for the relative reliability of each available cue. We review the robust empirical evidence that supports these assumptions, and, in turn, the Bayesian approach to spoken word recognition. 2.2. Sentential Context and Connectionist Models of Spoken Word Recognition 39 Existing computational models of spoken word recognition have not directly addressed how a word’s processing is influenced by its sentential context. Before motivating the present model, it is worth reviewing the evidence under consideration and possible explanations for why current models fail to account for this evidence. 2.2.1. Modulation of Spoken Word Recognition by Sentential Context Several decades of research have made it clear that the recognition of a spoken word is not independent of its context. Words that are unintelligible when presented in isolation can be readily identified in context (Lieberman, 1963; Pickett & Pollack, 1963; Hunnicutt, 1985; Fowler & Housum, 1987), and prior exposure to an acoustically clear prime sentence improves listeners’ recognition of conceptually related words in a subsequent acoustically degraded sentence (Guediche, Reilly & Blumstein, 2014). Moreover, in addition to facilitating recognition of speech in noise, a word’s context can shape a listener’s interpretation of spoken words that are ambiguous between two or more words in their language. For instance, words that have been digitally manipulated to replace a critical speech segment with extraneous non-speech acoustic material (such as a cough or white noise) are more often recognized as words that are consistent with the context in which the word was presented: /*il/, where * represents the digitally substituted non-speech sound, may be identified as wheel, heel, peel or meal depending on other words appearing in the same sentence (e.g., axle, shoe, orange or table) (Warren & Warren, 1970; Warren & Sherman, 1974). Furthermore, related evidence shows that such contextual effects are not restricted to the restoration phonetic information that is missing altogether. This line of work utilizes fine-grained acoustic manipulations of phonetically relevant parameters of natural 40 speech tokens to render the resulting stimuli ambiguous between two possible words. When these stimuli are presented in sentences that are consistent with one word or the other, they tend to be perceived as the word that is congruent with the context in which it is embedded. For example, subjects are more likely to identify a phonetically ambiguous stimulus between goat and coat as goat when it follows a sentence like The busy farmer hurried to milk the... that when after sentences like The careful tailor stopped to button the... (Borsky, Shapiro & Tuller, 1998). Such biases have been widely corroborated, whether the manipulated contextual constraints operate at the semantic (Borsky et al, 1998; Garnes & Bond, 1976; Miller, Green & Schermer, 1984; Connine, 1987; Guediche, Salvata & Blumstein, 2013), syntactic (Fox & Blumstein, in press; Tuinman, Mitterrer, & Cutler, 2014; van Alphen & McQueen, 2001), morphological (Martin, Monahan & Samuel, 2011), or pragmatic level (Rohde & Ettlinger, 2012; Do, 2011). With such strong evidence for contextual effects on spoken word recognition, it is somewhat surprising that word recognition models have thus far offered no explicit account for these data. The treatment of sentential context in most existing spoken word recognition models can generally be classified into four categories. First, some models ignore the role of sentential context, focusing on other aspects of spoken word recognition (e.g., PARSYN: Luce, Goldinger, Auer & Vitevitch, 2000; ARTWORD: Grossberg & Myers, 2000; Merge: Norris, McQueen & Cutler, 2000; LAFF: Stevens, 2002). A second group of models explicitly leave the question to future research (e.g., LAFS: Klatt, 1979; TRACE: McClelland & Elman, 1986; MINERVA 2: Goldinger, 1998; Hintzman, 1986). A third set of models asserts that incorporating sentential context would be a “straightforward” extension of the more basic model (e.g., NAM: Luce & 41 Pisoni, 1998; Shortlist: Norris, 1994; Shortlist B: Norris & McQueen, 2008; SpeM: Scharenborg, Norris, ten Bosch & McQueen, 2005; see also Norris, McQueen & Cutler, 2015). Finally, there are some theories about what role sentential context might play in speech recognition have been presented (e.g., Logogen: Morton, 1969; Race: Cutler & Norris, 1979; Cohort: Marslen-Wilson & Tyler, 1980; Marslen-Wilson & Welsh, 1978). Several members of this set – most notably the Cohort model – were prominent theories that guided early research on spoken word recognition, and, although they have been abandoned in light of empirical challenges to some of their specific claims, the principles they embodied (e.g., graded activation, competition, autonomous vs. interactive model architectures) remain influential today. However, with respect to the subject of the present work – sentential influences on spoken word recognition – this fourth group exemplifies what has probably been the most common treatment of the issue. That is, many theories have relied on “verbal models” which might explain some aspects of processing that is likely implicated (or, just as often, what sorts of processing might be precluded; cf. Shillcock & Bard, 1993; Tanenhaus, Leiman & Seidenberg, 1979; Tanenhaus & Lucas, 1987) during sentence- level speech processing. However, these models have a number of disadvantages, chief among them being that they are often incompletely specified. While verbal models are critical to theory development and are useful for generating and testing many predictions, it is often difficult or impossible to assess a theory’s adequacy or viability if it is not mathematically or computationally implemented, and it is even more difficult to compare its predictions to another competing theory (see, e.g., Magnuson, Mirman & Harris, 42 2012). In short, theories and models falling into the third and fourth categories above leave much work to be done. The lack of a comprehensive model of spoken word recognition in context is probably attributable to a number of factors. For one, many difficult and important questions can be (and have been) explored without the additional complication of modeling what are logically more abstract representations and cognitive functions (e.g., the composition of meaning). However, the exclusion of sentence-level information in existing models of speech perception can also be traced to major challenges presented by the predominant modeling approach to examining higher-level influences on word recognition. 2.2.2. Challenges in Modeling Context Effects on Spoken Word Recognition Many of the most influential models of spoken word recognition, including TRACE (McClelland & Elman, 1986), Shortlist (Norris, 1994), and Merge (Norris, McQueen & Cutler, 2000), are based on interactive activation networks (McClelland & Rumelhart, 1981; Rumelhart & McClelland, 1981, 1982). In such models, cognitive representations are connected to one another in a network, with each representation characterized by some amount of activation. Activation propagates through the network as a function of the connections between representations and the sensory input presented to the network. In localist connectionist models of spoken word recognition, each representation is designated by a node that stands for a linguistically-relevant unit (e.g., a word or a phoneme), and these nodes are organized into layers (cf. McClelland & Rumelhart, 1986; Page, 2000). The nodes within a given layer represent mutually exclusive hypotheses (cf. Smolensky, 1986) about which linguistic units might be 43 (underlyingly) present within a given speech signal. For instance, the words goat and coat would be represented as unique nodes in the Word layer of a model because a spoken word may be an exemplar of goat or coat, but not both. Although the exact details of how units are connected differ from model to model, nodes within a layer are typically connected via inhibitory connections, while mutually consistent linguistic units (e.g., a word-initial /g/ in the Phoneme layer and goat in the Word layer) would be connected via excitatory connections. Critically, though, if some node A is connected to some other node B via an excitatory connection, then when node A increases in activation, node B will also tend to increase in activation. On the other hand, if the connection from node A to node B is inhibitory, then when node A increases in activation, node B will tend to decrease in activation (for a recent review of interactive activation models in speech perception, see McClelland, Mirman, Bolger & Khaitan, 2014). 2.2.2.1. Challenges of Modeling Context Effects: Representing Context Given this modeling framework, it is not immediately clear how one should incorporate sentential context. Perhaps the most obvious question is: how should sentence-level information be represented? In order to capture semantic context effects (e.g., more GOAT responses after sentences about milking than buttoning), semantic relationships among words must somehow be incorporated into the model. This might be made possible by constructing a layer of semantic features that has excitatory connections to some nodes in the Word layer based on the words’ meanings (Chen & Mirman, 2012; Cree, McRae, McNorgan, 1999; Rogers & McClelland, 2004). Alternatively, it might be possible to ignore semantic features altogether and, instead, connect word units such that words can excite other related words based on semantic associativity norms (Deerwester, 44 Dumais, Fornas, Landauer & Harshman, 1990; Dumais, 2004; Landauer & Dumais, 1997) or other measures (Fellbaum, 1998; Miller, 1995; Miller, Beckwith, Fellbaum, Gross & Miller, 1990; Miller & Fellbaum, 1991). However, it is not clear which of these possibilities (or what other solution) is more consistent with the organization of listeners’ semantic knowledge (cf. Andrews, Vigliocco & Vinson, 2009; De Deyne & Storms, 2008; Riordan & Jones, 2011; Steyvers & Tenenbaum, 2005), nor is it clear from existing data which model would best explain context effects in the domain of word recognition. Moreover, while some such model enhancement might be able to capture semantic context effects, explaining syntactic and pragmatic context effects would require the addition of more connections and/or layers (cf. McClelland, St. John & Tarban, 1989; Recchia, Sahlgren, Kanerva & Jones, 2015; Rohde, 2002; St. John & McClelland, 1990; Strand, Simenstad, Cooperman & Rowe, 2014). 2.2.2.2. Challenges of Modeling Context Effects: Activation Dynamics Even if this the issue of representation could be solved, it is not straightforward to merely add more connections to the architecture of an existing activation-based model because it is not clear how contextual information should come to influence words’ activation levels. Adopting the same activation dynamics assumptions used in existing models, the effect of sentential context might be excitatory (for words that are supported by the context). On the other hand, the effects could be implemented via inhibitory connections, essentially ruling out words that are not supported by the context (Marslen- Wilson & Tyler, 1980; Marslen-Wilson & Welsh, 1978; for a similar approach in spoken word production, see Dell, Oppenheim, & Kittredge, 2008). Alternatively, rather than directly altering words’ activation levels, sentential context might induce adjustments to 45 words’ propagation threshold (a parameter governing the amount of activation required before a node’s activation begins to influence other nodes in the network) or their activation gain (a parameter governing how easily words become more activated). There is no a priori reason to believe that one of these mechanisms is more likely than any other, so additional assumptions and free parameters are needed. Furthermore, what works for modeling semantic context effects, where contextual cues (e.g., milked) tend to support specific words (e.g., goat), may not be able to capture syntactic context effects, where contextual cues (e.g., the) tend to support categories of words (e.g., nouns), but not specific words (see Fox & Blumstein, in press). 2.2.2.3. Challenges of Modeling Context Effects: Representing Time A third issue that makes it difficult to model sentential influences on word recognition in connectionist models arises from the way activation dynamics transpire over time as the speech signal unfolds. Recall that nodes in the same layer are generally considered to be mutually inhibitory: when more than one word is partially activated, the most activated representation(s) tends to crowd out other active nodes (Thomas & McClelland, 2008). This architectural feature is almost universally true of the Word layers of localist spoken word recognition models (Gaskell, 2007), as it allows activation- based models to account for competition among multiple lexical candidates during recognition (Allopenna, Magnuson & Tanenhaus, 1998; Andruski, Burton & Blumstein, 1994; Frauenfelder & Floccia, 1998; Gaskell & Marslen-Wilson, 1999, 2002; Magnuson, Dixon, Tanenhaus & Aslin, 2007; McMurray, Tanenhaus & Aslin, 2002; McMurray, Tanenhaus, Aslin & Spivey, 2003; McMurray, Tanenhaus, Aslin, Spivey & Subik, 2008; McQueen, Norris & Cutler, 1994; Norris, McQueen & Cutler, 1995; Righi, Blumstein, 46 Mertus & Worden, 2009; Utman, Blumstein & Burton, 2000; Vitevitch & Luce, 1998, 1999). However, this crucial architectural feature becomes problematic when the model is scaled up in order to account for sentential context effects, where multiple words should become activated in sequence. In the Merge model (Norris et al, 2000), for instance, activation of a lexical representation at one point in time will tend to suppress activation levels of other words at future points in time. Even if segmentation of the speech signal into words is taken for granted, it is clear that adapting Merge to account for sentential context effects would depend on how context is represented and how it comes to influence words’ activation levels. TRACE (McClelland & Elman, 1986), on the other hand, deals with time in a very different way, replicating the entire network for each time slice, so the inhibitory word-word connections only act on words that begin at the same point in time. From its inception, the implausibility of this aspect of TRACE’s architecture has been widely acknowledged by the model’s proponents and opponents alike (McClelland, Mirman & Holt, 2006; Norris, 1994) because of the enormous number of nodes required to model continuous speech. Therefore, modeling context effects by adding additional connections between associated words at different time-points or adding additional layers replicated for every time-point along with the rest of TRACE, would only exacerbate this problem. Meanwhile, Shortlist’s (Norris, 1994) representation of time is based on time- delayed recurrent neural networks (Elman 1990; Norris, 1988, 1990, 1993), which occupy a middle ground between the drawbacks of Merge and TRACE. Still, Norris’ (1994) brief discussion of how Shortlist might be adapted in order to account for the 47 various sentential context effects observed in the literature does not directly address either the representation of context or how the dynamic modulation of activation in the time-delayed network would function. 2.2.2.4. Context Effects Without Connectionist Models While all of these issues stem from important questions about spoken word recognition, they also pose significant challenges that even the most successful existing models have, so far, not addressed. Ultimately, however, these hurdles arise directly from the first choice made at the inception of each model: the choice to adopt a connectionist framework. Existing models were designed to address the architecture of a system supporting isolated word recognition, and adapting these architectures to solve a distinct computational problem – recognizing spoken words within the rich context of natural language – is a limiting approach when it comes to modeling context effects. As an alternative, it is possible to characterize the information processing architecture that must underlie any explanation of sentence-level influences on the recognition of spoken words, while acknowledging that there might be many possible architectures (representational systems, activation/dynamical assumptions, and implementations of time-varying input and processing) that could achieve the necessary computations (Marr, 1982). In the present work, we take this alternate path, analyzing the computational problem associated with word recognition in context from the beginning. 2.3. A Computational-Level Analysis of Spoken Word Recognition A useful starting point for a computational analysis of spoken word recognition is with a rational analysis (Anderson, 1990), wherein a cognitive system is considered with respect to the system’s goals, the environment in which the system must operate, and the 48 computational limitations of the system (Anderson, 1991). Anderson’s (1990) Principle of Rationality presumes that a cognitive system is optimized with respect to these factors, so, to the extent that some data do not fit the rational model’s predictions, these discrepancies will suggest that the modeler’s original assumptions about the system’s goals, environment, or limitations were inaccurate. These inconsistent data, in turn, guide the updating of a rational model’s initial assumptions. 2.3.1. Bayesian Models of Spoken Word Recognition Recent years have seen a notable rise in the application of rational analysis to questions in speech perception. Feldman, Griffiths and Morgan (2009) presented a rational analysis of speech sound perception and categorization, showing that the listeners’ discrimination and perceptual classification of vowel tokens could be explained by assuming their behavior reflected optimal (Bayesian) perceptual inference under uncertainty. In the same vein, Shortlist B (Norris & McQueen, 2008) exemplifies the rational analysis approach to spoken word recognition, accounting for several classic effects in the psycholinguistic literature without appealing to the notion of activation at all. Although their details differ, and although only the latter model focuses on word recognition specifically, both models follow the same basic logic. Similarly, the present model of spoken word recognition in context also follows this logic, so we now turn to an outline of the foundational principles these models share. Any rational analysis of a cognitive system must begin by identifying the goal of the system. Following Norris and McQueen (2008), and as suggested at the outset of this chapter, we take the purpose of the speech recognition system to be the recovery of the word (or words) produced by the speaker. For the purpose of exposition, we limit the 49 present discussion to a special case wherein the listener’s goal is to infer the most likely single word given the perceived speech signal. How would a rational system achieve this goal? If there is only one word that could possibly have produced the perceived signal, then the optimal decision is obvious: if all other words have been ruled out, then the signal must be an exemplar of the only remaining option (Doyle, 1890). However, since such certainty is often elusive when recognizing words in the real world, how should a rational system select the most probable word given incomplete information? Under this view, spoken word recognition amounts to a specific case of a more general problem: inference under perceptual uncertainty. All such computational problems share the same mathematically optimal solution, which is defined by the ideal observer framework (Geisler, 2003; Geisler & Kersten, 2002). An ideal observer is one that always makes the best possible guess when identifying the likely source of some observed data, and its behavior is given by Bayes’ rule (Knill, Kersten & Yuille, 1996). According to Bayes’ rule (Equation 2.1), for any exhaustive set of mutually exclusive hypotheses H, the probability that any given hypothesis hi in H is true, given some observed data d, is given by: Equation 2.1 𝑝 𝑑 ℎ! 𝑝(ℎ! ) 𝑝 ℎ! 𝑑 = !! ∈! 𝑝 𝑑 ℎ! 𝑝(ℎ! ) Because the denominator of the right side of Bayes’ rule is constant over all hj in H, Bayes’ rule is often stated in its proportional form (Equation 2.2): Equation 2.2 𝑝 ℎ! 𝑑 ∝ 𝑝 𝑑 ℎ! 𝑝(ℎ! ) 50 The key principle embodied by Bayes’ rule is that, having observed d, the so- called posterior probability of a given alternative, 𝑝 ℎ! 𝑑 , depends on two general classes of information: how representative of that alternative d is, and on how probable the alternative (hi) was in the first place. These two pieces of information are referred to, respectively, as an alternative’s likelihood, 𝑝 𝑑 ℎ! , and its prior probability, 𝑝(ℎ! ). An ideal observer integrates these two sources of information by computing the posterior probability for each alternative in H, and ultimately selecting the alternative with the largest posterior probability. In the domain of spoken word recognition, a hypothesis is a word wi that the speaker might have produced, so the hypothesis space is an entire vocabulary of size Nw, and the observed data is the acoustic signal A perceived by the listener. Thus, an ideal observer model of spoken word recognition is given in Equation 2.3’s restatement of Bayes’ rule: Equation 2.3 𝑝 𝐴 𝑤! 𝑝(𝑤! ) 𝑝 𝑤! 𝐴 = !! !!! 𝑝 𝐴 𝑤! 𝑝(𝑤! ) 2.3.2. Prior Expectations in Spoken Word Recognition: Lexical Frequency Equation 2.3 underlies the Shortlist B model presented by Norris and McQueen (2008). One of the most significant contributions of Shortlist B was a computational account of word frequency effects on spoken word recognition, a category of effects that ranks among the most robust findings throughout the psycholinguistic literature (e.g., Connine, Mullennix, Shernoff & Yellen, 1990; Dahan, Magnuson & Tanenhaus, 2001; Howes, 1954; Luce, 1986; Marslen-Wilson, 1987; Pollack, Rubenstein & Decker, 1960; Savin, 1963; Taft & Hambly, 1986). Following Norris’ (2006) Bayesian Reader model of 51 visual word recognition, Shortlist B adopts each word’s relative frequency as an estimate of its prior probability 𝑝(𝑤! ). Although this innovation is theoretically straightforward, it allowed Shortlist B to account for several classic effects, such as improved accuracy when subjects identify frequent words in noise compared to less frequent words (Luce & Pisoni, 1998). Two observations about the role of the prior in a Bayesian model bear noting. First, a Bayesian spoken word recognizer will never “hallucinate” (cf. Norris et al, 2000) a word that bears no resemblance to the acoustic signal, no matter how frequent it may be. This follows from the fact that words that are entirely incompatible with some perceived signal are realized in the model with a likelihood 𝑝 𝐴 𝑤! = 0, and this will also entail that the posterior 𝑝 𝑤! 𝐴 = 0. In the same vein, if no other words are consistent with a given acoustic signal, then even the rarest words can be clearly perceived (Doyle, 1890). Second, even though a word’s estimated frequency never changes in Shortlist B, the relative influence of the prior on subjects’ behavior (as modeled by the posterior) will not be the same for all possible acoustic signals. Rather, the prior’s influence on the posterior will be largest when the likelihood is most uncertain – that is, when there are many possible words that are somewhat consistent with the input. In contrast, when perceptual uncertainty is low, such that the likelihood is peaked over one or a small number of words, the same prior will be less influential on the posterior. As Norris and McQueen (2008) point out, this second observation matches findings of an interaction between word frequency and stimulus quality in word recognition accuracy data: the 52 more degraded a stimulus is by noise, the larger the observed advantage for frequent words (Luce & Pisoni, 1998). It is clear that Shortlist B, and the Bayesian framework more generally, offers straightforward explanations for a broad range of phenomena spanning concepts as basic as lexical frequency, neighborhood density and lexical competition, perceptual confusability, and lexical influences on speech segmentation and word recognition. However, it is just as important to observe that these effects follow automatically from the basic principles that are mathematically required by a Bayesian model. As Norris and McQueen (2008) suggest, for many of the effects examined, their model could not be made to predict anything but the established finding and still be called “Bayesian.” This stands in contrast to activation-based models of spoken word recognition, which require many architectural and dynamical assumptions and whose performance depend heavily on exact parameter settings within such models (Pitt, Kim, Navarro & Myung, 2006; see Norris, 2006 for discussion). That such a well-constrained model achieves such broad empirical coverage offers strong support for the notion that spoken word recognition might reflect optimal inference in the face of uncertain input. 2.3.3. Prior Expectations in Spoken Word Recognition: Sentential Context Despite Shortlist B’s successes, the rational analysis approach to computational modeling stresses the importance of revising a model’s assumptions when a model cannot account for certain data. One type of data that is not accounted for by Shortlist B is the influence of sentential context on spoken word recognition. Indeed, the model assumes that the probability of each word in a sequence is independent of any other (non- 53 overlapping) words in a multi-word speech signal. Clearly, this assumption is not warranted. In acknowledging this fact, Shortlist B’s creators suggest how a more complete Bayesian model might approach this issue: “In all of the simulations reported here, we assume that [a word’s prior probability] can be approximated by the word’s frequency of occurrence in the language. However, [the word’s prior] will also be influenced by factors outside the scope of the present model, such as semantic or syntactic context” (Norris & McQueen, 2008, p. 362). Since words do not occur randomly in language, an optimal listener’s prior expectation over which words are likely should be highly context- dependent. Just as a model assuming that all words are equally likely fails to explain effects of word frequency, a model that assumes that some word is equally likely to occur in every context will necessarily fail to explain effects of sentential context. The main goal of the present work is to relax the assumption of a context- independent prior. To do so, we begin with the basic approach of Norris and McQueen (2008), but – in the tradition of rational analysis – we update their assumptions in order to investigate whether behavioral patterns of context effects on spoken word recognition can also be explained by an ideal observer model. We also diverge from some other aspects of Shortlist B, most notably by adopting a model of words’ likelihood functions that explicitly takes into account acoustic cues in the speech signal (see also Clayards, Tanenhaus, Aslin & Jacobs, 2008; Feldman et al, 2009; Feldman et al, 2013). This approach emphasizes the power of the Bayesian framework to explain lawful, fine- grained variability in how cues as disparate as sentential context and acoustic input interact during spoken word recognition. 54 2.4. BIASES: Bayesian Integration of Acoustic and Sentential Evidence in Speech As already discussed, Bayes’ rule describes the optimal way of combining two information sources – prior knowledge about which words a listener is likely to encounter, incorporated into the prior term 𝑝(𝑊), and acoustic data perceived by a listener, incorporated into the likelihood term 𝑝 𝐴 𝑊 .2 However, it is also useful to invoke another common interpretation of Bayes’ rule that is particularly applicable to modeling the effects of preceding context on spoken word recognition. As suggested by the standard nomenclature of priors and posteriors, Bayes’ rule is often presented as an equation describing optimal belief updating. Put simply, if 𝑝(𝑊) indexes a listener’s set of beliefs about how likely each possible word is prior to observing the relevant data (A), then 𝑝 𝑊 𝐴 represents a listener’s updated set of beliefs about the identity of the unknown word after integrating the new perceptual data, A. Since the posterior, 𝑝 𝑊 𝐴 , depends on the prior, 𝑝(𝑊), and the likelihood, 𝑝 𝐴 𝑊 , it is intuitive that the likelihood drives the updating of a listener’s beliefs. For instance, when the newly observed A is highly unlikely to be a token of a particular word (wj), then the prior belief for wj is revised downwards, rendering the posterior belief 𝑝 𝑤! 𝐴 smaller than the prior expectation 𝑝(𝑤! ) . On the other hand, the more representative of wj the observed signal A is, the more 𝑝(𝑤! ) will be revised upwards, causing 𝑝 𝑤! 𝐴 to gain support relative to other words. Within incremental sentence processing theories (Hale, 2001; Levy, 2008; Marslen-Wilson, 1973, 1975), this updating 2 By convention, we use capital italicized letters to refer to a random variable. For instance, the prior distribution 𝑝(𝑊) defines the prior probability of each possible state of the random variable W. That is, if there are Nw words that a listener could hear, then 𝑝(𝑊) is a vector with Nw entries, such that each word wj has some prior probability !! 0 ≤ 𝑝 𝑤! ≤ 1 and the sum !!! 𝑝(𝑤! ) = 1. 55 process can be thought of as iterative, such that, after integrating each new piece of information or at each new time-step, the newly computed posterior becomes the updated prior for the next step in time. Our model adopts this perspective in order to incorporate the influence of a preceding sentence context, C, on the recognition of a spoken word. Rather than assuming a static prior across all contexts, 𝑝(𝑊), we assume that listeners make use of a conditional prior, 𝑝(𝑊|𝐶), such that their prior lexical expectations depend on the context up to that point (e.g., Altmann & Kamide, 1999; Eberhard, Spivey-Knowlton, Sedivy & Tanenhaus, 1995; Kamide, Altmann & Haywood, 2003). Upon observing some subsequent speech, A, an ideal speech recognizer should update its contextually- conditioned prior beliefs by evaluating the probability that A was a token of each possible word. Given the simplifying assumption that listeners expect words to be pronounced with roughly the same acoustic form irrespective of which words preceded it (an assumption we address in greater depth in Chapter 4), the probability that A was a token of word wi is given by Bayes’ rule (Equation 2.4): Equation 2.4 𝑝(𝑤! |𝐶)𝑝 𝐴 𝑤! 𝑝 𝑤! 𝐶, 𝐴 = !! !!! 𝑝(𝑤! |𝐶)𝑝 𝐴 𝑤! The model presented in Equation 2.4 serves as the basis for the remainder of this chapter. It represents a way of identifying which words were probably present in an imperfectly perceived speech signal by combining information from the preceding sentential context with subsequent acoustic cues. With this function in mind we will refer to this model as the BIASES model, short for Bayesian Integration of Acoustic and Sentential Evidence in Speech. As we will show, the model’s name also foreshadows the 56 type of effect that it predicts should result when sentential evidence is brought to bear during word recognition. Next, we detail our implementations of the two fundamental components of BIASES: the context-dependent conditional prior, 𝑝(𝑊|𝐶), and the likelihood function that relates words to their acoustic forms, 𝑝 𝐴 𝑊 . 2.4.1. Conditional Prior: A Model of Listeners’ Contextual Knowledge To define a conditional prior 𝑝(𝑊|𝐶) for BIASES, every lexical candidate wi must be assigned a probability of occurrence following each possible context. A conditional prior has two basic properties. First, in general, for a given context, some words will be more expected than others. This property is what makes any prior (conditional or not) informative: if all words are equally likely in some context, then the posterior is proportional to the likelihood alone. This is clearly not the case in human language, and listeners do clearly do not treat all words as being equally likely in a particular sentence context. Second, and in contrast to previous work (e.g., Norris & McQueen, 2008), different contexts will support the same word to different extents. It is this property that makes a prior conditional: 𝑝(𝑤! |𝐶 = 𝑐! ) need not equal 𝑝(𝑤! |𝐶 = 𝑐! ). As already discussed, whereas Shortlist B employed a context-independent lexical frequency prior, a key goal of BIASES is to incorporate a conditional prior that more accurately assumes listeners’ access to context-dependent lexical expectations. To do so in a computational model like BIASES, we must quantify the level of support that a given context provides for a word. It is undoubtedly the case that many factors collude to create a listener’s expectation for any given word. A complete model of how context influences the probability of subsequent words would certainly depend on semantic (e.g., Borsky et al, 1998) and syntactic (e.g., Fox & Blumstein, in press) 57 information contained within the preceding linguistic context, but it would also depend on many other information sources that are available to a listener. For instance, a full model would need to address how listeners make pragmatic inferences about the implicatures in prior linguistic context (e.g., Rohde & Ettlinger, 2012), how listeners might employ speaker-specific and situation-specific knowledge about likely words or grammatical structures (e.g., Fine & Jaeger, 2013; Fine, Jaeger, Farmer & Qian, 2013; Kamide, 2012; Horton, 2007; van Berkum, van den Brink, Tesink, Kos & Hagoort, 2008), and how listeners treat knowledge about which words or concepts have recently been uttered in a discourse or are in common ground (e.g., Horton & Keysar, 1996), to name just a few. Quantifying the influence of such factors on subjects’ lexical expectations is clearly not trivial, and doing is beyond the scope of the current modeling effort. Instead, we focus on an admittedly limited model of context in order to illustrate the explanatory power of the BIASES model, and the Bayesian framework more generally. 2.4.1.1. Conditional Expectations from n-gram Language Models Most modern automatic speech recognition systems operate under the same fundamental hypothesis embraced by BIASES: that a word’s context-independent frequency can capture only a fraction of the prior knowledge available during word recognition. The solution implemented in these models incorporates local semantic and syntactic context via language models (Jelinek, 1990, 1997). Put simply, an n-gram language model is a conditional probability distribution over lexical candidates given the n-1 immediately preceding words. As n increases, the probability distribution over possible words is conditioned on more information, and, consequently, the conditional 58 expectations for different lexical candidates become more fine-grained. For instance, a bigram language model 𝑝(𝑊! |𝑊!!! ) estimates a word Wt’s probability based on only the previous word, Wt-1, while a trigram language model 𝑝(𝑊! |[𝑊!!! , 𝑊!!! ]) estimates Wt’s probability given that [Wt-2,Wt-1] preceded Wt.3 Intuitively, trigram language models make more specific predictions than bigram language models: fewer words are likely to follow ...hated to... than just to... On the other hand, a unigram language model (n = 1) is simply a formal definition of a lexical frequency distribution. A word’s frequency 𝑝(𝑤! ) can be computed by collapsing over all Nc possible preceding contexts via summation (referred to as marginalization; Equation 2.5). Equation 2.5 !! !! 𝑝 𝑤! = 𝑝 𝑤! 𝐶 = 𝑐! = 𝑝 𝑊! = 𝑤! 𝑊!!! = 𝑤! !!! !!! The observation presented in Equation 2.5 provides a mathematical justification for Norris and McQueen’s (2008) original claim that (as they reiterated later) “frequency and context have the same explanation in a Bayesian model” (Norris, McQueen & Cutler, 2015, p. 4). 2.4.1.2. Consequences of Adopting an n-gram Language Model Prior BIASES implements a language model as its conditional prior 𝑝(𝑊|𝐶) for spoken word recognition. This decision has some obvious drawbacks, but also some important benefits. Under the strictest interpretation, the assumption entailed by employing an n- 3 Note that trigram language models are order-sensitive. That is, in general, 𝑝(𝑊! |[𝑊!!! , 𝑊!!! ]) ≠ 𝑝(𝑊! |[𝑊!!! , 𝑊!!! ]); intuitively, a listener’s expectation for the word pay is not the same after hearing: wanted to and to wanted. 59 gram language model in this way is that all relevant information in C can be summarized by knowing the identities of the n-1 words preceding the target word. Clearly, as discussed earlier, such a model is severely impoverished compared to listeners’ actual contextual knowledge. That said, even a bigram language model constitutes far richer prior than Shortlist B’s unigram language model (Norris & McQueen, 2008), which cannot account for any contextual effects on word recognition. Language models might be considered among the simplest possible models capable of predicting context-specific modulation of word recognition. For example, a bigram language model would predict that, if wi tends to follow wt-1 more often than wj follows wt-1, listeners should tend to identify an acoustic signal A that is perfectly ambiguous between wi and wj as wi when the word preceding it was wt-1. To the extent that such a model might account for some aspects of human behavior, it would suggest some commonalities between the detailed, linguistically relevant information contained within a listener’s contextual knowledge and the transition probabilities between sequential words. Note that it does not follow that listeners’ models of context necessarily represent these word-by-word transition probabilities explicitly (see Levy, 2008 for discussion). This is another benefit of adopting a language model as BIASES’s model of context for the purposes of the present computational-level analysis. The choice allows us to remain theory-neutral with respect to the actual representation of context used by listeners. We regard a language model as a convenient, useful tool to summarize some fraction of the information contained within a sentential context. As naïve as language models are, evidence suggests that they predict a number of measures in language processing, such as 60 reading times and eye fixations during reading (e.g., Hale, 2001, 2006; Levy, 2008; McDonald & Shillcock, 2003a, 2003b). By implementing a language model as the conditional prior in BIASES, we aim to test whether the predictive power of such models observed in psycholinguistic studies of reading will transfer to the domain of spoken language processing. Of course, not all contextual influences on speech recognition will be explained by a standard language model. As just one example, Rohde and Ettlinger (2012) show that listeners’ ratings of a phonetically ambiguous pronoun between he and she are biased towards the presumed gender of a referent who was the most likely “causer” of some event (cf. Garvey & Caramazza, 1974; McDonald & MacWhinney, 1995; Koornneef & van Berkum, 2006). Subjects preferentially rated phonetically ambiguous pronouns as he in sentences like Noah frightened Claire because [?e] drove 100 miles per hour, but as she if the referents’ names/genders were reversed. Such an effect would be difficult to explain with a basic language model, because the result appears to rely on inferential/causal reasoning above and beyond words’ co-locational probabilities. Thus, the decision to use a language model as BIASES’s model of context will prevent us from capturing effects like this one, but it is not implausible that Bayesian models of pragmatic reasoning (Bergen, Levy & Goodman, 2014; Frank & Goodman, 2012; Franke, 2009; Goodman & Stuhlmuller, 2013; Jager, 2012) could be incorporated into a Bayesian model of speech perception like BIASES. While some sentential context effects are unlikely to find explanation in any sort of standard language model, the ability to account for other findings will depend on the precise specifications adopted for a language model. Indeed, most previously reported 61 semantic context effects could not be explained by a bigram language model. For example, the same word (the) immediately precedes the phonetically ambiguous target word in every stimulus sentence in Borsky and colleagues’ (1998) study showing that subjects made more GOAT responses in goat-biased than in coat-biased sentences. A trigram language model, on the other hand, would likely explain at least some of the differences between goat-biased (...milk the...) and coat-biased (...button the...) sentences, but it would also predict that the entire semantic context effect observed in the study is driven by the word that appeared two words before the target, irrespective of the rest of the context (e.g., whether the sentence featured a farmer or a tailor as its subject). Whether or not this is true, what we hope to have made clear from the preceding examples is that the ability for a prior based on a language model to account for sentential influences on word recognition will depend on the sort of language model used and the sort of context effect examined. A final observation about the consequences of selecting a language model as a prior regards a practical challenge it poses. Although more complex language models produce more fine-grained predictions, specificity of predictions trades off with sparsity of data (e.g., Katz, 1987). That is, as n increases, it becomes more difficult to estimate the probabilities associated with an n-gram language model because the number of possible contexts grows exponentially: if there are 10 words in a language, then there are 10 possible contexts in a bigram language model and 102 = 100 two-word sequences for which to estimate probabilities. For a trigram language model in the same ten-word language, one must estimate all 103 = 1,000 probabilities (each word in each of the 102 = 100 possible two-word contexts). While most “possible” three-word sequences may 62 never occur in language, some sequences that are rare but do occasionally occur in language may never occur in a given corpus from which the language model is being estimated. If the corpus were to be taken as a perfectly reliable model of language, then those sequences will erroneously be assigned a prior probability of 0, making it impossible for a Bayesian word recognizer to observe that sequence in the future. Although smoothing methods can be applied to ensure that all word sequences have some small but nonzero prior probability (Church & Gale, 1991; Dagan, Marcus & Markovitch, 1993; Good, 1953; Goodman, 2001; Katz, 1987; Jelinek & Mercer, 1985), these methods will also tend to reduce the model’s ability to predict context-specific variability. When there is little information that might differentiate between the prior probabilities of two similarly rare sequences, they will tend to be treated as equally likely. Thus, although a bigram language model will lack a great deal of information that listeners will have access to, it will also provide a reliably-estimated language model capable of capturing context-dependent response patterns when those effects tend to be driven by the word immediately preceding the target that subjects are tasked with recognizing. 2.4.1.3. BIASES’ Conditional Prior: A Bigram Language Model With these factors in mind, and acknowledging the various limitations associated with language models, we adopted a bigram language model as the conditional prior for BIASES. As such, only the immediately preceding word influences the prior probability of the subsequent word. Furthermore, to examine how well BIASES could fit human behavior, the experiments we conducted utilized stimuli designed to elicit sentential context effects on word recognition that were driven by the immediately preceding word 63 (Fox & Blumstein, in press). In particular, the simulations and experiments reported here evaluated the influence of different function words (to vs. the) on the identification of the next word in a sentence. Fox and Blumstein showed that subjects were more likely to recognize a phonetically ambiguous word between bay and pay as pay when it was preceded by a sentence like Brett hated to... than when it was preceded by a sentence like Valerie hated the... According to BIASES, the basic explanation for this effect is that the two critical function word contexts (to vs. the) differentially influence a listener’s prior expectations for immediately subsequent target words (bay vs. pay). As we will show, despite this simplistic model of listeners’ contextual knowledge, BIASES does remarkably well at predicting subjects’ behavioral responses, including the overall pattern, patterns of subject-by-subject variability, and several other fine-grained quantitative predictions. We also conduct another set of simulations to examine the extent to which a richer model of context might account for additional, even more fine-grained context-specific patterns in our empirical results. 2.4.1.4. Additional Constraints on Prior Expectations: Forced-Choice Our simulations invoke one other piece of contextual information that we assume subjects exploit. Since subjects receive a set of instructions before performing the experimental task in the laboratory, these instructions further constrain the conditional prior model of context that listeners use while recognizing words in the study. Specifically, we assume that, once instructed to identify the target word as either bay or pay, subjects assign all other words a prior probability of 0. This same assumption is almost universally implicit in other models of spoken word recognition. For instance, in TRACE (McClelland & Elman, 1986), responses during multiple-alternative forced- 64 choice tasks (such as phoneme or word identification experiments) are generated probabilistically from among a set of alternatives that is identified based on the task and stimuli (cf. Luce, 1959; McClelland & Rumelhart, 1981). By reading activation levels out from only a few “clamped” response alternatives, TRACE’s decision model has the effect of nullifying any prior probability of a response from any other words outside of the predefined set. Similarly, the output nodes of other models (e.g., Shortlist: Norris, 1994; Merge: Norris et al, 2000; Shortlist B: Norris & McQueen, 2008) are pre-specified “on- the-fly” based on task demands. Although they are not strictly driven by theoretically interesting assumptions about human cognition, it makes sense that computational models of human behavior should account for such exogenous factors, and this is especially true for ideal observer models. As Norris (2006) puts it, which model will produce optimal behavioral responses “is critically dependent on the precise specification of the task or goal” (p. 330). With this additional constraint, the summation over all possible words in the denominator of Equation 2.4 can be simplified to the sum of two terms: one proportional to the posterior probability of pay given C and A, and the other proportional to the posterior probability of bay. Equation 2.6 incorporates this assumption, giving the posterior probability of pay, where w1 = pay and w2 = bay. Equation 2.6 𝑝(𝑤! |𝐶)𝑝 𝐴 𝑤! 𝑝 𝑤! 𝐶, 𝐴 = 𝑝 𝑤! 𝐶 𝑝 𝐴 𝑤! + 𝑝(𝑤! |𝐶)𝑝 𝐴 𝑤! This posterior probability distribution in Equation 2.6 gives the expected rate with which a subject should identify an acoustic stimulus as pay in a given context, if that subject were optimally combining the information sources we are assuming. To the 65 extent that subjects deviate from this behaviorally, Andersons’s Principle of Rationality (1990) would demand that we update our assumptions. Note that the expected posterior probability of subjects making a BAY response, 𝑝 𝑤! 𝐶, 𝐴 , is equal to 1 − 𝑝 𝑤! 𝐶, 𝐴 . Finally, as pointed out by Feldman and colleagues (2009), in the case of modeling two-alternative forced choice, the posterior in Equation 2.6 can be rewritten to take the form of a logistic function. By dividing both the numerator and denominator of the right side of Equation 2.6 by the quantity in the numerator and applying inverse functions (exponential power and natural logarithm), Equation 2.6 can be rewritten as shown in Equation 2.7: Equation 2.7 1 𝑝 𝑤! 𝐴, 𝐶 = ! !! ! ! ! !! ! !"# !!"# 1+ 𝑒 ! !! ! ! ! !! 2.4.1.5. Implementing BIASES’ Prior: Corpus Estimates, Smoothing An advantage of modeling subjects’ responses in a two-alternative forced choice word identification task is that the implementation of BIASES’ prior is quite flexible – flexible enough, in fact, that 𝑝 𝑤! 𝐶 and 𝑝 𝑤! 𝐶 need not actually be proper probabilities at all. To see this, one need simply note that the influence of the prior, ! !! ! log ! !! ! , is only dependent on the ratio of the prior probabilities of w1 or w2. A consequence of this is that their probabilities could just as easily be replaced by numeric values that are proportional to the words’ relative prior probabilities. Because the prior in this implementation of BIASES is estimated from a bigram language model, counts from a corpus of the number of times w1 and w2 follow C in sequence will suffice (𝜂 𝐶, 𝑤! and 𝜂 𝐶, 𝑤! , respectively; see Equation 2.8). 66 Equation 2.8 𝜂 𝐶, 𝑤! !! 𝑝 𝑤! 𝐶 𝑝 𝑊! = 𝑤! 𝑊!!! =𝐶 !!! 𝜂 𝐶, 𝑤! 𝜂 𝐶, 𝑤! log = log = log = log 𝑝 𝑤! 𝐶 𝑝 𝑊! = 𝑤! 𝑊!!! =𝐶 𝜂 𝐶, 𝑤! 𝜂 𝐶, 𝑤! !! !!! 𝜂 𝐶, 𝑤! Of course, many other words in the corpus besides w1 and w2 will also follow C !! (that is, !!! 𝜂 𝐶, 𝑤! ≫ 𝜂 𝐶, 𝑤! + 𝜂 𝐶, 𝑤! ). However, the fact that the normalizing term (which represents the sum of all occurrences of C with any word) cancels out in Equation 2.8 reflects the assumption that subjects engaged in a two-alternative forced choice word identification task will only consider the relative contextual evidence for w1 and w2 in responding. The data for BIASES’s bigram language model were collected from the 2009 Google Books corpus (Michel et al, 2010). Rather than assuming that the corpus counts of the relevant bigrams (the bay, the pay, to pay, to bay) were perfect estimates for subjects’ contextual knowledge, model-fitting (see Chapter 3) allowed for “add-alpha” smoothing (Lidstone, 1920). Under such a model, one value 𝛼 is added to all bigram counts, and the fitting process selects the 𝛼 that minimizes the overall deviation of the model predictions from the data. As mentioned earlier, while the benefit of smoothing is that it protects against overconfidence in our estimate of listeners’ prior (especially in estimates of the probability of relatively uncommon bigrams), higher values of 𝛼 tend to diminish the specificity of the predictions of the model. As the smoothing parameter 𝛼 grows larger (and greater than the raw counts in the bigram language model itself), the model approaches a uniform distribution that renders w1 and w2 equally likely to follow every context. 67 2.4.2. Likelihood Term: Mapping an Acoustic Signal onto Lexical Forms One of the most fundamental observations about speech communication is that there is no one-to-one mapping between words and their acoustic realizations (cf. Liberman, Cooper, Shankweiler & Studdert-Kennedy, 1967). On one hand, it is clear that signal-to-word mapping is many-to-one: perfectly understandable productions of the same word can take on countless acoustic realizations that may differ from one another along many dimensions. On the other hand, signal-to-word mapping is also sometimes one-to-many: the same acoustic signal may, on different occasions, be perceived as different sounds (e.g., Ganong, 1980; Liberman, Harris, Hoffman & Griffith, 1957; Sawusch & Jusczyk, 1981), words (e.g., Borsky et al, 1998; Fox & Blumstein, in press), or sequences of words (e.g., Foss & Swinney, 1973; Kim, Stephens & Pitt, 2012). Together, these two facts underlie the likelihood term in BIASES and how it interacts with the model’s conditional prior term. 2.4.2.1. Likelihood Functions: Many-to-One Mapping The immediate function of the likelihood term 𝑝(𝐴|𝑊) in BIASES is to formalize the many-to-one mapping from speech tokens to words. 𝑝 𝐴 𝑊 is best described as a composite of Nw likelihood functions, where Nw is the number of words in the lexicon, because each word wi has its own likelihood function 𝑝(𝐴|𝑤! ). 𝑝(𝐴|𝑤! ) defines a listener’s phonetically-detailed knowledge about how wi tends to be pronounced. Implicit in each word’s likelihood function is the notion that not all productions of a word will be equally clear, and some realizations will be more typical than others. Work examining the influence of category goodness and internal category structure has shown that fine- grained acoustic properties in speech modulate perception, recognition and lexical access 68 (Andruski, Blumstein & Burton, 1994; Blumstein, Myers & Rissman, 2005; Kessinger & Blumstein, 2003; McMurray, Tanenhaus & Aslin, 2002, 2009; Miller, 1994; Miller & Volaitis, 1989; Pisoni & Tash, 1974; Volaitis & Miller, 1992). This internal category structure is modeled within 𝑝(𝐴|𝑤! ) by making the more typical realizations of wi more probable than less typical realizations. The model assumes an acoustic space, 𝒜, comprised of all possible speech waveforms, and each ax in 𝒜 is a point within that multidimensional acoustic space. Each word, wi, occupies some subspace of 𝒜 comprised of all possible pronunciations of wi. The likelihood function of wi, 𝑝(𝐴|𝑤! ), assigns some probability to every possible ax. For most values of ax, it will effectively be the case that 𝑝 𝑎! 𝑤! = 0; after all, although each word in a lexicon can be pronounced in many4 ways, most possible waveforms will bear no similarity to some given wi. However, among those speech tokens (values of ax) that might plausibly be exemplars of wi, the ones that most resemble wi will be most probable according to 𝑝(𝐴|𝑤! ). In this way, the role of the likelihood term of BIASES, 𝑝(𝐴|𝑊), is to evaluate, for each lexical candidate wi, how representative of wi a perceived speech token ax is. Of course, just as the likelihood function of wi will ensure that 𝑝 𝑎! 𝑤! = 0 for most ax, the overall effect of 𝑝(𝐴|𝑊) is that, for a given acoustic signal ax, 𝑝 𝑎! 𝑤! = 0 for most words. As discussed earlier, when 𝑝 𝑎! 𝑤! = 0 for all words but one, only one word will have a nonzero posterior probability, and there will be no question about the identity of the token ax. Indeed, given the multiplicity of available bottom-up, acoustic 4 Indeed, because at least some relevant dimensions of 𝒜 are continuous (e.g., VOT, vowel duration, formant values), BIASES assumes that the sample space of possible pronunciations of any word is infinite. Thus, formally, 𝑝(𝐴|𝑤! ) must be a probability density function. 69 cues that comprise the many dimensions of 𝒜 , it may very often be possible to distinguish words and speech sounds from one another (see, e.g., Nearey, 1990, 1997). However, there is not always such a consistent mapping from a given acoustic signal to one (and only one) word. Put simply, while the many-to-one mapping between speech tokens and words lies at the heart of each word’s likelihood function, the computational challenge that a spoken word recognition system must overcome arises due to the one-to- many (or at least one-to-more-than-one) mappings between a signal and multiple possible lexical candidates. In such cases, the system must adjudicate among the various words that the perceived signal resembles to any degree. 2.4.2.2. Phonetic Ambiguity: One-to-Many Mapping Under what circumstances would an optimal listener believe that, for more than one word, 𝑝 𝑎! 𝑤! > 0? Feldman and colleagues (2009) identified several factors responsible for creating uncertainty in the mapping of a signal onto a single, best- matching word. Here, we classify these factors into two general categories. In short, the noisier the environment is and the more acoustically similar two words are, the more likely it is that the perceived signal ax will be ambiguous between the two words. One source of uncertainty is noise, which distorts the speaker’s production of a word (sx) and can cause the perceived acoustic signal (ax) to be ambiguous between multiple words, even when sx may not have been. For instance, as already discussed, early research in the phoneme restoration paradigm (Warren, 1970; Samuel, 1981, 1996) showed that masking a short segment of uninterrupted, natural speech with a cough or white noise could render the corrupted signal (/*il/) consistent with any of several lexical candidates (e.g., wheel, heel, peel, meal) (Warren & Warren, 1970; Warren & Sherman, 70 1974). In general, adding noise to stimuli tends to “smear out” a word wi’s likelihood function: to accommodate more variability in the acoustic signal due to noise, more values of ax will count as possible realizations of wi. The result of this smearing is that some values of ax may come to correspond to multiple possible words (Luce & Pisoni, 1998; Warren & Warren, 1970) or sounds (e.g., Cutler, Weber, Smits & Cooper, 2004; Miller & Nicely, 1955; Smits, Warner, McQueen & Cutler, 2003; Warner, Smits, McQueen & Cutler, 2005). While the effect of noise is to increase the uncertainty of the speech signal after it is produced, a second source of uncertainty emerges naturally and is inescapable, even in a completely noiseless environment. Distinct words are sometimes characterized by overlapping likelihood functions; this occurs when the acoustic space corresponding to one word intersects with that of another word. A trivial example illustrating this fact is the case of homophony: if a speaker produces sx = /baɪ/, sx could correspond to several words (buy, by, or bye) because all three words share virtually identical spaces of possible pronunciations (but see, e.g., Gahl, 2008). The present work considers a less extreme example of how phonetic ambiguity may lead to lexical ambiguity. While homophony inevitably leads to lexical ambiguity, the phonetic ambiguity examined here arises when the likelihood functions of a pair of word-initial segments (/b/ and /p/) overlap in acoustic space. The primary acoustic dimension on which /b/ and /p/ differ is voicing, with tokens of /p/ tending to be realized with longer voice-onset time (VOT) values than tokens of /b/ (Lisker & Abramson, 1964). Figure 2.1 displays two theoretical acoustic cue distributions over VOTs: one for word-initial /b/ and one for word-initial /p/. Although tokens of each 71 category can generally be distinguished on the basis of VOT alone, the two categories’ distributions overlap such that tokens with some intermediate VOT values could plausibly be an exemplar of either /b/ or /p/. Although other acoustic dimensions of a spoken word also provide reliable cues that can distinguish /b/-initial tokens from /p/- initial tokens (e.g., Klatt, 1975; Lisker, 1986; Miller & Dexter, 1988; Repp, 1984; Stevens & Klatt, 1974; Summerfield, 1981), much work has shown that, holding other variables constant, listeners perceive segments with some VOT values as phonetically ambiguous between /b/ and /p/ (Clayards et al, 2008; Connine, Blasko & Wang, 1994; Connine, Titone & Wang, 1993; Fox & Blumstein, in press; Ganong, 1980; Liberman, Harris, Kinney & Lane, 1961; McMurray, Clayards, Tanenhaus & Aslin, 2008; McMurray et al, 2002, 2009; Miller & Dexter, 1988; Miller et al, 1984; Pisoni & Lazarus, 1974; Toscano & McMurray, 2012; Wood, 1976). A consequence of this phonetic ambiguity for spoken word recognition is that an acoustic token /?eɪ/ with a phonetically ambiguous VOT could correspond to either bay or pay (Fox & Blumstein, in press). Thus, speech tokens whose initial consonants have intermediate VOT values exhibit a one-to-many mapping from acoustic signal to lexical forms, and, as illustrated in Figure 2.1, this one-to-many mapping can be modeled by assuming that bay and pay have overlapping likelihood functions. 72 Probability Density BAY$ PAY$ −20 −10 0 10 20 30 40 50 60 70 VOT (ms) Figure 2.1. Examples of normally distributed probability density functions for two categories: /b/ and /p/, or for bay and pay under the assumption that these words are otherwise (i.e., besides VOT) identical in their acoustic cue distributions. The light grey line represents the marginal density function, showing the relative amount of total probability mass associated with each voice-onset time (VOT) across all categories. The dashed black line indicates the category boundary (χ), defined as the point in acoustic space (or the plane in acoustic space, if the likelihood model has more than one dimension) for which the probability density functions of two or more categories are equal. It can be equivalently defined as the point/plane for which the posterior probability distribution over a given set of categories is uniformly distributed when the prior probabilities of the set’s members are also equal. 2.4.2.3. BIASES’ Likelihood Term: A Mixture of Gaussians An important issue that remains is the specification of each word’s likelihood function. As already discussed, a complete model of the likelihood function for a word would define which acoustic signals could be recognized as a word, as well as how good an exemplar each possible signal would be. Although, in reality, such a model would be extremely complex and require a highly multidimensional space, the present model is far 73 simpler. Following prior work (e.g., Clayards et al, 2008), we assume that the likelihood functions of word-initial voicing minimal pairs (e.g., bay and pay) can be approximated by normal distributions over a single continuous dimension, VOT (see also Kleinschmidt & Jaeger, in prep; Kronrod, Coppess & Feldman, 2012; Munson, 2011). Under this assumption, listeners expect that if they were to perceive a given word wi, the probability that it would be realized with different initial VOT values (A) is given by a normal distribution (see Equation 2.9A/B), where 𝜇! represents the mean initial VOT for wi and 𝜎!! represents the variance in wi’s initial VOT. Equation 2.9A 𝐴|𝑤! ~ 𝑁(𝜇! , 𝜎!! ) Equation 2.9B (!!!! )! 1 ! !!!! 𝑝 𝐴 𝑤! = 𝑒 ! 2𝜋𝜎!! If the acoustic form of wi is assumed to be normally distributed, 𝜇! represents the most probable acoustic signal associated with wi, and the further a stimulus is from this prototypical VOT, the less representative of wi the exemplar will be (and the lower its likelihood will be). Since each word wi has a Gaussian likelihood function 𝑝 𝐴 𝑤! defined by its mean (𝜇! ) and variance (𝜎!! ) parameters, the full likelihood model of BIASES, 𝑝 𝐴 𝑊 , which is a composite of all Nw words’ likelihood functions, takes the form of a mixture of Gaussians. Gaussian mixture models are a common approach to statistical models of speech categorization and phoneme category learning (de Boer & Kuhl, 2003; Clayards et al, 2008; Dillon, Dunbar & Isardi, 2013; Feldman et al, 2009, 2013; McMurray, Aslin & Toscano, 2009; Toscano & McMurray, 2010; Vallabha, 74 McClelland, Pons, Werker & Amano, 2007) in which there exists some number of categories, and the exemplars of each category are normally distributed according to its category’s mean and variance (or covariance matrix, in the case of multiple perceptual dimensions). Note that, for the present formalization of BIASES, the likelihood distribution over VOTs for each phonetic category (/b/ vs. /p/) is equivalent to the likelihood distribution over VOTs for each word (bay vs. pay) because (1) these two words are assumed not to differ on any other acoustic dimensions besides the VOT of their initial stop consonants, and (2) these words are assumed to be deterministically related to their associated phonetic categories. Admittedly, as we discuss later, it seems likely that neither of these assumptions is warranted; Chapters 3 and 4 illustrate some interesting and insightful implications of relaxing these assumptions. In any case, under these simplifying assumptions, in order to infer the most likely category label for a given exemplar ax, ax must be compared to each category’s likelihood function (see Equation 2.9B). Previous work suggests that listeners’ perceptual identification behavior can be approximated by modeling words’ likelihoods with a Gaussian mixture model over VOT values. In a study by Clayards and colleagues (2008) listeners identified acoustic stimuli along a VOT continuum between two words (e.g., beach and peach), and subjects’ identification functions were consistent with an optimal Bayesian recognizer’s behavior. A between-subject manipulation in their study provides further support for the modeling of words’ likelihood functions using a mixture of Gaussians approach: two groups of subjects were exposed to different distributions of VOTs that implied either high overlap or less overlap in the words’ likelihood functions (i.e., either higher values of 𝜎!! or lower 75 values of 𝜎!! , respectively). Results showed that, when the overlap between the likelihood functions of beach and peach appeared to be higher, subjects perceived a greater number of the intermediate tokens from the continuum as ambiguous (Clayards et al, 2008). Despite this modeling success, the Gaussian mixture model employed by Clayards and colleagues (2008) is not without its weaknesses. For instance, although they manipulated the acoustic variability of the stimuli for different subjects, the stimuli presented to any one group mimicked likelihood functions that had equal variance terms for both candidate words (e.g., beach and peach). Their model, in turn, also presumed that 𝜎!! = 𝜎!! . In reality, it is not generally the case that the initial VOTs of words with an initial /b/ and words with an initial /p/ are distributed with equal variance (Lisker & Abramson, 1964; Kronrod, Coppess & Feldman, 2012). In fact, VOT distributions, as measured in spoken word and segment production experiments (e.g., Baese-Berk & Goldrick, 2009; Fox, Reilly & Blumstein, 2015), tend to exhibit non-Gaussian skew, and evidence suggests that the distribution of VOTs of word-initial voiced stops in English (especially /b/) is highly bimodal (Lisker & Abramson, 1964; see also Docherty, Watt, Llamas, Hall & Nycz, 2011). Nonetheless, despite their divergence from speech production data, computational models of speech perception that have adopted the mixture of Gaussians approach and assumed equal variance across categories have achieved substantial success in capturing the overall patterns associated with category goodness and internal phonetic category structure (Clayards et al, 2008; Feldman et al, 2009; Kleinschmidt & Jaeger, 2015). Because of this fact, and because of the computational benefits associated with adopting this simplification (namely, the existence of a closed-form likelihood function), the 76 likelihood model implemented in BIASES was identical to the mixture of Gaussians employed by Clayards and colleagues (2008). In principle, any likelihood function could replace that of Clayards and colleagues (2008) in BIASES. For now, we simply acknowledge that BIASES could be enhanced with more detailed (and realistic) likelihood functions that incorporate more acoustic cues (see, e.g., Feldman et al, 2013) and/or less simplistic distributional assumptions (see, e.g., Kleinschmidt & Jaeger, in prep; Kronrod, Coppess & Feldman, 2012). Critically, though, unlike Clayards and colleagues (2008), and unlike other Bayesian models of speech perception that have adopted similar models of the likelihood term (Feldman et al, 2009; Kleinschmidt & Jaeger, 2015), the key innovation in BIASES is the inclusion of a context-dependent conditional prior which is integrated with the likelihood function. The model proposed by Clayards and colleagues (2008) fits the basic shape of subjects’ responses to isolated words despite their assumption of equal prior probabilities for beach and peach because their likelihood model captures fundamental properties of subjects’ signal-to-word mapping. Here, we adopt this same likelihood function in formulating BIASES in order to leverage the successes of their isolated word recognition model, while extending it to incorporate sentential context effects. 2.4.2.4. Comparing the Likelihood Terms in BIASES and Shortlist B Finally, it is worth pointing out another fundamental difference between BIASES and Shortlist B (Norris & McQueen, 2008). In addition to its assumption of a context- independent prior based on lexical frequency instead of the conditional prior embraced by BIASES, Shortlist B also differs from BIASES in the mathematical form of its likelihood term. Although the likelihood function adopted in BIASES is identical to that of Clayards 77 and colleagues (2008) and closely related to that of Feldman and colleagues (2009), Norris and McQueen’s (2008) Shortlist B takes a very different approach. Rather than explicitly relating the acoustic properties of the speech signal to phonemes or words, Norris and McQueen (2008) avoided specifying a likelihood function that would directly relate to an acoustic signal. Instead of relying on assumptions about the distributions of acoustic cues and the ways in which these different cues covary, Shortlist B’s likelihood model abstracts over all of the acoustic cues in speech to capture, broadly, the confusability of different words in the lexicon. Specifically, they assume that – whatever likelihood model subjects use – the same one that underlies subjects’ performance in spoken word recognition tasks should also underlie their performance in lower-level perceptual tasks. Under this assumption, McQueen and Norris (2008) used perceptual confusion data from a gating task (Smits et al, 2003; Warner et al, 2005) to infer subjects’ likelihood functions and taught Shortlist B, for each diphone in Dutch (e.g., /ba/), how likely subjects should be to perceive that diphone as itself or any other Dutch diphone (e.g., /ba/, /pa/, /da/, /bɪ/, ...). The obvious advantage of Shortlist B’s approach is that it can remain agnostic as to the computations that map an acoustic signal onto one or more words, so it can be applied to a large vocabulary without making many assumptions about pre-lexical representations or pre-lexical processing. However, the key question that motivated the development of BIASES was how bottom-up and top-down information sources are weighted and combined during spoken word recognition. In light of this question, the nature of the likelihood function and how it fits into the larger computational framework will play an important role in understanding the predictions of BIASES, as we discuss 78 later. Thus, unlike Shortlist B, BIASES is capable of making fine-grained predictions about how acoustic-phonetic properties of speech and context-specific lexical predictions jointly modulate subjects’ recognition of spoken words. 2.4.3. Integrating Prior Context and Perceptual Input in BIASES Substituting the likelihood function (Equation 2.9) into the model of subjects’ posterior (Equation 2.7), and simplifying based on the stated assumption of equal variance in the VOT distributions (𝜎!! = 𝜎!! = 𝜎 ! ) for pay (w1) and bay (w2), yields Equation 2.10 (cf. Feldman et al, 2009): Equation 2.10 1 𝑝 𝑤! 𝐴, 𝐶 = ! !! ! ! !"# !!(!!!) 1+𝑒 ! !! ! where 𝜇! + 𝜇! 𝜇! − 𝜇! χ= and 𝑔 = 2 𝜎! Equation 2.10 represents the optimal (Bayesian) posterior probability that a particular acoustic signal (A) following a particular sentence context (C) is an exemplar of the word pay, given the assumptions outlined above. Following a rational analysis approach (Anderson, 1990), Equation 2.10 can also be interpreted as an estimate of the optimal rate of PAY responses subjects should make when responding to different stimuli in different sentence contexts during a two-alternative forced choice word identification task. Finally, in order to evaluate the extent to which the predictions of BIASES are consistent with actual subjects’ behavior, it is possible to simulate responses from BIASES based on the assumption that, on a given trial (t) consisting of a context and 79 acoustic stimulus pairing (ct, at), a subject’s final identification decision (Zt) is probabilistically generated from a Bernoulli distribution with 𝜃! = 𝑝 𝑤! 𝑎! , 𝑐! (see Equation 2.11). Equation 2.11 1 𝑍! |𝑎! , 𝑐! ~ 𝐵𝑒𝑟𝑛( ! !! !! ) ! !"# !!(!!!! ) 1+ 𝑒 ! !! !! An oft-cited intuitive metaphor for Bernoulli-distributed random variables is the process of flipping a biased coin: the probability of a PAY response is like the probability of a coin-flip coming out heads, and different experimental conditions or stimuli affect the bias of the “coin” towards “heads” (𝜃! ) differently. In particular, Equation 2.10’s posterior distribution describes the way stimulus and context conditions in a given trial influence the probability of a PAY response on that trial. 2.4.4. Conclusion and Next Steps In sum, Equation 2.11 provides us with an explicit method of generating behavioral responses. The ability to simulate behavior from BIASES in this way affords many advantages. In Chapter 3, we take two somewhat different approaches to the simulation of behavioral data with BIASES. First, by providing to BIASES assumed values for all of the underlying parameters (𝜇! , 𝜇! , 𝜎 ! , and 𝑝 𝑤! 𝑐! for every wi and cj relevant to the experiment) needed to generate behavioral responses, we can examine properties of the model’s behavior, and examine the extent to which empirical data match those predictions. As we will demonstrate, this approach reveals that our chosen theoretical framework provides some much-needed clarity for research in the field of top- down effects, organizing a confusing literature rife with apparent inconsistencies. 80 Moreover, in the spirit of iterative updating of cognitive models that is fundamental to the rational analysis approach (Anderson, 1990), this approach allows us to identify enhancements to the model that are critical for capturing and “post-dicting” existing behavioral data. One major disadvantage to this approach is that, in order to generate responses, we must make even more assumptions about the latent structure of perceptual and cognitive processing that give rise to the behavioral responses we can observe. On the other hand, a second approach allows us to discover features of the model underlying observed behavior, rather than making assumptions about the model’s features. By generating data from BIASES over a very wide, weakly constrained range of possible parameter settings and comparing the simulated behavioral response patterns under different conditions to real data from human subjects, we can learn about the likely distribution of those parameters. As we will demonstrate, even though all we can actually observe is on the left side of Equation 2.11 (i.e., subjects’ responses to different stimulus/context pairings), this approach allows us to infer the distributions of all of the unknown parameters that ultimately give rise to those responses. Moreover, this approach can be used to directly compare the relative fit and explanatory power of different theories and models that make disparate assumptions about aspects of auditory language processing. As we will discuss, each approach has its own benefits and shortcomings, but both approaches can be leveraged to reveal important insights about the human speech perception system, and especially about how top-down information sources such as sentential context modulate word recognition. 81 Chapter 3 Exploring and Evaluating the BIASES Model of Spoken Word Recognition in Context 3.1. Understanding Top-Down Effects in BIASES Although the theoretical and mathematical underpinnings of BIASES were presented in Chapter 2, Chapter 3 aims to explore the model’s more fine-grained predictions about how higher-level (e.g., contextual) and lower-level (e.g., perceptual) information sources conspire during spoken word recognition to produce top-down effects on speech perception. To that end, Chapter 3 is divided into three main sections. First, Chapter 3 examines the mathematical form of BIASES more closely. We implement BIASES and perform two preliminary simulation studies to illustrate how a minimalistic implementation of BIASES can replicate subjects’ sensitivity to a preceding function word when identifying a stimulus that is phonetically ambiguous between a noun and verb (Fox & Blumstein, in press) and how the computational principles inherent to BIASES not only account for the overall pattern, but also provide fine-grained quantitative predictions about expected variability and asymmetries in the size of context effects on spoken word recognition. Second, we consider a major problem that is often ignored in the literature, and especially by computational models: the enormous amount of unexplained variability in the size of top-down effects on speech processing. These issues are discussed with consideration for how these data can be captured by BIASES. The present model recasts previously overlooked or poorly understood behavioral patterns and asymmetries, suggesting that apparent inconsistencies in top-down effect on speech perception (cf. Pitt & Samuel, 1993) actually follow from the theoretical principles embodied by BIASES. 82 Illustrative simulations demonstrate the unique ability of BIASES to explain and predict lawful variability in the patterns of top-down effects across stimuli and across studies. Finally, Experiment 3.1 is conducted in order to directly test one novel prediction made by BIASES, and the model’s simulated behavior is compared to human performance on an auditory word identification task. These new experimental data are also used in two model comparison analyses to demonstrate that the utility of this computational model extends beyond providing a theoretical framework for contextual influences on word recognition. BIASES also represents a novel tool for comparing psycholinguistic theories about the two inputs to the model, including both the lower- level pre-lexical processing of speech that maps speech sounds to words, and the higher- level processing of sentences that reflects how listeners utilize contextual and linguistic information during auditory language processing. Overall, the results of the simulations and the experimental analyses suggest that subjects’ recognition of spoken words in context exhibit certain hallmarks of a Bayesian cue integration system. Generally speaking, BIASES highlights the fact that top-down effects on speech perception offer a unique window into perceptual processing, cognitive processing, and the interface of cognitive and perceptual representations in human language function. 3.1.1. Overview of the Mathematical Form of BIASES Recall Equation 2.10’s statement of the form of the posterior probability distribution, reproduced in Equation 3.1 (substituting in a new term, Π, to summarize the effect of the prior): Equation 3.1 83 1 𝑝 𝑤! 𝐴, 𝐶 = 1+ 𝑒 ! !!!(!!!) where 𝑝 𝑤! 𝐶 𝜇! + 𝜇! 𝜇! − 𝜇! Π = log ,χ = and 𝑔 = 𝑝 𝑤! 𝐶 2 𝜎! As described in Chapter 2, Equation 3.1 represents the present model’s estimate of 𝑝 𝑤! 𝐴, 𝐶 , the probability that a target stimulus was pay, given the voice-onset time (VOT) of its initial stop, A, given that it followed sentence context C, and given that the listener is performing a two-alternative forced choice word identification task with two possible candidates (w1 = pay and w2 = bay). The acoustic forms of pay and bay are modeled as having Gaussian distributions with means 𝜇! and 𝜇! , respectively, and shared variance term 𝜎 ! (𝜎 ! = 𝜎!! = 𝜎!! ). 𝑝 𝑊 𝐶 is reflects the strength of the contextual bias towards one of the other candidate word, and it is estimated from a corpus. Finally, Equation 3.1’s logistic form is a consequence of a key non-linguistic constraint: the use of a two-alternative forced choice task. There are three key terms within the sigmoidal posterior (see Equation 3.1): (1) Π, the term summarizing the relative prior support for the candidate words, (2) g, the logistic function’s gain term, and (3) χ, which denotes the VOT that is exactly halfway between the category means. Here, we discuss the interpretation of g, χ and Π in turn and explore their role predicting the distribution of top-down effects in spoken word recognition. 3.1.1.1. Components of BIASES: Phonetic Category Structure (g) The primary effect of g is to control the slope of the logistic, with higher values of g indicating a sharper identification curve. As implied by the definition of g in Equation 3.1, greater separation between the means of the pay and bay (𝜇! − 𝜇! ) and lower 84 variability (𝜎 ! ) in the expected distribution of productions of pay and bay are associated with steeper slopes. Simulation Study 3.1 illustrates the nature of g’s influence on the posterior probability function and on the size of sentential context effects on word recognition. The details behind these stimulations and their key conclusions are described in Box 3.1. Figures 3.1 and 3.2 illustrate the tradeoff between category variance and category separation. The shape (i.e., steepness) of the resulting posterior sigmoids in Figure 3.2 is affected by changing either the distance between the means of the underlying normally distributed density functions in Figure 3.1 (left vs. right panels of the figures) or the underlying category variance of the normal probability density functions (or, equivalently, the standard deviations, as indicated in the top vs. bottom panels of Figures 3.1 and 3.2). Intuitively, and as described in Chapter 2, the less overlap there is between two words’ likelihood functions (Figure 3.1), the fewer acoustic values (here, VOTs) there will be that are ambiguous between pay and bay (Figure 3.2). Note that g is the term that differed between groups in the study by Clayards and colleagues (2008). By manipulating the apparent variance of the VOT distributions of /b/- and /p/-initial minimal pair words (e.g., beach and peach), Clayards and colleagues were tapping into the denominator of g. Finally, note that, after hearing a sentence context C, all other terms in the posterior remain constant no matter what stimulus A is presented to the listener. In particular, the prior information (Π) does not influence the slope of the posterior distribution in BIASES (due to a conditional independence assumption; see Chapter 2), while g determines the overall shape of the posterior distribution over the acoustic space. 85 Box 3.1. Description of Simulation Study 3.1 Goal: Illustrate influence of two aspects of underlying phonetic category structure on posterior probability function and size of sentential context effects. Design: 4 simulated phonetic category structures in a 2 × 2 design Parameters of BIASES Manipulated: 𝜇! − 𝜇! ∈ {64,36} , 𝜎 ! ∈ {15! , 20! } Parameters of BIASES Held Constant: χ = 32 , 𝑝 𝑤! 𝑐! = 0.75 , 𝑝 𝑤! 𝑐! = 0.25 Results displayed in: Figures 3.1-3.5, Table 3.1 Key conclusions: 1. BIASES’ gain parameter (g), which characterizes the slope of the sigmoidal posterior (cf. Feldman et al, 2009), is the ratio of 𝜇! − 𝜇! (the distance between the means of the two words’ distributions over VOTs) and 𝜎 ! (the shared variance of each word’s VOT distribution). 2. Because 𝜇! − 𝜇! and 𝜎 ! are collinear (having opposite effects on g), they are not identifiable parameters when fitting a model that assumes equal category variance. The tradeoff between these features of BIASES’ likelihood model can be visualized in the top-right and bottom-left simulated phonetic category structures in Figure 3.1, where distinct likelihood functions yield identical posteriors (Figure 3.2-3.3) with the same gain parameter (see Table 3.1). Consequently, model-fitting in Chapters 3-4 assumes values for 𝜇! − 𝜇! (from Lisker & Abramson, 1964; for a similar approach, see Kleinschmidt & Jeager, 2015) and fits 𝜎 ! . 3. Although the magnitude of the effective category boundary shift between two prior contexts (χ!! − χ!! ) depends on g, the maximum expected effect size (Δ!"# ) is independent of it (see Table 3.1; Figure 3.4 vs. 3.3). However, a narrower range of VOTs exhibit top-down effects, so it would be more difficult, practically speaking, to detect a large top-down effect size if VOTs are sampled from the space. 4. For all 4 likelihoods examined in Simulation Study 3.1, the locus (𝑎) of the maximum expected effect size (Δ!"# ) was consistently collocated with the category boundary (χ) (Figure 3.4). Note, however, that χ was confounded with the midpoint of the effective category boundaries for the prior contexts (χ!! , χ!! ). We discuss this point in Simulation Study 3.2 (see Box 3.2). 5. When measured for each prior context relative to a neutral baseline, the expected effect size for any given VOT is asymmetrical (in general); the locus of the maximum effect size is at the midpoint between χ and the prior context’s effective category boundary (χ!! ) (Figure 3.5). 86 µ1 − µ2 = 64 µ1 − µ2 = 36 σ = 15 Probability Density: p(V O T |w i ) σ = 20 −25 0 25 50 75 100 −25 0 25 50 75 100 VOT (ms) wi = bay wi = pay Figure 3.1. Results of Simulation Study 3.1: Influence of 𝜇! − 𝜇! and 𝜎 ! on probability density functions, 𝑝 𝑉𝑂𝑇 𝑤! . 𝑝 𝑉𝑂𝑇 𝑤! : solid/colored curves; category boundary (χ): dashed/grey vertical line; 𝜇! for each 𝑝 𝑉𝑂𝑇 𝑤! : dotted/colored vertical lines µ1 − µ2 = 64 µ1 − µ2 = 36 1.00 0.75 σ = 15 0.50 Posterior Probability of pay: p(w 1|V O T ) 0.25 0.00 1.00 0.75 σ = 20 0.50 0.25 0.00 −25 0 25 50 75 100 −25 0 25 50 75 100 VOT (ms) Figure 3.2. Results of Simulation Study 3.1: Influence of 𝜇! − 𝜇! and 𝜎 ! on posterior probability function, 𝑝 𝑤! 𝑉𝑂𝑇, 𝐶 = 𝑐! : solid/black curves; χ: dashed/grey vertical line 87 3.1.1.2. Components of BIASES: Category Boundary (𝛘) To understand the roles of the two remaining terms, Π and χ, consider, first, a hypothetical context 𝑐! in which w1 and w2 are both equally probable a priori, such that 𝑝 𝑤! 𝐶 = 𝑐! = 𝑝 𝑤! 𝐶 = 𝑐! = 0.5 (cf. Clayards et al, 2008; Feldman et al, 2009). In ! !! !! !.! this perfectly neutral context, Π = Π!! = log ! !! !! = log !.! = log 1 = 0. When this condition (Π = 0) is fulfilled, and when it also true that 𝐴 = χ, the entire right side of ! Equation 3.1 becomes !!! ! = 0.5. That is, χ is the point in acoustic space for which, if presented with that stimulus in a perfectly neutral context, a listener would, in theory, be equally likely to select either response: 𝑝 𝑤! 𝐴, 𝑐! = 𝑝 𝑤! 𝐴, 𝑐! = 0.5. This point, χ, is referred to as the category boundary. Note that, because BIASES assumes that the variability 𝜎 ! associated with each word’s likelihood function is equal, χ is guaranteed to be located exactly halfway between 𝜇! and 𝜇! , as it is defined in Equation 3.1. However, !! !!! note that if 𝜎!! ≠ 𝜎!! ≠ 𝜎 ! , then it is not, in general, true that χ = ! . 3.1.1.3. Components of BIASES: Prior Context (𝚷) How, then, does the prior information contained within Π influence spoken word recognition? Because the influence of Π is constant for a given C (i.e., independent of A) the overall effect of the prior is to produce a translation (i.e., a horizontal shift) of the logistic function towards the mean of the less probable word (Feldman et al, 2009). Figure 3.3 (also from Simulation Study 3.1) illustrates this for the same distributions displayed in Figures 3.1 and 3.2. Shifting the logistic means that a stimulus with the same word-initial VOT will be more likely to be recognized as an exemplar of the 88 contextually-supported word (relative to context 𝑐! , in which both lexical candidates are equally probable; black posterior distribution in Figures 3.2 and 3.3). µ1 − µ2 = 64 µ1 − µ2 = 36 1.00 0.75 Posterior Probability of pay: p(w 1|V O T ,C ) σ = 15 0.50 0.25 0.00 1.00 0.75 σ = 20 0.50 0.25 0.00 −25 0 25 50 75 100 −25 0 25 50 75 100 VOT (ms) Prior Probability of pay: p(w 1|c i ) cN = 0.5 c1 = 0.75 c2 = 0.25 Figure 3.3. Results of Simulation Study 3.1: Influence of 𝜇! − 𝜇! and 𝜎 ! on posterior probability function, incorporating prior contexts: 𝑝 𝑤! 𝑉𝑂𝑇, 𝐶 = 𝑐! : solid/colored curves; χ: dashed/grey vertical line; χ!! for each Π! : dashed/colored vertical lines In other words, Π induces a bias in the posterior sigmoid. Ultimately, this bias is realized for a given context C (for which Π = Π! ), in the movement of the location in acoustic space for which 𝑝 𝑤! 𝐴, 𝐶 = 𝑝 𝑤! 𝐴, 𝐶 = 0.5. We refer to this VOT value as the effective category boundary (χ! ). Unlike the “baseline” category boundary (χ = χ!! ) defined earlier, which depends on characteristics of the likelihood function alone (cf. Clayards et al, 2008), the effective category boundary χ! still, of course, depends on the likelihood function, but it is also context-dependent. Specifically, the magnitude and !! direction of its shift away from χ are given by χ! − χ = − ! (cf. Feldman et al, 2009). Figure 3.3 clearly depicts, for the same Π! , the magnitude of the boundary shift should be 89 larger when g is smaller (i.e., shallower). Ultimately, the location of the effective category boundary for a given context is given by Equation 3.2: Equation 3.2 Π! χ! = χ − 𝑔 How can we summarize the meanings of each of the terms in BIASES (Equation 3.1)? Firstly, g encapsulates the model of the phonetic category structure that comprises the likelihood functions of pay and bay (Figure 3.1) and gives the posterior probability distribution its slope (Figure 3.2). While g controls the overall shape of the posterior, Π and χ convey information about the posterior’s “location” in acoustic space. They determine, for instance, which VOTs will be most ambiguous. Along with g, χ also derives from the likelihood term, representing the location (in acoustic space) of the (unbiased) category boundary. Meanwhile, the influence of sentential context is completely contained within Π, which indexes the relative amount of support a context C provides to one or the other candidate word. It is Π’s context-dependent biasing effect on the posterior distribution (Figure 3.3) that we focus on in the present model. Simulation Study 3.1 has already shown that the two elements of g (𝜇! − 𝜇! and 𝜎 ! ) have an influence on the magnitude of shifts in the effective category boundary, controlling for prior context. Next, we explore the basic predictions of BIASES, focusing on how different factors within the model influence the predicted size of the effect of prior context on subjects’ word identification responses. 3.1.2. Towards Model-based Analyses of Top-Down Effects 3.1.2.1. Shifting of Invisible Category Boundaries 90 As discussed above, the “baseline” category boundary, χ, is the point at which a given VOT is equally likely to come from both categories’ distributions (Figure 3.1). In the case considered here (and elsewhere, e.g., Clayards et al, 2008; Feldman et al, 2009), where the mixture distribution that comprises the model’s likelihood term is effectively constrained to consider two Gaussian distributed categories with equal variance, χ lies at !! !!! the midpoint of the two categories’ means ( ! ). χ is a fundamental property of phonetic category structure. In Simulation Study 3.1, all phonetic category structures assumed χ (=32), and this was held constant for all those examined. However, in practice, χ is not straightforward to measure. For one, we cannot directly observe the phonetic category structure that underlies a subject’s behavioral responses to acoustic tokens. Instead, we are bound to try to infer χ from a subject’s identification or discrimination of speech tokens. However, even this is difficult, since the definition of χ presupposes equal prior probability. In reality, such a state is difficult to confidently tap into experimentally: due to pervasive effects of lexical and phonotactic frequency on subjects’ recognition of speech sounds (e.g., Connine, Titone & Wang, 1993; Massaro & Cohen, 1983; Pitt & McQueen, 1998), many assumptions are required if one hopes to confidently infer its value (see Pitt & Samuel, 1993 for a discussion of this issue). In short, although the action of the prior in BIASES is attributed to a shift produced relative to an unobservable category boundary (see Equation 3.2), the same principles can be captured without explicitly relying on some assumed or inferred value of χ by examining relative effective category boundary shifts. In other words, although the biased posterior distributions in Figure 3.3 (in blue and red) are, based on the cognitive model (Equation 3.1), computed by biasing the latent 91 (i.e., unobserved), neutral posterior (in black), in practice, an experimenter would compare data from two observed (biased) conditions to one another. Later, we do consider another type of experimental design that includes a designated “neutral” condition (e.g., Fox, 1984; Guediche et al, 2013; van Alphen & McQueen, 2001), and we argue that even in those conditions, subjects’ responses are probably not completely unbiased, so the so-called neutral conditions are actually more likely to be just a third bias condition with an intermediate Π! . In any case, for now, we focus on the far more common 2-condition experimental design. Formally, to the extent that any shift in the effective category boundary, χ!! , is observed in context, c1, it is relative to another context, c2, with some other prior Π!! . Equation 3.3 is a generalization of Equation 3.2, giving the magnitude of the VOT boundary shift between two biased contexts. Equation 3.3 𝑝 𝑤! 𝑐! 𝑝 𝑤! 𝑐! Π!! Π!! Π!! − Π!! log 𝑝 𝑤! 𝑐! − log 𝑝 𝑤! 𝑐! χ!! − χ!! = χ− − χ− = = 𝑔 𝑔 𝑔 𝑔 Equations 3.3 supersedes Equation 3.2 (and is more general) because the neutral !.! prior (Π!! = log !.! = 0) that, by definition, characterizes χ!! = χ could be substituted into Equation 3.3 to obtain Equation 3.2. Table 3.1 reports the relative effective category boundary shift for each simulation in Simulation Study 3.1, and they can be visualized as the difference between the VOT of the dashed/red line and that of the dashed/blue line in Figure 3.3. As previously mentioned, the boundary shift is inversely proportional to the value of g. 92 𝜇! − 𝜇! = 64 𝑚𝑠 𝜇! − 𝜇! = 36 𝑚𝑠 χ = 32 χ = 32 g = 0.28 g = 0.16 𝜎 = 15 𝑚𝑠 𝑎 = 32 𝑎 = 32 Δ!"# = 0.50 Δ!"# = 0.50 χ!! − χ!! = 7.72 χ!! − χ!! = 13.73 χ = 32 χ = 32 g = 0.16 g = 0.09 𝜎 = 20 𝑚𝑠 𝑎 = 32 𝑎 = 32 Δ!"# = 0.50 Δ!"# = 0.50 χ!! − χ!! = 13.73 χ!! − χ!! = 24.41 Table 3.1. Summary of Results of Simulation Study 3.1: Influence of underlying phonetic category structure on posterior probability function and size of sentential context effects. 3.1.2.2. Boundary Shifts vs. Effect Sizes However, even having skirted one practical issue by avoiding reliance on an unobservable parameter value, another practical issue remains that, ultimately, suggests that merely explaining top-down effects as arising from shifting category boundaries is not ideal for the goal of accurately assessing and predicting top-down effects in actual behavioral data. To see this, consider the methodological and analytic techniques utilized by behavioral research examining top-down effects. Typically, experimenters construct an acoustic continuum and/or select a relatively small number of stimuli with discrete step sizes. In many cases, the durations of these step sizes are influenced by other practical constraints such as the need to splice waveforms at zero-crossings to avoid discontinuities which introduce acoustic artifacts such as clicks into the stimuli. Then, participants are presented with these tokens in different contexts for identification. The experimenter is effectively sampling from the subject’s posterior distribution in order to characterize the listener’s underlying prior and likelihood model. Finally, the 93 experimenter adopts some analytic technique, typically aimed at producing evidence that responses to the same acoustic tokens were categorized reliably differently between conditions (such as logistic regression or ANOVA over proportions of response-types by condition). At no point in the process does the notion of a category boundary or some underlying horizontal shift arise. Of course, that does not mean it is not a useful characterization of the underlying model. It may suggest that either the “horizontal” shift of the biased sigmoid relative to another sigmoid (from another condition) or the “vertical” differences in the rate of categorization decisions is epiphenomenal, or even that both are. However, what is important about this observation for the present work is that the comparison of “effective category boundaries” across conditions is a theoretical construct and removed from real empirical research. That is, however fundamental or epiphenomenal a category boundary is to phonetic category structure, it is in some ways incidental to experimental research on top-down effects on spoken word recognition. It should be noted that there are analysis techniques that are exceptions to the ones described above. For instance, some researchers do explicitly estimate boundaries for subjects in different conditions (e.g., Baum, 2001; Blumstein et al, 1994) and then compute statistics about the shift in the boundary between conditions. Obviously, this technique is neatly connected to the theory espoused here, but these statistics ultimately rely on estimates (not directly observed category boundaries). Statistics based on derived measures are necessarily less accurate than the original data themselves (see, e.g., Pitt & Samuel, 1993). Also, using a single summary statistic ignores many fine-grained details regarding the distribution of top-down effects (cf. Pitt & Samuel, 1993). We address this 94 analysis technique later (Chapter 4), ultimately showing that an explicit, model-based analysis approach allows for richer inferences about the nature of variability in top-down effects. 3.1.2.3. Predicting Effect Sizes In order to better understand the way BIASES integrates prior information with bottom-up acoustic data with an eye towards the ultimate need to understand effect size (i.e., the “vertical” differences in response rates for a given acoustic stimulus), we first defined the function underlying BIASES’ expected effect size when comparing the rate of pay-responses for any pair of contexts, for any VOT, as shown in Equation 3.4: Equation 3.4 Π!! χ 1 1 Δ , , 𝐴 = 𝑝 𝑤! 𝐴, 𝑐! − 𝑝 𝑤! 𝐴, 𝑐! = − Π!! 𝑔 1+𝑒 ! !!! !!(!!!) 1+𝑒 ! !!! !!(!!!) where ! !! !! !! !!! !! !!! Π!! = log ! !! !! χ= ! and 𝑔= !! For ease of exposition, we refer to the function defined in Equation 3.4 as Δ 𝐴 with the understanding that Δ 𝐴 is meaningless unless there are other parameter values provided to it ({Π!! , Π!! , χ, 𝑔}). As expressed by Equation 3.4, Δ 𝐴 is equal to the difference between the posterior probabilities of a pay-response to a stimulus (A) after context c1 vs. c2. As such, the function’s shape will clearly depend on the same factors on which the posterior depends. For that reason, the first two arguments to the function are related to: (1) the biasing information in the prior of each posterior distribution (Π!! and Π!! ), and (2) the two components of the posterior that are based on the phonetic category 95 structure defined by the likelihood function (χ and g, which do not change from context to context). While it is simple to demonstrate that Δ 𝐴 is the difference of two sigmoidal curves (more specifically, a sigmoid and that same sigmoid under translation), the function is not easy to express (cf. difference of sigmoids membership function: e.g., in Berkan & Trubatch, 1997). However, despite not having a simple, closed form, Δ 𝐴 does have certain properties that are straightforward. More importantly, those properties are critical to the predictions and simulations discussed in this chapter and Chapter 4, so we review them now and illustrate several of them via simulation.5 Figures 3.4 and 3.5 are the final two Figures associated with Simulation Study 3.1 (see Box 3.1 for a summary of these simulations, and Table 3.1 for a summary of their results). Figure 3.4 illustrates the shape of Δ 𝐴 over the acoustic space for same four simulated phonetic category structures in Figures 3.1-3.3. A few points bear special note: Property 1. If Π!! > Π!! , then Δ 𝐴 > 0 for all values of A, although Δ 𝐴 approaches 0 for values of A further from Δ 𝐴 ’s peak, which we denote 𝑎. Thus, in line with intuition, subjects should never show a reversal of a sentence context effect, on average. Property 2. Δ 𝐴 is symmetrical under the assumptions imposed on BIASES in Chapter 2; in particular, Δ 𝑎 − 𝑥 = Δ 𝑎 + 𝑥 if 𝜎 ! = 𝜎!! = 𝜎!! . 5 Note that Δ 𝐴 is not a perfect tool for all purposes. For instance, although it does approximate the expected effect size after many trials in each context are completed, it is not meant to simulate behavioral data directly. After all, like the boundary shift, effect size is a derived measure that is epiphenomenal (from the explanatory standpoint of BIASES). Thus, the function is used for illustrative purposes, but all actual simulated behavioral data in Chapters 3 and 4 are generated using Equation 2.11 and subtracted to illustrate expected effect sizes. 96 Property 3. In general, it not possible to compute a definite integral for the sum or difference of two sigmoids. This is notable because the integral of Δ 𝐴 (i.e., the total area between Δ 𝐴 and the x-axis) is equal to the area between the 2 posteriors being compared: 𝑝 𝑤! 𝐴, 𝑐! − 𝑝 𝑤! 𝐴, 𝑐! . However, numerical methods can be used to approximate this value, and, conveniently, the total area under Δ 𝐴 , and therefore the total area between the 𝑝 𝑤! 𝐴, 𝑐! and 𝑝 𝑤! 𝐴, 𝑐! is equal to the magnitude of the effective category boundary shift between the two contexts (cf. Equation 3.3), as shown in Equation 3.5: Equation 3.5 𝑝 𝑤! 𝑐! 𝑝 𝑤! 𝑐! log − log 𝑝 𝑤! 𝑐! 𝑝 𝑤! 𝑐! Δ 𝐴 = 𝑝 𝑤! 𝐴, 𝑐! − 𝑝 𝑤! 𝐴, 𝑐! = = χ!! − χ!! 𝑔 Property 4. If the maximum expected difference in the posteriors is located (in acoustic space) at 𝑎 and has an effect size of magnitude Δ!"# , Equations 3.6 and 3.7 give those values. 𝑎 occurs at the midpoint of the two posterior probability distributions’ effective category boundaries. Equation 3.6 Π! Π! 𝑝 𝑤! 𝑐! 𝑝 𝑤! 𝑐! χ − 𝑔! + χ − 𝑔! log + log 𝑝 𝑤! 𝑐! 𝑝 𝑤! 𝑐! 𝑎 = argmax Δ 𝑎! = =χ− !! ∈𝒜 2 2𝑔 Equation 3.7 1 1 Δ!"# = Δ 𝑎 = ! ! !! !! ! !! !! − ! ! !! !! ! !! !! ! !"# !!"# ! !"# !!"# 1+ 𝑒 ! ! !! !! ! !! !! 1+ 𝑒 ! ! !! !! ! !! !! 97 Property 5. As is obvious from Equation 3.7, the value Δ!"# is independent of the specific characteristics of the phonetic category structure (i.e., the likelihood) in BIASES, including both χ and 𝑔. Unsurprisingly, Δ!"# does depend on the relative strengths of the biases of the prior contexts being compared. Simulation Study 3.2 further examines this issue (see Box 3.2; Figures 3.6-3.9; Tables 3.2-3.3). One thing that we can conclude from these simulations and observations is that the effective category boundary shift – which underlies the model’s explanation of top- down effects on spoken word recognition – is, indeed, closely tied to overall differences in the influence of two prior contexts on speech recognition (e.g., Property 3), but this shift does not tell the whole story of top-down effects on speech perception. According to BIASES, different effect sizes should be observed as a function of the underlying phonetic category structure (see Simulation Study 3.1) and as a function of the relative strengths of the biases of the two contexts being compared (see Simulation Study 3.2). In short, BIASES predicts fine-grained variation in the size of top-down effects that should be observed in subjects’ responses to different acoustic tokens in different sentential contexts. The distribution and shape of the Δ 𝐴 curve (i.e., expected top-down effects as a function of VOT, for a given pair of contexts) depends on many factors. This general statement is of great theoretical interest because of the enormous variability and inconsistency in top-down effects observed in the literature (see, e.g., Pitt & Samuel, 1993). It is this issue to which we now turn. 98 µ1 − µ2 = 64 µ1 − µ2 = 36 1.00 0.75 Effect Size: Difference in Posterior Probabilities of pay: σ = 15 0.50 p(w 1|V O T ,c 1) − p(w 1|V O T ,c 2) 0.25 0.00 1.00 0.75 σ = 20 0.50 0.25 0.00 −25 0 25 50 75 100 −25 0 25 50 75 100 VOT (ms) Figure 3.4. Results of Simulation Study 3.1: Influence of 𝜇! − 𝜇! and 𝜎 ! on Δ 𝐴 = 𝑝 𝑤! 𝑉𝑂𝑇, 𝐶 = 𝑐! − 𝑝 𝑤! 𝑉𝑂𝑇, 𝐶 = 𝑐! . Δ 𝐴 : solid black curves; χ : dashed/grey vertical line. µ1 − µ2 = 64 µ1 − µ2 = 36 1.00 Effect Size: Difference in Posterior Probability of pay (vs. neutral): 0.75 0.50 0.25 σ = 15 0.00 p(w 1|V O T ,c i ) − p(w 1|V O T ,c N ) −0.25 −0.50 −0.75 −1.00 1.00 0.75 0.50 0.25 σ = 20 0.00 −0.25 −0.50 −0.75 −1.00 −25 0 25 50 75 100 −25 0 25 50 75 100 VOT (ms) Prior Probability of pay: p(w 1|c i ) c1 = 0.75 c2 = 0.25 Figure 3.5. Results of Simulation Study 3.1: Influence of 𝜇! − 𝜇! and 𝜎 ! on Δ 𝐴 = 𝑝 𝑤! 𝑉𝑂𝑇, 𝐶 = 𝑐! − 𝑝 𝑤! 𝑉𝑂𝑇, 𝐶 = 𝑐! . Δ 𝐴 : solid/colored curves; χ: dashed/grey vertical line. 99 p(w 1|V O T , c i ) Posterior Probability of pay 1.00 0.75 0.50 c1 0.25 c2 0.00 −25 0 25 50 75 100 VOT (ms) p(w 1|V O T , c 1) − p(w 1|V O T , c 2) 1.00 0.75 Effect Size 0.50 0.25 0.00 −25 0 25 50 75 100 VOT (ms) p(w 1|V O T , c i ) − p(w 1|V O T , c N ) 1.0 Effect Size (vs. neutral) 0.5 0.0 c1 −0.5 c2 −1.0 −25 0 25 50 75 100 VOT (ms) Figure 3.6. Example simulation from Simulation Study 3.2: Illustrates posterior probability distributions as a function of VOT and prior context (𝑝 𝑤! 𝑉𝑂𝑇, 𝐶 = 𝑐! ; top panel), effect size as a function of VOT for those two prior contexts (Δ 𝐴 ; middle panel), and effect size as a function of VOT for each of those two prior contexts relative to a neutral baseline (bottom panel). All panels simulated in this illustration represent priors 𝑝 𝑤! 𝑉𝑂𝑇, 𝑐! = 0.9 and 𝑝 𝑤! 𝑉𝑂𝑇, 𝑐! = 0.25, with 𝜇! − 𝜇! = 64, χ = 32, and 𝜎 ! = 20! . In all panels: χ: dashed/grey vertical line; χ!! for each Π!! : dashed/colored vertical lines; Δ!"# : magnitude of solid/orange vertical marker; 𝑎: VOT of solid/orange vertical marker; χ!! − χ!! : magnitude of solid/black horizontal marker. For top panel: 𝑝 𝑤! 𝑉𝑂𝑇, 𝐶 = 𝑐! : solid/colored curves; 𝑝 𝑤! 𝑉𝑂𝑇, 𝐶 = 𝑐! : dotted/grey curve. For middle panel: Δ 𝐴 = 𝑝 𝑤! 𝑉𝑂𝑇, 𝑐! − 𝑝 𝑤! 𝑉𝑂𝑇, 𝑐! : solid/black curve; For bottom panel: Δ 𝐴 = 𝑝 𝑤! 𝑉𝑂𝑇, 𝐶 = 𝑐! − 𝑝 𝑤! 𝑉𝑂𝑇, 𝐶 = 𝑐! : solid/colored curves; Δ!"# of each context’s posterior relative to cN: magnitude of solid/colored vertical markers. 100 p(w 1|c 1)=0.25 p(w 1|c 1)=0.75 p(w 1|c 1)=0.9 1.00 p(w 1|c 2)=0.25 0.75 0.50 Posterior Probability of pay: p(w 1|V O T ,C ) 0.25 0.00 1.00 p(w 1|c 2)=0.75 0.75 0.50 0.25 0.00 1.00 p(w 1|c 2)=0.9 0.75 0.50 0.25 0.00 −25 0 25 50 75 100 −25 0 25 50 75 100 −25 0 25 50 75 100 VOT (ms) Prior Probability of pay: p(w 1|c i ) σ c1 σ = 20 c2 σ = 30 Figure 3.7. Results of Simulation Study 3.2: Influence of 𝑝 𝑤! 𝐶 = 𝑐! and 𝜎 ! on posterior probability function, incorporating prior contexts: 𝑝 𝑤! 𝐶 = 𝑐! : colored curves (solid: 𝜎 = 20; dashed: 𝜎 = 30); χ: dashed/grey vertical line; χ!! for each Π! : colored vertical lines (solid: 𝜎 = 20; dashed: 𝜎 = 30) 101 p(w 1|c 1)=0.25 p(w 1|c 1)=0.75 p(w 1|c 1)=0.9 1.00 0.75 p(w 1|c 2)=0.25 0.50 Effect Size: Difference in Posterior Probabilities of pay: 0.25 0.00 −0.25 −0.50 −0.75 p(w 1|V O T ,c 1) − p(w 1|V O T ,c 2) −1.00 1.00 0.75 p(w 1|c 2)=0.75 0.50 0.25 0.00 −0.25 −0.50 −0.75 −1.00 1.00 0.75 p(w 1|c 2)=0.9 0.50 0.25 0.00 −0.25 −0.50 −0.75 −1.00 −25 0 25 50 75 100 −25 0 25 50 75 100 −25 0 25 50 75 100 VOT (ms) σ σ = 20 σ = 30 Figure 3.8. Results of Simulation Study 3.2: Influence of 𝑝 𝑤! 𝐶 = 𝑐! and 𝜎 ! on Δ 𝐴 = 𝑝 𝑤! 𝑉𝑂𝑇, 𝐶 = 𝑐! − 𝑝 𝑤! 𝑉𝑂𝑇, 𝐶 = 𝑐! ; Δ 𝐴 : black curves (solid: 𝜎 = 20; dashed: 𝜎 = 30); χ: dashed/grey vertical line p(w 1|c 1)=0.25 p(w 1|c 1)=0.75 p(w 1|c 1)=0.9 1.00 Effect Size: Difference in Posterior Probability of pay (vs. neutral): 0.75 p(w 1|c 2)=0.25 0.50 0.25 0.00 −0.25 −0.50 p(w 1|V O T ,c i ) − p(w 1|V O T ,c N ) −0.75 −1.00 1.00 0.75 p(w 1|c 2)=0.75 0.50 0.25 0.00 −0.25 −0.50 −0.75 −1.00 1.00 0.75 p(w 1|c 2)=0.9 0.50 0.25 0.00 −0.25 −0.50 −0.75 −1.00 −25 0 25 50 75 100 −25 0 25 50 75 100 −25 0 25 50 75 100 VOT (ms) Prior Probability of pay: p(w 1|c i ) σ c1 σ = 20 c2 σ = 30 Figure 3.9. Results of Simulation Study 3.2: Influence of 𝑝 𝑤! 𝐶 = 𝑐! and 𝜎 ! on Δ 𝐴 = 𝑝 𝑤! 𝑉𝑂𝑇, 𝐶 = 𝑐! − 𝑝 𝑤! 𝑉𝑂𝑇, 𝐶 = 𝑐! ; Δ 𝐴 : colored curves (solid: 𝜎 = 20; dashed: 𝜎 = 30); χ: dashed/grey vertical line 102 Box 3.2. Description of Simulation Study 3.2 Goal: Illustrate influence of the strength of contextual priors and one aspect of phonetic category structure on posterior probability function and size of sentential context effects. Design: 2 phonetic category structures; 3 fully crossed levels of contextual bias in 2×3×3 design Parameters of BIASES Manipulated: 𝜎 ! ∈ {20! , 30! } , 𝑝 𝑤! 𝐶 = 𝑐! ∈ {0.25,0.75,0.90} , 𝑝 𝑤! 𝐶 = 𝑐! ∈ {0.25,0.75,0.90} Parameters of BIASES Held Constant: χ = 32 , 𝜇! − 𝜇! = 64 Results displayed in: Figures 3.6-3.9, Table 3.2-3.3 Key conclusions: 1. Figure 3.6 illustrates the geometric interpretations of several critical variables in Chapter 3. 2. As in Simulation Study 3.1, the magnitude of the effective category boundary shift between two prior contexts (χ!! − χ!! ) depends on g (Figure 3.7), but the maximum expected effect size (Δ!"# ) is independent of g. To see this, compare the solid and dashed curves in Figure 3.7. They peak at the same height. Note, also, that in Table 3.2 and Table 3.3, Δ!"# is the same for the same panel in each table. Table 3.2 lists the summary statistics for the simulations using 𝜎 ! = 20! and Table 3.3 lists summary statistics for the simulations using 𝜎 ! = 30! . Each colored panel represent a pair of prior contexts, with tan panels showing no expected context effects (see Figures 3.7-3.8), blue panels having higher posteriors for c1, red panels having higher posteriors for c2, and darker panels (of each hue) corresponding to larger expected effect sizes. Δ!"# depends only on the prior contexts’ biases’ strengths. 3. Nonetheless, despite Δ!"# being independent of BIASES’ likelihood function, the VOT at which the maximum expected effect size is found (𝑎) is not (see Figure 3.8). As Equations 3.6 and 3.7 suggest, 𝑎 lies midway between the two priors’ effective category boundaries, which depend on g. Note the divergence between Tables 3.2 and 3.3 in 𝑎 for the same panel (i.e., priors). 4. Similarly, the magnitude of the effective category boundary shift between two prior contexts (χ!! − χ!! ) depends on g and the priors (see Figure 3.7). 5. When measured for each prior context relative to a neutral baseline, the expected effect size for any given VOT is asymmetrical (in general); the locus of the maximum effect size is at the midpoint between χ and the prior context’s effective category boundary (χ!! ) (see Figures 3.6 and 3.9). Tables 3.2-3.3. Summary of Results of Simulation Study 3.2: Influence of of 𝑝 𝑤! 𝐶 = 𝑐! and 𝜎 ! on posterior probability distribution and size of sentential context effects (Table 3.2: 𝜎 ! = 20! ; Table 3.3: 𝜎 ! = 30! ). Each colored panel represent a pair of prior contexts, with tan panels showing no expected context effects (see Figures 3.7- 3.8), blue panels having higher posteriors for c1, red panels having higher posteriors for c2, and darker panels (of each hue) corresponding to larger expected effect sizes. 103 𝜎 ! = 20! 𝑝 𝑤! 𝑉𝑂𝑇, 𝑐! = 0.25 = 0.75 = 0.90 χ = 32 χ = 32 χ = 32 g = 0.16 g = 0.16 g = 0.16 = 0.25 𝑎 = 38.87 𝑎 = 32.00 𝑎 = 28.57 Δ!"# = 0.00 Δ!"# = 0.50 Δ!"# = 0.68 χ!! − χ!! = 0.00 χ!! − χ!! = 13.73 χ!! − χ!! = 20.60 χ = 32 χ = 32 χ = 32 𝑝 𝑤! 𝑉𝑂𝑇, 𝑐! g = 0.16 g = 0.16 g = 0.16 = 0.75 𝑎 = 32.00 𝑎 = 25.13 𝑎 = 21.70 Δ!"# = −0.50 Δ!"# = 0.00 Δ!"# = 0.27 χ!! − χ!! = −13.73 χ!! − χ!! = 0.00 χ!! − χ!! = 6.87 χ = 32 χ = 32 χ = 32 g = 0.16 g = 0.16 g = 0.16 = 0.90 𝑎 = 28.57 𝑎 = 21.70 𝑎 = 18.27 Δ!"# = −0.68 Δ!"# = −0.27 Δ!"# = 0.00 χ!! − χ!! = −20.60 χ!! − χ!! = −6.87 χ!! − χ!! = 0.00 Table 3.2. Summary of Results of Simulation Study 3.2 (simulations utilizing 𝜎 ! = 20! ). 𝜎 ! = 30! 𝑝 𝑤! 𝑉𝑂𝑇, 𝑐! = 0.25 = 0.75 = 0.90 χ = 32 χ = 32 χ = 32 g = 0.07 g = 0.07 g = 0.07 = 0.25 𝑎 = 47.45 𝑎 = 32.00 𝑎 = 24.28 Δ!"# = 0.00 Δ!"# = 0.50 Δ!"# = 0.68 χ!! − χ!! = 0.00 χ!! − χ!! = 30.90 χ!! − χ!! = 46.35 χ = 32 χ = 32 χ = 32 𝑝 𝑤! 𝑉𝑂𝑇, 𝑐! g = 0.07 g = 0.07 g = 0.07 = 0.75 𝑎 = 32.00 𝑎 = 16.55 𝑎 = 8.83 Δ!"# = −0.50 Δ!"# = 0.00 Δ!"# = 0.27 χ!! − χ!! = −30.90 χ!! − χ!! = 0.00 χ!! − χ!! = 15.45 χ = 32 χ = 32 χ = 32 g = 0.07 g = 0.07 g = 0.07 = 0.90 𝑎 = 24.28 𝑎 = 8.83 𝑎 = 1.10 Δ!"# = −0.68 Δ!"# = −0.27 Δ!"# = 0.00 χ!! − χ!! = −46.35 χ!! − χ!! = −15.45 χ!! − χ!! = 0.00 Table 3.3. Summary of Results of Simulation Study 3.2 (simulations utilizing 𝜎 ! = 30! ). 104 3.2. Evaluating BIASES A central motivation behind the development of BIASES is to provide a theoretical explanation and computational framework within which to examine top-down effects from sentential context in spoken word recognition tasks. The foregoing work (Chapters 2 through 3.1) has focused on this task. However, BIASES provides more than a framework; as discussed above, BIASES also provides an explicit mathematical model that makes specific, fine-grained quantitative predictions about the distribution of top- down effects on spoken word recognition. Thus, it is important to evaluate the extent to which BIASES can account for observed variability in such effects, and the extent to which its novel predictions are borne out experimentally. 3.2.1. Observed Variability in the Size of Top-Down Context Effects Despite the strong evidence for top-down effects on spoken word recognition (see Chapter 2), substantial heterogeneity remains to be explained in the fine-grained details of the results in studies of this class of phenomenon. Pitt and Samuel (1993) provide a thorough review of such variability for lexical effects, but we discuss a few examples here. Observed effects vary depending on the source of bias (e.g., lexical, sentential, monetary payoff). Among lexical effects, the degree of top-down influence depends on the position of the manipulated phonetic cues in the word (e.g., word-initial: Ganong, 1980; word-medial: Connine, 1990; word-final: McQueen, 1991; see also Mattys, Melhorn & White, 2007). Among sentential context effects, the sizes of semantic, syntactic and pragmatic effects are not consistent. Even restricting analysis to top-down effects from syntactic sentential context on speech recognition, effect sizes vary greatly 105 (see Chapter 1; Fox & Blumstein, in press). Such effects are reminiscent of word frequency effects (Connine et al, 1993) in phoneme identification tasks (see Pitt & Samuel, 1993 for a related explanation of inconsistent lexical effects due to word familiarity/frequency). In addition to varying with the nature of the biasing information, top-down effects also depend on characteristics of the acoustic stimuli that comprise the test continua. Most obviously and most consistently across studies, top-down effects are larger for more phonetically ambiguous stimuli and they vanish or are very small for tokens that are clearly identifiable. Other acoustic manipulations that reduce stimulus quality or otherwise render the tokens more ambiguous tend to be associated with larger effect sizes (e.g., Burton & Blumstein, 1995; McQueen, 1991; Pitt & Samuel, 1993). On the other hand, top-down effects are elusive when stimuli are more faithful to the phonetic properties of real speech and have a greater number of reliable bottom-up acoustic cues (e.g., Burton, Baum & Blumstein, 1989). Indeed, there is even some indication that the size and prevalence of top-down effects depends on the specific phonetic contrasts and the acoustic cues being manipulated in the stimuli (e.g., /sh/–/ch/ vs. /sh/–/h/ vs. /sh/–/s/ vs. /b/–/m/ vs. /b/–/d/ vs. /b/–/p/ vs. /g/–/k/ vs. /t/–/d/). Furthermore, there is a high degree of individual subject variability in the extent to which subjects exhibit top-down effects, even within a homogenous population of healthy, monolingual English-speaking young adults with normal hearing (see, e.g., Chapter 1; Fox & Blumstein, in press). Far more variability exists when considering the size of such effects in elderly adults (e.g., Abada et al, 2008) or patients with aphasia (see Chapter 4; see also Baum, 2001; Blumstein et al, 1994; Boyczuk & Baum, 1999). 106 Finally, a review of the literature shows that there are also strong task effects and an influence of an experiment’s demand characteristics on the observed size of top-down effects on speech recognition. Chapter 1 discussed the role of stimulus predictability, experimental task (phoneme vs. word identification), and response latency in determining the expected size of top-down effects (see Fox & Blumstein, in press; cf. Bicknell, Jeager & Tanenhaus, in press; Bicknell, Tanenhaus & Jeager, submitted; Connine, Blasko & Hall, 1991; McClelland, 1987; Pitt & Samuel, 1993; Szostak & Pitt, 2013; van Alphen & McQueen, 2001). Pitt and Samuel (1993) also acknowledge apparent modulations of top- down effects in mixed vs. blocked designs, and they highlight the potential for differences in measured effect sizes due to differences in the analytic techniques experimenters select. In some cases, such variability may be due to chance. However, it is also possible that the observed asymmetries and inconsistencies are not merely noise, but are, in fact, systematic variation attributable to the basic principles underlying speech perception and the probabilistic Bayesian framework within which we are have formulated an explanation of sentential context effects on spoken word recognition. Because BIASES offers a formal, mathematical model of context effects on speech perception, it is possible to evaluate the quantitative predictions of the model in light of available data. In this way, not only can we validate many of the fundamental principles underlying BIASES, but we can also identify shortcomings of BIASES and take measures to improve the model’s empirical coverage. Next, we consider four sources of variability, examining whether and/or how BIASES might capture the observed irregularities, and, in the process, relaxing some of 107 the simplifying assumptions adopted when the model was introduced in Chapter 2. Specifically, we examine two sources associated with the likelihood term of BIASES (variability in the ambiguity of phonetic cues based on VOT and based on additional cues) and two sources associated with the prior term of BIASES (variability based in the strength of prior context and based on a “neutral” context). 3.2.2. Variability in the Ambiguity of Phonetic Cues: VOT One well-documented source of variability in top-down effects on spoken word recognition is variability along a continuum; top-down effects are not typically observed for phonetically unambiguous endpoint stimuli. For example, stimuli with very short VOTs are not good exemplars of /p/, as reflected in the likelihood distribution for /b/ and /p/ in Figure 3.1. There is a vanishingly small probability that a word-initial /p/ will be pronounced with a VOT of 10 ms, so the posterior probability (see Figure 3.2) of a /p/- response for such a token is virtually zero. Even when the prior context strongly supports a word beginning with /p/ (see Figure 3.3) subjects are not likely to make a /p/-response; after all, the posterior in Bayes’ rule is proportional to the product of the prior and the likelihood, so acoustic tokens that are not at all representative of /p/ (i.e., have a likelihood close to zero) will not tend to show reliable context effects. The same is true of stop consonant tokens with VOTs of 50 ms, for example, because the likelihood that that VOT is a token of bay is practically zero. This can be seen clearly in Figure 3.4, where no context effects are observed for these VOT values. This pattern has been replicated in the literature, going back to Ganong’s original lexical effect (1980); for instance, a large effect for intermediate VOTs and much smaller or nonexistent effects for endpoint VOT tokens can be seen in the data from Chapter 1 108 (Fox & Blumstein, in press; see Chapter 2 for discussion). This is not a strictly Bayesian pattern: many models are capable of capturing this sort of effect. However, the pattern exemplified in Figure 3.4 is a fundamental property of the Bayesian framework (e.g., Massaro, 1989; Norris & McQueen, 2008), rather than the consequence of a design choice within the model. This contrasts with other models, such as Merge (Norris et al, 2000), which prevents top-down information from influencing responses to endpoint tokens by implementing a “bottom-up priority rule” that only allows higher-level sources to affect decisions if the acoustic information is ambiguous (Norris et al, 2000). Importantly, in order to explicitly implement something like Merge’s bottom-up priority rule, a model must define some additional computational machinery and/or assumptions to govern when bottom-up decisions are protected from contextual influences vs. when top-down information is integrated. Although other cue integration models (e.g., Toscano & McMurray, 2010), which focus on the integration of multiple acoustic cues in speech perception, also exhibit reliability-based cue-weighting like the present model, BIASES also makes fine-grained quantitative predictions about the distribution of top-down effects across VOTs that can be compared to patterns in empirical data. To the extent that these specific predictions are borne out, it would suggest that there exist certain hallmarks of Bayesian cue integration in behavioral results. This issue is examined further later in this chapter. 3.2.3. Variability in the Ambiguity of Phonetic Cues: Additional Cues Although, up to now, we have assumed that VOT is the primary acoustic dimension on which voiced and voiceless stop consonants (e.g., /b/ and /p/) are distinguished in speech perception (Liberman et al, 1961), listeners also make use of a 109 large variety of other acoustic cues in their judgments about voicing in natural speech stimuli (see, e.g., Klatt, 1975; Lisker, 1986; Miller & Dexter, 1988; Repp, 1984; Stevens & Klatt, 1974; Summerfield, 1981). A complete model of top-down effects on spoken word recognition, then, would include all of these cues in the likelihood model that maps acoustic stimuli to words. Burton, Baum and Blumstein (1989) investigated one cue in particular – the amplitude of the burst of the stop consonant. In natural speech, VOT and burst amplitude co-vary (Lisker & Abramson, 1964; Pickett, 1980; Zue, 1976), even though most VOT continua hold burst amplitude constant in an attempt to isolate the effect of VOT duration on speech recognition. Burton and colleagues (1989) showed that not only were subjects sensitive to manipulations of burst amplitude as a cue to voicing in stop consonants, but that the size of top-down effects as determined by the emergence of a lexical effect in their responses differed depending on whether the stimuli in the test VOT continuum varied from token to token in both the burst amplitude and VOT or just in VOT. Top- down effects occurred when only VOT varied, and were not, in fact, significant when burst amplitude and VOT co-varied along the continuum (as in natural speech). How might this asymmetry be explained within the Bayesian framework? Clearly, BIASES is not equipped to explain this effect under its original assumptions because it assumes that VOT is the only relevant acoustic cue to a word onset’s identity. A simple adaptation, however, can explain how this effect emerges. First, we must incorporate both the burst amplitude and the VOT of a given stimulus into BIASES’ likelihood function, 𝑝 𝐴 𝑊 = 𝑝 𝑏𝑢𝑟𝑠𝑡, 𝑉𝑂𝑇 𝑊 . With this change, the likelihood function is two- dimensional instead of one-dimensional; note that, while this is still surely an 110 oversimplification, since many other acoustic cues also influence word recognition, it illustrates the adaptability of BIASES. Next, we created an arbitrary range of burst amplitudes that was higher for voiceless tokens than voiced tokens (cf. Lisker & Abramson, 1964). Finally, a simulation was conducted to compare the expected size of top-down effects in responses to stimuli from two simulated VOT continua: a VOT continuum with a single burst amplitude across all token and a VOT continuum with VOT values that co- varied with burst amplitude. Arbitrary mean and variance parameters were selected from among those used in Simulation Study 3.1 (any choice shows the same basic pattern, but the actual size of the effective category boundary shift depends on this choice; see Figure 3.3 and Table 3.1). Figure 3.10 shows the posterior probability distributions for the two simulated continua in two biasing contexts (blue vs. red) and a baseline neutral context (black). keep /p/−burst covary burst and VOT 1.00 Posterior Probability of pay: p(w 1|V O T ,b urst,C ) 0.75 0.50 0.25 0.00 −25 0 25 50 75 100 −25 0 25 50 75 100 VOT (ms) Prior Probability of pay: p(w 1|c i ) cN = 0.5 c1 = 0.75 c2 = 0.25 Figure 3.10. Results of two model simulations of posterior probability distributions assuming the same biasing/neutral contexts and identical underlying likelihood models. Simulations on the left and right only varied in whether of not the VOTs of the stimuli were correlated with burst amplitude of the simulated stimuli (right) or not (left). 111 As can be seen, the expected category boundary shift (and effect sizes) are smaller for the second simulated stimulus set (burst amplitude and VOT co-varied) than in the first simulated stimulus set (VOT varies with a constant value for burst amplitude: the mean amplitude of the voiceless stop’s burst). It is important to note that the model is the same in both simulations (it is BIASES with the updated likelihood model to include burst amplitude as an acoustic cue) – only the characteristics of the simulated stimulus sets vary between the two simulations. Next we consider two sources of variability related to the prior of BIASES. 3.2.4. Variability in the Strength of Prior Cues Figure 3.11 is a reproduction of Figure 1.2 (in Chapter 1; Fox & Blumstein, in press), which shows the results of Experiment 1 from that study: the proportion of /p/- responses made to ambiguous tokens (i.e., the intermediate VOT values as defined in Chapter 1) from the bay–pay and buy–pie continua following noun-biasing (e.g., Valerie hated the...) and verb-biasing (e.g., Brett hated to...) sentence contexts. Recall that the bay–pay continuum was designed to be a noun–verb continuum and the buy–pie continuum was designed to be a verb–noun continuum. As explained in Chapter 1 and as can be seen in Figure 3.11, consistent with predictions, subjects exhibited a significant CONTEXT x CONTINUUM interaction, wherein they were more likely to make /p/- responses when the most common grammatical category of the /p/-endpoint was consistent with the grammatical cue provided by the preceding function word (to vs. the). While this effect was quite robust, with the simple effects of CONTEXT being significant and in opposite directions in each level of CONTINUUM, the effect sizes were not identical. This can be seen clearly in Figure 3.11: the magnitude of the effect of 112 CONTEXT in the buy–pie continuum (β = 0.95) is smaller than the effect’s magnitude in the bay–pay continuum (β = -1.37). Moreover, visual inspection of Figure 3.11 suggests that the primary level of CONTEXT that is driving the interaction is the verb-biasing (to) level: the proportion of /p/-responses is far more disparate between the two continua following the verb-biasing contexts than the noun-biasing contexts. 1.0 the to proportion "p"−responses 0.8 0.6 0.4 0.2 0.0 bay−pay buy−pie continuum Figure 3.11. Reproduction of Figure 1.2. Mean proportion of /p/-responses to ambiguous tokens from each VOT continuum in Experiment 1 of Chapter 1 after noun-biasing and verb-biasing sentence contexts. Error bars represent standard error. (Fox & Blumstein, in press) One possible explanation for this asymmetry lies in the strength of the biasing information in the prior of BIASES. Intuitively, if the targets (i.e., bay, pay, buy, pie) all represent relatively “good” nouns (i.e., they are sensible and/or grammatically acceptable following the), but only pay and buy represent “good” verbs (i.e., they are acceptable following to), then the second asymmetry should be predicted: the verb-biasing contexts 113 should drive the interaction. The stronger overall magnitude of the bias observed in the bay–pay continuum might occur under many circumstances, including if pay was particularly likely to follow to and bay was particularly unlikely, thereby creating a bias much stronger in that condition than in the others. To determine whether this intuitive explanation could quantitatively capture the asymmetry observed for these particular contexts and target words, we implemented the prior of BIASES (bigram language model; see Chapter 2) for these stimuli. Table 3.4 provides corpus counts of the number of tokens of each of the function word / target bigrams (e.g., to pay, the buy) appears in the Google Books corpus (Michel et al, 2010). As described in Chapter 2, a smoothing parameter6 is added to every corpus count (Lidstone, 1920) to yield an estimate of the conditional prior for each word, given the preceding context. ...bay ...pay ...buy ...pie to... 91,314 17,383,444 7,423,403 6,709 the... 3,236,957 945,799 56,284 249,243 Table 3.4. Number of tokens of each bigram found to the 2009 Google Books corpus (Michel et al, 2010) Furthermore, the likelihood model was improved so as to allow the rime (i.e., vowel + glide) of the target stimulus to influence the likelihood of BIASES, rather than just the VOT of the initial stop consonant of the stimulus (see Chapter 4 for more detail on the mathematical details of this improvement). Words that differed from a target 6 Although any value of alpha will give the basic same pattern of results (i.e., the same ordering of effect sizes), different values will accentuate the disparities between the contexts and continua to greater or lesser extents. For the present simulations the value, 1x107 was utilized to illustrate the similarity between the model predictions and the experimental results. 114 stimulus in its rime were assigned a likelihood (and therefore a posterior probability) of zero; that is, for every trial on which a subject heard /?ei/, the only words competing for recognition were bay and pay. The smoothed corpus estimates were incorporated into BIASES as the conditional prior. For the likelihood model, arbitrary mean and variance parameters were selected from among those used in Simulation Study 3.1 (any choice shows the same basic pattern). Figure 3.12 shows the posterior probability distributions for the two continua in each context (left panels) and reproduces Figure 1.1 (right panels) for comparison. 1 bay−pay 1.00 1.0 0.75 0.8 Posterior Probability of pay: p(w 1|V O T ,C ) 0.6 0.50 0.4 0.25 proportion "p"−responses 0.2 0.00 2 0.0 1.00 buy−pie 1.0 0.75 0.8 0.50 0.6 0.25 0.4 0.00 0.2 the −25 0 25 50 75 100 VOT (ms) to Prior Probability of pay: p(w 1|c i ) 0.0 c1 c2 5 10 15 20 25 30 35 VOT (ms) Figure 3.12. Model simulations of posterior probability distributions (left) and original data (right) for the bay–pay (top) and buy–pie (bottom) continua in the noun- and verb- biasing contexts (verb-biasing: blue on left / dashed on right; noun-biasing: red on left / solid on right). Right panels are a reproduction of Figure 1.1. Mean proportion of /p/- responses to tokens from each VOT continuum in Experiment 1 of Chapter 1 after noun- biasing and verb-biasing sentence contexts. Error bars represent standard error. (Fox & Blumstein, in press) 115 Finally, from the resulting posterior distributions, 20 sets (i.e., 20 subjects) of 20 behavioral responses were simulated in for each context condition for an ambiguous VOT (randomly selected VOT value within 5 ms of the assumed category boundary). Figure 3.13 shows the mean proportion of /p/-responses in each continuum to the ambiguous VOT value after each context. The same pattern of results seen in Figure 3.11 can be observed there: the overall interaction is robust, there is a larger effect of context in the bay–pay continuum than in the buy–pie continuum, and the effect is largely driven by the verb-biasing contexts. Importantly, these results are obtained without any significant efforts at parameter-fitting; rather, the pattern of results that emerges is inherent to a Bayesian model that assumes, like BIASES, that the prior word should bias the perception of subsequent spoken words when they are phonetically ambiguous. Thus, these results strongly suggest that variability in the strength of prior information in a sentence context modulates the size of observed top-down effects in systematic and predictable ways. 116 Posterior Probability of /p/−response: p(/p/|V O T ,C ,vowel ) 1.00 0.75 0.50 0.25 0.00 bay−pay buy−pie continuum context the to Figure 3.13. Model simulations of behavioral response rates for the bay–pay (left bars) and buy–pie (right bars) continua in noun-biasing (red) and verb-biasing (blue) sentence contexts. Mean proportion of simulated /p/-responses to randomly selected ambiguous VOT (within 5 ms of simulated category boundary) by 20 subjects with 20 Bernoulli (independent and identically distributed) trials. Error bars represent standard error of 20 simulated subject means. Compare to Figure 3.11 (or Figure 1.2). (cf. Fox & Blumstein, in press) 3.2.5. Variability in the Effect Sizes Compared to “Neutral” Prior Contexts Another inconsistency that has received relatively little attention in the literature despite its appearance in various studies (e.g., Guediche et al, 2013; van Alphen & McQueen, 2001) relates to the size of top-down effects when responses to stimuli in biasing contexts are compared to a context that is designed to serve as a neutral condition. For instance, Guediche and colleagues (2013) examined responses to stimuli that were occasionally phonetically ambiguous between goat and coat after goat-biasing sentence contexts (e.g., He milked the...), coat-biasing sentence contexts (e.g., He buttoned the...), and neutral sentence contexts in which either goat of coat could sensibly serve as a 117 continuation (e.g., He painted the...). Researchers have also attempted to include similarly neutral contexts in lexical effect studies by using continua between two non- words or between two words (e.g., Fox, 1984). The goal of such studies is generally to compare each biasing context to the neutral context in order to illustrate that sentential context effects are affecting the identification of stimuli relative to a neutral context. However, it has often been observed that the neutral context may differ significantly from only one of the biasing contexts, or the effect size may be larger in one direction than the other. These asymmetries have rarely been discussed in detail in the literature. Nonetheless, because BIASES demands that even the allegedly “neutral” context have some prior (whether 0.5 or not), this exercise serves as a reminder that one must explicitly model the prior on even the neutral context. It may be that the sentence contexts representing the neutral condition are, indeed, truly unbiased: 𝑝 𝑤! 𝐶 = 𝑐! = 0.5. In such a case, the asymmetries might be explained by asymmetric prior biases for the two biasing contexts: there is no guarantee that stimuli in the two biasing contexts will be equally biasing away from a perfectly neutral context, as in Simulation Study 3.1 (see Box 3.1) where 𝑝 𝑤! 𝐶 = 𝑐! = 0.75 and 𝑝 𝑤! 𝐶 = 𝑐! = 0.25 . Indeed, in Simulation Study 3.2 (see Box 3.2), prior contexts that were not equally biased compared to the neutral context were examined, with asymmetries resulting (see Figures 3.6 and 3.9). This issue is explored further later in this chapter, but, for the moment, it is simply worth noting that one important conclusion from this discussion is that an experimentally defined “neutral” context may not be neutral at all: there may be biases inherent in even those “neutral” stimuli, and – even if they are neutral – the only conditions under which 118 one should expect equal effect sizes of each context in comparison to the “neutral” condition is when stimuli are perfectly evenly biased around the perfectly neutral context. These conditions are unlikely to be met without extremely tight experimental design controls, but BIASES allows an experimenter to predict in advance of an experiment the likely effect sizes when comparing stimuli from different conditions. Thus, BIASES can be employed for power analyses and experimental design purposes. 3.3. Testing Predictions of BIASES: Experiment 3.1 In order to further examine the extent to which fine-grained predictions of BIASES could be observed in empirical data, a new set of stimuli was constructed and Experiment 3.1 was conducted. In particular, there were two goals: (a) to determine whether by-subject differences in pay-response rates to different acoustic stimuli predicted specific patterns of top-down effects, and (b) to determine whether there is evidence that subjects’ responses to stimuli following a “neutral” context actually reflect Bayes-optimal processing. These two goals were addressed by two model comparison analyses examining the results of Experiment 3.1. 3.3.1. Methods 3.3.1.1. Subjects 15 healthy young adults participated in Experiment 3.1 as part of a multi- experiment session, although all 15 subjects completed this experiment first. Participants either received course credit or 8 dollars. All subjects were right-handed monolingual native speakers of American English, and all participants self-reported having normal hearing and no known neurological diseases. 3.3.1.2. Materials 119 The stimuli for this study were comprised of 4 acoustic tokens from a voice-onset time continuum between bay and pay, each of which was appended to a set of noun- and verb-biasing sentence contexts (e.g., He hated the... vs. He hated to...). Stimuli were recorded in a soundproof booth on an Edirol digital recorder (model R09-HR) with a Sony microphone (model ECM-MS907) (sampling rate: 44.1 kHz; 24 bits; stereo) and then were resampled in BLISS speech-editing software (Mertus, 1989) (sampling rate: 22.05 kHz; 16 bits; mono: left). The speaker was a male native speaker of American English. All sentence frames (e.g., He hated...), biasing function words (to/the), and naturally produced target tokens of bay and pay were produced in isolation multiple times and tokens were selected from among them for use in the experiment proper. The list of sentence frames consisted of the same 20 main verbs that were used by Fox and Blumstein (2015; see also Chapter 1), but first names were replaced with the pronoun “He” to reduce differences in stimulus duration and ensure subject would not be able to learn mappings between names and subsequent function words. Three contexts were appended to each of the 20 sentence frames. A naturally produced token of the, a naturally produced token of to (both of duration 125 ms), and 125 ms of unintelligible but spectrally similar speech babble (the initial and final 40 ms of which were ramp up/down respectively). This third condition was dubbed the “noise” condition. In total, this yielded 60 sentence contexts (20 main verbs crossed with the three conditions; He hated to.../the.../[noise]...). To each of these 60 contexts, each of 4 acoustically manipulated tokens from a VOT continuum between bay and pay were appended (yielding 240 total sentences for each subject to respond to in the experiment). 120 Tokens of the VOT continuum were constructed by concatenating: the unaltered burst of a pay token; a variable amount of aspiration from the natural pay token (duration depended on duration of vowel removed – see below); the first quasiperiodic pitch period from the natural pay token; and all but the first N pitch periods of a naturally produced token of bay, where N = the stimulus number - 1. The duration of the N pitch periods that were removed was equal to the amount of aspiration added from the pay token in order to ensure all tokens were the same duration, overall (within 1 ms of 439 ms). In this way, 7 VOT tokens were created. Four tokens with VOTs of 3, 22, 31, 48 ms were selected for stimuli because the middle two were judged to be the most ambiguous and other two were strong endpoint tokens. 3.3.1.3. Procedure All subjects heard all stimuli binaurally over headphones in a random order in a sound-dampened booth and were instructed to respond whether the last word of each sentence was bay or pay, by pressing the appropriately marked button as quickly and accurately as possible, and to guess if they did not know. The buttons were counterbalanced across subjects. Subjects were told ahead of time that some sentences would not make sense. Subjects completed 2 practice trials before the experiment began. The experiment took about 12 minutes to complete. There were no breaks included during the experiment. 3.3.2. Results: Logistic Regression Analysis of Biased Contexts The results for Experiment 3.1 are analyzed throughout the remaining sections of Chapter 3. First, we consider only the results for the two biasing contexts (shown in 121 Figure 3.14). Subsequent analyses examined the results of all three conditions (including noise). To test for an effect of sentential context on speech recognition, the data were analyzed using mixed effects logistic regression (Baayen, Davidson & Bates, 2008; Jaeger, 2008) (see Chapter 1). There was evidence of a strong influence of VOT on the rate of pay-responses (β = 0.31, p < 0.001) and also a strong sentential context effect (β = 3.03, p < 0.001). Figure 3.15 shows the results by illustrating the effect size as a function of VOT. 1.0 ● ● ● ● 0.8 proportion PAY−responses ● 0.6 0.4 ● 0.2 ● ...to ● ● ...the 0.0 ● 0 10 20 30 40 50 VOT (ms) Figure 3.14. Mean proportion of pay-responses to tokens from the bay–pay VOT continuum after noun-biasing and verb-biasing sentence contexts. Error bars represent standard error. Figure 3.16 shows the by-subject variability in effect sizes for responses to each VOT token. However, Figure 3.17 shows that subjects also vary in their underlying likelihood model. In particular, in their responses to these tokens, subjects’ expected 122 category boundaries differ: some subjects appear to expect exemplars of pay to have much longer VOTs than other subjects. Because of this, we examined two models for BIASES: one in which subjects, all share the same category boundary, and one in which subjects differ. If subjects do, indeed, differ in their category boundary, then they should also vary in their expected effect size for a given VOT token. 1.0 0.8 Bias Effect Size (to − the) 0.6 ● 0.4 ● 0.2 ● ● 0.0 0 10 20 30 40 50 VOT (ms) Figure 3.15. Mean difference in proportion of pay-responses to tokens from the bay–pay continuum after verb- vs. noun-biasing sentence contexts. Error bars represent standard error. 123 1.0 ● ● 3 ● ● 22 ● ● 31 ● ● 48 0.8 ● Bias Effect Size (to − the) ● ● ● ● 0.6 ● ● ● ● ● ● 0.4 ● ● ● ● ● ● ● 0.2 ● ● ● ● ● ● ● ● ● 0.0 ● ● ● −20 −10 0 10 20 30 40 50 60 70 VOT (ms) Figure 3.16. For each subject (N=15), difference in proportion of pay-responses to tokens from bay–pay continuum after verb- vs. noun-biasing contexts. Error bars represent standard error. 1.0 0.8 proportion PAY responses 0.6 0.4 0.2 0.0 −20 −10 0 10 20 30 40 50 60 70 VOT (ms) Figure 3.17. For each subject (N=15), their best-fitting (see section 3.3.2) unbiased posterior probability distributions (probability of pay-response to tokens from the bay– pay VOT continuum after a theoretical context that is truly neutral). 124 3.3.3. Results: Model Comparison 1 – Subject Variability BIASES was implemented as a hierarchical Bayesian statistical model for further analysis. For the present analyses, only the data from responses to the VOT tokens after noun- and verb-biasing contexts (not in the noise context) were considered. Two separate versions of the model were implemented: in one version, subjects shared one group parameter for the mean of the normal likelihood function for their /b/ category and in the other version, the mean of the likelihood function could differ between subjects (the hyperprior on subjects’ 𝜇! was presumed to be normally distributed). As noted in Simulation Study 3.1 (see Box 3.1), because the current simulations of BIASES assume equal category variance (as do Feldman et al, 2009; Clayards et al, 2008; Kleinschmidt & Jeager, 2015), category variance and distance between category means are confounded. Thus, all model-fitting analyses assume a distance between categories based on VOT distributions from production data reported by Lisker and Abramson (1964): 𝜇! − 𝜇! = 55 𝑚𝑠. Tables 3.5 and 3.6 show the results of the model-fitting with and without by- subject variability, respectively. All chains converged, as judged from visual inspection of the chains and the Gelman-Rubin statistics for each model: multivariate psrf = 1.01 for both models and point estimates were all between 1.00 and 1.01 (with upper 95% confidence intervals of 1.00-1.03). Critically, the DIC (popt) was computed for each model in order to determine whether the additional parameters allowing subjects to differ in their phonetic category structure improved the model fit significantly. Penalized deviance scores were 514 for the group-level model and 398.3 for the hierarchical model, despite having penalty terms of 125 6.602 and 44.99, respectively. Table 3.6 provides estimates and HDIs for the parameters in the hierarchical version of BIASES. Median Mean SD 95% HDI min 95% HDI max 𝛼 0.089 0.090 0.016 0.059 0.120 2 σ 266.93 267.51 11.61 246.05 290.97 𝜇! 0.15 0.44 0.43 -0.37 1.26 Table 3.5. Summary of posterior Markov chains from model that assumed group-level category structure (i.e., same 𝜇! for all subjects). Median Mean SD 95% HDI min 95% HDI max 𝛼 0.060 0.061 0.012 0.040 0.085 σ2 235.98 236.44 10.77 216.10 257.90 𝜇! ! 0.41 0.44 1.30 -2.12 3.04 1 = 𝜎!! ! 21.08 24.13 12.27 7.36 48.97 𝜏!! Table 3.6. Summary of posterior Markov chains from model that assumed hierarchical phonetic category structure (i.e., variable 𝜇! for subjects). A posterior distribution was also obtained for each subject’s 𝜇! , so we determined the median of each of these 15 posterior distributions and added half of the assumed 𝜇! − 𝜇! (i.e., 27.5 ms) to compute a single point estimate for an approximate category boundary for each subject. Subjects’ boundaries, calculated in this way, ranged between 20.72 ms and 33.37 ms (mean = 27.95, SD = 4.02). In order to illustrate the improved model fit obtained by hierarchical modeling of phonetic category structure, we computed the distance of each VOT token from the estimated boundary for each subject (calculated from the median of the subject’s posterior distribution) and re-plotted Figure 3.16 with an x-axis reflecting the adjustment of subjects’ boundaries to coincide at a single point. This can be seen in Figure 3.18. In short, subjects show larger effects when the model predicts that they should show larger effects (e.g., closer to the category boundary), and this fine- grained variability among subjects is neatly captured by BIASES’ assumption that 126 subjects are not all identical in their phonetic expectations for acoustic realizations of exemplars of /b/ and /p/. 1.0 ● ● 3 ● ● 22 ● ● 31 ● ● 48 0.8 Bias Effect Size (to − the) ● ● ● ●● ● ● ● ● 0.6 ● ● ● ● ● ● 0.4 ● ● ●● ● ● ● ●● ● ● 0.2 ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●●● ● 0.0 ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ● −40 −30 −20 −10 0 10 20 30 40 Distance of VOT from Subject's Boundary (ms) Figure 3.18. For each subject (N=15), the difference in the proportion of pay-responses to tokens from the bay–pay VOT continuum after verb-biasing vs. noun-biasing sentence contexts. Note that, unlike Figure 3.16 and others, the x-axis is adjusted for subjects’ VOT boundaries. Error bars represent standard error. 3.3.4. Results: Model Comparison 2 – Inherent Biases in “Neutral” Priors The previous analyses of the results of Experiment 3.1 have focused on the data from the noun-biasing and verb-biasing condition, but ignored the third “noise” condition in which subjects heard sentences like He hated [noise] /?ay/. As discussed earlier, when experimenters include a “neutral” condition, how subjects respond to stimuli in that baseline condition must be modeled just like subjects’ responses to biased contexts. Next, we examined eight possible models of subjects’ conditional priors in order to understand the principles underlying subjects’ responses to 127 stimuli both in the biasing contexts and the noise condition employed in the current experiment. If, in the noise condition, subjects are equally biased towards bay and pay, then the noise context would lie closest to the noun-biased sentence contexts because those noun-biased contexts are less biased, overall, than the verb-biased contexts (see Table 3.4). Note that, as discussion above, it is not likely that responses to the noise condition will lay perfectly midway between the noun- and verb-biasing contexts. On the other hand, BIASES makes a different prediction about how subjects should respond in the noise condition. In particular, one principle of Bayesian models is that, when some information is not available, the optimal way to integrate that (lack of) information is to “believe” (in the Bayesian sense) each possible value of the cue to the extent that that cue was likely. This is called marginalization (see Chapter 2). In the present circumstances, this would mean that subjects’ responses to stimuli in the noise condition should be closer to the verb-biasing contexts rather than the noun-biasing contexts. Thus, the “neutral” assumption and the Bayesian (marginalization) assumption make opposite predictions about where subjects’ responses to stimuli in the noise condition should fall. Figure 3.19 displays the results of subjects’ responses to the noise condition added to the same data presented in Figure 3.14. As can be seen, responses in the noise condition lie closer to the verb-biased context than to the noun-biased context, suggesting that subjects were performing marginalization. However, to confirm this, we conducted a model comparison to evaluate the extent to which the marginalization model improved in fit over other alternative models. 128 1.0 ● ● ● ● 0.8 ● proportion PAY−responses ● 0.6 0.4 ● 0.2 ● [noise] ● ...to ● ● ...the 0.0 ● 0 10 20 30 40 50 VOT (ms) Figure 3.19. Mean proportion of pay-responses to tokens from the bay–pay VOT continuum following the noun-biasing (e.g., He hated the...) and verb-biasing (e.g., He hated to...) sentence contexts (see also Figure 3.14), as well as in the noise condition (e.g., He hated [noise]...). Error bars represent standard error. Eight models were compared (see Table 3.7); they differed on the assumed prior for the context conditions (to.../the...) and the assumed prior for the noise condition. For four models, the priors for the context conditions were estimated based on the standard bigram model (described in Chapter 2), considering only probability of bay vs. pay based only on the preceding word (to vs. the). The remaining four models employed a more complex trigram language model for their priors, considering the probability of bay vs. pay based not only on the previous word (to vs. the) but also on the main verb preceding that. The four models of each type each employed a different model for the prior when subjects heard the target after the noise condition. These models varied in complexity. One model assumed that subjects treated bay and pay as equally likely after hearing the noise condition (Equal Priors). A second presumed subjects were biased based on the 129 lexical frequency of bay and pay. A third assumed that subjects considered the probability of bay and pay after marginalizing over the possible bigram contexts in the experiment (to and the). Finally, a fourth model assumed that subjects marginalized over all possible trigram contexts in the experiment (to and the, but only for after the main verbs in the study). If subjects were performing optimally and making use of all information available to them, BIASES predicts that they should respond after the biasing contexts (to.../the...) based on the trigram prior, and that they should marginalize over all trigrams in Experiment 3.1. Indeed, the model comparison’s results support that finding (see Table 3.7). Table 3.8 reports the best-fitting model parameters for this best-performing model. Assumed Prior for Assumed Prior for Mean Penalty Penalized Context Conditions Noise Condition Deviance Term Deviance Bigram (to.../the...) Equal Priors 680.4 36.35 716.8 Bigram (to.../the...) Lexical Frequency 517.3 41.87 559.2 Bigram (to.../the...) Context-Sensitive (Bigram) 501.9 39.77 541.6 Bigram (to.../the...) Context-Sensitive (Trigram) 699.9 41.14 741 Trigram (marginal) Equal Priors 778.9 39.55 818.4 Trigram (marginal) Lexical Frequency 578.1 40.75 618.9 Trigram (marginal) Context-Sensitive (Bigram) 505.4 39.31 544.7 Trigram (marginal) Context-Sensitive (Trigram) 474.5 40.31 514.8 Table 3.7. Summary of Model Comparison 2. Shaded row is best-fitting model, which was the model that assumed subjects make use of not only bigram contexts (i.e., the prior word), but also the second-back word (i.e., trigram contexts) in their contextual priors. This was the most detailed representation of contextual information tested. Median Mean SD 95% HDI min 95% HDI max 𝛼 0.044 0.044 0.007 0.031 0.059 σ2 214.71 214.83 7.89 199.96 230.83 𝜇! 0.35 0.35 1.24 -2.10 2.74 1 = 𝜎!! ! 18.35 20.77 7.89 7.07 40.68 𝜏!! Table 3.8. Summary of posterior Markov chains from best-fitting model in Model Comparison 2 (shaded model in Table 3.7; fully trigram-driven context model). Note that all models assumed hierarchical phonetic category structure (i.e., variable 𝜇! for subjects). 130 3.3.5. Conclusion In conclusion, the results of Chapter 3 suggest that listeners’ behavior exhibits many hallmarks of a Bayesian spoken word recognition system. Overall, the results lend support to the validity and utility of BIASES for explaining and predicting subjects’ speech recognition behavior in experimental tasks. BIASES is capable of accounting for a wide range of variability that is usually ignored by other computational models, including variability among subjects, variability due to speech cues other than VOT, and variability due to prior contexts in an experiment. These findings, along with the theoretical contribution of BIASES – as a model of speech perception in sentential context – illustrate the novelty of the present work. An even more powerful demonstration of the utility of BIASES, though, would be to leverage it to inform theoretical debates in psychology or neuroscience. One key goal of computational modeling is to advance scientific theory by testing and comparing competing hypotheses. This is the aim of Chapter 4. 131 Chapter 4 Top-Down Effects on Spoken Word Recognition in Aphasia: A Model-Based Assessment of Information Processing Impairments 4.1. Introduction 4.1.1. Brief Introduction In order to understand spoken language, a listener must ultimately map a continuous acoustic waveform onto discrete lexical forms, which stand at the interface between sound and meaning. However, spoken word recognition is not only influenced by the so-called bottom-up speech signal; as the signal undergoes higher-level cognitive processing, other information sources exert top-down influence on speech perception (for review, see Samuel, 2011). For instance, perception is lexically biased: a phonetically ambiguous segment between /b/ and /p/ tends to be identified as /b/ when followed by – ash (because bash is an English word, but not *pash), but as /p/ when followed by –ast (where past is a word, but not *bast) (Ganong, 1980). Similarly, perception is contextually biased: a phonetically ambiguous stimulus between two words (e.g., bay and pay) tends to be recognized as bay after a noun-biasing sentence context (e.g., Valerie hated the...), but as pay after a verb-biasing sentence context (e.g., Brett hated to...) (Fox & Blumstein, in press). Although the mechanisms underlying the integration of bottom- up and top-down cues remain the subject of considerable debate (see, e.g., McClelland, Mirman & Holt, 2006; Norris, McQueen & Cutler, 2000), there is no question that both types of information influence speech recognition. Top-down effects on speech perception are of particular interest because they reflect dynamics at the confluence of perceptual and cognitive processing, so their 132 emergence and the characteristics of their distribution can reveal key insights about many aspects of human language function (see Chapter 3). Given the theoretical significance of this class of phenomena, it is noteworthy that far less is known about the pattern of such top-down effects in patients with aphasia. Most such individuals experience at least some receptive language impairments (Boller, Kim & Mack, 1977; Goodglass, Gleason & Hyde, 1970), with deficits arising at many different levels of processing (Goodglass, 1993; Lesser, 1978). What grants the status of top-down effects in patients with aphasia special importance, however, is the substantial evidence that lexical processing – that is, processing at the level where sound contacts meaning – is particularly vulnerable in aphasic syndromes (for review, see Blumstein, 2007). For example, a classic finding about lexical processing by neurologically healthy adults is that, upon hearing a prime word (e.g., cat), the processing of a subsequent target that is a semantic associate of the prime (e.g., dog) is automatically facilitated relative to processing when the prime was not related (e.g., table), as measured by the time required to accurately decide that dog is a word (Meyer & Schvaneveldt, 1971). Moreover, the extent to which listeners access cat (and, in turn, the extent to which processing of dog is facilitated) is modulated by the acoustic (or phonological) distance from cat of a “mispronounced” prime, as indicated by the monotonic ordering of lexical decision latencies for dog after four different prime conditions (from fastest to slowest): cat < *gat < *wat < table (Milberg, Blumstein & Dworetsky, 1988a). Although the implicit processing of semantic associates of perceived primes is typically spared in aphasia (i.e., cat primes dog; Milberg & Blumstein, 1981; Milberg, Blumstein & Shrier, 1982), patients fail to exhibit the characteristic graded sensitivity observed in healthy adults 133 (Milberg, Blumstein & Dworetsky, 1988b), a result that has been interpreted as evidence for lexical processing deficits in such patients. Nonetheless, it remains unclear what mechanisms are responsible for the observed dysfunctions (for review, see Mirman, Yee, Blumstein & Magnuson, 2011). One theory argues that lexical processing deficits arise directly from disruptions to processing dynamics at the level of the lexical representations themselves (Blumstein & Milberg, 2000; Janse, 2006; McNellis & Blumstein, 2001). However, in order to definitively conclude that lexical information is, indeed, specifically implicated, it is important to rule out the possibility that what appear to be lexical processing impairments are actually just downstream consequences of impairments in the bottom-up processing of the speech signal. Unfortunately, since auditory word processing must inevitably require both bottom-up speech processing and accessing the lexical representation, it is easy to see why it has been difficult to rule out this alternative explanation. However, top-down lexical and contextual effects on speech perception may offer a unique window through which to view this question. For instance, the lexical effect (Ganong, 1980) taps information stored within lexical representations because it reflects a comparison between two phonologically similar interpretations of a stimulus, only one of which corresponds to a lexical representation. The existence of one representation (bash) in the lexicon and the corresponding absence of another (*pash) conspire to bias subjects’ identification of speech stimuli toward words. As such, to the extent that patients or groups of patients differ in the size of their lexical effect from controls, these differences might be taken to suggest disruptions arising at the lexical level itself. On the other hand, listeners’ identification of spoken words and sounds should, of course, also be affected by 134 bottom-up phonetic and phonological processing deficits. Therefore, the pattern of lexical or contextual effects on speech perception in a patient with a bottom-up processing deficit might also be expected to diverge from the performance of healthy control subjects. The critical question, then, is “When it comes to top-down speech processing, are there unique predictions about the expected consequences of ‘virtual lesions’ at different levels of the spoken word recognition system?” It is not necessarily intuitive how – even in a healthy speech processing system – phonetic, phonological, lexical and sentential processing levels interact during online speech perception and ultimately drive subjects’ responses to, for instance, a stimulus that is ambiguous between bash and *pash. This challenge is multiplied when attempting to deduce how disruptions at different processing levels or to specific cognitive mechanisms might affect the behavior of patients with brain damage and a complex constellation of symptoms at any (or potentially many) of those processing levels. Thus, it is difficult to generate clear predictions about expected patterns of top-down effects in patients with aphasia, and it is also difficult to draw any strong conclusions about the relationship between such data and the nature of those patients’ fundamental information processing deficits, without first identifying a theoretical lens through which to view the data. To that end, the present work enlists the BIASES model (Bayesian Integration of Acoustic and Sentential Evidence in Speech; Chapters 2-3), a probabilistic computational model of spoken word recognition that has been shown to successfully capture key aspects of top-down effects on speech perception in healthy adults. As we will show, BIASES makes clear predictions about how fine-grained differences in the size and 135 distribution of top-down influences from lexical and contextual cues should be expected to emerge as a function of which information-processing levels are disrupted. Thus, by examining the specific patterns of top-down effects from lexical and sentential context in patients with Broca’s aphasia (BA) and Wernicke’s or Conduction aphasia (W/CA), and comparing those results to the distribution of top-down effects in healthy controls, it is possible to distinguish between the independent contributions of a range of processing impairments (including at acoustic-phonetic, phonological, lexical, and contextual processing levels), even when multiple such impairments may coexist in a single patient or group of patients. 4.1.2. Overview of Chapter 4 The central aim of this chapter is to investigate the nature of top-down processing in patients with aphasia and to evaluate the extent to which the pattern of deficits observed in two groups – patients with Broca’s aphasia (BA) and patients with Wernicke’s or Conduction aphasia (W/CA) – might inform the broader theoretical question regarding the locus of patients’ apparent lexical processing deficits. To that end, we further elaborate a Bayesian model of speech perception presented in earlier chapters, the BIASES model (Bayesian Integration of Acoustic and Sentential Evidence in Speech; Chapters 2-3). The fundamental principles embodied by this iteration of BIASES, which I call BIASES-A are consistent with its parent model, as discussed earlier. For example, preceding words can still bias a listener’s identification of a stimulus via a context- dependent conditional prior, and the model’s likelihood function still computes the relative fit of candidate representations given some acoustic values. 136 However, both the prior and likelihood terms of BIASES-A take different forms than they did when the model was introduced. These adaptations are critical for the model to address the question of theoretical interest here. In fact, in some ways, BIASES- A relaxes some of the assumptions present in the minimalist version of BIASES presented in Chapters 2-3. For instance, the drastically oversimplified likelihood function in BIASES, 𝑝 𝐴 𝑤! , characterized each word as a distribution over VOTs, implicitly ignoring subsequent cues (such as the rest of the word). This assumption was sufficient for modeling the perception of minimally paired words (e.g., bay vs. pay) that only differ as a function of VOT, but it must be updated in order to account for lexical biases arising as a function of subsequent phonological information (e.g., whether the rime of the word is –ast or –ash). Of course, while adding complexity to the model in this was improves its ability to accurately characterize the human speech processing system, relaxing certain assumptions requires committing to certain additional assumptions. However, most importantly, this approach illustrates with one of the key strengths of BIASES: its flexibility. The architecture of BIASES and its fundamental properties do not change when the prior and likelihood functions are updated to more accurately capture additional findings about human cognition and perception. Thus, while the main goals of this chapter are to assess the prevalence of top-down effects on speech perception in patients with aphasia and to address the theoretical question about the locus of lexical processing deficits in aphasia, this work also serves as a demonstration of the broad range of questions that BIASES can be leveraged to study. Chapter 4 is organized into 4 parts. First, we briefly review the evidence for lexical processing deficits in aphasia, with special focus on two clinical groups – patients with Broca’s aphasia (BAs) and 137 patients with Wernicke’s or Conduction aphasia (W/CAs). Patients belonging to each group exhibit a unique pattern of lexical processing deficits. These results motivated the proposal of a theory referred to here as the Lexical Activation Hypothesis (Blumstein & Milberg, 2000; Janse, 2006; McNellis & Blumstein, 2001), which posits that the observed impairments emerge due to disruptions at the level of patients’ lexical representations. However, as mentioned earlier, it is unclear whether the impairments might be fully accounted for by bottom-up processing deficits, which are known to afflict most patients with aphasia. Although relatively little is known about top-down processing in patients with aphasia, and although it is not necessarily obvious how top-down processing might be implicated in or affected by lexical processing deficits, we propose that a model-based analysis of top-down effects on speech perception may offer unique insights about the nature of lexical processing deficits, more broadly. Second, we review the basic principles embodied by the BIASES model of speech perception and show how this probabilistic model can be theoretically linked to the Lexical Activation Hypothesis, which is predicated on a connectionist/activation-based approach to cognitive modeling. We update several assumptions and components of BIASES, calling this iteration BIASES-A, for Aphasia, highlighting the model’s viability for not only capturing the fine-grained statistics of language function (see Chapter 3), but also its ability to reveal novel insights about the sometimes subtle details of language dysfunction. Alternatively, the A could stand for Activation, highlighting another critical contribution of this chapter: linking BIASES to more traditional (that is, connectionist) theories, models and approaches to thinking about spoken word recognition. We review the mathematical form of BIASES-A, briefly addressing the most important implications 138 of the changes from BIASES. Finally, we outline the present work’s model-based approach to the characterization of top-down processing deficits in patients with aphasia. Third, we present Simulation Study 4.1 and Experiment 4.1. Simulation Study 4.1 examines how information processing deficits at different levels should be expected to emerge in behavioral responses during an experiment testing for a lexical effect in patients with aphasia (Simulation Study 4.1). Experiment 4.1, conducted over two decades ago (Blumstein, Burton, Baum, Waldstein & Katz, 1994), examined the lexical effect in patients with aphasia. The present study’s model-based reanalysis of its original data offers new insights into the specific deficits responsible for the patterns reported in the original study. Fourth, Simulation Study 4.2 and Experiment 4.2 follow the same approach as was taken in Simulation Study 4.1 and Experiment 4.1, but they examine the sentential context effects examined in previous chapters. We argue that Chapter 4’s computational and behavioral results, together, lend support to the key ideas embodied by the Lexical Activation Hypothesis. 4.1.3. Lexical Processing in Aphasia 4.1.3.1. Lexical Processing Deficits Lexical access and spoken word comprehension is often profoundly disrupted in aphasia. Recall the early illustration of this by Milberg, Blumstein and Dworetsky (1988a, 1988b), who employed a lexical decision paradigm wherein subjects heard a prime-target pair and were instructed to decide whether the target was a word (e.g., dog) or a non-word (e.g., *jand). On those trials for which the target was a word (dog), the prime that immediately preceded it could come from one of four categories: it could be 139 an unrelated word (table), a semantically related word (cat), a “close” mispronunciation of the semantically related prime (*gat), or a “distant” mispronunciation of the semantic associate (*wat). Healthy controls tend to correctly identify the target stimulus, dog, as a word fastest when it was immediately preceded by the correctly pronounced, semantically related prime (cat), followed in order of speed by *gat, *wat, and table, suggesting that lexical access to a word (and therefore to its semantic associates) is graded based on the phonological similarity of the input to the word (Milberg et al, 1988a; see also, e.g., Connine, Blasko & Titone, 1993; Connine, Titone, Dellman & Blasko, 1997; McMurray, Tanenhaus & Aslin, 2002; Utman, 1997; Utman et al, 2001). In contrast, BAs exhibit semantic priming effects when dog is preceded by the correctly pronounced prime, cat, but they fail to show priming in either of the mispronunciation conditions. On the other hand, patients with W/CA are equally primed by *gat and *wat as they are by cat (Milberg et al, 1988b). These results have been interpreted as evidence that lexical access is disrupted in both patient groups, but that the nature of this disruption is not the same for all patients (see also Janse, 2006; Utman et al, 2001; Yee, Blumstein & Sedivy, 2008; but see, e.g., Del Toro, 2000; Tyler, 1992). In BAs, it is more difficult for bottom-up information to activate a lexical representation: only a very good perceptual match for cat is able to access the lexical-semantic network that must be engaged in order to facilitate subsequent recognition of dog. In W/CAs, though, even a poor match between the bottom-up signal and the stable phonological form of a word is able to access that word’s meaning. Clearly, deviation from typical lexical processing dynamics in either direction is likely to impair word comprehension in the real world, where speech is noisy and error-laden (Dell, 1988; Vitevitch, 1997, 2002) 140 and words very often belong to dense phonological neighborhoods (e.g., cat is similar to hat, bat, pat, rat; Luce & Pisoni, 1998). Thus, the locus of lexical processing impairments is of great interest. 4.1.3.2. The Lexical Activation Hypothesis What is the source of these lexical processing deficits? One theory holds that each group’s impairment can be traced to the resting activation level of lexical representations (Blumstein & Milberg, 2000; Janse, 2006; McNellis & Blumstein, 2001). According to this perspective, referred to here as the Lexical Activation Hypothesis, the extent to which *gat primes dog depends not only on the degree of phonological match between a *gat and cat, but also on the baseline activation of cat. Consider a model of semantic priming in which activation spreads (cf. Collins & Loftus, 1975) from cat to dog only after the activation level of cat exceeds some propagation threshold (cf. Rumelhart, 1989), and, thereafter, the amount of priming is related to the amount of supra-threshold activation (up to some maximum activation level). McNellis and Blumstein (2001) showed such a model captures the graded priming results in healthy controls (Milberg et al, 1988a), and alterations to the resting activation levels could explain the patterns in BA and W/CA (1988b). Lower resting activation levels rendered it impossible for poorly matching input to exceed cat’s propagation threshold, thereby preventing semantic priming by both close (*gat) and distant (*wat) mispronunciations (as in BA), while raising resting activation levels caused cat’s activation not only to exceed its propagation threshold, but also to quickly reach its maximum level, yielding ceiling-level facilitation of recognition of dog following cat, *gat, and *wat (as in W/CA). 4.1.3.3. Alternative Accounts of Lexical Processing Deficits 141 Critically, the Lexical Activation Hypothesis posits that the locus of the lexical processing deficit is inherent to the lexical representation: words’ resting activation levels are responsible for the observed impairment. However, some alternative explanations implicate the bottom-up processing of the speech signal and the time course of lexical activation. For example, the same pattern as was observed in W/CAs would be expected if those patients had perfectly normal lexical-level representations, but they sometimes misperceived *gat and *wat as cat. Since auditory comprehension is very frequently impaired in patients with W/CA (Blumstein, Baker & Goodglass, 1977a; Eggert, 1977; Luria, 1976; Robson, Keidel, Lambon Ralph & Sage, 2012), this possibility raises an important issue. Indeed, even though phonetic and phonological processing deficits are not as universally associated with BAs, virtually all patients, including BAs, appear to have at least some difficulties (Baker, Blumstein & Goodglass, 1981; Basso et al, 1977; Blumstein et al, 1977a, 1977b, 1984; Carpenter & Rutherford, 1973; Jauhiainen & Nuutila, 1977; Leeper, Shewan & Booth, 1986; Metz-Lutz, 1992; Miceli et al, 1978, 1980; Sasanuma et al, 1976; Utman et al, 2001; Yeni-Komshian & Lafontaine, 1983). This is consistent with neuroimaging research pointing to the involvement of both anterior and posterior brain regions in phoneme perception (Belin, Zatorre, Hoge, Evans & Pike, 1999; Blumstein, Myers & Rissman, 2005; Burton, 2001; Burton, Small & Blumstein, 2000; Poeppel, 1996). Notably, Milberg and colleagues (1988b) did try to rule out this explanation. In a post-experiment lexical decision task, patients in both groups were shown to correctly reject *gat and *wat as non-words while also correctly accepting cat as a word. In line with this finding, many other studies also suggest that, generally speaking, individual 142 subjects’ lexical processing deficits cannot be fully predicted by their bottom-up pre- lexical processing deficits alone (Baker et al, 1981; Basso et al, 1977; Blumstein et al, 1977a, 1977b, 1984; Carpenter & Rutherford, 1973; Caplan et al, 1995; Caplan & Utman, 1994; Csepe et al, 2001; Gow & Caplan, 1996; Jauhiainen & Nuutila, 1977; Leeper, Shewan & Booth, 1986; Metz-Lutz, 1992; Miceli et al, 1978, 1980; Sasanuma et al, 1976; Yeni-Komshian & Lafontaine, 1983). Nevertheless, Robson and colleagues (2012) argue that the primary deficit in W/CA is at the level of the phonological code (cf. Luria, 1976), suggesting that these studies’ inability to find a significant correlation between W/CAs’ phonological processing deficits and their other comprehension difficulties is due to unduly heterogeneous clinical populations, poor task selection, and other factors. Additionally, it has also been suggested that Milberg and colleagues’ (1988b) inability to detect priming in BAs following the mispronunciation conditions might have resulted not from an inherent disruption to lexical representations, but rather from a disruption to the dynamics (i.e., time course) of bottom-up lexical activation (Prather, Zurif, Love & Brownell, 1997; Swinney, Zurif & Prather, 1989; Swinney, Prather & Love, 2000). However, recent results using eye-tracking methodologies (which achieve more fine-grained temporal resolution than the priming paradigms) have disputed the idea that the time course of lexical activation is delayed in BA (for review, see Mirman et al, 2011; Yee et al, 2008). 4.1.3.4. Top-Down Effects and Lexical Processing It is apparent that at least part of the theoretical bottleneck that has made the debate between bottom-up and lexical-level accounts of patients’ lexical processing deficits difficult to resolve arises from the inherent difficulty in teasing apart bottom-up 143 processing which accesses lexical information and lexical-level information during typical word recognition tasks. For any task that evaluates behavioral responses to auditory words, potential lexical-level disruptions and potential downstream effects of bottom-up processing disruptions are necessarily confounded. However, top-down effects, are somewhat unique. Top-down effects measure the extent to which higher-level information sources – including lexical-level information (like lexical status or frequency) and contextual information that influences lexical-level predictions (cf. Chapter 2) – bias the perception of spoken words or sounds. What does it mean to observe a top-down effect for a given acoustic stimulus? If the same word-initial segment that is phonetically ambiguous between /b/ and /p/ is labeled /b/ when followed by –ash, more often than when followed by –ast, then this means that, for the same bottom-up stimulus, lexical-level information is influencing subjects’ ultimate speech recognition (Ganong, 1980). The significance of this observation is that the sizes of lexical or contextual effects are scaled with the strength of bias provided by top-down cues. However, bottom- up processing will also influence the ultimate size of the top-down effects: if the bottom- up processing reveals that a stimulus is almost certainly an exemplar of some particular word or phonetic category, then the top-down cue will have little impact on the response rate and there will not be a large top-down effect observed (cf. Chapters 2-3). Put another way, the ultimate size of a lexical or contextual effect on the perception of a stimulus will be influenced by both the bottom-up and the top-down processing of the stimulus (which includes lexical-level information and contextual information). As such, disrupting either bottom-up or lexical-level processing is likely to lead to behavioral differences in the size 144 of top-down effects on speech perception. The challenge is to separate out the influences of each. This theoretical and computational challenge is addressed from an information- processing standpoint by the BIASES model (Bayesian Integration of Acoustic and Sentential Evidence in Speech; Chapters 2-3), and it is to this model that we now turn. 4.2. Applying BIASES to Spoken Word Recognition in Aphasia 4.2.1. Brief Overview of BIASES The BIASES model of speech perception describes the mathematically optimal way of combining top-down information sources (such as lexical frequency or the contextual predictability of a word) and bottom-up information sources (such as acoustic cues in the stimulus). In Chapter 3, it was shown that BIASES provides a principled account for a number of fine-grained differences in the sizes of top-down effects on speech perception, explaining how properties of the model’s prior (which corresponds to top-down information sources) and the model’s likelihood (which corresponds to bottom-up information sources) should influence the ultimate size of the top-down effect for a given pair of conditions (e.g., two sentential contexts) and for a given acoustic signal (e.g., for a given voice-onset time, or VOT). Critically, though, the model’s predictions about how large a top-down effect should be depend on the information contained within a model’s prior and likelihood functions. Thus, if the underlying information contained within either the prior or the likelihood term of BIASES is disrupted, or if the information processing dynamics that govern the computations within the prior or the likelihood term of BIASES are disrupted, the expected size of top-down effects for a given pair of contexts and a given acoustic stimulus will also change. 145 Thus, in order to gain some insight into the nature of the information processing deficits in aphasia, we adapted the parent model, BIASES, to allow for the examination of how different “virtual lesions” to a child model, BIASES-A (for Aphasia), should influence the predicted sizes of top-down effects from lexical status, from lexical frequency and from sentence contexts. 4.2.2. From Activations to Probabilities: Lexical Activation Hypothesis BIASES is a probabilistic computational model which, like Shortlist B (Norris & McQueen, 2008), but unlike connectionist models of spoken word recognition (e.g., TRACE: McClelland & Elman, 1986; Shortlist: Norris, 1994; Merge: Norris, McQueen & Cutler, 2000), does not rely on any notion of activation. Instead, the amount of support for a given candidate in some set of mutually exclusive alternatives (e.g., a word in the lexicon) is related to its probability, which is computed relative to the other candidates. While this approach has many advantages (see, e.g., Chater, Tenenbaum & Yuille, 2006; Norris, 2006; Norris & McQueen, 2008) it is important to consider the relationship between probabilistic models and activation-based models (for recent reviews, see McClelland, 2009, 2013; McClelland, Mirman, Bolger & Khaitan, 2014). This is particularly crucial for the present modeling effort because, while BIASES does not rely on any notion of activation, the theoretical claim instantiated within the Lexical Activation Hypothesis about the underlying basis of lexical processing deficits in aphasia is couched within the language of words’ baseline levels of activation: BAs have lower baseline levels of activation than healthy adults, while W/CAs have higher baseline levels of activation than healthy adults. This raises the following critical question: what sort of lesion to the lexical information in BIASES would mimic the effects of changes in 146 baseline activation levels described by the Lexical Activation Hypothesis? The answer to this question requires drawing three theoretical links between activation-based models and probabilistic models. First, note that real-valued activation levels in a finite set of units in a connectionist model can be scaled (i.e., each divided by the sum of the activations of the entire set) in order to create a probability distribution, and, critically, this computation preserves the relevant ratios of all pairs of activation levels (Hinton & Sejnowski, 1983; Khaitan & McClelland, 2010; Luce, 1959; McClelland, 1991; McClelland, Mirman, Bolger & Khaitan, 2014; Movellan & McClelland, 2001; for a tutorial and review, see McClelland, 2013). Second, lexical frequency (which, by definition, characterizes a probability distribution over the lexicon) has a robust effect on spoken word recognition and speech perception (e.g., Connine, Mullennix, Shernoff & Yellen, 1990; Dahan, Magnuson & Tanenhaus, 2001; Howes, 1954; Luce, 1986; Marslen-Wilson, 1987; Pollack, Rubenstein & Decker, 1960; Savin, 1963; Taft & Hambly, 1986). Applying the converse relationship described in the first theoretical link (connecting activation levels to probabilities), suggests that words’ baseline lexical activations should be scaled by their lexical frequencies (see, e.g., Dahan, Magnuson & Tanenhaus, 2001). Thirdly, the last step is to determine how to mimic the raising and lowering of the baseline lexical activation of words in an activation-based framework? The approach we take is to transform the probability of each word wi in the lexicon of Nw words by applying the function A (for Activation), which is defined in Equation 4.1: Equation 4.1 𝑝(𝑤! )! 𝚨 𝑝 𝑤! , 𝜙 = !! ! !!! 𝑝(𝑤! ) 147 The function A raises each wi’s probability to the same exponent (𝜙), and then rescales the distribution so that it sums to one (as is required for any probability distribution). Crucially, A has the following four properties: Property 1. For 𝜙 = 1: 𝚨 𝑝 𝑤! , 𝜙 = 𝑝 𝑤! for all wi Property 2. For 𝜙 > 1: 𝚨 𝑝 𝑤! , 𝜙 < 𝑝 𝑤! for less probable (initially) wi 𝚨 𝑝 𝑤! , 𝜙 > 𝑝 𝑤! for more probable (initially) wi Property 3. For 𝜙 < 1: 𝚨 𝑝 𝑤! , 𝜙 > 𝑝 𝑤! for less probable (initially) wi 𝚨 𝑝 𝑤! , 𝜙 < 𝑝 𝑤! for more probable (initially) wi ! Property 4. For 𝜙 = 0: 𝚨 𝑝 𝑤! , 𝜙 = ! for all wi ! When 𝜙 > 1, the distribution becomes more extreme, or peaked, with the most probable words becoming even more probable and the least probable words becoming even less likely, so a virtual lesion that increases 𝜙 will cause the “rich to get richer.” Conversely, when 𝜙 < 1, the distribution becomes more uniform, essentially “watering down” frequency effects in the initial, un-lesioned distribution. Smaller values of 𝜙 reduce frequency effects further and further until 𝜙 = 0, at which point frequency effects are totally eliminated by functionally transforming the prior distribution over words into the uniform distribution: 𝐴(𝑝 𝑊 , 𝜙 = 0) = 𝑈𝑛𝑖𝑓(1, 𝑁! ). 4.2.2.1. Preliminary Simulations: Lexical Activation Hypothesis In order to establish the theoretical link between BIASES-A and the Lexical Activation Hypothesis, we must determine how 𝜙 relates to changes in lexical activation levels. That is, will increases in the baseline lexical activation levels (as hypothesized to underlie the lexical processing deficits in W/CAs) more closely match Property 2 (the “rich get richer” case) or Property 3 (the “watering down” case)? 148 In order to match the Lexical Activation Hypothesis to the computational approach embodied by the function A, we simulated a simple example with a “toy lexicon” of Nw = 20 words. For each word wi, a frequency fi between 10 and 100 was randomly determined ( 𝑓! ~𝑈𝑛𝑖𝑓(10,100) ) to serve as that wi’s effective baseline activation value, yielding a lexicon with activation values represented by the vector F (containing the frequencies for all Nw words). Then, in two separate simulations with the same lexicon F, to mimic the predicted activation levels for patients with BA and W/CA (McNellis & Blumstein, 2001), we either subtracted or added the same value, 𝜂, to each word’s activation level, yielding a new baseline activation vector 𝐹 ! = 𝐹 ± η.7 Finally, ! for both the activation subtraction simulation (𝐹!" = 𝐹 − η) and the activation addition ! simulation (𝐹!/!" = 𝐹 + η), the “pre-lesion” activation vector, F, and the updated “post- lesion” baseline activation vector, F’, were normalized to obtain pre-lesion and post- lesion probability distributions (Equations 4.2-4.4): Equation 4.2 𝐹 𝑝 𝑊 = !! !!! 𝑓! Equation 4.3 ! 𝐹!" 𝐹−η 𝑝!" 𝑊 = !! = !! !!! 𝑓! !!! 𝑓! −η Equation 4.4 ! 𝐹!/!" 𝐹+η 𝑝!/!" 𝑊 = !! ! = !! !!! 𝑓! !!! 𝑓! +η 7 To prevent any of the words’ activation levels from becoming negative, 𝜂 was set to !"# (!) half of the least frequent word’s activation value (𝜂 = ! ). 149 Figure 4.1 shows the effects of subtracting and adding to the baseline activation levels for the corresponding probability distributions. Decreasing words’ baseline activation levels (the mechanism implicated in lexical processing deficits in BA, according to the Lexical Activation Hypothesis) enhances frequency effects. The most frequent words (i.e., words with relatively higher pre-lesion activation levels) became even more probable, and the rarest words became even less probable. Conversely, increasing words’ baseline activation levels (the mechanism implicated in lexical processing deficits in W/CA, according to the Lexical Activation Hypothesis) diminishes frequency effects. The most frequent words and the least frequent words have less disparate post-lesion probabilities. Decrease Activation Increase Activation 0.100 probability of word 0.075 0.050 0.025 0.000 5 10 15 20 5 10 15 20 word (sorted by frequency) pre.post pre−lesion post−lesion Figure 4.1. Results of simulations of Lexical Activation Hypothesis: the probability of each word before and after the virtual lesion. Virtual lesions involved either increasing or decreasing the activation level of each word by a constant amount (cf. McNellis & Blumstein, 2001). 150 Connecting the Lexical Activation Hypothesis and this simulation’s results back to the probabilistic model (BIASES-A) and the function A, it is clear that decreasing the baseline activation levels of words corresponds to increasing 𝜙 (see Property 2 of A), while increasing baseline activation levels of words correspond with decreasing 𝜙 (see Property 3 of A). At an intuitive level, if baseline activation levels are increased by a constant amount (as in W/CAs; Blumstein & Milberg, 2000) while leaving the threshold for lexical access or word recognition constant (McNellis & Blumstein, 2001), and thereby requiring less bottom-up activation to achieve lexical access (i.e., *wat primes dog as much as cat in W/CA; Milberg et al, 1988b), then the activation of the lexical representation will not be a very reliable cue to the actual presence of the word in the speech signal. Consequently, lexical-level cues should be less reliable for W/CAs than they are for healthy adults; from an information processing perspective, less reliable cues should be down-weighted, which is precisely the effect of decreasing 𝜙. The opposite is true of decreasing baseline activation levels and increasing 𝜙: the eventual activation of a lexical representation in BA is a more reliable cue to the actual presence of the word in the speech signal, leading to greater reliance on lexical-level information. Note that, while our approach in the activation-based simulations above was designed to match the approach taken by McNellis and Blumstein (2001) in their proof- of-concept computational implementation of the Lexical Activation Hypothesis (adding/subtracting a constant value η to each word’s activation level; cf. Morton, 1969; see also Norris, 2006), this approach is not mathematically equivalent to A’s exponential re-weighting of the entire distribution by 𝜙. Our central aim was to match the overall directionality of the effects of parametric manipulations in the activation-based and 151 probabilistic frameworks on extreme values (e.g., the most and least frequent words). It is also worth noting that McNellis and Blumstein (2001) did not explicitly account for frequency effects. What is most important, though, is that while the details of the present probabilistic approach are not identical to the approach taken by McNellis and Blumstein (2001), the overall theoretical link is clear: the Lexical Activation Hypothesis should predict that, if controls have 𝜙 = 1, then behavioral responses in BAs should be more influenced by lexical-level information (𝜙 > 1), while W/CAs should exhibit less of an influence from lexical-level information (𝜙 < 1). 4.2.2.2 Implications for Top-Down Effects on Speech Perception Having derived the theoretical implications of the Lexical Activation Hypothesis for the probabilistic modeling approach, it is now possible to deduce principled predictions about the influence of lexical-processing deficits on top-down processing of speech. Because the Lexical Activation Hypothesis, as interpreted here, predicts that lexical-level information will be weighted more by BAs than by healthy controls, but less by W/CAs than by healthy controls, lexical status should have a greater influence on BAs’ responses to stimuli ambiguous between a word and non-word, but it should have a weaker influence on W/CAs’ responses. Implicit in this conclusion is the assumption of a relationship between “lexical status” and “lexical frequency.” This assumption represents the basic principle that non-words are, in the limit, not so different from very low frequency words. Thus, the effects of lexical status and frequency effects are given a unified, if simple, explanation within the prior of the BIASES model: listeners expect to encounter more probable stimuli (see Chapters 2-3). Since non-words are less probable than words (see also, Norris & McQueen, 2008), an effect of lexical status might be 152 thought of as a special case of a lexical frequency effect. It is worth noting that since Ganong’s (1980) original demonstration of the lexical effect, a number studies have reported hints of frequency effects (e.g., Fox, 1984; Fox & Blumstein, in press; Newman et al, 1997) and Connine, Blasko and Titone (1993) showed that the frequency of words within an experiment could drive top-down effects on speech perception that mirrored Ganong’s lexical effect (1980). Note, however, that, since lexical frequency is estimated by counting the number of times a word appears in some corpus, all non-words will, by definition, have a lexical frequency of 0. Clearly, subjects must be capable to recognize a string of phonemes that they have never heard before. While most speech an adult will hear on any given day will be composed of words in her lexicon, there must be some mechanism to “back off” to when a listener encounters foreign words, new words, or proper nouns such as names they have never heard before. Additionally some such computational machinery is obviously critical for learning in infancy and childhood (Feldman, Griffiths, Goldwater & Morgan, 2013). Even more relevant to the current situation, in the context of an experimental setting like Ganong’s (1980), in which subjects hear dozens or sometimes hundreds of trials, they often identify the stimulus as a non-word. Thus, a subject’s prior expectation should certainly not be completely determined by the lexical frequency of a stimulus as estimated from a corpus. In order to account for top-down effects of both lexical status and lexical frequency, what is needed is a prior that is influenced by frequency, but which also allows subjects to “expect the unexpected,” as the case may be for non-words (with a corpus-estimated frequency of 0). 153 Thus, in order to account for the effect of lexical status on spoken word recognition within the probabilistic framework, a simple approach is to allow non-words have some small prior probability (Norris & McQueen, 2008; see also Chapters 2-3). In the current model, the prior probability for a non-word is estimated by fitting a smoothing parameter (Lidstone, 1920) to healthy controls’ lexical effect data. Based on the success of parallel assumptions about BIASES in modeling the sizes of sentential context effects (see Chapter 3), we assume that control subjects are optimally making use of lexical information (𝜙 = 1), but that their lexical prior includes lexical frequency information that is smoothed by some positive and nonzero “pseudo-count” (𝛼), which serves as an estimate of the prior expectancy for all non-words. It is further assumed that patients’ underlying model of lexical information is the same as the controls’ model (i.e., the same frequency counts for all words and the same smoothing parameter, 𝛼), but that patients may weight this information differently than the controls (𝜙). It is important to note that drawing a relationship between lexical status and lexical frequency does not demand that every possible non-word be explicitly represented in the mental lexicon alongside every word, albeit with a different (lower) effective frequency estimate; indeed, this would not be a very plausible lexicon. Rather, we adopt the theoretical perspective that, during speech perception, candidate word-forms compete for recognition in a lexical buffer (Blumstein, 1994, 1998). A candidate’s prior probability is determined by the sum of 𝛼 (the smoothing parameter discussed above) and a candidate’s lexical frequency (0 for non-words, but non-zero and positive for words). This framework allows all candidates (whether words or non-words) to have some baseline probability of being perceived (related to 𝛼), while a word’s prior is also 154 influenced by its frequency. Because a constant 𝛼 is added to the counts of all words, the prior probability of any word (with a nonzero frequency estimate) is greater than that of a non-word, allowing the model to capture top-down effects of lexical status (Ganong, 1980), and the prior probability of a given word is greater than that of any less frequent word, allowing the model to capture top-down effects of lexical frequency (Connine et al, 1993). Moreover, this framework predicts that the relative size of shifts in categorization due to lexical status should be more apparent the more common the word in a non- word/word pair (e.g., a greater bias towards past in *bast–past than towards bash in *bash–pash) is. We return to this prediction later. The conclusions of this section are summarized in Table 4.1. Clinical Semantic Priming Baseline p(W) Lexical Frequency Diagnosisa Lex. Dec. Latenciesb Lex. Act.c weightd Effectse Effectsf W/CA cat=*gat=*wat