Top-Down Effects on Speech Perception: An Integrated Computational and Behavioral Approach by Neal P. Fox B. A., University of Virginia, 2009 Sc. M., Brown University, 2012 Submitted in partial fulfillment of the requirements for the Degree of Doctor of Philosophy in the Department of Cognitive, Linguistic, and Psychological Sciences at Brown University Providence, Rhode Island May 2016 © Copyright 2016 by Neal P. Fox This dissertation by Neal P. Fox is accepted in its present form by the Department of Cognitive, Linguistic, and Psychological Sciences as satisfying the dissertation requirement for the degree of Doctor of Philosophy. _________ _______________________________________ Date Sheila E. Blumstein, Advisor Recommended to the Graduate Council _________ _______________________________________ Date Michael J. Frank, Reader _________ _______________________________________ Date James L. Morgan, Reader Approved by the Graduate Council _________ _______________________________________ Date Peter M. Weber, Dean of the Graduate School iii Neal P. Fox Education Brown University Ph.D. in Cognitive Science January 2016 Top-Down Effects on Speech Perception: An Integrated Computational and Behavioral Approach (Advisor: Dr. Sheila E. Blumstein) Brown University M.S. in Cognitive Science May 2012 Top-down effects from syntactic category expectations on speech processing University of Virginia B.A. with Distinction in Cognitive Science May 2009 Minor: Mathematics; Post-grad. training in Systems Engineering (2009-10) Honors and Awards Reisman Brain Science Graduate Fellowship Brown Institute for Brain Science 2014 Dissertation Fellowship Brown University 2013 Best Poster Award Society for Teaching of Psychology 2013 NSF Graduate Research Fellowship National Science Foundation 2010–2013 LSA Summer Institute Fellowship Linguistic Society of America 2011 Calvin & Rose G Hoffman Prize The Marlowe Society 2011 Graduate Fellowship Brown University 2010 Raven Society (Academic Honor Society) University of Virginia 2010 Graduate Research Fellowship in Engineering University of Virginia 2009–2010 Publications 1. Fox, N. P., Reilly, M., & Blumstein, S. E. (2015). Phonological neighborhood competition affects spoken word production irrespective of sentential context. Journal of Memory and Language, 83, 97-117. 2. Fox, N. P. & Blumstein, S. E. (in press). Top-down effects of syntactic sentential context on phonetic processing. Journal of Experimental Psychology: Human Perception and Performance. 3. Luthra, S., Fox, N. P., & Blumstein, S. E. (2016). Speaker information affects false memory recognition of unstudied lexical-semantic associates. (under revision) 4. Caplan, S., Fox, N. P., McClosky, D. M., & Charniak, E. (2016). Lexical substitution for cross-domain parser adaptation. (under revision) 5. Reilly, M., Guediche, S., Fox, N. P., & Blumstein, S. E. (2016, manuscript). Articulatory planning and motor reprogramming: An fMRI investigation. (in prep) 6. Fox, N. P., Blumstein, S. E., & Frank, M. J. (2016, manuscript). Bayesian Integration of Acoustic and Sentential Evidence in Speech: The BIASES Model of Spoken Word Recognition in Context. 7. Fox, N. P. & Blumstein, S. E. (2016, manuscript). Bottom-up and top-down contributions to lexical processing deficits in aphasia. Refereed Conference Presentations 1. Fox, N. P. & Larsen, E. W. (2009). A comparative optimality theoretic outlook on loanword phonology in Huave dialects of Mexico. Rice University Linguistics Society Third Biennial Meeting; Houston, TX. 2. Fox, N. P. (2012). Top-down effect of syntactic category expectations on spoken word recognition. CUNY Conference on Sentence Processing; New York, NY. 3. Fox, N. P., Ehmoda, O., & Charniak, E. (2012). Statistical stylometrics and the Marlowe- Shakespeare authorship debate. Georgetown University Roundtable on Languages and Linguistics; Washington DC. 4. Fox, N. P. & Blumstein, S. E. (2013). Top-down effects from sentence context on speech processing in aphasia. Society for Neurobiology of Language; San Diego, CA. iv 5. Fox, N. P. & Reilly, M. (2013). Significantly different: A meta-analysis of the gap between statistics students learn and statistics psychologists use. Northeast Conference for Teachers of Psychology; Bridgeport, CT. (Best Poster Award) 6. Fox, N. P., Reilly, M., & Blumstein, S. E. (2014). Independent and interacting effects of sentential context and phonological neighborhood structure in spoken word production. Acoustical Society of America; Providence, RI. 7. Luthra, S., Fox, N. P. & Blumstein, S. E. (2015). Speaker information affects false recognition of unstudied lexical-semantic associates. Society for Neurobiology of Language; Chicago, IL. 8. Fox, N. P. & Blumstein, S. E. (2015). Computational and neural mechanisms of top-down effects on speech perception. Society for Neurobiology of Language; Chicago, IL. Teaching Experience Introduction to Cognitive Neuroscience Teaching Assistant Brown University 2013 Quantitative Methods in Psychology Teaching Assistant Brown University 2012 Computational Cognitive Science Teaching Assistant Brown University 2012 Teaching Development and Leadership Teaching Certifications: Certificate III: Professional Development Seminar 2014 Certificate IV: The Teaching Consultant 2013 Certificate I: Reflective Teaching Seminar 2012 Program Facilitation and Teaching Consultation: Certificate I: Reflective Teaching Seminar 2012–2015 Center for Engaged Learning Seminar: Undergraduate Research Mentorship 2014 Principles & Practice in Reflective Mentorship 2013–2014 Senior Teaching Consultant 2013–2014 New Teaching Assistant Orientation: Interactive Classrooms 2013 Undergraduate Research Mentorship Spencer Caplan, Brown class of 2015 2013–2015 Sahil Luthra, Brown class of 2014 2012–2014 Service Session Chair, Speech Perception, Acoustical Society of America Meeting 2014 CLPS Department Representative; Sheridan Center for Teaching & Learning 2012–2014 Graduate Representative, Campus Access Advisory Committee 2011–2014 Session Chair, Computational Linguistics, Georgetown Roundtable on Lang. & Ling. 2012 Graduate Representative, Brown Univ. Presidential Search Advisory Committee 2011–2012 Professional Memberships Society for Neurobiology of Language 2013–2016 Acoustical Society of America 2013–2016 American Psychological Association 2013–2016 Society for the Teaching of Psychology 2013–2016 Northeast Psychological Association 2013–2016 Linguistic Society of America 2011–2015 Institute for Electrical and Electronic Engineers 2009–2010 IEEE Engineering in Medicine and Biology Society 2009–2010 Society for Neuroscience 2008–2010 Research Interests computational models of speech and language perception and production, cognitive neuroscience of language and aphasia, computational linguistics and natural language processing Birth Information: Neal P. Fox was born on August 3, 1989, in San Diego, CA. v Acknowledgements “It is clear that the successful completion of graduate school requires the recruitment of resources from a variety of sources.” Eye rolls aside, this statement is undoubtedly one of the most fundamental truths of graduate school. I have been lucky to work with, learn from, and get support from incredible scientists and outstanding people. At every turn, Sheila Blumstein was there to guide me, to back me up, and – when necessary – to kick me in the butt. The role of advisor and mentor is often a thankless one, but it is impossible to express the impact Sheila has had on me and on my career. The breadth and depth of her work will always be an inspiration to me, along with the earnestness and fierceness of her loyalty, to say nothing of her unparalleled patience. I was also fortunate to work with Michael J. Frank. Michael’s multidisciplinary approach to cognitive science and computational neuroscience and his willingness to explore new questions opened many doors to me. Throughout my time at Brown, Jim Morgan was a valuable sounding board for my scientific work, but he is also representative of why the CLPS Department has become as much a home as a workplace. It was a privilege to “grow up” in a department in which a full professor would invite a graduate student to his house on a Saturday morning so the student can use his power tools to complete construction of a pair of Cornhole boards. Eugene Charniak’s mentorship and collaboration provided me with an indispensible perspective at every step of my graduate career. I was often impressed his humility when he didn’t know the answer (which was rare) and his ability to discern worthwhile projects from interesting exercises (while still seeing the value in both). vi The mentorship of several other faculty members and postdocs in the CLPS department has also been critical to my development as a scientist. Especially at the beginning, Laura Kertz, Paul Allopenna, Emily Myers, Rachel Theodore and Hugh Rabagliati were longsuffering teachers and mentors. I am also grateful to Thomas Serre, Kathy Spoehr and David Badre for their formative mentorship in my three teaching assistantships in the department. Finally, I would be remiss if I failed to thank the dedicated folks who were there for me (and the entire department) whenever I asked for help and worked behind the scenes every day so I wouldn’t need to. I feel lucky to call Reinette, Don, Jesse, Michelle, Rosa and Bill my friends. My research projects spanned three labs (Sheila’s, Michael’s and Eugene’s) over the last several years, and members of each lab have helped me clarify, shape, and also do much of that research. Among those colleagues were Megan Reilly, Sahil Luthra, Sara Guediche, John Mertus, Kathy Kurowski, Jeff Cockburn, Nick Franklin, Anne Franklin, Matt Nassar, Jason Scimeca, Micha Elsner, Spencer Caplan, Dave McClosky, Rebecca Mason and Omran Ehmoda. Additionally, several funding sources were critical to my research and training, including a Graduate Research Fellowship (DGE 0228243) from the National Science Foundation, a Graduate Fellowship from the Brown Institute for Brain Science, a summer training fellowship from the Linguistic Society of America, and funding from Brown’s Graduate School and the CLPS Department. As much as graduate school is and should be a disciplinary endeavor, I cannot overstate the amount of personal and professional development I gained by participating in the Sheridan Center’s programming, and especially thanks to the incredible mentorship of the Center’s former director Kathy Takayama. Working with her and the rest of the vii staff and students associated with the Sheridan Center was unquestionably one of the most important opportunities of my graduate career. Similarly, the opportunities to meet new colleagues, mentors, and friends from across the school by serving on University committees were formative experiences and fulfilling outlets during my time at Brown. All work and no play would have made grad school a bummer. Luckily, I met Megan Reilly on my first day of my first year at Brown, so there was never any chance of that happening. She and the rest of the DuckTales Cohort (awoohoo: Jason Scimeca, Chris Erb, Patrick Heck, David Mély) made me a better scientist and person, and they put up with me when I constantly pointed out top-down effects on speech perception in real life (TDEB). Without them, these last few years would have been DISGUSTING. Finally, as families go, I’ve got the best one. Thanks, Mom, Dad, Sean, Collin and Alyson. And thanks to the newest members of my family, Emily and Ginny. You have been the sources of many, many smiles while I wrote this thesis, and I will never be able to re-bay you. viii Table of Contents Introduction 1 Chapter 1 Top-down effects of syntactic sentential context on phonetic processing 1.1. Introduction 5 1.1.1. Models of Spoken Word Recognition: Competing Frameworks 6 1.1.2. Time Course of Sentential Context Effects 7 1.2. Experiment 1.1 10 1.2.1. Methods 11 1.2.1.1. Materials 11 1.2.1.1.1. Target Word Selection 11 1.2.1.1.2. Sentence Contexts 11 1.2.1.1.3. Stimulus Recording 12 1.2.1.1.4. Target Word Manipulation 12 1.2.1.1.5. Target Word Token Selection 13 1.2.1.2. Participants 14 1.2.1.3. Task 14 1.2.2. Results 15 1.2.3. Discussion 20 1.3. Experiment 1.2 21 1.3.1. Methods 23 1.3.1.1. Materials 23 1.3.1.1.1. Critical Targets 23 1.3.1.1.2. Filler Targets 23 1.3.1.1.3. Sentence Contexts 24 1.3.1.2. Participants 24 1.3.1.3. Task 25 1.3.2. Results 25 1.3.3. Discussion 28 1.4. General Discussion 29 1.4.1 Implications for Interactive Models of Speech Perception 30 1.4.2. Implications for Autonomous Models of Speech Perception 32 1.4.3. Predicting Behavior with Comp. Models of Speech Perception 33 1.5. Conclusion 33 1.6. Overview of Next Steps 34 Chapter 2 Bayesian Integration of Acoustic and Sentential Evidence in Speech: The BIASES Model of Spoken Word Recognition in Context 2.1. Introduction 36 2.1.1. Brief Introduction 36 2.1.2. Overview of Chapter 2 38 ix 2.2. Sentential Context and Connectionist Models of Spoken Word Recognition 39 2.2.1. Modulation of Spoken Word Recognition by Sentential Context 40 2.2.2. Challenges in Modeling Context Effects on Sp. Word Recognition 43 2.2.2.1. Challenges...: Representing Context 44 2.2.2.2. Challenges...:: Activation Dynamics 45 2.2.2.3. Challenges...: Representing Time 46 2.2.2.4. Context Effects Without Connectionist Models 48 2.3. A Computational-Level Analysis of Spoken Word Recognition 48 2.3.1. Bayesian Models of Spoken Word Recognition 49 2.3.2. Prior Expectations in SWR: Lexical Frequency 51 2.3.3. Prior Expectations in SWR: Sentential Context 53 2.4. BIASES: Bayesian Integration of Acoustic and Sentential Evidence in Speech 55 2.4.1. Conditional Prior: A Model of Listeners’ Contextual Knowledge 57 2.4.1.1. Conditional Expectations from n-gram Language Models 58 2.4.1.2. Consequences of Adopting an n-gram Lang. Model Prior 59 2.4.1.3. BIASES’ Conditional Prior: A Bigram Language Model 63 2.4.1.4. Add. Constraints on Prior Expectations: Forced-Choice 64 2.4.1.5. Implementing BIASES’ Prior: Corpus Est., Smoothing 66 2.4.2. Likelihood Term: Mapping an Acoustic Signal onto Lexical Forms 68 2.4.2.1. Likelihood Functions: Many-to-One Mapping 68 2.4.2.2. Phonetic Ambiguity: One-to-Many Mapping 70 2.4.2.3. BIASES’ Likelihood Term: A Mixture of Gaussians 73 2.4.2.4. Comparing Likelihood Terms in BIASES & Shortlist B 77 2.4.3. Integrating Prior Context and Perceptual Input in BIASES 79 2.4.4. Conclusion and Next Steps 80 Chapter 3 Exploring and Evaluating the BIASES Model of Spoken Word Recognition in Context 3.1. Understanding Top-Down Effects in BIASES 82 3.1.1. Overview of the Mathematical Form of BIASES 83 3.1.1.1. Components of BIASES: Phonetic Category Structure (g) 84 3.1.1.2. Components of BIASES: Category Boundary (χ) 88 3.1.1.3. Components of BIASES: Prior Context (Π) 88 3.1.2. Towards Model-based Analyses of Top-Down Effects 90 3.1.2.1. Shifting of Invisible Category Boundaries 90 3.1.2.2. Boundary Shifts vs. Effect Sizes 93 3.1.2.3. Predicting Effect Sizes 95 3.2. Evaluating BIASES 104 3.2.1. Observed Variability in the Size of Top-Down Context Effects 104 3.2.2. Variability in the Ambiguity of Phonetic Cues: VOT 108 3.2.3. Variability in the Ambiguity of Phonetic Cues: Additional Cues 109 3.2.4. Variability in the Strength of Prior Cues 112 3.2.5. Variability in the Effect Sizes Comp. to “Neutral” Prior Contexts 117 3.3. Testing Predictions of BIASES: Experiment 3.1 119 3.3.1. Methods 119 x 3.3.1.1. Subjects 119 3.3.1.2. Materials 119 3.3.1.3. Procedure 121 3.3.2. Results: Logistic Regression Analysis of Biased Contexts 121 3.3.3. Results: Model Comparison 1 – Subject Variability 125 3.3.4. Results: Model Comparison 2 – Inherent Biases in “Neutral” Priors 127 3.3.5. Conclusion 131 Chapter 4 Top-Down Effects on Spoken Word Recognition in Aphasia: A Model-Based Assessment of Information Processing Impairments 4.1. Introduction 132 4.1.1. Brief Introduction 132 4.1.2. Overview of Chapter 4 136 4.1.3. Lexical Processing in Aphasia 139 4.1.3.1. Lexical Processing Deficits 139 4.1.3.2. The Lexical Activation Hypothesis 141 4.1.3.3. Alternative Accounts of Lexical Processing Deficits 141 4.1.3.4. Top-Down Effects and Lexical Processing 143 4.2. Applying BIASES to Spoken Word Recognition in Aphasia 145 4.2.1. Brief Overview of BIASES 145 4.2.2. From Activations to Probabilities: Lexical Activation Hypothesis 146 4.2.2.1. Preliminary Simulations: Lexical Activation Hypothesis 148 4.2.2.2 Implications for Top-Down Effects on Speech Perception 152 4.2.3. Implementing BIASES-A 155 4.2.3.1. Adapting the Prior and Likelihood of BIASES 157 4.2.3.2. Modeling Speech Processing Deficits in BIASES-A 160 4.3. Top-Down Effects of Lexical Status on SWR in Aphasia 163 4.3.1. Simulation Study 4.1: Lexical Effects in Aphasia 164 4.3.2. Experiment 4.1: Lexical Effects in Aphasia 168 4.3.2.1. Methods 169 4.3.2.1.1. Subjects 169 4.3.2.1.2. Stimuli 170 4.3.2.1.3. Procedure 171 4.3.2.2. Results: Statistical Analyses 172 4.3.2.2.1. Motivation and Interp. of Logistic Regressions 173 4.3.2.2.2. Control Subjects: YCs vs. AMCs 176 4.3.2.2.3. Elderly Subjects: AMCs vs. BAs vs. W/CAs 178 4.3.2.2.4. Summary of Results of Statistical Analyses 182 4.3.2.3. Results: Model-Based Analyses 183 4.3.2.3.1. Motivation of Model-Based Analyses 184 4.3.2.3.2. Key Results of Model-Based Analyses 185 4.3.2.4. General Discussion of Results of Experiment 4.1 191 4.4. Top-Down Effects of Sentence Context on SWR in Aphasia 194 4.4.1. Joint Modeling Contextual & Lexical Effects on Word Recognition 196 xi 4.4.2. Simulation Study 4.2: Sentential Context Effects in Aphasia 199 4.4.3. Experiment 4.2: Sentential Context Effects in Aphasia 205 4.4.3.1. Methods 206 4.4.3.1.1. Subjects 206 4.4.3.1.2. Stimuli 207 4.4.3.1.3. Procedure 208 4.4.3.1.4. Methodological Diff’s Between Subject Groups 209 4.4.3.2. Results: Statistical Analyses 210 4.4.3.2.1. Control Subjects: YCs vs. AMCs 213 4.4.3.2.2. Elderly Subjects: AMCs vs. BAs vs. W/CAs 215 4.4.3.2.3. Summary of Results of Statistical Analyses 217 4.4.3.3. Results: Model-Based Analyses 219 4.4.3.3.1. Motivation of Model-Based Analyses 219 4.4.3.3.2. Key Results of Model-Based Analyses 220 4.4.3.4. General Discussion of Results of Experiment 4.1 and 4.2 226 Conclusion 228 xii List of Tables Table 1.1 page... 14 Box 3.1 86 Table 3.1 93 Box 3.2 103 Table 3.2 104 Table 3.3 104 Table 3.4 114 Table 3.5 126 Table 3.6 126 Table 3.7 130 Table 3.8 130 Table 4.1 155 Table 4.2 176 Table 4.3 176 Table 4.4 177 Table 4.5 178 Table 4.6 179 Table 4.7 181 Table 4.8 182 Table 4.9 186 Table 4.10 189 Table 4.11 214 Table 4.12 214 Table 4.13 215 Table 4.14 215 Table 4.15 217 Table 4.16 217 Table 4.17 221 Table 4.18 224 xiii List of Figures Figure 1.1 page... 15 Figure 1.2 16 Figure 1.3 20 Figure 1.4 26 Figure 1.5 27 Figure 1.6 28 Figure 2.1 73 Figure 3.1 87 Figure 3.2 87 Figure 3.3 89 Figure 3.4 99 Figure 3.5 99 Figure 3.6 100 Figure 3.7 101 Figure 3.8 102 Figure 3.9 102 Figure 3.10 111 Figure 3.11 113 Figure 3.12 115 Figure 3.13 117 Figure 3.14 122 Figure 3.15 123 Figure 3.16 124 Figure 3.17 124 Figure 3.18 127 Figure 3.19 129 Figure 4.1 150 Figure 4.2 162 Figure 4.3 166 Figure 4.4 167 Figure 4.5 167 Figure 4.6 168 Figure 4.7 173 Figure 4.8 183 Figure 4.9 190 Figure 4.10 191 Figure 4.11 193 Figure 4.12 194 Figure 4.13 202 Figure 4.14 203 Figure 4.15 203 Figure 4.16 204 Figure 4.17 204 Figure 4.18 212 xiv List of Figures (continued) Figure 4.19 page... 219 Figure 4.20 225 Figure 4.21 226 xv Introduction During auditory language comprehension, a listener’s principal objective is to infer what meaning the speaker intended to convey. For a healthy adult communicating in his or her native language, this task is typically both effortless and errorless. What makes this behavioral generalization noteworthy is the fact that there is rarely just one possible interpretation of a given acoustic signal; in fact, perceptual uncertainty is ubiquitous in speech communication. Indeed, listeners are faced with uncertainties arising from countless sources ranging from unclearly produced speech to imperfect listening conditions to inescapable ambiguities inherent in language (e.g., homophony). It is easy to see how the challenge posed by such pervasive uncertainty might be crippling to a speech processing system that relied exclusively on these ambiguous acoustic cues available in the perceived speech signal to decode a speaker’s meaning. A fundamental question, then, regards how the perception of speech can be so robust despite these barriers. In the present work, it is argued that at least part of the answer to that question is that even though much ambiguity exists, when one source of information is unreliable, there are usually other cues available in the signal that can be leveraged to understand the speaker’s intended meaning. For instance, although the spoken word /bɔrd/ could be an exemplar of either the word board or of bored, words are rarely uttered in isolation, and – most of the time – there is little doubt as to which meaning to assign to the lexically ambiguous speech token. This is, in large part, because the word’s linguistic and extra- linguistic context provides another cue that can aid in the extraction of meaning, particularly when the acoustic cues are degraded or insufficient. 1 While this explanation may appear straightforward, it raises the key question of how multiple sources of information are integrated by the speech processing system. It is that question which is the focus of this thesis. More specifically, the present work employs behavioral experiments and computational modeling in order to investigate how so-called bottom-up acoustic cues available in the sensory signal and top-down information about which words or sounds are likely in a given context are integrated during online auditory language comprehension. For the purpose of exposition, let a word be defined as the discrete linguistic unit that stands at the juncture between the sound information perceived by the listener in the speech signal and the underlying meaning associated with the signal (Blumstein, 2009). Spoken word recognition, then, represents a critical sub-routine of auditory language comprehension if a listener is to extract meaning from the perceived signal. A given word can be thought of as being associated with (1) a lexical form that defines how the word sounds, and (2) the word’s meaning. When a speech token that resembles the lexical form of a word is perceived, that word can be recognized and its meaning accessed. When it comes to recognizing a spoken word, acoustic cues in the sensory signal are certainly the paramount source of information available to the speech perception system. The perceptual system is adept at decoding the speech signal based on auditory cues alone. These information sources, which are the product of low-level sensory and phonetic processing of the input, are referred to as bottom-up cues. However, as important as bottom-up cues are, much research has shown that, in addition to integrating a host of bottom-up cues, word recognition also involves the recruitment of top-down information that is not immediately available in the signal, but 2 instead relies on cognitive or higher-level linguistic processing. A general conclusion of this line of research is that listeners tend to perceive things that are more probable; for instance, identification is biased towards words rather than non-words (Ganong, 1980) and towards contextually consistent or sensible words over words that are inconsistent or nonsensical given the context (e.g., Borsky et al, 1998; Fox & Blumstein, in press). Although the roles of both bottom-up and top-down cues are well attested, this thesis examines the basic computational principles that underlie their integration. Although a number of models have sought to tackle the question of how top-down cues come to influence speech perception, several issues remain unclear. Firstly, a long- standing, much-debated topic regards whether the observed biases in listeners’ responses (toward words over non-words and toward contextually consistent words over inconsistent words) reflect the direct, top-down modulation of perceptual processing of the input or whether they reflect processing biases at a later decision-making level (see, e.g., McClelland, Mirman & Holt, 2006; Norris, McQueen & Cutler, 2000). Secondly, despite substantial evidence that sentential context influences spoken word recognition, existing models lack an explicit characterization that can account for these effects. Thirdly, another weakness of existing spoken word recognition models is that they largely ignore the enormous variability that exists in the observed sizes of top-down effects. Finally, the extent to which patients with aphasia experience deficits in top-down processing and cue integration during speech perception is poorly understood. In the present work, each of these four issues is considered in turn. Empirical and computational methodologies are employed in order to probe the questions each issue poses. Ultimately, the results of this thesis provide a more complete picture of the 3 computations that take place at the interface between the perceptual processing of speech and the cognitive and linguistic processing of language, while also establishing a novel theoretical basis that promises to guide future work. 4 Chapter 11 Top-down effects of syntactic sentential context on phonetic processing 1.1. Introduction During auditory language comprehension, listeners integrate information from a variety of sources in their categorization of sounds and words, especially when confronted with degraded or ambiguous speech. Besides low-level acoustic cues, listeners are also sensitive to higher-level information that is not immediately available in the raw sensory input. For instance, listeners exhibit a lexical bias in their categorization of a phonetically ambiguous segment between /g/ and /k/ such that they label the segment as /g/ more often when followed by –ift, but as /k/ more often when followed by –iss (Ganong, 1980; see also Burton, Baum & Blumstein, 1989; Burton & Blumstein, 1995; Connine, 1990; Connine & Clifton, 1987; Fox, 1984; McQueen, 1991; Miller & Dexter, 1988; Myers & Blumstein, 2008; Pitt, 1995; Pitt & Samuel, 1993). Moreover, when a stimulus is phonetically ambiguous between two words (e.g., between goat and coat), listeners exhibit a semantic bias such that they label the ambiguous word as goat more often when embedded in a sentence like The busy farmer hurried to milk the… but as coat more often in The elderly tailor had to dry-clean the… (Borsky, Tuller & Shapiro, 1998; Connine, 1987; Connine, Blasko & Hall, 1991; Garnes & Bond, 1976; Guediche, Salvata & Blumstein, 2013; Miller, Green & Schermer, 1984). Like semantic information, syntactic (Isenberg, Walker & Ryder, 1980; van Alphen & McQueen, 2001), morphosyntactic (Martin, Monahan & Samuel, 2012), and pragmatic 1 At the time of submission of this dissertation, a version of Chapter 1 is currently in press at the Journal of Experimental Psychology: Human Perception and Performance (http://dx.doi.org/10.1037/a0039965). This article may not exactly replicate the 5 information (Do, 2011; Rohde & Ettlinger, 2012) have all been shown to bias listeners’ identifications of phonetically ambiguous words. From this robust literature, it is clear that sensory processing alone cannot explain listeners’ judgments about the identities of spoken words and sounds. In order for higher- level information to influence spoken word recognition, perceptual input must make contact with lexical representations which act as a gateway to the semantic, syntactic and other properties of words, and those lexical representations must then be able to influence behavioral responses in tasks like those described above (Samuel, 2011). 1.1.1. Models of Spoken Word Recognition: Competing Frameworks There exists, however, a longstanding debate about how those lexical representations come to influence spoken word recognition (for reviews, see McClelland, Mirman & Holt, 2006; McQueen, Norris & Cutler, 2006). Two competing families of spoken word recognition models – interactive models and autonomous models – each account for the effects of higher-level cues by appealing to different mechanisms. In particular, they differ in how pre-lexical (i.e., perceptual/phonetic) representations and lexical representations influence one another. Both approaches allow for a bottom-up flow of information such that pre-lexical processing of speech modulates the extent to which competing lexical representations are supported. However, only interactive models of spoken word recognition, exemplified by TRACE (McClelland & Elman, 1986; McClelland, 1991), incorporate top-down feedback projections that, conversely, allow lexical representations to modulate the extent to which competing pre-lexical representations are supported (see also Adaptive Resonance Theory; Grossberg, 1980, 2003; Grossberg & Myers, 2000). 6 Autonomous models of spoken word recognition, on the other hand, eschew top- down modulation of phonetic processing. Instead, they account for lexical and contextual effects on listeners’ responses by positing that both higher-level and lower-level information can influence phonemic decisions, but it is maintained that pre-lexical representations remain faithful to the bottom-up acoustic input. Thus, under the autonomous view, the observed biases reflect the integration of multiple information sources, but, crucially, this integration does not affect lower-level phonetic processing itself (see, e.g., Norris, McQueen & Cutler, 2000; McQueen, Jesse & Norris, 2009). A succession of autonomous models has been proposed in the literature, including Race (Cutler & Norris, 1979; Cutler, Mehler, Norris & Segui, 1987), Shortlist (Norris, 1994), Merge (Norris et al., 2000), and Shortlist’s Bayesian implementation (Norris & McQueen, 2008). Although each varies in its details, none allows for higher-level modulation of phonetic processing through feedback (McQueen et al., 2006; see also Fuzzy Logical Model of Speech Perception; Massaro, 1989). Thus, any behavioral demonstration of lexical or contextual biases in phoneme judgments could, in theory, be explained at the level of the judgment itself (a post- perceptual explanation, as in autonomous models), or by direct modulation of pre-lexical processing prior to the judgment (a perceptual explanation, as in interactive models). Because of this, interactive and autonomous models of spoken word recognition have, in practice, proven difficult to distinguish. However, past work suggests that these two frameworks may diverge in their predictions about the time course of these effects. 1.1.2. Time Course of Sentential Context Effects 7 One result that proponents of autonomous models have cited as incompatible with TRACE (and interactive models more generally) concerns the time course of lexical and contextual effects. Specifically, they contend that if top-down feedback can directly alter the activation of pre-lexical representations (as it can in TRACE), then the influence of higher-level information could only grow or remain stable as a function of processing time, but could not diminish (McQueen, 1991; Tuinman, Mitterer & Cutler, 2014; van Alphen & McQueen, 2001). According to this argument, top-down biasing information within an interactive framework tends to overwrite the ambiguous bottom-up input pattern, shaping it and pulling it towards a pattern that would be expected for lexically- or contextually-consistent speech. For instance, in one experiment, the identification of phonetically ambiguous function words (between de and te; roughly the and to in Dutch) was shown to be biased by manipulating the target words’ grammaticality in context (van Alphen & McQueen, 2001). Ambiguous stimuli were labeled as /de/ more often in sentences like We verstoppen [?] schaatsen (We hide [the]/[to] skates; de-biased) than in sentences like We behoren [?] schaatsen (We ought [to]/[the] skate; te-biased] (cf. Isenberg, Walker & Ryder, 1980). Importantly, though, when responses were divided into bins based on reaction time (cf. Fox, 1984), contextual biases from an immediately preceding syntactic cue were strongest in fast responses and grew weaker with time. A similar pattern of results was found for the time course of syntactic context effects on a different type of phonetic judgment by Tuinman, Mitterer and Cutler (2014). Can interactive models account for a smaller contextual bias in slower responses than in faster ones? Van Alphen and McQueen (2001) argue that they cannot: if 8 ambiguous bottom-up information has been overwritten by top-down feedback such that even early activation levels at the pre-lexical level are biased by context, then the original (ambiguous) pre-lexical representation cannot be recovered in order to yield more ambiguous (i.e., less biased) responses later in processing. That is, in interactive models, feedback permanently and irrevocably biases the bottom-up record of the unbiased pre- lexical representation (Massaro, 1989). Although Dahan, Magnuson and Tanenhaus (2001) dismiss this argument on the grounds that using response latencies to track top- down influences on pre-lexical activation “is not straightforward” (p. 321), it remains unclear to what extent the time course of sentential context effects on spoken word recognition does, in fact, challenge interactive models of speech perception. The present work aimed to examine this question by testing two possible alternative explanations of previous time course data (van Alphen & McQueen, 2001; Tuinman et al, 2014). Specifically, two experiments investigated whether characteristics of the experimental designs in earlier work might have allowed subjects to adopt strategies that would not only explain the diminishing bias effect, but also undermine the ability of those data to distinguish between interactive and autonomous models. Following previous work that showed diminishing contextual biases, both of the present experiments examined syntactic sentence context effects on subjects’ perception of phonetically ambiguous speech. Experiment 1.1 tested whether the diminishing influence of sentential context would persist when subjects could not plan contextually congruent responses prior to the presentation of the target stimulus. Experiment 1.2 tested whether the diminishing bias effect would persist when subjects were induced to engage in phoneme identification rather than word identification strategies. 9 1.2. Experiment 1.1 In prior work (van Alphen & McQueen, 2001; Tuinman et al, 2014), the experimental design allowed subjects to identify the contextually appropriate response before hearing the ambiguous target segment. For example, in van Alphen and McQueen’s (2001) study, a preceding context biased the identification of an acoustically ambiguous target between de and te. Because this experiment utilized a single continuum with only two alternatives (de and te), subjects could have identified and prepared a grammatically congruent response before they encountered the target. This raises the question of whether some responses might reflect decisions that were generated before processing of the target could have actually begun. If the responses that were fastest were disproportionately contaminated with such pre-planned decisions, it would not be surprising to observe a fast-arising bias that appears to weaken at longer RTs (once subjects had heard the target word). Importantly, this explanation would not challenge interactive models; button-presses planned before a stimulus is presented could not bear on whether the pre-lexical representation of that stimulus is modulated by interactive feedback. Thus, Experiment 1.1 utilized two voice-onset time (VOT) continua, rather than just one: a noun–verb continuum (bay–pay) and a verb–noun continuum (buy–pie), crossing target word voicing with syntactic category bias. Tokens from these VOT continua were appended to noun-biasing and verb-biasing sentence contexts (e.g., Valerie hated the..., Brett hated to...), and subjects identified the initial segment of sentence-final targets (/b/ vs. /p/). In this way, subjects could not know which phonemic response (/b/ vs. /p/) was congruent with the contextual bias of a sentence until the target word was 10 presented, thereby preventing them from anticipating or planning a specific button-press response before hearing the target word. 1.2.1. Methods 1.2.1.1. Materials 1.2.1.1.1. Target Word Selection Sixteen monolingual volunteers who spoke American English participated in a norming study to confirm that the four critical targets words (bay, pay; buy, pie) had strong syntactic category biases in the expected directions. The four targets were included in a randomly ordered list of 40 words that included words from a variety of syntactic categories. For each word in the list, subjects wrote one sentence “that might be heard in everyday speech,” and their responses were coded for the target words’ part of speech usage in each sentence. The noun targets (bay and pie) were each used by at least 15 of 16 subjects as a noun; the verb targets (pay and buy) were each used by at least 14 subjects as a verb. Thus, bay/pay and buy/pie were judged to be sufficiently biased noun/verb and verb/noun minimal pairs, respectively. 1.2.1.1.2. Sentence Contexts Twenty main verbs (e.g., hate, want) that could be followed by either a noun phrase or an infinitive phrase (e.g., hate the bay; hate to pay) were identified. Forty sentence contexts (20 noun-biased; 20 verb-biased) were then constructed by concatenating a first name, the past tense form of the main verb, and either the or to, yielding pairs of sentence contexts like Valerie hated the... and Brett hated to... The full list of contexts can be found in Appendix A. This design helped ensure that participants could not use information from the main verb to predict the target word. In this way, any 11 syntactic bias effect on responses could be attributed to the influence of the immediately preceding function word. 1.2.1.1.3. Stimulus Recording Sentences ending with the target words were recorded by a female monolingual native American English speaker in a sound-dampened room with an Edirol digital recorder (model R09-HR; Sony microphone model ECM-MS907; sampled: 44,100 Hz / 24 bits / stereo; resampled in BLISS speech-editing software: 22,050 Hz / 16 bits / mono; Mertus, 1989). Target words were spliced out of the sentences’ waveforms yielding 40 partial sentences (20 noun-biased, 20 verb-biased). 1.2.1.1.4. Target Word Manipulation A natural token of bay and of buy served as base tokens for two VOT continua, constructed using the BLISS waveform editor (Mertus, 1989). Beginning with the base voiced token of bay, an acoustically modified voiceless end of the bay–pay continuum and each intermediary token were generated by successively adding aspiration from the middle of the aspiration of a naturally-produced pay token and removing pitch periods of equal duration from the onset of the vowel of the natural bay token. This procedure yielded 12 stimuli with VOTs ranging from 2 to 64 milliseconds. The onset’s burst was amplified (2x) for all tokens in the continuum because the aspiration from the pay token rendered the burst in the natural bay inaudible. Tokens of the buy–pie continuum were generated in the same manner, yielding 12 stimuli with VOTs ranging from 3 to 62 milliseconds. As with the bay–pay continuum, the onset’s burst was amplified (3x) for all tokens in the continuum. Twenty milliseconds of silence was appended to the beginning of each token of both continua. The waveforms of all stimuli across the two continua 12 were normalized for amplitude so that the highest peaks of the waveforms were equally high. 1.2.1.1.5. Target Word Token Selection Ten monolingual native English-speaking volunteers from the Brown University community participated in a norming study whose goal was to select the eight target stimuli: two ambiguous tokens and one token from each phonetic category endpoint for each continuum. Twenty trials each of the 12 tokens of each continuum (480 total trials) were presented in isolation (without sentential context) binaurally to participants in random order. Participants responded whether each target began with a “p” or “b” by pressing a corresponding button (response mapping was counterbalanced between subjects) and were instructed to respond as quickly as possible while maintaining accuracy, and to guess if they were unsure. The two tokens from each continuum with the identification rates closest to 50% and the highest mean response reaction times (Pisoni & Tash, 1974) were selected as the ambiguous tokens from each continuum. Endpoint tokens of each continuum were selected such that each the /b/ and /p/ was equidistant from the ambiguous pair. Table 1.1 shows the results for the eight selected tokens. 13 Continuum Token # VOT (ms) Mean % p Mean RT (ms) bay–pay 2 7 1.5 591 bay–pay 4 18 42.5 794 bay–pay 5 24 74 778 bay–pay 7 35 97.5 655 buy–pie 2 7 2 609 buy–pie 4 18 39 852 buy–pie 5 23 79 826 buy–pie 7 34 86.5 720 Table 1.1. Mean classification rates and reaction times (RTs) for the selected tokens from each voice-onset time (VOT) continuum 1.2.1.2. Participants Fifty self-reported native monolingual American English speakers with normal hearing from the Brown University community volunteered or received course credit to participate in Experiment 1.1. None had participated in any of the norming studies reported earlier. Due to technical difficulties, one subject’s incomplete data were excluded from analysis. 1.2.1.3. Task Each of the eight selected tokens was appended to each of the 40 sentence contexts, yielding 320 stimuli. The resulting design crossed two levels of CONTINUUM (bay–pay, buy–pie) with two levels of CONTEXT (noun-biased, verb-biased) and four tokens from each VOT continuum. All sentences were presented binaurally in a random order after eight practice trials. Participants were instructed to indicate whether the last word in each sentence began with a “b” or a “p” by pushing the appropriate button with either the index or middle finger of their dominant hand (response mapping was counterbalanced between subjects). The experiment did not advance to the next trial until a subject responded, but participants were instructed to respond as quickly as possible 14 while maintaining accuracy, and to guess if they did not know. They were also warned that some sentences might not make sense. Reaction times were measured from the onset of the target word. 1.2.2. Results bay−pay 1.0 0.8 0.6 0.4 proportion "p"−responses 0.2 0.0 buy−pie 1.0 0.8 0.6 0.4 0.2 the to 0.0 5 10 15 20 25 30 35 VOT (ms) Figure 1.1. Mean proportion of /p/-responses to tokens from each VOT continuum in Experiment 1.1 after noun-biasing and verb-biasing sentence contexts. Error bars represent standard error. The results of Experiment 1.1 are shown in Figure 1.1. Because this study was designed to examine contextual effects on the processing of ambiguous speech, data from responses to the two middle tokens in each continuum were analyzed. The mean proportion of /p/-responses for those intermediate tokens in each context and continuum 15 are summarized in Figure 1.2. Individual subjects’ data were excluded if they did not make at least 10% /b/-responses and 10% /p/-responses to the ambiguous tokens in a given continuum. Based on this criterion, all 49 subjects perceived at least one of the continua ambiguously (36 for bay–pay continuum; 46 for the buy–pie continuum; 33 for both continua). Finally, 205 trials (1.31% of responses) with extreme reaction time (RT) values were removed prior to analysis (>3 standard deviations from the mean RT for a given subject/target/context). 1.0 the to proportion "p"−responses 0.8 0.6 0.4 0.2 0.0 bay−pay buy−pie continuum Figure 1.2. Mean proportion of /p/-responses to ambiguous tokens from each VOT continuum in Experiment 1.1 after noun-biasing and verb-biasing sentence contexts. Error bars represent standard error. To test for an effect of sentential context on the identification of ambiguous stimuli, the data were analyzed using mixed effects logistic regression (Baayen, Davidson & Bates, 2008; Jaeger, 2008); a detailed description of the analyses can be found in the 16 Appendix C. The regression model included fixed effects for CONTEXT (verb-biased vs. noun-biased), CONTINUUM (bay–pay vs. buy–pie), and VOT (the VOT of each ambiguous token), along with all their two- and three-way interactions. All random intercepts and slopes were included for both subjects and items (i.e., main verbs; e.g., hated). Since the CONTINUUM factor crossed voicing (/b/ vs. /p/) and syntactic category bias (noun vs. verb), the critical test of a syntactic context effect is a CONTEXT × CONTINUUM interaction. A significant CONTEXT × CONTINUUM interaction would indicate that, after hearing the (which requires a noun rather than a verb), subjects were more likely to report hearing a /p/ if the target came from the buy–pie continuum, and more likely to perceive a /b/ if it came from the bay–pay continuum. As suggested by Figure 1.2’s reversal in direction of the context effect within each continuum, the results showed a robust CONTEXT × CONTINUUM interaction (β = 2.27, SE = 0.30, |z| = 7.53, p < 0.001). Follow- up tests confirmed a crossover interaction between CONTEXT and CONTINUUM, indicated by a significant simple effect of CONTEXT on responses to stimuli from each continuum, but in opposite directions (bay–pay: β = -1.37, SE = 0.19, |z| = 7.34, p < 0.001; buy–pie: β = 0.95, SE = 0.20, |z| = 4.82, p < 0.001). All other effects that reached significance in the omnibus and follow-up analyses are reported and discussed in the Appendix C. The central aim of Experiment 1.1 was to examine whether the obtained syntactic context effect was modulated by response latency. Thus, following the analysis procedure of Tuinman and colleagues (2014), we divided responses into two RT ranges (fast vs. slow) to test for differences in the size of this contextual bias. To do this, the responses of each participant within each cell of the experiment’s design (20 responses for each subject/context/continuum/token, less outliers) were ranked according to their RTs. Then, 17 from each ranked list of RTs, the eight trials (40%) with the shortest RTs were labeled fast (mean RT = 577 ms, SD = 149 ms) and the eight trials (40%) with the longest RTs were labeled slow (mean RT = 1,012 ms, SD = 410 ms), omitting the mid-range. Together, the fast and slow RT ranges constituted a binary factor: SPEED. Tuinman and colleagues (2014) showed that sentential context interacted with SPEED such that subjects’ responses were less influenced by context at slow RTs than fast RTs. However, in the present study, CONTEXT (i.e., the vs. to) has the opposite effect on responses to stimuli from the bay–pay continuum than for stimuli from the buy–pie continuum. Therefore, the CONTEXT and CONTINUUM factors were recoded into a single factor (BIAS) with two levels (/p/-congruent vs. /b/-congruent), each corresponding to two types of trials. Trials were classified as /p/-congruent when a verb-biasing context (e.g., Brett hated to...) preceded a target from the bay–pay continuum (because a pay- response is congruent with the context in these trials) and when a noun-biasing context (e.g., Valerie hated the...) preceded a target from the buy–pie continuum (because a pie- response is congruent with the context in these trials). Conversely, trials were /b/- congruent if they contained a verb-biasing context and a target from the buy–pie continuum (“...to /?ai/”) or a noun-biasing context and a target came from the bay–pay continuum (“the /?ei/”). Visually, the /p/-congruent trial-types correspond to the two conditions in Figure 1.2 with higher rates of /p/-responses, while the lower bars represent /b/-congruent trials. Having created the SPEED and BIAS factors, we examined whether there was a BIAS × SPEED interaction similar to what was observed in previous work (Tuinman et al, 2014; cf. van Alphen & McQueen, 2001). Figure 1.3 shows the mean proportion of /p/- 18 responses subjects made for ambiguous tokens in /p/-congruent and /b/-congruent conditions in each of the RT ranges. A logistic regression analysis with fixed effects for BIAS, SPEED, VOT, their two- and three-way interactions, and all corresponding random intercepts and slopes for subjects and items was conducted (see the Appendix C for details). This analysis revealed a significant BIAS × SPEED interaction (β = 0.51, SE = 0.15, |z| = 3.30, p < 0.001), wherein responses were more likely to be syntactically congruent – and hence show a larger bias effect – in faster responses (61.6% /p/- responses to ambiguous tokens in /p/-congruent conditions vs. 41.4% in /b/-congruent conditions) than in slower responses (/p/-congruent: 55.6%; /b/-congruent: 42.0%). Additional effects that reached significance are reported and discussed in the Appendix C. 19 1.0 BIAS /p/−congruent /b/−congruent proportion "p"−responses 0.8 0.6 0.4 0.2 0.0 fast slow RT range (SPEED) Figure 1.3. Mean proportion of /p/-responses in fast and slow responses to ambiguous tokens in /p/-congruent (“...to /?ei/” and “the /?ai/”) and /b/-congruent (“...the /?ei/” and “to /?ai/”) conditions in Experiment 1.1. Results indicate a weaker effect of BIAS in slow responses than in fast responses (see main text). Error bars represent standard error. 1.2.3. Discussion Experiment 1.1 was designed to investigate one possible alternative explanation for previous results showing a diminishing influence of context on spoken word recognition (van Alphen & McQueen, 2001; Tuinman et al., 2014). Here, as in previous work, acoustic targets were manipulated to be phonetically ambiguous between two words and embedded in sentential contexts that rendered one of those words ungrammatical (or at least less plausible). However, unlike previous studies in which subjects could identify which response (button-press) would be congruent with the context before the target was presented, Experiment 1.1’s design made it impossible for responses to the target stimuli to be systematically biased by the context if such responses 20 were planned prior to target presentation. Despite the addition of this control, the results of Experiment 1.1 showed that contextual biases on spoken word recognition still diminished over time, replicating earlier results. Having ruled out this alternative explanation, the central theoretical question that remains is whether the observation of a weakening bias effect in Experiment 1.1 (and elsewhere; van Alphen & McQueen; Tuinman et al, 2014) is inconsistent with interactive speech perception models. As described earlier, this interpretation follows from the argument that top-down feedback permanently overwrites pre-lexical information in such models (e.g., TRACE). However, this argument rests on the critical assumption that subjects’ decisions (both in previous experiments and in Experiment 1.1) tap into pre- lexical processing levels. In previous studies (van Alphen & McQueen, 2001; Tuinman et al., 2014), subjects made word identification decisions, which reflect the relative activation of competing lexical representations. Results reflecting lexical-level decisions cannot be taken as evidence either for or against top-down feedback to pre-lexical levels. Similarly, although Experiment 1.1 employed a phoneme identification task, it remains possible that subjects were monitoring for the four possible target words (bay, pay, buy, pie) and learning that each response button corresponded to two words, and thus were implicitly engaging in word identification. Experiment 1.2 aimed to resolve this potential issue by making lexical-level decisions difficult, if not impossible. 1.3. Experiment 1.2 Experiment 1.2 considered a second factor important for the interpretation of contextual influences that diminish in slower responses. Previous studies showing this pattern of results employed word identification tasks, not phoneme identification tasks 21 (van Alphen & McQueen, 2001; Tuinman et al., 2014). It is essential to consider the task- specific linking hypothesis (cf. Magnuson, Mirman & Harris, 2012; Tanenhaus, Magnuson, Dahan & Chambers, 2000) that transforms model activations into behavioral predictions in TRACE: “word identification responses are assumed to be based on readout from the word level, and phoneme identification responses are assumed to be based on readout from the phoneme level” (McClelland & Elman, 1986; p. 21). Thus, since subjects were identifying words (not phonemes), a model like TRACE would predict that lexical responses should reflect word-level (not phoneme-level) activations. Consequently, the time course of context effects demonstrated in previous work may not, in fact, be inconsistent with either interactive models (generally) or TRACE (specifically) because word identification data are not relevant to the question of the presence or absence of feedback from lexical to pre-lexical nodes. Experiment 1.2 modified the design of Experiment 1.1 to discourage subjects from adopting word identification strategies. In particular, subjects performed a phoneme identification task in which the critical target stimuli from the bay–pay and buy–pie continua were embedded among twenty filler target words beginning with /b/ or /p/. We hypothesized that, when responding to 24 unique target words, subjects would have to monitor for the identity of a target’s initial consonant in order to perform the phoneme identification task rather than utilizing word-level strategies. As such, subjects’ responses in Experiment 1.2 should reflect the relative evidence for competing pre-lexical representations, a prerequisite to using any time course analysis to discriminate between models with and without interactive feedback. 22 The stimuli for Experiment 1.2 included sentences of the same form as Experiment 1.1 (e.g., Brett hated to...), but 160 sentences ending with critical targets (an acoustically manipulated token from either the bay–pay or buy–pie continuum) were embedded among 800 sentences ending with filler targets. 1.3.1. Methods 1.3.1.1. Materials 1.3.1.1.1. Critical Targets The same 8 critical target tokens used in Experiment 1.1 were used in Experiment 1.2. 1.3.1.1.2. Filler Targets Twenty words (10 beginning with /b/; 10 beginning with /p/) were selected to serve as filler targets in Experiment 1.2 (for a complete list, see Appendix B). The list of filler words included nouns (e.g., bull), verbs (e.g., put), and syntactically ambiguous words (e.g., plan). The syntactic bias of each filler word (rate of usage of the word as a noun vs. a verb) was computed using the Penn Treebank (Marcus, Marcinkiewicz & Santorini, 1993), and the list of filler words beginning with /b/ and /p/ were balanced for their average syntactic bias. Finally, of the twenty filler words, eight of them composed four minimal pairs (e.g., bull, pull) so that, when combined with the critical targets, half of the targets in Experiment 1.2 comprised minimal pairs. The same female speaker that recorded the stimuli for Experiment 1.1 read aloud sentences ending with the filler target words. The filler targets were then spliced out of the recorded sentences, scaled to have the same maximum volume as the critical targets, and appended to 20 ms of silence (like 23 the critical targets), but were not acoustically manipulated (e.g., by altering the VOT of onsets). 1.3.1.1.3. Sentence Contexts Each critical and filler target was appended to each of the forty sentence contexts used in Experiment 1.1; for each of the twenty main verbs (that is, for each item; e.g., hated), there was a set of noun-biasing and verb-biasing contexts (Valerie hated the..., Brett hated to...) followed by every critical and filler target. Of the twenty item-sets, ten were randomly selected for each participant in Experiment 1.2, and that participant heard all sentence stimuli associated with those ten items (noun-biased and verb-biased sentences, ending in all critical and filler targets). Thus, the design remained fully within- subjects and within-items, although not every subject heard the same items. As in Experiment 1.1, critical stimuli included four tokens from each of the VOT continua. In an effort to balance the likelihood that subjects would hear a sentence that ended with any given word, participants heard each of their filler sentences twice over the course of Experiment 1.2. In all, subjects heard 800 filler trials (10 items * 2 levels of CONTEXT * 20 filler target words * 2 presentations) and 160 critical trials (10 items * 2 levels of CONTEXT * 2 levels of CONTINUUM * 4 tokens from each VOT continuum), yielding a total of 960 trials. 1.3.1.2. Participants Twenty Brown University undergraduates who were self-reported native monolingual American English speakers with normal hearing received course credit to participate in Experiment 1.2. None had participated in Experiment 1.1 or any previous norming study. 24 1.3.1.3. Task The task was identical to Experiment 1.1: all critical and filler sentences were presented binaurally in a random order after eight practice trials and participants were asked to indicate with a button-press whether the last word in each sentence began with a “b” or a “p” (response mapping was counterbalanced between subjects). All instructions were the same as in Experiment 1.1. Subjects were offered three breaks (after every 240 stimuli). 1.3.2. Results Phoneme identification responses to trials ending with filler words were highly accurate (98.2% correct). This was unsurprising because all filler words were naturally produced tokens, so they were not phonetically ambiguous. Responses to critical trials from each continuum are shown in Figure 1.4. All further analyses followed the same approach as Experiment 1.1’s analyses, independently fitting identical logistic regression models using subjects’ responses to ambiguous tokens from the two VOT continua (additional details in the Appendix C). Using the same criterion as in Experiment 1.1, a subject’s data were excluded if the intermediate tokens of a continuum were not perceived ambiguously (the responses of 18/20 subjects were included for at least one continuum: 15 for bay–pay; 13 for buy–pie; 10 for both). No trials warranted removal following an RT outlier analysis (same criteria as Experiment 1.1). 25 bay−pay 1.0 0.8 0.6 0.4 proportion "p"−responses 0.2 0.0 buy−pie 1.0 0.8 0.6 0.4 0.2 the to 0.0 5 10 15 20 25 30 35 VOT (ms) Figure 1.4. Mean proportion of /p/-responses to tokens from each VOT continuum in Experiment 1.2 after noun-biasing and verb-biasing sentence contexts. Error bars represent standard error. The results of the first three-factor (CONTEXT × CONTINUUM × VOT) mixed effects regression revealed a significant CONTEXT × CONTINUUM interaction (β = 2.75, SE = 0.51, |z| = 5.34, p < 0.001; see Figure 1.5), confirming that subjects’ phonemic decisions about ambiguous targets were influenced by the grammaticality of targets given a preceding context. Follow-up tests confirmed the crossover interaction (opposite effects of CONTEXT in each continuum; bay–pay: β = -1.87, SE = 0.25, |z| = 7.46, p < 0.001; buy–pie: β = 0.66, SE = 0.27, |z| = 2.45, p < 0.02). Other effects are reported and discussed in the Appendix C. 26 1.0 the to proportion "p"−responses 0.8 0.6 0.4 0.2 0.0 bay−pay buy−pie continuum Figure 1.5. Mean proportion of /p/-responses to ambiguous tokens from each VOT continuum in Experiment 1.2 after noun-biasing and verb-biasing sentence contexts. Error bars represent standard error. To test whether the influence of context on phoneme identification diminished over time, trials were recoded and split into two RT ranges (fast: mean RT = 616 ms, SD = 157 ms; slow: mean RT = 1,063 ms, SD = 423 ms) for a three-way BIAS × SPEED × VOT logistic regression, as in Experiment 1.1. This analysis provided no evidence of a BIAS × SPEED interaction (p > 0.86; see Figure 1.6). That is, the effect of the syntactic manipulation on the rate of /p/-responses did not diminish between faster responses (/p/- congruent: 60.6%; /b/-congruent: 40.6%) and slower responses (/p/-congruent: 61.3%; /b/-congruent: 41.9%). Other effects are discussed in the Appendix C. 27 1.0 BIAS /p/−congruent /b/−congruent proportion "p"−responses 0.8 0.6 0.4 0.2 0.0 fast slow RT range (SPEED) Figure 1.6. Mean proportion of /p/-responses in fast and slow responses to ambiguous tokens in /p/-congruent (“...to /?ei/” and “the /?ai/”) and /b/-congruent (“...the /?ei/” and “to /?ai/”) conditions in Experiment 1.2. Unlike Experiment 1.1, the BIAS effect in Experiment 1.2 is as strong in slow responses as in fast responses (see main text). Error bars represent standard error. 1.3.3. Discussion The goal of Experiment 1.2 was to examine contextual influences on the identification of ambiguous targets using a task designed to elicit phonemic decisions. Of particular interest was the time course of such context effects, and especially whether these results would differ from previous word identification experiments (van Alphen & McQueen, 2001; Tuinman et al, 2014) and from Experiment 1.1, in which it was unclear whether subjects were engaging in word or phoneme monitoring strategies. Experiment 1.2, like Experiment 1.1, showed that syntactic context has a robust effect on subjects’ responses (including their fastest responses). However, unlike Experiment 1.1, Experiment 1.2 showed that this contextual bias on phoneme identification was as strong 28 in slow responses as it was in fast responses. Van Alphen and McQueen (2001) argue that if the fast-arising bias in phoneme responses was the result of lexical feedback to pre- lexical representations (as hypothesized by interactive models), then “there should have been a similar shift (if not a stronger one as more time elapsed with more feedback) in slow responses” (p. 1069). Indeed, results of Experiment 1.2 are consistent with these predictions and thus support the view that pre-lexical representations are modulated by lexical feedback, as hypothesized by interactive models. Additionally, the results of Experiment 1.2 suggest that tasks in which word identification decisions are required or may be used strategically by participants may not tap pre-lexical representations. 1.4. General Discussion The present work aimed to evaluate the validity of claims that the time course of context effects on speech perception is incompatible with interactive models of speech perception (cf. van Alphen & McQueen, 2001; Tuinman et al., 2014). According to this view, context effects within an interactive system should remain stable or grow over time, but they should not become weaker, because biasing feedback from lexical representations irreversibly overwrites an initially ambiguous representation of the acoustic input. Once pre-lexical representations are altered by top-down modulation, there is no recovering the record of the ambiguous signal, so the size of the bias effect should not diminish. Crucially, this logic is predicated on the assumption that subjects’ responses reflect activation levels of pre-lexical representations, an assumption that Experiments 1.1 and 1.2 examined more closely. Two key results emerged from Experiments 1.1 and 1.2. Firstly, as already discussed, the results of Experiment 1.2 suggest that when an experimental task is 29 designed to tap into pre-lexical processing, contextual influences on speech perception are robust and persistent over time. Secondly, the biasing effects from a preceding sentential context arose very rapidly in both experiments. In the present studies, the button-press that would represent a contextually congruent response depended on the target stimulus itself, so it was impossible for a crossover interaction to emerge unless subjects waited to hear the target stimuli. In other words, the fact that subjects’ fast responses were biased suggests that the processing of ambiguous speech is influenced by the grammatical properties of competing words, that this influence is virtually immediate, that this rapid influence is not attributable to pre-planned responses, and that top-down expectations rapidly propagate to bias both lexical and pre-lexical representations. As such, our data provide strong evidence for immediate top-down effects from sentence context effects on speech perception. Taken together, these results suggest that the time course of contextual effects on phoneme recognition does not, in fact, challenge the interactive modeling framework. Next, we consider the extent to which these results actually support interactive models, and how they constrain autonomous models. 1.4.1 Implications for Interactive Models of Speech Perception Despite the fact that subjects responded to the same stimuli in Experiments 1.1 and 2, the time course of the contextual bias effect differed between experiments, with Experiment 1.1’s results matching the pattern obtained by word identification tasks (van Alphen & McQueen, 2001; Tuinman et al., 2014). An important question is why lexically-driven and phonemically-driven responses would generate different patterns with respect to the time course of context effects. Recall that in TRACE, outputs in n- alternative forced-choice tasks (such as phoneme monitoring/identification or visual 30 world eye-tracking) are generated probabilistically from among a set of alternatives that is identified depending on the task and stimuli (cf. Luce, 1959). TRACE’s decision model keeps track of a running average of activation levels of the alternatives, which are nodes in a single layer of the model (cf. McClelland & Rumelhart, 1981). For instance, in a phoneme identification task, two units in the Phoneme layer (e.g., /b/ and /p/) constitute the output alternatives that are tracked (McClelland & Elman, 1986; McClelland, 1987), while in word recognition or visual world eye-tracking tasks, nodes in the Word layer (e.g., bear and pear) are identified as the output alternatives (e.g., Allopenna, Magnuson & Tanenhaus, 1998; Dahan, Magnuson & Tanenhaus, 2001; Dahan, Magnuson, Tanenhaus & Hogan, 2001; Magnuson, Dixon, Tanenhaus & Aslin, 2007; Magnuson, Tanenhaus, Aslin & Dahan, 2003; McMurray, Tanenhaus & Aslin, 2002). Since different tasks dictate that outputs reflect activation dynamics in different layers of the model, it is not surprising to observe unique patterns of results for word vs. phoneme identification tasks. Nevertheless, TRACE (or any other model) still must explain why Word-level activations become less biased over time even though biased Phoneme-level activations persist. This question can only be fully addressed once models of speech perception incorporate sentence-level representations (cf. Strand, Simenstad, Cooperman & Rowe, 2014). However, it seems likely that an interactive model could capture this difference. A model with a relatively strong stabilizing force (i.e., decay) at the lexical level or a quickly decaying influence of lexical expectation from “supra-lexical” levels of representation would predict a transient context effect in word identification responses. Notably, Word-level decay is the strongest decay parameter in TRACE (McClelland & 31 Elman, 1986). Meanwhile, as van Alphen and McQueen (2001) suggest, pre-lexical representations that are modulated by lexical feedback may not recover from the top- down biasing influence (or at least may not recover as quickly as the lexical representations), leading to more persistent context effects on phoneme identification responses (as in Experiment 1.2). 1.4.2. Implications for Autonomous Models of Speech Perception It is less clear to what extent the results of Experiment 1.2 challenge autonomous models of speech perception. Van Alphen and McQueen (2001) note that “if...sentential context effects are the result of a decision bias, predictions about their time course are much less clear” (p. 1059). As they acknowledge, their verbal account (Coltheart, Rastle, Perry, Langdon & Ziegler, 2001; Magnuson, Mirman & Harris, 2012; Mirman & Britt, 2013) of sentential context effects on word recognition could predict many patterns of results, including the one their experiments suggest. Nonetheless, the present results do challenge one theoretical proposal they offer. In their data, responses in the slowest RT range failed to show a significant effect of preceding context. They ultimately interpret this as consistent with a model in which context effects on subjects’ decisions are time- limited such that sentence-level processing can only bias responses while the syntactic parse of the sentence remains ambiguous (see also Mattys, Melhorn & White, 2007). However, the persistent top-down effects on subjects’ responses in Experiment 1.2 challenges the view that sentence context has a time-limited effect (see also Bicknell, Jaeger & Tanenhaus, 2015; Bicknell, Tanenhaus & Jaeger, 2015; Connine, Blasko & Hall, 1991; Szostak & Pitt, 2013). In fact, even when the diminishing bias effect was replicated (in Experiment 1.1 and by Tuinman et al, 2014), the slow responses were still 32 significantly biased. Thus, any autonomous model of speech perception must be able to account for long-lasting contextual biases on both phoneme and word identification responses. 1.4.3. Predicting Behavior with Computational Models of Speech Perception The present work underscores the need for explicit, testable computational models bridging speech perception and sentence processing, while also highlighting the importance of appropriately interpreting existing models’ predictions. Neither TRACE nor Merge makes any predictions about sentential context effects. Indeed, McClelland and Elman (1986) conceded this point when they introduced TRACE, explicitly leaving the question to future research: “We have not yet included...higher level contextual influences in TRACE, though of course we believe they are important” (p. 60). In order to understand the mechanisms underlying speech perception in naturalistic environments, it is necessary to develop and test models in which spoken sounds/words are accompanied by a rich linguistic and non-linguistic context. 1.5. Conclusion Interactive and autonomous models represent two powerful theories, each of which are capable of explaining many data regarding speech perception. Despite decades of arguments, rejoinders, clarifications, and revisions, they have proven difficult to distinguish. Given the persistence of the debate and the considerable power of each modeling framework, some have wondered whether any unique predictions exist that might settle the question (Cutler et al., 1987; Pitt & Samuel, 1993). The goal of this study was to assess whether the time course of context effects on speech perception is incompatible with interactive models, as has been proposed. Our results suggest that this 33 claim is unwarranted. Indeed, when an experiment is designed to tap into pre-lexical processing dynamics, the persistent influence of top-down lexical feedback can be observed. Our findings, therefore, indicate that there is even less evidence against interactive models of spoken word recognition than previously thought. On the other hand, while the present results provide some constraints for autonomous models, they cannot entirely rule out such a framework. Ultimately, it is clear that the lack of overt, testable, divergent predictions represents a fundamental barrier to resolving this long- running theoretical debate. Given this critical gap, we believe that the development of well-constrained, explicit computational models must play a central role in future research investigating the mechanisms underlying spoken word recognition. 1.6. Overview of Next Steps Unfortunately, as noted earlier, neither TRACE nor Merge is designed to explicitly model the role of sentential context in spoken word recognition. In fact, as we will discuss in Chapter 2, despite strong evidence illustrating context effects (as seen in the experiments reported here in Chapter 1), the influence of sentential context on speech perception is poorly characterized in existing psycholinguistic models. In order to address this major gap, we now turn to the task of developing a computational model of speech perception capable of explaining top-down effects from a word’s sentence context. As we will further show in Chapter 3, the model developed in Chapter 2 also provides a straightforward, natural account for the enormous variability in the size of top-down effects across different subjects, stimuli and experiments – an issue that has been almost completely ignored by previous models of speech perception. Finally, as we will discuss in Chapters 3 and 4, developing such a model promises to generate novel, testable 34 predictions that can elucidate properties of the cognitive and perceptual processing underlying auditory language comprehension in both healthy adults and patients with brain damage. 35 Chapter 2 Bayesian Integration of Acoustic and Sentential Evidence in Speech: The BIASES Model of Spoken Word Recognition in Context 2.1. Introduction 2.1.1. Brief Introduction In order to comprehend spoken language, a listener must ultimately map a perceived acoustic waveform onto some meaningful interpretation of the speaker’s message. The processing that underlies this complex phenomenon can be thought of as consisting of at least three subroutines: pre-lexical speech processing, spoken word recognition, and auditory sentence comprehension. While all three of these are, of course, critical for arriving at an accurate, contextualized understanding of the meaning behind some speech that reaches a listener, spoken word recognition occupies a critical juncture between the lower-level perceptual processing of a signal and the higher-level processing by which listeners recruit linguistic and world knowledge to construe the meaning of the words they identify. Because of the crucial role words play as the gatekeepers of meaning, characterizing the computations involved in the recognition of spoken words is of critical importance (for recent reviews, see Mattys, 2013; Maguson, Mirman & Myers, 2013; Tanenhaus, 2007). The fundamental computational problem (Marr, 1982) that must be solved in order to recognize words in speech is to infer which word (or sequence of words) was most likely to have been produced by the speaker. This decoding of the perceived speech stream would be trivial if words were consistently produced with a single acoustic form, any given acoustic signal corresponded to exactly one word, and perception always took 36 place under noise-free conditions. However, none of these facts is true of speech perception in the real world. Quite to the contrary, the number of ways in which noise, ambiguity, and uncertainty can compromise processing is daunting. Nonetheless, despite all of the potential barriers to successful mapping of signal to meaning, healthy listeners rarely experience difficulties in speech processing. Even when speech is intentionally degraded in the laboratory to tax the perceptual system, sounds and words are typically perceived with high accuracy (e.g., Luce & Pisoni, 1998; Cutler, Weber, Smits & Cooper, 2004). Indeed, listeners often fail to notice when a segment in a word is replaced with white noise or a cough (Warren & Obusek, 1971), and, when they do, it seldom impedes comprehension. What accounts for this robustness in the face of such pervasive degradation? A complete understanding of this issue requires, minimally, answering two fundamental questions: what cues are available to the listener, and how are these cues leveraged in order to overcome the ambiguity inherent in the input. Broadly, investigations of the first question – which cues do listeners utilize during speech recognition – have shown that listeners integrate both bottom-up (sensory-based) and top-down (knowledge-based) cues (for review, see Samuel, 2011). That is, in addition to leveraging bottom-up acoustic cues derived from perceptual processing of the speech signal such as voice-onset time and formant values, listeners also exploit cues that require higher-level cognitive processing of the signal. For instance, listeners are sensitive to whether or not different potential interpretations (e.g., goat vs. coat) of some speech input are sensible given the preceding sentential context (e.g., The busy farmer hurried to milk the...) (Borsky, Tuller & Shapiro, 1998). 37 The second question – how the various cues are integrated and come to influence speech recognition – is the focus of the present work. This wide-ranging question has motivated a host of theoretical and computational models focusing on different aspects of speech processing from how multiple distinct bottom-up cues are weighted (e.g., Massaro & Oden, 1980; Nearey, 1990, 1997; Oden & Massaro, 1978; Repp, 1982, 1983; Toscano & McMurray, 2010) to how multisensory information sources are combined (e.g., Diehl & Kluender, 1989; Fowler & Rosenblum, 1991; Kluender, 1994; Massaro, 1987; McGurk & MacDonald, 1976; Ostrand, Blumstein & Morgan, 2011; Rosenblum, 2005) to what mechanisms give rise to lexical biases on word recognition (Elman & McClelland, 1988; McClelland, Mirman & Holt, 2006; McQueen, Norris & Cutler, 2006; McQueen, Jesse & Norris, 2009; Norris, McQueen & Cutler, 2000). However, one major theoretical gap that remains concerns how top-down information available from a sentence context is integrated with bottom-up cues. This gap is especially conspicuous given that everyday speech rarely features words produced and perceived in isolation, and sentential context has consistently been shown to impact the recognition of spoken words (e.g., Borsky et al, 1998; Lieberman, 1963; Warren & Warren, 1970). The present work aims to narrow this gap by advancing one approach to modeling the integration of top-down and bottom-up cues during speech perception. In particular, we argue that viewing speech perception through the lens of Bayesian cue integration provides a powerful, principled framework to understand a wide range of behavioral data. To this end, we outline the issues addressed in this chapter, which is organized into three parts. 2.1.2. Overview of Chapter 2 38 First, we address the question of why this gap exists at all. We discuss several practical and theoretical bottlenecks associated with representing sentential context and modeling the mechanisms by which context might come to influence spoken word recognition. Second, we argue that these challenges motivate the specification of a computational-level (Marr, 1982) model of spoken word recognition capable of explicitly integrating bottom-up and top-down information sources. We present a Rational Analysis (Anderson, 1990) of spoken word recognition in context and propose that a Bayesian modeling approach may offer key insights into the information processing that underlies spoken language processing. Third, we introduce BIASES (short for Bayesian Integration of Acoustic and Sentential Evidence in Speech), a Bayesian model of spoken word recognition in context. BIASES is a novel, flexible computational framework for simulating human behavior in word recognition tasks and for testing psycholinguistic theories about how bottom-up and top-down information sources are represented and integrated by listeners. Adopting a model like BIASES involves embracing three basic assumptions: (1) that listeners are sensitive to fine-grained acoustic properties of spoken words; (2) that they are also sensitive to fine-grained differences in the chances of encountering different words in a given sentence context; and (3) that, when identifying spoken words, they integrate these information sources with consideration for the relative reliability of each available cue. We review the robust empirical evidence that supports these assumptions, and, in turn, the Bayesian approach to spoken word recognition. 2.2. Sentential Context and Connectionist Models of Spoken Word Recognition 39 Existing computational models of spoken word recognition have not directly addressed how a word’s processing is influenced by its sentential context. Before motivating the present model, it is worth reviewing the evidence under consideration and possible explanations for why current models fail to account for this evidence. 2.2.1. Modulation of Spoken Word Recognition by Sentential Context Several decades of research have made it clear that the recognition of a spoken word is not independent of its context. Words that are unintelligible when presented in isolation can be readily identified in context (Lieberman, 1963; Pickett & Pollack, 1963; Hunnicutt, 1985; Fowler & Housum, 1987), and prior exposure to an acoustically clear prime sentence improves listeners’ recognition of conceptually related words in a subsequent acoustically degraded sentence (Guediche, Reilly & Blumstein, 2014). Moreover, in addition to facilitating recognition of speech in noise, a word’s context can shape a listener’s interpretation of spoken words that are ambiguous between two or more words in their language. For instance, words that have been digitally manipulated to replace a critical speech segment with extraneous non-speech acoustic material (such as a cough or white noise) are more often recognized as words that are consistent with the context in which the word was presented: /*il/, where * represents the digitally substituted non-speech sound, may be identified as wheel, heel, peel or meal depending on other words appearing in the same sentence (e.g., axle, shoe, orange or table) (Warren & Warren, 1970; Warren & Sherman, 1974). Furthermore, related evidence shows that such contextual effects are not restricted to the restoration phonetic information that is missing altogether. This line of work utilizes fine-grained acoustic manipulations of phonetically relevant parameters of natural 40 speech tokens to render the resulting stimuli ambiguous between two possible words. When these stimuli are presented in sentences that are consistent with one word or the other, they tend to be perceived as the word that is congruent with the context in which it is embedded. For example, subjects are more likely to identify a phonetically ambiguous stimulus between goat and coat as goat when it follows a sentence like The busy farmer hurried to milk the... that when after sentences like The careful tailor stopped to button the... (Borsky, Shapiro & Tuller, 1998). Such biases have been widely corroborated, whether the manipulated contextual constraints operate at the semantic (Borsky et al, 1998; Garnes & Bond, 1976; Miller, Green & Schermer, 1984; Connine, 1987; Guediche, Salvata & Blumstein, 2013), syntactic (Fox & Blumstein, in press; Tuinman, Mitterrer, & Cutler, 2014; van Alphen & McQueen, 2001), morphological (Martin, Monahan & Samuel, 2011), or pragmatic level (Rohde & Ettlinger, 2012; Do, 2011). With such strong evidence for contextual effects on spoken word recognition, it is somewhat surprising that word recognition models have thus far offered no explicit account for these data. The treatment of sentential context in most existing spoken word recognition models can generally be classified into four categories. First, some models ignore the role of sentential context, focusing on other aspects of spoken word recognition (e.g., PARSYN: Luce, Goldinger, Auer & Vitevitch, 2000; ARTWORD: Grossberg & Myers, 2000; Merge: Norris, McQueen & Cutler, 2000; LAFF: Stevens, 2002). A second group of models explicitly leave the question to future research (e.g., LAFS: Klatt, 1979; TRACE: McClelland & Elman, 1986; MINERVA 2: Goldinger, 1998; Hintzman, 1986). A third set of models asserts that incorporating sentential context would be a “straightforward” extension of the more basic model (e.g., NAM: Luce & 41 Pisoni, 1998; Shortlist: Norris, 1994; Shortlist B: Norris & McQueen, 2008; SpeM: Scharenborg, Norris, ten Bosch & McQueen, 2005; see also Norris, McQueen & Cutler, 2015). Finally, there are some theories about what role sentential context might play in speech recognition have been presented (e.g., Logogen: Morton, 1969; Race: Cutler & Norris, 1979; Cohort: Marslen-Wilson & Tyler, 1980; Marslen-Wilson & Welsh, 1978). Several members of this set – most notably the Cohort model – were prominent theories that guided early research on spoken word recognition, and, although they have been abandoned in light of empirical challenges to some of their specific claims, the principles they embodied (e.g., graded activation, competition, autonomous vs. interactive model architectures) remain influential today. However, with respect to the subject of the present work – sentential influences on spoken word recognition – this fourth group exemplifies what has probably been the most common treatment of the issue. That is, many theories have relied on “verbal models” which might explain some aspects of processing that is likely implicated (or, just as often, what sorts of processing might be precluded; cf. Shillcock & Bard, 1993; Tanenhaus, Leiman & Seidenberg, 1979; Tanenhaus & Lucas, 1987) during sentence- level speech processing. However, these models have a number of disadvantages, chief among them being that they are often incompletely specified. While verbal models are critical to theory development and are useful for generating and testing many predictions, it is often difficult or impossible to assess a theory’s adequacy or viability if it is not mathematically or computationally implemented, and it is even more difficult to compare its predictions to another competing theory (see, e.g., Magnuson, Mirman & Harris, 42 2012). In short, theories and models falling into the third and fourth categories above leave much work to be done. The lack of a comprehensive model of spoken word recognition in context is probably attributable to a number of factors. For one, many difficult and important questions can be (and have been) explored without the additional complication of modeling what are logically more abstract representations and cognitive functions (e.g., the composition of meaning). However, the exclusion of sentence-level information in existing models of speech perception can also be traced to major challenges presented by the predominant modeling approach to examining higher-level influences on word recognition. 2.2.2. Challenges in Modeling Context Effects on Spoken Word Recognition Many of the most influential models of spoken word recognition, including TRACE (McClelland & Elman, 1986), Shortlist (Norris, 1994), and Merge (Norris, McQueen & Cutler, 2000), are based on interactive activation networks (McClelland & Rumelhart, 1981; Rumelhart & McClelland, 1981, 1982). In such models, cognitive representations are connected to one another in a network, with each representation characterized by some amount of activation. Activation propagates through the network as a function of the connections between representations and the sensory input presented to the network. In localist connectionist models of spoken word recognition, each representation is designated by a node that stands for a linguistically-relevant unit (e.g., a word or a phoneme), and these nodes are organized into layers (cf. McClelland & Rumelhart, 1986; Page, 2000). The nodes within a given layer represent mutually exclusive hypotheses (cf. Smolensky, 1986) about which linguistic units might be 43 (underlyingly) present within a given speech signal. For instance, the words goat and coat would be represented as unique nodes in the Word layer of a model because a spoken word may be an exemplar of goat or coat, but not both. Although the exact details of how units are connected differ from model to model, nodes within a layer are typically connected via inhibitory connections, while mutually consistent linguistic units (e.g., a word-initial /g/ in the Phoneme layer and goat in the Word layer) would be connected via excitatory connections. Critically, though, if some node A is connected to some other node B via an excitatory connection, then when node A increases in activation, node B will also tend to increase in activation. On the other hand, if the connection from node A to node B is inhibitory, then when node A increases in activation, node B will tend to decrease in activation (for a recent review of interactive activation models in speech perception, see McClelland, Mirman, Bolger & Khaitan, 2014). 2.2.2.1. Challenges of Modeling Context Effects: Representing Context Given this modeling framework, it is not immediately clear how one should incorporate sentential context. Perhaps the most obvious question is: how should sentence-level information be represented? In order to capture semantic context effects (e.g., more GOAT responses after sentences about milking than buttoning), semantic relationships among words must somehow be incorporated into the model. This might be made possible by constructing a layer of semantic features that has excitatory connections to some nodes in the Word layer based on the words’ meanings (Chen & Mirman, 2012; Cree, McRae, McNorgan, 1999; Rogers & McClelland, 2004). Alternatively, it might be possible to ignore semantic features altogether and, instead, connect word units such that words can excite other related words based on semantic associativity norms (Deerwester, 44 Dumais, Fornas, Landauer & Harshman, 1990; Dumais, 2004; Landauer & Dumais, 1997) or other measures (Fellbaum, 1998; Miller, 1995; Miller, Beckwith, Fellbaum, Gross & Miller, 1990; Miller & Fellbaum, 1991). However, it is not clear which of these possibilities (or what other solution) is more consistent with the organization of listeners’ semantic knowledge (cf. Andrews, Vigliocco & Vinson, 2009; De Deyne & Storms, 2008; Riordan & Jones, 2011; Steyvers & Tenenbaum, 2005), nor is it clear from existing data which model would best explain context effects in the domain of word recognition. Moreover, while some such model enhancement might be able to capture semantic context effects, explaining syntactic and pragmatic context effects would require the addition of more connections and/or layers (cf. McClelland, St. John & Tarban, 1989; Recchia, Sahlgren, Kanerva & Jones, 2015; Rohde, 2002; St. John & McClelland, 1990; Strand, Simenstad, Cooperman & Rowe, 2014). 2.2.2.2. Challenges of Modeling Context Effects: Activation Dynamics Even if this the issue of representation could be solved, it is not straightforward to merely add more connections to the architecture of an existing activation-based model because it is not clear how contextual information should come to influence words’ activation levels. Adopting the same activation dynamics assumptions used in existing models, the effect of sentential context might be excitatory (for words that are supported by the context). On the other hand, the effects could be implemented via inhibitory connections, essentially ruling out words that are not supported by the context (Marslen- Wilson & Tyler, 1980; Marslen-Wilson & Welsh, 1978; for a similar approach in spoken word production, see Dell, Oppenheim, & Kittredge, 2008). Alternatively, rather than directly altering words’ activation levels, sentential context might induce adjustments to 45 words’ propagation threshold (a parameter governing the amount of activation required before a node’s activation begins to influence other nodes in the network) or their activation gain (a parameter governing how easily words become more activated). There is no a priori reason to believe that one of these mechanisms is more likely than any other, so additional assumptions and free parameters are needed. Furthermore, what works for modeling semantic context effects, where contextual cues (e.g., milked) tend to support specific words (e.g., goat), may not be able to capture syntactic context effects, where contextual cues (e.g., the) tend to support categories of words (e.g., nouns), but not specific words (see Fox & Blumstein, in press). 2.2.2.3. Challenges of Modeling Context Effects: Representing Time A third issue that makes it difficult to model sentential influences on word recognition in connectionist models arises from the way activation dynamics transpire over time as the speech signal unfolds. Recall that nodes in the same layer are generally considered to be mutually inhibitory: when more than one word is partially activated, the most activated representation(s) tends to crowd out other active nodes (Thomas & McClelland, 2008). This architectural feature is almost universally true of the Word layers of localist spoken word recognition models (Gaskell, 2007), as it allows activation- based models to account for competition among multiple lexical candidates during recognition (Allopenna, Magnuson & Tanenhaus, 1998; Andruski, Burton & Blumstein, 1994; Frauenfelder & Floccia, 1998; Gaskell & Marslen-Wilson, 1999, 2002; Magnuson, Dixon, Tanenhaus & Aslin, 2007; McMurray, Tanenhaus & Aslin, 2002; McMurray, Tanenhaus, Aslin & Spivey, 2003; McMurray, Tanenhaus, Aslin, Spivey & Subik, 2008; McQueen, Norris & Cutler, 1994; Norris, McQueen & Cutler, 1995; Righi, Blumstein, 46 Mertus & Worden, 2009; Utman, Blumstein & Burton, 2000; Vitevitch & Luce, 1998, 1999). However, this crucial architectural feature becomes problematic when the model is scaled up in order to account for sentential context effects, where multiple words should become activated in sequence. In the Merge model (Norris et al, 2000), for instance, activation of a lexical representation at one point in time will tend to suppress activation levels of other words at future points in time. Even if segmentation of the speech signal into words is taken for granted, it is clear that adapting Merge to account for sentential context effects would depend on how context is represented and how it comes to influence words’ activation levels. TRACE (McClelland & Elman, 1986), on the other hand, deals with time in a very different way, replicating the entire network for each time slice, so the inhibitory word-word connections only act on words that begin at the same point in time. From its inception, the implausibility of this aspect of TRACE’s architecture has been widely acknowledged by the model’s proponents and opponents alike (McClelland, Mirman & Holt, 2006; Norris, 1994) because of the enormous number of nodes required to model continuous speech. Therefore, modeling context effects by adding additional connections between associated words at different time-points or adding additional layers replicated for every time-point along with the rest of TRACE, would only exacerbate this problem. Meanwhile, Shortlist’s (Norris, 1994) representation of time is based on time- delayed recurrent neural networks (Elman 1990; Norris, 1988, 1990, 1993), which occupy a middle ground between the drawbacks of Merge and TRACE. Still, Norris’ (1994) brief discussion of how Shortlist might be adapted in order to account for the 47 various sentential context effects observed in the literature does not directly address either the representation of context or how the dynamic modulation of activation in the time-delayed network would function. 2.2.2.4. Context Effects Without Connectionist Models While all of these issues stem from important questions about spoken word recognition, they also pose significant challenges that even the most successful existing models have, so far, not addressed. Ultimately, however, these hurdles arise directly from the first choice made at the inception of each model: the choice to adopt a connectionist framework. Existing models were designed to address the architecture of a system supporting isolated word recognition, and adapting these architectures to solve a distinct computational problem – recognizing spoken words within the rich context of natural language – is a limiting approach when it comes to modeling context effects. As an alternative, it is possible to characterize the information processing architecture that must underlie any explanation of sentence-level influences on the recognition of spoken words, while acknowledging that there might be many possible architectures (representational systems, activation/dynamical assumptions, and implementations of time-varying input and processing) that could achieve the necessary computations (Marr, 1982). In the present work, we take this alternate path, analyzing the computational problem associated with word recognition in context from the beginning. 2.3. A Computational-Level Analysis of Spoken Word Recognition A useful starting point for a computational analysis of spoken word recognition is with a rational analysis (Anderson, 1990), wherein a cognitive system is considered with respect to the system’s goals, the environment in which the system must operate, and the 48 computational limitations of the system (Anderson, 1991). Anderson’s (1990) Principle of Rationality presumes that a cognitive system is optimized with respect to these factors, so, to the extent that some data do not fit the rational model’s predictions, these discrepancies will suggest that the modeler’s original assumptions about the system’s goals, environment, or limitations were inaccurate. These inconsistent data, in turn, guide the updating of a rational model’s initial assumptions. 2.3.1. Bayesian Models of Spoken Word Recognition Recent years have seen a notable rise in the application of rational analysis to questions in speech perception. Feldman, Griffiths and Morgan (2009) presented a rational analysis of speech sound perception and categorization, showing that the listeners’ discrimination and perceptual classification of vowel tokens could be explained by assuming their behavior reflected optimal (Bayesian) perceptual inference under uncertainty. In the same vein, Shortlist B (Norris & McQueen, 2008) exemplifies the rational analysis approach to spoken word recognition, accounting for several classic effects in the psycholinguistic literature without appealing to the notion of activation at all. Although their details differ, and although only the latter model focuses on word recognition specifically, both models follow the same basic logic. Similarly, the present model of spoken word recognition in context also follows this logic, so we now turn to an outline of the foundational principles these models share. Any rational analysis of a cognitive system must begin by identifying the goal of the system. Following Norris and McQueen (2008), and as suggested at the outset of this chapter, we take the purpose of the speech recognition system to be the recovery of the word (or words) produced by the speaker. For the purpose of exposition, we limit the 49 present discussion to a special case wherein the listener’s goal is to infer the most likely single word given the perceived speech signal. How would a rational system achieve this goal? If there is only one word that could possibly have produced the perceived signal, then the optimal decision is obvious: if all other words have been ruled out, then the signal must be an exemplar of the only remaining option (Doyle, 1890). However, since such certainty is often elusive when recognizing words in the real world, how should a rational system select the most probable word given incomplete information? Under this view, spoken word recognition amounts to a specific case of a more general problem: inference under perceptual uncertainty. All such computational problems share the same mathematically optimal solution, which is defined by the ideal observer framework (Geisler, 2003; Geisler & Kersten, 2002). An ideal observer is one that always makes the best possible guess when identifying the likely source of some observed data, and its behavior is given by Bayes’ rule (Knill, Kersten & Yuille, 1996). According to Bayes’ rule (Equation 2.1), for any exhaustive set of mutually exclusive hypotheses H, the probability that any given hypothesis hi in H is true, given some observed data d, is given by: Equation 2.1 𝑝 𝑑 ℎ! 𝑝(ℎ! ) 𝑝 ℎ! 𝑑 = !! ∈! 𝑝 𝑑 ℎ! 𝑝(ℎ! ) Because the denominator of the right side of Bayes’ rule is constant over all hj in H, Bayes’ rule is often stated in its proportional form (Equation 2.2): Equation 2.2 𝑝 ℎ! 𝑑 ∝ 𝑝 𝑑 ℎ! 𝑝(ℎ! ) 50 The key principle embodied by Bayes’ rule is that, having observed d, the so- called posterior probability of a given alternative, 𝑝 ℎ! 𝑑 , depends on two general classes of information: how representative of that alternative d is, and on how probable the alternative (hi) was in the first place. These two pieces of information are referred to, respectively, as an alternative’s likelihood, 𝑝 𝑑 ℎ! , and its prior probability, 𝑝(ℎ! ). An ideal observer integrates these two sources of information by computing the posterior probability for each alternative in H, and ultimately selecting the alternative with the largest posterior probability. In the domain of spoken word recognition, a hypothesis is a word wi that the speaker might have produced, so the hypothesis space is an entire vocabulary of size Nw, and the observed data is the acoustic signal A perceived by the listener. Thus, an ideal observer model of spoken word recognition is given in Equation 2.3’s restatement of Bayes’ rule: Equation 2.3 𝑝 𝐴 𝑤! 𝑝(𝑤! ) 𝑝 𝑤! 𝐴 = !! !!! 𝑝 𝐴 𝑤! 𝑝(𝑤! ) 2.3.2. Prior Expectations in Spoken Word Recognition: Lexical Frequency Equation 2.3 underlies the Shortlist B model presented by Norris and McQueen (2008). One of the most significant contributions of Shortlist B was a computational account of word frequency effects on spoken word recognition, a category of effects that ranks among the most robust findings throughout the psycholinguistic literature (e.g., Connine, Mullennix, Shernoff & Yellen, 1990; Dahan, Magnuson & Tanenhaus, 2001; Howes, 1954; Luce, 1986; Marslen-Wilson, 1987; Pollack, Rubenstein & Decker, 1960; Savin, 1963; Taft & Hambly, 1986). Following Norris’ (2006) Bayesian Reader model of 51 visual word recognition, Shortlist B adopts each word’s relative frequency as an estimate of its prior probability 𝑝(𝑤! ). Although this innovation is theoretically straightforward, it allowed Shortlist B to account for several classic effects, such as improved accuracy when subjects identify frequent words in noise compared to less frequent words (Luce & Pisoni, 1998). Two observations about the role of the prior in a Bayesian model bear noting. First, a Bayesian spoken word recognizer will never “hallucinate” (cf. Norris et al, 2000) a word that bears no resemblance to the acoustic signal, no matter how frequent it may be. This follows from the fact that words that are entirely incompatible with some perceived signal are realized in the model with a likelihood 𝑝 𝐴 𝑤! = 0, and this will also entail that the posterior 𝑝 𝑤! 𝐴 = 0. In the same vein, if no other words are consistent with a given acoustic signal, then even the rarest words can be clearly perceived (Doyle, 1890). Second, even though a word’s estimated frequency never changes in Shortlist B, the relative influence of the prior on subjects’ behavior (as modeled by the posterior) will not be the same for all possible acoustic signals. Rather, the prior’s influence on the posterior will be largest when the likelihood is most uncertain – that is, when there are many possible words that are somewhat consistent with the input. In contrast, when perceptual uncertainty is low, such that the likelihood is peaked over one or a small number of words, the same prior will be less influential on the posterior. As Norris and McQueen (2008) point out, this second observation matches findings of an interaction between word frequency and stimulus quality in word recognition accuracy data: the 52 more degraded a stimulus is by noise, the larger the observed advantage for frequent words (Luce & Pisoni, 1998). It is clear that Shortlist B, and the Bayesian framework more generally, offers straightforward explanations for a broad range of phenomena spanning concepts as basic as lexical frequency, neighborhood density and lexical competition, perceptual confusability, and lexical influences on speech segmentation and word recognition. However, it is just as important to observe that these effects follow automatically from the basic principles that are mathematically required by a Bayesian model. As Norris and McQueen (2008) suggest, for many of the effects examined, their model could not be made to predict anything but the established finding and still be called “Bayesian.” This stands in contrast to activation-based models of spoken word recognition, which require many architectural and dynamical assumptions and whose performance depend heavily on exact parameter settings within such models (Pitt, Kim, Navarro & Myung, 2006; see Norris, 2006 for discussion). That such a well-constrained model achieves such broad empirical coverage offers strong support for the notion that spoken word recognition might reflect optimal inference in the face of uncertain input. 2.3.3. Prior Expectations in Spoken Word Recognition: Sentential Context Despite Shortlist B’s successes, the rational analysis approach to computational modeling stresses the importance of revising a model’s assumptions when a model cannot account for certain data. One type of data that is not accounted for by Shortlist B is the influence of sentential context on spoken word recognition. Indeed, the model assumes that the probability of each word in a sequence is independent of any other (non- 53 overlapping) words in a multi-word speech signal. Clearly, this assumption is not warranted. In acknowledging this fact, Shortlist B’s creators suggest how a more complete Bayesian model might approach this issue: “In all of the simulations reported here, we assume that [a word’s prior probability] can be approximated by the word’s frequency of occurrence in the language. However, [the word’s prior] will also be influenced by factors outside the scope of the present model, such as semantic or syntactic context” (Norris & McQueen, 2008, p. 362). Since words do not occur randomly in language, an optimal listener’s prior expectation over which words are likely should be highly context- dependent. Just as a model assuming that all words are equally likely fails to explain effects of word frequency, a model that assumes that some word is equally likely to occur in every context will necessarily fail to explain effects of sentential context. The main goal of the present work is to relax the assumption of a context- independent prior. To do so, we begin with the basic approach of Norris and McQueen (2008), but – in the tradition of rational analysis – we update their assumptions in order to investigate whether behavioral patterns of context effects on spoken word recognition can also be explained by an ideal observer model. We also diverge from some other aspects of Shortlist B, most notably by adopting a model of words’ likelihood functions that explicitly takes into account acoustic cues in the speech signal (see also Clayards, Tanenhaus, Aslin & Jacobs, 2008; Feldman et al, 2009; Feldman et al, 2013). This approach emphasizes the power of the Bayesian framework to explain lawful, fine- grained variability in how cues as disparate as sentential context and acoustic input interact during spoken word recognition. 54 2.4. BIASES: Bayesian Integration of Acoustic and Sentential Evidence in Speech As already discussed, Bayes’ rule describes the optimal way of combining two information sources – prior knowledge about which words a listener is likely to encounter, incorporated into the prior term 𝑝(𝑊), and acoustic data perceived by a listener, incorporated into the likelihood term 𝑝 𝐴 𝑊 .2 However, it is also useful to invoke another common interpretation of Bayes’ rule that is particularly applicable to modeling the effects of preceding context on spoken word recognition. As suggested by the standard nomenclature of priors and posteriors, Bayes’ rule is often presented as an equation describing optimal belief updating. Put simply, if 𝑝(𝑊) indexes a listener’s set of beliefs about how likely each possible word is prior to observing the relevant data (A), then 𝑝 𝑊 𝐴 represents a listener’s updated set of beliefs about the identity of the unknown word after integrating the new perceptual data, A. Since the posterior, 𝑝 𝑊 𝐴 , depends on the prior, 𝑝(𝑊), and the likelihood, 𝑝 𝐴 𝑊 , it is intuitive that the likelihood drives the updating of a listener’s beliefs. For instance, when the newly observed A is highly unlikely to be a token of a particular word (wj), then the prior belief for wj is revised downwards, rendering the posterior belief 𝑝 𝑤! 𝐴 smaller than the prior expectation 𝑝(𝑤! ) . On the other hand, the more representative of wj the observed signal A is, the more 𝑝(𝑤! ) will be revised upwards, causing 𝑝 𝑤! 𝐴 to gain support relative to other words. Within incremental sentence processing theories (Hale, 2001; Levy, 2008; Marslen-Wilson, 1973, 1975), this updating 2 By convention, we use capital italicized letters to refer to a random variable. For instance, the prior distribution 𝑝(𝑊) defines the prior probability of each possible state of the random variable W. That is, if there are Nw words that a listener could hear, then 𝑝(𝑊) is a vector with Nw entries, such that each word wj has some prior probability !! 0 ≤ 𝑝 𝑤! ≤ 1 and the sum !!! 𝑝(𝑤! ) = 1. 55 process can be thought of as iterative, such that, after integrating each new piece of information or at each new time-step, the newly computed posterior becomes the updated prior for the next step in time. Our model adopts this perspective in order to incorporate the influence of a preceding sentence context, C, on the recognition of a spoken word. Rather than assuming a static prior across all contexts, 𝑝(𝑊), we assume that listeners make use of a conditional prior, 𝑝(𝑊|𝐶), such that their prior lexical expectations depend on the context up to that point (e.g., Altmann & Kamide, 1999; Eberhard, Spivey-Knowlton, Sedivy & Tanenhaus, 1995; Kamide, Altmann & Haywood, 2003). Upon observing some subsequent speech, A, an ideal speech recognizer should update its contextually- conditioned prior beliefs by evaluating the probability that A was a token of each possible word. Given the simplifying assumption that listeners expect words to be pronounced with roughly the same acoustic form irrespective of which words preceded it (an assumption we address in greater depth in Chapter 4), the probability that A was a token of word wi is given by Bayes’ rule (Equation 2.4): Equation 2.4 𝑝(𝑤! |𝐶)𝑝 𝐴 𝑤! 𝑝 𝑤! 𝐶, 𝐴 = !! !!! 𝑝(𝑤! |𝐶)𝑝 𝐴 𝑤! The model presented in Equation 2.4 serves as the basis for the remainder of this chapter. It represents a way of identifying which words were probably present in an imperfectly perceived speech signal by combining information from the preceding sentential context with subsequent acoustic cues. With this function in mind we will refer to this model as the BIASES model, short for Bayesian Integration of Acoustic and Sentential Evidence in Speech. As we will show, the model’s name also foreshadows the 56 type of effect that it predicts should result when sentential evidence is brought to bear during word recognition. Next, we detail our implementations of the two fundamental components of BIASES: the context-dependent conditional prior, 𝑝(𝑊|𝐶), and the likelihood function that relates words to their acoustic forms, 𝑝 𝐴 𝑊 . 2.4.1. Conditional Prior: A Model of Listeners’ Contextual Knowledge To define a conditional prior 𝑝(𝑊|𝐶) for BIASES, every lexical candidate wi must be assigned a probability of occurrence following each possible context. A conditional prior has two basic properties. First, in general, for a given context, some words will be more expected than others. This property is what makes any prior (conditional or not) informative: if all words are equally likely in some context, then the posterior is proportional to the likelihood alone. This is clearly not the case in human language, and listeners do clearly do not treat all words as being equally likely in a particular sentence context. Second, and in contrast to previous work (e.g., Norris & McQueen, 2008), different contexts will support the same word to different extents. It is this property that makes a prior conditional: 𝑝(𝑤! |𝐶 = 𝑐! ) need not equal 𝑝(𝑤! |𝐶 = 𝑐! ). As already discussed, whereas Shortlist B employed a context-independent lexical frequency prior, a key goal of BIASES is to incorporate a conditional prior that more accurately assumes listeners’ access to context-dependent lexical expectations. To do so in a computational model like BIASES, we must quantify the level of support that a given context provides for a word. It is undoubtedly the case that many factors collude to create a listener’s expectation for any given word. A complete model of how context influences the probability of subsequent words would certainly depend on semantic (e.g., Borsky et al, 1998) and syntactic (e.g., Fox & Blumstein, in press) 57 information contained within the preceding linguistic context, but it would also depend on many other information sources that are available to a listener. For instance, a full model would need to address how listeners make pragmatic inferences about the implicatures in prior linguistic context (e.g., Rohde & Ettlinger, 2012), how listeners might employ speaker-specific and situation-specific knowledge about likely words or grammatical structures (e.g., Fine & Jaeger, 2013; Fine, Jaeger, Farmer & Qian, 2013; Kamide, 2012; Horton, 2007; van Berkum, van den Brink, Tesink, Kos & Hagoort, 2008), and how listeners treat knowledge about which words or concepts have recently been uttered in a discourse or are in common ground (e.g., Horton & Keysar, 1996), to name just a few. Quantifying the influence of such factors on subjects’ lexical expectations is clearly not trivial, and doing is beyond the scope of the current modeling effort. Instead, we focus on an admittedly limited model of context in order to illustrate the explanatory power of the BIASES model, and the Bayesian framework more generally. 2.4.1.1. Conditional Expectations from n-gram Language Models Most modern automatic speech recognition systems operate under the same fundamental hypothesis embraced by BIASES: that a word’s context-independent frequency can capture only a fraction of the prior knowledge available during word recognition. The solution implemented in these models incorporates local semantic and syntactic context via language models (Jelinek, 1990, 1997). Put simply, an n-gram language model is a conditional probability distribution over lexical candidates given the n-1 immediately preceding words. As n increases, the probability distribution over possible words is conditioned on more information, and, consequently, the conditional 58 expectations for different lexical candidates become more fine-grained. For instance, a bigram language model 𝑝(𝑊! |𝑊!!! ) estimates a word Wt’s probability based on only the previous word, Wt-1, while a trigram language model 𝑝(𝑊! |[𝑊!!! , 𝑊!!! ]) estimates Wt’s probability given that [Wt-2,Wt-1] preceded Wt.3 Intuitively, trigram language models make more specific predictions than bigram language models: fewer words are likely to follow ...hated to... than just to... On the other hand, a unigram language model (n = 1) is simply a formal definition of a lexical frequency distribution. A word’s frequency 𝑝(𝑤! ) can be computed by collapsing over all Nc possible preceding contexts via summation (referred to as marginalization; Equation 2.5). Equation 2.5 !! !! 𝑝 𝑤! = 𝑝 𝑤! 𝐶 = 𝑐! = 𝑝 𝑊! = 𝑤! 𝑊!!! = 𝑤! !!! !!! The observation presented in Equation 2.5 provides a mathematical justification for Norris and McQueen’s (2008) original claim that (as they reiterated later) “frequency and context have the same explanation in a Bayesian model” (Norris, McQueen & Cutler, 2015, p. 4). 2.4.1.2. Consequences of Adopting an n-gram Language Model Prior BIASES implements a language model as its conditional prior 𝑝(𝑊|𝐶) for spoken word recognition. This decision has some obvious drawbacks, but also some important benefits. Under the strictest interpretation, the assumption entailed by employing an n- 3 Note that trigram language models are order-sensitive. That is, in general, 𝑝(𝑊! |[𝑊!!! , 𝑊!!! ]) ≠ 𝑝(𝑊! |[𝑊!!! , 𝑊!!! ]); intuitively, a listener’s expectation for the word pay is not the same after hearing: wanted to and to wanted. 59 gram language model in this way is that all relevant information in C can be summarized by knowing the identities of the n-1 words preceding the target word. Clearly, as discussed earlier, such a model is severely impoverished compared to listeners’ actual contextual knowledge. That said, even a bigram language model constitutes far richer prior than Shortlist B’s unigram language model (Norris & McQueen, 2008), which cannot account for any contextual effects on word recognition. Language models might be considered among the simplest possible models capable of predicting context-specific modulation of word recognition. For example, a bigram language model would predict that, if wi tends to follow wt-1 more often than wj follows wt-1, listeners should tend to identify an acoustic signal A that is perfectly ambiguous between wi and wj as wi when the word preceding it was wt-1. To the extent that such a model might account for some aspects of human behavior, it would suggest some commonalities between the detailed, linguistically relevant information contained within a listener’s contextual knowledge and the transition probabilities between sequential words. Note that it does not follow that listeners’ models of context necessarily represent these word-by-word transition probabilities explicitly (see Levy, 2008 for discussion). This is another benefit of adopting a language model as BIASES’s model of context for the purposes of the present computational-level analysis. The choice allows us to remain theory-neutral with respect to the actual representation of context used by listeners. We regard a language model as a convenient, useful tool to summarize some fraction of the information contained within a sentential context. As naïve as language models are, evidence suggests that they predict a number of measures in language processing, such as 60 reading times and eye fixations during reading (e.g., Hale, 2001, 2006; Levy, 2008; McDonald & Shillcock, 2003a, 2003b). By implementing a language model as the conditional prior in BIASES, we aim to test whether the predictive power of such models observed in psycholinguistic studies of reading will transfer to the domain of spoken language processing. Of course, not all contextual influences on speech recognition will be explained by a standard language model. As just one example, Rohde and Ettlinger (2012) show that listeners’ ratings of a phonetically ambiguous pronoun between he and she are biased towards the presumed gender of a referent who was the most likely “causer” of some event (cf. Garvey & Caramazza, 1974; McDonald & MacWhinney, 1995; Koornneef & van Berkum, 2006). Subjects preferentially rated phonetically ambiguous pronouns as he in sentences like Noah frightened Claire because [?e] drove 100 miles per hour, but as she if the referents’ names/genders were reversed. Such an effect would be difficult to explain with a basic language model, because the result appears to rely on inferential/causal reasoning above and beyond words’ co-locational probabilities. Thus, the decision to use a language model as BIASES’s model of context will prevent us from capturing effects like this one, but it is not implausible that Bayesian models of pragmatic reasoning (Bergen, Levy & Goodman, 2014; Frank & Goodman, 2012; Franke, 2009; Goodman & Stuhlmuller, 2013; Jager, 2012) could be incorporated into a Bayesian model of speech perception like BIASES. While some sentential context effects are unlikely to find explanation in any sort of standard language model, the ability to account for other findings will depend on the precise specifications adopted for a language model. Indeed, most previously reported 61 semantic context effects could not be explained by a bigram language model. For example, the same word (the) immediately precedes the phonetically ambiguous target word in every stimulus sentence in Borsky and colleagues’ (1998) study showing that subjects made more GOAT responses in goat-biased than in coat-biased sentences. A trigram language model, on the other hand, would likely explain at least some of the differences between goat-biased (...milk the...) and coat-biased (...button the...) sentences, but it would also predict that the entire semantic context effect observed in the study is driven by the word that appeared two words before the target, irrespective of the rest of the context (e.g., whether the sentence featured a farmer or a tailor as its subject). Whether or not this is true, what we hope to have made clear from the preceding examples is that the ability for a prior based on a language model to account for sentential influences on word recognition will depend on the sort of language model used and the sort of context effect examined. A final observation about the consequences of selecting a language model as a prior regards a practical challenge it poses. Although more complex language models produce more fine-grained predictions, specificity of predictions trades off with sparsity of data (e.g., Katz, 1987). That is, as n increases, it becomes more difficult to estimate the probabilities associated with an n-gram language model because the number of possible contexts grows exponentially: if there are 10 words in a language, then there are 10 possible contexts in a bigram language model and 102 = 100 two-word sequences for which to estimate probabilities. For a trigram language model in the same ten-word language, one must estimate all 103 = 1,000 probabilities (each word in each of the 102 = 100 possible two-word contexts). While most “possible” three-word sequences may 62 never occur in language, some sequences that are rare but do occasionally occur in language may never occur in a given corpus from which the language model is being estimated. If the corpus were to be taken as a perfectly reliable model of language, then those sequences will erroneously be assigned a prior probability of 0, making it impossible for a Bayesian word recognizer to observe that sequence in the future. Although smoothing methods can be applied to ensure that all word sequences have some small but nonzero prior probability (Church & Gale, 1991; Dagan, Marcus & Markovitch, 1993; Good, 1953; Goodman, 2001; Katz, 1987; Jelinek & Mercer, 1985), these methods will also tend to reduce the model’s ability to predict context-specific variability. When there is little information that might differentiate between the prior probabilities of two similarly rare sequences, they will tend to be treated as equally likely. Thus, although a bigram language model will lack a great deal of information that listeners will have access to, it will also provide a reliably-estimated language model capable of capturing context-dependent response patterns when those effects tend to be driven by the word immediately preceding the target that subjects are tasked with recognizing. 2.4.1.3. BIASES’ Conditional Prior: A Bigram Language Model With these factors in mind, and acknowledging the various limitations associated with language models, we adopted a bigram language model as the conditional prior for BIASES. As such, only the immediately preceding word influences the prior probability of the subsequent word. Furthermore, to examine how well BIASES could fit human behavior, the experiments we conducted utilized stimuli designed to elicit sentential context effects on word recognition that were driven by the immediately preceding word 63 (Fox & Blumstein, in press). In particular, the simulations and experiments reported here evaluated the influence of different function words (to vs. the) on the identification of the next word in a sentence. Fox and Blumstein showed that subjects were more likely to recognize a phonetically ambiguous word between bay and pay as pay when it was preceded by a sentence like Brett hated to... than when it was preceded by a sentence like Valerie hated the... According to BIASES, the basic explanation for this effect is that the two critical function word contexts (to vs. the) differentially influence a listener’s prior expectations for immediately subsequent target words (bay vs. pay). As we will show, despite this simplistic model of listeners’ contextual knowledge, BIASES does remarkably well at predicting subjects’ behavioral responses, including the overall pattern, patterns of subject-by-subject variability, and several other fine-grained quantitative predictions. We also conduct another set of simulations to examine the extent to which a richer model of context might account for additional, even more fine-grained context-specific patterns in our empirical results. 2.4.1.4. Additional Constraints on Prior Expectations: Forced-Choice Our simulations invoke one other piece of contextual information that we assume subjects exploit. Since subjects receive a set of instructions before performing the experimental task in the laboratory, these instructions further constrain the conditional prior model of context that listeners use while recognizing words in the study. Specifically, we assume that, once instructed to identify the target word as either bay or pay, subjects assign all other words a prior probability of 0. This same assumption is almost universally implicit in other models of spoken word recognition. For instance, in TRACE (McClelland & Elman, 1986), responses during multiple-alternative forced- 64 choice tasks (such as phoneme or word identification experiments) are generated probabilistically from among a set of alternatives that is identified based on the task and stimuli (cf. Luce, 1959; McClelland & Rumelhart, 1981). By reading activation levels out from only a few “clamped” response alternatives, TRACE’s decision model has the effect of nullifying any prior probability of a response from any other words outside of the predefined set. Similarly, the output nodes of other models (e.g., Shortlist: Norris, 1994; Merge: Norris et al, 2000; Shortlist B: Norris & McQueen, 2008) are pre-specified “on- the-fly” based on task demands. Although they are not strictly driven by theoretically interesting assumptions about human cognition, it makes sense that computational models of human behavior should account for such exogenous factors, and this is especially true for ideal observer models. As Norris (2006) puts it, which model will produce optimal behavioral responses “is critically dependent on the precise specification of the task or goal” (p. 330). With this additional constraint, the summation over all possible words in the denominator of Equation 2.4 can be simplified to the sum of two terms: one proportional to the posterior probability of pay given C and A, and the other proportional to the posterior probability of bay. Equation 2.6 incorporates this assumption, giving the posterior probability of pay, where w1 = pay and w2 = bay. Equation 2.6 𝑝(𝑤! |𝐶)𝑝 𝐴 𝑤! 𝑝 𝑤! 𝐶, 𝐴 = 𝑝 𝑤! 𝐶 𝑝 𝐴 𝑤! + 𝑝(𝑤! |𝐶)𝑝 𝐴 𝑤! This posterior probability distribution in Equation 2.6 gives the expected rate with which a subject should identify an acoustic stimulus as pay in a given context, if that subject were optimally combining the information sources we are assuming. To the 65 extent that subjects deviate from this behaviorally, Andersons’s Principle of Rationality (1990) would demand that we update our assumptions. Note that the expected posterior probability of subjects making a BAY response, 𝑝 𝑤! 𝐶, 𝐴 , is equal to 1 − 𝑝 𝑤! 𝐶, 𝐴 . Finally, as pointed out by Feldman and colleagues (2009), in the case of modeling two-alternative forced choice, the posterior in Equation 2.6 can be rewritten to take the form of a logistic function. By dividing both the numerator and denominator of the right side of Equation 2.6 by the quantity in the numerator and applying inverse functions (exponential power and natural logarithm), Equation 2.6 can be rewritten as shown in Equation 2.7: Equation 2.7 1 𝑝 𝑤! 𝐴, 𝐶 = ! !! ! ! ! !! ! !"# !!"# 1+ 𝑒 ! !! ! ! ! !! 2.4.1.5. Implementing BIASES’ Prior: Corpus Estimates, Smoothing An advantage of modeling subjects’ responses in a two-alternative forced choice word identification task is that the implementation of BIASES’ prior is quite flexible – flexible enough, in fact, that 𝑝 𝑤! 𝐶 and 𝑝 𝑤! 𝐶 need not actually be proper probabilities at all. To see this, one need simply note that the influence of the prior, ! !! ! log ! !! ! , is only dependent on the ratio of the prior probabilities of w1 or w2. A consequence of this is that their probabilities could just as easily be replaced by numeric values that are proportional to the words’ relative prior probabilities. Because the prior in this implementation of BIASES is estimated from a bigram language model, counts from a corpus of the number of times w1 and w2 follow C in sequence will suffice (𝜂 𝐶, 𝑤! and 𝜂 𝐶, 𝑤! , respectively; see Equation 2.8). 66 Equation 2.8 𝜂 𝐶, 𝑤! !! 𝑝 𝑤! 𝐶 𝑝 𝑊! = 𝑤! 𝑊!!! =𝐶 !!! 𝜂 𝐶, 𝑤! 𝜂 𝐶, 𝑤! log = log = log = log 𝑝 𝑤! 𝐶 𝑝 𝑊! = 𝑤! 𝑊!!! =𝐶 𝜂 𝐶, 𝑤! 𝜂 𝐶, 𝑤! !! !!! 𝜂 𝐶, 𝑤! Of course, many other words in the corpus besides w1 and w2 will also follow C !! (that is, !!! 𝜂 𝐶, 𝑤! ≫ 𝜂 𝐶, 𝑤! + 𝜂 𝐶, 𝑤! ). However, the fact that the normalizing term (which represents the sum of all occurrences of C with any word) cancels out in Equation 2.8 reflects the assumption that subjects engaged in a two-alternative forced choice word identification task will only consider the relative contextual evidence for w1 and w2 in responding. The data for BIASES’s bigram language model were collected from the 2009 Google Books corpus (Michel et al, 2010). Rather than assuming that the corpus counts of the relevant bigrams (the bay, the pay, to pay, to bay) were perfect estimates for subjects’ contextual knowledge, model-fitting (see Chapter 3) allowed for “add-alpha” smoothing (Lidstone, 1920). Under such a model, one value 𝛼 is added to all bigram counts, and the fitting process selects the 𝛼 that minimizes the overall deviation of the model predictions from the data. As mentioned earlier, while the benefit of smoothing is that it protects against overconfidence in our estimate of listeners’ prior (especially in estimates of the probability of relatively uncommon bigrams), higher values of 𝛼 tend to diminish the specificity of the predictions of the model. As the smoothing parameter 𝛼 grows larger (and greater than the raw counts in the bigram language model itself), the model approaches a uniform distribution that renders w1 and w2 equally likely to follow every context. 67 2.4.2. Likelihood Term: Mapping an Acoustic Signal onto Lexical Forms One of the most fundamental observations about speech communication is that there is no one-to-one mapping between words and their acoustic realizations (cf. Liberman, Cooper, Shankweiler & Studdert-Kennedy, 1967). On one hand, it is clear that signal-to-word mapping is many-to-one: perfectly understandable productions of the same word can take on countless acoustic realizations that may differ from one another along many dimensions. On the other hand, signal-to-word mapping is also sometimes one-to-many: the same acoustic signal may, on different occasions, be perceived as different sounds (e.g., Ganong, 1980; Liberman, Harris, Hoffman & Griffith, 1957; Sawusch & Jusczyk, 1981), words (e.g., Borsky et al, 1998; Fox & Blumstein, in press), or sequences of words (e.g., Foss & Swinney, 1973; Kim, Stephens & Pitt, 2012). Together, these two facts underlie the likelihood term in BIASES and how it interacts with the model’s conditional prior term. 2.4.2.1. Likelihood Functions: Many-to-One Mapping The immediate function of the likelihood term 𝑝(𝐴|𝑊) in BIASES is to formalize the many-to-one mapping from speech tokens to words. 𝑝 𝐴 𝑊 is best described as a composite of Nw likelihood functions, where Nw is the number of words in the lexicon, because each word wi has its own likelihood function 𝑝(𝐴|𝑤! ). 𝑝(𝐴|𝑤! ) defines a listener’s phonetically-detailed knowledge about how wi tends to be pronounced. Implicit in each word’s likelihood function is the notion that not all productions of a word will be equally clear, and some realizations will be more typical than others. Work examining the influence of category goodness and internal category structure has shown that fine- grained acoustic properties in speech modulate perception, recognition and lexical access 68 (Andruski, Blumstein & Burton, 1994; Blumstein, Myers & Rissman, 2005; Kessinger & Blumstein, 2003; McMurray, Tanenhaus & Aslin, 2002, 2009; Miller, 1994; Miller & Volaitis, 1989; Pisoni & Tash, 1974; Volaitis & Miller, 1992). This internal category structure is modeled within 𝑝(𝐴|𝑤! ) by making the more typical realizations of wi more probable than less typical realizations. The model assumes an acoustic space, 𝒜, comprised of all possible speech waveforms, and each ax in 𝒜 is a point within that multidimensional acoustic space. Each word, wi, occupies some subspace of 𝒜 comprised of all possible pronunciations of wi. The likelihood function of wi, 𝑝(𝐴|𝑤! ), assigns some probability to every possible ax. For most values of ax, it will effectively be the case that 𝑝 𝑎! 𝑤! = 0; after all, although each word in a lexicon can be pronounced in many4 ways, most possible waveforms will bear no similarity to some given wi. However, among those speech tokens (values of ax) that might plausibly be exemplars of wi, the ones that most resemble wi will be most probable according to 𝑝(𝐴|𝑤! ). In this way, the role of the likelihood term of BIASES, 𝑝(𝐴|𝑊), is to evaluate, for each lexical candidate wi, how representative of wi a perceived speech token ax is. Of course, just as the likelihood function of wi will ensure that 𝑝 𝑎! 𝑤! = 0 for most ax, the overall effect of 𝑝(𝐴|𝑊) is that, for a given acoustic signal ax, 𝑝 𝑎! 𝑤! = 0 for most words. As discussed earlier, when 𝑝 𝑎! 𝑤! = 0 for all words but one, only one word will have a nonzero posterior probability, and there will be no question about the identity of the token ax. Indeed, given the multiplicity of available bottom-up, acoustic 4 Indeed, because at least some relevant dimensions of 𝒜 are continuous (e.g., VOT, vowel duration, formant values), BIASES assumes that the sample space of possible pronunciations of any word is infinite. Thus, formally, 𝑝(𝐴|𝑤! ) must be a probability density function. 69 cues that comprise the many dimensions of 𝒜 , it may very often be possible to distinguish words and speech sounds from one another (see, e.g., Nearey, 1990, 1997). However, there is not always such a consistent mapping from a given acoustic signal to one (and only one) word. Put simply, while the many-to-one mapping between speech tokens and words lies at the heart of each word’s likelihood function, the computational challenge that a spoken word recognition system must overcome arises due to the one-to- many (or at least one-to-more-than-one) mappings between a signal and multiple possible lexical candidates. In such cases, the system must adjudicate among the various words that the perceived signal resembles to any degree. 2.4.2.2. Phonetic Ambiguity: One-to-Many Mapping Under what circumstances would an optimal listener believe that, for more than one word, 𝑝 𝑎! 𝑤! > 0? Feldman and colleagues (2009) identified several factors responsible for creating uncertainty in the mapping of a signal onto a single, best- matching word. Here, we classify these factors into two general categories. In short, the noisier the environment is and the more acoustically similar two words are, the more likely it is that the perceived signal ax will be ambiguous between the two words. One source of uncertainty is noise, which distorts the speaker’s production of a word (sx) and can cause the perceived acoustic signal (ax) to be ambiguous between multiple words, even when sx may not have been. For instance, as already discussed, early research in the phoneme restoration paradigm (Warren, 1970; Samuel, 1981, 1996) showed that masking a short segment of uninterrupted, natural speech with a cough or white noise could render the corrupted signal (/*il/) consistent with any of several lexical candidates (e.g., wheel, heel, peel, meal) (Warren & Warren, 1970; Warren & Sherman, 70 1974). In general, adding noise to stimuli tends to “smear out” a word wi’s likelihood function: to accommodate more variability in the acoustic signal due to noise, more values of ax will count as possible realizations of wi. The result of this smearing is that some values of ax may come to correspond to multiple possible words (Luce & Pisoni, 1998; Warren & Warren, 1970) or sounds (e.g., Cutler, Weber, Smits & Cooper, 2004; Miller & Nicely, 1955; Smits, Warner, McQueen & Cutler, 2003; Warner, Smits, McQueen & Cutler, 2005). While the effect of noise is to increase the uncertainty of the speech signal after it is produced, a second source of uncertainty emerges naturally and is inescapable, even in a completely noiseless environment. Distinct words are sometimes characterized by overlapping likelihood functions; this occurs when the acoustic space corresponding to one word intersects with that of another word. A trivial example illustrating this fact is the case of homophony: if a speaker produces sx = /baɪ/, sx could correspond to several words (buy, by, or bye) because all three words share virtually identical spaces of possible pronunciations (but see, e.g., Gahl, 2008). The present work considers a less extreme example of how phonetic ambiguity may lead to lexical ambiguity. While homophony inevitably leads to lexical ambiguity, the phonetic ambiguity examined here arises when the likelihood functions of a pair of word-initial segments (/b/ and /p/) overlap in acoustic space. The primary acoustic dimension on which /b/ and /p/ differ is voicing, with tokens of /p/ tending to be realized with longer voice-onset time (VOT) values than tokens of /b/ (Lisker & Abramson, 1964). Figure 2.1 displays two theoretical acoustic cue distributions over VOTs: one for word-initial /b/ and one for word-initial /p/. Although tokens of each 71 category can generally be distinguished on the basis of VOT alone, the two categories’ distributions overlap such that tokens with some intermediate VOT values could plausibly be an exemplar of either /b/ or /p/. Although other acoustic dimensions of a spoken word also provide reliable cues that can distinguish /b/-initial tokens from /p/- initial tokens (e.g., Klatt, 1975; Lisker, 1986; Miller & Dexter, 1988; Repp, 1984; Stevens & Klatt, 1974; Summerfield, 1981), much work has shown that, holding other variables constant, listeners perceive segments with some VOT values as phonetically ambiguous between /b/ and /p/ (Clayards et al, 2008; Connine, Blasko & Wang, 1994; Connine, Titone & Wang, 1993; Fox & Blumstein, in press; Ganong, 1980; Liberman, Harris, Kinney & Lane, 1961; McMurray, Clayards, Tanenhaus & Aslin, 2008; McMurray et al, 2002, 2009; Miller & Dexter, 1988; Miller et al, 1984; Pisoni & Lazarus, 1974; Toscano & McMurray, 2012; Wood, 1976). A consequence of this phonetic ambiguity for spoken word recognition is that an acoustic token /?eɪ/ with a phonetically ambiguous VOT could correspond to either bay or pay (Fox & Blumstein, in press). Thus, speech tokens whose initial consonants have intermediate VOT values exhibit a one-to-many mapping from acoustic signal to lexical forms, and, as illustrated in Figure 2.1, this one-to-many mapping can be modeled by assuming that bay and pay have overlapping likelihood functions. 72 Probability Density BAY$ PAY$ −20 −10 0 10 20 30 40 50 60 70 VOT (ms) Figure 2.1. Examples of normally distributed probability density functions for two categories: /b/ and /p/, or for bay and pay under the assumption that these words are otherwise (i.e., besides VOT) identical in their acoustic cue distributions. The light grey line represents the marginal density function, showing the relative amount of total probability mass associated with each voice-onset time (VOT) across all categories. The dashed black line indicates the category boundary (χ), defined as the point in acoustic space (or the plane in acoustic space, if the likelihood model has more than one dimension) for which the probability density functions of two or more categories are equal. It can be equivalently defined as the point/plane for which the posterior probability distribution over a given set of categories is uniformly distributed when the prior probabilities of the set’s members are also equal. 2.4.2.3. BIASES’ Likelihood Term: A Mixture of Gaussians An important issue that remains is the specification of each word’s likelihood function. As already discussed, a complete model of the likelihood function for a word would define which acoustic signals could be recognized as a word, as well as how good an exemplar each possible signal would be. Although, in reality, such a model would be extremely complex and require a highly multidimensional space, the present model is far 73 simpler. Following prior work (e.g., Clayards et al, 2008), we assume that the likelihood functions of word-initial voicing minimal pairs (e.g., bay and pay) can be approximated by normal distributions over a single continuous dimension, VOT (see also Kleinschmidt & Jaeger, in prep; Kronrod, Coppess & Feldman, 2012; Munson, 2011). Under this assumption, listeners expect that if they were to perceive a given word wi, the probability that it would be realized with different initial VOT values (A) is given by a normal distribution (see Equation 2.9A/B), where 𝜇! represents the mean initial VOT for wi and 𝜎!! represents the variance in wi’s initial VOT. Equation 2.9A 𝐴|𝑤! ~ 𝑁(𝜇! , 𝜎!! ) Equation 2.9B (!!!! )! 1 ! !!!! 𝑝 𝐴 𝑤! = 𝑒 ! 2𝜋𝜎!! If the acoustic form of wi is assumed to be normally distributed, 𝜇! represents the most probable acoustic signal associated with wi, and the further a stimulus is from this prototypical VOT, the less representative of wi the exemplar will be (and the lower its likelihood will be). Since each word wi has a Gaussian likelihood function 𝑝 𝐴 𝑤! defined by its mean (𝜇! ) and variance (𝜎!! ) parameters, the full likelihood model of BIASES, 𝑝 𝐴 𝑊 , which is a composite of all Nw words’ likelihood functions, takes the form of a mixture of Gaussians. Gaussian mixture models are a common approach to statistical models of speech categorization and phoneme category learning (de Boer & Kuhl, 2003; Clayards et al, 2008; Dillon, Dunbar & Isardi, 2013; Feldman et al, 2009, 2013; McMurray, Aslin & Toscano, 2009; Toscano & McMurray, 2010; Vallabha, 74 McClelland, Pons, Werker & Amano, 2007) in which there exists some number of categories, and the exemplars of each category are normally distributed according to its category’s mean and variance (or covariance matrix, in the case of multiple perceptual dimensions). Note that, for the present formalization of BIASES, the likelihood distribution over VOTs for each phonetic category (/b/ vs. /p/) is equivalent to the likelihood distribution over VOTs for each word (bay vs. pay) because (1) these two words are assumed not to differ on any other acoustic dimensions besides the VOT of their initial stop consonants, and (2) these words are assumed to be deterministically related to their associated phonetic categories. Admittedly, as we discuss later, it seems likely that neither of these assumptions is warranted; Chapters 3 and 4 illustrate some interesting and insightful implications of relaxing these assumptions. In any case, under these simplifying assumptions, in order to infer the most likely category label for a given exemplar ax, ax must be compared to each category’s likelihood function (see Equation 2.9B). Previous work suggests that listeners’ perceptual identification behavior can be approximated by modeling words’ likelihoods with a Gaussian mixture model over VOT values. In a study by Clayards and colleagues (2008) listeners identified acoustic stimuli along a VOT continuum between two words (e.g., beach and peach), and subjects’ identification functions were consistent with an optimal Bayesian recognizer’s behavior. A between-subject manipulation in their study provides further support for the modeling of words’ likelihood functions using a mixture of Gaussians approach: two groups of subjects were exposed to different distributions of VOTs that implied either high overlap or less overlap in the words’ likelihood functions (i.e., either higher values of 𝜎!! or lower 75 values of 𝜎!! , respectively). Results showed that, when the overlap between the likelihood functions of beach and peach appeared to be higher, subjects perceived a greater number of the intermediate tokens from the continuum as ambiguous (Clayards et al, 2008). Despite this modeling success, the Gaussian mixture model employed by Clayards and colleagues (2008) is not without its weaknesses. For instance, although they manipulated the acoustic variability of the stimuli for different subjects, the stimuli presented to any one group mimicked likelihood functions that had equal variance terms for both candidate words (e.g., beach and peach). Their model, in turn, also presumed that 𝜎!! = 𝜎!! . In reality, it is not generally the case that the initial VOTs of words with an initial /b/ and words with an initial /p/ are distributed with equal variance (Lisker & Abramson, 1964; Kronrod, Coppess & Feldman, 2012). In fact, VOT distributions, as measured in spoken word and segment production experiments (e.g., Baese-Berk & Goldrick, 2009; Fox, Reilly & Blumstein, 2015), tend to exhibit non-Gaussian skew, and evidence suggests that the distribution of VOTs of word-initial voiced stops in English (especially /b/) is highly bimodal (Lisker & Abramson, 1964; see also Docherty, Watt, Llamas, Hall & Nycz, 2011). Nonetheless, despite their divergence from speech production data, computational models of speech perception that have adopted the mixture of Gaussians approach and assumed equal variance across categories have achieved substantial success in capturing the overall patterns associated with category goodness and internal phonetic category structure (Clayards et al, 2008; Feldman et al, 2009; Kleinschmidt & Jaeger, 2015). Because of this fact, and because of the computational benefits associated with adopting this simplification (namely, the existence of a closed-form likelihood function), the 76 likelihood model implemented in BIASES was identical to the mixture of Gaussians employed by Clayards and colleagues (2008). In principle, any likelihood function could replace that of Clayards and colleagues (2008) in BIASES. For now, we simply acknowledge that BIASES could be enhanced with more detailed (and realistic) likelihood functions that incorporate more acoustic cues (see, e.g., Feldman et al, 2013) and/or less simplistic distributional assumptions (see, e.g., Kleinschmidt & Jaeger, in prep; Kronrod, Coppess & Feldman, 2012). Critically, though, unlike Clayards and colleagues (2008), and unlike other Bayesian models of speech perception that have adopted similar models of the likelihood term (Feldman et al, 2009; Kleinschmidt & Jaeger, 2015), the key innovation in BIASES is the inclusion of a context-dependent conditional prior which is integrated with the likelihood function. The model proposed by Clayards and colleagues (2008) fits the basic shape of subjects’ responses to isolated words despite their assumption of equal prior probabilities for beach and peach because their likelihood model captures fundamental properties of subjects’ signal-to-word mapping. Here, we adopt this same likelihood function in formulating BIASES in order to leverage the successes of their isolated word recognition model, while extending it to incorporate sentential context effects. 2.4.2.4. Comparing the Likelihood Terms in BIASES and Shortlist B Finally, it is worth pointing out another fundamental difference between BIASES and Shortlist B (Norris & McQueen, 2008). In addition to its assumption of a context- independent prior based on lexical frequency instead of the conditional prior embraced by BIASES, Shortlist B also differs from BIASES in the mathematical form of its likelihood term. Although the likelihood function adopted in BIASES is identical to that of Clayards 77 and colleagues (2008) and closely related to that of Feldman and colleagues (2009), Norris and McQueen’s (2008) Shortlist B takes a very different approach. Rather than explicitly relating the acoustic properties of the speech signal to phonemes or words, Norris and McQueen (2008) avoided specifying a likelihood function that would directly relate to an acoustic signal. Instead of relying on assumptions about the distributions of acoustic cues and the ways in which these different cues covary, Shortlist B’s likelihood model abstracts over all of the acoustic cues in speech to capture, broadly, the confusability of different words in the lexicon. Specifically, they assume that – whatever likelihood model subjects use – the same one that underlies subjects’ performance in spoken word recognition tasks should also underlie their performance in lower-level perceptual tasks. Under this assumption, McQueen and Norris (2008) used perceptual confusion data from a gating task (Smits et al, 2003; Warner et al, 2005) to infer subjects’ likelihood functions and taught Shortlist B, for each diphone in Dutch (e.g., /ba/), how likely subjects should be to perceive that diphone as itself or any other Dutch diphone (e.g., /ba/, /pa/, /da/, /bɪ/, ...). The obvious advantage of Shortlist B’s approach is that it can remain agnostic as to the computations that map an acoustic signal onto one or more words, so it can be applied to a large vocabulary without making many assumptions about pre-lexical representations or pre-lexical processing. However, the key question that motivated the development of BIASES was how bottom-up and top-down information sources are weighted and combined during spoken word recognition. In light of this question, the nature of the likelihood function and how it fits into the larger computational framework will play an important role in understanding the predictions of BIASES, as we discuss 78 later. Thus, unlike Shortlist B, BIASES is capable of making fine-grained predictions about how acoustic-phonetic properties of speech and context-specific lexical predictions jointly modulate subjects’ recognition of spoken words. 2.4.3. Integrating Prior Context and Perceptual Input in BIASES Substituting the likelihood function (Equation 2.9) into the model of subjects’ posterior (Equation 2.7), and simplifying based on the stated assumption of equal variance in the VOT distributions (𝜎!! = 𝜎!! = 𝜎 ! ) for pay (w1) and bay (w2), yields Equation 2.10 (cf. Feldman et al, 2009): Equation 2.10 1 𝑝 𝑤! 𝐴, 𝐶 = ! !! ! ! !"# !!(!!!) 1+𝑒 ! !! ! where 𝜇! + 𝜇! 𝜇! − 𝜇! χ= and 𝑔 = 2 𝜎! Equation 2.10 represents the optimal (Bayesian) posterior probability that a particular acoustic signal (A) following a particular sentence context (C) is an exemplar of the word pay, given the assumptions outlined above. Following a rational analysis approach (Anderson, 1990), Equation 2.10 can also be interpreted as an estimate of the optimal rate of PAY responses subjects should make when responding to different stimuli in different sentence contexts during a two-alternative forced choice word identification task. Finally, in order to evaluate the extent to which the predictions of BIASES are consistent with actual subjects’ behavior, it is possible to simulate responses from BIASES based on the assumption that, on a given trial (t) consisting of a context and 79 acoustic stimulus pairing (ct, at), a subject’s final identification decision (Zt) is probabilistically generated from a Bernoulli distribution with 𝜃! = 𝑝 𝑤! 𝑎! , 𝑐! (see Equation 2.11). Equation 2.11 1 𝑍! |𝑎! , 𝑐! ~ 𝐵𝑒𝑟𝑛( ! !! !! ) ! !"# !!(!!!! ) 1+ 𝑒 ! !! !! An oft-cited intuitive metaphor for Bernoulli-distributed random variables is the process of flipping a biased coin: the probability of a PAY response is like the probability of a coin-flip coming out heads, and different experimental conditions or stimuli affect the bias of the “coin” towards “heads” (𝜃! ) differently. In particular, Equation 2.10’s posterior distribution describes the way stimulus and context conditions in a given trial influence the probability of a PAY response on that trial. 2.4.4. Conclusion and Next Steps In sum, Equation 2.11 provides us with an explicit method of generating behavioral responses. The ability to simulate behavior from BIASES in this way affords many advantages. In Chapter 3, we take two somewhat different approaches to the simulation of behavioral data with BIASES. First, by providing to BIASES assumed values for all of the underlying parameters (𝜇! , 𝜇! , 𝜎 ! , and 𝑝 𝑤! 𝑐! for every wi and cj relevant to the experiment) needed to generate behavioral responses, we can examine properties of the model’s behavior, and examine the extent to which empirical data match those predictions. As we will demonstrate, this approach reveals that our chosen theoretical framework provides some much-needed clarity for research in the field of top- down effects, organizing a confusing literature rife with apparent inconsistencies. 80 Moreover, in the spirit of iterative updating of cognitive models that is fundamental to the rational analysis approach (Anderson, 1990), this approach allows us to identify enhancements to the model that are critical for capturing and “post-dicting” existing behavioral data. One major disadvantage to this approach is that, in order to generate responses, we must make even more assumptions about the latent structure of perceptual and cognitive processing that give rise to the behavioral responses we can observe. On the other hand, a second approach allows us to discover features of the model underlying observed behavior, rather than making assumptions about the model’s features. By generating data from BIASES over a very wide, weakly constrained range of possible parameter settings and comparing the simulated behavioral response patterns under different conditions to real data from human subjects, we can learn about the likely distribution of those parameters. As we will demonstrate, even though all we can actually observe is on the left side of Equation 2.11 (i.e., subjects’ responses to different stimulus/context pairings), this approach allows us to infer the distributions of all of the unknown parameters that ultimately give rise to those responses. Moreover, this approach can be used to directly compare the relative fit and explanatory power of different theories and models that make disparate assumptions about aspects of auditory language processing. As we will discuss, each approach has its own benefits and shortcomings, but both approaches can be leveraged to reveal important insights about the human speech perception system, and especially about how top-down information sources such as sentential context modulate word recognition. 81 Chapter 3 Exploring and Evaluating the BIASES Model of Spoken Word Recognition in Context 3.1. Understanding Top-Down Effects in BIASES Although the theoretical and mathematical underpinnings of BIASES were presented in Chapter 2, Chapter 3 aims to explore the model’s more fine-grained predictions about how higher-level (e.g., contextual) and lower-level (e.g., perceptual) information sources conspire during spoken word recognition to produce top-down effects on speech perception. To that end, Chapter 3 is divided into three main sections. First, Chapter 3 examines the mathematical form of BIASES more closely. We implement BIASES and perform two preliminary simulation studies to illustrate how a minimalistic implementation of BIASES can replicate subjects’ sensitivity to a preceding function word when identifying a stimulus that is phonetically ambiguous between a noun and verb (Fox & Blumstein, in press) and how the computational principles inherent to BIASES not only account for the overall pattern, but also provide fine-grained quantitative predictions about expected variability and asymmetries in the size of context effects on spoken word recognition. Second, we consider a major problem that is often ignored in the literature, and especially by computational models: the enormous amount of unexplained variability in the size of top-down effects on speech processing. These issues are discussed with consideration for how these data can be captured by BIASES. The present model recasts previously overlooked or poorly understood behavioral patterns and asymmetries, suggesting that apparent inconsistencies in top-down effect on speech perception (cf. Pitt & Samuel, 1993) actually follow from the theoretical principles embodied by BIASES. 82 Illustrative simulations demonstrate the unique ability of BIASES to explain and predict lawful variability in the patterns of top-down effects across stimuli and across studies. Finally, Experiment 3.1 is conducted in order to directly test one novel prediction made by BIASES, and the model’s simulated behavior is compared to human performance on an auditory word identification task. These new experimental data are also used in two model comparison analyses to demonstrate that the utility of this computational model extends beyond providing a theoretical framework for contextual influences on word recognition. BIASES also represents a novel tool for comparing psycholinguistic theories about the two inputs to the model, including both the lower- level pre-lexical processing of speech that maps speech sounds to words, and the higher- level processing of sentences that reflects how listeners utilize contextual and linguistic information during auditory language processing. Overall, the results of the simulations and the experimental analyses suggest that subjects’ recognition of spoken words in context exhibit certain hallmarks of a Bayesian cue integration system. Generally speaking, BIASES highlights the fact that top-down effects on speech perception offer a unique window into perceptual processing, cognitive processing, and the interface of cognitive and perceptual representations in human language function. 3.1.1. Overview of the Mathematical Form of BIASES Recall Equation 2.10’s statement of the form of the posterior probability distribution, reproduced in Equation 3.1 (substituting in a new term, Π, to summarize the effect of the prior): Equation 3.1 83 1 𝑝 𝑤! 𝐴, 𝐶 = 1+ 𝑒 ! !!!(!!!) where 𝑝 𝑤! 𝐶 𝜇! + 𝜇! 𝜇! − 𝜇! Π = log ,χ = and 𝑔 = 𝑝 𝑤! 𝐶 2 𝜎! As described in Chapter 2, Equation 3.1 represents the present model’s estimate of 𝑝 𝑤! 𝐴, 𝐶 , the probability that a target stimulus was pay, given the voice-onset time (VOT) of its initial stop, A, given that it followed sentence context C, and given that the listener is performing a two-alternative forced choice word identification task with two possible candidates (w1 = pay and w2 = bay). The acoustic forms of pay and bay are modeled as having Gaussian distributions with means 𝜇! and 𝜇! , respectively, and shared variance term 𝜎 ! (𝜎 ! = 𝜎!! = 𝜎!! ). 𝑝 𝑊 𝐶 is reflects the strength of the contextual bias towards one of the other candidate word, and it is estimated from a corpus. Finally, Equation 3.1’s logistic form is a consequence of a key non-linguistic constraint: the use of a two-alternative forced choice task. There are three key terms within the sigmoidal posterior (see Equation 3.1): (1) Π, the term summarizing the relative prior support for the candidate words, (2) g, the logistic function’s gain term, and (3) χ, which denotes the VOT that is exactly halfway between the category means. Here, we discuss the interpretation of g, χ and Π in turn and explore their role predicting the distribution of top-down effects in spoken word recognition. 3.1.1.1. Components of BIASES: Phonetic Category Structure (g) The primary effect of g is to control the slope of the logistic, with higher values of g indicating a sharper identification curve. As implied by the definition of g in Equation 3.1, greater separation between the means of the pay and bay (𝜇! − 𝜇! ) and lower 84 variability (𝜎 ! ) in the expected distribution of productions of pay and bay are associated with steeper slopes. Simulation Study 3.1 illustrates the nature of g’s influence on the posterior probability function and on the size of sentential context effects on word recognition. The details behind these stimulations and their key conclusions are described in Box 3.1. Figures 3.1 and 3.2 illustrate the tradeoff between category variance and category separation. The shape (i.e., steepness) of the resulting posterior sigmoids in Figure 3.2 is affected by changing either the distance between the means of the underlying normally distributed density functions in Figure 3.1 (left vs. right panels of the figures) or the underlying category variance of the normal probability density functions (or, equivalently, the standard deviations, as indicated in the top vs. bottom panels of Figures 3.1 and 3.2). Intuitively, and as described in Chapter 2, the less overlap there is between two words’ likelihood functions (Figure 3.1), the fewer acoustic values (here, VOTs) there will be that are ambiguous between pay and bay (Figure 3.2). Note that g is the term that differed between groups in the study by Clayards and colleagues (2008). By manipulating the apparent variance of the VOT distributions of /b/- and /p/-initial minimal pair words (e.g., beach and peach), Clayards and colleagues were tapping into the denominator of g. Finally, note that, after hearing a sentence context C, all other terms in the posterior remain constant no matter what stimulus A is presented to the listener. In particular, the prior information (Π) does not influence the slope of the posterior distribution in BIASES (due to a conditional independence assumption; see Chapter 2), while g determines the overall shape of the posterior distribution over the acoustic space. 85 Box 3.1. Description of Simulation Study 3.1 Goal: Illustrate influence of two aspects of underlying phonetic category structure on posterior probability function and size of sentential context effects. Design: 4 simulated phonetic category structures in a 2 × 2 design Parameters of BIASES Manipulated: 𝜇! − 𝜇! ∈ {64,36} , 𝜎 ! ∈ {15! , 20! } Parameters of BIASES Held Constant: χ = 32 , 𝑝 𝑤! 𝑐! = 0.75 , 𝑝 𝑤! 𝑐! = 0.25 Results displayed in: Figures 3.1-3.5, Table 3.1 Key conclusions: 1. BIASES’ gain parameter (g), which characterizes the slope of the sigmoidal posterior (cf. Feldman et al, 2009), is the ratio of 𝜇! − 𝜇! (the distance between the means of the two words’ distributions over VOTs) and 𝜎 ! (the shared variance of each word’s VOT distribution). 2. Because 𝜇! − 𝜇! and 𝜎 ! are collinear (having opposite effects on g), they are not identifiable parameters when fitting a model that assumes equal category variance. The tradeoff between these features of BIASES’ likelihood model can be visualized in the top-right and bottom-left simulated phonetic category structures in Figure 3.1, where distinct likelihood functions yield identical posteriors (Figure 3.2-3.3) with the same gain parameter (see Table 3.1). Consequently, model-fitting in Chapters 3-4 assumes values for 𝜇! − 𝜇! (from Lisker & Abramson, 1964; for a similar approach, see Kleinschmidt & Jeager, 2015) and fits 𝜎 ! . 3. Although the magnitude of the effective category boundary shift between two prior contexts (χ!! − χ!! ) depends on g, the maximum expected effect size (Δ!"# ) is independent of it (see Table 3.1; Figure 3.4 vs. 3.3). However, a narrower range of VOTs exhibit top-down effects, so it would be more difficult, practically speaking, to detect a large top-down effect size if VOTs are sampled from the space. 4. For all 4 likelihoods examined in Simulation Study 3.1, the locus (𝑎) of the maximum expected effect size (Δ!"# ) was consistently collocated with the category boundary (χ) (Figure 3.4). Note, however, that χ was confounded with the midpoint of the effective category boundaries for the prior contexts (χ!! , χ!! ). We discuss this point in Simulation Study 3.2 (see Box 3.2). 5. When measured for each prior context relative to a neutral baseline, the expected effect size for any given VOT is asymmetrical (in general); the locus of the maximum effect size is at the midpoint between χ and the prior context’s effective category boundary (χ!! ) (Figure 3.5). 86 µ1 − µ2 = 64 µ1 − µ2 = 36 σ = 15 Probability Density: p(V O T |w i ) σ = 20 −25 0 25 50 75 100 −25 0 25 50 75 100 VOT (ms) wi = bay wi = pay Figure 3.1. Results of Simulation Study 3.1: Influence of 𝜇! − 𝜇! and 𝜎 ! on probability density functions, 𝑝 𝑉𝑂𝑇 𝑤! . 𝑝 𝑉𝑂𝑇 𝑤! : solid/colored curves; category boundary (χ): dashed/grey vertical line; 𝜇! for each 𝑝 𝑉𝑂𝑇 𝑤! : dotted/colored vertical lines µ1 − µ2 = 64 µ1 − µ2 = 36 1.00 0.75 σ = 15 0.50 Posterior Probability of pay: p(w 1|V O T ) 0.25 0.00 1.00 0.75 σ = 20 0.50 0.25 0.00 −25 0 25 50 75 100 −25 0 25 50 75 100 VOT (ms) Figure 3.2. Results of Simulation Study 3.1: Influence of 𝜇! − 𝜇! and 𝜎 ! on posterior probability function, 𝑝 𝑤! 𝑉𝑂𝑇, 𝐶 = 𝑐! : solid/black curves; χ: dashed/grey vertical line 87 3.1.1.2. Components of BIASES: Category Boundary (𝛘) To understand the roles of the two remaining terms, Π and χ, consider, first, a hypothetical context 𝑐! in which w1 and w2 are both equally probable a priori, such that 𝑝 𝑤! 𝐶 = 𝑐! = 𝑝 𝑤! 𝐶 = 𝑐! = 0.5 (cf. Clayards et al, 2008; Feldman et al, 2009). In ! !! !! !.! this perfectly neutral context, Π = Π!! = log ! !! !! = log !.! = log 1 = 0. When this condition (Π = 0) is fulfilled, and when it also true that 𝐴 = χ, the entire right side of ! Equation 3.1 becomes !!! ! = 0.5. That is, χ is the point in acoustic space for which, if presented with that stimulus in a perfectly neutral context, a listener would, in theory, be equally likely to select either response: 𝑝 𝑤! 𝐴, 𝑐! = 𝑝 𝑤! 𝐴, 𝑐! = 0.5. This point, χ, is referred to as the category boundary. Note that, because BIASES assumes that the variability 𝜎 ! associated with each word’s likelihood function is equal, χ is guaranteed to be located exactly halfway between 𝜇! and 𝜇! , as it is defined in Equation 3.1. However, !! !!! note that if 𝜎!! ≠ 𝜎!! ≠ 𝜎 ! , then it is not, in general, true that χ = ! . 3.1.1.3. Components of BIASES: Prior Context (𝚷) How, then, does the prior information contained within Π influence spoken word recognition? Because the influence of Π is constant for a given C (i.e., independent of A) the overall effect of the prior is to produce a translation (i.e., a horizontal shift) of the logistic function towards the mean of the less probable word (Feldman et al, 2009). Figure 3.3 (also from Simulation Study 3.1) illustrates this for the same distributions displayed in Figures 3.1 and 3.2. Shifting the logistic means that a stimulus with the same word-initial VOT will be more likely to be recognized as an exemplar of the 88 contextually-supported word (relative to context 𝑐! , in which both lexical candidates are equally probable; black posterior distribution in Figures 3.2 and 3.3). µ1 − µ2 = 64 µ1 − µ2 = 36 1.00 0.75 Posterior Probability of pay: p(w 1|V O T ,C ) σ = 15 0.50 0.25 0.00 1.00 0.75 σ = 20 0.50 0.25 0.00 −25 0 25 50 75 100 −25 0 25 50 75 100 VOT (ms) Prior Probability of pay: p(w 1|c i ) cN = 0.5 c1 = 0.75 c2 = 0.25 Figure 3.3. Results of Simulation Study 3.1: Influence of 𝜇! − 𝜇! and 𝜎 ! on posterior probability function, incorporating prior contexts: 𝑝 𝑤! 𝑉𝑂𝑇, 𝐶 = 𝑐! : solid/colored curves; χ: dashed/grey vertical line; χ!! for each Π! : dashed/colored vertical lines In other words, Π induces a bias in the posterior sigmoid. Ultimately, this bias is realized for a given context C (for which Π = Π! ), in the movement of the location in acoustic space for which 𝑝 𝑤! 𝐴, 𝐶 = 𝑝 𝑤! 𝐴, 𝐶 = 0.5. We refer to this VOT value as the effective category boundary (χ! ). Unlike the “baseline” category boundary (χ = χ!! ) defined earlier, which depends on characteristics of the likelihood function alone (cf. Clayards et al, 2008), the effective category boundary χ! still, of course, depends on the likelihood function, but it is also context-dependent. Specifically, the magnitude and !! direction of its shift away from χ are given by χ! − χ = − ! (cf. Feldman et al, 2009). Figure 3.3 clearly depicts, for the same Π! , the magnitude of the boundary shift should be 89 larger when g is smaller (i.e., shallower). Ultimately, the location of the effective category boundary for a given context is given by Equation 3.2: Equation 3.2 Π! χ! = χ − 𝑔 How can we summarize the meanings of each of the terms in BIASES (Equation 3.1)? Firstly, g encapsulates the model of the phonetic category structure that comprises the likelihood functions of pay and bay (Figure 3.1) and gives the posterior probability distribution its slope (Figure 3.2). While g controls the overall shape of the posterior, Π and χ convey information about the posterior’s “location” in acoustic space. They determine, for instance, which VOTs will be most ambiguous. Along with g, χ also derives from the likelihood term, representing the location (in acoustic space) of the (unbiased) category boundary. Meanwhile, the influence of sentential context is completely contained within Π, which indexes the relative amount of support a context C provides to one or the other candidate word. It is Π’s context-dependent biasing effect on the posterior distribution (Figure 3.3) that we focus on in the present model. Simulation Study 3.1 has already shown that the two elements of g (𝜇! − 𝜇! and 𝜎 ! ) have an influence on the magnitude of shifts in the effective category boundary, controlling for prior context. Next, we explore the basic predictions of BIASES, focusing on how different factors within the model influence the predicted size of the effect of prior context on subjects’ word identification responses. 3.1.2. Towards Model-based Analyses of Top-Down Effects 3.1.2.1. Shifting of Invisible Category Boundaries 90 As discussed above, the “baseline” category boundary, χ, is the point at which a given VOT is equally likely to come from both categories’ distributions (Figure 3.1). In the case considered here (and elsewhere, e.g., Clayards et al, 2008; Feldman et al, 2009), where the mixture distribution that comprises the model’s likelihood term is effectively constrained to consider two Gaussian distributed categories with equal variance, χ lies at !! !!! the midpoint of the two categories’ means ( ! ). χ is a fundamental property of phonetic category structure. In Simulation Study 3.1, all phonetic category structures assumed χ (=32), and this was held constant for all those examined. However, in practice, χ is not straightforward to measure. For one, we cannot directly observe the phonetic category structure that underlies a subject’s behavioral responses to acoustic tokens. Instead, we are bound to try to infer χ from a subject’s identification or discrimination of speech tokens. However, even this is difficult, since the definition of χ presupposes equal prior probability. In reality, such a state is difficult to confidently tap into experimentally: due to pervasive effects of lexical and phonotactic frequency on subjects’ recognition of speech sounds (e.g., Connine, Titone & Wang, 1993; Massaro & Cohen, 1983; Pitt & McQueen, 1998), many assumptions are required if one hopes to confidently infer its value (see Pitt & Samuel, 1993 for a discussion of this issue). In short, although the action of the prior in BIASES is attributed to a shift produced relative to an unobservable category boundary (see Equation 3.2), the same principles can be captured without explicitly relying on some assumed or inferred value of χ by examining relative effective category boundary shifts. In other words, although the biased posterior distributions in Figure 3.3 (in blue and red) are, based on the cognitive model (Equation 3.1), computed by biasing the latent 91 (i.e., unobserved), neutral posterior (in black), in practice, an experimenter would compare data from two observed (biased) conditions to one another. Later, we do consider another type of experimental design that includes a designated “neutral” condition (e.g., Fox, 1984; Guediche et al, 2013; van Alphen & McQueen, 2001), and we argue that even in those conditions, subjects’ responses are probably not completely unbiased, so the so-called neutral conditions are actually more likely to be just a third bias condition with an intermediate Π! . In any case, for now, we focus on the far more common 2-condition experimental design. Formally, to the extent that any shift in the effective category boundary, χ!! , is observed in context, c1, it is relative to another context, c2, with some other prior Π!! . Equation 3.3 is a generalization of Equation 3.2, giving the magnitude of the VOT boundary shift between two biased contexts. Equation 3.3 𝑝 𝑤! 𝑐! 𝑝 𝑤! 𝑐! Π!! Π!! Π!! − Π!! log 𝑝 𝑤! 𝑐! − log 𝑝 𝑤! 𝑐! χ!! − χ!! = χ− − χ− = = 𝑔 𝑔 𝑔 𝑔 Equations 3.3 supersedes Equation 3.2 (and is more general) because the neutral !.! prior (Π!! = log !.! = 0) that, by definition, characterizes χ!! = χ could be substituted into Equation 3.3 to obtain Equation 3.2. Table 3.1 reports the relative effective category boundary shift for each simulation in Simulation Study 3.1, and they can be visualized as the difference between the VOT of the dashed/red line and that of the dashed/blue line in Figure 3.3. As previously mentioned, the boundary shift is inversely proportional to the value of g. 92 𝜇! − 𝜇! = 64 𝑚𝑠 𝜇! − 𝜇! = 36 𝑚𝑠 χ = 32 χ = 32 g = 0.28 g = 0.16 𝜎 = 15 𝑚𝑠 𝑎 = 32 𝑎 = 32 Δ!"# = 0.50 Δ!"# = 0.50 χ!! − χ!! = 7.72 χ!! − χ!! = 13.73 χ = 32 χ = 32 g = 0.16 g = 0.09 𝜎 = 20 𝑚𝑠 𝑎 = 32 𝑎 = 32 Δ!"# = 0.50 Δ!"# = 0.50 χ!! − χ!! = 13.73 χ!! − χ!! = 24.41 Table 3.1. Summary of Results of Simulation Study 3.1: Influence of underlying phonetic category structure on posterior probability function and size of sentential context effects. 3.1.2.2. Boundary Shifts vs. Effect Sizes However, even having skirted one practical issue by avoiding reliance on an unobservable parameter value, another practical issue remains that, ultimately, suggests that merely explaining top-down effects as arising from shifting category boundaries is not ideal for the goal of accurately assessing and predicting top-down effects in actual behavioral data. To see this, consider the methodological and analytic techniques utilized by behavioral research examining top-down effects. Typically, experimenters construct an acoustic continuum and/or select a relatively small number of stimuli with discrete step sizes. In many cases, the durations of these step sizes are influenced by other practical constraints such as the need to splice waveforms at zero-crossings to avoid discontinuities which introduce acoustic artifacts such as clicks into the stimuli. Then, participants are presented with these tokens in different contexts for identification. The experimenter is effectively sampling from the subject’s posterior distribution in order to characterize the listener’s underlying prior and likelihood model. Finally, the 93 experimenter adopts some analytic technique, typically aimed at producing evidence that responses to the same acoustic tokens were categorized reliably differently between conditions (such as logistic regression or ANOVA over proportions of response-types by condition). At no point in the process does the notion of a category boundary or some underlying horizontal shift arise. Of course, that does not mean it is not a useful characterization of the underlying model. It may suggest that either the “horizontal” shift of the biased sigmoid relative to another sigmoid (from another condition) or the “vertical” differences in the rate of categorization decisions is epiphenomenal, or even that both are. However, what is important about this observation for the present work is that the comparison of “effective category boundaries” across conditions is a theoretical construct and removed from real empirical research. That is, however fundamental or epiphenomenal a category boundary is to phonetic category structure, it is in some ways incidental to experimental research on top-down effects on spoken word recognition. It should be noted that there are analysis techniques that are exceptions to the ones described above. For instance, some researchers do explicitly estimate boundaries for subjects in different conditions (e.g., Baum, 2001; Blumstein et al, 1994) and then compute statistics about the shift in the boundary between conditions. Obviously, this technique is neatly connected to the theory espoused here, but these statistics ultimately rely on estimates (not directly observed category boundaries). Statistics based on derived measures are necessarily less accurate than the original data themselves (see, e.g., Pitt & Samuel, 1993). Also, using a single summary statistic ignores many fine-grained details regarding the distribution of top-down effects (cf. Pitt & Samuel, 1993). We address this 94 analysis technique later (Chapter 4), ultimately showing that an explicit, model-based analysis approach allows for richer inferences about the nature of variability in top-down effects. 3.1.2.3. Predicting Effect Sizes In order to better understand the way BIASES integrates prior information with bottom-up acoustic data with an eye towards the ultimate need to understand effect size (i.e., the “vertical” differences in response rates for a given acoustic stimulus), we first defined the function underlying BIASES’ expected effect size when comparing the rate of pay-responses for any pair of contexts, for any VOT, as shown in Equation 3.4: Equation 3.4 Π!! χ 1 1 Δ , , 𝐴 = 𝑝 𝑤! 𝐴, 𝑐! − 𝑝 𝑤! 𝐴, 𝑐! = − Π!! 𝑔 1+𝑒 ! !!! !!(!!!) 1+𝑒 ! !!! !!(!!!) where ! !! !! !! !!! !! !!! Π!! = log ! !! !! χ= ! and 𝑔= !! For ease of exposition, we refer to the function defined in Equation 3.4 as Δ 𝐴 with the understanding that Δ 𝐴 is meaningless unless there are other parameter values provided to it ({Π!! , Π!! , χ, 𝑔}). As expressed by Equation 3.4, Δ 𝐴 is equal to the difference between the posterior probabilities of a pay-response to a stimulus (A) after context c1 vs. c2. As such, the function’s shape will clearly depend on the same factors on which the posterior depends. For that reason, the first two arguments to the function are related to: (1) the biasing information in the prior of each posterior distribution (Π!! and Π!! ), and (2) the two components of the posterior that are based on the phonetic category 95 structure defined by the likelihood function (χ and g, which do not change from context to context). While it is simple to demonstrate that Δ 𝐴 is the difference of two sigmoidal curves (more specifically, a sigmoid and that same sigmoid under translation), the function is not easy to express (cf. difference of sigmoids membership function: e.g., in Berkan & Trubatch, 1997). However, despite not having a simple, closed form, Δ 𝐴 does have certain properties that are straightforward. More importantly, those properties are critical to the predictions and simulations discussed in this chapter and Chapter 4, so we review them now and illustrate several of them via simulation.5 Figures 3.4 and 3.5 are the final two Figures associated with Simulation Study 3.1 (see Box 3.1 for a summary of these simulations, and Table 3.1 for a summary of their results). Figure 3.4 illustrates the shape of Δ 𝐴 over the acoustic space for same four simulated phonetic category structures in Figures 3.1-3.3. A few points bear special note: Property 1. If Π!! > Π!! , then Δ 𝐴 > 0 for all values of A, although Δ 𝐴 approaches 0 for values of A further from Δ 𝐴 ’s peak, which we denote 𝑎. Thus, in line with intuition, subjects should never show a reversal of a sentence context effect, on average. Property 2. Δ 𝐴 is symmetrical under the assumptions imposed on BIASES in Chapter 2; in particular, Δ 𝑎 − 𝑥 = Δ 𝑎 + 𝑥 if 𝜎 ! = 𝜎!! = 𝜎!! . 5 Note that Δ 𝐴 is not a perfect tool for all purposes. For instance, although it does approximate the expected effect size after many trials in each context are completed, it is not meant to simulate behavioral data directly. After all, like the boundary shift, effect size is a derived measure that is epiphenomenal (from the explanatory standpoint of BIASES). Thus, the function is used for illustrative purposes, but all actual simulated behavioral data in Chapters 3 and 4 are generated using Equation 2.11 and subtracted to illustrate expected effect sizes. 96 Property 3. In general, it not possible to compute a definite integral for the sum or difference of two sigmoids. This is notable because the integral of Δ 𝐴 (i.e., the total area between Δ 𝐴 and the x-axis) is equal to the area between the 2 posteriors being compared: 𝑝 𝑤! 𝐴, 𝑐! − 𝑝 𝑤! 𝐴, 𝑐! . However, numerical methods can be used to approximate this value, and, conveniently, the total area under Δ 𝐴 , and therefore the total area between the 𝑝 𝑤! 𝐴, 𝑐! and 𝑝 𝑤! 𝐴, 𝑐! is equal to the magnitude of the effective category boundary shift between the two contexts (cf. Equation 3.3), as shown in Equation 3.5: Equation 3.5 𝑝 𝑤! 𝑐! 𝑝 𝑤! 𝑐! log − log 𝑝 𝑤! 𝑐! 𝑝 𝑤! 𝑐! Δ 𝐴 = 𝑝 𝑤! 𝐴, 𝑐! − 𝑝 𝑤! 𝐴, 𝑐! = = χ!! − χ!! 𝑔 Property 4. If the maximum expected difference in the posteriors is located (in acoustic space) at 𝑎 and has an effect size of magnitude Δ!"# , Equations 3.6 and 3.7 give those values. 𝑎 occurs at the midpoint of the two posterior probability distributions’ effective category boundaries. Equation 3.6 Π! Π! 𝑝 𝑤! 𝑐! 𝑝 𝑤! 𝑐! χ − 𝑔! + χ − 𝑔! log + log 𝑝 𝑤! 𝑐! 𝑝 𝑤! 𝑐! 𝑎 = argmax Δ 𝑎! = =χ− !! ∈𝒜 2 2𝑔 Equation 3.7 1 1 Δ!"# = Δ 𝑎 = ! ! !! !! ! !! !! − ! ! !! !! ! !! !! ! !"# !!"# ! !"# !!"# 1+ 𝑒 ! ! !! !! ! !! !! 1+ 𝑒 ! ! !! !! ! !! !! 97 Property 5. As is obvious from Equation 3.7, the value Δ!"# is independent of the specific characteristics of the phonetic category structure (i.e., the likelihood) in BIASES, including both χ and 𝑔. Unsurprisingly, Δ!"# does depend on the relative strengths of the biases of the prior contexts being compared. Simulation Study 3.2 further examines this issue (see Box 3.2; Figures 3.6-3.9; Tables 3.2-3.3). One thing that we can conclude from these simulations and observations is that the effective category boundary shift – which underlies the model’s explanation of top- down effects on spoken word recognition – is, indeed, closely tied to overall differences in the influence of two prior contexts on speech recognition (e.g., Property 3), but this shift does not tell the whole story of top-down effects on speech perception. According to BIASES, different effect sizes should be observed as a function of the underlying phonetic category structure (see Simulation Study 3.1) and as a function of the relative strengths of the biases of the two contexts being compared (see Simulation Study 3.2). In short, BIASES predicts fine-grained variation in the size of top-down effects that should be observed in subjects’ responses to different acoustic tokens in different sentential contexts. The distribution and shape of the Δ 𝐴 curve (i.e., expected top-down effects as a function of VOT, for a given pair of contexts) depends on many factors. This general statement is of great theoretical interest because of the enormous variability and inconsistency in top-down effects observed in the literature (see, e.g., Pitt & Samuel, 1993). It is this issue to which we now turn. 98 µ1 − µ2 = 64 µ1 − µ2 = 36 1.00 0.75 Effect Size: Difference in Posterior Probabilities of pay: σ = 15 0.50 p(w 1|V O T ,c 1) − p(w 1|V O T ,c 2) 0.25 0.00 1.00 0.75 σ = 20 0.50 0.25 0.00 −25 0 25 50 75 100 −25 0 25 50 75 100 VOT (ms) Figure 3.4. Results of Simulation Study 3.1: Influence of 𝜇! − 𝜇! and 𝜎 ! on Δ 𝐴 = 𝑝 𝑤! 𝑉𝑂𝑇, 𝐶 = 𝑐! − 𝑝 𝑤! 𝑉𝑂𝑇, 𝐶 = 𝑐! . Δ 𝐴 : solid black curves; χ : dashed/grey vertical line. µ1 − µ2 = 64 µ1 − µ2 = 36 1.00 Effect Size: Difference in Posterior Probability of pay (vs. neutral): 0.75 0.50 0.25 σ = 15 0.00 p(w 1|V O T ,c i ) − p(w 1|V O T ,c N ) −0.25 −0.50 −0.75 −1.00 1.00 0.75 0.50 0.25 σ = 20 0.00 −0.25 −0.50 −0.75 −1.00 −25 0 25 50 75 100 −25 0 25 50 75 100 VOT (ms) Prior Probability of pay: p(w 1|c i ) c1 = 0.75 c2 = 0.25 Figure 3.5. Results of Simulation Study 3.1: Influence of 𝜇! − 𝜇! and 𝜎 ! on Δ 𝐴 = 𝑝 𝑤! 𝑉𝑂𝑇, 𝐶 = 𝑐! − 𝑝 𝑤! 𝑉𝑂𝑇, 𝐶 = 𝑐! . Δ 𝐴 : solid/colored curves; χ: dashed/grey vertical line. 99 p(w 1|V O T , c i ) Posterior Probability of pay 1.00 0.75 0.50 c1 0.25 c2 0.00 −25 0 25 50 75 100 VOT (ms) p(w 1|V O T , c 1) − p(w 1|V O T , c 2) 1.00 0.75 Effect Size 0.50 0.25 0.00 −25 0 25 50 75 100 VOT (ms) p(w 1|V O T , c i ) − p(w 1|V O T , c N ) 1.0 Effect Size (vs. neutral) 0.5 0.0 c1 −0.5 c2 −1.0 −25 0 25 50 75 100 VOT (ms) Figure 3.6. Example simulation from Simulation Study 3.2: Illustrates posterior probability distributions as a function of VOT and prior context (𝑝 𝑤! 𝑉𝑂𝑇, 𝐶 = 𝑐! ; top panel), effect size as a function of VOT for those two prior contexts (Δ 𝐴 ; middle panel), and effect size as a function of VOT for each of those two prior contexts relative to a neutral baseline (bottom panel). All panels simulated in this illustration represent priors 𝑝 𝑤! 𝑉𝑂𝑇, 𝑐! = 0.9 and 𝑝 𝑤! 𝑉𝑂𝑇, 𝑐! = 0.25, with 𝜇! − 𝜇! = 64, χ = 32, and 𝜎 ! = 20! . In all panels: χ: dashed/grey vertical line; χ!! for each Π!! : dashed/colored vertical lines; Δ!"# : magnitude of solid/orange vertical marker; 𝑎: VOT of solid/orange vertical marker; χ!! − χ!! : magnitude of solid/black horizontal marker. For top panel: 𝑝 𝑤! 𝑉𝑂𝑇, 𝐶 = 𝑐! : solid/colored curves; 𝑝 𝑤! 𝑉𝑂𝑇, 𝐶 = 𝑐! : dotted/grey curve. For middle panel: Δ 𝐴 = 𝑝 𝑤! 𝑉𝑂𝑇, 𝑐! − 𝑝 𝑤! 𝑉𝑂𝑇, 𝑐! : solid/black curve; For bottom panel: Δ 𝐴 = 𝑝 𝑤! 𝑉𝑂𝑇, 𝐶 = 𝑐! − 𝑝 𝑤! 𝑉𝑂𝑇, 𝐶 = 𝑐! : solid/colored curves; Δ!"# of each context’s posterior relative to cN: magnitude of solid/colored vertical markers. 100 p(w 1|c 1)=0.25 p(w 1|c 1)=0.75 p(w 1|c 1)=0.9 1.00 p(w 1|c 2)=0.25 0.75 0.50 Posterior Probability of pay: p(w 1|V O T ,C ) 0.25 0.00 1.00 p(w 1|c 2)=0.75 0.75 0.50 0.25 0.00 1.00 p(w 1|c 2)=0.9 0.75 0.50 0.25 0.00 −25 0 25 50 75 100 −25 0 25 50 75 100 −25 0 25 50 75 100 VOT (ms) Prior Probability of pay: p(w 1|c i ) σ c1 σ = 20 c2 σ = 30 Figure 3.7. Results of Simulation Study 3.2: Influence of 𝑝 𝑤! 𝐶 = 𝑐! and 𝜎 ! on posterior probability function, incorporating prior contexts: 𝑝 𝑤! 𝐶 = 𝑐! : colored curves (solid: 𝜎 = 20; dashed: 𝜎 = 30); χ: dashed/grey vertical line; χ!! for each Π! : colored vertical lines (solid: 𝜎 = 20; dashed: 𝜎 = 30) 101 p(w 1|c 1)=0.25 p(w 1|c 1)=0.75 p(w 1|c 1)=0.9 1.00 0.75 p(w 1|c 2)=0.25 0.50 Effect Size: Difference in Posterior Probabilities of pay: 0.25 0.00 −0.25 −0.50 −0.75 p(w 1|V O T ,c 1) − p(w 1|V O T ,c 2) −1.00 1.00 0.75 p(w 1|c 2)=0.75 0.50 0.25 0.00 −0.25 −0.50 −0.75 −1.00 1.00 0.75 p(w 1|c 2)=0.9 0.50 0.25 0.00 −0.25 −0.50 −0.75 −1.00 −25 0 25 50 75 100 −25 0 25 50 75 100 −25 0 25 50 75 100 VOT (ms) σ σ = 20 σ = 30 Figure 3.8. Results of Simulation Study 3.2: Influence of 𝑝 𝑤! 𝐶 = 𝑐! and 𝜎 ! on Δ 𝐴 = 𝑝 𝑤! 𝑉𝑂𝑇, 𝐶 = 𝑐! − 𝑝 𝑤! 𝑉𝑂𝑇, 𝐶 = 𝑐! ; Δ 𝐴 : black curves (solid: 𝜎 = 20; dashed: 𝜎 = 30); χ: dashed/grey vertical line p(w 1|c 1)=0.25 p(w 1|c 1)=0.75 p(w 1|c 1)=0.9 1.00 Effect Size: Difference in Posterior Probability of pay (vs. neutral): 0.75 p(w 1|c 2)=0.25 0.50 0.25 0.00 −0.25 −0.50 p(w 1|V O T ,c i ) − p(w 1|V O T ,c N ) −0.75 −1.00 1.00 0.75 p(w 1|c 2)=0.75 0.50 0.25 0.00 −0.25 −0.50 −0.75 −1.00 1.00 0.75 p(w 1|c 2)=0.9 0.50 0.25 0.00 −0.25 −0.50 −0.75 −1.00 −25 0 25 50 75 100 −25 0 25 50 75 100 −25 0 25 50 75 100 VOT (ms) Prior Probability of pay: p(w 1|c i ) σ c1 σ = 20 c2 σ = 30 Figure 3.9. Results of Simulation Study 3.2: Influence of 𝑝 𝑤! 𝐶 = 𝑐! and 𝜎 ! on Δ 𝐴 = 𝑝 𝑤! 𝑉𝑂𝑇, 𝐶 = 𝑐! − 𝑝 𝑤! 𝑉𝑂𝑇, 𝐶 = 𝑐! ; Δ 𝐴 : colored curves (solid: 𝜎 = 20; dashed: 𝜎 = 30); χ: dashed/grey vertical line 102 Box 3.2. Description of Simulation Study 3.2 Goal: Illustrate influence of the strength of contextual priors and one aspect of phonetic category structure on posterior probability function and size of sentential context effects. Design: 2 phonetic category structures; 3 fully crossed levels of contextual bias in 2×3×3 design Parameters of BIASES Manipulated: 𝜎 ! ∈ {20! , 30! } , 𝑝 𝑤! 𝐶 = 𝑐! ∈ {0.25,0.75,0.90} , 𝑝 𝑤! 𝐶 = 𝑐! ∈ {0.25,0.75,0.90} Parameters of BIASES Held Constant: χ = 32 , 𝜇! − 𝜇! = 64 Results displayed in: Figures 3.6-3.9, Table 3.2-3.3 Key conclusions: 1. Figure 3.6 illustrates the geometric interpretations of several critical variables in Chapter 3. 2. As in Simulation Study 3.1, the magnitude of the effective category boundary shift between two prior contexts (χ!! − χ!! ) depends on g (Figure 3.7), but the maximum expected effect size (Δ!"# ) is independent of g. To see this, compare the solid and dashed curves in Figure 3.7. They peak at the same height. Note, also, that in Table 3.2 and Table 3.3, Δ!"# is the same for the same panel in each table. Table 3.2 lists the summary statistics for the simulations using 𝜎 ! = 20! and Table 3.3 lists summary statistics for the simulations using 𝜎 ! = 30! . Each colored panel represent a pair of prior contexts, with tan panels showing no expected context effects (see Figures 3.7-3.8), blue panels having higher posteriors for c1, red panels having higher posteriors for c2, and darker panels (of each hue) corresponding to larger expected effect sizes. Δ!"# depends only on the prior contexts’ biases’ strengths. 3. Nonetheless, despite Δ!"# being independent of BIASES’ likelihood function, the VOT at which the maximum expected effect size is found (𝑎) is not (see Figure 3.8). As Equations 3.6 and 3.7 suggest, 𝑎 lies midway between the two priors’ effective category boundaries, which depend on g. Note the divergence between Tables 3.2 and 3.3 in 𝑎 for the same panel (i.e., priors). 4. Similarly, the magnitude of the effective category boundary shift between two prior contexts (χ!! − χ!! ) depends on g and the priors (see Figure 3.7). 5. When measured for each prior context relative to a neutral baseline, the expected effect size for any given VOT is asymmetrical (in general); the locus of the maximum effect size is at the midpoint between χ and the prior context’s effective category boundary (χ!! ) (see Figures 3.6 and 3.9). Tables 3.2-3.3. Summary of Results of Simulation Study 3.2: Influence of of 𝑝 𝑤! 𝐶 = 𝑐! and 𝜎 ! on posterior probability distribution and size of sentential context effects (Table 3.2: 𝜎 ! = 20! ; Table 3.3: 𝜎 ! = 30! ). Each colored panel represent a pair of prior contexts, with tan panels showing no expected context effects (see Figures 3.7- 3.8), blue panels having higher posteriors for c1, red panels having higher posteriors for c2, and darker panels (of each hue) corresponding to larger expected effect sizes. 103 𝜎 ! = 20! 𝑝 𝑤! 𝑉𝑂𝑇, 𝑐! = 0.25 = 0.75 = 0.90 χ = 32 χ = 32 χ = 32 g = 0.16 g = 0.16 g = 0.16 = 0.25 𝑎 = 38.87 𝑎 = 32.00 𝑎 = 28.57 Δ!"# = 0.00 Δ!"# = 0.50 Δ!"# = 0.68 χ!! − χ!! = 0.00 χ!! − χ!! = 13.73 χ!! − χ!! = 20.60 χ = 32 χ = 32 χ = 32 𝑝 𝑤! 𝑉𝑂𝑇, 𝑐! g = 0.16 g = 0.16 g = 0.16 = 0.75 𝑎 = 32.00 𝑎 = 25.13 𝑎 = 21.70 Δ!"# = −0.50 Δ!"# = 0.00 Δ!"# = 0.27 χ!! − χ!! = −13.73 χ!! − χ!! = 0.00 χ!! − χ!! = 6.87 χ = 32 χ = 32 χ = 32 g = 0.16 g = 0.16 g = 0.16 = 0.90 𝑎 = 28.57 𝑎 = 21.70 𝑎 = 18.27 Δ!"# = −0.68 Δ!"# = −0.27 Δ!"# = 0.00 χ!! − χ!! = −20.60 χ!! − χ!! = −6.87 χ!! − χ!! = 0.00 Table 3.2. Summary of Results of Simulation Study 3.2 (simulations utilizing 𝜎 ! = 20! ). 𝜎 ! = 30! 𝑝 𝑤! 𝑉𝑂𝑇, 𝑐! = 0.25 = 0.75 = 0.90 χ = 32 χ = 32 χ = 32 g = 0.07 g = 0.07 g = 0.07 = 0.25 𝑎 = 47.45 𝑎 = 32.00 𝑎 = 24.28 Δ!"# = 0.00 Δ!"# = 0.50 Δ!"# = 0.68 χ!! − χ!! = 0.00 χ!! − χ!! = 30.90 χ!! − χ!! = 46.35 χ = 32 χ = 32 χ = 32 𝑝 𝑤! 𝑉𝑂𝑇, 𝑐! g = 0.07 g = 0.07 g = 0.07 = 0.75 𝑎 = 32.00 𝑎 = 16.55 𝑎 = 8.83 Δ!"# = −0.50 Δ!"# = 0.00 Δ!"# = 0.27 χ!! − χ!! = −30.90 χ!! − χ!! = 0.00 χ!! − χ!! = 15.45 χ = 32 χ = 32 χ = 32 g = 0.07 g = 0.07 g = 0.07 = 0.90 𝑎 = 24.28 𝑎 = 8.83 𝑎 = 1.10 Δ!"# = −0.68 Δ!"# = −0.27 Δ!"# = 0.00 χ!! − χ!! = −46.35 χ!! − χ!! = −15.45 χ!! − χ!! = 0.00 Table 3.3. Summary of Results of Simulation Study 3.2 (simulations utilizing 𝜎 ! = 30! ). 104 3.2. Evaluating BIASES A central motivation behind the development of BIASES is to provide a theoretical explanation and computational framework within which to examine top-down effects from sentential context in spoken word recognition tasks. The foregoing work (Chapters 2 through 3.1) has focused on this task. However, BIASES provides more than a framework; as discussed above, BIASES also provides an explicit mathematical model that makes specific, fine-grained quantitative predictions about the distribution of top- down effects on spoken word recognition. Thus, it is important to evaluate the extent to which BIASES can account for observed variability in such effects, and the extent to which its novel predictions are borne out experimentally. 3.2.1. Observed Variability in the Size of Top-Down Context Effects Despite the strong evidence for top-down effects on spoken word recognition (see Chapter 2), substantial heterogeneity remains to be explained in the fine-grained details of the results in studies of this class of phenomenon. Pitt and Samuel (1993) provide a thorough review of such variability for lexical effects, but we discuss a few examples here. Observed effects vary depending on the source of bias (e.g., lexical, sentential, monetary payoff). Among lexical effects, the degree of top-down influence depends on the position of the manipulated phonetic cues in the word (e.g., word-initial: Ganong, 1980; word-medial: Connine, 1990; word-final: McQueen, 1991; see also Mattys, Melhorn & White, 2007). Among sentential context effects, the sizes of semantic, syntactic and pragmatic effects are not consistent. Even restricting analysis to top-down effects from syntactic sentential context on speech recognition, effect sizes vary greatly 105 (see Chapter 1; Fox & Blumstein, in press). Such effects are reminiscent of word frequency effects (Connine et al, 1993) in phoneme identification tasks (see Pitt & Samuel, 1993 for a related explanation of inconsistent lexical effects due to word familiarity/frequency). In addition to varying with the nature of the biasing information, top-down effects also depend on characteristics of the acoustic stimuli that comprise the test continua. Most obviously and most consistently across studies, top-down effects are larger for more phonetically ambiguous stimuli and they vanish or are very small for tokens that are clearly identifiable. Other acoustic manipulations that reduce stimulus quality or otherwise render the tokens more ambiguous tend to be associated with larger effect sizes (e.g., Burton & Blumstein, 1995; McQueen, 1991; Pitt & Samuel, 1993). On the other hand, top-down effects are elusive when stimuli are more faithful to the phonetic properties of real speech and have a greater number of reliable bottom-up acoustic cues (e.g., Burton, Baum & Blumstein, 1989). Indeed, there is even some indication that the size and prevalence of top-down effects depends on the specific phonetic contrasts and the acoustic cues being manipulated in the stimuli (e.g., /sh/–/ch/ vs. /sh/–/h/ vs. /sh/–/s/ vs. /b/–/m/ vs. /b/–/d/ vs. /b/–/p/ vs. /g/–/k/ vs. /t/–/d/). Furthermore, there is a high degree of individual subject variability in the extent to which subjects exhibit top-down effects, even within a homogenous population of healthy, monolingual English-speaking young adults with normal hearing (see, e.g., Chapter 1; Fox & Blumstein, in press). Far more variability exists when considering the size of such effects in elderly adults (e.g., Abada et al, 2008) or patients with aphasia (see Chapter 4; see also Baum, 2001; Blumstein et al, 1994; Boyczuk & Baum, 1999). 106 Finally, a review of the literature shows that there are also strong task effects and an influence of an experiment’s demand characteristics on the observed size of top-down effects on speech recognition. Chapter 1 discussed the role of stimulus predictability, experimental task (phoneme vs. word identification), and response latency in determining the expected size of top-down effects (see Fox & Blumstein, in press; cf. Bicknell, Jeager & Tanenhaus, in press; Bicknell, Tanenhaus & Jeager, submitted; Connine, Blasko & Hall, 1991; McClelland, 1987; Pitt & Samuel, 1993; Szostak & Pitt, 2013; van Alphen & McQueen, 2001). Pitt and Samuel (1993) also acknowledge apparent modulations of top- down effects in mixed vs. blocked designs, and they highlight the potential for differences in measured effect sizes due to differences in the analytic techniques experimenters select. In some cases, such variability may be due to chance. However, it is also possible that the observed asymmetries and inconsistencies are not merely noise, but are, in fact, systematic variation attributable to the basic principles underlying speech perception and the probabilistic Bayesian framework within which we are have formulated an explanation of sentential context effects on spoken word recognition. Because BIASES offers a formal, mathematical model of context effects on speech perception, it is possible to evaluate the quantitative predictions of the model in light of available data. In this way, not only can we validate many of the fundamental principles underlying BIASES, but we can also identify shortcomings of BIASES and take measures to improve the model’s empirical coverage. Next, we consider four sources of variability, examining whether and/or how BIASES might capture the observed irregularities, and, in the process, relaxing some of 107 the simplifying assumptions adopted when the model was introduced in Chapter 2. Specifically, we examine two sources associated with the likelihood term of BIASES (variability in the ambiguity of phonetic cues based on VOT and based on additional cues) and two sources associated with the prior term of BIASES (variability based in the strength of prior context and based on a “neutral” context). 3.2.2. Variability in the Ambiguity of Phonetic Cues: VOT One well-documented source of variability in top-down effects on spoken word recognition is variability along a continuum; top-down effects are not typically observed for phonetically unambiguous endpoint stimuli. For example, stimuli with very short VOTs are not good exemplars of /p/, as reflected in the likelihood distribution for /b/ and /p/ in Figure 3.1. There is a vanishingly small probability that a word-initial /p/ will be pronounced with a VOT of 10 ms, so the posterior probability (see Figure 3.2) of a /p/- response for such a token is virtually zero. Even when the prior context strongly supports a word beginning with /p/ (see Figure 3.3) subjects are not likely to make a /p/-response; after all, the posterior in Bayes’ rule is proportional to the product of the prior and the likelihood, so acoustic tokens that are not at all representative of /p/ (i.e., have a likelihood close to zero) will not tend to show reliable context effects. The same is true of stop consonant tokens with VOTs of 50 ms, for example, because the likelihood that that VOT is a token of bay is practically zero. This can be seen clearly in Figure 3.4, where no context effects are observed for these VOT values. This pattern has been replicated in the literature, going back to Ganong’s original lexical effect (1980); for instance, a large effect for intermediate VOTs and much smaller or nonexistent effects for endpoint VOT tokens can be seen in the data from Chapter 1 108 (Fox & Blumstein, in press; see Chapter 2 for discussion). This is not a strictly Bayesian pattern: many models are capable of capturing this sort of effect. However, the pattern exemplified in Figure 3.4 is a fundamental property of the Bayesian framework (e.g., Massaro, 1989; Norris & McQueen, 2008), rather than the consequence of a design choice within the model. This contrasts with other models, such as Merge (Norris et al, 2000), which prevents top-down information from influencing responses to endpoint tokens by implementing a “bottom-up priority rule” that only allows higher-level sources to affect decisions if the acoustic information is ambiguous (Norris et al, 2000). Importantly, in order to explicitly implement something like Merge’s bottom-up priority rule, a model must define some additional computational machinery and/or assumptions to govern when bottom-up decisions are protected from contextual influences vs. when top-down information is integrated. Although other cue integration models (e.g., Toscano & McMurray, 2010), which focus on the integration of multiple acoustic cues in speech perception, also exhibit reliability-based cue-weighting like the present model, BIASES also makes fine-grained quantitative predictions about the distribution of top-down effects across VOTs that can be compared to patterns in empirical data. To the extent that these specific predictions are borne out, it would suggest that there exist certain hallmarks of Bayesian cue integration in behavioral results. This issue is examined further later in this chapter. 3.2.3. Variability in the Ambiguity of Phonetic Cues: Additional Cues Although, up to now, we have assumed that VOT is the primary acoustic dimension on which voiced and voiceless stop consonants (e.g., /b/ and /p/) are distinguished in speech perception (Liberman et al, 1961), listeners also make use of a 109 large variety of other acoustic cues in their judgments about voicing in natural speech stimuli (see, e.g., Klatt, 1975; Lisker, 1986; Miller & Dexter, 1988; Repp, 1984; Stevens & Klatt, 1974; Summerfield, 1981). A complete model of top-down effects on spoken word recognition, then, would include all of these cues in the likelihood model that maps acoustic stimuli to words. Burton, Baum and Blumstein (1989) investigated one cue in particular – the amplitude of the burst of the stop consonant. In natural speech, VOT and burst amplitude co-vary (Lisker & Abramson, 1964; Pickett, 1980; Zue, 1976), even though most VOT continua hold burst amplitude constant in an attempt to isolate the effect of VOT duration on speech recognition. Burton and colleagues (1989) showed that not only were subjects sensitive to manipulations of burst amplitude as a cue to voicing in stop consonants, but that the size of top-down effects as determined by the emergence of a lexical effect in their responses differed depending on whether the stimuli in the test VOT continuum varied from token to token in both the burst amplitude and VOT or just in VOT. Top- down effects occurred when only VOT varied, and were not, in fact, significant when burst amplitude and VOT co-varied along the continuum (as in natural speech). How might this asymmetry be explained within the Bayesian framework? Clearly, BIASES is not equipped to explain this effect under its original assumptions because it assumes that VOT is the only relevant acoustic cue to a word onset’s identity. A simple adaptation, however, can explain how this effect emerges. First, we must incorporate both the burst amplitude and the VOT of a given stimulus into BIASES’ likelihood function, 𝑝 𝐴 𝑊 = 𝑝 𝑏𝑢𝑟𝑠𝑡, 𝑉𝑂𝑇 𝑊 . With this change, the likelihood function is two- dimensional instead of one-dimensional; note that, while this is still surely an 110 oversimplification, since many other acoustic cues also influence word recognition, it illustrates the adaptability of BIASES. Next, we created an arbitrary range of burst amplitudes that was higher for voiceless tokens than voiced tokens (cf. Lisker & Abramson, 1964). Finally, a simulation was conducted to compare the expected size of top-down effects in responses to stimuli from two simulated VOT continua: a VOT continuum with a single burst amplitude across all token and a VOT continuum with VOT values that co- varied with burst amplitude. Arbitrary mean and variance parameters were selected from among those used in Simulation Study 3.1 (any choice shows the same basic pattern, but the actual size of the effective category boundary shift depends on this choice; see Figure 3.3 and Table 3.1). Figure 3.10 shows the posterior probability distributions for the two simulated continua in two biasing contexts (blue vs. red) and a baseline neutral context (black). keep /p/−burst covary burst and VOT 1.00 Posterior Probability of pay: p(w 1|V O T ,b urst,C ) 0.75 0.50 0.25 0.00 −25 0 25 50 75 100 −25 0 25 50 75 100 VOT (ms) Prior Probability of pay: p(w 1|c i ) cN = 0.5 c1 = 0.75 c2 = 0.25 Figure 3.10. Results of two model simulations of posterior probability distributions assuming the same biasing/neutral contexts and identical underlying likelihood models. Simulations on the left and right only varied in whether of not the VOTs of the stimuli were correlated with burst amplitude of the simulated stimuli (right) or not (left). 111 As can be seen, the expected category boundary shift (and effect sizes) are smaller for the second simulated stimulus set (burst amplitude and VOT co-varied) than in the first simulated stimulus set (VOT varies with a constant value for burst amplitude: the mean amplitude of the voiceless stop’s burst). It is important to note that the model is the same in both simulations (it is BIASES with the updated likelihood model to include burst amplitude as an acoustic cue) – only the characteristics of the simulated stimulus sets vary between the two simulations. Next we consider two sources of variability related to the prior of BIASES. 3.2.4. Variability in the Strength of Prior Cues Figure 3.11 is a reproduction of Figure 1.2 (in Chapter 1; Fox & Blumstein, in press), which shows the results of Experiment 1 from that study: the proportion of /p/- responses made to ambiguous tokens (i.e., the intermediate VOT values as defined in Chapter 1) from the bay–pay and buy–pie continua following noun-biasing (e.g., Valerie hated the...) and verb-biasing (e.g., Brett hated to...) sentence contexts. Recall that the bay–pay continuum was designed to be a noun–verb continuum and the buy–pie continuum was designed to be a verb–noun continuum. As explained in Chapter 1 and as can be seen in Figure 3.11, consistent with predictions, subjects exhibited a significant CONTEXT x CONTINUUM interaction, wherein they were more likely to make /p/- responses when the most common grammatical category of the /p/-endpoint was consistent with the grammatical cue provided by the preceding function word (to vs. the). While this effect was quite robust, with the simple effects of CONTEXT being significant and in opposite directions in each level of CONTINUUM, the effect sizes were not identical. This can be seen clearly in Figure 3.11: the magnitude of the effect of 112 CONTEXT in the buy–pie continuum (β = 0.95) is smaller than the effect’s magnitude in the bay–pay continuum (β = -1.37). Moreover, visual inspection of Figure 3.11 suggests that the primary level of CONTEXT that is driving the interaction is the verb-biasing (to) level: the proportion of /p/-responses is far more disparate between the two continua following the verb-biasing contexts than the noun-biasing contexts. 1.0 the to proportion "p"−responses 0.8 0.6 0.4 0.2 0.0 bay−pay buy−pie continuum Figure 3.11. Reproduction of Figure 1.2. Mean proportion of /p/-responses to ambiguous tokens from each VOT continuum in Experiment 1 of Chapter 1 after noun-biasing and verb-biasing sentence contexts. Error bars represent standard error. (Fox & Blumstein, in press) One possible explanation for this asymmetry lies in the strength of the biasing information in the prior of BIASES. Intuitively, if the targets (i.e., bay, pay, buy, pie) all represent relatively “good” nouns (i.e., they are sensible and/or grammatically acceptable following the), but only pay and buy represent “good” verbs (i.e., they are acceptable following to), then the second asymmetry should be predicted: the verb-biasing contexts 113 should drive the interaction. The stronger overall magnitude of the bias observed in the bay–pay continuum might occur under many circumstances, including if pay was particularly likely to follow to and bay was particularly unlikely, thereby creating a bias much stronger in that condition than in the others. To determine whether this intuitive explanation could quantitatively capture the asymmetry observed for these particular contexts and target words, we implemented the prior of BIASES (bigram language model; see Chapter 2) for these stimuli. Table 3.4 provides corpus counts of the number of tokens of each of the function word / target bigrams (e.g., to pay, the buy) appears in the Google Books corpus (Michel et al, 2010). As described in Chapter 2, a smoothing parameter6 is added to every corpus count (Lidstone, 1920) to yield an estimate of the conditional prior for each word, given the preceding context. ...bay ...pay ...buy ...pie to... 91,314 17,383,444 7,423,403 6,709 the... 3,236,957 945,799 56,284 249,243 Table 3.4. Number of tokens of each bigram found to the 2009 Google Books corpus (Michel et al, 2010) Furthermore, the likelihood model was improved so as to allow the rime (i.e., vowel + glide) of the target stimulus to influence the likelihood of BIASES, rather than just the VOT of the initial stop consonant of the stimulus (see Chapter 4 for more detail on the mathematical details of this improvement). Words that differed from a target 6 Although any value of alpha will give the basic same pattern of results (i.e., the same ordering of effect sizes), different values will accentuate the disparities between the contexts and continua to greater or lesser extents. For the present simulations the value, 1x107 was utilized to illustrate the similarity between the model predictions and the experimental results. 114 stimulus in its rime were assigned a likelihood (and therefore a posterior probability) of zero; that is, for every trial on which a subject heard /?ei/, the only words competing for recognition were bay and pay. The smoothed corpus estimates were incorporated into BIASES as the conditional prior. For the likelihood model, arbitrary mean and variance parameters were selected from among those used in Simulation Study 3.1 (any choice shows the same basic pattern). Figure 3.12 shows the posterior probability distributions for the two continua in each context (left panels) and reproduces Figure 1.1 (right panels) for comparison. 1 bay−pay 1.00 1.0 0.75 0.8 Posterior Probability of pay: p(w 1|V O T ,C ) 0.6 0.50 0.4 0.25 proportion "p"−responses 0.2 0.00 2 0.0 1.00 buy−pie 1.0 0.75 0.8 0.50 0.6 0.25 0.4 0.00 0.2 the −25 0 25 50 75 100 VOT (ms) to Prior Probability of pay: p(w 1|c i ) 0.0 c1 c2 5 10 15 20 25 30 35 VOT (ms) Figure 3.12. Model simulations of posterior probability distributions (left) and original data (right) for the bay–pay (top) and buy–pie (bottom) continua in the noun- and verb- biasing contexts (verb-biasing: blue on left / dashed on right; noun-biasing: red on left / solid on right). Right panels are a reproduction of Figure 1.1. Mean proportion of /p/- responses to tokens from each VOT continuum in Experiment 1 of Chapter 1 after noun- biasing and verb-biasing sentence contexts. Error bars represent standard error. (Fox & Blumstein, in press) 115 Finally, from the resulting posterior distributions, 20 sets (i.e., 20 subjects) of 20 behavioral responses were simulated in for each context condition for an ambiguous VOT (randomly selected VOT value within 5 ms of the assumed category boundary). Figure 3.13 shows the mean proportion of /p/-responses in each continuum to the ambiguous VOT value after each context. The same pattern of results seen in Figure 3.11 can be observed there: the overall interaction is robust, there is a larger effect of context in the bay–pay continuum than in the buy–pie continuum, and the effect is largely driven by the verb-biasing contexts. Importantly, these results are obtained without any significant efforts at parameter-fitting; rather, the pattern of results that emerges is inherent to a Bayesian model that assumes, like BIASES, that the prior word should bias the perception of subsequent spoken words when they are phonetically ambiguous. Thus, these results strongly suggest that variability in the strength of prior information in a sentence context modulates the size of observed top-down effects in systematic and predictable ways. 116 Posterior Probability of /p/−response: p(/p/|V O T ,C ,vowel ) 1.00 0.75 0.50 0.25 0.00 bay−pay buy−pie continuum context the to Figure 3.13. Model simulations of behavioral response rates for the bay–pay (left bars) and buy–pie (right bars) continua in noun-biasing (red) and verb-biasing (blue) sentence contexts. Mean proportion of simulated /p/-responses to randomly selected ambiguous VOT (within 5 ms of simulated category boundary) by 20 subjects with 20 Bernoulli (independent and identically distributed) trials. Error bars represent standard error of 20 simulated subject means. Compare to Figure 3.11 (or Figure 1.2). (cf. Fox & Blumstein, in press) 3.2.5. Variability in the Effect Sizes Compared to “Neutral” Prior Contexts Another inconsistency that has received relatively little attention in the literature despite its appearance in various studies (e.g., Guediche et al, 2013; van Alphen & McQueen, 2001) relates to the size of top-down effects when responses to stimuli in biasing contexts are compared to a context that is designed to serve as a neutral condition. For instance, Guediche and colleagues (2013) examined responses to stimuli that were occasionally phonetically ambiguous between goat and coat after goat-biasing sentence contexts (e.g., He milked the...), coat-biasing sentence contexts (e.g., He buttoned the...), and neutral sentence contexts in which either goat of coat could sensibly serve as a 117 continuation (e.g., He painted the...). Researchers have also attempted to include similarly neutral contexts in lexical effect studies by using continua between two non- words or between two words (e.g., Fox, 1984). The goal of such studies is generally to compare each biasing context to the neutral context in order to illustrate that sentential context effects are affecting the identification of stimuli relative to a neutral context. However, it has often been observed that the neutral context may differ significantly from only one of the biasing contexts, or the effect size may be larger in one direction than the other. These asymmetries have rarely been discussed in detail in the literature. Nonetheless, because BIASES demands that even the allegedly “neutral” context have some prior (whether 0.5 or not), this exercise serves as a reminder that one must explicitly model the prior on even the neutral context. It may be that the sentence contexts representing the neutral condition are, indeed, truly unbiased: 𝑝 𝑤! 𝐶 = 𝑐! = 0.5. In such a case, the asymmetries might be explained by asymmetric prior biases for the two biasing contexts: there is no guarantee that stimuli in the two biasing contexts will be equally biasing away from a perfectly neutral context, as in Simulation Study 3.1 (see Box 3.1) where 𝑝 𝑤! 𝐶 = 𝑐! = 0.75 and 𝑝 𝑤! 𝐶 = 𝑐! = 0.25 . Indeed, in Simulation Study 3.2 (see Box 3.2), prior contexts that were not equally biased compared to the neutral context were examined, with asymmetries resulting (see Figures 3.6 and 3.9). This issue is explored further later in this chapter, but, for the moment, it is simply worth noting that one important conclusion from this discussion is that an experimentally defined “neutral” context may not be neutral at all: there may be biases inherent in even those “neutral” stimuli, and – even if they are neutral – the only conditions under which 118 one should expect equal effect sizes of each context in comparison to the “neutral” condition is when stimuli are perfectly evenly biased around the perfectly neutral context. These conditions are unlikely to be met without extremely tight experimental design controls, but BIASES allows an experimenter to predict in advance of an experiment the likely effect sizes when comparing stimuli from different conditions. Thus, BIASES can be employed for power analyses and experimental design purposes. 3.3. Testing Predictions of BIASES: Experiment 3.1 In order to further examine the extent to which fine-grained predictions of BIASES could be observed in empirical data, a new set of stimuli was constructed and Experiment 3.1 was conducted. In particular, there were two goals: (a) to determine whether by-subject differences in pay-response rates to different acoustic stimuli predicted specific patterns of top-down effects, and (b) to determine whether there is evidence that subjects’ responses to stimuli following a “neutral” context actually reflect Bayes-optimal processing. These two goals were addressed by two model comparison analyses examining the results of Experiment 3.1. 3.3.1. Methods 3.3.1.1. Subjects 15 healthy young adults participated in Experiment 3.1 as part of a multi- experiment session, although all 15 subjects completed this experiment first. Participants either received course credit or 8 dollars. All subjects were right-handed monolingual native speakers of American English, and all participants self-reported having normal hearing and no known neurological diseases. 3.3.1.2. Materials 119 The stimuli for this study were comprised of 4 acoustic tokens from a voice-onset time continuum between bay and pay, each of which was appended to a set of noun- and verb-biasing sentence contexts (e.g., He hated the... vs. He hated to...). Stimuli were recorded in a soundproof booth on an Edirol digital recorder (model R09-HR) with a Sony microphone (model ECM-MS907) (sampling rate: 44.1 kHz; 24 bits; stereo) and then were resampled in BLISS speech-editing software (Mertus, 1989) (sampling rate: 22.05 kHz; 16 bits; mono: left). The speaker was a male native speaker of American English. All sentence frames (e.g., He hated...), biasing function words (to/the), and naturally produced target tokens of bay and pay were produced in isolation multiple times and tokens were selected from among them for use in the experiment proper. The list of sentence frames consisted of the same 20 main verbs that were used by Fox and Blumstein (2015; see also Chapter 1), but first names were replaced with the pronoun “He” to reduce differences in stimulus duration and ensure subject would not be able to learn mappings between names and subsequent function words. Three contexts were appended to each of the 20 sentence frames. A naturally produced token of the, a naturally produced token of to (both of duration 125 ms), and 125 ms of unintelligible but spectrally similar speech babble (the initial and final 40 ms of which were ramp up/down respectively). This third condition was dubbed the “noise” condition. In total, this yielded 60 sentence contexts (20 main verbs crossed with the three conditions; He hated to.../the.../[noise]...). To each of these 60 contexts, each of 4 acoustically manipulated tokens from a VOT continuum between bay and pay were appended (yielding 240 total sentences for each subject to respond to in the experiment). 120 Tokens of the VOT continuum were constructed by concatenating: the unaltered burst of a pay token; a variable amount of aspiration from the natural pay token (duration depended on duration of vowel removed – see below); the first quasiperiodic pitch period from the natural pay token; and all but the first N pitch periods of a naturally produced token of bay, where N = the stimulus number - 1. The duration of the N pitch periods that were removed was equal to the amount of aspiration added from the pay token in order to ensure all tokens were the same duration, overall (within 1 ms of 439 ms). In this way, 7 VOT tokens were created. Four tokens with VOTs of 3, 22, 31, 48 ms were selected for stimuli because the middle two were judged to be the most ambiguous and other two were strong endpoint tokens. 3.3.1.3. Procedure All subjects heard all stimuli binaurally over headphones in a random order in a sound-dampened booth and were instructed to respond whether the last word of each sentence was bay or pay, by pressing the appropriately marked button as quickly and accurately as possible, and to guess if they did not know. The buttons were counterbalanced across subjects. Subjects were told ahead of time that some sentences would not make sense. Subjects completed 2 practice trials before the experiment began. The experiment took about 12 minutes to complete. There were no breaks included during the experiment. 3.3.2. Results: Logistic Regression Analysis of Biased Contexts The results for Experiment 3.1 are analyzed throughout the remaining sections of Chapter 3. First, we consider only the results for the two biasing contexts (shown in 121 Figure 3.14). Subsequent analyses examined the results of all three conditions (including noise). To test for an effect of sentential context on speech recognition, the data were analyzed using mixed effects logistic regression (Baayen, Davidson & Bates, 2008; Jaeger, 2008) (see Chapter 1). There was evidence of a strong influence of VOT on the rate of pay-responses (β = 0.31, p < 0.001) and also a strong sentential context effect (β = 3.03, p < 0.001). Figure 3.15 shows the results by illustrating the effect size as a function of VOT. 1.0 ● ● ● ● 0.8 proportion PAY−responses ● 0.6 0.4 ● 0.2 ● ...to ● ● ...the 0.0 ● 0 10 20 30 40 50 VOT (ms) Figure 3.14. Mean proportion of pay-responses to tokens from the bay–pay VOT continuum after noun-biasing and verb-biasing sentence contexts. Error bars represent standard error. Figure 3.16 shows the by-subject variability in effect sizes for responses to each VOT token. However, Figure 3.17 shows that subjects also vary in their underlying likelihood model. In particular, in their responses to these tokens, subjects’ expected 122 category boundaries differ: some subjects appear to expect exemplars of pay to have much longer VOTs than other subjects. Because of this, we examined two models for BIASES: one in which subjects, all share the same category boundary, and one in which subjects differ. If subjects do, indeed, differ in their category boundary, then they should also vary in their expected effect size for a given VOT token. 1.0 0.8 Bias Effect Size (to − the) 0.6 ● 0.4 ● 0.2 ● ● 0.0 0 10 20 30 40 50 VOT (ms) Figure 3.15. Mean difference in proportion of pay-responses to tokens from the bay–pay continuum after verb- vs. noun-biasing sentence contexts. Error bars represent standard error. 123 1.0 ● ● 3 ● ● 22 ● ● 31 ● ● 48 0.8 ● Bias Effect Size (to − the) ● ● ● ● 0.6 ● ● ● ● ● ● 0.4 ● ● ● ● ● ● ● 0.2 ● ● ● ● ● ● ● ● ● 0.0 ● ● ● −20 −10 0 10 20 30 40 50 60 70 VOT (ms) Figure 3.16. For each subject (N=15), difference in proportion of pay-responses to tokens from bay–pay continuum after verb- vs. noun-biasing contexts. Error bars represent standard error. 1.0 0.8 proportion PAY responses 0.6 0.4 0.2 0.0 −20 −10 0 10 20 30 40 50 60 70 VOT (ms) Figure 3.17. For each subject (N=15), their best-fitting (see section 3.3.2) unbiased posterior probability distributions (probability of pay-response to tokens from the bay– pay VOT continuum after a theoretical context that is truly neutral). 124 3.3.3. Results: Model Comparison 1 – Subject Variability BIASES was implemented as a hierarchical Bayesian statistical model for further analysis. For the present analyses, only the data from responses to the VOT tokens after noun- and verb-biasing contexts (not in the noise context) were considered. Two separate versions of the model were implemented: in one version, subjects shared one group parameter for the mean of the normal likelihood function for their /b/ category and in the other version, the mean of the likelihood function could differ between subjects (the hyperprior on subjects’ 𝜇! was presumed to be normally distributed). As noted in Simulation Study 3.1 (see Box 3.1), because the current simulations of BIASES assume equal category variance (as do Feldman et al, 2009; Clayards et al, 2008; Kleinschmidt & Jeager, 2015), category variance and distance between category means are confounded. Thus, all model-fitting analyses assume a distance between categories based on VOT distributions from production data reported by Lisker and Abramson (1964): 𝜇! − 𝜇! = 55 𝑚𝑠. Tables 3.5 and 3.6 show the results of the model-fitting with and without by- subject variability, respectively. All chains converged, as judged from visual inspection of the chains and the Gelman-Rubin statistics for each model: multivariate psrf = 1.01 for both models and point estimates were all between 1.00 and 1.01 (with upper 95% confidence intervals of 1.00-1.03). Critically, the DIC (popt) was computed for each model in order to determine whether the additional parameters allowing subjects to differ in their phonetic category structure improved the model fit significantly. Penalized deviance scores were 514 for the group-level model and 398.3 for the hierarchical model, despite having penalty terms of 125 6.602 and 44.99, respectively. Table 3.6 provides estimates and HDIs for the parameters in the hierarchical version of BIASES. Median Mean SD 95% HDI min 95% HDI max 𝛼 0.089 0.090 0.016 0.059 0.120 2 σ 266.93 267.51 11.61 246.05 290.97 𝜇! 0.15 0.44 0.43 -0.37 1.26 Table 3.5. Summary of posterior Markov chains from model that assumed group-level category structure (i.e., same 𝜇! for all subjects). Median Mean SD 95% HDI min 95% HDI max 𝛼 0.060 0.061 0.012 0.040 0.085 σ2 235.98 236.44 10.77 216.10 257.90 𝜇! ! 0.41 0.44 1.30 -2.12 3.04 1 = 𝜎!! ! 21.08 24.13 12.27 7.36 48.97 𝜏!! Table 3.6. Summary of posterior Markov chains from model that assumed hierarchical phonetic category structure (i.e., variable 𝜇! for subjects). A posterior distribution was also obtained for each subject’s 𝜇! , so we determined the median of each of these 15 posterior distributions and added half of the assumed 𝜇! − 𝜇! (i.e., 27.5 ms) to compute a single point estimate for an approximate category boundary for each subject. Subjects’ boundaries, calculated in this way, ranged between 20.72 ms and 33.37 ms (mean = 27.95, SD = 4.02). In order to illustrate the improved model fit obtained by hierarchical modeling of phonetic category structure, we computed the distance of each VOT token from the estimated boundary for each subject (calculated from the median of the subject’s posterior distribution) and re-plotted Figure 3.16 with an x-axis reflecting the adjustment of subjects’ boundaries to coincide at a single point. This can be seen in Figure 3.18. In short, subjects show larger effects when the model predicts that they should show larger effects (e.g., closer to the category boundary), and this fine- grained variability among subjects is neatly captured by BIASES’ assumption that 126 subjects are not all identical in their phonetic expectations for acoustic realizations of exemplars of /b/ and /p/. 1.0 ● ● 3 ● ● 22 ● ● 31 ● ● 48 0.8 Bias Effect Size (to − the) ● ● ● ●● ● ● ● ● 0.6 ● ● ● ● ● ● 0.4 ● ● ●● ● ● ● ●● ● ● 0.2 ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●●● ● 0.0 ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ● −40 −30 −20 −10 0 10 20 30 40 Distance of VOT from Subject's Boundary (ms) Figure 3.18. For each subject (N=15), the difference in the proportion of pay-responses to tokens from the bay–pay VOT continuum after verb-biasing vs. noun-biasing sentence contexts. Note that, unlike Figure 3.16 and others, the x-axis is adjusted for subjects’ VOT boundaries. Error bars represent standard error. 3.3.4. Results: Model Comparison 2 – Inherent Biases in “Neutral” Priors The previous analyses of the results of Experiment 3.1 have focused on the data from the noun-biasing and verb-biasing condition, but ignored the third “noise” condition in which subjects heard sentences like He hated [noise] /?ay/. As discussed earlier, when experimenters include a “neutral” condition, how subjects respond to stimuli in that baseline condition must be modeled just like subjects’ responses to biased contexts. Next, we examined eight possible models of subjects’ conditional priors in order to understand the principles underlying subjects’ responses to 127 stimuli both in the biasing contexts and the noise condition employed in the current experiment. If, in the noise condition, subjects are equally biased towards bay and pay, then the noise context would lie closest to the noun-biased sentence contexts because those noun-biased contexts are less biased, overall, than the verb-biased contexts (see Table 3.4). Note that, as discussion above, it is not likely that responses to the noise condition will lay perfectly midway between the noun- and verb-biasing contexts. On the other hand, BIASES makes a different prediction about how subjects should respond in the noise condition. In particular, one principle of Bayesian models is that, when some information is not available, the optimal way to integrate that (lack of) information is to “believe” (in the Bayesian sense) each possible value of the cue to the extent that that cue was likely. This is called marginalization (see Chapter 2). In the present circumstances, this would mean that subjects’ responses to stimuli in the noise condition should be closer to the verb-biasing contexts rather than the noun-biasing contexts. Thus, the “neutral” assumption and the Bayesian (marginalization) assumption make opposite predictions about where subjects’ responses to stimuli in the noise condition should fall. Figure 3.19 displays the results of subjects’ responses to the noise condition added to the same data presented in Figure 3.14. As can be seen, responses in the noise condition lie closer to the verb-biased context than to the noun-biased context, suggesting that subjects were performing marginalization. However, to confirm this, we conducted a model comparison to evaluate the extent to which the marginalization model improved in fit over other alternative models. 128 1.0 ● ● ● ● 0.8 ● proportion PAY−responses ● 0.6 0.4 ● 0.2 ● [noise] ● ...to ● ● ...the 0.0 ● 0 10 20 30 40 50 VOT (ms) Figure 3.19. Mean proportion of pay-responses to tokens from the bay–pay VOT continuum following the noun-biasing (e.g., He hated the...) and verb-biasing (e.g., He hated to...) sentence contexts (see also Figure 3.14), as well as in the noise condition (e.g., He hated [noise]...). Error bars represent standard error. Eight models were compared (see Table 3.7); they differed on the assumed prior for the context conditions (to.../the...) and the assumed prior for the noise condition. For four models, the priors for the context conditions were estimated based on the standard bigram model (described in Chapter 2), considering only probability of bay vs. pay based only on the preceding word (to vs. the). The remaining four models employed a more complex trigram language model for their priors, considering the probability of bay vs. pay based not only on the previous word (to vs. the) but also on the main verb preceding that. The four models of each type each employed a different model for the prior when subjects heard the target after the noise condition. These models varied in complexity. One model assumed that subjects treated bay and pay as equally likely after hearing the noise condition (Equal Priors). A second presumed subjects were biased based on the 129 lexical frequency of bay and pay. A third assumed that subjects considered the probability of bay and pay after marginalizing over the possible bigram contexts in the experiment (to and the). Finally, a fourth model assumed that subjects marginalized over all possible trigram contexts in the experiment (to and the, but only for after the main verbs in the study). If subjects were performing optimally and making use of all information available to them, BIASES predicts that they should respond after the biasing contexts (to.../the...) based on the trigram prior, and that they should marginalize over all trigrams in Experiment 3.1. Indeed, the model comparison’s results support that finding (see Table 3.7). Table 3.8 reports the best-fitting model parameters for this best-performing model. Assumed Prior for Assumed Prior for Mean Penalty Penalized Context Conditions Noise Condition Deviance Term Deviance Bigram (to.../the...) Equal Priors 680.4 36.35 716.8 Bigram (to.../the...) Lexical Frequency 517.3 41.87 559.2 Bigram (to.../the...) Context-Sensitive (Bigram) 501.9 39.77 541.6 Bigram (to.../the...) Context-Sensitive (Trigram) 699.9 41.14 741 Trigram (marginal) Equal Priors 778.9 39.55 818.4 Trigram (marginal) Lexical Frequency 578.1 40.75 618.9 Trigram (marginal) Context-Sensitive (Bigram) 505.4 39.31 544.7 Trigram (marginal) Context-Sensitive (Trigram) 474.5 40.31 514.8 Table 3.7. Summary of Model Comparison 2. Shaded row is best-fitting model, which was the model that assumed subjects make use of not only bigram contexts (i.e., the prior word), but also the second-back word (i.e., trigram contexts) in their contextual priors. This was the most detailed representation of contextual information tested. Median Mean SD 95% HDI min 95% HDI max 𝛼 0.044 0.044 0.007 0.031 0.059 σ2 214.71 214.83 7.89 199.96 230.83 𝜇! 0.35 0.35 1.24 -2.10 2.74 1 = 𝜎!! ! 18.35 20.77 7.89 7.07 40.68 𝜏!! Table 3.8. Summary of posterior Markov chains from best-fitting model in Model Comparison 2 (shaded model in Table 3.7; fully trigram-driven context model). Note that all models assumed hierarchical phonetic category structure (i.e., variable 𝜇! for subjects). 130 3.3.5. Conclusion In conclusion, the results of Chapter 3 suggest that listeners’ behavior exhibits many hallmarks of a Bayesian spoken word recognition system. Overall, the results lend support to the validity and utility of BIASES for explaining and predicting subjects’ speech recognition behavior in experimental tasks. BIASES is capable of accounting for a wide range of variability that is usually ignored by other computational models, including variability among subjects, variability due to speech cues other than VOT, and variability due to prior contexts in an experiment. These findings, along with the theoretical contribution of BIASES – as a model of speech perception in sentential context – illustrate the novelty of the present work. An even more powerful demonstration of the utility of BIASES, though, would be to leverage it to inform theoretical debates in psychology or neuroscience. One key goal of computational modeling is to advance scientific theory by testing and comparing competing hypotheses. This is the aim of Chapter 4. 131 Chapter 4 Top-Down Effects on Spoken Word Recognition in Aphasia: A Model-Based Assessment of Information Processing Impairments 4.1. Introduction 4.1.1. Brief Introduction In order to understand spoken language, a listener must ultimately map a continuous acoustic waveform onto discrete lexical forms, which stand at the interface between sound and meaning. However, spoken word recognition is not only influenced by the so-called bottom-up speech signal; as the signal undergoes higher-level cognitive processing, other information sources exert top-down influence on speech perception (for review, see Samuel, 2011). For instance, perception is lexically biased: a phonetically ambiguous segment between /b/ and /p/ tends to be identified as /b/ when followed by – ash (because bash is an English word, but not *pash), but as /p/ when followed by –ast (where past is a word, but not *bast) (Ganong, 1980). Similarly, perception is contextually biased: a phonetically ambiguous stimulus between two words (e.g., bay and pay) tends to be recognized as bay after a noun-biasing sentence context (e.g., Valerie hated the...), but as pay after a verb-biasing sentence context (e.g., Brett hated to...) (Fox & Blumstein, in press). Although the mechanisms underlying the integration of bottom- up and top-down cues remain the subject of considerable debate (see, e.g., McClelland, Mirman & Holt, 2006; Norris, McQueen & Cutler, 2000), there is no question that both types of information influence speech recognition. Top-down effects on speech perception are of particular interest because they reflect dynamics at the confluence of perceptual and cognitive processing, so their 132 emergence and the characteristics of their distribution can reveal key insights about many aspects of human language function (see Chapter 3). Given the theoretical significance of this class of phenomena, it is noteworthy that far less is known about the pattern of such top-down effects in patients with aphasia. Most such individuals experience at least some receptive language impairments (Boller, Kim & Mack, 1977; Goodglass, Gleason & Hyde, 1970), with deficits arising at many different levels of processing (Goodglass, 1993; Lesser, 1978). What grants the status of top-down effects in patients with aphasia special importance, however, is the substantial evidence that lexical processing – that is, processing at the level where sound contacts meaning – is particularly vulnerable in aphasic syndromes (for review, see Blumstein, 2007). For example, a classic finding about lexical processing by neurologically healthy adults is that, upon hearing a prime word (e.g., cat), the processing of a subsequent target that is a semantic associate of the prime (e.g., dog) is automatically facilitated relative to processing when the prime was not related (e.g., table), as measured by the time required to accurately decide that dog is a word (Meyer & Schvaneveldt, 1971). Moreover, the extent to which listeners access cat (and, in turn, the extent to which processing of dog is facilitated) is modulated by the acoustic (or phonological) distance from cat of a “mispronounced” prime, as indicated by the monotonic ordering of lexical decision latencies for dog after four different prime conditions (from fastest to slowest): cat < *gat < *wat < table (Milberg, Blumstein & Dworetsky, 1988a). Although the implicit processing of semantic associates of perceived primes is typically spared in aphasia (i.e., cat primes dog; Milberg & Blumstein, 1981; Milberg, Blumstein & Shrier, 1982), patients fail to exhibit the characteristic graded sensitivity observed in healthy adults 133 (Milberg, Blumstein & Dworetsky, 1988b), a result that has been interpreted as evidence for lexical processing deficits in such patients. Nonetheless, it remains unclear what mechanisms are responsible for the observed dysfunctions (for review, see Mirman, Yee, Blumstein & Magnuson, 2011). One theory argues that lexical processing deficits arise directly from disruptions to processing dynamics at the level of the lexical representations themselves (Blumstein & Milberg, 2000; Janse, 2006; McNellis & Blumstein, 2001). However, in order to definitively conclude that lexical information is, indeed, specifically implicated, it is important to rule out the possibility that what appear to be lexical processing impairments are actually just downstream consequences of impairments in the bottom-up processing of the speech signal. Unfortunately, since auditory word processing must inevitably require both bottom-up speech processing and accessing the lexical representation, it is easy to see why it has been difficult to rule out this alternative explanation. However, top-down lexical and contextual effects on speech perception may offer a unique window through which to view this question. For instance, the lexical effect (Ganong, 1980) taps information stored within lexical representations because it reflects a comparison between two phonologically similar interpretations of a stimulus, only one of which corresponds to a lexical representation. The existence of one representation (bash) in the lexicon and the corresponding absence of another (*pash) conspire to bias subjects’ identification of speech stimuli toward words. As such, to the extent that patients or groups of patients differ in the size of their lexical effect from controls, these differences might be taken to suggest disruptions arising at the lexical level itself. On the other hand, listeners’ identification of spoken words and sounds should, of course, also be affected by 134 bottom-up phonetic and phonological processing deficits. Therefore, the pattern of lexical or contextual effects on speech perception in a patient with a bottom-up processing deficit might also be expected to diverge from the performance of healthy control subjects. The critical question, then, is “When it comes to top-down speech processing, are there unique predictions about the expected consequences of ‘virtual lesions’ at different levels of the spoken word recognition system?” It is not necessarily intuitive how – even in a healthy speech processing system – phonetic, phonological, lexical and sentential processing levels interact during online speech perception and ultimately drive subjects’ responses to, for instance, a stimulus that is ambiguous between bash and *pash. This challenge is multiplied when attempting to deduce how disruptions at different processing levels or to specific cognitive mechanisms might affect the behavior of patients with brain damage and a complex constellation of symptoms at any (or potentially many) of those processing levels. Thus, it is difficult to generate clear predictions about expected patterns of top-down effects in patients with aphasia, and it is also difficult to draw any strong conclusions about the relationship between such data and the nature of those patients’ fundamental information processing deficits, without first identifying a theoretical lens through which to view the data. To that end, the present work enlists the BIASES model (Bayesian Integration of Acoustic and Sentential Evidence in Speech; Chapters 2-3), a probabilistic computational model of spoken word recognition that has been shown to successfully capture key aspects of top-down effects on speech perception in healthy adults. As we will show, BIASES makes clear predictions about how fine-grained differences in the size and 135 distribution of top-down influences from lexical and contextual cues should be expected to emerge as a function of which information-processing levels are disrupted. Thus, by examining the specific patterns of top-down effects from lexical and sentential context in patients with Broca’s aphasia (BA) and Wernicke’s or Conduction aphasia (W/CA), and comparing those results to the distribution of top-down effects in healthy controls, it is possible to distinguish between the independent contributions of a range of processing impairments (including at acoustic-phonetic, phonological, lexical, and contextual processing levels), even when multiple such impairments may coexist in a single patient or group of patients. 4.1.2. Overview of Chapter 4 The central aim of this chapter is to investigate the nature of top-down processing in patients with aphasia and to evaluate the extent to which the pattern of deficits observed in two groups – patients with Broca’s aphasia (BA) and patients with Wernicke’s or Conduction aphasia (W/CA) – might inform the broader theoretical question regarding the locus of patients’ apparent lexical processing deficits. To that end, we further elaborate a Bayesian model of speech perception presented in earlier chapters, the BIASES model (Bayesian Integration of Acoustic and Sentential Evidence in Speech; Chapters 2-3). The fundamental principles embodied by this iteration of BIASES, which I call BIASES-A are consistent with its parent model, as discussed earlier. For example, preceding words can still bias a listener’s identification of a stimulus via a context- dependent conditional prior, and the model’s likelihood function still computes the relative fit of candidate representations given some acoustic values. 136 However, both the prior and likelihood terms of BIASES-A take different forms than they did when the model was introduced. These adaptations are critical for the model to address the question of theoretical interest here. In fact, in some ways, BIASES- A relaxes some of the assumptions present in the minimalist version of BIASES presented in Chapters 2-3. For instance, the drastically oversimplified likelihood function in BIASES, 𝑝 𝐴 𝑤! , characterized each word as a distribution over VOTs, implicitly ignoring subsequent cues (such as the rest of the word). This assumption was sufficient for modeling the perception of minimally paired words (e.g., bay vs. pay) that only differ as a function of VOT, but it must be updated in order to account for lexical biases arising as a function of subsequent phonological information (e.g., whether the rime of the word is –ast or –ash). Of course, while adding complexity to the model in this was improves its ability to accurately characterize the human speech processing system, relaxing certain assumptions requires committing to certain additional assumptions. However, most importantly, this approach illustrates with one of the key strengths of BIASES: its flexibility. The architecture of BIASES and its fundamental properties do not change when the prior and likelihood functions are updated to more accurately capture additional findings about human cognition and perception. Thus, while the main goals of this chapter are to assess the prevalence of top-down effects on speech perception in patients with aphasia and to address the theoretical question about the locus of lexical processing deficits in aphasia, this work also serves as a demonstration of the broad range of questions that BIASES can be leveraged to study. Chapter 4 is organized into 4 parts. First, we briefly review the evidence for lexical processing deficits in aphasia, with special focus on two clinical groups – patients with Broca’s aphasia (BAs) and 137 patients with Wernicke’s or Conduction aphasia (W/CAs). Patients belonging to each group exhibit a unique pattern of lexical processing deficits. These results motivated the proposal of a theory referred to here as the Lexical Activation Hypothesis (Blumstein & Milberg, 2000; Janse, 2006; McNellis & Blumstein, 2001), which posits that the observed impairments emerge due to disruptions at the level of patients’ lexical representations. However, as mentioned earlier, it is unclear whether the impairments might be fully accounted for by bottom-up processing deficits, which are known to afflict most patients with aphasia. Although relatively little is known about top-down processing in patients with aphasia, and although it is not necessarily obvious how top-down processing might be implicated in or affected by lexical processing deficits, we propose that a model-based analysis of top-down effects on speech perception may offer unique insights about the nature of lexical processing deficits, more broadly. Second, we review the basic principles embodied by the BIASES model of speech perception and show how this probabilistic model can be theoretically linked to the Lexical Activation Hypothesis, which is predicated on a connectionist/activation-based approach to cognitive modeling. We update several assumptions and components of BIASES, calling this iteration BIASES-A, for Aphasia, highlighting the model’s viability for not only capturing the fine-grained statistics of language function (see Chapter 3), but also its ability to reveal novel insights about the sometimes subtle details of language dysfunction. Alternatively, the A could stand for Activation, highlighting another critical contribution of this chapter: linking BIASES to more traditional (that is, connectionist) theories, models and approaches to thinking about spoken word recognition. We review the mathematical form of BIASES-A, briefly addressing the most important implications 138 of the changes from BIASES. Finally, we outline the present work’s model-based approach to the characterization of top-down processing deficits in patients with aphasia. Third, we present Simulation Study 4.1 and Experiment 4.1. Simulation Study 4.1 examines how information processing deficits at different levels should be expected to emerge in behavioral responses during an experiment testing for a lexical effect in patients with aphasia (Simulation Study 4.1). Experiment 4.1, conducted over two decades ago (Blumstein, Burton, Baum, Waldstein & Katz, 1994), examined the lexical effect in patients with aphasia. The present study’s model-based reanalysis of its original data offers new insights into the specific deficits responsible for the patterns reported in the original study. Fourth, Simulation Study 4.2 and Experiment 4.2 follow the same approach as was taken in Simulation Study 4.1 and Experiment 4.1, but they examine the sentential context effects examined in previous chapters. We argue that Chapter 4’s computational and behavioral results, together, lend support to the key ideas embodied by the Lexical Activation Hypothesis. 4.1.3. Lexical Processing in Aphasia 4.1.3.1. Lexical Processing Deficits Lexical access and spoken word comprehension is often profoundly disrupted in aphasia. Recall the early illustration of this by Milberg, Blumstein and Dworetsky (1988a, 1988b), who employed a lexical decision paradigm wherein subjects heard a prime-target pair and were instructed to decide whether the target was a word (e.g., dog) or a non-word (e.g., *jand). On those trials for which the target was a word (dog), the prime that immediately preceded it could come from one of four categories: it could be 139 an unrelated word (table), a semantically related word (cat), a “close” mispronunciation of the semantically related prime (*gat), or a “distant” mispronunciation of the semantic associate (*wat). Healthy controls tend to correctly identify the target stimulus, dog, as a word fastest when it was immediately preceded by the correctly pronounced, semantically related prime (cat), followed in order of speed by *gat, *wat, and table, suggesting that lexical access to a word (and therefore to its semantic associates) is graded based on the phonological similarity of the input to the word (Milberg et al, 1988a; see also, e.g., Connine, Blasko & Titone, 1993; Connine, Titone, Dellman & Blasko, 1997; McMurray, Tanenhaus & Aslin, 2002; Utman, 1997; Utman et al, 2001). In contrast, BAs exhibit semantic priming effects when dog is preceded by the correctly pronounced prime, cat, but they fail to show priming in either of the mispronunciation conditions. On the other hand, patients with W/CA are equally primed by *gat and *wat as they are by cat (Milberg et al, 1988b). These results have been interpreted as evidence that lexical access is disrupted in both patient groups, but that the nature of this disruption is not the same for all patients (see also Janse, 2006; Utman et al, 2001; Yee, Blumstein & Sedivy, 2008; but see, e.g., Del Toro, 2000; Tyler, 1992). In BAs, it is more difficult for bottom-up information to activate a lexical representation: only a very good perceptual match for cat is able to access the lexical-semantic network that must be engaged in order to facilitate subsequent recognition of dog. In W/CAs, though, even a poor match between the bottom-up signal and the stable phonological form of a word is able to access that word’s meaning. Clearly, deviation from typical lexical processing dynamics in either direction is likely to impair word comprehension in the real world, where speech is noisy and error-laden (Dell, 1988; Vitevitch, 1997, 2002) 140 and words very often belong to dense phonological neighborhoods (e.g., cat is similar to hat, bat, pat, rat; Luce & Pisoni, 1998). Thus, the locus of lexical processing impairments is of great interest. 4.1.3.2. The Lexical Activation Hypothesis What is the source of these lexical processing deficits? One theory holds that each group’s impairment can be traced to the resting activation level of lexical representations (Blumstein & Milberg, 2000; Janse, 2006; McNellis & Blumstein, 2001). According to this perspective, referred to here as the Lexical Activation Hypothesis, the extent to which *gat primes dog depends not only on the degree of phonological match between a *gat and cat, but also on the baseline activation of cat. Consider a model of semantic priming in which activation spreads (cf. Collins & Loftus, 1975) from cat to dog only after the activation level of cat exceeds some propagation threshold (cf. Rumelhart, 1989), and, thereafter, the amount of priming is related to the amount of supra-threshold activation (up to some maximum activation level). McNellis and Blumstein (2001) showed such a model captures the graded priming results in healthy controls (Milberg et al, 1988a), and alterations to the resting activation levels could explain the patterns in BA and W/CA (1988b). Lower resting activation levels rendered it impossible for poorly matching input to exceed cat’s propagation threshold, thereby preventing semantic priming by both close (*gat) and distant (*wat) mispronunciations (as in BA), while raising resting activation levels caused cat’s activation not only to exceed its propagation threshold, but also to quickly reach its maximum level, yielding ceiling-level facilitation of recognition of dog following cat, *gat, and *wat (as in W/CA). 4.1.3.3. Alternative Accounts of Lexical Processing Deficits 141 Critically, the Lexical Activation Hypothesis posits that the locus of the lexical processing deficit is inherent to the lexical representation: words’ resting activation levels are responsible for the observed impairment. However, some alternative explanations implicate the bottom-up processing of the speech signal and the time course of lexical activation. For example, the same pattern as was observed in W/CAs would be expected if those patients had perfectly normal lexical-level representations, but they sometimes misperceived *gat and *wat as cat. Since auditory comprehension is very frequently impaired in patients with W/CA (Blumstein, Baker & Goodglass, 1977a; Eggert, 1977; Luria, 1976; Robson, Keidel, Lambon Ralph & Sage, 2012), this possibility raises an important issue. Indeed, even though phonetic and phonological processing deficits are not as universally associated with BAs, virtually all patients, including BAs, appear to have at least some difficulties (Baker, Blumstein & Goodglass, 1981; Basso et al, 1977; Blumstein et al, 1977a, 1977b, 1984; Carpenter & Rutherford, 1973; Jauhiainen & Nuutila, 1977; Leeper, Shewan & Booth, 1986; Metz-Lutz, 1992; Miceli et al, 1978, 1980; Sasanuma et al, 1976; Utman et al, 2001; Yeni-Komshian & Lafontaine, 1983). This is consistent with neuroimaging research pointing to the involvement of both anterior and posterior brain regions in phoneme perception (Belin, Zatorre, Hoge, Evans & Pike, 1999; Blumstein, Myers & Rissman, 2005; Burton, 2001; Burton, Small & Blumstein, 2000; Poeppel, 1996). Notably, Milberg and colleagues (1988b) did try to rule out this explanation. In a post-experiment lexical decision task, patients in both groups were shown to correctly reject *gat and *wat as non-words while also correctly accepting cat as a word. In line with this finding, many other studies also suggest that, generally speaking, individual 142 subjects’ lexical processing deficits cannot be fully predicted by their bottom-up pre- lexical processing deficits alone (Baker et al, 1981; Basso et al, 1977; Blumstein et al, 1977a, 1977b, 1984; Carpenter & Rutherford, 1973; Caplan et al, 1995; Caplan & Utman, 1994; Csepe et al, 2001; Gow & Caplan, 1996; Jauhiainen & Nuutila, 1977; Leeper, Shewan & Booth, 1986; Metz-Lutz, 1992; Miceli et al, 1978, 1980; Sasanuma et al, 1976; Yeni-Komshian & Lafontaine, 1983). Nevertheless, Robson and colleagues (2012) argue that the primary deficit in W/CA is at the level of the phonological code (cf. Luria, 1976), suggesting that these studies’ inability to find a significant correlation between W/CAs’ phonological processing deficits and their other comprehension difficulties is due to unduly heterogeneous clinical populations, poor task selection, and other factors. Additionally, it has also been suggested that Milberg and colleagues’ (1988b) inability to detect priming in BAs following the mispronunciation conditions might have resulted not from an inherent disruption to lexical representations, but rather from a disruption to the dynamics (i.e., time course) of bottom-up lexical activation (Prather, Zurif, Love & Brownell, 1997; Swinney, Zurif & Prather, 1989; Swinney, Prather & Love, 2000). However, recent results using eye-tracking methodologies (which achieve more fine-grained temporal resolution than the priming paradigms) have disputed the idea that the time course of lexical activation is delayed in BA (for review, see Mirman et al, 2011; Yee et al, 2008). 4.1.3.4. Top-Down Effects and Lexical Processing It is apparent that at least part of the theoretical bottleneck that has made the debate between bottom-up and lexical-level accounts of patients’ lexical processing deficits difficult to resolve arises from the inherent difficulty in teasing apart bottom-up 143 processing which accesses lexical information and lexical-level information during typical word recognition tasks. For any task that evaluates behavioral responses to auditory words, potential lexical-level disruptions and potential downstream effects of bottom-up processing disruptions are necessarily confounded. However, top-down effects, are somewhat unique. Top-down effects measure the extent to which higher-level information sources – including lexical-level information (like lexical status or frequency) and contextual information that influences lexical-level predictions (cf. Chapter 2) – bias the perception of spoken words or sounds. What does it mean to observe a top-down effect for a given acoustic stimulus? If the same word-initial segment that is phonetically ambiguous between /b/ and /p/ is labeled /b/ when followed by –ash, more often than when followed by –ast, then this means that, for the same bottom-up stimulus, lexical-level information is influencing subjects’ ultimate speech recognition (Ganong, 1980). The significance of this observation is that the sizes of lexical or contextual effects are scaled with the strength of bias provided by top-down cues. However, bottom- up processing will also influence the ultimate size of the top-down effects: if the bottom- up processing reveals that a stimulus is almost certainly an exemplar of some particular word or phonetic category, then the top-down cue will have little impact on the response rate and there will not be a large top-down effect observed (cf. Chapters 2-3). Put another way, the ultimate size of a lexical or contextual effect on the perception of a stimulus will be influenced by both the bottom-up and the top-down processing of the stimulus (which includes lexical-level information and contextual information). As such, disrupting either bottom-up or lexical-level processing is likely to lead to behavioral differences in the size 144 of top-down effects on speech perception. The challenge is to separate out the influences of each. This theoretical and computational challenge is addressed from an information- processing standpoint by the BIASES model (Bayesian Integration of Acoustic and Sentential Evidence in Speech; Chapters 2-3), and it is to this model that we now turn. 4.2. Applying BIASES to Spoken Word Recognition in Aphasia 4.2.1. Brief Overview of BIASES The BIASES model of speech perception describes the mathematically optimal way of combining top-down information sources (such as lexical frequency or the contextual predictability of a word) and bottom-up information sources (such as acoustic cues in the stimulus). In Chapter 3, it was shown that BIASES provides a principled account for a number of fine-grained differences in the sizes of top-down effects on speech perception, explaining how properties of the model’s prior (which corresponds to top-down information sources) and the model’s likelihood (which corresponds to bottom-up information sources) should influence the ultimate size of the top-down effect for a given pair of conditions (e.g., two sentential contexts) and for a given acoustic signal (e.g., for a given voice-onset time, or VOT). Critically, though, the model’s predictions about how large a top-down effect should be depend on the information contained within a model’s prior and likelihood functions. Thus, if the underlying information contained within either the prior or the likelihood term of BIASES is disrupted, or if the information processing dynamics that govern the computations within the prior or the likelihood term of BIASES are disrupted, the expected size of top-down effects for a given pair of contexts and a given acoustic stimulus will also change. 145 Thus, in order to gain some insight into the nature of the information processing deficits in aphasia, we adapted the parent model, BIASES, to allow for the examination of how different “virtual lesions” to a child model, BIASES-A (for Aphasia), should influence the predicted sizes of top-down effects from lexical status, from lexical frequency and from sentence contexts. 4.2.2. From Activations to Probabilities: Lexical Activation Hypothesis BIASES is a probabilistic computational model which, like Shortlist B (Norris & McQueen, 2008), but unlike connectionist models of spoken word recognition (e.g., TRACE: McClelland & Elman, 1986; Shortlist: Norris, 1994; Merge: Norris, McQueen & Cutler, 2000), does not rely on any notion of activation. Instead, the amount of support for a given candidate in some set of mutually exclusive alternatives (e.g., a word in the lexicon) is related to its probability, which is computed relative to the other candidates. While this approach has many advantages (see, e.g., Chater, Tenenbaum & Yuille, 2006; Norris, 2006; Norris & McQueen, 2008) it is important to consider the relationship between probabilistic models and activation-based models (for recent reviews, see McClelland, 2009, 2013; McClelland, Mirman, Bolger & Khaitan, 2014). This is particularly crucial for the present modeling effort because, while BIASES does not rely on any notion of activation, the theoretical claim instantiated within the Lexical Activation Hypothesis about the underlying basis of lexical processing deficits in aphasia is couched within the language of words’ baseline levels of activation: BAs have lower baseline levels of activation than healthy adults, while W/CAs have higher baseline levels of activation than healthy adults. This raises the following critical question: what sort of lesion to the lexical information in BIASES would mimic the effects of changes in 146 baseline activation levels described by the Lexical Activation Hypothesis? The answer to this question requires drawing three theoretical links between activation-based models and probabilistic models. First, note that real-valued activation levels in a finite set of units in a connectionist model can be scaled (i.e., each divided by the sum of the activations of the entire set) in order to create a probability distribution, and, critically, this computation preserves the relevant ratios of all pairs of activation levels (Hinton & Sejnowski, 1983; Khaitan & McClelland, 2010; Luce, 1959; McClelland, 1991; McClelland, Mirman, Bolger & Khaitan, 2014; Movellan & McClelland, 2001; for a tutorial and review, see McClelland, 2013). Second, lexical frequency (which, by definition, characterizes a probability distribution over the lexicon) has a robust effect on spoken word recognition and speech perception (e.g., Connine, Mullennix, Shernoff & Yellen, 1990; Dahan, Magnuson & Tanenhaus, 2001; Howes, 1954; Luce, 1986; Marslen-Wilson, 1987; Pollack, Rubenstein & Decker, 1960; Savin, 1963; Taft & Hambly, 1986). Applying the converse relationship described in the first theoretical link (connecting activation levels to probabilities), suggests that words’ baseline lexical activations should be scaled by their lexical frequencies (see, e.g., Dahan, Magnuson & Tanenhaus, 2001). Thirdly, the last step is to determine how to mimic the raising and lowering of the baseline lexical activation of words in an activation-based framework? The approach we take is to transform the probability of each word wi in the lexicon of Nw words by applying the function A (for Activation), which is defined in Equation 4.1: Equation 4.1 𝑝(𝑤! )! 𝚨 𝑝 𝑤! , 𝜙 = !! ! !!! 𝑝(𝑤! ) 147 The function A raises each wi’s probability to the same exponent (𝜙), and then rescales the distribution so that it sums to one (as is required for any probability distribution). Crucially, A has the following four properties: Property 1. For 𝜙 = 1: 𝚨 𝑝 𝑤! , 𝜙 = 𝑝 𝑤! for all wi Property 2. For 𝜙 > 1: 𝚨 𝑝 𝑤! , 𝜙 < 𝑝 𝑤! for less probable (initially) wi 𝚨 𝑝 𝑤! , 𝜙 > 𝑝 𝑤! for more probable (initially) wi Property 3. For 𝜙 < 1: 𝚨 𝑝 𝑤! , 𝜙 > 𝑝 𝑤! for less probable (initially) wi 𝚨 𝑝 𝑤! , 𝜙 < 𝑝 𝑤! for more probable (initially) wi ! Property 4. For 𝜙 = 0: 𝚨 𝑝 𝑤! , 𝜙 = ! for all wi ! When 𝜙 > 1, the distribution becomes more extreme, or peaked, with the most probable words becoming even more probable and the least probable words becoming even less likely, so a virtual lesion that increases 𝜙 will cause the “rich to get richer.” Conversely, when 𝜙 < 1, the distribution becomes more uniform, essentially “watering down” frequency effects in the initial, un-lesioned distribution. Smaller values of 𝜙 reduce frequency effects further and further until 𝜙 = 0, at which point frequency effects are totally eliminated by functionally transforming the prior distribution over words into the uniform distribution: 𝐴(𝑝 𝑊 , 𝜙 = 0) = 𝑈𝑛𝑖𝑓(1, 𝑁! ). 4.2.2.1. Preliminary Simulations: Lexical Activation Hypothesis In order to establish the theoretical link between BIASES-A and the Lexical Activation Hypothesis, we must determine how 𝜙 relates to changes in lexical activation levels. That is, will increases in the baseline lexical activation levels (as hypothesized to underlie the lexical processing deficits in W/CAs) more closely match Property 2 (the “rich get richer” case) or Property 3 (the “watering down” case)? 148 In order to match the Lexical Activation Hypothesis to the computational approach embodied by the function A, we simulated a simple example with a “toy lexicon” of Nw = 20 words. For each word wi, a frequency fi between 10 and 100 was randomly determined ( 𝑓! ~𝑈𝑛𝑖𝑓(10,100) ) to serve as that wi’s effective baseline activation value, yielding a lexicon with activation values represented by the vector F (containing the frequencies for all Nw words). Then, in two separate simulations with the same lexicon F, to mimic the predicted activation levels for patients with BA and W/CA (McNellis & Blumstein, 2001), we either subtracted or added the same value, 𝜂, to each word’s activation level, yielding a new baseline activation vector 𝐹 ! = 𝐹 ± η.7 Finally, ! for both the activation subtraction simulation (𝐹!" = 𝐹 − η) and the activation addition ! simulation (𝐹!/!" = 𝐹 + η), the “pre-lesion” activation vector, F, and the updated “post- lesion” baseline activation vector, F’, were normalized to obtain pre-lesion and post- lesion probability distributions (Equations 4.2-4.4): Equation 4.2 𝐹 𝑝 𝑊 = !! !!! 𝑓! Equation 4.3 ! 𝐹!" 𝐹−η 𝑝!" 𝑊 = !! = !! !!! 𝑓! !!! 𝑓! −η Equation 4.4 ! 𝐹!/!" 𝐹+η 𝑝!/!" 𝑊 = !! ! = !! !!! 𝑓! !!! 𝑓! +η 7 To prevent any of the words’ activation levels from becoming negative, 𝜂 was set to !"# (!) half of the least frequent word’s activation value (𝜂 = ! ). 149 Figure 4.1 shows the effects of subtracting and adding to the baseline activation levels for the corresponding probability distributions. Decreasing words’ baseline activation levels (the mechanism implicated in lexical processing deficits in BA, according to the Lexical Activation Hypothesis) enhances frequency effects. The most frequent words (i.e., words with relatively higher pre-lesion activation levels) became even more probable, and the rarest words became even less probable. Conversely, increasing words’ baseline activation levels (the mechanism implicated in lexical processing deficits in W/CA, according to the Lexical Activation Hypothesis) diminishes frequency effects. The most frequent words and the least frequent words have less disparate post-lesion probabilities. Decrease Activation Increase Activation 0.100 probability of word 0.075 0.050 0.025 0.000 5 10 15 20 5 10 15 20 word (sorted by frequency) pre.post pre−lesion post−lesion Figure 4.1. Results of simulations of Lexical Activation Hypothesis: the probability of each word before and after the virtual lesion. Virtual lesions involved either increasing or decreasing the activation level of each word by a constant amount (cf. McNellis & Blumstein, 2001). 150 Connecting the Lexical Activation Hypothesis and this simulation’s results back to the probabilistic model (BIASES-A) and the function A, it is clear that decreasing the baseline activation levels of words corresponds to increasing 𝜙 (see Property 2 of A), while increasing baseline activation levels of words correspond with decreasing 𝜙 (see Property 3 of A). At an intuitive level, if baseline activation levels are increased by a constant amount (as in W/CAs; Blumstein & Milberg, 2000) while leaving the threshold for lexical access or word recognition constant (McNellis & Blumstein, 2001), and thereby requiring less bottom-up activation to achieve lexical access (i.e., *wat primes dog as much as cat in W/CA; Milberg et al, 1988b), then the activation of the lexical representation will not be a very reliable cue to the actual presence of the word in the speech signal. Consequently, lexical-level cues should be less reliable for W/CAs than they are for healthy adults; from an information processing perspective, less reliable cues should be down-weighted, which is precisely the effect of decreasing 𝜙. The opposite is true of decreasing baseline activation levels and increasing 𝜙: the eventual activation of a lexical representation in BA is a more reliable cue to the actual presence of the word in the speech signal, leading to greater reliance on lexical-level information. Note that, while our approach in the activation-based simulations above was designed to match the approach taken by McNellis and Blumstein (2001) in their proof- of-concept computational implementation of the Lexical Activation Hypothesis (adding/subtracting a constant value η to each word’s activation level; cf. Morton, 1969; see also Norris, 2006), this approach is not mathematically equivalent to A’s exponential re-weighting of the entire distribution by 𝜙. Our central aim was to match the overall directionality of the effects of parametric manipulations in the activation-based and 151 probabilistic frameworks on extreme values (e.g., the most and least frequent words). It is also worth noting that McNellis and Blumstein (2001) did not explicitly account for frequency effects. What is most important, though, is that while the details of the present probabilistic approach are not identical to the approach taken by McNellis and Blumstein (2001), the overall theoretical link is clear: the Lexical Activation Hypothesis should predict that, if controls have 𝜙 = 1, then behavioral responses in BAs should be more influenced by lexical-level information (𝜙 > 1), while W/CAs should exhibit less of an influence from lexical-level information (𝜙 < 1). 4.2.2.2 Implications for Top-Down Effects on Speech Perception Having derived the theoretical implications of the Lexical Activation Hypothesis for the probabilistic modeling approach, it is now possible to deduce principled predictions about the influence of lexical-processing deficits on top-down processing of speech. Because the Lexical Activation Hypothesis, as interpreted here, predicts that lexical-level information will be weighted more by BAs than by healthy controls, but less by W/CAs than by healthy controls, lexical status should have a greater influence on BAs’ responses to stimuli ambiguous between a word and non-word, but it should have a weaker influence on W/CAs’ responses. Implicit in this conclusion is the assumption of a relationship between “lexical status” and “lexical frequency.” This assumption represents the basic principle that non-words are, in the limit, not so different from very low frequency words. Thus, the effects of lexical status and frequency effects are given a unified, if simple, explanation within the prior of the BIASES model: listeners expect to encounter more probable stimuli (see Chapters 2-3). Since non-words are less probable than words (see also, Norris & McQueen, 2008), an effect of lexical status might be 152 thought of as a special case of a lexical frequency effect. It is worth noting that since Ganong’s (1980) original demonstration of the lexical effect, a number studies have reported hints of frequency effects (e.g., Fox, 1984; Fox & Blumstein, in press; Newman et al, 1997) and Connine, Blasko and Titone (1993) showed that the frequency of words within an experiment could drive top-down effects on speech perception that mirrored Ganong’s lexical effect (1980). Note, however, that, since lexical frequency is estimated by counting the number of times a word appears in some corpus, all non-words will, by definition, have a lexical frequency of 0. Clearly, subjects must be capable to recognize a string of phonemes that they have never heard before. While most speech an adult will hear on any given day will be composed of words in her lexicon, there must be some mechanism to “back off” to when a listener encounters foreign words, new words, or proper nouns such as names they have never heard before. Additionally some such computational machinery is obviously critical for learning in infancy and childhood (Feldman, Griffiths, Goldwater & Morgan, 2013). Even more relevant to the current situation, in the context of an experimental setting like Ganong’s (1980), in which subjects hear dozens or sometimes hundreds of trials, they often identify the stimulus as a non-word. Thus, a subject’s prior expectation should certainly not be completely determined by the lexical frequency of a stimulus as estimated from a corpus. In order to account for top-down effects of both lexical status and lexical frequency, what is needed is a prior that is influenced by frequency, but which also allows subjects to “expect the unexpected,” as the case may be for non-words (with a corpus-estimated frequency of 0). 153 Thus, in order to account for the effect of lexical status on spoken word recognition within the probabilistic framework, a simple approach is to allow non-words have some small prior probability (Norris & McQueen, 2008; see also Chapters 2-3). In the current model, the prior probability for a non-word is estimated by fitting a smoothing parameter (Lidstone, 1920) to healthy controls’ lexical effect data. Based on the success of parallel assumptions about BIASES in modeling the sizes of sentential context effects (see Chapter 3), we assume that control subjects are optimally making use of lexical information (𝜙 = 1), but that their lexical prior includes lexical frequency information that is smoothed by some positive and nonzero “pseudo-count” (𝛼), which serves as an estimate of the prior expectancy for all non-words. It is further assumed that patients’ underlying model of lexical information is the same as the controls’ model (i.e., the same frequency counts for all words and the same smoothing parameter, 𝛼), but that patients may weight this information differently than the controls (𝜙). It is important to note that drawing a relationship between lexical status and lexical frequency does not demand that every possible non-word be explicitly represented in the mental lexicon alongside every word, albeit with a different (lower) effective frequency estimate; indeed, this would not be a very plausible lexicon. Rather, we adopt the theoretical perspective that, during speech perception, candidate word-forms compete for recognition in a lexical buffer (Blumstein, 1994, 1998). A candidate’s prior probability is determined by the sum of 𝛼 (the smoothing parameter discussed above) and a candidate’s lexical frequency (0 for non-words, but non-zero and positive for words). This framework allows all candidates (whether words or non-words) to have some baseline probability of being perceived (related to 𝛼), while a word’s prior is also 154 influenced by its frequency. Because a constant 𝛼 is added to the counts of all words, the prior probability of any word (with a nonzero frequency estimate) is greater than that of a non-word, allowing the model to capture top-down effects of lexical status (Ganong, 1980), and the prior probability of a given word is greater than that of any less frequent word, allowing the model to capture top-down effects of lexical frequency (Connine et al, 1993). Moreover, this framework predicts that the relative size of shifts in categorization due to lexical status should be more apparent the more common the word in a non- word/word pair (e.g., a greater bias towards past in *bast–past than towards bash in *bash–pash) is. We return to this prediction later. The conclusions of this section are summarized in Table 4.1. Clinical Semantic Priming Baseline p(W) Lexical Frequency Diagnosisa Lex. Dec. Latenciesb Lex. Act.c weightd Effectse Effectsf W/CA cat=*gat=*wat 𝜌! 𝜙<1 𝜆!/!" < 𝜆! “watered down” Control cat<*gat<*wat
1 𝜆!" > 𝜆! “rich get richer” Table 4.1. Summary of Probabilistic Approach to the Lexical Activation Hypothesis: a W/CAs = Wernicke’s or Conduction Aphasia; BA = Broca’s Aphasia; b Semantic Priming Lexical Decision Task; patterns of response latencies for YES responses to dog (Milberg, Blumstein & Dworetzky, 1988); c Pattern of resting activation levels (𝜌) according to Lexical Activation Hypothesis (Blumstein & Milberg, 2000; McNellis & Blumstein, 2001); d Pattern of weighting of lexical information (𝜙) in probabilistic approach that matches effects of baseline lexical activation modulation; see Equation 1 and Figure 1; e Predicted effects of model lesion on the influence of lexical status (𝜆); based on 𝜙; f Predicted effects of model lesion on the influence of lexical frequency information; based on 𝜙 4.2.3. Implementing BIASES-A As described in Chapter 2, the fundamental assumption of BIASES is that when subjects categorize speech stimuli, their responses reflect both (1) the relative perceptual match between the available acoustic signal and each candidate response, and (2) the relative predictability of each candidate, irrespective of what was ultimately perceived. In 155 that sense, BIASES reflects the optimal integration of a bottom-up, perceptually driven processing stream and a top-down, expectation-driven processing stream. The introduction of BIASES in Chapters 2 and 3 places much focus on modeling the top- down constraints on expectations for future words provided by sentential context. However, as discussed above (and in Chapter 2), lexical status and lexical frequency can also serve to constrain expectations for upcoming linguistic material (cf. Norris & McQueen, 2008). Equation 4.5 encapsulates the fundamental properties of BIASES: Equation 4.5 𝑝 𝑤! 𝐴 ∝ 𝑝 𝐴 𝑤! 𝑝(𝑤! ) In short, subjects’ word identification decisions are generated based on the posterior probability function, 𝑝 𝑊 𝐴 , which is proportional to the product of the likelihood, 𝑝 𝐴 𝑊 , and the prior, 𝑝(𝑊). While the likelihood function indexes how representative of each candidate word the perceived speech signal is, the prior indexes how probable each candidate word was to begin with (or, a priori). Also, recall that BIASES implements the influence of sentential context, C, by allowing C to constrain the prior, 𝑝 𝑊 𝐶 . While the basic form of Equation 4.5 also underlies BIASES-A (the “child model” presented in this chapter), several adaptions were made to BIASES, the effects of which were (a) to enhance the breadth of the empirical coverage of BIASES, and, importantly for the questions addressed in this chapter, (b) to leverage the theoretical framework provided by BIASES for the purpose of providing a computational-level explanation for fine-grained differences in the patterns of top-down effects in patients with aphasia. In doing so, several simplifying assumptions made during the initial presentation of 156 BIASES were revisited, ultimately yielding a more complicated, but also more realistic and more accurate, model of human speech perception. 4.2.3.1. Adapting the Prior and Likelihood of BIASES The model was enhanced in four main ways. First, the likelihood function was updated to include a phonological-processing stage interceding between the acoustic and lexical level. Second, the likelihood model was updated in another way to allow the rime of a stimulus to influence speech recognition, rather than only accounting for acoustic information available from the onset’s VOT. Third, a smoothing parameter was added in order to allow novel phonological forms to have nonzero prior probabilities. Finally, a lexical buffer was added to the model, following work suggesting that the phoneme identification task may not necessarily tap phonemic processing, per se (Fox & Blumstein, in press; Swinney & Prather, 1980). In the updated model, BIASES-A, upon perceiving a monosyllabic stimulus, Bayes’ rule gives the probability of recognizing a candidate word-form, fi, given the initial segment’s voice-onset time, V, and the stimulus’s rime, R (Equation 4.6). Equation 4.6 𝑝(𝑓! )𝑝 𝑉, 𝑅 𝑓! 𝑝 𝑓! 𝑉, 𝑅 = !! !!! 𝑝(𝑓! )𝑝 𝑉, 𝑅 𝑓! Assuming that the VOT and rime are independent cues to the phonological form of the stimulus yields Equation 4.7: Equation 4.7 𝑝(𝑓! )𝑝 𝑅 𝑓! , 𝑉 𝑝 𝑉 𝑓! 𝑝(𝑓! )𝑝 𝑅 𝑓! 𝑝 𝑉 𝑓! 𝑝 𝑓! 𝑉, 𝑅 = !! = !! !!! 𝑝(𝑓! )𝑝 𝑅 𝑓! , 𝑉 𝑝 𝑉 𝑓! !!! 𝑝(𝑓! )𝑝 𝑅 𝑓! 𝑝 𝑉 𝑓! 157 In the current model, and as shown in Equation 4.8, we assume rimes to be deterministically related to word-forms; while this is certainly not true of real speech, the lexical effect stimuli examined in this study (see Experiment 4.1’s Methods) were blocked by continuum, so subjects only heard stimuli with a single rime many times for half of the experiment, and then the rime switched for all of the stimuli for the rest of the experiment. Thus, for our purposes, it is probably safe to assume that participants could accurately map rimes to associated word-forms. In particular, the lexical effect continua considered here ranged from *dut to toot and duke to *tuk (see Experiment 4.1’s Methods). Equation 4.8 1 𝑓! ∈ {/𝑡𝑢𝑡/,/𝑑𝑢𝑡/} 𝑝 𝑟/!"/ 𝑓! = 0 𝑓! ∈ {/𝑡𝑢𝑘/,/𝑑𝑢𝑘/} 0 𝑓! ∈ {/𝑡𝑢𝑡/,/𝑑𝑢𝑡/} 𝑝 𝑟/!"/ 𝑓! = 1 𝑓! ∈ {/𝑡𝑢𝑘/,/𝑑𝑢𝑘/} Add-alpha smoothing (Lidstone, 1920) was implemented for the prior. Thus, even word-forms that never appeared in the Brown corpus (Kucera & Francis, 1963) had some prior probability. Equation 4.9 indicates the smoothed frequency estimates for all four relevant word forms (prior to normalization) in Experiment 4.1, where 𝜅!""! and 𝜅!"#$ are lexical frequency counts of those words in the Brown corpus and 𝛼 is the smoothing parameter, fit as described earlier. Equation 4.9 𝜅!""! + 𝛼 𝑓! =/𝑡𝑢𝑡/ 𝛼 𝑓! =/𝑑𝑢𝑡/ 𝑝 𝑓! ∝ 𝛼 𝑓! =/𝑡𝑢𝑘/ 𝜅!"#$ + 𝛼 𝑓! =/𝑑𝑢𝑘/ 158 The second term of the likelihood function, 𝑝 𝑉 𝑓! , which mapped VOTs onto word-forms, was modeled such that the mixture components of the Gaussian mixture model were onsets (namely, either /t/ or /d/) with normally distributed VOTs (still assuming equal category variance). As a simplifying assumption, it was assumed that the distribution of VOTs for a word’s initial consonant depends on which consonant the word begins with (/t/ vs. /d/), but is otherwise independent of the identity of the word itself (but see, e.g., Baese-Berk & Goldrick, 2009; Fox, Reilly & Blumstein, 2015). Equations 4.10 – 4.12 show how 𝑝 𝑉 𝑓! is expanded to account for a phonological processing level. Equation 4.10 !! !! !! 𝑝 𝑉 𝑓! = 𝑝 𝑉, 𝜊! 𝑓! = 𝑝 𝜊! 𝑓! 𝑝 𝑉 𝑓! , 𝜊! = 𝑝 𝜊! 𝑓! 𝑝 𝑉 𝜊! !!! !!! !!! Equation 4.11 𝑝 𝑉 𝑓! = 𝑝 𝑜/!/ 𝑓! 𝑝 𝑉 𝑜/!/ + 𝑝 𝑜/!/ 𝑓! 𝑝 𝑉 𝑜/!/ Equation 4.12 𝑉|𝜊! ~ 𝑁(𝜇! , 𝜎!! ) (!!!! )! 1 ! !!!! 𝑝 𝑉 𝜊! = ! 𝑒 2𝜋𝜎!! Additionally, a parameter was added that allows perceptual processing of the onset’s VOT to be degraded, with the degree of degradation assumed to be independent of the value of the onset’s VOT, as shown in Equation 4.13. Equation 4.13 𝑆|𝑣 ~ 𝑁(𝑣, 𝜎!! ) 𝑆|𝜊! ~ 𝑁(𝜇! , 𝜎!! + 𝜎!! ) 159 𝑆|𝜊! ~ 𝑁(𝜇! , 𝜎 ! + 𝜎!! ) 4.2.3.2. Modeling Speech Processing Deficits in BIASES-A In addition to the assumptions outlined above, BIASES-A makes three basic assumptions, listed in Equation 4.14, about speech processing in healthy adults (young controls and age-matched [to the patients] controls). First, the relationship between word- forms and their onsets is assumed to be deterministic. Secondly, we assume that 𝜎!! , the additional variance associated with perceptual processing deficits, is 0 for all healthy control subjects. That is because, when equal category variance is assumed, unless a noise manipulation is included in the experiment (cf. Feldman et al, 2009), 𝜎!! and 𝜎!! are not identifiable parameters in model-fitting. Thirdly, as mentioned earlier, we assume that healthy controls optimally weight lexical information after fitting a smoothing parameter, 𝛼, to the model. Equation 4.14 𝜀!" = 𝜀!"# = 0 𝜎!! !" = 𝜎!! !"# = 0 𝜙!" = 𝜙!"# = 1 Critically, these three assumptions were not made about speech processing in patients with aphasia. To the extent that patients do not have perfect lexical-phonological processing, BIASES-A implements this as shown in Equations 4.15 – 4.17. Equation 4.15 1−𝜀 𝑓! ∈ {/𝑡𝑢𝑡/,/𝑡𝑢𝑘/} 𝑝 𝑜/!/ 𝑓! = 1 − 𝑝 𝑜/!/ 𝑓! = 𝜀 𝑓! ∈ {/𝑑𝑢𝑡/,/𝑑𝑢𝑘/} Equation 4.16 160 ! ! 1 − 𝜀 ∙ 𝑁(𝜇/!/ , 𝜎/!/ ) + 𝜀 ∙ 𝑁(𝜇/!/ , 𝜎/!/ ) 𝑓! ∈ {/𝑡𝑢𝑡/,/𝑡𝑢𝑘/} 𝑉|𝑓! ~ ! ! 𝜀 ∙ 𝑁(𝜇/!/ , 𝜎/!/ ) + 1 − 𝜀 ∙ 𝑁(𝜇/!/ , 𝜎/!/ ) 𝑓! ∈ {/𝑑𝑢𝑡/,/𝑑𝑢𝑘/} Equation 4.17 1 (!!!/!/ )! (!!!/!/ )! (!!!/!/ )! ! ! ! ! ∙ 𝑒 !! ! +𝜀∙ 𝑒 !! ! −𝑒 !! ! 𝑓! ∈ {/𝑡𝑢𝑡/,/𝑡𝑢𝑘/} 2𝜋𝜎 ! 𝑉|𝑓! ~ (!!!/!/ )! (!!!/!/ )! (!!!/!/ )! 1 ! ! ! ! ∙ 𝑒 !! ! +𝜀∙ 𝑒 !! ! −𝑒 !! ! 𝑓! ∈ {/𝑑𝑢𝑡/,/𝑑𝑢𝑘/} 2𝜋𝜎 ! Figure 4.2 illustrates simulations teasing apart the influence of lexical- phonological processing impairments (𝜀 > 0) and acoustic-phonetic processing deficits (𝜎!! > 0) on the model’s likelihood function. Finally, Equation 4.18 summarizes the full model for the new BIASES-A from which behavioral data can be simulated, where 𝑝 𝑧! 𝑉, 𝑟/!"/ is the probability of a /t/-response given a stimulus with VOT value V from the */dut/–toot continuum and 𝑝 𝑧! 𝑉, 𝑟/!"/ is the probability of a /t/-response given a stimulus with VOT value V from the duke–*/tuk/ continuum. Equation 4.18 1 𝑝 𝑧! 𝑉, 𝑟/!"/ = (!!!/!/ )! (!!!/!/ )! (!!!/!/ )! ! ! ! ! ! ! !! !!! ! ! ! ! ! !!! !!∙ ! ! !! ! ! !!! ! !! ! !∙!"# !""! !!"# ! (!!!/!/ )! (!!!/!/ )! (!!!/!/ )! ! ! ! ! !! !!! ! !! !!! ! ! !! ! ! !!! ! ! ! !!∙ ! 1+𝑒 1 𝑝 𝑧! 𝑉, 𝑟/!"/ = (!!!/!/ )! (!!!/!/ )! (!!!/!/ )! ! ! ! ! ! ! !! !!! ! ! ! ! ! !!! !!∙ ! ! !! ! ! !!! ! ! !∙!"# !!"# !!"#$ !! (!!!/!/ )! (!!!/!/ )! (!!!/!/ )! ! ! ! ! !! !!! ! !! !!! ! ! !! ! ! !!! ! ! ! !!∙ ! 1+𝑒 161 Lexical−Phonological Acoustic−Phonetic Processing Processing ε=0 σN = 0 0.02 1 0.01 0.00 ε = 0.01 σN = 10 0.02 2 0.01 p( VOT | fi ) 0.00 ε = 0.05 σN = 14 0.02 3 0.01 0.00 ε = 0.15 σN = 20 0.02 4 0.01 0.00 −40 0 40 80 −40 0 40 80 VOT (ms) {f/dut/,f/duk/} {f/tut/,f/tuk/} Figure 4.2. Dissociable influences of two bottom-up processing components on speech recognition. Acoustic-phonetic and lexical-phonological processing deficits are modeled as unique information processing transformations that, together, comprise the likelihood function of BIASES-A. Virtual lesions to each has unique effects on the probability density functions, 𝑝 𝑉 𝑓! , of each fi, where each fi is a lexical candidate (e.g., toot vs. *dut) competing for recognition in a lexical buffer. The influence of acoustic-phonetic processing impairments (modeled as an increase in the category variance, σN) is to render a wider range of VOTs “somewhat representative” of each fi. The influence is assumed to be uniform over all onsets (and therefore over all fi). On the other hand, the influence of lexical-phonological processing impairments is modeled as an increased chance of “mishearing” the onset due to an increasingly noisy mapping between word-forms and onsets, where the probability of an errant lexical-phonological mapping is given by 𝜀 = 𝑝 /𝑡/ 𝑑𝑢𝑡 = 𝑝 /𝑑/ 𝑡𝑜𝑜𝑡 . The rate of noisy mappings is assumed to be symmetrical across word-forms: the probability of a word-form whose true onset is /d/ (e.g., *dut, duke) being activated when the listener perceives a /t/ onset is equal to ε, which is also equal to the probability of a word-form whose true onset is /t/ (e.g., toot, *tuk) being activated when the listener perceives a /d/ onset. The influence of ε is to increase the bimodality of the density function since BIASES’ likelihood is a mixture of Gaussians. Although it is possible for both levels of processing to be impaired (and for BIASES to detect both impairments by implicitly identifying and teasing apart their independent contributions; see results of Experiment 4.1), the simulations presented here only vary one dimension at a time. On the left panels, σN = 0, while on the right ε = 0 (baseline levels). 162 4.3. Top-Down Effects of Lexical Status on Spoken Word Recognition in Aphasia Simulation Study 4.1 and Experiment 4.1 were designed to investigate the role of lexical status in spoken word recognition in aphasia. As discussed earlier, the classic finding about top-down lexical effects on phoneme identification is that listeners (who are neurologically healthy) show biases in their labeling of a phonetically ambiguous spoken segment based on the subsequent phonetic material when only one of the two competing candidate labels for the segment would represent a word in the listener’s language (Ganong, 1980). Although the stimuli and task used in Experiment 4.1 and simulated in Simulation Study 4.1 are described in greater detail later on (see Sections 4.3.2.1.2 - 4.3.2.1.3), the following represents an overview of key aspects of the Methods. Subjects (including both healthy controls and patients with aphasia) heard tokens from a VOT continuum between /d/ and /t/ that were immediately followed by one of two rimes (/-uk/ or /-ut/) and their task was to decide whether the first segment was an exemplar of a /d/ or of a /t/. Critically, when the segment was followed by /-uk/, a /d/-response corresponded to a word-response (because duke is a word, but */tuk/ is not), but when the segment was followed by /-ut/, a /t/-response corresponded to a word-response (because toot is a word, but */dut/ is not). In this stimulus set, the presence of a top-down lexical effect would therefore be realized if, for the same ambiguous VOT token, /t/-responses were more likely in the /-ut/ condition than in the /-uk/ condition (Blumstein et al, 1994; Burton, Baum & Blumstein, 1989). As described earlier, BIASES-A captures this lexical bias by assuming that subjects’ prior expectations for the words duke and toot are stronger than their 163 expectations for non-word stimuli like */tuk/ and */dut/. As discussed in Chapter 3, just how strong the lexical bias is appears to be (i.e., how large the top-down lexical effect is) depends on several factors, including the strength of subjects’ top-down lexical expectations (captured by the model’s prior) and the degree of bottom-up acoustic ambiguity of a stimulus (captured by the model’s likelihood). Consequently, to the extent that patients with aphasia suffer from disruptions to either their lexical-level processing or their bottom-up acoustic/phonetic/phonological processing, signatures of these impairments should be present in their behavioral response patterns. BIASES-A allows us to characterize the signatures associated with different functional linguistic deficits. To that end, Simulation Study 4.1 examines the expected consequences of disruptions at three levels of processing and Experiment 4.1 assesses the extent to which patients actually do exhibit atypical patterns of top-down effects in their behavioral responses to these stimuli. Ultimately, we can leverage the theoretical framework provided by BIASES-A in order to assess the extent to which the responses of patients with BA and patients with W/CA indicate bottom-up processing deficits, lexical-level processing deficits, or both. 4.3.1. Simulation Study 4.1: Lexical Effects in Aphasia Simulation Study 4.1 examined the independent contributions of lesions at three different processing levels on the expected size of lexical effects. The results of these simulations are summarized in Figure 4.3. In short, lesions to the prior (ϕ, which controls the weighting of lexical/frequency information) predict atypical patterns in behavioral results that are most notable for the exaggerated (when ϕ > 1) or diminished (when ϕ < 1) effect sizes close to the phonetic category boundary (where acoustic information is most 164 ambiguous). On the other hand, as illustrated in Figure 4.2, lesions to the likelihood (ε > 0 or σN > 0) tend to predict atypicalities with respect to which VOT values are judged to be ambiguous, while the maximum effect size remains relatively unchanged. For lexical- phonological processing deficits, patients tend to mix up endpoint tokens (e.g., by mislabeling, or mishearing, clear exemplars of */tuk/ as duke) at higher rates. Note that the size expected top-down effects is greater at the /t/ end of the VOT continuum than at the /d/ end of the continuum. This is the result of the /t/ endpoint token that is a word is less frequent than the /d/ endpoint token that is a word (𝜅!"#$ > 𝜅!""! ), illustrating the interacting roles of the prior and the likelihood during speech integration. Finally, for acoustic-phonetic processing impairments, lesions are expected to induce top-down effects for a wider array of VOTs, just as the addition of noise would cause. 165 Lexical/Frequency Lexical−Phonological Acoustic−Phonetic Weighting Processing Processing 0.5 φ ε σN Lexical Effect Size 0.4 %t/*ut/ − %t/*uk/ 0.3 0.2 0.1 0.0 0 20 40 60 0 20 40 60 0 20 40 60 VOT (ms) simulation number (see panels below) 1 2 3 4 Figure 4.3. Summary of results of Simulation Study 4.1: Effect of manipulating each parameter on the predicted lexical effect size, as a function of VOT. Each curve represents the difference between the posterior probability functions of the /*ut/ (/t/- biased) and /*uk/ (/d/-biased) conditions. In each panel, only the labeled parameter was manipulated; other baseline parameter values (ϕ = 1; ε = 0; σN = 0) were held constant in order to observe the effects of each parameter independently. Solid curves represent the simulation in each panel for which all baseline assumptions were held constant. Each panel summarizes four simulations (i.e., four levels of the relevant parameter for that panel), whose coloration corresponds to the panel number in which that simulation is further detailed in Figures 4.4, 4.5 or 4.6. Coloration darkens from simulation 1-4, because the boundary shift associated with that simulation also increased from simulation 1-4 in each panel. This can be seen in Figures 4.4, 4.5 or 4.6, which show the two conditions’ posterior probability curves, as a function of VOT. 166 φ1 = 0.1 φ2 = 0.5 φ3 = 1 φ4 = 1.5 1.00 0.75 % t responses 0.50 0.25 0.00 0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60 VOT (ms) *dut−toot duke−*tuk Figure 4.4. Detailed Results of Simulation Study 4.1: Effect of weighting lexical information (ϕ) on expected rate of voiceless (/t/) responses, as a function of VOT and rime of the stimulus (/uk/ vs. /ut/), which corresponded to opposing lexical biases for the initial consonant. The panel with solid lines represents the baseline assumptions (ϕ = 1; ε = 0; σN = 0), and each other panel manipulated only the listed parameter value; all others remained at baseline. The vertical grey line denotes the phoneme category boundary in the simulations (the VOT at which, for an unbiased prior, the posterior probability of /t/- and /d/-response are equal. ε1 = 0 ε2 = 0.01 ε3 = 0.05 ε4 = 0.15 1.00 0.75 % t responses 0.50 0.25 0.00 0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60 VOT (ms) *dut−toot duke−*tuk Figure 4.5. Detailed Results of Simulation Study 4.1: Effect of efficacy of phonological processing (ε) on expected rate of voiceless (/t/) responses, as a function of VOT and rime of the stimulus (/uk/ vs. /ut/), which corresponded to opposing lexical biases for the initial consonant. The panel with solid lines represents the baseline assumptions (ϕ = 1; ε = 0; σN = 0), and each other panel manipulated only the listed parameter value; all others remained at baseline. The vertical grey line denotes the phoneme category boundary in the simulations (the VOT at which, for an unbiased prior, the posterior probability of /t/- and /d/-response are equal. 167 σN 1 = 0 σN2 = 10 σN3 = 14 σN4 = 20 1.00 0.75 % t responses 0.50 0.25 0.00 0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60 VOT (ms) *dut−toot duke−*tuk Figure 4.6. Detailed Results of Simulation Study 4.1: Effect of efficacy of acoustic- phonetic processing (σN) on expected rate of voiceless (/t/) responses, as a function of VOT and rime of the stimulus (/uk/ vs. /ut/), which corresponded to opposing lexical biases for the initial consonant. The panel with solid lines represents the baseline assumptions (ϕ = 1; ε = 0; σN = 0), and each other panel manipulated only the listed parameter value; all others remained at baseline. The vertical grey line denotes the phoneme category boundary in the simulations (the VOT at which, for an unbiased prior, the posterior probability of /t/- and /d/-response are equal. 4.3.2. Experiment 4.1: Lexical Effects in Aphasia Based on the simulations with BIASES-A presented in Simulation Study 4.1, and based on the logic outlined above in Section 4.2.2 (see Table 4.1 for a summary), it is clear that the Lexical Activation Hypothesis predicts that BAs should exhibit exaggerated lexical effects compared to healthy control subjects and W/CAs should exhibit diminished (or perhaps even undetectable) lexical effects. However, as discussed earlier, both patient groups may also suffer from bottom-up processing deficits. Simulation Study 4.1 further illustrated that bottom-up and lexical-level processing deficits have distinct signatures in the expected behavioral results, suggesting that, by analyzing the fine- grained pattern of behavioral responses from a study examining the lexical effect in patients with aphasia (as compared to healthy listeners) in light of the predictions of 168 BIASES-A, it may be possible to tease apart the impacts of different functional impairments and infer the nature of the underlying deficits in the BAs and W/CAs. To that end, the data from Experiment 4.1, described below, were examined in order to evaluate the extent to which these data support the predictions of the Lexical Activation Hypothesis. As previously mentioned, the data for Experiment 4.1 were originally presented by Blumstein and colleagues (1994). Here, we reanalyze the raw data from that study using the model-based approach. 4.3.2.1. Methods For a detailed description of the participants, stimuli, and procedure of the original study by Blumstein and colleagues (1994), readers should consult that article. However, a summary is provided below. 4.3.2.1.1. Subjects A total of thirty subjects participated in Experiment 4.1, including 10 young control subjects, 8 age-matched control subjects, 6 patients diagnosed with Broca’s aphasia, and 6 patients diagnosed with either Wernicke’s or Conduction aphasia. Ten Brown University students participated, serving as the young control (YC) sample. All reported having normal hearing and being native speakers of English. Eight right-handed males with a mean age of 64.0 years (minimum: 55; maximum: 75; sd: 6.5) participated, serving as the age-matched control (AMC) sample. All reported having normal hearing and being native speakers of English. Six patients with Broca’s aphasia (mean age: 58.3 years; minimum: 44; maximum: 72; sd: 9.9) participated, comprising the BA sample. Four other patients with Broca’s aphasia were originally selected to participate in the study, but were excluded 169 from the original study after a pre-test because they were unable to accurately identify phonetically unambiguous exemplars of the stimuli included in the study. Patients’ clinical diagnoses were determined based on clinical and neurological examinations (including CT scans) and performance on the Boston Diagnostic Aphasia Examination (BDAE) (Goodglass & Kaplan, 1983). Six patients with Wernicke’s or Conduction aphasia (mean age: 65.2 years; minimum: 52; maximum: 78; sd: 11.2) participated, comprising the W/CA sample. Four additional patients with Wernicke’s aphasia were excluded from the original study after failing the same pre-test administered to the patients in the BA sample, and clinical diagnoses were determined according to the same criteria. 4.3.2.1.2. Stimuli The stimuli for Experiment 4.1 were comprised of a total of 14 acoustic tokens from two continua that crossed initial consonant voicing with lexical status. In particular, they consisted of 7 tokens from a continuum between a word and a non-word (W-NW continuum; duke–*tuk) and 7 acoustic tokens with the same word-initial voice-onset time (VOT) values, but ending in a different final consonant (NW-W continuum; *dut–toot). The stimuli were a subset of those used in another previously published study (Burton, Baum & Blumstein, 1989). Burton and colleagues (1989) constructed two 12- step VOT continua. A naturally produced token of duke served as the /d/ endpoint of the duke–*tuk continuum. The other 11 steps of the duke–*tuk continuum were constructed by acoustically manipulating this token’s waveform, splicing out successively longer portions of the vowel and inserting equal durations of aspiration from the naturally produced *tuk token. Additionally, the duke token’s burst was replaced with the *tuk 170 token’s burst, and the amplitude of the burst varied over the continuum (see Burton et al, 1989). Finally, the tokens of the *dut–toot continuum was constructed by replacing the final /k/ of the tokens from the duke–*tuk continuum with the final /t/ from a naturally produced token of *dut, thus ensuring that the W-NW continuum and the NW-W continuum did not differ acoustically except in their final consonant. Blumstein and colleagues (1994) selected 7 of the 12 stimuli from each VOT continuum, corresponding to two /d/ endpoint tokens (VOTs = 14.7 and 18.7 ms), two /t/ endpoint tokens (VOTs = 55.7 and 60.2 ms), and three phonetically ambiguous tokens with intermediate VOTs (VOTs = 34.2, 37.3, and 41.7 ms). 4.3.2.1.3. Procedure The seven tokens from each VOT continuum were binaurally presented over headphones to each participant 10 times. Subjects heard stimuli from each continuum in two separate tests separated by a short break (order of presentation of the two continua was counterbalanced across subjects). The 70 trials for a given continuum were randomly ordered and presented in blocks of 10 trials, with sequential blocks separated by a 6- second interval. Trials within a block were separated by a 3-second inter-stimulus interval for young control subjects and a 4-second inter-stimulus interval for all older subjects (age-matched controls and all patients with aphasia). Subjects were instructed to identify the first sound of each stimulus as either “d” or “t” by pressing the appropriately labeled button (counterbalanced across subjects) with their preferred hand as quickly and accurately as possible. Prior to each test, all subjects completed 12 randomly ordered practice trials, including at least one trial with each of the 7 tokens from the continuum being tested and 5 additional trials with randomly selected tokens from that continuum. 171 Prior to the experiment, all patients with aphasia completed a pretest in which they heard each of the /d/ and /t/ endpoint stimuli from each continuum (VOTs = 14.7 ms and 60.2 ms) ten times. Only participants who achieved at least 70% accuracy on each of the endpoint VOTs completed the experiment (six participants in each patient group). 4.3.2.2. Results: Statistical Analyses The results of Experiment 4.1, as originally reported by Blumstein and colleagues (1994) are shown in Figure 4.7. Recall that, to the extent that subjects tend to label stimuli with the same word-initial VOT as beginning with a /t/ more often in the *dut– toot continuum than in the duke–*tuk continuum, those results would suggest that subjects are biased towards the word endpoint of each continuum, and such results would, in turn, represent evidence of top-down effects from lexical status on speech recognition. 172 Young Controls Age−Matched Controls 1.00 ● ● ● ● ● 0.75 ● ● ● 0.50 ● ● 0.25 ● ● % t responses ● ● ● ● 0.00 ● ● ● ● ● PWA: Broca's PWA: Wernicke's/Conduction 1.00 ● ● ● ● ● ● ● 0.75 ● ● 0.50 ● ● ● ● ● ● ● 0.25 ● ● ● ● ● ● ● ● ● ● ● ● 0.00 0 20 40 60 0 20 40 60 VOT (ms) ● *dut−toot ● duke−*tuk Figure 4.7. Results of Experiment 4.1: for each group, the proportion /t/-responses as a function of voice-onset time (VOT) for the /*ut/ (/t/-biased) and /*uk/ (/d/-biased) conditions. Error bars represent by-subject standard error. Results represent reanalysis of raw data from Blumstein et al (1994). PWA = Patients with aphasia. 4.3.2.2.1. Motivation and Interpretation of Logistic Regressions We reanalyzed the raw data from Experiment 4.1 (that is, the number of /t/- responses by each subject, for each VOT value, in each continuum: duke–*tuk vs. *dut– toot) using mixed effects logistic regression (Baayen, Davidson & Bates, 2008; Jaeger, 173 2008), implemented using the lme4 package (Bates, Maechler, Bolker & Walker, 2014) in R (R Core Team, 2014). As noted earlier, the coefficients of a logistic regression relate directly to the underlying parameters of a Bayesian model of speech perception in the context of a two-alternative forced choice task (Feldman et al, 2009, Appendix B; see also Kleinschmidt & Jeager, 2015, Appendix, pp. 200-201). In particular, the theoretical framework defined by BIASES implies the appropriate structure for the logistic regression models, and how significance levels should be interpreted. Consistent with the theoretical framework provided by BIASES, all analyses reported in this section included independent fixed effects for RIME (β2) (/-ut/ vs. /-uk/; or, equivalently, /t/-biased stimulus vs. /d/-biased stimulus) and for VOT (β1) (modeled here as a continuous, linear fixed effect). No RIME × VOT interaction term was included, reflecting the principle that the prior and the likelihood are independent sources of information in the Bayesian framework (cf. Chapter 2). Any significant main effect of RIME suggests an influence of lexical status. A significant main effect of VOT suggests that subjects’ likelihood of making a /t/-response depends on the VOT of the stimulus. Thus, these two fixed effects reflect top-down and bottom-up processing, respectively. Additionally, whenever analyses included subjects from more than one group (e.g., a comparison of Young and Age-Matched Controls or a comparison of all three groups of Elderly subjects to one another), a fixed effect of GROUP was included in the model, along with its interactions with both RIME and VOT. Critically, a significant interaction between RIME and GROUP would reflect reliable differences in top-down processing between the two groups being compared, while a significant interaction between VOT and GROUP would reflect reliable differences in the bottom-up processing 174 between the two groups. Typically, if two groups differ in their best-fitting intercept coefficient (β0), it would indicate that subjects from the two groups differed reliably from one another in the locus of their category boundary. However, this result could also be attributable to between-group differences in top-down processing due to the default choice of contrasts used to code factors in these models, a point to which we will return later. Finally, all analyses also included random by-subject intercepts, thereby allowing subjects to vary in their category boundary around some overall group mean (cf. Chapter 3). Prior to analysis, VOT was centered (mean = 0) and RIME was deviation-coded (contrasts: -0.5/0.5 for /-uk/ and /-ut/, respectively). Deviation-coding was also used for the GROUP factor in analyses comparing groups. In comparisons of the two control groups (Young vs. Age-Matched) the older subjects were represented by the positive contrast. In comparisons of the elderly subjects (Age-Matched Controls vs. BAs vs. W/CAs), the GROUP factor was coded using two planned contrasts. These contrasts were selected to be Age-Matched Controls vs. BAs and Age-Matched Controls vs. W/CAs, with Age-Matched Controls being coded as the negative contrast in both cases. All results are reported in tables that include the best-fitting estimate of each regression coefficient (β), the estimate’s standard error (SE), Wald’s z statistic for the estimate of that parameter (|z|), and the significance level of the statistic (p). Table 4.2 summarizes the theoretical interpretation of each logistic regression coefficient. 175 Related Terms in Coefficient Factor Interpretation of Significance BIASES 𝜇! ! − 𝜇! ! β0 Intercept −𝑏 = − reflects estimate of category boundary 2𝜎 ! 𝜇! − 𝜇! reliable bottom-up influence of acoustic- β1 VOT 𝑔= 𝜎! phonetic cues on recognition (likelihood) 𝑝(𝑓! ) reliable top-down influence of lexical β2 RIME log 𝑝(𝑓! ) status on recognition (prior) Table 4.2. Summary of theoretical interpretations of logistic regression coefficients 4.3.2.2.2. Control Subjects: YCs vs. AMCs In order to confirm that Experiment 4.1’s stimuli elicited a reliable lexical effect for the healthy control subjects, the data from all Young Control subjects (YCs) and all Age-Matched Control subjects (AMCs) were submitted to logistic regression. Unsurprisingly, results (see Table 4.3) revealed contributions of both bottom-up and top- down effects on speech perception. No significant differences between the two control groups emerged, but there was a marginally significant RIME × GROUP interaction, suggesting that the AMCs may be somewhat more influenced by lexical status than YCs. Coefficient β SE |z| p β0 -1.511 (-1.610) 0.286 (0.286) -5.273 (-5.638) < 0.001 (< 0.001) β1 0.449 (0.452) 0.028 (0.029) 15.958 (15.458) < 0.001 (< 0.001) β2 0.762 (0.916) 0.177 (0.183) 4.314 (5.012) < 0.001 (< 0.001) β0: YC vs. AMC -0.807 (-0.599) 0.572 (0.570) -1.412 (-1.052) 0.158 (0.293) β1: YC vs. AMC -0.078 (-0.087) 0.055 (0.057) -1.418 (-1.512) 0.156 (0.131) β2: YC vs. AMC 0.627 (0.313) 0.353 (0.365) 1.777 (0.858) 0.076 (0.391) Table 4.3. Results of logistic regression analysis of Experiment 4.1 (Blumstein et al, 1994) that included Young and Age-Matched Controls. Shaded boxes indicate statistically significant effects. Statistics in parentheses report the comparable statistic for an identical analysis that excluded one young control subject (see main text). β0 = intercept (related to phoneme category boundary); β1 = VOT (related to gain/slope of sigmoid); β2 = RIME (related to size of the boundary shift introduced by the lexical/frequency bias); β: best-fitting estimate of each regression coefficient, SE: the estimate’s standard error, |z|: Wald’s z statistic for the estimate of that parameter, p: the significance level of the test statistic. Further examination of the potential source of this effect revealed that one of the 10 YC subjects showed large “anti-lexical effects” on all three ambiguous tokens: across 176 all three tokens, the subject made 79% /t/-responses to the duke–*tuk continuum, but only 45% /t/-responses to the *dut–toot continuum. When this subject’s data are excluded (statistics reported in parentheses in Table 4.3 to facilitate comparison), the overall pattern is the same, but the marginal RIME × GROUP interaction evaporates completely. This suggests that the marginally significant interaction was being driven by a single subject’s atypical behavioral pattern. Although it is impossible to know why this subject was so strongly biased in the opposite direction than predicted, the follow-up analysis suggests that this subject is an outlier. Therefore, in the model-based analyses, this subject’s anomalous data were excluded in order to prevent group-level parameter estimation from being unduly influenced. With no evidence that the YCs and AMCs differed substantially in their overall pattern of responses to these data, each group’s data were analyzed separately. Results were consistent with the conclusion of the first analysis, showing both bottom-up and top-down influences on speech recognition in both YCs (Table 4.4) and AMCs (Table 4.5). This was true whether or not the atypical YC subject was included in the analysis (see parentheses of Table 4.4), although the effect of lexical status was more reliable when those data were excluded. Coefficient β SE |z| p β0 -1.102 (-1.299) 0.352 (0.346) -3.126 (-3.751) 0.002 (< 0.001) β1 0.486 (0.492) 0.041 (0.044) 11.736 (11.086) < 0.001 (< 0.001) β2 0.447 (0.754) 0.223 (0.242) 2.002 (3.119) 0.045 (0.002) Table 4.4. Results of logistic regression analysis of Experiment 4.1 (Blumstein et al, 1994) that included only Young Controls. Statistics in parentheses report the same value for an identical analysis that excluded one young control subject (see main text). β0 = intercept (related to phoneme category boundary); β1 = VOT (related to gain/slope of sigmoid); β2 = RIME (related to size of the boundary shift introduced by the lexical/frequency bias); β: best-fitting estimate of each regression coefficient, SE: the estimate’s standard error, |z|: Wald’s z statistic for the estimate of that parameter, p: the significance level of the test statistic. 177 Coefficient β SE |z| p β0 -1.926 0.464 -4.155 < 0.001 β1 0.412 0.038 10.798 < 0.001 β2 1.083 0.275 3.941 < 0.001 Table 4.5. Results of logistic regression analysis of Experiment 4.1 (Blumstein et al, 1994) that included only Age-Matched Controls. Shaded boxes indicate statistically significant effects. β0 = intercept (related to phoneme category boundary); β1 = VOT (related to gain/slope of sigmoid); β2 = RIME (related to size of the boundary shift introduced by the lexical/frequency bias); β: best-fitting estimate of each regression coefficient, SE: the estimate’s standard error, |z|: Wald’s z statistic for the estimate of that parameter, p: the significance level of the test statistic. In general, these effects were consistent with the results reported by Blumstein and colleagues (1994), although, where they found no significant lexical effect in the YCs, the present results did. Given that the data are identical, this is likely due, at least in part, to our use of a more powerful statistical approach: the analyses of Blumstein and colleagues only examined shifts in the estimated phoneme category boundary as a function of lexical status. However, any minor differences between the present results and those originally reported are of little importance to the theoretical interpretation of the data. 4.3.2.2.3. Elderly Subjects: AMCs vs. BAs vs. W/CAs A logistic regression examined all of the elderly participants, including the AMCs, and patients from both clinically defined groups, BAs and W/CAs. The results of this analysis are shown in Table 4.6. Overall, there was a significant lexical effect on subjects’ responses (more /t/-responses in the *dut–toot continuum than in the duke–*tuk continuum), and an overall effect of VOT on subjects’ responses (more /t/-responses to stimuli with longer VOTs). 178 Coefficient β SE |z| p β0 -0.794 0.174 -4.554 < 0.001 β1 0.190 0.012 15.367 < 0.001 β2 0.736 0.120 6.133 < 0.001 β0: AMC vs. BA 0.587 0.494 1.189 0.234 β0: AMC vs. W/CA 1.491 0.494 3.021 0.003 β1: AMC vs. BA -0.209 0.026 -8.101 < 0.001 β1: AMC vs. W/CA -0.197 0.026 -7.632 < 0.001 β2: AMC vs. BA 0.672 0.313 2.148 0.032 β2: AMC vs. W/CA -1.245 0.309 -4.031 < 0.001 Table 4.6. Results of logistic regression analysis of Experiment 4.1 (Blumstein et al, 1994) that included Age-Matched Controls (AMC), patients with Broca’s aphasia (BA), and patients with Wernicke’s or Conduction aphasia (W/CA). Shaded boxes indicate statistically significant effects. β0 = intercept (related to phoneme category boundary); β1 = VOT (related to gain/slope of sigmoid); β2 = RIME (related to size of the boundary shift introduced by the lexical/frequency bias); β: best-fitting estimate of each regression coefficient, SE: the estimate’s standard error, |z|: Wald’s z statistic for the estimate of that parameter, p: the significance level of the test statistic. However, both BAs and W/CAs differed from AMCs with respect to both of these effects. In particular, the influence of VOT on speech recognition was diminished in each of the patient groups when compared to the controls. Weaker effects of VOT correspond to a shallower slope of the sigmoidal categorization curve; for reference, a shallower slope is also the expected effect of adding Gaussian noise to a stimulus (Feldman et al, 2009). Thus, this pattern in the results indicates that BAs and W/CAs in Experiment 4.1 both exhibited bottom-up perceptual processing deficits relative to AMCs. On the other hand, the patterns of lexical effects displayed by the BAs and W/CAs are quite different. Results indicated that both patient groups appeared to differ significantly from AMCs in the extent to which rime (i.e., lexical status) influenced speech perception, but in opposite directions. Whereas BAs were more influenced by lexical status compared to AMCs, W/CAs were less influenced than AMCs. Notably, this is precisely the prediction that emerged in the simulations that aimed to specify the 179 relationship between the Lexical Activation Hypothesis and a probabilistic speech perception model like BIASES. Finally, W/CAs also differed significantly from AMCs in the best-fitting intercept. This result has two possible interpretations. The simplest interpretation is that W/CAs (but not BAs) have a different overall category boundary compared to AMCs. This would suggest that these patients’ underlying phonetic expectations for the VOTs of exemplars of the /t/ and/or /d/ phoneme categories are different from controls. While this is certainly possible, there is little evidence that patients with aphasia exhibit fundamentally different phonetic category structure from healthy controls. Quite to the contrary, evidence suggests that, while overall performance on phoneme discrimination and categorization tasks is very often impaired in patients with aphasia (including W/CAs), the typical signatures of phonetic category structure are preserved (Blumstein, Tartter, Nigro & Statlender, 1984; Blumstein et al, 1977b), even in patients who present with specifically impaired acoustic-phonetic processing (e.g., Caplan & Aydellott Utman, 1994; Gow & Caplan, 1996). Alternatively, the apparent shift in W/CAs’ phonetic category boundary could also be explained as an artifact of the smaller lexical effects. The decision to fit the logistic regressions with assumed contrasts for the RIME factor that were equally far from zero (/-uk/: -0.5; /-ut/: 0.5) implied that the strength of the bias towards /t/ imposed by the *dut–toot continuum and the strength of the bias towards /d/ in the duke–*tuk continuum should be equal. That is, the fit boundary is exactly halfway between the theoretical boundaries for the two VOT continua. However, if the bias created by the smoothed lexical frequency prior towards duke is stronger than the bias towards toot, as 180 predicted by BIASES, then the actual phoneme category boundary should tend to be closer to the implicit category boundary of the *dut–toot continuum. On the other hand, if W/CAs show a smaller effect of lexical status, overall, then the best-fitting category boundary should be closer to the midpoint between the implicit category boundaries of the two continua (or, if there is no effect of the prior and lexical status, both continua should have the same implicit category boundary, which should be the phonetic category !! !!! boundary (χ = ! ). To further examine the pattern of lexical effects in the two patient groups, each group’s data were analyzed separately. Results confirmed that BAs (Table 4.7) exhibited a robust influence of lexical status. Moreover, bottom-up cues (VOT) also influenced speech recognition, although the raw effect size was weaker than in both the control groups. Finally, as suggested by Figure 4.7, W/CAs (Table 4.8) showed no evidence of top-down effects from lexical status. It is especially notable that, although their bottom- up perception of the VOT continua was comparable to BAs, showing a significant effect of VOT with a similar effect size, the primary dimension on which these groups differed was in the extent to which lexical-level information influenced speech recognition. Coefficient β SE |z| p β0 -0.493 0.103 -4.768 < 0.001 β1 0.085 0.006 13.257 < 0.001 β2 1.057 0.173 6.129 < 0.001 Table 4.7. Results of logistic regression analysis of Experiment 4.1 (Blumstein et al, 1994) that included only patients with Broca’s aphasia (BAs). Shaded boxes indicate statistically significant effects. β0 = intercept (related to phoneme category boundary); β1 = VOT (related to gain/slope of sigmoid); β2 = RIME (related to size of the boundary shift introduced by the lexical/frequency bias); β: best-fitting estimate of each regression coefficient, SE: the estimate’s standard error, |z|: Wald’s z statistic for the estimate of that parameter, p: the significance level of the test statistic. 181 Coefficient β SE |z| p β0 -0.048 0.156 -0.307 0.759 β1 0.091 0.006 13.998 < 0.001 β2 0.112 0.167 0.669 0.503 Table 4.8. Results of logistic regression analysis of Experiment 4.1 (Blumstein et al, 1994) that included only patients with Wernicke’s or Conduction aphasia (W/CAs).. Shaded boxes indicate statistically significant effects. β0 = intercept (related to phoneme category boundary); β1 = VOT (related to gain/slope of sigmoid); β2 = RIME (related to size of the boundary shift introduced by the lexical/frequency bias); β: best-fitting estimate of each regression coefficient, SE: the estimate’s standard error, |z|: Wald’s z statistic for the estimate of that parameter, p: the significance level of the test statistic. 4.3.2.2.4. Summary of Results of Statistical Analyses Figure 4.8 provides an alternate way of visualizing differences in the size of top- down effects from lexical status for each group over the entire continuum. For each subject’s responses to each of the seven VOT tokens, we computed the difference in the proportion of /t/-responses in the /t/-biased continuum (*dut–toot) and the /d/-biased continuum (duke–*tuk), and plotted the mean difference (i.e., effect size) for each group at each VOT. In summary, there are at least five tentative conclusions that find support in the statistical analyses presented above. All are also clearly visible in Figure 4.8: 1. Lexical status influences speech categorization in both YCs and AMCs. 2. Those effects only arise at intermediate VOTs; unambiguous speech tokens are consistently and accurately perceived by healthy adults, even when those speech tokens are non-words (e.g., *dut). 3. The size of top-down lexical effects and their distribution is nearly identical between YCs and AMCs. 4. The mean size of the lexical effect in patients with BA is greater than in control subjects and lexical influences appear over a wider array of the VOT continua. 182 5. There is no evidence for an influence of lexical status on speech categorization is W/CAs. Young Controls Age−Matched Controls 0.4 ● 0.2 Lexical Effect Size: %t/*ut/ − %t/*uk/ ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● −0.2 PWA: Broca's PWA: Wernicke's/Conduction 0.4 ● ● ● 0.2 ● ● ● ● ● ● ● ● 0.0 ● ● ● −0.2 0 20 40 60 0 20 40 60 VOT (ms) Figure 4.8. Results of Experiment 4.1: Difference between proportion /t/-responses in the /*ut/ (/t/-biased) and /*uk/ (/d/-biased) conditions as a function of voice-onset time (VOT), for each group. Error bars represent by-subject standard error. Results represent reanalysis of raw data from Blumstein et al (1994). PWA = Patients with aphasia 4.3.2.3. Results: Model-Based Analyses The statistical analyses described above provide some evidence for differences in the sizes of top-down effects from lexical status in patients with aphasia compared with 183 healthy controls. Most interestingly, the results suggest that the identification of stimuli by BAs may be more influenced by lexical status than in healthy control subjects, while W/CAs are less influenced by lexical-level information. According to the simulations summarized earlier in Table 4.1, this is precisely the pattern predicted by the Lexical Activation Hypothesis. 4.3.2.3.1. Motivation of Model-Based Analyses However, despite these intriguing results, the interpretability of the findings is limited by the analytic techniques employed. Recall that in Chapter 3, it was shown that the size of a boundary shift depends on many factors. Teasing apart competing explanations is not always straightforward. For instance, consider Figures 4.4 and 4.6 in Simulation Study 4.1: in one simulation, manipulating the strength of the bias modulates the size of the boundary shift, while in the other simulation, manipulation the efficacy of acoustic-phonetic processing also modulates the size of the boundary shift. Only one of these parameters is associated with lexical-level processing, so it would be a mistake to conclude from the presence of a larger boundary shift in one group that the difference is attributable to lexical-level processing deficits. While logistic regression models are much more powerful than comparisons of inferred category boundaries, even these models have other shortcomings. Foremost among these is the relative inflexibility of using generalized linear models (e.g., logistic regression). Such models require a number of assumptions that are not necessarily appropriate for the present work. For instance, consider the influence of ε on expected categorization behavior (see Figure 4.5). The implementation of a fitting procedure for data with asymptotes other than 0 and 1 is not straightforward (Wichmann & Hill, 2001), 184 but these asymptotes are a direct prediction of our model if subjects suffer from lexical- phonological processing impairments. Furthermore, as with the model intercept that differed between AMCs and W/CAs above (see Table 4.6), some coefficients in logistic regression analyses can be influenced by both top-down and bottom-up factors. This fact makes it difficult to isolate the differences caused by lexical-level deficits and those caused by bottom-up processing impairments. Moreover, since multiple unique parameters associated with bottom-up processing dynamics (here, acoustic-phonetic and lexical-phonological processing) are lumped together and expected to influence the same coefficients in a regression model, it is virtually impossible to recover the independent influences of different bottom-up information sources. Fortunately, model-based Bayesian data analysis makes is possible to explicitly evaluate the independent contributions of multiple interacting model parameters to the observed behavioral data. Such an analytic approach is more theoretically informed, more flexible, more powerful and, ultimately, yields more informative results. Rather than attempting to interpret the relationship between model parameters and regression coefficients (as in Table 4.2), this approach directly models the parameters and processes of interest. Another advantage, especially in situations such as the current one, when data is limited, is that the Bayesian data analysis approach allows a researcher to explicitly choose which parameters should be shared or different between groups or individuals. 4.3.2.3.2. Key Results of Model-Based Analyses Table 4.9 provides a summary of the posterior distributions of the parameters that were fit in the present analysis (i.e., the “best-fitting” model parameters). 185 Mean SD 95% HDI min 95% HDI max 𝛼 1.25 0.27 0.78 1.82 2 σ 175.28 10.74 154.18 196.59 𝜇! 8.04 0.22 7.63 8.50 BAs: σN 148.93 82.52 13.64 310.09 W/CAs: σN 154.00 67.59 35.69 295.54 BAs: ε 0.14 0.03 0.09 0.19 W/CAs: ε 0.13 0.02 0.09 0.17 BAs: ϕ 1.36 0.30 0.77 1.94 W/CAs: ϕ -0.12 0.20 -0.51 0.25 Table 4.9. Summary statistics of posterior distributions of Bayesian data analysis of Experiment 4.1 (Blumstein et al, 1994). HDI = highest density interval. First, it is worth noting that the mean VOT of the /d/ onset (𝜇! ) was estimated to be approximately 8 ms, which is very close to the 5 ms value reported in the seminal analysis of VOTs in English by Lisker and Abramson (1964). This suggests that subjects treated the stimuli like real speech and their responses indicated a category boundary in the typical range. Of critical interest was the extent to which behavioral responses of BAs and W/CAs might reflect either bottom-up or top-down processing deficits (or both) compared to healthy adults. As suggested by the statistical analyses reported earlier, the model-based analysis provided strong evidence that both patient groups exhibit bottom- up and top-down impairments, but the model-based analyses offer a more detailed picture of the specific deficits underlying abnormal response patterns. Considering those parameters associated with the efficacy of bottom-up processing first, the results suggest that BAs and W/CAs both suffer from acoustic- phonetic processing deficits, as well as from lexical-phonological deficits. Notably, for both groups, the severity of these bottom-up processing deficits is quite similar in total magnitude (at the group level). 186 However, when it comes to lexical processing deficits, W/CAs showed significant deficits compared to AMCs (and YCs). As discussed earlier, their responses indicated a weaker influence of lexical status (i.e., frequency) information on phonetic speech categorization decisions. Meanwhile, BAs’ lexical processing deficits were trending in the opposite direction. Among these patients, there was a tendency to weight lexical-level cues more heavily than controls (and much more than W/CAs8), as predicted by the Lexical Activation Hypothesis. It is important to note that the simulations in Simulation Study 4.1 focused on illustrating the expected independent impacts of virtual lesions to each processing level of the computational model. The exploratory simulations did suggest the existence of distinct, independent behavioral signatures of bottom-up and lexical-level processing impairments, and the results of the model-based Bayesian data analysis presented here further suggest that the subtle, fine-grained effects of each parameter could be distinguished from one another in the data. However, a more powerful demonstration of this result would be a direct illustration that simulating new behavioral data from a model with the recovered parameter estimates for each group or subjects produced similar patterns of results as the original data. This technique is referred to as a posterior predictive check (PPC), and PPCs can also used to evaluate whether or not a model is sufficient to capture all of the key aspects of the behavioral data. To further evaluate the ability of the model to fit the data, a posterior predictive check (PPC) was performed. 100 random samples were selected from the joint posterior distribution of the model, and parameter values for a given sample were set to the 8 Subtracting the posterior chains’ samples (BA-W/CA) gives a 95% HDI of [0.26, 3.54], confirming the behavioral divergence on this task for these two clinically defined groups. 187 sampled value in the corresponding Markov chain. For each sample we simulated data from the model, and we ran all of the statistical analyses reported in Section 4.3.2.2 on the simulated data. This yielded 100 samples of each of 6 statistical analyses. For each logistic regression coefficient in each statistical test, we computed the mean coefficient estimate (β) and we determined how many of the statistical tests reached significance at the 0.05 level. To the extent that statistical tests on new, generated data give similar inferences as the same statistical tests on the original data, it would suggest that the model from which the data were generated captures some fundamental aspects of the generative model underlying the psychological processes giving way to the relevant empirical data. The results are shown in Table 4.10. In general, the PPCs’ coefficient estimates and pattern of significances were consistent with the statistics of the original experimental data. Overall, these results suggest that the posterior accurately captured the critical aspects of and patterns in the original data. Figures 4.9 and 4.10 superimpose the results of the PPC onto the original experimental data shown in Figures 4.7 and 4.8. 188 Experiment 4.1: Results: Results: PPC Lexical Effect Experiment 4.1 logistic mean % sims coefficient β p regression β p<.05 β0 -1.610 < 0.001 -1.349 100 Control Subjects β1 0.452 < 0.001 0.380 100 β2 0.916 < 0.001 0.786 96 β0: YC vs. AMC -0.599 0.293 0.006 13 β1: YC vs. AMC -0.087 0.131 0.006 14 β2: YC vs. AMC 0.313 0.391 0.033 15 β0 -0.794 < 0.001 -0.697 100 β1 0.190 < 0.001 0.185 100 β2 0.736 < 0.001 0.582 99 Elderly Subjects β0: AMC vs. BA 0.587 0.234 0.378 48 β0: AMC vs. W/CA 1.491 0.003 0.921 99 β1: AMC vs. BA -0.209 < 0.001 -0.203 100 β1: AMC vs. W/CA -0.197 < 0.001 -0.194 100 β2: AMC vs. BA 0.672 0.032 0.905 68 β2: AMC vs. W/CA -1.245 < 0.001 -1.347 91 β0 -1.299 < 0.001 -1.352 100 YCs β1 0.492 < 0.001 0.377 100 β2 0.754 0.002 0.770 84 β0 -1.926 < 0.001 -1.346 100 AMCs β1 0.412 < 0.001 0.383 100 β2 1.083 < 0.001 0.803 92 β0 -0.493 < 0.001 -0.508 100 BAs β1 0.085 < 0.001 0.083 100 β2 1.057 < 0.001 1.034 98 β0 -0.048 0.759 -0.236 45 W/CAs β1 0.091 < 0.001 0.087 100 β2 0.112 0.503 -0.092 11 Table 4.10. Summary of the results of a Posterior Predictive Check (PPC) examining the reliability of the model fit to data from Experiment 4.1 (Blumstein et al, 1994). 189 Young Controls Age−Matched Controls 1.00 ● ● ● ● ● 0.75 ● ● ● 0.50 ● ● 0.25 ● ● % t responses ● ● ● ● 0.00 ● ● ● ● ● PWA: Broca's PWA: Wernicke's/Conduction 1.00 ● ● ● ● ● ● ● 0.75 ● ● 0.50 ● ● ● ● ● ● ● 0.25 ● ● ● ● ● ● ● ● ● ● ● ● 0.00 0 20 40 60 0 20 40 60 VOT (ms) model fits ● actual data ● *dut−toot ● duke−*tuk Figure 4.9. Results of Experiment 4.1 (data points; cf. Figure 4.7) with superimposed model fits (solid lines). For each group (panel), two curves display the two sigmoidal posterior probability functions of the /*ut/ (/t/-biased) and /*uk/ (/d/-biased) conditions. Young and Age-Matched Controls were fit together. Points indicate proportion /t/- responses in the /*ut/ (/t/-biased) and /*uk/ (/d/-biased) conditions for each VOT, for each group. Error bars represent by-subject standard error. PWA = Patients with aphasia. 190 Young Controls Age−Matched Controls 0.4 ● 0.2 Lexical Effect Size: %t/*ut/ − %t/*uk/ ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● −0.2 PWA: Broca's PWA: Wernicke's/Conduction 0.4 ● ● ● 0.2 ● ● ● ● ● ● ● ● 0.0 ● ● ● −0.2 0 20 40 60 0 20 40 60 VOT (ms) model fits ● actual data Figure 4.10. Results of Experiment 4.1 (data points; cf. Figure 4.8) with superimposed model fits (solid lines). For each group (panel), the curve represents the difference between two sigmoidal posterior probability functions of the /*ut/ (/t/-biased) and /*uk/ (/d/-biased) conditions (cf. Figure 4.8). Young and Age-Matched Controls were fit together. Points indicate difference between proportion /t/-responses between the /*ut/ (/t/-biased) and /*uk/ (/d/-biased) conditions as a function of VOT, for each group. Error bars represent by-subject standard error. PWA = Patients with aphasia. 4.3.2.4. General Discussion of Results of Experiment 4.1 Together with the results of the statistical analyses reported in Section 4.3.2.2, the results of the model-based analyses in Section 4.3.2.3 provide evidence for the 191 diminished influence of lexical status on the phoneme categorization decisions of patients with W/CA, and (although somewhat less clear) the results may also suggest greater influence of lexical status on the phoneme categorization decisions of patients with BA. These conclusions would be consistent with the predictions of the Lexical Activation Hypothesis. At the same time, the analyses also point towards bottom-up processing impairments (at both the acoustic-phonetic and lexical-phonological levels) in both groups of patients, a finding that is in line with a great deal of work on speech perception in patients with aphasia (Baker, Blumstein & Goodglass, 1981; Basso et al, 1977; Blumstein et al, 1977a, 1977b, 1984; Carpenter & Rutherford, 1973; Jauhiainen & Nuutila, 1977; Leeper, Shewan & Booth, 1986; Metz-Lutz, 1992; Miceli et al, 1978, 1980; Sasanuma et al, 1976; Utman et al, 2001; Yeni-Komshian & Lafontaine, 1983). At least two other general methodological conclusions also warrant mention. For one, regardless of patient classification, all the patients have a constellation of deficits ranging from acoustic-phonetic to lexical-phonological to lexical-level processing deficits. Although many standard statistical techniques are more limited in the kinds of data they can model and in the kinds of inferences they allow us to draw, BIASES (and in particular BIASES-A) and hierarchical Bayesian data analysis techniques provide a powerful and principled framework for teasing apart subtle differences in the expected influence of different model parameters on subjects’ response patterns. Secondly, another important conclusion is that ignoring patients’ clinical classification (and lumping all patients with aphasia into a single group; e.g., “patients with aphasia”, or PWA) would preclude us from observing these divergent patterns. To see this, consider Figures 4.11 and 4.12, which merge the two patient groups into one, as 192 compared to Figures 4.7 and 4.8. It is immediately clear that bottom-up processing deficits are implicated in the broad PWA group, but the opposing lexical-level processing impairments in the two patient groups essentially cancel each other out. Consequently, ignoring clinically relevant classifications could threaten to mask the existence of any lexical-processing deficits at all. Young Controls Age−Matched Controls PWA: All Patients 1.00 ● ● ● ● ● ● ● ● 0.75 % t responses ● ● ● ● 0.50 ● ● ● ● ● ● 0.25 ● ● ● ● ● ● ● ● ● ● ● 0.00 ● ● ● ● ● 20 30 40 50 60 20 30 40 50 60 20 30 40 50 60 VOT (ms) ● *dut−toot ● duke−*tuk Figure 4.11. Results of Experiment 4.1, merging clinically defined patient groups (BAs and W/CAs) into one single group (PWA = Patients with aphasia). For each group, the proportion of /t/-responses as a function of voice-onset time (VOT) for the /*ut/ (/t/- biased) and /*uk/ (/d/-biased) conditions. Error bars represent by-subject standard error. Results represent reanalysis of raw data from Blumstein et al (1994). 193 Young Controls Age−Matched Controls PWA: All Patients Lexical Effect Size %t/*ut/ − %t/*uk/ ● 0.2 ● ● ● ● 0.1 ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● 20 30 40 50 60 20 30 40 50 60 20 30 40 50 60 VOT (ms) Figure 4.12. Results of Experiment 4.1, merging clinically defined patient groups (BAs and W/CAs) into one single group (PWA = Patients with aphasia). Difference between proportion /t/-responses in the /*ut/ (/t/-biased) and /*uk/ (/d/-biased) conditions as a function of voice-onset time (VOT), for each group. Error bars represent by-subject standard error. Results represent reanalysis of raw data from Blumstein et al (1994). 4.4. Top-Down Effects of Sentence Context on Spoken Word Recognition in Aphasia Broadly, the results of Experiment 4.1 provide evidence that spoken word recognition in patients with aphasia is affected by multiple functional linguistic deficits, including a deficit at the level of lexical processing, as well as deficits in bottom-up processing of the speech signal, and that those deficits (and their consequent effects) differ as a function of clinical diagnosis. However, unlike the stimuli in Experiment 4.1, everyday speech rarely features words uttered in isolation, and it is important to note that, in individuals without aphasia, linguistic context has consistently been shown to impact recognition of acoustically ambiguous words. Words that are unintelligible when presented in isolation can often be identified in context (Lieberman, 1963; Pickett & Pollack, 1963; Hunnicutt, 1985; Fowler & Housum, 1987). Furthermore, as discussed at length in Chapters 1-3, stimuli that (when presented in isolation) are perceived as ambiguous between two possible words (e.g., between bay and pay) tend (when presented in sentences) to be perceived as whichever word is more congruent with a 194 preceding context (e.g., as bay after sentences like He hated the... but as pay after sentences like He hated to...) (Fox & Blumstein, in press; see also Borsky et al, 1998; Connine, 1987; Garnes & Bond, 1976; Guediche et al, 2013; Miller et al, 1984; Rohde & Ettlinger, 2012; Tuinman et al, 2014; van Alphen & McQueen, 2001). Importantly, most work examining lexical access impairments in aphasia has examined the recognition of isolated words (but see, e.g., Friederici, 1983; Baum, 2001). It remains unclear to what extent spoken word recognition processes in brain-injured patients with aphasia have access to the same information sources during auditory language processing that have been shown to influence speech perception in healthy subjects. This question is also of special interest because sentential context might, in fact, reduce apparent lexical processing deficits in such patients by providing top-down support for those lexical candidates whose processing could ordinarily be impaired when perceived in isolation (as in Experiment 4.1). Thus, Simulation Study 4.2 and Experiment 4.2 were designed to explore the nature of top-down processing of words by patients with aphasia when the words are embedded in sentential contexts. In particular, the goal of Simulation Study 4.2 was similar to that of Simulation Study 4.1, but further considered the role of sentential context. That is, Simulation Study 4.2 investigates the expected consequences of disruptions at the three levels of processing considered in Simulation Study 4.1 (acoustic- phonetic processing, lexical-phonological processing, and lexical processing), as well as at the level of contextual integration during auditory sentence processing. As we will show, by simulating disruptions at each level of processing in BIASES-A, it is possible to generate fine-grained quantitative predictions about the expected patterns of top-down 195 effects in patients with various linguistic deficits. As with Experiment 4.1, Experiment 4.2 was designed to evaluate the extent to which patients with BA and W/CA actually exhibit atypical patterns of top-down effects in their behavioral responses to stimuli embedded in sentences that support one or the other lexical candidate. The stimuli and task employed in Experiment 4.2 (detailed in Sections 4.4.2.1.2 – 4.4.2.1.3) resembled the stimuli and task described in Chapters 1-3: subjects (including both healthy controls and patients with aphasia) heard tokens from a VOT continuum between bay and pay after noun-biasing and verb-biasing sentence contexts (e.g., He hated the... vs. He hated to...) and their task was to decide whether the last word of each sentence was bay or pay. Applying the theoretical lens represented by BIASES-A, we submitted these data to a model-based analysis in order to assess the extent to which the responses of patients with BA and patients with W/CA provide evidence for bottom-up processing deficits, lexical-level impairments, deficits affecting the integration of cues from a preceding sentence context, or some combination thereof. 4.4.1. Joint Modeling Contextual and Lexical Effects on Word Recognition In order to model potential deficits at both lexical and contextual levels of processing and their independent effects on spoken word recognition, one addition was made to BIASES-A. The only difference in the mathematical formulation of BIASES-A was to allow context, C, to influence subjects’ responses. This was implicit in the original formulation of BIASES-A, because (as described in Chapter 2; see Equation 2.5) lexical frequency is equal to the total number of times the word appears after any context. However, since Experiment 4.1 did not involve any sentential contexts preceding the target stimulus, there could be no influence of context. In order to model the task 196 examined in Experiment 4.2, though, it was critical to incorporate into BIASES-A both (1) a parameter than can model lexical-level impairments, and (2) a parameter that can model contextual integration impairments. To do so, the form of BIASES-A was updated (Equation 4.19): Equation 4.19 𝑝 𝑓! 𝐶 𝑝 𝑉, 𝑅 𝑓! , 𝐶 𝑝 𝑓! 𝐶, 𝑉, 𝑅 = !! !!! 𝑝 𝑓! 𝐶 𝑝 𝑉, 𝑅 𝑓! , 𝐶 In the updated model, BIASES-A, upon perceiving a monosyllabic stimulus and the preceding context (here, limited to the function words to vs. the), Bayes’ rule gives the probability of recognizing a candidate word-form, fi, given the context, C, the initial segment’s voice-onset time, V, and the stimulus’s rime, R. As described earlier, we assume that a subject’s task is to identify the word-form of a stimulus. Equation 4.19’s prior term can be expanded according to Bayes’ rule (Equation 4.20). Equation 4.20 𝑝 𝑓! 𝐶 ∝ 𝑝 𝐶 𝑓! 𝑝(𝑓! ) Put simply, Equation 4.20 states that the prior probability of a candidate word- form following context C is proportional to the product of the lexical frequency of the word-form, 𝑝(𝑓! ), and 𝑝 𝐶 𝑓! , a term related to the proportion of times the word fi follows C compared with any other preceding context. That is, 𝑝 𝐶 𝑓! will be high if, when fi occurs in a sentence, it usually occurs after C. For instance, the word Francisco almost always occurs after San, so 𝑝 𝐶 = 𝑆𝑎𝑛 𝑓! = 𝐹𝑟𝑎𝑛𝑐𝑖𝑠𝑐𝑜 is high (even though many other cities have names beginning with San). On the other hand, 𝑝 𝐶 𝑓! might be low in two situations: (1) if, when fi occurs in sentences, it usually occurs after something other than C (e.g., 𝑝 𝐶 = 𝑆𝑎𝑠ℎ𝑎 𝑓! = 𝑂𝑏𝑎𝑚𝑎 ≪ 𝑝 𝐶 = 𝐵𝑎𝑟𝑎𝑐𝑘 𝑓! = 𝑂𝑏𝑎𝑚𝑎 ), or (2) 197 if fi occurs in many contexts such that the occurrence of fi is not specific to context C (e.g., 𝑝 𝐶 𝑓! = 𝑆𝑚𝑖𝑡ℎ . Thus, Equation 4.20’s manipulation of 𝑝 𝑓! 𝐶 includes a term associated with contextual integration and a term associated with lexical-level (frequency) information. The values for 𝑝 𝐶 𝑓! were estimated from the Google n-grams corpus (Michel et al, 2010) and were smoothed as described in Chapters 2-3. The same basic assumptions about 𝑝 𝑉, 𝑅 𝑓! , 𝐶 , the likelihood function of BIASES-A described earlier, were maintained here including: that the phonological form of a monosyllabic stimulus is composed of an onset and a rime, that the onset and the rime are conditionally independent cues to the phonological form of the stimulus, that rimes are deterministically related to word-forms, that rimes are consistently perceived accurately, that VOT is the only acoustic cue to the identity of the onset of the stimulus, that the VOTs of acoustic realizations of a given onset follow normal distributions with equal variance for all onsets, and that the distribution of VOTs conditionally independent of lexical or higher-level information given the identity of the onset. Simplifying Equation 4.19 accordingly and applying the straightforward algebraic manipulations described in Chapter 2 (see Equations 2.6-2.7) yields Equation 4.21 (f1 = pay; f2 = bay). Equation 4.21 1 𝑝 𝑓! 𝐶, 𝑉, 𝑅 = ! ! ! ! ! ! ! ! ! ! !! !(!! ) ! ! !! ! ! ! ! !"# !!"# !!"# !!"# !!! ! ! !! !(!! ) ! ! !! ! ! ! ! ! ! ! ! 1+𝑒 !!! ! ! ! In order to model the effects of impairments at the acoustic-phonetic, lexical- phonological, and lexical levels of processing, the same three parameters ( 𝜎!! , 𝜀, 𝜙 ) were included as described earlier. A fourth parameter (𝜔) was also included to model 198 the influence of impairments in the integration of a preceding contextual cue during spoken word recognition, as shown in Equations 22-24. Equation 4.22 1 𝑝 𝑓! 𝐶, 𝑉, 𝑅 = ! ! ! ! ! ! ! ! ! ! !! !(!! ) ! ! !! ! ! ! ! !∙!"# !!∙!"# !!"# !!"# !!! ! ! !! !(!! ) ! ! !! ! ! ! ! ! ! ! ! 1+𝑒 !!! ! ! ! Equation 4.23 1−𝜀 𝑓! = 𝑝𝑎𝑦 𝑝 𝑜/!/ 𝑓! = 1 − 𝑝 𝑜/!/ 𝑓! = 𝜀 𝑓! = 𝑏𝑎𝑦 Equation 4.24 𝑉|𝜊! ~ 𝑁(𝜇! , 𝜎 ! + 𝜎!! ) Finally, in the present task, the rime of the target stimulus was always /ei/, allowing for yet another simplification. Equation 4.25 summarizes a full model for BIASES-A that allows for independent estimation of parameters associated with contextual integration and lexical-level processing, where 𝑝 𝑧!"# 𝐶, 𝑉 is the probability of a pay-response given a stimulus with VOT value V after context C. Equation 4.25 1 𝑝 𝑧!"# 𝐶, 𝑉 = (!!!/!/ )! (!!!/!/ )! (!!!/!/ )! ! ! ! ! !! !!! ! ! ! ! !!! ! ! ! ! !!∙ ! !! ! ! !!! ! ! !"# !(!"#) ! !∙!"# !!∙!"# !!"# ! ! !"! !(!"#) (!!!/!/ )! (!!!/!/ )! (!!!/!/ )! ! ! ! ! !! !!! ! !! !!! ! ! ! !! ! ! !!! ! ! !!∙ ! 1+𝑒 4.4.2. Simulation Study 4.2: Sentential Context Effects in Aphasia Simulation Study 4.2 examined the independent contributions of lesions at four different processing levels on the expected size of top-down sentential context effects. The results of these simulations are summarized in Figure 4.13. First, it is worth noting 199 that, as in Simulation Study 4.1 (see Figure 4.3), lesions to the likelihood function (ε > 0 or σN > 0) can best be characterized as driving changes with respect to the distribution of top-down effects over VOT values, but not in the maximum effect size itself. As was seen with the lexical effect simulations (see Figure 4.6), acoustic-phonetic processing impairments (governed by the parameter σN) can be expected to induce top-down effects for a wider array of VOTs (see Figure 4.17). Meanwhile, lexical-phonological processing deficits (governed by the parameter ε) are associated with greater effect sizes for endpoint tokens of the VOT continua (see Figure 4.16), reflecting the bottom-up “mishearing” of acoustically clear exemplars of bay and pay (cf. Figure 4.5; see also Figure 4.2). As for lesions affecting the weighting of information at the lexical level (governed by the parameter ϕ), recall that this parameter was realized responsible for changes in the maximum effect size in the lexical effect simulations (see Figures 4.3 and 4.4). In the present simulations of sentential context effects, that role is played instead by ω, the parameter responsible for the weighting of contextual information during word recognition (see Figure 4.14; compare the leftmost panels of Figure 4.13 and Figure 4.3). This discrepancy is due to the nature of the two tasks being considered and the notion of effect size in the two studies. The two conditions being directly compared in studies of the lexical effect (as in Simulation Study 4.1) differ as a function of lexical information, the influence of which is affected by varying ϕ; however, the two conditions being directly compared in studies of sentential context effects (as in Simulation Study 4.2) differ as a function of how strongly different words are predicted by the preceding cues (i.e., the vs. to), the influence of which is affected by varying ω. When contextual cues are weighted more strongly (when ω > 1), the relative fit of the candidates (bay vs. pay) 200 with the perceived contextual cue will have a greater influence on subjects’ behavioral responses, leading to exaggerated top-down context effects, especially when acoustic information is most ambiguous (i.e., close to the phonetic category boundary). The opposite is predicted when contextual cues are weighted less strongly (i.e., when ω < 1): the relative fit of competing candidates with the preceding context will be a less reliable predictor of subjects’ responses, which will be reflected in diminished top-down context effects. Although lesions at the lexical level do predict the same types of effects in an experiment examining the size of top-down effects from sentential context on subjects categorization decisions between two words (like the present study) as they do in studies of the lexical effect, they are still predicted to have an effect on word recognition performance. Specifically, increasing and decreasing the weighting of lexical-level cues (governed by the parameter ϕ) tends to shift the locus (on the VOT continuum) of the maximum effect size (see panel 2 of Figure 4.13 and Figure 4.15). The reason for this lies in the relationship between lexical status and lexical frequency. Recall that the current model essentially treats non-words as “very low frequency words” such that every word that appears in a corpus is more frequent than any non-word, while incorporating a mechanism to allow the word recognition system to perceive stimuli that are not found in the corpus from which lexical frequencies are estimated (see Section 4.2.2.2). Because bay is a less frequent word than pay, increasing the weighting of lexical (i.e., frequency) cues leads to a stronger overall bias toward pay responses – independent of the preceding context – while decreasing the weighting of the frequency information will tend to reduce the top-down bias towards pay (see Figure 4.15). The overall effect of this parametric 201 variation of ϕ is that as ϕ increases, the center of the distribution of top-down effects shifts closer to the mean of the VOT distribution for the /b/ onset, and as ϕ decreases, the center of the distribution of top-down effects shifts closer to the (unweighted) category boundary between the /b/ and /p/ onsets’ VOT distributions. The stronger the frequency bias (i.e., the higher ϕ becomes), the more susceptible otherwise clear tokens of bay are to top-down biasing effects from the sentence context, because pay is already highly favored as a response; the weaker the frequency bias (i.e., as ϕ approaches 0), the more top-down effects begin to reflect only the fit between the candidates and the preceding context rather than by the lexical frequency of the candidates themselves. Sentential Context Lexical/Frequency Lexical−Phonological Acoustic−Phonetic Weighting Weighting Processing Processing 0.5 ω φ ε σN Context Effect Size %payto − %paythe 0.4 0.3 0.2 0.1 0.0 0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60 VOT (ms) simulation number (see panels below) 1 2 3 4 Figure 4.13. Summary of results of Simulation Study 4.2: Effect of manipulating each parameter on the predicted sentential context effect size, as a function of VOT. Each curve represents the difference between the posterior distributions of the to... (pay-biased) and the... (bay-biased) conditions. In each panel, only the labeled parameter was manipulated; other baseline parameter values (ω = 1; ϕ = 1; ε = 0; σN = 0) were held constant in order to observe the effects of each parameter independently. Solid curves represent the simulation in each panel for which all baseline assumptions were held constant. Each panel summarizes four simulations (i.e., four levels of the relevant parameter for that panel), whose coloration corresponds to the panel number in which that simulation is further detailed in Figures 4.14, 4.15, 4.16 or 4.17. Coloration darkens from simulation 1-4, because the boundary shift associated with that simulation also increased from simulation 1-4 in each panel. This can be seen in Figures 4.14, 4.15, 4.16 or 4.17, which show the two conditions’ posterior probability curves, as a function of VOT. 202 ω1 = 0.1 ω2 = 0.5 ω3 = 1 ω4 = 1.5 1.00 % pay responses 0.75 0.50 0.25 0.00 0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60 VOT (ms) to... the... Figure 4.14. Detailed Results of Simulation Study 4.2: Effect of weighting of contextual information (ω) on expected rate of voiceless (pay) responses, as a function of VOT and the function word the preceded the stimulus (to... vs. the...), which corresponded to opposing contextual biases on the initial consonant. The panel with solid lines represents the baseline assumptions (ω = 1; ϕ = 1; ε = 0; σN = 0), and each other panel manipulated only the listed parameter value; all others remained at baseline. The vertical grey line denotes the phoneme category boundary in the simulations (the VOT at which, for an unbiased prior, the posterior probability of pay- and bay-response are equal. φ1 = 0.1 φ2 = 1 φ3 = 3 φ4 = 5 1.00 % pay responses 0.75 0.50 0.25 0.00 0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60 VOT (ms) to... the... Figure 4.15. Detailed Results of Simulation Study 4.2: Effect of weighting of lexical information (ϕ) on expected rate of voiceless (pay) responses, as a function of VOT and the function word the preceded the stimulus (to... vs. the...), which corresponded to opposing contextual biases on the initial consonant. The panel with solid lines represents the baseline assumptions (ω = 1; ϕ = 1; ε = 0; σN = 0), and each other panel manipulated only the listed parameter value; all others remained at baseline. The vertical grey line denotes the phoneme category boundary in the simulations (the VOT at which, for an unbiased prior, the posterior probability of pay- and bay-response are equal. 203 ε1 = 0 ε2 = 0.01 ε3 = 0.05 ε4 = 0.15 1.00 % pay responses 0.75 0.50 0.25 0.00 0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60 VOT (ms) to... the... Figure 4.16. Detailed Results of Simulation Study 4.2: Effect of efficacy of phonological processing (ε) on expected rate of voiceless (pay) responses, as a function of VOT and the function word the preceded the stimulus (to... vs. the...), which corresponded to opposing contextual biases on the initial consonant. The panel with solid lines represents the baseline assumptions (ω = 1; ϕ = 1; ε = 0; σN = 0), and each other panel manipulated only the listed parameter value; all others remained at baseline. The vertical grey line denotes the phoneme category boundary in the simulations (the VOT at which, for an unbiased prior, the posterior probability of pay- and bay-response are equal. σN 1 = 0 σN2 = 10 σN3 = 14 σN4 = 20 1.00 % pay responses 0.75 0.50 0.25 0.00 0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60 VOT (ms) to... the... Figure 4.17. Detailed Results of Simulation Study 4.2: Effect of efficacy of acoustic- phonetic processing (σN) expected rate of voiceless (pay) responses, as a function of VOT and the function word the preceded the stimulus (to... vs. the...), which corresponded to opposing contextual biases on the initial consonant. The panel with solid lines represents the baseline assumptions (ω = 1; ϕ = 1; ε = 0; σN = 0), and each other panel manipulated only the listed parameter value; all others remained at baseline. The vertical grey line denotes the phoneme category boundary in the simulations (the VOT at which, for an unbiased prior, the posterior probability of pay- and bay-response are equal. 204 4.4.3. Experiment 4.2: Sentential Context Effects in Aphasia Based on the simulations with BIASES-A presented in Simulation Study 4.2, and based on the logic outlined above in Section 4.2.2 (see Table 4.1 for a summary), the predictions of the Lexical Activation Hypothesis for Experiment 4.2 are considerably subtler than in Experiment 4.1. In particular, it is that BAs should exhibit exaggerated frequency effects compared with healthy control subjects and W/CAs should exhibit diminished (or perhaps even undetectable) frequency effects. According to Simulation Study 4.2, these effects would be realized as essentially horizontal shifts (along a VOT continuum) of the entire distribution of top-down context effects. However, both patient groups may also suffer from bottom-up processing deficits, as seen in Experiment 4.1. Moreover, it is not altogether clear whether patients would be expected to differ from healthy controls in the extent to which they weight contextual cues (ω) during spoken word recognition. Baum (2001) showed that patients divided based on fluency (non-fluent vs. fluent) showed top-down effects of semantic sentential context in their identification responses of the first segments of words from two VOT continua (bath–path and dent–tent). While fluency does tend to correlate with clinical diagnosis (BAs tend to be non-fluent while W/CAs tend to be fluent), there were many other types of patients included in Baum’s (2001) study that did not fall into the groups in question. Furthermore, the nature of the contextual cue in the present study (the vs. to, which are cues to the grammatical class of the subsequent word) is quite different from the semantic cues examined by Baum (2001). Finally, the analysis techniques employed by Baum (2001) suffer from the same issues discussed in Section 4.3.2.3.1; specifically, the size of boundary shifts can be influenced by both bottom-up and top-down factors, so 205 it remains unclear whether patients might differ from healthy control subjects in their weighting of sentential/contextual cues to the syntactic category of the target word when identifying that target word. Whether or not patients differ from healthy controls in the size of their top-down sentential context effects, Simulation Study 4.2 illustrated distinct signatures of bottom- up, lexical-level, and sentential processing deficits. Thus, following the same approach as was taken to the analysis of Experiment 4.1, it should be possible to tease apart the impacts of different functional linguistic impairments and infer the nature of the underlying deficits in patients with BA and W/CA. To that end, Experiment 4.2 investigated top-down effects from sentential context in healthy controls and patients with BA and W/CA. 4.4.3.1. Methods 4.4.3.1.1. Subjects Data analyzed in Experiment 4.2 came from a total of fifty subjects, 14 of whom participated in the present study as described here (8 age-matched control subjects, 3 patients diagnosed with Broca’s aphasia, and 3 patients diagnosed with either Wernicke’s or Conduction aphasia). The remaining data came from 36 young, healthy control subjects who participated in Experiment 1.1, described in Chapter 1 (Fox & Blumstein, in press). Eight right-handed elderly adults (3 male) with a mean age of 73.3 years (minimum: 66; maximum: 78; sd: 4.6) participated, serving as the age-matched control (AMC) sample. All reported having age-appropriate hearing and being native speakers of English. 206 Three patients with Broca’s aphasia (mean age: 70.3 years; minimum: 63.3; maximum: 78.1; sd: 7.4) participated, comprising the BA sample, and three patients with Wernicke’s or Conduction aphasia (mean age: 69.8 years; minimum: 65.1; maximum: 78.5; sd: 7.6) participated, comprising the W/CA sample. As in Experiment 4.1, all patients’ clinical diagnoses were determined based on clinical and neurological examinations (including CT scans and, where possible, MRIs) and performance on the Boston Diagnostic Aphasia Examination (BDAE) (Goodglass & Kaplan, 1983). Clinical and lesion information about the patients with aphasia who participated in Experiment 4.2 is summarized in Appendix D. The young control (YC) sample included data from 36 native monolingual speakers of American English with self-reported normal hearing who participated in Chapter 1’s Experiment 1.1 (Fox & Blumstein, in press). As described in Chapter 1, a total of 50 subjects participated in the experiment. One was excluded due to technical difficulties and, of the 49 remaining subjects, 36 perceived some tokens of the bay–pay continuum as ambiguous (defined as making at least 10% /b/-responses and 10% /p/- responses to the two intermediate VOT tokens of the bay–pay continuum). 4.4.3.1.2. Stimuli The stimuli for Experiment 4.2 were comprised of 4 acoustic tokens from a voice- onset time continuum between bay and pay, each of which was appended to a set of noun- and verb-biasing sentence contexts (e.g., Valerie hated the... vs. Brett hated to...). Beginning with a naturally produced token of bay, Fox and Blumstein (in press; see also Chapter 1) created a 12-step bay–pay VOT continuum by successively removing pitch periods from the vowel of the bay token and adding aspiration from a naturally-produced 207 pay token of equal duration between bay’s burst (which was amplified 2x in all tokens) and the onset of glottal pulsing. Fox and Blumstein selected 4 of the 12 tokens for inclusion in their study, corresponding to one bay endpoint token (VOT = 7 ms), one pay endpoint token (VOT = 35 ms), and two phonetically ambiguous tokens with intermediate VOTs (VOTs = 18 and 24 ms). Minimally paired sentence contexts were selected such that, by changing only the function word that immediately preceded the target token, the contexts would create a bias for a noun (e.g., bay) vs. a verb (e.g., pay). For instance, the verb hated could be followed by either a noun phrase (e.g., the bay) or an infinitive phrase (e.g., to pay), so Valerie hated the... and Brett hated to... served as noun- and verb-biasing contexts, respectively. Experiment 4.2 employed a subset of 10 of Fox and Blumstein’s 20 main verbs (e.g., hate, want). That reduced stimulus list can be found in Appendix E. Each of the four tokens was appended to each sentence context ending in to... and the... for a total of 80 trials (4 tokens × 10 main verbs × 2 contexts). Note that, as mentioned earlier, data for the YCs came from Experiment 1.1 (Fox & Blumstein, in press), which included 20 main verbs (instead of 10). The full stimulus list for Experiment 1.1 can be found in Appendix A. Furthermore, although those subjects also responded to stimuli from a similarly constructed buy–pie continuum, the data reanalyzed here from those subjects only included their responses to the bay–pay continuum since it was this continuum whose VOT tokens were also stimuli in the present experiment, Experiment 4.2. 4.4.3.1.3. Procedure 208 All sentences were presented to participants binaurally over headphones in a random order with a 4-second inter-stimulus interval between trials. AMCs and patients with aphasia completed a minimum of 6 practice trials (some patients received more practice trials in order to adjust the volume to an appropriate level and to ensure they understood the task). Subjects were instructed to identify the last word of each sentence as either bay or pay by pressing the appropriately labeled button with their preferred hand (response mapping counterbalanced between subjects) as quickly and accurately as possible. Participants were warned that some sentences might not make sense, and they were instructed to guess if they did not know. Note that participants also completed other tasks during the same experimental session (sometimes before this task; sometimes after it). Again, note that, since data for the YCs came from Experiment 1.1 (Fox & Blumstein, in press), the procedure differed slightly from what is described above. Most notably, YCs in Experiment 1.1 were instructed to identify the first sound of the last word in each sentence (either “b” or “p”) instead of the last word of each sentence. 4.4.3.1.4. Methodological Differences Between Subject Groups To briefly summarize the methodological disparities between the data from the YCs and the data from the elderly subjects (AMCs, BAs, and W/CAs), two are most notable: (1) the elderly subjects who participated in Experiment 4.2 heard a subset of the stimuli heard by YCs in Experiment 1.1 (and the duration of Experiment 4.2 was thus shorter) and, (2) elderly subjects performed a word identification task while YCs performed phoneme identification task (at least explicitly; see Fox & Blumstein, in press). These differences recommend caution in drawing any conclusions based on direct 209 comparisons of the younger and older subjects’ responses. Importantly, though, the critical contrasts of interest involve comparing results among the age-matched groups of participants in Experiment 4.2 (AMCs, BAs, and W/CAs). The rationale behind the inclusion of data from Experiment 1.1 (discussed in more detail later) was that, within the hierarchical Bayesian data analysis framework, it is possible to leverage assumed commonalities between the cognitive processing underlying the two datasets, while still accounting for key differences between them. 4.4.3.2. Results: Statistical Analyses The results of Experiment 4.2, including the results of reanalysis of data originally reported by Fox and Blumstein (in press) are shown in Figure 4.18. Recall that, to the extent that subjects tend to label stimuli with the same word-initial VOT as pay more often in the verb-biasing context (to...) than after the noun-biasing context (the...), those results would represent evidence of top-down effects from sentential context on speech recognition. All statistical analyses of the present data followed exactly the approach to the logistic regression analyses taken in Experiment 4.1 (see Section 4.3.2.2.1). Because the two conditions being compared in the design of Experiment 4.2 differed as a function of which function word context preceded the target word, all analyses reported here included independent fixed effects for CONTEXT (β2) (the... vs. to...; or, equivalently, noun vs. verb-biased or bay- vs. pay-biased) and for VOT (β1) (modeled here as a continuous, linear fixed effect). No CONTEXT × VOT interaction term was included, reflecting the principle that the prior and the likelihood are independent sources of information in the Bayesian framework (cf. Chapter 2). Any significant main effect of 210 CONTEXT suggests an influence of sentential context. A significant main effect of VOT suggests that subjects’ likelihood of making a pay-response depends on the VOT of the stimulus. Thus, these two fixed effects reflect top-down and bottom-up processing, respectively. As in the analyses of Experiment 4.1, whenever analyses included subjects from more than one group, a fixed effect of GROUP was included in the model, along with its interactions with both CONTEXT and VOT. A significant interaction between CONTEXT and GROUP would reflect reliable differences in top-down effects of contextual information between the two groups being compared, while a significant interaction between VOT and GROUP would reflect reliable differences in bottom-up processing between the two groups. The results of Simulation Study 4.2 suggest that the weighting of lexical-level (frequency) information may be associated with the location of the inferred category boundary. Consequently, it would be difficult to determine whether differences between two groups in their best-fitting intercept coefficient (β0) is more likely driven by differences in lexical-level processing or in the locus of their phonetic category boundaries. Moreover, interpretation of any comparisons of elderly subjects with YCs is complicated by the methodological disparities discussed in Section 4.4.3.1.4. As with the analyses of Experiment 4.1, random by-subject intercepts allowed for subject variability with respect to their category boundaries (cf. Chapter 3). Coding of fixed effects was identical to analysis of Experiment 4.1, with a deviation-coded CONTEXT factor (contrasts: -0.5/0.5 for the... and to..., respectively) replacing the RIME factor. All results are reported in tables that include the best-fitting estimate of each 211 regression coefficient (β), the estimate’s standard error (SE), Wald’s z statistic for the estimate of that parameter (|z|), and the significance level of the statistic (p). Note that the theoretical interpretations of the logistic regression coefficients are somewhat different for Experiment 4.2 than for Experiment 4.1 (see Table 4.2): β2 reflects the influence of context, not of lexical status/frequency, and β0 reflects not only the category boundary (a feature of the likelihood function), but also the influence of lexical-level processing (a top-down effect related to the model’s prior). Young Controls Age−Matched Controls 1.00 ● ● ● ● 0.75 ● ● ● ● 0.50 ● ● 0.25 ● %pay responses ● ● 0.00 ● ● ● PWA: Broca's PWA: Wernicke's/Conduction 1.00 ● ● 0.75 ● ● ● ● ● ● ● 0.50 ● ● ● 0.25 ● ● ● 0.00 ● 0 10 20 30 40 50 0 10 20 30 40 50 VOT (ms) ● to... ● the... Figure 4.18. Results of Experiment 4.2: for each group, the proportion pay-responses as a function of VOT in the to... (pay-biased) and the... (bay-biased) conditions. Error bars represent by-subject standard error. Results for Young Controls represent reanalysis of raw data from Experiment 1.1 (Fox & Blumstein, in press). PWA = Patients with aphasia. 212 4.4.3.2.1. Control Subjects: YCs vs. AMCs First, the data from all young control subjects (YCs) and all age-matched control subjects (AMCs) were submitted to logistic regression. Unsurprisingly, results (see Table 4.11) revealed significant effects from both bottom-up and top-down information sources on speech perception. Groups did not differ in the strength of their sentential context effects (β2). The analysis also indicated that the two control groups differed in their intercept (β0) and in the size of the effect of VOT (β1) with results suggesting that the responses of AMCs were less influenced by VOT than YCs and that their category boundary occurred at a higher VOT value. The former finding could be interpreted as evidence for bottom-up processing deficits (cf. Abada, Baum & Titone, 2008) and the latter could be interpreted as evidence of either a different phonetic category structure in older adults or as evidence for a weaker effect of lexical-level processing. However, as highlighted in Section 4.4.3.1.4, it is difficult to isolate the source of these disparities because several differences between these data are confounded, including at least the following: (1) the age of the participants, (2) differences in the other stimuli subjects heard during the experiment, (3) the duration of the experiment, and (4) the experimental task employed. 213 Coefficient β SE |z| p β0 -0.318 0.234 -1.359 0.174 β1 0.260 0.010 26.492 < 0.001 β2 1.317 0.132 9.995 < 0.001 β0: YC vs. AMC -1.097 0.468 -2.341 0.019 β1: YC vs. AMC -0.071 0.019 -3.730 < 0.001 β2: YC vs. AMC 0.377 0.263 1.432 0.152 Table 4.11. Results of logistic regression analysis of Experiment 4.2 that included Young and Age-Matched Controls. Shaded boxes indicate statistically significant effects. β0 = intercept (related to phoneme category boundary and lexical-level processing); β1 = VOT (related to gain/slope of sigmoid); β2 = CONTEXT (related to size of the boundary shift introduced by the contextual bias); β: best-fitting estimate of each regression coefficient, SE: the estimate’s standard error, |z|: Wald’s z statistic for the estimate of that parameter, p: the significance level of the test statistic. Results for Young Controls represent reanalysis of raw data from Experiment 1.1 (Fox & Blumstein, in press). Given the between-group differences, follow-up tests were conducted to examine each group’s data separately. Results showed that both YCs (Table 4.12) and AMCs (Table 4.13) exhibited strong effects of both bottom-up and top-down influences on speech recognition. Whatever the source of the differences between the YCs and AMCs, the results suggest that the two groups’ data should not be fit together in the model-based analyses (Section 4.4.3.3). Based on these results, the model-based analyses fit unique values of 𝜇! (related to the phonetic category boundary, χ) for YCs and AMCs and allowed AMCs to have a greater category variance than YCs. Coefficient β SE |z| p β0 0.230 0.197 1.167 0.243 β1 0.297 0.008 36.418 < 0.001 β2 1.129 0.088 12.850 < 0.001 Table 4.12. Results of logistic regression analysis of Experiment 4.2 that included only Young Controls. Shaded boxes indicate statistically significant effects. β0 = intercept (related to phoneme category boundary and lexical-level processing); β1 = VOT (related to gain/slope of sigmoid); β2 = CONTEXT (related to size of the boundary shift introduced by the contextual bias); β: best-fitting estimate of each regression coefficient, SE: the estimate’s standard error, |z|: Wald’s z statistic for the estimate of that parameter, p: the significance level of the test statistic. Results for Young Controls represent reanalysis of raw data from Experiment 1.1 (Fox & Blumstein, in press). 214 Coefficient β SE |z| p β0 -0.865 0.415 -2.085 0.037 β1 0.224 0.018 12.344 < 0.001 β2 1.503 0.249 6.036 < 0.001 Table 4.13. Results of logistic regression analysis of Experiment 4.2 that included only Age-Matched Controls. Shaded boxes indicate statistically significant effects. β0 = intercept (related to phoneme category boundary and lexical-level processing); β1 = VOT (related to gain/slope of sigmoid); β2 = CONTEXT (related to size of the boundary shift introduced by the contextual bias); β: best-fitting estimate of each regression coefficient, SE: the estimate’s standard error, |z|: Wald’s z statistic for the estimate of that parameter, p: the significance level of the test statistic. 4.4.3.2.2. Elderly Subjects: AMCs vs. BAs vs. W/CAs A logistic regression examined all of the elderly participants, including the AMCs, BAs and W/CAs. Notably, this between-group comparison does not suffer from the same methodological disparities as the comparison of the YCs and AMCs. The results of this analysis are shown in Table 4.14. Overall, there was a significant top-down contextual biasing effect on subjects’ responses (more pay-responses after verb-biasing contexts than noun-biasing contexts), and an overall effect of VOT on subjects’ responses (more pay-responses to stimuli with longer VOTs). Coefficient β SE |z| p β0 -0.531 0.364 -1.457 0.145 β1 0.154 0.012 13.085 < 0.001 β2 1.382 0.187 7.371 < 0.001 β0: AMC vs. BA 2.054 1.085 1.893 0.058 β0: AMC vs. W/CA -1.379 1.103 -1.250 0.211 β1: AMC vs. BA -0.197 0.029 -6.832 < 0.001 β1: AMC vs. W/CA 0.055 0.038 1.466 0.143 β2: AMC vs. BA -0.999 0.500 -1.999 0.046 β2: AMC vs. W/CA 0.747 0.604 1.236 0.216 Table 4.14. Results of logistic regression analysis of Experiment 4.2 that included Age- Matched Controls (AMC), patients with Broca’s aphasia (BA), and patients with Wernicke’s or Conduction aphasia (W/CA). Shaded boxes indicate statistically significant effects. β0 = intercept (related to phoneme category boundary and lexical-level processing); β1 = VOT (related to gain/slope of sigmoid); β2 = CONTEXT (related to size of the boundary shift introduced by the contextual bias); β: best-fitting estimate of each regression coefficient, SE: the estimate’s standard error, |z|: Wald’s z statistic for the estimate of that parameter, p: the significance level of the test statistic. 215 BAs differed from AMCs with respect to both of these effects, and there was a marginal difference (p = 0.058) between BAs and AMCs in their intercept. In particular, the influence of VOT on speech recognition was diminished in BAs compared to AMCs, corresponding to a shallower slope of the sigmoidal categorization curve, suggesting bottom-up processing deficits. The effect of sentential context was weaker in BAs than AMCs, suggesting impairments in the integration of sentential cues during word recognition. The marginal difference in the intercept between BAs and AMCs suggested that BAs had an inferred category boundary at a much lower VOT value than AMCs, which could correspond either to a disruption in BAs’ internal phonetic category structure or increased weighting of lexical-level information. Notably, the latter interpretation is consistent with the Lexical Activation Hypothesis. Previous research has suggested fundamental aspects of phonetic category structure are preserved in aphasia (Blumstein et al, 1984; Blumstein et al, 1977b; Caplan et al, 1994; Gow & Caplan, 1996), even while discrimination, categorization, and acoustic-phonetic processing is impaired, so there is little reason to suspect that phoneme category boundaries differ between AMCs and BAs. No significant differences were found between W/CAs and AMCs. However, the direction of the (non-significant) regression coefficient corresponding to the inferred category boundary was opposite that of the difference between BAs and AMCs. This is the direction predicted by the Lexical Activation Hypothesis. To further examine the pattern of sentential context effects in the two patient groups, each group’s data were analyzed separately. Results confirmed that both BAs (Table 4.15) and W/CAs (Table 4.16) exhibited a robust influence of sentential context in their responses. The VOT of the stimuli also influenced speech recognition in both BAs 216 and W/CAs, although the raw effect size for BAs was much weaker than in both of the control groups (cf. Tables 4.12 and 4.13). Coefficient β SE |z| p β0 0.486 0.205 2.370 0.018 β1 0.054 0.014 3.759 < 0.001 β2 0.863 0.283 3.046 0.002 Table 4.15. Results of logistic regression analysis of Experiment 4.2 that included only patients with Broca’s aphasia (BAs). Shaded boxes indicate statistically significant effects. β0 = intercept (related to phoneme category boundary and lexical-level processing); β1 = VOT (related to gain/slope of sigmoid); β2 = CONTEXT (related to size of the boundary shift introduced by the contextual bias); β: best-fitting estimate of each regression coefficient, SE: the estimate’s standard error, |z|: Wald’s z statistic for the estimate of that parameter, p: the significance level of the test statistic. Coefficient β SE |z| p β0 -1.292 1.111 -1.163 0.245 β1 0.189 0.028 6.874 < 0.001 β2 1.835 0.427 4.296 < 0.001 Table 4.16. Results of logistic regression analysis of Experiment 4.2 that included only patients with Wernicke’s or Conduction aphasia (W/CAs). Shaded boxes indicate statistically significant effects. β0 = intercept (related to phoneme category boundary and lexical-level processing); β1 = VOT (related to gain/slope of sigmoid); β2 = CONTEXT (related to size of the boundary shift introduced by the contextual bias); β: best-fitting estimate of each regression coefficient, SE: the estimate’s standard error, |z|: Wald’s z statistic for the estimate of that parameter, p: the significance level of the test statistic. 4.4.3.2.3. Summary of Results of Statistical Analyses Figure 4.19 provides an alternate way of visualizing differences in the size of top- down effects from sentential context for each group over the entire continuum. For each subject’s responses to each of the four VOT tokens, we computed the difference in the proportion of pay-responses in the verb-biased condition (to...) and the noun-biased condition (the...), and plotted the mean difference (i.e., effect size) for each group at each VOT. In summary, there are at least five tentative conclusions that find support in the statistical analyses presented above. 1. Sentential context influences speech categorization in all groups, including both patient groups. 217 2. For the healthy controls and for the W/CAs, those effects tend to arise most strongly at intermediate VOTs; unambiguous speech tokens less likely to be susceptible to contextual biases. 3. The behavior of YCs and AMCs appear to reflect differences in the bottom-up processing of the tokens from the VOT continua; this may be due to methodological differences in the way those two datasets were collected. 4. There is fairly robust evidence for differences between BAs and AMCs in both top-down and bottom-up speech processing, but it is difficult to draw strong conclusions from the present analyses. 5. The present analyses do not provide clear evidence for differences between behavioral response patterns of AMCs and W/CAs. 218 Young Controls Age−Matched Controls 0.6 0.4 Context Effect Size: %payto − %paythe ● ● ● 0.2 ● ● ● ● ● 0.0 PWA: Broca's PWA: Wernicke's/Conduction 0.6 ● 0.4 ● ● ● 0.2 ● ● ● ● 0.0 0 10 20 30 40 50 0 10 20 30 40 50 VOT (ms) Figure 4.19. Results of Experiment 4.2: Difference between proportion pay-responses in the to... (pay-biased) and the... (bay-biased) conditions as a function of voice-onset time (VOT), for each group. Error bars represent by-subject standard error. Results for Young Controls represent reanalysis of raw data from Experiment 1.1 (Fox & Blumstein, in press). PWA = Patients with aphasia. 4.4.3.3. Results: Model-Based Analyses 4.4.3.3.1. Motivation of Model-Based Analyses The statistical analyses proved difficult to interpret for several reasons. Firstly, the data for the YCs came from a different experiment than the data for the AMCs, BAs and 219 W/CAs. Secondly, there was limited data, with only three patients in each clinical group. Thirdly, simulations with BIASES-A in Simulation Study 4.2 suggested that the intercept parameter of the logistic regression can be influenced by both bottom-up and top-down processing components, so differences between groups were ambiguous. Moreover, the logistic regression analyses in Experiment 4.2 suffer from the same shortcomings described earlier (see Section 4.3.2.3.1). Many of these problems are addressed by the model-based Bayesian data analysis approach described in Section 4.3.2.3.1. Its ability to distinguish between the subtle influences of many parameters while avoiding typical assumptions of many standard statistical tests (e.g., frequentist logistic regression) and its ability to make the most out of limited data by employing theoretically informed hierarchical modeling are especially advantageous for the present analyses. 4.4.3.3.2. Key Results of Model-Based Analyses Table 4.17 provides a summary of the posterior distributions of the parameters that were fit in the present analysis (i.e., the “best-fitting” model parameters). 220 Mean SD 95% HDI min 95% HDI max 𝛼 0.82 0.09 0.65 1.01 σ2 221.81 5.99 210.60 233.07 YCs: 𝜇! -5.95 0.18 -6.29 -5.60 Elderly: 𝜇! -0.95 0.65 -2.13 0.46 2 AMCs: σ N 85.41 24.48 39.85 135.76 2 BAs: σ N 698.09 357.85 99.27 1467.97 2 W/CAs: σ N 201.58 74.12 64.17 345.73 BAs: ε 0.13 0.09 9.00e-05 0.31 W/CAs: ε 0.03 0.03 5.36e-06 0.09 BAs: ϕ 2.03 0.47 1.15 3.01 W/CAs: ϕ -0.09 0.50 -0.99 0.97 BAs: ω 0.91 0.32 0.35 1.56 W/CAs: ω 1.29 0.37 0.54 2.02 Table 4.17. Summary statistics of posterior distributions of Bayesian data analysis of Experiment 4.2. HDI = highest density interval. The most theoretically important results regard the posterior distributions of the BAs and W/CAs and the extent to which the present model-based analysis could confidently infer differences in the posterior estimates between the patient groups and the AMCs. Three key results emerged. First, BAs were substantially more impaired in their bottom-up acoustic-phonetic processing of speech tokens (𝜎!! ) compared to AMCs. Interestingly, no such difference between AMCs and W/CAs was found. The other two key results regard each patients’ weighting of lexical-level (frequency) information. Recall that optimal weighting of frequency information, which is assumed for AMCs, is given by 𝜙 = 1. To the extent that 𝜙 can be confidently assessed to be greater than 1 for some group, it suggests that those subjects are overweighting frequency information, which is the prediction the Lexical Activation Hypothesis makes for BAs (see Table 4.1). To the extent that 𝜙 can be confidently assessed to be less than 1 for some group, it suggests that those subjects are underweighting frequency information, which is the prediction the Lexical Activation 221 Hypothesis makes for W/CAs (see Table 4.1). According to Kruschke (2011) a parameter can be confidently assessed to be different from some value if the 95% HDI of that parameter’s posterior distribution excludes that value. The second key finding was that patients with BA reliably overweight frequency information, and the third key finding was that patients with W/CA reliably underweight frequency information. Both results are exactly what is predicted by the Lexical Activation Hypothesis. This result can be seen visually in Figures 4.18 and 4.19 as the apparent shift of the entire distribution of top-down effects to the left for the BAs (to be centered over lower VOT values, which correspond to bay, the less frequent candidate word) or to the right for W/CAs (to be centered closer to the overall phonetic category boundary). However, among the many other differences between the behavioral patterns, it is virtually impossible to confidently asses the status of each effect’s reliability without a theoretically and analytically powerful technique like the one presented here. The power of this analysis technique is that it can separate out all of the other differences between the distributions of responses and isolate the influence of each parameter on subjects’ behavior. As in the analyses of the results from Experiment 4.1, we performed a posterior predictive check (PPC) in order to evaluate the ability of the fit model to accurately capture the key aspects of the behavioral data. The PPC proceeded exactly as described in Experiment 4.1 (see Section 4.3.2.3.2): 100 random samples were selected from the joint posterior distribution of the model, and parameter values for a given sample were set to the sampled value in each corresponding Markov chain. For each sample we simulated data from the model, and we ran all of the statistical analyses reported in Section 222 4.4.3.3.1 on the simulated data. This yielded 100 samples of each of 6 statistical analyses. For each logistic regression coefficient in each statistical test, we computed the mean coefficient estimate (β) and we determined how many of the statistical tests reached significance at the 0.05 level. To the extent that statistical tests on new, generated data give similar inferences as the same statistical tests on the original data, it would suggest that the model from which the data were generated captures some fundamental aspects of the generative model underlying the psychological processes giving way to the relevant empirical data. The results are shown in Table 4.18. Figures 4.20 and 4.21 superimpose the results of the PPC onto the original experimental data shown in Figures 4.18 and 4.19. The PPCs’ coefficient estimates and pattern of significances were somewhat consistent with the statistics of the original experimental data, but future work should examine possible shortcomings. The inconsistencies in this method may be related to the fact that the HDIs of some parameters were quite large (see Table 4.17), which is indicative of a dataset with inconsistent or too little data. It is also possible that inconsistencies may be due to incorrect assumptions in the model. These are important questions for future work. 223 Experiment 4.2: Results: Results: PPC Sentential Context Effect Experiment 4.2 logistic mean % sims coefficient β p regression β p<.05 β0 -0.318 0.174 -0.24 94 Control Subjects β1 0.260 < 0.001 0.21 100 β2 1.317 < 0.001 0.97 100 β0: YC vs. AMC -1.097 0.019 -0.87 100 β1: YC vs. AMC -0.071 < 0.001 -0.07 85 β2: YC vs. AMC 0.377 0.152 0.07 7 β0 -0.531 0.145 -0.34 92 β1 0.154 < 0.001 0.11 100 β2 1.382 < 0.001 1.01 100 Elderly Subjects β0: AMC vs. BA 2.054 0.058 1.64 100 β0: AMC vs. W/CA -1.379 0.211 -0.98 92 β1: AMC vs. BA -0.197 < 0.001 -0.14 100 β1: AMC vs. W/CA 0.055 0.143 0.01 12 β2: AMC vs. BA -0.999 0.046 -0.41 28 β2: AMC vs. W/CA 0.747 0.216 0.42 15 β0 0.230 0.243 0.20 98 YCs β1 0.297 < 0.001 0.25 100 β2 1.129 < 0.001 0.93 100 β0 -0.865 0.037 -0.67 100 AMCs β1 0.224 < 0.001 0.18 100 β2 1.503 < 0.001 1.01 97 β0 0.486 0.018 0.48 80 BAs β1 0.054 < 0.001 0.04 64 β2 0.863 0.002 0.81 60 β0 -1.292 0.245 -0.83 96 W/CAs β1 0.189 < 0.001 0.12 100 β2 1.835 < 0.001 1.23 85 Table 4.18. Summary of the results of a Posterior Predictive Check (PPC) examining the reliability of the model fit to data from Experiment 4.2. 224 Young Controls Age−Matched Controls 1.00 ● ● ● ● 0.75 ● ● ● ● 0.50 ● ● % pay responses 0.25 ● ● ● 0.00 ● ● ● PWA: Broca's PWA: Wernicke's/Conduction 1.00 ● ● 0.75 ● ● ● ● ● ● ● 0.50 ● ● ● 0.25 ● ● ● 0.00 ● 0 10 20 30 40 50 0 10 20 30 40 50 VOT (ms) ● to... ● the... model fits ● actual data Figure 4.20. Results of Experiment 4.2 (data points; cf. Figure 4.18) with superimposed model fits (solid lines). For each group (panel), two curves display the two sigmoidal posterior probability functions of the to... (pay-biased) and the... (bay-biased) conditions. Points indicate proportion pay-responses in the to... (pay-biased) and the... (bay-biased) conditions for each VOT, for each group. Error bars represent by-subject standard error. PWA = Patients with aphasia. 225 Young Controls Age−Matched Controls 0.6 0.4 Context Effect Size: %payto − %paythe ● ● ● 0.2 ● ● ● ● 0.0 ● PWA: Broca's PWA: Wernicke's/Conduction 0.6 ● 0.4 ● ● ● 0.2 ● ● ● ● 0.0 0 10 20 30 40 50 0 10 20 30 40 50 VOT (ms) model fits ● actual data Figure 4.21. Results of Experiment 4.2 (data points; cf. Figure 4.19) with superimposed model fits (solid lines). For each group (panel), the curve represents the difference between proportion pay-responses in the to... (pay-biased) and the... (bay-biased) conditions as a function of voice-onset time (VOT). Points indicate difference in proportion pay-responses in the to... (pay-biased) and the... (bay-biased) conditions for each VOT, for each group. Error bars represent by-subject standard error. PWA = Patients with aphasia. 4.4.3.4. General Discussion of Results of Experiment 4.1 and 4.2 Together with the results of Experiment 4.1, the present work provides evidence for diminished influence of lexical status and frequency information on word recognition 226 in patients with W/CA, and for greater influence of lexical status and frequency information on word recognition in patients with BA. These conclusions are consistent with the original predictions of the Lexical Activation Hypothesis. At the same time, the analyses may also point towards bottom-up processing impairments (especially at the acoustic-phonetic level) in patients. However, the present results suggest that lexical processing deficits are not likely to be accounted for as downstream effects of bottom-up processing deficits alone. Methodologically, the present results have also illustrated the relative power of hierarchical Bayesian data analysis techniques over traditional methods. Having developed a computational model, BIASES, which served as a theoretical lens through which to view the issue of lexical processing deficits in aphasia, it was possible to avoid many of the inadequate and inappropriate assumptions of more traditional statistical analyses. Ultimately, we were able to draw novel, rich, and principled conclusions from previously published data (Blumstein et al, 1994) and from another dataset with several significant limitations (e.g., it was collected in two separate experiments with different methods and it featured a relatively small number of patients and trials). This represents a promising direction for future work interested in teasing apart subtle differences in the expected influences of different model parameters on subjects’ response patterns. Finally, as discussed earlier, aphasia is a heterogeneous disorder of language. Ignoring this fact – for instance, by analyzing data without respect for their symptomology, clinical diagnosis, or underlying neurological etiology/lesion site – is not theoretically motivated and is likely to miss important differences in patients, both from each other and from healthy control subjects. 227 Conclusion During auditory language comprehension, bottom-up acoustic cues in the sensory signal are critical to listeners’ ability to recognize spoken words, but listeners are also sensitive to higher-level processing; in general, identification of ambiguous targets is biased by prior expectations (e.g., words over non-words, contextually consistent words over inconsistent words). The focus of the present work has been to better characterize how such top-down cues are integrated with bottom-up cues. In particular, the goal was to improve our understanding of the computational principles underlying top-down effects on speech perception, especially those top-down effects which arise from a word’s sentential context. Chapter 1 considered a longstanding debate: do top-down effects result from interactive modulation of perceptual processing or from entirely autonomous, decision- level processing? Although some past work suggested that the time course of top-down effects was incompatible with interactive models, Experiments 1.1 and 1.2 illustrated that, with appropriate controls, the predictions of interactive models were supported. Ultimately, though, two major weaknesses of existing spoken word recognition models (whether interactive or autonomous) are that they ignore the role of sentential context and that they ignore the enormous variability in the size of top-down effects. To address these gaps, Chapter 2 introduced BIASES (short for Bayesian Integration of Acoustic and Sentential Evidence in Speech), a newly developed computational model of speech perception. Chapter 3 demonstrated BIASES’ ability to predict and explain fine-grained variability and asymmetries in previously published work, as well as in novel 228 experimental data from Experiment 3.1. The results of Chapter 3 indicate that many of the hallmarks of a Bayesian cue integration model are present in listeners’ behavior during spoken word recognition tasks. Finally, Chapter 4 employed BIASES to examine top-down processing in patients with aphasia. Experiment 4.1 reanalyzed previously published data (Blumstein et al, 1994) regarding top-down effects of lexical status on speech perception in patients with aphasia. Experiment 4.2 examined new data regarding top-down effects of sentential context in patients with aphasia. Model-based analysis of these data suggested that patients with aphasia experience both bottom-up processing deficits and lexical-level processing deficits, and that the lexical processing deficits are consistent with the predictions of the Lexical Activation Hypothesis (Blumstein & Milberg, 2000; McNellis & Blumstein, 2001). Importantly, those impairments differ as a function of patients’ clinical diagnoses. The BIASES model has the potential to guide future experimental research and help advance both psycholinguistic and neurolinguistic theory. This work offers new insights into the computations occurring at the interface between the perceptual processing of speech and the cognitive and linguistic processing of language. 229 Appendix A: Context Sentences for Experiments 1 & 2 Noun-biased (bay/pie) Verb-biased (buy/pay) Tom liked the... Dennis liked to... Jill preferred the... Stephanie preferred to... Valerie hated the... Brett hated to... Theresa chose the... Bethany chose to... Ronald remembered the... Christopher remembered to... Austin forgot the... Rob forgot to... Lillian neglected the... Eliza neglected to... Justin wanted the... Joe wanted to... Tina loved the... Nathan loved to... Noah prepared the... Dustin prepared to... Jasmine demanded the... Tyler demanded to... Josh declined the... Grant declined to... Celia offered the... Kristen offered to... Mark meant the... Kate meant to... Sue needed the... Megan needed to... Eileen expected the... Dorothy expected to... Katherine requested the... Lance requested to... Tony knew the... Bob knew to... Tracy promised the... Carl promised to... Abigail thought the... Jacqueline thought to... 230 Appendix B: Filler Target Words for Experiment 2 build put beat pick break print blame play block plan brief press back pack bet pet bear pair bull pull 231 Appendix C: Supplementary Materials Complete Results and Discussion of Experiment 1 A.1. Details of Analysis Procedures Because subjects’ responses were categorical (/p/ vs. /b/), the data were analyzed using mixed effects logistic regression (Baayen, Davidson & Bates, 2008; Jaeger, 2008), implemented using the lme4 package (Bates, Maechler, Bolker & Walker, 2014) in R (R Core Team, 2014). Factorial main effects (CONTEXT, CONTINUUM, BIAS, and SPEED) were deviation-coded (contrasts: 0.5, -0.5; positive contrasts corresponded to noun- biased, buy–pie, /p/-congruent, and fast trials). VOT was a centered, continuous fixed effect. Since the design was fully within-subjects and within-items (an item corresponded to a main verb; e.g., hated), the maximal random effects structure (Barr, Levy, Scheepers & Tily, 2013) for this design included all random intercepts, slopes and interactions for every subject and item. In order to achieve convergence while minimizing the risk of inferential bias (Barr et al., 2013), random correlations were excluded. A.2. Supplementary Results/Discussion Besides the two critical findings discussed in the main text (CONTEXT × CONTINUUM and BIAS × SPEED interactions), our results provided evidence that several other factors influence subjects’ responses. Most are attributable to phonetic factors in our stimuli and well-established observations about how context effects interact with phonetic factors in speech perception. A.2.1. Analysis 1a (omnibus): CONTEXT × CONTINUUM × VOT In addition to the crucial CONTEXT × CONTINUUM interaction in Experiment 1, there was a main effect of VOT (β = 0.42, SE = 0.04, |z| = 11.70, p < 0.001) such that 232 tokens with longer VOTs were more often labeled as beginning with /p/, as expected given that VOT is the primary cue distinguishing the /b/ and /p/ categories in English (Liberman, Harris, Kinney & Lane, 1961). A significant VOT × CONTEXT interaction (β = 0.10, SE = 0.03, |z| = 3.58, p < 0.001) replicates previous work showing that the size of a top-down bias depends on the acoustic ambiguity of the stimuli (Burton, Baum & Blumstein, 1989; Ganong, 1980; McQueen, 1991; Pitt & Samuel, 1993; Tuinman et al, 2014; van Alphen & McQueen, 2001). As Figure 1.1 suggests, the closer a token’s mean rate of /p/-responses was to the phoneme category boundary (the VOT at which one would expect to see 50% /b/- responses and 50% /p/-responses), the larger the difference between subjects’ /p/- response rates at the two levels of CONTEXT appears to be. VOT interacted with CONTINUUM (β = 0.18, SE = 0.03, |z| = 6.30, p < 0.001), suggesting a somewhat stronger influence of VOT in the buy–pie continuum than in the bay–pay continuum. Although the exact source of this asymmetry is not immediately obvious, one should not necessarily expect the effect of VOT to pattern identically in the bay–pay and buy–pie continua, because VOT is only one of many cues to the identity of phonetically ambiguous (between /b/ and /p/) stimuli. Burst amplitude (Repp, 1984), subsequent vowel duration (Miller & Dexter, 1988; Summerfield, 1981), vowel identity (Klatt, 1975; Stevens & Klatt, 1974), and the lexical frequency of continuum endpoints (Fox, 1984) can all influence voicing decisions about phonetically ambiguous stimuli, and although it is not clear which of these (if any) contributed to this asymmetry in Experiment 1, it is unclear how any of these factors could account for the theoretically important CONTEXT × CONTINUUM interaction. 233 Finally, a significant main effect of CONTEXT (β = -0.36, SE = 0.12, |z| = 2.89, p < 0.004) such that subjects were more likely to make /p/-responses after verb-biasing sentences than after noun-biasing sentences reflected the fact that the simple effect of CONTEXT was stronger in the bay–pay continuum (β = -1.37; /p/-responses to ambiguous tokens: 44.4% in noun-biased contexts vs. 65.5% in verb-biased contexts) than in the buy–pie continuum (β = 0.95; /p/-responses to ambiguous tokens: 58.4% in noun-biased contexts vs. 38.5% in verb-biased contexts). Further research would be necessary to identify the specific source of this asymmetry, but one possibility is that the syntactic manipulation was more efficacious in the bay–pay continuum because the specific items in the experiment (e.g., Brett hated to...) created stronger preferences when judging between bay and pay than between buy and pie. Importantly, though, no matter the cause of this or any of the other ancillary effects discussed here, the prediction that CONTEXT would have robust, contrasting effects in the two continua was borne out by the data. A.2.2. Analysis 1b (follow-up tests): CONTEXT × VOT In addition to the reported simple effects of CONTEXT in the by-continuum follow-up tests, both analyses, as in the omnibus analysis, revealed simple effects of VOT (bay–pay: β = 0.33, SE = 0.03, |z| = 11.58, p < 0.001; buy–pie: β = 0.50, SE = 0.04, |z| = 13.74, p < 0.001) in the expected direction. Finally, there was a significant interaction between CONTEXT and VOT in the bay–pay continuum (β = 0.12, SE = 0.03, |z| = 3.48, p < 0.001) and marginal interaction in the buy–pie continuum (β = 0.07, SE = 0.04, |z| = 1.80, p = 0.07), suggesting that ambiguous tokens were differentially impacted by CONTEXT in both continua (see Appendix C for discussion). A.2.3. Analysis 2: BIAS × SPEED × VOT 234 In addition to the BIAS × SPEED interaction, the results revealed a main effect of VOT (β = 0.36, SE = 0.03, |z| = 12.47, p < 0.001) and a main effect of BIAS (β = 0.99, SE = 0.14, |z| = 7.21, p < 0.001), such that /p/-responses were more likely when targets had longer VOTs and in trials for which the CONTEXT and CONTINUUM jointly made /p/ the congruent response. There was also a significant VOT × SPEED interaction (β = 0.09, SE = 0.03, |z| = 3.12, p < 0.002), suggesting that slower responses to a token were less influenced by that token’s VOT. Complete Results and Discussion of Experiment 2 B.1. Details of Analysis Procedures Analyses followed the same approach as Experiment 1’s. When occasional convergence failures occurred, the random effects structure was simplified by removing random slopes for factors involving the VOT factor. In all cases, this simplification allowed for convergence, and the pattern of results (i.e., which fixed effects reached significance) was identical to the results of the unconverged models with all of the random effects. B.2. Supplementary Results/Discussion Besides the two critical findings of Experiment 2 discussed in the main text (CONTEXT × CONTINUUM interaction, but no BIAS × SPEED interaction), there was evidence that several other effects influenced subjects’ responses, including effects replicating most patterns seen in Experiment 1 (see above). However, aside from differing in the presence of a BIAS × SPEED interaction, Experiments 1 and 2 differed in a few other ways. In particular, the critical tokens were, on the whole, less often identified 235 as /p/ in Experiment 2 than in Experiment 1, even for the /p/-endpoint tokens (pay- endpoint: 96.0% vs. 77.0% /p/-responses in Experiment 1 vs. 2; pie-endpoint: 86.8% vs. 76.5%). Such a pattern is consistent with earlier studies showing that the distributional statistics of acoustic-phonetic cues (e.g., VOTs) within an experimental context can produce range effects in the perception of phonetic category structure (Clayards, Tanenhaus, Aslin & Jacobs, 2008). An analysis of the VOTs of the ten naturally produced /p/-initial filler targets showed that these fillers had a mean VOT of 90 ms (with the shortest VOT being 71 ms), in contrast to 35 and 34 ms VOTs for the two critical /p/- endpoint stimuli. Thus, it appears that the longer VOTs of the filler targets affected the perception of voicing in the critical target stimuli such that the boundary between the /p/ and /b/ stimuli was now skewed towards fewer /p/ and more /b/ responses, consistent with range effects (Brady & Darwin, 1978). B.2.1. Analysis 1a (omnibus): CONTEXT × CONTINUUM × VOT In addition to the crucial CONTEXT × CONTINUUM interaction in Experiment 2, main effects of CONTEXT (β = -0.67, SE = 0.20, |z| = 3.28, p < 0.002) and VOT (β = 0.22, SE = 0.05, |z| = 4.91, p < 0.001) emerged in Experiment 2, both matching patterns observed in Experiment 1. B.2.2. Analysis 1b (follow-up tests): CONTEXT × VOT Follow-up tests in each continuum revealed significant simple effects of CONTEXT (see main text), as well as simple effects of VOT (bay–pay: β = 0.21, SE = 0.04, |z| = 5.75, p < 0.001; buy–pie: β = 0.21, SE = 0.05, |z| = 4.65, p < 0.001). B.2.3. Analysis 2: BIAS × SPEED × VOT 236 Although there was no evidence for a BIAS × SPEED interaction, the results showed a main effect of BIAS (β = 1.01, SE = 0.21, |z| = 4.73, p < 0.001) which corresponds to the CONTEXT × CONTINUUM interaction in the primary analysis, and a main effect of VOT (β = 0.18, SE = 0.03, |z| = 5.40, p < 0.001). 237 BDAE Subtests: Handed Time Age Education Aud. Comp.; Lesion Subj Diagnosis Sex -ness post Etiology at test (years) Fluency; Artic.; analysis onset Repet’n (W/H/L) Left hemisphere frontal lesion including insular cortex and lateral putamen with z = +1.07 0% WA Broca 63.3 10 yrs 16 anterior extension into approximately half 52.5 0% SMG Table A.1. All parameters from model-based analyses of Experiments 1 and 2 1 M R of Broca's area with deep extension across 40th % 39% IFG the anterior limb of the internal capsule and (by analysis) patchy lesion into a portion of medial 9/8/2 Appendix D: Patient Characteristics for Experiment 4.2 subcallosal fasciculus... Left hemisphere lesion involving caudate z = +0.96 Broca 69.7 27 yrs 16 and global pallidus, anterior internal 55 2 M R none available capsule to medial temporal cortex and 45th% insula, and anterior PVWM 9/6/4 238 Left hemisphere lesion including all of z = +0.82 Broca 78.1 18 yrs 16 Broca's area and the white matter deep into 47.5 100% IFG 3 M L it with... 70th % (by eye) 9/6/5 Wernicke's area...no SMG or Broca's area 5% AG z = +0.85 3% WA Wernicke à 78.5 5 yrs 62.5 5% other temp. 4 F R 12 Conduction 70th % 0% SMG 9/7/3 0% IFG (by analysis) L CVA: Posterior temporoparietal lesion z = +0.52 Conduction 10 yrs 16 involves the insula and 1/4 of Wernicke's 64 74% WA 5 M R 62* area with superior extension into the 60th % 99% SMG supramarginal and angular gyrus areas 7/1/1 Left parietal AVM clipped; subarachnoid z = +0.54 35% WA Wernicke 65.7 19 yrs hemmorage...patchy Wernicke's area; 72.5 18% SMG 6 M R 18 extension into posterior SMG and angular 70th % 0% IFG gyri 9/6/6 (by eye) Appendix E: Sentence Contexts for Experiment 4.2 Noun-biased (bay) Verb-biased (pay) Theresa chose the... Bethany chose to... Jasmine demanded the... Tyler demanded to... Valerie hated the... Brett hated to... Tom liked the... Dennis liked to... Sue needed the... Megan needed to... Celia offered the... Kristen offered to... Jill preferred the... Stephanie preferred to... Ronald remembered the... Christopher remembered to... Katherine requested the... Lance requested to... Justin wanted the... Joe wanted to... 239 References Allopenna, P. D., Magnuson, J. S., & Tanenhaus, M. K. (1998). Tracking the time course of spoken word recognition using eye movements: Evidence for continuous mapping models. Journal of Memory and Language, 38, 419-439. Altmann, G. T., & Kamide, Y. (1999). Incremental interpretation at verbs: Restricting the domain of subsequent reference. Cognition, 73(3), 247-264. Andrews, M., Vigliocco, G., & Vinson, D. (2009). Integrating experiential and distributional data to learn semantic representations. Psychological review,116(3), 463. Andruski, J., Blumstein, S. E., & Burton, M. (1994). The effect of subphonetic differences on lexical access. Cognition, 52, 163-187. Andruski, J. E., Blumstein, S. E., & Burton, M. (1994). The effect of subphonetic differences on lexical access. Cognition, 52(3), 163-187. Baayen, R. H., Davidson, D. J., & Bates, D. M. (2008). Mixed-effects modeling with crossed random effects for subjects and items. Journal of Memory and Language, 59(4), 390-412. Baker, E., Blumstein, S. E., & Goodglass, H. (1981). Interaction between phonological and semantic factors in auditory comprehension. Neuropsychologia, 19(1), 1-15. Barr, D. J., Levy, R., Scheepers, C., & Tily, H. J. (2013). Random effects structure for confirmatory hypothesis testing: Keep it maximal. Journal of Memory and Language, 68, 255-278. Basso, A., Casati, G., & Vignolo, L. A. (1977). Phonemic identification defect in aphasia. Cortex, 13(1), 85-95. 240 Bates, D., Maechler, M., Bolker, B., & Walker, S. (2014). lme4: Linear mixed-effects models using Eigen and S4. R package version 1.1-7. . Belin, P., Zatorre, R. J., Hoge, R., Evans, A. C., & Pike, B. (1999). Event-related fMRI of the auditory cortex. Neuroimage, 10(4), 417-429. Bicknell, K., Jaeger, T. F., & Tanenhaus, M. K. (2015, in press). Now or...later: Perceptual data is not immediately forgotten during language processing. Behavioral and Brain Sciences, 38. Bicknell, K., Tanenhaus, M. K., & Jaeger, T. F. (2015). Listeners can maintain and rationally update uncertainty about prior words. Manuscript submitted for publication. Blumstein, S. E., & Milberg, W. P. (2000). Language deficits in Broca’s and Wernicke’s aphasia: A singular impairment. Language and the brain: Representation and processing, 167-184. Blumstein, S. E. (2007). Word recognition in aphasia. In G. Gaskell (Ed.), Oxford Handbook of Psycholinguistics. Blumstein, S. E., Baker, E., & Goodglass, H. (1977). Phonological factors in auditory comprehension in aphasia. Neuropsychologia, 15(1), 19-30. Blumstein, S. E., Burton, M., Baum, S., Waldstein, R., & Katz, D. (1994). The role of lexical status on phonetic categorization in apashia. Brain and Language, 46, 181- 197. 241 Blumstein, S. E., Myers, E. B., & Rissman, J. (2005). The perception of voice onset time: an fMRI investigation of phonetic category structure. Cognitive Neuroscience, Journal of, 17(9), 1353-1366. Brady, S. A. & Darwin, C. J. (1978). Range effects in the perception of voicing. Journal of the Acoustical Society of America, 63(5), 1556-1558. Burton, M. W., Baum, S., & Blumstein, S. E. (1989). Lexical effects on the phonetic categorization of speech: The role of acoustic structure. Journal of Experimental Psychology: Human Perception and Performance, 15, 567-575. Burton, M. W., & Blumstein, S. E. (1995). Lexical effects on phonetic categorization: The role of stimulus naturalness and stimulus quality. Journal of Experimental Psychology: Human Perception and Performance, 21(5), 1230-1235. Burton, M. W., Small, S. L., & Blumstein, S. E. (2000). The role of segmentation in phonological processing: an fMRI investigation. Cognitive Neuroscience, Journal of, 12(4), 679-690. Caplan, D., & Utman, J. A. (1994). Selective acoustic phonetic impairment and lexical access in an aphasic patient. The Journal of the Acoustical Society of America, 95(1), 512-517. Carpenter, R. L., & Rutherford, D. R. (1973). Acoustic cue discrimination in adult aphasia. Journal of Speech, Language, and Hearing Research, 16(3), 534-544. Chater, N., & Oaksford, M. (1999). The probability heuristics model of syllogistic reasoning. Cognitive psychology, 38(2), 191-258. Chater, N., Tenenbaum, J. B., & Yuille, A. (2006). Probabilistic models of cognition: Conceptual foundations. Trends in cognitive sciences, 10(7), 287-291. 242 Clayards, M., Tanenhaus, M. K., Aslin, R. N., & Jacobs, R. A. (2008). Perception of speech reflects optimal use of probabilistic speech cues.Cognition, 108(3), 804- 809. Coltheart, M., Rastle, K., Perry, C., Langdon, R., & Ziegler, J. (2001). DRC: a dual route cascaded model of visual word recognition and reading aloud. Psychological review, 108(1), 204-256. Connine, C. M. (1987). Constraints on interactive processes in auditory word recognition: The role of sentence context. Journal of Memory and Language, 26 (5), 527-538. Connine, C. M. (1990). Recognition of spoken words with ambiguous word initial phonemes. Bulletin of the Psychonomic Society, 28, 497. Connine, C. M., Blasko, D. G., & Hall, M. (1991). Effects of subsequent sentence context in auditory word recognition: Temporal and linguistic constrainst. Journal of Memory and Language, 30(2), 234-250. Connine, C. M., Blasko, D. G., & Titone, D. (1993). Do the beginnings of spoken words have a special status in auditory word recognition?. Journal of Memory and Language, 32(2), 193-210. Connine, C. M., Blasko, D. G., & Wang, J. (1994). Vertical similarity in spoken word recognition: Multiple lexical activation, individual differences, and the role of sentence context. Perception & Psychophysics, 56(6), 624-636. Connine, C. M. & Clifton, C. C., Jr. (1987). Interactive use of lexical information in speech perception. Journal of Experimental Psychology: Human Perception and Performance, 13, 291-299. 243 Connine, C. M., Titone, D., & Wang, J. (1993). Auditory word recognition: extrinsic and intrinsic effects of word frequency. Journal of Experimental Psychology: Learning, Memory, and Cognition, 19(1), 81. Cree, G. S., McRae, K., & McNorgan, C. (1999). An attractor model of lexical conceptual processing: Simulating semantic priming. Cognitive Science, 23(3), 371-414. Csépe, V., Osman-Sági, J., Molnár, M., & Gósy, M. (2001). Impaired speech perception in aphasic patients: event-related potential and neuropsychological assessment. Neuropsychologia, 39(11), 1194-1208. Cutler, A., Mehler, J., Norris, D., & Segui, J. (1987). Phoneme identification and the lexicon. Cognitive Psychology, 19(2), 141-177. Cutler, A. & Norris, D. (1979). Monitoring sentence comprehension. In: Sentence processing: Psycholinguistic studies presented to Merrill Garrett, ed. W. E. Cooper & E. C. T. Walker. Erlbaum. Cutler, A., Weber, A., Smits, R., & Cooper, N. (2004). Patterns of English phoneme confusions by native and non-native listeners. The Journal of the Acoustical Society of America, 116(6), 3668-3678. Dagan, I., Marcus, S., & Markovitch, S. (1993, June). Contextual word similarity and estimation from sparse data. In Proceedings of the 31st annual meeting on Association for Computational Linguistics (pp. 164-171). Association for Computational Linguistics. 244 Dahan, D., Magnuson, J. S., & Tanenhaus, M. K. (2001). Time course of frequency effects in spoken-word recognition: Evidence from eye movements.Cognitive psychology, 42(4), 317-367. Dahan, D., Magnuson, J. S., Tanenhaus, M. K., & Hogan, E. M. (2001). Subcategorical mismatches and the time course of lexical access: Evidence for lexical competition. Language and Cognitive Processes, 16(5-6), 507-534. De Boer, B., & Kuhl, P. K. (2003). Investigating the role of infant-directed speech with a computer model. Acoustics Research Letters Online, 4(4), 129-134. De Deyne, S., & Storms, G. (2008). Word associations: Network and semantic properties. Behavior Research Methods, 40(1), 213-231. Dell, G. S., Oppenheim, G. M., & Kittredge, A. K. (2008). Saying the right word at the right time: Syntagmatic and paradigmatic interference in sentence production. Language and cognitive processes, 23(4), 583-608. Diehl, R. L., & Kluender, K. R. (1989). On the objects of speech perception.Ecological Psychology, 1(2), 121-144. Do, Y. A. (2011). Interaction of the top-most and the bottom-most: Pragmatic bias and phonetic perception. Presented at the 37th Annual Meeting of the Berkeley Linguistics Society, Berkeley, CA, Feb. 12-13. Docherty, G. J., Watt, D., Llamas, C., Hall, D., & Nycz, J. (2011). Variation in voice onset time along the Scottish-English border. In Proceedings of the 17th International Congress of Phonetic Sciences (pp. 591-594). Dumais, S. T. (2004). Latent semantic analysis. Annual review of information science and technology, 38(1), 188-230. 245 Eggert, G. H. (1977). Wernicke's works on aphasia: a sourcebook and review. Mouton de Gruyter. Elman, J. L. & McClelland, J. L. (1988). Cognitive penetration of the mechanisms of perception: Compensation for coarticulation of lexically restored phonemes. Journal of Memory and Language, 27, 143-165. Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14, 179-211. Feldman, N. H., Griffiths, T. L., & Morgan, J. L. (2009). The influence of categories on perception: Explaining the perceptual magnet effect as optimal statistical inference. Psychological Review, 116 (4), 752-782. Feldman, N. H., Griffiths, T. L., Goldwater, S., & Morgan, J. L. (2013). A role for the developing lexicon in phonetic category acquisition. Psychological review,120(4), 751. Fellbaum, C. (1998). WordNet. Blackwell Publishing Ltd. Fine, A. B., & Florian Jaeger, T. (2013). Evidence for implicit learning in syntactic comprehension. Cognitive Science, 37(3), 578-591. Fine, A. B., Jaeger, T. F., Farmer, T. A., & Qian, T. (2013). Rapid expectation adaptation during syntactic comprehension. Foss, D. J., & Swinney, D. A. (1973). On the psychological reality of the phoneme: Perception, identification, and consciousness. Journal of Verbal Learning and Verbal Behavior, 12(3), 246-257. 246 Fowler, C. A., & Housum, J. (1987). Talkers' signaling of “new” and “old” words in speech and listeners' perception and use of the distinction. Journal of Memory and Language, 26(5), 489-504. Fowler, C. A., & Rosenblum, L. D. (1991). The perception of phonetic gestures.Modularity and the motor theory of speech perception, 33-59. Fox, R. A. (1984). Effect of lexical status on phonetic categorization. Journal of Experimental Psychology: Human Perception and Performance, 10 (4), 526-540. Gahl, S. (2008). Time and thyme are not homophones: The effect of lemma frequency on word durations in spontaneous speech. Language, 84(3), 474-496. Ganong, W. F., III (1980). Phonetic categorization in auditory word perception. Journal of Experimental Psychology: Human Perception and Performance, 6 (1), 110-125. Garnes, S. & Bond, Z. S. (1976). The relationship between semantic expectation and acoustic information. Phonologica, 3, 285-293. Garvey, C. & Caramazza, A. (1974). Implicit causality in verbs. Linguistic Inquiry, 5, 459-464. Gaskell, M. G., & Marslen–Wilson, W. D. (1999). Ambiguity, competition, and blending in spoken word recognition. Cognitive Science, 23(4), 439-462. Gaskell, M. G., & Marslen-Wilson, W. D. (2002). Representation and competition in the perception of spoken words. Cognitive psychology, 45(2), 220-266. Geisler, W. S., & Kersten, D. (2002). Illusions, perception and Bayes. nature neuroscience, 5(6), 508-510. Goodglass, H. (1993). Understanding aphasia: Foundations of neuropsychology. San Diego, CA: Academic Press. 247 Goodglass, H., Gleason, J. B., & Hyde, M. R. (1970). Some dimensions of auditory language comprehension in aphasia. Journal of Speech, Language, and Hearing Research, 13(3), 595-606. Gow Jr, D. W., & Caplan, D. (1996). An examination of impaired acoustic–phonetic processing in aphasia. Brain and Language, 52(2), 386-407. Grossberg, S. (1980). How does the brain build a cognitive code? Psychological Review, 87, 1-51. Grossberg, S. (2003). Resonant neural dynamics of speech perception. Journal of Phonetics, 31(3), 423-445. Grossberg, S. & Myers, C. W. (2000). The resonant dynamics of speech perception: Interword integration and duration-dependent backward effects.. Psychological Review, 107, 735-767. Guediche, S., Reilly, M., & Blumstein, S. E. (2014). Facilitating perception of speech in babble through conceptual relationships. The Journal of the Acoustical Society of America, 135(4), 2257-2258. Guediche, S., Salvata, C., & Blumstein, S. E. (2013). Temporal cortex reflects effects of sentence context on phonetic processing. Journal of Cognitive Neuroscience, 25 (5), 706-718. Hale, J. (2001). A probabilistic Earley parser as a psycholinguistic model. Proceedings of the North American Association for Computational Linguistics, , 159-166. Horton, W. S., & Keysar, B. (1996). When do speakers take into account common ground?. Cognition, 59(1), 91-117. 248 Horton, W. S. (2007). The influence of partner-specific memory associations on language production: Evidence from picture naming. Language and Cognitive Processes, 22(7), 1114-1139. Hunnicutt, S. (1985). Intelligibility versus redundancy-conditions of dependency.Language and Speech, 28(1), 47-56. Isenberg, D., Walker, E. C. T., & Ryder, J. (1980). A top-down effect on the identification of function words. Paper presented at the annual meeting of the Acoustical Society of America, Los Angeles, CA. Jaeger, T. F. (2008). Categorical data analysis: Away from ANOVAs (transformation or not) and towards Logit Mixed Models. Journal of Memory and Language, 59(4), 434-446. Janse, E. (2006). Lexical competition effects in aphasia: Deactivation of lexical candidates in spoken word processing. Brain and Language, 97 (1), 1-11. Jauhiainen, T., & Nuutila, A. (1977). Auditory perception of speech and speech sounds in recent and recovered cases of aphasia. Brain and Language, 4(4), 572-579. Jelinek, F., & Mercer, R. (1985). Probability distribution estimation from sparse data. IBM Technical Disclosure Bulletin, 28, 2591-2594. Jelinek, F. (1990). Self-organized language modeling for speech recognition.Readings in speech recognition, 450-506. Frederick Jelinek. (1997). Statistical methods for speech recognition. MIT press. Kamide, Y., Altmann, G. T., & Haywood, S. L. (2003). The time-course of prediction in incremental sentence processing: Evidence from anticipatory eye movements. Journal of Memory and language, 49(1), 133-156. 249 Khaitan, P., & McClelland, J. L. (2010). Matching exact posterior probabilities in the Multinomial Interactive Activation Model. In S. Ohlsson, & R. Catrambone (Eds.), Proceedings of the 32nd Annual Meeting of the Cognitive Science Society (p. 623). Austin, TX: Cognitive Science Society. Kim, D., Stephens, J. D. W., & Pitt, M. A. (2012). How does context play a part in splitting words apart? Production and perception of word boundaries in casual speech . Journal of Memory and Language, 66, 509-529. Klatt, D. H. (1975). Voice onset time, frication, and aspiration in word-initial consonant clusters. Journal of Speech, Language, and Hearing Research,18(4), 686-706. Klatt, D. H. (1979). Speech perception: A model of acoustic-phonetic analysis and lexical access. Journal of phonetics, 7(312), 1-26. Kleinschmidt, D. F., & Jaeger, T. F. (2015). Robust speech perception: Recognize the familiar, generalize to the similar, and adapt to the novel.Psychological review, 122(2), 148. Knill, D. C., Kersten, D., & Yuille, A. (1996). A Bayesian formulation of visual perception. In D. C. Knill and W. Richards (Eds.), Perception as Bayesian Inference, Cambridge University Press. Korneef, A. W. & van Berkum, J. J. A. (2006). On the use of verb-based implicit causality in sentence comprehension: Evidence from self-paced reading and eye tracking. Journal of Memory and Language, 54, 445-465. Kronrod, Y., Coppess, E., & Feldman, N. H. (2012). A unified model of categorical effects in consonant and vowel perception. In Proceedings of the 34th Annual Conference of the Cognitive Science Society (pp. 629-634). 250 Landauer, T. K. & Dumais, S. T. (1997). A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction and representation of knowledge. Psychological Review, 104, 211-240. Leeper, H. A., Shewan, C. M., & Booth, J. C. (1986). Altered acoustic cue discrimination in Broca's and conduction aphasics. Journal of communication disorders, 19(2), 83-103. Liberman, A. M., Cooper, F. S., Shankweiler, D. P., & Studdert-Kennedy, M. (1967). Perception of the speech code. Psychological review, 74(6), 431. Liberman, A. M., Harris, K. S., Hoffman, H. S., & Griffith, B. C. (1957). The discrimination of speech sounds within and across phoneme boundaries. Journal of Experimental Psychology, 54, 358-368. Liberman, A. M., Harris, K. S., Kinney, J. A., & Lane, H. (1961). The discrimination of relative onset-time of the components of certain speech and nonspeech patterns. Journal of experimental psychology, 61(5), 379. Lieberman, P. (1963). Some effects of semantic and grammatical context on the production and perception of speech. Language and speech, 6(3), 172-187. Lisker, L. & Abramson, A. S. (1964). A cross-language study of voicing in initial stops: acoustical measurements. Word, 20, 384-422. Lisker, L. (1986). ""Voicing"" in English: A catalogue of acoustic features signaling /b/ versus /p/ in trochees. Language and Speech, 29, 3-11. Luce, P. A., & Pisoni, D. B. (1998). Recognizing spoken words: The neighborhood activation model. Ear and hearing, 19(1), 1. Luce, R. D. (1959). Individual choice behavior. Oxford, England: John Wiley. 251 Luce, P. A., Goldinger, S. D., Auer, E. T., & Vitevitch, M. S. (2000). Phonetic priming, neighborhood activation, and PARSYN. Perception & psychophysics,62(3), 615- 625. Magnuson, J. S., Dixon, J. A., Tanenhaus, M. K., & Aslin, R. N. (2007). The dynamics of lexical competition during spoken word recognition. Cognitive Science, 31(1), 133-156. Magnuson, J. S., Mirman, D., & Harris, H. D. (2012). Computational models of spoken word recognition. The Cambridge handbook of psycholinguistics, 76-103. Magnuson, J. S., Mirman, D., & Myers, E. (2013). Spoken word recognition.Oxford handbook of cognitive psychology, 412-441. Magnuson, J. S., Tanenhaus, M. K., Aslin, R. N., & Dahan, D. (2003). The time course of spoken word learning and recognition: studies with artificial lexicons.Journal of Experimental Psychology: General, 132(2), 202. Marcus, M. P., Marcinkiewicz, M. A., & Santorini, B. (1993). Building a large annotated corpus of English: The Penn Treebank. Computational linguistics, 19(2), 313- 330. Marr, D. (1982). Vision: A computational Investigation into the human representation and processing of visual information. San Francisco: W. H. Freeman & Co. Marslen-Wilson, W., & Tyler, L. K. (1980). The temporal structure of spoken language understanding. Cognition, 8(1), 1-71. Marslen-Wilson, W. D. & Welsh, A. (1978). Processing interactions and lexical access during word recognition in continuous speech. . Cognitive Psychology, 10, 29-63. 252 Marslen-Wilson, W. (1973). Linguistic structure and speech shadowing at very short latencies. Nature. Marslen-Wilson, W. D. (1975). Sentence perception as an interactive parallel process. Science, 189 (4198), 226-228. Marslen-Wilson, W. D. (1987). Functional parallelism in spoken word recognition. Cognition, 25, 71-102. Martin, A. E., Monahan, P. J., & Samuel, A. (2012). Vowel identification shaped by phrasal gender agreement expectation. Poster at the 25th Annual CUNY Conference on Human Sentence Processing, New York, NY. Massaro, D. W. & Oden, G. C. (1980). Evaluation and integration of acoustic features in speech perception . The Journal of the Acoustical Society of America, 67, 996- 1013. Massaro, D. W. (1987). Categorical partition: A fuzzy-logical model of categorization behavior. Massaro, D. W. (1989). Testing between the TRACE model and the fuzzy logical model of speech perception. Cognitive Psychology, 21(3), 398-421. Mattys, S. L., Melhorn, J. F., & White, L. (2007). Effects of syntactic expectations on speech segmentation. Journal of Experimental Psychology: Human Perception and Performance, 33, 960-977. McClelland, J. L. & Elman, J. L. (1986). The TRACE model of speech perception. Cognitive Psychology, 18, 1-86. 253 McClelland, J. L., & Rumelhart, D. E. (1981). An interactive activation model of context effects in letter perception: I. An account of basic findings.Psychological review, 88(5), 375. McClelland, J. L., Rumelhart, D. E., & PDP Research Group. (1986). Parallel distributed processing. Explorations in the microstructure of cognition, 2. McClelland, J. L. (1987). The case for interactionism in language processing. In M. Coltheart (Ed.), Attention & performance XII: The psychology of reading (pp. 1– 36). London, UK: Erlbaum. McClelland, J. L. (1991). Stochastic interactive processes and the effect of context on perception. Cognitive Psychology, 23(1), 1-44. McClelland, J. L. (2009). The place of modeling in cognitive science. Topics in Cognitive Science, 1(1), 11-38. McClelland, J. L., Mirman, D., & Holt, L. L. (2006). Are there interactive processes in speech perception?. Trends in cognitive sciences, 10(8), 363-369. McClelland, J. L., Mirman, D., Bolger, D. J., & Khaitan, P. (2014). Interactive activation and mutual constraint satisfaction in perception and cognition.Cognitive science, 38(6), 1139-1189. McDonald, J. L. & MacWhinney, B. (1995). The time course of anaphora resolution: Effects of implicit verb causality and gender. Journal of Memory and Language, 34, 543-566. McDonald, S. A., & Shillcock, R. C. (2003). Eye movements reveal the on-line computation of lexical probabilities during reading. Psychological science,14(6), 648-652. 254 McDonald, S. A., & Shillcock, R. C. (2003). Low-level predictive inference in reading: The influence of transitional probabilities on eye movements. Vision Research, 43(16), 1735-1751. McGurk, H. & McDonald, J. (1976). Hearing lips and seeing voices. Nature, 264 (5588), 746-748. McMurray, B., Aslin, R. N., & Toscano, J. C. (2009). Statistical learning of phonetic categories: insights from a computational approach. Developmental science, 12(3), 369-378. McMurray, B., Clayards, M. A., Tanenhaus, M. K., & Aslin, R. N. (2008). Tracking the time course of phonetic cue integration during spoken word recognition. Psychonomic bulletin & review, 15(6), 1064-1071. McMurray, B., Tanenhaus, M. K., & Aslin, R. N. (2009). Within-category VOT affects recovery from “lexical” garden-paths: Evidence against phoneme-level inhibition. Journal of memory and language, 60(1), 65-91. McMurray, B., Tanenhaus, M. K., & Aslin, R. N. (2002). Gradient effects of within- category phonetic variation on lexical access. Cognition, 86, B32-B42. McMurray, B., Tanenhaus, M. K., Aslin, R. N., & Spivey, M. J. (2003). Probabilistic constraint satisfaction at the lexical/phonetic interface: Evidence for gradient effects of within-category VOT on lexical access. Journal of Psycholinguistic Research, 32, 77-97. McMurray, B., Aslin, R. N., Tanenhaus, M. K., Spivey, M. J., & Subik, D. (2008). Gradient sensitivity to within-category variation in words and syllables. Journal of 255 Experimental Psychology: Human Perception and Performance, 34 (6), 1609- 1631. McNellis, M. & Blumstein, S. E. (2001). Self-organizing dynamics of lexical access in normals and aphasics. Journal of Cognitive Neuroscience, 13, 151-170. McQueen, J. M. (1991). The influence of the lexicon on phonetic categorization: stimulus quality in word-final ambiguity. Journal of Experimental Psychology: Human Perception and Performance, 17, 433-443. McQueen, J. M., Jesse, A., & Norris, D. (2009). No lexical–prelexical feedback during speech perception or: Is it time to stop playing those Christmas tapes?.Journal of Memory and Language, 61(1), 1-18. McQueen, J. M., Norris, D., & Cutler, A. (2006). Are there really interactive processes in speech perception? Trends in Cognitive Science, 10(12), 533. McQueen, J. M., Norris, D., & Cutler, A. (1994). Competition in spoken word recognition: Spotting words in other words. Journal of Experimental Psychology: Learning, Memory, and Cognition, 20(3), 621. Mertus, J. (1989). BLISS User Manual. Providence, RI: Brown University. Meyer, D. E., & Schvaneveldt, R. W. (1971). Facilitation in recognizing pairs of words: evidence of a dependence between retrieval operations. Journal of experimental psychology, 90(2), 227. Miceli, G., Caltagirone, C., Gainotti, G., & Payer-Rigo, P. (1978). Discrimination of voice versus place contrasts in aphasia. Brain and Language, 6(1), 47-51. Miceli, G., Gainotti, G., Caltagirone, C., & Masullo, C. (1980). Some aspects of phonological impairment in aphasia. Brain and language, 11(1), 159-169. 256 Milberg, W. & Blumstein, S. E. (1981). Lexical decision and aphasia: evidence for semantic processing. Brain and Language, 14, 371-385. Milberg, W., Blumstein, S. E., & Dworetzky, B. (1988a). Phonological factors in lexical access: Evidence from an auditory lexical decision task. Psychonomic Bulletin and Review, 26 (4), 305-308. Milberg, W., Blumstein, S. E., & Dworetzky, B. (1988b). Phonological processing and lexical access in aphasia. Brain and Language, 34, 279-293. Miller, J. L. & Dexter, E. R. (1988). Effects of speaking rate and lexical status on phonetic perception. Journal of Experimental Psychology: Human Perception and Performance, 14 (3), 369-378. Miller, G. A., & Fellbaum, C. (1991). Semantic networks of English. Cognition,41(1), 197-229. Miller, J. L. & Voltalis, L. E. (1989). Effect of speaking rate on the perceptual structure of a phonetic category. Perception and Psychophysics, 46, 505-512. Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., & Miller, K. J. (1990). Introduction to wordnet: An on-line lexical database*. International journal of lexicography, 3(4), 235-244. Miller, J. L., Green, K., & Schermer, T. M. (1984). A distinction between the effects of sentential speaking rate and semantic congruity on a word identification. Perception and Psychophysics, 36 (4), 329-337. Mirman, D., & Britt, A. E. (2014). What we talk about when we talk about access deficits. Philosophical Transactions of the Royal Society of London B: Biological Sciences, 369(1634). 257 Mirman, D., Yee, E., Blumstein, S. E., & Magnuson, J. S. (2011). Theories of spoken word recognition deficits in aphasia: evidence from eye-tracking and computational modeling. Brain and language, 117(2), 53-68. Morton, J. (1969). Interaction of information in word recognition. Psychological review, 76(2), 165. Movellan, J. R., & McClelland, J. L. (2001). The Morton-Massaro law of information integration: Implications for models of perception. Psychological Review, 108(1), 113. Myers, E. B. & Blumstein, S. E. (2008). The neural bases of the lexical effect: an fMRI investigation. Cerebral Cortex, 18, 346-355. Nearey, T. M. (1990). The segment as a unit of speech perception. Journal of Phonetics. Nearey, T. M. (1997). Speech perception as pattern recognition. The Journal of the Acoustical Society of America, 101(6), 3241-3254. Norris, D. & McQueen, J. M. (2008). Shortlist B: a Bayesian model of continous speech recognition. Psychological Review, 115 (2), 357-395. Norris, D. (1994). Shortlist: a connectionist model of continuous speech recognition. Cognition, 52, 189-234. Norris, D. & McQueen, J. M. (2008). Shortlist B: a Bayesian model of continuous speech recognition. Psychological Review, 115(2), 357-395. Norris, D., McQueen, J. M., & Cutler, A. (1995). Competition and segmentation in spoken-word recognition. Journal of Experimental Psychology: Human Perception and Performance, 21 (5), 1209-1228. 258 Norris, D., McQueen, J. M., & Cutler, A. (2000). Merging information in speech recognition: Feedback is never necessary. Behavioral and Brain Sciences, 23, 299-370. Oden, G. C. & Massaro, D. W. (1978). Integration of feature information in speech perception . Psychological Review, 85, 172-191. Ostrand, R., Blumstein, S. E., & Morgan, J. L. (2011). When hearing lips and seeing voices becomes perceiving speech: Auditory-visual integration in lexical access. In Proceedings of the Annual Meeting of the Cognitive Science Society(Vol. 33, pp. 1376-1381). Pickett, J. M., & Pollack, I. (1963). Intelligibility of excerpts from fluent speech: Effects of rate of utterance and duration of excerpt. Language and Speech,6(3), 151-164. Pisoni, D. B., & Lazarus, J. H. (1974). Categorical and noncategorical modes of speech perception along the voicing continuum. The Journal of the Acoustical Society of America, 55(2), 328-333. Pisoni, D. B., & Tash, J. (1974). Reaction times to comparisons within and across phonetic categories. Attention, Perception, & Psychophysics, 15(2), 285-290. Pitt, M. A. & Samuel, A. G. (1993). An empirical and meta-analytic evaluation of the phoneme identification task. Journal of Experimental Psychology: Human Perception and Performance, 19 (4), 699-725. Pitt, M. A., Kim, W., Navarro, D. J., & Myung, J. I. (2006). Global model analysis by parameter space partitioning. Psychological Review, 113(1), 57. 259 Pollack, I., Rubenstein, H., & Decker, L. (1960). Analysis of incorrect responses to an unknown message set. The Journal of the Acoustical Society of America,32(4), 454-457. Prather, P. A., Zurif, E., Love, T., & Brownell, H. (1997). Speed of lexical activation in nonfluent Broca's aphasia and fluent Wernicke's aphasia. Brain and Language, 59(3), 391-411. R Core Team (2014). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. . Recchia, G., Sahlgren, M., Kanerva, P., & Jones, M. N. (2015). Encoding Sequential Information in Semantic Space Models: Comparing Holographic Reduced Representation and Random Permutation. Computational intelligence and neuroscience, 2015. Repp, B. H. (1982). Phonetic trading relations and context effects: New experimental evidence for a speech mode of perception. Psychological Bulletin,92(1), 81. Repp, B. H. (1983). Coarticulation in sequences of two nonhomorganic stop consonants: perceptual and acoustic evidence. The Journal of the Acoustical Society of America, 74(2), 420-427. Repp, B. H. (1984). Closure duration and release burst amplitude cues to stop consonant manner and place of articulation. Language and speech, 27(3), 245-254. Riordan, B., & Jones, M. N. (2011). Redundancy in perceptual and linguistic experience: Comparing feature‐based and distributional models of semantic representation. Topics in Cognitive Science, 3(2), 303-345. 260 Robson, H., Keidel, J. L., Ralph, M. A. L., & Sage, K. (2012). Revealing and quantifying the impaired phonological analysis underpinning impaired comprehension in Wernicke's aphasia. Neuropsychologia, 50(2), 276-288. Rogers, T. T., & McClelland, J. L. (2004). Semantic cognition: A parallel distributed processing approach. MIT press. Rohde, H. & Ettlinger, M. (2012). Integration of pragmatic and phonetic cues in spoken word recognition. Journal of Experimental Psychology: Learning, Memory and Cognition, 38 (4), 967-983. Rumelhart, D. E., & McClelland, J. L. (1982). An interactive activation model of context effects in letter perception: II. The contextual enhancement effect and some tests and extensions of the model. Psychological review, 89(1), 60. Samuel, A. G. (2011). Speech perception. Annual Review of Psychology, 62, 49-72. Samuel, A. G. (1981). Phonemic restoration: insights from a new methodology.Journal of Experimental Psychology: General, 110(4), 474. Samuel, A. G. (1996). Does lexical information influence the perceptual restoration of phonemes?. Journal of Experimental Psychology: General,125(1), 28. Savin, H. B. (1963). Word‐Frequency Effect and Errors in the Perception of Speech. The Journal of the Acoustical Society of America, 35(2), 200-206. Sawusch, J. R., & Jusczyk, P. W. (1981). Adaptation and contrast in the perception of voicing. Journal of Experimental Psychology: Human perception and performance, 7(2), 408. Scharenborg, O., Norris, D., Bosch, L., & McQueen, J. M. (2005). How should a speech recognizer work?. Cognitive Science, 29(6), 867-918. 261 Smits, R., Warner, N., McQueen, J. M., & Cutler, A. (2003). Unfolding of phonetic information over time: A database of Dutch diphone perception. The Journal of the Acoustical Society of America, 113(1), 563-574. Smolensky, P. (1986). Information processing in dynamical systems: Foundations of harmony theory. John, M. F. S., & McClelland, J. L. (1990). Learning and applying contextual constraints in sentence comprehension. Artificial Intelligence, 46(1), 217-257. Stevens, K. N., & Klatt, D. H. (1974). Role of formant transitions in the voiced‐voiceless distinction for stops. The Journal of the Acoustical Society of America,55(3), 653- 659. Stevens, K. N. (2002). Toward a model for lexical access based on acoustic landmarks and distinctive features. The Journal of the Acoustical Society of America, 111(4), 1872-1891. Strand, J., Simenstad, A., Cooperman, A., & Rowe, J. (2014). Grammatical context constrains lexical competition in spoken word recognition. Memory & cognition, 42(4), 676-687. Summerfield, Q. (1981). Articulatory rate and perceputal constancy in phonetic perception. Journal of Experimental Psychology: Human Perception and Performance, 7 (5), 1074-1095. Swinney, D., Prather, P., & Love, T. (2000). The time-course of lexical access and the role of context: Converging evidence from normal and aphasic processing. Language and the brain: Representation and processing, 273-292. 262 Szostak, C. M., & Pitt, M. A. (2013). The prolonged influence of subsequent context on spoken word recognition. Attention, Perception, & Psychophysics, 75(7), 1533- 1546. Taft, M., & Hambly, G. (1986). Exploring the cohort model of spoken word recognition. Cognition, 22(3), 259-282. Tanenhaus, M. K. (2007). Eye movements and spoken language processing.Eye movements: A window on mind and brain, 309-26. Tanenhaus, M. K., Magnuson, J. S., Dahan, D., & Chambers, C. (2000). Eye movements and lexical access in spoken-language comprehension: Evaluating a linking hypothesis between fixations and linguistic processing. Journal of Psycholinguistic Research, 29(6), 557-580. Tanenhaus, M. K., Spivey-Knowlton, M. J., Eberhard, K. M., & Sedivy, J. C. (1995). Integration of visual and linguistic information in spoken language comprehension. Science, 268(5217), 1632-1634. Thomas, M. S., & McClelland, J. L. (2008). Connectionist models of cognition.Cambridge handbook of computational cognitive modelling, 23-58. Toscano, J. C. & McMurray, B. (2010). Cue integration with categories: Weighting acoustic cues in speech using unsupervised learning and distributional statistics . Cognitive Science, 34, 434-464. Toscano, J. C. & McMurray, B. (2012). Cue-integration and context effects in speech: Evidence against speaking-rate normalization . Attention, Perception and Psychophysics, 74, 1284-1301. 263 Tuinman, A., Mitterer, H., & Cutler, A. (2014). Use of syntax in perceptual compensation for phonological reduction. Language and speech, 57(1), 68-85. Utman, J. A., Blumstein, S. E., & Sullivan, K. (2001). Mapping from sound to meaning: Reduced lexical activation in Broca's aphasics. Brain and Language, 79, 444-472. Utman, C. H. (1997). Performance effects of motivational state: A meta- analysis. Personality and Social Psychology Review, 1(2), 170-182. Utman, J. A., Blumstein, S. E., & Burton, M. (2000). Effects of subphonetic and syllable structure variation on word recognition. Perception and Psychophysics, 62, 1297- 1311. Vallabha, G. K., McClelland, J. L., Pons, F., Werker, J. F., & Amano, S. (2007). Unsupervised learning of vowel categories from infant-directed speech.Proceedings of the National Academy of Sciences, 104(33), 13273-13278. van Alphen, P. & McQueen, J. M. (2001). The time-limited influence of sentential context on function word identification. Journal of Experimental Psychology: Human Perception and Performance, 27 (5), 1057-1071. Van Berkum, J. J., Van den Brink, D., Tesink, C. M., Kos, M., & Hagoort, P. (2008). The neural integration of speaker and message. Journal of cognitive neuroscience, 20(4), 580-591. Vitevitch, M. S. & Luce, P. A. (1998). When words compete: Levels of processing in spoken word perception.. Psychological Science, 9, 325-329. Vitevitch, M. S. & Luce, P. A. (1999). Probabilistic phonotactics and neighborhood activation in spoken word recognition. Journal of Memory and Language, 40, 374-408. 264 Vitevitch, M. S. (1997). The neighborhood characteristics of malapropisms.Language and Speech, 40(3), 211-228. Vitevitch, M. S. (2002). The influence of phonological similarity neighborhoods on speech production. Journal of Experimental Psychology: Learning, Memory, and Cognition, 28(4), 735. Volaitis, L. E., & Miller, J. L. (1992). Phonetic prototypes: Influence of place of articulation and speaking rate on the internal structure of voicing categories.The Journal of the Acoustical Society of America, 92(2), 723-735. Warner, N., Smits, R., McQueen, J. M., & Cutler, A. (2005). Phonological and statistical effects on timing of speech perception: Insights from a database of Dutch diphone perception. Speech Communication, 46(1), 53-72. Warren, R. M., & Obusek, C. J. (1971). Speech perception and phonemic restorations. Perception & Psychophysics, 9(3), 358-362. Warren, R. M. & Sherman, G. L. (1974). Phonemic restorations based on subsequent context. Perception and Psychophysics, 16, 150-156. Warren, R. M. & Warren, R. P. (1970). Auditory illusions and confusions. Scientific American, 223, 30-36. Warren, R. M. (1970). Perceptual restoration of missing speech sounds. Science, 167, 392-393. Wood, C. C. (1976). Discriminability, response bias, and phoneme categories in discrimination of voice onset time. The Journal of the Acoustical Society of America, 60, 1381-1389. 265 Yee, E., Blumstein, S. E., & Sedivy, J. C. (2008). Lexical-semantic activation in Broca's and Wernicke's aphasia: Evidence from eye movements. Journal of cognitive neuroscience, 20(4), 592-612. Yeni-Komshian, G. H., & Lafontaine, L. (1983). Discrimination and identification of voicing and place contrasts in aphasic patients. Canadian Journal of Psychology/Revue canadienne de psychologie, 37(1), 107. 266