Can machines have desires? Juanda Tan, Brown University INTRODUCTION This paper aims to bring us closer to answering the question of whether a machine could plausibly possess desires, by making the claim that the many (very intuitive) assumptions we have about desires being related to dispositions to behave or feel in particular ways perhaps do not penetrate what it fundamentally means to desire. Rather, it is the fundamental relation desires have with rewards, and that which rewards have with the production of learning signals, that provide the best characterization of desires — one which aligns well with our conceptual notions of desire, and strikingly, with various neurological findings. Under the reward-based desire theory, it seems much less unintuitive for machines or computer programs to have desires, since consciousness, or even the ability for bodily movement, is no longer presupposed. In the following section, I outline what it means for a system to constitute something as a reward, a notion on which the reward-based theory of desire fundamentally depends. Next, I lay out the reward theory of desire, and explain the various advantages it has over competing theories. Following that, I provide an account of a reinforcement learning paradigm that is prototypical within the computer science literature — the actor-to-critic model — and explain why, according to a plausible theory of teleosemantics, it is capable of representing things as rewards in a meaningful way. Finally, I conclude that because such models qualify as constituting things as rewards, they can plausibly be said to have desires. !1 A PLAUSIBLE THEORY OF DESIRE In the contemporary literature surrounding desire, there seem to be three dispositional states which are closely related to desire. Concretely, the first of these dispositional states is defined by one’s behavior, or one’s motivation to behave — that is, one desires P if one has the disposition to act so as to bring about P. The second is defined by the feeling one might get when a desire is satisfied, perhaps rightfully termed the experience or phenomenology of desire — that is, one desires P if one has the disposition to feel pleasure whenever P. The third, and perhaps less commonly addressed dispositional state is defined by one’s disposition to constitute something as a reward — that is, one desires P if one is disposed to constitute P as a reward. This separation into what Schroeder (2004) might term the “three faces of desire” might seem redundant to some readers, given that the three aforementioned readings of what a desire is seem to express overlapping dispositional states within us — something is a reward to us because it gives us pleasure, we behave in pursuit of what we consider rewards; and so on. This is but an indication of how close the relation between these dispositional states to what we consider a desire is. One goal of this paper is to show that making this distinction makes conceptual sense, and is also supported by empirical data. But more importantly, the chief purpose of this paper is to support the view that the third face of desire is the one that bears a constitutive relation to desire, while the other two bear merely causal ones, and that in laying out a theory of desire, we should demand that desire be rightfully defined only by what constitutes it as such. To this end, this section will first make important definitions of what a reward is and how it figures in desire. Now, what is a reward? First, note that there is an important sense in which whether or not something is a reward to one depends crucially on whether one constitutes it as such. In !2 determining whether P is a reward to one, it matters whether one takes a positive stance towards P, since it seems that to be rewarded when P obtains is to have a certain kind of favorable response to the state of affairs in which P obtains. Thus, P’s status as a reward depends constitutively on one’s subjective perspective on P. Again, this is a mark of how closely regarding P as a reward is associated to desiring P, given that desiredness is also, crucially, a perspective-dependent property of things. So, rather than speak of something as being a reward, it is clearer and more efficacious for our purposes to speak of constituting something as a reward, since subjectivity is implied in the latter treatment. What, then, is meant by constituting something as a reward? Perhaps rewards are best understood in terms of their causal powers. After all, we are very familiar with such colloquialisms as “It sure feels good to finally be rewarded,” or “Wai Shing is only working so hard because he is trying to be Employee of the Month.” It appears that when an agent constitutes P as a reward, she has three relevant dispositions — namely, dispositions towards behavior, pleasure, and the production of a learning signal. First, the agent is so disposed that if P follows upon behavior B, her disposition to produce B in similar circumstances is strengthened. Second, the agent is disposed to feel pleasure whenever P obtains. And third, the agent is so disposed that if the agent represents P as obtaining, a certain sort of learning signal is produced. The first two dispositions are part of the common sense picture of reward, and if they seem to bear an uncanny resemblance to the dispositions caused by desire itself, it is only because of the close connection desire has with reward. The third disposition, though perhaps lesser known, is a major component of the emerging scientific picture surrounding reward. Following Schroeder, I will argue in the immediately ensuing sections that the first two dispositions are less fundamental !3 to reward than the third, given that there are situations in which an agent is (paradigmatically) rewarded but does not experience pleasure, or a strengthening of her behavioral dispositions; yet the release of a characteristic learning signal seems to occur in all such situations. Moreover, the first two dispositions can be causally explained by the third. Following this, I will provide an argument drawing from those of Kripke (1972) and Putnam (1975) in support of the claim that it is the property that rewards have to cause this third disposition, the creation of a learning signal, that should rightfully be considered essential and constitutive of reward, given that it is this property that provides the deepest explanatory account of the phenomenon of reward. REWARD AS REWARD-GATHERING BEHAVIOR Now, make no mistake — reward and behavior are closely intertwined, in both human and non-human animals alike. A child receiving a gold star for helping to clean out the classroom may help to take out the trash with the hope of receiving such a reward again. A dog learns new tricks — that is, repeats a certain pattern of behavior upon a particular command — because it is rewarded with a treat every time it demonstrates that behavior on command. Even irrelevant behavior that is typically not praiseworthy or beneficial and thus ought not to induce the receipt of a reward under ordinary circumstances can be made to be repeated consistently by an organism, simply in virtue of its being followed, chronologically, by the reward. Consider the classic study on operant conditioning (Skinner, 1948) in which caged pigeons were given food at regular time intervals regardless of their behavior. After a certain number of trials, the pigeons started developing determinate patterns of behavior, which wouldn’t typically be observed in pigeons under non-experimental circumstances — one repeatedly traversed rounds around the !4 cage, another continuously thrust its head into a corner of a cage, and so on. These behavioral patterns were observed to arise only because they temporally preceded the release of food into the pigeon’s cage; in other words, even behavior that was merely spontaneous and wholly unrelated to the reward-gathering efforts of the pigeon (at least, initially) can be induced to be repeated by its being rewarded. Thus it is fairly apparent that the causal connection between reward and behavior runs deep. Nonetheless, this doesn’t seem to provide good reason for arguing that the strengthening of certain behavioral dispositions is essential to or constitutive of what makes something a reward. For if it were so, rewards should necessarily cause the strengthening of behavioral dispositions, but there seem to be conceivable situations where this is not the case. Consider Strawson’s Weather Watchers (1994), hypothetical creatures which lack the ability to act, but can justifiably be said to constitute certain things as rewards. Despite not being capable of any bodily movement whatsoever, these creatures enjoy rich emotional lives, and develop emotional responses to various weather patterns: they feel hope and optimism on a sunny day, and despair on a cloudy one; the prospect of light showers might fill them with delight, but a heavy thunderstorm might cause sadness, even strike fear in them. This seems to align with how we respond emotionally to paradigmatic rewards and punishments respectively — we too delight in things we consider rewarding, and despair in things we consider punishing. Thus it seems intuitive to also say, then, that these Weather Watchers constitute the onset of sunny weather and light drizzles as rewards, and don’t constitute cloudy skies and torrential rain as such. However, it wouldn’t be reasonable to argue that for a Weather Watcher, some disposition to behave in a certain way will strengthen on a sunny day, given that one needs to be able to act to have a !5 disposition to so act in the first place, and Weather Watchers possess no such ability. To be clear, that the possession of a disposition to act necessitates an ability to so act is a trivially valid claim — if one instead argues that one can be disposed to act in a certain way without being able to so act, then one must concede that there are countless preposterous things humans could be disposed to do, such as photosynthesizing, levitating, walking through walls, and so on. Thus, Weather Watchers cannot be said to possess any disposition to act, and therefore do not and indeed cannot experience the strengthening of any such disposition, despite being able to constitute certain things as rewards. Thus, it would be explanatorily insignificant, perhaps even wholly irrelevant, to account for the fact that the Weather Watchers constitute certain things as rewards with an explanation that some behavioral disposition in them was strengthened. This provides a good example of how it is not necessary for there to be a strengthening of certain behavioral dispositions for an agent to constitute something as a reward. If this is the case, then the causal power that rewards have to strengthen behavioral dispositions is not essential to or constitutive of what it means for something to be a reward after all. REWARD AS PLEASURE-INDUCING It seems plausible then that the causal property of reward to induce changes to behavioral dispositions doesn't qualify as being constitutive of reward. Now, what about feelings of pleasure? Reward’s close relationship with pleasure, as was its relationship with behavior, is undeniable. We feel happy when we’re rewarded, and look forward to being rewarded because of the pleasure it will bring, we prioritize certain things we consider rewards over others because !6 they provide a greater degree of pleasure, and so on. However, not unlike the causal property of reward to induce changes to behavioral dispositions, its causal property to induce pleasure also does not seem essential to what it means to be a reward, because in certain situations one could demonstrably constitute something as a reward, yet have no such disposition to feel pleasure upon receiving it. Consider a study involving hospitalized heroin addicts (Lamb et al, 1991), in which subjects were given unknown doses of morphine from Monday to Friday of a given week. Although the dosage for that week was not disclosed to the subject, it would have become apparent to them by the Wednesday of that week, and if subjects desired continued doses for the coming Thursday and Friday, they had to press a lever three thousand times within forty-five minutes to be granted them. The findings revealed that subjects proceeded to complete the task for as low a dose as 3.75 milligrams — which, importantly, subjects subjectively did not derive any pleasure from, and rated as worthless, in a separate study — but would not do so for doses of saline. It seems, however, that subjects did subjectively regard whatever dose of morphine they got as a reward despite not deriving any pleasure from it whatsoever, given that the doses seem to exhibit the same motivational profile that paradigmatic rewards would. Why else would subjects proceed to complete such an inane and tedious task, designed only to return them something under the specific conditions of the experiment? Therefore it seems plausible that there are states of affairs which we consider rewards but don’t derive pleasure from, implying that the disposition to feel pleasure isn’t necessary for one to be able to constitute something as a reward. !7 Could the study’s findings admit of another explanation: that under the specific conditions of Lamb’s experiments, the mere fact of being given morphine, whatever the dosage, was enough for subjects to feel pleasure and thereby constitute it as a reward? This is perhaps plausible given that each dosage wasn’t precisely disclosed to the subjects, such that perhaps it becomes easy for subjects to deceive themselves into thinking that they were given effective dosages of morphine, even if some of them were medicinally indistinguishable from saline. After all, the “placebo effect” is a well-known phenomenon — the mere fact that they were given pills they thought to be medicine, even if the pills contained only sugar and had no medicinal properties whatsoever, was enough for patients to subjectively feel that they were getting better (American Cancer Society, 2015). However, this effect doesn’t quite seem to be at work here, given that subjects weren’t motivated to complete the lever-pressing task for morphine dosages under 3.75 milliliters, when they should have if the mere fact that they were given morphine was all it took for them to constitute it as a reward. The discovery of a lower limit to the dosage required for subjects to reproduce the behavior in question suggests that subjects were in fact discriminating between the dosages that were worth their time and effort, and those that were not, and that they could in fact discern those quantities to some reliable degree. Thus it seems plausible that subjects constituted such ineffective dosages of morphine as 3.75 milligrams as rewards, despite not feeling any associated pleasure as a result. Other studies also suggest that a disposition to feel pleasure isn’t necessary for one to constitute things as rewards. In a study where feeding behavior was elicited in rats via electrical stimulation to the lateral hypothalamus (Berridge and Valenstein, 1991), although rats consistently behaved so as to procure and consume food, they did not demonstrate the facial !8 expressions or mannerisms that are indicative of a positive affective reaction, which include a rhythmic midline protrusion of the tongue, a nonrhythmic lateral protrusion of the tongue, and also the licking of paws. These reactions are widely believed to be characteristic of a pleasure sensation in rats, because they have underlying parameters, like a certain allometric rhythm, that are identical to those which underlie the behavioral affective reactions demonstrated by humans and primates, suggesting that these demonstrations of positive affect may be homologous or evolutionarily derived (Steiner et al, 2001). Given that pleasure is not the motivating force behind the rats’ feeding behavior, some other non-hedonic psychological mechanism must be at play to drive behavior here; Berridge (2003) argues that this mechanism is governed by a kind of incentive salience, which is an internal psychological process that functions to make certain stimuli more attractive to the agent, in a way that is independent of whether the stimulus delivers a pleasure sensation or not. Incentives that are salient to an agent thus motivate incentive-seeking behavior in the agent. Such a process underlies what we might typically call ‘wanting’, a psychological state Berridge characterizes as strictly non-hedonic, and thus fundamentally different from that of ‘liking’, which is grounded in pleasurable utility. The rats wanted the food, but did not like it. Strikingly, it is believed that it is the incentive salience governing ‘wanting’ that contributes to the psychological mechanism of learning through motivation by reward (Berridge and Robinson, 1998), suggesting yet again that rewards are perhaps not best characterized by their putative ability to induce pleasure. Therefore, it seems very plausible that although the rats viewed the food as rewarding and consistently sought to consume it, they did not in fact derive any feeling of pleasure from it whatsoever. !9 In sum, these studies show that although agents don’t possess a disposition to feel pleasure in response to a given stimulus, it is still plausible that they constitute the stimulus as a reward, given that they responded to the onset of the stimulus in a manner that aligns very closely with the ways in which we typically respond to paradigmatic rewards — namely, with repeated behavior that is directed at procuring the reward. These findings thus conclusively show that it is not necessary for an agent to have a disposition to feel pleasure whenever P obtains in order to for her to be able to constitute P as a reward, and thus, that dispositions towards pleasure are not essential to or constitutive of what it means for something to be a reward. REWARD AS THE CREATION OF A LEARNING SIGNAL This leaves us with the connection between reward and the creation of some sort of learning signal. To begin, perhaps the most important reason which justifies our identifying reward as something which causes the production of a learning signal is the neurology underlying what empirical studies have decisively found to be the reward system in human and non-human animals alike. These neural structures are known as the ventral tegmental area (VTA) and the substantia nigra pars compacta (SNpc), and are responsible for the release of dopamine to other areas of the brain, thereby reorganizing various neural connections and inducing feelings of pleasure, changes to behavioral dispositions, and even changes to sensory capacities. Crucially, it is the VTA and SNpc’s connection to multiple brain areas underlying memory, sensory and motor capacities, as well as its control of the release of dopamine to those areas in response to stimuli, which give it a functional profile that very closely corresponds to that of a system which actualizes the various causal powers that, as we have seen, are typically associated with rewards. !10 To be clear, the argument here is not that such structures are strictly necessary for one to constitute something as a reward. That would unnecessarily preclude organisms (actual and hypothetical) which do not possess the specific neural machinery from being able to be rewarded, and we have no reason to do so since the general state of constituting something as a reward ought to be multiply-realizable across many different neurologies. Rather, an understanding of the way in which rewards are instantiated in us and many other animals helps us nail down what contributes to an understanding of the way rewards work in general, thereby serving our purposes of finding a broader conceptual, extra-neural theory of reward. Importantly, what we want to see here is that it is the creation of what we have called a learning signal — regardless of how this signal is implemented, neurally or otherwise — that we define what it means to constitute something as a reward by. Now, let us first examine — what does the release of dopamine have to do with rewards in the first place? Dopamine-releasing neurons in the VTA and SNpc exhibit firing activity that correlates strongly with an organism’s receipt of (paradigmatic) rewards: although they have baseline activity that is fairly low and stable, this activity spikes and briefly becomes more rapid when, for example. the organism is hungry and has just received a tasty treat. This correlation is a prima facie reason to believe that these neurons, and the dopamine they release, are somehow involved in the various causal relationships that rewards have. Moreover, it seems that when the dopamine receptors in the brain are blocked, the neurological changes that seem to be caused by paradigmatic rewards no longer occur. A study by Bao, Chan and Merzenich (2001) which involves stimulation to the auditory cortices and VTA/SNpc of rats provides evidence for this claim. When the rats were first exposed to sound at a given frequency, and then subsequently !11 received VTA/SNpc stimulation, the neural connections in their primary and higher auditory cortices were found to have been reorganized such that they were better able to pick up and process sounds of the same frequency. This suggests that VTA/SNpc stimulation, and the consequent release of dopamine to the auditory regions of the brain that were stimulated prior, was sufficient to create a propensity to be auditorily stimulated in the very same way in the rat. This indicates that the activity of dopaminergic neurons in the VTA/SNpc is not merely correlated with the receipt of paradigmatic rewards, as is widely observed, but in fact causes what paradigmatic rewards themselves cause. Namely, VTA/SNpc activity causes particular changes within the organism that seem to implement a certain process of learning — learning that, in this case, takes the form of learning to better perceive and process certain auditory tones. On the other hand, when dopamine receptors in the aforementioned auditory cortices were blocked by the experimenters, no change in the rats’ cortical areas was observed, and thus no such learning occurred. This strongly suggests that it is in response to the chemical, dopamine, that these brain areas undergo the changes that they do, and therefore that dopamine is what facilitates the causal relationship between paradigmatic rewards and learning. If it is true that dopamine induces neurological change in the way rewards might, then the VTA and SNpc, which control the release of dopamine, are responsible for setting this change in motion. But perhaps it would aid the discussion if we examine further why such changes within the organism can be considered learning. The kind of learning involved here is what Schroeder terms contingency-based learning: learning that consists in the reinforcement of dispositions within the system, reinforcement that, if positive, better positions the system to procure that which did the reinforcing; and if negative, better positions the system to avoid it. Thus, such !12 learning is future-directed — it concerns the system’s performance in similar contingencies which may arise in the system’s future. In the case of the tonally-trained rats, the specific way in which connections within the rat’s cortices were changed suggest that it was a targeted response to the particular stimuli it received. The fact that it heightened the rat’s capability to process a particular frequency of sound, as opposed to some other frequency, or as opposed to impeding this capability altogether, suggests that its prior processing of the sound was positively reinforced, such that for the rat, its propensity for future processing of the same sound is increased. The change was targeted and in response to the given stimuli. Thus we say that the rat has learned to process sounds of this frequency better, in the same way that a dog learns to reproduce a certain behavior upon being positively reinforced by treats, or a child learns to sit still at the dinner table upon being positively reinforced by the granting of dessert. To be clear, such learning doesn’t necessarily have to involve changes in behavior — the rat didn’t necessarily act to hear sound at the same frequency again. Rather this change is perhaps more exemplified by shifts in dispositions within the organism — a disposition to act, to perceive, even to think in a given way. Other studies provide more stereotypical examples of contingency-based learning. A study by Egelman, Person and Montague (1999) showed that the learned behavior of human subjects in various decision-making tasks matched closely with that of a predictive network modeled on mesencephalic dopamine systems, suggesting a direct relationship between the biases involved in human decision-making, and changes in dopamine delivery to target structures in the brain. Action selection in non-human animals seems to also depend crucially on dopamine release (Berridge and Robinson, 1998). Thus it seems that learned biases in decision-making !13 processes, as well as learned behavior, are guided by the causal activity of dopamine-releasing neurons in the VTA/SNpc, In light of these findings, the VTA and SNpc can be understood as making up our reward systems, given that their activity seems to coincide with the receipt of paradigmatic rewards, and that they cause what paradigmatic rewards cause — learning, in its various forms. The activity of these essential structures, then — that is, the activity that underlies our constituting something as a reward — can be said to be defined by the creation of a learning signal, a signal that then induces the various kinds of learning that we have observed in the experimental setups cited. But given that rewards also induce behavioral change and feelings of pleasure, what reason do we have to believe that our constituting something as a reward is best defined by its relation to learning? First off, the behavior-production and pleasure centers are situated elsewhere in the brain, and it is clear that their activity is preceded by VTA/SNpc stimulation, and by the release of dopamine by neurons within the VTA and SNpc to those centers. We have already seen that behavior production is causally affected by dopamine release; as for pleasure, studies have shown that the effects of many different pleasure-inducing drugs like cocaine (Gawin, 1991), “ecstasy” (Liechti and Vollenweider, 2000) and various others (ed. Kandel, Schwartz and Jessell, 2000) are mediated at least in part by the VTA/SNpc. To clarify, it is also unlikely that the VTA/ SNpc is itself the pleasure center of the brain, given that rats have been found to experience pleasure in response to consuming sugar despite severe lesions to dopamine-releasing neurons in their VTA/SNpc (note that this doesn’t invalidate VTA/SNpc activity as a cause of pleasure, but only as a necessary one). Thus it seems most intuitive to identify the creation of a learning signal !14 (as implemented by VTA/SNpc activity in us) as our constituting something as a reward, and explain behavior production and feelings of pleasure as being subsequently caused by this signal. Furthermore, such an explanation aligns well with commonsense psychology. We previously noted in the various experimental findings cited the myriad effects of paradigmatic rewards — creating emotional changes, motivating particular behavioral and decision-making patterns, even perhaps enhancing sensory abilities — that are mediated via the action of the VTA/SNpc systems and the release of dopamine to relevant parts of the brain. These are effects that we intuitively attribute to paradigmatic rewards, but the point here is that although these effects are certainly differentiated, they are nonetheless unified under the idea of reward as producing a characteristic learning signal within the organism. That is, identifying the essence of reward as such allows us to provide an explanatorily significant account of the multitude of effects rewards have on us and other non-human animals, in a way that aligns with, and indeed is supported by, the central role of the VTA and SNpc in our mental life. Note also that this means not only that the learning-based theory of reward doesn’t preclude the close relation between reward, and behavior and pleasure, but also that such a theory actually supports it. It is the conveying of a learning signal, via a release of dopamine, to motor cortices and the basal ganglia, that causes the organism to exhibit decision-making and other bodily behaviors; indeed, the signal is only one of learning insofar as it has the potential to, among other things, produce learned patterns of behavior in the organism. Moreover, the learning signal creates a favorable attitude towards the stimulus within the organism in a similar manner: via a release of dopamine to hedonic centers of the brain (for example, the PGAC). Part of what makes the signal one of learning is also the kind of emotional reinforcement that it is capable of !15 creating within the organism, that allows the organism to learn to make positive associations to rewarding stimuli. The causal relationships behavior and pleasure have with reward are thus not only preserved by a learning-based theory of reward, but are also accounted for with explanatory force. By accounting for changes in behavioral and emotional dispositions in response to reward in causal terms, we give many commonly uttered statements regarding reward explanatory significance — such statements as “Pascal always works late nowadays because he’s gunning for a promotion,” or “Hyun Joon's spirits have really improved because after many weeks of working long hours, he was finally rewarded with an off-day.” However, one might rightfully object that a learning-based theory will, by characterizing the relationship between reward and learning in terms of identity and equivalence instead of causality, reduce certain statements meant to express causality to mere tautologies; one perhaps contrived example might be something like: “A learning signal was created in Greg because he was rewarded with a raise.” Notice first that such statements don’t really feature in our everyday appraisals of our own and others’ psychologies. Moreover, the theory identifies one’s constituting something as a reward as the creation of a learning signal within one, and not in terms of the actual event of learning; it is only that a certain potential for learning has been actualized, and that the learning itself — which, notably, may or may not happen — is but a consequence of the creation of this potential. Thus while reward is identified as the creation of such a potential for learning, it can still meaningfully be said to cause a specific event of learning. The effect of learning can therefore still be causally attributed to the fact of reward in an explanatorily-significant way. !16 LEARNING SIGNAL CREATION AS ESSENTIAL TO REWARD To bring home the argument that what is essential to reward is not the strengthening of behavior dispositions or the production of feelings of pleasure, but rather the creation of a characteristic learning signal, we will conclude this section by examining the dominant essentialist theory of which Kripke (1980) and Putnam (1975) are advocates. They identify the properties that are essential to an object P being of a certain kind K as the properties of P which provide the deepest explanatory account of the properties typically associated with K. These essential properties define what it means for something to be of kind K, because they provide important causal explanations for the other phenomena that K exhibits, and thus ground K’s other properties in a fundamental way. This seems to make good sense — intuitively, if the object P didn’t possess the properties which provided the causal ground for the properties commonly associated with K, then it wouldn’t instantiate K’s properties and therefore cannot reasonably be classified as being of kind K at all. This argument is perhaps best exemplified in chemistry. Kripke advances an argument that especially for things with underlying chemistries, it is their microstructural properties which determine their natures. For example, gold exhibits certain chemical behavior and has a certain outward appearance because each gold atom has 79 electrons; its electron configuration thus constitutes its essence since it adequately provides the ground for explaining all its other observable properties. Likewise, Putnam’s Twin Earth thought experiment advances a similar argument. On Twin Earth, a planet that is identical to Earth in all observable respects, there exists a liquid that similarly appears to be identical to water on Earth: it is colorless, odorless, tasteless, potable and so on. However, its chemical composition is not H2O, but rather, some unknown !17 chemical compound, XYZ. In spite of possessing all of water’s superficial properties, however, intuitively XYZ is not water. This is because what fundamentally explains the identifying properties in water — the property of being composed of a particular chemical compound, H2O — is not instantiated in XYZ and thus does not explain XYZ”s identifying properties, even if those properties happen to be identical to those of H2O. A similar argument can be applied to reward, and perhaps has already been somewhat explored in the preceding section. We identify the property that is truly essential to what it means to be a reward as the creation of a learning signal, because the creation of this signal crucially explains many of the other properties paradigmatic rewards have. We have already seen how the creation of a learning signal, implemented as activity of the VTA/SNpc in human and non-human animals alike, causes the strengthening of certain behavioral dispositions via a release of dopamine to the basal ganglia, motor cortices and other subcortical structures responsible for motion; also we have observed how a similar dopaminergic release to hedonic centers of the brain causes feelings of pleasure. We have also seen how sensory capacities can be heightened from similar activity. Other studies show that the release of dopamine is also directly involved in the learning of certain cognitive dispositions (Knowlton, Mangles and Squire, 1996). All these effects are characteristic effects of rewards, and all, very strikingly, can be adequately explained by the release of dopamine by neurons in the VTA/SNpc which thereby creates a characteristic learning signal within the system. According to the aforementioned essentialist doctrine, then, the property that rewards have to produce this learning signal is thus what is rightfully essential to reward — it is the property that provides the deepest and most fundamental explanatory account of many of the causal properties associated with paradigmatic rewards. It seems most !18 reasonable, then, that it is reward’s third — albeit lesser known — disposition, to release a characteristic learning signal, which constitutes the essence of reward after all. THE REWARD THEORY OF DESIRE Thus a plausible case has been made for a learning-based theory of reward. An organism’s constituting P as a reward can now be plausibly understood as P having the tendency to contribute to the production of a learning signal in the organism. How does this feature in desire? In the following section, I will argue that we should identify desiring something as precisely that: as constituting that thing as a reward. To be precise, I will defend the following thesis, as posited by Schroeder (2003): The Reward Theory of Desire (RTD): To have an intrinsic desire that P is to use the capacity to represent that P to constitute P as a reward. In the following sections, I will first present some reasons to believe this view, and then outline several advantages that it has over contending views — namely those arguing that the essence of desire lies in one’s disposition to behave, or feel, in a given way. The first argument in favor of RTD is that it appeals to common sense. Presumably, if we would like to figure out how we might reward someone, it would help to first find out what the person desires; conversely, if we were to attempt to reward someone by giving her something she in fact does not desire, then it seems commonsensical to say that we have failed to reward her altogether. Here the subjective, representational quality of both desire and constituting something as a reward is apparent — desiring P, and constituting P as a reward, seem much aligned in that they both involve the subject’s taking a positive stance towards P. Perhaps an example would !19 help to illustrate the point. Suppose Yuko wants to reward her girlfriend for getting into the graduate school of her choice, and tries to do so by surprising her with tickets to go hike the famed Inca trail in Peru. If her girlfriend loves the outdoors, and has dreamed of one day witnessing the beauty of the Machu Picchu ruins first-hand, then for Yuko this would prove to be a successful attempt in rewarding her. However, if in fact she has an unspoken aversion to spending extended periods of time in the heat and humidity of the Peruvian wilderness, then as much as Yuko would like to reward her, it would seem that she has failed to do so, for her girlfriend does not desire hiking the Inca trail at all. Thus it seems intuitive to identify desiring something as constituting it as a reward. Other observations about the striking similarities between the effects of being rewarded and having one’s desires satisfied also seem to suggest this claim. People are motivated to achieve what they desire, and look forward to having their desires satisfied; likewise people seek rewards, and eagerly anticipate being rewarded. We consciously make decisions about what to do based on how we have had our desires satisfied in the past, or similarly how we have been rewarded; the positive reinforcement of both being rewarded and having desires satisfied may also influence our actions and decisions unconsciously. All these everyday observations thus provide a prima facie reason to believe that being rewarded and having one’s desires satisfied are one and the same. Moreover, it isn’t that it merely seems to us that rewards and desire satisfaction produce the same consequences in us; rather, neurological evidence shows that they actually do. We have previously looked at some of these experimental findings in relation to learning, but we will now examine them in relation to desire satisfaction. First, consider the effects of paradigmatic desires on actions. It is widely known that we are typically disposed to act so as to have our desires !20 satisfied; neurological studies suggest that the brain areas responsible for action are not designed to function independently of the activity of our reward system — the VTA, and the SNpc. The center of action production is known as the dorsal striatum, and it receives input from the brain’s sensory and cognitive regions, which give it information about standing perceptual and belief- states — qualitatively, about the current state of affairs, and the relevant action options that are available to it at the given time. The output of the dorsal striatum is then directed to the brain’s motor and pre-motor cortices (as well as many subcortical structures), which collectively have full control over the performance of voluntary bodily movements. But more pertinently, the dorsal striatum also receives input from the reward system, and if the RTD is correct, then the psychological mechanism we intuitively believe underlies voluntary action can be more thoroughly explained — in biological terms, motor function is initiated by sensory, cognitive and reward information; in psychological terms, voluntary action is motivated by the contents of our perception, beliefs and desires, as is widely believed. A compelling reason to believe the reward system is essential for voluntary action is the striking loss of the capacity for such action in patients with (extreme forms of) Parkinson’s disease. In severe cases, the disease kills all the neuronal cells within the reward system which contribute to action production; the patient ends up being afflicted by complete paralysis as a result (Langston and Palfreman, 1995). Importantly, patients do not regain voluntary motor capacities by drawing on beliefs about what is best, or right to do. This suggests that reward- independent motivation is insufficient for voluntary action; rather, voluntary action production needs to be at least somewhat motivated by states of the reward system. To illustrate this point further, although bodily movements can be performed while bypassing the reward system, such !21 as through the direct stimulation of motor cortices, or in Tourretic tics, such movements are not paradigmatically voluntary — that is, they seem to arise independently of desire. This seems to suggest even further that the involvement of the reward system is essential for voluntary action, and underlies the motivation of such action by desire. Next, consider the causal relationship between the activity of the reward system and feelings of pleasure. We saw in the previous section that the release of dopamine by the reward system to hedonic centers in the brain are responsible for the pleasure sensations we experience. To emphasize this point further, other studies suggest that most pleasure-causing drugs have been found to directly or indirectly act on relevant parts of the reward system (Lieachti et al, 2000; Liechti and Vollenweider, 2000); furthermore direct stimulation of the reward system seems to suffice to, under certain experimental conditions, cause pleasurable feelings (Stellar, 2012). Such evidence is widely observed, and strongly suggests that the activity of the reward system, though artificially induced in these examples, plays an essential and perhaps even sufficient role in causing pleasure sensations. An additional point can be made about the role of reward learning in helping us position ourselves to attain what we intrinsically desire. We saw earlier how the activity of the reward system is best understood as the production of a characteristic learning signal. This signal is designed to causally affect other brain areas so as to motivate particular behavior, produce feelings of pleasure, enhance sensory capacities, and so on — that is, so as to place ourselves in a more advantageous position for the reaping of what we constitute as the reward, or equivalently, what had contributed to the production of that learning signal (and has the tendency to). This seems fundamentally aligned with the effects on our dispositions that desires can cause in us. In a !22 general sense, we are disposed to overcome obstacles, navigate confusing territory (literally and metaphorically), and even improve our own abilities, so as to position ourselves to attain what we intrinsically desire. Again, here it seems only natural to think that it is the promise of such desire satisfaction that underlies the learning that rewards bring about, and the learning that the activity of the reward system functions to cause; perhaps, then, desire satisfaction, and the attainment of rewards, are one and the same. The empirical findings cited so far show that the reward system, comprising the VTA and the SNpc, is in fact the common cause of all the aforementioned effects, for us and many non- human animals. It is a biological natural kind, a component of our neurophysiology that really does function to bring about those biological effects. In order to understand the concept of reward, and of constituting something as a reward, we can rightfully generalize this to say that rewards themselves are a psychological natural kind, regardless of how the constitution of something as such is implemented — by the VTA/SNpc, or otherwise. They do causally affect our behavior, our feelings, and also the way we learn. Now, in order to solidify our defense of the view that constituting something as a reward just is desiring it, we need to also show that the intrinsic desires we are trying to understand really do in fact have those very same causal powers — that is, that desires themselves are a natural kind. Only then, would the striking alignment between the causal profiles of rewards and desires provide a plausible argument for identifying desires as the constituting of things as rewards. The first argument in favor of treating desires as a natural kind is a familiar one: an appeal to intuition. There seems to be something straightforwardly appealing and explanatorily significant in arguing the thesis of desire realism, that desires do exist and really do at once exert !23 all the myriad causal powers we have examined thus far. When a baseball player hits a home run, and at once feels exhilaration, goes on to continue refining her swing with repetitive drills, and perhaps also internalizes how her body contorts and feels when she propelled the ball out of the park, it seems natural and uncontroversial to say that she has had a desire to hit a home run all along, and that it is this singular mental state of desire (satisfaction) that simultaneously explains all the different mental and physical effects she experienced. Such reasoning occurs immediately to us and makes good sense, without the need for further complicated hypothesizing. Furthermore, if desires as a natural kind aren’t invoked to make this account, then something else needs to fill the explanatory gap, and it doesn’t quite seem that any other mental state serves as a suitable candidate. Moreover, it seems that a good test for the realism and explanatory robustness of a state as a putative part of the mind is that it features in our best endeavors in psychological science. For the reasons canvassed earlier, it seems that psychological sciences will likely continue to make good use of the reward system in explaining a multitude of other thought, feeling and behavior phenomena, given that no other system of the brain simultaneously causes all those effects. If the reward system is going to be preeminent in psychological literature, then either we account for its explanatory significance by interpreting its activity as realizing desires, or make the arguably unwarranted interpretation of its activity as realizing some other hitherto unidentified mental state, or worse: disregard its importance to the philosophy of mind and morality altogether. Again, desire seems to be the best answer we have when examining the question of what mental state is realized by the activity of the reward system. Thus, there is good reason to understand desire as a natural kind as such. !24 If desire and reward are both natural kinds, and have matching causal profiles, then it seems very plausible that they are identical. We have gone considerable distance in showing that a reward theory of desire makes intuitive sense, and has distinctive explanatory force. As a final clarifying point, note that while this trivializes the apparent causality that is expressed in statements such as “getting that reward really did satisfy the desire she had,” it turns out that it seems wholly unnecessary to invoke causality here in the first place. Getting a reward and having a desire satisfied occur simultaneously, not because getting the reward somehow caused the satisfaction of a desire for that same reward, but because getting the reward just is the satisfaction of that desire. In certain cases, however, it seems possible for us to desire a reward, but this is best expressed as a second-order desire — a desire to have some desire or other satisfied. This seems to make sense if we consider how we might desire to have a particular desire satisfied comparatively more so than another. Thus constituting something as a reward and desiring it can still be one and the same without facing serious conceptual flaws, and the RTD remains intact. ADVANTAGES OVER THE MOTIVATIONAL THEORY OF DESIRE Perhaps one might oppose the RTD by saying that instead, desiring P consists in the disposition to act so as to bring about that P. By now it should be well established that desires can cause action; so as not to belabor the point, let us affirm that it does seem plausible that if I have an intrinsic desire to, say, eat peaches, then I am disposed to eat peaches, and will do so when I have the means to. We’ll call this theory the motivational theory of desire. !25 However, for this to constitute desire in an essential way, the disposition to act so that P must be both necessary and sufficient for desiring P. But it seems that such a disposition is neither sufficient nor necessary. First, consider some arguments against the sufficiency of behavioral dispositions. It seems that under certain circumstances, it is possible for us to have a disposition to act in a given way, and yet it still doesn’t amount to our having the associated desire. Behaving out of habit is one such example. We often hear people lament “I didn’t actually want to do it, I just did it out of habit,” and even express feelings of frustration after having done something habitually that might have been counter-productive or even undesired at the time. It seems that desires and habits are fundamentally different motivating forces for action, such that it doesn’t seem quite right for both to have dispositions to act as their essence. Suppose that Phong has a habit of leaving the refrigerator door open whenever he reaches into it for a snack. He is able to overcome this habit most of the time, but occasionally when he tries to access the refrigerator when he is really hungry, he’s so focused on obtaining and consuming the snack that he inadvertently leaves the door open. From a third-person perspective, the consistency in his act of carelessness seems to constitute a disposition to so act, such that if desires were really equivalent to behavioral dispositions, Phong would be functionally indistinguishable from someone who has the albeit very specific desire to leave the refrigerator door open when he is hungry. Yet Phong doesn’t have such a desire — upon finding out that he has left the door open many hours later, Phong might find himself annoyed that his folly has caused the milk and eggs to go bad. It’s clear that he never wanted to leave the refrigerator open in the first place. Another case in which a disposition to act doesn’t suffice for desire involves the influence of other mental states which cause the subject to act often in opposition to what she !26 intrinsically desires. Consider a runway model who is making his debut at a big event; as an aspiring model the last thing he wants to do is to trip and fall on stage, in front of such a large audience. Yet the anxiety and the mental image of him tripping on his own feet prove too much for him to overcome, so much so that he now has formed a disposition to trip on stage, and actually blunders and falls during the show. Here, again, we see that it is possible for one to develop a disposition to act that is decidedly contradictory to what one wants to do; the motivational theory would thus wrongly classify the model as having desired to fall on stage, when we know he desired no such thing. One could perhaps push back on this argument by saying that in these cases the subjects certainly don’t amount to having had desires to act as they have, since their actions were not voluntary, and thus were not the kind of genuine action that the motivational theory of desire deals with. However such an objection seems to beg the question, since the claim here is that part of what makes something a genuine action is that it is caused by desire, but that claim is precisely what a desire theory should explain; so we shouldn’t be allowed to use the conclusion purported by that claim to prove the theory itself. Thus the objection from the insufficiency of the disposition to act for desire still stands. Furthermore, the disposition to act is also not necessary for one to have a desire. We’re all familiar with having desires that don’t have much to do with our acting on them at all. For example, most of us desire an end to global poverty, yet do not feel compelled to fight for that end, and would not reason that we have a disposition to do something about it. The relation between a disposition to act and desire seems even more tenuous if we consider cases where we might desire something that is not within our power, such that harboring a disposition to act so as !27 to make it happen would be irrational — for instance, some mathematicians have a desire that P = NP, because that would make many hard problems tractable, but it would be delusional to even think of trying to make that happen. Even more striking is when we have certain desires which are defined precisely by the disregard of one’s actions. Arpaly and Schroeder (2014) offer the example of desiring that someone love you regardless of what you do; this kind of desire for unconditional treatment, if anything, involves desiring such treatment in spite of the various dispositions to act one might have. These everyday examples strongly suggest that it is possible and probably quite common for us to have desires that do not and in some cases cannot depend on a given behavioral disposition. Another familiar argument comes to mind: the conceivability of the hypothetical creatures Strawson calls Weather Watchers. We examined this example in relation to reward in an earlier section, but it is relevant in relation to desire as well. To reiterate, Weather Watchers lack the capacity for any kind of bodily movement, yet possess the emotional disposition to feel delight and despair at specific weather patterns. Given that such emotional responses are not random, but targeted at particular weather phenomena, it seems intuitive for us to say that Weather Watchers desire the onset of some weather patterns, and are averse to the onset of some other ones. But a disposition to act entails an ability to so act. These Weather Watchers, without such an ability, can have no such disposition; yet they possess desires. Thus, it doesn’t seem that dispositions to act are necessary for one to have desires either. Perhaps one could defend the motivational theory by saying that these examples don’t quite involve desires, but rather involve something more like wishes. The differences between these two states are subtle and very much debatable. It does seem that there is a sense in which !28 wishes are accompanied by a kind of helplessness, that there is nothing in our power that can be done to help us get what we wish for; perhaps it isn’t even worth trying. But does this really qualify them as fundamentally different in kind? We might also say that both wishes and desires are favorable attitudes towards their objects, and that what occurs when one shifts from wishing for P, to figuring out that perhaps something can be done so that P, to finally desiring P, is merely a cognitive change and not a full-fledged transition from one conative state to another. One’s ability to act in service of what one feels favorably toward then seems inessential to the underlying attitude towards it, given that the difference lies crucially in a certain judgment of the options available to one based on a perceptual appraisal of the current state of affairs, and not a change in the attitude itself. This seems plausible considering that it is easier to conceive of one oscillating between wishing for and desiring something due to changes in belief states regarding one’s circumstances, than due to an incessant swing from one conative state to a completely different one. Thus, just as there are reasons to distinguish fundamentally between wishes and desires, there are equally compelling reasons to treat them as being underlaid by the same fundamental state. This objection thus doesn’t quite seem to stand after all. If behavioral dispositions are neither sufficient nor necessary for one to have desires, then it seems that we would be unjustified in identifying having a desire as being disposed to act in a certain way. Rather, a disposition to act so that P that is caused by a desire for P — a wholly separate state, which we identify as constituting P as a reward — would be more consistent with the observations just canvassed, given that such a causal relationship entails neither sufficiency nor necessity. !29 ADVANTAGES OVER THE HEDONIC THEORY OF DESIRE Feelings of pleasure seem closely tied to desires as well — perhaps then, as briefly suggested at the start of this paper, desiring P is being disposed to feel pleasure whenever P obtains. Such feelings of pleasure which arise when we have desires satisfied should be exceedingly familiar to us — when our thirst is quenched with ice-cold lemonade on a hot day, when we achieve the grade we wanted on an exam… the examples are multitudinous and widespread. However, in a similar manner as I did in the previous section with the motivational theory of desire, I argue here that identifying these desires as emotional dispositions isn’t justified, because this identification entails that the associated emotional dispositions are both necessary and sufficient for one’s having the desire, but this is decidedly not the case. We will henceforth call this theory the hedonic theory of desire. First, consider some arguments against the necessity of emotional dispositions for desire. Clinical studies of drugs like chlorpromazine have shown that these drugs are capable of suppressing subjects’ impulses for action, so much so that they no longer are disposed to feel the emotional sensations associated with such tasks as eating, walking or talking. Strikingly, however, subjects still perform those acts. To account for this observation, proponents of the hedonic theory of desire would have to argue that the subjects’ desires are in fact suppressed, but subjects still exhibit behavior because other motivational structures, like belief states about what is right or good to do, are still intact. However, this doesn’t seem to be the case, given that further clinical observation reveals that subjects do not demonstrate any notable ability for accomplishing such tasks as dieting, curbing addictive tendencies like smoking, or seeing long- term plans through — tasks which (appetitive) desires distract us from, and which should !30 reasonably be easier to accomplish if those desires were eliminated. Thus it seems that a better explanation would be that despite the effects of the drug, subjects still retained their intrinsic desires, and it was those desires that motivated subjects (albeit to a lesser degree) to continue their behavior. What was suppressed, then, was instead the causality between the satisfaction of their desires, and the associated feelings. We don’t seem justified in reasoning that the disposition to feel pleasure is strictly necessary for one to have desires. It also doesn’t seem that the disposition to feel pleasure whenever P obtains is sufficient for one to desire P. Certain conditions like drunkenness, the influence of drugs, and being caffeinated can alter emotional dispositions, without altering our desires in a fundamental way. Maria might have in her possession an antique vase that was passed down to her as a family heirloom, and she might value it very much; however in a fit of drunkenness, she accidentally knocks it over and shatters it all over her living room floor, but because of her inebriated state she delights in this accident, and somehow finds amusement in the fact that something so precious has been so violated beyond repair. This seems to be an entirely imaginable scenario; some of us might even remember having a similar experience. But the pleasure that Maria derives from this situation doesn’t quite amount to her having an intrinsic desire to break the vase, since we know she had the contrary desire to preserve it, and it doesn’t quite seem reasonable to suggest that she developed the former desire just by being drunk either. Thus, the disposition to feel pleasure at destroying a precious object, in Maria’s case, isn’t sufficient for her to have possessed a desire to destroy it. The fact that feelings of pleasure and displeasure do not always correspond to the status of our desire satisfaction seems to suggest an alternative account of pleasure to the hedonic !31 theory: that perhaps such feelings are a representation of desire satisfaction. Arpaly and Schroeder (2014) echo this view. It seems that while feelings of pleasure often covary with desire satisfaction in predictable ways, the failure of this covariation as highlighted by the observations mentioned earlier seem to suggest that this is an instance of misrepresentation. An anomalous covariation between two phenomena, A and B, can be explained by the fact that A is a representation of B, and has in the given instance erroneously represented B; in deviating from its predictable trend of covarying with B, A is thereby said to have misrepresented B. The other properties which pleasure sensations seem to exhibit are best explained by such a representation- based view. Consider how the satisfaction of the same exact desire could arouse varying degrees of pleasure — sometimes indifference, but other times pure exhilaration. Someone who has always slept in a comfortable bed might continue to desire sleeping in one, but when she does sleep in one, might take it for granted, and be nonchalant at the satisfaction of this desire; however, if she were deprived from this comfort for ten days straight, sleeping in her bed might become an immensely pleasurable experience. This pleasure might then gradually fade with each successive time she gets to sleep in this bed, until a state of indifference is reached once again. This observation is perhaps best explained not by a strict one-to-one correspondence identifying a particular feeling of pleasure as the satisfaction of a particular desire, but by taking pleasure to be a representation of the net change in desire satisfaction. Such a view can at once adequately explain the various atypical pleasure phenomena outlined thus far: the suppression of this representational capacity, erroneous representations, and feelings of pleasure corresponding more closely with a relative level of desire satisfaction than an absolute one. !32 This concludes our evaluation of the hedonic theory of desire: it seems implausible because dispositions to feel pleasure aren’t necessary or sufficient for one to have desires, and furthermore feelings of pleasure admit of a much more coherent explanation as measures of changes to levels of desire satisfaction that are caused by those changes. We thus have compelling reason to reject the hedonic theory as a coherent account of desire. The arguments outlined above help show that behavior and feelings of pleasure are best understood as characteristic effects of desire satisfaction, but are nonetheless inessential to what it means to desire something after all. Rather, it seems that there are compelling reasons to instead identify desiring something as constituting it as a reward: such a theory appeals strongly to our intuitions about desire, is explained robustly by neurological evidence, and is also conceptually coherent when we consider desire and reward as the same natural psychological kind with real causal powers. This sums up our defense of the reward theory of desire; in the next section, we will begin to see how we can apply this theory to the standard reinforcement learning paradigm that is commonplace in computer science literature, to see if we can reasonably attribute desires to the machine. THE REINFORCEMENT LEARNING PARADIGM The reinforcement learning model, generally speaking, is a program governed by mathematical principles that is designed for the purposes of performing a certain task — whether it is winning a chess game, or helping a car autonomously stay within its lane. But before the program can perform this task proficiently, it must first learn to do so, by executing relevant actions within its virtual environment, and receiving feedback about how it has performed, !33 feedback according to which it can then tweak itself so as to receive more positive feedback in future. Because of its activity within its environment, in the literature the model is often called an actor or an agent; and because the positive feedback it receives is supposed to help it become more proficient in its task by reinforcing the approach it has just taken, this feedback is often (perhaps rather loosely) termed a reward. It remains to be seen whether this notion of reward corresponds with the one we intuitively use in our everyday lives, and if so, whether the model’s pursuit of these rewards qualifies it as having a desire for them. For the purposes of this paper, we will investigate a class of such reinforcement learning programs that is called the actor-to-critic (A2C) model. It is explanatorily efficacious for us to use this model as an example because the putative actions and rewards involved are perhaps expressed more literally. The model comprises two agents which work together to perform an arbitrary reward-garnering task: the actor, and the critic. One such prototypical task is a game of CartPole (Barto, Sutton and Anderson, 1983), in which a pole is attached by an unactuated joint to a cart, which moves along a straight, frictionless track. The cart can only move forwards or backwards, and the model is tasked with making decisions on which of the two directions to move this cart so as to prevent the pole from falling over, for as long a period of time as possible. In the context of this task, there are only two action options available to the actor: A1 and A2, corresponding to moving the cart forwards and backwards respectively. The actor uses a given policy to select one of these actions; the policy takes as input, among other things, a vector encoding information about the current state of the environment (eg. the angle subtended by the pole and the base of the cart), and a probability distribution across the available actions, which, at the start of the task, is initialized either randomly, or equally across the two actions. To be !34 clear, the policy is simply governed by mathematical principles which perform computations over the aforementioned inputs, and which then produce a decision about which action to take — this can be represented by a binary number, since there are only two available actions. For the purposes of this paper, exposition about these (and other) complex mathematical principles can be elided. One other important input to the actor’s action-selection policy is an estimation of the reward that is associated with the execution of either action. Notice that it seems to make good sense to offer this information to the actor, given that in our everyday lives, when we deliberate between various options to take, we often consider and compare what we expect the benefit of taking each option is, and given our intrinsic benefit-maximizing tendency this information crucially affects our decision. Within the model, this estimation is made by the critic, so termed because it provides a “critique” of how rewarding each action is anticipated to be. These estimations are denoted by V1 and V2, which correspond to the expected reward of A1 and A2 respectively. The expected rewards act as inputs to the actor’s action selection policy by affecting the probabilities which the actions are distributed across, with the magnitude of the expected reward for a given action being proportional to the probability that that action is selected. Again, this seems reasonable — we want to preferentially select the action that we expect to bring us the greatest reward. The computation of the expected rewards involves state information as well, but the complexities behind that computation shall also be elided here. Thus the actor performs this action, and the action event is actualized within the environment, thereby changing its state, and the vector which encodes it. The vector that results acts as feedback about how the actor has fared with that last action — perhaps it has allowed the !35 pole to stand more uprightly, or perhaps it has caused the pole to tip even further. This feedback is computed and processed by the critic as a scalar, and denotes the actual reward, r, associated with the action. From this, the numerical difference between the expected reward, Vi, and the actual reward, is calculated — this denotes a reward prediction error (RPE), signifying that Vi was either an over- or under-estimation of the reward for the most recently executed action; mathematically, ẟ = r - Vi, where ẟ denotes the RPE. The critic thereby uses the RPE to adjust Vi, in the direction given by the sign of the RPE and by a magnitude proportional to that of the RPE, so as to potentially reduce future RPE values and provide more accurate predictions of reward in future. Again, this seems to align with the process we use to make everyday decisions — upon finding out that the sushi at a Japanese restaurant was not as good as expected, one might lower one’s preference for going to that restaurant for dinner, so much so that one might choose an alternative place instead in future. The newly-adjusted Vi values thus serve as inputs for the actor’s next action selection procedure, and the whole process repeats itself, until the task is completed. Such a model has been widely shown to optimally stabilize the pole atop the cart after a certain number of rules, with technical differences in the algorithm used creating variations in the average number of moves needed1. The model so described (albeit somewhat simplistically) is thus able to make use of the various internal quantities listed to solve the pole-balancing problem optimally; one of those internal quantities is described as being an expected reward, and another, an actual reward. At face value, this notion of reward seems somewhat abstract, and arguably departs significantly from the notion of rewards that figures in everyday life. The ensuing 1 See the OpenAI website https://gym.openai.com/envs/CartPole-v0/ for a catalogue of some of these algorithms. !36 sections will argue that because rewards are most essentially grounded in learning, as we have observed in the earlier sections, perhaps this intuition is somewhat misplaced. Importantly, however, we have highlighted earlier that under the RTD, for an organism to desire P it must be able to constitute P as a reward, and implied in such a notion of constituting is one of representing. According to the RTD, then, an organism must possess the capability to represent, and the essential features of rewards must figure in the content of its representation, for it to be able to have desires. Thus an argument that the A2C model has desires must first show that the model is capable of representing things as rewards. In the next section, I provide a plausible account of representational content, and show that the account does not preclude but in fact supports the claim that the model can in fact represent rewards in a semantically-significant way. A PLAUSIBLE THEORY OF REPRESENTATIONAL CONTENT What we want to outline here is a way of explaining what it is that gives representations their meaning. To represent states of affairs as rewards, one needs to be able to ascribe meaning to one’s representations of those states of affairs — that is, a specific kind of meaning, that of being a reward, encapsulating the essential qualities constituting the concept we try to express when we communicate with the word “reward”. One perhaps misguided way of explaining how this meaning is ascribed is the proposal that meaning is already inherent in the representation, and there needs to be a mind capable of grasping that meaning — a kind of internal homunculus — and thereby able to access that content in such a way as to guide action. This proposal seems to raise more questions than it answers, for it leaves us with the task of explaining exactly how the homunculus accomplishes this difficult feat of understanding. Conversely, it seems to make more !37 sense that representations, as products manufactured by representing systems, have content in virtue of the way those representations are used by the system to guide behavior. The way the system behaves in response to those representations constitutes the representation as having meaning. This seems to crucially ground representing systems as participating in a self-directed process of obtaining and making use of information about the distal properties of its environment, for various purposes. Shea (2018) offers support for this view. He argues that content can be ascribed to a representation in virtue of the role the representation plays in the organism’s performance of task functions: the kind of tasks undertaken by representing systems which can be understood as having particular functions for the system’s survival, persistence, development, performance, or other things we might intuitively think of as paradigmatic goals for the organism. He offers an account of such task functions, by arguing that a system can be said to be performing a task function if (1) it produces outcomes (relevant to the task) robustly, (2) the robust production of these outcomes is due to a kind of stabilizing process, and (3) there are internal components which correlate with distal properties in a way that is relevant to achieve the outcomes, and the system actually exploits these correlations to achieve the outcomes. We will now address each requirement in succession, and see if the A2C model fulfills each one of them. If it does, then the model can be said to be performing a task function, and if Shea’s theory is right, then the model can rightfully be said to be representing in a contentful manner. First, what does it mean to produce robust outcomes? Shea argues that for an outcome to be generated robustly it must be produced in a manner that is responsive to relevant external conditions, such that the same outcome can be produced even when distal conditions differ. Shea !38 defines the relevant kind of distal conditions more precisely, by stating that an output F of a system S is a robust outcome if (i) S produces F in response to a range of different inputs, and (ii) S produces F in a range of different relevant external conditions. He introduces this distinction by drawing on Ernst Nagel’s characterization of goal-directedness (Nagel, 1977) in which the same outcome is produced in the pursuit of a goal despite variations in initial conditions, and perturbations which arise as the action is being executed, loosely corresponding to points (i) and (ii) respectively. To clarify this point, Shea offers a paradigm example of robust outcomes in motor control (Goodale et al. 1986, Schindler et al. 2004, Milner and Goodale 2006). The studies show that while subjects take as input the initial visual percept of the position of a target they are tasked with reaching toward and touching, and begin with an initial trajectory towards the perceived position, if the target is displaced while the hand is in motion the initial trajectory is modified so that the target is reached anyway. This is made possible by an online mechanism in the brain which adjusts an action as it is executed, in a way that is sensitive, on the fly, to changes in external conditions. However, Shea also explains that in many cases these perturbations can be considered in effect to be new initial conditions which the organism takes into consideration, that is, as input, to devise next steps towards producing the outcome. The displaced position of the target as the hand is mid-flight can simply be understood as forming a new visual percept which guides hand movement. Thus, the main idea is that these distal properties specify a rich set of circumstances which the system must internalize and navigate in order to produce the outcome; outcomes are produced robustly if they are produced despite variations in those circumstances. !39 Does the A2C model produce any outcome robustly? One such outcome is a more upright pole after the action of moving the cart either left or right is performed, as compared to before the action was performed. A sufficiently trained model is capable of successfully balancing the pole atop the cart, and importantly, is able to do so whether, at any given point in time, the pole is tipping towards the left, or towards the right. Contrast this with a model that always chooses to move the cart to the right with maximal probability. Then, it would be possible for it to robustly keep the pole upright, if the hinge on which the pole sits is designed to disallow any leftward tipping whatsoever, such that the pole always tips towards the right. However, it cannot produce the outcome robustly despite changes to distal conditions, since if a regular two-way hinge is used instead and the pole tips over to the left, a default rightward movement would cause the pole to tip over even further. Such a model would thus fail to satisfy the requirements of robust outcome production. Contrastingly, it seems that the A2C model does in fact robustly produce the outcome of making the pole more upright, given that it is sensitive to information about the state of its environment (namely the direction of the incline on the pole), and can move the cart in response to this information so as to prevent the pole from falling over, with consistent success. Its ability to produce the same outcome despite variations in distal circumstances thus qualifies it as producing that outcome robustly. The next feature of task functions is that the robust outcomes that are produced in service of this function are so produced not by mere accident, but by a kind of stabilizing process that shapes the consistent behavioral patterns which are relevant to producing the outcome. A mechanized ball that bounces every which way within a box might always eventually find its !40 way out of the box through a hole, but despite the consistency in the production of that outcome, it is produced by mere accident and not by a process that actually stabilizes its production. One such stabilizing process is natural selection. An organism produces an output in a stabilized way because that output has brought certain benefits to the organism in its evolutionary past — perhaps it contributed to its ancestors’ survival, or promoted their good health. These behavioral dispositions are thereby passed down between generations, as Darwinian evolutionary theory might suggest, because it was their contribution to the organism’s survival that made the organism capable of conferring them hereditarily to their offspring in the first place. The behavioral disposition is thereby selected, in a manner familiar to natural selection theorists, such that it can shape behavior in successive generations in a directed, non- accidental way. Many advocates of teleosemantics propose an etiological account of functions, and argue that the function of a perceptual organ which produces representations is what the organ was selected to do, and not simply what it is disposed to do (Millikan, 1984; Neander, 1995). The representations so produced thus have content, because the content is instrumental in guiding various other tasks the organism has to perform, tasks which the organism is poised to perform only if the perceptual organ does what it is selected to do — that is, if it fulfills its biological function. Neander, in particular, argues that frogs do produce visual representations of flies with determinate content, because their visual systems were selected to represent flies determinately, and the content of those representations guides their fly-consuming behavior in a directed way (Neander, 2017). According to these teleosemantic accounts, then, the content of a putative !41 representation that is being used by an organism depends crucially on the way those representations were used in prior processes that were favored by natural selection. Natural selection, however, cannot explain the stabilization of all robustly produced outcomes which might qualify as task functions, because it pertains mainly to biological functions and depends crucially on the system having an evolutionary history. Shea himself doesn’t think this covers all cases of task functions, and raises the paradigmatic example of the Swampman to express this point. It seems that Swampman, a molecule-for-molecule duplicate of a human being that was created by some scientific miracle in a swamp, would, given time, behave predictably to various stimuli — swat at a bothersome fly, or reach for an apple hanging on a tree — in the same way that the human being it is a duplicate of would. However, teleosemantic theories predicated on conferring content to representations in virtue of what the relevant representing systems were naturally selected to do are committed to also arguing that Swampman cannot in fact have contentful representations, because he, being born non- reproductively, has no such evolutionary history. Yet it seems that no other explanandum other than that Swampman is creating and using contentful representations can sufficiently explain the consistent pattern of behavior it is likely to engage in. Perhaps this explanandum is invalid at the very moment of Swampman’s birth. But it seems likely that once Swampman begins to interact with his environment, a process of learning, reminiscent of the one examined in earlier sections of this paper, will being to stabilize its behavior in an important way. He may perhaps find himself in an apple orchard; picking an apple from a tree for the first time, Swampman may first simply handle it, perhaps fling it on the ground, but later discover that these red round objects represented in his visual field quell his hunger when eaten, and thus develop a disposition to !42 reach toward such red round objects in his visual field and try to consume them whenever he is hungry. The predictability of his behavior from that point provides a good reason to believe that it is produced by some stabilizing process or other. In this case, it seems clear that the fact that his visual state encodes information about an apple guides the production of the robust outcome of reaching for the apple and consuming it in a stabilized, non-accidental way, without presupposing that any such deep evolutionary history is required to confer those representations content. This brings us to another kind of stabilization process that Shea focuses on: learning through feedback. Shea explains that this kind of learning proceeds in a number of steps. The organism first identifies a statistical pattern within the inputs it receives from its environment. Here Shea refers to the kind of correlation identification that is familiar from classical conditioning studies, where a general principle is recognized by the organism, but the principle isn’t yet used to direct behavior. Then, the organism begins acting on the input patterns it has observed; the effects of its action are registered, acting as feedback for its behavior. If the effects are positive for the organism, a positive learning signal is produced, strengthening the connection between the organism’s recognition of the pattern, and the behavior it just performed. Correspondingly, if the effects are negative for the organism, a negative learning signal is produced, weakening this same connection, or perhaps strengthening the connection between recognition of that pattern and avoidance of the behavior it just performed. Changes to this connection — however it is implemented, neurally or otherwise — thus contribute to the stabilization of behavior, since a stronger connection will more robustly produce the behavior in the organism whenever the input pattern is recognized. !43 Notice here that when such behavioral dispositions are regulated by feedback, the production of the behavioral outcomes is stabilized in a way that is not contingent on the presence of any evolutionarily selected function. Perhaps the mechanism to initially make certain associations between inputs could be described as an evolutionary function, but the associations that are made don’t quite inform the organism about anything other than the fact of the association, and are thus very general; the learning effect of feedback is more specific, such that it can direct specific behavior. The fact that it could either be the case that seeking or avoidance behavior is reinforced depending on whether the effect on the organism is positive or negative suggests that the organism needs to go beyond merely recognizing patterns to learn. This kind of learning also does not always involve making connections that are necessarily evolutionarily beneficial for the organism — for example, ravens will learn to perform tasks in exchange for shiny objects; human beings will learn to jump off cliffs into lakes because it is an enjoyable thing to do. But such learning does happen. Stabilization by feedback-based learning thus explains many of the robust behavioral outcomes we observe in and around us. Swampman, too, seems to be capable of such learning, since despite not possessing the evolutionary function of consuming apples, the prior consumption of apples during his own lifetime and the internalization of the positive effects of apple consumption have allowed him to learn to consume apples robustly. Does the A2C model robustly produce outcomes of keeping the pole upright because of such a process of stabilization? Given that it is quite literally a model that alters inputs to its action-selection policy based on feedback, it’s clear that this is the case. When the model executes a selected action Ai, based on the expected reward of that action Vi under the particular !44 circumstances during the time at which the action was selected, an actual reward is reaped. The RPE, as the mathematical difference between the expected and actual rewards (to reiterate: ẟ = r - Vi), is a measure of how much the reward was over- or underestimated. If this difference is positive, this means that the model’s Vi value was an underestimation of the reward for the action just selected; the model responds by increasing Vi proportionally. Given the way the Vi value is used by the actor in its action selection, an increase in Vi has the effect of increasing the actor’s preferential bias for selected the corresponding action, Ai. Thus the RPE value, when mathematically positive, constitutes a positive learning signal, since it strengthens the connection — the connection being, quite literally, the preferential bias the model has for that action under those circumstances — between the inputs to the model (what the state of the pole’s incline was at the time of action selection) and the most recently performed action. A symmetric explanation exists for when the RPE value is negative and the model is found to have overestimated the reward for the action just selected. This would constitute a negative learning signal, weakening the preferential bias the model has for that action under those circumstances. And of course, in the event that the actual and expected rewards are equal (that is, Vi = r), then the RPE value is 0 and the learning signal that is produced is a neutral one. Assuming there is actually a pattern of behavior to be learnt (as opposed to similar inputs creating positive, negative or neutral signals randomly), which is entirely conceivable given the CartPole task, these signals do have the effect of stabilizing the action selection of the model — they cause the model to predictably choose to move the cart rightwards when the pole is tipping towards the right, and leftwards when the pole is tipping to the left. !45 So it does seem that a sufficiently-trained A2C model produces outcomes robustly, and does so out of a stabilizing, non-accidental process. Now that we have ascertained that the model does behave in this manner, to account for representational content what is left to be explained is how the workings of internal components can produce behavior that is responsive to distal properties — that is, whether there is a representational capacity that facilitates these workings. Shea argues that there is one if internal components stand in relations to properties of the environment that are relevant for the execution of the task function, and if the system actually exploits those relations in that execution. He explains this notion of correlation in terms of probability: a system’s internal component ɑ being in a state F carries exploitable correlational information about a component of its environment β being in a state G if and only if there are regions D and D’ such that if ɑ is in D and β is in D’, then either (i) ℙ(Gβ | Fɑ) > ℙ(Gβ) for a univocal reason, or (ii) ℙ(Gβ | Fɑ) < ℙ(Gβ) for a univocal reason. Concretely, a correlation is exploitable if the conditional probability of the environment being in a state G given that the system is in the state F is greater (or less) than the absolute probability of the environment being in the state G (unambiguously). That is, the system’s being in the state F reliably indicates the values of a distal variable. Shea qualifies his use of the term “region” as encompassing both that of a spatiotemporal kind, but also sets and other kinds of collections, but we needn’t explore the complexities behind that here. Additionally, Shea’s proposal also requires that the correlational information encoded by the internal state indicates the value of its corresponding distal variable in an unmediated way, so as to account for content-indeterminacy !46 worries. He explains this notion with the example of a frog’s fly-catching mechanism: the correlation between a frog’s retinal-ganglion firing (R) with little black objects (condition C) is mediated by a further correlation between R and the fly’s being a nutritious flying object (condition C’), since without the latter correlation, we wouldn’t be able to explain how the former correlation enables frogs to achieve the evolutionarily beneficial outcome of consuming flies and capturing the nutrition they provide. It is important for us to target the distal property to which an internal state has a fundamental, explanatorily-significant connection, because if there really is such a connection, then there needs to be a fact of the matter exactly what the internal state is connected to; furthermore, for the correlation to qualify as content-constituting under Shea’s theory, it must actually be exploited by the system, and so there must also be a fact of the matter what information is actually doing the behavior-producing work. Shea’s characterization of correlational information in terms of probability is plausible given that a true correlation would in fact verify the inequalities stated in his theory. Can we apply this theory to the internal quantities that figure in the A2C model? Take the RPE signal, ẟ, as an example. It should be exceedingly apparent that the ẟ value correlates with the mathematical difference between the actual reward and the expected reward, since that is precisely how it is calculated. For any arbitrary value θ the RPE value ẟ equates to, the probability that the difference between the actual and expected reward equates to θ (this is denoted by distal state Gβ, in Shea’s definition outlined above) conditioned on the fact that the RPE value ẟ equates to θ (internal state Fɑ) is thus maximal, at 1, and decidedly greater than (or equal to) the absolute probability that it is the case that said difference is θ — that is, the probability that state Gβ obtains distally, independent of whether any other state obtains. In the !47 context of the A2C model, the conditional probability denoted by ℙ(Gβ | Fɑ) is maximal simply because the RPE value is calculated directly from the difference between the actual and expected reward; condition (i) of Shea’s definition is thus satisfied by default. Therefore, the RPE value does in fact correlate with this difference. However, whether the model actually exploits this correlation remains another question altogether. Given that the model operates so as to maximize the reward it receives (indeed, this is presupposed), the fact that the exploitation of the aforementioned correlation contributes to the model’s success is a prima facie reason to believe said correlation is the one being exploited. As we already understand, the RPE signal plays a kind of imperative role within the model: its sign dictates the direction in which Vi is adjusted, and its magnitude dictates by how much. Vi is decreased, and consequently the probability that the corresponding action Ai is selected at the next time-step is decreased as well, if the reward for executing Ai at the current time step was less than estimated (and thus gave rise to a negative RPE); likewise Vi and the probability that Ai is selected in future is increased if the reward received was larger than estimated (and gave rise to a positive RPE). This makes intuitive sense, since the model should increase the preferential bias towards the selected action if the reward received exceeded what was expected, given that it was precisely that expectation that grounded the preferential bias towards the selection of that action in the first place (likewise if the reward was underestimated). More importantly, the A2C model so designed has been proven, in the computer science literature, to be an optimal strategy for such reward-garnering tasks (Sutton and Barto, 2018), but only on the condition that the reward received actually deviated from the expected reward as the RPE signal suggests. That is, the RPE signal is only beneficial to the system in virtue of its tight correlation to the difference !48 between the actual and expected reward. This makes a convincing case that it is correlational information about this difference that the model exploits in order to improve its performance, and gather rewards more robustly. Could the model instead be exploiting the correlations the RPE might have with other distal properties? Consider the fact that RPE correlates with the objective, worldly expected reward for an action, in that it is a measure of the difference between the actual reward, and the objective expected reward (in probabilistic terms, the product of the objective probability and magnitude of the reward), instead of a mere estimation of it that is made by the system. In order to home in on the correlational information that is actually exploited by the system here, we have been looking to how the representation is processed downstream — that is, how it is causally involved in the system’s task-performing procedure; this follows from Dretske’s view (1988), and Carruthers (2009) also agrees that this gives us an adequate explanation of the content of the putative representations involved. Given this, it doesn’t seem that a correlation between the RPE and the difference between the actual and objective expected reward offers a viable account of the function of the RPE signal within the model. This is because the RPE is supposed to provide a guideline for how to adjust Vi, but such a difference does not itself rationalize any such adjustment, since the objective expected reward, as a property of the world, is a measure of what the average payoff across a large number of trials is (according to the Law of Large Numbers), but the actual reward is unlikely to match this value in individual trials. In other words, an accurate Vi value does not preclude a disparity between the actual reward and the objective expected reward, and so this disparity should not warrant any adjustment of Vi whatsoever. Rather, it is evidence of the inaccuracy in the system’s Vi value itself that should warrant its !49 adjustment, because action selection for future trials bases crucially on the preferential bias for selecting a given action that is given by this value. Additionally, it doesn’t seem that the model’s exploitation of the RPE’s correlation to the difference between the actual and expected reward is mediated in any way by the aforementioned correlation it has to the difference between the actual and objective expected reward. Updating Vi according to the error in predicting the reward for a given action that is encoded in the RPE provides an adequate explanation of the optimality of the model (Sutton and Barto, 2018), independent of the RPE’s correlation with the difference between the actual and objective expected reward. The correlational information encoded by the RPE thus satisfies the requirement outlined in Shea’s theory of being explanatory in an unmediated way. This shows that it is the RPE’s correlation with the difference between the actual and internally-represented expected reward that is exploited by the model. A2C AS A REPRESENTATIONAL REWARD-GATHERER In sum, it seems that there is a plausible explanation for how the A2C model satisfies each of the three main requirements of a system’s having contentful representations in support of the performance of task functions: the production of robust outcomes, the utilization of processes which have stabilized the outcome production, and the exploitation of (unmediated explanatory) correlational information for the production of those outcomes. If Shea’s teleosemantics are right, and it seems plausible that they are, then the A2C model does perform genuine task functions — in the case of the CartPole environment, it is the task function of keeping the pole !50 standing upright on the cart — and it makes use of representations in service of this task function which genuinely have content. It would be useful here to revisit some of the representations that are purportedly used by the A2C model to produce the outcome of keeping the pole upright. One putative representation we have not yet examined in detail is Vi, the model’s estimation of the expected reward of executing action Ai. Why is this representation contentful? We find this in the way it is used by the model: as quantifying the expected reward in a probabilistic sense — that is, as the product of the probability of receiving the payoff and its magnitude, or more qualitatively, what the average payoff would be if Vi was repeatedly selected over the course of a large number of trials. Given that the mathematical principles governing the use of Vi within the actor’s action-selection policy interpret Vi as an estimate of the average payoff, the quantity denoted by V1 thus has content in a semantically-significant way. The fact that this action-selection policy contributes to the success of the model also suggests that V1 really does have that meaning. Moreover, insofar as V1 is an estimation, it is also a representation of this average payoff, since here, just as it is relevant to invoke veridicality conditions when reasoning about representations, we can do the same for V1: if Vi was veridical, then the average payoff would actually be Vi if Ai was selected over multiple trials. Another putative representation we may have already examined in some detail is the RPE signal itself. Again we look to how this representation figures in the model’s operation in order to answer the question of whether this representation has content. The way that the quantity denoted by ẟ is used by the model presupposes that ẟ does have some semantic content — Shea himself endorses this view (2012), and classifies this content as being indicative and imperative. The !51 model presupposes that ẟ has indicative content in that its sign indicates that the actual reward received was either higher or lower than Vi; we can attribute a veridicality condition to this content, since the sign can be said to be (in)accurate in indicating the direction of the disparity between Vi and the reward. Furthermore, it presupposes imperative content in ẟ, since its magnitude connotes an instruction to adjust Vi proportionally; here we attribute a satisfaction condition to this content, since it describes what an appropriate adjustment of Vi in response to this magnitude would be. The RPE thus seems to possess the qualities that representations in general are widely believed to possess. The success of its exploitation by the model is predicated on the accuracy of its correlation with the degree to which the expected reward was under- or overestimated. The specific usage of ẟ within the model thereby confers it its status of being a contentful representation. So the A2C model performs task functions, and makes use of contentful representations to do so. It may have a representational capacity, and thus may be capable of representing certain things as rewards, but it remains a further question whether or not it actually does so. Perhaps the most suitable candidate that could qualify as a representation of reward is the quantity that we denote as r. Why does r represent the quantity it denotes as a reward? As the earlier sections of this paper have shown, what most crucially grounds a representatum as a reward is its contribution to the production of a learning signal within the representing system. But not just any kind of learning signal — rewards contribute to the production of a positive learning signal, which have the potential to, among other things, cause pleasure in hedonic systems, and create behavioral patterns directed at positioning the organism to more effectively gather those rewards in future. To be clear, this notion of positive does not entail the concept involving a quantity’s !52 being mathematically greater than zero, or that pertaining to the system’s “likes” or “dislikes”, or even the quality of beneficialness. It is meant to be a more abstract notion expressing the learning signal’s constructive effect on the system’s disposition to seek that which was responsible for producing the signal. Now, it seems that the quantity r has that exact effect on the system: the very specific way in which it participates in the mathematical difference that the RPE consists in causes the system to alter its dispositions towards action selection for more effective reward-gathering. To reiterate, the model interprets a (mathematically) positive RPE as a signal to increase Vi; this would only cause it to be more likely that the same action is chosen, and the hitherto greater-than-expected reward is reaped again (under similar circumstances) if r precedes Vi in the difference calculation. If instead Vi preceded r in the calculation, and the system increases Vi in virtue of a positive result, then this would only tend to increase the degree to which Vi outstrips r and thus would, in effect, encourage the model to select actions for which the reward is likely to be less than expected, thereby compromising the model’s ability to reap rewards optimally. This effect is even clearer if a mathematically negative reward is reaped — a positive RPE where Vi preceded r in the difference calculation would thereby encourage the model to literally reduce the total reward it has collected (given that total reward is defined as the sum of the individual rewards gathered at each action execution step). Naturally, a symmetric explanation exists for the system’s interpretation of a (mathematically) negative RPE. Thus the mathematical principles governing the use of the actual reward value, r, rightfully qualify the model as constituting r as a reward contributing to the production of a positive learning signal within the model. I have argued that this is the best way of characterizing a reward, and those arguments support a realist perspective of the A2C model’s representation of !53 reward. The lengthy arguments canvassed thus far are intended to offer support for the thesis that reinforcement learning models, such as A2C, can be regarded as possessing desires, because they are capable of constituting things as rewards, as the argument just mentioned plausibly shows. To reiterate, there is a convincing case for identifying a desire for P as the constituting of P as a reward, because theorizing as such is conceptually sound, and the causal profile of rewards aligns with that of desires in a real way. Now, constituting something as a reward depends fundamentally on what it means for something to be a reward to a system. We have also seen compelling reasons to identify a reward as that which contributes to the production of a learning signal within that system; the learning signal is what grounds the formation of the other effects, like changes to behavior or emotional dispositions, which themselves are phenomena that are often mistakenly identified as rewards when they should be explained by them. Putting these two arguments together, we have a complete reward-based theory of desire; affirming that reinforcement learning models have the representational powers which are required by the theory, and that the models do in fact use those powers to represent things as rewards, it seems that these models do fit the reward-based theory of desire quite well. Perhaps, then, machines can have desires after all. CONCLUDING REMARKS There is an important connection between desiring something and viewing something as rewarding that plausibly qualifies as identity. The following is perhaps a commonly-heard line of conversation: “Little Anwar seems to want to go to Disneyland particularly badly; he asks about it all the time.” “Ah yes — we take him there after he finishes his exams at the end of every !54 school year. For him, it’s a well-deserved reward for his hard work.” For Anwar, and perhaps for the rest of us, desiring a trip to Disneyland is synonymous with treating such a trip as a reward. That doesn’t mean that such psychological phenomena as being disposed to act so as to procure something we desire, or being disposed to feel pleasure when we do procure it, are not indicative of desires being at work. None of what has been said in this paper is aimed at refuting this commonsensical view. Rather, what this paper has hoped to show is that despite being characteristically associated with desire, these dispositions are decidedly inessential to its nature. Rather, it is reward that underlies its essence, because reward, as a natural kind, instantiates a causal profile that is strikingly similar to that of desire as a natural kind, and bears properties that provide the deepest explanatory account of the very dispositions we deem characteristic of desires. These arguments provide compelling reasons for the view that desiring something and constituting it as a reward are one and the same. A reward theory of desire doesn’t immediately preclude such systems as machine learning models from being capable of having desires, because it doesn’t presuppose the necessity of consciousness from the outset as other theories of desire, such as a hedonic theory of desire, might do. All it requires is a capability to representationally constitute something as a reward — that is, to represent something as contributing to a characteristic reward learning signal — in order to have a desire for it. We have seen that it is entirely plausible that the reinforcement learning paradigm, exemplified by the A2C model, has such a capability, and can thus reasonably be said to have desires — if the reward theory of desire is right. Dominant theories of desire condemn such systems to wholly desire-less lives; the arguments canvassed in !55 this paper should, at the very least, convince us to question the intuitions on which these views are founded. !56 BIBLIOGRAPHY American Cancer Society. (2015). Placebo effect. https://www.cancer.org/treatment/treatments- and-side-effects/clinical-trials/placebo-effect.html Arpaly, N., & Schroeder, T. (2014). In praise of desire. Oxford University Press. Bao, S., Chan, V. T., & Merzenich, M. M. (2001). Cortical remodelling induced by activity of ventral tegmental dopamine neurons. Nature, 412(6842), 79. Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuronlike adaptive elements that can solve difficult learning control problems. IEEE transactions on systems, man, and cybernetics, (5), 834-846. Berridge, K. C. (2003). Pleasures of the brain. Brain and cognition, 52(1), 106-128. Berridge, K. C., & Robinson, T. E. (1998). What is the role of dopamine in reward: hedonic impact, reward learning, or incentive salience?. Brain research reviews, 28(3), 309-369. Berridge, K. C., & Valenstein, E. S. (1991). What psychological process mediates feeding evoked by electrical stimulation of the lateral hypothalamus?. Behavioral neuroscience, 105(1), 3. Breland, K., & Breland, M. (1961). The misbehavior of organisms. American psychologist, 16(11), 681. Carruthers, Peter (2009), ‘How we know our own minds: the relationship between mindreading and metacognition’, Behavioral and Brain Sciences, 32, 164–82. Dretske, F. I. (1991). Explaining behavior: Reasons in a world of causes. MIT press. 21, 22, 23, 42, 87, 160, 192, 206, 207. !57 Egelman, D. M., Person, C., & Montague, P. R. (1998). A computational role for dopamine delivery in human decision-making. Journal of Cognitive Neuroscience, 10(5), 623-630. Goodale, M. A., Pelisson, D., & Prablanc, C. (1986). Large adjustments in visually guided reaching do not depend on vision of the hand or perception of target displacement. Nature, 320(6064), 748. Heath, R. G. (1963). Electrical self-stimulation of the brain in man. American Journal of Psychiatry, 120(6), 571-577. Department of Biochemistry and Molecular Biophysics Thomas Jessell, Siegelbaum, S., & Hudspeth, A. J. (2000). Principles of neural science (Vol. 4, pp. 1227-1246). E. R. Kandel, J. H. Schwartz, & T. M. Jessell (Eds.). New York: McGraw-hill. Knowlton, B. J., Mangels, J. A., & Squire, L. R. (1996). A neostriatal habit learning system in humans. Science, 273(5280), 1399-1402. Kripke, S. A. (1972). Naming and necessity. In Semantics of natural language (pp. 253-355). Springer, Dordrecht. Lamb, R. J., Preston, K. L., Schindler, C. W., Meisch, R. A., Davis, F., Katz, J. L., ... & Goldberg, S. R. (1991). The reinforcing and subjective effects of morphine in post- addicts: a dose-response study. Journal of Pharmacology and Experimental Therapeutics, 259(3), 1165-1173. Langston, J. W., & Palfreman, J. (1995). The case of the frozen addicts. New York: Pantheon Books. !58 Liechti, M. E., Saur, M. R., Gamma, A., Hell, D., & Vollenweider, F. X. (2000). Psychological and physiological effects of MDMA (“Ecstasy”) after pretreatment with the 5-HT 2 antagonist ketanserin in healthy humans. Neuropsychopharmacology, 23(4), 396. Liechti, M. E., & Vollenweider, F. X. (2000). Acute psychological and physiological effects of MDMA (“Ecstasy”) after haloperidol pretreatment in healthy humans. European Neuropsychopharmacology, 10(4), 289-295. Millikan, R. G. (1984). Language, thought, and other biological categories: New foundations for realism. MIT press. Milner, D., & Goodale, M. (2006). The visual brain in action. Oxford University Press. Nagel, E. (1977). Goal-directed processes in biology. The Journal of Philosophy, 74(5), 261-279. Neander, K. (1995). Misrepresenting & malfunctioning. Philosophical Studies, 79(2), 109-141. Neander, K. (2017). A mark of the mental: In defense of informational teleosemantics. MIT Press. Putnam, H. (1975). The meaning of ‘meaning’. Philosophical papers, 2. Shea, N. (2012). Reward prediction error signals are meta-representational. Nous (Detroit, Mich.), 48(2), 314. Shea, Nicholas (2018). Representation in Cognitive Science. Oxford University Press. Schindler, I., Rice, N. J., McIntosh, R. D., Rossetti, Y., Vighetto, A., & Milner, A. D. (2004). Automatic avoidance of obstacles is a dorsal stream function: evidence from optic ataxia. Nature neuroscience, 7(7), 779. Schroeder, T. (2004). Three faces of desire. Oxford University Press. !59 Skinner, B. F. (1948). 'Superstition' in the pigeon. Journal of Experimental Psychology, 38(2), 168-172. Steiner, J. E., Glaser, D., Hawilo, M. E., & Berridge, K. C. (2001). Comparative expression of hedonic impact: affective reactions to taste by human infants and other primates. Neuroscience & Biobehavioral Reviews, 25(1), 53-74. Stellar, J. (2012). The neurobiology of motivation and reward. Springer Science & Business Media. Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.
 !60