Can machines have desires?
                                  Juanda Tan, Brown University


                                         INTRODUCTION

       This paper aims to bring us closer to answering the question of whether a machine could

plausibly possess desires, by making the claim that the many (very intuitive) assumptions we

have about desires being related to dispositions to behave or feel in particular ways perhaps do

not penetrate what it fundamentally means to desire. Rather, it is the fundamental relation desires

have with rewards, and that which rewards have with the production of learning signals, that

provide the best characterization of desires — one which aligns well with our conceptual notions

of desire, and strikingly, with various neurological findings. Under the reward-based desire

theory, it seems much less unintuitive for machines or computer programs to have desires, since

consciousness, or even the ability for bodily movement, is no longer presupposed.

        In the following section, I outline what it means for a system to constitute something as a

reward, a notion on which the reward-based theory of desire fundamentally depends. Next, I lay

out the reward theory of desire, and explain the various advantages it has over competing

theories. Following that, I provide an account of a reinforcement learning paradigm that is

prototypical within the computer science literature — the actor-to-critic model — and explain

why, according to a plausible theory of teleosemantics, it is capable of representing things as

rewards in a meaningful way. Finally, I conclude that because such models qualify as

constituting things as rewards, they can plausibly be said to have desires.


                                                                                                   !1
                                 A PLAUSIBLE THEORY OF DESIRE

        In the contemporary literature surrounding desire, there seem to be three dispositional

states which are closely related to desire. Concretely, the first of these dispositional states is

defined by one’s behavior, or one’s motivation to behave — that is, one desires P if one has the

disposition to act so as to bring about P. The second is defined by the feeling one might get when

a desire is satisfied, perhaps rightfully termed the experience or phenomenology of desire — that

is, one desires P if one has the disposition to feel pleasure whenever P. The third, and perhaps

less commonly addressed dispositional state is defined by one’s disposition to constitute

something as a reward — that is, one desires P if one is disposed to constitute P as a reward.

This separation into what Schroeder (2004) might term the “three faces of desire” might seem

redundant to some readers, given that the three aforementioned readings of what a desire is seem

to express overlapping dispositional states within us — something is a reward to us because it

gives us pleasure, we behave in pursuit of what we consider rewards; and so on. This is but an

indication of how close the relation between these dispositional states to what we consider a

desire is. One goal of this paper is to show that making this distinction makes conceptual sense,

and is also supported by empirical data. But more importantly, the chief purpose of this paper is

to support the view that the third face of desire is the one that bears a constitutive relation to

desire, while the other two bear merely causal ones, and that in laying out a theory of desire, we

should demand that desire be rightfully defined only by what constitutes it as such. To this end,

this section will first make important definitions of what a reward is and how it figures in desire.

        Now, what is a reward? First, note that there is an important sense in which whether or

not something is a reward to one depends crucially on whether one constitutes it as such. In


                                                                                                     !2
determining whether P is a reward to one, it matters whether one takes a positive stance towards

P, since it seems that to be rewarded when P obtains is to have a certain kind of favorable

response to the state of affairs in which P obtains. Thus, P’s status as a reward depends

constitutively on one’s subjective perspective on P. Again, this is a mark of how closely

regarding P as a reward is associated to desiring P, given that desiredness is also, crucially, a

perspective-dependent property of things.

       So, rather than speak of something as being a reward, it is clearer and more efficacious

for our purposes to speak of constituting something as a reward, since subjectivity is implied in

the latter treatment. What, then, is meant by constituting something as a reward? Perhaps rewards

are best understood in terms of their causal powers. After all, we are very familiar with such

colloquialisms as “It sure feels good to finally be rewarded,” or “Wai Shing is only working so

hard because he is trying to be Employee of the Month.” It appears that when an agent

constitutes P as a reward, she has three relevant dispositions — namely, dispositions towards

behavior, pleasure, and the production of a learning signal. First, the agent is so disposed that if P

follows upon behavior B, her disposition to produce B in similar circumstances is strengthened.

Second, the agent is disposed to feel pleasure whenever P obtains. And third, the agent is so

disposed that if the agent represents P as obtaining, a certain sort of learning signal is produced.

The first two dispositions are part of the common sense picture of reward, and if they seem to

bear an uncanny resemblance to the dispositions caused by desire itself, it is only because of the

close connection desire has with reward. The third disposition, though perhaps lesser known, is a

major component of the emerging scientific picture surrounding reward. Following Schroeder, I

will argue in the immediately ensuing sections that the first two dispositions are less fundamental


                                                                                                       !3
to reward than the third, given that there are situations in which an agent is (paradigmatically)

rewarded but does not experience pleasure, or a strengthening of her behavioral dispositions; yet

the release of a characteristic learning signal seems to occur in all such situations. Moreover, the

first two dispositions can be causally explained by the third. Following this, I will provide an

argument drawing from those of Kripke (1972) and Putnam (1975) in support of the claim that it

is the property that rewards have to cause this third disposition, the creation of a learning signal,

that should rightfully be considered essential and constitutive of reward, given that it is this

property that provides the deepest explanatory account of the phenomenon of reward.


                          REWARD AS REWARD-GATHERING BEHAVIOR

       Now, make no mistake — reward and behavior are closely intertwined, in both human

and non-human animals alike. A child receiving a gold star for helping to clean out the classroom

may help to take out the trash with the hope of receiving such a reward again. A dog learns new

tricks — that is, repeats a certain pattern of behavior upon a particular command — because it is

rewarded with a treat every time it demonstrates that behavior on command. Even irrelevant

behavior that is typically not praiseworthy or beneficial and thus ought not to induce the receipt

of a reward under ordinary circumstances can be made to be repeated consistently by an

organism, simply in virtue of its being followed, chronologically, by the reward. Consider the

classic study on operant conditioning (Skinner, 1948) in which caged pigeons were given food at

regular time intervals regardless of their behavior. After a certain number of trials, the pigeons

started developing determinate patterns of behavior, which wouldn’t typically be observed in

pigeons under non-experimental circumstances — one repeatedly traversed rounds around the


                                                                                                        !4
cage, another continuously thrust its head into a corner of a cage, and so on. These behavioral

patterns were observed to arise only because they temporally preceded the release of food into

the pigeon’s cage; in other words, even behavior that was merely spontaneous and wholly

unrelated to the reward-gathering efforts of the pigeon (at least, initially) can be induced to be

repeated by its being rewarded. Thus it is fairly apparent that the causal connection between

reward and behavior runs deep.

       Nonetheless, this doesn’t seem to provide good reason for arguing that the strengthening

of certain behavioral dispositions is essential to or constitutive of what makes something a

reward. For if it were so, rewards should necessarily cause the strengthening of behavioral

dispositions, but there seem to be conceivable situations where this is not the case. Consider

Strawson’s Weather Watchers (1994), hypothetical creatures which lack the ability to act, but can

justifiably be said to constitute certain things as rewards. Despite not being capable of any bodily

movement whatsoever, these creatures enjoy rich emotional lives, and develop emotional

responses to various weather patterns: they feel hope and optimism on a sunny day, and despair

on a cloudy one; the prospect of light showers might fill them with delight, but a heavy

thunderstorm might cause sadness, even strike fear in them. This seems to align with how we

respond emotionally to paradigmatic rewards and punishments respectively — we too delight in

things we consider rewarding, and despair in things we consider punishing. Thus it seems

intuitive to also say, then, that these Weather Watchers constitute the onset of sunny weather and

light drizzles as rewards, and don’t constitute cloudy skies and torrential rain as such. However,

it wouldn’t be reasonable to argue that for a Weather Watcher, some disposition to behave in a

certain way will strengthen on a sunny day, given that one needs to be able to act to have a


                                                                                                     !5
disposition to so act in the first place, and Weather Watchers possess no such ability. To be clear,

that the possession of a disposition to act necessitates an ability to so act is a trivially valid claim

— if one instead argues that one can be disposed to act in a certain way without being able to so

act, then one must concede that there are countless preposterous things humans could be

disposed to do, such as photosynthesizing, levitating, walking through walls, and so on. Thus,

Weather Watchers cannot be said to possess any disposition to act, and therefore do not and

indeed cannot experience the strengthening of any such disposition, despite being able to

constitute certain things as rewards.

        Thus, it would be explanatorily insignificant, perhaps even wholly irrelevant, to account

for the fact that the Weather Watchers constitute certain things as rewards with an explanation

that some behavioral disposition in them was strengthened. This provides a good example of

how it is not necessary for there to be a strengthening of certain behavioral dispositions for an

agent to constitute something as a reward. If this is the case, then the causal power that rewards

have to strengthen behavioral dispositions is not essential to or constitutive of what it means for

something to be a reward after all.


                                 REWARD AS PLEASURE-INDUCING

        It seems plausible then that the causal property of reward to induce changes to behavioral

dispositions doesn't qualify as being constitutive of reward. Now, what about feelings of

pleasure? Reward’s close relationship with pleasure, as was its relationship with behavior, is

undeniable. We feel happy when we’re rewarded, and look forward to being rewarded because of

the pleasure it will bring, we prioritize certain things we consider rewards over others because


                                                                                                       !6
they provide a greater degree of pleasure, and so on. However, not unlike the causal property of

reward to induce changes to behavioral dispositions, its causal property to induce pleasure also

does not seem essential to what it means to be a reward, because in certain situations one could

demonstrably constitute something as a reward, yet have no such disposition to feel pleasure

upon receiving it.

       Consider a study involving hospitalized heroin addicts (Lamb et al, 1991), in which

subjects were given unknown doses of morphine from Monday to Friday of a given week.

Although the dosage for that week was not disclosed to the subject, it would have become

apparent to them by the Wednesday of that week, and if subjects desired continued doses for the

coming Thursday and Friday, they had to press a lever three thousand times within forty-five

minutes to be granted them. The findings revealed that subjects proceeded to complete the task

for as low a dose as 3.75 milligrams — which, importantly, subjects subjectively did not derive

any pleasure from, and rated as worthless, in a separate study — but would not do so for doses of

saline. It seems, however, that subjects did subjectively regard whatever dose of morphine they

got as a reward despite not deriving any pleasure from it whatsoever, given that the doses seem

to exhibit the same motivational profile that paradigmatic rewards would. Why else would

subjects proceed to complete such an inane and tedious task, designed only to return them

something under the specific conditions of the experiment? Therefore it seems plausible that

there are states of affairs which we consider rewards but don’t derive pleasure from, implying

that the disposition to feel pleasure isn’t necessary for one to be able to constitute something as a

reward.


                                                                                                    !7
       Could the study’s findings admit of another explanation: that under the specific

conditions of Lamb’s experiments, the mere fact of being given morphine, whatever the dosage,

was enough for subjects to feel pleasure and thereby constitute it as a reward? This is perhaps

plausible given that each dosage wasn’t precisely disclosed to the subjects, such that perhaps it

becomes easy for subjects to deceive themselves into thinking that they were given effective

dosages of morphine, even if some of them were medicinally indistinguishable from saline. After

all, the “placebo effect” is a well-known phenomenon — the mere fact that they were given pills

they thought to be medicine, even if the pills contained only sugar and had no medicinal

properties whatsoever, was enough for patients to subjectively feel that they were getting better

(American Cancer Society, 2015). However, this effect doesn’t quite seem to be at work here,

given that subjects weren’t motivated to complete the lever-pressing task for morphine dosages

under 3.75 milliliters, when they should have if the mere fact that they were given morphine was

all it took for them to constitute it as a reward. The discovery of a lower limit to the dosage

required for subjects to reproduce the behavior in question suggests that subjects were in fact

discriminating between the dosages that were worth their time and effort, and those that were

not, and that they could in fact discern those quantities to some reliable degree. Thus it seems

plausible that subjects constituted such ineffective dosages of morphine as 3.75 milligrams as

rewards, despite not feeling any associated pleasure as a result.

       Other studies also suggest that a disposition to feel pleasure isn’t necessary for one to

constitute things as rewards. In a study where feeding behavior was elicited in rats via electrical

stimulation to the lateral hypothalamus (Berridge and Valenstein, 1991), although rats

consistently behaved so as to procure and consume food, they did not demonstrate the facial


                                                                                                      !8
expressions or mannerisms that are indicative of a positive affective reaction, which include a

rhythmic midline protrusion of the tongue, a nonrhythmic lateral protrusion of the tongue, and

also the licking of paws. These reactions are widely believed to be characteristic of a pleasure

sensation in rats, because they have underlying parameters, like a certain allometric rhythm, that

are identical to those which underlie the behavioral affective reactions demonstrated by humans

and primates, suggesting that these demonstrations of positive affect may be homologous or

evolutionarily derived (Steiner et al, 2001). Given that pleasure is not the motivating force

behind the rats’ feeding behavior, some other non-hedonic psychological mechanism must be at

play to drive behavior here; Berridge (2003) argues that this mechanism is governed by a kind of

incentive salience, which is an internal psychological process that functions to make certain

stimuli more attractive to the agent, in a way that is independent of whether the stimulus delivers

a pleasure sensation or not. Incentives that are salient to an agent thus motivate incentive-seeking

behavior in the agent. Such a process underlies what we might typically call ‘wanting’, a

psychological state Berridge characterizes as strictly non-hedonic, and thus fundamentally

different from that of ‘liking’, which is grounded in pleasurable utility. The rats wanted the food,

but did not like it. Strikingly, it is believed that it is the incentive salience governing ‘wanting’

that contributes to the psychological mechanism of learning through motivation by reward

(Berridge and Robinson, 1998), suggesting yet again that rewards are perhaps not best

characterized by their putative ability to induce pleasure. Therefore, it seems very plausible that

although the rats viewed the food as rewarding and consistently sought to consume it, they did

not in fact derive any feeling of pleasure from it whatsoever.


                                                                                                        !9
       In sum, these studies show that although agents don’t possess a disposition to feel

pleasure in response to a given stimulus, it is still plausible that they constitute the stimulus as a

reward, given that they responded to the onset of the stimulus in a manner that aligns very

closely with the ways in which we typically respond to paradigmatic rewards — namely, with

repeated behavior that is directed at procuring the reward. These findings thus conclusively show

that it is not necessary for an agent to have a disposition to feel pleasure whenever P obtains in

order to for her to be able to constitute P as a reward, and thus, that dispositions towards pleasure

are not essential to or constitutive of what it means for something to be a reward.


                       REWARD AS THE CREATION OF A LEARNING SIGNAL

This leaves us with the connection between reward and the creation of some sort of learning

signal. To begin, perhaps the most important reason which justifies our identifying reward as

something which causes the production of a learning signal is the neurology underlying what

empirical studies have decisively found to be the reward system in human and non-human

animals alike. These neural structures are known as the ventral tegmental area (VTA) and the

substantia nigra pars compacta (SNpc), and are responsible for the release of dopamine to other

areas of the brain, thereby reorganizing various neural connections and inducing feelings of

pleasure, changes to behavioral dispositions, and even changes to sensory capacities. Crucially, it

is the VTA and SNpc’s connection to multiple brain areas underlying memory, sensory and motor

capacities, as well as its control of the release of dopamine to those areas in response to stimuli,

which give it a functional profile that very closely corresponds to that of a system which

actualizes the various causal powers that, as we have seen, are typically associated with rewards.


                                                                                                     !10
       To be clear, the argument here is not that such structures are strictly necessary for one to

constitute something as a reward. That would unnecessarily preclude organisms (actual and

hypothetical) which do not possess the specific neural machinery from being able to be

rewarded, and we have no reason to do so since the general state of constituting something as a

reward ought to be multiply-realizable across many different neurologies. Rather, an

understanding of the way in which rewards are instantiated in us and many other animals helps

us nail down what contributes to an understanding of the way rewards work in general, thereby

serving our purposes of finding a broader conceptual, extra-neural theory of reward. Importantly,

what we want to see here is that it is the creation of what we have called a learning signal —

regardless of how this signal is implemented, neurally or otherwise — that we define what it

means to constitute something as a reward by.

       Now, let us first examine — what does the release of dopamine have to do with rewards

in the first place? Dopamine-releasing neurons in the VTA and SNpc exhibit firing activity that

correlates strongly with an organism’s receipt of (paradigmatic) rewards: although they have

baseline activity that is fairly low and stable, this activity spikes and briefly becomes more rapid

when, for example. the organism is hungry and has just received a tasty treat. This correlation is

a prima facie reason to believe that these neurons, and the dopamine they release, are somehow

involved in the various causal relationships that rewards have. Moreover, it seems that when the

dopamine receptors in the brain are blocked, the neurological changes that seem to be caused by

paradigmatic rewards no longer occur. A study by Bao, Chan and Merzenich (2001) which

involves stimulation to the auditory cortices and VTA/SNpc of rats provides evidence for this

claim. When the rats were first exposed to sound at a given frequency, and then subsequently


                                                                                                  !11
received VTA/SNpc stimulation, the neural connections in their primary and higher auditory

cortices were found to have been reorganized such that they were better able to pick up and

process sounds of the same frequency. This suggests that VTA/SNpc stimulation, and the

consequent release of dopamine to the auditory regions of the brain that were stimulated prior,

was sufficient to create a propensity to be auditorily stimulated in the very same way in the rat.

This indicates that the activity of dopaminergic neurons in the VTA/SNpc is not merely

correlated with the receipt of paradigmatic rewards, as is widely observed, but in fact causes

what paradigmatic rewards themselves cause. Namely, VTA/SNpc activity causes particular

changes within the organism that seem to implement a certain process of learning — learning

that, in this case, takes the form of learning to better perceive and process certain auditory tones.

On the other hand, when dopamine receptors in the aforementioned auditory cortices were

blocked by the experimenters, no change in the rats’ cortical areas was observed, and thus no

such learning occurred. This strongly suggests that it is in response to the chemical, dopamine,

that these brain areas undergo the changes that they do, and therefore that dopamine is what

facilitates the causal relationship between paradigmatic rewards and learning.

       If it is true that dopamine induces neurological change in the way rewards might, then the

VTA and SNpc, which control the release of dopamine, are responsible for setting this change in

motion. But perhaps it would aid the discussion if we examine further why such changes within

the organism can be considered learning. The kind of learning involved here is what Schroeder

terms contingency-based learning: learning that consists in the reinforcement of dispositions

within the system, reinforcement that, if positive, better positions the system to procure that

which did the reinforcing; and if negative, better positions the system to avoid it. Thus, such


                                                                                                     !12
learning is future-directed — it concerns the system’s performance in similar contingencies

which may arise in the system’s future. In the case of the tonally-trained rats, the specific way in

which connections within the rat’s cortices were changed suggest that it was a targeted response

to the particular stimuli it received. The fact that it heightened the rat’s capability to process a

particular frequency of sound, as opposed to some other frequency, or as opposed to impeding

this capability altogether, suggests that its prior processing of the sound was positively

reinforced, such that for the rat, its propensity for future processing of the same sound is

increased. The change was targeted and in response to the given stimuli. Thus we say that the rat

has learned to process sounds of this frequency better, in the same way that a dog learns to

reproduce a certain behavior upon being positively reinforced by treats, or a child learns to sit

still at the dinner table upon being positively reinforced by the granting of dessert. To be clear,

such learning doesn’t necessarily have to involve changes in behavior — the rat didn’t

necessarily act to hear sound at the same frequency again. Rather this change is perhaps more

exemplified by shifts in dispositions within the organism — a disposition to act, to perceive,

even to think in a given way.

        Other studies provide more stereotypical examples of contingency-based learning. A

study by Egelman, Person and Montague (1999) showed that the learned behavior of human

subjects in various decision-making tasks matched closely with that of a predictive network

modeled on mesencephalic dopamine systems, suggesting a direct relationship between the

biases involved in human decision-making, and changes in dopamine delivery to target structures

in the brain. Action selection in non-human animals seems to also depend crucially on dopamine

release (Berridge and Robinson, 1998). Thus it seems that learned biases in decision-making


                                                                                                       !13
processes, as well as learned behavior, are guided by the causal activity of dopamine-releasing

neurons in the VTA/SNpc,

       In light of these findings, the VTA and SNpc can be understood as making up our reward

systems, given that their activity seems to coincide with the receipt of paradigmatic rewards, and

that they cause what paradigmatic rewards cause — learning, in its various forms. The activity of

these essential structures, then — that is, the activity that underlies our constituting something as

a reward — can be said to be defined by the creation of a learning signal, a signal that then

induces the various kinds of learning that we have observed in the experimental setups cited. But

given that rewards also induce behavioral change and feelings of pleasure, what reason do we

have to believe that our constituting something as a reward is best defined by its relation to

learning? First off, the behavior-production and pleasure centers are situated elsewhere in the

brain, and it is clear that their activity is preceded by VTA/SNpc stimulation, and by the release

of dopamine by neurons within the VTA and SNpc to those centers. We have already seen that

behavior production is causally affected by dopamine release; as for pleasure, studies have

shown that the effects of many different pleasure-inducing drugs like cocaine (Gawin, 1991),

“ecstasy” (Liechti and Vollenweider, 2000) and various others (ed. Kandel, Schwartz and Jessell,

2000) are mediated at least in part by the VTA/SNpc. To clarify, it is also unlikely that the VTA/

SNpc is itself the pleasure center of the brain, given that rats have been found to experience

pleasure in response to consuming sugar despite severe lesions to dopamine-releasing neurons in

their VTA/SNpc (note that this doesn’t invalidate VTA/SNpc activity as a cause of pleasure, but

only as a necessary one). Thus it seems most intuitive to identify the creation of a learning signal


                                                                                                   !14
(as implemented by VTA/SNpc activity in us) as our constituting something as a reward, and

explain behavior production and feelings of pleasure as being subsequently caused by this signal.

       Furthermore, such an explanation aligns well with commonsense psychology. We

previously noted in the various experimental findings cited the myriad effects of paradigmatic

rewards — creating emotional changes, motivating particular behavioral and decision-making

patterns, even perhaps enhancing sensory abilities — that are mediated via the action of the

VTA/SNpc systems and the release of dopamine to relevant parts of the brain. These are effects

that we intuitively attribute to paradigmatic rewards, but the point here is that although these

effects are certainly differentiated, they are nonetheless unified under the idea of reward as

producing a characteristic learning signal within the organism. That is, identifying the essence of

reward as such allows us to provide an explanatorily significant account of the multitude of

effects rewards have on us and other non-human animals, in a way that aligns with, and indeed is

supported by, the central role of the VTA and SNpc in our mental life.

       Note also that this means not only that the learning-based theory of reward doesn’t

preclude the close relation between reward, and behavior and pleasure, but also that such a

theory actually supports it. It is the conveying of a learning signal, via a release of dopamine, to

motor cortices and the basal ganglia, that causes the organism to exhibit decision-making and

other bodily behaviors; indeed, the signal is only one of learning insofar as it has the potential to,

among other things, produce learned patterns of behavior in the organism. Moreover, the learning

signal creates a favorable attitude towards the stimulus within the organism in a similar manner:

via a release of dopamine to hedonic centers of the brain (for example, the PGAC). Part of what

makes the signal one of learning is also the kind of emotional reinforcement that it is capable of


                                                                                                   !15
creating within the organism, that allows the organism to learn to make positive associations to

rewarding stimuli. The causal relationships behavior and pleasure have with reward are thus not

only preserved by a learning-based theory of reward, but are also accounted for with explanatory

force.

         By accounting for changes in behavioral and emotional dispositions in response to reward

in causal terms, we give many commonly uttered statements regarding reward explanatory

significance — such statements as “Pascal always works late nowadays because he’s gunning for

a promotion,” or “Hyun Joon's spirits have really improved because after many weeks of

working long hours, he was finally rewarded with an off-day.” However, one might rightfully

object that a learning-based theory will, by characterizing the relationship between reward and

learning in terms of identity and equivalence instead of causality, reduce certain statements

meant to express causality to mere tautologies; one perhaps contrived example might be

something like: “A learning signal was created in Greg because he was rewarded with a raise.”

Notice first that such statements don’t really feature in our everyday appraisals of our own and

others’ psychologies. Moreover, the theory identifies one’s constituting something as a reward as

the creation of a learning signal within one, and not in terms of the actual event of learning; it is

only that a certain potential for learning has been actualized, and that the learning itself —

which, notably, may or may not happen — is but a consequence of the creation of this potential.

Thus while reward is identified as the creation of such a potential for learning, it can still

meaningfully be said to cause a specific event of learning. The effect of learning can therefore

still be causally attributed to the fact of reward in an explanatorily-significant way.


                                                                                                    !16
                    LEARNING SIGNAL CREATION AS ESSENTIAL TO REWARD

To bring home the argument that what is essential to reward is not the strengthening of behavior

dispositions or the production of feelings of pleasure, but rather the creation of a characteristic

learning signal, we will conclude this section by examining the dominant essentialist theory of

which Kripke (1980) and Putnam (1975) are advocates. They identify the properties that are

essential to an object P being of a certain kind K as the properties of P which provide the deepest

explanatory account of the properties typically associated with K. These essential properties

define what it means for something to be of kind K, because they provide important causal

explanations for the other phenomena that K exhibits, and thus ground K’s other properties in a

fundamental way. This seems to make good sense — intuitively, if the object P didn’t possess the

properties which provided the causal ground for the properties commonly associated with K, then

it wouldn’t instantiate K’s properties and therefore cannot reasonably be classified as being of

kind K at all.

        This argument is perhaps best exemplified in chemistry. Kripke advances an argument

that especially for things with underlying chemistries, it is their microstructural properties which

determine their natures. For example, gold exhibits certain chemical behavior and has a certain

outward appearance because each gold atom has 79 electrons; its electron configuration thus

constitutes its essence since it adequately provides the ground for explaining all its other

observable properties. Likewise, Putnam’s Twin Earth thought experiment advances a similar

argument. On Twin Earth, a planet that is identical to Earth in all observable respects, there exists

a liquid that similarly appears to be identical to water on Earth: it is colorless, odorless, tasteless,

potable and so on. However, its chemical composition is not H2O, but rather, some unknown


                                                                                                      !17
chemical compound, XYZ. In spite of possessing all of water’s superficial properties, however,

intuitively XYZ is not water. This is because what fundamentally explains the identifying

properties in water — the property of being composed of a particular chemical compound, H2O

— is not instantiated in XYZ and thus does not explain XYZ”s identifying properties, even if

those properties happen to be identical to those of H2O.

       A similar argument can be applied to reward, and perhaps has already been somewhat

explored in the preceding section. We identify the property that is truly essential to what it means

to be a reward as the creation of a learning signal, because the creation of this signal crucially

explains many of the other properties paradigmatic rewards have. We have already seen how the

creation of a learning signal, implemented as activity of the VTA/SNpc in human and non-human

animals alike, causes the strengthening of certain behavioral dispositions via a release of

dopamine to the basal ganglia, motor cortices and other subcortical structures responsible for

motion; also we have observed how a similar dopaminergic release to hedonic centers of the

brain causes feelings of pleasure. We have also seen how sensory capacities can be heightened

from similar activity. Other studies show that the release of dopamine is also directly involved in

the learning of certain cognitive dispositions (Knowlton, Mangles and Squire, 1996). All these

effects are characteristic effects of rewards, and all, very strikingly, can be adequately explained

by the release of dopamine by neurons in the VTA/SNpc which thereby creates a characteristic

learning signal within the system. According to the aforementioned essentialist doctrine, then,

the property that rewards have to produce this learning signal is thus what is rightfully essential

to reward — it is the property that provides the deepest and most fundamental explanatory

account of many of the causal properties associated with paradigmatic rewards. It seems most


                                                                                                     !18
reasonable, then, that it is reward’s third — albeit lesser known — disposition, to release a

characteristic learning signal, which constitutes the essence of reward after all.


                                THE REWARD THEORY OF DESIRE

Thus a plausible case has been made for a learning-based theory of reward. An organism’s

constituting P as a reward can now be plausibly understood as P having the tendency to

contribute to the production of a learning signal in the organism. How does this feature in desire?

In the following section, I will argue that we should identify desiring something as precisely that:

as constituting that thing as a reward. To be precise, I will defend the following thesis, as posited

by Schroeder (2003):

   The Reward Theory of Desire (RTD): To have an intrinsic desire that P is to use the

   capacity to represent that P to constitute P as a reward.

In the following sections, I will first present some reasons to believe this view, and then outline

several advantages that it has over contending views — namely those arguing that the essence of

desire lies in one’s disposition to behave, or feel, in a given way.

       The first argument in favor of RTD is that it appeals to common sense. Presumably, if we

would like to figure out how we might reward someone, it would help to first find out what the

person desires; conversely, if we were to attempt to reward someone by giving her something she

in fact does not desire, then it seems commonsensical to say that we have failed to reward her

altogether. Here the subjective, representational quality of both desire and constituting something

as a reward is apparent — desiring P, and constituting P as a reward, seem much aligned in that

they both involve the subject’s taking a positive stance towards P. Perhaps an example would


                                                                                                   !19
help to illustrate the point. Suppose Yuko wants to reward her girlfriend for getting into the

graduate school of her choice, and tries to do so by surprising her with tickets to go hike the

famed Inca trail in Peru. If her girlfriend loves the outdoors, and has dreamed of one day

witnessing the beauty of the Machu Picchu ruins first-hand, then for Yuko this would prove to be

a successful attempt in rewarding her. However, if in fact she has an unspoken aversion to

spending extended periods of time in the heat and humidity of the Peruvian wilderness, then as

much as Yuko would like to reward her, it would seem that she has failed to do so, for her

girlfriend does not desire hiking the Inca trail at all. Thus it seems intuitive to identify desiring

something as constituting it as a reward. Other observations about the striking similarities

between the effects of being rewarded and having one’s desires satisfied also seem to suggest this

claim. People are motivated to achieve what they desire, and look forward to having their desires

satisfied; likewise people seek rewards, and eagerly anticipate being rewarded. We consciously

make decisions about what to do based on how we have had our desires satisfied in the past, or

similarly how we have been rewarded; the positive reinforcement of both being rewarded and

having desires satisfied may also influence our actions and decisions unconsciously. All these

everyday observations thus provide a prima facie reason to believe that being rewarded and

having one’s desires satisfied are one and the same.

        Moreover, it isn’t that it merely seems to us that rewards and desire satisfaction produce

the same consequences in us; rather, neurological evidence shows that they actually do. We have

previously looked at some of these experimental findings in relation to learning, but we will now

examine them in relation to desire satisfaction. First, consider the effects of paradigmatic desires

on actions. It is widely known that we are typically disposed to act so as to have our desires


                                                                                                        !20
satisfied; neurological studies suggest that the brain areas responsible for action are not designed

to function independently of the activity of our reward system — the VTA, and the SNpc. The

center of action production is known as the dorsal striatum, and it receives input from the brain’s

sensory and cognitive regions, which give it information about standing perceptual and belief-

states — qualitatively, about the current state of affairs, and the relevant action options that are

available to it at the given time. The output of the dorsal striatum is then directed to the brain’s

motor and pre-motor cortices (as well as many subcortical structures), which collectively have

full control over the performance of voluntary bodily movements. But more pertinently, the

dorsal striatum also receives input from the reward system, and if the RTD is correct, then the

psychological mechanism we intuitively believe underlies voluntary action can be more

thoroughly explained — in biological terms, motor function is initiated by sensory, cognitive and

reward information; in psychological terms, voluntary action is motivated by the contents of our

perception, beliefs and desires, as is widely believed.

       A compelling reason to believe the reward system is essential for voluntary action is the

striking loss of the capacity for such action in patients with (extreme forms of) Parkinson’s

disease. In severe cases, the disease kills all the neuronal cells within the reward system which

contribute to action production; the patient ends up being afflicted by complete paralysis as a

result (Langston and Palfreman, 1995). Importantly, patients do not regain voluntary motor

capacities by drawing on beliefs about what is best, or right to do. This suggests that reward-

independent motivation is insufficient for voluntary action; rather, voluntary action production

needs to be at least somewhat motivated by states of the reward system. To illustrate this point

further, although bodily movements can be performed while bypassing the reward system, such


                                                                                                       !21
as through the direct stimulation of motor cortices, or in Tourretic tics, such movements are not

paradigmatically voluntary — that is, they seem to arise independently of desire. This seems to

suggest even further that the involvement of the reward system is essential for voluntary action,

and underlies the motivation of such action by desire.

       Next, consider the causal relationship between the activity of the reward system and

feelings of pleasure. We saw in the previous section that the release of dopamine by the reward

system to hedonic centers in the brain are responsible for the pleasure sensations we experience.

To emphasize this point further, other studies suggest that most pleasure-causing drugs have been

found to directly or indirectly act on relevant parts of the reward system (Lieachti et al, 2000;

Liechti and Vollenweider, 2000); furthermore direct stimulation of the reward system seems to

suffice to, under certain experimental conditions, cause pleasurable feelings (Stellar, 2012). Such

evidence is widely observed, and strongly suggests that the activity of the reward system, though

artificially induced in these examples, plays an essential and perhaps even sufficient role in

causing pleasure sensations.

       An additional point can be made about the role of reward learning in helping us position

ourselves to attain what we intrinsically desire. We saw earlier how the activity of the reward

system is best understood as the production of a characteristic learning signal. This signal is

designed to causally affect other brain areas so as to motivate particular behavior, produce

feelings of pleasure, enhance sensory capacities, and so on — that is, so as to place ourselves in a

more advantageous position for the reaping of what we constitute as the reward, or equivalently,

what had contributed to the production of that learning signal (and has the tendency to). This

seems fundamentally aligned with the effects on our dispositions that desires can cause in us. In a


                                                                                                    !22
general sense, we are disposed to overcome obstacles, navigate confusing territory (literally and

metaphorically), and even improve our own abilities, so as to position ourselves to attain what

we intrinsically desire. Again, here it seems only natural to think that it is the promise of such

desire satisfaction that underlies the learning that rewards bring about, and the learning that the

activity of the reward system functions to cause; perhaps, then, desire satisfaction, and the

attainment of rewards, are one and the same.

       The empirical findings cited so far show that the reward system, comprising the VTA and

the SNpc, is in fact the common cause of all the aforementioned effects, for us and many non-

human animals. It is a biological natural kind, a component of our neurophysiology that really

does function to bring about those biological effects. In order to understand the concept of

reward, and of constituting something as a reward, we can rightfully generalize this to say that

rewards themselves are a psychological natural kind, regardless of how the constitution of

something as such is implemented — by the VTA/SNpc, or otherwise. They do causally affect

our behavior, our feelings, and also the way we learn. Now, in order to solidify our defense of the

view that constituting something as a reward just is desiring it, we need to also show that the

intrinsic desires we are trying to understand really do in fact have those very same causal powers

— that is, that desires themselves are a natural kind. Only then, would the striking alignment

between the causal profiles of rewards and desires provide a plausible argument for identifying

desires as the constituting of things as rewards.

       The first argument in favor of treating desires as a natural kind is a familiar one: an

appeal to intuition. There seems to be something straightforwardly appealing and explanatorily

significant in arguing the thesis of desire realism, that desires do exist and really do at once exert


                                                                                                     !23
all the myriad causal powers we have examined thus far. When a baseball player hits a home run,

and at once feels exhilaration, goes on to continue refining her swing with repetitive drills, and

perhaps also internalizes how her body contorts and feels when she propelled the ball out of the

park, it seems natural and uncontroversial to say that she has had a desire to hit a home run all

along, and that it is this singular mental state of desire (satisfaction) that simultaneously explains

all the different mental and physical effects she experienced. Such reasoning occurs immediately

to us and makes good sense, without the need for further complicated hypothesizing.

Furthermore, if desires as a natural kind aren’t invoked to make this account, then something else

needs to fill the explanatory gap, and it doesn’t quite seem that any other mental state serves as a

suitable candidate.

       Moreover, it seems that a good test for the realism and explanatory robustness of a state

as a putative part of the mind is that it features in our best endeavors in psychological science.

For the reasons canvassed earlier, it seems that psychological sciences will likely continue to

make good use of the reward system in explaining a multitude of other thought, feeling and

behavior phenomena, given that no other system of the brain simultaneously causes all those

effects. If the reward system is going to be preeminent in psychological literature, then either we

account for its explanatory significance by interpreting its activity as realizing desires, or make

the arguably unwarranted interpretation of its activity as realizing some other hitherto

unidentified mental state, or worse: disregard its importance to the philosophy of mind and

morality altogether. Again, desire seems to be the best answer we have when examining the

question of what mental state is realized by the activity of the reward system. Thus, there is good

reason to understand desire as a natural kind as such.


                                                                                                      !24
       If desire and reward are both natural kinds, and have matching causal profiles, then it

seems very plausible that they are identical. We have gone considerable distance in showing that

a reward theory of desire makes intuitive sense, and has distinctive explanatory force. As a final

clarifying point, note that while this trivializes the apparent causality that is expressed in

statements such as “getting that reward really did satisfy the desire she had,” it turns out that it

seems wholly unnecessary to invoke causality here in the first place. Getting a reward and having

a desire satisfied occur simultaneously, not because getting the reward somehow caused the

satisfaction of a desire for that same reward, but because getting the reward just is the

satisfaction of that desire. In certain cases, however, it seems possible for us to desire a reward,

but this is best expressed as a second-order desire — a desire to have some desire or other

satisfied. This seems to make sense if we consider how we might desire to have a particular

desire satisfied comparatively more so than another. Thus constituting something as a reward and

desiring it can still be one and the same without facing serious conceptual flaws, and the RTD

remains intact.


                  ADVANTAGES OVER THE MOTIVATIONAL THEORY OF DESIRE

Perhaps one might oppose the RTD by saying that instead, desiring P consists in the disposition

to act so as to bring about that P. By now it should be well established that desires can cause

action; so as not to belabor the point, let us affirm that it does seem plausible that if I have an

intrinsic desire to, say, eat peaches, then I am disposed to eat peaches, and will do so when I have

the means to. We’ll call this theory the motivational theory of desire.


                                                                                                       !25
       However, for this to constitute desire in an essential way, the disposition to act so that P

must be both necessary and sufficient for desiring P. But it seems that such a disposition is

neither sufficient nor necessary. First, consider some arguments against the sufficiency of

behavioral dispositions. It seems that under certain circumstances, it is possible for us to have a

disposition to act in a given way, and yet it still doesn’t amount to our having the associated

desire. Behaving out of habit is one such example. We often hear people lament “I didn’t actually

want to do it, I just did it out of habit,” and even express feelings of frustration after having done

something habitually that might have been counter-productive or even undesired at the time. It

seems that desires and habits are fundamentally different motivating forces for action, such that it

doesn’t seem quite right for both to have dispositions to act as their essence. Suppose that Phong

has a habit of leaving the refrigerator door open whenever he reaches into it for a snack. He is

able to overcome this habit most of the time, but occasionally when he tries to access the

refrigerator when he is really hungry, he’s so focused on obtaining and consuming the snack that

he inadvertently leaves the door open. From a third-person perspective, the consistency in his act

of carelessness seems to constitute a disposition to so act, such that if desires were really

equivalent to behavioral dispositions, Phong would be functionally indistinguishable from

someone who has the albeit very specific desire to leave the refrigerator door open when he is

hungry. Yet Phong doesn’t have such a desire — upon finding out that he has left the door open

many hours later, Phong might find himself annoyed that his folly has caused the milk and eggs

to go bad. It’s clear that he never wanted to leave the refrigerator open in the first place.

       Another case in which a disposition to act doesn’t suffice for desire involves the

influence of other mental states which cause the subject to act often in opposition to what she


                                                                                                    !26
intrinsically desires. Consider a runway model who is making his debut at a big event; as an

aspiring model the last thing he wants to do is to trip and fall on stage, in front of such a large

audience. Yet the anxiety and the mental image of him tripping on his own feet prove too much

for him to overcome, so much so that he now has formed a disposition to trip on stage, and

actually blunders and falls during the show. Here, again, we see that it is possible for one to

develop a disposition to act that is decidedly contradictory to what one wants to do; the

motivational theory would thus wrongly classify the model as having desired to fall on stage,

when we know he desired no such thing.

        One could perhaps push back on this argument by saying that in these cases the subjects

certainly don’t amount to having had desires to act as they have, since their actions were not

voluntary, and thus were not the kind of genuine action that the motivational theory of desire

deals with. However such an objection seems to beg the question, since the claim here is that part

of what makes something a genuine action is that it is caused by desire, but that claim is

precisely what a desire theory should explain; so we shouldn’t be allowed to use the conclusion

purported by that claim to prove the theory itself. Thus the objection from the insufficiency of

the disposition to act for desire still stands.

        Furthermore, the disposition to act is also not necessary for one to have a desire. We’re

all familiar with having desires that don’t have much to do with our acting on them at all. For

example, most of us desire an end to global poverty, yet do not feel compelled to fight for that

end, and would not reason that we have a disposition to do something about it. The relation

between a disposition to act and desire seems even more tenuous if we consider cases where we

might desire something that is not within our power, such that harboring a disposition to act so as


                                                                                                      !27
to make it happen would be irrational — for instance, some mathematicians have a desire that P

= NP, because that would make many hard problems tractable, but it would be delusional to even

think of trying to make that happen. Even more striking is when we have certain desires which

are defined precisely by the disregard of one’s actions. Arpaly and Schroeder (2014) offer the

example of desiring that someone love you regardless of what you do; this kind of desire for

unconditional treatment, if anything, involves desiring such treatment in spite of the various

dispositions to act one might have. These everyday examples strongly suggest that it is possible

and probably quite common for us to have desires that do not and in some cases cannot depend

on a given behavioral disposition.

       Another familiar argument comes to mind: the conceivability of the hypothetical

creatures Strawson calls Weather Watchers. We examined this example in relation to reward in an

earlier section, but it is relevant in relation to desire as well. To reiterate, Weather Watchers lack

the capacity for any kind of bodily movement, yet possess the emotional disposition to feel

delight and despair at specific weather patterns. Given that such emotional responses are not

random, but targeted at particular weather phenomena, it seems intuitive for us to say that

Weather Watchers desire the onset of some weather patterns, and are averse to the onset of some

other ones. But a disposition to act entails an ability to so act. These Weather Watchers, without

such an ability, can have no such disposition; yet they possess desires. Thus, it doesn’t seem that

dispositions to act are necessary for one to have desires either.

       Perhaps one could defend the motivational theory by saying that these examples don’t

quite involve desires, but rather involve something more like wishes. The differences between

these two states are subtle and very much debatable. It does seem that there is a sense in which


                                                                                                     !28
wishes are accompanied by a kind of helplessness, that there is nothing in our power that can be

done to help us get what we wish for; perhaps it isn’t even worth trying. But does this really

qualify them as fundamentally different in kind? We might also say that both wishes and desires

are favorable attitudes towards their objects, and that what occurs when one shifts from wishing

for P, to figuring out that perhaps something can be done so that P, to finally desiring P, is

merely a cognitive change and not a full-fledged transition from one conative state to another.

One’s ability to act in service of what one feels favorably toward then seems inessential to the

underlying attitude towards it, given that the difference lies crucially in a certain judgment of the

options available to one based on a perceptual appraisal of the current state of affairs, and not a

change in the attitude itself. This seems plausible considering that it is easier to conceive of one

oscillating between wishing for and desiring something due to changes in belief states regarding

one’s circumstances, than due to an incessant swing from one conative state to a completely

different one. Thus, just as there are reasons to distinguish fundamentally between wishes and

desires, there are equally compelling reasons to treat them as being underlaid by the same

fundamental state. This objection thus doesn’t quite seem to stand after all.

       If behavioral dispositions are neither sufficient nor necessary for one to have desires, then

it seems that we would be unjustified in identifying having a desire as being disposed to act in a

certain way. Rather, a disposition to act so that P that is caused by a desire for P — a wholly

separate state, which we identify as constituting P as a reward — would be more consistent with

the observations just canvassed, given that such a causal relationship entails neither sufficiency

nor necessity.


                                                                                                   !29
                     ADVANTAGES OVER THE HEDONIC THEORY OF DESIRE

Feelings of pleasure seem closely tied to desires as well — perhaps then, as briefly suggested at

the start of this paper, desiring P is being disposed to feel pleasure whenever P obtains. Such

feelings of pleasure which arise when we have desires satisfied should be exceedingly familiar to

us — when our thirst is quenched with ice-cold lemonade on a hot day, when we achieve the

grade we wanted on an exam… the examples are multitudinous and widespread. However, in a

similar manner as I did in the previous section with the motivational theory of desire, I argue

here that identifying these desires as emotional dispositions isn’t justified, because this

identification entails that the associated emotional dispositions are both necessary and sufficient

for one’s having the desire, but this is decidedly not the case. We will henceforth call this theory

the hedonic theory of desire.

       First, consider some arguments against the necessity of emotional dispositions for desire.

Clinical studies of drugs like chlorpromazine have shown that these drugs are capable of

suppressing subjects’ impulses for action, so much so that they no longer are disposed to feel the

emotional sensations associated with such tasks as eating, walking or talking. Strikingly,

however, subjects still perform those acts. To account for this observation, proponents of the

hedonic theory of desire would have to argue that the subjects’ desires are in fact suppressed, but

subjects still exhibit behavior because other motivational structures, like belief states about what

is right or good to do, are still intact. However, this doesn’t seem to be the case, given that

further clinical observation reveals that subjects do not demonstrate any notable ability for

accomplishing such tasks as dieting, curbing addictive tendencies like smoking, or seeing long-

term plans through — tasks which (appetitive) desires distract us from, and which should


                                                                                                  !30
reasonably be easier to accomplish if those desires were eliminated. Thus it seems that a better

explanation would be that despite the effects of the drug, subjects still retained their intrinsic

desires, and it was those desires that motivated subjects (albeit to a lesser degree) to continue

their behavior. What was suppressed, then, was instead the causality between the satisfaction of

their desires, and the associated feelings. We don’t seem justified in reasoning that the

disposition to feel pleasure is strictly necessary for one to have desires.

       It also doesn’t seem that the disposition to feel pleasure whenever P obtains is sufficient

for one to desire P. Certain conditions like drunkenness, the influence of drugs, and being

caffeinated can alter emotional dispositions, without altering our desires in a fundamental way.

Maria might have in her possession an antique vase that was passed down to her as a family

heirloom, and she might value it very much; however in a fit of drunkenness, she accidentally

knocks it over and shatters it all over her living room floor, but because of her inebriated state

she delights in this accident, and somehow finds amusement in the fact that something so

precious has been so violated beyond repair. This seems to be an entirely imaginable scenario;

some of us might even remember having a similar experience. But the pleasure that Maria

derives from this situation doesn’t quite amount to her having an intrinsic desire to break the

vase, since we know she had the contrary desire to preserve it, and it doesn’t quite seem

reasonable to suggest that she developed the former desire just by being drunk either. Thus, the

disposition to feel pleasure at destroying a precious object, in Maria’s case, isn’t sufficient for her

to have possessed a desire to destroy it.

       The fact that feelings of pleasure and displeasure do not always correspond to the status

of our desire satisfaction seems to suggest an alternative account of pleasure to the hedonic


                                                                                                     !31
theory: that perhaps such feelings are a representation of desire satisfaction. Arpaly and

Schroeder (2014) echo this view. It seems that while feelings of pleasure often covary with desire

satisfaction in predictable ways, the failure of this covariation as highlighted by the observations

mentioned earlier seem to suggest that this is an instance of misrepresentation. An anomalous

covariation between two phenomena, A and B, can be explained by the fact that A is a

representation of B, and has in the given instance erroneously represented B; in deviating from its

predictable trend of covarying with B, A is thereby said to have misrepresented B. The other

properties which pleasure sensations seem to exhibit are best explained by such a representation-

based view. Consider how the satisfaction of the same exact desire could arouse varying degrees

of pleasure — sometimes indifference, but other times pure exhilaration. Someone who has

always slept in a comfortable bed might continue to desire sleeping in one, but when she does

sleep in one, might take it for granted, and be nonchalant at the satisfaction of this desire;

however, if she were deprived from this comfort for ten days straight, sleeping in her bed might

become an immensely pleasurable experience. This pleasure might then gradually fade with each

successive time she gets to sleep in this bed, until a state of indifference is reached once again.

This observation is perhaps best explained not by a strict one-to-one correspondence identifying

a particular feeling of pleasure as the satisfaction of a particular desire, but by taking pleasure to

be a representation of the net change in desire satisfaction. Such a view can at once adequately

explain the various atypical pleasure phenomena outlined thus far: the suppression of this

representational capacity, erroneous representations, and feelings of pleasure corresponding more

closely with a relative level of desire satisfaction than an absolute one.


                                                                                                      !32
       This concludes our evaluation of the hedonic theory of desire: it seems implausible

because dispositions to feel pleasure aren’t necessary or sufficient for one to have desires, and

furthermore feelings of pleasure admit of a much more coherent explanation as measures of

changes to levels of desire satisfaction that are caused by those changes. We thus have

compelling reason to reject the hedonic theory as a coherent account of desire.

       The arguments outlined above help show that behavior and feelings of pleasure are best

understood as characteristic effects of desire satisfaction, but are nonetheless inessential to what

it means to desire something after all. Rather, it seems that there are compelling reasons to

instead identify desiring something as constituting it as a reward: such a theory appeals strongly

to our intuitions about desire, is explained robustly by neurological evidence, and is also

conceptually coherent when we consider desire and reward as the same natural psychological

kind with real causal powers. This sums up our defense of the reward theory of desire; in the next

section, we will begin to see how we can apply this theory to the standard reinforcement learning

paradigm that is commonplace in computer science literature, to see if we can reasonably

attribute desires to the machine.


                          THE REINFORCEMENT LEARNING PARADIGM

       The reinforcement learning model, generally speaking, is a program governed by

mathematical principles that is designed for the purposes of performing a certain task — whether

it is winning a chess game, or helping a car autonomously stay within its lane. But before the

program can perform this task proficiently, it must first learn to do so, by executing relevant

actions within its virtual environment, and receiving feedback about how it has performed,


                                                                                                    !33
feedback according to which it can then tweak itself so as to receive more positive feedback in

future. Because of its activity within its environment, in the literature the model is often called an

actor or an agent; and because the positive feedback it receives is supposed to help it become

more proficient in its task by reinforcing the approach it has just taken, this feedback is often

(perhaps rather loosely) termed a reward. It remains to be seen whether this notion of reward

corresponds with the one we intuitively use in our everyday lives, and if so, whether the model’s

pursuit of these rewards qualifies it as having a desire for them.

       For the purposes of this paper, we will investigate a class of such reinforcement learning

programs that is called the actor-to-critic (A2C) model. It is explanatorily efficacious for us to

use this model as an example because the putative actions and rewards involved are perhaps

expressed more literally. The model comprises two agents which work together to perform an

arbitrary reward-garnering task: the actor, and the critic. One such prototypical task is a game of

CartPole (Barto, Sutton and Anderson, 1983), in which a pole is attached by an unactuated joint

to a cart, which moves along a straight, frictionless track. The cart can only move forwards or

backwards, and the model is tasked with making decisions on which of the two directions to

move this cart so as to prevent the pole from falling over, for as long a period of time as possible.

       In the context of this task, there are only two action options available to the actor: A1 and

A2, corresponding to moving the cart forwards and backwards respectively. The actor uses a

given policy to select one of these actions; the policy takes as input, among other things, a vector

encoding information about the current state of the environment (eg. the angle subtended by the

pole and the base of the cart), and a probability distribution across the available actions, which,

at the start of the task, is initialized either randomly, or equally across the two actions. To be


                                                                                                     !34
clear, the policy is simply governed by mathematical principles which perform computations

over the aforementioned inputs, and which then produce a decision about which action to take —

this can be represented by a binary number, since there are only two available actions. For the

purposes of this paper, exposition about these (and other) complex mathematical principles can

be elided.

       One other important input to the actor’s action-selection policy is an estimation of the

reward that is associated with the execution of either action. Notice that it seems to make good

sense to offer this information to the actor, given that in our everyday lives, when we deliberate

between various options to take, we often consider and compare what we expect the benefit of

taking each option is, and given our intrinsic benefit-maximizing tendency this information

crucially affects our decision. Within the model, this estimation is made by the critic, so termed

because it provides a “critique” of how rewarding each action is anticipated to be. These

estimations are denoted by V1 and V2, which correspond to the expected reward of A1 and A2

respectively. The expected rewards act as inputs to the actor’s action selection policy by affecting

the probabilities which the actions are distributed across, with the magnitude of the expected

reward for a given action being proportional to the probability that that action is selected. Again,

this seems reasonable — we want to preferentially select the action that we expect to bring us the

greatest reward. The computation of the expected rewards involves state information as well, but

the complexities behind that computation shall also be elided here.

       Thus the actor performs this action, and the action event is actualized within the

environment, thereby changing its state, and the vector which encodes it. The vector that results

acts as feedback about how the actor has fared with that last action — perhaps it has allowed the


                                                                                                   !35
pole to stand more uprightly, or perhaps it has caused the pole to tip even further. This feedback

is computed and processed by the critic as a scalar, and denotes the actual reward, r, associated

with the action. From this, the numerical difference between the expected reward, Vi, and the

actual reward, is calculated — this denotes a reward prediction error (RPE), signifying that Vi

was either an over- or under-estimation of the reward for the most recently executed action;

mathematically, ẟ = r - Vi, where ẟ denotes the RPE. The critic thereby uses the RPE to adjust Vi,

in the direction given by the sign of the RPE and by a magnitude proportional to that of the RPE,

so as to potentially reduce future RPE values and provide more accurate predictions of reward in

future. Again, this seems to align with the process we use to make everyday decisions — upon

finding out that the sushi at a Japanese restaurant was not as good as expected, one might lower

one’s preference for going to that restaurant for dinner, so much so that one might choose an

alternative place instead in future. The newly-adjusted Vi values thus serve as inputs for the

actor’s next action selection procedure, and the whole process repeats itself, until the task is

completed.

       Such a model has been widely shown to optimally stabilize the pole atop the cart after a

certain number of rules, with technical differences in the algorithm used creating variations in the

average number of moves needed1. The model so described (albeit somewhat simplistically) is

thus able to make use of the various internal quantities listed to solve the pole-balancing problem

optimally; one of those internal quantities is described as being an expected reward, and another,

an actual reward. At face value, this notion of reward seems somewhat abstract, and arguably

departs significantly from the notion of rewards that figures in everyday life. The ensuing

1
  See the OpenAI website https://gym.openai.com/envs/CartPole-v0/ for a catalogue of some of these
algorithms.

                                                                                                     !36
sections will argue that because rewards are most essentially grounded in learning, as we have

observed in the earlier sections, perhaps this intuition is somewhat misplaced. Importantly,

however, we have highlighted earlier that under the RTD, for an organism to desire P it must be

able to constitute P as a reward, and implied in such a notion of constituting is one of

representing. According to the RTD, then, an organism must possess the capability to represent,

and the essential features of rewards must figure in the content of its representation, for it to be

able to have desires. Thus an argument that the A2C model has desires must first show that the

model is capable of representing things as rewards. In the next section, I provide a plausible

account of representational content, and show that the account does not preclude but in fact

supports the claim that the model can in fact represent rewards in a semantically-significant way.


                    A PLAUSIBLE THEORY OF REPRESENTATIONAL CONTENT

What we want to outline here is a way of explaining what it is that gives representations their

meaning. To represent states of affairs as rewards, one needs to be able to ascribe meaning to

one’s representations of those states of affairs — that is, a specific kind of meaning, that of being

a reward, encapsulating the essential qualities constituting the concept we try to express when

we communicate with the word “reward”. One perhaps misguided way of explaining how this

meaning is ascribed is the proposal that meaning is already inherent in the representation, and

there needs to be a mind capable of grasping that meaning — a kind of internal homunculus —

and thereby able to access that content in such a way as to guide action. This proposal seems to

raise more questions than it answers, for it leaves us with the task of explaining exactly how the

homunculus accomplishes this difficult feat of understanding. Conversely, it seems to make more

                                                                                                       !37
sense that representations, as products manufactured by representing systems, have content in

virtue of the way those representations are used by the system to guide behavior. The way the

system behaves in response to those representations constitutes the representation as having

meaning. This seems to crucially ground representing systems as participating in a self-directed

process of obtaining and making use of information about the distal properties of its

environment, for various purposes.

       Shea (2018) offers support for this view. He argues that content can be ascribed to a

representation in virtue of the role the representation plays in the organism’s performance of task

functions: the kind of tasks undertaken by representing systems which can be understood as

having particular functions for the system’s survival, persistence, development, performance, or

other things we might intuitively think of as paradigmatic goals for the organism. He offers an

account of such task functions, by arguing that a system can be said to be performing a task

function if (1) it produces outcomes (relevant to the task) robustly, (2) the robust production of

these outcomes is due to a kind of stabilizing process, and (3) there are internal components

which correlate with distal properties in a way that is relevant to achieve the outcomes, and the

system actually exploits these correlations to achieve the outcomes. We will now address each

requirement in succession, and see if the A2C model fulfills each one of them. If it does, then the

model can be said to be performing a task function, and if Shea’s theory is right, then the model

can rightfully be said to be representing in a contentful manner.

       First, what does it mean to produce robust outcomes? Shea argues that for an outcome to

be generated robustly it must be produced in a manner that is responsive to relevant external

conditions, such that the same outcome can be produced even when distal conditions differ. Shea


                                                                                                     !38
defines the relevant kind of distal conditions more precisely, by stating that an output F of a

system S is a robust outcome if (i) S produces F in response to a range of different inputs, and (ii)

S produces F in a range of different relevant external conditions. He introduces this distinction

by drawing on Ernst Nagel’s characterization of goal-directedness (Nagel, 1977) in which the

same outcome is produced in the pursuit of a goal despite variations in initial conditions, and

perturbations which arise as the action is being executed, loosely corresponding to points (i) and

(ii) respectively.

        To clarify this point, Shea offers a paradigm example of robust outcomes in motor control

(Goodale et al. 1986, Schindler et al. 2004, Milner and Goodale 2006). The studies show that

while subjects take as input the initial visual percept of the position of a target they are tasked

with reaching toward and touching, and begin with an initial trajectory towards the perceived

position, if the target is displaced while the hand is in motion the initial trajectory is modified so

that the target is reached anyway. This is made possible by an online mechanism in the brain

which adjusts an action as it is executed, in a way that is sensitive, on the fly, to changes in

external conditions. However, Shea also explains that in many cases these perturbations can be

considered in effect to be new initial conditions which the organism takes into consideration, that

is, as input, to devise next steps towards producing the outcome. The displaced position of the

target as the hand is mid-flight can simply be understood as forming a new visual percept which

guides hand movement. Thus, the main idea is that these distal properties specify a rich set of

circumstances which the system must internalize and navigate in order to produce the outcome;

outcomes are produced robustly if they are produced despite variations in those circumstances.


                                                                                                      !39
       Does the A2C model produce any outcome robustly? One such outcome is a more upright

pole after the action of moving the cart either left or right is performed, as compared to before

the action was performed. A sufficiently trained model is capable of successfully balancing the

pole atop the cart, and importantly, is able to do so whether, at any given point in time, the pole is

tipping towards the left, or towards the right. Contrast this with a model that always chooses to

move the cart to the right with maximal probability. Then, it would be possible for it to robustly

keep the pole upright, if the hinge on which the pole sits is designed to disallow any leftward

tipping whatsoever, such that the pole always tips towards the right. However, it cannot produce

the outcome robustly despite changes to distal conditions, since if a regular two-way hinge is

used instead and the pole tips over to the left, a default rightward movement would cause the

pole to tip over even further. Such a model would thus fail to satisfy the requirements of robust

outcome production. Contrastingly, it seems that the A2C model does in fact robustly produce

the outcome of making the pole more upright, given that it is sensitive to information about the

state of its environment (namely the direction of the incline on the pole), and can move the cart

in response to this information so as to prevent the pole from falling over, with consistent

success. Its ability to produce the same outcome despite variations in distal circumstances thus

qualifies it as producing that outcome robustly.

       The next feature of task functions is that the robust outcomes that are produced in service

of this function are so produced not by mere accident, but by a kind of stabilizing process that

shapes the consistent behavioral patterns which are relevant to producing the outcome. A

mechanized ball that bounces every which way within a box might always eventually find its


                                                                                                    !40
way out of the box through a hole, but despite the consistency in the production of that outcome,

it is produced by mere accident and not by a process that actually stabilizes its production.

        One such stabilizing process is natural selection. An organism produces an output in a

stabilized way because that output has brought certain benefits to the organism in its

evolutionary past — perhaps it contributed to its ancestors’ survival, or promoted their good

health. These behavioral dispositions are thereby passed down between generations, as

Darwinian evolutionary theory might suggest, because it was their contribution to the organism’s

survival that made the organism capable of conferring them hereditarily to their offspring in the

first place. The behavioral disposition is thereby selected, in a manner familiar to natural

selection theorists, such that it can shape behavior in successive generations in a directed, non-

accidental way.

        Many advocates of teleosemantics propose an etiological account of functions, and argue

that the function of a perceptual organ which produces representations is what the organ was

selected to do, and not simply what it is disposed to do (Millikan, 1984; Neander, 1995). The

representations so produced thus have content, because the content is instrumental in guiding

various other tasks the organism has to perform, tasks which the organism is poised to perform

only if the perceptual organ does what it is selected to do — that is, if it fulfills its biological

function. Neander, in particular, argues that frogs do produce visual representations of flies with

determinate content, because their visual systems were selected to represent flies determinately,

and the content of those representations guides their fly-consuming behavior in a directed way

(Neander, 2017). According to these teleosemantic accounts, then, the content of a putative


                                                                                                       !41
representation that is being used by an organism depends crucially on the way those

representations were used in prior processes that were favored by natural selection.

       Natural selection, however, cannot explain the stabilization of all robustly produced

outcomes which might qualify as task functions, because it pertains mainly to biological

functions and depends crucially on the system having an evolutionary history. Shea himself

doesn’t think this covers all cases of task functions, and raises the paradigmatic example of the

Swampman to express this point. It seems that Swampman, a molecule-for-molecule duplicate of

a human being that was created by some scientific miracle in a swamp, would, given time,

behave predictably to various stimuli — swat at a bothersome fly, or reach for an apple hanging

on a tree — in the same way that the human being it is a duplicate of would. However,

teleosemantic theories predicated on conferring content to representations in virtue of what the

relevant representing systems were naturally selected to do are committed to also arguing that

Swampman cannot in fact have contentful representations, because he, being born non-

reproductively, has no such evolutionary history. Yet it seems that no other explanandum other

than that Swampman is creating and using contentful representations can sufficiently explain the

consistent pattern of behavior it is likely to engage in. Perhaps this explanandum is invalid at the

very moment of Swampman’s birth. But it seems likely that once Swampman begins to interact

with his environment, a process of learning, reminiscent of the one examined in earlier sections

of this paper, will being to stabilize its behavior in an important way. He may perhaps find

himself in an apple orchard; picking an apple from a tree for the first time, Swampman may first

simply handle it, perhaps fling it on the ground, but later discover that these red round objects

represented in his visual field quell his hunger when eaten, and thus develop a disposition to


                                                                                                    !42
reach toward such red round objects in his visual field and try to consume them whenever he is

hungry. The predictability of his behavior from that point provides a good reason to believe that

it is produced by some stabilizing process or other. In this case, it seems clear that the fact that

his visual state encodes information about an apple guides the production of the robust outcome

of reaching for the apple and consuming it in a stabilized, non-accidental way, without

presupposing that any such deep evolutionary history is required to confer those representations

content.

        This brings us to another kind of stabilization process that Shea focuses on: learning

through feedback. Shea explains that this kind of learning proceeds in a number of steps. The

organism first identifies a statistical pattern within the inputs it receives from its environment.

Here Shea refers to the kind of correlation identification that is familiar from classical

conditioning studies, where a general principle is recognized by the organism, but the principle

isn’t yet used to direct behavior. Then, the organism begins acting on the input patterns it has

observed; the effects of its action are registered, acting as feedback for its behavior. If the effects

are positive for the organism, a positive learning signal is produced, strengthening the connection

between the organism’s recognition of the pattern, and the behavior it just performed.

Correspondingly, if the effects are negative for the organism, a negative learning signal is

produced, weakening this same connection, or perhaps strengthening the connection between

recognition of that pattern and avoidance of the behavior it just performed. Changes to this

connection — however it is implemented, neurally or otherwise — thus contribute to the

stabilization of behavior, since a stronger connection will more robustly produce the behavior in

the organism whenever the input pattern is recognized.


                                                                                                       !43
        Notice here that when such behavioral dispositions are regulated by feedback, the

production of the behavioral outcomes is stabilized in a way that is not contingent on the

presence of any evolutionarily selected function. Perhaps the mechanism to initially make certain

associations between inputs could be described as an evolutionary function, but the associations

that are made don’t quite inform the organism about anything other than the fact of the

association, and are thus very general; the learning effect of feedback is more specific, such that

it can direct specific behavior. The fact that it could either be the case that seeking or avoidance

behavior is reinforced depending on whether the effect on the organism is positive or negative

suggests that the organism needs to go beyond merely recognizing patterns to learn. This kind of

learning also does not always involve making connections that are necessarily evolutionarily

beneficial for the organism — for example, ravens will learn to perform tasks in exchange for

shiny objects; human beings will learn to jump off cliffs into lakes because it is an enjoyable

thing to do. But such learning does happen. Stabilization by feedback-based learning thus

explains many of the robust behavioral outcomes we observe in and around us. Swampman, too,

seems to be capable of such learning, since despite not possessing the evolutionary function of

consuming apples, the prior consumption of apples during his own lifetime and the

internalization of the positive effects of apple consumption have allowed him to learn to

consume apples robustly.

        Does the A2C model robustly produce outcomes of keeping the pole upright because of

such a process of stabilization? Given that it is quite literally a model that alters inputs to its

action-selection policy based on feedback, it’s clear that this is the case. When the model

executes a selected action Ai, based on the expected reward of that action Vi under the particular


                                                                                                      !44
circumstances during the time at which the action was selected, an actual reward is reaped. The

RPE, as the mathematical difference between the expected and actual rewards (to reiterate: ẟ = r

- Vi), is a measure of how much the reward was over- or underestimated. If this difference is

positive, this means that the model’s Vi value was an underestimation of the reward for the

action just selected; the model responds by increasing Vi proportionally. Given the way the Vi

value is used by the actor in its action selection, an increase in Vi has the effect of increasing the

actor’s preferential bias for selected the corresponding action, Ai. Thus the RPE value, when

mathematically positive, constitutes a positive learning signal, since it strengthens the connection

— the connection being, quite literally, the preferential bias the model has for that action under

those circumstances — between the inputs to the model (what the state of the pole’s incline was

at the time of action selection) and the most recently performed action. A symmetric explanation

exists for when the RPE value is negative and the model is found to have overestimated the

reward for the action just selected. This would constitute a negative learning signal, weakening

the preferential bias the model has for that action under those circumstances. And of course, in

the event that the actual and expected rewards are equal (that is, Vi = r), then the RPE value is 0

and the learning signal that is produced is a neutral one. Assuming there is actually a pattern of

behavior to be learnt (as opposed to similar inputs creating positive, negative or neutral signals

randomly), which is entirely conceivable given the CartPole task, these signals do have the effect

of stabilizing the action selection of the model — they cause the model to predictably choose to

move the cart rightwards when the pole is tipping towards the right, and leftwards when the pole

is tipping to the left.


                                                                                                     !45
       So it does seem that a sufficiently-trained A2C model produces outcomes robustly, and

does so out of a stabilizing, non-accidental process. Now that we have ascertained that the model

does behave in this manner, to account for representational content what is left to be explained is

how the workings of internal components can produce behavior that is responsive to distal

properties — that is, whether there is a representational capacity that facilitates these workings.

Shea argues that there is one if internal components stand in relations to properties of the

environment that are relevant for the execution of the task function, and if the system actually

exploits those relations in that execution. He explains this notion of correlation in terms of

probability:

   a system’s internal component ɑ being in a state F carries exploitable correlational

   information about a component of its environment β being in a state G if and only if there are

   regions D and D’ such that if ɑ is in D and β is in D’, then either

       (i) ℙ(Gβ | Fɑ) > ℙ(Gβ) for a univocal reason, or

       (ii) ℙ(Gβ | Fɑ) < ℙ(Gβ) for a univocal reason.

Concretely, a correlation is exploitable if the conditional probability of the environment being in

a state G given that the system is in the state F is greater (or less) than the absolute probability of

the environment being in the state G (unambiguously). That is, the system’s being in the state F

reliably indicates the values of a distal variable. Shea qualifies his use of the term “region” as

encompassing both that of a spatiotemporal kind, but also sets and other kinds of collections, but

we needn’t explore the complexities behind that here. Additionally, Shea’s proposal also requires

that the correlational information encoded by the internal state indicates the value of its

corresponding distal variable in an unmediated way, so as to account for content-indeterminacy


                                                                                                     !46
worries. He explains this notion with the example of a frog’s fly-catching mechanism: the

correlation between a frog’s retinal-ganglion firing (R) with little black objects (condition C) is

mediated by a further correlation between R and the fly’s being a nutritious flying object

(condition C’), since without the latter correlation, we wouldn’t be able to explain how the

former correlation enables frogs to achieve the evolutionarily beneficial outcome of consuming

flies and capturing the nutrition they provide. It is important for us to target the distal property to

which an internal state has a fundamental, explanatorily-significant connection, because if there

really is such a connection, then there needs to be a fact of the matter exactly what the internal

state is connected to; furthermore, for the correlation to qualify as content-constituting under

Shea’s theory, it must actually be exploited by the system, and so there must also be a fact of the

matter what information is actually doing the behavior-producing work.

        Shea’s characterization of correlational information in terms of probability is plausible

given that a true correlation would in fact verify the inequalities stated in his theory. Can we

apply this theory to the internal quantities that figure in the A2C model? Take the RPE signal, ẟ,

as an example. It should be exceedingly apparent that the ẟ value correlates with the

mathematical difference between the actual reward and the expected reward, since that is

precisely how it is calculated. For any arbitrary value θ the RPE value ẟ equates to, the

probability that the difference between the actual and expected reward equates to θ (this is

denoted by distal state Gβ, in Shea’s definition outlined above) conditioned on the fact that the

RPE value ẟ equates to θ (internal state Fɑ) is thus maximal, at 1, and decidedly greater than (or

equal to) the absolute probability that it is the case that said difference is θ — that is, the

probability that state Gβ obtains distally, independent of whether any other state obtains. In the


                                                                                                      !47
context of the A2C model, the conditional probability denoted by ℙ(Gβ | Fɑ) is maximal simply

because the RPE value is calculated directly from the difference between the actual and expected

reward; condition (i) of Shea’s definition is thus satisfied by default. Therefore, the RPE value

does in fact correlate with this difference.

       However, whether the model actually exploits this correlation remains another question

altogether. Given that the model operates so as to maximize the reward it receives (indeed, this is

presupposed), the fact that the exploitation of the aforementioned correlation contributes to the

model’s success is a prima facie reason to believe said correlation is the one being exploited. As

we already understand, the RPE signal plays a kind of imperative role within the model: its sign

dictates the direction in which Vi is adjusted, and its magnitude dictates by how much. Vi is

decreased, and consequently the probability that the corresponding action Ai is selected at the

next time-step is decreased as well, if the reward for executing Ai at the current time step was

less than estimated (and thus gave rise to a negative RPE); likewise Vi and the probability that Ai

is selected in future is increased if the reward received was larger than estimated (and gave rise

to a positive RPE). This makes intuitive sense, since the model should increase the preferential

bias towards the selected action if the reward received exceeded what was expected, given that it

was precisely that expectation that grounded the preferential bias towards the selection of that

action in the first place (likewise if the reward was underestimated). More importantly, the A2C

model so designed has been proven, in the computer science literature, to be an optimal strategy

for such reward-garnering tasks (Sutton and Barto, 2018), but only on the condition that the

reward received actually deviated from the expected reward as the RPE signal suggests. That is,

the RPE signal is only beneficial to the system in virtue of its tight correlation to the difference

                                                                                                       !48
between the actual and expected reward. This makes a convincing case that it is correlational

information about this difference that the model exploits in order to improve its performance, and

gather rewards more robustly.

       Could the model instead be exploiting the correlations the RPE might have with other

distal properties? Consider the fact that RPE correlates with the objective, worldly expected

reward for an action, in that it is a measure of the difference between the actual reward, and the

objective expected reward (in probabilistic terms, the product of the objective probability and

magnitude of the reward), instead of a mere estimation of it that is made by the system. In order

to home in on the correlational information that is actually exploited by the system here, we have

been looking to how the representation is processed downstream — that is, how it is causally

involved in the system’s task-performing procedure; this follows from Dretske’s view (1988),

and Carruthers (2009) also agrees that this gives us an adequate explanation of the content of the

putative representations involved. Given this, it doesn’t seem that a correlation between the RPE

and the difference between the actual and objective expected reward offers a viable account of

the function of the RPE signal within the model. This is because the RPE is supposed to provide

a guideline for how to adjust Vi, but such a difference does not itself rationalize any such

adjustment, since the objective expected reward, as a property of the world, is a measure of what

the average payoff across a large number of trials is (according to the Law of Large Numbers),

but the actual reward is unlikely to match this value in individual trials. In other words, an

accurate Vi value does not preclude a disparity between the actual reward and the objective

expected reward, and so this disparity should not warrant any adjustment of Vi whatsoever.

Rather, it is evidence of the inaccuracy in the system’s Vi value itself that should warrant its


                                                                                                   !49
adjustment, because action selection for future trials bases crucially on the preferential bias for

selecting a given action that is given by this value.

       Additionally, it doesn’t seem that the model’s exploitation of the RPE’s correlation to the

difference between the actual and expected reward is mediated in any way by the aforementioned

correlation it has to the difference between the actual and objective expected reward. Updating Vi

according to the error in predicting the reward for a given action that is encoded in the RPE

provides an adequate explanation of the optimality of the model (Sutton and Barto, 2018),

independent of the RPE’s correlation with the difference between the actual and objective

expected reward. The correlational information encoded by the RPE thus satisfies the

requirement outlined in Shea’s theory of being explanatory in an unmediated way. This shows

that it is the RPE’s correlation with the difference between the actual and internally-represented

expected reward that is exploited by the model.


                       A2C AS A REPRESENTATIONAL REWARD-GATHERER

       In sum, it seems that there is a plausible explanation for how the A2C model satisfies

each of the three main requirements of a system’s having contentful representations in support of

the performance of task functions: the production of robust outcomes, the utilization of processes

which have stabilized the outcome production, and the exploitation of (unmediated explanatory)

correlational information for the production of those outcomes. If Shea’s teleosemantics are

right, and it seems plausible that they are, then the A2C model does perform genuine task

functions — in the case of the CartPole environment, it is the task function of keeping the pole


                                                                                                      !50
standing upright on the cart — and it makes use of representations in service of this task function

which genuinely have content.

          It would be useful here to revisit some of the representations that are purportedly used by

the A2C model to produce the outcome of keeping the pole upright. One putative representation

we have not yet examined in detail is Vi, the model’s estimation of the expected reward of

executing action Ai. Why is this representation contentful? We find this in the way it is used by

the model: as quantifying the expected reward in a probabilistic sense — that is, as the product of

the probability of receiving the payoff and its magnitude, or more qualitatively, what the average

payoff would be if Vi was repeatedly selected over the course of a large number of trials. Given

that the mathematical principles governing the use of Vi within the actor’s action-selection policy

interpret Vi as an estimate of the average payoff, the quantity denoted by V1 thus has content in a

semantically-significant way. The fact that this action-selection policy contributes to the success

of the model also suggests that V1 really does have that meaning. Moreover, insofar as V1 is an

estimation, it is also a representation of this average payoff, since here, just as it is relevant to

invoke veridicality conditions when reasoning about representations, we can do the same for V1:

if Vi was veridical, then the average payoff would actually be Vi if Ai was selected over multiple

trials.

          Another putative representation we may have already examined in some detail is the RPE

signal itself. Again we look to how this representation figures in the model’s operation in order to

answer the question of whether this representation has content. The way that the quantity denoted

by ẟ is used by the model presupposes that ẟ does have some semantic content — Shea himself

endorses this view (2012), and classifies this content as being indicative and imperative. The

                                                                                                        !51
model presupposes that ẟ has indicative content in that its sign indicates that the actual reward

received was either higher or lower than Vi; we can attribute a veridicality condition to this

content, since the sign can be said to be (in)accurate in indicating the direction of the disparity

between Vi and the reward. Furthermore, it presupposes imperative content in ẟ, since its

magnitude connotes an instruction to adjust Vi proportionally; here we attribute a satisfaction

condition to this content, since it describes what an appropriate adjustment of Vi in response to

this magnitude would be. The RPE thus seems to possess the qualities that representations in

general are widely believed to possess. The success of its exploitation by the model is predicated

on the accuracy of its correlation with the degree to which the expected reward was under- or

overestimated. The specific usage of ẟ within the model thereby confers it its status of being a

contentful representation.

       So the A2C model performs task functions, and makes use of contentful representations

to do so. It may have a representational capacity, and thus may be capable of representing certain

things as rewards, but it remains a further question whether or not it actually does so. Perhaps the

most suitable candidate that could qualify as a representation of reward is the quantity that we

denote as r. Why does r represent the quantity it denotes as a reward? As the earlier sections of

this paper have shown, what most crucially grounds a representatum as a reward is its

contribution to the production of a learning signal within the representing system. But not just

any kind of learning signal — rewards contribute to the production of a positive learning signal,

which have the potential to, among other things, cause pleasure in hedonic systems, and create

behavioral patterns directed at positioning the organism to more effectively gather those rewards

in future. To be clear, this notion of positive does not entail the concept involving a quantity’s

                                                                                                      !52
being mathematically greater than zero, or that pertaining to the system’s “likes” or “dislikes”, or

even the quality of beneficialness. It is meant to be a more abstract notion expressing the

learning signal’s constructive effect on the system’s disposition to seek that which was

responsible for producing the signal. Now, it seems that the quantity r has that exact effect on the

system: the very specific way in which it participates in the mathematical difference that the RPE

consists in causes the system to alter its dispositions towards action selection for more effective

reward-gathering. To reiterate, the model interprets a (mathematically) positive RPE as a signal

to increase Vi; this would only cause it to be more likely that the same action is chosen, and the

hitherto greater-than-expected reward is reaped again (under similar circumstances) if r precedes

Vi in the difference calculation. If instead Vi preceded r in the calculation, and the system

increases Vi in virtue of a positive result, then this would only tend to increase the degree to

which Vi outstrips r and thus would, in effect, encourage the model to select actions for which

the reward is likely to be less than expected, thereby compromising the model’s ability to reap

rewards optimally. This effect is even clearer if a mathematically negative reward is reaped — a

positive RPE where Vi preceded r in the difference calculation would thereby encourage the

model to literally reduce the total reward it has collected (given that total reward is defined as the

sum of the individual rewards gathered at each action execution step). Naturally, a symmetric

explanation exists for the system’s interpretation of a (mathematically) negative RPE.

       Thus the mathematical principles governing the use of the actual reward value, r,

rightfully qualify the model as constituting r as a reward contributing to the production of a

positive learning signal within the model. I have argued that this is the best way of characterizing

a reward, and those arguments support a realist perspective of the A2C model’s representation of


                                                                                                   !53
reward. The lengthy arguments canvassed thus far are intended to offer support for the thesis that

reinforcement learning models, such as A2C, can be regarded as possessing desires, because they

are capable of constituting things as rewards, as the argument just mentioned plausibly shows. To

reiterate, there is a convincing case for identifying a desire for P as the constituting of P as a

reward, because theorizing as such is conceptually sound, and the causal profile of rewards

aligns with that of desires in a real way. Now, constituting something as a reward depends

fundamentally on what it means for something to be a reward to a system. We have also seen

compelling reasons to identify a reward as that which contributes to the production of a learning

signal within that system; the learning signal is what grounds the formation of the other effects,

like changes to behavior or emotional dispositions, which themselves are phenomena that are

often mistakenly identified as rewards when they should be explained by them. Putting these two

arguments together, we have a complete reward-based theory of desire; affirming that

reinforcement learning models have the representational powers which are required by the

theory, and that the models do in fact use those powers to represent things as rewards, it seems

that these models do fit the reward-based theory of desire quite well. Perhaps, then, machines can

have desires after all.


                                      CONCLUDING REMARKS

        There is an important connection between desiring something and viewing something as

rewarding that plausibly qualifies as identity. The following is perhaps a commonly-heard line of

conversation: “Little Anwar seems to want to go to Disneyland particularly badly; he asks about

it all the time.” “Ah yes — we take him there after he finishes his exams at the end of every


                                                                                                     !54
school year. For him, it’s a well-deserved reward for his hard work.” For Anwar, and perhaps for

the rest of us, desiring a trip to Disneyland is synonymous with treating such a trip as a reward.

        That doesn’t mean that such psychological phenomena as being disposed to act so as to

procure something we desire, or being disposed to feel pleasure when we do procure it, are not

indicative of desires being at work. None of what has been said in this paper is aimed at refuting

this commonsensical view. Rather, what this paper has hoped to show is that despite being

characteristically associated with desire, these dispositions are decidedly inessential to its nature.

Rather, it is reward that underlies its essence, because reward, as a natural kind, instantiates a

causal profile that is strikingly similar to that of desire as a natural kind, and bears properties that

provide the deepest explanatory account of the very dispositions we deem characteristic of

desires. These arguments provide compelling reasons for the view that desiring something and

constituting it as a reward are one and the same.

        A reward theory of desire doesn’t immediately preclude such systems as machine

learning models from being capable of having desires, because it doesn’t presuppose the

necessity of consciousness from the outset as other theories of desire, such as a hedonic theory of

desire, might do. All it requires is a capability to representationally constitute something as a

reward — that is, to represent something as contributing to a characteristic reward learning

signal — in order to have a desire for it. We have seen that it is entirely plausible that the

reinforcement learning paradigm, exemplified by the A2C model, has such a capability, and can

thus reasonably be said to have desires — if the reward theory of desire is right. Dominant

theories of desire condemn such systems to wholly desire-less lives; the arguments canvassed in


                                                                                                     !55
this paper should, at the very least, convince us to question the intuitions on which these views

are founded.


                                                                                                    !56
                                         BIBLIOGRAPHY

American Cancer Society. (2015). Placebo effect. https://www.cancer.org/treatment/treatments-

       and-side-effects/clinical-trials/placebo-effect.html

Arpaly, N., & Schroeder, T. (2014). In praise of desire. Oxford University Press.

Bao, S., Chan, V. T., & Merzenich, M. M. (2001). Cortical remodelling induced by activity of

       ventral tegmental dopamine neurons. Nature, 412(6842), 79.

Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuronlike adaptive elements that can

       solve difficult learning control problems. IEEE transactions on systems, man, and

       cybernetics, (5), 834-846.

Berridge, K. C. (2003). Pleasures of the brain. Brain and cognition, 52(1), 106-128.

Berridge, K. C., & Robinson, T. E. (1998). What is the role of dopamine in reward: hedonic

       impact, reward learning, or incentive salience?. Brain research reviews, 28(3), 309-369.

Berridge, K. C., & Valenstein, E. S. (1991). What psychological process mediates feeding

       evoked by electrical stimulation of the lateral hypothalamus?. Behavioral neuroscience,

       105(1), 3.

Breland, K., & Breland, M. (1961). The misbehavior of organisms. American psychologist,

       16(11), 681.

Carruthers, Peter (2009), ‘How we know our own minds: the relationship between mindreading

       and metacognition’, Behavioral and Brain Sciences, 32, 164–82.

Dretske, F. I. (1991). Explaining behavior: Reasons in a world of causes. MIT press. 21, 22, 23,

       42, 87, 160, 192, 206, 207.


                                                                                               !57
Egelman, D. M., Person, C., & Montague, P. R. (1998). A computational role for dopamine

       delivery in human decision-making. Journal of Cognitive Neuroscience, 10(5), 623-630.

Goodale, M. A., Pelisson, D., & Prablanc, C. (1986). Large adjustments in visually guided

       reaching do not depend on vision of the hand or perception of target displacement.

       Nature, 320(6064), 748.

Heath, R. G. (1963). Electrical self-stimulation of the brain in man. American Journal of

       Psychiatry, 120(6), 571-577.

Department of Biochemistry and Molecular Biophysics Thomas Jessell, Siegelbaum, S., &

       Hudspeth, A. J. (2000). Principles of neural science (Vol. 4, pp. 1227-1246). E. R.

       Kandel, J. H. Schwartz, & T. M. Jessell (Eds.). New York: McGraw-hill.

Knowlton, B. J., Mangels, J. A., & Squire, L. R. (1996). A neostriatal habit learning system in

       humans. Science, 273(5280), 1399-1402.

Kripke, S. A. (1972). Naming and necessity. In Semantics of natural language (pp. 253-355).

       Springer, Dordrecht.

Lamb, R. J., Preston, K. L., Schindler, C. W., Meisch, R. A., Davis, F., Katz, J. L., ... &

       Goldberg, S. R. (1991). The reinforcing and subjective effects of morphine in post-

       addicts: a dose-response study. Journal of Pharmacology and Experimental Therapeutics,

       259(3), 1165-1173.

Langston, J. W., & Palfreman, J. (1995). The case of the frozen addicts. New York: Pantheon

       Books.


                                                                                                  !58
Liechti, M. E., Saur, M. R., Gamma, A., Hell, D., & Vollenweider, F. X. (2000). Psychological

       and physiological effects of MDMA (“Ecstasy”) after pretreatment with the 5-HT 2

       antagonist ketanserin in healthy humans. Neuropsychopharmacology, 23(4), 396.

Liechti, M. E., & Vollenweider, F. X. (2000). Acute psychological and physiological effects of

       MDMA (“Ecstasy”) after haloperidol pretreatment in healthy humans. European

       Neuropsychopharmacology, 10(4), 289-295.

Millikan, R. G. (1984). Language, thought, and other biological categories: New foundations for

       realism. MIT press.

Milner, D., & Goodale, M. (2006). The visual brain in action. Oxford University Press.

Nagel, E. (1977). Goal-directed processes in biology. The Journal of Philosophy, 74(5), 261-279.

Neander, K. (1995). Misrepresenting & malfunctioning. Philosophical Studies, 79(2), 109-141.

Neander, K. (2017). A mark of the mental: In defense of informational teleosemantics. MIT

       Press.

Putnam, H. (1975). The meaning of ‘meaning’. Philosophical papers, 2.

Shea, N. (2012). Reward prediction error signals are meta-representational. Nous (Detroit,

       Mich.), 48(2), 314.

Shea, Nicholas (2018). Representation in Cognitive Science. Oxford University Press.

Schindler, I., Rice, N. J., McIntosh, R. D., Rossetti, Y., Vighetto, A., & Milner, A. D. (2004).

       Automatic avoidance of obstacles is a dorsal stream function: evidence from optic ataxia.

       Nature neuroscience, 7(7), 779.

Schroeder, T. (2004). Three faces of desire. Oxford University Press.


                                                                                                   !59
Skinner, B. F. (1948). 'Superstition' in the pigeon. Journal of Experimental Psychology, 38(2),

       168-172.

Steiner, J. E., Glaser, D., Hawilo, M. E., & Berridge, K. C. (2001). Comparative expression of

       hedonic impact: affective reactions to taste by human infants and other primates.

       Neuroscience & Biobehavioral Reviews, 25(1), 53-74.

Stellar, J. (2012). The neurobiology of motivation and reward. Springer Science & Business

       Media.

Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press. 


                                                                                                  !60