The PGD model - OESO : The Oesomeric Framework

predicted, and omitted rewards. Spatiotemporal analysis leads to signal production. 11 and learning rules which differ slightly but fundamentally from TD ( ...
6MB taille 2 téléchargements 327 vues
The PGD model: giving Space to Reinforcement Learning Temporal models

Anonymous Author(s) Affiliation Address email

Abstract Imagine that the CS to US delay of reinforcement learning (RL) experiments is the traversal duration of a reward item along a pre-oral pipe-shaped segment fitted with a reward sensor, from an entry port (at CS) to an exit port (at US). Downstream, reward items are transmitted to an intra-body segment and then steadily transmitted to reward consumers, notably the brain. Dopamine (DA) specialists focusing on feeding neurobiology show that most segments along the nutritive path from mouth to brain are fitted with caloric sensors projecting to dopamine circuits. Stimulating this natural PGD structural model reproduces the main response component dynamics of the phasic signal recorded on pretrained vertebrates’ DA neurons during basic RL scenarios, notably: unexpected, predicted, and omitted rewards. Spatiotemporal analysis leads to signal production and learning rules which differ slightly but fundamentally from TD (Temporal Difference) rules. The PGD model suggests that learning may spread the PGD structure upstream, by recruiting cognitive segments which apparently transmit value to its foremost entry ports. The PGD model quantifies value logistics, including value transformation. It should benefit numerous domains, notably biology and thus medicine, neurosciences, neuroeconomics, psychology of reward (and probably of other vital values: damage risk, care, sex, ...), robotics, and human activities related to value logistics (among others: production and economy).

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

20

21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36

1

Introduction to the PGD model

A classical way to make an animal expect a reward (e.g. delivery of a bolus of sucrose) is to train it with several cue-reward pairings, with a constant delay D between a cue (e.g. a tone) and a signal indicating the actual and immediate reward delivery (classically denoted “US” i.e. unconditioned stimulus). After this training, the announcing cue becomes a “CS” (a conditioned signal). Let δDA (t) denote the main phasic component signal produced by vertebrates’ midbrain DA (dopamine) neurons, devoid of biological details (e.g. baseline). DAP stands for “phasic DA”, PIC is a positive impulse, DIP is a negative impulse. Let us call URPROR the three following basic DAP scenarios (Fig. 1) [7]: - “UR” (unexpected reward) elicits a PIC at US. - “PR” (predicted reward) elicits a PIC at CS and no significant DAP signal when the reward is actually provided on time. - “OR” (omitted reward) elicits a PIC at CS and a DIP on omission of an expected reward. The TD (temporal difference) algorithm computes a difference δT D (t) between expected and experienced reward acquisition [1]. When a TD device is trained and then stimulated with scenarios such as URPROR, δT D (t) shows a striking similarity with the phasic component of the δDA (t) signal recorded when a trained animal is stimulated by the same scenarios, hence a highly active study domain, denoted hereafter DAP/TD. This similarity led many DAP and TD authors (e.g. [3, 5, 9]) to Submitted to 29th Conference on Neural Information Processing Systems (NIPS 2016). Do not distribute.

37 38

postulate the RPE (reward prediction error) hypothesis: δ(t)DA and δ(t)T D indicate a discrepancy between reward prediction and actual experience.

Figure 1: URPROR scenarios. Left: DA signal. Right: PGD viewpoint 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

The initial insight about the PGD cognitive model resulted from the idea of a natural temporal bridge between CS and US. Imagine a sensor C˙ CS that measures the amount RCS (t) of a given valuable substance (e.g. sugar concentration) in its surroundings, which I call S˙ CS . Assume that S˙ CS is a pipe-shaped segment, fitted with the C˙ CS sensor and three ports: a CS entry port, and two exit ports, namely a transmission port and an omission port. Let dRCS (t) denote the variation of RCS (t) during one unit of time: dRCS (t) := RCS (t)−RCS (t−1). RCS (t) is a level signal, dRCS (t) is a variation signal. An admission into a segment occurs when an item enters into this segment. An emission from a segment occurs when an item exits from this segment. Now imagine another signal dRU S (t) which indicates variations of the same valuable substance as dRCS but inside another segment S˙ U S . For example, imagine that S˙ U S is the nutritive duct from the mouth to a brain sensor (Fig. 2). This long segment is made of several shorter abutting segments. DA specialists focusing on feeding neurobiology (see. e.g. [10]) depict a “gut–brain dopamine axis” fitted with several caloric sensors scattered all along the path from mouth to brain; dopamine sensing may function as a central caloric sensor. An axiogogue is a structured set of real segments S˙ k (k=0···K) conveying value items to consumers downstream. Axiogogues typically have a tree structure. Some of their parts may be graphs (value path divergence then convergence). Let S˙ CU denote the axiogogue S˙ CS +S˙ U S . In the present case (Fig. 1), the transmission port of S˙ CS abuts an admission port of S˙ U S so that some items are directly transmitted from S˙ CS to S˙ U S . Some items may exit from S˙ CS (and from S˙ CU ) via the omission port of S˙ CS without entering into S˙ U S : they are omitted from S˙ CS . Sometimes an item emitted from S˙ CS is partly transmitted to other segments of S˙ CU (to S˙ U S in the present case), and partly omitted from S˙ CU during the same time interval: this is a partial transmission. Assume that items admitted into S˙ U S are steadily emitted later from S˙ U S . For example, the brain steadily consumes glucose which was ingested earlier as a bolus and rapidly absorbed. The intrabody nutritive segment is a RASE segment: rapid admission (usually), slow emission (structurally). 2

Figure 2: The extended nutritive system

64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83

Figure 3: RASE segment and exp-pulse

Exponential decay RASE segments emit at a rate λ proportional to their current level. Fig. 3 shows the typical level response EPtλ0 (t) (called hereafter exp-pulse) and the variation response dEPtλ0 (t) of an exponential decay RASE segment to a rapid admission at t0 : EPtλ0 (t) := e−λ(t−t0 ) if t ≥ t0 or else EPtλ0 (t) := 0. Hence, while admission of a significant item into S˙ U S elicits a dRU S (t) PIC, its later omission from S˙ U S does not elicit a significant dRU S (t) DIP. Pharmacokineticists are familiar with such asymmetric dynamics. The CS-US delay produced by the artificial learning apparatus may be considered as the ratio L/V , L being the spatial length of the segment between the cue (at CS) and the reward receptors (at US), and V the usual speed of reward items in this segment. Now consider a ΠCU device, which produces the signal δΠ (t) = dRCS (t)+dRU S (t). δΠ (t) dynamics are similar to those of δDA (t) when ΠCU is stimulated by each of the three URPROR scenarios (Fig. 1): in the PR scenario, δΠ (t) ' 0 during the S˙ CS to S˙ U S transmission; in the OR scenario, reward omission elicits a significant δΠ (t) DIP. Hereafter I define the structural, functional, and dynamical PGD cognitive models sketched above, aiming to derive simple δΠ (t) production and learning rules. PGD beginners will find useful information in the supplementary material [15]. CS announces a future reward. Announcement is theoretically transitive: if cue K announces a later event J, and if prior cue L announces K, then L announces J. Thus announcement learning is theoretically transitive. Indirect experimental evidence (e.g. fMRI records) suggests a high-order (> 2) DAP learning transitivity. However, direct evidence (DA neuron recording during simple scenarios) is lacking, both for and against high-order conditioning.

96

So far, comparison between δT D (t) and δDA (t) has mainly focused on the classical DAP/TD stimulation frame described above: rewards are usually delivered into (or just before) the mouth, and CS is a proximal cue delivered shortly before the oral admission: both CS and US lie in a proximal pre-oral and oral space domain. However the PGD structural model and theoretical announcement learning transitivity, strengthened by growing neurobiological evidence, suggest an extension of the classical CS-US span, both downstream and upstream. Downstream, the oral reward signal (classically denoted US) may also act as a reward cue for later reward admissions into downstream segments (stomach, upper intestine, ... and eventually the brain). Production of the δDA (t) signal related to post-ingestive segments by a PGD cognitive device seems plausible. PGD learning seems more speculative in this intra-body domain (where neuronal wiring could be innate). Upstream, a signal usually considered as a cue (classically denoted CS) could also be considered (at least theoretically) as a reward announced by more distal upstream cues, which may themselves be rewards announced by even more distal cues.

97

2

84 85 86 87 88 89 90 91 92 93 94 95

98 99 100

PGD functional and structural architectures

A PGD device is a natural or artificial device which implements the PGD model. A driven device usually includes 3 functions: a cognitive function including perception (sensing) and cognitive functions, a motor function (involving e.g. motor neurons), and a driving function which uses the 3

101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125

cognitive function output to act on the motor function. A driven device may embed zero, one, or several PGD cognitive devices, each one tracking variations of a specific valuable substance, indicated by a reference signal, e.g. dR0glucose (t) or dR0water (t). In this paper, I consider only one PGD cognitive device, based on a reference variation signal dR0 (t). I assume that the cognitive components of the PGD device are functionally separated from the driving and motor parts, and I focus on the cognitive component of the PGD device, the PGD cognitive device. I present a discrete time PGD cognitive model. Time is divided into time steps of constant duration dt. For simplification, let dt = 1. Operator d computes the difference between a signal U (t) at the current time step t, and the same signal U (t−1) at the previous time step: dU (t) := U (t)−U (t−1). Where unambiguous, I omit the currently considered time step t, writing U instead of U (t). A PGD cognitive device is composed of a PGD core function, denoted Π♥, and a set of PGD cognitive input operators (notably delay operators), denoted ΠCIO. Π♥ has K+1 inputs: one reference signal dR0 (t), and K recruitable inputs dRk (t), k=1···K. Π♥ inputs are produced by ΠCIO, that transform N perceived signals dXi (t) into K+1 signals dRj (t). At time step t, Π♥ gets K+1 input signals dRk (t), and computes K weights Wk (t) and the output signal δΠ (t) using W0 and the K weights Wk (t−1) computed at t−1. Assuming that dR0 (t) indicates the variations of a set v0 of substances inside a real segment, let S˙ 0 denote this segment. and S˙ A the axiogogue ended downstream by S˙ 0 . Let Sk (k=0···K) denote the sensibility (spatial) domain of dRk (t). Sk (k=1···K) are cognitive segments. S0 and cognitive segments are denoted with undotted ‘S’, they “represent” the axiogogue reality, transformed (sometimes altered) by ΠCIO. dV˜k (t) is the product of a dRk (t) signal by its weight Wk computed at t−1: dV˜k (t) := Wk (t−1)×dRk (t). dV˜k (t) is the estimation at t, by Π♥, of the variation, between t−1 and t instants, of the net value held by Sk (net value is defined below). Let SΠ (t) (or simply SΠ ) denote the set of Sk segments with a non null weight. Π♥ output is a sum dV˜Π (t), also denoted δΠ (t), of all dV˜k (t) signals (k=0···K): δΠ (t) := dV˜Π (t) :=

X SΠ

dV˜k (t) =

K X k=0

dV˜k (t) =

K X

Wk (t−1) × dRk (t)

(1)

k=0

130

dV˜Π (t) is the estimation at t, by Π♥, of the variation, between t−1 and t instants, of the net value held by SΠ . The Π♥ functional architecture defined above is a Π♥d×Σ one: derivation (d) then weight (×) then sum (Σ). Π♥×Σd is an equivalent functional architecture alternative, since d, ×, and Σ operators are linear. Π♥×Σd produces both V˜Π (t) and dV˜Π (t) := δΠ (t). In this paper I concentrate on the Π♥ architecture necessary to produce dV˜Π (t): Π♥ stands for Π♥d×Σ .

131

3

126 127 128 129

132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150

PGD cognitive learning: an overview

Initially, all weights Wk are set to 0, except W0 , which is assumed to be a strictly positive constant. The PGD cognitive device is then a novice one, SΠ is truncated to S0 , and dV˜Π (t) = dV˜0 (t) = W0 ×dR0 (t), i.e. dV˜Π (t) is initially the measure of gross value variations inside S0 . Π♥ recruits cognitive segments in order to spread SΠ upstream. When Π♥ recruits a given cognitive segment Sk , it increases its weight Wk from 0 to a strictly positive weight, thus adding Sk to SΠ . Progressively, PGD cognitive learning adds segments to SΠ . Learning a novice PGD cognitive device first recruits stratum 1 segments, i.e. segments which apparently transmit directly to S0 (which belongs to stratum 0). Then when SΠ contains stratum 1 segments, learning may add stratum 2 segments, i.e. segments which apparently transmit directly to stratum 1 segments, and so on. Hence net value estimation learning backpropagates upwards, starting from the reference segment S0 . Qualitatively, PGD cognitive learning progressively spreads SΠ upstream, by recruiting cognitive segments at the foremost entry ports of SΠ structure. Quantitatively, PGD cognitive learning adjusts Wk weights, providing Π♥ and the driving function with an estimation of the net value afforded by an SΠ acquisition as soon as it occurs. A PGD cognitive device is mature when its weights no longer vary significantly in a stationary context. The prognodendron is a cognitive object, a mental “representation” of all the segments belonging to SΠ and their spatial structure, which is mainly a tree (it may include graph parts, notably when it represents human activities). While human brains are able to imagine or even draw prognodendrons, a basic PGD cognitive device is space-agnostic (section 7), unaware of its prognodendron structure. 4

153

PGD and Π stand for prognodendron, which could mean “foreknowledge tree”. Keeping the prognodendron construct in mind is useful to understand and to study the PGD model and its numerous applications.

154

4

151 152

155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201

Net value estimation

˙ tX ) denote the net value of an item b˙ located in S˙ k at tX . V net (b, ˙ tX ) is the integral Let V net (b, ˙ ˜ of variations dV0 (tX ) after tX , letting b follow its “natural” fate departing from S˙ k , including transmissions, possible partial omissions, and maybe transformations (section 7), ceteris paribus ˙ tX ) := P∞ ˜ (neither admission into nor emissions from S˙ A of other items): V net (b, t=tX +1 dV0 (t). Thus the net value of an item b˙ at tX is the total gross value that will eventually be admitted at S˙ 0 after tX following its natural fate, ceteris paribus. Net value measure is only available after the duration needed for b˙ to “naturally” flow downstream from Sk to S0 . This duration is usually bounded. However, a driven device needs to know the value afforded by an admission as soon as it occurs, in order to reinforce (if immature, or else adjust) its potential causes shortly afterwards: the weight of an upstream segment that has just emitted, or a just triggered action which seems to favor this admission. The strength of such reinforcements should depend on transmission evidence, which itself depends both on the magnitude of potential causes (emissions or actions) and on the magnitude of the noticed effect (an admission into SΠ ). Π♥ deals with these temporal credit assignment and magnitude issues by backpropagating a net value estimation capability (stored in Wk weights), from stratum 1 segments, up to distal upstream segments, as explained below. Consider a recruitable signal dRk (t) and its corresponding real (S˙ k ) and cognitive (Sk ) segments. Let ELTS[Sk , tE ] denote a time step tE such that dRk (tE ) < 0: from Π♥ point of view, an item may have been emitted at tE from S˙ k , it is an “emission-like time step” (ELTS). Assume that an item b˙ E is really emitted from S˙ k at tE : dRk (tE ) < 0. Let S k denote the complement of Sk in SΠ : S k := SΠ −Sk . S˙ k is the real segment corresponding to S k . A non null part b˙ T of b˙ E is transmitted to S˙ k at tE , thus eliciting a strictly positive variation dRk (tE ) > 0. The remaining ˙ t) denote the estimated net value of an item b˙ at part b˙ O = b˙ E −b˙ T is omitted from S˙ Π . Let V˜ (b, t. V˜ (b˙O , tE ) = 0, because only cognitive segments belonging to SΠ have non null weights. Thus V˜ (b˙E , tE ) = V˜ (b˙T , tE )+V˜ (b˙O , tE ) = V˜ (b˙T , tE ). Assume the following (unrealistic: see below) ceteris paribus condition: nothing other than the emission of b˙ E from S˙ k happens at tE (including a non null transmission to S˙ k ). Thus dV˜k (tE ) = −V˜ (b˙E , tE ), and dV˜ k (tE ) = V˜ (b˙T , tE ) = V˜ (b˙E , tE ) = −dV˜k (tE ). Let Wk∗ (tE ) denote the ratio dV˜ k (tE )/(−dRk (tE )). So dV˜k (tE ) = −dV˜ k (tE ) = [dV˜ k (tE )/(−dRk (tE ))] × dRk (tE ), hence dV˜k (tE ) = Wk∗ (tE ) × dRk (tE ). Assume Wk∗ (tE ) is constant (denoted Wk∗ ) at every ELTS[Sk , tE ]. Thus dV˜k (tE ) = Wk∗ ×dRk (tE ) at every ELTS[Sk , tE ]). In that case, Π♥ could perform a one-shot learning, storing the constant Wk∗ in Wk at the first experienced ELTS[Sk , tE ]: Wk (tE ) ← W ∗ k = dV˜ k (tE )/(−dRk (tE )). But Wk∗ (tE ) is seldom constant. For example, consider a boolean segment S˙ dice which gives you $6 each ∗ ∗ time you get a 6 (win), and $0 otherwise (void). Thus Wdice (twin ) = $6, Wdice (tvoid ) = $0, and ∗ ∗ on average Wdice = $1. To deal with Wk variability, Π♥ computes a moving average Wk∗ (t) of the various Wk∗ (tEi ) experienced during recent ELTS[Sk , tEi ], which it stores in the Wk weight at each ELTS[Sk , tEi ]: Wk (tEi ) ← Wk∗ (tEi ). The above rationale is correct if both the numerator and the denominator of Wk∗ (tE ) = dV˜ k (tE )/(−dRk (tE )) are elicited only by the emission of b˙ E from Sk , ceteris paribus. Wk∗ (tE ) may be wrong if other items are admitted into or emitted from SΠ at tE ; notably in case of transmission illusion, when a fortuitous coincidence happens at tE between an omission of b˙ E from S˙ k , and admission of another item b˙ I into S˙ k . Π♥ is not aware of what really happens at tE , hence it cannot distinguish between a real transmission and a transmission illusion. In both cases it notices an apparent transmission, i.e. dRk (tE ) < 0 and dV˜ k (tE ) > 0. Averaging recently experienced Wk∗ (tEi ) usually cleans inappropriate adjustments due to such illusions. Thanks to Wk adjustment at previous time steps, Π♥ obtains an estimation dV˜k (t) = Wk (t−1)× dRk (t) of net value variation inside Sk at each time step t: either a net value loss during emission-like time steps (dV˜k (tE ) < 0), or a net value acquisition during admission-like time steps (dV˜k (tA ) > 0), 5

206

or no variation. More interesting for a driving device using dV˜Π (t) = δΠ (t) as a training signal: a positive δΠ (t) indicates either a net value acquisition into SΠ at t (in which case the driving device should reinforce a recently triggered action, which seems to favor such a net value acquisition), or a better than average transmission inside SΠ (better and worse than average transmissions occur frequently: see the Sdice illustration above).

207

5

202 203 204 205

208 209 210 211 212

A simple PGD cognitive learning rule

Wk∗ (tE ) is usually variable. A classical method of dealing with sample variability in non-stationary contexts is to update an exponential moving average Wk∗ (t): Wk (tE ) = Wk (tE −1) + a × [Wk∗ (tE )−Wk (tE −1)] at each ELTS[Sk , tE ]. The learning rate a is set such that 0 < a  1. Let dWk (tE ) := Wk (tE )−Wk (tE −1), and ∆Wk∗ (tE ) := Wk∗ (tE )−Wk (tE −1). Hence dWk (tE ) = a × ∆Wk∗ (tE ). ∆Wk∗ (tE )

=

dV˜k (tE ) dV˜k (tE ) − −dRk (tE ) dRk (tE )

213

Therefore: dWk (tE ) = a ×

214

∆Wk∗ (tE ),

215 216 217 218 219

=

dV˜k (tE ) + dV˜k (tE ) −dRk (tE )

=

dV˜Π (tE ) (2) −dRk (tE )

dV˜Π (tE ) −dRk (tE ) .

Note that dRk (tE ) is negative at every ELTS[Sk , tE ]. Thus dWk (tE ), and dV˜Π (tE ) have the same sign, therefore dV˜Π (tE ) provides the sign of Wk adjustment at tE : if during an ELTS[Sk , tE ] dV˜k (tE ) > dV˜k (tE ), then Wk should be reinforced; if dV˜k (tE ) < dV˜k (tE ), then Wk should be attenuated. As a consequence, Wk (t) of a mature variableW ∗ segment varies around W ∗ . Recall the S˙ dice illustration above; after training, Wdice is mature: k

k

∗ ∗ ∗ Wdice (t) ' Wdice (twin ) ' +$5, and ∆Wdice (tvoid ) ' −$1. A low learning rate = $1, ∆Wdice (a  1) deadens adjustment noise. ˜

220 221 222 223 224

dVΠ (tE ) As stated above, dWk (tE ) = a× −dR . But if dWk (tE ) was inversely proportional to dRk (tE ), k (tE ) Wk adjustment would be highly sensitive to tiny emissions and to dRk noise. The simple rule (3) below takes into account the above rationale, and magnitude guideline related to transmission evidence (section 4). Rule (3) may be used to evaluate the PGD cognitive model on various natural or artificial cases. Wk (t) = Wk (t−1) + α b−dRk (t)c+ × dV˜Π (t) (3)

232

The rectifier bracket notation implements the “adjust only at ELTS” condition: buc+ := u if u ≥ 0 or else 0. Recall that dV˜Π (t) := δΠ (t) is similar to δDA (t) and to δT D (t), at least during URPROR scenarios. In fact, the PGD cognitive learning rule (3) is equivalent to that of conventional TD, except for the rectifier bracket term (see section 8). Rule (3) is fully dynamistic: both terms dRk (tE ) and dV˜Π (tE ) are variations. The TD rule is halfway between statistics and dynamistics. Translating the conventional TD(0) learning rule (using linear approximation as PGD) to PGD notation gives Wk (t) = Wk (t−1) + α Rk (t−1) × δT D (t). While δT D (t) is a variation, Rk (t−1) is a level (the feature k of state s at t−1) [2]. See section 8 below for the TD(λ) case.

233

6

225 226 227 228 229 230 231

234 235 236 237 238 239 240 241 242 243 244 245 246 247

Cognitive input operators

So far I have considered only axiogogues with abutting segments, each providing a dRx (t) signal to Π♥. Now consider an axiogogue with the following chain of segments, upstream to downstream: S˙ k , S˙ m , S˙ j . Assume that S˙ k and S˙ j provide input signals dRk (t) and dRj (t) to Π♥, and that dRj (t) has already been recruited. S˙ m is located between S˙ k and S˙ j , but it does not provide an input signal to Π♥: S˙ m is a mute segment, from Π♥ point of view. Consequently, Π♥ can recruit neither Sm , nor Sk . One way to deal with this situation is to produce a recruitable delayed signal dRm (t), set by each emission of an item from S˙ k (upon each dRk (t) DIP), and reset after the travel duration Dm of items through S˙ m . But Dm is not known a priori, and it may vary. As a simple solution, dRm (t) could be an exp-pulse triggered by dRk (t) DIP, with a decay lasting beyond the usual range of Dm . A more accurate but more expensive solution uses a set of several delay operators Smi (see [8]), which are all triggered at each dRk (t) DIP, setting every signal Rmi (t), and which decay with various delays from set to decay, such that Rmi (t) decays elicited by a given dRk (t) DIP overlap and cover the range of usual Dm durations. Plug the set of derivatives dRmi (t) as inputs of Π♥. If an item actually emits from S˙ k and then admits into S˙ j after an usual duration, one or several 6

248 249 250 251

delayed signals dRmi (t) triggered by the emission from S˙ k may be recruited, and then Sk will also be recruited. Resettable delay operators should prevent a δΠ (t) DIP at the usual reward delivery time, when value transmission occurs earlier than usual [8] (clarifying DA ramps [11], which have a similar timescale, may help to address this issue).

253

Various types of other cognitive input operators may be useful upstream of Π♥. Notably, static input operators may combine perceptions, e.g. performing logical functions (AND, OR, NOT, XOR, ...).

254

7

252

Extended transitivity beyond value transformations; logical segments

270

v0 is the set of substances detected by sensor C˙ 0 , including the substance (e.g. glucose) interesting value consumers downstream. Transitivity spreads v0 net value tracking upstream of S˙ 0 . But v0 net value tracking may spread far upstream of v0 domain. Imagine a segment S˙ T which transforms upstream items of type vu (e.g. fructose) into downstream items of another type vd (e.g. glucose): vu → vd . Fit this segment with two sensors, C˙ u (sensing vu items) and C˙ d (sensing vd items), and plug the corresponding variation signals dRu (t) and dRd (t) as Π♥ inputs. Transformation of a vu item into a vd one elicits a dRu DIP and a dRd PIC. If the dRd signal has been previously recruited by Π♥, then vu → vd will cause dRu recruitment. Therefore, upstream segments transmitting vu items to S˙ T will also be recruited. Physical segments such as S˙ T where transformation occurs in situ may be considered as a chain of two logical segments segments Su and Sd . Π♥ is substance-agnostic, aware of its input dynamics, but unaware of value types. “Substance transformation” should be understood beyond its usual chemical sense. It includes every in-situ change in spatial configuration, either physical (e.g. plowing a field) or logical (a changing image; a paper being edited). Transformations and combinations contribute to an extended transitivity which may spread prognodendrons far upstream from their origin (S0 ), beyond value transformations (e.g., manufacturing yields money spent to buy food).

271

8

255 256 257 258 259 260 261 262 263 264 265 266 267 268 269

272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298

From TD to PGD

The conventional TD production rule [14] writes: δt := rt +γ Wt−1 T Rt −Wt−1 T Rt−1 , using some PGD model notation: at t, the TD algorithm inputs the reward rt := r(t) and the recruitable signals vector Rt , and it computes δt :=δT D (t) using the weight vector Wt−1 computed at the previous time step (UT is the transpose of vector U). V (t) := Wt−1 T Rt is the sum of rewards expected after t. The algorithmical justification for the discount rate parameter γ is to avoid getting an infinite sum V (t) of expected future rewards [2]. In the Π♥×Σd case, V˜Π (t) is a weighted sum of value levels Rk (t)(k=0···K). In physical reality, V˜Π (t) is limited, both upstream (even if Π♥ is farsighted, SΠ is limited by Π♥ perception capabilities, since the set of Rk inputs is limited), and downstream, thanks to value consumption (a PGD requirement for artificial device design). In the PGD framework, the psychological justification (“better an egg today than tomorrow”) [2] translates into an a priori omission rate in cognitive segments. While TD applies a systematic and identical temporal discount to all cues (with a rise of V (t) between CS and US), PGD takes into account a specific (Sk -dependent) a priori omission rate at acquisition, devoid of V˜Π (t) variation after acquisition if everything happens as expected. Hence, assume no systematic discount hereafter: γ = 1. A significant amount of DAP and TD literature considers δDA and δT D as a reward prediction error (RPE), since it indicates a discrepancy between actual and expected rewards. In the DAP/TD framework, RP E(t) is a TD error: it is the difference, at t, between the reward r(t) actually obtained at t and the reward T D(t) expected by the TD device at t [2]. Devoid of systematic temporal discount, T D(t) is the difference between the sum V (t−1) of rewards which were expected after t−1, and the sum V (t) of rewards currently expected after t: T D(t) := V (t−1)−V (t). T D(t) is the “temporal difference” at t, i.e. the error or difference between temporally successive predictions [1] at t−1 and t. If the prediction is correct, a reward r(t) = T D(t) is delivered at t. RP E(t) quantifies the discrepancy between actual and expected rewards at t: RP E(t) := δT D (t) := r(t)−T D(t) = r(t)−[V (t−1)−V (t)] = r(t)+V (t)−V (t−1). In classical CS-US cases (assuming that SΠ = SCS +SU S , SU S being a RASE segment), the conventional TD rule translates in the PGD framework to: δT D (t) ≡ dV˜U S (t)+V˜CS (t)−V˜CS (t−1) = dV˜CS (t)+dV˜U S (t) = dV˜Π (t) = δΠ (t). Hence in this case the δΠ (t) and δT D (t) production rules are equivalent. 7

299 300 301 302 303 304 305 306 307 308 309 310 311

DAP and TD authors have proposed various functional or neuronal architectures (e.g. see [4]) including a subtraction operator to compute V (t−1)−V (t) [12], and either a dt delay operator (when dt is constant and small) or a V (t−1) memory (when dt is variable or large, e.g. in game cases). δΠ (t) production rule requires neither a subtraction operator, nor a delay operator or an estimated value memory: it simply performs a weighted sum of its inputs. The simple PGD cognitive functional model may help neuroscientists in clarifying neuronal architectures, notably that of the dopamine system (and presumably the cerebellum [13]). PGD cognitive learning rule (3) is similar to the conventional TD(λ) rule [14], replacing eligibity traces ek (t) by b−dRk (t)c+ . Note that: 1) ek (t) is an exp-pulse triggered by the corresponding feature φk (t); 2) b−dEPtλ0 (t)c+ = λ EPtλ0 (t−1) (Fig. 3). Indeed, TD(λ) eligibility traces ek (t) are rectified variations of exponential decay RASE cognitive segments. Hence, while the PGD cognitive learning rule (3) with exp-pulse delay operators is equivalent to the conventional TD(λ) one, the PGD rule explicits their dynamistical nature inferred from the PGD spatiotemporal model of RL.

314

TD(λ) rules embed temporal discount and delay operators, with systematic and identical γ and λ for all input signals. PGD externalizes delay operators out of Π♥ (allowing use of various temporal profiles and delays) and lets weight adjustment apply an Sk -dependent temporal discount.

315

9

312 313

316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352

Conclusion

The PGD cognitive model provides an extended viewpoint, adding directed space to the classical temporal framework of Reinforcement Learning models. The structural, functional, and dynamical PGD cognitive models defined in this paper have natural grounding: tracking substance amount variations inside a structure of segments. Several concepts are defined (some of them being quantified), among others: “x-mission” events (admission, omission, etc.) and acquisition, net value and expected value, axiogogue and prognodendron. The PGD framework should help to interpret various phenomena elicited in many application domains related to value logistics. The PGD production rule is simple (a weighted sum of variations, devoid of subtraction operator) and equivalent to the TD production rule. While the PGD learning rule (using exponential decay delay operators) is equivalent to the TD(λ) learning rule, the PGD model unveils their dynamistical nature. Noting that basic PGD cognitive devices are space-agnostic leads us to extend the value announcement transitivity upstream to substance transformations occurring at interfaces between logical segments. Extended transitivity may shed light on various psychological phenomena, bridging the gap between vital values and accessory needs (e.g. seeking for tasty meals ?). While this paper focuses on the cognitive component of value logistics, studying the driving component (aiming to favor downward progress of value items) should help to clarify several still obscure psychological concepts such as drives or desires and their neuronal implementation. In this paper I focus on the progression in space of value items to consumers downwards. But frequently roles are inverted, where the consumer progresses in space to value items (goals). Indeed animals (notably humans) chain both types of progressions to eventually consume values: you move to reach your kitchen, then grasp a food item and put it in your mouth, then the food item travels along your nutritive system to nutriment consumers. Possible applications are games, using software and internet, and several human activities prefixed by “pro-”, e.g. production, projects, procedures. Several paths may lead a consumer from her current location to a given goal, hence graph-structured axiogogues and prognodendrons. The PGD model applies to both progress types, notably expected value quantification. Nutriments flow downwards along unchanging structures. It may be interesting to study an extension of the PGD model to 2D spaces, where value items move in various directions. Even if wind direction changes on a day timescale, it is rather constant on a shorter timescale, which allows us to predict sun or rain here within ten minutes by looking in the right direction at the horizon. Similarly, a subset of cognitive input operators could be activated by direction sensors, to track objects of interest in the field of vision. The PGD model could shed light on several physiological phenomena which display anticipation capabilities triggered by extra-body clues, notably: cephalic phase responses (e.g. insulin rise), sexual arousal (e.g. erection), risk appraisal as anticipation of damage (eliciting a rise of the sympathetic system). The intra-body immunity role of serotonin [6] may spread to extra-body concerns related to self-care, shelter, attachment, which may explain its involvement in mood disorders.

8

353

Acknowledgments

354

To be completed...

355

References

357

[1] Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine learning, 3(1), 9-44.

358

[2] Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. MIT press.

359

[3] Hollerman, J. R., & Schultz, W. (1998). Dopamine neurons report an error in the temporal prediction of reward during learning. Nature neuroscience, 1(4), 304-309.

356

360

362

[4] Kawato, M., & Samejima, K. (2007). Efficient reinforcement learning: computational theories, neuroscience and robotics. Current opinion in neurobiology, 17(2), 205-212.

363

[5] Montague, R. (2007). Your brain is (almost) perfect: How we make decisions. Penguin.

364 365

[6] Rubio-Godoy, M., Aunger, R., & Curtis, V. (2007). Serotonin–A link between disgust and immunity? Medical hypotheses, 68(1), 61-66.

366

[7] Schultz,W. (2007). Reward signals. Scholarpedia, 2(6):2184.

367

[8] Ludvig, E. A., Sutton, R. S., & Kehoe, E. J. (2008). Stimulus representation and the timing of reward-prediction errors in models of the dopamine system. Neural Computation, 20(12), 3034-3054.

361

368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384

[9] Niv, Y., & Schoenbaum, G. (2008). Dialogues on prediction errors. Trends in cognitive sciences, 12(7), 265-272. [10] de Araujo, I. E., Ferreira, J. G., Tellez, L. A., Ren, X., & Yeckel, C. W. (2012). The gut–brain dopamine axis: a regulatory system for caloric intake. Physiology & behavior, 106(3), 394-399. [11] Howe, M. W., Tierney, P. L., Sandberg, S. G., Phillips, P. E., & Graybiel, A. M. (2013). Prolonged dopamine signalling in striatum signals proximity and value of distant rewards. Nature, 500(7464), 575-579. [12] Eshel, N., Bukwich, M., Rao, V., Hemmelder, V., Tian, J., & Uchida, N. (2015). Arithmetic and local circuitry underlying dopamine prediction errors. Nature. [13] Ohmae, S., & Medina, J. F. (2015). Climbing fibers encode a temporal-difference prediction error during cerebellar learning in mice. Nature neuroscience. [14] van Seijen, H., Mahmood, A. R., Pilarski, P. M., Machado, M. C., & Sutton, R. S. (2015). True Online Temporal-Difference Learning. arXiv preprint arXiv:1512.04087. [15] Anonymous, A. (2016). The PGD model: giving Space to Reinforcement Learning Temporal models - Supplementary material. [website] (in the final version).

9