Introduction to Reinforcement Learning Models

Someone very near and dear to me just sent me a picture of herself cuddled up on the couch in her pajamas with an Argentinian Tegu. That's right lady, I said Tegu. The second coming of Sodom and Gomorrah - you heard it here first, folks! I mean, I know it's the twenty-first century and all, but what the heck.

Looks like I'll be pushing her to buy that lucrative life insurance policy much earlier than planned!

Anyway, I think that little paroxysm of righteous anger provides an appropriate transition into our discussion of reinforcement learning. Previously we talked about how a simple model can simulate an organism processing a stimulus, such as a tone, and begin to associate that with rewards or lack of rewards, which in turn leads to either greater levels of dopamine firing, or depressed levels of dopamine firing. Over time, dopamine firing begins to respond to the conditioned stimulus itself instead of the reward as it becomes more tightly linked to receiving the reward in the near future. This phenomenon is so strong and reliable across all species, it can even be observed in the humble sea slug Aplysia, which is one ugly sucker if I've ever seen one. Probably wouldn't stop her from cuddling up with that monstrosity, though!

Anyway, that only describes one form of learning - to wit, classical conditioning. (Do you think I am putting on airs when I use a phrase like "to wit"? She thinks that I do; but then again, she also has passionate, perverted predilections for cold-blooded wildlife.) Obviously, any animal in the food chain - even the ones she associates with - can be classically conditioned to almost anything. Much more interesting is operant conditioning, in which an individual has to make certain choices, or actions, and then evaluate the consequences of those choices. Kind of like hugging reptiles! Oh hey, she probably thinks, let's see if hugging this lizard - this pebbly-skinned, fork-tongued, unblinking beast - results in some kind of reward, like gold coins shooting out of my mouth. In operant conditioning parlance, the rush of gold coins flowing out of one's orifice would be a reinforcer, which increases the probability of that action in the future; while a negative event, such as being fatally bitten by the reptile - which pretty much any sane person would expect to happen - would be a punisher, which decreases the probability of that action in the future.

The classically conditioned responses, in other words, serve the function of a critic which monitors for stimuli and reliably-predicted reinforcers or punishers following those stimuli, while operant conditioning can be thought of as an actor role, where choices are made and the results evaluated against what was expected. Sutton and Barto, a pair of researchers considerably less sanguinary than Hodgkin and Huxley, were among the first to propose and refine this model, assigning the critic role to the ventral striatum and the actor role to the dorsal striatum. So, that's where they are; if you want to find the actor component of reinforcement learning, for example, just grab a flashlight and examine the dorsal striatum inside someone's skull, and, hey presto! there it is. I won't tell you what it looks like.

However, we can form some abstract idea about what the actor component looks like by simulating it in Matlab. No, just in case you were wondering, this won't help you hook up with Komodo Dragons! It will, however, refine our understanding of how reinforcement learning works, by building upon the classical conditioning architecture we discussed previously. In this case, weights are still updated, but now we have two actions to choose from, which results in four combinations: either one or the other, both at the same time, or neither. In this example, only doing action 1 will lead to a reward, and this gets learned right quick by the simulation. As before, a surface map of delta shows the reward signal being transferred from the actual reward itself to the action associated with that reward, and a plot of the vectors shows action 1 clearly dominating over action 2. The following code will help you visualize these plots, and see how tweaking parameters such as the discount factor and learning rate affect delta and the action weights. But it won't help you get those gold coins, will it?




clear
clc
close all

numTrials = 200;
numSteps = 100;
weights = zeros(100,200); %Array of weights from steps 1-100, initialized to zero

discFactor = 0.995; %Discounting factor
learnRate = 0.3; %Learning Rate
delta = zeros(100,200); %Empty vector
V = []; %Empty vector
x = [zeros(1,19) ones(1,81)];

r = zeros(100,200); %Reward vector, which will be populated with 1's whenever a reward occurs (in this case, when action1 == 1 and action2 == 0)

W1=0;
W2=0;
a1=zeros(1,numTrials);
a2=zeros(1,numTrials);


for idx = 1:numTrials
   
    for t = 1:numSteps-1
        if t==20
            as1=x(t)*W1; %Compute action signals at time step 20 within each trial
            as2=x(t)*W2;
           
            ap1 =  exp(as1)/(exp(as1)+exp(as2)); %Softmax function to calculate probability associated with each action
            ap2 =  exp(as2)/(exp(as1)+exp(as2));
           
            n=rand;
            if n<(idx)=1;
            end
           
            n=rand;
            if n<ap2                a2(idx)=1;
            end
        
            if a1(idx)==1 && a2(idx)==0 %Only deliver reward when action1 ==1 and action2 ==0
                r(50:55,idx)=1;
            end                       
        end
       
        V(t,idx) = x(t).*weights(t, idx);
        V(t+1,idx) = x(t+1).*weights(t+1, idx);
       
        delta(t+1,idx) = r(t+1,idx) + discFactor.*V(t+1,idx) - V(t,idx);
       
        weights(t, idx+1) = weights(t, idx)+learnRate.*x(t).*delta(t+1,idx);
       
        W1 = W1 + learnRate*delta(t+1,idx)*a1(idx);
        W2 = W2 + learnRate*delta(t+1,idx)*a2(idx);
       
    end
   
    w1Vect(idx) = W1;
    w2Vect(idx) = W2;

   
   
end


figure
set(gcf, 'renderer', 'zbuffer') %Can prevent crashes associated with surf command
surf(delta)

figure
hold on
plot(w1Vect)
plot(w2Vect, 'r')

 





======================

Oh, and one more thing that gets my running tights in a twist - people who don't like Bach. Who the Heiligenstadt Testament doesn't like Bach? Philistines, pederasts, and pompous, nattering, Miley Cyrus-cunnilating nitwits, that's who! I get the impression that most people have this image of Bach as some bewigged fogey dithering around in a musty church somewhere improvising fugues on an organ, when in fact he wrote some of the most hot-blooded, sphincter-tightening, spiritually liberating music ever composed. He was also, clearly, one of the godfathers of modern metal; listen, for example, to the guitar riffs starting at 6:38.


...Now excuse me while I clean up some of the coins off the floor...

Computational Models of Reinforcement Learning: An Introduction

The process of learning what is good for us, and what is bad for us, is incredibly complex; but the rudiments have been outlined, and we can gain some insight by starting with the basic building blocks of what is known as reinforcement learning. During reinforcement learning, we come to associate certain actions with specific outcomes - push one button and get a piece of cake; push another button, and receive a blast of voltage to your nipples. Through experience we begin to flesh out a mental picture of what decisions are likely to lead to certain events; and, though merely observing someone else we can learn about what to do, or what not to do, even in the absence of reinforcers or punishers.

Before we get there, however, let's approach our subject from an even more basic form of learning - classical conditioning. In this case, no actions are needed; one merely observes a stimulus, such as a tone of a certain frequency, and learns that it predicts a specific outcome, such as the arrival of food. In this case, the tone is the conditioned stimulus, the food is the unconditioned stimulus, and salivating in response to the tone, after enough pairings between the tone and the food, becomes the conditioned response.

Let me give an example from my dating history. You may find this particular story I am about to relate to be way, way too much information; but if you've been reading for this long, I assume that we're on close enough terms that divulging such graphic details of my personal life will, far from driving us apart, bring us closer together by allowing us to bond over our shared humanity.

So. Onions. I am - or I used to be - indifferent to them. All I could say about them was that they had a smooth, eely texture when fried in oil; that they released a pungent aroma when sliced, diced, and crushed; and that their flavor was particularly sharp. Other than that, I had nothing else to say about them. Onions were onions.

But one day - never mind when - I began to see a girl who absolutely loved onions. Onions were inseparable from any dish she made; and so close was the association between her mood and the amount of onions she put into her cooking - casseroles, curries, tartlets, you name it - that, were I to witness her eating an entire onion in the raw, I would assume her to be in the seventh heaven.

My little onion, I used to call her, as a sign of my undying affection; and whenever we made love, we would first scatter onion shavings upon the bed, or the grass, or the movie theater seat, as a ritual to consecrate the beautiful, sacred act that was about to be made manifest. And when she would part the pillowy gates of her mouth and cleave her lips to mine, that pregnant moment filled with an anticipation so poignant you could hardly bear it, I would inhale deeply, feeling the overpowering, acrid smell of onions run over single one of my nose hairs and driving my olfactory bulbs insensate with desire.

"Darling, do you love me?" she would ask, breathing heavily, the odorous waves of onion wafting across the thin slit of air between us and mooring within my nostrils.

"Yes, my little onion," I would reply. "Yes; yes; a thousand times yes!"

Such was our love, then; and you can hardly imagine my shock and desolation when, several years into this relationship of onions and unadulterated bliss, some knave, jealous of our happiness no doubt, took it upon himself to poison one of her onions, and kill her! My bereavement was only slightly assuaged by the fact that she had, only a few days before, took out an extremely lucrative life insurance policy, having named me as the sole beneficiary.

After three painful, soul-searching days of mourning, however, I eventually gained the strength to renew my courtships with several other desirable young ladies. Yet, while throughout this period I continued to seduce innumerable women and live a Byronesque lifestyle of aristocratic excess, I couldn't help feeling some conspicuous lack, some defect in any affair, any tryst I willingly thrust myself into. At first I blamed the girls themselves: this one with long, lithe arms, but perhaps a shade too willowy; this one with a bold, intriguing personality, but perhaps a bit too pert for my taste; and yet another, sloe-eyed, with beautiful brown irises, but which, upon closer inspection, revealed the slightest of discrepancies in the size of one pupil compared to the other. Not having taken myself for a very discriminating fellow before my relationship with Mary, that light of my life, that fire of my loins - in other words, that onion chick I was talking about earlier, in case you couldn't tell - I found myself at a total loss.

While ruminating over my sudden change in amorous tastes, one day I found myself absentmindedly skimming the menu at a local bistro; and then - mirabile dictu! - I saw the item French onion soup inconspicuously nestled under the Appetizers section. Feeling my pulse quicken, I followed my instincts and ordered the soup, aware that I had hit upon the answer to my problem. Soon after, a disembodied hand placed the soup in front of me; and, slowly, meaningfully, I gazed down into the thick brown liquid. I braced myself, inhaled deeply, and somewhere in my brain a key unlocked the overflowing warehouse of my desire. Memories came flooding back; memories of Mary; memories of onion; and, most of all, memories of that pungent, acidic smell crushed out from the shavings underneath our bodies.

Having solved the puzzle, I now embark on a new chapter of my life; and nowhere do I go now without my peeler, and without my paring knife!


This story wonderfully illustrates some of the key components of classical conditioning. First, an unconditioned stimulus - Mary - elicited an unconditioned response from me - feelings of arousal. Because of Mary's repeated pairings with onions, the onions became a conditioned stimulus that signified an upcoming session of especially gratifying hanky-panky, and eventually by themselves elicited the conditioned response of arousal.

In psychological terms, this process of learning is called the "critic" part of learning; a stimulus signified some kind of upcoming reward, and over time a person learns this associations, eventually beginning to shift their usual feelings of pleasure and excitement from the reward itself to the stimulus signifying the reward. The critic evaluates how reliable the association is, and, depending on the individual, associations can be learned relatively slowly, or relatively quickly.

Let's focus on a landmark Science paper by Schultz, Dayan, & Montague (1997). This paper mathematically modeled different phases of reinforcement learning, and outlined several equations that can simulate how much an organism will response to the conditioned stimulus and to the reward itself. The following Matlab code implements equations 3 and 4 from the paper, using 200 trials and 100 timesteps within each trial. The weights are updated on each trial, and the prediction error, represented by delta, will become increasingly larger and move close to the time of the presentation of the conditioned stimulus. Note in the following figure from the Schultz et al paper, that when an organism has been conditioned to expect a reward at a certain time, the omission of that reward will lead to a large negative deflection in the prediction error signal.

Similar surface maps can be generated using the following code; I suggest adjusting the learning rate and discount factor parameters to see how they affect the error prediction signal, and also the administration of the reward at different times. Building up this intuition will be critical in understanding more advanced models of reinforcement learning, in which outcomes are contingent upon particular actions. And don't forget to keep eating those onions!



%Parameters
numTrials = 200;
numSteps = 100;
weights = zeros(100,200); %Array of weights from steps 1-100, initialized to zero

discFactor = 0.995; %Discounting factor
learnRate = 0.3; %Learning Rate
delta = zeros(100,200);
V = []; %Empty vector, sum of all future rewards
x = [zeros(1,19) ones(1,81)]; %Presentation of conditioning stimulus
r = zeros(100,200); %Reward
r(50:55,1:190)=1;


for idx = 1:numTrials
   
    for t = 1:numSteps-1   
       
        V(t,idx) = x(t).*weights(t, idx);
        V(t+1,idx) = x(t+1).*weights(t+1, idx);
       
        delta(t+1,idx) = r(t+1,idx) + discFactor.*V(t+1,idx) - V(t,idx);
       
        weights(t, idx+1) = weights(t, idx)+learnRate.*x(t).*delta(t+1,idx);
    end
   
   
end

surf(delta)

Establishing Casaulity Between Prediction Errors and Learning

You've just submitted a big grant, and you anxiously await the verdict on your proposal, which is due any day now. Finally, you get an email with the results of your proposal. Sweat drips from your brow and onto your hands and onto your pantlegs and soaks through your clothing until you look like some alien creature excavated from a marsh. You read the first line - and then read it again. You can't believe what you just saw - you got the grant!

Never in a million years did you think this would happen. The proposal was crap, you thought; and everyone else you sent it to for review thought it was crap, too. You can just imagine their faces now as they are barely able to restrain their choked-back venom while they congratulate you on getting the big grant while they have to go another year without funding and force their graduate students to work part-time at Kilroy's for the summer and get hit on by sleazy patrons with slicked-back ponytails and names like Tony and Butch and save money by moving into that rundown, cockroaches-on-your-miniwheats-infested, two-bedroom apartment downtown with five roommates and sewage backup problems on the regular.

This scenario illustrates a key component of reinforcement learning known as prediction error: Organisms tend to associate outcomes with particular actions - sometimes randomly, at first - and over time come to form a cause-effect relationship between actions and results. Computational modeling and neuroimaging has implicated dopamine (DA) as a critical neurotransmitter responsible for making these associations, as shown in a landmark study by Schultz and colleagues back in 1997. When you have no prediction about what is going to happen, but a reward - or punishment - appears out of the blue, DA tracks this occurrence by increasing firing, usually originating from clusters of DA neurons in midbrain areas in the ventral tegmental area (VTA). Over time, these outcomes can become associated with particular stimuli or particular actions, and DA firing drifts to the onset of the stimulus or action. Other types of predictions and violations you may be familiar with include certain forms of humor, items failing to drop from the vending machine, and the Houdini.

Figure 1 reproduced from Schutlz et al (1997). Note that when a reward is predicted but no reward occurs, DA firing drops precipitously.

In spite of a large body of empirical results, most reinforcement learning experiments have difficulty establishing a causal link between DA firing and the learning process, often due to relatively poor temporal resolution. However, a recent study in Nature Neuroscience by Steinberg et al (2013) used a form of neuronal activation known as optogenetics to stimulate neurons with pulses of light during critical time periods of learning. One aspect of learning, known as blocking, presented an opportunity to use the superior temporal resolution of optogenetics to test the role of DA in reinforcement learning.

To illustrate the concept of blocking, imagine that you are a rat. Life isn't terribly interesting, but you get to run around inside a box, run on a wheel, and push a lever to get pellets. One day you hear a tone, and a pellet falls down a nearby chute; and it turns out to be the juiciest, moistest, tastiest pellet you've ever had in your life since you were born about seven weeks ago. The same thing happens again and again, with the same tone and the same super-pellet delivered into your cage. Then, at some point, right after you hear the tone you begin to see light flashed into your cage. The pellet is still delivered; all that has changed is now you have a tone and a light, instead of just the tone. At this point, you begin to get all hot and excited whenever you hear the tone; however, the light isn't really doing it for you, and about the light you couldn't really care less. Your learning toward the light has been blocked; everything is present to learn an association between the light and the uber-pellet, but since you've already been highly trained on the association between the tone and the pellet, the light doesn't add any predictive power to the situation.

What Steinberg and colleagues did was to optogenetically stimulate DA neurons whenever rats were presented with the blocked stimulus; in the example above, the light stimulus. This induced a prediction error that was then associated with the blocked object - and rats later presented with the blocked object exhibited similar learning behavior to that stimulus as they did to the primary reinforcer - in the example above, the tone stimulus - lending direct support to the theory that DA serves as a prediction error signal, rather than a salience or surprise signal. Followup experiments showed that optogenetic stimulation of DA neurons could also interfere with the extinction process, when stimuli are no longer associated with a reward, but still manipulated to precede a prediction error. Taken together, these results are a solid contribution to reinforcement learning theory, and have prompted the FDA to recommend more dopamine as part of a healthy diet.

And now, what you've all been waiting for - a gunfight scene from Django Unchained.