Computational Models of Reinforcement Learning: An Introduction

The process of learning what is good for us, and what is bad for us, is incredibly complex; but the rudiments have been outlined, and we can gain some insight by starting with the basic building blocks of what is known as reinforcement learning. During reinforcement learning, we come to associate certain actions with specific outcomes - push one button and get a piece of cake; push another button, and receive a blast of voltage to your nipples. Through experience we begin to flesh out a mental picture of what decisions are likely to lead to certain events; and, though merely observing someone else we can learn about what to do, or what not to do, even in the absence of reinforcers or punishers.

Before we get there, however, let's approach our subject from an even more basic form of learning - classical conditioning. In this case, no actions are needed; one merely observes a stimulus, such as a tone of a certain frequency, and learns that it predicts a specific outcome, such as the arrival of food. In this case, the tone is the conditioned stimulus, the food is the unconditioned stimulus, and salivating in response to the tone, after enough pairings between the tone and the food, becomes the conditioned response.

Let me give an example from my dating history. You may find this particular story I am about to relate to be way, way too much information; but if you've been reading for this long, I assume that we're on close enough terms that divulging such graphic details of my personal life will, far from driving us apart, bring us closer together by allowing us to bond over our shared humanity.

So. Onions. I am - or I used to be - indifferent to them. All I could say about them was that they had a smooth, eely texture when fried in oil; that they released a pungent aroma when sliced, diced, and crushed; and that their flavor was particularly sharp. Other than that, I had nothing else to say about them. Onions were onions.

But one day - never mind when - I began to see a girl who absolutely loved onions. Onions were inseparable from any dish she made; and so close was the association between her mood and the amount of onions she put into her cooking - casseroles, curries, tartlets, you name it - that, were I to witness her eating an entire onion in the raw, I would assume her to be in the seventh heaven.

My little onion, I used to call her, as a sign of my undying affection; and whenever we made love, we would first scatter onion shavings upon the bed, or the grass, or the movie theater seat, as a ritual to consecrate the beautiful, sacred act that was about to be made manifest. And when she would part the pillowy gates of her mouth and cleave her lips to mine, that pregnant moment filled with an anticipation so poignant you could hardly bear it, I would inhale deeply, feeling the overpowering, acrid smell of onions run over single one of my nose hairs and driving my olfactory bulbs insensate with desire.

"Darling, do you love me?" she would ask, breathing heavily, the odorous waves of onion wafting across the thin slit of air between us and mooring within my nostrils.

"Yes, my little onion," I would reply. "Yes; yes; a thousand times yes!"

Such was our love, then; and you can hardly imagine my shock and desolation when, several years into this relationship of onions and unadulterated bliss, some knave, jealous of our happiness no doubt, took it upon himself to poison one of her onions, and kill her! My bereavement was only slightly assuaged by the fact that she had, only a few days before, took out an extremely lucrative life insurance policy, having named me as the sole beneficiary.

After three painful, soul-searching days of mourning, however, I eventually gained the strength to renew my courtships with several other desirable young ladies. Yet, while throughout this period I continued to seduce innumerable women and live a Byronesque lifestyle of aristocratic excess, I couldn't help feeling some conspicuous lack, some defect in any affair, any tryst I willingly thrust myself into. At first I blamed the girls themselves: this one with long, lithe arms, but perhaps a shade too willowy; this one with a bold, intriguing personality, but perhaps a bit too pert for my taste; and yet another, sloe-eyed, with beautiful brown irises, but which, upon closer inspection, revealed the slightest of discrepancies in the size of one pupil compared to the other. Not having taken myself for a very discriminating fellow before my relationship with Mary, that light of my life, that fire of my loins - in other words, that onion chick I was talking about earlier, in case you couldn't tell - I found myself at a total loss.

While ruminating over my sudden change in amorous tastes, one day I found myself absentmindedly skimming the menu at a local bistro; and then - mirabile dictu! - I saw the item French onion soup inconspicuously nestled under the Appetizers section. Feeling my pulse quicken, I followed my instincts and ordered the soup, aware that I had hit upon the answer to my problem. Soon after, a disembodied hand placed the soup in front of me; and, slowly, meaningfully, I gazed down into the thick brown liquid. I braced myself, inhaled deeply, and somewhere in my brain a key unlocked the overflowing warehouse of my desire. Memories came flooding back; memories of Mary; memories of onion; and, most of all, memories of that pungent, acidic smell crushed out from the shavings underneath our bodies.

Having solved the puzzle, I now embark on a new chapter of my life; and nowhere do I go now without my peeler, and without my paring knife!


This story wonderfully illustrates some of the key components of classical conditioning. First, an unconditioned stimulus - Mary - elicited an unconditioned response from me - feelings of arousal. Because of Mary's repeated pairings with onions, the onions became a conditioned stimulus that signified an upcoming session of especially gratifying hanky-panky, and eventually by themselves elicited the conditioned response of arousal.

In psychological terms, this process of learning is called the "critic" part of learning; a stimulus signified some kind of upcoming reward, and over time a person learns this associations, eventually beginning to shift their usual feelings of pleasure and excitement from the reward itself to the stimulus signifying the reward. The critic evaluates how reliable the association is, and, depending on the individual, associations can be learned relatively slowly, or relatively quickly.

Let's focus on a landmark Science paper by Schultz, Dayan, & Montague (1997). This paper mathematically modeled different phases of reinforcement learning, and outlined several equations that can simulate how much an organism will response to the conditioned stimulus and to the reward itself. The following Matlab code implements equations 3 and 4 from the paper, using 200 trials and 100 timesteps within each trial. The weights are updated on each trial, and the prediction error, represented by delta, will become increasingly larger and move close to the time of the presentation of the conditioned stimulus. Note in the following figure from the Schultz et al paper, that when an organism has been conditioned to expect a reward at a certain time, the omission of that reward will lead to a large negative deflection in the prediction error signal.

Similar surface maps can be generated using the following code; I suggest adjusting the learning rate and discount factor parameters to see how they affect the error prediction signal, and also the administration of the reward at different times. Building up this intuition will be critical in understanding more advanced models of reinforcement learning, in which outcomes are contingent upon particular actions. And don't forget to keep eating those onions!



%Parameters
numTrials = 200;
numSteps = 100;
weights = zeros(100,200); %Array of weights from steps 1-100, initialized to zero

discFactor = 0.995; %Discounting factor
learnRate = 0.3; %Learning Rate
delta = zeros(100,200);
V = []; %Empty vector, sum of all future rewards
x = [zeros(1,19) ones(1,81)]; %Presentation of conditioning stimulus
r = zeros(100,200); %Reward
r(50:55,1:190)=1;


for idx = 1:numTrials
   
    for t = 1:numSteps-1   
       
        V(t,idx) = x(t).*weights(t, idx);
        V(t+1,idx) = x(t+1).*weights(t+1, idx);
       
        delta(t+1,idx) = r(t+1,idx) + discFactor.*V(t+1,idx) - V(t,idx);
       
        weights(t, idx+1) = weights(t, idx)+learnRate.*x(t).*delta(t+1,idx);
    end
   
   
end

surf(delta)