Puppy training and reinforcement learning

What RL algorithms can teach us about training puppies

Mar 26, 2021

This is issue #8 of Russell’s Index, where I write about the lessons I’ve learned—and continue to learn—as a founding employee at SharpestMinds. Subscribe for a new issue in your inbox every week(ish). Emphasis on the ish.

This is another departure from my usual topics. But they say to write about what you’re interested in. I want to keep up the writing habit and right now our puppy is consuming a lot of my mental cycles. I will return to SharpestMinds and startup-related content in the next post.

Dog training and reinforcement learning

I grew up with dogs, but I’ve never had the responsibility of training one until now. It’s an uphill battle. But they are quick learners and it is incredibly satisfying watching them learn.

Lucy—the reinforcement learning algorithm.

The gist of dog training is simple—reward the behaviours that you want to replicate. Do not reward the ones you want to stop.

Tell the puppy to sit. If she sits, give her a treat. Reward the puppy when she pees outside, but not when she pees inside. With enough reinforcement, the rewarded behaviours become ingrained habits. The other ones die out.

Training a puppy is very much like training a reinforcement learning (RL) algorithm.

The basic concept of RL is simple. You have an agent who can perform actions in an environment. Every action produces a new state (a new configuration of the environment) along with a reward. The goal of an RL algorithm is to have the agent learn which action to take—given the current state of the environment—to maximize their reward.

https://en.wikipedia.org/wiki/Reinforcement_learning#/media/File:Reinforcement_learning_diagram.svg...

For a chess-playing algorithm, the environment is the chessboard. The actions—all the legal chess moves. The state—the positions of all the pieces on the board. The reward is simple—winning is good, losing is bad.

But there can be a lot of steps between each move and the end of the game when the reward is finally doled out (or not). There are a lot of intermediate states. So a typical RL algorithm has the agent learn to assign a value for each of these intermediate states. The value of a state is a measure of how likely it is to result in a future reward.

RL algorithms typically learn through lots and lots of iterations. Make the agent play thousands—or millions—of chess games. When a game is won, look back at all the intermediate states that led to the win and increase their value a little bit. When a game is lost, decrease their value a little bit. Do this over and over and the agent starts to learn which states are more valuable, which actions will lead to higher-value states, and which actions will eventually lead to a reward.

Puppies—like most living things—already have a built-in reward function that can be exploited. They like food. From birth, they are already learning which actions will get them more food. If our puppy Lucy were out in the wild, she might be learning to hunt and find scraps. But, under my roof, she gets food when she behaves.

Since I control the food, I can choose which actions to reward. Training her to obey my commands boils down to an RL problem. We have an agent (the puppy) that can take actions (sit, stay, come) in an environment (our apartment, a park). Certain states will result in a reward (a treat, attention, a nice stick). Luckily, puppies don’t require thousands of examples to learn simple commands.

RL research is often done with games because games are simple. They have a small environment with a finite set of possible states and a limited set of actions that the agent can take. The bigger the state or action space, the harder it becomes to train the RL agent.

The same goes for training puppies. Training is easier in our apartment where there are fewer distractions. The outside world has much more going on—there are cars, people, smells. Keeping the space small and limiting distractions reduces the total possible states, and the amount of information processing the puppy needs to do to understand their environment.

It’s also useful to put a limit on their possible actions. When teaching “sit”, for example, you can stand on their leash. Limiting them to a few actions—stand, sit, or lie down. When there are only three actions to choose from, they’ll converge faster on the one that gets them the reward.

Of course, you’ll want to generalize and make sure the puppy learns to sit on command in more distracting environments—leash or no leash. The way to do this is to gradually expand the state and action space. This approach—starting with simple examples and adding complexity—has also been adopted by the machine learning community and dubbed curriculum learning.

One of the things I found most interesting about animal training is a technique called clicker training. It’s fairly straightforward. Make the puppy associate a reward (a treat) with a distinct sound (like a “click” from a mechanical clicker).

Click. Treat. Click. Treat. Click. Treat. Repeat.

The puppy will catch on. Click equals treat. Then you can use that click as a proxy for an actual reward. You say “sit”. The puppy sits. Mark it with a click (but don’t forget to still give a treat after).

So what’s the point? Why mark a behaviour with a click if you’re just going to reward them with a treat after? Because you can be much more precise with what actions you are rewarding with a click.

It takes time to pull out a treat and get it into the dog’s mouth. By the time they get it, they might not know exactly what action you were rewarding. A click is instantaneous. A click says to the dog, “That exact thing you just did earned you a reward.”

Of course, training does not require a clicker. But it speeds up learning. We know from RL that learning requires working backwards from the terminal state (e.g. winning the chess game, getting the treat). The more intermediate states between taking an action and the reward, the longer it will take to learn. Using a clicker lets you reduce the time between action and reward and reduce the number of iterations required to learn a new behaviour.

- Russell

Thanks for reading. That should be my last puppy post for now. Back to business for future posts. There are lots of things I want to write about. Creating a culture of writing. How to do async meetings. How to find process/team fit. Why the internet + mentorship is the future of education. Subscribe for a new post every week(ish)—emphasis on the ish.

Product Engineering Playbook

Discussion about this post