Ben Cook - Computational Projects

Baseball Statistics

Applying Collaborative Filtering to Pitcher-Batter matchups

For more technical details, see our project webpage and the project notebook on GitHub.

Imagine you're the manager of the Chicago Cubs, trying to win the last game of the World Series. Your best player has just been injured, and you need to pick the best pinch-hitter to replace him for a crucial at-bat. Your last two players, Abe and Bruce, both have identical .250 batting averages. You think, "Maybe one has more success against the opposing pitcher, Charlie," but alas neither of them has ever faced him before!

In the final project for Harvard's Data Science course AC 209 (Fall 2015), our team of 4 developed an approach to solving this problem using collaborative filtering, the same framework used by companies like Netflix to predict whether you will like a movie. In the Netflix model, other users are assigned similarity ratings to you, based on how much you liked the same sets of movies. Netflix makes the assumption that if users who are similar to you liked the new movie in question, then you probably will too!

Our baseball analysis relied on a similar supposition: that we should be able to find sets of "similar" batters to Abe and Bruce, and that those batters' success against Charlie will predict how well Abe and Bruce are likely to do. Just like Netflix finds that some users really like sci-fi movies, we hoped to identify groups of batters who (for example) did well against fastball pitchers, and others who do well against the curve-ball.

My primary contribution to the project was in collecting, cleaning, and distributing a database of player statistics. The key data required was: how many times has each MLB batter faced each MLB pitcher, and how successful were they? While there are neatly curated databases listing all MLB batters and pitchers such as the Lahman Database, the only online database of batter-pitcher matchups is collected in a large series of fairly unstructured HTML pages through retrosheet.org (example page for Sammy Sosa is shown below).

I developed code to crawl the Retrosheet encyclopedia and scrape the matchup data for each MLB player, collecting the HTML data into a Pandas dataframe to be analyzed in Python. Our group then developed a variety of collaborative filtering models and used them to try to predict the success of individual batters against particular pitchers.

Unfortunately, our models were only marginally more successful than using a batter's overall batting average. We suspect that this indicates that a batter's success against various pitchers is primarily due to luck and small-number statistics (most batters hadn't faced a given pitcher more than 10 times), and not to statistically-significant differences in their abilities against that pitcher.

Autonomous Gameplay

Reinforcement learning used to outperform humans on computer game

For more details, see the technical writeup produced for the course. The video below shows my algorithm learning and playing the game. The algorithm successfully learns and improves on human performance by run 57 (1:09 in the video).

Computers are continuing to outperform humans in a variety of skill- and intelligence-based games, such as Chess and Go. Many of the successful algorithms include a combination of expert-written heuristics and analysis of millions of human-played games.

For the final project of Harvard's Machine Learning course CS 181 (Spring 2016), my partner and I designed algorithms which could learn to play a simple computer game (Swingy Monkey, an open-source game inspired by Flappy Bird) solely through trial-and-error. After only 57 games (1 minute) played, my algorithm could consistently match or best human abilities.

The video game is straight-forward: a monkey sprite, controlled by the user, moves through the jungle at a constant x-velocity and subject to gravity. The user must choose when to "jump" (grab a vine and swing), such that the monkey passes through gaps in the trees and does not fall off the screen. The basic algorithm to learn gameplay is likewise fairly straight-forward: given a particular state (position, velocity, distance to tree, etc), decide whether to jump or continue falling. The algorithm begins with no prior knowledge of which actions should be taken, so it must begin by taking random actions, seeing how well those actions do, and eventually transitioning to using its knowledge to play optimally (see the Multi-armed Bandit problem).

The key conceptual challenges to designing a successful algorithm were:

Which properties should be included and modeled as part of the monkey's state?
The monkey never returns to exactly the same point in state-space. How do we identify "similar" situations to learn from?
How long should the algorithm "learn" for, and when should it instead switch over to trying to play the best it can?

One of my major contributions to the project was in answering question (1), using domain knowledge from physics to design an optimal set of state properties which would be useful in solving the problem. We then developed two different solutions to question (2): discretizing the state-space into a matrix, and linearly-interpolating between observed points. The discretization algorithm was inefficient, as either the grid was too coarse and couldn't make accurate predictions, or took extremely long before enough datapoints were collected in each state-space bin to make a correct prediction.

In contrast, the linear-interpolation algorithm was extremely efficient at learning to play the game successfully, and after only 1 minute of gameplay (56 games played), managed to outperform human abilities by playing for over 2.5 consecutive minutes without failing. The video above shows the results (best run at 1:09).

Computational Projects

Baseball Statistics

Applying Collaborative Filtering to Pitcher-Batter matchups

Autonomous Gameplay

Reinforcement learning used to outperform humans on computer game

Master's Thesis

Post-processing of Cosmological Simulations

PhD Thesis

Bayesian Forward-modelling of Hubble Space Telescope Images with GPU Acceleration