Building Worlds for Reinforcement Learning

OpenAI’s Gym places reinforcement learning into the masses. It comes with a wealth of environments from the classic cart pole, board games, Atari, and now the new Universe which adds flash games and PC Games like GTA or Portal. This is great news, but for someone starting out, working on some of these games is overkill. You can learn a lot more in a shorter time, by playing around with some smaller toy environments.

One area I like within the gym environments are the classic control problems (besides the fun of eating melon and poop). These are great problems for understanding the basics of reinforcement learning because we intutiively understand the rewards and they run really fast. Its not like pong that can take several days to train, instead, you can train these environments within minutes!

If you aren’t happy with the current environments, it is possible to modify and even add more environments. In this post, I will highlight other environments and share how I modified an Acrobot-v1 environment.


To begin, grab the repo for the OpenAI gym. Inside the repo, navigate to gym/envs/classic_controlwhere you will see the scripts that define the class control environments. If you open one of the scripts, you will see a heading on the top that says:

__copyright__ = "Copyright 2013, RLPy"

Ahh! In the spirit of open source, OpenAI stands on the shoulders of another reinforcement library, RLPy. You can learn a lot more about them at the RLPy site or take a look at their github. If you browse here, you can find the original script that was used in OpenAI under rlpy /rlpy/Domains. The interesting thing here is that there are a ton more interesting reinforcement problems!


You can run these using RLPy or you can try and hack this into OpenAI.

Modifying OpenAI Environments

I decided to modify the Acrobot environment. Acrobot is a 2-link pendulum with only the second joint actuated (it has three states, left, right, and no movement). The goal is to swing the end to a height of at least one link above the base. If you look at the leaderboard on OpenAIs site, they meet that criterion, but its not very impressive. Here is the current highest scoring entry:


This is way boring compared to what Hardmaru shows in his demo, where a pendulum is capable of balancing for a short time.hardmaru

So I decided to try and modify the Acrobot demo to make this task a little more interesting, Acrobot gist here. The main change was to the reward system. I added a variable steps_beyond_done that would keep track of successes when the end was swung high. I also changed the reward structure, so it would gradually be rewarded as it swung higher. I also changed g to 0, this removes gravity’s effect.

self.rewardx = (-np.cos(s[0]) - np.cos(s[1] + s[0])) ##Swung height is calculated 
if self.rewardx < .5:
    reward = -1.
    self.steps_beyond_done = 0
if (self.rewardx > .5 and self.rewardx < .8):
    reward = -0.8
    self.steps_beyond_done = 0  
if self.rewardx > .8:
    reward = -0.6 
if self.rewardx > 1:
    reward = -0.4
    self.steps_beyond_done += 1 
if self.steps_beyond_done > 4:
    reward = -0.2
if self.steps_beyond_done > 8:
    reward = -0.1
if self.steps_beyond_done > 12:
    reward = 0.

Another important file to be aware of is where the benchmarks are kept for each environment. You can navigate to this at gym/gym/benchmarks/__init__.pyWithin this file, you will see the following:

{'env_id': 'Acrobot-v1',
         'trials': 3,
         'max_timesteps': 100000,
         'reward_floor': -500.0,
         'reward_ceiling': 0.0,

I then ran an implementation of Asynchronous Advantage Actor Critic A3C) by Arno Moonens. After running for a half hour, you can see the improvement in the algorithm:


Now a half hour later:


The result is teaching the pendulum to stay up for an extended time! This is much more interesting and what I was looking for. I hope this will inspire others to build new and interesting environments.

Taking an H2O Model to Production

One of the best feelings as a Data Scientist is when the model you have poured your heart and soul into, moves into production. Your model is now grown-up and you get to watch it mature.

This post shows how to take a H2O model and move it into a production environment. In this post, we will develop a simple H2O based predictive model, convert it into a Plain Old Java Object (POJO), compile it along with other Java packages, and package the compiled class files into a deployable JAR file so that it can readily be deployed onto any Java based application servers. This model will accept the input data set in the form of CSV file and return the predicted output in CSV format.

H2O is one of my favorite tools for building models because it is well designed from an algorithm perspective, easy to use, and can scale to larger datasets. However, H2O’s documentation, though voluminous, doesn’t have clear instructions for moving a POJO model into production. This post will discuss this approach in greater detail besides providing code for how to do this. (H2O does have a post on doing real time predictions with storm). Special thanks to Socrates Krishnamurthy who co-wrote this post with me.

Building H2O Model

As a starting point, lets use our favorite ice cream dataset to create a toy model in H2O:

  train.h2o <- as.h2o(Icecream)  
  rf <- h2o.randomForest(x=2:4, y=1, ntrees=2, training_frame=train.h2o)   

Once you have developed your model in H2O, then the next step is downloading the POJO:

h2o.download_pojo(rf, getjar=TRUE, path="~/Code/h2o-")
# you must give a path to download a file

This will save two files, a H2O jar file about the model and an actual model file (that begins with DRF and ends with .java). Go ahead and open the model file in a text editor if you want to have a look at it.

Compiling the H2O Model

The next step is to compile and run the model (say, the downloaded model name is, then type:

> javac -cp h2o-genmodel.jar -J-Xmx2g  

This creates a bunch of java class files.

Scoring the Input Data

The final step is scoring some input data. Prior to running the model, it is necessary to have files created for the input and output. For the input, the default setting is to read the first row as a header. The assumption is that the csv is well formed (this approach is not using the H2O parser). Once that is done, run:

> java -cp .:h2o-genmodel.jar --header --model DRF_model_R_1470416938086_15 --input input.csv --output output.csv

If you open the output.csv file, it can be noticed that the predicted values are in Hexadecimal and not in Numeric format. For example, the output will be something like this:


Fixing the Hexadecimal Issue

The model is now predicting, but the predictions are in the wrong format. Yikes! To fix this issue requires some hacking of the java code. The rest of this post will show you how to hack the java code in PredictCsv, which can fix this issue and other unexpected issues with PredictCsv (for example, if your input comes tab separated).

If we take a deeper look at the PredictCsv java file located in the h2o github, the myDoubleToString method returns Hexadecimal string. But the challenge is this method being static in nature, cannot be overridden in a subclass or cannot be updated directly since it was provided by H2O jar file, to return regular numeric value in String format.

This can be fixed by creating a new java file (say, by copying the entire content of from the above location and saving it locally. You then need to:

  • comment out the first line, so it should be //package;
  • change the name of the class name (~line 20) to read: public class NewPredictCsv {
  • correct the hexadecimal issue by changing the return statement of myDoubleToString method to .toString() in lieu of .toHexString() (~line 131).

After creating, compile it using the following command:

> javac -cp h2o-genmodel.jar -J-Xmx2g

Run the compiled file by providing input and output CSV files using the following command (Ensure that the input.csv file is in the current folder where you will run this):

> java -cp .:h2o-genmodel.jar NewPredictCsv --header --model DRF_model_R_1470416938086_15 --input input.csv --output output.csv

If you open the output.csv file now, it will be in the proper numeric format as follows:


Deploying the Solution into Production:

At this point, we have a workable flow for using our model to score new data. But we can clean up the code to make it a little friendlier for our data engineers. First, create a jar file out of the class files created in previous steps. To do that, issue the following command:

> jar cf my-RF-model.jar *.class

This will place all the class files and our NewPredictCsv inside the jar. This is helpful when we have a model with say 500 trees. Now all we need is three files to run our scorer. So copy the above two jar files along with input.csv file in any folder/directory from where the program has to be executed. After copying, the folder should contain following files:

> my-RF-model.jar  
> h2o-genmodel.jar  
> input.csv  

The above input.csv file contains the dataset for which the dependent variable has to be predicted. To compute/ predict the values, run the java command as below:

> java -cp .:my-RF-model.jar:h2o-genmodel.jar NewPredictCsv --header --model DRF_model_R_1470416938086_15 --input input.csv --output output.csv


Replace : with ; in above commands if you are working in Windows (yuck).

Using xgbfi for revealing feature interactions

Tree based methods excel in using feature or variable interactions. As a tree is built, it picks up on the interaction of features. For example, buying ice cream may not be affected by having extra money unless the weather is hot. It is the interaction of both of these features that can affect whether ice cream will be consumed.

The traditional manner for examining interactions is relying on measures of variable importance. However, these measures don’t provide insights into second or third order interactions. Identifying these interactions are important in building better models, especially when finding features to use within linear models.

In this post, I show how to find higher order interactions using XGBoost Feature Interactions & Importance. This tool has been available for a while, but outside of kagglers, it has received relatively little attention.

As a starting point, I used the Ice Cream dataset to illustrate using xgbfi. This walkthrough is in R, but python instructions are also available at the repo. I am going to break the code into three sections, the initial build of the model, exporting the files necessary for xgbfi, and running xgbi.

Building the model

Lets start by loading the data:

data(Icecream) <- data.matrix(Icecream[,-1])

The next step is running xgboost:

bst <- xgboost(data =, label = Icecream$cons, max.depth = 3, eta = 1, nthread = 2, nround = 2, objective = "reg:linear")

To better understand how the model is working, lets go ahead and look at the trees:

xgb.plot.tree(feature_names = names((Icecream[,-1])), model = bst)

xg tree plot

The results here line up with our intution. Hot days seems to be the biggest variable by just eyeing the plot. This lines up with the results of a variable importance calculation:

> xgb.importance(colnames(, do.NULL = TRUE, prefix = "col"), model = bst)
   Feature       Gain      Cover Frequency
1:    temp 0.75047187 0.66896552 0.4444444
2:  income 0.18846270 0.27586207 0.4444444
3:   price 0.06106542 0.05517241 0.1111111

All of this should be very familiar to anyone who has used decision trees for modeling. But what are the second order interactions? Third order interactions? Can you rank them?

Exporting the tree

The next step involves saving the tree and moving it outside of R so xgbfi can parse the tree. The code below will help to create two files that are needed:xgb.dump and fmap.text.

featureList <- names(Icecream[,-1])
featureVector <- c() 
for (i in 1:length(featureList)) { 
  featureVector[i] <- paste(i-1, featureList[i], "q", sep="\t") 
write.table(featureVector, "fmap.txt", row.names=FALSE, quote = FALSE, col.names = FALSE)
xgb.dump(model = bst, fname = 'xgb.dump', fmap = "fmap.txt", with.stats = TRUE)

Running xgbfi

The first step is to clone the xgbfi repository onto your computer. Then copy the files xgb.dump and fmap.text to the bin directory.

Go to your terminal or command line and run: XgbFeatureInteractions.exe application. On a mac, download mono and then run the command: mono XgbFeatureInteractions.exe. There is also a XgbFeatureInteractions.exe.config file that contains configuration settings in the bin directory.

After the application runs, it will write out an excel spreadsheet titled: XgbFeatureInteractions.xlsx. This spreadsheet has the good stuff! Open up the spreadsheet and you should see:

interaction depth 0

This tab of the spreadsheet shows the first order interactions. These results are similar to what variable importance showed. The good stuff is when you click on the tab for Interaction Depth 1 or Interaction Depth 2.

interaction depth 1

interaction depth 2

It is now possible to rank the higher order interactions. With the simple dataset, you can see that the results out of xgbfi match what is happening in the tree. The real value of this tool is for much larger datasets, where its difficult to examine the trees for the interactions.

Outlier App

I was recently trying various outlier detection algorithms. For me, the best way to understand an algorithm is to tinker with it. I wanted to share my recent work on a shiny app that allows you to play around with various outlier algorithms.

The shiny app is available on my site, but even better, the code is on github for you to run locally or improve! I also posted a video that provides background on the app. Let me give you a quick tour of the app:

The available algorithms include:

  • Hierarchical Clustering (DMwR)
  • Kmeans (distance metrics from proxy)
    • Kmeans Euclidean Distance
    • Kmeans Mahalanobis
    • Kmeans Manhattan
  • Fuzzy kmeans (all from fclust)
    • Fuzzy kmeans - Gustafson and Kessel
    • Fuzzy k-medoids
    • Fuzzy k-means with polynomial fuzzifier
  • Local Outlier Factor (dbscan)
  • RandomForest (proximity from randomForest)
    • Isolation Forest (IsolationForest)
  • Autoencoder (Autoencoder)
  • FBOD and SOD (HighDimOut)


There are also a wide range of datasets to try as well:


Once the data is loaded, you can start exploring. One thing you can do is look at the effect scaling can have. In this example, you can see how outliers differ when scaling is used. The values on the far right no longer dominate the distance measurements, and there are now outliers from other areas:


By trying different algorithms, you can see how different algorithms will select outliers. In this case, you see a difference between the outliers selected using an autoencoder versus isolation forest.auto_iso

Another example here is the difference between kmeans and fuzzy kmeans as show below:


A density based algorithm can also select different outliers versus a distance based algorithm. This example nicely shows the difference between kmeans and lof (local outlier factor from dbscan)


An important part of using this visualization is studying the distance numbers that are calculated. Are these numbers meshing with your intuition? How big of a quantitative difference is there between outliers and other points?


So that is the 2D app. Please send me bug fixes, additional algorithms, or tighter code!

3D+ App?

The next thing is whether to expand this to larger datasets. This is something that you would run locally (large datasets take too long to run for my shiny server). The downside of larger datasets is that it gets tricker to visualize them. For now, I am using a TSNE plot. I am open to suggestions, but the intent here is a way to evaluate outlier algorithms on a variety of datasets.


RNN Addition (1st Grade)

Ever since I ran across RNNs, they have intrigued me with their ability to learn. The best background is Denny Britz’s tutorial, Karpathy’s totally accessible and fun post on character-level language models, and Colah’s detailed descriptions of LSTMs. Besides all the fun examples of generating content with RNNs, other people have been applying them and winning Kaggle competitions and the ECML/PKDD challenge.

I am still blown away by how RNN’s can learn to add. RNNs are trained through thousands of examples and can learn how to sum numbers. For example, the Keras addition example show how to add two sets of numbers up to 5 digital long each (e.g., 54678 + 78967). It achieves 99% train/test accuracy in 30 epochs with a one layer LSTM (128 HN) and 550k training examples.

My eventual goal is to use RNNs to study various sequenced data (such as the NBA SportVu), so I thought I should start simple. I wanted to teach a RNN to add a series of numbers. For example: 5+7+9. The rest of the post discusses this journey.

1st Grade Model

My first model was teaching an RNN to add between 5 to 15 single digit numbers. This would be at the level of a first grader in the US. For example, using a 2 layer LSTM network with 100 hidden units, a batch of 50 training examples, and 5000 epochs, the RNN summed up:

8+6+4+4+0+9+1+1+7+3+9+2+8 as 66.2154007

This isn’t too far from the actual answer of 62. The Keras addition example show that with even more examples/training, the RNN can get much better. The code for this RNN is available as a gist using tensorflow. I made this in a notebook format so its easy to play with.

There are lots of parameters to tweak with RNN models, such as the number of hidden units, epochs, batch size, dropout, and training rate. Each of these has different sorts of effects on the model. For example, increasing the number of hidden units will provide more space for learning, but consequently take longer to learn/train. The chart below shows the effect of different choices. Please take the time to really study/investigate the role of hidden units. Its a dynamic plot so you can zoom in and examine each series individually by clicking on the legend.

Cost by epoch