Using xgbfi for revealing feature interactions

Tree based methods excel in using feature or variable interactions. As a tree is built, it picks up on the interaction of features. For example, buying ice cream may not be affected by having extra money unless the weather is hot. It is the interaction of both of these features that can affect whether ice cream will be consumed.

The traditional manner for examining interactions is relying on measures of variable importance. However, these measures don’t provide insights into second or third order interactions. Identifying these interactions are important in building better models, especially when finding features to use within linear models.

In this post, I show how to find higher order interactions using XGBoost Feature Interactions & Importance. This tool has been available for a while, but outside of kagglers, it has received relatively little attention.

As a starting point, I used the Ice Cream dataset to illustrate using xgbfi. This walkthrough is in R, but python instructions are also available at the repo. I am going to break the code into three sections, the initial build of the model, exporting the files necessary for xgbfi, and running xgbi.

Building the model

Lets start by loading the data:

data(Icecream) <- data.matrix(Icecream[,-1])

The next step is running xgboost:

bst <- xgboost(data =, label = Icecream$cons, max.depth = 3, eta = 1, nthread = 2, nround = 2, objective = "reg:linear")

To better understand how the model is working, lets go ahead and look at the trees:

xgb.plot.tree(feature_names = names((Icecream[,-1])), model = bst)

xg tree plot

The results here line up with our intution. Hot days seems to be the biggest variable by just eyeing the plot. This lines up with the results of a variable importance calculation:

> xgb.importance(colnames(, do.NULL = TRUE, prefix = "col"), model = bst)
   Feature       Gain      Cover Frequency
1:    temp 0.75047187 0.66896552 0.4444444
2:  income 0.18846270 0.27586207 0.4444444
3:   price 0.06106542 0.05517241 0.1111111

All of this should be very familiar to anyone who has used decision trees for modeling. But what are the second order interactions? Third order interactions? Can you rank them?

Exporting the tree

The next step involves saving the tree and moving it outside of R so xgbfi can parse the tree. The code below will help to create two files that are needed:xgb.dump and fmap.text.

featureList <- names(Icecream[,-1])
featureVector <- c() 
for (i in 1:length(featureList)) { 
  featureVector[i] <- paste(i-1, featureList[i], "q", sep="\t") 
write.table(featureVector, "fmap.txt", row.names=FALSE, quote = FALSE, col.names = FALSE)
xgb.dump(model = bst, fname = 'xgb.dump', fmap = "fmap.txt", with.stats = TRUE)

Running xgbfi

The first step is to clone the xgbfi repository onto your computer. Then copy the files xgb.dump and fmap.text to the bin directory.

Go to your terminal or command line and run: XgbFeatureInteractions.exe application. On a mac, download mono and then run the command: mono XgbFeatureInteractions.exe. There is also a XgbFeatureInteractions.exe.config file that contains configuration settings in the bin directory.

After the application runs, it will write out an excel spreadsheet titled: XgbFeatureInteractions.xlsx. This spreadsheet has the good stuff! Open up the spreadsheet and you should see:

interaction depth 0

This tab of the spreadsheet shows the first order interactions. These results are similar to what variable importance showed. The good stuff is when you click on the tab for Interaction Depth 1 or Interaction Depth 2.

interaction depth 1

interaction depth 2

It is now possible to rank the higher order interactions. With the simple dataset, you can see that the results out of xgbfi match what is happening in the tree. The real value of this tool is for much larger datasets, where its difficult to examine the trees for the interactions.

Outlier App

I was recently trying various outlier detection algorithms. For me, the best way to understand an algorithm is to tinker with it. I wanted to share my recent work on a shiny app that allows you to play around with various outlier algorithms.

The shiny app is available on my site, but even better, the code is on github for you to run locally or improve! I also posted a video that provides background on the app. Let me give you a quick tour of the app:

The available algorithms include:

  • Hierarchical Clustering (DMwR)
  • Kmeans (distance metrics from proxy)
    • Kmeans Euclidean Distance
    • Kmeans Mahalanobis
    • Kmeans Manhattan
  • Fuzzy kmeans (all from fclust)
    • Fuzzy kmeans - Gustafson and Kessel
    • Fuzzy k-medoids
    • Fuzzy k-means with polynomial fuzzifier
  • Local Outlier Factor (dbscan)
  • RandomForest (proximity from randomForest)
    • Isolation Forest (IsolationForest)
  • Autoencoder (Autoencoder)
  • FBOD and SOD (HighDimOut)


There are also a wide range of datasets to try as well:


Once the data is loaded, you can start exploring. One thing you can do is look at the effect scaling can have. In this example, you can see how outliers differ when scaling is used. The values on the far right no longer dominate the distance measurements, and there are now outliers from other areas:


By trying different algorithms, you can see how different algorithms will select outliers. In this case, you see a difference between the outliers selected using an autoencoder versus isolation forest.auto_iso

Another example here is the difference between kmeans and fuzzy kmeans as show below:


A density based algorithm can also select different outliers versus a distance based algorithm. This example nicely shows the difference between kmeans and lof (local outlier factor from dbscan)


An important part of using this visualization is studying the distance numbers that are calculated. Are these numbers meshing with your intuition? How big of a quantitative difference is there between outliers and other points?


So that is the 2D app. Please send me bug fixes, additional algorithms, or tighter code!

3D+ App?

The next thing is whether to expand this to larger datasets. This is something that you would run locally (large datasets take too long to run for my shiny server). The downside of larger datasets is that it gets tricker to visualize them. For now, I am using a TSNE plot. I am open to suggestions, but the intent here is a way to evaluate outlier algorithms on a variety of datasets.


RNN Addition (1st Grade)

Ever since I ran across RNNs, they have intrigued me with their ability to learn. The best background is Denny Britz’s tutorial, Karpathy’s totally accessible and fun post on character-level language models, and Colah’s detailed descriptions of LSTMs. Besides all the fun examples of generating content with RNNs, other people have been applying them and winning Kaggle competitions and the ECML/PKDD challenge.

I am still blown away by how RNN’s can learn to add. RNNs are trained through thousands of examples and can learn how to sum numbers. For example, the Keras addition example show how to add two sets of numbers up to 5 digital long each (e.g., 54678 + 78967). It achieves 99% train/test accuracy in 30 epochs with a one layer LSTM (128 HN) and 550k training examples.

My eventual goal is to use RNNs to study various sequenced data (such as the NBA SportVu), so I thought I should start simple. I wanted to teach a RNN to add a series of numbers. For example: 5+7+9. The rest of the post discusses this journey.

1st Grade Model

My first model was teaching an RNN to add between 5 to 15 single digit numbers. This would be at the level of a first grader in the US. For example, using a 2 layer LSTM network with 100 hidden units, a batch of 50 training examples, and 5000 epochs, the RNN summed up:

8+6+4+4+0+9+1+1+7+3+9+2+8 as 66.2154007

This isn’t too far from the actual answer of 62. The Keras addition example show that with even more examples/training, the RNN can get much better. The code for this RNN is available as a gist using tensorflow. I made this in a notebook format so its easy to play with.

There are lots of parameters to tweak with RNN models, such as the number of hidden units, epochs, batch size, dropout, and training rate. Each of these has different sorts of effects on the model. For example, increasing the number of hidden units will provide more space for learning, but consequently take longer to learn/train. The chart below shows the effect of different choices. Please take the time to really study/investigate the role of hidden units. Its a dynamic plot so you can zoom in and examine each series individually by clicking on the legend.

Cost by epoch

SportVu Analysis

This post shares some of the code that I have created for analyzing NBA SportVu data. For background, the NBA SportVu data is motion data for the basketball and players taken 25 times a second. For a typical NBA game, this means about 2 million rows of data. The data for over 600 NBA games (first half of the 2015-2016 season) is available. This is over a billion rows of telematics (iOT) type data. This is a gold mine and here are some early pieces from studying that data.

The first is basic EDA on the movement data. This code allows you to start analyzing the ball and player movement. basicbb

The next markdown, PBP, shows how to merge play by play data with the SportVu movement data. This allows using the annotated data which contains information on the type of play, score, and home/visitor info.

chull The next set of documents start analyzing the data. The first measures player spacing using convex hulls. The next shows how to calculate player velocity, acceleration, and jerk. (I really wanted to do a post on the biggest jerk in the NBA, but unfortunately the jerk data is way too noisy.) traj The third document offers a few different ways for analyzing player and ball trajectories.

You can find all these files at my SportVu Github repo.

Shiny front end for Tensorflow demo

I built a GUI front end for tensorflow from shiny, the code is available at Github. The shiny app allows trying different inputs, RNN cell types, and even optimizers. The results are shown with plots as well as a link to tensorboard. The app allows anyone to try out these models with a variety of modelling options.

The code for the shiny web app was based around work by Sachin Jogleka. Sachin focused on RNNs that had two numeric inputs. (This is slightly different than most RNN examples which focus on language models.)

Sachin’s code was modified to allow different cell types and reworked so it could be called from rPython. The shiny web app relies on rPython to run the tensorflow models. There is also an iPython notebook in the repository if you would like to test this outside of shiny.


Live Demo:

I have a live demo of this app, but it’s flaky. Building RNN models is computationally intensive and the shiny front end is intended to be used on development boxes with tensorflow. My live demo app is limited in several ways. First, the server lacks the horsepower to build models quickly. Second, if the instructions below are not carefully followed the app will crash. Third, its not designed for multiple people building different types of models at the same time. Finally, tensorboard application sometimes stops running, so the link to tensorboard within the live demo app may not work. Again, to really use this app, please install it locally.

The requirements for the app include tensorflow and numpy on the Python side. Shiny, Metrics, plotly, and rPython on the R side. rPython can be difficult to install/configure, so please verify that rPython is working correctly if you are having problems running the code.

Using the App:

To use the app, select your model options. For the inputs, there are three options of increasing complexity. Steps for prediction window refers to how far ahead is the model suppose to predict. For this data, 20s seemed a reasonable window. For Cell Type, select one of the cell types and press Initialize Model. Then select iterations (max of 10,000) and press Train. After a few seconds, you will see the output.

Take advantage of the plots to zoom in and out and see the shape of the actual and predicted outputs. To further improve the model, you can add iterations by pressing the train button. The plots show how the RNN model is learning and getting better at predicting the output.

To try a new model, select a new cell type and press initialize model. Then select the number of iterations and press train.

If the app crashes, no worries, it happens. I have not accounted for everything that could go wrong.