Using Unlabeled Data to Label Data

Your boss hands you a pile of a 100,000 unlabeled images and asks you to categorize whether they are sandals, pants, boots, etc.

So now you have a massive set of unlabeled data and you need labels. What should you do?

This problem is commonplace. Lots of companies are swimming with data, whether its transactional, IoT sensors, security logs, images, voice, or more, and its all unlabeled. With so little labeled data, it is a tedious and slow process for data scientists to build machine learning models in most all enterprises.

Take Google’s street view data. Gebru had to figure out how to label cars in 50 million images with very little labeled data. Over at Facebook, they used algorithms to label half a million videos, a task that would have otherwise taken 16 years.

This post shows you how to label hundreds of thousands of images in an afternoon. You can use the same approach whether you are labeling images or labeling traditional tabular data (e.g, identifying cyber security atacks or potential part failures.)

The Manual Method

For most data scientists when asked to do something, the first step is to calculate who else should do this.

But 100,000 images could cost you at least $30,000 on Mechanical Turk or some other competitor. Your boss expects this done cheaply, since after all, they hired you because you use free software. Now, she doesn’t budget for anything other than your salary (if you don’t believe me, ask to go to pydata).

You take a deep breath and figure you can probably label 200 images in an hour. So that means in three weeks of non stop work, you can get this done!! Yikes!

Just Build a Model

The first idea is to label a handful of the images, train a machine learning algorithm, and then predict the remaining set of labels. For this exercise, I am using the Fashion-MNIST dataset (you could also make your own using quickdraw). There are ten classes of images to identify and here is a sample of what they look like:

I like this dataset, because each image is 28 by 28 pixels, which means it contains 784 unique features/variables. For a blog post this works great, but its also not like any datasets you see in the real world, which are often either much narrower (traditional tabluar business problem datasets) or much wider (real images are much bigger and include color).

I built models using the most common data science algorithms: logistic regression, support vector machines (SVM), random forest and gradient boosted machines (GBM).

I evaluated the performance based on labeling 100, 200, 500, 1000, and 2000 images.


At this point in the post, if you are still with me, slow down and mull this graph over. There is a lot of good stuff here. Which algorithm does the best? (If you a data scientist, you shouldn’t fall for that question.) It really depends on the context.

You want something quick and dependable out of the box, you could go for the logistic regression. While the random forest starts way ahead, the SVM is coming on fast. If we had more labeled data the SVM would pass the random forest. And the GBM works great, but can take a bit of work to perform their best. The scores here are using out of the box implementations in R (e1071, randomForest, gbm, nnet).

If our benchmark is 80% accuracy for ten classes of images, we could get there by building a Random Forest model with 1000 images. But 1000 images is still a lot of data to label, 5 hours by my estimate. Lets think about ways we can improve.

Let’s Think About Data

After a little reflection, you remember what you often tell others — that data isn’t random, but has patterns. By taking advantage of these patterns we can get insight in our data.

Lets start with an autoencoder (AE). An autoencoder squeezes and compresses your data, kind of like turning soup into a bouillon cube. Autoencoders are the hipster’s Principle Component Analysis (PCA) , since they support nonlinear transformations.

Effectively this means we are taking our wide data (784 features/variables) reducing it down to 128 features. We then take this new compressed data and train our machine learning algorithm (SVM in this case). The graph below shows the difference in performance between an SVM fed with an autoencoder (AE_SVM) versus the SVM on the raw data.


By squeezing the information down to 128 features, we were able to actually improve the performance of the SVM algorithm at the low end. At the 100 labels mark, accuracy went from 44% to 59%. At the 1000 labels mark, the autoencoder was still helping, we see an improvement from 74% to 78%. So we are on to something here. We just need think a bit more about the distribution and patterns in our data that we can take advantage of.

Thinking Deeper About Your Data

We know that our data are images and since 2012, the hammer for images is a convolutional neural network (CNN). There are a couple of ways we could use a CNN, from a pretrained network or as a simple model to pre-process the images. For this post, I am going to use a Convolutional Variational Autoencoder as a path towards the technique by Kingma for semi-supervised learning.

So lets build a Convolutional Variational Autoencoder (CVAE). The leap here is twofold. First, “variational” means the autoenconder compress the information down into a probability distribution. Second is the addition of using a convolutional neural networks as an encoder. This is a bit of deep learning, but the emphasis here is on how we are solving the problem, not the latest shiny toy.

For coding my CVAE, I used the example CVAE from the list of examples over at RStudio’s Keras page. Like the previous autoencoder, we design the latent space to reduce the data to 128 features. We then use this new data to train an SVM model. Below is a plot of the performance of the CVAE as compared to the SVM and RandomForest on the raw data.


Wow! The new model is much more accurate. We can get well past 80% accuracy with just 500 labels. By using these techniques we get better performance and require less labelled images! At the top end, we can also do much better than the RandomForest or SVM model.

Next Steps

By using some very simple semi-supervised techniques with autoencoders, its possible to quickly and accurately label data. But the takeaway is not to use deep learning auto encoders! Instead, I hope you understand the methodology here of starting very simple and then trying gradually more complex solutions. Don’t fall for the latest shiny toy — pratical data science is not about using the latest approaches found in arxiv.

If this idea of semi-supervised learning inspires you, this post is the logistic regression of semi-supervised learning. If you want to dig further into Semi-Supervised Learning and Domain Adaptation, check out Brian Keng’s great walkthrough of using variational autoencoders (which goes beyond what we have done here) or the work of Curious AI, which has been advancing semi-supervised learning using deep learning and sharing their code. But at the very least, don’t reflexively think all your data has to be hand labeled.


Rajiv Shah is a data scientist at DataRobot, where he works with customers to make and implement predictions. Previously, Rajiv has been part of data science teams at Caterpillar and State Farm. He enjoys data science and spends time mentoring data scientists, speaking at events, and having fun with blog posts. He has a PhD from the University of Illinois at Urbana Champaign.

Using Google's Quickdraw to create an MNIST style dataset!

For those running deep learning models, MNIST is ubiquotuous. This dataset of handwritten digits serves many purposes from benchmarking numerous algorithms (its referenced in thousands of papers) and as a visualization, its even more prevelant than Napoleon’s 1812 March. The digits look like this:

There are many reasons for its enduring use, but much of it is the lack of an alternative. In this post, I want to introduce an alternative, the Google QuickDraw dataset. The quickdraw dataset was captured in 2017 by Google’s drawing game, Quick, Draw!. The dataset consists of 50 million drawings across 345 categories. The drawings look like this:

Build your own Quickdraw dataset

I want to walk through how you can use this drawings and create your own MNIST like dataset. Google has made available 28x28 grayscale bitmap files of each drawing. These can serve as drop in replacements for the MNIST 28x28 grayscale bitmap images.

As a starting point, Google has graciously made the dataset publicly available with documentation on the dataset. All the data is sitting in Google’s Cloud Console, but for the images, you want this link of the numpy_bitmaps.


You should arrive on a page that allows you to download all the images for any category. So this is when you have fun! Go ahead and pick your own categories. I started with eyeglasses, face, pencil, and television. As I learned from the face, the drawings that have fine points can be more difficult to learn. But you should play around and pick fun categories.


The next challenge is taking these .npy files and using them. Here is a short python gist that I used to read the .npy files and combine them to create a 80,000 images dataset that I could use in place of MNIST. They are saved in a hdf5 format that is cross platform and often used in deep learning.


Using Quickdraw instead of MNIST

The next thing is to go have fun with it. I used this dataset in place of MNIST for some work playing around with autoencoders in Python from the Keras tutorials. The below picture represents the original images at the top and reconstructed ones at the bottom, using an autoencoder.


I next used this dataset with a variational autoencoder in R. Here is the code snippet to import the data:

x_test <- t(h5read("x_test.h5", "name-of-dataset"))
x_train <- t(h5read("x_train.h5", "name-of-dataset"))
y_test <- (h5read("y_test.h5", "name-of-dataset"))
y_train <- (h5read("y_train.h5", "name-of-dataset"))

Here is a visualization of its latent space using my custom quickdraw dataset. For me, this was a nice fresh alternative to always staring at the MNIST dataset. So next time you see MNIST listed . . . go build your own!


Deep Learning with R

For R users, there hasn’t been a production grade solution for deep learning (sorry MXNET). This post introduces the Keras interface for R and how it can be used to perform image classification. The post ends by providing some code snippets that show Keras is intuitive and powerful 💪🏽.


Last January, Tensorflow for R was released, which provided access to the Tensorflow API from R. This was signficant, as Tensorflow is the most popular library for deep learning. However, for most R users, the Tensorflow for R interface was not very R like. 🤢 Take a look at this code chunk for training a model:

cross_entropy <- tf$reduce_mean(-tf$reduce_sum(y_ * tf$log(y_conv), reduction_indices=1L))
train_step <- tf$train$AdamOptimizer(1e-4)$minimize(cross_entropy)
correct_prediction <- tf$equal(tf$argmax(y_conv, 1L), tf$argmax(y_, 1L))
accuracy <- tf$reduce_mean(tf$cast(correct_prediction, tf$float32))

for (i in 1:20000) {
  batch <- mnist$train$next_batch(50L)
  if (i %% 100 == 0) {
    train_accuracy <- accuracy$eval(feed_dict = dict(
        x = batch[[1]], y_ = batch[[2]], keep_prob = 1.0))
    cat(sprintf("step %d, training accuracy %g\n", i, train_accuracy))
  train_step$run(feed_dict = dict(
    x = batch[[1]], y_ = batch[[2]], keep_prob = 0.5))

test_accuracy <- accuracy$eval(feed_dict = dict(
     x = mnist$test$images, y_ = mnist$test$labels, keep_prob = 1.0))
cat(sprintf("test accuracy %g", test_accuracy))


Unless you are familiar with tensorflow, it’s not readily apparent what is going on. A quick search on Github finds less than a 100 code results using tensorflow for R. 😔


All this is going to change with Keras and R! ☺️

For background, Keras is a high-level neural network API that is designed for experimentation and can run on top of Tensorflow. Keras is what data scientists like to use. 🤓 Keras has grown in popularity and supported on a wide set of platforms including Tensorflow, CNTK, Apple’s CoreML, and Theano. It is becoming the de factor language for deep learning.

As a simple example, here is the code to train a model in Keras:

model_top %>% fit(
        x = train_x, y = train_y,

Image Classification with Keras

So if you are still with me, let me show you how to build deep learning models using R, Keras, and Tensorflow together. You will find a Github repo at that contains the code and data you will need. Included is an R notebook (and Python notebooks) that walks through building an image classifier (telling 🐱 from 🐶), but can easily be generalized to other images. The walk through includes advanced methods that are commonly used for production deep learning work including:

  • augmenting data
  • using the bottleneck features of a pre-trained network
  • fine-tuning the top layers of a pre-trained network
  • saving weights of models

Code Snippets of Keras

The R interface to Keras truly makes it easy to build deep learning models in R. Here are some code snippets based on my example of building an image classifier to illustrate how intuitive and useful Keras for R is:

To load 🖼 from a folder:

train_generator <- flow_images_from_directory(train_directory, generator = image_data_generator(), target_size = c(img_width, img_height), color_mode = "rgb",
  class_mode = "binary", batch_size = batch_size, shuffle = TRUE,
  seed = 123)

To define a simple convolutional neural network:

model <- keras_model_sequential()

model %>%
  layer_conv_2d(filter = 32, kernel_size = c(3,3), input_shape = c(img_width, img_height, 3)) %>%
  layer_activation("relu") %>%
  layer_max_pooling_2d(pool_size = c(2,2)) %>% 
  layer_conv_2d(filter = 32, kernel_size = c(3,3)) %>%
  layer_activation("relu") %>%
  layer_max_pooling_2d(pool_size = c(2,2)) %>%
  layer_conv_2d(filter = 64, kernel_size = c(3,3)) %>%
  layer_activation("relu") %>%
  layer_max_pooling_2d(pool_size = c(2,2)) %>%
  layer_flatten() %>%
  layer_dense(64) %>%
  layer_activation("relu") %>%
  layer_dropout(0.5) %>%
  layer_dense(1) %>%

To augment data:

augment <- image_data_generator(rescale=1./255,

To load a pretrained network:

model_vgg <- application_vgg16(include_top = FALSE, weights = "imagenet")

To save model weights:

save_model_weights_hdf5(model_ft, 'finetuning_30epochs_vggR.h5', overwrite = TRUE)

The Keras for R interface makes it much easier for R users to build and refine deep learning models. Its no longer necessary to force everyone to use Python to build, refine, and test deep learning models. I really think this will open up deep learning to a wider audience that was a bit apprehensive on using python.

To start with, you can grab my repo, fire up RStudio (or your IDE of choice), and go build a simple classifier using Keras. There are also a wealth of other examples such as generating text from Nietzsche’s writings, deep dreaming, or creating a variational encoder.

So for now, give it a spin!

An earlier version of this post was posted at Datascience+.

Building Worlds for Reinforcement Learning

OpenAI’s Gym places reinforcement learning into the masses. It comes with a wealth of environments from the classic cart pole, board games, Atari, and now the new Universe which adds flash games and PC Games like GTA or Portal. This is great news, but for someone starting out, working on some of these games is overkill. You can learn a lot more in a shorter time, by playing around with some smaller toy environments.

One area I like within the gym environments are the classic control problems (besides the fun of eating melon and poop). These are great problems for understanding the basics of reinforcement learning because we intutiively understand the rewards and they run really fast. Its not like pong that can take several days to train, instead, you can train these environments within minutes!

If you aren’t happy with the current environments, it is possible to modify and even add more environments. In this post, I will highlight other environments and share how I modified an Acrobot-v1 environment.


To begin, grab the repo for the OpenAI gym. Inside the repo, navigate to gym/envs/classic_controlwhere you will see the scripts that define the class control environments. If you open one of the scripts, you will see a heading on the top that says:

__copyright__ = "Copyright 2013, RLPy"

Ahh! In the spirit of open source, OpenAI stands on the shoulders of another reinforcement library, RLPy. You can learn a lot more about them at the RLPy site or take a look at their github. If you browse here, you can find the original script that was used in OpenAI under rlpy /rlpy/Domains. The interesting thing here is that there are a ton more interesting reinforcement problems!


You can run these using RLPy or you can try and hack this into OpenAI.

Modifying OpenAI Environments

I decided to modify the Acrobot environment. Acrobot is a 2-link pendulum with only the second joint actuated (it has three states, left, right, and no movement). The goal is to swing the end to a height of at least one link above the base. If you look at the leaderboard on OpenAIs site, they meet that criterion, but its not very impressive. Here is the current highest scoring entry:


This is way boring compared to what Hardmaru shows in his demo, where a pendulum is capable of balancing for a short time.hardmaru

So I decided to try and modify the Acrobot demo to make this task a little more interesting, Acrobot gist here. The main change was to the reward system. I added a variable steps_beyond_done that would keep track of successes when the end was swung high. I also changed the reward structure, so it would gradually be rewarded as it swung higher. I also changed g to 0, this removes gravity’s effect.

self.rewardx = (-np.cos(s[0]) - np.cos(s[1] + s[0])) ##Swung height is calculated 
if self.rewardx < .5:
    reward = -1.
    self.steps_beyond_done = 0
if (self.rewardx > .5 and self.rewardx < .8):
    reward = -0.8
    self.steps_beyond_done = 0  
if self.rewardx > .8:
    reward = -0.6 
if self.rewardx > 1:
    reward = -0.4
    self.steps_beyond_done += 1 
if self.steps_beyond_done > 4:
    reward = -0.2
if self.steps_beyond_done > 8:
    reward = -0.1
if self.steps_beyond_done > 12:
    reward = 0.

Another important file to be aware of is where the benchmarks are kept for each environment. You can navigate to this at gym/gym/benchmarks/__init__.pyWithin this file, you will see the following:

{'env_id': 'Acrobot-v1',
         'trials': 3,
         'max_timesteps': 100000,
         'reward_floor': -500.0,
         'reward_ceiling': 0.0,

I then ran an implementation of Asynchronous Advantage Actor Critic A3C) by Arno Moonens. After running for a half hour, you can see the improvement in the algorithm:


Now a half hour later:


The result is teaching the pendulum to stay up for an extended time! This is much more interesting and what I was looking for. I hope this will inspire others to build new and interesting environments.

Taking an H2O Model to Production

One of the best feelings as a Data Scientist is when the model you have poured your heart and soul into, moves into production. Your model is now grown-up and you get to watch it mature.

This post shows how to take a H2O model and move it into a production environment. In this post, we will develop a simple H2O based predictive model, convert it into a Plain Old Java Object (POJO), compile it along with other Java packages, and package the compiled class files into a deployable JAR file so that it can readily be deployed onto any Java based application servers. This model will accept the input data set in the form of CSV file and return the predicted output in CSV format.

H2O is one of my favorite tools for building models because it is well designed from an algorithm perspective, easy to use, and can scale to larger datasets. However, H2O’s documentation, though voluminous, doesn’t have clear instructions for moving a POJO model into production. This post will discuss this approach in greater detail besides providing code for how to do this. (H2O does have a post on doing real time predictions with storm). Special thanks to Socrates Krishnamurthy who co-wrote this post with me.

Building H2O Model

As a starting point, lets use our favorite ice cream dataset to create a toy model in H2O:

  train.h2o <- as.h2o(Icecream)  
  rf <- h2o.randomForest(x=2:4, y=1, ntrees=2, training_frame=train.h2o)   

Once you have developed your model in H2O, then the next step is downloading the POJO:

h2o.download_pojo(rf, getjar=TRUE, path="~/Code/h2o-")
# you must give a path to download a file

This will save two files, a H2O jar file about the model and an actual model file (that begins with DRF and ends with .java). Go ahead and open the model file in a text editor if you want to have a look at it.

Compiling the H2O Model

The next step is to compile and run the model (say, the downloaded model name is, then type:

> javac -cp h2o-genmodel.jar -J-Xmx2g  

This creates a bunch of java class files.

Scoring the Input Data

The final step is scoring some input data. Prior to running the model, it is necessary to have files created for the input and output. For the input, the default setting is to read the first row as a header. The assumption is that the csv is well formed (this approach is not using the H2O parser). Once that is done, run:

> java -cp .:h2o-genmodel.jar --header --model DRF_model_R_1470416938086_15 --input input.csv --output output.csv

If you open the output.csv file, it can be noticed that the predicted values are in Hexadecimal and not in Numeric format. For example, the output will be something like this:


Fixing the Hexadecimal Issue

The model is now predicting, but the predictions are in the wrong format. Yikes! To fix this issue requires some hacking of the java code. The rest of this post will show you how to hack the java code in PredictCsv, which can fix this issue and other unexpected issues with PredictCsv (for example, if your input comes tab separated).

If we take a deeper look at the PredictCsv java file located in the h2o github, the myDoubleToString method returns Hexadecimal string. But the challenge is this method being static in nature, cannot be overridden in a subclass or cannot be updated directly since it was provided by H2O jar file, to return regular numeric value in String format.

This can be fixed by creating a new java file (say, by copying the entire content of from the above location and saving it locally. You then need to:

  • comment out the first line, so it should be //package;
  • change the name of the class name (~line 20) to read: public class NewPredictCsv {
  • correct the hexadecimal issue by changing the return statement of myDoubleToString method to .toString() in lieu of .toHexString() (~line 131).

After creating, compile it using the following command:

> javac -cp h2o-genmodel.jar -J-Xmx2g

Run the compiled file by providing input and output CSV files using the following command (Ensure that the input.csv file is in the current folder where you will run this):

> java -cp .:h2o-genmodel.jar NewPredictCsv --header --model DRF_model_R_1470416938086_15 --input input.csv --output output.csv

If you open the output.csv file now, it will be in the proper numeric format as follows:


Deploying the Solution into Production:

At this point, we have a workable flow for using our model to score new data. But we can clean up the code to make it a little friendlier for our data engineers. First, create a jar file out of the class files created in previous steps. To do that, issue the following command:

> jar cf my-RF-model.jar *.class

This will place all the class files and our NewPredictCsv inside the jar. This is helpful when we have a model with say 500 trees. Now all we need is three files to run our scorer. So copy the above two jar files along with input.csv file in any folder/directory from where the program has to be executed. After copying, the folder should contain following files:

> my-RF-model.jar  
> h2o-genmodel.jar  
> input.csv  

The above input.csv file contains the dataset for which the dependent variable has to be predicted. To compute/ predict the values, run the java command as below:

> java -cp .:my-RF-model.jar:h2o-genmodel.jar NewPredictCsv --header --model DRF_model_R_1470416938086_15 --input input.csv --output output.csv


Replace : with ; in above commands if you are working in Windows (yuck).