Serverless predictions at scale

5 min readSep 14, 2017

Once we are happy with your trained machine learning model, how can we serve up predictions at scale? Find out on this episode of Cloud AI Adventures!

Google’s Cloud Machine Learning Engine enables let’s you create a prediction service for your TensorFlow model without any ops work. Get more time to work with your data, by going from a trained model to a deployed, auto-scaling prediction service in a matter minutes.

Serving Predictions: the final step

So, we’ve gathered our data, and finally finished training up a suitable model and validating that it performs well. We are now finally ready to move to the final phase: serving your predictions.

When taking on the challenge of serving predictions, we would ideally want to deploy a model that is purpose-built for serving. In particular, a fast, light-weight model that is static, since we do not want any updating to happen while serving.

Additionally, we want our prediction server to scale with demand, which adds another layer of complexity to the problem.

Exporting TensorFlow models

It turns out that TensorFlow has a built-in function that can take care of generating an optimized model for serving predictions! It handles all the adjustments needed, which saves you a lot of work.

The function that we’re interested in is called export_savedmodel(), and we can run it directly on the classifier object once you are satisfied with the state of the trained model.

This will take a snapshot of your model, and export it as a set of files that you can use elsewhere. Over time, as your model improves, you can continue to produce updated exported models, thus providing multiple versions of your model over time.

The exported files are composed of a file and a folder. The file is the saved_model.pb file, which defines the model structure. The variables folder holds two files, supplying the trained weights in our model.

Serving a model in production

Once you have an exported model, you are ready to serve it in production. Here you have two primary options: use TensorFlow Serving, or the Cloud Machine Learning Engine prediction service.

TensorFlow Serving is a part of TensorFlow and is available on GitHub. It is useful if you enjoy configuring your production infrastructure and scaling for demand.

However, today we will focus our attention on the Cloud Machine Learning Engine prediction service, though they have similar file interfaces.

Cloud Machine Learning Engine allows you to take an exported TensorFlow model and turn it into a prediction service with a built-in API endpoint and auto-scaling which goes all the way down to zero (aka no compute charges when no one is requesting predictions!).

It’s also complete with a feature-rich command line tool, API, and UI, so we can interact with it in a number of different ways depending on our preferences.

Deploying a new prediction model

Let’s see an example of how to use Cloud Machine Learning Engine’s prediction service with our Iris example.

Export and upload

First will run export_savedmodel() on our trained classifier. This will generate an exported model that we can use for our prediction service.

Next, we’ll want to upload the files to Google Cloud Storage. Cloud machine learning engine will read from cloud storage when creating a new model version.

Be sure to choose the regional storage class when creating your bucket, to ensure your compute and storage are in the same region.

Create a new model

In the cloud machine learning UI, we can create a new model, which is really just a wrapper for all of our released versions. Versions hold our individual exported models, while the model abstraction helps route the incoming traffic to the appropriate version of your choice.

Here we are in the models list view, where we can create a new model.

All it takes to create a Model is to give it a name. Let’s call ours iris_model.

Create a new version

Next, we’ll create a version, by choosing a name for this particular model version, and pointing it to our cloud storage directory that holds our exported files.

And just like that, we’ve created our model! All it took was pointing the service at our exported model and giving it a name.

How could it take so little work? Well, the service handled all the operational aspects of setting up and securing the endpoint. Moreover, we didn’t need to write our own code for scaling it out based on demand. And since this is the cloud, this elasticity means you don’t need to pay for unused compute when demand is low.

By setting up a prediction service for our iris model that didn’t need any ops work, we’ve been able to go from trained model to deployed, auto-scaling prediction service in a matter of minutes, which means more time to get back to working with our data!

Thanks for reading this edition of Cloud AI Adventures. Be sure to subscribe to the channel to catch future episodes as they come out.

If you want to learn more about TensorFlow Serving, check out this talk from earlier this year at the TensorFlow Dev Summit: https://www.youtube.com/watch?v=sqYdlSF0BI8

And don’t forget, when it comes to scaling your machine learning service in production: Cloud Machine Learning Engine’s got your back.