Machine Learning

How to Build Production-Level Machine Learning Systems

Peter Foy
Peter Foy

Table of Contents

In this article, we'll look at how to build production-level machine learning systems.

Building models that perform well on unseen data is, of course, a key component of machine learning. That said, real world production-level ML systems are much larger ecosystems of which the model is only a small part.

This article is based on notes from the second course in the Advanced Machine Learning with TensorFlow on GCP Specialization and is organized as follows:

  • Architecting Production ML Systems
  • Training Architecture Decisions
  • Serving Architecture Decisions
  • Ingesting Data into the Cloud
  • Designing Adaptable Machine Learning Systems
  • Performance Consideration for ML Models
  • Distributed Training Architectures

This post may contain affiliate links. See our policy page for more information.

Architecting Production ML Systems

In general, the machine learning model will account for roughly 5% of the overall code of a production-level system.

The other components that make up a system's code include:

  • Data ingestion
  • Data analytics and validation
  • Data transformation and training
  • Tuning, model evaluation and validation
  • Serving and logging
  • Orchestration and workflow
  • Integrated frontend & storage

Let's look at each of these components in a bit more detail.

Data Ingestion

Before we can start working with the data, it needs to be ingested into the system. The first consideration here is to determine where and how it will be ingested. For example, you may need to build an ingestion pipeline for streaming data. Other times the data may be ingested from a data warehouse through structured batches.

Data Analysis & Validation

The next step is to determine the quality of the data through analysis and validation. Data analysis is all about understanding the distribution of the data, whereas validation is centered around verifying the health of the data.

Validating the health of the data includes understanding the similarity of distribution in the data, whether or not all the expected features are present, and so on.

Data Transformation and Training

Data transformation allows for feature wrangling, such as generating feature to integer mappings.

The trainer component is responsible for training the model. It should support both data and model parallelism, which we'll discuss in more detail later. Training should also automatically monitor and log everything and support hyperparameter tuning.

Tuning, Model Evaluation, and Validation

Since the model will typically need hyperparameter tuning, the system needs to be able to operate multiple experiments in parallel.

Model evaluation and validation is responsible for ensuring the models are performing well before moving them into production.

Model evaluation is an iterative process and often involves a person or group of people assessing the model against a relevant business metric. Model validation, on the other hand, is not human-facing and involves evaluating the model against fixed thresholds.

Serving & Logging

The serving component should be low-latency, highly efficient, and scale horizontally in order to be reliable and robust.

Serving a model is centered around responding to variable user demand, maximizing throughput, and minimizing response latency. It should also be simple to update and transition to new versions of the model.

Logging is critical for debugging and comparison. All logs must be accessible and integrated.

Orchestration and Workflow

Since there are many individual components that all talk to each other, it's imperative that they all share resources and a configuration framework. This requires that you have a common architecture for machine learning R&D and production engineers.

Orchestration refers to the component responsible for gluing all the other components together.

In GCP, for example, orchestration can be done with Cloud Composer and the steps to compose a workflow include:

  • Define the Ops
  • Arrange it into a DAG, or Directed acyclic graph
  • Upload the DAG to the environment
  • Explore the DAG Run it in the web UI

You can learn more about writing workflows for Cloud Composer in this GCP tutorial.

Integrated Frontend and Storage

Similar to how components need to talk to each other, users of the system need to be able to accomplish tasks in a central location. This can be done with TensorBoard, which is TensorFlow's visualization toolkit. It also allows you to debug TensorFlow code in real-time.


Storage is also needed for staging intermediate output fo these components.

Training Architecture Decisions

A key decision to make in a production ML system relates to model training.

In particular, there are two paradigms to choose from: static vs. dynamic training.

In the case of static training, at a high-level the workflow operates as follows:

  • Acquire data
  • Transform data
  • Train model
  • Test model
  • Deploy model

Dynamic training involves the same steps, although they are done repeatedly as new data arrives.

The benefit of static training is that it's much easier to build, although the model can become stale quickly. Dynamic systems are slightly harder to build, although the benefit is that they can adapt to changes.

One of the reasons that dynamic systems are harder to build is that the new data may contain bugs—this will be discussed later in the section on designing adaptable ML systems.

Serving Architecture Decisions

In addition to the training architecture, the specific use case determines the necessary serving architecture.

One of the goals of the serving architecture is to minimize average latency. When serving machine learning models, this means we want to avoid bottlenecks from slow predictions.

One way to optimize performance is to use a cache.

In the case of static serving, the system can compute the label ahead of time and serve it by just looking in up in a table.

Dynamic serving, on the other hand, computes the label on-demand.

There is a space-time tradeoff between these serving architectures:

  • Static serving is space-intensive as it stores pre-computed predictions, but it has low, fixed latency and lower maintenance costs
  • Dynamic serving is compute intensive with a lower storage cost, variable latency, and higher maintenance

In order to choose between these two, you need to consider the importance of latency, storage, and CPU costs.

In practice, however, you often choose a hybrid solution that statically cache some of the predictions and respond on-demand for long-tail predictions.

Ingesting Data into the Cloud

The main reason that you need to bring your data into the cloud for a production-level system is to take advantage of the scale and fully-managed services available.

To understand how to get data into the cloud, we'll discuss three data scenarios:

  • Data on-premise
  • Large datasets
  • Data on other cloud platforms
  • Existing databases

Data On-Premise

If your data is on-premise, the easiest way to transfer it to the cloud is often to simply drag-and-drop it into a Google Cloud Storage bucket from your web browser.

You can also use the JSON API to manage new and existing buckets.

Once your data is in Cloud Storage, there are four options to store it:

  • Multi-Regional
  • Regional
  • Nearline
  • Coldline

For machine learning data, it's recommended to store it in a single region that's geographically closest to you. This will often speed up performance if the data is in the same region as your compute resources.

Large Datasets

The threshold for a large dataset is roughly 60 terabytes of data. In this case, one option is to receive a physical device called the Google Cloud Transfer Appliance, which can hold up to 1 petabyte of data.

To provide some context, if you were transferring 1 petabyte of data over a typical 100 bit connection, migration would take 3 years. This would take on average 43 days with the Transfer Appliance.

Data on Other Clouds

If your organization uses multiple cloud providers, you can use a service such as Cloud-to-Cloud Transfer, which transfers data from an online data source to a data sink.

Existing Databases

If data already exists in your database application, there are several options for transferring it including the BigQuery data transfer service, migrations to Cloud SQL, and Cloud Dataproc for Hadoop jobs.

Designing Adaptable Machine Learning Systems

In this section we'll look at how our models are dependent on data, cost-conscious engineering discussions, and implementing a pipeline that is immune to a certain type of dependency.

Most software today is modular, meaning that it depends on other software.

The reason for this is that modular software is more maintainable—they are easier to reuse, test, and fix since you can focus on small pieces of code instead of the entire program.

Managing dependencies is also much easier today thanks to package management tools such as pip. Containers also make managing dependencies easier. Docker describes containers as follows:

A container is a standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another.

As in traditional software engineering, managing dependencies properly is key to production ML systems in order to avoid losses in prediction quality, decreases in system stability, and decreases in team productivity.

Adapting to Data Distribution Changes

We'll now look at how we can properly manage data dependencies. First, we'll look at a few ways that data changes can affect the model.

In the context of an upstream model,  performance would typically degrade if it expected one input and got another—for example, if you're using data from another team and they made a change without telling you.

The statistical term for changes in the likelihood of observed values is changes in distribution.

The distribution of labels may naturally change over time, which could indicate the relationship between features and labels is also changing.

It could also be the features that change their distribution, for example ZIP codes are not fixed and new ones are released and old ones deprecated.

There are a few methods to protect your models from distribution changes, including:

  • Monitor descriptive statistics for your inputs and outputs
  • Monitor residuals as a function of your inputs
  • Use custom weights in the loss function to emphasize more recent data
  • Use dynamic training architecture and regularly retrain the model

Mitigating Training-Serving Skew

Training-serving skew refers to differences in performance that occur due to differences in environment.

There are three main causes of training-serving skew:

  • A discrepancy between how the model handles data in training and serving pipelines
  • A change in the data between when you train and when you serve the model
  • A feedback loop between the model and algorithm

One of the best ways to mitigate training-serving skew is to write code that can be used in development and production environments. You can find a lab that demonstrates how to do this in Week 2 of this course on Production Level Machine Learning Systems.

Performance Consideration for ML Models

In this section, we'll discuss several performance considerations for production ML models.

Keep in mind that not all machine learning models have the same performance requirements—some will be focused on improving IO performance while others will be focused on computational speed.

Depending on the model requirements, you will need to choose the appropriate ML infrastructure. In addition to hardware selection, you also need to select a distribution strategy.

Training Considerations

A key consideration for performance is the time it takes to train a model. In particular, we need to consider how long various architectures will take to get an equivalent level of accuracy.

Aside from training time, you need to consider the cost of various training architectures. When it comes to optimizing your training budget, there are three main considerations: time, cost, and scale.

Another way to optimize training the training time and budget is to use an earlier model checkpoint, which will typically converge faster than starting from scratch.

Model training performance will be constrained by one of three things:

  • Input/output: consider how fast you can get the data into the model for training
  • CPU: consider how fast you can compute the gradient in each training step?
  • Memory: consider how many weights can you hold in memory so you can do matrix multiplications in memory or use a GPU or TPU

Depending on the type of model, you will need to consider which one of these three constraints apply.

Prediction Considerations

In addition to training, you'll need to consider the performance requirements of the models predictions.

If you're doing batch prediction, the requirements will be similar to that of training—such as the time, cost, and scale of inference.

If you're doing online predictions, you'll typically perform prediction for one end-user on one machine.

The performance consideration for predictions is not how many training steps you can carry out, but how many queries you can handle per second. The unit for this is called QPS, or Queries Per Second.

Distributed Training Architectures

One of the reasons for the success of deep learning is because datasets are large, but this also requires compute to keep increasing.

This growth in data size and algorithmic complexity means that distributed systems are essentially a necessity for machine learning.

There are several architectures to scale distributed training, including:

Data Parallelism

Data parallelism refers to running the same model and computation on every device, each using different training samples. You then update the model parameters using each of the gradients.

There are two main approaches to data parallelism:

  • Async Parameter Server: Some devices are designated to be parameter servers and others are workers. Each worker gets parameters from the server and computes gradients based on a subset of training samples.
  • Sync Allreduce: In this architecture, each worker holds a copy of the model's parameters. Each worker computes gradients based on the training samples and communicate with each other to propagate the gradients and update parameters.

Given these two strategies, here are a few considerations to pick one over the other:

  • Use the Async Parameter Server approach if there are many low-power or unreliable workers. This is the more common and mature approach to data parallelism.
  • Use Sync Allreduce if you have multiple devices on one host and fast devices with strong links, such as TPUs. This approach is getting more traction due to advances in hardware.

Aside from data parallelism, there is also a type of distributed training called model parallelism. This is used when the model gets so large that it doesn't fit into a single devices' memory.

Summary: Production-Level Machine Learning Systems

In this article, we discussed various components of building production-level machine learning systems.

Building production ML systems requires decisions about both training and serving architectures, such as whether or not to use static or dynamic training and inference.

When it comes to designing adaptable ML systems, there are several ways to reduce the affects of changes in data distribution such as monitoring descriptive statistics and using custom weights in the loss function to emphasize recent data.

We also looked at how to mitigate training-serving skew, which refers to variations in performance due to differences in environments. Finally, we reviewed several performance-related considerations and common distributed training architectures.

When building production-level ML systems, it's important to remember that the model is only a small part of a much larger ecosystem that requires many engineering considerations depending on the data and use case.


Join the conversation.