Solving numerical optimization problems like scheduling, routing, and allocation with Amazon SageMaker Processing

In this post, we discuss solving numerical optimization problems using the very flexible Amazon SageMaker Processing API. Optimization is the process of finding the minimum (or maximum) of a function that depends on some inputs, called design variables. This pattern is relevant to solving business-critical problems such as scheduling, routing, allocation, shape optimization, trajectory optimization, and others. Several commercial and open-source solvers are available for solving such problems. We demonstrate this solution with three popular Python libraries and solvers that are free to use, and provide a sample notebook that shows how to solve these optimization problems using SageMaker Processing.

Solution overview

SageMaker Processing lets data scientists and ML engineers easily run preprocessing, postprocessing, and model evaluation workloads on SageMaker. This SDK uses the built-in container for scikit-learn and Spark. You can also use your own Docker images without having to conform to any Docker image specification. This gives you maximum flexibility in running any code you want, whether on SageMaker Processing, on AWS container services like Amazon Elastic Container Service (Amazon ECS) and Amazon Elastic Kubernetes Service (Amazon EKS), or even on premises; which is what we do in this post. First, we build and push a Docker image that includes several popular optimization packages and solvers, and then we use this Docker image to solve three example problems:

  • Minimize the cost of shipping goods through a distribution network
  • Scheduling shifts of a set of nurses in a hospital
  • Find a trajectory for landing the Apollo 11 Lunar Module with the least amount of fuel

We solve each use case using a different interface that connects to a different solver. We complete the following high-level steps (as in the provided example notebook) for each problem:

  1. Build a Docker container that contains useful Python interfaces (such as Pyomo and PuLP) to optimization solvers (such as GLPK and CBC)
  2. Build and push the image to a repository in Amazon Elastic Container Registry (Amazon ECR).
  3. Use the SageMaker Python SDK (from a notebook or elsewhere with the right permissions) to point to the Docker image in Amazon ECR and send in a Python file with the actual optimization problem.
  4. Monitor the logs in a notebook or Amazon CloudWatch Logs and obtain and outputs you need in a dedicated output folder that you specify in Amazon Simple Storage Service (Amazon S3).

Schematically, this process looks like the following diagram.

Let’s get started!

Building and pushing a Docker container

Start with the following Dockerfile:

FROM continuumio/anaconda3

RUN pip install boto3 pandas scikit-learn pulp pyomo inspyred ortools scipy deap 

RUN conda install -c conda-forge ipopt coincbc glpk


ENTRYPOINT ["python"]

In this code, we install Python interfaces to solvers such as PuLP, Pyomo, Inspyred, OR-Tools, Scipy, and DEAP. For more information about these solvers, see the References section at the end of this post.

We then use the following commands from the notebook to build and push this container to Amazon ECR:

import boto3

account_id = boto3.client('sts').get_caller_identity().get('Account')
ecr_repository = 'sagemaker-opt-container'
tag = ':latest'
processing_repository_uri = '{}.dkr.ecr.{}{}'.format(account_id, region, ecr_repository + tag)

# Create ECR repository and push docker image
!docker build -t $ecr_repository docker
!$(aws ecr get-login --region $region --registry-ids $account_id --no-include-email)
!aws ecr create-repository --repository-name $ecr_repository
!docker tag {ecr_repository + tag} $processing_repository_uri
!docker push $processing_repository_uri

Sample output for this command looks like the following code:

Sending build context to Docker daemon  2.048kB
Step 1/5 : FROM continuumio/anaconda3
 ---> 5fbf7bac70a0
Step 2/5 : RUN pip install boto3 pandas scikit-learn pulp pyomo inspyred ortools scipy deap
 ---> Using cache
 ---> 98864164a472
Step 3/5 : RUN conda install -c conda-forge ipopt coincbc glpk
 ---> Using cache
 ---> 1fde58988350
 ---> Using cache
 ---> 06cc27c84a9a
Step 5/5 : ENTRYPOINT ["python"]
 ---> Using cache
 ---> 0ae65a2ad5b9
Successfully built <code>
Successfully tagged sagemaker-opt-container:latest
WARNING! Using --password via the CLI is insecure. Use --password-stdin.
WARNING! Your password will be stored unencrypted in /home/ec2-user/.docker/config.json.
Configure a credential helper to remove this warning. See

Login Succeeded

An error occurred (RepositoryAlreadyExistsException) when calling the CreateRepository operation: The repository with name 'sagemaker-opt-container' already exists in the registry with id '<account numnber>'
The push refers to repository [<account number>.dkr.ecr.<region>]

8611989d: Preparing 
d960633d: Preparing 
db96c31c: Preparing 
ea6160d7: Preparing 
4bce66cd: Layer already exists latest: digest: sha256:<hash> size: 1379

Using the SageMaker Python SDK to start a job

Typically, we first initialize a SageMaker Processing ScriptProcessor as follows:

from sagemaker.processing import ScriptProcessor

script_processor = ScriptProcessor(command=['python'],

Then we write a file (for this post, we always use a file called and run a processing job on SageMaker as follows:

from sagemaker.processing import ProcessingInput, ProcessingOutput'',

script_processor_job_description =[-1].describe()

Use case 1: Minimizing the cost of shipping goods through a distribution network

In this use case, American Steel, an Ohio-based steel manufacturing company, produces steel at its two steel mills located at Youngstown and Pittsburgh. The company distributes finished steel to its retail customers through the distribution network of regional and field warehouses.

The network represents shipment of finished steel from American Steel—two steel mills located at Youngstown (node 1) and Pittsburgh (node 2) to their field warehouses at Albany, Houston, Tempe, and Gary (nodes 6, 7, 8, and 9) through three regional warehouses located at Cincinnati, Kansas City, and Chicago (nodes 3, 4, and 5). Also, some field warehouses can be directly supplied from the steel mills.

The following table presents the minimum and maximum flow amounts of steel that may be shipped between different cities, along with the cost per 1,000 tons per month of shipping the steel. For example, the shipment from Youngstown to Kansas City is contracted out to a railroad company with a minimal shipping clause of 1,000 tons per month. However, the railroad can’t ship more than 5,000 tons per month due the shortage of rail cars.

From node To node Cost Minimum Maximum
Youngstown Albany 500 1000
Youngstown Cincinnati 350 3000
Youngstown Kansas City 450 1000 5000
Youngstown Chicago 375 5000
Pittsburgh Cincinnati 350 2000
Pittsburgh Kansas City 450 2000 3000
Pittsburgh Chicago 400 4000
Pittsburgh Gary 450 2000
Cincinnati Albany 350 1000 5000
Cincinnati Houston 550 6000
Kansas City Houston 375 4000
Kansas City Tempe 650 4000
Chicago Tempe 600 2000
Chicago Gary 120 4000

The objective of transshipment problems in general and The American Steel Problem in particular is to minimize the cost of shipping goods through the network.

All the nodes have supply and demand, demand = 0 for supply nodes, supply = 0 for demand nodes, and supply = demand = 0 for transshipment nodes. The only constraints in the transshipment problem are flow conservation constraints. These constraints simply state that the flow of goods into a node must be greater than or equal to the flow of goods out of a node.

This problem can be formulated as follows:


import argparse
import os
import warnings

The American Steel Problem for the PuLP Modeller

Authors: Antony Phillips, Dr Stuart Mitchell  2007

# Import PuLP modeller functions
from pulp import *

# List of all the nodes
Nodes = ["Youngstown",
         "Kansas City",

nodeData = {# NODE        Supply Demand
         "Youngstown":    [10000,0],
         "Pittsburgh":    [15000,0],
         "Cincinatti":    [0,0],
         "Kansas City":   [0,0],
         "Chicago":       [0,0],
         "Albany":        [0,3000],
         "Houston":       [0,7000],
         "Tempe":         [0,4000],
         "Gary":          [0,6000]}

# List of all the arcs
Arcs = [("Youngstown","Albany"),
        ("Youngstown","Kansas City"),
        ("Pittsburgh","Kansas City"),
        ("Kansas City","Houston"),
        ("Kansas City","Tempe"),

arcData = { #      ARC                Cost Min Max
        ("Youngstown","Albany"):      [0.5,0,1000],
        ("Youngstown","Cincinatti"):  [0.35,0,3000],
        ("Youngstown","Kansas City"): [0.45,1000,5000],
        ("Youngstown","Chicago"):     [0.375,0,5000],
        ("Pittsburgh","Cincinatti"):  [0.35,0,2000],
        ("Pittsburgh","Kansas City"): [0.45,2000,3000],
        ("Pittsburgh","Chicago"):     [0.4,0,4000],
        ("Pittsburgh","Gary"):        [0.45,0,2000],
        ("Cincinatti","Albany"):      [0.35,1000,5000],
        ("Cincinatti","Houston"):     [0.55,0,6000],
        ("Kansas City","Houston"):    [0.375,0,4000],
        ("Kansas City","Tempe"):      [0.65,0,4000],
        ("Chicago","Tempe"):          [0.6,0,2000],
        ("Chicago","Gary"):           [0.12,0,4000]}

# Splits the dictionaries to be more understandable
(supply, demand) = splitDict(nodeData)
(costs, mins, maxs) = splitDict(arcData)

# Creates the boundless Variables as Integers
vars = LpVariable.dicts("Route",Arcs,None,None,LpInteger)

# Creates the upper and lower bounds on the variables
for a in Arcs:
    vars[a].bounds(mins[a], maxs[a])

# Creates the 'prob' variable to contain the problem data    
prob = LpProblem("American Steel Problem",LpMinimize)

# Creates the objective function
prob += lpSum([vars[a]* costs[a] for a in Arcs]), "Total Cost of Transport"

# Creates all problem constraints - this ensures the amount going into each node is at least equal to the amount leaving
for n in Nodes:
    prob += (supply[n]+ lpSum([vars[(i,j)] for (i,j) in Arcs if j == n]) >=
             demand[n]+ lpSum([vars[(i,j)] for (i,j) in Arcs if i == n])), "Steel Flow Conservation in Node %s"%n

# The problem data is written to an .lp file
prob.writeLP('/opt/ml/processing/data/' + 'AmericanSteelProblem.lp')

# The problem is solved using PuLP's choice of Solver

# The status of the solution is printed to the screen
print("Status:", LpStatus[prob.status])

# Each of the variables is printed with it's resolved optimum value
for v in prob.variables():
    print(, "=", v.varValue)

# The optimised objective function value is printed to the screen    
print("Total Cost of Transportation = ", value(prob.objective))

We solve this problem using the PuLP interface and its default solver GLPK using Logs from this optimization job provide these solutions:

Status: Optimal
Route_('Chicago',_'Gary') = 4000.0
Route_('Chicago',_'Tempe') = 2000.0
Route_('Cincinatti',_'Albany') = 2000.0
Route_('Cincinatti',_'Houston') = 3000.0
Route_('Kansas_City',_'Houston') = 4000.0
Route_('Kansas_City',_'Tempe') = 2000.0
Route_('Pittsburgh',_'Chicago') = 3000.0
Route_('Pittsburgh',_'Cincinatti') = 2000.0
Route_('Pittsburgh',_'Gary') = 2000.0
Route_('Pittsburgh',_'Kansas_City') = 3000.0
Route_('Youngstown',_'Albany') = 1000.0
Route_('Youngstown',_'Chicago') = 3000.0
Route_('Youngstown',_'Cincinatti') = 3000.0
Route_('Youngstown',_'Kansas_City') = 3000.0
Total Cost of Transportation =  15005.0

Use case 2: Scheduling shifts of a set of nurses in a hospital

In the next example, a hospital supervisor must create a schedule for four nurses over a 3-day period, subject to the following conditions:

  • Each day is divided into three 8-hour shifts
  • Every day, each shift is assigned to a single nurse, and no nurse works more than one shift
  • Each nurse is assigned to at least two shifts during the 3-day period

For more information about this scheduling use case, see Employee Scheduling.

This problem can be formulated as follows:


from __future__ import print_function
from ortools.sat.python import cp_model

class NursesPartialSolutionPrinter(cp_model.CpSolverSolutionCallback):
    """Print intermediate solutions."""

    def __init__(self, shifts, num_nurses, num_days, num_shifts, sols):
        self._shifts = shifts
        self._num_nurses = num_nurses
        self._num_days = num_days
        self._num_shifts = num_shifts
        self._solutions = set(sols)
        self._solution_count = 0

    def on_solution_callback(self):
        if self._solution_count in self._solutions:
            print('Solution %i' % self._solution_count)
            for d in range(self._num_days):
                print('Day %i' % d)
                for n in range(self._num_nurses):
                    is_working = False
                    for s in range(self._num_shifts):
                        if self.Value(self._shifts[(n, d, s)]):
                            is_working = True
                            print('  Nurse %i works shift %i' % (n, s))
                    if not is_working:
                        print('  Nurse {} does not work'.format(n))
        self._solution_count += 1

    def solution_count(self):
        return self._solution_count

def main():
    # Data.
    num_nurses = 4
    num_shifts = 3
    num_days = 3
    all_nurses = range(num_nurses)
    all_shifts = range(num_shifts)
    all_days = range(num_days)
    # Creates the model.
    model = cp_model.CpModel()

    # Creates shift variables.
    # shifts[(n, d, s)]: nurse 'n' works shift 's' on day 'd'.
    shifts = {}
    for n in all_nurses:
        for d in all_days:
            for s in all_shifts:
                shifts[(n, d,
                        s)] = model.NewBoolVar('shift_n%id%is%i' % (n, d, s))

    # Each shift is assigned to exactly one nurse in the schedule period.
    for d in all_days:
        for s in all_shifts:
            model.Add(sum(shifts[(n, d, s)] for n in all_nurses) == 1)

    # Each nurse works at most one shift per day.
    for n in all_nurses:
        for d in all_days:
            model.Add(sum(shifts[(n, d, s)] for s in all_shifts) <= 1)

    # min_shifts_per_nurse is the largest integer such that every nurse
    # can be assigned at least that many shifts. If the number of nurses doesn't
    # divide the total number of shifts over the schedule period,
    # some nurses have to work one more shift, for a total of
    # min_shifts_per_nurse + 1.
    min_shifts_per_nurse = (num_shifts * num_days) // num_nurses
    max_shifts_per_nurse = min_shifts_per_nurse + 1
    for n in all_nurses:
        num_shifts_worked = sum(
            shifts[(n, d, s)] for d in all_days for s in all_shifts)
        model.Add(min_shifts_per_nurse <= num_shifts_worked)
        model.Add(num_shifts_worked <= max_shifts_per_nurse)

    # Creates the solver and solve.
    solver = cp_model.CpSolver()
    solver.parameters.linearization_level = 0
    # Display the first five solutions.
    a_few_solutions = range(5)
    solution_printer = NursesPartialSolutionPrinter(shifts, num_nurses,
                                                    num_days, num_shifts,
    solver.SearchForAllSolutions(model, solution_printer)

    # Statistics.
    print('  - conflicts       : %i' % solver.NumConflicts())
    print('  - branches        : %i' % solver.NumBranches())
    print('  - wall time       : %f s' % solver.WallTime())
    print('  - solutions found : %i' % solution_printer.solution_count())

if __name__ == '__main__':

We solve this problem using the OR-Tools interface and its CP-SAT solver with Logs from this optimization job provide these solutions:

Solution 0
Day 0
  Nurse 0 does not work
  Nurse 1 works shift 0
  Nurse 2 works shift 1
  Nurse 3 works shift 2
Day 1
  Nurse 0 works shift 2
  Nurse 1 does not work
  Nurse 2 works shift 1
  Nurse 3 works shift 0
Day 2
  Nurse 0 works shift 2
  Nurse 1 works shift 1
  Nurse 2 works shift 0
  Nurse 3 does not work

Solution 1
Day 0
  Nurse 0 works shift 0
  Nurse 1 does not work
  Nurse 2 works shift 1
  Nurse 3 works shift 2
Day 1
  Nurse 0 does not work
  Nurse 1 works shift 2
  Nurse 2 works shift 1
  Nurse 3 works shift 0
Day 2
  Nurse 0 works shift 2
  Nurse 1 works shift 1
  Nurse 2 works shift 0
  Nurse 3 does not work

Solution 2
Day 0
  Nurse 0 works shift 0
  Nurse 1 does not work
  Nurse 2 works shift 1
  Nurse 3 works shift 2
Day 1
  Nurse 0 works shift 1
  Nurse 1 works shift 2
  Nurse 2 does not work
  Nurse 3 works shift 0
Day 2
  Nurse 0 works shift 2
  Nurse 1 works shift 1
  Nurse 2 works shift 0
  Nurse 3 does not work

Solution 3
Day 0
  Nurse 0 works shift 0
  Nurse 1 does not work
  Nurse 2 works shift 1
  Nurse 3 works shift 2
Day 1
  Nurse 0 works shift 2
  Nurse 1 works shift 1
  Nurse 2 does not work
  Nurse 3 works shift 0
Day 2
  Nurse 0 works shift 2
  Nurse 1 works shift 1
  Nurse 2 works shift 0
  Nurse 3 does not work

Solution 4
Day 0
  Nurse 0 does not work
  Nurse 1 works shift 0
  Nurse 2 works shift 1
  Nurse 3 works shift 2
Day 1
  Nurse 0 works shift 2
  Nurse 1 works shift 1
  Nurse 2 does not work
  Nurse 3 works shift 0
Day 2
  Nurse 0 works shift 2
  Nurse 1 works shift 1
  Nurse 2 works shift 0
  Nurse 3 does not work

  - conflicts       : 37
  - branches        : 41231
  - wall time       : 0.367511 s
  - solutions found : 5184

Use case 3: Finding a trajectory for landing the Apollo 11 Lunar Module with the least amount of fuel

This example uses Pyomo and a simple model of a rocket to compute a control policy for a soft landing. The parameters used correspond to the descent of the Apollo 11 Lunar Module to the moon on July 20, 1969. For a rocket with a mass 𝑚 in vertical flight at altitude ℎ, a momentum balance yields the following model:

a momentum balance yields the following model:


In this model, 𝑢 is the mass flow of propellant and 𝑣𝑒 is the velocity of the exhaust relative to the rocket. In this first attempt at modeling and control, we neglect the change in rocket mass due to fuel burn.

Fuel consumption can be calculated as the following:

Fuel consumption can be calculated as the following:

We want to find a trajectory that minimizes fuel consumption:

We want to find a trajectory that minimizes fuel consumption:

This problem can be formulated as follows:


import numpy as np

from pyomo.environ import *
from pyomo.dae import *

#Define constants ...
# lunar module
m_ascent_dry = 2445.0          # kg mass of ascent stage without fuel
m_ascent_fuel = 2376.0         # kg mass of ascent stage fuel
m_descent_dry = 2034.0         # kg mass of descent stage without fuel
m_descent_fuel = 8248.0        # kg mass of descent stage fuel

m_fuel = m_descent_fuel
m_dry = m_ascent_dry + m_ascent_fuel + m_descent_dry
m_total = m_dry + m_fuel

# descent engine characteristics
v_exhaust = 3050.0             # m/s
u_max = 45050.0/v_exhaust      # 45050 newtons / exhaust velocity

# landing mission specifications
h_initial = 100000.0           # meters
v_initial = 1520               # orbital velocity m/s
g = 1.62                       # m/s**2

m = ConcreteModel()
m.t = ContinuousSet(bounds=(0, 1))
m.h = Var(m.t)
m.u = Var(m.t, bounds=(0, u_max))
m.T = Var(domain=NonNegativeReals)

m.v = DerivativeVar(m.h, wrt=m.t)
m.a = DerivativeVar(m.v, wrt=m.t)

m.fuel = Integral(m.t, wrt=m.t, rule = lambda m, t: m.u[t]*m.T)
m.obj = Objective(expr=m.fuel, sense=minimize)

m.ode1 = Constraint(m.t, rule = lambda m, t: m_total*m.a[t]/m.T**2 == -m_total*g + v_exhaust*m.u[t])


m.h[1].fix(0)    # land on surface
m.v[1].fix(0)    # soft landing

def solve(m):
    TransformationFactory('dae.finite_difference').apply_to(m, nfe=50, scheme='FORWARD')
    SolverFactory('ipopt').solve(m, tee=True)

We use the Pyomo interface and the nonlinear optimization solver Ipopt to solve this continuous-time, trajectory optimization problem. Logs from provide the following solution:

Ipopt 3.12.13: 

This program contains Ipopt, a library for large-scale nonlinear optimization.
 Ipopt is released as open source code under the Eclipse Public License (EPL).
         For more information visit

This is Ipopt version 3.12.13, running with linear solver mumps.
NOTE: Other linear solvers might be more efficient (see Ipopt documentation).

Number of nonzeros in equality constraint Jacobian...:      448
Number of nonzeros in inequality constraint Jacobian.:        0
Number of nonzeros in Lagrangian Hessian.............:      154

Error in an AMPL evaluation. Run with "halt_on_ampl_error yes" to see details.
Error evaluating Jacobian of equality constraints at user provided starting point.
  No scaling factors for equality constraints computed!
Total number of variables............................:      201
                     variables with only lower bounds:        1
                variables with lower and upper bounds:       51
                     variables with only upper bounds:        0
Total number of equality constraints.................:      151
Total number of inequality constraints...............:        0
        inequality constraints with only lower bounds:        0
   inequality constraints with lower and upper bounds:        0
        inequality constraints with only upper bounds:        0

iter    objective    inf_pr   inf_du lg(mu)  ||d||  lg(rg) alpha_du alpha_pr  ls
   0  9.9999800e-05 5.00e+06 9.90e-01  -1.0 0.00e+00    -  0.00e+00 0.00e+00   0
   1r 9.9999800e-05 5.00e+06 9.99e+02   6.7 0.00e+00    -  0.00e+00 4.29e-14R  4
   2r 2.1397987e+02 5.00e+06 4.78e+08   6.7 2.14e+05    -  1.00e+00 6.83e-05f  1
   3r 2.1342176e+02 5.00e+06 1.36e+08   3.2 4.37e+04    -  7.16e-01 6.16e-01f  1
   4r 1.7048263e+02 4.99e+06 4.67e+07   3.2 1.60e+04    -  9.85e-01 4.16e-01f  1
   5r 1.5143799e+02 4.99e+06 2.50e+07   3.2 3.57e+03    -  5.88e-01 7.62e-01f  1
   6r 1.3041897e+02 4.99e+06 2.08e+07   3.2 1.89e+03    -  2.75e-01 8.14e-01f  1
   7r 1.1452223e+02 4.99e+06 3.17e+04   3.2 1.97e+03    -  9.78e-01 8.18e-01f  1
   8r 1.1168709e+02 4.99e+06 2.72e+05   3.2 3.36e-01   4.0 9.78e-01 1.00e+00f  1
   9r 1.0774716e+02 4.99e+06 1.66e+05   3.2 4.28e+03    -  9.36e-01 9.70e-02f  1
iter    objective    inf_pr   inf_du lg(mu)  ||d||  lg(rg) alpha_du alpha_pr  ls
  10r 8.7784873e+01 5.00e+06 5.08e+04   3.2 3.69e+03    -  8.74e-01 7.24e-01f  1
  11r 7.9008215e+01 5.00e+06 1.88e+04   2.5 1.09e+03    -  1.22e-01 8.35e-01h  1
  12r 1.1960245e+02 5.00e+06 4.34e+03   2.5 1.81e+03    -  6.76e-01 1.00e+00f  1
  13r 1.2344166e+02 5.00e+06 1.35e+03   1.8 1.66e+02    -  8.23e-01 1.00e+00f  1
  14r 2.0065756e+02 4.99e+06 6.85e+02   1.1 4.28e+03    -  4.26e-01 1.00e+00f  1
  15r 3.0115879e+02 4.99e+06 4.78e+01   1.1 9.69e+03    -  7.64e-01 1.00e+00f  1
  16r 3.0355974e+02 4.99e+06 5.30e+00   1.1 4.92e+00    -  1.00e+00 1.00e+00f  1
  17r 3.0555655e+02 4.99e+06 6.83e+02   0.4 7.49e+00    -  1.00e+00 1.00e+00f  1
  18r 4.4494526e+02 4.97e+06 2.28e+01   0.4 2.17e+04    -  8.05e-01 1.00e+00f  1
  19r 3.9588385e+02 4.97e+06 3.77e+00   0.4 4.73e+00    -  1.00e+00 1.00e+00f  1
iter    objective    inf_pr   inf_du lg(mu)  ||d||  lg(rg) alpha_du alpha_pr  ls
  20r 4.0158949e+02 4.97e+06 7.79e-02   0.4 5.70e-01    -  1.00e+00 1.00e+00h  1
  21r 4.0076180e+02 4.97e+06 9.88e+02  -1.0 1.80e+00    -  1.00e+00 1.00e+00f  1
  22r 5.4964501e+02 4.95e+06 7.59e+02  -1.0 1.57e+05    -  2.48e-01 2.32e-01f  1
  23r 5.5056601e+02 4.95e+06 7.57e+02  -1.0 1.21e+05    -  1.00e+00 3.02e-03f  1
  24r 5.5057553e+02 4.95e+06 7.57e+02  -1.0 1.09e+05    -  8.13e-01 3.34e-05f  1
  25r 5.5898777e+02 4.95e+06 7.00e+02  -1.0 3.82e+04    -  1.00e+00 7.48e-02f  1
  26r 6.0274077e+02 4.96e+06 3.93e+02  -1.0 3.53e+04    -  1.00e+00 4.39e-01f  1
  27r 6.0301192e+02 4.96e+06 3.90e+02  -1.0 1.98e+04    -  1.00e+00 7.83e-03f  1
  28r 6.0301418e+02 4.96e+06 3.89e+02  -1.0 1.61e+04    -  1.00e+00 9.62e-05f  1
  29r 5.9834909e+02 4.96e+06 3.71e+02  -1.0 3.63e+03    -  1.00e+00 1.85e-01f  1
iter    objective    inf_pr   inf_du lg(mu)  ||d||  lg(rg) alpha_du alpha_pr  ls
  30r 5.7601446e+02 4.95e+06 1.67e+00  -1.0 2.96e+03    -  1.00e+00 1.00e+00f  1
  31r 5.6977301e+02 4.95e+06 6.41e-02  -1.0 1.22e+00    -  1.00e+00 1.00e+00h  1
  32r 5.7024128e+02 4.95e+06 9.05e-05  -1.0 4.89e-02    -  1.00e+00 1.00e+00h  1
  33r 5.6989454e+02 4.95e+06 6.84e+02  -2.5 9.30e-02    -  1.00e+00 1.00e+00f  1
  34r 5.7613459e+02 4.94e+06 5.38e+02  -2.5 5.65e+04    -  4.67e-01 2.13e-01f  1
  35r 5.7617358e+02 4.94e+06 5.37e+02  -2.5 4.45e+04    -  1.00e+00 9.52e-04f  1
  36r 6.6264177e+02 4.90e+06 3.78e+01  -2.5 4.45e+04    -  6.62e-01 9.30e-01f  1
  37r 7.5101828e+02 4.90e+06 7.59e+01  -2.5 3.12e+03    -  1.25e-02 1.00e+00f  1
  38r 7.5705424e+02 4.90e+06 8.60e-02  -2.5 7.04e-01    -  1.00e+00 1.00e+00h  1
  39r 7.5713736e+02 4.90e+06 2.85e-05  -2.5 9.02e-03    -  1.00e+00 1.00e+00h  1
iter    objective    inf_pr   inf_du lg(mu)  ||d||  lg(rg) alpha_du alpha_pr  ls
  40r 7.5713093e+02 4.90e+06 4.90e+02  -5.7 6.76e-03    -  1.00e+00 9.99e-01f  1
  41r 1.0909809e+03 4.78e+06 4.67e+02  -5.7 2.54e+06    -  6.15e-02 4.62e-02f  1
  42r 1.0909867e+03 4.78e+06 4.67e+02  -5.7 2.42e+06    -  1.00e+00 9.55e-07f  1
  43r 1.5672936e+03 4.59e+06 8.15e+03  -5.7 2.42e+06    -  3.36e-03 7.69e-02f  1
  44r 1.7598365e+03 4.50e+06 8.17e+03  -5.7 2.24e+06    -  4.43e-08 4.23e-02f  1
  45r 5.7264420e+03 2.36e+06 4.60e+03  -5.7 2.14e+06    -  7.07e-02 1.00e+00f  1
  46  4.3546591e+03 2.35e+06 1.50e+01  -1.0 2.51e+08    -  3.52e-03 2.97e-03f  1
  47  3.7700543e+03 2.16e+06 1.94e+01  -1.0 2.87e+06    -  3.27e-01 8.10e-02f  1
  48  3.9963720e+03 1.02e+06 7.97e+00  -1.0 3.70e+05    -  3.47e-01 5.26e-01h  1
  49  4.0601733e+03 5.28e+05 5.09e+00  -1.0 1.57e+06    -  5.24e-03 4.85e-01h  1
iter    objective    inf_pr   inf_du lg(mu)  ||d||  lg(rg) alpha_du alpha_pr  ls
  50  4.0596593e+03 5.27e+05 3.53e+00  -1.0 4.32e+06    -  7.60e-01 1.81e-03h  1
  51  4.1577305e+03 9.40e+04 7.32e-01  -1.0 4.01e+05    -  9.09e-01 8.22e-01h  1
  52  4.1754490e+03 1.27e+01 4.74e-02  -1.0 5.08e+04    -  8.32e-01 1.00e+00h  1
  53  4.1752565e+03 7.78e-02 8.68e-07  -1.0 1.49e+04    -  1.00e+00 1.00e+00h  1
  54  4.1704409e+03 1.60e+00 3.18e-05  -2.5 1.16e+04    -  1.00e+00 1.00e+00f  1
  55  4.1704236e+03 6.98e-04 2.83e-08  -2.5 1.41e+03    -  1.00e+00 1.00e+00h  1
  56  4.1702897e+03 1.15e-03 2.31e-08  -3.8 2.98e+02    -  1.00e+00 1.00e+00f  1
  57  4.1702823e+03 3.63e-06 5.75e-11  -5.7 1.67e+01    -  1.00e+00 1.00e+00h  1
  58  4.1702822e+03 1.28e-09 1.62e-14  -8.6 2.04e-01    -  1.00e+00 1.00e+00h  1

Number of Iterations....: 58

                                   (scaled)                 (unscaled)
Objective...............:   4.1702822027548118e+03    4.1702822027548118e+03
Dual infeasibility......:   1.6235231869939369e-14    1.6235231869939369e-14
Constraint violation....:   1.2805685400962830e-09    1.2805685400962830e-09
Complementarity.........:   2.5079038009909822e-09    2.5079038009909822e-09
Overall NLP error.......:   2.5079038009909822e-09    2.5079038009909822e-09

Number of objective function evaluations             = 63
Number of objective gradient evaluations             = 16
Number of equality constraint evaluations            = 63
Number of inequality constraint evaluations          = 0
Number of equality constraint Jacobian evaluations   = 60
Number of inequality constraint Jacobian evaluations = 0
Number of Lagrangian Hessian evaluations             = 58
Total CPU secs in IPOPT (w/o function evaluations)   =      0.682
Total CPU secs in NLP function evaluations           =      0.002

EXIT: Optimal Solution Found.


We used various examples, front ends, and solvers to solve numerical optimization problems using SageMaker Processing. Next, try using Scipy.optimize, DEAP, or Inspyred to explore other examples. See the references in the next section for documentation and other examples to help solve your own business problems using SageMaker Processing. If you currently use SageMaker APIs for your machine learning projects, using SageMaker Processing for running optimization is a simple, obvious extension. However, consider that other compute options on AWS such as Lambda or Fargate may be more relevant when running some of these open source libraries for serverless optimization, especially when your team has this expertise. Lastly, open source libraries are provided as is, with minimal support whereas commercial libraries such as CPLEX and Gurobi are constantly being tuned for higher performance, and usually provide premium support.


About the Author

Shreyas Subramanian is a AI/ML specialist Solutions Architect, and helps customers by using Machine Learning to solve their business challenges using the AWS platform.

Building an omnichannel Q&A chatbot with Amazon Connect, Amazon Lex, Amazon Kendra, and the open-source QnABot project

For many students, embarking on a higher education journey is an exciting time filled with new experiences. However, like anything new, it also can also bring plenty of questions to answer and obstacles to overcome. Oklahoma State University, Oklahoma City (OSU-OKC) recognized this, and was intent on providing a better solution to address student questions using machine learning (ML) technology from AWS.

They knew that if they could develop a solution that accurately anticipated their students’ needs and delivered timely and relevant information, they could boost their chances of attracting future students. After all, universities need students the same way businesses need customers.

“The first thing we wanted to address was the lack of visibility we had into customer sentiment at any given time,” says Michael Widell, Interim President at OKC-OSU. “Building on that, we also had a real focus on consistency and accuracy of information—it mattered to us that current and future students could rely on the information they were getting across school and faculty communication channels.”

The team identified conversational chatbots as a way to address the information gap that students face. ML-powered chatbots are dynamic, and help connect with students through the communication channels they prefer, whether that’s a website, phone, chatbot, or by asking an Alexa-enabled device.

With this in mind, OSU-OKC began working with AWS Professional Services in January 2020, and became the first university to deploy a call center using Amazon Connect and the QnABot.

Amazon Connect is a cloud contact center that provides a seamless experience across voice and chat for customers and agents. The QnABot is an open-source project that uses Amazon Lex to provide a conversational interface for your questions and answers, and can be applied to a host of communication channels, including websites, contact centers, chatbots, collaboration tools like Slack, and Amazon Alexa-enabled devices.

Deploying QnABot in the call center

Although OSU-OKC’s use of the QnABot evolved throughout 2020, its initial area of focus centered on boosting call center efficiency. They achieved this by automating answers to student FAQs, thereby delivering accurate and up-to-date information, reducing call hold times, and enabling human call center agents to focus on handling higher-value interactions.

The following diagram illustrates the solution architecture.

The following diagram illustrates the solution architecture.

For OSU-OKC, QnABot simplified bot deployment and administration, allowing even non-technical users to maximize the impact of the solution by allowing them to:

Extending the QnABot to the website

After implementing the QnABot to assist agents inside their call center, OSU-OKC decided to extend the bot’s reach to the university’s website. They used the AWS open-source Amazon Lex Web UI project, a sample Amazon Lex Web UI that helps provide a full-featured web client for Amazon Lex chatbots.

After content was gathered from the campus, creating question and answer responses for the bot was an easy process. The content designer provided customization options that allowed for organization and readability. The built-in test features aided the tuning and development process by attributing a matching score to a response.

Shortly after expanding the QnABot to their website, OSU-OKC realized that providing more channels for students to interact with didn’t dilute engagement levels. In fact, they increased overall engagement from their student body and doubled the average number of conversations with students.

Adding the QnABot to the university website wasn’t a replacement for human interaction; it was an aid to increase quality interactions by reducing repetitive phone traffic. Try asking OSU-OKC bot, OKC Pete, some questions of your own via the university website.

Try asking OSU-OKC bot, OKC Pete, some questions of your own via the university website.

OKC Pete on the university website

Equipping the QnABot with more responses

While QnABot answered high volumes of questions for students and delivered consistent service at scale, the OSU-OKC team learned a great deal about student sentiment by observing which questions the QnABot couldn’t answer.

For example, some questions highlighted how much prospective students knew about the campus and its resources. Incoming students asked about dorms when in fact the campus doesn’t have any student housing.

The team could use the QnABot’s Content Designer UI to continuously enhance the bot, and equip it with appropriate responses about student housing or any other campus resources. This helped students avoid a phone call, which freed call center agents to focus on more critical or higher-quality interactions.

This flexibility proved particularly helpful during the onset of the COVID-19 pandemic in the spring of 2020. OSU-OKC was able to rapidly expand the newly deployed QnABot’s knowledge base to include answers to many pandemic-related questions. Students and parents could quickly get answers to the questions that mattered to them via the QnABot-assisted university call center or via the website chatbot.

Scaling QnABot’s knowledge with Amazon Kendra

QnABot’s Content Designer UI allowed OSU-OKC to add new questions and answers to the bot when they identified a gap. However, the team also wanted to ensure that customers could still get answers when a question had not yet been added.

To achieve this, they used Amazon Kendra, a highly accurate intelligent search service. In the summer of 2020, the team at OSU-OKC integrated the QnABot with Amazon Kendra to enhance the accuracy and relevance of responses in the following ways:

  • Use the document index in Amazon Kendra as an additional source of answers when a question and answer isn’t found in QnABot’s knowledge base. This allows QnABot to find answers to questions that may not have been added to its knowledge base, including unstructured data contained in word documents or PDFs that have been indexed by Amazon Kendra.
  • Without extensive QnABot tuning, the natural language processing and reading comprehension capabilities of Amazon Kendra more accurately understand user queries, and its ML models expertly handle variations in how users phrase their questions to increase search accuracy and return relevant responses to user queries.

By using ML to automate the handling of common customer questions via their call center and website, OSU-OKC ensured consistent service levels even during their busiest time of year. Widell says, “During peak we can receive over 2,000 calls, which is too many for one agent to handle—however, since launching the QnABot, it’s supported over 34,000 conversations and saved 833 hours in staff time, while ensuring every customer received the same level of service and accuracy.”

Creating your own QnABot and integrating with Amazon Kendra

To get started on your own QnABot journey, see Create a Question and Answer Bot with Amazon Lex and Amazon Alexa. The section Turbocharging QnABot with Amazon Kendra outlines how to integrate QnABot with Amazon Kendra. If you want to follow OSU-OKC’s lead and add the QnABot to your website, you can take advantage of our companion Chatbot UI project.

As you think about configuration and deployment, consider the following options:

  • Deploy the QnABot and Chatbot UI yourself (self-serve), using the project as is
  • Make your own customizations and enhancements to the open-source code
  • Follow OSU-OKC’s example and contact AWS Professional Services for expert help to customize and enhance QnABot, and to integrate with your own communication channels

For more information, watch the team at OSU-OKC present their QnABot solution at Re:Invent 2020.


The team at OSU-OKC is excited to build on the early success they have seen from deploying the QnABot, Amazon Kendra, and Amazon Lex. “For customers and students, this has been the most impactful technology that we’ve implemented,” Widell says.

Our overarching vision for ML technology will evolve student interactions from being transactional exchanges to becoming more meaningful experiences, allowing us to easily connect to customers, understand their needs, and serve them better. Widell adds, “In the future, we hope to expand our use of QnABot to provide personalized information to students as it relates to their academic schedules, advisement, and other relevant information related to their course of study.”

About the Authors

Bob StrahanBob Strahan is a Principal Solutions Architect in the AWS Language AI Services team.





Michael Widell is the Interim President at OSU-OKC. As an innovative agent of change he has worked to strengthen organizations through redesign and resource optimization, allowing individuals to excel, and deliver transformative products and services. In his career, Widell has also held leadership positions in the private sector for AT&T and key roles within the General Office of Walmart Inc. where he began his post collegiate career.

Data processing options for AI/ML

Training an accurate machine learning (ML) model requires many different steps, but none are potentially more important than data processing. Examples of processing steps include converting data to the input format expected by the ML algorithm, rescaling and normalizing, cleaning and tokenizing text, and many more. However, data processing at scale involves considerable operational overhead: managing complex infrastructure like processing clusters, writing code to tie all the moving pieces together, and implementing security and governance. Fortunately, AWS provides a wide variety of data processing options to suit every ML workload and teams’ preferred workflows. This set of options expanded even more at AWS re:Invent 2020, so now is the perfect time to examine how to choose between them.

In this post, we review the primary options and provide guidance on how to select one to match your use case and how your team prefers to work with Python, Spark, SQL, and other tools. Although the discussion centers around Amazon SageMaker for ML model building, training, and hosting, it’s equally applicable to workflows where other AWS services are used for these tasks, such as Amazon Personalize or Amazon Comprehend. The main assumption we make is that the decision is being made by those in data science, ML engineering, or MLOps roles. Other factors that are important in making the decision are team experience level, and inclination for writing code and managing infrastructure. Lower experience and inclination typically map to choosing a more fully managed option instead of a less managed or “roll your own” approach.

Prerequisite: Data Lake or Lake House

Before we dive deep into the options, there is a question we must answer: how do we reconcile our chosen option with the preferred technology choices of data engineering teams? Different tools may be suited to different roles; the tools a data scientist may prefer for an ML workflow may have little overlap with the tools used by a data engineer to support analytics workloads such as reporting. The good news is that AWS makes it very easy for these roles to pick their own tools and apply them to their organization’s data without conflict. The key is to create a data lake in Amazon Simple Storage Service (Amazon S3) at the center of the organization’s architecture for all data. This separates data and compute and avoids the problem of each team having individual data silos.

With a data lake in the center of the architecture, data engineering teams can apply their own tools for analytics workloads. At the same time, data science teams can also use their own separate tools to access the same data for ML workloads. Multiple separate processing clusters run by various teams can access the same data, always keeping in mind the need to retain the raw data in Amazon S3 for all teams as a source of truth. Additionally, use of a feature store for transformed data, such as Amazon SageMaker Feature Store, by data science teams helps delineate the boundary with data engineering, as well as provide benefits such as feature discovery, sharing, updating, and reuse.

As an alternative to a “classic” data lake, the teams might build on top of a Lake House Architecture, an evolution of the concept of a data lake. Featuring support for ACID transactions, this architecture enables multiple users to concurrently insert, delete, and modify rows across tables, while still allowing other users to simultaneously run analytical queries and ML models on the same datasets. AWS Lake Formation recently added new features to support the Lake House Architecture (currently in preview).

Now that we’ve solved the conundrum of enabling data engineering and data science teams to use their separate, preferred tools without conflict, let’s examine the data processing options for ML workloads on AWS.

Options overview

In this post, we review the following processing options. They’re in categories ranked by the following formula: (user friendliness for data scientists and ML engineers) x (usefulness for ML-specific tasks).

  1. SageMaker managed features only
  2. Low (or no) code solutions with other AWS services
  3. Spark in Amazon EMR
  4. Self-managed stack with Python or R

Keep in mind that these are not mutually exclusive; you can use them in various combinations to suit your team’s preferred workflow. For example, some teams may prefer to use SQL as much as possible, while others may use Spark for some tasks in addition to Python frameworks like Pandas. Another point to consider is that some services have built-in data visualization capabilities, while others do not and require use of other services for visualization. Let’s discuss the specifics of each option.

SageMaker managed features

SageMaker is a fully managed service that helps data scientists and developers prepare, build, train, and deploy high-quality ML models quickly by bringing together a broad set of capabilities purpose-built for ML. These capabilities include robust data processing features. For data processing and data preparation, you can use either Amazon SageMaker Data Wrangler or Amazon SageMaker Processing for the processing itself, and either Amazon SageMaker Studio or SageMaker notebook instances for data visualization. You can process datasets with sizes ranging from small to very large (petabytes) with SageMaker.

SageMaker Data Wrangler is a feature of SageMaker, enabled through SageMaker Studio, that makes it easy for data scientists and ML engineers to aggregate and prepare data for ML applications using a visual interface to accelerate data cleansing, exploration, and visualization. It allows you to easily connect to various data sources such as Amazon S3 and apply built-in transformations or custom transformations written in PySpark, Pandas, or SQL.

SageMaker Processing comes built in with SageMaker, and provides you with full control of your cluster resources such as instance count, type, and storage. It includes prebuilt containers for Spark and Scikit-learn, and offers an easy path for integrating your custom containers. For a “lift and shift” of an existing workload to SageMaker, SageMaker Processing may be a good fit. The following table compares SageMaker Processing and SageMaker Data Wrangler across some key dimensions.

The following table compares SageMaker Processing and SageMaker Data Wrangler across some key dimensions.

The SageMaker option is ranked first due to its ease of use for data scientists and ML engineers, and its usefulness for ML-specific tasks; it was built from the ground up specifically to support ML workloads. However, several other options may be useful even though they weren’t developed solely for, or dedicated specifically to, ML workloads. Let’s review those options next.

Low (or no) code

This option involves several services that are serverless: infrastructure details and management are hidden under the hood. Additionally, there might be no need to write custom code, or in some cases, any code at all. This may lead to a relatively fast path to results, while potentially causing greater workflow friction by requiring you to switch between multiple services, UIs, and tools, and sacrificing some flexibility and ability to customize. For our purposes, we consider a solution that requires SQL queries to be a low code solution, and one that doesn’t require any code, even SQL, to be a no code solution.

For example, one low code possibility involves Amazon Athena, a serverless interactive query service, for transforming data using standard SQL queries, in combination with Amazon QuickSight, a serverless BI tool that offers no code, built-in visualizations. When evaluating this powerful combination, consider whether your data transformations can be accomplished with SQL. On the visualization side, an alternative to QuickSight is to use a library such as PyAthena to run the queries in SageMaker notebooks with Python code and visualize the results there.

Another low code possibility involves AWS Glue, a serverless ETL service that catalogs your data and offers built-in transforms, along with the ability to write custom PySpark code. For visualizations, besides QuickSight, you can attach either SageMaker or Zeppelin notebooks to an AWS Glue development endpoint. Choosing between AWS Glue and Athena comes down to a team’s preference for using SQL versus PySpark code (in the case when AWS Glue built-in transforms don’t fully cover the desired set of data transforms).

A no code possibility is AWS Glue DataBrew, a serverless visual data preparation tool, to transform data, combined with either the SageMaker console to start model training jobs using built-in algorithms such as XGBoost, or the SageMaker Studio UI to start AutoML model training jobs with SageMaker Autopilot. With many built-in transformations and built-in visualizations, DataBrew covers both data processing and data visualization. However, if your dataset requires custom transformations other than the built-in ones, you need to pair DataBrew with another solution that allows you to write custom code. Autopilot automatically performs typical featurization of data (such as one-hot encoding of categorical values) as part of its AutoML pipeline, so you might find the set of transformations in DataBrew sufficient if paired with Autopilot. The following table provides a more detailed comparison.

The following table provides a more detailed comparison.

Spark in Amazon EMR

Many organizations use Spark for data processing and other purposes, such as the basis for a data warehouse. In these situations, Spark clusters are typically run in Amazon EMR, a managed service for Hadoop-ecosystem clusters, which eliminates the need to do your own setup, tuning, and maintenance. From the perspective of a data scientist or ML engineer, Spark in Amazon EMR may be considered in the following circumstances:

  • Spark is already used for a data warehouse or other application with a persistent cluster. Unlike the other options we described, which only provision transient resources, Amazon EMR also enables creation of persistent clusters to support analytics applications.
  • The team already has a complete end-to-end pipeline in Spark and also the skillset and inclination to run a persistent Spark cluster for the long term. Otherwise, the SageMaker and AWS Glue options for Spark generally are preferable.

Another consideration is the wider range of instance types offered by Amazon EMR, including AWS Graviton2 processors and Amazon EC2 Spot Instances for cost optimization.

For visualization with Amazon EMR, there are several choices. To keep your primary ML workflow within SageMaker, use SageMaker Studio and its built-in SparkMagic kernel to connect to Amazon EMR. You can start to query, analyze, and process data with Spark in a few steps. For added security, you can connect to EMR clusters using Kerberos authentication. Amazon EMR also features other integrations with SageMaker, for example you can start a SageMaker model training job from a Spark pipeline in Amazon EMR. Another visualization possibility is to use Amazon EMR Studio (preview), which provides access to fully managed Jupyter notebooks, and includes the ability to log in via AWS Single Sign-On (AWS SSO). However, EMR Studio lacks the many SageMaker-specific UI integrations of SageMaker Studio.

There are other factors to consider when evaluating this option. Spark is based on the Scala/Java stack, with all the problems that entails in regard to dependency management and JVM issues that may be unfamiliar to data scientists. Also keep in mind that Spark’s PySpark API has often lagged behind its primary API in Scala, which is a language less familiar to data scientists. In this regard, if you prefer the alternative Dask framework for your workloads, you can install Dask on your EMR clusters.

Self-managed stack using Python or R

For this option, teams roll their own solutions using Amazon Elastic Compute Cloud (Amazon EC2) compute resources, or the container services Amazon Elastic Container Service (Amazon ECS) or Amazon Elastic Kubernetes Service (Amazon EKS).  Integration with SageMaker is most conveniently achieved using the Amazon SageMaker Python SDK. Any machine with AWS Identity and Access Management (IAM) permissions to SageMaker can use the SageMaker Python SDK to invoke SageMaker functionality for model building, training, tuning, deployment, and more.

This option provides the most flexibility to mix and match any data processing tools and frameworks. It also offers access to the widest range of EC2 instance types and storage options. In addition to the possibility of using Spot Instances similarly to Amazon EMR, you can also use this option with the flexible pricing model of AWS Savings Plans. These plans can be applied not only to Amazon EC2 resources, but also to serverless compute AWS Lambda resources, and serverless compute engine AWS Fargate resources.

However, keep in mind in regard to user-friendliness for data scientists and ML engineers, this option requires them to manage low-level infrastructure, a task better suited to other roles. Also, with respect to usefulness for ML-specific tasks, although there are many frameworks and tools that can be layered on top of these services to make management easier and provide specific functionality for ML workloads, this option is still far less managed than the preceding options. It requires more personnel time to manage, tune, maintain infrastructure and dependencies, and write code to fill functionality gaps. As a result, this option also is likely to prove the most costly in the long run.

Review and conclusion

Your choice of a data processing option for ML workloads typically depends on your team’s preference for tools (Spark, SQL, or Python) and inclination for writing code and managing infrastructure. The following table summarizes the options across several relevant dimensions. The first column emphasizes that separate services or features may be used for processing and related visualization, and the third column refers to resources used to process data rather than for visualization, which tends to happen on lighter-weight resources.

The following table summarizes the options across several relevant dimensions.

Workloads evolve over time, and you don’t need to be locked in to one set of tools forever. You can mix and match according to your use case. When you use Amazon S3 at the center of your data lake and the fully managed SageMaker service for core ML workflow steps, it’s easy to switch tools as needed or desired to accommodate the latest technologies. Whichever option you choose now, AWS provides the flexibility to evolve your tool chain to best fit the then-current data processing needs of your ML workloads.

About the Author

Brent Rabowsky focuses on data science at AWS, and leverages his expertise to help AWS customers with their own data science projects.

Translating JSON documents using Amazon Translate

JavaScript Object Notation (JSON) is a schema-less, lightweight format for storing and transporting data. It’s a text-based, self-describing representation of structured data that is based on key-value pairs. JSON is supported either natively or through libraries in most major programming languages, and is commonly used to exchange information between web clients and web servers. Over the last 15 years, JSON has become ubiquitous on the web and is the format of choice for almost all web services.

To reach more users, you often want to localize your content and applications that may be in JSON format. This post shows you a serverless approach for easily translating JSON documents using Amazon Translate. Serverless architecture is ideal because it is event-driven and can automatically scale, making it a cost effective solution. In this approach, JSON tags are left as they are, and the content within those tags is translated. This allows you to preserve the context of the text, so that translations can be handled with greater precision. The approach presented here was recently used by a large higher education customer of AWS for translating media documents that are in JSON format.

Amazon Translate is a neural machine translation service that delivers fast, high-quality, affordable, and customizable language translation. Neural machine translation uses deep learning models to deliver more accurate and natural-sounding translation than traditional statistical and rule-based translation algorithms. The translation service is trained on a wide variety of content across different use cases and domains to perform well on many kinds of content. Its asynchronous batch processing capability enables you to translate a large collection of text, HTML, and OOXML documents with a single API call.

In this post, we walk you through creating an automated and serverless pipeline for translating JSON documents using Amazon Translate.

Solution overview

Amazon Translate currently supports the ability to ignore tags and only translate text content in XML documents. In this solution, we therefore first convert JSON documents to XML documents, use Amazon Translate to convert text content in the XML document, and then covert the XML document back to JSON.

The solution uses serverless technologies and managed services to provide maximum scalability and cost-effectiveness. In addition to Amazon Translate, the solution uses the following services:

  • AWS Lambda – Runs code in response to triggers such as changes in data, changes in application state, or user actions. Because services like Amazon S3 and Amazon SNS can directly trigger a Lambda function, you can build a variety of real-time serverless data-processing systems.
  • Amazon Simple Notification Service (Amazon SNS) – Enables you to decouple microservices, distributed systems, and serverless applications with a highly available, durable, secure, fully managed publish/subscribe messaging service.
  • Amazon Simple Storage Service (Amazon S3) – Stores your documents and allows for central management with fine-tuned access controls.
  • AWS Step Functions – Coordinates multiple AWS services into serverless workflows.

Solution architecture

The architecture workflow contains the following steps:

  1. Users upload one or more JSON documents to Amazon S3.
  2. The Amazon S3 upload triggers a Lambda function.
  3. The function converts the JSON documents into XML, stores them in Amazon S3, and invokes Amazon Translate in batch mode to translate the XML documents texts into the target language.
  4. The Step Functions-based job poller polls for the translation job to complete.
  5. Step Functions sends an SNS notification when the translation is complete.
  6. A Lambda function reads the translated XML documents in Amazon S3, converts them to JSON documents, and stores them back in Amazon S3.

The following diagram illustrates this architecture.

Deploying the solution with AWS CloudFormation

The first step is to use an AWS CloudFormation template to provision the necessary resources needed for the solution, including the AWS Identity and Access Management (IAM) roles, IAM policies, and SNS topics.

  1. Launch the AWS CloudFormation template by choosing Launch Stack (this creates the stack the us-east-1 Region):
  1. For Stack name, enter a unique stack name for this account; for example, translate-json-document.
  2. For SourceLanguageCode, enter the language code for the current language of the JSON documents; for example, en for English.
  3. For TargetLanguageCode, enter the language code that you want your translated documents in; for example, es for Spanish.

For more information about supported languages, see Supported Languages and Language Codes.

  1. For TriggerFileName, enter the name of the file that triggers the translation serverless pipeline; the default is triggerfile.
  2. In the Capabilities and transforms section, select the check boxes to acknowledge that AWS CloudFormation will create IAM resources and transform the AWS Serverless Application Model (AWS SAM) template.

AWS SAM templates simplify the definition of resources needed for serverless applications. When deploying AWS SAM templates in AWS CloudFormation, AWS CloudFormation performs a transform to convert the AWS SAM template into a CloudFormation template. For more information, see Transform.

  1. Choose Create stack.

The stack creation may take up to 20 minutes, after which the status changes to CREATE_COMPLETE. You can see the name of the newly created S3 bucket on the Outputs tab.

Translating JSON documents

To translate your documents, upload one or more JSON documents to the input folder of the S3 bucket you created in the previous step. For this post, we use the following JSON file:

    "firstName": "John",
    "lastName": "Doe",
    "isAlive": true,
    "age": 27,
    "address": {
      "streetAddress": "100 Main Street",
      "city": "Anytown",
      "postalCode": "99999-9999"
    "phoneNumbers": [
        "type": "home",
        "number": "222 555-1234"
        "type": "office",
        "number": "222 555-4567"
    "children": ["Richard Roe", "Paulo Santos"],
    "spouse": "Jane Doe"

After you upload all the JSON documents, upload the file that triggers the translation workflow. This file can be a zero-byte file, but the filename should match the TriggerFileName parameter in the CloudFormation stack. The default name for the file is triggerfile.

This upload event triggers the Lambda function <Stack name>-S3FileEventProcessor-<Random string>, which converts the uploaded JSON documents into XML and places them in the xmlin folder of the S3 bucket. The function then invokes the Amazon Translate startTextTranslationJob, with the xmlin folder in the S3 bucket location as the input location and the xmlout folder as the output location for the translated XML files.

The following code is the processRequest method in the <Stack name>-S3FileEventProcessor-<Random string> Lambda function:

def processRequest(request):
    output = """request: {}".format(request))

    bucketName = request["bucketName"]
    sourceLanguageCode = request["sourceLanguage"]
    targetLanguageCode = request["targetLanguage"]
    access_role = request["access_role"]
    triggerFile = request["trigger_file"]
        # Filter only the JSON files for processing
        objs = S3Helper().getFilteredFileNames(bucketName,"input/","json")
        for obj in objs:
                content = S3Helper().readFromS3(bucketName,obj)
                jsonDocument = json.loads(content)
                # Convert the JSON document into XML
                outputXML = json2xml.Json2xml(jsonDocument, attr_type=False).to_xml()
                newObjectKey = "xmlin/{}.xml".format(FileHelper.getFileName(obj))
                # Store the XML in the S3 location for Translation
                output = "Output Object: {}/{}".format(bucketName, newObjectKey)
                # Rename the JSON files to prevent reprocessing
            except ValueError:
                logger.error("Error occured loading the json file:{}".format(obj))
            except ClientError as e:
                logger.error("An error occured with S3 Bucket Operation: %s" % e)
        # Start the translation batch job using Amazon Translate
    except ClientError as e:
        logger.error("An error occured with S3 Bucket Operation: %s" % e)

The Amazon Translate job completion SNS notification from the job poller triggers the Lambda function <Stack name>-TranslateJsonJobSNSEventProcessor-<Random string>. The function converts the XML document created by the Amazon Translate batch job to JSON documents in the output folder of the S3 bucket with the following naming convention: TargetLanguageCode-<inputJsonFileName>.json.

The following code shows the JSON document translated in Spanish.

    "firstName": "John",
    "lastName": "Cierva",
    "isAlive": "Es verdad",
    "age": "27",
    "address": {
        "streetAddress": "100 Calle Principal",
        "city": "En cualquier ciudad",
        "postalCode": "99999-9999"
    "phoneNumbers": {
        "item": [
                "type": "hogar",
                "number": "222 555-1234"
                "type": "oficina",
                "number": "222 555-4567"
    "children": {
        "item": [
            "Richard Roe",
            "Paulo Santos"
    "spouse": "Jane Doe"

The following code is the processRequest method containing the logic in the <Stack name>-TranslateJsonJobSNSEventProcessor-<Random string> Lambda function:

def processRequest(request):
    output = ""
    logger.debug("request: {}".format(request))
    up = urlparse(request["s3uri"], allow_fragments=False)
    accountid = request["accountId"]
    jobid =  request["jobId"]
    bucketName = up.netloc
    objectkey = up.path.lstrip('/')
    # choose the base path for iterating within the translated files for the specific job
    basePrefixPath = objectkey  + accountid + "-TranslateText-" + jobid + "/"
    languageCode = request["langCode"]
    logger.debug("Base Prefix Path:{}".format(basePrefixPath))
    # Filter only the translated XML files for processing
    objs = S3Helper().getFilteredFileNames(bucketName,basePrefixPath,"xml")
    for obj in objs:
            content = S3Helper().readFromS3(bucketName,obj)
            #Convert the XML file to Dictionary object
            data_dict = xmltodict.parse(content)
            #Generate the Json content from the dictionary
            data_dict =  data_dict["all"]
            flatten_dict = {k: (data_dict[k]["item"] if (isinstance(v,dict) and len(v.keys()) ==1 and "item" in v.keys())  else v) for (k,v) in data_dict.items()}
            json_data = json.dumps(flatten_dict,ensure_ascii=False).encode('utf-8')
            newObjectKey = "output/{}.json".format(FileHelper.getFileName(obj))
            #Write the JSON object to the S3 output folder within the bucket
            output = "Output Object: {}/{}".format(bucketName, newObjectKey)
        except ValueError:
            logger.error("Error occured loading the json file:{}".format(obj))
        except ClientError as e:
            logger.error("An error occured with S3 bucket operations: %s" % e)
        except :
            e = sys.exc_info()[0]
            logger.error("Error occured processing the xmlfile: %s" % e)
    objs = S3Helper().getFilteredFileNames(bucketName,"xmlin/","xml")
    if( request["delete_xmls"] and request["delete_xmls"] == "true") :
        for obj in objs:
                logger.debug("Deleting temp xml files {}".format(obj))
            except ClientError as e:
                logger.error("An error occured with S3 bucket operations: %s" % e)
            except :
                e = sys.exc_info()[0]
                logger.error("Error occured processing the xmlfile: %s" % e)

For any pipeline failures, check the Amazon CloudWatch Logs for the corresponding Lambda function and look for potential errors that caused the failure.

To do a translation for a different source-target language combination, you can update the SOURCE_LANG_CODE and TARGET_LANG_CODE environment variable for the <Stack name>-S3FileEventProcessor-<Random string> Lambda function and trigger the solution pipeline by uploading JSON documents and the TriggerFileName into the input folder of the S3 bucket.

All code used in this post is available in the GitHub repo. If you want to build your own pipeline and don’t need to use the CloudFormation template provided, you can use the file under the directory translate_json in the GitHub repo. That file carries code to convert a JSON file into XML as well as to call the Amazon Translate API. The code for converting translated XML back to JSON format is available in the file


In this post, we demonstrated how to translate JSON documents using Amazon Translate asynchronous batch processing.

You can easily integrate the approach into your own pipelines as well as handle large volumes of JSON text given the scalable serverless architecture. This methodology works for translating JSON documents between over 70 languages that are supported by Amazon Translate (as of this writing). Because this solution uses asynchronous batch processing, you can customize your machine translation output using parallel data. For more information on using parallel data, see Customizing Your Translations with Parallel Data (Active Custom Translation). For a low-latency, low-throughput solution translating smaller JSON documents, you can perform the translation through the real-time Amazon Translate API.

For further reading, we recommend the following:

About the Authors

Siva Rajamani is a Boston-based Enterprise Solutions Architect for AWS. He enjoys working closely with customers and supporting their digital transformation and AWS adoption journey. His core areas of focus are serverless, application integration, and security. Outside of work, he enjoys outdoors activities and watching documentaries.



Raju Penmatcha is a Senior AI/ML Specialist Solutions Architect at AWS. He works with education, government, and non-profit customers on machine learning and artificial intelligence related projects, helping them build solutions using AWS. When not helping customers, he likes traveling to new places with his family.



Using container images to run PyTorch models in AWS Lambda

PyTorch is an open-source machine learning (ML) library widely used to develop neural networks and ML models. Those models are usually trained on multiple GPU instances to speed up training, resulting in expensive training time and model sizes up to a few gigabytes. After they’re trained, these models are deployed in production to produce inferences. They can be synchronous, asynchronous, or batch-based workloads. Those endpoints must be highly scalable and resilient in order to process from zero to millions of requests. This is where AWS Lambda can be a compelling compute service for scalable, cost-effective, and reliable synchronous and asynchronous ML inferencing. Lambda offers benefits such as automatic scaling, reduced operational overhead, and pay-per-inference billing.

This post shows you how to use any PyTorch model with Lambda for scalable inferences in production with up to 10 GB of memory. This allows us to use ML models in Lambda functions up to a few gigabytes. For the PyTorch example, we use the Huggingface Transformers, open-source library to build a question-answering endpoint.

Overview of solution

Lambda is a serverless compute service that lets you run code without provisioning or managing servers. Lambda automatically scales your application by running code in response to every event, allowing event-driven architectures and solutions. The code runs in parallel and processes each event individually, scaling with the size of the workload, from a few requests per day to hundreds of thousands of workloads. The following diagram illustrates the architecture of our solution.

The following diagram illustrates the architecture of our solution.

You can package your code and dependencies as a container image using tools such as the Docker CLI. The maximum container size is 10 GB. After the model for inference is Dockerized, you can upload the image to Amazon Elastic Container Registry (Amazon ECR). You can then create the Lambda function from the container image stored in Amazon ECR.


For this walkthrough, you should have the following prerequisites:

Implementing the solution

We use a pre-trained language model (DistilBERT) from Huggingface. Huggingface provides a variety of pre-trained language models; the model we’re using is 250 MB large and can be used to build a question-answering endpoint.

We use the AWS SAM CLI to create the serverless endpoint with an Amazon API Gateway. The following diagram illustrates our architecture.

To implement the solution, complete the following steps: 

  1. On your local machine, run sam init.
  2. Enter 1 for the template source (AWS Quick Start Templates)
  3. As a package type, enter 2 for image.
  4. For the base image, enter 3 - amazon/python3.8-base.
  5. As a project name, enter lambda-pytorch-example.
  6. Change your workdir to lambda-pytorch-example and copy the following code snippets into the hello_world folder.

The following code is an example of a requirements.txt file to run PyTorch code in Lambda. Huggingface has as a dependency PyTorch so we don’t need to add it here separately. Add the requirements to the empty requirements.txt in the folder hello_world.

# List all python libraries for the lambda

The following is the code for the file:

import json
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch

tokenizer = AutoTokenizer.from_pretrained("model/")
model = AutoModelForQuestionAnswering.from_pretrained("model/")

def lambda_handler(event, context):

    body = json.loads(event['body'])

    question = body['question']
    context = body['context']

    inputs = tokenizer.encode_plus(question, context,add_special_tokens=True, return_tensors="pt")
    input_ids = inputs["input_ids"].tolist()[0]

    output = model(**inputs)
    answer_start_scores = output.start_logits
    answer_end_scores = output.end_logits

    answer_start = torch.argmax(answer_start_scores)
    answer_end = torch.argmax(answer_end_scores) + 1

    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))

    print('Question: {0}, Answer: {1}'.format(question, answer))

    return {
        'statusCode': 200,
        'body': json.dumps({
            'Question': question,
            'Answer': answer

The following Dockerfile is an example for Python 3.8, which downloads and uses the DistilBERT language model fine-tuned for the question-answering task. For more information, see DistilBERT base uncased distilled SQuAD. You can use your custom models by copying them to the model folder and referencing it in the

# Pull the base image with python 3.8 as a runtime for your Lambda

# Copy the earlier created requirements.txt file to the container
COPY requirements.txt ./

# Install the python requirements from requirements.txt
RUN python3.8 -m pip install -r requirements.txt

# Copy the earlier created file to the container

# Load the BERT model from Huggingface and store it in the model directory
RUN mkdir model
RUN curl -L -o ./model/pytorch_model.bin
RUN curl -o ./model/config.json
RUN curl -o ./model/tokenizer.json
RUN curl -o ./model/tokenizer_config.json

# Set the CMD to your handler
CMD ["app.lambda_handler"]

Change your working directory back to lambda-pytorch-example and copy the following content into the template.yaml file:

AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Description: >

  Sample SAM Template for lambda-pytorch-example

    Type: AWS::Serverless::Function
      PackageType: Image
      MemorySize: 5000
      Timeout: 300
          Type: HttpApi
            Path: /inference
            Method: post
            TimeoutInMillis: 29000
      Dockerfile: Dockerfile
      DockerContext: ./hello_world
      DockerTag: python3.8-v1

    Description: "API Gateway endpoint URL for Prod stage for inference function"
    Value: !Sub "https://${ServerlessHttpApi}.execute-api.${AWS::Region}"

Now we need to create an Amazon ECR repository in AWS and register the local Docker to it. The repositoryUri is displayed in the output; save it for later.

# Create an ECR repository
aws ecr create-repository --repository-name lambda-pytorch-example --image-scanning-configuration scanOnPush=true --region <REGION>

# Register docker to ECR
aws ecr get-login-password --region <REGION> | docker login --username AWS --password-stdin <AWS_ACCOUNT_ID>.dkr.ecr.<REGION>

Deploying the application

The following steps deploy the application to your AWS account:

  1. Run sam build && sam deploy –-guided.
  2. For Stack Name, enter pytorch-lambda-example.
  3. Choose the same Region that you created the Amazon ECR repository in.
  4. Enter the image repository for the function (enter the earlier saved repositoryUri of the Amazon ECR repository).
  5. For Confirm changes before deploy and Allow SAM CLI IAM role creation, keep the defaults.
  6. For pytorchEndpoint may not have authorization defined, Is this okay?, select y.
  7. Keep the defaults for the remaining prompts.

AWS SAM uploads the container images to the Amazon ECR repository and deploys the application. During this process, you see a change set along with the status of the deployment. For a more detailed description about AWS SAM and container images for Lambda, see Using container image support for AWS Lambda with AWS SAM.

When the deployment is complete, the stack output is displayed. Use the InferenceApi endpoint to test your deployed application. The endpoint URL is displayed as an output during the deployment of the stack.

Overcoming a Lambda function cold start

Because the plain language model is already around 250 MB, the initial function run can take up to 25 seconds and may even exceed the maximum API timeout of 29 seconds. That time can also be reached when the function wasn’t called for some time and therefore is in a cold start mode. When the Lambda function is in a hot state, one inference run takes about 150 milliseconds.

There are multiple ways to mitigate the runtime of Lambda functions in a cold state. Lambda supports provisioned concurrency to keep the functions initialized. Another way is to create an Amazon CloudWatch event that periodically calls the function to keep it warm.

Make sure to change <API_GATEWAY_URL> to the URL of your API Gateway endpoint. In the following example code, the text is copied from the Wikipedia page on cars. You can change the question and context as you like and check the model’s answers.

curl --header "Content-Type: application/json" --request POST --data '{"question": "When was the car invented?","context": "Cars came into global use during the 20th century, and developed economies depend on them. The year 1886 is regarded as the birth year of the modern car when German inventor Karl Benz patented his Benz Patent-Motorwagen. Cars became widely available in the early 20th century. One of the first cars accessible to the masses was the 1908 Model T, an American car manufactured by the Ford Motor Company. Cars were rapidly adopted in the US, where they replaced animal-drawn carriages and carts, but took much longer to be accepted in Western Europe and other parts of the world."}' <API_GATEWAY_URL>

The response shows the correct answer to the question:

{"Question": "When was the car invented?", "Answer": "1886"}


Container image support for Lambda allows you to customize your function even more, opening up many new use cases for serverless ML. You can bring your custom models and deploy them on Lambda using up to 10 GB for the container image size. For smaller models that don’t need much computing power, you can perform online training and inference purely in Lambda. When the model size increases, cold start issues become more and more important and need to be mitigated. There is also no restriction on the framework or language with container images; other ML frameworks such as TensorFlow, Apache MXNet, XGBoost, or Scikit-learn can be used as well!

If you do require GPU for your inference, you can consider using containers services such as Amazon Elastic Container Service (Amazon ECS), Kubernetes, or deploy the model to an Amazon SageMaker endpoint

About the Author

Jan Bauer is a Cloud Application Developer at AWS Professional Services. His interests are serverless computing, machine learning, and everything that involves cloud computing.

Building secure machine learning environments with Amazon SageMaker

As businesses and IT leaders look to accelerate the adoption of machine learning (ML) and artificial intelligence (AI), there is a growing need to understand how to build secure and compliant ML environments that meet enterprise requirements. One major challenge you may face is integrating ML workflows into existing IT and business work streams. A second challenge is bringing together stakeholders from business leadership, data science, engineering, risk and compliance, and cybersecurity to define the requirements and guardrails for the organization. Third, because building secure ML environments in the cloud is a relatively new topic, understanding recommended practices is also helpful.

In this post, we introduce a series of hands-on workshops and associated code artifacts to help you build secure ML environments on top of Amazon SageMaker, a fully managed service that provides every developer and data scientist with the ability to build, train, and deploy ML models quickly. The objective of these workshops is to address the aforementioned challenges by helping bring together different IT stakeholders and data scientists and provide best practices to build and operate secure ML environments. These workshops are a summary of recommended practices from large enterprises and small and medium businesses. You can access these workshops on Building Secure Environments, and you can find the associated code on GitHub. We believe that these workshops are valuable for the following primary teams:

  • Cloud engineering – This team is responsible for creating and maintaining a set of enterprise-wide guardrails for operating in the cloud. Key requirements for these teams include isolation from public internet, restriction of data traffic flows, use of strict AWS Identity and Access Management (IAM) controls to allow only authorized and authenticated users the ability to access project resources, and the use of defense-in-depth methodologies to detect and mitigate potential threats. This team can use tools like AWS Service Catalog to build repeatable patterns using infrastructure as code (IaC) practices via AWS CloudFormation.
  • ML platform: This team is responsible for building and maintaining the infrastructure for supporting ML services, such as provisioning notebooks for data scientists to use, creating secure buckets for storing data, managing costs for ML from various lines of business (LOBs), and more.
  • Data science COE: Data scientists within an AI Center of Excellence (COE) or embedded within the LOBs are responsible for building, training, and deploying models. In regulated industries, data scientists need to adhere to the organization’s security boundaries, such as using encrypted buckets for data access, use of private networking for accessing APIs, committing code to source control, ensuring all their experiments and trials are properly logged, enforcing encryption of data in transit, and monitoring deployed models.

The following diagram is the architecture for the secure environment developed in this workshop.

The following diagram is the architecture for the secure environment developed in this workshop.

In the Building Secure Environments workshop aimed at the cloud engineering and ML platform teams, we cover how this architecture can be set up in Labs 1–2. Specifically, we use AWS Service Catalog to provision a Shared Services Amazon Virtual Private Cloud (Amazon VPC), which hosts a private PyPI package repository to pull packages from an Amazon Simple Storage Service (Amazon S3) bucket via a secure VPC endpoint.

After the environment is provisioned, the following architecture diagram illustrates the typical data scientist workflow within the project VPC, which is covered in detail in the workshop Using Secure Environments aimed at data scientists.

After the environment is provisioned, the following architecture diagram illustrates the typical data scientist workflow within the project VPC.

This workshop quickly sets up the secure environment (Steps 1–3) and then focuses on using SageMaker notebook instances to securely explore and process data (Steps 4–5). Following that, we train a model (Steps 6–7) and deploy and monitor the model and model metadata (8–9) while enforcing version control (Step 4).

The workshops and associated code let you implement recommended practices and patterns and help you to quickly get started building secure environments, and improve productivity with the ability to securely build, train, deploy and monitor ML models. Although the workshop is built using SageMaker notebook instances, in this post we highlight how you can adapt this to Amazon SageMaker Studio. Although the workshop is built using SageMaker notebook instances, in this post we highlight how you can adapt this to Amazon SageMaker Studio, the first integrated development environment for machine learning on AWS.

Workshop features

The workshop is a collection of feature implementations grouped together to provide a coherent starting point for customers looking to build secure data science environments. The features implemented are broadly categorized across seven areas:

  • Enforce your existing IT policies in your AWS account and data science environment to mitigate risks
  • Create environments with least privilege access to sensitive data in the interest of reducing the blast radius of a compromised or malicious actor
  • Protect sensitive data against data exfiltration using a number of controls designed to mitigate the data exfiltration risk
  • Encrypt sensitive data and intellectual property at rest and in transit as part of a defense-in-depth strategy
  • Audit and trace activity in your environment
  • Reproduce results in your environment by tracking the lineage of ML artifacts throughout the lifecycle and using source and version control tools such as AWS CodeCommit
  • Manage costs and allow teams to self service using a combination of tagging and the AWS Service Catalog to automate building secure environments

In the following sections, we cover in more depth how these different features have been implemented.

Enforcing existing IT policies

When entrusting sensitive data to AWS services, you need confidence that you can govern your data to the same degree with the managed service as if you were running the service yourself. A typical starting point to govern your data in an AWS environment is to create a VPC that is tailored and configured to your standards in terms of information security, firewall rules, and routing. This becomes a starting point for your data science environment and the services that projects use to deliver on their objectives. SageMaker, and many other AWS services, can be deployed into your VPC. This allows you to use network-level controls to manage the Amazon Elastic Compute Cloud (Amazon EC2)-based resources that reside within the network. To learn about how to set up SageMaker Studio in a private VPC, see Securing Amazon SageMaker Studio connectivity using a private VPC.

The network-level controls deployed as a part of this workshop include the following:

  • Security groups to manage which resources and services, such as SageMaker, can communicate with other resources in the VPC
  • VPC endpoints to grant explicit access to specific AWS services from within the VPC, like Amazon S3 or Amazon CloudWatch
  • VPC endpoints to grant explicit access to customer-managed shared services such as a PyPi repository server

The shared service PyPi repository demonstrates how you can create managed artifact repositories that can then be shared across project environments. Because the environments don’t have access to the open internet, access to common package and library repositories is restricted to your repositories that hold your packages. This limits any potential threats from unapproved packages entering your secure environment.

With the launch of AWS CodeArtifact, you can now use CodeArtifact as your private PyPi repository. CodeArtifact provides VPC endpoints to maintain private networking. To learn more about how to integrate CodeArtifact with SageMaker notebook instances and Studio notebooks, see Private package installation in Amazon SageMaker running in internet-free mode.

In addition to configuring a secure network environment, this workshop also uses IAM policies to create a preventive control that requires that all SageMaker resources be provisioned within a customer VPC. An AWS Lambda function is also deployed as a corrective control to stop any SageMaker resources that are provisioned without a VPC attachment.

One of the unique elements of SageMaker notebooks is that they are managed EC2 instances in which you can tailor the operating system. This workshop uses SageMaker lifecycle configuration policies to configure the Linux operating system of the SageMaker notebook to be inline with IT policy, such as disabling root access for data scientists. For SageMaker Studio, you can enforce your IT policies of using security approved containers and packages for running notebooks by bringing your own custom image. SageMaker handles versioning of the images, and provides data scientists with a user-friendly drop-down to select the custom image of their choice.

Labs 1–3 in the Building Secure Environments and Labs 1–2 in the Using Secure Environments workshops focus on how you can enforce IT policies on your ML environments.

Least privilege access to sensitive data

In the interest of least privileged access to sensitive data, it’s simpler to provide isolated environments to any individual project. These isolated environments provide a method of restricting access to customer-managed assets, datasets, and AWS services on a project-by-project basis, with a lower risk of cross-project data movement. The following discusses some of the key mechanisms used in the workshops to provide isolated, project-specific environments. The workshop hosts multiple projects in a single AWS account, but given sufficient maturity of automation, you could provide the same level of isolation using project-specific AWS accounts. Although you can have multiple SageMaker notebook instances within a single account, you can only have one Studio domain per Region in an account. You can therefore use a domain to create isolated project-specific environments in separate accounts.

To host multiple projects in a single AWS account, the workshop dedicates a private, single-tenant VPC to each project. This creates a project-specific network boundary that grants access to specific AWS resources and services using VPC endpoints and endpoint policies. This combination creates logically isolated single-tenant project environments that are dedicated to a project team.

In addition to a dedicated network environment, the workshop creates AWS resources that are dedicated to individual projects. S3 buckets, for instance, are created per project and bound to the VPC for the project. An S3 bucket policy restricts the objects in the bucket to only be accessed from within the VPC. Equally, the endpoint policy associated with the Amazon S3 VPC endpoint within the VPC only allows principals in the VPC to communicate with those specific S3 buckets. This could be expanded as needed in order to support accessing other buckets, perhaps in conjunction with an Amazon S3-based data lake.

Other AWS resources that are created on behalf of an individual project include IAM roles that govern who can access the project environment and what permissions they have within the environment. This prevents other project teams from accessing resources in the AWS account that aren’t dedicated to that other project.

To manage intellectual property developed by the project, a CodeCommit repository is created to provide the project with a dedicated Git repository to manage and version control their source code. We use CodeCommit to commit any code developed in notebooks by data scientists in Labs 3–4 in the Using Secure Environments workshop.

Protecting against data exfiltration

As described earlier, project teams have access to AWS services and resources like Amazon S3 and objects in Amazon S3 through the VPC endpoints in the project’s VPC. The isolated VPC environment gives you full control over the ingress and egress of data flowing across the network boundary. The workshop uses security groups to govern which AWS resources can communicate with specific AWS services. The workshop also uses VPC endpoint policies to limit the AWS resources that can be accessed using the VPC endpoints.

When data is in Amazon S3, the bucket policy applied to the bucket doesn’t allow resources from outside the VPC to read data from the bucket, ensuring that it’s bound, as a backing store, to the VPC.

Data protection

The application of ML technologies is often done using sensitive customer data. This data may contain commercially sensitive, personal identifiable, or proprietary information that must be protected over the data’s lifetime. SageMaker and associated services such as Amazon Elastic Container Registry (Amazon ECR), Amazon S3, and CodeCommit all support end-to-end encryption both at rest and in transit.

Encryption at rest

SageMaker prefers to source information from Amazon S3, which supports multiple methods of encrypting data. For the purposes of this workshop, the S3 buckets are configured to automatically encrypt objects with a specified customer master key (CMK) that is stored in AWS Key Management Service (AWS KMS). A preventive control is also configured to require that data put into Amazon S3 is encrypted using a KMS key. These two mechanisms ensure that data stored in Amazon S3 is encrypted using a key that is managed and controlled by the customer.

Similar to Amazon S3, Amazon ECR is also used to store customer-built Docker containers that are likely to contain intellectual property. Amazon ECR supports the encryption of images at rest using a CMK. This enables you to support PCI-DSS compliance requirements for separate authentication of the storage and cryptography. With this feature enabled, Amazon ECR automatically encrypts images when pushed, and decrypts them when pulled.

As data is moving into SageMaker-managed resources from Amazon S3, it’s important to ensure that the encryption at rest of the data persists. SageMaker supports this by allowing the specification of KMS CMKs for encrypting the EBS volumes that hold the data retrieved from Amazon S3. Encryption keys can be specified to encrypt the volumes of all Amazon EC2-based SageMaker resources, such as processing jobs, notebooks, training jobs, and model endpoints. A preventive control is deployed in this workshop, which allows the provisioning of SageMaker resources only if a KMS key has been specified to encrypt the volumes.

Encryption in transit

AWS makes extensive use of HTTPS communication for its APIs. The services mentioned earlier are no exception. In addition to passing all API calls through a TLS encrypted channel, AWS APIs also require that requests are signed using the Signature version 4 signing process. This process uses client access keys to sign every API request, adding authentication information as well as preventing tampering of the request in flight.

As services like SageMaker, Amazon S3, and Amazon ECR interact with one another, they must also communicate using Signature V4 signed packets over encrypted HTTPS channels. This ensures that communication between AWS services is encrypted to a known standard, protecting customer data as it moves between services.

When communicating with SageMaker resources such as notebooks or hosted models, the communication is also performed over authenticated and signed HTTPS requests as with other AWS services.

Intra-node encryption

SageMaker provides added benefit to secure your data when training using distributed clusters. Some ML frameworks when performing distributed training pass coefficients between the different instances of the algorithm in plain text. This shared state is not your training data, but is the information that the algorithms require to stay synchronized with one another. You can instruct SageMaker to automatically encrypt inter-node communication for your training job. The data passed between nodes is then passed over an encrypted tunnel without your algorithm having to take on responsibility for encrypting and decrypting the data. To enable inter-node encryption, ensure that your security groups are configured to permit UDP traffic over port 500 and that you have set EnableInterContainerTrafficEncryption to True. For more detailed instructions, see Protect Communications Between ML Compute Instances in a Distributed Training Job.

Ensuring encryption at rest and in transit during the ML workflow is covered in detail in Labs 3–4 of the Using Secure Environments workshop.

Traceability, reproducibility, and auditability

A common pain point that you may face is a lack of recommended practices around code and ML lifecycle traceability. Often, this can arise from data scientists not being trained in MLOps (ML and DevOps) best practices, and the inherent experimental nature of the ML process. In regulated industries such as financial services, regulatory bodies such as the Office of the Comptroller of the Currency (OCC) and Federal Reserve Board (FRB) have documented guidelines on managing the risk of analytical models.

Lack of best practices around documenting the end-to-end ML lifecycle can lead to lost cycles in trying to trace the source code, model hyperparameters, and training data. The following figure shows the different steps in the lineage of a model that may be tracked for traceability and reproducibility reasons.

The following figure shows the different steps in the lineage of a model that may be tracked for traceability and reproducibility reasons.

Traceability refers to the ability to map outputs from one step in the ML cycle to the inputs of another, thereby having a record of the entire lineage of a model. Enforcing data scientists to use source and version control tools such as Git or BitBucket to regularly check in code, and not approve or promote models until code has been checked in, can help mitigate this issue. In this workshop, we provision a private CodeCommit repository for use by data scientists, along with their notebook instance. Admins can tag these repositories to the users, to identify the users responsible for the commits, and ensure code is being frequently checked into source control. One way to do this is to use project-specific branches, and ensure that the branch has been merged with the primary branch in the shared services environment prior to being promoted to pre-production or test. Data scientists should not be allowed to directly promote code from dev to production without this intermediate step.

In addition to versioning code, versioning data used for training models is important as well. All the buckets created in this workshop have versioning automatically enabled to enforce version control on any data stored there, such as training data, processed data, and training, validation and test data. SageMaker Experiments automatically keeps track of the pointer to the specific version of the training data used during model training.

Data scientists often tend to explore data in notebooks, and use notebooks to engineer features as well. In this workshop, we demonstrate how to use SageMaker Processing to not only offload the feature engineering code from the notebook instance onto separate compute instances to run at scale, but also to subsequently track parameters used for engineering features in SageMaker Experiments for reproducibility reasons. SageMaker recently launched SageMaker Clarify, which allows you to detect bias in your data as well as extract feature importances. You can run these jobs as you would run SageMaker Processing jobs using the Clarify SDK.

Versioning and tagging experiments, hyperparameter tuning jobs, and data processing jobs allow data scientists to collaborate faster. SageMaker Experiments automatically tracks and logs metadata from SageMaker training, processing, and batch transform jobs, and surfaces relevant information such as model hyperparameters, model artifact location, model container metadata in a searchable way. For more information, see Amazon SageMaker Experiments – Organize, Track And Compare Your Machine Learning Trainings.

Additionally, it keeps track of model metrics that allow data scientists to compare different trained models and identify the ones that meet their business objectives. You can also use SageMaker Experiments to track which user launched a training job and use IAM condition keys to enforce resource tags on the Experiment APIs.

Additionally, in SageMaker Studio, SageMaker Experiments tracks the user profile of the user launching jobs, providing additional auditability. We demonstrate the use of SageMaker Experiments and how you can use Experiments to search for specific trials and extract the model metadata in Labs 3–4 of the Using Secure Environments workshop.

Although accurately capturing the lineage of ML models can certainly help reproduce the model outputs, depending on the model’s risk level, you may also be required to document feature importance from your models. In this workshop, we demonstrate one methodology for doing so, using Shapley values. We note however that this approach is by no means exhaustive and you should work with your risk, legal, and compliance teams to assess legal, ethical, regulatory, and compliance requirements for, and implications of, building and using ML systems.

Deployed endpoints should be monitored against data drift as a best practice. In these workshops, we demonstrate how SageMaker Model Monitor automatically extracts the statistics from the features as a baseline, captures the input payload and the model predictions, and checks for any data drift against the baseline at regular intervals. The detected drift can be visualized using SageMaker Studio and used to set thresholds and alarms to re-trigger model retraining or alert developers of model drift.

To audit ML environments, admins can monitor instance-level metrics related to training jobs, processing jobs, and hyperparameter tuning jobs using CloudWatch Events. You can use lifecycle configurations to also publish Jupyter logs to CloudWatch. Here we demonstrate the use of detective and preventive controls to prevent data scientists from launching training jobs outside the project VPC. Additional preventive controls using IAM condition keys such as sagemaker:InstanceTypes may be added to prevent data scientists from misusing certain instance types (such as the more expensive GPU instances) or enforcing that data scientists only train models using AWS Nitro System instances, which offer enhanced security. Studio notebook logs are automatically published to CloudWatch.


Customers are rapidly adopting IaC best practices using tools such as AWS CloudFormation or HashiCorp Terraform to ensure repeatability across their cloud workflows. However, a consistent pain point for data science and IT teams across enterprises has been the challenge to create repeatable environments that can be easily scaled across the organization.

AWS Service Catalog allows you to build products that abstract the underlying CloudFormation templates. These products can be shared across accounts, and a consistent taxonomy can be enforced using the TagOptions Library. Administrators can design products for the data science teams to run in their accounts that provision all the underlying resources automatically, while allowing data scientists to customize resources such as underlying compute instances (GPU or CPU) required for running notebooks, but disallowing data scientists from creating notebook instances any other way. Similarly, admins can enforce that data scientists enter their user information while creating products to have visibility on who is creating notebooks.

To allow teams to move at speed and to free constrained cloud operations teams from easily automated work, this workshop uses the AWS Service Catalog to automate common activities such as SageMaker notebook creation. AWS Service Catalog provides you with a way to codify your own best practice for deploying logically grouped assets, such as a project team environment, and allow project teams to deploy these assets for themselves.

The AWS Service Catalog allows cloud operations teams to give business users a way to self-service and obtain on-demand assets that are deployed in a manner compliant with internal IT policies. Business users no longer have to submit tickets for common activities and wait for the ticket to be serviced by the cloud operations team. Additionally, AWS Service Catalog provides the cloud operations team with a centralized location to understand who has deployed various assets and manage those deployed assets to ensure that, as IT policy evolves, updates can be provided across provisioned products. This is covered in detail in Labs 1–2 of the Building Secure Environments workshop.

Cost management

It’s important to be able to track expenses during the lifecycle of a project. To demonstrate this capability, the workshop uses cost tags to track all resources associated with any given project. The cost tags used in this workshop tag resources like SageMaker training jobs, VPCs, and S3 buckets with the project name and the environment type (development, testing, production). You can use these tags to identify a project’s costs across services and your environments to ensure that teams are accountable for their consumption. You can also use SageMaker Processing to offload feature engineering tasks and SageMaker Training jobs to train models at scale, and use lightweight notebooks and further save on costs. As we show in this workshop, admins can enforce this directly by allowing data scientists to create notebooks only via AWS Service Catalog using approved instance types only.


In this series of workshops, we have implemented a number of features and best practices that cover the most common pain points that CTO teams face when provisioning and using secure environments for ML. For a detailed discussion on ML governance as it applies to regulated industries such as financial services, see Machine Learning Best Practices in Financial Services. Additionally, you may want to look at the AWS Well-Architected guidelines as they apply to machine learning and financial services, respectively. Feel free to connect with the authors and don’t hesitate to reach out to your AWS account teams if you wish to run these hands-on labs.

Further reading

About the Authors

Jason BartoJason Barto works as a Principal Solutions Architect with AWS. Jason supports customers to accelerate and optimize their business by leveraging cloud services. Jason has 20 years of professional experience developing systems for use in secure, sensitive environments. He has led teams of developers and worked as a systems architect to develop petabyte scale analytics platforms, real-time complex event processing systems, and cyber-defense monitoring systems. Today he is working with financial services customers to implement secure, resilient, and self-healing data and analytics systems using open-source technologies and AWS services

Stefan Natu is a Sr. AI/ML Specialist Solutions Architect at Amazon Web Services. He is focused on helping financial services customers build end-to-end machine learning solutions on AWS. In his spare time, he enjoys reading machine learning blogs, playing the guitar, and exploring the food scene in New York City.

