Run ensemble ML models on Amazon SageMaker

Model deployment in machine learning (ML) is becoming increasingly complex. You want to deploy not just one ML model but large groups of ML models represented as ensemble workflows. These workflows are comprised of multiple ML models. Productionizing these ML models is challenging because you need to adhere to various performance and latency requirements.

Amazon SageMaker supports single-instance ensembles with Triton Inference Server. This capability allows you to run model ensembles that fit on a single instance. Behind the scenes, SageMaker leverage Triton Inference Server to manage the ensemble on every instance behind the endpoint to maximize throughput and hardware utilization with ultra-low (single-digit milliseconds) inference latency. With Triton, you can also choose from a wide range of supported ML frameworks (including TensorFlow, PyTorch, ONNX, XGBoost, and NVIDIA TensorRT) and infrastructure backends, including GPUs, CPUs, and AWS Inferentia.

With this capability on SageMaker, you can optimize your workloads by avoiding costly network latency and reaping the benefits of compute and data locality for ensemble inference pipelines. In this post, we discuss the benefits of using Triton Inference Server along with considerations on if this is the right option for your workload.

Solution overview

Triton Inference Server is designed to enable teams to deploy, run, and scale trained AI models from any framework on any GPU- or CPU-based infrastructure. In addition, it has been optimized to offer high-performance inference at scale with features like dynamic batching, concurrent runs, optimal model configuration, model ensemble capabilities, and support for streaming inputs.

Workloads should take into account the capabilities that Triton provides to ensure their models can be served. Triton supports a number of popular frameworks out of the box, including TensorFlow, PyTorch, ONNX, XGBoost, and NVIDIA TensorRT. Triton also supports various backends that are required for algorithms to run properly. You should ensure that your models are supported by these backends and in the event that a backend does not, Triton allows you to implement your own and integrate it. You should also verify that your algorithm version is supported as well as ensure that the model artifacts are acceptable by the corresponding backend. To check if your particular algorithm is supported, refer to Triton Inference Server Backend for a list of natively supported backends maintained by NVIDIA.

There may be some scenarios where your models or model ensembles won’t work on Triton without requiring more effort, such as if a natively supported backend doesn’t exist for your algorithm. There are some other considerations to take into account, such as the payload format may not be ideal, especially when your payload size may be large for your request. As always, you should validate your performance after deploying these workloads to ensure that your expectations are met.

Let’s take an image classification neural network model and see how we can accelerate our workloads. In this example, we use the NVIDIA DALI backend to accelerate our preprocessing in the context of our ensemble.

Create Triton model ensembles

Triton Inference Server simplifies the deployment of AI models at scale. Triton Inference Server comes with a convenient solution that simplifies building preprocessing and postprocessing pipelines. The Triton Inference Server platform provides the ensemble scheduler, which you can use to build pipelining ensemble models participating in the inference process while ensuring efficiency and optimizing throughput.

NVIDIA Triton Ensemble

Triton Inference Server serves models from model repositories. Let’s look at the model repository layout for ensemble model containing the DALI preprocessing model, the TensorFlow inception V3 model, and the model ensemble configuration. Each subdirectory contains the repository information for the corresponding models. The config.pbtxt file describes the model configuration for the models. Each directory must have one numeric sub-folder for each version of the model and it’s run by a specific backend that Triton supports.

NVIDIA Triton Model Repository

NVIDIA DALI

For this post, we use the NVIDIA Data Loading Library (DALI) as the preprocessing model in our model ensemble. NVIDIA DALI is a library for data loading and preprocessing to accelerate deep learning applications. It provides a collection of optimized building blocks for loading and processing image, video, and audio data. You can use it as a portable drop-in replacement for built-in data loaders and data iterators in popular deep learning frameworks.

NVIDIA Dali

The following code shows the model configuration for a DALI backend:

name: "dali"
backend: "dali"
max_batch_size: 256
input [
  {
    name: "DALI_INPUT_0"
    data_type: TYPE_UINT8
    dims: [ -1 ]
  }
]
output [
  {
    name: "DALI_OUTPUT_0"
    data_type: TYPE_FP32
    dims: [ 299, 299, 3 ]
  }
]
parameters: [
  {
    key: "num_threads"
    value: { string_value: "12" }
  }
]

Inception V3 model

For this post, we show how DALI is used in a model ensemble with Inception V3. The Inception V3 TensorFlow pre-trained model is saved in GraphDef format as a single file named model.graphdef. The config.pbtxt file has information about the model name, platform, max_batch_size, and input and output contracts. We recommend setting the max_batch_size configuration to less than the inception V3 model batch size. The label file has class labels for 1,000 different classes. We copy the inception classification model labels to the inception_graphdef directory in the model repository. The labels file contains 1,000 class labels of the ImageNet classification dataset.

name: "inception_graphdef"
platform: "tensorflow_graphdef"
max_batch_size: 256
input [
  {
    name: "input"
    data_type: TYPE_FP32
    format: FORMAT_NHWC
    dims: [ 299, 299, 3 ]
  }
]
output [
  {
    name: "InceptionV3/Predictions/Softmax"
    data_type: TYPE_FP32
    dims: [ 1001 ]
    label_filename: "inception_labels.txt"
  }
]

Triton ensemble

The following code shows a model configuration of an ensemble model for DALI preprocessing and image classification:

name: "ensemble_dali_inception"
platform: "ensemble"
max_batch_size: 256
input [
  {
    name: "INPUT"
    data_type: TYPE_UINT8
    dims: [ -1 ]
  }
]
output [
  {
    name: "OUTPUT"
    data_type: TYPE_FP32
    dims: [ 1001 ]
  }
]
ensemble_scheduling {
  step [
    {
      model_name: "dali"
      model_version: -1
      input_map {
        key: "DALI_INPUT_0"
        value: "INPUT"
      }
      output_map {
        key: "DALI_OUTPUT_0"
        value: "preprocessed_image"
      }
    },
    {
      model_name: "inception_graphdef"
      model_version: -1
      input_map {
        key: "input"
        value: "preprocessed_image"
      }
      output_map {
        key: "InceptionV3/Predictions/Softmax"
        value: "OUTPUT"
      }
    }
  ]
}

Create a SageMaker endpoint

SageMaker endpoints allow for real-time hosting where millisecond response time is required. SageMaker takes on the undifferentiated heavy lifting of model hosting management and has the ability to auto scale. In addition, a number of capabilities are also provided, including hosting multiple variants of your model, A/B testing of your models, integration with Amazon CloudWatch to gain observability of model performance, and monitoring for model drift.

Let’s create a SageMaker model from the model artifacts we uploaded to Amazon Simple Storage Service (Amazon S3).

Next, we also provide an additional environment variable: SAGEMAKER_TRITON_DEFAULT_MODEL_NAME, which specifies the name of the model to be loaded by Triton. The value of this key should match the folder name in the model package uploaded to Amazon S3. This variable is optional in cases where you’re using a single model. In the case of ensemble models, this key must be specified for Triton to start up in SageMaker.

Additionally, you can set SAGEMAKER_TRITON_BUFFER_MANAGER_THREAD_COUNT and SAGEMAKER_TRITON_THREAD_COUNT for optimizing the thread counts.

container = {
    "Image": triton_image_uri,
    "ModelDataUrl": model_uri,
    "Environment": {"SAGEMAKER_TRITON_DEFAULT_MODEL_NAME": "ensemble_dali_inception"},
}
create_model_response = sm_client.create_model(
    ModelName=sm_model_name, ExecutionRoleArn=role, PrimaryContainer=container
)

With the preceding model, we create an endpoint configuration where we can specify the type and number of instances we want in the endpoint:

create_endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "InstanceType": instance_type,
            "InitialVariantWeight": 1,
            "InitialInstanceCount": 1,
            "ModelName": sm_model_name,
            "VariantName": "AllTraffic",
        }
    ],
)
endpoint_config_arn = create_endpoint_config_response["EndpointConfigArn"]

We use this endpoint configuration to create a new SageMaker endpoint and wait for the deployment to finish. The status changes to InService when the deployment is successful.

create_endpoint_response = sm_client.create_endpoint(
    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)
endpoint_arn = create_endpoint_response["EndpointArn"]

Inference payload

The input payload image goes through the preprocessing DALI pipeline and is used in the ensemble scheduler provided by Triton Inference Server. We construct the payload to be passed to the inference endpoint:

payload = {
    "inputs": [
        {
            "name": "INPUT",
            "shape": rv2.shape,
            "datatype": "UINT8",
            "data": rv2.tolist(),
        }
    ]
}

Ensemble inference

When we have the endpoint running, we can use the sample image to perform an inference request using JSON as the payload format. For the inference request format, Triton uses the KFServing community standard inference protocols.

response = runtime_sm_client.invoke_endpoint(
    EndpointName=endpoint_name, ContentType="application/octet-stream", Body=json.dumps(payload)
)
print(json.loads(response["Body"].read().decode("utf8")))

With the binary+json format, we have to specify the length of the request metadata in the header to allow Triton to correctly parse the binary payload. This is done using a custom Content-Type header application/vnd.sagemaker-triton.binary+json;json-header-size={}.

This is different from using an Inference-Header-Content-Length header on a standalone Triton server because custom headers aren’t allowed in SageMaker.

The tritonclient package provides utility methods to generate the payload without having to know the details of the specification. We use the following methods to convert our inference request into a binary format, which provides lower latencies for inference. Refer to the GitHub notebook for implementation details.

response = runtime_sm_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="application/vnd.sagemaker-triton.binary+json;json-header-size={}".format(
        header_length
    ),
    Body=request_body,
)

Conclusion

In this post, we showcased how you can productionize model ensembles that run on a single instance on SageMaker. This design pattern can be useful for combining any preprocessing and postprocessing logic along with inference predictions. SageMaker uses Triton to run the ensemble inference on a single container on an instance that supports all major frameworks.

For more samples on Triton ensembles on SageMaker, refer the GitHub repo. Try it out!


About the Authors

James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In his spare time, he enjoys seeking out new cultures, new experiences, and staying up to date with the latest technology trends.

Vikram Elango is a Senior AI/ML Specialist Solutions Architect at Amazon Web Services, based in Virginia, US. Vikram helps financial and insurance industry customers with design and thought leadership to build and deploy machine learning applications at scale. He is currently focused on natural language processing, responsible AI, inference optimization, and scaling ML across the enterprise. In his spare time, he enjoys traveling, hiking, cooking, and camping with his family.

Saurabh Trikande is a Senior Product Manager for Amazon SageMaker Inference. He is passionate about working with customers and is motivated by the goal of democratizing machine learning. He focuses on core challenges related to deploying complex ML applications, multi-tenant ML models, cost optimizations, and making deployment of deep learning models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch and spending time with his family.

Read More

Host code-server on Amazon SageMaker

Machine learning (ML) teams need the flexibility to choose their integrated development environment (IDE) when working on a project. It allows you to have a productive developer experience and innovate at speed. You may even use multiple IDEs within a project. Amazon SageMaker lets ML teams choose to work from fully managed, cloud-based environments within Amazon SageMaker Studio, SageMaker Notebook Instances, or from your local machine using local mode.

SageMaker provides a one-click experience to Jupyter and RStudio to build, train, debug, deploy, and monitor ML models. In this post, we will also share a solution for hosting code-server on SageMaker.

With code-server, users can run VS Code on remote machines and access it in a web browser. For ML teams, hosting code-server on SageMaker provides minimal changes to a local development experience, and allows you to code from anywhere, on scalable cloud compute. With VS Code, you can also use built-in Conda environments with AWS-optimized TensorFlow and PyTorch, managed Git repositories, local mode, and other features provided by SageMaker to speed up your delivery. For IT admins, it allows you to standardize and expedite the provisioning of managed, secure IDEs in the cloud, to quickly onboard and enable ML teams in their projects.

Solution overview

In this post, we cover installation for both Studio environments (Option A), and notebook instances (Option B). For each option, we walk through a manual installation process that ML teams can run in their environment, and an automated installation that IT admins can set up for them via the AWS Command Line Interface (AWS CLI).

The following diagram illustrates the architecture overview for hosting code-server on SageMaker.

ml-10244-architecture-overview

Our solution speeds up the install and setup of code-server in your environment. It works for both JupyterLab 3 (recommended) and JupyterLab 1 that run within Studio and SageMaker notebook instances. It is made of shell scripts that do the following based on the option.

For Studio (Option A), the shell script does the following:

For SageMaker notebook instances (Option B), the shell script does the following:

  • Installs code-server.
  • Adds a code-server shortcut on the Jupyter notebook file menu and JupyterLab launcher for fast access to the IDE.
  • Creates a dedicated Conda environment for managing dependencies.
  • Installs the Python and Docker extensions on the IDE.

In the following sections, we walk through the solution install process for Option A and Option B. Make sure you have access to Studio or a notebook instance.

Option A: Host code-server on Studio

To host code-server on Studio, complete the following steps:

  1. Choose System terminal in your Studio launcher.
    ml-10244-studio-terminal-click
  2. To install the code-server solution, run the following commands in your system terminal:
    curl -LO https://github.com/aws-samples/amazon-sagemaker-codeserver/releases/download/v0.1.5/amazon-sagemaker-codeserver-0.1.5.tar.gz
    tar -xvzf amazon-sagemaker-codeserver-0.1.5.tar.gz
    
    cd amazon-sagemaker-codeserver/install-scripts/studio
     
    chmod +x install-codeserver.sh
    ./install-codeserver.sh
    
    # Note: when installing on JL1, please prepend the nohup command to the install command above and run as follows: 
    # nohup ./install-codeserver.sh

    The commands should take a few seconds to complete.

  3. Reload the browser page, where you can see a Code Server button in your Studio launcher.
    ml-10244-code-server-button
  4. Choose Code Server to open a new browser tab, allowing you to access code-server from your browser.
    The Python extension is already installed, and you can get to work in your ML project.ml-10244-vscode

You can open your project folder in VS Code and select the pre-built Conda environment to run your Python scripts.

ml-10244-vscode-conda

Automate the code-server install for users in a Studio domain

As an IT admin, you can automate the installation for Studio users by using a lifecycle configuration. It can be done for all users’ profiles under a Studio domain or for specific ones. See Customize Amazon SageMaker Studio using Lifecycle Configurations for more details.

For this post, we create a lifecycle configuration from the install-codeserver script, and attach it to an existing Studio domain. The install is done for all the user profiles in the domain.

From a terminal configured with the AWS CLI and appropriate permissions, run the following commands:

curl -LO https://github.com/aws-samples/amazon-sagemaker-codeserver/releases/download/v0.1.5/amazon-sagemaker-codeserver-0.1.5.tar.gz
tar -xvzf amazon-sagemaker-codeserver-0.1.5.tar.gz

cd amazon-sagemaker-codeserver/install-scripts/studio

LCC_CONTENT=`openssl base64 -A -in install-codeserver.sh`

aws sagemaker create-studio-lifecycle-config 
    --studio-lifecycle-config-name install-codeserver-on-jupyterserver 
    --studio-lifecycle-config-content $LCC_CONTENT 
    --studio-lifecycle-config-app-type JupyterServer 
    --query 'StudioLifecycleConfigArn'

aws sagemaker update-domain 
    --region <your_region> 
    --domain-id <your_domain_id> 
    --default-user-settings 
    '{
    "JupyterServerAppSettings": {
    "DefaultResourceSpec": {
    "LifecycleConfigArn": "arn:aws:sagemaker:<your_region>:<your_account_id>:studio-lifecycle-config/install-codeserver-on-jupyterserver",
    "InstanceType": "system"
    },
    "LifecycleConfigArns": [
    "arn:aws:sagemaker:<your_region>:<your_account_id>:studio-lifecycle-config/install-codeserver-on-jupyterserver"
    ]
    }}'

# Make sure to replace <your_domain_id>, <your_region> and <your_account_id> in the previous commands with
# the Studio domain ID, the AWS region and AWS Account ID you are using respectively.

After Jupyter Server restarts, the Code Server button appears in your Studio launcher.

Option B: Host code-server on a SageMaker notebook instance

To host code-server on a SageMaker notebook instance, complete the following steps:

  1. Launch a terminal via Jupyter or JupyterLab for your notebook instance.
    If you use Jupyter, choose Terminal on the New menu.
  2.  To install the code-server solution, run the following commands in your terminal:
    curl -LO https://github.com/aws-samples/amazon-sagemaker-codeserver/releases/download/v0.1.5/amazon-sagemaker-codeserver-0.1.5.tar.gz
    tar -xvzf amazon-sagemaker-codeserver-0.1.5.tar.gz
    
    cd amazon-sagemaker-codeserver/install-scripts/notebook-instances
     
    chmod +x install-codeserver.sh
    chmod +x setup-codeserver.sh
    sudo ./install-codeserver.sh
    sudo ./setup-codeserver.sh

    The code-server and extensions installations are persistent on the notebook instance. However, if you stop or restart the instance, you need to run the following command to reconfigure code-server:

    sudo ./setup-codeserver.sh

    The commands should take a few seconds to run. You can close the terminal tab when you see the following.

    ml-10244-terminal-output

  3. Now reload the Jupyter page and check the New menu again.
    The Code Server option should now be available.

You can also launch code-server from JupyterLab using a dedicated launcher button, as shown in the following screenshot.

ml-10244-jupyterlab-code-server-button

Choosing Code Server will open a new browser tab, allowing you to access code-server from your browser. The Python and Docker extensions are already installed, and you can get to work in your ML project.

ml-10244-notebook-vscode

Automate the code-server install on a notebook instance

As an IT admin, you can automate the code-server install with a lifecycle configuration running on instance creation, and automate the setup with one running on instance start.

Here, we create an example notebook instance and lifecycle configuration using the AWS CLI. The on-create config runs install-codeserver, and on-start runs setup-codeserver.

From a terminal configured with the AWS CLI and appropriate permissions, run the following commands:

curl -LO https://github.com/aws-samples/amazon-sagemaker-codeserver/releases/download/v0.1.5/amazon-sagemaker-codeserver-0.1.5.tar.gz
tar -xvzf amazon-sagemaker-codeserver-0.1.5.tar.gz

cd amazon-sagemaker-codeserver/install-scripts/notebook-instances

aws sagemaker create-notebook-instance-lifecycle-config 
    --notebook-instance-lifecycle-config-name install-codeserver 
    --on-start Content=$((cat setup-codeserver.sh || echo "")| base64) 
    --on-create Content=$((cat install-codeserver.sh || echo "")| base64)

aws sagemaker create-notebook-instance 
    --notebook-instance-name <your_notebook_instance_name> 
    --instance-type <your_instance_type> 
    --role-arn <your_role_arn> 
    --lifecycle-config-name install-codeserver

# Make sure to replace <your_notebook_instance_name>, <your_instance_type>,
# and <your_role_arn> in the previous commands with the appropriate values.

The code-server install is now automated for the notebook instance.

Conclusion

With code-server hosted on SageMaker, ML teams can run VS Code on scalable cloud compute, code from anywhere, and speed up their ML project delivery. For IT admins, it allows them to standardize and expedite the provisioning of managed, secure IDEs in the cloud, to quickly onboard and enable ML teams in their projects.

In this post, we shared a solution you can use to quickly install code-server on both Studio and notebook instances. We shared a manual installation process that ML teams can run on their own, and an automated installation that IT admins can set up for them.

To go further in your learnings, visit AWSome SageMaker on GitHub to find all the relevant and up-to-date resources needed for working with SageMaker.


About the Authors

Giuseppe Angelo Porcelli is a Principal Machine Learning Specialist Solutions Architect for Amazon Web Services. With several years software engineering an ML background, he works with customers of any size to deeply understand their business and technical needs and design AI and Machine Learning solutions that make the best use of the AWS Cloud and the Amazon Machine Learning stack. He has worked on projects in different domains, including MLOps, Computer Vision, NLP, and involving a broad set of AWS services. In his free time, Giuseppe enjoys playing football.

Sofian Hamiti is an AI/ML specialist Solutions Architect at AWS. He helps customers across industries accelerate their AI/ML journey by helping them build and operationalize end-to-end machine learning solutions.

Eric Pena is a Senior Technical Product Manager in the AWS Artificial Intelligence Platforms team, working on Amazon SageMaker Interactive Machine Learning. He currently focuses on IDE integrations on SageMaker Studio . He holds an MBA degree from MIT Sloan and outside of work enjoys playing basketball and football.

Read More

Real estate brokerage firm John L. Scott uses Amazon Textract to strike racially restrictive language from property deeds for homeowners

Founded more than 91 years ago in Seattle, John L. Scott Real Estate’s core value is Living Life as a Contribution®. The firm helps homebuyers find and buy the home of their dreams, while also helping sellers move into the next chapter of their home ownership journey. John L. Scott currently operates over 100 offices with more than 3,000 agents throughout Washington, Oregon, Idaho, and California.

When company operating officer Phil McBride joined the company in 2007, one of his initial challenges was to shift the company’s public website from an on-premises environment to a cloud-hosted one. According to McBride, a world of resources opened up to John L. Scott once the company started working with AWS to build an easily controlled, cloud-enabled environment.

Today, McBride is taking on the challenge of uncovering and modifying decades-old discriminatory restrictions in home titles and deeds. What he didn’t expect was enlisting the help of AWS for the undertaking.

In this post, we share how John L. Scott uses Amazon Textract and Amazon Comprehend to identify racially restrictive language from such documents.

A problem rooted in historic discrimination

Racial covenants restrict who can buy, sell, lease, or occupy a property based on race (see the following example document). Although no longer enforceable since the Fair Housing Act of 1968, racial covenants became pervasive across the country during the post-World War II housing boom and are still present in the titles of millions of homes. Racial covenants are direct evidence of the real estate industry’s complicity and complacency when it came to the government’s racist policies of the past, including redlining.

In 2019, McBride spoke in support of Washington state legislation that served as the next step in correcting the historic injustice of racial language in covenants. In 2021, a bill was passed that required real estate agents to provide notice of any unlawful recorded covenant or deed restriction to purchasers at the time of sale. A year after the legislation passed and homeowners were notified, John L. Scott discovered that only five homeowners in the state of Washington acted on updating their own property deeds.

“The challenge lies in the sheer volume of properties in the state of Washington, and the current system to update your deeds,” McBride said. “The process to update still is very complicated, so only the most motivated homeowners would put in the research and legwork to modify their deed. This just wasn’t going to happen at scale.”

Initial efforts to find restrictive language have found university students and community volunteers manually reading documents and recording findings. But in Washington state alone, millions of documents needed to be analyzed. A manual approach wouldn’t scale effectively.

Machine learning overcomes manual and complicated processes

With the support of AWS Global Impact Computing Specialists and Solutions Architects, John L. Scott has built an intelligent document processing solution that helps homeowners easily identify racially restrictive covenants in their property title documents. This intelligent document processing solution uses machine learning to scan titles, deeds, and other property documents, searching the text for racially restrictive language. The Washington State Association of County Auditors is also working with John L. Scott to provide digitized deeds, titles, and CC&Rs from their database, starting with King County, Washington.

Once these racial covenants are identified, John L. Scott team members guide homeowners through the process of modifying the discriminatory restrictions from their home’s title, with the support of online notary services such as Notarize.

With a goal of building a solution that the lean team at John L. Scott could manage, McBride’s team worked with AWS to evaluate different services and stitch them together in a modular, repeatable way that met the team’s vision and principles for speed and scale. To minimize management overhead and maximize scalability, the team worked together to build a serverless architecture for handling document ingestion and restrictive language identification using several key AWS services:

  • Amazon Simple Storage Service – Documents are stored in an Amazon S3 data lake for secure and highly available storage.
  • AWS Lambda – Documents are processed by Lambda as they arrive in the S3 data lake. Original document images are split into single-page files and analyzed with Amazon Textract (text detection) and Amazon Comprehend (text analysis).
  • Amazon Textract – Amazon Textract automatically converts raw images into text blocks, which are scanned using fuzzy string pattern matching for restrictive language. When restrictive language is identified, Lambda functions create new image files that highlight the language using the coordinates supplied by Amazon Textract. Finally, records of the restrictive findings are stored in an Amazon DynamoDB table.
  • Amazon Comprehend – Amazon Comprehend analyzes the text output from Amazon Textract and identifies useful data (entities) like dates and locations within the text. This information is key to identifying where and when restrictions were in effect.

The following diagram illustrates the architecture of the serverless ingestion and identification pipeline.

Building from this foundation, the team also incorporates parcel information (via GeoJSON and shapefiles) from county governments to identify affected property owners so they can be notified and begin the process of remediation. A forthcoming public website will also soon allow property owners to input their address to see if their property is affected by restrictive documents.

Setting a new example for the 21st Century

When asked about what’s next, McBride said working with Amazon Textract and Amazon Comprehend has helped his team serve as an example to other counties and real estate firms across the country who want to bring the project into their geographic area.

“Not all areas will have robust programs like we do in Washington state, with University of Washington volunteers indexing deeds and notifying the homeowners,” McBride said. “However, we hope offering this intelligent document processing solution in the public domain will help others drive change in their local communities.”

Learn more


About the authors

Jeff Stockamp is a Senior Solutions Architect based in Seattle, Washington. Jeff helps guide customers as they build well architected-applications and migrate workloads to AWS. Jeff is a constant builder and spends his spare time building Legos with his son.

Jarman Hauser is a Business Development and Go-to-Market Strategy leader at AWS. He works with customers on leveraging technology in unique ways to solve some of the worlds most challenging social, environmental, and economic challenges globally.

Moussa Koulbou is a Senior Solutions Architecture leader at AWS. He helps customers shape their cloud strategy and accelerate their digital velocity by creating the connection between intent and action. He leads a high-performing Solutions Architects team to deliver enterprise-grade solutions that leverage AWS cutting-edge technology to enable growth and solve the most critical business and social problems.

Read More

Run and optimize multi-model inference with Amazon SageMaker multi-model endpoints

Amazon SageMaker multi-model endpoint (MME) enables you to cost-effectively deploy and host multiple models in a single endpoint and then horizontally scale the endpoint to achieve scale. As illustrated in the following figure, this is an effective technique to implement multi-tenancy of models within your machine learning (ML) infrastructure. We have seen software as a service (SaaS) businesses use this feature to apply hyper-personalization in their ML models while achieving lower costs.

For a high-level overview of how MME work, check out the AWS Summit video Scaling ML to the next level: Hosting thousands of models on SageMaker. To learn more about the hyper-personalized, multi-tenant use cases that MME enables, refer to How to scale machine learning inference for multi-tenant SaaS use cases.

Multi model endpoint architecture

In the rest of this post, we dive deeper into the technical architecture of SageMaker MME and share best practices for optimizing your multi-model endpoints.

Use cases best suited for MME

SageMaker multi-model endpoints are well suited for hosting a large number of models that you can serve through a shared serving container and you don’t need to access all the models at the same time. Depending on the size of the endpoint instance memory, a model may occasionally be unloaded from memory in favor of loading a new model to maximize efficient use of memory, therefore your application needs to be tolerant of occasional latency spikes on unloaded models.

MME is also designed for co-hosting models that use the same ML framework because they use the shared container to load multiple models. Therefore, if you have a mix of ML frameworks in your model fleet (such as PyTorch and TensorFlow), SageMaker dedicated endpoints or multi-container hosting is a better choice.

Finally, MME is suited for applications that can tolerate an occasional cold start latency penalty, because models are loaded on first invocation and infrequently used models can be offloaded from memory in favor of loading new models. Therefore, if you have a mix of frequently and infrequently accessed models, a multi-model endpoint can efficiently serve this traffic with fewer resources and higher cost savings.

We have also seen some scenarios where customers deploy an MME cluster with enough aggregate memory capacity to fit all their models, thereby avoiding model offloads altogether yet still achieving cost savings because of the shared inference infrastructure.

Model serving containers

When you use the SageMaker Inference Toolkit or a pre-built SageMaker model serving container compatible with MME, your container has the Multi Model Server (JVM process) running. The easiest way to have Multi Model Server (MMS) incorporated into your model serving container is to use SageMaker model serving containers compatible with MME (look for those with Job Type=inference and CPU/GPU=CPU). MMS is an open source, easy-to-use tool for serving deep learning models. It provides a REST API with a web server to serve and manage multiple models on a single host. However, it’s not mandatory to use MMS; you could implement your own model server as long as it implements the APIs required by MME.

When used as part of the MME platform, all predict, load, and unload API calls to MMS or your own model server are channeled through the MME data plane controller. API calls from the data plane controller are made over local host only to prevent unauthorized access from outside of the instance. One of the key benefits of MMS is that it enables a standardized interface for loading, unloading, and invoking models with compatibility across a wide range of deep learning frameworks.

Advanced configuration of MMS

If you choose to use MMS for model serving, consider the following advanced configurations to optimize the scalability and throughput of your MME instances.

Increase inference parallelism per model

MMS creates one or more Python worker processes per model based on the value of the default_workers_per_model configuration parameter. These Python workers handle each individual inference request by running any preprocessing, prediction, and post processing functions you provide. For more information, see the custom service handler GitHub repo.

Having more than one model worker increases the parallelism of predictions that can be served by a given model. However, when a large number of models are being hosted on an instance with a large number of CPUs, you should perform a load test of your MME to find the optimum value for default_workers_per_model to prevent any memory or CPU resource exhaustion.

Design for traffic spikes

Each MMS process within an endpoint instance has a request queue that can be configured with the job_queue_size parameter (default is 100). This determines the number of requests MMS will queue when all worker processes are busy. Use this parameter to fine-tune the responsiveness of your endpoint instances after you’ve decided on the optimal number of workers per model.

In an optimal worker per model ratio, the default of 100 should suffice for most cases. However, for those cases where request traffic to the endpoint spikes unusually, you can reduce the size of the queue if you want the endpoint to fail fast to pass control to the application or increase the queue size if you want the endpoint to absorb the spike.

Maximize memory resources per instance

When using multiple worker processes per model, by default each worker process loads its own copy of the model. This can reduce the available instance memory for other models. You can optimize memory utilization by sharing a single model between worker processes by setting the configuration parameter preload_model=true. Here you’re trading off reduced inference parallelism (due to a single model instance) with more memory efficiency. This setting along with multiple worker processes can be a good choice for use cases where model latency is low but you have heavier preprocessing and postprocessing (done by the worker processes) per inference request.

Set values for MMS advanced configurations

MMS uses a config.properties file to store configurations. MMS uses the following order to locate this config.properties file:

  1. If the MMS_CONFIG_FILE environment variable is set, MMS loads the configuration from the environment variable.
  2. If the --mms-config parameter is passed to MMS, it loads the configuration from the parameter.
  3. If there is a config.properties in current folder where the user starts MMS, it loads the config.properties file from the current working directory.

If none of the above are specified, MMS loads the built-in configuration with default values.

The following is a command line example of starting MMS with an explicit configuration file:

multi-model-server --start --mms-config /home/mms/config.properties

Key metrics to monitor your endpoint performance

The key metrics that can help you optimize your MME are typically related to CPU and memory utilization and inference latency. The instance-level metrics are emitted by MMS, whereas the latency metrics come from the MME. In this section, we discuss the typical metrics that you can use to understand and optimize your MME.

Endpoint instance-level metrics (MMS metrics)

From the list of MMS metrics, CPUUtilization and MemoryUtilization can help you evaluate whether or not your instance or the MME cluster is right-sized. If both metrics have percentages between 50–80%, then your MME is right-sized.

Typically, low CPUUtilization and high MemoryUtilization is an indication of an over-provisioned MME cluster because it indicates that infrequently invoked models aren’t being unloaded. This could be because of a higher-than-optimal number of endpoint instances provisioned for the MME and therefore higher-than-optimal aggregate memory is available for infrequently accessed models to remain in memory. Conversely, close to 100% utilization of these metrics means that your cluster is under-provisioned, so you need to adjust your cluster auto scaling policy.

Platform-level metrics (MME metrics)

From the full list of MME metrics, a key metric that can help you understand the latency of your inference request is ModelCacheHit. This metric shows the average ratio of invoke requests for which the model was already loaded in memory. If this ratio is low, it indicates your MME cluster is under-provisioned because there’s likely not enough aggregate memory capacity in the MME cluster for the number of unique model invocations, therefore causing models to be frequently unloaded from memory.

Lessons from the field and strategies for optimizing MME

We have seen the following recommendations from some of the high-scale uses of MME across a number of customers.

Horizontal scaling with smaller instances is better than vertical scaling with larger instances

You may experience throttling on model invocations when running high requests per second (RPS) on fewer endpoint instances. There are internal limits to the number of invocations per second (loads and unloads that can happen concurrently on an instance), and therefore it’s always better to have a higher number of smaller instances. Running a higher number of smaller instances means a higher total aggregate capacity of these limits for the endpoint.

Another benefit of horizontally scaling with smaller instances is that you reduce the risk of exhausting instance CPU and memory resources when running MMS with higher levels of parallelism, along with a higher number of models in memory (as described earlier in this post).

Avoiding thrashing is a shared responsibility

Thrashing in MME is when models are frequently unloaded from memory and reloaded due to insufficient memory, either in an individual instance or on aggregate in the cluster.

From a usage perspective, you should right-size individual endpoint instances and right-size the overall size of the MME cluster to ensure enough memory capacity is available per instance and also on aggregate for the cluster for your use case. The MME platform’s router fleet will also maximize the cache hit.

Don’t be aggressive with bin packing too many models on fewer, larger memory instances

Memory isn’t the only resource on the instance to be aware of. Other resources like CPU can be a constraining factor, as seen in the following load test results. In some other cases, we have also observed other kernel resources like process IDs being exhausted on an instance, due to a combination of too many models being loaded and the underlying ML framework (such as TensorFlow) spawning threads per model that were multiples of available vCPUs.

The following performance test demonstrates an example of CPU constraint impacting model latency. In this test, a single instance endpoint with a large instance, while having more than enough memory to keep all four models in memory, produced comparatively worse model latencies under load when compared to an endpoint with four smaller instances.

single instance endpoint model latency

single instance endpoint CPU & memory utilization

four instance endpoint model latency

four instance endpoint CPU & memory utilization

To achieve both performance and cost-efficiency, right-size your MME cluster with higher number of smaller instances that on aggregate give you the optimum memory and CPU capacity while being relatively at par for cost with fewer but larger memory instances.

Mental model for optimizing MME

There are four key metrics that you should always consider when right-sizing your MME:

  • The number and size of the models
  • The number of unique models invoked at a given time
  • The instance type and size
  • The instance count behind the endpoint

Start with the first two points, because they inform the third and fourth. For example, if not enough instances are behind the endpoint for the number or size of unique models you have, the aggregate memory for the endpoint will be low and you’ll see a lower cache hit ratio and thrashing at the endpoint level because the MME will load and unload models in and out of memory frequently.

Similarly, if the invocations for unique models are higher than the aggregate memory of all instances behind the endpoint, you’ll see a lower cache hit. This can also happen if the size of instances (especially memory capacity) is too small.

Vertically scaling with really large memory instances could also lead to issues because although the models may fit into memory, other resources like CPU and kernel processes and thread limits could be exhausted. Load test horizontal scaling in pre-production to get the optimum number and size of instances for your MME.

Summary

In this post, you got a deeper understanding of the MME platform. You learned which technical use cases MME is suited for and reviewed the architecture of the MME platform. You gained a deeper understanding of the role of each component within the MME architecture and which components you can directly influence the performance of. Finally, you had a deeper look at the configuration parameters that you can adjust to optimize MME for your use case and the metrics you should monitor to maintain optimum performance.

To get started with MME, review Amazon SageMaker Multi-Model Endpoints using XGBoost and Host multiple models in one container behind one endpoint.


About the Author

Syed Jaffry is a Principal Solutions Architect with AWS. He works with a range of companies from mid-sized organizations, large enterprises, financial services and ISVs to help them build and operate cost efficient and scalable AI/ML applications in the cloud.

Saurabh Trikande is a Senior Product Manager for Amazon SageMaker Inference. He is passionate about working with customers and is motivated by the goal of democratizing machine learning. He focuses on core challenges related to deploying complex ML applications, multi-tenant ML models, cost optimizations, and making deployment of deep learning models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch and spending time with his family.

Read More

Testing approaches for Amazon SageMaker ML models

This post was co-written with Tobias Wenzel, Software Engineering Manager for the Intuit Machine Learning Platform.

We all appreciate the importance of a high-quality and reliable machine learning (ML) model when using autonomous driving or interacting with Alexa, for examples. ML models also play an important role in less obvious ways—they’re used by business applications, healthcare, financial institutions, amazon.com, TurboTax, and more.

As ML-enabled applications become core to many businesses, models need to follow the same vigor and discipline as software applications. An important aspect of MLOps is to deliver a new version of the previously developed ML model in production by using established DevOps practices such as testing, versioning, continuous delivery, and monitoring.

There are several prescriptive guidelines around MLOps, and this post gives an overview of the process that you can follow and which tools to use for testing. This is based on collaborations between Intuit and AWS. We have been working together to implement the recommendations explained in this post in practice and at scale. Intuit’s goal of becoming an AI-driven expert platform is heavily dependent on a strategy of increasing velocity of initial model development as well as testing of new versions.

Requirements

The following are the main areas of consideration while deploying new model versions:

  1. Model accuracy performance – It’s important to keep track of model evaluation metrics like accuracy, precision, and recall, and ensure that the objective metrics remain relatively the same or improve with a new version of the model. In most cases, deploying a new version of the model doesn’t make sense if the experience of end-users won’t improve.
  2. Test data quality – Data in non-production environments, whether simulated or point-in-time copy, should be representative of the data that the model will receive when fully deployed, in terms of volume or distribution. If not, your testing processes won’t be representative, and your model may behave differently in production.
  3. Feature importance and parity – Feature importance in the newer version of the model should relatively compare to the older model, though there might be new features introduced. This is to ensure that the model isn’t becoming biased.
  4. Business process testing – It’s important that a new version of a model can fulfill your required business objectives within acceptable parameters. For example, one of the business metrics can be that the end-to-end latency for any service must not be more than 100 milliseconds, or the cost to host and retrain a particular model can’t be more than $10,000 per year.
  5. Cost – A simple approach to testing is to replicate the whole production environment as a test environment. This is a common practice in software development. However, such an approach in the case of ML models might not yield the right ROI depending upon the size of data and may impact the model in terms of the business problem it’s addressing.
  6. Security – Test environments are often expected to have sample data instead of real customer data and as a result, data handling and compliance rules can be less strict. Just like cost though, if you simply duplicate the production environment into a test environment, you could introduce security and compliance risks.
  7. Feature store scalability – If an organization decides to not create a separate test feature store because of cost or security reasons, then model testing needs to happen on the production feature store, which can cause scalability issues as traffic is doubled during the testing period.
  8. Online model performance – Online evaluations differ from offline evaluations and can be important in some cases like recommendation models because they measure user satisfaction in real time rather than perceived satisfaction. It’s hard to simulate real traffic patterns in non-production due to seasonality or other user behavior, so online model performance can only be done in production.
  9. Operational performance – As models get bigger and are increasingly deployed in a decentralized manner on different hardware, it’s important to test the model for your desired operational performance like latency, error rate, and more.

Most ML teams have a multi-pronged approach to model testing. In the following sections, we provide ways to address these challenges during various testing stages.

Offline model testing

The goal of this testing phase is to validate new versions of an existing model from an accuracy standpoint. This should be done in an offline fashion to not impact any predictions in the production system that are serving real-time predictions. By ensuring that the new model performs better for applicable evaluation metrics, this testing addresses challenge 1 (model accuracy performance). Also, by using the right dataset, this testing can address challenges 2 and 3 (test data quality, feature importance and parity), with the additional benefit of tackling challenge 5 (cost).

This phase is done in the staging environment.

You should capture production traffic, which you can use to replay in offline back testing. It’s preferable to use past production traffic instead of synthetic data. The Amazon SageMaker Model Monitor capture data feature allows you to capture production traffic for models hosted on Amazon SageMaker. This allows model developers to test their models with data from peak business days or other significant events. The captured data is then replayed against the new model version in a batch fashion using Sagemaker batch transform. This means that the batch transform run can tests with data that has been collected over weeks or months in just a few hours. This can significantly speed up the model evaluation process compared to running two or more versions of a real-time model side by side and sending duplicate prediction requests to each endpoint. In addition to finding a better-performing version faster, this approach also uses the compute resources for a shorter amount of time, reducing the overall cost.

A challenge with this approach to testing is that the feature set changes from one model version to another. In this scenario, we recommend creating a feature set with a superset of features for both versions so that all features can be queried at once and recorded through the data capture. Each prediction call can then work on only those features necessary for the current version of the model.

As an added bonus, by integrating Amazon SageMaker Clarify in your offline model testing, you can check the new version of model for bias and also compare feature attribution with the previous version of the model. With pipelines, you can orchestrate the entire workflow such that after training, a quality check step can take place to perform an analysis of the model metrics and feature importance. These metrics are stored in the SageMaker model registry for comparison in the next run of training.

Integration and performance testing

Integration testing is needed to validate end-to-end business processes from a functional as well as a runtime performance perspective. Within this process, the whole pipeline should be tested, including fetching, and calculating features in the feature store and running the ML application. This should be done with a variety of different payloads to cover a variety of scenarios and requests and achieve high coverage for all possible code runs. This addresses challenges 4 and 9 (business process testing and operational performance) to ensure none of the business processes are broken with the new version of the model.

This testing should be done in a staging environment.

Both integration testing and performance testing need to be implemented by individual teams using their MLOps pipeline. For the integration testing, we recommend the tried and tested method of maintaining a functionally equivalent pre-production environment and testing with a few different payloads. The testing workflow can be automated as shown in this workshop. For the performance testing, you can use Amazon SageMaker Inference Recommender, which offers a great starting point to determine which instance type and how many of those instances to use. For this, you’ll need to use a load generator tool, such as the open-source projects perfsizesagemaker and perfsize that Intuit has developed. Perfsizesagemaker allows you to automatically test model endpoint configurations with a variety of payloads, response times, and peak transactions per second requirements. It generates detailed test results that compare different model versions. Perfsize is the companion tool that tries different configurations given only the peak transactions per second and the expected response time.

A/B testing

In many cases where user reaction to the immediate output of the model is required, such as ecommerce applications, offline model functional evaluation isn’t sufficient. In these scenarios, you need to A/B test models in production before making the decision of updating models. A/B testing also has its risks because there could be real customer impact. This testing method serves as the final ML performance validation, a lightweight engineering sanity check. This method also addresses challenges 8 and 9 (online model performance and operational excellence).

A/B testing should be performed in a production environment.

With SageMaker, you can easily perform A/B testing on ML models by running multiple production variants on an endpoint. Traffic can be routed in increments to the new version to reduce the risk that a badly behaving model could have on production. If results of the A/B test look good, traffic is routed to the new version, eventually taking over 100% of traffic. We recommend using deployment guardrails to transition from model A to B. For a more complete discussion on A/B testing using Amazon Personalize models as an example, refer to Using A/B testing to measure the efficacy of recommendations generated by Amazon Personalize.

Online model testing

In this scenario, the new version of a model is significantly different to the one already serving live traffic in production, so the offline testing approach is no longer suitable to determine the efficacy of the new model version. The most prominent reason for this is a change in features required to produce the prediction, so that previously recorded transactions can’t be used to test the model. In this scenario, we recommend using shadow deployments. Shadow deployments offer the capability to deploy a shadow (or challenger) model alongside the production (or champion) model that is currently serving predictions. This lets you evaluate how the shadow model performs in production traffic. The predictions of the shadow model aren’t served to the requesting application; they’re logged for offline evaluation. With the shadow approach for testing, we address challenges 4, 5, 6, and 7 (business process testing, cost, security, and feature store scalability).

Online model testing should be done in staging or production environments.

This method of testing new model versions should be used as a last resort if all the other methods can’t be used. We recommend it as a last resort because duplexing calls to multiple models generates additional load on all downstream services in production, which can lead to performance bottlenecks as well as increased cost in production. The most obvious impact this has is on the feature serving layer. For use cases that share features from a common pool of physical data, we need to be able to simulate multiple use cases concurrently accessing the same data table to ensure no resource contention exists before transitioning to production. Wherever possible, duplicate queries to the feature store should be avoided, and features needed for both versions of the model should be reused for the second inference. Feature stores based on Amazon DynamoDB, as the one Intuit has built, can implement Amazon DynamoDB Accelerator(DAX) to cache and avoid doubling the I/O to the database. These and other caching options can mitigate challenge 7 (feature store scalability).

To address challenge 5 (cost) as well as 7, we propose using shadow deployments to sample the incoming traffic. This gives model owners another layer of control to minimize impact on the production systems.

Shadow deployment should be onboarded to the Model Monitor offerings just like the regular production deployments in order to observe the improvements of the challenger version.

Conclusion

This post illustrates the building blocks to create a comprehensive set of processes and tools to address various challenges with model testing. Although every organization is unique, this should help you get started and narrow down your considerations when implementing your own testing strategy.


About the authors

Tobias Wenzel is a Software Engineering Manager for the Intuit Machine Learning Platform in Mountain View, California. He has been working on the platform since its inception in 2016 and has helped design and build it from the ground up. In his job, he has focused on the operational excellence of the platform and bringing it successfully through Intuit’s seasonal business. In addition, he is passionate about continuously expanding the platform with the latest technologies.

Shivanshu Upadhyay is a Principal Solutions Architect in the AWS Business Development and Strategic Industries group. In this role, he helps most advanced adopters of AWS transform their industry by effectively using data and AI.

Alan Tan is a Senior Product Manager with SageMaker, leading efforts on large model inference. He’s passionate about applying machine learning to the area of analytics. Outside of work, he enjoys the outdoors.

Read More

Encode multi-lingual text properties in Amazon Neptune to train predictive models

Amazon Neptune ML is a machine learning (ML) capability of Amazon Neptune that helps you make accurate and fast predictions on your graph data. Under the hood, Neptune ML uses Graph Neural Networks (GNNs) to simultaneously take advantage of graph structure and node/edge properties to solve the task at hand. Traditional methods either only use properties and no graph structure (e.g., XGBoost, Neural Networks), or only graph structure and no properties (e.g., node2vec, Label Propagation). To better manipulate the node/edge properties, ML algorithms require the data to be well behaved numerical data, but raw data in a database can have other types, like raw text. To make use of these other types of data, we need specialized processing steps that convert them from their native type into numerical data, and the quality of the ML results is strongly dependent on the quality of these data transformations. Raw text, like sentences, are among the most difficult types to transform, but recent progress in the field of Natural Language Processing (NLP) has led to strong methods that can handle text coming from multiple languages and a wide variety of lengths.

Beginning with version 1.1.0.0, Neptune ML supports multiple text encoders (text_fasttext, text_sbert, text_word2vec, and text_tfidf), which bring the benefits of recent advances in NLP and enables support for multi-lingual text properties as well as additional inference requirements around languages and text length. For example, in a job recommendation use case, the job posts in different countries can be described in different languages and the length of job descriptions vary considerably. Additionally, Neptune ML supports an auto option that automatically chooses the best encoding method based on the characteristics of the text feature in the data.

In this post, we illustrate the usage of each text encoder, compare their advantages and disadvantages, and show an example of how to choose the right text encoders for a job recommendation task.

What is a text encoder?

The goal of text encoding is to convert the  text-based edge/node properties in Neptune into fixed-size vectors for use in downstream machine learning models for either node classification or link prediction tasks. The length of the text feature can vary a lot. It can be a word, phrase, sentence, paragraph, or even a document with multiple sentences (the maximum size of a single property is 55 MB in Neptune). Additionally, the text features can be in different languages. There may also be sentences that contain words in several different languages, which we define as code-switching.

Beginning with the 1.1.0.0 release, Neptune ML allows you to choose from several different text encoders. Each encoder works slightly differently, but has the same goal of converting a text value field from Neptune into a fixed-size vector that we use to build our GNN model using Neptune ML. The new encoders are as follows:

  • text_fasttext (new) – Uses fastText encoding. FastText is a library for efficient text representation learning. text_fasttext is recommended for features that use one and only one of the five languages that fastText supports (English, Chinese, Hindi, Spanish, and French). The text_fasttext method can optionally take the max_length field, which specifies the maximum number of tokens in a text property value that will be encoded, after which the string is truncated. You can regard a token as a word. This can improve performance when text property values contain long strings, because if max_length is not specified, fastText encodes all the tokens regardless of the string length.
  • text_sbert (new) – Uses the Sentence BERT (SBERT) encoding method. SBERT is a kind of sentence embedding method using the contextual representation learning models, BERT-Networks. text_sbert is recommended when the language is not supported by text_fasttext. Neptune supports two SBERT methods: text_sbert128, which is the default if you just specify text_sbert, and text_sbert512. The difference between them is the maximum number of tokens in a text property that get encoded. The text_sbert128 encoding only encodes the first 128 tokens, whereas text_sbert512 encodes up to 512 tokens. As a result, using text_sbert512 can require more processing time than text_sbert128. Both methods are slower than text_fasttext.
  • text_word2vec – Uses Word2Vec algorithms originally published by Google to encode text. Word2Vec only supports English.
  • text_tfidf – Uses a term frequency-inverse document frequency (TF-IDF) vectorizer for encoding text. TF-IDF encoding supports statistical features that the other encodings do not. It quantifies the importance or relevance of words in one node property among all the other nodes.

Note that text_word2vec and text_tfidf were previously supported and the new methods text_fasttext and text_sbert are recommended over the old methods.

Comparison of different text encoders

The following table shows the detailed comparison of all the supported text encoding options (text_fasttext, text_sbert, and text_word2vec). text_tfidf is not a model-based encoding method, but rather a count-based measure that evaluates how relevant a token (for example, a word) is to the text features in other nodes or edges, so we don’t include text_tfidf for comparison. We recommend using text_tfidf when you want to quantify the importance or relevance of some words in one node or edge property amongst all the other node or edge properties.)

. . text_fasttext text_sbert text_word2vec
Model Capability Supported language English, Chinese, Hindi, Spanish, and French More than 50 languages English
Can encode text properties that contain words in different languages No Yes No
Max-length support No maximum length limit Encodes the text sequence with the maximum length of 128 and 512 No maximum length limit
Time Cost Loading Approximately 10 seconds Approximately 2 seconds Approximately 2 seconds
Inference Fast Slow Medium

Note the following usage tips:

  • For text property values in English, Chinese, Hindi, Spanish, and French, text_fasttext is the recommended encoding. However, it can’t handle cases where the same sentence contains words in more than one language. For other languages than the five that fastText supports, use text_sbert encoding.
  • If you have many property value text strings longer than, for example, 120 tokens, use the max_length field to limit the number of tokens in each string that text_fasttext encodes.

To summarize, depending on your use case, we recommend the following encoding method:

  • If your text properties are in one of the five supported languages, we recommend using text_fasttext due to its fast inference. text_fasttext is the recommended choices and you can also use text_sbert in the following two exceptions.
  • If your text properties are in different languages, we recommend using text_sbert because it’s the only supported method that can encode text properties containing words in several different languages.
  • If your text properties are in one language that isn’t one of the five supported languages, we recommend using text_sbert because it supports more than 50 languages.
  • If the average length of your text properties is longer than 128, consider using text_sbert512 or text_fasttext. Both methods can use encode longer text sequences.
  • If your text properties are in English only, you can use text_word2vec, but we recommend using text_fasttext for its fast inference.

Use case demo: Job recommendation task

The goal of the job recommendation task is to predict what jobs users will apply for based on their previous applications, demographic information, and work history. This post uses an open Kaggle dataset. We construct the dataset as a three-node type graph: job, user, and city.

A job is characterized by its title, description, requirements, located city, and state. A user is described with the properties of major, degree type, number of work history, total number of years for working experience, and more. For this use case, job title, job description, job requirements, and majors are all in the form of text.

In the dataset, users have the following properties:

  • State – For example, CA or 广东省 (Chinese)
  • Major – For example, Human Resources Management or Lic Cytura Fisica (Spanish)
  • DegreeType – For example, Bachelor’s, Master’s, PhD, or None
  • WorkHistoryCount – For example, 0, 1, 16, and so on
  • TotalYearsExperience – For example, 0.0, 10.0, or NAN

Jobs have the following properties:

  • Title – For example, Administrative Assistant or Lic Cultura Física (Spanish).
  • Description – For example, “This Administrative Assistant position is responsible for performing a variety of clerical and administrative support functions in the areas of communications, …” The average number of words in a description is around 192.2.
  • Requirements – For example, “JOB REQUIREMENTS: 1. Attention to detail; 2.Ability to work in a fast paced environment;3.Invoicing…”
  • State: – For example, CA, NY, and so on.

The node type city like Washington DC and Orlando FL only has the identifier for each node. In the following section, we analyze the characteristics of different text features and illustrate how to select the proper text encoders for different text properties.

How to select different text encoders

For our example, the Major and Title properties are in multiple languages and have short text sequences, so text_sbert is recommended. The sample code for the export parameters is as follows. For the text_sbert type, there are no other parameter fields. Here we choose text_sbert128 other than text_sbert512, because the text length is relatively shorter than 128.

"additionalParams": {
    "neptune_ml": {
        "version": "v2.0",
        "targets": [ ... ],
        "features": [
            {
                "node": "user",
                "property": "Major",
                "type": "text_sbert128"
            },
            {
                "node": "job",
                "property": "Title",
                "type": "text_sbert128",
            }, ...
        ], ...
    }
}

The Description and Requirements properties are usually in long text sequences. The average length of a description is around 192 words, which is longer than the maximum input length of text_sbert (128). We can use text_sbert512, but it may result in slower inference. In addition, the text is in a single language (English). Therefore, we recommend text_fasttext with the en language value because of its fast inference and not limited input length. The sample code for the export parameters is as follows. The text_fasttext encoding can be customized using language and max_length. The language value is required, but max_length is optional.

"additionalParams": {
    "neptune_ml": {
        "version": "v2.0",
        "targets": [ ... ],
        "features": [
            {
                "node": "job",
                "property": "Description",
                "type": "text_fasttext",
                "language": "en",
                "max_length": 256
            },
            {
                "node": "job",
                "property": "Requirements",
                "type": "text_fasttext",
                "language": "en"
            }, ...
        ], ...
    }
}

More details of the job recommendation use cases can be found in the Neptune notebook tutorial.

For demonstration purposes, we select one user, i.e., user 443931, who holds a Master’s degree in ‘Management and Human Resources. The user has applied to five different jobs, titled as “Human Resources (HR) Manager”, “HR Generalist”, “Human Resources Manager”, “Human Resources Administrator”, and “Senior Payroll Specialist”. In order to evaluate the performance of the recommendation task, we delete 50% of the apply jobs (the edges) of the user (here we delete “Human Resources Administrator” and “Human Resources (HR) Manager) and try to predict the top 10 jobs this user is most likely to apply for.

After encoding the job features and user features, we perform a link prediction task by training a relational graph convolutional network (RGCN) model. Training a Neptune ML model requires three steps: data processing, model training, and endpoint creation. After the inference endpoint has been created, we can make recommendations for user 443931. From the predicted top 10 jobs for user 443931 (i.e., “HR Generalist”, “Human Resources (HR) Manager”, “Senior Payroll Specialist”, “Human Resources Administrator”, “HR Analyst”, et al.), we observe that the two deleted jobs are among the 10 predictions.

Conclusion

In this post, we showed the usage of the newly supported text encoders in Neptune ML. These text encoders are simple to use and can support multiple requirements. In summary,

  • text_fasttext is recommended for features that use one and only one of the five languages that text_fasttext supports.
  • text_sbert is recommended for text that text_fasttext doesn’t support.
  • text_word2vec only supports English, and can be replaced by text_fasttext in any scenario.

For more details about the solution, see the GitHub repo. We recommend using the text encoders on your graph data to meet your requirements. You can just choose an encoder name and set some encoder attributes, while keeping the GNN model unchanged.


About the authors

Jiani Zhang is an applied scientist of AWS AI Research and Education (AIRE). She works on solving real-world applications using machine learning algorithms, especially natural language and graph related problems.

Read More