LiDAR 3D point cloud labeling with Velodyne LiDAR sensor in Amazon SageMaker Ground Truth

LiDAR is a key enabling technology in growing autonomous markets, such as robotics, industrial, infrastructure, and automotive. LiDAR delivers precise 3D data about its environment in real time to provide “vision” for autonomous solutions. For autonomous vehicles (AVs), nearly every carmaker uses LiDAR to augment camera and radar systems for a comprehensive perception stack capable of safely navigating complex roadway environments. Computer vision systems can use the 3D maps generated by LiDAR sensors for object detection, object classification, and scene segmentation. Like any other supervised machine learning (ML) system, the point cloud data generated by LiDAR sensors should be labeled correctly in order for the ML model to make correct inferences. This allows AVs to operate smoothly and efficiently, avoiding incidents and collisions with objects, pedestrians, vehicles, and other road users.

In this post, we demonstrate how to label 3D point cloud data generated by Velodyne LiDAR sensors using Amazon SageMaker Ground Truth. We break down the process of sending data for annotation so that you can obtain precise, high-quality results.

The code for this example is available on GitHub.

Solution overview

SageMaker Ground Truth is a data labeling service that you can use to create high-quality labeled datasets for various types of ML use cases. SageMaker Ground Truth is a capability in Amazon SageMaker, which is a comprehensive and fully managed ML service. With SageMaker, data scientists and developers can quickly and easily build and train ML models, and then directly deploy them into a production-ready environment.

In addition to LiDAR data, we also include camera images, using the sensor fusion feature in SageMaker Ground Truth to deliver robust visual information about the scenes that annotators are labeling. Through sensor fusion, annotators can adjust labels in the 3D scene as well as in 2D images. It delivers the unique capability to ensure that annotations in LiDAR data are mirrored in 2D imagery, making the process more efficient.

With SageMaker Ground Truth, Velodyne LiDAR’s 3D point cloud data generated by a Velodyne LiDAR sensor mounted on a vehicle can be labeled for tracking moving objects. In this challenging use case, we can follow the trajectory of an object like a car or a pedestrian in a dynamic environment, while our point of reference is also moving. In this case, our point of reference is a car that is equipped with Velodyne LiDAR.

To perform this task, we walk through the following topics:

  • Velodyne technology
  • The dataset
  • Creating a labeling job
  • The point cloud sequence input manifest file
  • Building the sequence input manifest file
  • Labeling the category configuration file
  • Specifying the job resources
  • Completing a labeling job

Prerequisites

To implement the solution in this post, you must have the following prerequisites:

  • An AWS account for running the code.
  • An Amazon Simple Storage Service (Amazon S3) bucket you can write to. The bucket must be in the same Region as the SageMaker notebook instance. We can also define a valid S3 prefix. All the files related to this experiment are stored in that prefix of our bucket. We must attach the CORS policy to this bucket. For instructions, refer to Configuring cross-origin resource sharing (CORS). Enter the following policy in the CORS configuration editor:
<?xml version="1.0" encoding="UTF-8"?>
<CORSConfiguration xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<CORSRule>
    <AllowedOrigin>*</AllowedOrigin>
    <AllowedMethod>GET</AllowedMethod>
    <AllowedMethod>HEAD</AllowedMethod>
    <AllowedMethod>PUT</AllowedMethod>
    <MaxAgeSeconds>3000</MaxAgeSeconds>
    <ExposeHeader>Access-Control-Allow-Origin</ExposeHeader>
    <AllowedHeader>*</AllowedHeader>
</CORSRule>
<CORSRule>
    <AllowedOrigin>*</AllowedOrigin>
    <AllowedMethod>GET</AllowedMethod>
</CORSRule>
</CORSConfiguration>

Velodyne technology

LiDAR can be divided into different categories, including scanning LiDAR and flash LiDAR. Conventionally scanning LiDAR uses mechanical rotation to spin the sensor for 360-degree detection. Velodyne, which invented the industry’s first 3D LiDAR, continues to innovate and launch new rotational products with cutting-edge technology. Velodyne’s Ultra Puck is a scanning LiDAR sensor that uses Velodyne’s patented surround view technology. It provides a full 360-degree environmental view to deliver accurate real-time 3D data. The Ultra Puck has a compact form factor and delivers the real-time object detection needed for safe navigation and reliable operation. With a combination of optimal power and high performance, this sensor provides distance and calibrated reflectivity measurements at all rotational angles. It’s an ideal solution for robotics, mapping, security, driver assistance, and autonomous navigation. Besides the LiDAR sensor itself, Velodyne has created the Vella Development Kit (VDK), a collection of tools, hardware, and documentation that facilitate access to the Velodyne’s autonomy software stack. The VDK can be configured for different custom interfaces and environments, providing you with a broad range of applications for increased autonomy and improved safety.

Additionally, the VDK can reduce the upfront work you would have to otherwise put in to enable an end-to-end data collection and annotation pipeline by providing the following necessary capabilities:

  • Clock synchronization between LiDAR, odometry, and camera frames
  • Calibration for LiDAR vehicle 5-DOF extrinsic calibration (z is not observable)
  • Calibration for LiDAR camera extrinsic, intrinsic, and distortion parameters
  • Collect motion compensated (intra-frame or multi-frame), synchronized LiDAR point clouds and camera images

To develop vehicle-based perception capabilities, Velodyne’s software team has set up their own data collection vehicle with one of their Ultra Puck LiDAR units, a camera and GPS/IMU sensors mounted to the vehicle hood. In the subsequent steps, we refer to their internal processes that use the VDK to prepare, collect, and annotate data needed to develop their vehicle-based perception capabilities as an example to other customers trying to solve their own perception use cases.

Clock synchronization

Accurate clock synchronization of the LiDAR, odometry, and camera outputs can be crucial for any multi-sensor application that combines those data streams. For best results, you should use a PTP synchronization system with a primary clock and support by all sensors. One advantage of PTP is the ability to synchronize multiple devices to high accuracy with a single timing source. Such a system can achieve synchronization accuracy better than 1 microsecond. Other solutions include PPS distribution and per-device time sources. As an alternative option, the VDK supports software synchronization utilizing time-of-arrival timestamping, which can be a great way to get an application off the ground quickly in the absence of proper clock synchronization infrastructure. This can result in timestamping errors on the order of 1–10 milliseconds due to a combination of latency and queuing delays at various levels of the network infrastructure and host operating system, which may or may not be acceptable, depending on the application.

LiDAR vehicle calibration

The LiDAR vehicle calibration estimates the extrinsic position of the LiDAR in vehicle frame along five axes. Z value is unobservable; therefore you must measure the z value independently. Our process is a targetless calibration technique but it works well in an environment where the ground is relatively flat, and the environment has contiguous static objects features rather than dynamic (vehicles, pedestrians) or non-contiguous (shrubs and bushes) features. Think of a parking lot with few obstacles and buildings with flat facades. The presence of geometric structures is ideal for improving the calibration quality. The user is required to drive in some predefined driving patterns indicated by the VDK to expose most of the parameters. One minute of data is sufficient for this calibration. After the data is uploaded to Veldoyne’s platform service, the calibration takes place on the cloud and the result is made available within 24 hours. For the purposes of this notebook, the calibration parameters have already been processed and provided.

The LiDAR dataset

The dataset and resources used in this notebook are provided by Velodyne. This dataset contains one continuous scene from an autonomous vehicle experiment driving around on a highway in California. The entire scene contains 60 frames. The dataset contents are as follows:

  • lidar_cam_calib_vlp32_06_10_2021.yaml – Camera calibration information, one camera only
  • images/ – Camera footage for each frame
  • poses/ – Pose JSON file containing LiDAR extrinsic matrix for each frame
  • rectified_scans_local/ – .pcb files in LiDAR sensor local coordinate system

Run the following code to download the dataset locally and then upload to your S3 bucket, which we defined in the initialization section:

source_bucket = 'velodyne-blog'
source_prefix = 'highway_data_07'
source_data = f's3://{source_bucket}/{source_prefix}'

!aws s3 cp $source_data ./$PREFIX --recursive
target_s3 = f's3://{BUCKET}/{PREFIX}'
!aws s3 cp ./$PREFIX $target_s3 –recursive

Create a labeling job

As the next step, we need to create a data labeling job in SageMaker Ground Truth. We select the task type as object tracking. For more information about 3D point cloud labeling task types, refer to 3D Point Cloud Task types. To create an object tracking point cloud labeling job, we need to add the following resources as the labeling job inputs:

  • Point cloud sequence input manifest – A JSON file defining the point cloud frame sequence and associated sensor fusion data. For more information, see Create a Point Cloud Sequence Input Manifest.
  • Input manifest file – The input file for the labeling job. Each line of the manifest file contains a link to a sequence file defined in the point cloud sequence input manifest.
  • Label category configuration file – This file is used to specify your labels, label category, frame attributes, and worker instructions. For more information, see Create a Labeling Category Configuration File with Label Category and Frame Attributes.
  • Predefined AWS resources – Includes the following:

    • Pre-annotation Lambda ARN – Refer to PreHumanTaskLambdaArn.
    • Annotation consolidation ARN – The AWS Lambda function used to consolidate labels from different workers. Refer to AnnotationConsolidationLambdaArn.
    • Workforce ARN – Defines which workforce type we want to use. Refer to Create and Manage Workforces for more details.
    • HumanTaskUiArn – Defines the worker UI template to do the labeling job. This should have a format similar to arn:aws:sagemaker:<region>:123456789012:human-task-ui/PointCloudObjectTracking.

Keep in mind the following:

  • There should not be an entry for the UiTemplateS3Uri parameter.
  • Your LabelAttributeName must end in -ref. For example, ot-labels-ref.
  • The number of workers specified in NumberOfHumanWorkersPerDataObject should be 1.
  • 3D point cloud labeling doesn’t support active learning, so we shouldn’t specify values for parameters in LabelingJobAlgorithmsConfig.
  • 3D point cloud object tracking labeling jobs can take multiple hours to complete. You should specify a longer time limit for these labeling jobs in TaskTimeLimitInSeconds (up to 7 days, or 604,800 seconds).
    #object tracking as our 3D Point Cloud Task Type. 
    task_type = "3DPointCloudObjectTracking"

Point cloud sequence input manifest file

The following of the most important steps to generating a sequence input manifest file:

  1. Convert the 3D points to a world coordinate system.
  2. Generate the sensor extrinsic matrix to enable the sensor fusion feature in SageMaker Ground Truth.

The LiDAR sensor is mounted on a moving vehicle (ego vehicle), which captures the data in its own frame of reference. To perform object tracking, we need to convert this data to a global frame of reference to account for the moving ego vehicle itself. This is the world coordinate system.

Sensor fusion is a feature in SageMaker Ground Truth that synchronizes the 3D point cloud frame side by side with the camera frame. This provides visual context for human labelers and allows labelers to adjust annotation in 3D and 2D images synchronously. For instructions on matrix transformation, refer to Labeling data for 3D object tracking and sensor fusion in Amazon SageMaker Ground Truth.

The generate_transformed_pcd_from_point_cloud function performs the coordinate translation and then generates the 3D point data file, which SageMaker Ground Truth can consume.

To translate the data from local/sensor global coordinate system, multiply each point in a 3D frame with the extrinsic matrix for the LiDAR sensor.

SageMaker Ground Truth renders the 3D point cloud data in either Compact Binary Pack (.bin) or ASCII (.txt) format. Files in these formats need to contain information about the location (x, y, and z coordinates) of all points that make up that frame, and, optionally, information about the pixel color of each point for colored point clouds (i, r, g, b).

To read more about SageMaker Ground Truth accepted raw 3D data formats, see Accepted Raw 3D Data Formats.

Build the sequence input manifest file

The next step is to build the point cloud sequence input manifest file. The steps listed in this section are also available in the notebook.

  1. Point the cloud data from the .pcd file, the LiDAR extrinsic matrix from the pose file, and the camera extrinsic, intrinsic, and distortion data from the camera calibration .yaml file.
  2. Perform a per-frame transform of the raw point cloud to the global frame of reference. Generate and store ASCII (.txt) for each frame to Amazon S3.
  3. Extract the ego vehicle pose from the LiDAR extrinsic matrix.
  4. Build a sensor position in the global coordinate system by extracting the camera pose from the camera inverse extrinsic matrix.
  5. Provide camera calibration parameters (such as distortion and skew).
  6. Build the array of data frames. Reference the ASCII file location, define the vehicle position in world coordinate system, and so on.
  7. Create the sequence manifest file sequence.json.
  8. Create our input manifest file. Each line identifies a single sequence file we just uploaded.

Label the category configuration file

Our label category configuration file is used to specify labels, or classes, for our labeling job. When we use the object detection or object tracking task types; we can also include label attributes in our label category configuration file. Workers can assign one or more attributes we provide to annotations to give more information about that object. For example, we may want to use the attribute occluded to have workers identify when an object is partially obstructed. Let’s look at an example of the label category configuration file for an object detection or object tracking labeling job:

label_category = {
  "categoryGlobalAttributes": [
    {
      "enum": [
        "75-100%",
        "25-75%",
        "0-25%"
      ],
      "name": "Visibility",
      "type": "string"
    }
  ],
  "documentVersion": "2020-03-01",
  "instructions": {
    "fullInstruction": "Draw a tight Cuboid. You only need to annotate those in the first frame. Please make sure the direction of the cubiod is accurately representative of the direction of the vehicle it bounds.",
    "shortInstruction": "Draw a tight Cuboid. You only need to annotate those in the first frame."
  },
  "labels": [
    {
      "categoryAttributes": [],
      "label": "Car"
    },
    {
      "categoryAttributes": [],
      "label": "Truck"
    },
    {
      "categoryAttributes": [],
      "label": "Bus"
    },
    {
      "categoryAttributes": [],
      "label": "Pedestrian"
    },
    {
      "categoryAttributes": [],
      "label": "Cyclist"
    },
    {
      "categoryAttributes": [],
      "label": "Motorcyclist"
    },
  ]
}

category_key = f'{PREFIX}/manifests_categories/label_category.json'
write_json_to_s3(label_category, BUCKET, category_key)

label_category_file = f's3://{BUCKET}/{category_key}'
print(f"label category file uri: {label_category_file}")

Specify the job resources

As the next step, we specify various labeling job resources:

  • Human task UI ARNHumanTaskUiArn is a resource that defines the worker task template used to render the worker UI and tools for the labeling job. This attribute is defined under UiConfig and the resource name is configured by Region and task type:

    human_task_ui_arn = (
        f"arn:aws:sagemaker:{region}:123456789012:human-task-ui/{task_type[2:]}"
    )

  • Work resource – In this example, we use private team resources. For instructions, refer to Create a Private Workforce (Amazon Cognito Console). When we’re done, we should put our resource ARN in the following parameter:

    workteam_arn = f"arn:aws:sagemaker:{region}:123456789012:workteam/private-crowd/test-team"#"<REPLACE W/ YOUR Private Team ARN>"

  • Pre-annotation Lambda ARN and post-annotation Lambda ARN – See the following code:

    ac_arn_map = {
        "us-west-2": "081040173940",
        "us-east-1": "432418664414",
        "us-east-2": "266458841044",
        "eu-west-1": "568282634449",
        "ap-northeast-1": "477331159723",
    }
    
    prehuman_arn = "arn:aws:lambda:{}:{}:function:PRE-{}".format(region, ac_arn_map[region], task_type)
    acs_arn = "arn:aws:lambda:{}:{}:function:ACS-{}".format(region, ac_arn_map[region], task_type)

  • HumanTaskConfig – We use this to specify our work team and configure our labeling job task. Feel free to update the task description in the following code:

    job_name = f"velodyne-blog-test-{str(time.time()).split('.')[0]}"
    
    # Task description info =================
    task_description = "Draw 3D boxes around required objects"
    task_keywords = ['lidar', 'pointcloud']
    task_title = job_name
    
    human_task_config = {
        "AnnotationConsolidationConfig": {
            "AnnotationConsolidationLambdaArn": acs_arn,
        },
        "WorkteamArn": workteam_arn,
        "PreHumanTaskLambdaArn": prehuman_arn,
        "MaxConcurrentTaskCount": 200,
        "NumberOfHumanWorkersPerDataObject": 1,  # One worker will work on each task
        "TaskAvailabilityLifetimeInSeconds": 18000, # Your workteam has 5 hours to complete all pending tasks.
        "TaskDescription": task_description,
        "TaskKeywords": task_keywords,
        "TaskTimeLimitInSeconds": 36000, # Each seq must be labeled within 1 hour.
        "TaskTitle": task_title,
        "UiConfig": {
            "HumanTaskUiArn": human_task_ui_arn,
        },
    }

Create the labeling job

Next, we create the labeling request, as shown in the following code:

labelAttributeName = f"{job_name}-ref" #must end with -ref

output_path = f"s3://{BUCKET}/{PREFIX}/output"

ground_truth_request = {
    "InputConfig" : {
      "DataSource": {
        "S3DataSource": {
          "ManifestS3Uri": manifest_uri,
        }
      },
      "DataAttributes": {
        "ContentClassifiers": [
          "FreeOfPersonallyIdentifiableInformation",
          "FreeOfAdultContent"
        ]
      },  
    },
    "OutputConfig" : {
      "S3OutputPath": output_path,
    },
    "HumanTaskConfig" : human_task_config,
    "LabelingJobName": job_name,
    "RoleArn": role, 
    "LabelAttributeName": labelAttributeName,
    "LabelCategoryConfigS3Uri": label_category_file,
    "Tags": [],
}

Finally, we create the labeling job:

sagemaker_client.create_labeling_job(**ground_truth_request)

Complete a labeling job

When our labeling job is ready, we can add ourselves to our private work team and experiment with the worker’s portal. We should receive an email with the portal link, our user name, and a temporary password. When we log in, we choose the labeling job from the list, and then we should see the worker’s portal like the following screenshot. (It may take a few minutes for a new labeling job to show up in the portal). More information on how to set up workers and instructions can be found here and here respectively.

When we’re are done with the labeling job, we can choose Submit, and then view the output data in the S3 output location we specified earlier.

Conclusion

In this post, we showed how we can create a 3D point cloud labeling job for object tracking for data captured using Velodyne’s LiDAR sensor. We followed the step-by-step instructions in this post and ran the provided code to create a SageMaker Ground Truth labeling job to label the 3D point cloud data. ML models can use the labels created with this job to train object detection, object recognition, and object tracking models commonly used in autonomous vehicle scenarios.

If you are interested in labeling 3D point cloud data captured via Velodyne’s LiDAR sensor, follow the steps in this article to label the data using Amazon SageMaker Ground Truth.


About the Authors

Sharath Nair leads the Computer Vision team that focusses on building perception algorithms for some of Velodyne’s software products like Object Detection & Tracking, Semantic Segmentation, SLAM, etc. Prior to Velodyne, Sharath worked on Autonomous Vehicles and Robotics and has been involved in this space for the past 6 years.

Oliver Monson is a Senior Data Operations Manager at Velodyne Lidar, responsible for the data pipelines and acquisition strategies that support the development of perception software. Prior to Velodyne, Oliver has managed operational teams executing on HD mapping, geospatial, and archaeological applications.

John Kua is Director of Software Engineering at Velodyne, overseeing the System Integration and Robotics, Vella Go, and Software Production teams. Prior to joining Velodyne, John spent over a decade building multimodal sensor platforms for a wide range of 3D localization and mapping applications in commercial and government applications. These platforms included a wide array of sensors including visible light, thermal, and hyperspectral cameras, lidar, GPS, IMUs, and even gamma-ray spectrometers and imagers.

Sally Frykman, Chief Marketing Officer at Velodyne, oversees the strategic development and execution of global marketing and communications programs that advance the company’s innovative vision and goals. Her multifaceted role encompasses a wide array of responsibilities, including promotion of the Velodyne brand, thought leadership development, and robust sales lead generation fueled by highly engaging digital marketing. Previously, Sally worked in public education and social work.

Nitin Wagh is Sr. Business Development Manager for Amazon AI. He likes the opportunity to help customers understand Machine Learning and  power of Augmented AI in AWS cloud. In his spare time, he loves spending time with family in outdoors activities.

James Wu is a Senior AI/ML Specialist Solution Architect at AWS. helping customers design and build AI/ML solutions. James’s work covers a wide range of ML use cases, with a primary interest in computer vision, deep learning, and scaling ML across the enterprise. Prior to joining AWS, James was an architect, developer, and technology leader for over 10 years, including 6 years in engineering and 4 years in marketing & advertising industries.

Farooq Sabir is a Senior Artificial Intelligence and Machine Learning Specialist Solutions Architect at AWS. He holds PhD and MS degrees in Electrical Engineering from The University of Texas at Austin and a MS in Computer Science from Georgia Institute of Technology. He has over 15 years of work experience and also likes to teach and mentor college students. At AWS, he helps customers formulate and solve their business problems in data science, machine learning, computer vision, artificial intelligence, numerical optimization and related domains. Based in Dallas, Texas, he and his family love to travel and make long road trips.

Read More

Onboard PaddleOCR with Amazon SageMaker Projects for MLOps to perform optical character recognition on identity documents

Optical character recognition (OCR) is the task of converting printed or handwritten text into machine-encoded text. OCR has been widely used in various scenarios, such as document electronization and identity authentication. Because OCR can greatly reduce the manual effort to register key information and serve as an entry step for understanding large volumes of documents, an accurate OCR system plays a crucial role in the era of digital transformation.

The open-source community and researchers are concentrating on how to improve OCR accuracy, ease of use, integration with pre-trained models, extension, and flexibility. Among many proposed frameworks, PaddleOCR has gained increasing attention recently. The proposed framework concentrates on obtaining high accuracy while balancing computational efficiency. In addition, the pre-trained models for Chinese and English make it popular in the Chinese language-based market. See the PaddleOCR GitHub repo for more details.

At AWS, we have also proposed integrated AI services that are ready to use with no machine learning (ML) expertise. To extract text and structured data such as tables and forms from documents, you can use Amazon Textract. It uses ML techniques to read and process any type of document, accurately extracting text, handwriting, tables, and other data with no manual effort.

For the data scientists who want the flexibility to use an open-source framework to develop your own OCR model, we also offer the fully managed ML service Amazon SageMaker. SageMaker enables you to implement MLOps best practices throughout the ML lifecycle, and provides templates and toolsets to reduce the undifferentiated heavy lifting to put ML projects in production.

In this post, we concentrate on developing customized models within the PaddleOCR framework on SageMaker. We walk through the ML development lifecycle to illustrate how SageMaker can help you build and train a model, and eventually deploy the model as a web service. Although we illustrate this solution with PaddleOCR, the general guidance is true for arbitrary frameworks to be used on SageMaker. To accompany this post, we also provide sample code in the GitHub repository.

PaddleOCR framework

As a widely adopted OCR framework, PaddleOCR contains rich text detection, text recognition, and end-to-end algorithms. It chooses Differentiable Binarization (DB) and Convolutional Recurrent Neural Network (CRNN) as the basic detection and recognition models, and proposes a series of models, named PP-OCR, for industrial applications after a series of optimization strategies.

The PP-OCR model is aimed at general scenarios and forms a model library of different languages. It consists of three parts: text detection, box detection and rectification, and text recognition, illustrated in the following figure on the PaddleOCR official GitHub repository. You can also refer to the research paper PP-OCR: A Practical Ultra Lightweight OCR System for more information.

To be more specific, PaddleOCR consists of three consecutive tasks:

  • Text detection – The purpose of text detection is to locate the text area in the image. Such tasks can be based on a simple segmentation network.
  • Box detection and rectification – Each text box needs to be transformed into a horizontal rectangle box for subsequent text recognition. To do this, PaddleOCR proposes to train a text direction classifier (image classification task) to determine the text direction.
  • Text recognition – After the text box is detected, the text recognizer model performs inference on each text box and outputs the results according to text box location. PaddleOCR adopts the widely used method CRNN.

PaddleOCR provides high-quality pre-trained models that are comparable to commercial effects. You can either use the pre-trained model for a detection model, direction classifier, or recognition model, or you can fine tune and retrain each individual model to serve your use case. To increase the efficiency and effectiveness of detecting Traditional Chinese and English, we illustrate how to fine-tune the text recognition model. The pre-trained model we choose is ch_ppocr_mobile_v2.0_rec_train, which is a lightweight model, supporting Chinese, English, and number recognition. The following is an example inference result using a Hong Kong identity card.

In the following sections, we walk through how to fine-tune the pre-trained model using SageMaker.

MLOps best practices with SageMaker

SageMaker is a fully managed ML service. With SageMaker, data scientists and developers can quickly and easily build and train ML models, and then directly deploy them into a production-ready managed environment.

Many data scientists use SageMaker for accelerating the ML lifecycle. In this section, we illustrate how SageMaker can help you from experimentation to productionalizing ML. Following the standard steps of an ML project, from the experimental phrase (code development and experiments), to the operational phrase (automatization of the model build workflow and deployment pipelines), SageMaker can bring efficiency in the following steps:

  1. Explore the data and build the ML code with Amazon SageMaker Studio notebooks.
  2. Train and tune the model with a SageMaker training job.
  3. Deploy the model with an SageMaker endpoint for model serving.
  4. Orchestrate the workflow with Amazon SageMaker Pipelines.

The following diagram illustrates this architecture and workflow.

It’s important to note that you can use SageMaker in a modular way. For example, you can build your code with a local integrated development environment (IDE) and train and deploy your model on SageMaker, or you can develop and train your model in your own cluster compute sources, and use a SageMaker pipeline for workflow orchestration and deploy on a SageMaker endpoint. This means that SageMaker provides an open platform to adapt for your own requirements.

See the code in our GitHub repository and README to understand the code structure.

Provision a SageMaker project

You can use Amazon SageMaker Projects to start your journey. With a SageMaker project, you can manage the versions for your Git repositories so you can collaborate across teams more efficiently, ensure code consistency, and enable continuous integration and continuous delivery (CI/CD). Although notebooks are helpful for model building and experimentation, when you have a team of data scientists and ML engineers working on an ML problem, you need a more scalable way to maintain code consistency and have stricter version control.

SageMaker projects create a preconfigured MLOps template, which includes the essential components for simplifying the PaddleOCR integration:

  • A code repository to build custom container images for processing, training, and inference, integrated with CI/CD tools. This allows us to configure our custom Docker image and push to Amazon Elastic Container Registry (Amazon ECR) to be ready to use.
  • A SageMaker pipeline that defines steps for data preparation, training, model evaluation, and model registration. This prepares us to be MLOps ready when the ML project goes to production.
  • Other useful resources, such as a Git repository for code version control, model group that contains model versions, code change trigger for the model build pipeline, and event-based trigger for the model deployment pipeline.

You can use SageMaker seed code to create standard SageMaker projects, or a specific template that your organization created for team members. In this post, we use the standard MLOps template for image building, model building, and model deployment. For more information about creating a project in Studio, refer to Create an MLOps Project using Amazon SageMaker Studio.

Explore data and build ML code with SageMaker Studio Notebooks

SageMaker Studio notebooks are collaborative notebooks that you can launch quickly because you don’t need to set up compute instances and file storage beforehand. Many data scientists prefer to use this web-based IDE for developing the ML code, quickly debugging the library API, and getting things running with a small sample of data to validate the training script.

In Studio notebooks, you can use a pre-built environment for common frameworks such as TensorFlow, PyTorch, Pandas, and Scikit-Learn. You can install the dependencies to the pre-built kernel, or build up your own persistent kernel image. For more information, refer to Install External Libraries and Kernels in Amazon SageMaker Studio. Studio notebooks also provide a Python environment to trigger SageMaker training jobs, deployment, or other AWS services. In the following sections, we illustrate how to use Studio notebooks as an environment to trigger training and deployment jobs.

SageMaker provides a powerful IDE; it’s an open ML platform where data scientists have the flexibility to use their preferred development environment. For data scientists who prefer a local IDE such as PyCharm or Visual Studio Code, you can use the local Python environment to develop your ML code, and use SageMaker for training in a managed scalable environment. For more information, see Run your TensorFlow job on Amazon SageMaker with a PyCharm IDE. After you have a solid model, you can adopt the MLOps best practices with SageMaker.

Currently, SageMaker also provides SageMaker notebook instances as our legacy solution for the Jupyter Notebook environment. You have the flexibility to run the Docker build command and use SageMaker local mode to train on your notebook instance. We also provide sample code for PaddleOCR in our code repository: ./train_and_deploy/notebook.ipynb.

Build a custom image with a SageMaker project template

SageMaker makes extensive use of Docker containers for build and runtime tasks. You can run your own container with SageMaker easily. See more technical details at Use Your Own Training Algorithms.

However, as a data scientist, building a container might not be straightforward. SageMaker projects provide a simple way for you to manage custom dependencies through an image building CI/CD pipeline. When you use a SageMaker project, you can make updates to the training image with your custom container Dockerfile. For step-by-step instructions, refer to Create Amazon SageMaker projects with image building CI/CD pipelines. With the structure provided in the template, you can modify the provided code in this repository to build a PaddleOCR training container.

For this post, we showcase the simplicity of building a custom image for processing, training, and inference. The GitHub repo contains three folders:

These projects follow a similar structure. Take the training container image as an example; the image-build-train/ repository contains the following files:

  • The codebuild-buildspec.yml file, which is used to configure AWS CodeBuild so that the image can be built and pushed to Amazon ECR.
  • The Dockerfile used for the Docker build, which contains all dependencies and the training code.
  • The train.py entry point for training script, with all hyperparameters (such as learning rate and batch size) that can be configured as an argument. These arguments are specified when you start the training job.
  • The dependencies.

When you push the code into the corresponding repository, it triggers AWS CodePipeline to build a training container for you. The custom container image is stored in an Amazon ECR repository, as illustrated in the previous figure. A similar procedure is adopted for generating the inference image.

Train the model with the SageMaker training SDK

After your algorithm code is validated and packaged into a container, you can use a SageMaker training job to provision a managed environment to train the model. This environment is ephemeral, meaning that you can have separate, secure compute resources (such as GPU) or a Multi-GPU distributed environment to run your code. When the training is complete, SageMaker saves the resulting model artifacts to an Amazon Simple Storage Service (Amazon S3) location that you specify. All the log data and metadata persist on the AWS Management Console, Studio, and Amazon CloudWatch.

The training job includes several important pieces of information:

  • The URL of the S3 bucket where you stored the training data
  • The URL of the S3 bucket where you want to store the output of the job
  • The managed compute resources that you want SageMaker to use for model training
  • The Amazon ECR path where the training container is stored

For more information about training jobs, see Train Models. The example code for the training job is available at experiments-train-notebook.ipynb.

SageMaker makes the hyperparameters in a CreateTrainingJob request available in the Docker container in the /opt/ml/input/config/hyperparameters.json file.

We use the custom training container as the entry point and specify a GPU environment for the infrastructure. All relevant hyperparameters are detailed as parameters, which allows us to track each individual job configuration, and compare them with the experiment tracking.

Because the data science process is very research-oriented, it’s common that multiple experiments are running in parallel. This requires an approach that keeps track of all the different experiments, different algorithms, and potentially different datasets and hyperparameters attempted. Amazon SageMaker Experiments lets you organize, track, compare, and evaluate your ML experiments. We demonstrate this as well in experiments-train-notebook.ipynb. For more details, refer to Manage Machine Learning with Amazon SageMaker Experiments.

Deploy the model for model serving

As for deployment, especially for real-time model serving, many data scientists might find it hard to do without help from operation teams. SageMaker makes it simple to deploy your trained model into production with the SageMaker Python SDK. You can deploy your model to SageMaker hosting services and get an endpoint to use for real-time inference.

In many organizations, data scientists might not be responsible for maintaining the endpoint infrastructure. However, testing your model as an endpoint and guaranteeing the correct prediction behaviors is indeed the responsibility of data scientists. Therefore, SageMaker simplified the tasks for deploying by adding a set of tools and SDK for this.

For the use case in the post, we want to have real-time, interactive, low-latency capabilities. Real-time inference is ideal for this inference workload. However, there are many options adapting to each specific requirement. For more information, refer to Deploy Models for Inference.

To deploy the custom image, data scientists can use the SageMaker SDK, illustrated at

experiments-deploy-notebook.ipynb.

In the create_model request, the container definition includes the ModelDataUrl parameter, which identifies the Amazon S3 location where model artifacts are stored. SageMaker uses this information to determine from where to copy the model artifacts. It copies the artifacts to the /opt/ml/model directory for use by your inference code. The serve and predictor.py is the entry point for serving, with the model artifact that is loaded when you start the deployment. For more information, see Use Your Own Inference Code with Hosting Services.

Orchestrate your workflow with SageMaker Pipelines

The last step is to wrap your code as end-to-end ML workflows, and to apply MLOps best practices. In SageMaker, the model building workload, a directed acyclic graph (DAG), is managed by SageMaker Pipelines. Pipelines is a fully managed service supporting orchestration and data lineage tracking. In addition, because Pipelines is integrated with the SageMaker Python SDK, you can create your pipelines programmatically using a high-level Python interface that we used previously during the training step.

We provide an example of pipeline code to illustrate the implementation at pipeline.py.

The pipeline includes a preprocessing step for dataset generation, training step, condition step, and model registration step. At the end of each pipeline run, data scientists may want to register their model for version controls and deploy the best performing one. The SageMaker model registry provides a central place to manage model versions, catalog models, and trigger automated model deployment with approval status of a specific model. For more details, refer to Register and Deploy Models with Model Registry.

In an ML system, automated workflow orchestration helps prevent model performance degradation, in other words model drift. Early and proactive detection of data deviations enables you to take corrective actions, such as retraining models. You can trigger the SageMaker pipeline to retrain a new version of the model after deviations have been detected. The trigger of a pipeline can be also determined by Amazon SageMaker Model Monitor, which continuously monitors the quality of models in production. With the data capture capability to record information, Model Monitor supports data and model quality monitoring, bias, and feature attribution drift monitoring. For more details, see Monitor models for data and model quality, bias, and explainability.

Conclusion

In this post, we illustrated how to run the framework PaddleOCR on SageMaker for OCR tasks. To help data scientists easily onboard SageMaker, we walked through the ML development lifecycle, from building algorithms, to training, to hosting the model as a web service for real-time inference. You can use the template code we provided to migrate an arbitrary framework onto the SageMaker platform. Try it out for your ML project and let us know your success stories.


About the Authors

Junyi(Jackie) LIU is an Senior Applied Scientist at AWS. She has many years of working experience in the field of machine learning. She has rich practical experience in the development and implementation of solutions in the construction of machine learning models in supply chain prediction algorithms, advertising recommendation systems, OCR and NLP area.

Yanwei Cui, PhD, is a Machine Learning Specialist Solutions Architect at AWS. He started machine learning research at IRISA (Research Institute of Computer Science and Random Systems), and has several years of experience building artificial intelligence powered industrial applications in computer vision, natural language processing and online user behavior prediction. At AWS, he shares the domain expertise and helps customers to unlock business potentials, and to drive actionable outcomes with machine learning at scale. Outside of work, he enjoys reading and traveling.

Yi-An CHEN is a Software Developer at Amazon Lab 126. She has more than 10 years experience in developing machine learning driven products across diverse disciplines, including personalization, natural language processing and computer vision. Outside of work, she likes to do long running and biking.

Read More

Drive efficiencies with CI/CD best practices on Amazon Lex

Let’s say you have identified a use case in your organization that you would like to handle via a chatbot. You familiarized yourself with Amazon Lex, built a prototype, and did a few trial interactions with the bot. You liked the overall experience and now want to deploy the bot in your production environment, but aren’t sure about best practices for Amazon Lex. In this post, we review the best practices for developing and deploying Amazon Lex bots, enabling you to streamline the end-to-end bot lifecycle and optimize your operations.

We have covered the planning, design, and configuration phases in previous blog posts. We suggest reviewing these posts to help you build engaging conversations with your bot before you proceed. After you’ve initially configured the bot, you should test it internally and iterate on the bot definition. You’re now ready to deploy it in your production environment (such as a call center), where the bot will process live conversations. Once in production, you should monitor it continuously to make sure it’s meeting your desired business goals. This cycle repeats as you add new use cases and enhancements.

Let’s review the best practices for development, testing, deployment, and monitoring bots.

Development

Consider the following best practices when developing your bot:

  • Manage bot schema via code – The Amazon Lex console provides an easy-to-use interface as you design and configure the bot, but relies on manual actions to replicate the setup. We recommend converting the bot schema into code after finishing the design to simplify this step. You can use APIs or AWS CloudFormation (see Creating Amazon Lex V2 resources with AWS CloudFormation) to manage the bot programmatically.
  • Checkpoint bot schema with bot versioning – Checkpointing is a common approach often used to revert an application to a last-known stable state. Amazon Lex offers this functionality via bot versioning. We recommend using a new version at each milestone in your development process. This allows you to make incremental changes to your bot definition, with an easy way to revert them in case they don’t work as expected.
  • Identify data handling requirements and configure appropriate controls – Amazon Lex follows the AWS shared responsibility model, which includes guidelines for data protection to comply with industry regulations and with your company’s own data privacy standards. Additionally, Amazon Lex adheres to compliance programs such as SOC, PCI, and FedRAMP. Amazon Lex provides the ability to obfuscate slots that are considered sensitive. You should identify your data privacy requirements and configure the appropriate controls in your bot.

Testing

After you have a bot definition, you should test the bot to ensure that it works as intended and is configured correctly. For example, it should have permissions to trigger other services, such as AWS Lambda functions. In addition, you should also test the bot to confirm it’s able to interpret different types of user requests. Consider the following best practices for testing:

  • Identify test data – You should gather relevant test data to test bot performance. The test data should include a comprehensive representation of expected user conversations with the bot, especially for IVR use cases where the bot will need to understand voice inputs. The test data should cover different speaking styles and accents. Such test data can provide experience validation for your target customer base.
  • Identify user experience metrics – Defining the conversational experience can be hard. You have to anticipate and plan for all the different ways users might engage with the bot. How do you guide the caller without sounding too prescriptive? How do you recover if the caller provides incorrect or incomplete information? To manage the dialog through many different scenarios, you should set a clear goal that covers different speaking styles, acoustic conditions, and modality, and identify objective metrics that you can track. For example, an objective indicator would be “90% of conversations should have less than two re-prompts played to the user,” versus a subjective indicator such as “the majority of conversations should not ask users to repeat their input.”
  • Evaluate user experience along the way – In some cases, seemingly small changes can have a big impact on the user experience. For example, consider a situation where you inadvertently introduce a typo in the regular expression used for an account ID slot type, which leads to bot re-prompting the user to provide input again. You should evaluate the user experience, and invest in an automated testing to generate key metrics. You can refer to Evaluating an automatic speech recognition service and Testing accuracy and regression with Amazon Connect and Amazon Lex for examples of how to test and generate key metrics.

Deployment

Once you’re satisfied with the bot’s performance, you’ll want to deploy the bot to start serving your production traffic. As you iterate the bot over the course of its lifecycle, you repeat the deployments, making it a continuous process, so it’s critical to have a streamlined, automated deployment to reduce the chance of errors. Consider the following best practices for deployment:

  • Use a multi-account environment – You should follow the AWS recommended multi-account environment setup in your organization and use separate AWS accounts for your development stage and production stage. If you have a multi-Region presence, then you should also use a separate AWS account per Region for production. Using separate AWS accounts per stage offers you security, access, and billing boundaries for your AWS resources.
  • Automate promoting a bot from development through to production – When replicating the bot setup in your development stage to your production stage, you should use automated solutions and minimize manual touch points. You should use CloudFormation templates to create your bots. Alternatively, you can use Amazon Lex export and import APIs to provide an automated means to copy a bot schema across accounts.
  • Roll out changes in a phased manner – You should deploy changes to your production environment in a phased manner, so that changes are released to a subset of your production traffic before being released to all users. Such an approach gives you the chance to limit the blast radius in case there are any issues with the change. One way you can achieve this is by having a two-phased deployment approach: you create two aliases for a bot (for example, prod-05 and prod-95). You first associate the new bot version with one alias (prod-05 in this example). After you validate the key metrics meet the success criteria, you associate the second alias (prod-95) with the new bot version.

Note that you need to control the distribution of traffic on the client application used to integrate with Amazon Lex bots. For example, if you’re using Amazon Connect to integrate with your bots, you can use a Distribute by percentage contact block in conjunction with two or more Get customer input blocks.

It’s important to note that Amazon Lex provides a test alias out of the box. The test alias is meant to be used for ad hoc manual testing via the Amazon Lex console only, and is not meant to handle production-scale loads. We recommend using a dedicated alias for your production traffic.

Monitoring

Monitoring is important for maintaining reliability, availability, and an effective end-user experience. You should analyze your bot’s metrics and use the learnings as a feedback mechanism to improve the bot schema as well your development, testing, and deployment practices. Amazon Lex supports multiple mechanisms to monitor bots. Consider the following best practices for monitoring your Lex bots:

  • Monitor constantly and iterate – Amazon Lex integrates with Amazon CloudWatch to provide near-real-time metrics that can provide you with key insights into your users’ interactions with the bot. These insights can help you gain perspective on the end-user experience. To learn more about the different types of metrics that Amazon Lex emits, see Monitoring Amazon Lex V2 with Amazon CloudWatch. We recommend setting up thresholds to trigger alarms. Similarly, Amazon Lex gives you visibility into the raw input utterances from your users’ interactions with the bot. You should use utterance statistics or conversation logs to gain insights to identify communication patterns and make appropriate changes to your bot as necessary. To learn how to create a personalized analytics dashboard for your bots, refer to Monitor operational metrics for your Amazon Lex chatbot.

The best practices discussed in this post focus primarily on Amazon Lex-specific use cases. In addition to these, you should review and adhere to best practices when managing your cloud infrastructure in AWS. Make sure that your cloud infrastructure is secure and only accessible by authorized users. You should also review and adopt the appropriate AWS security best practices within your organization. Lastly, you should proactively review the AWS quotas for individual AWS services (including Amazon Lex quotas) and request appropriate changes if necessary.

Conclusion

You can use Amazon Lex to enable sophisticated natural language conversations and drive customer service efficiencies. In this post, we reviewed the best practices for the development, testing, deployment, and monitoring phases of a bot lifecycle. With these guidelines, you can improve the end-user experience and achieve better customer engagement. Start building your Amazon Lex conversational experience today!


About the Author

Swapandeep Singh is an engineer with the Amazon Lex team. He works on making interactions with bots smoother and more human-like. Outside of work, he likes to travel and learn about different cultures.

Read More

Feature engineering at scale for healthcare and life sciences with Amazon SageMaker Data Wrangler

Machine learning (ML) is disrupting a lot of industries at an unprecedented pace. The healthcare and life sciences (HCLS) industry has been going through a rapid evolution in recent years embracing ML across a multitude of use cases for delivering quality care and improving patient outcomes.

In a typical ML lifecycle, data engineers and scientists spend the majority of their time on the data preparation and feature engineering steps before even getting started with the process of model building and training. Having a tool that can lower the barrier to entry for data preparation, thereby improving productivity, is a highly desirable ask for these personas. Amazon SageMaker Data Wrangler is purpose built by AWS to reduce the learning curve and enable data practitioners to accomplish data preparation, cleaning, and feature engineering tasks in less effort and time. It offers a GUI interface with many built-in functions and integrations with other AWS services such as Amazon Simple Storage Service (Amazon S3) and Amazon SageMaker Feature Store, as well as partner data sources including Snowflake and Databricks.

In this post, we demonstrate how to use Data Wrangler to prepare healthcare data for training a model to predict heart failure, given a patient’s demographics, prior medical conditions, and lab test result history.

Solution overview

The solution consists of the following steps:

  1. Acquire a healthcare dataset as input to Data Wrangler.
  2. Use Data Wrangler’s built-in transformation functions to transform the dataset. This includes drop columns, featurize data/time, join datasets, impute missing values, encode categorical variables, scale numeric values, balance the dataset, and more.
  3. Use Data Wrangler’s custom transform function (Pandas or PySpark code) to supplement additional transformations required beyond the built-in transformations and demonstrate the extensibility of Data Wrangler. This includes filter rows, group data, form new dataframes based on conditions, and more.
  4. Use Data Wrangler’s built-in visualization functions to perform visual analysis. This includes target leakage, feature correlation, quick model, and more.
  5. Use Data Wrangler’s built-in export options to export the transformed dataset to Amazon S3.
  6. Launch a Jupyter notebook to use the transformed dataset in Amazon S3 as input to train a model.

Generate a dataset

Now that we have settled on the ML problem statement, we first set our sights on acquiring the data we need. Research studies such as Heart Failure Prediction may provide data that’s already in good shape. However, we often encounter scenarios where the data is quite messy and requires joining, cleansing, and several other transformations that are very specific to the healthcare domain before it can be used for ML training. We want to find or generate data that is messy enough and walk you through the steps of preparing it using Data Wrangler. With that in mind, we picked Synthea as a tool to generate synthetic data that fits our goal. Synthea is an open-source synthetic patient generator that models the medical history of synthetic patients. To generate your dataset, complete the following steps:

  1. Follow the instructions as per the quick start documentation to create an Amazon SageMaker Studio domain and launch Studio.
    This is a prerequisite step. It is optional if Studio is already set up in your account.
  2. After Studio is launched, on the Launcher tab, choose System terminal.
    This launches a terminal session that gives you a command line interface to work with.
  3. To install Synthea and generate the dataset in CSV format, run the following commands in the launched terminal session:
    $ sudo yum install -y java-1.8.0-openjdk-devel
    $ export JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk.x86_64
    $ export PATH=$JAVA_HOME/bin:$PATH
    $ git clone https://github.com/synthetichealth/synthea
    $ git checkout v3.0.0
    $ cd synthea
    $ ./run_synthea --exporter.csv.export=true -p 10000

We supply a parameter to generate the datasets with a population size of 10,000. Note the size parameter denotes the number of alive members of the population. Additionally, Synthea also generates data for dead members of the population which might add a few extra data points on top of the specified sample size.

Wait until the data generation is complete. This step usually takes around an hour or less. Synthea generates multiple datasets, including patients, medications, allergies, conditions, and more. For this post, we use three of the resulting datasets:

  • patients.csv – This dataset is about 3.2 MB and contains approximately 11,000 rows of patient data (25 columns including patient ID, birthdate, gender, address, and more)
  • conditions.csv – This dataset is about 47 MB and contains approximately 370,000 rows of medical condition data (six columns including patient ID, condition start date, condition code, and more)
  • observations.csv – This dataset is about 830 MB and contains approximately 5 million rows of observation data (eight columns including patient ID, observation date, observation code, value, and more)

There is a one-to-many relationship between the patients and conditions datasets. There is also a one-to-many relationship between the patients and observations datasets. For a detailed data dictionary, refer to CSV File Data Dictionary.

  1. To upload the generated datasets to a source bucket in Amazon S3, run the following commands in the terminal session:
    $ cd ./output/csv
    $ aws s3 sync . s3://<source bucket name>/

Launch Data Wrangler

Choose SageMaker resources in the navigation page in Studio and on the Projects menu, choose Data Wrangler to create a Data Wrangler data flow. For detailed steps how to launch Data Wrangler from within Studio, refer to Get Started with Data Wrangler.

Import data

To import your data, complete the following steps:

  1. Choose Amazon S3 and locate the patients.csv file in the S3 bucket.
  2. In the Details pane, choose First K for Sampling.
  3. Enter 1100 for Sample size.
    In the preview pane, Data Wrangler pulls the first 100 rows from the dataset and lists them as a preview.
  4. Choose Import.
    Data Wrangler selects the first 1,100 patients from the total patients (11,000 rows) generated by Synthea and imports the data. The sampling approach lets Data Wrangler only process the sample data. It enables us to develop our data flow with a smaller dataset, which results in quicker processing and a shorter feedback loop. After we create the data flow, we can submit the developed recipe to a SageMaker processing job to horizontally scale out the processing for the full or larger dataset in a distributed fashion.
  5. Repeat this process for the conditions and observations datasets.
    1. For the conditions dataset, enter 37000 for Sample size, which is 1/10 of the total 370,000 rows generated by Synthea.
    2. For the observations dataset, enter 500000 for Sample size, which is 1/10 of the total observations 5 million rows generated by Synthea.

You should see three datasets as shown in the following screenshot.

Transform the data

Data transformation is the process of changing the structure, value, or format of one or more columns in the dataset. The process is usually developed by a data engineer and can be challenging for people with a smaller data engineering skillset to decipher the logic proposed for the transformation. Data transformation is part of the broader feature engineering process, and the correct sequence of steps is another important criterion to keep in mind while devising such recipes.

Data Wrangler is designed to be a low-code tool to reduce the barrier of entry for effective data preparation. It comes with over 300 preconfigured data transformations for you to choose from without writing a single line of code. In the following sections, we see how to transform the imported datasets in Data Wrangler.

Drop columns in patients.csv

We first drop some columns from the patients dataset. Dropping redundant columns removes non-relevant information from the dataset and helps us reduce the amount of computing resources required to process the dataset and train a model. In this section, we drop columns such as SSN or passport number based on common sense that these columns have no predictive value. In other words, they don’t help our model predict heart failure. Our study is also not concerned about other columns such as birthplace or healthcare expenses’ influence to a patient’s heart failure, so we drop them as well. Redundant columns can also be identified by running the built-in analyses like target leakage, feature correlation, multicollinearity, and more, which are built into Data Wrangler. For more details on the supported analyses types, refer to Analyze and Visualize. Additionally, you can use the Data Quality and Insights Report to perform automated analyses on the datasets to arrive at a list of redundant columns to eliminate.

  1. Choose the plus sign next to Data types for the patients.csv dataset and choose Add transform.
  2. Choose Add step and choose Manage columns.
  3. For Transform¸ choose Drop column.
  4. For Columns to drop, choose the following columns:
    1. SSN
    2. DRIVERS
    3. PASSPORT
    4. PREFIX
    5. FIRST
    6. LAST
    7. SUFFIX
    8. MAIDEN
    9. RACE
    10. ETHNICITY
    11. BIRTHPLACE
    12. ADDRESS
    13. CITY
    14. STATE
    15. COUNTY
    16. ZIP
    17. LAT
    18. LON
    19. HEALTHCARE_EXPENSES
    20. HEALTHCARE_COVERAGE
  5. Choose Preview to review the transformed dataset, then choose Add.

    You should see the step Drop column in your list of transforms.

Featurize date/time in patients.csv

Now we use the Featurize date/time function to generate the new feature Year from the BIRTHDATE column in the patients dataset. We use the new feature in a subsequent step to calculate a patient’s age at the time of observation takes place.

  1. In the Transforms pane of your Drop column page for the patients dataset, choose Add step.
  2. Choose the Featurize date/time transform.
  3. Choose Extract columns.
  4. For Input columns, add the column BIRTHDATE.
  5. Select Year and deselect Month, Day, Hour, Minute, Second.

  6. Choose Preview, then choose Add.

Add transforms in observations.csv

Data Wrangler supports custom transforms using Python (user-defined functions), PySpark, Pandas, or PySpark (SQL). You can choose your transform type based on your familiarity with each option and preference. For the latter three options, Data Wrangler exposes the variable df for you to access the dataframe and apply transformations on it. For a detailed explanation and examples, refer to Custom Transforms. In this section, we add three custom transforms to the observations dataset.

  1. Add a transform to observations.csv and drop the DESCRIPTION column.
  2. Choose Preview, then choose Add.
  3. In the Transforms pane, choose Add step and choose Custom transform.
  4. On the drop-down menu, choose Python (Pandas).
  5. Enter the following code:
    df = df[df["CODE"].isin(['8867-4','8480-6','8462-4','39156-5','777-3'])]

    These are LONIC codes that correspond to the following observations we’re interested in using as features for predicting heart failure:

    heart rate: 8867-4
    systolic blood pressure: 8480-6
    diastolic blood pressure: 8462-4
    body mass index (BMI): 39156-5
    platelets [#/volume] in Blood: 777-3

  6. Choose Preview, then choose Add.
  7. Add a transform to extract Year and Quarter from the DATE column.
  8. Choose Preview, then choose Add.
  9. Choose Add step and choose Custom transform.
  10. On the drop-down menu, choose Python (PySpark).

    The five types of observations may not always be recorded on the same date. For example, a patient may visit their family doctor on January 21 and have their systolic blood pressure, diastolic blood pressure, heart rate, and body mass index measured and recorded. However, a lab test that includes platelets may be done at a later date on February 2. Therefore, it’s not always possible to join dataframes by the observation date. Here we join dataframes on a coarse granularity at the quarter basis.
  11. Enter the following code:
    from pyspark.sql.functions import col
    
    systolic_df = (
        df.select("patient", "DATE_year", "DATE_quarter", "value")
                       .withColumnRenamed("value", "systolic")
                       .filter((col("code") == "8480-6"))
      )
    
    diastolic_df = (
        df.select("patient", "DATE_year", "DATE_quarter", "value")
                       .withColumnRenamed('value', 'diastolic')
                       .filter((col("code") == "8462-4"))
        )
    
    hr_df = (
        df.select("patient", "DATE_year", "DATE_quarter", "value")
                       .withColumnRenamed('value', 'hr')
                       .filter((col("code") == "8867-4"))
        )
    
    bmi_df = (
        df.select("patient", "DATE_year", "DATE_quarter", "value")
                       .withColumnRenamed('value', 'bmi')
                       .filter((col("code") == "39156-5"))
        )
    
    platelets_df = (
        df.select("patient", "DATE_year", "DATE_quarter", "value")
                       .withColumnRenamed('value', 'platelets')
                       .filter((col("code") == "777-3"))
        )
    
    df = (
        systolic_df.join(diastolic_df, ["patient", "DATE_year", "DATE_quarter"])
                                .join(hr_df, ["patient", "DATE_year", "DATE_quarter"])
                                .join(bmi_df, ["patient", "DATE_year", "DATE_quarter"])
                                .join(platelets_df, ["patient", "DATE_year", "DATE_quarter"])
    )

  12. Choose Preview, then choose Add.
  13. Choose Add step, then choose Manage rows.
  14. For Transform, choose Drop duplicates.
  15. Choose Preview, then choose Add.
  16. Choose Add step and choose Custom transform.
  17. On the drop-down menu, choose Python (Pandas).
  18. Enter the following code to take an average of data points that share the same time value:
    import pandas as pd
    df.loc[:, df.columns != 'patient']=df.loc[:, df.columns != 'patient'].apply(pd.to_numeric)
    df = df.groupby(['patient','DATE_year','DATE_quarter']).mean().round(0).reset_index()

  19. Choose Preview, then choose Add.

Join patients.csv and observations.csv

In this step, we showcase how to effectively and easily perform complex joins on datasets without writing any code via Data Wrangler’s powerful UI. To learn more about the supported types of joins, refer to Transform Data.

  1. To the right of Transform: patients.csv, choose the plus sign next to Steps and choose Join.
    You can see the transformed patients.csv file listed under Datasets in the left pane.
  2. To the right of Transform: observations.csv, click on the Steps to initiate the join operation.
    The transformed observations.csv file is now listed under Datasets in the left pane.
  3. Choose Configure.
  4. For Join Type, choose Inner.
  5. For Left, choose Id.
  6. For Right, choose patient.
  7. Choose Preview, then choose Add.

Add a custom transform to the joined datasets

In this step, we calculate a patient’s age at the time of observation. We also drop columns that are no longer needed.

  1. Choose the plus sign next to 1st Join and choose Add transform.
  2. Add a custom transform in Pandas:
    df['age'] = df['DATE_year'] - df['BIRTHDATE_year']
    df = df.drop(columns=['BIRTHDATE','DEATHDATE','BIRTHDATE_year','patient'])

  3. Choose Preview, then choose Add.

Add custom transforms to conditions.csv

  1. Choose the plus sign next to Transform: conditions.csv and choose Add transform.
  2. Add a custom transform in Pandas:
    df = df[df["CODE"].isin(['84114007', '88805009', '59621000', '44054006', '53741008', '449868002', '49436004'])]
    df = df.drop(columns=['DESCRIPTION','ENCOUNTER','STOP'])

Note: As we demonstrated earlier, you can drop columns either using custom code or using the built-in transformations provided by Data Wrangler. Custom transformations within Data Wrangler provides the flexibility to bring your own transformation logic in the form of code snippets in the supported frameworks. These snippets can later be searched and applied if needed.

The codes in the preceding transform are SNOMED-CT codes that correspond to the following conditions. The heart failure or chronic congestive heart failure condition becomes the label. We use the remaining conditions as features for predicting heart failure. We also drop a few columns that are no longer needed.

Heart failure: 84114007
Chronic congestive heart failure: 88805009
Hypertension: 59621000
Diabetes: 44054006
Coronary Heart Disease: 53741008
Smokes tobacco daily: 449868002
Atrial Fibrillation: 49436004
  1. Next, let’s add a custom transform in PySpark:
    from pyspark.sql.functions import col, when
    
    heartfailure_df = (
        df.select("patient", "start")
                          .withColumnRenamed("start", "heartfailure")
                       .filter((col("code") == "84114007") | (col("code") == "88805009"))
      )
    
    hypertension_df = (
        df.select("patient", "start")
                       .withColumnRenamed("start", "hypertension")
                       .filter((col("code") == "59621000"))
      )
    
    diabetes_df = (
        df.select("patient", "start")
                       .withColumnRenamed("start", "diabetes")
                       .filter((col("code") == "44054006"))
      )
    
    coronary_df = (
        df.select("patient", "start")
                       .withColumnRenamed("start", "coronary")
                       .filter((col("code") == "53741008"))
      )
    
    smoke_df = (
        df.select("patient", "start")
                       .withColumnRenamed("start", "smoke")
                       .filter((col("code") == "449868002"))
      )
    
    atrial_df = (
        df.select("patient", "start")
                       .withColumnRenamed("start", "atrial")
                       .filter((col("code") == "49436004"))
      )
    
    df = (
        heartfailure_df.join(hypertension_df, ["patient"], "leftouter").withColumn("has_hypertension", when(col("hypertension") < col("heartfailure"), 1).otherwise(0))
        .join(diabetes_df, ["patient"], "leftouter").withColumn("has_diabetes", when(col("diabetes") < col("heartfailure"), 1).otherwise(0))
        .join(coronary_df, ["patient"], "leftouter").withColumn("has_coronary", when(col("coronary") < col("heartfailure"), 1).otherwise(0))
        .join(smoke_df, ["patient"], "leftouter").withColumn("has_smoke", when(col("smoke") < col("heartfailure"), 1).otherwise(0))
        .join(atrial_df, ["patient"], "leftouter").withColumn("has_atrial", when(col("atrial") < col("heartfailure"), 1).otherwise(0))
    )

    We perform a left outer join to keep all entries in the heart failure dataframe. A new column has_xxx is calculated for each condition other than heart failure based on the condition’s start date. We’re only interested in medical conditions that were recorded prior to the heart failure and use them as features for predicting heart failure.

  2. Add a built-in Manage columns transform to drop the redundant columns that are no longer needed:
    1. hypertension
    2. diabetes
    3. coronary
    4. smoke
    5. atrial
  3. Extract Year and  Quarter from the heartfailure column.
    This matches the granularity we used earlier in the transformation of the observations dataset.
  4. We should have a total of 6 steps for conditions.csv.

Join conditions.csv to the joined dataset

We now perform a new join to join the conditions dataset to the joined patients and observations dataset.

  1. Choose Transform: 1st Join.
  2. Choose the plus sign and choose Join.
  3. Choose Steps next to Transform: conditions.csv.
  4. Choose Configure.
  5. For Join Type, choose Left outer.
  6. For Left, choose Id.
  7. For Right, choose patient.
  8. Choose Preview, then choose Add.

Add transforms to the joined datasets

Now that we have all three datasets joined, let’s apply some additional transformations.

  1. Add the following custom transform in PySpark so has_heartfailure becomes our label column:
    from pyspark.sql.functions import col, when
    df = (
        df.withColumn("has_heartfailure", when(col("heartfailure").isNotNull(), 1).otherwise(0))
    )

  2. Add the following custom transformation in PySpark:
    from pyspark.sql.functions import col
    
    df = (
        df.filter(
          (col("has_heartfailure") == 0) | 
          ((col("has_heartfailure") == 1) & ((col("date_year") <= col("heartfailure_year")) | ((col("date_year") == col("heartfailure_year")) & (col("date_quarter") <= col("heartfailure_quarter")))))
        )
    )

    We’re only interested in observations recorded prior to when the heart failure condition is diagnosed and use them as features for predicting heart failure. Observations taken after heart failure is diagnosed may be affected by the medication a patient takes, so we want to exclude those ones.

  3. Drop the redundant columns that are no longer needed:
    1. Id
    2. DATE_year
    3. DATE_quarter
    4. patient
    5. heartfailure
    6. heartfailure_year
    7. heartfailure_quarter
  4. On the Analysis tab, for Analysis type¸ choose Table summary.
    A quick scan through the summary shows that the MARITAL column has missing data.
  5. Choose the Data tab and add a step.
  6. Choose Handle Missing.
  7. For Transform, choose Fill missing.
  8. For Input columns, choose MARITAL.
  9. For Fill value, enter S.
    Our strategy here is to assume the patient is single if the marital status has missing value. You can have a different strategy.
  10. Choose Preview, then choose Add.
  11. Fill the missing value as 0 for has_hypertension, has_diabetes, has_coronary, has_smoke, has_atrial.

Marital and Gender are categorial variables. Data Wrangler has a built-in function to encode categorial variables.

  1. Add a step and choose Encode categorial.
  2. For Transform, choose One-hot encode.
  3. For Input columns, choose MARITAL.
  4. For Output style, choose Column.
    This output style produces encoded values in separate columns.
  5. Choose Preview, then choose Add.
  6. Repeat these steps for the Gender column.

The one-hot encoding splits the Marital column into Marital_M (married) and Marital_S (single), and splits the Gender column into Gender_M (male) and Gender_F (female). Because Marital_M and Marital_S are mutually exclusive (as are Gender_M and Gender_F), we can drop one column to avoid redundant features.

  1. Drop Marital_S and Gender_F.

Numeric features such as systolic, heart rate, and age have different unit standards. For a linear regression-based model, we need to normalize these numeric features first. Otherwise, some features with higher absolute values may have an unwarranted advantage over other features with lower absolute values and result in poor model performance. Data Wrangler has the built-in transform Min-max scaler to normalize the data. For a decision tree-based classification model, normalization isn’t required. Our study is a classification problem so we don’t need to apply normalization. Imbalanced classes are a common problem in classification. Imbalance happens when the training dataset contains severely skewed class distribution. For example, when our dataset contains disproportionally more patients without heart failure than patients with heart failure, it can cause the model to be biased toward predicting no heart failure and perform poorly. Data Wrangler has a built-in function to tackle the problem.

  1. Add a custom transform in Pandas to convert data type of columns from “object” type to numeric type:
    import pandas as pd
    df=df.apply(pd.to_numeric)

  2. Choose the Analysis tab.
  3. For Analysis type¸ choose Histogram.
  4. For X axis, choose has_heartfailure.
  5. Choose Preview.

    It’s obvious that we have an imbalanced class (more data points labeled as no heart failure than data points labeled as heart failure).
  6. Go back to the Data tab. Choose Add step and choose Balance data.
  7. For Target column, choose has_heartfailure.
  8. For Desired ratio, enter 1.
  9. For Transform, choose SMOTE.

    SMOTE stands for Synthetic Minority Over-sampling Technique. It’s a technique to create new minority instances and add to the dataset to reach class balance. For detailed information, refer to SMOTE: Synthetic Minority Over-sampling Technique.
  10. Choose Preview, then choose Add.
  11. Repeat the histogram analysis in step 20-23. The result is a balanced class.

Visualize target leakage and feature correlation

Next, we’re going to perform a few visual analyses using Data Wrangler’s rich toolset of advanced ML-supported analysis types. First, we look at target leakage. Target leakage occurs when data in the training dataset is strongly correlated with the target label, but isn’t available in real-world data at inference time.

  1. On the Analysis tab, for Analysis type¸ choose Target Leakage.
  2. For Problem Type, choose classification.
  3. For Target, choose has_heartfailure.
  4. Choose Preview.

    Based on the analysis, hr is a target leakage. We’ll drop it in a subsequent step. age is flagged a target leakage. It’s reasonable to say that a patient’s age will be available during inference time, so we keep age as a feature. Systolic and diastolic are also flagged as likely target leakage. We expect to have the two measurements during inference time, so we keep them as features.
  5. Choose Add to add the analysis.

Then, we look at feature correlation. We want to select features that are correlated with the target but are uncorrelated among themselves.

  1. On the Analysis tab, for Analysis type¸ choose Feature Correlation.
  2. For Correlation Type¸ choose linear.
  3. Choose Preview.

The coefficient scores indicate strong correlations between the following pairs:

  • systolic and diastolic
  • bmi and age
  • has_hypertension and has_heartfailure (label)

For features that are strongly correlated, matrices are computationally difficult to invert, which can lead to numerically unstable estimates. To mitigate the correlation, we can simply remove one from the pair. We drop diastolic and bmi and keep systolic and age in a subsequent step.

Drop diastolic and bmi columns

Add additional transform steps to drop the hr, diastolic and bmi columns using the built-in transform.

Generate the Data Quality and Insights Report

AWS recently announced the new Data Quality and Insights Report feature in Data Wrangler. This report automatically verifies data quality and detects abnormalities in your data. Data scientists and data engineers can use this tool to efficiently and quickly apply domain knowledge to process datasets for ML model training. This step is optional. To generate this report on our datasets, complete the following steps:

  1. On the Analysis tab, for Analysis type, choose Data Quality and Insights Report.
  2. For Target column, choose has_heartfailure.
  3. For Problem type, select Classification.
  4. Choose Create.

In a few minutes, it generates a report with a summary, visuals, and recommendations.

Generate a Quick Model analysis

We have completed our data preparation, cleaning, and feature engineering. Data Wrangler has a built-in function that provides a rough estimate of the expected predicted quality and the predictive power of features in our dataset.

  1. On the Analysis tab, for Analysis type¸ choose Quick Model.
  2. For Label, choose has_heartfailure.
  3. Choose Preview.

As per our Quick Model analysis, we can see the feature has_hypertension has the highest feature importance score among all features.

Export the data and train the model

Now let’s export the transformed ML-ready features to a destination S3 bucket and scale the entire feature engineering pipeline we have created so far using the samples into the entire dataset in a distributed fashion.

  1. Choose the plus sign next to the last box in the data flow and choose Add destination.
  2. Choose Amazon S3.
  3. Enter a Dataset name. For Amazon S3 location, choose a S3 bucket, then choose Add Destination.
  4. Choose Create job to launch a distributed PySpark processing job to perform the transformation and output the data to the destination S3 bucket.

    Depending on the size of the datasets, this option lets us easily configure the cluster and horizontally scale in a no-code fashion. We don’t have to worry about partitioning the datasets or managing the cluster and Spark internals. All of this is automatically taken care for us by Data Wrangler.
  5. On the left pane, choose Next, 2. Configure job.

  6. Then choose Run.

Alternatively, we can also export the transformed output to S3 via a Jupyter Notebook. With this approach, Data Wrangler automatically generates a Jupyter notebook with all the code needed to kick-off a processing job to apply the data flow steps (created using a sample) on the larger full dataset and use the transformed dataset as features to kick-off a training job later. The notebook code can be readily run with or without making changes. Let’s now walk through the steps on how to get this done via Data Wrangler’s UI.

  1. Choose the plus sign next to the last step in the data flow and choose Export to.
  2. Choose Amazon S3 (via Jupyter Notebook).
  3. It automatically opens a new tab with a Jupyter notebook.
  4. In the Jupyter notebook, locate the cell in the (Optional) Next Steps section and change run_optional_steps from False to True.
    The enabled optional steps in the notebook perform the following:

    • Train a model using XGBoost
  5. Go back to the top of the notebook and on the Run menu, choose Run All Cells.

If you use the generated notebook as is, it launches a SageMaker processing job that scales out the processing across two m5.4xlarge instances to processes the full dataset on the S3 bucket. You can adjust the number of instances and instance types based on the dataset size and time you need to complete the job.

Wait until the training job from the last cell is complete. It generates a model in the SageMaker default S3 bucket.

The trained model is ready for deployment for either real-time inference or batch transformation. Note that we used synthetic data to demonstrate functionalities in Data Wrangler and used processed data for training model. Given that the data we used is synthetic, the inference result from the trained model is not meant for real-world medical condition diagnosis or substitution of judgment from medical practitioners.

You can also directly export your transformed dataset into Amazon S3 by choosing Export on top of the transform preview page. The direct export option only exports the transformed sample if sampling was enabled during the import. This option is best suited if you’re dealing with smaller datasets. The transformed data can also be ingested directly into a feature store. For more information, refer to Amazon SageMaker Feature Store. The data flow can also be exported as a SageMaker pipeline that can be orchestrated and scheduled as per your requirements. For more information, see Amazon SageMaker Pipelines.

Conclusion

In this post, we showed how to use Data Wrangler to process healthcare data and perform scalable feature engineering in a tool-driven, low-code fashion. We learned how to apply the built-in transformations and analyses aptly wherever needed, combining it with custom transformations to add even more flexibility to our data preparation workflow. We also walked through the different options for scaling out the data flow recipe via distributed processing jobs. We also learned how the transformed data can be easily used for training a model to predict heart failure.

There are many other features in Data Wrangler we haven’t covered in this post. Explore what’s possible in Prepare ML Data with Amazon SageMaker Data Wrangler and learn how to leverage Data Wrangler for your next data science or machine learning project.


About the Authors

Forrest Sun is a Senior Solution Architect with the AWS Public Sector team in Toronto, Canada. He has worked in the healthcare and finance industries for the past two decades. Outside of work, he enjoys camping with his family.

Arunprasath Shankar is an Artificial Intelligence and Machine Learning (AI/ML) Specialist Solutions Architect with AWS, helping global customers scale their AI solutions effectively and efficiently in the cloud. In his spare time, Arun enjoys watching sci-fi movies and listening to classical music.

Read More