How Foxconn built an end-to-end forecasting solution in two months with Amazon Forecast

How Foxconn built an end-to-end forecasting solution in two months with Amazon Forecast

This is a guest post by Foxconn. The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post. 

In their own words, “Established in Taiwan in 1974, Hon Hai Technology Group (Foxconn) is the world’s largest electronics manufacturer. Foxconn is also the leading technological solution provider and it continuously leverages its expertise in software and hardware to integrate its unique manufacturing systems with emerging technologies.” 

At Foxconn, we manufacture some of the most widely used electronics worldwide. Our effectiveness comes from our ability to plan our production and staffing levels weeks in advance, while maintaining the ability to respond to short-term changes. For years, Foxconn has relied on predictable demand in order to properly plan and allocate resources within our factories. However, as the COVID-19 pandemic began, the demand for our products became more volatile. This increased uncertainty impacted our ability to forecast demand and estimate our future staffing needs.

This highlighted a crucial need for us to develop an improved forecasting solution that could be implemented right away. With Amazon Forecast and AWS, our team was able to build a custom forecasting application in only two months. With limited data science experience internally, we collaborated with the Machine Learning Solutions Lab at AWS to identify a solution using Forecast. The service makes AI-powered forecasting algorithms available to non-expert practitioners. Now we have a state-of-the-art solution that has improved demand forecasting accuracy by 8%, saving an estimated $553,000 annually. In this post, I show you how easy it was to use AWS services to build an application that fit our needs.

Forecasting challenges at Foxconn

Our factory in Mexico assembles and ships electronics equipment to all regions in North and South America. Each product has their own seasonal variations and requires different levels of complexity and skill to build. Having individual forecasts for each product is important to understand the mix of skills we need in our workforce. Forecasting short-term demand allows us to staff for daily and weekly production requirements. Long-term forecasts are used to inform hiring decisions aimed at meeting demand in the upcoming months.

If demand forecasts are inaccurate, it can impact our business in several ways, but the most critical impact for us is staffing our factories. Underestimating demand can result in understaffing and require overtime to meet production targets. Overestimating can lead to overstaffing, which is very costly because workers are underutilized. Both over and underestimating present different costs, and balancing these costs is crucial to our business.

Prior to this effort, we relied on forecasts provided by our customers in order to make these staffing decisions. With the COVID-19 pandemic, our demand became more erratic. This unpredictability caused over and underestimating demand to became more common and staffing related costs to increase. It became clear that we needed to find a better approach to forecasting.

Processing and modeling

Initially, we explored traditional forecasting methods such as ARIMA on our local machines. However, these approaches took a long time to develop, test, and tune for each product. It also required us to maintain a model for each individual product. From this experience, we learned that the new forecasting solution had to be fast, accurate, easy to manage, and scalable. Our team reached out to data scientists at the Amazon Machine Learning (ML) Solutions Lab, who advised and guided us through the process of building our solution around Forecast.

For this solution, we used a 3-year history of daily sales across nine different product categories. We chose these nine categories because they had a long history for the model to train on and exhibited different seasonal buying patterns. To begin, we uploaded the data from our on-premise servers into an Amazon Simple Storage Service (Amazon S3) bucket. After that, we preprocessed the data by removing known anomalies and organizing the data in a format compatible with Forecast. Our final dataset consisted of three columns: timestamp, item_id, and demand.

For model training, we decided to use the AutoML functionality in Forecast. The AutoML tool tries to fit several different algorithms to the data and tunes each one to obtain the highest accuracy. The AutoML feature was vital for a team like ours with limited background in time-series modeling. It only took a few hours for Forecast to train a predictor. After the service identifies the most effective algorithm, it further tunes that algorithm through hyperparameter optimization (HPO) to get the final predictor. This AutoML capability eliminated weeks of development time that the team would have spent researching, training, and evaluating various algorithms.

Forecast evaluation

After the AutoML finished training, it output results for a number of key performance metrics, including root mean squared error (RMSE) and weighted quantile loss (wQL). We chose to focus on wQL, which provides probabilistic estimates by evaluating the accuracy of the model’s predictions for different quantiles. A model with low wQL scores was important for our business because we face different costs associated with underestimating and overestimating demand. Based on our evaluations, the best model for our use case was CNN-QR.

We applied an additional evaluation step using a held-out test set. We combined the estimated forecast with internal business logic to evaluate how we would have planned staffing using the new forecast. The results were a resounding success. The new solution improved our forecast accuracy by 8%, saving an estimated $553,000 per year.

Application architecture

At Foxconn, much of our data resides on premises, so our application is a hybrid solution. The application loads the data to AWS from the on-premises server, builds the forecasts, and allows our team evaluate the output on a client-side GUI.

To ingest the data into AWS, we have a program running on premises that queries the latest data from the on-premises database on a weekly basis. It uploads the data to an S3 bucket via an SFTP server managed by AWS Transfer Family. This upload triggers an AWS Lambda function that performs the data preprocessing and loads the prepared data back into Amazon S3. The preprocessed data being written to the S3 bucket triggers two Lambda functions. The first loads the data from Amazon S3 into an OLTP database. The second starts the Forecast training on the processed data. After the forecast is trained, the results are loaded into a separate S3 bucket and also into the OLTP database. The following diagram illustrates this architecture.

The following diagram illustrates this architecture.

Finally, we wanted a way for customers to review the forecast outputs and provide their own feedback into the system. The team put together a GUI that uses Amazon API Gateway to allow users to visualize and interact with the forecast results in the database. The GUI allows the customer to review the latest forecast and choose a target production for upcoming weeks. The targets are uploaded back to the OLTP and used in further planning efforts.

Summary and next steps

In this post, we showed how a team new to AWS and data science built a custom forecasting solution with Forecast in 2 months. The application improved our forecast accuracy by 8%, saving an estimated $553,000 annually for our Mexico facility alone. Using Forecast also gave us the flexibility to scale out if we add new product categories in the future.

We’re thrilled to see the high performance of the Forecast solution using only the historical demand data. This is the first step in a larger plan to expand our use of ML for supply chain management and production planning.

Over the coming months, the team will migrate other planning data and workloads to the cloud. We’ll use the demand forecast in conjunction with inventory, backlog, and worker data to create an optimization solution for labor planning and allocation. These solutions will make the improved forecast even more impactful by allowing us to better plan production levels and resource needs.

If you’d like help accelerating the use of ML in your products and services, please contact the Amazon ML Solutions Lab program. To learn more about how to use Amazon Forecast, check out the service documentation.


About the Authors

Azim Siddique serves as Technical Advisor and CoE Architect at Foxconn. He provides architectural direction for the Digital Transformation program, conducts PoCs with emerging technologies, and guides engineering teams to deliver business value by leveraging digital technologies at scale.

 

 

Felice Chuang is a Data Architect at Foxconn. She uses her diverse skillset to implement end-to-end architecture and design for big data, data governance, and business intelligence applications. She supports analytic workloads and conducts PoCs for Digital Transformation programs.

 

 

Yash Shah is a data scientist in the Amazon ML Solutions Lab, where he works on a range of machine learning use cases from healthcare to manufacturing and retail. He has a formal background in Human Factors and Statistics, and was previously part of the Amazon SCOT team designing products to guide 3P sellers with efficient inventory management.

 

 

Dan Volk is a Data Scientist at Amazon ML Solutions Lab, where he helps AWS customers across various industries accelerate their AI and cloud adoption. Dan has worked in several fields including manufacturing, aerospace, and sports and holds a Masters in Data Science from UC Berkeley.

 

 

 

Xin Chen is a senior manager at Amazon ML Solutions Lab, where he leads Automotive Vertical and helps AWS customers across different industries identify and build machine learning solutions to address their organization’s highest return-on-investment machine learning opportunities. Xin obtained his Ph.D. in Computer Science and Engineering from the University of Notre Dame.

Read More

Controlling and auditing data exploration activities with Amazon SageMaker Studio and AWS Lake Formation

Controlling and auditing data exploration activities with Amazon SageMaker Studio and AWS Lake Formation

Highly-regulated industries, such as financial services, are often required to audit all access to their data. This includes auditing exploratory activities performed by data scientists, who usually query data from within machine learning (ML) notebooks.

This post walks you through the steps to implement access control and auditing capabilities on a per-user basis, using Amazon SageMaker Studio notebooks and AWS Lake Formation access control policies. This is a how-to guide based on the Machine Learning Lens for the AWS Well-Architected Framework, following the design principles described in the Security Pillar:

  • Restrict access to ML systems
  • Ensure data governance
  • Enforce data lineage
  • Enforce regulatory compliance

Additional ML governance practices for experiments and models using Amazon SageMaker are described in the whitepaper Machine Learning Best Practices in Financial Services.

Overview of solution

This implementation uses Amazon Athena and the PyAthena client on a Studio notebook to query data on a data lake registered with Lake Formation.

SageMaker Studio is the first fully integrated development environment (IDE) for ML. Studio provides a single, web-based visual interface where you can perform all the steps required to build, train, and deploy ML models. Studio notebooks are collaborative notebooks that you can launch quickly, without setting up compute instances or file storage beforehand.

Athena is an interactive query service that makes it easy to analyze data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL. Athena is serverless, so there is no infrastructure to set up or manage, and you pay only for the queries you run.

Lake Formation is a fully managed service that makes it easier for you to build, secure, and manage data lakes. Lake Formation simplifies and automates many of the complex manual steps that are usually required to create data lakes, including securely making that data available for analytics and ML.

For an existing data lake registered with Lake Formation, the following diagram illustrates the proposed implementation.

For an existing data lake registered with Lake Formation, the following diagram illustrates the proposed implementation.

The workflow includes the following steps:

  1. Data scientists access the AWS Management Console using their AWS Identity and Access Management (IAM) user accounts and open Studio using individual user profiles. Each user profile has an associated execution role, which the user assumes while working on a Studio notebook. The diagram depicts two data scientists that require different permissions over data in the data lake. For example, in a data lake containing personally identifiable information (PII), user Data Scientist 1 has full access to every table in the Data Catalog, whereas Data Scientist 2 has limited access to a subset of tables (or columns) containing non-PII data.
  2. The Studio notebook is associated with a Python kernel. The PyAthena client allows you to run exploratory ANSI SQL queries on the data lake through Athena, using the execution role assumed by the user while working with Studio.
  3. Athena sends a data access request to Lake Formation, with the user profile execution role as principal. Data permissions in Lake Formation offer database-, table-, and column-level access control, restricting access to metadata and the corresponding data stored in Amazon S3. Lake Formation generates short-term credentials to be used for data access, and informs Athena what columns the principal is allowed to access.
  4. Athena uses the short-term credential provided by Lake Formation to access the data lake storage in Amazon S3, and retrieves the data matching the SQL query. Before returning the query result, Athena filters out columns that aren’t included in the data permissions informed by Lake Formation.
  5. Athena returns the SQL query result to the Studio notebook.
  6. Lake Formation records data access requests and other activity history for the registered data lake locations. AWS CloudTrail also records these and other API calls made to AWS during the entire flow, including Athena query requests.

Walkthrough overview

In this walkthrough, I show you how to implement access control and audit using a Studio notebook and Lake Formation. You perform the following activities:

  1. Register a new database in Lake Formation.
  2. Create the required IAM policies, roles, group, and users.
  3. Grant data permissions with Lake Formation.
  4. Set up Studio.
  5. Test Lake Formation access control policies using a Studio notebook.
  6. Audit data access activity with Lake Formation and CloudTrail.

If you prefer to skip the initial setup activities and jump directly to testing and auditing, you can deploy the following AWS CloudFormation template in a Region that supports Studio and Lake Formation:

You can also deploy the template by downloading the CloudFormation template. When deploying the CloudFormation template, you provide the following parameters:

  • User name and password for a data scientist with full access to the dataset. The default user name is data-scientist-full.
  • User name and password for a data scientist with limited access to the dataset. The default user name is data-scientist-limited.
  • Names for the database and table to be created for the dataset. The default names are amazon_reviews_db and amazon_reviews_parquet, respectively.
  • VPC and subnets that are used by Studio to communicate with the Amazon Elastic File System (Amazon EFS) volume associated to Studio.

If you decide to deploy the CloudFormation template, after the CloudFormation stack is complete, you can go directly to the section Testing Lake Formation access control policies in this post.

Prerequisites

For this walkthrough, you should have the following prerequisites:

  • An AWS account.
  • A data lake set up in Lake Formation with a Lake Formation Admin. For general guidance on how to set up Lake Formation, see Getting started with AWS Lake Formation.
  • Basic knowledge on creating IAM policies, roles, users, and groups.

Registering a new database in Lake Formation

For this post, I use the Amazon Customer Reviews Dataset to demonstrate how to provide granular access to the data lake for different data scientists. If you already have a dataset registered with Lake Formation that you want to use, you can skip this section and go to Creating required IAM roles and users for data scientists.

To register the Amazon Customer Reviews Dataset in Lake Formation, complete the following steps:

  1. Sign in to the console with the IAM user configured as Lake Formation Admin.
  2. On the Lake Formation console, in the navigation pane, under Data catalog, choose Databases.
  3. Choose Create Database.
  4. In Database details, select Database to create the database in your own account.
  5. For Name, enter a name for the database, such as amazon_reviews_db.
  6. For Location, enter s3://amazon-reviews-pds.
  7. Under Default permissions for newly created tables, make sure to clear the option Use only IAM access control for new tables in this database.

Under Default permissions for newly created tables, make sure to clear the option Use only IAM access control for new tables in this database.

  1. Choose Create database.

The Amazon Customer Reviews Dataset is currently available in TSV and Parquet formats. The Parquet dataset is partitioned on Amazon S3 by product_category. To create a table in the data lake for the Parquet dataset, you can use an AWS Glue crawler or manually create the table using Athena, as described in Amazon Customer Reviews Dataset README file.

  1. On the Athena console, create the table.

If you haven’t specified a query result location before, follow the instructions in Specifying a Query Result Location.

  1. Choose the data source AwsDataCatalog.
  2. Choose the database created in the previous step.
  3. In the Query Editor, enter the following query:
    CREATE EXTERNAL TABLE amazon_reviews_parquet(
      marketplace string, 
      customer_id string, 
      review_id string, 
      product_id string, 
      product_parent string, 
      product_title string, 
      star_rating int, 
      helpful_votes int, 
      total_votes int, 
      vine string, 
      verified_purchase string, 
      review_headline string, 
      review_body string, 
      review_date bigint, 
      year int)
    PARTITIONED BY (product_category string)
    ROW FORMAT SERDE 
      'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
    STORED AS INPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
    OUTPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
    LOCATION
      's3://amazon-reviews-pds/parquet/'

  1. Choose Run query.

You should receive a Query successful response when the table is created.

  1. Enter the following query to load the table partitions:
    MSCK REPAIR TABLE amazon_reviews_parquet

  1. Choose Run query.
  2. On the Lake Formation console, in the navigation pane, under Data catalog, choose Tables.
  3. For Table name, enter a table name.
  4. Verify that you can see the table details.

18. Verify that you can see the table details.

  1. Scroll down to see the table schema and partitions.

Finally, you register the database location with Lake Formation so the service can start enforcing data permissions on the database.

  1. On the Lake Formation console, in the navigation pane, under Register and ingest, choose Data lake locations.
  2. On the Data lake locations page, choose Register location.
  3. For Amazon S3 path, enter s3://amazon-reviews-pds/.
  4. For IAM role, you can keep the default role.
  5. Choose Register location.

Creating required IAM roles and users for data scientists

To demonstrate how you can provide differentiated access to the dataset registered in the previous step, you first need to create IAM policies, roles, a group, and users. The following diagram illustrates the resources you configure in this section.

The following diagram illustrates the resources you configure in this section.

In this section, you complete the following high-level steps:

  1. Create an IAM group named DataScientists containing two users: data-scientist-full and data-scientist-limited, to control their access to the console and to Studio.
  2. Create a managed policy named DataScientistGroupPolicy and assign it to the group.

The policy allows users in the group to access Studio, but only using a SageMaker user profile that matches their IAM user name. It also denies the use of SageMaker notebook instances, allowing Studio notebooks only.

  1. For each IAM user, create individual IAM roles, which are used as user profile execution roles in Studio later.

The naming convention for these roles consists of a common prefix followed by the corresponding IAM user name. This allows you to audit activities on Studio notebooks—which are logged using Studio’s execution roles—and trace them back to the individual IAM users who performed the activities. For this post, I use the prefix SageMakerStudioExecutionRole_.

  1. Create a managed policy named SageMakerUserProfileExecutionPolicy and assign it to each of the IAM roles.

The policy establishes coarse-grained access permissions to the data lake.

Follow the remainder of this section to create the IAM resources described. The permissions configured in this section grant common, coarse-grained access to data lake resources for all the IAM roles. In a later section, you use Lake Formation to establish fine-grained access permissions to Data Catalog resources and Amazon S3 locations for individual roles.

Creating the required IAM group and users

To create your group and users, complete the following steps:

  1. Sign in to the console using an IAM user with permissions to create groups, users, roles, and policies.
  2. On the IAM console, create policies on the JSON tab to create a new IAM managed policy named DataScientistGroupPolicy.
    1. Use the following JSON policy document to provide permissions, providing your AWS Region and AWS account ID:
      {
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Action": [
                      "sagemaker:DescribeDomain",
                      "sagemaker:ListDomains",
                      "sagemaker:ListUserProfiles",
                      "sagemaker:ListApps"
                  ],
                  "Resource": "*",
                  "Effect": "Allow"
              },
              {
                  "Action": [
                      "sagemaker:CreatePresignedDomainUrl",
                      "sagemaker:DescribeUserProfile"
                  ],
                  "Resource": "arn:aws:sagemaker:<AWSREGION>:<AWSACCOUNT>:user-profile/*/${aws:username}",
                  "Effect": "Allow"
              },
              {
                  "Action": [
                      "sagemaker:CreatePresignedDomainUrl",
                      "sagemaker:DescribeUserProfile"
                  ],
                  "Effect": "Deny",
                  "NotResource": "arn:aws:sagemaker:<AWSREGION>:<AWSACCOUNT>:user-profile/*/${aws:username}"
              },
              {
                  "Action": "sagemaker:*App",
                  "Resource": "arn:aws:sagemaker:<AWSREGION>:<AWSACCOUNT>:app/*/${aws:username}/*",
                  "Effect": "Allow"
              },
              {
                  "Action": "sagemaker:*App",
                  "Effect": "Deny",
                  "NotResource": "arn:aws:sagemaker:<AWSREGION>:<AWSACCOUNT>:app/*/${aws:username}/*"
              },
              {
                  "Action": [
                      "sagemaker:CreatePresignedNotebookInstanceUrl",
                      "sagemaker:*NotebookInstance",
                      "sagemaker:*NotebookInstanceLifecycleConfig",
                      "sagemaker:CreateUserProfile",
                      "sagemaker:DeleteDomain",
                      "sagemaker:DeleteUserProfile"
                  ],
                  "Resource": "*",
                  "Effect": "Deny"
              }
          ]
      }

This policy forces an IAM user to open Studio using a SageMaker user profile with the same name. It also denies the use of SageMaker notebook instances, allowing Studio notebooks only.

  1. Create an IAM group.
    1. For Group name, enter DataScientists.
    2. Search and attach the AWS managed policy named DataScientist and the IAM policy created in the previous step.
  2. Create two IAM users named data-scientist-full and data-scientist-limited.

Alternatively, you can provide names of your choice, as long as they’re a combination of lowercase letters, numbers, and hyphen (-). Later, you also give these names to their corresponding SageMaker user profiles, which at the time of writing only support those characters.

Creating the required IAM roles

To create your roles, complete the following steps:

  1. On the IAM console, create a new managed policy named SageMakerUserProfileExecutionPolicy.
    1. Use the following policy code:
      {
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Action": [
                      "lakeformation:GetDataAccess",
                      "glue:GetTable",
                      "glue:GetTables",
                      "glue:SearchTables",
                      "glue:GetDatabase",
                      "glue:GetDatabases",
                      "glue:GetPartitions"
                  ],
                  "Resource": "*",
                  "Effect": "Allow"
              },
              {
                  "Action": "sts:AssumeRole",
                  "Resource": "*",
                  "Effect": "Deny"
              }
          ]
      }

This policy provides common coarse-grained IAM permissions to the data lake, leaving Lake Formation permissions to control access to Data Catalog resources and Amazon S3 locations for individual users and roles. This is the recommended method for granting access to data in Lake Formation. For more information, see Methods for Fine-Grained Access Control.

  1. Create an IAM role for the first data scientist (data-scientist-full), which is used as the corresponding user profile’s execution role.
    1. On the Attach permissions policy page, search and attach the AWS managed policy AmazonSageMakerFullAccess.
    2. For Role name, use the naming convention introduced at the beginning of this section to name the role SageMakerStudioExecutionRole_data-scientist-full.
  2. To add the remaining policies, on the Roles page, choose the role name you just created.
  3. Under Permissions, choose Attach policies;
  4. Search and select the SageMakerUserProfileExecutionPolicy and AmazonAthenaFullAccess policies.
  5. Choose Attach policy.
  6. To restrict the Studio resources that can be created within Studio (such as image, kernel, or instance type) to only those belonging to the user profile associated to the first IAM role, embed an inline policy to the IAM role.
    1. Use the following JSON policy document to scope down permissions for the user profile, providing the Region, account ID, and IAM user name associated to the first data scientist (data-scientist-full). You can name the inline policy DataScientist1IAMRoleInlinePolicy.
      {
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Action": "sagemaker:*App",
                  "Resource": "arn:aws:sagemaker:<AWSREGION>:<AWSACCOUNT>:app/*/<IAMUSERNAME>/*",
                  "Effect": "Allow"
              },
              {
                  "Action": "sagemaker:*App",
                  "Effect": "Deny",
                  "NotResource": "arn:aws:sagemaker:<AWSREGION>:<AWSACCOUNT>:app/*/<IAMUSERNAME>/*"
              }
          ]
      }

  1. Repeat the previous steps to create an IAM role for the second data scientist (data-scientist-limited).
    1. Name the role SageMakerStudioExecutionRole_data-scientist-limited and the second inline policy DataScientist2IAMRoleInlinePolicy.

Granting data permissions with Lake Formation

Before data scientists are able to work on a Studio notebook, you grant the individual execution roles created in the previous section access to the Amazon Customer Reviews Dataset (or your own dataset). For this post, we implement different data permission policies for each data scientist to demonstrate how to grant granular access using Lake Formation.

  1. Sign in to the console with the IAM user configured as Lake Formation Admin.
  2. On the Lake Formation console, in the navigation pane, choose Tables.
  3. On the Tables page, select the table you created earlier, such as amazon_reviews_parquet.
  4. On the Actions menu, under Permissions, choose Grant.
  5. Provide the following information to grant full access to the Amazon Customer Reviews Dataset table for the first data scientist:
  6. Select My account.
  7. For IAM users and roles, choose the execution role associated to the first data scientist, such as SageMakerStudioExecutionRole_data-scientist-full.
  8. For Table permissions and Grantable permissions, select Select.
  9. Choose Grant.
  10. Repeat the first step to grant limited access to the dataset for the second data scientist, providing the following information:
  11. Select My account.
  12. For IAM users and roles, choose the execution role associated to the second data scientist, such as SageMakerStudioExecutionRole_data-scientist-limited.
  13. For Columns, choose Include columns.
  14. Choose a subset of columns, such as: product_category, product_id, product_parent, product_title, star_rating, review_headline, review_body, and review_date.
  15. For Table permissions and Grantable permissions, select Select.
  16. Choose Grant.
  17. To verify the data permissions you have granted, on the Lake Formation console, in the navigation pane, choose Tables.
  18. On the Tables page, select the table you created earlier, such as amazon_reviews_parquet.
  19. On the Actions menu, under Permissions, choose View permissions to open the Data permissions menu.

You see a list of permissions granted for the table, including the permissions you just granted and permissions for the Lake Formation Admin.

You see a list of permissions granted for the table, including the permissions you just granted and permissions for the Lake Formation Admin.

If you see the principal IAMAllowedPrincipals listed on the Data permissions menu for the table, you must remove it. Select the principal and choose Revoke. On the Revoke permissions page, choose Revoke.

Setting up SageMaker Studio

You now onboard to Studio and create two user profiles, one for each data scientist.

When you onboard to Studio using IAM authentication, Studio creates a domain for your account. A domain consists of a list of authorized users, configuration settings, and an Amazon EFS volume, which contains data for the users, including notebooks, resources, and artifacts.

Each user receives a private home directory within Amazon EFS for notebooks, Git repositories, and data files. All traffic between the domain and the Amazon EFS volume is communicated through specified subnet IDs. By default, all other traffic goes over the internet through a SageMaker system Amazon Virtual Private Cloud (Amazon VPC).

Alternatively, instead of using the default SageMaker internet access, you could secure how Studio accesses resources by assigning a private VPC to the domain. This is beyond the scope of this post, but you can find additional details in Securing Amazon SageMaker Studio connectivity using a private VPC.

If you already have a Studio domain running, you can skip the onboarding process and follow the steps to create the SageMaker user profiles.

Onboarding to Studio

To onboard to Studio, complete the following steps:

  1. Sign in to the console using an IAM user with service administrator permissions for SageMaker.
  2. On the SageMaker console, in the navigation pane, choose Amazon SageMaker Studio.
  3. On the Studio menu, under Get started, choose Standard setup.
  4. For Authentication method, choose AWS Identity and Access Management (IAM).
  5. Under Permission, for Execution role for all users, choose an option from the role selector.

You’re not using this execution role for the SageMaker user profiles that you create later. If you choose Create a new role, the Create an IAM role dialog opens.

  1. For S3 buckets you specify, choose None.
  2. Choose Create role.

SageMaker creates a new IAM role named AmazonSageMaker-ExecutionPolicy role with the AmazonSageMakerFullAccess policy attached.

  1. Under Network and storage, for VPC, choose the private VPC that is used for communication with the Amazon EFS volume.
  2. For Subnet(s), choose multiple subnets in the VPC from different Availability Zones.
  3. Choose Submit.
  4. On the Studio Control Panel, under Studio Summary, wait for the status to change to Ready and the Add user button to be enabled.

Creating the SageMaker user profiles

To create your SageMaker user profiles, complete the following steps:

  1. On the SageMaker console, in the navigation pane, choose Amazon SageMaker Studio.
  2. On the Studio Control Panel, choose Add user.
  3. For User name, enter data-scientist-full.
  4. For Execution role, choose Enter a custom IAM role ARN.
  5. Enter arn:aws:iam::<AWSACCOUNT>:role/SageMakerStudioExecutionRole_data-scientist-full, providing your AWS account ID.
  6. After creating the first user profile, repeat the previous steps to create a second user profile.
    1. For User name, enter data-scientist-limited.
    2. For Execution role, enter the associated IAM role ARN.

For Execution role, enter the associated IAM role ARN.

Testing Lake Formation access control policies

You now test the implemented Lake Formation access control policies by opening Studio using both user profiles. For each user profile, you run the same Studio notebook containing Athena queries. You should see different query outputs for each user profile, matching the data permissions implemented earlier.

  1. Sign in to the console with IAM user data-scientist-full.
  2. On the SageMaker console, in the navigation pane, choose Amazon SageMaker Studio.
  3. On the Studio Control Panel, choose user name data-scientist-full.
  4. Choose Open Studio.
  5. Wait for SageMaker Studio to load.

Due to the IAM policies attached to the IAM user, you can only open Studio with a user profile matching the IAM user name.

  1. In Studio, on the top menu, under File, under New, choose Terminal.
  2. At the command prompt, run the following command to import a sample notebook to test Lake Formation data permissions:
    git clone https://github.com/aws-samples/amazon-sagemaker-studio-audit.git

  1. In the left sidebar, choose the file browser icon.
  2. Navigate to amazon-sagemaker-studio-audit.
  3. Open the notebook folder.
  4. Choose sagemaker-studio-audit-control.ipynb to open the notebook.
  5. In the Select Kernel dialog, choose Python 3 (Data Science).
  6. Choose Select.
  7. Wait for the kernel to load.

Wait for the kernel to load.

  1. Starting from the first code cell in the notebook, press Shift + Enter to run the code cell.
  2. Continue running all the code cells, waiting for the previous cell to finish before running the following cell.

After running the last SELECT query, because the user has full SELECT permissions for the table, the query output includes all the columns in the amazon_reviews_parquet table.

After running the last SELECT query, because the user has full SELECT permissions for the table, the query output includes all the columns in the amazon_reviews_parquet table.

  1. On the top menu, under File, choose Shut Down.
  2. Choose Shutdown All to shut down all the Studio apps.
  3. Close the Studio browser tab.
  4. Repeat the previous steps in this section, this time signing in as the user data-scientist-limited and opening Studio with this user.
  5. Don’t run the code cell in the section Create S3 bucket for query output files.

For this user, after running the same SELECT query in the Studio notebook, the query output only includes a subset of columns for the amazon_reviews_parquet table.

For this user, after running the same SELECT query in the Studio notebook, the query output only includes a subset of columns for the amazon_reviews_parquet table.

Auditing data access activity with Lake Formation and CloudTrail

In this section, we explore the events associated to the queries performed in the previous section. The Lake Formation console includes a dashboard where it centralizes all CloudTrail logs specific to the service, such as GetDataAccess. These events can be correlated with other CloudTrail events, such as Athena query requests, to get a complete view of the queries users are running on the data lake.

Alternatively, instead of filtering individual events in Lake Formation and CloudTrail, you could run SQL queries to correlate CloudTrail logs using Athena. Such integration is beyond the scope of this post, but you can find additional details in Using the CloudTrail Console to Create an Athena Table for CloudTrail Logs and Analyze Security, Compliance, and Operational Activity Using AWS CloudTrail and Amazon Athena.

Auditing data access activity with Lake Formation

To review activity in Lake Formation, complete the following steps:

  1. Sign out of the AWS account.
  2. Sign in to the console with the IAM user configured as Lake Formation Admin.
  3. On the Lake Formation console, in the navigation pane, choose Dashboard.

Under Recent access activity, you can find the events associated to the data access for both users.

  1. Choose the most recent event with event name GetDataAccess.
  2. Choose View event.

Among other attributes, each event includes the following:

  • Event date and time
  • Event source (Lake Formation)
  • Athena query ID
  • Table being queried
  • IAM user embedded in the Lake Formation principal, based on the chosen role name convention

• IAM user embedded in the Lake Formation principal, based on the chosen role name convention

Auditing data access activity with CloudTrail

To review activity in CloudTrail, complete the following steps:

  1. On the CloudTrail console, in the navigation pane, choose Event history.
  2. In the Event history menu, for Filter, choose Event name.
  3. Enter StartQueryExecution.
  4. Expand the most recent event, then choose View event.

This event includes additional parameters that are useful to complete the audit analysis, such as the following:

  • Event source (Athena).
  • Athena query ID, matching the query ID from Lake Formation’s GetDataAccess event.
  • Query string.
  • Output location. The query output is stored in CSV format in this Amazon S3 location. Files for each query are named using the query ID.

Output location. The query output is stored in CSV format in this Amazon S3 location. Files for each query are named using the query ID.

Cleaning up

To avoid incurring future charges, delete the resources created during this walkthrough.

If you followed this walkthrough using the CloudFormation template, after shutting down the Studio apps for each user profile, deleting the stack deletes the remaining resources.

If you encounter any errors, open the Studio Control Panel and verify that all the apps for every user profile are in Deleted state before deleting the stack.

If you didn’t use the CloudFormation template, you can manually delete the resources you created:

  1. On the Studio Control Panel, for each user profile, choose User Details.
  2. Choose Delete user.
  3. When all users are deleted, choose Delete Studio.
  4. On the Amazon EFS console, delete the volume that was automatically created for Studio.
  5. On the Lake Formation console, delete the table and the database created for the Amazon Customer Reviews Dataset.
  6. Remove the data lake location for the dataset.
  7. On the IAM console, delete the IAM users, group, and roles created for this walkthrough.
  8. Delete the policies you created for these principals.
  9. On the Amazon S3 console, empty and delete the bucket created for storing Athena query results (starting with sagemaker-audit-control-query-results-), and the bucket created by Studio to share notebooks (starting with sagemaker-studio-).

Conclusion

This post described how to the implement access control and auditing capabilities on a per-user basis in ML projects, using Studio notebooks, Athena, and Lake Formation to enforce access control policies when performing exploratory activities in a data lake.

I thank you for following this walkthrough and I invite you to implement it using the associated CloudFormation template. You’re also welcome to visit the GitHub repo for the project.


About the Author

Rodrigo Alarcon is a Sr. Solutions Architect with AWS based out of Santiago, Chile. Rodrigo has over 10 years of experience in IT security and network infrastructure. His interests include machine learning and cybersecurity.

Read More

Building and deploying an object detection computer vision application at the edge with AWS Panorama

Building and deploying an object detection computer vision application at the edge with AWS Panorama

Computer vision (CV) is sought after technology among companies looking to take advantage of machine learning (ML) to improve their business processes. Enterprises have access to large amounts of video assets from their existing cameras, but the data remains largely untapped without the right tools to gain insights from it. CV provides the tools to unlock opportunities with this data, so you can automate processes that typically require visual inspection, such as evaluating manufacturing quality or identifying bottlenecks in industrial processes. You can take advantage of CV models running in the cloud to automate these inspection tasks, but there are circumstances when relying exclusively on the cloud isn’t optimal due to latency requirements or intermittent connectivity that make a round trip to the cloud infeasible.

AWS Panorama enables you to bring CV to on-premises cameras and make predictions locally with high accuracy and low latency. On the AWS Panorama console, you can easily bring custom trained models to the edge and build applications that integrate with custom business logic. You can then deploy these applications on the AWS Panorama Appliance, which auto-discovers existing IP cameras and runs the applications on video streams to make real-time predictions. You can easily integrate the inference results with other AWS services such as Amazon QuickSight to derive ML-powered business intelligence (BI) or route the results to your on-premises systems to trigger an immediate action.

Sign up for the preview to learn more and start building your own CV applications.

In this post, we look at how you can use AWS Panorama to build and deploy a parking lot car counter application.

Parking lot car counter application

Parking facilities, like the one in the image below, need to know how many cars are parked in a given facility at any point of time, to assess vacancy and intake more customers. You also want to keep track of the number of cars that enter and exit your facility during any given time. You can use this information to improve operations, such as adding more parking payment centers, optimizing price, directing cars to different floors, and more. Parking center owners typically operate more than one facility and are looking for real-time aggregate details of vacancy in order to direct traffic to less-populated facilities and offer real-time discounts.

To achieve these goals, parking centers sometimes manually count the cars to provide a tally. This inspection can be error prone and isn’t optimal for capturing real-time data. Some parking facilities install sensors that give the number of cars in a particular lot, but these sensors are typically not integrated with analytics systems to derive actionable insights.

With the AWS Panorama Appliance, you can get a real-time count of number of cars, collect metrics across sites, and correlate them to improve your operations. Let’s see how we can solve this once manual (and expensive) problem using CV at the edge. We go through the details of the trained model, the business logic code, and walk through the steps to create and deploy an application on your AWS Panorama Appliance Developer Kit so you can view the inferences on a connected HDMI screen.

Computer vision model

A CV model helps us extract useful information from images and video frames. We can detect and localize objects in a scene, and identity and classify images and action recognition. You can choose from a variety of frameworks such as TensorFlow, MXNet, and PyTorch to build your CV models, or you can choose from a variety of pre-trained models available from AWS or from third parties such as ISVs.

For this example, we use a pre-trained GluonCV model downloaded from the GluonCV model zoo.

The model we use is the ssd_512_resnet50_v1_voc model. It’s trained on the very popular PASCAL VOC dataset. It has 20 classes of objects annotated and labeled for a model to be trained on. The following code shows the classes and their indexes.

voc_classes = {
	'aeroplane'		: 0,
	'bicycle'		: 1,
	'bird'			: 2,
	'boat'			: 3,
	'bottle'		: 4,
	'bus'			: 5,
	'car'			: 6,
	'cat'			: 7,
	'chair'			: 8,
	'cow'			: 9,
	'diningtable'	: 10,
	'dog'			: 11,
	'horse'			: 12,
	'motorbike'		: 13,
	'person'		: 14,
	'pottedplant'	: 15,
	'sheep'			: 16,
	'sofa'			: 17,
	'train'			: 18,
	'tvmonitor'		: 19
}


For our use case, we’re detecting and counting cars. Because we’re talking about cars, we use class 6 as the index in our business logic later in this post.

Our input image shape is [1, 3, 512, 512]. These are the dimensions of the input image the model expects to be given:

  • Batch size – 1
  • Number of channels – 3
  • Width and height of the input image – 512, 512

Uploading the model artifacts

We need to upload the model artifacts to an Amazon Simple Storage Service (Amazon S3) bucket. The bucket name should have aws-panorama- in the beginning of the name. After downloading the model artifacts, we upload the ssd_512_resnet50_v1_voc.tar.gz file to the S3 bucket. To create your bucket, complete the following steps:

  1. Download the model artifacts.
  2. On the Amazon S3 console, choose Create bucket.
  3. For Bucket name, enter a name starting with aws-panorama-.

  1. Choose Create bucket.

You can view the object details in the Object overview section. The model URI is s3://aws-panorama-models-bucket/ssd_512_resnet50_v1_voc.tar.gz.

The business logic code

After we upload the model artifacts to an S3 bucket, let’s turn our attention to the business logic code. For more information about the sample developer code, see Sample application code. For a comparative example of code samples, see AWS Panorama People Counter Example on GitHub.

Before we look at the full code, let’s look at a skeleton of the business logic code we use:

### Lambda skeleton

class car_counter(object):
    def interface(self):
        # defines the parameters that interface with other services from Panorama
        return

    def init(self, parameters, inputs, outputs):
        # defines the attributes such as arrays and model objects that will be used in the application
        return

    def entry(self, inputs, outputs):
        # defines the application logic responsible for predicting using the inputs and handles what to do
        # with the outputs
        return

The business logic code and AWS Lambda function expect to have at least the interface method, init method, and the entry method.

Let’s go through the python business logic code next.

import panoramasdk
import cv2
import numpy as np
import time
import boto3

# Global Variables 

HEIGHT = 512
WIDTH = 512

class car_counter(panoramasdk.base):
    
    def interface(self):
        return {
                "parameters":
                (
                    ("float", "threshold", "Detection threshold", 0.10),
                    ("model", "car_counter", "Model for car counting", "ssd_512_resnet50_v1_voc"), 
                    ("int", "batch_size", "Model batch size", 1),
                    ("float", "car_index", "car index based on dataset used", 6),
                ),
                "inputs":
                (
                    ("media[]", "video_in", "Camera input stream"),
                ),
                "outputs":
                (
                    ("media[video_in]", "video_out", "Camera output stream"),
                    
                ) 
            }
    
            
    def init(self, parameters, inputs, outputs):  
        try:  
            
            print('Loading Model')
            self.model = panoramasdk.model()
            self.model.open(parameters.car_counter, 1)
            print('Model Loaded')
            
            # Detection probability threshold.
            self.threshold = parameters.threshold
            # Frame Number Initialization
            self.frame_num = 0
            # Number of cars
            self.number_cars = 0
            # Bounding Box Colors
            self.colours = np.random.rand(32, 3)
            # Car Index for Model from parameters
            self.car_index = parameters.car_index
            # Set threshold for model from parameters 
            self.threshold = parameters.threshold
                        
            class_info = self.model.get_output(0)
            prob_info = self.model.get_output(1)
            rect_info = self.model.get_output(2)

            self.class_array = np.empty(class_info.get_dims(), dtype=class_info.get_type())
            self.prob_array = np.empty(prob_info.get_dims(), dtype=prob_info.get_type())
            self.rect_array = np.empty(rect_info.get_dims(), dtype=rect_info.get_type())

            return True
        
        except Exception as e:
            print("Exception: {}".format(e))
            return False

    def preprocess(self, img, size):
        
        resized = cv2.resize(img, (size, size))
        mean = [0.485, 0.456, 0.406]  # RGB
        std = [0.229, 0.224, 0.225]  # RGB
        
        # converting array of ints to floats
        img = resized.astype(np.float32) / 255. 
        img_a = img[:, :, 0]
        img_b = img[:, :, 1]
        img_c = img[:, :, 2]
        
        # Extracting single channels from 3 channel image
        # The above code could also be replaced with cv2.split(img)
        # normalizing per channel data:
        
        img_a = (img_a - mean[0]) / std[0]
        img_b = (img_b - mean[1]) / std[1]
        img_c = (img_c - mean[2]) / std[2]
        
        # putting the 3 channels back together:
        x1 = [[[], [], []]]
        x1[0][0] = img_a
        x1[0][1] = img_b
        x1[0][2] = img_c
        x1 = np.asarray(x1)
        
        return x1
    
    def get_number_cars(self, class_data, prob_data):
        
        # get indices of car detections in class data
        car_indices = [i for i in range(len(class_data)) if int(class_data[i]) == self.car_index]
        # use these indices to filter out anything that is less than self.threshold
        prob_car_indices = [i for i in car_indices if prob_data[i] >= self.threshold]
        return prob_car_indices

    
    def entry(self, inputs, outputs):        
        for i in range(len(inputs.video_in)):
            stream = inputs.video_in[i]
            car_image = stream.image

            # Pre Process Frame
            x1 = self.preprocess(car_image, 512)
                                    
            # Do inference on the new frame.
            
            self.model.batch(0, x1)        
            self.model.flush()
            
            # Get the results.            
            resultBatchSet = self.model.get_result()
            class_batch = resultBatchSet.get(0)
            prob_batch = resultBatchSet.get(1)
            rect_batch = resultBatchSet.get(2)

            class_batch.get(0, self.class_array)
            prob_batch.get(1, self.prob_array)
            rect_batch.get(2, self.rect_array)

            class_data = self.class_array[0]
            prob_data = self.prob_array[0]
            rect_data = self.rect_array[0]
            
            
            # Get Indices of classes that correspond to Cars
            car_indices = self.get_number_cars(class_data, prob_data)
            
            try:
                self.number_cars = len(car_indices)
            except:
                self.number_cars = 0
            
            # Visualize with Opencv or stream.(media) 
            
            # Draw Bounding boxes on HDMI output
            if self.number_cars > 0:
                for index in car_indices:
                    
                    left = np.clip(rect_data[index][0] / np.float(HEIGHT), 0, 1)
                    top = np.clip(rect_data[index][1] / np.float(WIDTH), 0, 1)
                    right = np.clip(rect_data[index][2] / np.float(HEIGHT), 0, 1)
                    bottom = np.clip(rect_data[index][3] / np.float(WIDTH), 0, 1)
                    
                    stream.add_rect(left, top, right, bottom)
                    stream.add_label(str(prob_data[index][0]), right, bottom) 
                    
            stream.add_label('Number of Cars : {}'.format(self.number_cars), 0.8, 0.05)
        
            self.model.release_result(resultBatchSet)            
            outputs.video_out[i] = stream
        return True


def main():
        
    car_counter().run()
main()

For a full explanation of the code and the methods used, see the AWS Panorama Developer Guide.

The code has the following notable features:

  • car_index6
  • model_used ssd_512_resnet50_v1_voc (parameters.car_counter)
  • add_label – Adds text to the HDMI output
  • add_rect – Adds bounding boxes around the object of interest
  • Image – Gets the NumPy array of the frame read from the camera

Now that we have the code ready, we need to create a Lambda function with the preceding code.

  1. On the Lambda console, choose Functions.
  2. Choose Create function.
  3. For Function name, enter a name.
  4. Choose Create function.

  1. Rename the Python file to car_counter.py.

  1. Change the handler to car_counter_main.

  1. In the Basic settings section, confirm that the memory is 2048 MB and the timeout is 2 minutes.

  1. On the Actions menu, choose Publish new version.

We’re now ready to create our application and deploy to the device. We use the model we uploaded and the Lambda function we created in the subsequent steps.

Creating the application

To create your application, complete the following steps:

  1. On the AWS Panorama console, choose My applications.
  2. Choose Create application.

  1. Choose Begin creation.

  1. For Name, enter car_counter.
  2. For Description, enter an optional description.
  3. Choose Next.

  1. Click Choose model.

  1. For Model artifact path, enter the model S3 URI.
  2. For Model name¸ enter the same name that you used in the business logic code.
  3. In the Input configuration section, choose Add input.
  4. For Input name, enter the input Tensor name (for this post, data).
  5. For Shape, enter the frame shape (for this post, 1, 3, 512, 512).

  1. Choose Next.
  2. Under Lambda functions, select your function (CarCounter).

  1. Choose Next.
  2. Choose Proceed to deployment.

Deploying your application

To deploy your new application, complete the following steps:

  1. Choose Choose appliance.

  1. Choose the appliance you created.
  2. Choose Choose camera streams.

  1. Select your camera stream.

  1. Choose Deploy.

Checking the output

After we deploy the application, we can check the output HDMI output or use Amazon CloudWatch Logs. For more information, see Setting up the AWS Panorama Appliance Developer Kit or Viewing AWS Panorama event logs in CloudWatch Logs, respectively.

If we have an HDMI output connected to the device, we should see the output from the device on the HDMI screen, as in the following screenshot.

And that’s it. We have successfully deployed a car counting use case to the AWS Panorama Appliance.

Extending the solution

We can do so much more with this application and extend it to other parking-related use cases, such as the following:

  • Parking lot routing – Where are the vacant parking spots?
  • Parking lot monitoring – Are cars parked in appropriate spots? Are they too close to each other?

You can integrate these use cases with other AWS services like QuickSight, S3 buckets, and MQTT, just to name a few, and get real-time inference data for monitoring cars in a parking lot.

You can adapt this example and build other object detection applications for your use case. We will also continue to share more examples with you so you can build, develop, and test with the AWS Panorama Appliance Developer Kit.

Conclusion

The applications of computer vision at the edge are only now being imagined and built out. As a data scientist, I’m very excited to be innovating in lockstep with AWS Panorama customers to help you ideate and build CV models that are uniquely tailored to solve your problems.

And we’re just scratching the surface of what’s possible with CV at the edge and the AWS Panorama ecosystem.

Resources

For more information about using AWS Panorama, see the following resources:

 


About the Author

Surya Kari is a Data Scientist who works on AI devices within AWS. His interests lie in computer vision and autonomous systems.

Read More

Population health applications with Amazon HealthLake – Part 1: Analytics and monitoring using Amazon QuickSight

Population health applications with Amazon HealthLake – Part 1: Analytics and monitoring using Amazon QuickSight

Healthcare has recently been transformed by two remarkable innovations: Medical Interoperability and machine learning (ML). Medical Interoperability refers to the ability to share healthcare information across multiple systems. To take advantage of these transformations, we launched a new HIPAA-eligible healthcare service, Amazon HealthLake, now in preview at re:Invent 2020. In the re:Invent announcement, we talk about how HealthLake enables organizations to structure, tag, index, query, and apply ML to analyze health data at scale. In a series of posts, starting with this one, we show you how to use HealthLake to derive insights or ask new questions of your health data using advanced analytics.

The primary source of healthcare data are patient electronic health records (EHR). Health Level Seven International (HL7), a non-profit standards development organization, announced a standard for exchanging structured medical data called the Fast Healthcare Interoperability Resources (FHIR). FHIR is widely supported by healthcare software vendors and was supported at an American Medical Informatics Association meeting by EHR vendors. The FHIR specification makes structured medical data easily accessible to clinical researchers and informaticians, and also makes it easy for ML tools to process this data and extract valuable information from it. For example, FHIR provides a resource to capture documents, such as doctor’s notes or lab report summaries. However, this data needs to be extracted and transformed before it can be searched and analyzed.

As the FHIR-formatted medical data is ingested, HealthLake uses natural language processing trained to understand medical terminology to enrich unstructured data with standardized labels (such as for medications, conditions, diagnoses, and procedures), so all this information can be normalized and easily searched. One example is parsing clinical narratives in the FHIR DocumentReference resource to extract, tag, and structure the medical entities, including ICD-10-CM codes. This transformed data is then added to the patient’s record, providing a complete view of all of the patient’s attributes (such as medications, tests, procedures, and diagnoses) that is optimized for search and applying advanced analytics. In this post, we walk you through the process of creating a population health dashboard on this enriched data, using AWS Glue, Amazon Athena, and Amazon QuickSight.

Building a population health dashboard

After HealthLake extracts and tags the FHIR-formatted data, you can use advanced analytics and ML with your now normalized data to make sense of it all. Next, we walk through using QuickSight to build a population health dashboard to quickly analyze data from HealthLake. The following diagram illustrates the solution architecture.

In this example, we build a dashboard for patients diagnosed with congestive heart failure (CHF), a chronic medical condition in which the heart doesn’t pump blood as well as it should. We use the MIMIC-III (Medical Information Mart for Intensive Care III) data, a large, freely-available database comprising de-identified health-related data associated with over 40,000 patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001–2012. [1]

The tools used for processing the data and building the dashboard include AWS Glue, Athena, and QuickSight. AWS Glue is a serverless data preparation service that makes it easy to extract, transform, and load (ETL) data, in order to prepare the data for subsequent analytical processing and presentation in charts and dashboards. An AWS Glue crawler is a program that determines the schema of data and creates a metadata table in the AWS Glue Data Catalog that describes the data schema. An AWS Glue job encapsulates a script that reads, processes, and writes data to a new schema. Finally, we use Athena, an interactive query service that can query data in Amazon Simple Storage Service (Amazon S3) using standard SQL queries on tables in a Data Catalog.

Connecting Athena with HealthLake

We first convert the MIMIC-III data to FHIR format and then copy the formatted data into a data store in HealthLake, which extracts medical entities from textual narratives such as doctors’ notes and discharge summaries. The clinical notes are stored in the DocumentReference resource, whereby the extracted entities are tagged to each patient’s record in the DocumentReference with FHIR extension fields represented in the JSON object. The following screenshot is an example of how the augmented DocumentResource looks.

Now that data is indexed and tagged in the HealthLake, we export the normalized data to an S3 bucket. The exported data is in NDJSON format, with one folder per resource.

An AWS Glue crawler is written for each folder to crawl the NDJSON file and create tables in the Data Catalog. Because the default classifiers can work with NDJSON files directly, no special classifiers are needed. There is one crawler per FHIR resource and each crawler creates one table. These tables are then queried directly from within Athena; however, for some queries, we use AWS Glue jobs to transform and partition the data to make the queries simpler and faster.

We create two AWS Glue jobs for this project to transform the DocumentReference and Condition tables. Both jobs transform the data from JSON to Apache Parquet, to improve query performance and reduce data storage and scanning costs. In addition, both jobs partition the data by patient first, and then by the identity of the individual FHIR resources. This improves the performance of patient- and record-based queries issued through Athena. The resulting Parquet files are tabular in structure, which also simplifies queries issued via clients, because they can reference detected entities and ICD-10 codes directly, and no longer need to navigate the nested FHIR structure of the DocumentReference extension element. After these jobs create the Parquet files in Amazon S3, we create and run crawlers to add the table schema into the Data Catalog.

Finally, to support keyword-based queries for conditions via the QuickSight dashboard, we create a view of the transformed DocumentReference table that includes ICD-10-CM textual descriptions and the corresponding ICD-10-CM codes.

Building a population health dashboard with QuickSight

QuickSight is a cloud-based business intelligence (BI) services that makes it easy to build dashboards in the cloud. It can obtain data from various sources, but for our use case, we use Athena to create a data source for our QuickSight dashboard. From the previous step, we have Athena tables that use data from HealthLake. As the next step, we create a dataset in QuickSight from a table in Athena. We use SPICE (Super-fast, Parallel, In-memory Calculation Engine) to store the data because this allows us to import the data only one time and use it multiple times.

After creating the dataset, we create a number of analytic components in the dashboard. These components allow us to aggregate the data and create charts and time-series visualizations at the patient and population levels.

The first tab of the dashboard that we build provides a view into the entire patient population and their encounters with the health system (see the following screenshot). The target audience for this dashboard consists of healthcare providers or caregivers.

The dashboard contains filters that allows us to further drill on the results by referring hospital or by date. It shows the number of patients, their demographic distribution, the number of encounters, the average hospital stay, and more.

The second tab joins hospital encounters with patient medical conditions. This view provides the number of encounters per referring hospital, broken by type of encounter and by age. We also create a word cloud for major medical conditions to easily drill down on the details and understand the distribution of these conditions across the entire population by encounter type.

The third component contains a patient timeline. The timeline is in the form of a tree table. The first column is the patient name. The second column contains the start date of the encounter sorted chronologically. The third column contains the list of ranked conditions diagnosed in that encounter. The last column contains the list of procedures performed during that encounter.

To build the patient timeline, we create a view in Athena that joins multiple tables. We build the preceding view by joining the condition, patient, encounter, and observation tables. The encounter table contains an array of conditions, and therefore we need to use the unnest command. The following code is a sample SQL query to join the tables:

SELECT o.code.text, o.effectivedatetime, o.valuequantity, p.name[1].family, e.hospitalization.dischargedisposition.coding[1].display as dischargeddisposition, e.period.start, e.period."end", e.hospitalization.admitsource.coding[1].display as admitsource, e.class.display as encounter_class, c.code.coding[1].display as condition
    FROM "healthai_mimic"."encounter" e, unnest(diagnosis) t(cond), condition c, patient p, observation o
    AND ("split"("cond"."condition"."reference", '/')[2] = "c"."id")
    AND ("split"("e"."subject"."reference", '/')[2] = "p"."id")
    AND ("split"("o"."subject"."reference", '/')[2] = "p"."id")
    AND ("split"("o"."encounter"."reference", '/')[2] = "e"."id")

The last but probably most exciting part is where we compare patient data found in structured fields vs. data parsed from text. As described before, the AWS Glue job has transformed the DocumentReference and Condition table so that the modified DocumentReference tables can now be queried to retrieve parsed medical entities.

In the following screenshot, we search for all patients that have the word [s]epsis in the condition text. The condition equals field is a filter that allows us to filter all conditions that match a text. The results show that 209 patients have a sepsis-related condition in their structured data. However, 288 patients have sepsis-related conditions as parsed from textual notes. The table on the left shows timelines for patients based on structured data, and the table on right shows timelines for patients based on parsed data.

Next steps

In this post, we joined the data from multiple FHIR references to create a holistic view for a patient. We also used Athena to search for a single patient. If the data volume is high, it’s a good idea to create year, month, and day partitions within Amazon S3 and store the NDJSON files in those partitions. This allows the dashboard to be created for a restricted time period, such as current month or current year, making the dashboard faster and cost-effective.

Conclusion

HealthLake creates exciting new possibilities for extracting medical entities from unstructured data and quickly building a dashboard on top of it. The dashboard helps clinicians and health administrators make informed decisions and improve patient care. It also helps researchers improve the performance of their ML models by incorporating medical entities that were hidden in unstructured data. You can start building a dashboard on your raw FHIR data by importing it into Amazon S3, creating AWS Glue crawlers and Data Catalog tables, and creating a QuickSight dashboard!

[1] MIMIC-III, a freely accessible critical care database. Johnson AEW, Pollard TJ, Shen L, Lehman L, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, and Mark RG. Scientific Data (2016).

 


About the Author

Mithil Shah is an ML/AI Specialist at Amazon Web Services. Currently he helps public sector customers improve lives of citizens by building Machine Learning solutions on AWS.

 

 

 

Paul Saxman is a Principal Solutions Architect at AWS, where he helps clinicians, researchers, executives, and staff at academic medical centers to adopt and leverage cloud technologies. As a clinical and biomedical informatics, Paul is passionate about accelerating healthcare advancement and innovation, by supporting the translation of science into medical practice.

Read More

Focusing on disaster response with Amazon Augmented AI and Mechanical Turk

Focusing on disaster response with Amazon Augmented AI and Mechanical Turk

It’s easy to distinguish a lake from a flood. But when you’re looking at an aerial photograph, factors like angle, altitude, cloud cover, and context can make the task more difficult. And when you need to identify 100,000 aerial images in order to give first responders the information they need to accelerate disaster response efforts? That’s when you need to combine the speed and accuracy of machine learning (ML) with the precision of human judgement.

With a constant supply of low altitude disaster imagery and satellite imagery coming online, researchers are looking for faster and more affordable ways to label this content so that it can be utilized by stakeholders like first responders and state, local, and federal agencies. Because the process of labeling this data is expensive, manual, and time consuming, developing ML models that can automate image labeling (or annotation) is critical to bringing this data into a more usable state. And to develop an effective ML model, you need a ground truth dataset: a labeled set of data that is used to train your model. The lack of an adequate ground truth dataset for LADI images put model development out of reach until now.

A broad array of organizations and agencies are developing solutions to this problem, and Amazon is there to support them with technology, infrastructure, and expertise. By integrating the full suite of human-in-the-loop services into a single AWS data pipeline, we can improve model performance, reduce the cost of human review, simplify the process of implementing an annotation pipeline, and provide prebuilt templates for the worker user interface, all while supplying access to an elastic, on-demand Amazon Mechanical Turk workforce that can scale to natural disaster event-driven annotation task volumes.

One of the projects that has made headway in the annotation of disaster imagery was developed by students at Penn State. Working alongside a team of MIT Lincoln Laboratory researchers, students at Penn State College of Information Sciences and Technology (IST) developed a computer model that can improve the classification of disaster scene images and inform disaster response.

Developing solutions

The Penn State project began with an analysis of imagery from the Low Altitude Disaster Imagery (LADI) dataset, a collection of aerial images taken above disaster scenes since 2015. Based on work supported by the United States Air Force, the LADI dataset was developed by the New Jersey Office of Homeland Security and Preparedness and MIT Lincoln Laboratory, with support from the National Institute of Standards and Technology’s Public Safety Innovation Accelerator Program (NIST PSIAP) and AWS.

“We met with the MIT Lincoln Laboratory team in June 2019 and recognized shared goals around improving annotation models for satellite and LADI objects, as we’ve been developing similar computer vision solutions here at AWS,” says Kumar Chellapilla, General Manager of Amazon Mechanical Turk, Amazon SageMaker Ground Truth, and Amazon Augmented AI (Amazon A2I) at AWS. “We connected the team with the AWS Machine Learning Research Awards (now part of the Amazon Research Awards program) and the AWS Open Data Program and funded MTurk credits for the development of MIT Lincoln Laboratory’s ground truth dataset.” Mechanical Turk is a global marketplace for requesters and workers to interact on human intelligence-related work, and is often leveraged by ML and artificial intelligence researchers to label large datasets.

With the annotated dataset hosted as part of the AWS Open Data Program, the Penn State students developed a computer model to create an augmented classification system for the images. This work has led to a trained model with an expected accuracy of 79%. The students’ code and models are now being integrated into the LADI project as an open-source baseline classifier and tutorial.

“They worked on training the model with only a subset of the full dataset, and I anticipate the precision will get even better,” says Dr. Jeff Liu, Technical Staff at MIT Lincoln Laboratory. “So we’ve seen, just over the course of a couple of weeks, very significant improvements in precision. It’s very promising for the future of classifiers built on this dataset.”

“During a disaster, a lot of data can be collected very quickly,” explains Andrew Weinert, Staff Research Associate at MIT Lincoln Laboratory who helped facilitate the project with the College of IST. “But collecting data and actually putting information together for decision-makers is a very different thing.”

Integrating human-in-the-loop services

Amazon also supported the development of an annotation user interface (UI) that aligned with common disaster classification codes, such as those used by urban search and rescue teams, which enabled MIT Lincoln Laboratory to pilot real-time Civil Air Patrol (CAP) image annotation following Hurricane Dorian. The MIT Lincoln Laboratory team is in the process of building a pipeline to bring CAP data through this classifier using Amazon A2I to route low-confidence results to Mechanical Turk for human review. Amazon A2I seamlessly integrates human intelligence with AI to offer human-level accuracy at machine-level scale for AWS AI services and custom models, and enables routing low-confidence ML results for human review.

“Amazon A2I is like ‘phone a friend’ for the model,” Weinert says. “It helps us route the images that can’t confidently be labeled by the classifier to MTurk workers for review. Ultimately, developing the tools that can be used by first responders to get help to those that need it is on top of our mind when working on this type of classifier, so we are now building a service to combine our results with other datasets like GIS (geographic information systems) to make it useful to first responders in the field.”

Weinert says that in a hurricane or other large-scale disaster, there could be up to 100,000 aerial images for emergency officers to analyze. For example, an official may be seeking images of bridges to assess damage or flooding nearby and needs a way to review the images quickly.

“Say you have a picture that at first glance looks like a lake,” says Dr. Marc Rigas, Assistant Teaching Professor, Penn State College of IST. “Then you see trees sticking out of it and realize it’s a flood zone. The computer has to know that and be able to distinguish what is a lake and what isn’t.” If it can’t distinguish between the two with confidence, Amazon A2I can route that image for human review.

There is a critical need to develop new technology to support incident and disaster response following natural disasters, such as computer vision models that detect damaged infrastructure or dangerous conditions. Looking forward, we will use Amazon A2I to combine the power of custom ML models with Amazon A2I to route low-confidence predictions to workers who annotate images to identify categories of natural disaster damage.

During hurricane season, providing the capacity for redundant systems that enables a workforce to access systems from home can provide the ability to annotate data in real time as new image sets become available.

Looking forward

Grace Kitzmiller from the Amazon Disaster Response team envisions a future where projects such as these can change how disaster response is handled. “By working with researchers and students, we can partner with the computer vision community to build a set of open-source resources that enable rich collaboration among diverse stakeholders,” Kitzmiller says. “With the idea that open-source development can be driven on the academic side with support from Amazon, we can accelerate the process of bringing some of these microservices into production for first responders.”

Joe Flasher of the AWS Open Data Program discussed the huge strides in predictive accuracy that classifiers have made in the last few years. “Using what we know about a specific image, its GIS coordinates and other metadata can help us improve classifier performance of both LADI and satellite datasets,” Flasher says. “As we begin to combine and layer complementary datasets based on geospatial metadata, we can both improve accuracy and enhance the depth and granularity of results by incorporating attributes from each dataset in the results of the selected set.”

Mechanical Turk and the MIT Lincoln Laboratory are putting together a workshop that enables a broader group of researchers to leverage the LADI ground truth dataset to train classifiers using SageMaker Ground Truth. Low-confidence results are routed through Amazon A2I for human annotation using Mechanical Turk, and the team can rerun models using the enhanced ground truth set to measure improvements in model performance. The workshop results will contribute to the open-source resources shared through the AWS Open Data Program. “We are very excited to support these academic efforts through the Amazon Research Awards,” says An Luo, Senior Technical Program Manager for academic programs, Amazon AI. “We look for opportunities where the work being done by academics advances ML research and is complementary to AWS goals while advancing educational opportunities for students.”

To start using Amazon Augmented AI check out the resources here. There are resources available here for use with the Low Altitude Disaster Imagery (LADI) Dataset. You can learn more about Mechanical Turk here.


About the Author

Morgan Dutton is a Senior Program Manager with the Amazon Augmented AI and Mechanical Turk team. She works with academic and public sector customers to accelerate their use of human-in-the-loop ML services. Morgan is interested in collaborating with academic customers to support adoption of ML technologies by students and educators.

Read More