Amazon AWS – Page 286

Building a secure search application with access controls using Amazon Kendra

January 4, 2021

by Abhinav Jawadekar Amazon AWS

For many enterprises, critical business information is often stored as unstructured data scattered across multiple content repositories. Not only is it challenging for organizations to make this information available to employees when they need it, but it’s also difficult to do so securely so relevant information is available to the right employees or employee groups.

Amazon Kendra is a highly accurate and easy-to-use intelligent search service powered by machine learning (ML). Amazon Kendra delivers secure search for enterprise applications and can make sure the results of a user’s search query only include documents the user is authorized to read. In this post, we illustrate how to build an Amazon Kendra-powered search application supporting access controls that reflect the security model of an example organization.

Amazon Kendra supports search filtering based on user access tokens that are provided by your search application, as well as document access control lists (ACLs) collected by the Amazon Kendra connectors. When user access tokens are applied, search results return links to the original document repositories and include a short description. Access control to the full document is still enforced by the original repository.

In this post, we demonstrate token-based user access control in Amazon Kendra with Open ID. We use Amazon Cognito user pools to authenticate users and provide Open ID tokens. You can use a similar approach with other Open ID providers.

Application overview

This application is designed for guests and registered users to make search queries to a document repository, and results are returned only from those documents that are authorized for access by the user. Users are grouped based on their roles, and access control is at a group level. The following table outlines which documents each user is authorized to access for our use case. The documents being used in this example are a subset of AWS public documents.

User	Role	Group	Document Type Authorized for Access
	Guest		Blogs
Patricia	IT Architect	Customer	Blogs, user guides
James	Sales Rep	Sales	Blogs, user guides, case studies
John	Marketing Exec	Marketing	Blogs, user guides, case studies, analyst reports
Mary	Solutions Architect	Solutions Architect	Blogs, user guides, case studies, analyst reports, whitepapers

Architecture

The following diagram illustrates our solution architecture.

The documents being queried are stored in an Amazon Simple Storage Service (Amazon S3) bucket. Each document type has a separate folder: blogs, case-studies, analyst-reports, user-guides, and white-papers. This folder structure is contained in a folder named Data. Metadata files including the ACLs are included in a folder named Meta.

We use the Amazon Kendra S3 connector to configure this S3 bucket as the data source. When the data source is synced with the Amazon Kendra index, it crawls and indexes all documents as well as collects the ACLs and document attributes from the metadata files. For this example, we use a custom attribute DocumentType to denote the type of the document.

We use an Amazon Cognito user pool to authenticate registered users, and use an identity pool to authorize the application to use Amazon Kendra and Amazon S3. The user pool is configured as an Open ID provider in the Amazon Kendra index by configuring the signing URL of the user pool.

When a registered user authenticates and logs in to the application to perform a query, the application sends the user’s access token provided by the user pool to the Amazon Kendra index as a parameter in the query API call. For guest users, there is no authentication and therefore no access token is sent as a parameter to the query API. The results of a query API call without the access token parameter only return the documents without access control restrictions.

When an Amazon Kendra index receives a query API call with a user access token, it decrypts the access token using the user pool signing URL and gets parameters such as cognito:username and cognito:groups associated with the user. The Amazon Kendra index filters the search results based on the stored ACLs and the information received in the user access token. These filtered results are returned in response to the query API call made by the application.

The application, which the users can download with its source, is written in ReactJS using components from the AWS Amplify framework. We use the AWS Amplify console to implement the continuous integration and continuous deployment pipelines. We use an AWS CloudFormation template to deploy the AWS infrastructure, which includes the following:

An Amazon Kendra index
An Amazon Cognito user pool and identity pool
AWS Identity and Access Management (IAM) roles and policies
An AWS CodeCommit source code repository
AWS Amplify console application configurations

In this post, we provide a step-by-step walkthrough to configure the backend infrastructure, build and deploy the application code, and use the application.

Prerequisites

To complete the steps in this post, make sure you have the following:

An AWS account with privileges to create IAM roles and policies.
Basic knowledge of AWS.
An S3 bucket for your documents. For more information, see Creating a bucket and What is Amazon S3?
Access to a command terminal with the AWS Command Line Interface (CLI) installed or AWS CloudShell. For instructions, see Installing, updating, and uninstalling the AWS CLI version 2.
Amazon Kendra.

Preparing your S3 bucket as a data source

To prepare an S3 bucket as a data source, create an S3 bucket. In the terminal with the AWS CLI or AWS CloudShell, run the following commands to upload the documents and the metadata to the data source bucket:

aws s3 cp s3://aws-ml-blog/artifacts/building-a-secure-search-application-with-access-controls-kendra/docs.zip .
unzip docs.zip
aws s3 cp Data/ s3://<REPLACE-WITH-NAME-OF-S3-BUCKET>/Data/ --recursive
aws s3 cp Meta/ s3://<REPLACE-WITH-NAME-OF-S3-BUCKET>/Meta/ --recursive

Deploying the infrastructure as a CloudFormation stack

In a separate browser tab open the AWS Management Console, and make sure that you are logged in to your AWS account. Click the button below to launch the CloudFormation stack to deploy the infrastructure.

You should see a page similar to the image below:

For S3DataSourceBucket, enter your data source bucket name without the s3:// prefix, select I acknowledge that AWS CloudFormation might create IAM resources with custom names, and then choose Create stack.

Stack creation can take 30–45 minutes to complete. While you wait, you can look at the different tabs, such as Events, Resources, and Template. You can monitor the stack creation status on the Stack info tab.

When stack creation is complete, keep the Outputs tab open. We need values from the Outputs and Resources tabs in subsequent steps.

Reviewing Amazon Kendra configuration and starting the data source sync

In the following steps, we configure Amazon Kendra to enable secure token access and start the data source sync to begin crawling and indexing documents.

On the Amazon Kendra console, choose the index AuthKendraIndex, which was created as part of the CloudFormation stack.

Under User access control, token-based user access control is enabled, the signing key object is set to the Open ID provider URL of the Amazon Cognito user pool, and the user name and group are set to cognito:username and cognito:groups, respectively.

In the navigation pane, choose Data sources.
On the Settings tab, you can see the data source bucket being configured.
Select the radio button for the data source and choose Sync now.

The data source sync can take 10–15 minutes to complete, but you don’t have to wait to move to the next step.

Creating users and groups in the Amazon Cognito user pool

In the terminal with the AWS CLI or AWS CloudShell, run the following commands to create users and groups in the Amazon Cognito user pool to use for our application. You need to copy the contents of the Physical ID column in the UserPool row from the Resources tab of the CloudFormation stack. This is the user pool ID to use in the following steps. We set AmazonKendra@2020 as the temporary password for all the users. This password is required when logging in for the first time, and Amazon Cognito enforces a password reset.

USER_POOL_ID=<PASTE-USER-POOL-ID-HERE>
aws cognito-idp create-group --group-name customer --user-pool-id ${USER_POOL_ID}
aws cognito-idp create-group --group-name AWS-Sales --user-pool-id ${USER_POOL_ID}
aws cognito-idp create-group --group-name AWS-Marketing --user-pool-id ${USER_POOL_ID}
aws cognito-idp create-group --group-name AWS-SA --user-pool-id ${USER_POOL_ID}
aws cognito-idp admin-create-user --user-pool-id ${USER_POOL_ID} --username patricia --temporary-password AmazonKendra@2020
aws cognito-idp admin-create-user --user-pool-id ${USER_POOL_ID} --username james  --temporary-password AmazonKendra@2020
aws cognito-idp admin-create-user --user-pool-id ${USER_POOL_ID} --username john  --temporary-password AmazonKendra@2020
aws cognito-idp admin-create-user --user-pool-id ${USER_POOL_ID} --username mary  --temporary-password AmazonKendra@2020
aws cognito-idp admin-add-user-to-group --user-pool-id ${USER_POOL_ID} --username patricia --group-name customer
aws cognito-idp admin-add-user-to-group --user-pool-id ${USER_POOL_ID} --username james --group-name AWS-Sales
aws cognito-idp admin-add-user-to-group --user-pool-id ${USER_POOL_ID} --username john --group-name AWS-Marketing
aws cognito-idp admin-add-user-to-group --user-pool-id ${USER_POOL_ID} --username mary --group-name AWS-SA

Building and deploying the app

Now we build and deploy the app using the following steps:

On the AWS Amplify console, choose the app AWSKendraAuthApp.
Choose Run build.

You can monitor the build progress on the console.

Let the build continue and complete the steps: Provision, Build, Deploy, and Verify. After this, the application is deployed and ready to use.

You can browse through the source code by opening up the CodeCommit repository. The important file to look at is src/App.tsx.

Choose the link on the left to start the application in a new browser tab.

Trial run

We can now take a trial run of our app.

On the login page, sign in with the username patricia and the temporary password AmazonKendra@2020.

Amazon Cognito requires you to reset your password the first time you log in. After you log in, you can see the search field.

In the search field, enter a query, such as what is serverless?
Expand Filter search results to see different document types.

You can select different document types to filter the search results.

Sign out and repeat this process for other users that are created in the Cognito user pool, namely, james, john, and mary.

You can also choose Continue as Guest to use the app without authenticating. However, this option only shows results from blogs.

You can return back to the login screen by choosing Welcome Guest! Click here to sign up or sign in.

Using the application

You can use the application we developed by making a few search queries logged in as different users. To experience how access control works, issue the same query from different user accounts and observe the difference in the search results. The following users get results from different sources:

Guests and anonymous users – Only blogs
Patricia (Customer) – Blogs and user guides
James (Sales) – Blogs, user guides, and case studies
John (Marketing) – Blogs, user guides, case studies, and analyst reports
Mary (Solutions Architect) – Blogs, user guides, case studies, analyst reports, and whitepapers

We can make additional queries and observe the results. Some suggested queries include “What is machine learning?”, “What is serverless?”, and “Databases”.

Cleaning up

To delete the infrastructure that was deployed as part of the CloudFormation stack, delete the stack from the AWS CloudFormation console. Stack deletion can take 20–30 minutes.

When the stack status shows as Delete Complete, go to the Events tab and confirm that each of the resources has been removed. You can also cross-verify by checking on the respective management consoles for Amazon Kendra, Amazon Amplify, and the Amazon Cognito user pool and identity pool.

You must delete your data source bucket separately, because it was not created as part of the CloudFormation stack.

Conclusion

In this post, we demonstrated how you can create a secure search application using Amazon Kendra. Organizations who use an Open ID-compliant identity management system with a new or pre-existing Amazon Kendra index can now enable secure token access to make sure your intelligent search applications are aligned with your organizational security model. For more information about access control in Amazon Kendra, see Controlling access to documents in an index.

About the Author

Abhinav Jawadekar is a Senior Partner Solutions Architect at Amazon Web Services. Abhinav works with AWS partners to help them in their cloud journey.

Extracting buildings and roads from AWS Open Data using Amazon SageMaker

December 30, 2020

by Yunzhi Shi Amazon AWS

Sharing data and computing in the cloud allows data users to focus on data analysis rather than data access. Open Data on AWS helps you discover and share public open datasets in the cloud. The Registry of Open Data on AWS hosts a large amount of public open data. The datasets range from genomics to climate to transportation information. They are well structured and easily accessible. Additionally, you can use these datasets in machine learning (ML) model development in the cloud.

In this post, we demonstrate how to extract buildings and roads from two large-scale geospatial datasets: SpaceNet satellite images and USGS 3DEP LiDAR data. Both datasets are hosted on the Registry of Open Data on AWS. We show you how to launch an Amazon SageMaker notebook instance and walk you through the tutorial notebooks at a high level. The notebooks reproduce winning algorithms from the SpaceNet challenges (which only use satellite images). In addition to the SpaceNet satellite images, we compare and combine the USGS 3D Elevation Program (3DEP) LiDAR data to extract the same.

This post demonstrates running ML services on AWS to extract features from large-scale geospatial data in the cloud. By following our examples, you can train the ML models on AWS, apply the models to other regions where satellite or LiDAR data is available, and experiment with new ideas to improve the performances. For the complete code and notebooks of this tutorial, see our GitHub repo.

Datasets

In this section, we provide more detail about the datasets we use in this post.

SpaceNet dataset

SpaceNet launched in August 2016 as an open innovation project offering a repository of freely available imagery with co-registered map features. It’s a large corpus of labeled satellite imagery. The project has also launched a series of competitions ranging from automatic building extraction, road extraction, to recently published multi-temporal urban development analysis. The dataset covers 11 areas of interest (AOIs), including Rio de Janeiro, Las Vegas, and Paris. For this post, we use Las Vegas; the images in this AOI cover 216km2 areas with 151,367 building polygon labels and 3,685km road labels.

The following image is from DigitalGlobe’s SpaceNet Challenge Concludes First Round, Moves to Higher Resolution Challenges.

USGS 3DEP LiDAR dataset

Our second dataset comes from the USGS 3D Elevation Program (3DEP) in the form of LiDAR (Light Detection and Ranging) data. The program’s goal is to complete the acquisition of nationwide LiDAR to provide the first-ever national baseline of consistent high-resolution topographic elevation data, collected in a timeframe of less than a decade. LiDAR is a remote sensing method that emits hundreds of thousands of near-infrared light pulses each second to measure distances to the Earth. These light pulses generate precise, 3D information about the shape of the Earth and its surface characteristics.

The USGS 3DEP LiDAR is presented in two formats. The first is a public repository in Entwine Point Tiles (EPT) format, which is a lossless, full resolution, streamable octree structure. This format is suitable for online visualization. The following image shows an example of LiDAR visualization in Las Vegas.

The other format is in LAZ (compressed LAS) with requester-pays access. In this post, we use LiDAR data in the second format.

Data registration

For this post, we select the Las Vegas AOI where both SpaceNet satellite images and USGS LiDAR data are available. Among SpaceNet data categories, we use the 30cm resolution pan-sharpened 3-band RGB geotiff and corresponding building and road labels. To improve the visual feature extraction performance, we process the data by white balancing and convert it to 8-bit (0–255) values for ease of postprocessing. The following graph shows the RGB value aggregated histogram of all images after processing.

Satellite images are 2D images, whereas the USGS LiDAR data are 3D point clouds and therefore require conversion and projection to align with 2D satellite images. We use Matlab and LAStools to map each 3D LiDAR point to a pixel-wise location corresponding to SpaceNet tiles, and generate two sets of attribute images: elevation and reflectivity intensity. The elevation ranges from approximately 2,000–3,000 feet, and the intensity ranges from 0–5,000 units. The following graphs show the aggregated histograms of all images for elevation and reflectivity intensity values.

Finally, we merge either one of the LiDAR attributes and merge them with the RGB images. The images are saved in 16-bit because LiDAR attribute values can be larger than 255, the 8-bit upper limit. We make this processed and merged data available via a publicly accessible Amazon Simple Storage Service (Amazon S3) bucket for this tutorial. The following are three samples of merged RGB+LiDAR images. From left to right, the columns are RGB image, LiDAR elevation attribute, and LiDAR reflectivity intensity attribute.

Creating a notebook instance

SageMaker is a fully managed service that allows you to build, train, deploy, and monitor ML models. Its modular design allows you to pick and choose features that suit your use cases at different stages of the ML lifecycle. SageMaker offers capabilities that abstract the heavy lifting of infrastructure management and provide the agility and scalability you desire for large-scale ML activities with different features and a pay-as-you-use pricing model.

The SageMaker on-demand notebook instance is a fully managed compute instance running the Jupyter Notebook app. SageMaker manages creating instances and related resources. Notebooks contain everything needed to run or recreate an ML workflow. You can use Jupyter notebooks in your notebook instance to prepare and process data, write code to train models, deploy models to SageMaker hosting, and test or validate your models. For different problems, you can select the type of instance to best fit each scenario (such as high throughput, high memory usage, or real-time inference).

Although training the deep learning model can take a long time, you can reproduce the inference part of this post with a reasonable computing cost. It’s recommended to run the notebooks inside a SageMaker notebook instance of type ml.p3.8xlarge (4 x V100 GPUs) or larger. Network training and inference is a memory-intensive process; if you run into out of memory or out of RAM errors, consider decreasing the batch_size in the configuration files (.yml format).

To create a notebook instance, complete the following steps:

On the SageMaker console, choose Notebook instances.
Choose Create notebook instance.
Enter the name of your notebook instance, such as open-data-tutorial.
Set the instance type to 8xlarge.
Choose Additional configuration.
Set the volume size to 60 GB.
Choose Create notebook instance.
When the instance is ready, choose Open in JupyterLab.
From the launcher, you can open a terminal and run the provided code.

Deploy environment and download datasets

At the JupyterLab terminal, run the following commands:

$ cd ~/SageMaker/$ ./setup-env.sh tutorial_env
$ git clone https://github.com/aws-samples/aws-open-data-satellite-lidar-tutorial.git
$ cd aws-open-data-satellite-lidar-tutorial

This downloads the tutorial repository from GitHub and takes you to the tutorial directory.

Next, set up a Conda environment by running setup-env.sh (see the following code). You can change the environment name from tutorial_env to any other name.

$ ./setup-env.sh tutorial_env

This may take 10–15 minutes to complete, after which you have a new Jupyter kernel called conda_tutorial_env, or conda_[name] if you change the environment name. You may need to wait a few minutes after conda completion and refresh the Jupyter page.

Next, download the necessary data from the public S3 bucket hosting the tutorial files:

$ ./download-from-s3.sh

This may take up to 5 minutes to complete and requires at least 23 GB of notebook instance storage.

Building extraction

Launch the notebook Building-Footprint.ipynb to reproduce this chapter.

The first and second SpaceNet challenges aimed to extract building footprints from satellite images at various AOIs. The fourth SpaceNet challenge posed a similar task with more challenging off-nadir ( oblique-looking angles) imagery. We reproduce a winning algorithm and evaluate its performance with both RGB images and LiDAR data.

Training data

In the Las Vegas AOI, SpaceNet data is tiled to size 200m x 200m. We select 3,084 tiles in which both SpaceNet imagery and LiDAR data are available and merge them together. Unfortunately, the labels of test data for scoring in the SpaceNet challenges are not published, so we split the merged data into 70% and 30% for training and evaluation. Between LiDAR elevation and intensity, we choose elevation for building extractions. See the following code:

In the Las Vegas AOI, SpaceNet data is tiled to size 200m×200m. We select 3084 tiles where both SpaceNet imagery and LiDAR data are available and merge them together. Unfortunately, the labels of test data for scoring in the SpaceNet challenges are not published, so we split the merged data by 70%/30% for training and evaluation. Between LiDAR elevation and intensity, we choose elevation for building extractions.

# Create Pandas data frame, containing columns 'image' and 'label'.
total_df = pd.DataFrame({'image': img_path_list
                         'label': mask_path_list})
# Split this data frame to training data and blind test data.
split_mask = np.random.rand(len(total_df)) < 0.7
train_df = total_df[split_mask]
test_df = total_df[~split_mask]

Model

We reproduce the winning algorithm from SpaceNet challenge 4 by XD_XD. The model has a U-net architecture with skip-connections between encoder and decoder, and a modified VGG16 as backbone encoder. The model takes three different types of input:

Three-channel RGB image, same as the original contest
One-channel LiDAR elevation image
Four-channel RGB+LiDAR merged image

We train three models based on the three types of inputs described in this post and compare their performances.

The label for training is binary mask converted from polygon geojson by Solaris, an ML pipeline library developed by CosmiQ Works. We select a combined loss of binary cross-entropy and Jaccard loss with a weight factor alpha=0.8:

mathcal{L} =
alphamathcal{L}_mathrm{BCE} + (1 –
alphamathcal{L}_mathrm{Jaccard})

We train the models with batch size 20, Adam optimizer, and 10-4 learning rate for 100 epochs. The training takes approximately 100 minutes to finish on an ml.p3.8xlarge SageMaker notebook instance. See the following code:

# Load customized multi-channel input VGG16-Unet model.
from networks.vgg16_unet import get_modified_vgg16_unet

custom_model = get_modified_vgg16_unet(
    in_channels=config['data_specs']['channels'])
custom_model_dict = {
    'model_name': 'modified_vgg16_unet',
    'arch': custom_model}

# Select config file and link training datasets.
config = sol.utils.config.parse('./configs/buildings/RGB+ELEV.yml')
config['training_data_csv'] = train_csv_path
# Create solaris trainer, and train with configuration.
trainer = sol.nets.train.Trainer(config, custom_model_dict=custom_model_dict)
trainer.train()

The following images show examples of building extraction inputs and outputs. From left to right, the columns are RGB image, LiDAR elevation image, model prediction trained with RGB and LiDAR data, and ground truth building footprint mask.

Evaluation

Use the trained model to perform model inference on the test dataset (30% hold-out):

custom_model_dict = {
    'model_name': 'modified_vgg16_unet',
    'arch': custom_model,
    'weight_path': config['training']['model_dest_path']}
config['train'] = False

# Create solaris inferer, and do inference on test data.
inferer = sol.nets.infer.Inferer(config, custom_model_dict=custom_model_dict)
inferer(test_df)

After model inference, we evaluate the model performance using the same metric as in the original contest: an aggregated F-1 score with intersection of union (IoU) ≥ 0.5 criterion. There are two steps to compute this score. First, convert the building footprint binary masks to proposed polygons:

# Convert these probability maps to building polygons.
def pred_to_prop(pred_file, img_path):
    pred_path = os.path.join(pred_dir, pred_file)
    pred = skimage.io.imread(pred_path)[..., 0]
    prop_file = 
        pred_file.replace('RGB+ELEV', 'geojson_buildings').replace('tif', 'geojson')
    prop_path = os.path.join(prop_dir, prop_file)
    prop = sol.vector.mask.mask_to_poly_geojson(
        pred_arr=pred,
        reference_im=img_path,
        do_transform=True,
        min_area=1e-10,
        output_path=prop_path)

Next, compare the proposed polygons against the ground truth polygons (SpaceNet building labels), and count the aggregated F-1 scores:

# Evaluate aggregated F-1 scores.
def compute_score(prop_path, bldg_path):
    evaluator = sol.eval.base.Evaluator(bldg_path)
    evaluator.load_proposal(prop_path, conf_field_list=[])
    score = evaluator.eval_iou(miniou=0.5, calculate_class_scores=False)
    # score_list.append(score[0]) # skip because single-class
    return score[0] # single-class

The following table shows the F-1 scores from the three models trained with RGB images, LiDAR elevation images, and RGB+LiDAR merged images. Compared to using RGB only as in the original SpaceNet competition, the model trained using only LiDAR elevation images achieves a score only a few percent worse. When combining both RGB and LiDAR elevation in training, the model outperforms the RGB-only model. For reference, the F-1 scores of the top three teams from SpaceNet challenge 2 in this AOI are 0.885, 0.829, and 0.787 (we don’t compare them directly because they use a different test set for scoring).

Training data type	Aggregated F-1 scores
RGB images	0.8268
LiDAR elevation	0.80676
RGB+LiDAR merged	0.85312

Road extraction

To reproduce this section, launch the notebook Road-Network.ipynb.

The third SpaceNet challenge aimed to extract road networks from satellite images. The fifth SpaceNet challenge added predicting road speed along with the road network extraction in order to minimize travel time and plan optimal routing. Similar to building extraction, we reproduce a top winning algorithm, train different models with either RGB images, LiDAR attributes, or both of them, and evaluate their performance.

Training data

The road network extraction uses larger tiles with size 400m x 400m. We generate 918 merged tiles, and split by 70%/30% for training and evaluation. In this case, we select reflectivity intensity for road extraction because road surfaces often consist of materials that have distinctive reflectivity among backgrounds, such as a paved surface, dirt road, or asphalt.

Model

We reproduce the CRESI algorithm for road networks extraction. It also has a U-net architecture but uses ResNet as the backbone encoder. Again, we train the model with three different types of input:

Three-channel RGB image
One-channel LiDAR intensity image
Four-channel RGB+LiDAR merged image

To extract road location and speed together, binary road mask doesn’t provide enough information for training. As mentioned in the CRESI paper, we can convert the speed metadata to either continuous mask (0–1 values) or multi-class binary mask. Because their test results show that multi-class binary mask performs better, we use the latter conversion scheme. The following images break down the eight-class road masks. The first seven binary masks represent road corresponds to seven bins of speed within 0–65 mph. The eighth mask (bottom right) represents the aggregation of all previous masks.

The following images show the visualization of multi-class road masks. The left is the RGB image tile. The right is the road mask with color coding in which the yellow-to-red colormap represents speed values from low to high speed (0–65 mph).

We train the model with the same setup as in the building extraction. The following images show examples of road extraction inputs and outputs. From left to right, the columns are RGB image, LiDAR reflectivity intensity image, model prediction trained with RGB and LiDAR data, and ground truth road mask.

Evaluation

We implement the average path length similarity (APLS) score to evaluate the road extraction performance. This metric is used in SpaceNet road challenges because APLS considers both logical topology (connections within road network) and physical topology (location of the road edges and nodes). The APLS can be weighted by either length or travel time; a higher score means better performance. See the following code:

# Skeletonize the prediction mask into non-geo road network graph.
!python ./libs/apls/skeletonize.py --results_dir={results_dir}
# Match geospatial info and create geo-projected graph.
!python ./libs/apls/wkt_to_G.py --imgs_dir={img_dir} --results_dir={results_dir}
# Infer road speed on each graph edge based on speed bins.
!python ./libs/apls/infer_speed.py --results_dir={results_dir} 
    --speed_conversion_csv_file='./data/roads/speed_conversion_binned7.csv'

# Compute length-based APLS score.
!python ./libs/apls/apls.py --output_dir={results_dir} 
    --truth_dir={os.path.join(data_dir, 'geojson_roads_speed')} 
    --im_dir={img_dir} 
    --prop_dir={os.path.join(results_dir, 'graph_speed_gpickle')} 
    --weight='length'

# Compute time-based APLS score.
!python ./libs/apls/apls.py --output_dir={results_dir} 
    --truth_dir={os.path.join(data_dir, 'geojson_roads_speed')} 
    --im_dir={img_dir} 
    --prop_dir={os.path.join(results_dir, 'graph_speed_gpickle')} 
    --weight='travel_time_s'

We convert multi-class road mask predictions to skeleton and speed-weighted graph and compute APLS scores. The following table shows the APLS scores of the three models. Similar to the building extraction results, the LiDAR-only result achieves scores close to the RGB-only result, whereas RGB+LiDAR gives the best performance.

Training data type	APLSlength	APLStime
RGB images	0.59624	0.54298
LiDAR intensity	0.57811	0.52697
RGB+LiDAR merged	0.63651	0.58518

Conclusion

We demonstrate how to extract building extract buildings and roads from two large-scale geospatial datasets hosted on the Registry of Open Data on AWS using a SageMaker notebook instance. The SageMaker notebook instance contains everything needed to run or recreate an ML workflow. It’s easy to use and customize to best fit different scenarios.

By using the LiDAR dataset from the Registry of Open Data on AWS and reproducing winning algorithms from SpaceNet building and road challenges, we show that you can use LiDAR data to perform the same task with similar accuracy, and even outperform the RGB models when combined.

With the full code and notebooks shared on GitHub and the necessary data hosted in the public S3 bucket, you can reproduce the map feature extraction tasks, apply the models to any other area of interest, and innovate with new ideas to improve model performance. For the complete code and notebooks of this tutorial, see our GitHub repo.

About the Authors

Yunzhi Shi is a data scientist at the Amazon ML Solutions Lab where he helps AWS customers address business problems with AI and cloud capabilities. Recently, he has been building computer vision, search, and forecast solutions for various customers.

Xin Chen is a senior manager at Amazon ML Solutions Lab, where he leads the Automotive Vertical and helps AWS customers across different industries identify and build machine learning solutions to address their organization’s highest return-on-investment machine learning opportunities. Xin obtained his Ph.D. in Computer Science and Engineering from the University of Notre Dame.

Tianyu Zhang is a data scientist at the Amazon ML Solutions Lab. He helps AWS customers solve business problems by applying ML and AI techniques. Most recently, he has built NLP model and predictive model for procurement and sports.

How an important change in web standards impacts your image annotation jobs

December 29, 2020

by Talia Chopra Amazon AWS

Earlier in 2020, widely used browsers like Chrome and Firefox changed their default behavior for rotating images based on image metadata, referred to as EXIF data. Previously, images always displayed in browsers exactly how they’re stored on disk, which is typically unrotated. After the change, images now rotate according to a piece of image metadata called orientation value. This has important implications for the entire machine learning (ML) community. For example, if the EXIF orientation isn’t considered, applications that you use to annotate images may display images in unexpected orientations and result in confusing or incorrect labels.

For example, before the change, by default images would display in the orientation stored on the device, as shown in the following image. After the change, by default, images display according to the orientation value in EXIF data, as shown in the second image.

Here, the image was stored in portrait mode, with EXIF data attached to indicate it should be displayed with a landscape orientation.

To ensure images are predictably oriented, ML annotation services need to be able to view image EXIF data. The recent change to global web standards requires you to grant explicit permission to image annotation services to view your image EXIF data.

To guarantee data consistency between workers and across datasets, the annotation tools used by Amazon SageMaker Ground Truth, Amazon Augmented AI (Amazon A2I), and Amazon Mechanical Turk need to understand and control orientations of input images that are shown to workers. Therefore, from January 12, 2021, onward, AWS requires that you add a cross-origin resource sharing (CORS) header configuration to Amazon Simple Storage Service (Amazon S3) buckets that contain labeling job or human review task input data. This policy allows these AWS services to view EXIF data and verify that images are predictably oriented in labeling and human review tasks.

This post provides details on the image metadata change, how it can impact labeling jobs and human review tasks, and how you can update your S3 buckets with these new, required permissions.

What is EXIF data?

EXIF data is metadata that tells us things about the image. EXIF data typically includes the height and width of an image but can also include things like the date a photo was taken, what kind of camera was used, and even GPS coordinates where the image was captured. For the image annotation web application community, the orientation property of EXIF is about to become very important.

When you take a photo, whether it’s landscape or portrait, the data is written to storage in the landscape orientation. Instead of storing a portrait photo in the portrait orientation, the camera writes a piece of metadata to the image to explain to applications how that image should be rotated when it’s shown to humans. To learn more, see Exif.

A big change to browsers: Why EXIF data is important

Until recently, popular web browsers such as Chrome and Firefox didn’t use EXIF orientation values, meaning that images that users annotated were never rotated. This means the annotation data matched how the image was stored and the orientation value didn’t matter.

Earlier in 2020, Chrome and Firefox changed their default behavior to begin using EXIF data by default. To make sure image annotating tasks weren’t impacted, AWS mitigated this change by preventing rotation so that users continued to annotate images in their unrotated form. However, AWS can no longer automatically prevent the rotation of images because the web standards group W3C has decided that the ability to control image rotation violates the web’s Same Origin Policy.

It is estimated that, starting with Chrome 88 on January 19^th, 2021, annotation services like the ones offered by AWS will require additional permissions to control the orientation of your images when displayed to human workers.

When using AWS services, you can grant these permissions by adding a CORS header policy to the S3 buckets that contain your input images.

Upcoming change to AWS image annotation job security requirements

It is recommended you add a CORS configuration to all S3 buckets that contain input data used for active and future labeling jobs as soon as possible. Starting January 12^th, 2021, to ensure human workers annotate your input images in a predictable orientation when you submit requests to create one of the following, you must add a CORS header policy to the S3 buckets that contain your input images:

If you have pre-existing active resources like Ground Truth streaming labeling jobs, you must add a CORS header policy to the S3 bucket used to create those resources. For Ground Truth, this is the input data S3 bucket identified when you created the streaming labeling job.

Additionally, if you reuse resources, such as cloning a Ground Truth labeling job, make sure the input data S3 bucket you use has a CORs header policy attached.

In the context of input image data, AWS services use CORS headers to view EXIF orientation data to control image rotation.

If you don’t add a CORS header policy to an S3 bucket that contains input data by January 12^th, 2021, Ground Truth, Amazon A2I, and Mechanical Turk tasks created using this S3 bucket will fail.

Adding a CORS header policy to an S3 bucket

If you’re creating an Amazon A2I human loop or Mechanical Turk job, or you’re using the CreateLabelingJob API to create a Ground Truth labeling job, you can add a CORS policy to an S3 bucket that contains input data on the Amazon S3 console.

If you create your job through the Ground Truth console, under Enable enhanced image access, a check box is select to enable CORS configuration on the S3 bucket that contains your input manifest file as shown in the following image. Keep this check box selected. If all of your input data is not located in the same S3 bucket as your input manifest file, you must manually add a CORS configuration to all S3 buckets that contain input data using the following instructions.

For instructions on setting the required CORS headers on the S3 bucket that hosts your images, see How do I add cross-domain resource sharing with CORS? Use the following CORS configuration code for the buckets that host your images.

The following is the code in JSON format:

[{
   "AllowedHeaders": [],
   "AllowedMethods": ["GET"],
   "AllowedOrigins": ["*"],
   "ExposeHeaders": []
}]

The following is the code in XML format:

<CORSConfiguration>
 <CORSRule>
   <AllowedOrigin>*</AllowedOrigin>
   <AllowedMethod>GET</AllowedMethod>
 </CORSRule>
</CORSConfiguration>

The following GIF demonstrates the instructions found in the Amazon S3 documentation to add a CORS header policy using the Amazon S3 console.

Conclusion

In this post, we explained how a recent decision made by the web standards group W3C will impact the ML community. AWS image annotation service providers will now require you to grant permission to view orientation values of your input images, which are stored in image EXIF data.

Make sure you enable CORS headers on the S3 buckets that contain your input images before creating Ground Truth labeling jobs, Amazon A2I human review jobs, and Mechanical Turk tasks on or after January 12th, 2021.

About the Authors

Talia Chopra is a Technical Writer in AWS specializing in machine learning and artificial intelligence. She works with multiple teams in AWS to create technical documentation and tutorials for customers using Amazon SageMaker, MxNet, and AutoGluon.

Phil Cunliffe is an engineer turned Software Development Manager for Amazon Human in the Loop services. He is a JavaScript fanboy with an obsession for creating great user experiences.

How Foxconn built an end-to-end forecasting solution in two months with Amazon Forecast

December 23, 2020

by Azim Siddique Amazon AWS

This is a guest post by Foxconn. The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.

In their own words, “Established in Taiwan in 1974, Hon Hai Technology Group (Foxconn) is the world’s largest electronics manufacturer. Foxconn is also the leading technological solution provider and it continuously leverages its expertise in software and hardware to integrate its unique manufacturing systems with emerging technologies.”

At Foxconn, we manufacture some of the most widely used electronics worldwide. Our effectiveness comes from our ability to plan our production and staffing levels weeks in advance, while maintaining the ability to respond to short-term changes. For years, Foxconn has relied on predictable demand in order to properly plan and allocate resources within our factories. However, as the COVID-19 pandemic began, the demand for our products became more volatile. This increased uncertainty impacted our ability to forecast demand and estimate our future staffing needs.

This highlighted a crucial need for us to develop an improved forecasting solution that could be implemented right away. With Amazon Forecast and AWS, our team was able to build a custom forecasting application in only two months. With limited data science experience internally, we collaborated with the Machine Learning Solutions Lab at AWS to identify a solution using Forecast. The service makes AI-powered forecasting algorithms available to non-expert practitioners. Now we have a state-of-the-art solution that has improved demand forecasting accuracy by 8%, saving an estimated $553,000 annually. In this post, I show you how easy it was to use AWS services to build an application that fit our needs.

Forecasting challenges at Foxconn

Our factory in Mexico assembles and ships electronics equipment to all regions in North and South America. Each product has their own seasonal variations and requires different levels of complexity and skill to build. Having individual forecasts for each product is important to understand the mix of skills we need in our workforce. Forecasting short-term demand allows us to staff for daily and weekly production requirements. Long-term forecasts are used to inform hiring decisions aimed at meeting demand in the upcoming months.

If demand forecasts are inaccurate, it can impact our business in several ways, but the most critical impact for us is staffing our factories. Underestimating demand can result in understaffing and require overtime to meet production targets. Overestimating can lead to overstaffing, which is very costly because workers are underutilized. Both over and underestimating present different costs, and balancing these costs is crucial to our business.

Prior to this effort, we relied on forecasts provided by our customers in order to make these staffing decisions. With the COVID-19 pandemic, our demand became more erratic. This unpredictability caused over and underestimating demand to became more common and staffing related costs to increase. It became clear that we needed to find a better approach to forecasting.

Processing and modeling

Initially, we explored traditional forecasting methods such as ARIMA on our local machines. However, these approaches took a long time to develop, test, and tune for each product. It also required us to maintain a model for each individual product. From this experience, we learned that the new forecasting solution had to be fast, accurate, easy to manage, and scalable. Our team reached out to data scientists at the Amazon Machine Learning (ML) Solutions Lab, who advised and guided us through the process of building our solution around Forecast.

For this solution, we used a 3-year history of daily sales across nine different product categories. We chose these nine categories because they had a long history for the model to train on and exhibited different seasonal buying patterns. To begin, we uploaded the data from our on-premise servers into an Amazon Simple Storage Service (Amazon S3) bucket. After that, we preprocessed the data by removing known anomalies and organizing the data in a format compatible with Forecast. Our final dataset consisted of three columns: timestamp, item_id, and demand.

For model training, we decided to use the AutoML functionality in Forecast. The AutoML tool tries to fit several different algorithms to the data and tunes each one to obtain the highest accuracy. The AutoML feature was vital for a team like ours with limited background in time-series modeling. It only took a few hours for Forecast to train a predictor. After the service identifies the most effective algorithm, it further tunes that algorithm through hyperparameter optimization (HPO) to get the final predictor. This AutoML capability eliminated weeks of development time that the team would have spent researching, training, and evaluating various algorithms.

Forecast evaluation

After the AutoML finished training, it output results for a number of key performance metrics, including root mean squared error (RMSE) and weighted quantile loss (wQL). We chose to focus on wQL, which provides probabilistic estimates by evaluating the accuracy of the model’s predictions for different quantiles. A model with low wQL scores was important for our business because we face different costs associated with underestimating and overestimating demand. Based on our evaluations, the best model for our use case was CNN-QR.

We applied an additional evaluation step using a held-out test set. We combined the estimated forecast with internal business logic to evaluate how we would have planned staffing using the new forecast. The results were a resounding success. The new solution improved our forecast accuracy by 8%, saving an estimated $553,000 per year.

Application architecture

At Foxconn, much of our data resides on premises, so our application is a hybrid solution. The application loads the data to AWS from the on-premises server, builds the forecasts, and allows our team evaluate the output on a client-side GUI.

To ingest the data into AWS, we have a program running on premises that queries the latest data from the on-premises database on a weekly basis. It uploads the data to an S3 bucket via an SFTP server managed by AWS Transfer Family. This upload triggers an AWS Lambda function that performs the data preprocessing and loads the prepared data back into Amazon S3. The preprocessed data being written to the S3 bucket triggers two Lambda functions. The first loads the data from Amazon S3 into an OLTP database. The second starts the Forecast training on the processed data. After the forecast is trained, the results are loaded into a separate S3 bucket and also into the OLTP database. The following diagram illustrates this architecture.

Finally, we wanted a way for customers to review the forecast outputs and provide their own feedback into the system. The team put together a GUI that uses Amazon API Gateway to allow users to visualize and interact with the forecast results in the database. The GUI allows the customer to review the latest forecast and choose a target production for upcoming weeks. The targets are uploaded back to the OLTP and used in further planning efforts.

Summary and next steps

In this post, we showed how a team new to AWS and data science built a custom forecasting solution with Forecast in 2 months. The application improved our forecast accuracy by 8%, saving an estimated $553,000 annually for our Mexico facility alone. Using Forecast also gave us the flexibility to scale out if we add new product categories in the future.

We’re thrilled to see the high performance of the Forecast solution using only the historical demand data. This is the first step in a larger plan to expand our use of ML for supply chain management and production planning.

Over the coming months, the team will migrate other planning data and workloads to the cloud. We’ll use the demand forecast in conjunction with inventory, backlog, and worker data to create an optimization solution for labor planning and allocation. These solutions will make the improved forecast even more impactful by allowing us to better plan production levels and resource needs.

If you’d like help accelerating the use of ML in your products and services, please contact the Amazon ML Solutions Lab program. To learn more about how to use Amazon Forecast, check out the service documentation.

About the Authors

Azim Siddique serves as Technical Advisor and CoE Architect at Foxconn. He provides architectural direction for the Digital Transformation program, conducts PoCs with emerging technologies, and guides engineering teams to deliver business value by leveraging digital technologies at scale.

Felice Chuang is a Data Architect at Foxconn. She uses her diverse skillset to implement end-to-end architecture and design for big data, data governance, and business intelligence applications. She supports analytic workloads and conducts PoCs for Digital Transformation programs.

Yash Shah is a data scientist in the Amazon ML Solutions Lab, where he works on a range of machine learning use cases from healthcare to manufacturing and retail. He has a formal background in Human Factors and Statistics, and was previously part of the Amazon SCOT team designing products to guide 3P sellers with efficient inventory management.

Dan Volk is a Data Scientist at Amazon ML Solutions Lab, where he helps AWS customers across various industries accelerate their AI and cloud adoption. Dan has worked in several fields including manufacturing, aerospace, and sports and holds a Masters in Data Science from UC Berkeley.

Xin Chen is a senior manager at Amazon ML Solutions Lab, where he leads Automotive Vertical and helps AWS customers across different industries identify and build machine learning solutions to address their organization’s highest return-on-investment machine learning opportunities. Xin obtained his Ph.D. in Computer Science and Engineering from the University of Notre Dame.

The science behind Amazon’s new StyleSnap for Home feature

December 22, 2020

by admin Amazon AWS

StyleSnap for fashion and home features are made possible by use of multiple convolutional neural networks.Read More

Controlling and auditing data exploration activities with Amazon SageMaker Studio and AWS Lake Formation

December 22, 2020

by Rodrigo Alarcon Amazon AWS

Highly-regulated industries, such as financial services, are often required to audit all access to their data. This includes auditing exploratory activities performed by data scientists, who usually query data from within machine learning (ML) notebooks.

This post walks you through the steps to implement access control and auditing capabilities on a per-user basis, using Amazon SageMaker Studio notebooks and AWS Lake Formation access control policies. This is a how-to guide based on the Machine Learning Lens for the AWS Well-Architected Framework, following the design principles described in the Security Pillar:

Restrict access to ML systems
Ensure data governance
Enforce data lineage
Enforce regulatory compliance

Additional ML governance practices for experiments and models using Amazon SageMaker are described in the whitepaper Machine Learning Best Practices in Financial Services.

Overview of solution

This implementation uses Amazon Athena and the PyAthena client on a Studio notebook to query data on a data lake registered with Lake Formation.

SageMaker Studio is the first fully integrated development environment (IDE) for ML. Studio provides a single, web-based visual interface where you can perform all the steps required to build, train, and deploy ML models. Studio notebooks are collaborative notebooks that you can launch quickly, without setting up compute instances or file storage beforehand.

Athena is an interactive query service that makes it easy to analyze data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL. Athena is serverless, so there is no infrastructure to set up or manage, and you pay only for the queries you run.

Lake Formation is a fully managed service that makes it easier for you to build, secure, and manage data lakes. Lake Formation simplifies and automates many of the complex manual steps that are usually required to create data lakes, including securely making that data available for analytics and ML.

For an existing data lake registered with Lake Formation, the following diagram illustrates the proposed implementation.

The workflow includes the following steps:

Data scientists access the AWS Management Console using their AWS Identity and Access Management (IAM) user accounts and open Studio using individual user profiles. Each user profile has an associated execution role, which the user assumes while working on a Studio notebook. The diagram depicts two data scientists that require different permissions over data in the data lake. For example, in a data lake containing personally identifiable information (PII), user Data Scientist 1 has full access to every table in the Data Catalog, whereas Data Scientist 2 has limited access to a subset of tables (or columns) containing non-PII data.
The Studio notebook is associated with a Python kernel. The PyAthena client allows you to run exploratory ANSI SQL queries on the data lake through Athena, using the execution role assumed by the user while working with Studio.
Athena sends a data access request to Lake Formation, with the user profile execution role as principal. Data permissions in Lake Formation offer database-, table-, and column-level access control, restricting access to metadata and the corresponding data stored in Amazon S3. Lake Formation generates short-term credentials to be used for data access, and informs Athena what columns the principal is allowed to access.
Athena uses the short-term credential provided by Lake Formation to access the data lake storage in Amazon S3, and retrieves the data matching the SQL query. Before returning the query result, Athena filters out columns that aren’t included in the data permissions informed by Lake Formation.
Athena returns the SQL query result to the Studio notebook.
Lake Formation records data access requests and other activity history for the registered data lake locations. AWS CloudTrail also records these and other API calls made to AWS during the entire flow, including Athena query requests.

Walkthrough overview

In this walkthrough, I show you how to implement access control and audit using a Studio notebook and Lake Formation. You perform the following activities:

Register a new database in Lake Formation.
Create the required IAM policies, roles, group, and users.
Grant data permissions with Lake Formation.
Set up Studio.
Test Lake Formation access control policies using a Studio notebook.
Audit data access activity with Lake Formation and CloudTrail.

If you prefer to skip the initial setup activities and jump directly to testing and auditing, you can deploy the following AWS CloudFormation template in a Region that supports Studio and Lake Formation:

You can also deploy the template by downloading the CloudFormation template. When deploying the CloudFormation template, you provide the following parameters:

User name and password for a data scientist with full access to the dataset. The default user name is data-scientist-full.
User name and password for a data scientist with limited access to the dataset. The default user name is data-scientist-limited.
Names for the database and table to be created for the dataset. The default names are amazon_reviews_db and amazon_reviews_parquet, respectively.
VPC and subnets that are used by Studio to communicate with the Amazon Elastic File System (Amazon EFS) volume associated to Studio.

If you decide to deploy the CloudFormation template, after the CloudFormation stack is complete, you can go directly to the section Testing Lake Formation access control policies in this post.

Prerequisites

For this walkthrough, you should have the following prerequisites:

An AWS account.
A data lake set up in Lake Formation with a Lake Formation Admin. For general guidance on how to set up Lake Formation, see Getting started with AWS Lake Formation.
Basic knowledge on creating IAM policies, roles, users, and groups.

Registering a new database in Lake Formation

For this post, I use the Amazon Customer Reviews Dataset to demonstrate how to provide granular access to the data lake for different data scientists. If you already have a dataset registered with Lake Formation that you want to use, you can skip this section and go to Creating required IAM roles and users for data scientists.

To register the Amazon Customer Reviews Dataset in Lake Formation, complete the following steps:

Sign in to the console with the IAM user configured as Lake Formation Admin.
On the Lake Formation console, in the navigation pane, under Data catalog, choose Databases.
Choose Create Database.
In Database details, select Database to create the database in your own account.
For Name, enter a name for the database, such as amazon_reviews_db.
For Location, enter s3://amazon-reviews-pds.
Under Default permissions for newly created tables, make sure to clear the option Use only IAM access control for new tables in this database.

Choose Create database.

The Amazon Customer Reviews Dataset is currently available in TSV and Parquet formats. The Parquet dataset is partitioned on Amazon S3 by product_category. To create a table in the data lake for the Parquet dataset, you can use an AWS Glue crawler or manually create the table using Athena, as described in Amazon Customer Reviews Dataset README file.

On the Athena console, create the table.

If you haven’t specified a query result location before, follow the instructions in Specifying a Query Result Location.

Choose the data source AwsDataCatalog.
Choose the database created in the previous step.

In the Query Editor, enter the following query:

CREATE EXTERNAL TABLE amazon_reviews_parquet(
  marketplace string, 
  customer_id string, 
  review_id string, 
  product_id string, 
  product_parent string, 
  product_title string, 
  star_rating int, 
  helpful_votes int, 
  total_votes int, 
  vine string, 
  verified_purchase string, 
  review_headline string, 
  review_body string, 
  review_date bigint, 
  year int)
PARTITIONED BY (product_category string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://amazon-reviews-pds/parquet/'

Choose Run query.

You should receive a Query successful response when the table is created.

Enter the following query to load the table partitions:
```
MSCK REPAIR TABLE amazon_reviews_parquet
```

Choose Run query.
On the Lake Formation console, in the navigation pane, under Data catalog, choose Tables.
For Table name, enter a table name.
Verify that you can see the table details.

Scroll down to see the table schema and partitions.

Finally, you register the database location with Lake Formation so the service can start enforcing data permissions on the database.

On the Lake Formation console, in the navigation pane, under Register and ingest, choose Data lake locations.
On the Data lake locations page, choose Register location.
For Amazon S3 path, enter s3://amazon-reviews-pds/.
For IAM role, you can keep the default role.
Choose Register location.

Creating required IAM roles and users for data scientists

To demonstrate how you can provide differentiated access to the dataset registered in the previous step, you first need to create IAM policies, roles, a group, and users. The following diagram illustrates the resources you configure in this section.

In this section, you complete the following high-level steps:

Create an IAM group named DataScientists containing two users: data-scientist-full and data-scientist-limited, to control their access to the console and to Studio.
Create a managed policy named DataScientistGroupPolicy and assign it to the group.

The policy allows users in the group to access Studio, but only using a SageMaker user profile that matches their IAM user name. It also denies the use of SageMaker notebook instances, allowing Studio notebooks only.

For each IAM user, create individual IAM roles, which are used as user profile execution roles in Studio later.

The naming convention for these roles consists of a common prefix followed by the corresponding IAM user name. This allows you to audit activities on Studio notebooks—which are logged using Studio’s execution roles—and trace them back to the individual IAM users who performed the activities. For this post, I use the prefix SageMakerStudioExecutionRole_.

Create a managed policy named SageMakerUserProfileExecutionPolicy and assign it to each of the IAM roles.

The policy establishes coarse-grained access permissions to the data lake.

Follow the remainder of this section to create the IAM resources described. The permissions configured in this section grant common, coarse-grained access to data lake resources for all the IAM roles. In a later section, you use Lake Formation to establish fine-grained access permissions to Data Catalog resources and Amazon S3 locations for individual roles.

Creating the required IAM group and users

To create your group and users, complete the following steps:

On the IAM console, create policies on the JSON tab to create a new IAM managed policy named DataScientistGroupPolicy.

Use the following JSON policy document to provide permissions, providing your AWS Region and AWS account ID:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "sagemaker:DescribeDomain",
                "sagemaker:ListDomains",
                "sagemaker:ListUserProfiles",
                "sagemaker:ListApps"
            ],
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "sagemaker:CreatePresignedDomainUrl",
                "sagemaker:DescribeUserProfile"
            ],
            "Resource": "arn:aws:sagemaker:<AWSREGION>:<AWSACCOUNT>:user-profile/*/${aws:username}",
            "Effect": "Allow"
        },
        {
            "Action": [
                "sagemaker:CreatePresignedDomainUrl",
                "sagemaker:DescribeUserProfile"
            ],
            "Effect": "Deny",
            "NotResource": "arn:aws:sagemaker:<AWSREGION>:<AWSACCOUNT>:user-profile/*/${aws:username}"
        },
        {
            "Action": "sagemaker:*App",
            "Resource": "arn:aws:sagemaker:<AWSREGION>:<AWSACCOUNT>:app/*/${aws:username}/*",
            "Effect": "Allow"
        },
        {
            "Action": "sagemaker:*App",
            "Effect": "Deny",
            "NotResource": "arn:aws:sagemaker:<AWSREGION>:<AWSACCOUNT>:app/*/${aws:username}/*"
        },
        {
            "Action": [
                "sagemaker:CreatePresignedNotebookInstanceUrl",
                "sagemaker:*NotebookInstance",
                "sagemaker:*NotebookInstanceLifecycleConfig",
                "sagemaker:CreateUserProfile",
                "sagemaker:DeleteDomain",
                "sagemaker:DeleteUserProfile"
            ],
            "Resource": "*",
            "Effect": "Deny"
        }
    ]
}

This policy forces an IAM user to open Studio using a SageMaker user profile with the same name. It also denies the use of SageMaker notebook instances, allowing Studio notebooks only.

Create an IAM group.
1. For Group name, enter DataScientists.
2. Search and attach the AWS managed policy named DataScientist and the IAM policy created in the previous step.
Create two IAM users named data-scientist-full and data-scientist-limited.

Alternatively, you can provide names of your choice, as long as they’re a combination of lowercase letters, numbers, and hyphen (-). Later, you also give these names to their corresponding SageMaker user profiles, which at the time of writing only support those characters.

Creating the required IAM roles

To create your roles, complete the following steps:

On the IAM console, create a new managed policy named SageMakerUserProfileExecutionPolicy.

Use the following policy code:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "lakeformation:GetDataAccess",
                "glue:GetTable",
                "glue:GetTables",
                "glue:SearchTables",
                "glue:GetDatabase",
                "glue:GetDatabases",
                "glue:GetPartitions"
            ],
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Action": "sts:AssumeRole",
            "Resource": "*",
            "Effect": "Deny"
        }
    ]
}

This policy provides common coarse-grained IAM permissions to the data lake, leaving Lake Formation permissions to control access to Data Catalog resources and Amazon S3 locations for individual users and roles. This is the recommended method for granting access to data in Lake Formation. For more information, see Methods for Fine-Grained Access Control.

Create an IAM role for the first data scientist (data-scientist-full), which is used as the corresponding user profile’s execution role.
1. On the Attach permissions policy page, search and attach the AWS managed policy AmazonSageMakerFullAccess.
2. For Role name, use the naming convention introduced at the beginning of this section to name the role SageMakerStudioExecutionRole_data-scientist-full.
To add the remaining policies, on the Roles page, choose the role name you just created.
Under Permissions, choose Attach policies;
Search and select the SageMakerUserProfileExecutionPolicy and AmazonAthenaFullAccess policies.
Choose Attach policy.
To restrict the Studio resources that can be created within Studio (such as image, kernel, or instance type) to only those belonging to the user profile associated to the first IAM role, embed an inline policy to the IAM role.
1. Use the following JSON policy document to scope down permissions for the user profile, providing the Region, account ID, and IAM user name associated to the first data scientist (data-scientist-full). You can name the inline policy DataScientist1IAMRoleInlinePolicy.
```
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": "sagemaker:*App",
            "Resource": "arn:aws:sagemaker:<AWSREGION>:<AWSACCOUNT>:app/*/<IAMUSERNAME>/*",
            "Effect": "Allow"
        },
        {
            "Action": "sagemaker:*App",
            "Effect": "Deny",
            "NotResource": "arn:aws:sagemaker:<AWSREGION>:<AWSACCOUNT>:app/*/<IAMUSERNAME>/*"
        }
    ]
}
```

Repeat the previous steps to create an IAM role for the second data scientist (data-scientist-limited).
1. Name the role SageMakerStudioExecutionRole_data-scientist-limited and the second inline policy DataScientist2IAMRoleInlinePolicy.

Granting data permissions with Lake Formation

Before data scientists are able to work on a Studio notebook, you grant the individual execution roles created in the previous section access to the Amazon Customer Reviews Dataset (or your own dataset). For this post, we implement different data permission policies for each data scientist to demonstrate how to grant granular access using Lake Formation.

Sign in to the console with the IAM user configured as Lake Formation Admin.
On the Lake Formation console, in the navigation pane, choose Tables.
On the Tables page, select the table you created earlier, such as amazon_reviews_parquet.
On the Actions menu, under Permissions, choose Grant.
Provide the following information to grant full access to the Amazon Customer Reviews Dataset table for the first data scientist:
Select My account.
For IAM users and roles, choose the execution role associated to the first data scientist, such as SageMakerStudioExecutionRole_data-scientist-full.
For Table permissions and Grantable permissions, select Select.
Choose Grant.
Repeat the first step to grant limited access to the dataset for the second data scientist, providing the following information:
Select My account.
For IAM users and roles, choose the execution role associated to the second data scientist, such as SageMakerStudioExecutionRole_data-scientist-limited.
For Columns, choose Include columns.
Choose a subset of columns, such as: product_category, product_id, product_parent, product_title, star_rating, review_headline, review_body, and review_date.
For Table permissions and Grantable permissions, select Select.
Choose Grant.
To verify the data permissions you have granted, on the Lake Formation console, in the navigation pane, choose Tables.
On the Tables page, select the table you created earlier, such as amazon_reviews_parquet.
On the Actions menu, under Permissions, choose View permissions to open the Data permissions menu.

You see a list of permissions granted for the table, including the permissions you just granted and permissions for the Lake Formation Admin.

If you see the principal IAMAllowedPrincipals listed on the Data permissions menu for the table, you must remove it. Select the principal and choose Revoke. On the Revoke permissions page, choose Revoke.

Setting up SageMaker Studio

You now onboard to Studio and create two user profiles, one for each data scientist.

When you onboard to Studio using IAM authentication, Studio creates a domain for your account. A domain consists of a list of authorized users, configuration settings, and an Amazon EFS volume, which contains data for the users, including notebooks, resources, and artifacts.

Each user receives a private home directory within Amazon EFS for notebooks, Git repositories, and data files. All traffic between the domain and the Amazon EFS volume is communicated through specified subnet IDs. By default, all other traffic goes over the internet through a SageMaker system Amazon Virtual Private Cloud (Amazon VPC).

Alternatively, instead of using the default SageMaker internet access, you could secure how Studio accesses resources by assigning a private VPC to the domain. This is beyond the scope of this post, but you can find additional details in Securing Amazon SageMaker Studio connectivity using a private VPC.

If you already have a Studio domain running, you can skip the onboarding process and follow the steps to create the SageMaker user profiles.

Onboarding to Studio

To onboard to Studio, complete the following steps:

Sign in to the console using an IAM user with service administrator permissions for SageMaker.
On the SageMaker console, in the navigation pane, choose Amazon SageMaker Studio.
On the Studio menu, under Get started, choose Standard setup.
For Authentication method, choose AWS Identity and Access Management (IAM).
Under Permission, for Execution role for all users, choose an option from the role selector.

You’re not using this execution role for the SageMaker user profiles that you create later. If you choose Create a new role, the Create an IAM role dialog opens.

For S3 buckets you specify, choose None.
Choose Create role.

SageMaker creates a new IAM role named AmazonSageMaker-ExecutionPolicy role with the AmazonSageMakerFullAccess policy attached.

Under Network and storage, for VPC, choose the private VPC that is used for communication with the Amazon EFS volume.
For Subnet(s), choose multiple subnets in the VPC from different Availability Zones.
Choose Submit.
On the Studio Control Panel, under Studio Summary, wait for the status to change to Ready and the Add user button to be enabled.

Creating the SageMaker user profiles

To create your SageMaker user profiles, complete the following steps:

On the SageMaker console, in the navigation pane, choose Amazon SageMaker Studio.
On the Studio Control Panel, choose Add user.
For User name, enter data-scientist-full.
For Execution role, choose Enter a custom IAM role ARN.
Enter arn:aws:iam::<AWSACCOUNT>:role/SageMakerStudioExecutionRole_data-scientist-full, providing your AWS account ID.
After creating the first user profile, repeat the previous steps to create a second user profile.
1. For User name, enter data-scientist-limited.
2. For Execution role, enter the associated IAM role ARN.

Testing Lake Formation access control policies

You now test the implemented Lake Formation access control policies by opening Studio using both user profiles. For each user profile, you run the same Studio notebook containing Athena queries. You should see different query outputs for each user profile, matching the data permissions implemented earlier.

Sign in to the console with IAM user data-scientist-full.
On the SageMaker console, in the navigation pane, choose Amazon SageMaker Studio.
On the Studio Control Panel, choose user name data-scientist-full.
Choose Open Studio.
Wait for SageMaker Studio to load.

Due to the IAM policies attached to the IAM user, you can only open Studio with a user profile matching the IAM user name.

In Studio, on the top menu, under File, under New, choose Terminal.
At the command prompt, run the following command to import a sample notebook to test Lake Formation data permissions:
```
git clone https://github.com/aws-samples/amazon-sagemaker-studio-audit.git
```

In the left sidebar, choose the file browser icon.
Navigate to amazon-sagemaker-studio-audit.
Open the notebook folder.
Choose sagemaker-studio-audit-control.ipynb to open the notebook.
In the Select Kernel dialog, choose Python 3 (Data Science).
Choose Select.
Wait for the kernel to load.

Starting from the first code cell in the notebook, press Shift + Enter to run the code cell.
Continue running all the code cells, waiting for the previous cell to finish before running the following cell.

After running the last SELECT query, because the user has full SELECT permissions for the table, the query output includes all the columns in the amazon_reviews_parquet table.

On the top menu, under File, choose Shut Down.
Choose Shutdown All to shut down all the Studio apps.
Close the Studio browser tab.
Repeat the previous steps in this section, this time signing in as the user data-scientist-limited and opening Studio with this user.
Don’t run the code cell in the section Create S3 bucket for query output files.

For this user, after running the same SELECT query in the Studio notebook, the query output only includes a subset of columns for the amazon_reviews_parquet table.

Auditing data access activity with Lake Formation and CloudTrail

In this section, we explore the events associated to the queries performed in the previous section. The Lake Formation console includes a dashboard where it centralizes all CloudTrail logs specific to the service, such as GetDataAccess. These events can be correlated with other CloudTrail events, such as Athena query requests, to get a complete view of the queries users are running on the data lake.

Alternatively, instead of filtering individual events in Lake Formation and CloudTrail, you could run SQL queries to correlate CloudTrail logs using Athena. Such integration is beyond the scope of this post, but you can find additional details in Using the CloudTrail Console to Create an Athena Table for CloudTrail Logs and Analyze Security, Compliance, and Operational Activity Using AWS CloudTrail and Amazon Athena.

Auditing data access activity with Lake Formation

To review activity in Lake Formation, complete the following steps:

Sign out of the AWS account.
Sign in to the console with the IAM user configured as Lake Formation Admin.
On the Lake Formation console, in the navigation pane, choose Dashboard.

Under Recent access activity, you can find the events associated to the data access for both users.

Choose the most recent event with event name GetDataAccess.
Choose View event.

Among other attributes, each event includes the following:

Event date and time
Event source (Lake Formation)
Athena query ID
Table being queried
IAM user embedded in the Lake Formation principal, based on the chosen role name convention

Auditing data access activity with CloudTrail

To review activity in CloudTrail, complete the following steps:

On the CloudTrail console, in the navigation pane, choose Event history.
In the Event history menu, for Filter, choose Event name.
Enter StartQueryExecution.
Expand the most recent event, then choose View event.

This event includes additional parameters that are useful to complete the audit analysis, such as the following:

Event source (Athena).
Athena query ID, matching the query ID from Lake Formation’s GetDataAccess event.
Query string.
Output location. The query output is stored in CSV format in this Amazon S3 location. Files for each query are named using the query ID.

Cleaning up

To avoid incurring future charges, delete the resources created during this walkthrough.

If you followed this walkthrough using the CloudFormation template, after shutting down the Studio apps for each user profile, deleting the stack deletes the remaining resources.

If you encounter any errors, open the Studio Control Panel and verify that all the apps for every user profile are in Deleted state before deleting the stack.

If you didn’t use the CloudFormation template, you can manually delete the resources you created:

On the Studio Control Panel, for each user profile, choose User Details.
Choose Delete user.
When all users are deleted, choose Delete Studio.
On the Amazon EFS console, delete the volume that was automatically created for Studio.
On the Lake Formation console, delete the table and the database created for the Amazon Customer Reviews Dataset.
Remove the data lake location for the dataset.
On the IAM console, delete the IAM users, group, and roles created for this walkthrough.
Delete the policies you created for these principals.
On the Amazon S3 console, empty and delete the bucket created for storing Athena query results (starting with sagemaker-audit-control-query-results-), and the bucket created by Studio to share notebooks (starting with sagemaker-studio-).

Conclusion

This post described how to the implement access control and auditing capabilities on a per-user basis in ML projects, using Studio notebooks, Athena, and Lake Formation to enforce access control policies when performing exploratory activities in a data lake.

I thank you for following this walkthrough and I invite you to implement it using the associated CloudFormation template. You’re also welcome to visit the GitHub repo for the project.

About the Author

Rodrigo Alarcon is a Sr. Solutions Architect with AWS based out of Santiago, Chile. Rodrigo has over 10 years of experience in IT security and network infrastructure. His interests include machine learning and cybersecurity.

Amazon Machine Learning University launches new, advanced course

December 21, 2020

by admin Amazon AWS

Decision trees class gives students access to cutting-edge instruction on key machine-learning topic.Read More

Using knowledge graphs to streamline COVID-19 research

December 21, 2020

by admin Amazon AWS

A knowledge graph linking research papers, authors, and topics should make it easier for researchers fighting COVID-19 to discover relevant information.Read More

Building and deploying an object detection computer vision application at the edge with AWS Panorama

December 19, 2020

by Surya Kari Amazon AWS

Computer vision (CV) is sought after technology among companies looking to take advantage of machine learning (ML) to improve their business processes. Enterprises have access to large amounts of video assets from their existing cameras, but the data remains largely untapped without the right tools to gain insights from it. CV provides the tools to unlock opportunities with this data, so you can automate processes that typically require visual inspection, such as evaluating manufacturing quality or identifying bottlenecks in industrial processes. You can take advantage of CV models running in the cloud to automate these inspection tasks, but there are circumstances when relying exclusively on the cloud isn’t optimal due to latency requirements or intermittent connectivity that make a round trip to the cloud infeasible.

AWS Panorama enables you to bring CV to on-premises cameras and make predictions locally with high accuracy and low latency. On the AWS Panorama console, you can easily bring custom trained models to the edge and build applications that integrate with custom business logic. You can then deploy these applications on the AWS Panorama Appliance, which auto-discovers existing IP cameras and runs the applications on video streams to make real-time predictions. You can easily integrate the inference results with other AWS services such as Amazon QuickSight to derive ML-powered business intelligence (BI) or route the results to your on-premises systems to trigger an immediate action.

In this post, we look at how you can use AWS Panorama to build and deploy a parking lot car counter application.

Parking lot car counter application

Parking facilities, like the one in the image below, need to know how many cars are parked in a given facility at any point of time, to assess vacancy and intake more customers. You also want to keep track of the number of cars that enter and exit your facility during any given time. You can use this information to improve operations, such as adding more parking payment centers, optimizing price, directing cars to different floors, and more. Parking center owners typically operate more than one facility and are looking for real-time aggregate details of vacancy in order to direct traffic to less-populated facilities and offer real-time discounts.

To achieve these goals, parking centers sometimes manually count the cars to provide a tally. This inspection can be error prone and isn’t optimal for capturing real-time data. Some parking facilities install sensors that give the number of cars in a particular lot, but these sensors are typically not integrated with analytics systems to derive actionable insights.

With the AWS Panorama Appliance, you can get a real-time count of number of cars, collect metrics across sites, and correlate them to improve your operations. Let’s see how we can solve this once manual (and expensive) problem using CV at the edge. We go through the details of the trained model, the business logic code, and walk through the steps to create and deploy an application on your AWS Panorama Appliance Developer Kit so you can view the inferences on a connected HDMI screen.

Computer vision model

A CV model helps us extract useful information from images and video frames. We can detect and localize objects in a scene, and identity and classify images and action recognition. You can choose from a variety of frameworks such as TensorFlow, MXNet, and PyTorch to build your CV models, or you can choose from a variety of pre-trained models available from AWS or from third parties such as ISVs.

For this example, we use a pre-trained GluonCV model downloaded from the GluonCV model zoo.

The model we use is the ssd_512_resnet50_v1_voc model. It’s trained on the very popular PASCAL VOC dataset. It has 20 classes of objects annotated and labeled for a model to be trained on. The following code shows the classes and their indexes.

voc_classes = {
	'aeroplane'		: 0,
	'bicycle'		: 1,
	'bird'			: 2,
	'boat'			: 3,
	'bottle'		: 4,
	'bus'			: 5,
	'car'			: 6,
	'cat'			: 7,
	'chair'			: 8,
	'cow'			: 9,
	'diningtable'	: 10,
	'dog'			: 11,
	'horse'			: 12,
	'motorbike'		: 13,
	'person'		: 14,
	'pottedplant'	: 15,
	'sheep'			: 16,
	'sofa'			: 17,
	'train'			: 18,
	'tvmonitor'		: 19
}

For our use case, we’re detecting and counting cars. Because we’re talking about cars, we use class 6 as the index in our business logic later in this post.

Our input image shape is [1, 3, 512, 512]. These are the dimensions of the input image the model expects to be given:

Batch size – 1
Number of channels – 3
Width and height of the input image – 512, 512

Uploading the model artifacts

We need to upload the model artifacts to an Amazon Simple Storage Service (Amazon S3) bucket. The bucket name should have aws-panorama- in the beginning of the name. After downloading the model artifacts, we upload the ssd_512_resnet50_v1_voc.tar.gz file to the S3 bucket. To create your bucket, complete the following steps:

Download the model artifacts.
On the Amazon S3 console, choose Create bucket.
For Bucket name, enter a name starting with aws-panorama-.

Choose Create bucket.

You can view the object details in the Object overview section. The model URI is s3://aws-panorama-models-bucket/ssd_512_resnet50_v1_voc.tar.gz.

The business logic code

After we upload the model artifacts to an S3 bucket, let’s turn our attention to the business logic code. For more information about the sample developer code, see Sample application code. For a comparative example of code samples, see AWS Panorama People Counter Example on GitHub.

Before we look at the full code, let’s look at a skeleton of the business logic code we use:

### Lambda skeleton

class car_counter(object):
    def interface(self):
        # defines the parameters that interface with other services from Panorama
        return

    def init(self, parameters, inputs, outputs):
        # defines the attributes such as arrays and model objects that will be used in the application
        return

    def entry(self, inputs, outputs):
        # defines the application logic responsible for predicting using the inputs and handles what to do
        # with the outputs
        return

The business logic code and AWS Lambda function expect to have at least the interface method, init method, and the entry method.

Let’s go through the python business logic code next.

import panoramasdk
import cv2
import numpy as np
import time
import boto3

# Global Variables 

HEIGHT = 512
WIDTH = 512

class car_counter(panoramasdk.base):
    
    def interface(self):
        return {
                "parameters":
                (
                    ("float", "threshold", "Detection threshold", 0.10),
                    ("model", "car_counter", "Model for car counting", "ssd_512_resnet50_v1_voc"), 
                    ("int", "batch_size", "Model batch size", 1),
                    ("float", "car_index", "car index based on dataset used", 6),
                ),
                "inputs":
                (
                    ("media[]", "video_in", "Camera input stream"),
                ),
                "outputs":
                (
                    ("media[video_in]", "video_out", "Camera output stream"),
                    
                ) 
            }
    
            
    def init(self, parameters, inputs, outputs):  
        try:  
            
            print('Loading Model')
            self.model = panoramasdk.model()
            self.model.open(parameters.car_counter, 1)
            print('Model Loaded')
            
            # Detection probability threshold.
            self.threshold = parameters.threshold
            # Frame Number Initialization
            self.frame_num = 0
            # Number of cars
            self.number_cars = 0
            # Bounding Box Colors
            self.colours = np.random.rand(32, 3)
            # Car Index for Model from parameters
            self.car_index = parameters.car_index
            # Set threshold for model from parameters 
            self.threshold = parameters.threshold
                        
            class_info = self.model.get_output(0)
            prob_info = self.model.get_output(1)
            rect_info = self.model.get_output(2)

            self.class_array = np.empty(class_info.get_dims(), dtype=class_info.get_type())
            self.prob_array = np.empty(prob_info.get_dims(), dtype=prob_info.get_type())
            self.rect_array = np.empty(rect_info.get_dims(), dtype=rect_info.get_type())

            return True
        
        except Exception as e:
            print("Exception: {}".format(e))
            return False

    def preprocess(self, img, size):
        
        resized = cv2.resize(img, (size, size))
        mean = [0.485, 0.456, 0.406]  # RGB
        std = [0.229, 0.224, 0.225]  # RGB
        
        # converting array of ints to floats
        img = resized.astype(np.float32) / 255. 
        img_a = img[:, :, 0]
        img_b = img[:, :, 1]
        img_c = img[:, :, 2]
        
        # Extracting single channels from 3 channel image
        # The above code could also be replaced with cv2.split(img)
        # normalizing per channel data:
        
        img_a = (img_a - mean[0]) / std[0]
        img_b = (img_b - mean[1]) / std[1]
        img_c = (img_c - mean[2]) / std[2]
        
        # putting the 3 channels back together:
        x1 = [[[], [], []]]
        x1[0][0] = img_a
        x1[0][1] = img_b
        x1[0][2] = img_c
        x1 = np.asarray(x1)
        
        return x1
    
    def get_number_cars(self, class_data, prob_data):
        
        # get indices of car detections in class data
        car_indices = [i for i in range(len(class_data)) if int(class_data[i]) == self.car_index]
        # use these indices to filter out anything that is less than self.threshold
        prob_car_indices = [i for i in car_indices if prob_data[i] >= self.threshold]
        return prob_car_indices

    
    def entry(self, inputs, outputs):        
        for i in range(len(inputs.video_in)):
            stream = inputs.video_in[i]
            car_image = stream.image

            # Pre Process Frame
            x1 = self.preprocess(car_image, 512)
                                    
            # Do inference on the new frame.
            
            self.model.batch(0, x1)        
            self.model.flush()
            
            # Get the results.            
            resultBatchSet = self.model.get_result()
            class_batch = resultBatchSet.get(0)
            prob_batch = resultBatchSet.get(1)
            rect_batch = resultBatchSet.get(2)

            class_batch.get(0, self.class_array)
            prob_batch.get(1, self.prob_array)
            rect_batch.get(2, self.rect_array)

            class_data = self.class_array[0]
            prob_data = self.prob_array[0]
            rect_data = self.rect_array[0]
            
            
            # Get Indices of classes that correspond to Cars
            car_indices = self.get_number_cars(class_data, prob_data)
            
            try:
                self.number_cars = len(car_indices)
            except:
                self.number_cars = 0
            
            # Visualize with Opencv or stream.(media) 
            
            # Draw Bounding boxes on HDMI output
            if self.number_cars > 0:
                for index in car_indices:
                    
                    left = np.clip(rect_data[index][0] / np.float(HEIGHT), 0, 1)
                    top = np.clip(rect_data[index][1] / np.float(WIDTH), 0, 1)
                    right = np.clip(rect_data[index][2] / np.float(HEIGHT), 0, 1)
                    bottom = np.clip(rect_data[index][3] / np.float(WIDTH), 0, 1)
                    
                    stream.add_rect(left, top, right, bottom)
                    stream.add_label(str(prob_data[index][0]), right, bottom) 
                    
            stream.add_label('Number of Cars : {}'.format(self.number_cars), 0.8, 0.05)
        
            self.model.release_result(resultBatchSet)            
            outputs.video_out[i] = stream
        return True


def main():
        
    car_counter().run()
main()

For a full explanation of the code and the methods used, see the AWS Panorama Developer Guide.

The code has the following notable features:

car_index – 6
model_used – ssd_512_resnet50_v1_voc (parameters.car_counter)
add_label – Adds text to the HDMI output
add_rect – Adds bounding boxes around the object of interest
Image – Gets the NumPy array of the frame read from the camera

Now that we have the code ready, we need to create a Lambda function with the preceding code.

On the Lambda console, choose Functions.
Choose Create function.
For Function name, enter a name.
Choose Create function.

Rename the Python file to car_counter.py.

Change the handler to car_counter_main.

In the Basic settings section, confirm that the memory is 2048 MB and the timeout is 2 minutes.

On the Actions menu, choose Publish new version.

We’re now ready to create our application and deploy to the device. We use the model we uploaded and the Lambda function we created in the subsequent steps.

Creating the application

To create your application, complete the following steps:

On the AWS Panorama console, choose My applications.
Choose Create application.

Choose Begin creation.

For Name, enter car_counter.
For Description, enter an optional description.
Choose Next.

Click Choose model.

For Model artifact path, enter the model S3 URI.
For Model name¸ enter the same name that you used in the business logic code.
In the Input configuration section, choose Add input.
For Input name, enter the input Tensor name (for this post, data).
For Shape, enter the frame shape (for this post, 1, 3, 512, 512).

Choose Next.
Under Lambda functions, select your function (CarCounter).

Choose Next.
Choose Proceed to deployment.

Deploying your application

To deploy your new application, complete the following steps:

Choose Choose appliance.

Choose the appliance you created.
Choose Choose camera streams.

Select your camera stream.

Choose Deploy.

Checking the output

After we deploy the application, we can check the output HDMI output or use Amazon CloudWatch Logs. For more information, see Setting up the AWS Panorama Appliance Developer Kit or Viewing AWS Panorama event logs in CloudWatch Logs, respectively.

If we have an HDMI output connected to the device, we should see the output from the device on the HDMI screen, as in the following screenshot.

And that’s it. We have successfully deployed a car counting use case to the AWS Panorama Appliance.

Extending the solution

We can do so much more with this application and extend it to other parking-related use cases, such as the following:

Parking lot routing – Where are the vacant parking spots?
Parking lot monitoring – Are cars parked in appropriate spots? Are they too close to each other?

You can integrate these use cases with other AWS services like QuickSight, S3 buckets, and MQTT, just to name a few, and get real-time inference data for monitoring cars in a parking lot.

You can adapt this example and build other object detection applications for your use case. We will also continue to share more examples with you so you can build, develop, and test with the AWS Panorama Appliance Developer Kit.

Conclusion

The applications of computer vision at the edge are only now being imagined and built out. As a data scientist, I’m very excited to be innovating in lockstep with AWS Panorama customers to help you ideate and build CV models that are uniquely tailored to solve your problems.

And we’re just scratching the surface of what’s possible with CV at the edge and the AWS Panorama ecosystem.

Resources

For more information about using AWS Panorama, see the following resources:

GitHub examples – Introduction to AWS Panorama
Sending output to an S3 bucket – AWS SDK for Python (Boto3)
Sending MQTT messages – Using the AWS IoT MQTT topic
Setting up the AWS Panorama Appliance Developer Kit – Register and configure the developer kit
Setting up a camera stream – Add a camera stream

About the Author

Surya Kari is a Data Scientist who works on AI devices within AWS. His interests lie in computer vision and autonomous systems.

Population health applications with Amazon HealthLake – Part 1: Analytics and monitoring using Amazon QuickSight

December 18, 2020

by Mithil Shah Amazon AWS

Healthcare has recently been transformed by two remarkable innovations: Medical Interoperability and machine learning (ML). Medical Interoperability refers to the ability to share healthcare information across multiple systems. To take advantage of these transformations, we launched a new HIPAA-eligible healthcare service, Amazon HealthLake, now in preview at re:Invent 2020. In the re:Invent announcement, we talk about how HealthLake enables organizations to structure, tag, index, query, and apply ML to analyze health data at scale. In a series of posts, starting with this one, we show you how to use HealthLake to derive insights or ask new questions of your health data using advanced analytics.

The primary source of healthcare data are patient electronic health records (EHR). Health Level Seven International (HL7), a non-profit standards development organization, announced a standard for exchanging structured medical data called the Fast Healthcare Interoperability Resources (FHIR). FHIR is widely supported by healthcare software vendors and was supported at an American Medical Informatics Association meeting by EHR vendors. The FHIR specification makes structured medical data easily accessible to clinical researchers and informaticians, and also makes it easy for ML tools to process this data and extract valuable information from it. For example, FHIR provides a resource to capture documents, such as doctor’s notes or lab report summaries. However, this data needs to be extracted and transformed before it can be searched and analyzed.

As the FHIR-formatted medical data is ingested, HealthLake uses natural language processing trained to understand medical terminology to enrich unstructured data with standardized labels (such as for medications, conditions, diagnoses, and procedures), so all this information can be normalized and easily searched. One example is parsing clinical narratives in the FHIR DocumentReference resource to extract, tag, and structure the medical entities, including ICD-10-CM codes. This transformed data is then added to the patient’s record, providing a complete view of all of the patient’s attributes (such as medications, tests, procedures, and diagnoses) that is optimized for search and applying advanced analytics. In this post, we walk you through the process of creating a population health dashboard on this enriched data, using AWS Glue, Amazon Athena, and Amazon QuickSight.

Building a population health dashboard

After HealthLake extracts and tags the FHIR-formatted data, you can use advanced analytics and ML with your now normalized data to make sense of it all. Next, we walk through using QuickSight to build a population health dashboard to quickly analyze data from HealthLake. The following diagram illustrates the solution architecture.

In this example, we build a dashboard for patients diagnosed with congestive heart failure (CHF), a chronic medical condition in which the heart doesn’t pump blood as well as it should. We use the MIMIC-III (Medical Information Mart for Intensive Care III) data, a large, freely-available database comprising de-identified health-related data associated with over 40,000 patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001–2012. [1]

The tools used for processing the data and building the dashboard include AWS Glue, Athena, and QuickSight. AWS Glue is a serverless data preparation service that makes it easy to extract, transform, and load (ETL) data, in order to prepare the data for subsequent analytical processing and presentation in charts and dashboards. An AWS Glue crawler is a program that determines the schema of data and creates a metadata table in the AWS Glue Data Catalog that describes the data schema. An AWS Glue job encapsulates a script that reads, processes, and writes data to a new schema. Finally, we use Athena, an interactive query service that can query data in Amazon Simple Storage Service (Amazon S3) using standard SQL queries on tables in a Data Catalog.

Connecting Athena with HealthLake

We first convert the MIMIC-III data to FHIR format and then copy the formatted data into a data store in HealthLake, which extracts medical entities from textual narratives such as doctors’ notes and discharge summaries. The clinical notes are stored in the DocumentReference resource, whereby the extracted entities are tagged to each patient’s record in the DocumentReference with FHIR extension fields represented in the JSON object. The following screenshot is an example of how the augmented DocumentResource looks.

Now that data is indexed and tagged in the HealthLake, we export the normalized data to an S3 bucket. The exported data is in NDJSON format, with one folder per resource.

An AWS Glue crawler is written for each folder to crawl the NDJSON file and create tables in the Data Catalog. Because the default classifiers can work with NDJSON files directly, no special classifiers are needed. There is one crawler per FHIR resource and each crawler creates one table. These tables are then queried directly from within Athena; however, for some queries, we use AWS Glue jobs to transform and partition the data to make the queries simpler and faster.

We create two AWS Glue jobs for this project to transform the DocumentReference and Condition tables. Both jobs transform the data from JSON to Apache Parquet, to improve query performance and reduce data storage and scanning costs. In addition, both jobs partition the data by patient first, and then by the identity of the individual FHIR resources. This improves the performance of patient- and record-based queries issued through Athena. The resulting Parquet files are tabular in structure, which also simplifies queries issued via clients, because they can reference detected entities and ICD-10 codes directly, and no longer need to navigate the nested FHIR structure of the DocumentReference extension element. After these jobs create the Parquet files in Amazon S3, we create and run crawlers to add the table schema into the Data Catalog.

Finally, to support keyword-based queries for conditions via the QuickSight dashboard, we create a view of the transformed DocumentReference table that includes ICD-10-CM textual descriptions and the corresponding ICD-10-CM codes.

Building a population health dashboard with QuickSight

QuickSight is a cloud-based business intelligence (BI) services that makes it easy to build dashboards in the cloud. It can obtain data from various sources, but for our use case, we use Athena to create a data source for our QuickSight dashboard. From the previous step, we have Athena tables that use data from HealthLake. As the next step, we create a dataset in QuickSight from a table in Athena. We use SPICE (Super-fast, Parallel, In-memory Calculation Engine) to store the data because this allows us to import the data only one time and use it multiple times.

After creating the dataset, we create a number of analytic components in the dashboard. These components allow us to aggregate the data and create charts and time-series visualizations at the patient and population levels.

The first tab of the dashboard that we build provides a view into the entire patient population and their encounters with the health system (see the following screenshot). The target audience for this dashboard consists of healthcare providers or caregivers.

The dashboard contains filters that allows us to further drill on the results by referring hospital or by date. It shows the number of patients, their demographic distribution, the number of encounters, the average hospital stay, and more.

The second tab joins hospital encounters with patient medical conditions. This view provides the number of encounters per referring hospital, broken by type of encounter and by age. We also create a word cloud for major medical conditions to easily drill down on the details and understand the distribution of these conditions across the entire population by encounter type.

The third component contains a patient timeline. The timeline is in the form of a tree table. The first column is the patient name. The second column contains the start date of the encounter sorted chronologically. The third column contains the list of ranked conditions diagnosed in that encounter. The last column contains the list of procedures performed during that encounter.

To build the patient timeline, we create a view in Athena that joins multiple tables. We build the preceding view by joining the condition, patient, encounter, and observation tables. The encounter table contains an array of conditions, and therefore we need to use the unnest command. The following code is a sample SQL query to join the tables:

SELECT o.code.text, o.effectivedatetime, o.valuequantity, p.name[1].family, e.hospitalization.dischargedisposition.coding[1].display as dischargeddisposition, e.period.start, e.period."end", e.hospitalization.admitsource.coding[1].display as admitsource, e.class.display as encounter_class, c.code.coding[1].display as condition
    FROM "healthai_mimic"."encounter" e, unnest(diagnosis) t(cond), condition c, patient p, observation o
    AND ("split"("cond"."condition"."reference", '/')[2] = "c"."id")
    AND ("split"("e"."subject"."reference", '/')[2] = "p"."id")
    AND ("split"("o"."subject"."reference", '/')[2] = "p"."id")
    AND ("split"("o"."encounter"."reference", '/')[2] = "e"."id")

The last but probably most exciting part is where we compare patient data found in structured fields vs. data parsed from text. As described before, the AWS Glue job has transformed the DocumentReference and Condition table so that the modified DocumentReference tables can now be queried to retrieve parsed medical entities.

In the following screenshot, we search for all patients that have the word [s]epsis in the condition text. The condition equals field is a filter that allows us to filter all conditions that match a text. The results show that 209 patients have a sepsis-related condition in their structured data. However, 288 patients have sepsis-related conditions as parsed from textual notes. The table on the left shows timelines for patients based on structured data, and the table on right shows timelines for patients based on parsed data.

Next steps

In this post, we joined the data from multiple FHIR references to create a holistic view for a patient. We also used Athena to search for a single patient. If the data volume is high, it’s a good idea to create year, month, and day partitions within Amazon S3 and store the NDJSON files in those partitions. This allows the dashboard to be created for a restricted time period, such as current month or current year, making the dashboard faster and cost-effective.

Conclusion

HealthLake creates exciting new possibilities for extracting medical entities from unstructured data and quickly building a dashboard on top of it. The dashboard helps clinicians and health administrators make informed decisions and improve patient care. It also helps researchers improve the performance of their ML models by incorporating medical entities that were hidden in unstructured data. You can start building a dashboard on your raw FHIR data by importing it into Amazon S3, creating AWS Glue crawlers and Data Catalog tables, and creating a QuickSight dashboard!

[1] MIMIC-III, a freely accessible critical care database. Johnson AEW, Pollard TJ, Shen L, Lehman L, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, and Mark RG. Scientific Data (2016).

About the Author

Mithil Shah is an ML/AI Specialist at Amazon Web Services. Currently he helps public sector customers improve lives of citizens by building Machine Learning solutions on AWS.

Paul Saxman is a Principal Solutions Architect at AWS, where he helps clinicians, researchers, executives, and staff at academic medical centers to adopt and leverage cloud technologies. As a clinical and biomedical informatics, Paul is passionate about accelerating healthcare advancement and innovation, by supporting the translation of science into medical practice.

Application overview

Architecture

Prerequisites

Preparing your S3 bucket as a data source

Deploying the infrastructure as a CloudFormation stack

Reviewing Amazon Kendra configuration and starting the data source sync

Creating users and groups in the Amazon Cognito user pool

Building and deploying the app

Trial run

Using the application

Cleaning up

Conclusion

About the Author

Datasets

SpaceNet dataset

USGS 3DEP LiDAR dataset

Data registration

Creating a notebook instance

Deploy environment and download datasets

Building extraction

Training data

Model

Evaluation

Road extraction

Training data

Model

Evaluation

Conclusion

About the Authors

What is EXIF data?

A big change to browsers: Why EXIF data is important

Upcoming change to AWS image annotation job security requirements

Adding a CORS header policy to an S3 bucket

Conclusion

About the Authors

Forecasting challenges at Foxconn

Processing and modeling

Forecast evaluation

Application architecture

Summary and next steps

About the Authors

Overview of solution

Walkthrough overview

Prerequisites

Registering a new database in Lake Formation

Creating required IAM roles and users for data scientists

Creating the required IAM group and users

Creating the required IAM roles

Granting data permissions with Lake Formation

Setting up SageMaker Studio

Onboarding to Studio

Creating the SageMaker user profiles

Testing Lake Formation access control policies

Auditing data access activity with Lake Formation and CloudTrail

Auditing data access activity with Lake Formation

Auditing data access activity with CloudTrail

Cleaning up

Conclusion

About the Author

Parking lot car counter application

Computer vision model

Uploading the model artifacts

The business logic code

Creating the application

Deploying your application

Checking the output

Extending the solution

Conclusion

Resources

About the Author

Building a population health dashboard

Connecting Athena with HealthLake

Building a population health dashboard with QuickSight

Next steps

Conclusion

About the Author

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.