Aerobotics improves training speed by 24 times per sample with Amazon SageMaker and TensorFlow

Editor’s note: This is a guest post written by Michael Malahe, Head of Data at Aerobotics, a South African startup that builds AI-driven tools for agriculture.

Aerobotics is an agri-tech company operating in 18 countries around the world, based out of Cape Town, South Africa. Our mission is to provide intelligent tools to feed the world. We aim to achieve this by providing farmers with actionable data and insights on our platform, Aeroview, so that they can make the necessary interventions at the right time in the growing season. Our predominant data source is aerial drone imagery: capturing visual and multispectral images of trees and fruit in an orchard.

In this post we look at how we use Amazon SageMaker and TensorFlow to improve our Tree Insights product, which provides per-tree measurements of important quantities like canopy area and health, and provides the locations of dead and missing trees. Farmers use this information to make precise interventions like fixing irrigation lines, applying fertilizers at variable rates, and ordering replacement trees. The following is an image of the tool that farmers use to understand the health of their trees and make some of these decisions.

To provide this information to make these decisions, we first must accurately assign each foreground pixel to a single unique tree. For this instance segmentation task, it’s important that we’re as accurate as possible, so we use a machine learning (ML) model that’s been effective in large-scale benchmarks. The model is a variant of Mask R-CNN, which pairs a convolutional neural network (CNN) for feature extraction with several additional components for detection, classification, and segmentation. In the following image, we show some typical outputs, where the pixels belong to a given tree are outlined by a contour.

Glancing at the outputs, you might think that the problem is solved.

The challenge

The main challenge with analyzing and modeling agricultural data is that it’s highly varied across a number of dimensions.

The following image illustrates some extremes of the variation in the size of trees and the extent to which they can be unambiguously separated.

In the grove of pecan trees, we have one of the largest trees in our database, with an area of 654 m2 (a little over a minute to walk around at a typical speed). The vines to the right of the grove measure 50 cm across (the size of a typical potted plant). Our models need to be tolerant to these variations to provide accurate segmentations regardless of the scale.

An additional challenge is that the sources of variation aren’t static. Farmers are highly innovative, and best practices can change significantly over time. One example is ultra-high-density planting for apples, where trees are planted as close as a foot apart. Another is the adoption of protective netting, which obscures aerial imagery, as in the following image.

In this domain with broad and shifting variations, we need to maintain accurate models to provide our clients with reliable insights. Models should improve with every challenging new sample we encounter, and we should deploy them with confidence.

In our initial approach to this challenge, we simply trained on all the data we had. As we scaled, however, we quickly got to the point where training on all our data became infeasible, and the cost of doing so became an impediment to experimentation.

The solution

Although we have variation in the edge cases, we recognized that there was a lot of redundancy in our standard cases. Our goal was to get to a point where our models are trained on only the most salient data, and can converge without needing to see every sample. Our approach to achieving this was first to create an environment where it’s simple to experiment with different approaches to dataset construction and sampling. The following diagram shows our overall workflow for the data preprocessing that enables this.

The outcome is that training samples are available as individual files in Amazon Simple Storage Service (Amazon S3), which is only sensible to do with bulky data like multispectral imagery, with references and rich metadata stored in Amazon Redshift tables. This makes it trivial to construct datasets with a single query, and makes it possible to fetch individual samples with arbitrary access patterns at train time. We use UNLOAD to create an immutable dataset in Amazon S3, and we create a reference to the file in our Amazon Relational Database Service (Amazon RDS) database, which we use for training provenance and evaluation result tracking. See the following code:

UNLOAD ('[subset_query]')
TO '[s3://path/to/dataset]'
IAM_ROLE '[redshift_write_to_s3_role]'
FORMAT PARQUET

The ease of querying the tile metadata allowed us to rapidly create and test subsets of our data, and eventually we were able to train to convergence after seeing only 1.1 million samples of a total 3.1 million. This sub-epoch convergence has been very beneficial in bringing down our compute costs, and we got a better understanding of our data along the way.

The second step we took in reducing our training costs was to optimize our compute. We used the TensorFlow profiler heavily throughout this step:

import tensorflow as tf
tf.profiler.experimental.ProfilerOptions(host_tracer_level=2,  python_tracer_level=1,device_tracer_level=1)
tf.profiler.experimental.start("[log_dir]")
[train model]
tf.profiler.experimental.stop()

For training, we use Amazon SageMaker with P3 instances provisioned by Amazon Elastic Compute Cloud (Amazon EC2), and initially we found that the NVIDIA Tesla V100 GPUs in the instances were bottlenecked by CPU compute in the input pipeline. The overall pattern for alleviating the bottleneck was to shift as much of the compute from native Python code to TensorFlow operations as possible to ensure efficient thread parallelism. The largest benefit was switching to tf.io for data fetching and deserialization, which improved throughput by 41%. See the following code:

serialised_example = tf.io.decode_compressed(tf.io.gfile.GFile(fname, 'rb').read(), compression_type='GZIP')
example = tf.train.Example.FromString(serialised_example.numpy())

A bonus feature with this approach was that switching between local files and Amazon S3 storage required no code changes due to the file object abstraction provided by GFile.

We found that the last remaining bottleneck came from the default TensorFlow CPU parallelism settings, which we optimized using a SageMaker hyperparameter tuning job (see the following example config).

With the CPU bottleneck removed, we moved to GPU optimization, and made the most of the V100’s Tensor Cores by using mixed precision training:

from tensorflow.keras.mixed_precision import experimental as mixed_precision
policy = mixed_precision.Policy('mixed_float16')
mixed_precision.set_policy(policy)

The mixed precision guide is a solid reference, but the change to using mixed precision still requires some close attention to ensure that the operations happening in half precision are not ill-conditioned or prone to underflow. Some specific cases that were critical were terminal activations and custom regularization terms. See the following code:

import tensorflow as tf
...
y_pred = tf.keras.layers.Activation('sigmoid', dtype=tf.float32)(x)
loss = binary_crossentropy(tf.cast(y_true, tf.float32), tf.cast(y_pred, tf.float32), from_logits=True)
...
model.add_loss(lambda: tf.keras.regularizers.l2()(tf.cast(w, tf.float32)))

After implementing this, we measured the following benchmark results for a single V100.

Precision CPU parallelism Batch size Samples per second
single default 8 9.8
mixed default 16 19.3
mixed optimized 16 22.4

The impact of switching to mixed precision was that training speed roughly doubled, and the impact of using the optimal CPU parallelism settings discovered by SageMaker was an additional 16% increase.

Implementing these initiatives as we grew resulted in reducing the cost of training a model from $122 to $68, while our dataset grew from 228 thousand samples to 3.1 million, amounting to a 24 times reduction in cost per sample.

Conclusion

This reduction in training time and cost has meant that we can quickly and cheaply adapt to changes in our data distribution. We often encounter new cases that are confounding even for humans, such as the following image.

However, they quickly become standard cases for our models, as shown in the following image.

We aim to continue making training faster by using more devices, and making it more efficient by leveraging SageMaker managed Spot Instances. We also aim to make the training loop tighter by serving SageMaker models that are capable of online learning, so that improved models are available in near-real time. With these in place, we should be well equipped to handle all the variation that agriculture can throw at us. To learn more about Amazon SageMaker, visit the product page.

 

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.


About the Author

Michael Malahe is the Head of Data at Aerobotics, a South African startup that builds AI-driven tools for agriculture.

Read More

AWS ML Community showcase: March 2021 edition

In our Community Showcase, Amazon Web Services (AWS) highlights projects created by AWS Heroes and AWS Community Builders. 

Each month AWS ML Heroes and AWS ML Community Builders bring to life projects and use cases for the full range of machine learning skills from beginner to expert through deep dive tutorials, podcasts, videos, and other content that show how to use AWS Machine Learning (ML) solutions such as Amazon SageMaker, pertained AI services such as Amazon Rekognition, and AI learning devices such as AWS DeepRacer.

The AWS ML community is a vibrant group of developers, data scientists, researchers, and business decision-makers that dive deep into artificial intelligence and ML concepts, contribute with real-world experiences, and collaborate on building projects together.

Here are a few highlights of externally published getting started guides and tutorials curated by our AWS ML Evangelist team led by Julien Simon.

AWS ML Heroes and AWS ML Community Builder Projects

Making My Toddler’s Dream of Flying Come True with AI Tech (with code samples). In this deep dive tutorial, AWS ML Hero Agustinus Nalwan walks you through how to build an object detection model with Amazon SageMaker JumpStart (a set of solutions for the most common use cases that can be deployed readily with just a few clicks), Torch2trt (a tool to automatically convert PyTorch models into TensorRT), and NVIDIA Jetson AGX Xavier.

How to use Amazon Rekognition Custom Labels to analyze AWS DeepRacer Real World Performance Through Video (with code samples). In this deep dive tutorial, AWS ML Community Builder Pui Kwan Ho shows you how to analyze the path and speed of an AWS DeepRacer device using pretrained computer vision with Amazon Rekognition Custom Labels.

AWS Panorama Appliance Developers Kit: An Unboxing and Walkthrough (with code samples). In this video, AWS ML Hero Mike Chambers shows you how to get started with AWS Panorama, an ML appliance and software development kit (SDK) that allows developers to bring computer vision and make predictions locally with high accuracy and low latency.

Improving local food processing with Amazon Lookout for Vision (with code samples). In this deep tutorial, AWS ML Hero Olalekan Elesin demonstrates how to use AI to improve the quality of food sorting (using cassava flakes) cost-effectively and with zero AI knowledge.

Conclusion

Whether you’re just getting started with ML, already an expert, or something in between, there is always something to learn. Choose from community-created and ML-focused blogs, videos, eLearning guides, and much more from the AWS ML community.

Are you interested in contributing to the community? Apply to the AWS Community Builders program today!

 

The content and opinions in the preceding linked posts are those of the third-party authors and AWS is not responsible for the content or accuracy of those posts.


About the Author

Cameron Peron is Senior Marketing Manager for AWS Amazon Rekognition and the AWS AI/ML community. He evangelizes how AI/ML innovation solves complex challenges facing community, enterprise, and startups alike. Out of the office, he enjoys staying active with kettlebell-sport, spending time with his family and friends, and is an avid fan of Euro-league basketball.

Read More

Configure Amazon Forecast for a multi-tenant SaaS application

Amazon Forecast is a fully managed service that is based on the same technology used for forecasting at Amazon.com. Forecast uses machine learning (ML) to combine time series data with additional variables to build highly accurate forecasts. Forecast requires no ML experience to get started. You only need to provide historical data and any additional data that may impact forecasts.

Customers are turning towards using a Software as service (SaaS) model for delivery of multi-tenant solutions. You can build SaaS applications with a variety of different architectural models to meet regulatory and compliance requirements. Depending on the SaaS model, resources like Forecast are shared across tenants. Forecast data access, monitoring, and billing needs to be considered per tenant for deploying SaaS solutions.

This post outlines how to use Forecast within a multi-tenant SaaS application using Attribute Based Access Control (ABAC) in AWS Identity and Access Management (IAM) to provide these capabilities. ABAC is a powerful approach that you can use to isolate resources across tenants.

In this post, we provide guidance on setting up IAM policies for tenants using ABAC principles and Forecast. To demonstrate the configuration, we set up two tenants, TenantA and TenantB, and show a use case in the context of an SaaS application using Forecast. In our use case, TenantB can’t delete TenantA resources, and vice versa. The following diagram illustrates our architecture.

TenantA and TenantB have services running as microservice within Amazon Elastic Kubernetes Service (Amazon EKS). The tenant application uses Forecast as part of its business flow.

Forecast data ingestion

Forecast imports data from the tenant’s Amazon Simple Storage Service (Amazon S3) bucket to the Forecast managed S3 bucket. Data can be encrypted in transit and at rest automatically using Forecast managed keys or tenant-specific keys through AWS Key Management Service (AWS KMS). The tenant-specific key can be created by the SaaS application as part of onboarding, or the tenant can provide their own customer managed key (CMK) using AWS KMS. Revoking permission on the tenant-specific key prevents Forecast from using the tenant’s data. We recommend using a tenant-specific key and an IAM role per tenant in a multi-tenant SaaS environment. This enables securing data on a tenant-by-tenant basis.

Solution overview

You can partition data on Amazon S3 to segregate tenant access in different ways. For this post, we discuss two strategies:

  • Use one S3 bucket per tenant
  • Use a single S3 bucket and separate tenant data with a prefix

For more information about various strategies, see the Storing Multi-Tenant Data on Amazon S3 GitHub repo.

When using one bucket per tenant, you use an IAM policy to restrict access to a given tenant S3 bucket. For example:

s3://tenant_a    [ Tag tenant = tenant_a ]
s3://tenant_b     [ Tag tenant = tenant_b ]

There is a hard limit on the number of S3 buckets per account. A multi-account strategy needs to be considered to overcome this limit.

In our second option, tenant data separated using an S3 prefix in a single S3 bucket. We use an IAM policy  to restrict access within a bucket prefix per tenant. For example:

s3://<bucketname>/tenant_a

For this post, we use the second option of assigning S3 prefixes within a single bucket. We encrypt tenant data using CMKs in AWS KMS.

Tenant onboarding

SaaS applications rely on a frictionless model for introducing new tenants into their environment. This often requires orchestrating several components to successfully provision and configure all the elements needed to create a new tenant. This process, in SaaS architecture, is referred to as tenant onboarding. This can be initiated directly by tenants or as part of a provider-managed process. The following diagram illustrates the flow of configuring Forecast per tenant as part of onboarding process.

Resources are tagged with tenant information. For this post, we tag resources with a value for tenant, for example,  tenant_a.

Create a Forecast role

This IAM role is assumed by Forecast per tenant. You should apply the following policy to allow Forecast to interact with Amazon S3 and AWS KMS in the customer account. The role is tagged with the tag tenant. For example, see the following code:

TenantA create role Forecast_TenantA_Role  [ Tag tenant = tenant_a]
TenantB create role Forecast_TenantB_Role [ Tag tenant = tenant_b]

Create the policies

In this next step, we create policies for our Forecast role. For this post, we split them into two policies for more readability, but you can create them according to your needs.

Policy 1: Forecast read-only access

The following policy gives privileges to describe, list, and query Forecast resources. This policy restricts Forecast to read-only access. The tenant tag validation condition in the following code makes sure that the tenant tag value matches the principal’s tenant tag. Refer to the bolded code for specifics.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "DescribeQuery",
            "Effect": "Allow",
            "Action": [
                "forecast:GetAccuracyMetrics",
                "forecast:ListTagsForResource",
                "forecast:DescribeDataset",
                "forecast:DescribeForecast",
                "forecast:DescribePredictor",
                "forecast:DescribeDatasetImportJob",
                "forecast:DescribePredictorBacktestExportJob",
                "forecast:DescribeDatasetGroup",
                "forecast:DescribeForecastExportJob",
                "forecast:QueryForecast"
            ],
            "Resource": [
                "arn:aws:forecast:*:<accountid>:dataset-import-job/*",
                "arn:aws:forecast:*:<accountid>:dataset-group/*",
                "arn:aws:forecast:*:<accountid>:predictor/*",
                "arn:aws:forecast:*:<accountid>:forecast/*",
                "arn:aws:forecast:*:<accountid>:forecast-export-job/*",
                "arn:aws:forecast:*:<accountid>:dataset/*",
                "arn:aws:forecast:*:<accountid>:predictor-backtest-export-job/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/tenant":"${aws:PrincipalTag/tenant}"
                }
            }
        },
        {
            "Sid": "List",
            "Effect": "Allow",
            "Action": [
                "forecast:ListDatasetImportJobs",
                "forecast:ListDatasetGroups",
                "forecast:ListPredictorBacktestExportJobs",
                "forecast:ListForecastExportJobs",
                "forecast:ListForecasts",
                "forecast:ListPredictors",
                "forecast:ListDatasets"
            ],
            "Resource": "*"
        }
    ]
}

Policy 2: Amazon S3 and AWS KMS access policy

The following policy gives privileges to AWS KMS and access to the S3 tenant prefix. The tenant tag validation condition in the following code makes sure that the tenant tag value matches the principal’s tenant tag. Refer to the bolded code for specifics.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "KMS",
            "Effect": "Allow",
            "Action": [
                "kms:Decrypt",
                "kms:Encrypt",
                "kms:RevokeGrant",
                "kms:GenerateDataKey",
                "kms:DescribeKey",
                "kms:RetireGrant",
                "kms:CreateGrant",
                "kms:ListGrants"
            ],
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/tenant":"${aws:PrincipalTag/tenant}"
                }
            }
        },
        {
            "Sid": "S3Access",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject", 
                "s3:PutObject",
                "s3:GetObjectVersionTagging",
                "s3:GetObjectAcl",
                "s3:GetObjectVersionAcl",
                "s3:GetBucketPolicyStatus",
                "s3:ListBucket",
                "s3:ListBucketMultipartUploads",
                "s3:ListAccessPoints",
                "s3:GetObjectVersion"
            ],
            "Resource": [
                "arn:aws:s3:::<bucketname>/*",
                "arn:aws:s3:::<bucketname>"
            ],
            "Condition": {
                "StringLike": {
                    "s3:prefix": [
                        "${aws:PrincipalTag/tenant}",
                        "${aws:PrincipalTag/tenant}/*"
                    ]
                }
            }
        }
    ]
}

Create a tenant specific key

We now create a tenant-specific key in AWS KMS per tenant and tag it with the tenant tag value. Alternatively, the tenant can bring their own key to AWS KMS. We give the preceding roles (Forecast_TenantA_Role or Forecast_TenantB_Role) access to the tenant-specific key.

For example, the following screenshot shows the key-value pair of tenant and tenant_a.

The following screenshot shows the IAM roles that can use this key.

Create an application role

The second role we create is assumed by the SaaS application per tenant. You should apply the following policy to allow the application to interact with Forecast, Amazon S3, and AWS KMS. The role is tagged with the tag tenant. See the following code:

TenantA create role TenantA_Application_Role  [ Tag tenant = tenant_a]
TenantB create role TenantB_Application_Role  [ Tag tenant = tenant_b]

Create the policies

We now create policies for the application role. For this post, we split them into two policies for more readability, but you can create them according to your needs.

Policy 1: Forecast access

The following policy gives privileges to create, update, and delete Forecast resources. The policy enforces the tag requirement during creation. In addition, it restricts the list, describe, and delete actions on resources to the respective tenant. This policy has IAM PassRole to allow Forecast to assume the role.

The tenant tag validation condition in the following code makes sure that the tenant tag value matches the tenant. Refer to the bolded code for specifics.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "CreateDataSet",
            "Effect": "Allow",
            "Action": [
                "forecast:CreateDataset",
                "forecast:CreateDatasetGroup",
                "forecast:TagResource"
            ],
            "Resource": [
                "arn:aws:forecast:*:<accountid>:dataset-import-job/*",
                "arn:aws:forecast:*:<accountid>:dataset-group/*",
                "arn:aws:forecast:*:<accountid>:predictor/*",
                "arn:aws:forecast:*:<accountid>:forecast/*",
                "arn:aws:forecast:*:<accountid>:forecast-export-job/*",
                "arn:aws:forecast:*:<accountid>:dataset/*",
                "arn:aws:forecast:*:<accountid>:predictor-backtest-export-job/*"
            ],
            "Condition": {
                "ForAnyValue:StringEquals": {
                    "aws:TagKeys": [ "tenant" ]
                },
                "StringEquals": {
                    "aws:RequestTag/tenant": "${aws:PrincipalTag/tenant}"
                }
            }
        },
        {
            "Sid": "CreateUpdateDescribeQueryDelete",
            "Effect": "Allow",
            "Action": [
                "forecast:CreateDatasetImportJob",
                "forecast:CreatePredictor",
                "forecast:CreateForecast",
                "forecast:CreateForecastExportJob",
                "forecast:CreatePredictorBacktestExportJob",
                "forecast:GetAccuracyMetrics",
                "forecast:ListTagsForResource",
                "forecast:UpdateDatasetGroup",
                "forecast:DescribeDataset",
                "forecast:DescribeForecast",
                "forecast:DescribePredictor",
                "forecast:DescribeDatasetImportJob",
                "forecast:DescribePredictorBacktestExportJob",
                "forecast:DescribeDatasetGroup",
                "forecast:DescribeForecastExportJob",
                "forecast:QueryForecast",
                "forecast:DeletePredictorBacktestExportJob",
                "forecast:DeleteDatasetImportJob",
                "forecast:DeletePredictor",
                "forecast:DeleteDataset",
                "forecast:DeleteDatasetGroup",
                "forecast:DeleteForecastExportJob",
                "forecast:DeleteForecast"
            ],
            "Resource": [
                "arn:aws:forecast:*:<accountid>:dataset-import-job/*",
                "arn:aws:forecast:*:<accountid>:dataset-group/*",
                "arn:aws:forecast:*:<accountid>:predictor/*",
                "arn:aws:forecast:*:<accountid>:forecast/*",
                "arn:aws:forecast:*:<accountid>:forecast-export-job/*",
                "arn:aws:forecast:*:<accountid>:dataset/*",
                "arn:aws:forecast:*:<accountid>:predictor-backtest-export-job/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/tenant": "${aws:PrincipalTag/tenant}"
                }
            }
        },
        {
            "Sid": "IAMPassRole",
            "Effect": "Allow",
            "Action": [
                "iam:GetRole",
                "iam:PassRole"
            ],
            "Resource": "--Provide Resource ARN--"
        },
        {
            "Sid": "ListAccess",
            "Effect": "Allow",
            "Action": [
                "forecast:ListDatasetImportJobs",
                "forecast:ListDatasetGroups",
                "forecast:ListPredictorBacktestExportJobs",
                "forecast:ListForecastExportJobs",
                "forecast:ListForecasts",
                "forecast:ListPredictors",
                "forecast:ListDatasets"
            ],
            "Resource": "*"
        }
    ]
}

Policy 2: Amazon S3, AWS KMS, Amazon CloudWatch, and resource group access

The following policy gives privileges to access Amazon S3 and AWS KMS resources, and also Amazon CloudWatch. It limits access to the tenant-specific S3 prefix and tenant-specific CMK. The tenant validation condition is in bolded code.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "S3Storage",
            "Effect": "Allow",
            "Action": [
                "s3:*" ---> To be modifed based on application needs
            ],
            "Resource": [
                "arn:aws:s3:::<bucketname>",
                "arn:aws:s3:::<bucketname>/*"
            ],
            "Condition": {
                "StringLike": {
                    "s3:prefix": [ "${aws:PrincipalTag/tenant}", "${aws:PrincipalTag/tenant}/*"
                    ]
                }
            }
        },
  {
            "Sid": "ResourceGroup",
            "Effect": "Allow",
            "Action": [
                "resource-groups:SearchResources",
                "tag:GetResources",
                "tag:getTagKeys",
                "tag:getTagValues",
         "resource-explorer:List*",
   "cloudwatch:PutMetricData"
            ],
            "Resource": "*"
        },
        {
            "Sid": "KMS",
            "Effect": "Allow",
            "Action": [
                "kms:Encrypt",
                "kms:Decrypt",
                "kms:CreateGrant",
                "kms:RevokeGrant",
                "kms:RetireGrant",
                "kms:ListGrants",
                "kms:DescribeKey",
                "kms:GenerateDataKey"
            ],
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/tenant": "${aws:PrincipalTag/tenant}"
                }
            }
        }
    ]
}

Create a resource group

The resource group allows all tagged resources to be queried by the tenant. The following example code uses the AWS Command Line Interface (AWS CLI) to create a resource group for TenantA:

aws resource-groups create-group --name TenantA --tags tenant=tenant_a --resource-query '{"Type":"TAG_FILTERS_1_0", "Query":"{"ResourceTypeFilters":["AWS::AllSupported"],"TagFilters":[{"Key":"tenant", "Values":["tenant_a"]}]}"}'

Forecast application flow

The following diagram illustrates our Forecast application flow. Application service assumes IAM role for the tenant and as part of its business flow invokes Forecast API.

Create a predictor for TenantB

Resources created should be tagged with the tenant tag. The following code uses the Python (Boto3) API to create a predictor for TenantB (refer to the bolded code for specifics):

//Run under TenantB role TenantB_Application_Role
session = boto3.Session() 
forecast = session.client(service_name='forecast') 
...
response=forecast.create_dataset(
                    Domain="CUSTOM",
                    DatasetType='TARGET_TIME_SERIES',
                    DatasetName=datasetName,
                    DataFrequency=DATASET_FREQUENCY, 
                    Schema = schema,
                    Tags = [{'Key':'tenant','Value':'tenant_b'}],
                    EncryptionConfig={'KMSKeyArn':'KMS_TenantB_ARN', 'RoleArn':Forecast_TenantB_Role}
)
...
create_predictor_response=forecast.create_predictor(
                    ...
                    EncryptionConfig={ 'KMSKeyArn':'KMS_TenantB _ARN', 'RoleArn':Forecast_TenantB_Role}, Tags = [{'Key':'tenant','Value':'tenant_b'}],
                      ...
                      }) 
predictor_arn=create_predictor_response['PredictorArn']

Create a forecast on the predictor for TenantB

The following code uses the Python (Boto3) API to create a forecast on the predictor you just created:

//Run under TenantB role TenantB_Application_Role
session = boto3.Session() 
forecast = session.client(service_name='forecast') 
...
create_forecast_response=create_forecast_response=forecast.create_forecast(
ForecastName=forecastName,
             PredictorArn=predictor_arn,
             Tags = [{'Key':'tenant','Value':'tenant_b'}])
tenant_b_forecast_arn = create_forecast_response['ForecastArn']

Validate access to Forecast resources

In this section, we confirm that only the respective tenant can access Forecast resources. Access, modifying, or deleting Forecast resources belonging to a different tenant throws an error. The following code uses the Python (Boto3) API to demonstrate TenantA attempting to delete a TenantB Forecast resource:

//Run under TenantA role TenantA_Application_Role
session = boto3.Session() 
forecast = session.client(service_name='forecast') 
..
forecast.delete_forecast(ForecastArn= tenant_b_forecast_arn)

ClientError: An error occurred (AccessDeniedException) when calling the DeleteForecast operation: User: arn:aws:sts::<accountid>:assumed-role/TenantA_Application_Role/tenant-a-role is not authorized to perform: forecast:DeleteForecast on resource: arn:aws:forecast:<region>:<accountid>:forecast/tenantb_deeparp_algo_forecast

List and monitor predictors

The following example code uses the Python (Boto3) API to query Forecast predictors for TenantA using resource groups:

//Run under TenantA role TenantA_Application_Role
session = boto3.Session() 
resourcegroup = session.client(service_name='resource-groups')

query="{"ResourceTypeFilters":["AWS::Forecast::Predictor"],"TagFilters":[{"Key":"tenant", "Values":["tenant_a"]}]}"  Tenant Tag needs to be specified.

response = resourcegroup.search_resources(
    ResourceQuery={
        'Type': 'TAG_FILTERS_1_0',
        'Query': query
    },
    MaxResults=20
)

predictor_count=0
for resource in response['ResourceIdentifiers']:
    print(resource['ResourceArn'])
    predictor_count=predictor_count+1

As the AWS Well-Architected Framework explains, it’s important to monitor service quotas (which are also referred to as service limits). Forecast has limits per accounts; for more information, see Guidelines and Quotas.

The following code is an example of populating a CloudWatch metric with the total number of predictors:

cloudwatch = session.client(service_name='cloudwatch')
cwresponse = cloudwatch.put_metric_data(Namespace='TenantA_PredictorCount',MetricData=[
 {
 'MetricName': 'TotalPredictors',
 'Value': predictor_count
 }]
)

Other considerations

Resource limits and throttling need to be managed by the application across tenants. If you can’t accommodate the Forecast limits, you should consider a multi-account configuration.

The Forecast List APIs or resource group response need to be filtered by application based on the tenant tag value.

Conclusion

In this post, we demonstrated how to isolate Forecast access using the ABAC technique in a multi-tenant SaaS application. We showed how to limit access to Forecast by tenant using the tenant tag. You can further customize policies by applying more tags, or apply this strategy to other AWS services.

For more information about using ABAC as an authorization strategy, see What is ABAC for AWS?


About the Authors

Gunjan Garg is a Sr. Software Development Engineer in the AWS Vertical AI team. In her current role at Amazon Forecast, she focuses on engineering problems and enjoys building scalable systems that provide the most value to end users. In her free time, she enjoys playing Sudoku and Minesweeper.

 

 

 

Matias Battaglia is a Technical Account Manager at Amazon Web Services. In his current role, he enjoys helping customers at all the stages of their cloud journey. On his free time, he enjoys building AI/ML projects.

 

 

 

Rakesh Ramadas is an ISV Solution Architect at Amazon Web Services. His focus areas include AI/ML and Big Data.

 

Read More

Introducing Amazon Lookout for Metrics: An anomaly detection service to proactively monitor the health of your business

Anomalies are unexpected changes in data, which could point to a critical issue. An anomaly could be a technical glitch on your website, or an untapped business opportunity. It could be a new marketing channel with exceedingly high customer conversions. As businesses produce more data than ever before, detecting these unexpected changes and responding in a timely manner is essential, yet challenging. Delayed responses cost businesses millions of dollars, missed opportunities, and the risk of losing the trust of their customers.

We’re excited to announce the general availability of Amazon Lookout for Metrics, a new service that uses machine learning (ML) to automatically monitor the metrics that are most important to businesses with greater speed and accuracy. The service also makes it easier to diagnose the root cause of anomalies like unexpected dips in revenue, high rates of abandoned shopping carts, spikes in payment transaction failures, increases in new user sign-ups, and many more. Lookout for Metrics goes beyond simple anomaly detection. It allows developers to set up autonomous monitoring for important metrics to detect anomalies and identify their root cause in a matter of few clicks, using the same technology used by Amazon internally to detect anomalies in its metrics—all with no ML experience required.

You can connect Lookout for Metrics to 19 popular data sources, including Amazon Simple Storage Solution (Amazon S3), Amazon CloudWatch, Amazon Relational Database Service (Amazon RDS), and Amazon Redshift, as well as software as a service (SaaS) applications like Salesforce, Marketo, and Zendesk, to continuously monitor metrics important to your business. Lookout for Metrics automatically inspects and prepares the data, uses ML to detect anomalies, groups related anomalies together, and summarizes potential root causes. The service also ranks anomalies by severity so you can prioritize which issue to tackle first.

Lookout for Metrics easily connects to notification and event services like Amazon Simple Notification Service (Amazon SNS), Slack, Pager Duty, and AWS Lambda, allowing you to create customized alerts or actions like filing a trouble ticket or removing an incorrectly priced product from a retail website. As the service begins returning results, you can also provide feedback on the relevancy of detected anomalies via the Lookout for Metrics console or the API, and the service uses this input to continuously improve its accuracy over time.

Digitata, a telecommunication analytics provider, intelligently transforms pricing and subscriber engagement for mobile network operators (MNOs), empowering them to make better and more informed business decisions. One of Digitata’s MNO customers had made an erroneous update to their pricing platform, which led to them charging their end customers the maximum possible price for their internet data bundles. Lookout for Metrics immediately identified that this update had led to a drop of over 16% in their active purchases and notified the customer within minutes of the said incident using Amazon SNS. The customer was also able to attribute the drop to the latest updates to the pricing platform using Lookout for Metrics. With a clear and immediate remediation path, the customer was able to deploy a fix within 2 hours of getting notified. Without Lookout for Metrics, it would have taken Digitata approximately a day to identify and triage the issue, and would have led to a 7.5% drop in customer revenue, in addition to the risk of losing the trust of their end customers.

Solution overview

This post demonstrates how you can set up anomaly detection on a sample ecommerce dataset using Lookout for Metrics. The solution allows you to download relevant datasets, set up continuous anomaly detection, and optionally set up alerts to receive notifications in case anomalies occur.

Our sample dataset is designed to detect abnormal changes in revenue and views for the ecommerce website across major supported platforms like pc_web, mobile_web, and mobile_app and marketplaces like US, UK, DE, FR, ES, IT, and JP.

The following diagram shows the architecture of our continuous detection system.

Building this system requires three simple steps:

  1. Create an S3 bucket and upload your sample dataset.
  2. Create a detector for Lookout for Metrics.
  3. Add a dataset and activate the detector to start learning and continuous detection.

Then you can review and analyze the results.

If you’re familiar with Python and Jupyter, you can get started immediately by following along with the GitHub repo; this post walks you through getting started with the service. After you set up the detection system, you can optionally define alerts that notify you when anomalies are found that meet or exceed a specified severity threshold.

Create an S3 bucket and upload your sample dataset

Download the sample dataset and save it locally. Then continue through the following steps:

  1. Create an S3 bucket.

This bucket needs to be unique and in the same Region where you’re using Lookout for Metrics. For this post, we use the bucket 059124553121-lookoutmetrics-lab.

  1. After you create the bucket, extract the demo dataset on your local machine.

You should have a folder named ecommerce.

  1. On the Amazon S3 console, open the bucket you created.

  1. Choose Upload.

  1. Upload the ecommerce folder.

It takes a few moments to process the files.

  1. When the files are processed, choose Upload.

Do not navigate away from this page while the upload is still processing. You can move to the next step when your dataset is ready.

Alternatively, you can use the AWS Command Line Interface (AWS CLI) to upload the file in just a few minutes using the following command:

!aws s3 sync {data_dirname}/ecommerce/ s3://{s3_bucket}/ecommerce/ --quiet

Create a detector for Lookout for Metrics

To create your detector, complete the following steps:

  1. On the Lookout for Metrics console, choose Create detector.

  1. For Name, enter a detector name.
  2. For Description, enter a description.
  3. For Interval, choose 1 hour intervals.
  4. Optionally, you can modify encryption settings.
  5. Choose Create.

Add a dataset and activate the detector

You now configure your dataset and metrics.

  1. Choose Add a dataset.

  1. For Name, enter a name for your dataset.
  2. For Description, enter a description.
  3. For Timezone, choose the UTC timezone.
  4. For Datasource, choose Amazon S3.

We use Amazon S3 as our data source in this post, but Lookout for Metrics can connect to 19 popular data sources, including CloudWatch, Amazon RDS, and Amazon Redshift, as well as SaaS applications like Salesforce, Marketo, and Zendesk.

You also have an offset parameter, for when you have data that can take a while to arrive and you want Lookout for Metrics to wait until your data has arrived to start reading. This is helpful for long-running jobs that may feed into Amazon S3.

  1. For Detector mode, select either Backtest or Continuous.

Backtesting allows you to detect anomalies on historical data. This feature is helpful when you want to try out the service on past data or validate against known anomalies that occurred in the past. For this post, we use continuous mode, where you can detect anomalies on live data continuously, as they occur.

  1. For Path to an object in your continuous dataset, enter the value for S3_Live_Data_URI.
  2. For Continuous data S3 path pattern, choose the S3 path with your preferred time format (for this post, we choose the path with {{yyyyMMdd}}/{{/HHmm}}), which is the third option in the drop down.

The options on the drop-down menu adapt for your data.

  1. For Datasource interval, choose your preferred time interval (for this post, we choose 1 hour intervals).

For continuous mode, you have the option to provide your historical data, if you have any, for the system to proactively learn from. If you don’t have historical data, you can get started with real-time or live data, and Lookout for Metrics learns on the go. For this post, we have historical data for learning, hence we check this box (Step 10)

  1. For Historical data, select Use historical data.
  2. For S3 location of historical data, enter the Amazon S3 URI for historical data that you collected earlier.

You need to provide the ARN of an AWS Identity and Access Management (IAM) role to allow Lookout for Metrics to read from your S3 bucket. You can use an existing role or create a new one. For this post, we use an existing role.

  1. For Service role, choose Enter the ARN of a role.
  2. Enter the role ARN.
  3. Choose Next.

The service now validates your data.

  1. Choose OK.

On the Map files page, you specify which fields you want to run anomaly detection on. Measures are the key performance indicators on which we want to detect anomalies, and dimensions are the categorical information about the measures. You may want to monitor your data for anomalies in number of views or revenue for every platform, marketplace, and combination of both. You can designate up to five measures and five dimensions per dataset.

  1. For Measures, choose views and revenue.
  2. For Dimensions, choose platform and marketplace.

Lookout for Metrics analyzes each combination of these measures and dimensions. For our example, we have seven unique marketplace values (US, UK, DE, FR, ES, IT, and JP) and three unique platform values (pc_web, mobile_web, and mobile_app), for a total of 21 unique combinations. Each unique combination of measures and dimension values is a metric. In this case, you have 21 dimensions and two measures, for a total of 42 time series metrics. Lookout for Metrics detects anomalies at the most granular level so you can pinpoint any unexpected behavior in your data.

  1. For Timestamp, choose your timestamp formatting (for this post, we use the default 24-hour format in Python’s Pandas package, yyyy-MM-dd HH:mm:ss).

  1. Choose Next.

  1. Review your setup and choose Save and activate.

You’re redirected to the detector page, where you can see that the job has started. This process takes 20–25 minutes to complete.

From here you could define the alerts. Lookout for Metrics can automatically send you alerts through channels such as Amazon SNS, Datadog, PagerDuty, Webhooks, and Slack, or trigger custom actions using Lambda, reducing your time to resolution.

Congratulations, your first detector is up and running.

Review and analyze the results

When detecting an anomaly, Lookout for Metrics helps you focus on what matters most by assigning a severity score to aid prioritization. To help you find the root cause, it intelligently groups anomalies that may be related to the same incident and summarizes the different sources of impact. In the following screenshot, the anomaly in revenue on March 8 at 22:00 GMT had a severity score of 97, indicating a high severity anomaly that needs immediate attention. The impact analysis also tells you that the Mobile_web platform in Germany (DE) saw the highest impact.

Lookout for Metrics also allows you to provide real-time feedback on the relevance of the detected anomalies, enabling a powerful human-in-the-loop mechanism. This information is fed back to the anomaly detection model to improve its accuracy continuously, in near-real time.

Clean up

To avoid incurring ongoing charges, delete the following resources created in this post:

  • Detector
  • S3 bucket
  • IAM role

Conclusion

Lookout for Metrics is available directly via the AWS Management Console, the AWS SDKs, the AWS CLI, as well as through supporting partners to help you easily implement customized solutions. The service is also compatible with AWS CloudFormation and can be used in compliance with the European Union’s General Data Protection Regulation (GDPR).

As of this writing, Lookout for Metrics is available in the following Regions:

  • US East (Ohio)
  • US East (N. Virginia)
  • US West (Oregon)
  • Asia Pacific (Singapore)
  • Asia Pacific (Sydney)
  • Asia Pacific (Tokyo)
  • Europe (Ireland)
  • Europe (Frankfurt)
  • Europe (Stockholm)

To get started building your first detector, adding metrics via various popular data sources, and creating custom alerts and actions via the output channel of your choice, see Amazon Lookout for Metrics Documentation.


About the Authors

Ankita Verma is the Product Lead for Amazon Lookout for Metrics. Her current focus is helping businesses make data driven decisions using AI/ ML. Outside of AWS, she is a fitness enthusiast, and loves mentoring budding product managers and entrepreneurs in her free time. She also publishes a weekly product management newsletter called ‘The Product Mentors’ on Substack.

 

 

Chris King is a Senior Solutions Architect in Applied AI with AWS. He has a special interest in launching AI services and helped grow and build Amazon Personalize and Amazon Forecast before focusing on Amazon Lookout for Metrics. In his spare time he enjoys cooking, reading, boxing, and building models to predict the outcome of combat sports.

Read More

Amazon Kendra adds new search connectors from AWS Partner, Perficient, to help customers search enterprise content faster

Today, Amazon Kendra is making nine new search connectors available in the Amazon Kendra connector library developed by Perficient, an AWS Partner. These include search connectors for IBM Case Manager, Adobe Experience Manger, Atlassian Jira and Confluence, and many others.

Improving the Enterprise Search Experience

These days, employees and customers expect an intuitive search experience. Search has evolved from simply being considered a feature to being considered a critical element of how every product works.

Online searches have trained us to expect specific, relevant answers, and not choices when we ask a search engine a question. Furthermore, we expect this level of accuracy everywhere we perform a search, particularly in the workplace. Unfortunately, enterprise search does not typically deliver the desired experience.

Amazon Kendra is an intelligent search service powered by machine learning. Amazon Kendra uses deep learning and reading comprehension to deliver more accurate search results. It also uses natural language understanding (NLU) to deliver a search experience that is more in line with asking an expert a specific question so you can receive a direct answer, quickly.

Amazon Kendra’s intelligent search capabilities improve the search and discovery experience, but enterprises are still faced with the challenge of connecting troves of unstructured data and making that data accessible to search. Content is often unstructured and scattered across multiple repositories making critical information hard to find, costing employees’ time and effort to track down the right answer.

Data Source Connectors Make Information Searchable

Due to the diversity of sources, quality of data, and variety of connection protocols for an organization’s data sources, it’s crucial for enterprises to avoid time spent building complex ETL jobs that aggregate data sources.

To achieve this, organizations can use data source connectors to quickly unify content as part of a single, searchable index, without needing to copy or move data from an existing location to a new one. This reduces the time and effort typically associated with creating a new search solution.

In addition to unifying scattered content, data source connectors that have the right source interfaces in place, and the ability to apply rules and enrich content before users begin searching, can make a search application even more accurate.

Perficient Connectors for Amazon Kendra

Perficient has years of experience developing data source connectors for a wide range of enterprise data sources. After countless one off implementations, they identified patterns which can be repeated for any connection type. They developed their search connector platform Handshake and integrated with Amazon Kendra to deliver the following:

  • Reduced setup time when creating intelligent search applications
    • Handshake’s integration with Amazon Kendra out of the box indexes, extracts and transforms metadata making it easy to configure and get started with whatever data source connector you choose.
  • More connectors for many popular enterprise data sources
    • With Handshake, Amazon Kendra customers can license any of Perficient’s existing source connectors, including IBM Case Manager, Adobe Experience Manger, Nuxeo, Atlassian Jira and Confluence, Elastic Path, Elastic search, SQL databases, file systems, any REST-based API, and 30 other interfaces ready upon request.
  • Access to the most up-to-date enterprise content
    • Data source connectors can be scheduled to automatically sync an Amazon Kendra index with a data source, to ensure you’re always securely searching through the most up-to-date content.

Additionally, you can also develop your own source connectors using Perficient’s framework as an accelerator. Wherever the data lives, whatever format, Perficient can build hooks to combine, enhance, and index the critical data to augment Amazon Kendra’s intelligent search.

Perficient has found Amazon Kendra’s intelligent search capabilities, customization potential, and APIs to be powerful tools in search experience design. We are thrilled to add our growing list of data source connectors to Amazon Kendra’s connector library.

To get started with Amazon Kendra, visit the Amazon Kendra Essentials+ workshop for an interactive walkthrough. To learn more about other Amazon Kendra data source connectors visit the Amazon Kendra connector library. For more information about Perficent’s Handshake platform or to contact the team directly visit their website.


About the Authors

Zach Fischer is a Solution Architect at Perficient with 10 years’ experience delivering Search and ECM implementations to over 40 customers. Recently, he has become the Product Manager for Perficient’s Handshake and Nero products. When not working, he’s playing music or DnD.

 

 

Jean-Pierre Dodel leads product management for Amazon Kendra, a new ML-powered enterprise search service from AWS. He brings 15 years of Enterprise Search and ML solutions experience to the team, having worked at Autonomy, HP, and search startups for many years prior to joining Amazon 4 years ago. JP has led the Kendra team from its inception, defining vision, roadmaps, and delivering transformative semantic search capabilities to customers like Dow Jones, Liberty Mutual, 3M, and PwC.

Read More

Enable feature reuse across accounts and teams using Amazon SageMaker Feature Store

Amazon SageMaker Feature Store is a new capability of Amazon SageMaker that helps data scientists and machine learning (ML) engineers securely store, discover, and share curated data used in training and prediction workflows. As organizations build data-driven applications using ML, they’re constantly assembling and moving features between more and more functional teams. This constant movement of data can lead to inconsistencies in features and become a bottleneck when designing ML initiatives spanning multiple teams. For example, an ecommerce company might have several data science and engineering teams working on different aspects of their platform. The Core Search team focuses on query understanding and information retrieval tasks. The Product Success team solves problems involving customer reviews and feedback signals. The Personalization team uses clickstream and session data to create ML models for personalized recommendations. Additionally, data engineering teams like the Data Curation team can curate and validate user-specific information, which is an essential component that other teams can use. A Feature Store works as a unified interface between these teams, enabling one team to leverage the features generated by other teams, which minimizes the operational overhead of replicating and moving features across teams.

Training a production-ready ML model typically involves access to diverse set of features that aren’t always owned and maintained by the team that is building the model. A common practice for organizations that apply ML is to think of these data science teams as individual groups that work independently with limited collaboration. This results in ML workflows with no standardized way to share features across teams, which becomes a crucial limiting factor for data science productivity and makes it harder for data scientists to build new and complex models. With a shared feature store, organizations can achieve economies of scale. As more shared features become available, it becomes easier and cheaper for teams to build and maintain new models. These models can reuse features that are already developed, tested, and offered using a centralized feature store.

This post captures the essential cross-account architecture patterns for Feature Store that can be implemented in an organization with many data engineering and data science teams operating in different AWS accounts. We share how to enable sharing of features between accounts through a step-by-step example, which you can try out yourself with the code in our GitHub repo.

SageMaker Feature Store overview

By default, a SageMaker Feature Store is local to the account in which it is created, but it can also be centralized and shared by many accounts. An organization with multiple teams can have one centralized feature store that is shared across teams, as well as separate feature stores for use by individual teams. The separate stores can either hold feature groups that are of a sensitive nature or that are specific to a unique ML workload.

In this post, you first learn about the centralized feature store pattern. This pattern prescribes a central interface through which teams can create and publish new features, and from which other teams (or systems) can consume features. It also ensures that you have a single source of truth for feature data across your organization and simplifies resource management.

Next, you learn about the combined feature store pattern, which allows teams to maintain their own feature stores local to their account, while still being able to access shared features from the centralized feature store. These local feature stores are usually built for data science experimentation. By combining shared features from the centralized store with local features, teams can derive new enhanced features that can help when building more complex ML models. You can also use the local stores to store sensitive data that can’t be shared across the organization for regulatory and compliance reasons.

Lastly, we briefly cover a less common pattern involving replication of feature data.

Centralized feature store

Organizations can maximize the benefits of a feature store when it’s centralized. The centralized feature store pattern demonstrates how feature pipelines from multiple accounts can write to one centralized feature store, and how multiple other accounts can consume these features. This is a common pattern among mid- to large-sized enterprises where multiple teams manage different types of data or different parts of an application.

The process of hypothesizing, selecting, and transforming data inputs into a usable form suitable for ML models is called feature engineering. A feature pipeline encapsulates all the steps of the feature engineering process needed to convert raw data into useful features that ML models take as input for predictions. Maintaining feature pipelines is an expensive, time-consuming, and error-prone process. Also, replicating feature recipes and transformations across accounts can lead to inconsistencies and skew in feature characteristics. Because a centralized feature store facilitates knowledge sharing, teams don’t have to recreate feature recipes and rewrite pipelines from scratch in every project.

In this pattern, instead of writing features locally to an account-specific feature store, features are written to a centralized feature store. The centralized store serves as the central vault and creates a standardized way to access and maintain features for cross-team collaboration. It acts as an enabler and accelerator for AI adoption, reducing time to market for ML solutions, and allows for centralized governance and access control to ML features. You can grant access to external accounts, users, or roles to read and write individual feature groups in keeping with your data access policies. AWS recommends enforcing least privilege access to only the feature groups that you need for your job function. This is managed by the underlying AWS Identity and Access Management (IAM) policies. You can further refine access control with feature group tags and IAM conditions to decide which principals can perform specific actions. When you’re using a centralized store at scale, it’s important to also implement proper feature governance to ensure feature groups are well designed, have feature pipelines that are documented and supported, and have processes in place to ensure feature quality. This type of governance helps earn the trust required for feature reuse across teams.

Before walking through an example, let’s identify some key feature store concepts. First, feature groups are logical groups of features, typically originating from the same feature pipeline. An offline store contains large volumes of historical feature data used to create training and testing data for model development, or by batch applications for model scoring. The purpose of the online store is to serve these same features in real time with low latency. Unlike the offline store, which is append-only, the goal of the online store is to serve the most recent feature values. Behind the scenes, Feature Store automatically carries out data synchronization between the two stores. If you ingest new feature values into the online store, they’re automatically appended to the offline store. However, you can also create offline and online stores separately if this is a requirement for your team or project.

The following diagram depicts three functional teams, each with its own feature pipeline writing to a feature group in a centralized feature store.

The Personalization account manages user session data collected from a customer-facing application and owns a feature pipeline that produces a feature group called Sessions with features derived from the session data. This pipeline writes the generated feature values to the centralized feature store. Likewise, a feature pipeline in the Product Success account is responsible for producing features in the Reviews feature group, and the Data Curation account produces features in the Users feature group.

The centralized feature store account holds all features received from the three producer accounts, mapped to three feature groups: Sessions, Reviews, and Users. Feature pipelines can write to the centralized feature store by assuming a specific IAM role that is created in the centralized store account. We discuss how to enable this cross-account role later in this post. External accounts can also query features from the feature groups in the centralized store for training or inference, as shown in the preceding architecture diagram. For training, you can assume the IAM role from the centralized store and run cross-account Amazon Athena queries (as shown in the diagram), or initiate an Amazon EMR or SageMaker Processing job to create training datasets. In case of real-time inference, you can read online features directly via the same assumed IAM role for cross-account access.

In this model, the centralized feature store usually resides in a production account. Applications using this store can either live in this account or in other accounts with cross-account access to the centralized feature store. You can replicate this entire structure in lower environments, such as development or staging, for testing infrastructure changes before promoting them to production.

Combined feature store

In this section, we discuss a variant of the centralized feature store pattern called the combined feature store pattern. In feature engineering, a common practice is to combine existing features to derive new features. When teams combine shared features from the centralized store with local features in their own feature store, they can derive new enhanced features to help build more complex data models. We know from the previous section that the centralized store makes it easy for any data science team to access external features and use them with their existing pool of features to compound and evolve new features.

Security and compliance is another use case for teams to maintain a team-specific feature store in addition to accessing features from the centralized store. Many teams require specific access rights that aren’t granted to everyone in the organization. For example, it might not be feasible to publish features that are extracted from sensitive data to a centralized feature store within the organization.

In the following architecture diagram, the centralized feature store is the account that collects and catalogs all the features received from multiple feature pipelines into one central repository. In this example, the account of the combined store belongs to the Core Search team. This account is the consumer of the shareable features from the centralized store. In addition, this account manages user keyword data collected via a customer-facing search application.

This account maintains its own local offline and online stores. These local stores are populated by a feature pipeline set up locally to ingest user query keyword data and generate features. These features are grouped under a feature group named Keywords. Feature Store by default automatically creates an AWS Glue table for this feature group, which is registered in the AWS Glue Data Catalog in this account. The metadata of this table points to the Amazon S3 location of the feature group in this account’s offline store.

The combined store account can also access feature groups Sessions, Reviews, and Users from the centralized store. You can enable cross-account access by role, which we discuss in the next sections. Data scientists and researchers can use Athena to query feature groups created locally and join these internal features with external features derived from the centralized store for data science experiments.

Cross-account access overview

This section provides an overview of how to enable cross-account access for Feature Store between two accounts using an assumed role via AWS Security Token Service (AWS STS). AWS STS is a web service that enables you to request temporary, limited-privilege credentials for IAM users. AWS STS returns a set of temporary security credentials that you can use to access AWS resources that you might not normally have access to. These temporary credentials consist of an access key ID, secret access key, and security token.

To demonstrate this process, assume we have two accounts, A and B, as shown in the following diagram.

Account B maintains a centralized online and offline feature store. Account A needs access to both online and offline stores contained in Account B. To enable this, we create a role in Account B and let Account A assume that role using AWS STS. This enables Account A to behave like Account B, with permissions to perform specific actions identified by the role. AWS services like SageMaker (processing and training jobs, endpoints) and AWS Lambda used from Account A can assume the IAM role created in Account B by using an AWS STS client (see code block later in this post). This grants them the needed permissions to access resources like Amazon S3, Athena, and the AWS Glue Data Catalog inside Account B. After the services in Account A acquire the necessary permissions to the resources, they can access both the offline and online store in Account B. Depending on the choice of your service, you also need to add the IAM execution role for that service to the trusted policy of the cross-account IAM role in Account B. We discuss this in detail in the following section.

The preceding architecture diagram shows how Account A assumes a role from Account B to read and write to both online and offline stores contained within Account B. The seven steps in the diagram are as follows:

  1. Account B creates a role that can be assumed by others (for our use case, Account A).
  2. Account A assumes the IAM role from Account B using AWS STS. Account A can now generate temporary credentials that can be used to create AWS service clients that behave as if they are inside Account B.
  3. In Account A, SageMaker and other service clients (such as Amazon S3 and Athena) are created using the temporary credentials via the assumed role.
  4. The service clients in Account A can now create feature groups and populate feature values into Account B’s centralized online store using the AWS SDK.
  5. The online store in Account B automatically syncs with the offline store, also in Account B.
  6. The Athena service client inside Account A runs cross-account queries to read, group, and materialize feature sets using Athena tables inside Account B. Because the offline store exists in Account B, the corresponding AWS Glue tables, metadata catalog entries, and S3 objects all reside within Account B. Account A can use the AWS STS assume role to query the offline features (S3 objects) inside Account B.
  7. Athena query results are returned back as feature datasets into Account A’s S3 bucket.

The temporary credentials use the AWS STS GetSessionToken API and are limited to 1 hour. You can extend the duration of your session by using RefreshableCredentials, a Botocore class that can automatically refresh the credentials to work with your long-running applications beyond the 1-hour timeframe. An example notebook demonstrating this is available in our GitHub repo.

Create cross-account access

This section details all the steps to create the cross-account access roles, policies, and permissions to enable shareability of features between Accounts A and B according to our architecture.

Create a Feature Store access role

From Account B, we create a Feature Store access role. This is the role assumed by AWS services inside Account A to gain access to resources in Account B.

  1. On the IAM console, in the navigation pane, choose Roles.
  2. Choose Create role.
  3. Choose Another AWS account.
  4. For Account ID, enter the 12-digit account ID of Account B.
  5. Choose Next: Permissions.

  1. In the Permissions section, search for and attach the following AWS managed policies:
    1. AmazonSageMakerFullAccess (you can further restrict this to least privileges based on your use case)
    2. AmazonSageMakerFeatureStoreAccess
  2. Create and attach a custom policy to this new role (provide the S3 bucket name in Account A where the Athena query results collected in Account B are written):
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AthenaResultsS3BucketCrossAccountAccessPolicy",
      "Effect": "Allow",
      "Action": [
        "s3:GetBucketLocation",
        "s3:GetObject",
        "s3:ListBucket",
        "s3:PutObjectAcl",
        "s3:PutObject"
      ],
      "Resource": [
        "arn:aws:s3:::<ATHENA RESULTS BUCKET NAME IN ACCOUNT A>",
        "arn:aws:s3:::<ATHENA RESULTS BUCKET NAME IN ACCOUNT A>/*"
      ]
    }
  ]
}

When you use this new AWS STS cross-account role from Account A, it can run Athena queries against the offline store content in Account B. The custom policy allows Athena (inside Account B) to write back the results to a results bucket in Account A. Make sure that this results bucket is created in Account A before you create the preceding policy.

Alternatively, you can let the centralized feature store in Account B maintain all the Athena query results in an S3 bucket. In this case, you have to set up cross-account Amazon S3 read access policies for external accounts to read the saved results (S3 objects).

  1. After you attach the policies, choose Next.
  2. Enter a name for this role (for example, cross-account-assume-role).
  3. On the Summary page for the created role, under Trust relationships, choose Edit trust relationship.
  4. Edit the access control policy document as shown in the following code:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": [
          "arn:aws:iam::<ACCOUNT A ID>:root"
        ],
        "Service": [
          "sagemaker.amazonaws.com",
          "athena.amazonaws.com"
        ]
      },
      "Action": "sts:AssumeRole",
      "Condition": {}
    }
  ]
}

The preceding code adds SageMaker and Athena as services in the Principal section. If you want more external accounts or roles to assume this role, you can add their corresponding ARNs in this section.

Create a SageMaker notebook instance

From Account A, create a SageMaker notebook instance with an IAM execution role. This role grants the SageMaker notebook in Account A the needed permissions to run actions on the feature store inside Account B. Alternatively, if you’re not using a SageMaker notebook and using Lambda instead, you need to create a role for Lambda with the same attached policies as shown in this section.

By default, the following policies are attached when you create a new execution role for a SageMaker notebook:

  • AmazonSageMaker-ExecutionPolicy
  • AmazonSageMakerFullAccess

We need to create and attach two additional custom policies. First, create a custom policy with the following code, which allows the execution role in Account A to perform certain S3 actions needed to interact with the offline store in Account B:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "FeatureStoreS3AccessPolicy",
      "Effect": "Allow",
      "Action": [
        "s3:PutObject",
        "s3:GetBucketAcl",
        "s3:GetObjectAcl"
      ],
      "Resource": [
        "arn:aws:s3:::<OFFLINE STORE BUCKET NAME IN ACCOUNT B>",
        "arn:aws:s3:::<OFFLINE STORE BUCKET NAME IN ACCOUNT B>/*"
      ]
    }
  ]
}

You can also attach the AWS managed policy AmazonSageMakerFeatureStoreAccess, if your offline store S3 bucket name contains the SageMaker keyword.

Second, create the following custom policy, which allows the SageMaker notebook in Account A to assume the role (cross-account-assume-role) created in Account B:

{
  "Version": "2012-10-17",
  "Statement": {
    "Effect": "Allow",
    "Action": "sts:AssumeRole",
    "Resource": "arn:aws:iam::<ACCOUNT B ID>:role/cross-account-assume-role"
  }
}

We know Account A can access the online and offline store in Account B. When Account A assumes the cross-account AWS STS role of Account B, it can run Athena queries inside Account B against its offline store. However, the results of these queries (feature datasets) need to be saved in Account A’s S3 bucket in order to enable model training. Therefore, we need to create a bucket in Account A that can store the Athena query results as well as create a bucket policy (see the following code). This policy allows the cross-account AWS STS role to write and read objects in this bucket:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "MyStatementSid",
      "Effect": "Allow",
      "Principal": {
        "AWS": [
          "arn:aws:iam::<ACCOUNT B>:role/cross-account-assume-role"
        ]
      },
      "Action": "s3:*",
      "Resource": [
        "arn:aws:s3:::<ATHENA RESULTS BUCKET NAME IN ACCOUNT A>",
        "arn:aws:s3:::<ATHENA RESULTS BUCKET NAME IN ACCOUNT A>/*"
      ]
    }
  ]
}

Modify the trust relationship policy

Because we created an IAM execution role in Account A, we use the ARN of this role to modify the trust relationships policy of the cross-account assume role in Account B:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": [
          "ARN OF SAGEMAKER EXECUTION ROLE CREATED IN ACCOUNT A"
        ],
        "Service": [
          "sagemaker.amazonaws.com",
          "athena.amazonaws.com"
        ]
      },
      "Action": "sts:AssumeRole",
      "Condition": {}
    }
  ]
}

Validate the setup process

After you set up all the roles and accompanying policies, you can validate the setup by running the example notebooks in the GitHub repo. The following code block is an excerpt from the example notebook and must be run in a SageMaker notebook running within Account A. It demonstrates how you can assume the cross-account role from Account B using AWS STS via the AssumeRole API call. This call returns a set of temporary credentials that Account A can use to create any service clients. When you use these clients, your code uses the permissions of the assumed role, and acts as if it belongs to Account B. For more information, see assume_role in the AWS SDK for Python (Boto 3) documentation.

 import boto3


# Create STS client
sts = boto3.client('sts')

# Role assumption B -> A
CROSS_ACCOUNT_ASSUME_ROLE = 'arn:aws:iam::<ACCOUNT B ID>:role/cross-account-assume-role'
metadata = sts.assume_role(RoleArn=CROSS_ACCOUNT_ASSUME_ROLE, 
                           RoleSessionName='FeatureStoreCrossAccountAccessDemo')

# Get temporary credentials
access_key_id = metadata['Credentials']['AccessKeyId']
secret_access_key = metadata['Credentials']['SecretAccessKey']
session_token = metadata['Credentials']['SessionToken']

region = boto3.Session().region_name
boto_session = boto3.Session(region_name=region)

# Create SageMaker client
sagemaker_client = boto3.client('sagemaker', 
                                 aws_access_key_id=access_key_id,
                                 aws_secret_access_key=secret_access_key,
                                 aws_session_token=session_token)

# Create SageMaker Feature Store runtime client
sagemaker_featurestore_runtime_client = boto3.client(service_name='sagemaker-featurestore-runtime', 
                                                     aws_access_key_id=access_key_id,
                                                     aws_secret_access_key=secret_access_key,
                                                     aws_session_token=session_token)
                                         
. . .

offline_config = {'OfflineStoreConfig': {'S3StorageConfig': {'S3Uri': f's3://{OFFLINE_STORE_BUCKET}'}}}

sagemaker_client.create_feature_group(FeatureGroupName=FEATURE_GROUP_NAME,
                                    RecordIdentifierFeatureName=record_identifier_feature_name,
                                    EventTimeFeatureName=event_time_feature_name,
                                    FeatureDefinitions=feature_definitions,
                                    Description='< DESCRIPTION >',
                                    Tags='< LIST OF TAGS >',
                                    OnlineStoreConfig={'EnableOnlineStore': True},
                                    RoleArn=CROSS_ACCOUNT_ASSUME_ROLE,
                                    **offline_config)

. . .

sagemaker_featurestore_runtime_client.put_record(FeatureGroupName=FEATURE_GROUP_NAME, 
                                                 Record=record)

After you create the SageMaker clients as per the preceding code example in Account A, you can create feature groups and populate features into Account B’s centralized online and offline store. For more information about how to create, describe, and delete feature groups, see create_feature_group in the Boto3 documentation. You can also use the Feature Store runtime client to put and get feature records to and from feature groups.

Offline store replication

Reproducibility is the ability to recreate an ML model exactly, so if you use the same features as input, the model returns the same output as the original model. This is essentially what we strive to achieve between the models we develop in a research environment and deploy in a production environment. Replicating feature engineering pipelines across accounts is a complex and time-consuming process that can introduce model discrepancies if not implemented properly. If the feature set used to train a model changes after the training phase, it may be difficult or impossible to reproduce a model.

Applications that reside on AWS usually have several distinct environments and accounts, such as development, testing, staging, and production. To achieve automated deployment of the application across different environments, we use CI/CD pipelines. Organizations often need to maintain isolated work environments and multiple copies of data in the same or different AWS Regions, or across different AWS accounts. In the context of Feature Store, some companies may want to replicate offline feature store data. Offline store replication via Amazon S3 replication can be a useful pattern in this case. This pattern enables isolated environments and accounts to retrain ML models using full feature sets without using cross-account roles or permissions.

Conclusion

In this post, we demonstrated various architecture patterns like the centralized feature store, combined feature store, and other design considerations for SageMaker Feature Store that are essential to cross-functional data science collaboration. We also showed how to set up cross-account access using AWS STS.

To learn more about Feature Store capabilities and use cases, see Understanding the key capabilities of Amazon SageMaker Feature Store and Using streaming ingestion with Amazon SageMaker Feature Store to make ML-backed decisions in near-real time.

If you have any comments or questions, please leave them in the comments section.


About the Authors

Arunprasath Shankar is an Artificial Intelligence and Machine Learning (AI/ML) Specialist Solutions Architect with AWS, helping global customers scale their AI solutions effectively and efficiently in the cloud. In his spare time, Arun enjoys watching sci-fi movies and listening to classical music.

 

 

Mark Roy is a Principal Machine Learning Architect for AWS, helping AWS customers design and build AI/ML solutions. Mark’s work covers a wide range of ML use cases, with a primary interest in computer vision, deep learning, and scaling ML across the enterprise. He has helped companies in many industries, including Insurance, Financial Services, Media and Entertainment, Healthcare, Utilities, and Manufacturing. Mark holds 6 AWS certifications, including the ML Specialty Certification. Prior to joining AWS, Mark was an architect, developer, and technology leader for 25+ years, including 19 years in financial services.

Stefan Natu is a Sr. AI/ML Specialist Solutions Architect at Amazon Web Services. He is focused on helping financial services customers build end-to-end machine learning solutions on AWS. In his spare time, he enjoys reading machine learning blogs, playing the guitar, and exploring the food scene in New York City.

Read More