Getting started with Amazon SageMaker Feature Store

In a machine learning (ML) journey, one crucial step before building any ML model is to transform your data and design features from your data so that your data can be machine-readable. This step is known as feature engineering. This can include one-hot encoding categorical variables, converting text values to vectorized representation, aggregating log data to a daily summary, and more. The quality of your features directly influences your model predictability, and often needs a few iterations until a model reaches an ideal level of accuracy. Data scientists and developers can easily spend 60% of their time designing and creating features, and the challenges go beyond writing and testing your feature engineering code. Features built at different times and by different teams aren’t consistent. Extensive and repetitive feature engineering work is often needed when productionizing new features. Difficulty tracking versions and up-to-date features aren’t easily accessible.

To address these challenges, Amazon SageMaker Feature Store provides a fully managed central repository for ML features, making it easy to securely store and retrieve features without the heavy lifting of managing the infrastructure. It lets you define groups of features, use batch ingestion and streaming ingestion, and retrieve the latest feature values with low latency.

For an introduction to Feature Store and a basic use case using a credit card transaction dataset for fraud detection, see New – Store, Discover, and Share Machine Learning Features with Amazon SageMaker Feature Store. For further exploration of its features, see Using streaming ingestion with Amazon SageMaker Feature Store to make ML-backed decisions in near-real time.

For this post, we focus on the integration of Feature Store with other Amazon SageMaker features to help you get started quickly. The associated sample notebook and the following video demonstrate how you can apply these concepts to the development of an ML model to predict the risk of heart failure.

The components of Feature Store

Feature Store is a centralized hub for features and associated metadata. Features are defined and stored in a collection called a feature group. You can visualize a feature group as a table in which each column is a feature, with a unique identifier for each row. In principle, a feature group is composed of features and values specific to each feature. A feature group’s definition is composed of a list of the following:

  • Feature definitions – These consist of a name and data types.
  • A record identifier name – Each feature group is defined with a record identifier name. It should be a unique ID to identify each instance of the data, for example, primary key, customer ID, transaction ID, and so on.
  • Configurations for its online and offline store – You can create an online or offline store. The online store is used for low-latency, real-time inference use cases, and the offline store is used for training and batch inference.

The following diagram shows how you can use Feature Store as part of your ML pipeline. First, you read in your raw data and transform it to features ready for exploration and modeling. Then you can create a feature store, configure it to an online or offline store, or both. Next you can ingest data via streaming to the online and offline store, or in batches directly to the offline store. After your feature store is set up, you can create a model using data from your offline store and access it for real time inference or batch inference.

For more hands-on experience, follow the notebook example for a step-by-step guide to build a feature store, train a model for fraud detection, and access the feature store for inference.

Export data from Data Wrangler to Feature Store

Because Feature Store can ingest data in batches, you can author features using Amazon SageMaker Data Wrangler, create feature groups in Feature Store, and ingest features in batches using a SageMaker Processing job with a notebook exported from Data Wrangler. This mode allows for batch ingestion into the offline store. It also supports ingestion into the online store if the feature group is configured for both online and offline use.

To start off, after you complete your data transformation steps and analysis, you can conveniently export your data preparation workflow into a notebook with one click. When you export your flow steps, you have the option of exporting your processing code to a notebook that pushes your processed features to Feature Store.

Choose Export step and Feature Store to automatically create your notebook. This notebook recreates the manual steps you created, creates a feature group, and adds features to an offline or online feature store, allowing you easily rerun your manual steps.

This notebook defines the schema instead of auto-detection of data types for each column of the data, with the following format:

column_schema = [
 { 
 "name": "Height", 
 "type": "long" 
 },
 { 
 "name": "Sum", 
 "type": "string" 
 }, 
 { 
 "name": "Time", 
 "type": "string"
  }
]

For more information on how to load the schema, map it, and add it as a FeatureDefinition that you can use to create the FeatureGroup, see Export to the SageMaker Feature Store.

Additionally, you must specify a record identifier name and event time feature name in the following code:

  • The record_identifier_name is the name of the feature whose value uniquely identifies a record defined in the feature store.
  • An EventTime is a point in time when a new event occurs that corresponds to the creation or update of a record in a feature. All records in the feature group must have a corresponding EventTime.

The notebook creates an offline store and the online by default with the following configuration set to True:

online_store_config = {
    "EnableOnlineStore": True
}

You can also disable an online store by setting EnableOnlineStore to False in the online and offline store configurations.

You can then run the notebook, and the notebook creates a feature group and processing job to process data in scale. The offline store is located in an Amazon Simple Storage Service (Amazon S3) bucket in your AWS account. Because Feature Store is integrated with Amazon SageMaker Studio, you can visualize the feature store by choosing Components and registries in the navigation pane, choosing Feature Store on the drop-down menu, and then finding your feature store on the list. You can check for feature definitions, manage feature group tags, and generate queries for the offline store.

Build a training set from an offline store

Now that you have created a feature store from your processed data, you can build a training dataset from your offline store by using services such as Amazon Athena, AWS Glue, or Amazon EMR. In the following example, because Feature Store automatically builds an AWS Glue Data Catalog when you create feature groups, you can easily create a training dataset with feature values from the feature group. This is done by utilizing the auto-built Data Catalog.

First, create an Athena query for your feature group with the following code. The table_name is the AWS Glue table that is automatically generated by Feature Store.

sample_query = your_feature_group.athena_query()
data_table = sample_query.table_name

You can then write your query using SQL on your feature group, and run the query with the .run() command and specify your S3 bucket location for the dataset to be saved there. You can modify the query to include any operations needed for your data like joining, filtering, ordering, and so on. You can further process the output DataFrame until it’s ready for modeling, then upload it to your S3 bucket so that your SageMaker trainer can directly read the input from the S3 bucket.

# define your Athena query
query_string = 'SELECT * FROM "'+data_table+'"'

# run Athena query. The output is loaded to a Pandas dataframe.
dataset = pd.DataFrame()
sample_query.run(query_string=query_string, output_location='s3://'+default_s3_bucket_name+'/query_results/')
sample_query.wait()
dataset = sample_query.as_dataframe()

Access your Feature Store for inference

After you build a model from the training set, you can access your online store conveniently to fetch a record and make predictions using the deployed model. Feature Store can be especially useful in supplementing data for inference requests because of the low-latency GetRecord functionality. In this example, you can use the following code to query the online feature group to build an inference request:

selected_id = str(194)

# Helper to parse the feature value from the record.

def get_feature_value(record, feature_name):
    return str(list(filter(lambda r: r['FeatureName'] == feature_name, record))[0]['ValueAsString'])

fs_response = featurestore_runtime.get_record(
                                               FeatureGroupName=your_feature_group_name,
                                               RecordIdentifierValueAsString=selected_id)
selected_record = fs_response['Record']
inference_request = [
    get_feature_value(selected_record, 'feature1'),
    get_feature_value(selected_record, 'feature2'),
    ....
    get_feature_value(selected_record, 'feature 10')
]

You can then call the deployed model predictor to generate a prediction for the selected record:

results = predictor.predict(','.join(inference_request), 
                            initial_args = {"ContentType": "text/csv"})
prediction = json.loads(results)

Integrate Feature Store in a SageMaker pipeline

Feature Store also integrates with Amazon SageMaker Pipelines to create, add feature search and discovery to, and reuse automated ML workflows. As a result, it’s easy to add feature search, discovery, and reuse to your ML workflow. The following code shows you how to configure the ProcessingOutput to directly write the output to your feature group instead of Amazon S3, so that you can maintain your model features in a feature store:

flow_step_outputs = []
flow_output = sagemaker.processing.ProcessingOutput(
    output_name=customers_output_name,
    feature_store_output=sagemaker.processing.FeatureStoreOutput(
        feature_group_name=your_feature_group_name), 
    app_managed=True)
flow_step_outputs.append(flow_output)

example_flow_step = ProcessingStep(
    name='SampleProcessingStep', 
    processor=flow_processor, # Your flow processor defined at the beginning of your pipeline
    inputs=flow_step_inputs, # Your processing and feature engineering steps, can be Data Wrangler flows
    outputs=flow_step_outputs)

Conclusion

In this post, we explored how Feature Store can be a powerful tool in your ML journey. You can easily export your data processing and feature engineering results to a feature group and build your feature store. After your feature store is all set up, you can explore and build training sets from your offline store, taking advantage of its integration with other AWS analytics services such as Athena, AWS Glue, and Amazon EMR. After you train and deploy a model, you can fetch records from your online store for real-time inference. Lastly, you can add a feature store as a part of a complete SageMaker pipeline in your ML workflow. Feature Store makes it easy to store and retrieve features as needed in ML development.

Give it a try, and let us know what you think!


About the Author

As a data scientist and consultant, Zoe Ma has helped bring the latest tools and technologies and data-driven insights to businesses and enterprises. In her free time, she loves painting and crafting and enjoys all water sports.

Courtney McKay is a consultant. She is passionate about helping customers drive measurable ROI with AI/ML tools and technologies. In her free time, she enjoys camping, hiking and gardening.

Read More

Run ML inference on AWS Snowball Edge with Amazon SageMaker Edge Manager and AWS IoT Greengrass

You can use AWS Snowball Edge devices in locations like cruise ships, oil rigs, and factory floors with limited to no network connectivity for a wide range of machine learning (ML) applications such as surveillance, facial recognition, and industrial inspection. However, given the remote and disconnected nature of these devices, deploying and managing ML models at the edge is often difficult. With AWS IoT Greengrass and Amazon SageMaker Edge Manager, you can perform ML inference on locally generated data on Snowball Edge devices using cloud-trained ML models. You not only benefit from the low latency and cost savings of running local inference, but also reduce the time and effort required to get ML models to production. You can do all this while continuously monitoring and improving model quality across your Snowball Edge device fleet.

In this post, we talk about how you can use AWS IoT Greengrass version 2.0 or higher and Edge Manager to optimize, secure, monitor, and maintain a simple TensorFlow classification model to classify shipping containers (connex) and people.

Getting started

To get started, order a Snowball Edge device (for more information, see Creating an AWS Snowball Edge Job). You can order a Snowball Edge device with an AWS IoT Greengrass validated AMI on it.

After you receive the device, you can use AWS OpsHub for Snow Family or the Snowball Edge client to unlock the device. You can start an Amazon Elastic Compute Cloud (Amazon EC2) instance with the latest AWS IoT Greengrass installed or use the commands on AWS OpsHub for Snow Family.

Launch and install an AMI with the following requirements, or provide an AMI reference on the Snowball console before ordering and it will be shipped with all libraries and data in the AMI:

  • The ML framework of your choice, such as TensorFlow, PyTorch, or MXNet
  • Docker (if you intend to use it)
  • AWS IoT Greengrass
  • Any other libraries you may need

Prepare the AMI at the time of ordering the Snowball Edge device on AWS Snow Family console. For instructions, see Using Amazon EC2 Compute Instances. You also have the option to update the AMI after Snowball is deployed to your edge location.

Install the latest AWS IoT Greengrass on Snowball Edge

To install AWS IoT Greengrass on your device, complete the following steps:

  1. Install the latest AWS IoT Greengrass on your Snowball Edge device. Make sure dev_tools=True is set to have ggv2 cli See the following code:
sudo -E java -Droot="/greengrass/v2" -Dlog.store=FILE  -jar ./MyGreengrassCore/lib/Greengrass.jar  --aws-region region  --thing-name MyGreengrassCore  --thing-group-name MyGreengrassCoreGroup  --tes-role-name GreengrassV2TokenExchangeRole  --tes-role-alias-name GreengrassCoreTokenExchangeRoleAlias  --component-default-user ggc_user:ggc_group  --provision true  --setup-system-service true  --deploy-dev-tools true

We reference the --thing-name you chose here when we set up Edge Manager.

  1. Run the following command to test your installation:
aws greegrassv2 help
  1. On the AWS IoT console, validate the successfully registered Snowball Edge device with your AWS IoT Greengrass account.

Optimize ML models with Edge Manager

We use Edge Manger to deploy and manage the model on Snowball Edge.

  1. Install the Edge Manager agent on Snowball Edge using the latest AWS IoT Greengrass.
  2. Train and store your ML model.

You can train your ML model using any framework of your choice and save it to an Amazon Simple Storage Service (Amazon S3) bucket. In the following screenshot, we use TensorFlow to train a multi-label model to classify connex and people in an image. The model used here is saved to an S3 bucket by first creating a .tar file.

After the model is saved (TensorFlow Lite in this case), you can start an Amazon SageMaker Neo compilation job of the model and optimize the ML model for Snowball Edge Compute (SBE_C).

  1. On the SageMaker console, under Inference in the navigation pane, choose Compilation jobs.
  2. Choose Create compilation job.

  1. Give your job a name and create or use an existing role.

 If you’re creating a new AWS Identity and Access Management (IAM) role, ensure that SageMaker has access to the bucket in which the model is saved.

  1. In the Input configuration section, for Location of model artifacts, enter the path to model.tar.gz where you saved the file (in this case, s3://feidemo/tfconnexmodel/connexmodel.tar.gz).
  2. For Data input configuration, enter the ML model’s input layer (its name and its shape). In this case, it’s called keras_layer_input and its shape is [1,224,224,3], so we enter {“keras_layer_input”:[1,224,224,3]}.

  1. For Machine learning framework, choose TFLite.

  1. For Target device, choose sbe_c.
  2. Leave Compiler options
  3. For S3 Output location, enter the same location as where your model is saved with the prefix (folder) output. For example, we enter s3://feidemo/tfconnexmodel/output.

  1. Choose Submit to start the compilation job.

Now you create a model deployment package to be used by Edge Manager.

  1. On the SageMaker console, under Edge Manager, choose Edge packaging jobs.
  2. Choose Create Edge packaging job.
  3. In the Job properties section, enter the job details.
  4. In the Model source section, for Compilation job name, enter the name you provided for the Neo compilation job.
  5. Choose Next.

  1. In the Output configuration section, for S3 bucket URI, enter where you want to store the package in Amazon S3.
  2. For Component name, enter a name for your AWS IoT Greengrass component.

This step creates an AWS IoT Greengrass model component where the model is downloaded from Amazon S3 and uncompressed to local storage on Snowball Edge.

  1. Create a device fleet to manage a group of devices, in this case, just one (SBE).
  2. For IAM role¸ enter the role generated by AWS IoT Greengrass earlier (–tes-role-name).

Make sure it has the required permissions by going to IAM console, searching for the role, and adding the required policies to it.

  1. Register the Snowball Edge device to the fleet you created.

  1. In the Device source section, enter the device name. The IoT name needs to match the name you used earlier—in this case, –thing-name MyGreengrassCore.

You can register additional Snowball devices on the SageMaker console to add them to the device fleet, which allows you to group and manage these devices together.

Deploy ML models to Snowball Edge using AWS IoT Greengrass

In the previous sections, you unlocked and configured your Snowball Edge device. The ML model is now compiled and optimized for performance on Snowball Edge. An Edge Manager package is created with the compiled model and the Snowball device is registered to a fleet. In this section, you look at the steps involved in deploying the ML model for inference to Snowball Edge with the latest AWS IoT Greengrass.

Components

AWS IoT Greengrass allows you to deploy to edge devices as a combination of components and associated artifacts. Components are JSON documents that contain the metadata, the lifecycle, what to deploy when, and what to install. Components also define what operating system to use and what artifacts to use when running on different OS options.

Artifacts

Artifacts can be code files, models, or container images. For example, a component can be defined to install a pandas Python library and run a code file that will transform the data, or to install a TensorFlow library and run the model for inference. The following are example artifacts needed for an inference application deployment:

  • gRPC proto and Python stubs (this can be different based on your model and framework)
  • Python code to load the model and perform inference

These two items are uploaded to an S3 bucket.

Deploy the components

The deployment needs the following components:

  • Edge Manager agent (available in public components at GA)
  • Model
  • Application

Complete the following steps to deploy the components:

  1. On the AWS IoT console, under Greengrass, choose Components, and create the application component.
  2. Find the Edge Manager agent component in the public components list and deploy it.
  3. Deploy a model component created by Edge Manager, which is used as a dependency in the application component.
  4. Deploy the application component to the edge device by going to the list of AWS IoT Greengrass deployments and creating a new deployment.

If you have an existing deployment, you can revise it to add the application component.

Now you can test your component.

  1. In your prediction or inference code deployed with application component, code in the logic to access files locally on the Snowball Edge device (for example, in the incoming folder) and have the predictions or processed files be moved to a processed folder.
  2. Log in to the device to see if the predictions have been made.
  3. Set up the code to run on a loop, checking the incoming folder for new files, processing the files, and moving them to the processed folder.

The following screenshot is an example setup of files before deployment inside the Snowball Edge.

After deployment, all the test images have classes of interest and therefore are moved to the processed folder.

Clean up

To clean up everything or reimplement this solution from scratch, stop all the EC2 instances by invoking the TerminateInstance API against EC2-compatible endpoints running on your Snowball Edge device. To return your Snowball Edge device, see Powering Off the Snowball Edge and Returning the Snowball Edge Device.

Conclusion

This post walked you through how to order a Snowball Edge device with an AMI of your choice. You then compile a model for the edge using SageMaker, package that model using Edge Manager, and create and run components with artifacts to perform ML inference on Snowball Edge using the latest AWS IoT Greengrass. With Edge Manager, you can deploy and update your ML models on a fleet of Snowball Edge devices, and monitor performance at the edge with saved input and prediction data on Amazon S3. You can also run these components as long-running AWS Lambda functions that can spin up a model and wait for data to do inference.

You combine several features of AWS IoT Greengrass to create an MQTT client and use a pub/sub model to invoke other services or microservices. The possibilities are endless.

By running ML inference on Snowball Edge with Edge Manager and AWS IoT Greengrass, you can optimize, secure, monitor, and maintain ML models on fleets of Snowball Edge devices. Thanks for reading and please do not hesitate to leave questions or comments in the comments section.

To learn more about AWS Snow Family, AWS IoT Greengrass, and Edge Manager, check out the following:


About the Authors

Raj Kadiyala is an AI/ML Tech Business Development Manager in AWS WWPS Partner Organization. Raj has over 12 years of experience in Machine Learning and likes to spend his free time exploring machine learning for practical every day solutions and staying active in the great outdoors of Colorado.

 

 

 

Nida Beig is a Sr. Product Manager – Tech at Amazon Web Services where she works on the AWS Snow Family team. She is passionate about understanding customer needs, and using technology as a conductor of transformative thinking to deliver consumer products. Besides work, she enjoys traveling, hiking, and running.

Read More

Run your TensorFlow job on Amazon SageMaker with a PyCharm IDE

As more machine learning (ML) workloads go into production, many organizations must bring ML workloads to market quickly and increase productivity in the ML model development lifecycle. However, the ML model development lifecycle is significantly different from an application development lifecycle. This is due in part to the amount of experimentation required before finalizing a version of a model. Amazon SageMaker, a fully managed ML service, enables organizations to put ML ideas into production faster and improve data scientist productivity by up to 10 times. Your team can quickly and easily train and tune models, move back and forth between steps to adjust experiments, compare results, and deploy models to production-ready environments.

Amazon SageMaker Studio offers an integrated development environment (IDE) for ML. Developers can write code, track experiments, visualize data, and perform debugging and monitoring all within a single, integrated visual interface, which significantly boosts developer productivity. Within Studio, you can also use Studio notebooks, which are collaborative notebooks (the view is an extension of the JupyterLab interface). You can launch quickly because you don’t need to set up compute instances and file storage beforehand. SageMaker Studio provides persistent storage, which enables you to view and share notebooks even if the instances that the notebooks run on are shut down. For more details, see Use Amazon SageMaker Studio Notebooks.

Many data scientists and ML researchers prefer to use a local IDE such as PyCharm or Visual Studio Code for Python code development while still using SageMaker to train the model, tune hyperparameters with SageMaker hyperparameter tuning jobs, compare experiments, and deploy models to production-ready environments. In this post, we show how you can use SageMaker to manage your training jobs and experiments on AWS, using the Amazon SageMaker Python SDK with your local IDE. For this post, we use PyCharm for our IDE, but you can use your preferred IDE with no code changes.

The code used in this post is available on GitHub.

Prerequisites

To run training jobs on a SageMaker managed environment, you need the following:

  • An AWS account configured with the AWS Command Line Interface (AWS CLI) to have sufficient permissions to run SageMaker training jobs
  • Docker configured (SageMaker local mode) and the SageMaker Python SDK installed on your local computer
  • (Optional) Studio set up for experiment tracking and the Amazon SageMaker Experiments Python SDK

Setup

To get started, complete the following steps:

  1. Create a new user with programmatic access that enables an access key ID and secret access key for the AWS CLI.
  2. Attach the permissions AmazonSageMakerFullAccess and AmazonS3FullAccess.
  3. Limit the permissions to specific Amazon Simple Storage Service (Amazon S3) buckets if possible.
  4. You also need an execution role for the SageMaker AmazonSageMakerFullAccess and AmazonS3FullAccess permissions. SageMaker uses this role to perform operations on your behalf on the AWS hardware that is managed by SageMaker.
  5. Install the AWS CLI on your local computer and quick configuration with aws configure:
$ aws configure
AWS Access Key ID [None]: AKIAI*********EXAMPLE
AWS Secret Access Key [None]: wJal********EXAMPLEKEY
Default region name [None]: eu-west-1
Default output format [None]: json

For more information, see Configuring the AWS CLI

  1. Install Docker and your preferred local Python IDE. For this post, we use PyCharm.
  2. Make sure that you have all the required Python libraries to run your code locally.
  3. Add the SageMaker Python SDK to your local library. You can use pip install sagemaker or create a virtual environment with venv for your project then install SageMaker within the virtual environment. For more information, see Use Version 2.x of the SageMaker Python SDK.

Develop your ML algorithms on your local computer

Many data scientists use a local IDE for ML algorithm development, such as PyCharm. In this post, the algorithm Python script tf_code/tf_script.py is a simple file that uses TensorFlow Keras to create a feedforward neural network. You can run the Python script locally as you do usually.

Make your TensorFlow code SageMaker compatible

To make your code compatible for SageMaker, you must follow certain rules for reading input data and writing output model and other artifacts. The training script is very similar to a training script you might run outside of SageMaker, but you can access useful properties about the training environment through various environment variables. For more information, see SageMaker Toolkits Containers Structure.

The following code shows some important environment variables used by SageMaker for managing the infrastructure.

The following uses the input data location SM_CHANNEL_{channel_name}:

SM_CHANNEL_TRAINING=/opt/ml/input/data/training
SM_CHANNEL_VALIDATION=/opt/ml/input/data/validation
SM_CHANNEL_TESTING=/opt/ml/input/data/testing

The following code uses the model output location to save the model artifact:

SM_MODEL_DIR=/opt/ml/model

The following code uses the output artifact location to write non-model training artifacts (such as evaluation results):

SM_OUTPUT_DATA_DIR=/opt/ml/output

You can pass these SageMaker environment variables as arguments so you can still run the training script outside of SageMaker:

# SageMaker default SM_MODEL_DIR=/opt/ml/model
if os.getenv("SM_MODEL_DIR") is None:
    os.environ["SM_MODEL_DIR"] = os.getcwd() + '/model'

# SageMaker default SM_OUTPUT_DATA_DIR=/opt/ml/output
if os.getenv("SM_OUTPUT_DATA_DIR") is None:
    os.environ["SM_OUTPUT_DATA_DIR"] = os.getcwd() + '/output'

# SageMaker default SM_CHANNEL_TRAINING=/opt/ml/input/data/training
if os.getenv("SM_CHANNEL_TRAINING") is None:
    os.environ["SM_CHANNEL_TRAINING"] = os.getcwd() + '/data'

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAINING'))
    parser.add_argument('--model_dir', type=str, default=os.environ.get('SM_MODEL_DIR'))
    parser.add_argument('--output_dir', type=str, default=os.environ.get('SM_OUTPUT_DATA_DIR'))

Test your ML algorithms on a local computer with the SageMaker SDK local mode

The SageMaker Python SDK supports local mode, which allows you to create estimators and deploy them to your local environment. This is a great way to test your deep learning scripts before running them in the SageMaker managed training or hosting environments. Local mode is supported for framework images (TensorFlow, MXNet, Chainer, PyTorch, and Scikit-Learn) and images you supply yourself. See the following code for ./sm_local.py:

sagemaker_role = 'arn:aws:iam::707*******22:role/RandomRoleNameHere'
sagemaker_session = LocalSession()
sagemaker_session.config = {'local': {'local_code': True}}

def sagemaker_estimator(sagemaker_role, code_entry, code_dir, hyperparameters):
    sm_estimator = TensorFlow(entry_point=code_entry,
                              source_dir=code_dir,
                              role=sagemaker_role,
                              instance_type='local',
                              instance_count=1,
                              model_dir='/opt/ml/model',
                              hyperparameters=hyperparameters,
                              output_path='file://{}/model/'.format(os.getcwd()),
                              framework_version='2.2',
                              py_version='py37',
                              script_mode=True)
    return sm_estimator

With SageMaker local mode, the managed TensorFlow image from the service account is downloaded to your local computer and shows up in Docker. This Docker image is the same as in the SageMaker managed training or hosting environments, so you can debug your code locally and faster.

The following diagram outlines how a Docker image runs in your local machine with SageMaker local mode.

 

The service account TensorFlow Docker image is now running in your local computer.

On the newer versions of macOS, when you debug your code with SageMaker local mode, you might need to add Docker Full Disk Access within System Preferences under Security & Privacy, otherwise PermissionError occurs.

Run your ML algorithms on an AWS managed environment with the SageMaker SDK

After you create the training job, SageMaker launches the ML compute instances and uses the training code and the training dataset to train the model. It saves the resulting model artifacts and other output in the S3 bucket you specified for that purpose.

The following diagram outlines how a Docker image runs in an AWS managed environment.

On the SageMaker console, you can see that your training job launched, together with all training job related metadata, including metrics for model accuracy, input data location, output data configuration, and hyperparameters. This helps you manage and track all your SageMaker training jobs.

Deploy your trained ML model on a SageMaker endpoint for real-time inference

For this step, we use the ./sm_deploy.py script.

When your trained model seems satisfactory, you might want to test the real-time inference against an HTTPS endpoint, or with batch prediction. With the SageMaker SDK, you can easily set up the inference environment to test your inference code and assess model performance regarding accuracy, latency, and throughput.

SageMaker provides model hosting services for model deployment, as shown in the following diagram. It provides an HTTPS endpoint where your ML model is available to perform inference.

The persistent endpoint deployed with SageMaker hosting services appears on the SageMaker console.

Organize, track, and compare your ML trainings with Amazon SageMaker Experiments

Finally, if you have lots of experiments with different preprocessing configurations, different hyperparameters, or even different ML algorithms to test, we suggest you use Amazon SageMaker Experiments to help you group and organize your ML iterations.

Experiments automatically tracks the inputs, parameters, configurations, and results of your iterations as trials. You can assign, group, and organize these trials into experiments. Experiments is integrated with Studio, providing a visual interface to browse your active and past experiments, compare trials on key performance metrics, and identify the best-performing models.

Conclusion

In this post, we showed how you can use SageMaker with your local IDE, such as PyCharm. With SageMaker, data scientists can take advantage of this fully managed service to build, train, and deploy ML models quickly, without having to worry about the underlying infrastructure needs.

To fully achieve operational excellence, your organization needs a well-architected ML workload solution, which includes versioning ML inputs and artifacts, tracking data and model lineage, automating ML deployment pipelines, continuously monitoring and measuring ML workloads, establishing a model retraining strategy, and more. For more information about SageMaker features, see the Amazon SageMaker Developer Guide.

SageMaker is generally available worldwide. For a list of the supported AWS Regions, see the AWS Region Table for all AWS global infrastructure.


About the Author

Yanwei Cui, PhD, is a Machine Learning Specialist Solutions Architect at AWS. He started machine learning research at IRISA (Research Institute of Computer Science and Random Systems), and has several years of experience building artificial intelligence powered industrial applications in computer vision, natural language processing and online user behavior prediction. At AWS, he shares the domain expertise and helps customers to unlock business potentials, and to drive actionable outcomes with machine learning at scale. Outside of work, he enjoys reading and traveling.

Read More

How Cortica used Amazon HealthLake to get deeper insights to improve patient care

This is a guest post by Ernesto DiMarino, who is Head of Enterprise Applications and Data at Cortica.

Cortica is on a mission to revolutionize healthcare for children with autism and other neurodevelopmental differences. Cortica was founded to fix the fragmented journey families typically navigate while seeking diagnoses and therapies for their children. To bring their vision to life, Cortica seamlessly blends neurology, research-based therapies, and technology into comprehensive care programs for the children they serve. This coordinated approach leads to best-in-class member satisfaction and empowers families to achieve long-lasting, transformative results.

In this post, we discuss how Cortica used Amazon HealthLake to create a data analytics hub to store a patient’s medical history, medication history, behavioral assessments, lab reports, and genetic variants in Fast Healthcare Interoperability Resource (FHIR) standard format. They create a composite view of the patient’s health journey and apply advance analytics to understand trends in patient progression with Cortica’s treatment approach.

Unifying our data

The challenges faced by Cortica’s team of three data engineers are no different than any other healthcare enterprise. Cortica has two EHRs (electronic health records), 6 specialties, 420 providers, and a few home-grown data capturing questionnaires, one of which has 842 questions. With multiple vendors providing systems and data solutions, Cortica finds itself in an all-too-common situation in the healthcare industry: volumes of data with multiple formats and complexity in matching patients from system to system. Cortica looked to solve some of this complexity by setting up a data lake on AWS.

Cortica’s team imported all data into an Amazon Simple Storage Service (Amazon S3) data lake using Python extract, transform, and load (ETL), orchestrating it with Apache Airflow. Additionally, they maintain a Kimball model star schema for financial and operational analytics. The data sizes are a respectable 16 terabytes of data. Most of the file formats delivered to the data lake are in CSV, PDF, and Parquet, all of which the data lake is well equipped to manage. However, the data lake solution is only part of the story. To truly derive value from the data, Cortica needed a standardized model to deal with the healthcare languages and vocabularies, as well as the many industry standardized code sets.

Deriving deeper value from data

Although the data lake and star schema data model work well for some financial and operational analytics, the Cortica team found that it was challenging to dive deeper into the data for meaningful insights to share with patients and their caregivers. Some of the questions they wanted to answer included:

  • How can Cortica present to caregivers a composite view of the patient’s healthcare journey with Cortica?
  • How can they show that patients are getting better over time using data from standardized assessments, medical notes, and goals tracking data?
  • How do patients with specific comorbidities progress to their goals compared to patients without comorbidities?
  • Can Cortica show how patients have better outcomes through the unique multispecialty approach?
  • Can Cortica partner with industry researchers sharing de-identified data to help further treatment for autism and other neurodevelopmental differences?

Before implementing the data lake, staff would read through PDFs, Excel, and vendor systems to create Excel files to capture the data points of interest. Interrogating the EHRs and manually transcribing documents and notes into a large spreadsheet for analysis would take months of work. This process wasn’t scalable and made it difficult to reproduce analytics and insights.

With the data lake, Cortica found that they still lacked the ability to quickly access the volumes of data, as well as join the various datasets together to make complex analysis. Because healthcare data is so driven by medical terminologies, they needed a solution that could help unify data from different healthcare fields to present a clear patient journey through the different specialties Cortica offers. To quickly derive this deeper value, they chose Amazon HealthLake to help provide this added layer of meaning to the data.

Cortica’s solution

Cortica adopted Amazon HealthLake to help standardize data and scale insights. Through implementing the FHIR standard, Amazon HealthLake provided a faster solution to standardizing data with a far less complex maintenance pathway. They were able to quickly load a basic set of resources into Amazon HealthLake. This allowed the team to create a proof of concept (POC) for starting to answer the bigger set of questions focused on their patient population. In a 3-day process, they were able to develop a POC for understanding their patient’s journey from the perspective of their behavior therapy goals and medical comorbidities. Most of the 3-day process was spent on two days fine-tuning the queries in Amazon QuickSight and making visualizations of the data. From a data to visual perspective, the data was ready in hours not months. The following diagram illustrates their pipeline.

Getting to insights faster

Cortica was able to quickly see across their patient population the length of time it took for patients to attain their goals. The team could then break it down by age-phenotype (a designated age grouping for comparing Cortica’s population). They saw the grouping of patients that were meeting their goals in 4, 6, 9, and 12-month intervals. They further sliced and diced the visuals by layering in a variety of categories such as goal status. Until now, staff and clinicians were only able to look at an individual’s data rather than population data. They couldn’t get these types of insights. The manual chart clinician abstraction process for this goal analysis would have taken months to complete.

The following charts show two visualizations of their goals.

As a fast follow with this POC, Cortica wanted to see how medical comorbidities impacted goal attainment. The specific medical comorbidities of interest were seizures, constipation, and sleep disturbances, because these are commonly found within this patient population. Data for the FHIR Condition Resource was loaded into the pipeline, and the team was able to identify cohorts by comorbidites and quickly visualize the information. In a few minutes, they had visualizations running, and could see the impact that these comorbidities had on goal attainment (see the following example diagram).

With Amazon HealthLake, the Cortica team can spend more time analyzing and understanding data patterns rather than figuring out where data comes from, formatting it, and joining it into a usable state. The value that Amazon brings to any healthcare organization is the ability to quickly move data, conform data, and start visualizing. With FHIR as the data model, a small non-technical team can request an organization’s integration team to provide a flat file feed of FHIR resources of interest to an S3 bucket. This data is easily loaded to Amazon HealthLake data stores via the AWS Command Line Interface (AWS CLI), AWS Management Console, or API. Next, they can run the data on Amazon Athena to expose the data to an SQL queryable tool and use QuickSight for visualization. Both clinical or non-technical teams can use this solution to start deriving value from data locked within medical records systems.

Conclusion

The tools available through AWS such as Amazon HealthLake, Amazon SageMaker, Athena, Amazon Comprehend Medical, and QuickSight are speeding up the ability to learn more about the patient population Cortica cares for in an actionable timeframe. Analysis that took months to complete can now be completed in days, and in some cases hours. AWS tools can enhance analysis by adding layers of richness to the data in minutes and provide different views of the same analysis. Furthermore, analysis that required chart abstraction can now be done through automated data pipelines, processing hundreds or thousands of documents to derive insights from notes, which were previously only available to a few clinicians.

Cortica is entering a new era of data analytics, one in which the data pipeline and process doesn’t require data engineers and technical staff. What is unknown can be learned from the data, ultimately bringing Cortica closer to its mission of revolutionizing the pediatric healthcare space and empowering families to achieve long-lasting, transformative results.

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.


About the Authors

Ernesto DiMarino is Head of Enterprise Applications and Data at Cortica.

Satadal Bhattacharjee is Sr Manager, Product Management, who leads products at AWS Health AI. He works backwards from healthcare customers to help them make sense of their data by developing services such as Amazon HealthLake and Amazon Comprehend Medical.

Read More

Attendee matchmaking at virtual events with Amazon Personalize

Amazon Personalize enables developers to build applications with the same machine learning (ML) technology used by Amazon.com for real-time personalized recommendations—no ML expertise required. Amazon Personalize makes it easy for developers to build applications capable of delivering a wide array of personalization experiences, including specific product recommendations, personalized product re-ranking, and customized direct marketing. Besides applications in retail and ecommerce, other common use cases for Amazon Personalize include recommending videos, blog posts, or newsfeeds based on users’ activity history.

What if you wanted to recommend users of common interest to connect with each other? As the pandemic pushes many of our normal activities virtual, connecting with people is a greater challenge than ever before. This post discusses how 6Connex turned this challenge into an opportunity by harnessing Amazon Personalize to elevate their “user matchmaking” feature.

6Connex and event AI

6Connex is an enterprise virtual venue and hybrid events system. Their cloud-based product portfolio includes virtual environments, learning management, and webinars. Attendee experience is one of the most important metrics for their success.

Attendees have better experiences when they are engaged not only with the event’s content, organizers, and sponsors, but also when making connections with other attendees. Engagement metrics are measured and reported for each attendee activity on the platform, as well as feedback from post-event surveys. The goal is to make the events system more attendee-centric by not only providing personalized content and activity recommendations, but also making matchmaking suggestions for attendees based on similar interests and activity history. By adding event AI features to their platform, 6Connex fosters more meaningful connections between attendees, and keeps their attendees more engaged with a personalized event journey.

Implementation and solution architecture

6Connex built their matchmaking solution using the related items recipe (SIMS) of Amazon Personalize. The SIMS algorithm uses collaborative filtering to recommend items that are similar to a given item. The novelty of 6Connex’s approach lies in the reverse mapping of users and items. In this solution, event attendees are items in Amazon Personalize terms, and content, meeting rooms, and so on are users’ in Amazon Personalize terms.

When a platform user joins a meeting room or views a piece of content, an interaction is created. To increase the accuracy of interaction types, also known as event_type, you can add logic to only count as an interaction when a user stays in a meeting room for at least a certain amount of time. This eliminates accidental clicks and cases when users join but quickly leave a room due to lack of interest.

As many users interact with the platform during a live event, interactions are streamed in real time from the platform via Amazon Kinesis Data Streams. AWS Lambda functions are used for data transformation before streaming data directly to Amazon Personalize through an event tracker. This mechanism also enables Amazon Personalize to adjust to changing user interest over time, allowing recommendations to adapt in real time.

After a model is trained in Amazon Personalize, a fully managed inference endpoint (campaign) is created to serve real-time recommendations for 6Connex’s platform. To answer the question “for each attendee, who are similar attendees?”, 6Connex’s client-side application queries the GetRecommendations API with a current user (represented as an itemId). The API response provides recommended connections because they have been identified as similar by the Amazon Personalize.

Due to its deep learning capabilities, Amazon Personalize requires at least 1,000 interaction data points before training the model. At the start of a live event, there aren’t enough interactions, therefore a rules engine is used at the beginning of an event to provide the initial recommendations prior to gathering 1000 data points. The following table shows the three main phases of an event where connection recommendations are generated.

Rule-based recommendations
  • Event tracker interaction events < 1,000
  • Use rule engine
  • Cache results for 10 minutes
Amazon Personalize real-time recommendations during live sessions
  • Initial data is loaded and model is trained
  • Data is ingested in real-time via Kinesis Data Streams
  • Regular training occurs across the day
  • Recommendation results are cached
Amazon Personalize batch recommendations for on-demand users
  • Main event live sessions are over but the event is still open for a period of time
  • Daily batch recommendations are retrieved and loaded into DynamoDB.

For a high-level example architecture, see the following diagram.

The following are the steps involved in the solution architecture:

  1. 6Connex web application calls the GetRecommendations API to retrieve recommended connections.
  2. A matchmaking Lambda function retrieves recommendations.
  3. Until the training threshold of 1,000 interaction data points is reached, the matchmaking function uses a simple rules engine to provide recommendations.
  4. Recommendations are generated from Amazon Personalize and stored in Amazon ElastiCache. The reason for caching recommendations is to improve response performance while reducing the number of queries on the Amazon Personalize API. When new recommendations are requested, or when the cache expires (expiration is set to every 15 minutes), recommendations are pulled from Amazon Personalize.
  5. New user interactions are ingested in real time via Kinesis Data Streams.
  6. A Lambda function consumes data from the data stream, performs data transformation, persists the transformed data to Amazon Simple Storage Service (Amazon S3) and related metadata to Amazon DynamoDB, and sends the records to Amazon Personalize via the PutEvents API.
  7. AWS Step Functions orchestrates the process for creating solutions, training, retraining, and several other workflows. More details on the Step Functions workflow are in the next section.
  8. Amazon EventBridge schedules regular retraining events during the virtual events. We also use EventBridge to trigger batch recommendations after the virtual events are over and when the contents are served to end users on demand.
  9. Recommendations are stored in DynamoDB for use during the on-demand period and also for future analysis.

Adoption of MLOps

It was crucial for 6Connex to quickly shift from a rules-based recommender engine to personalized recommendations using Amazon Personalize. To accelerate this shift and hydrate the interactions dataset, 6Connex infers interactions not only from content engagement, but also from other sources such as pre-events questionnaires. This is an important development that increased the speed to when users start receiving ML-based recommendations.

More importantly, the adoption of Amazon Personalize MLOps enabled 6Connex to automate and accelerate the transition from rule-based recommendations to personalized recommendations using Amazon Personalize. After the minimum threshold for data is met, Step Functions loads data into Amazon Personalize and manages the training process.

The following diagram shows the MLOps pipeline for the initial loading of data, training solutions, and deploying campaigns.

6Connex created their MLOps solution based on the Amazon Personalize MLOps reference solution to automate this process. There are several Step Functions workflows that offload long-running processes such loading batch recommendations in DynamoDB, retraining Amazon Personalize solutions, and cleaning up after an event is complete.

With Amazon Personalize and MLOps pipelines, 6Connex brought an AI solution to market in less than half the time it would have taken to develop and deploy their own ML infrastructure. Moreover, these solutions reduced the cost of acquiring data science and ML expertise. As a result, 6Connex realized a competitive advantage through AI-based personalized recommendations for each individual user.

Based on the success of this engagement, 6Connex plans to expand its usage of Amazon Personalize to provide content-based recommendations in the near future. 6Connex is looking forward to expanding the partnership not only in ML, but also in data analytics and business intelligence to serve the fast-growing hybrid event market.

Conclusion

With a well-designed MLOps pipeline and some creativity, 6Connex built a robust recommendation engine using Amazon Personalize in a short amount of time.

Do you have a use case for a recommendation engine but are short on time or ML expertise? You can get started with Amazon Personalize using the Developer Guide, as well as a myriad of hands-on resources such as the Amazon Personalize Samples GitHub repo.

If you have any questions on this matchmaking solution, please leave a comment!


About the Author

Shu Jackson is a Senior Solutions Architect with AWS. Shu works with startup customers helping them design and build solutions in the cloud, with a focus on AI/ML.

 

 

 

 

Luis Lopez Soria is a Sr AI/ML specialist solutions architect working with the Amazon Machine Learning team. He works with AWS customers to help them adopt machine learning on a large scale. He enjoys playing sports, traveling around the world, and exploring new foods and cultures.

Read More