Get to production-grade data faster by using new built-in interfaces with Amazon SageMaker Ground Truth Plus

Get to production-grade data faster by using new built-in interfaces with Amazon SageMaker Ground Truth Plus

Launched at AWS re:Invent 2021, Amazon SageMaker Ground Truth Plus helps you create high-quality training datasets by removing the undifferentiated heavy lifting associated with building data labeling applications and managing the labeling workforce. All you do is share data along with labeling requirements, and Ground Truth Plus sets up and manages your data labeling workflow based on these requirements. From there, an expert workforce that is trained on a variety of machine learning (ML) tasks labels your data. You don’t even need deep ML expertise or knowledge of workflow design and quality management to use Ground Truth Plus.

Today, we are excited to announce the launch of new built-in interfaces on Ground Truth Plus. With this new capability, multiple Ground Truth Plus users can now create a new project and batch, share data, and receive data using the same AWS account through self-serve interfaces. This enables you to accelerate the development of high-quality training datasets by reducing project set up time. Additionally, you can control fine-grained access to your data by scoping your AWS Identity and Access Management (IAM) role permissions to match your individual level of Amazon Simple Storage Service (Amazon S3) access, and you always have the option to revoke access to certain buckets.

Until now, you had to reach out to your Ground Truth Plus operations program manager (OPM) to create new data labeling projects and batches. This process had some restrictions because it allowed only one user to request a new project and batch—if multiple users within the organization were using the same AWS account, then only one user could request a new data labeling project and batch using the Ground Truth Plus console. Additionally, the process created artificial delays in kicking off the labeling process due to multiple manual touchpoints and troubleshooting required in case of issues. Separately, all the projects used the same IAM role for accessing data. Therefore, to run projects and batches that needed access to different data sources such as different Amazon S3 buckets, you had to rely on your Ground Truth Plus OPM to provide your account specific S3 policies, which you had to manually apply to your S3 buckets. This entire operation was manually intensive resulting in operational overheads.

This post walks you through steps to create a new project and batch, share data, and receive data using the new self-serve interfaces to efficiently kickstart the labeling process. This post assumes that you are familiar with Ground Truth Plus. For more information, see Amazon SageMaker Ground Truth Plus – Create Training Datasets Without Code or In-house Resources.

Solution overview

We demonstrate how to do the following:

  • Update existing projects
  • Request a new project
  • Set up a project team
  • Create a batch

Prerequisites

Before you get started, make sure you have the following prerequisites:

  • An AWS account
  • An IAM user with access to create IAM roles
  • The Amazon S3 URI of the bucket where your labeling objects are stored

Update existing projects

If you have a Ground Truth Plus project before the launch (December 9, 2022) of the new features described in this post, then you need to create and share an IAM role so that you can use these features with your existing Ground Truth Plus project. If you’re a new user of Ground Truth Plus, you can skip this section.

To create an IAM role, complete the following steps:

  1. On the IAM console, choose Create role.
  2. Select Custom trust policy.
  3. Specify the following trust relationship for the role:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Principal": {
                    "Service": 
                        "sagemaker-ground-truth-plus.amazonaws.com"
                },
                "Action": "sts:AssumeRole"
            }
        ]
    }

  4. Choose Next.
  5. Choose Create policy.
  6. On the JSON tab, specify the following policy. Update the Resource property by specifying two entries for each bucket: one with just the bucket ARN, and another with the bucket ARN followed by /*. For example, replace <your-input-s3-arn> with arn:aws:s3:::my-bucket/myprefix/ and <your-input-s3-arn>/* with arn:aws:s3:::my-bucket/myprefix/*.
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "s3:GetObject",
                    "s3:PutObject",
                    "s3:GetBucketLocation",
                    "s3:ListBucket"
                ],
         "Resource": [
                    "<your-input-s3-arn>",
                    "<your-input-s3-arn>/*",
                    "<your-output-s3-arn>",
                    "<your-output-s3-arn>/*"
                ]
            }
        ]
    }

  7. Choose Next: Tags and Next: Review.
  8. Enter the name of the policy and an optional description.
  9. Choose Create policy.
  10. Close this tab and go back to the previous tab to create your role.

On the Add permissions tab, you should see the new policy you created (refresh the page if you don’t see it).

  1. Select the newly created policy and choose Next.
  2. Enter a name (for example, GTPlusExecutionRole) and optionally a description of the role.
  3. Choose Create role.
  4. Provide the role ARN to your Ground Truth Plus OPM, who will then update your existing project with this newly created role.

Request a new project

To request a new project, complete the following steps:

  1. On the Ground Truth Plus console, navigate to the Projects section.

This is where all your projects are listed.

  1. Choose Request project.

The Request project page is your opportunity to provide details that will help us schedule an initial consultation call and set up your project.

  1. In addition to specifying general information like the project name and description, you must specify the project’s task type and whether it contains personally identifiable information (PII).

To label your data, Ground Truth Plus needs temporary access to your raw data in an S3 bucket. When the labeling process is complete, Ground Truth Plus delivers the labeling output back to your S3 bucket. This is done through an IAM role. You can either create a new role, or you can navigate to the IAM console to create a new role (refer to the previous section for instructions).

  1. If you choose to create a role, choose Enter a custom IAM role ARN and enter your IAM role ARN, which is in the format of arn:aws:iam::<YourAccountNumber>:role/<RoleName>.
  2. To use the built-in tool, on the drop-down menu under IAM Role, choose Create a new role.
  3. Specify the bucket location of your labeling data. If you don’t know the location of your labeling data or if you don’t have any labeling data uploaded, select Any S3 bucket, which will give Ground Truth Plus access to all your account’s buckets.
  4. Choose Create to create the role.

Your IAM role will allow Ground Truth Plus, identified as sagemaker-ground-truth-plus.amazonaws.com in the role’s trust policy, to run the following actions on your S3 buckets:

[
    "s3:GetObject",
    "s3:PutObject",
    "s3:GetBucketLocation",
    "s3:ListBucket"
]
  1. Choose Request project to complete the request.

A Ground Truth Plus OPM will schedule an initial consultation call with you to discuss your data labeling project requirements and pricing.

Set up a project team

After you request a project, you need to create a project team to log in to your project portal. A project team provides access to the members from your organization or team to track projects, view metrics, and review labels. You can use the option Invite new members by email or Import members from existing Amazon Cognito user groups. In this post, we show how to import members from existing Amazon Cognito user groups to add users to your project team.

  1. On the Ground Truth Plus console, navigate to the Project team section.
  2. Choose Create project team.
  3. Choose Import members from existing Amazon Cognito user groups.
  4. Choose an Amazon Cognito user pool.

User pools require a domain and an existing user group.

  1. Choose an app client.

We recommend using a client generated by Amazon SageMaker.

  1. Choose a user group from your pool to import members.
  2. Choose Create project team.

You can add more team members after creating the project team by choosing Invite new members on the Members page of the Ground Truth Plus console.

Create a batch

After you have successfully submitted the project request and created a project team, you can access the Ground Truth Plus project portal by clicking Open project portal on the Ground Truth Plus console.

You can use the project portal to create batches for a project, but only after the project’s status has changed to Request approved.

  1. View a project’s details and batches by choosing the project name.
    A page titled with the project name opens.
  2. In the Batches section, choose Create batch.
  3. Enter a batch name and optional description.
  4. Enter the S3 locations of the input and output datasets.

To ensure the batch is created successfully, you must meet the following requirements:

    • The S3 bucket and prefix should exist, and the total number of files should be greater than 0
    • The total number of objects should be less than 10,000
    • The size of each object should be less than 2 GB
    • The total size of all objects combined is less than 100 GB
    • The IAM role provided to create a project has permission to access the input bucket, output bucket, and S3 files that are used to create the batch
    • The files under the provided S3 location for the input datasets should not be encrypted by AWS Key Management Service (AWS KMS)
  1. Choose Submit.

Your batch status will show as Request submitted. After Ground Truth Plus has temporary access to your data, AWS experts will set up data labeling workflows and operate them on your behalf, which will change the batch status to In-progress. When the labeling is complete, the batch status changes from In-progress to Ready for review. If you want to review your labels before receiving the labels then choose Review batch. From there, you have an option to choose Accept batch to receive your labeled data.

Conclusion

This post showed you how multiple Ground Truth Plus users can now create a new project and batch, share data, and receive data using the same AWS account through new self-serve interfaces. This new capability allows you to kickstart your labeling projects faster and reduces operational overhead. We also demonstrated how you can control fine-grained access to data by scoping your IAM role permissions to match your individual level of access.

We encourage you to try out this new functionality, and connect with the Machine Learning & AI community if you have any questions or feedback!


About the authors

Manish Goel is the Product Manager for Amazon SageMaker Ground Truth Plus. He is focused on building products that make it easier for customers to adopt machine learning. In his spare time, he enjoys road trips and reading books.

Karthik Ganduri is a Software Development Engineer at Amazon AWS, where he works on building ML tools for customers and internal solutions. Outside of work, he enjoys clicking pictures.  

Zhuling Bai is a Software Development Engineer at Amazon AWS. She works on developing large scale distributed systems to solve machine learning problems.

Aatef Baransy is a Frontend engineer at Amazon AWS. He writes fast, reliable, and thoroughly tested software to nurture and grow the industry’s most cutting-edge AI applications.

Mohammad Adnan is a Senior Engineer for AI and ML at AWS. He was part of many AWS service launch, notably Amazon Lookout for Metrics and AWS Panorama. Currently, he is focusing on AWS human-in-the-loop offerings (AWS SageMaker’s Ground truth, Ground truth plus and Augmented AI). He is a clean code advocate and a subject-matter expert on server-less and event-driven architecture. You can follow him on LinkedIn, mohammad-adnan-6a99a829.

Read More

Announcing the updated Salesforce connector (V2) for Amazon Kendra

Announcing the updated Salesforce connector (V2) for Amazon Kendra

Amazon Kendra is a highly accurate and simple-to-use intelligent search service powered by machine learning (ML). Amazon Kendra offers a suite of data source connectors to simplify the process of ingesting and indexing your content, wherever it resides.

Valuable data in organizations is stored in both structured and unstructured repositories. An enterprise search solution should be able to pull together data across several structured and unstructured repositories to index and search on.

One such data repository is Salesforce. Salesforce is a comprehensive CRM tool for managing support, sales, and marketing teams. It’s an intelligent, proactive, AI-powered platform that empowers employees with the information they need to make the best decisions for every customer. It’s the backbone of the world’s most customer-centered organizations and helps companies put the customer at the center of everything they do.

We’re excited to announce that we have updated the Salesforce connector for Amazon Kendra to add even more capabilities. In this version (V2), we have added support for Salesforce Lightning in addition to Classic. You can now choose to crawl attachments and also bring in identity/ACL information to make your searches more granular. We now support 20 standard entities, and you can choose to index more fields.

You can import the following entities (and attachments for those marked with *):

  • Accounts*
  • Campaign*
  • Partner
  • Pricebook
  • Case*
  • Contact*
  • Contract*
  • Document
  • Group
  • Idea
  • Lead*
  • Opportunity*
  • Product
  • Profile
  • Solution*
  • Task*
  • User*
  • Chatter*
  • Knowledge Articles
  • Custom Objects*

Solution overview

With Amazon Kendra, you can configure multiple data sources to provide a central place to search across your document repository. For our solution, we demonstrate how to index a Salesforce repository or folder using the Amazon Kendra connector for Salesforce. The solution consists of the following steps:

  1. Create and configure an app on Salesforce and get the connection details.
  2. Create a Salesforce data source via the Amazon Kendra console.
  3. Index the data in the Salesforce repository.
  4. Run a sample query to get the information.
  5. Filter the query by users or groups.

Prerequisites

To try out the Amazon Kendra connector for Salesforce, you need the following:

Configure a Salesforce app and gather connection details

Before we set up the Salesforce data source, we need a few details about your Salesforce repository. Let’s gather those in advance (refer to Authorization Through Connected Apps and OAuth 2.0 for more details).

  1. Go to https://login.salesforce.com/ and log in with your credentials.

  2. In the navigation pane, choose Setup Home.
  3. Under Apps, choose App Manager.

    This refreshes the right pane.

  4. Choose New Connected App.

  5. Select Enable OAuth Settings to expand the API (Enable OAuth Settings) section.
  6. For Callback URL, enter https://login.salesforce.com/services/oauth2/token.
  7. For Selected OAuth Scopes, choose eclair_api and choose the right arrow icon.
  8. Select Introspect All Tokens.

  9. Choose Save.A warning appears that says “Changes can take up to 10 minutes to take effect.”
  10. Choose Continue to acknowledge.
  11. On the confirmation page, choose Manage Consumer Details.

  12. Copy and save the values for Consumer Key and Consumer Secret to use later when setting up your Amazon Kendra data source.

    Next, we generate a security token.

  13. On the home page, choose the View Profile icon and choose Settings.

  14. In the navigation pane, expand My Personal Information and choose Reset My Security Token.

    The security token is sent to the email you used when configuring the app. The following screenshot shows an example email.

  15. Save the security token to use when you configure the Salesforce connector to Amazon Kendra.

Configure the Amazon Kendra connector for Salesforce

To configure the Amazon Kendra connector, complete the following steps:

  1. On the Amazon Kendra console, choose Create an Index.

  2. For Index name, enter a name for the index (for example, my-salesforce-index).
  3. Enter an optional description.
  4. Choose Create a new role.
  5. For Role name, enter an IAM role name.
  6. Configure optional encryption settings and tags.
  7. Choose Next.

  8. In the Configure user access control section, leave the settings at their defaults and choose Next.

  9. Select Developer edition and choose Create.

    This creates and propagates the IAM role and then creates the Amazon Kendra index, which can take up to 30 minutes.

  10. Return to the Amazon Kendra console and choose Data sources in the navigation pane.

  11. Scroll down and locate Salesforce Online connector V2.0, and choose Add connector.

  12. For Data source name, enter a name (for example, my-salesforce-datasourcev2).
  13. Enter an optional description.
  14. Choose Next.

  15. For Salesforce URL, enter the URL at the top of the browser when you log in to Salesforce.
  16. For Configure VPC and security group, leave the default (No VPC).
  17. Keep Identity crawler is on selected.This imports identity/ACL information into the index.
  18. For IAM role, choose Create a new role.
  19. Enter a role name, such as AmazonKendra-salesforce-datasourcev2.
  20. Choose Next.

  21. In the Authentication section, choose Create and add new secret.

  22. Enter the details you gathered while setting up the Salesforce app:
    1. Secret name – The name you gave your secret.
    2. Username – The user name you use to log in to Salesforce.
    3. Password – The password you use to log in to Salesforce.
    4. Security token – The security token you received in your email while going through the setup in Salesforce.
    5. Consumer key – The key generated while going through the setup in Salesforce.
    6. Consumer secret – The secret generated while going through the setup in Salesforce.
    7. Authentication URL – Enter https://login.salesforce.com/services/oauth2/token.
  23. Choose Save.

    The next page is prefilled with the name of the secret.

  24. Choose Next.

  25. Select All standard objects and Include all attachments.
  26. For Sync run schedule, choose Run on demand.
  27. Choose Next.

  28. Keep all the defaults in the Field Mappings section and choose Next.
  29. On the review page, choose Add data source.

  30. Choose Sync now.

This indexes all the content in Salesforce as per your configuration. You will see a success message at the top of the page and also in the sync history.

Test the solution

Now that you have ingested the content from your Salesforce account into your Amazon Kendra index, you can test some queries.

  1. Go to your index and choose Search indexed content in the navigation pane.
  2. Enter a search term and press Enter.

    One of the features of the data source is that it brings in the ACL information along with the contents of Salesforce. You can use this to narrow down your queries by users or groups.

  3. Return to the search page and expand Test query with user name or groups.Choose Apply user name or groups.

  4. For Username, enter your user name and choose Apply.

    A message appears saying Attributes applied.

  5. Enter a new test query and press Enter.

Congratulations! You have successfully used Amazon Kendra to surface answers and insights based on the content indexed from your Salesforce account.

Conclusion

With the Salesforce connector for Amazon Kendra, organizations can tap into the repository of information stored in their account securely using intelligent search powered by Amazon Kendra.

In this post, we introduced you to the basics, but there are many additional features that we didn’t cover. For example:

  • You can enable user-based access control for your Amazon Kendra index and restrict access to users and groups that you configure
  • You can map additional fields to Amazon Kendra index attributes and enable them for faceting, search, and display in the search results
  • You can integrate the Salesforce data source with the Custom Document Enrichment (CDE) capability in Amazon Kendra to perform additional attribute mapping logic and even custom content transformation during ingestion

To learn about these possibilities and more, refer to the Amazon Kendra Developer Guide.


About the author

Ashish Lagwankar is a Senior Enterprise Solutions Architect at AWS. His core interests include AI/ML, serverless, and container technologies. Ashish is based in the Boston, MA, area and enjoys reading, outdoors, and spending time with his family.

Read More

­­Speed ML development using SageMaker Feature Store and Apache Iceberg offline store compaction

­­Speed ML development using SageMaker Feature Store and Apache Iceberg offline store compaction

Today, companies are establishing feature stores to provide a central repository to scale ML development across business units and data science teams. As feature data grows in size and complexity, data scientists need to be able to efficiently query these feature stores to extract datasets for experimentation, model training, and batch scoring.

Amazon SageMaker Feature Store is a purpose-built feature management solution that helps data scientists and ML engineers securely store, discover, and share curated data used in training and prediction workflows. SageMaker Feature Store now supports Apache Iceberg as a table format for storing features. This accelerates model development by enabling faster query performance when extracting ML training datasets, taking advantage of Iceberg table compaction. Depending on the design of your feature groups and their scale, you can experience training query performance improvements of 10x to 100x by using this new capability.

By the end of this post, you will know how to create feature groups using the Iceberg format, execute Iceberg’s table management procedures using Amazon Athena, and schedule these tasks to run autonomously. If you are a Spark user, you’ll also learn how to execute the same procedures using Spark and incorporate them into your own Spark environment and automation.

SageMaker Feature Store and Apache Iceberg

Amazon SageMaker Feature Store is a centralized store for features and associated metadata, allowing features to be easily discovered and reused by data scientist teams working on different projects or ML models.

SageMaker Feature Store consists of an online and an offline mode for managing features. The online store is used for low-latency real-time inference use cases. The offline store is primarily used for batch predictions and model training. The offline store is an append-only store and can be used to store and access historical feature data. With the offline store, users can store and serve features for exploration and batch scoring and extract point-in-time correct datasets for model training.

The offline store data is stored in an Amazon Simple Storage Service (Amazon S3) bucket in your AWS account. SageMaker Feature Store automatically builds an AWS Glue Data Catalog during feature group creation. Customers can also access offline store data using a Spark runtime and perform big data processing for ML feature analysis and feature engineering use cases.

Table formats provide a way to abstract data files as a table. Over the years, many table formats have emerged to support ACID transaction, governance, and catalog use cases. Apache Iceberg is an open table format for very large analytic datasets. It manages large collections of files as tables, and it supports modern analytical data lake operations such as record-level insert, update, delete, and time travel queries. Iceberg tracks individual data files in a table instead of in directories. This allows writers to create data files in place (files are not moved or changed) and only add files to the table in an explicit commit. The table state is maintained in metadata files. All changes to the table state create a new metadata file version that atomically replaces the older metadata. The table metadata file tracks the table schema, partitioning configuration, and other properties.

Iceberg has integrations with AWS services. For example, you can use the AWS Glue Data Catalog as the metastore for Iceberg tables, and Athena supports read, time travel, write, and DDL queries for Apache Iceberg tables that use the Apache Parquet format for data and the AWS Glue catalog for their metastore.

With SageMaker Feature Store, you can now create feature groups with Iceberg table format as an alternative to the default standard Glue format. With that, customers can leverage the new table format to use Iceberg’s file compaction and data pruning features to meet their use case and optimization requirements. Iceberg also lets customers perform deletion, time-travel queries, high-concurrency transactions, and higher-performance queries.

By combining Iceberg as a table format and table maintenance operations such as compaction, customers get faster query performance when working with offline feature groups at scale, letting them more quickly build ML training datasets.

The following diagram shows the structure of the offline store using Iceberg as a table format.

In the next sections, you will learn how to create feature groups using the Iceberg format, execute Iceberg’s table management procedures using AWS Athena and use AWS services to schedule these tasks to run on-demand or on a schedule. If you are a Spark user, you will also learn how to execute the same procedures using Spark.

For step-by-step instructions, we also provide a sample notebook, which can be found in GitHub. In this post, we will highlight the most important parts.

Creating feature groups using Iceberg table format

You first need to select Iceberg as a table format when creating new feature groups. A new optional parameter TableFormat can be set either interactively using Amazon SageMaker Studio or through code using the API or the SDK. This parameter accepts the values ICEBERG or GLUE (for the current AWS Glue format). The following code snippet shows you how to create a feature group using the Iceberg format and FeatureGroup.create API of the SageMaker SDK.

orders_feature_group_iceberg.create(
s3_uri=f"s3://{s3_bucket_name}/{prefix}",
record_identifier_name=record_identifier_feature_name,
event_time_feature_name=event_time_feature_name,
role_arn=role,
enable_online_store=True,
table_format=TableFormatEnum.ICEBERG
)

The table will be created and registered automatically in the AWS Glue Data Catalog.

Now that the orders_feature_group_iceberg is created, you can ingest features using your ingestion pipeline of choice. In this example, we ingest records using the FeatureGroup.ingest() API, which ingests records from a Pandas DataFrame. You can also use the FeatureGroup().put_record API to ingest individual records or to handle streaming sources. Spark users can also ingest Spark dataframes using our Spark Connector.

orders_fg = FeatureGroup(name=orders_feature_group_iceberg_name,
sagemaker_session=feature_store_session)
orders_fg.ingest(data_frame=order_data, wait=True)

You can verify that the records have been ingested successfully by running a query against the offline feature store. You can also navigate to the S3 location and see the new folder structure.

Executing Iceberg table management procedures

Amazon Athena is a serverless SQL query engine that natively supports Iceberg management procedures. In this section, you will use Athena to manually compact the offline feature group you created. Note you will need to use Athena engine version 3. For this, you can create a new workgroup, or configure an existing workgroup, and select the recommended Athena engine version 3. For more information and instructions for changing your Athena engine version, refer to Changing Athena engine versions.

As data accumulates into an Iceberg table, queries may gradually become less efficient because of the increased processing time required to open additional files. Compaction optimizes the structural layout of the table without altering table content.

To perform compaction, you use the OPTIMIZE table REWRITE DATA compaction table maintenance command in Athena. The following syntax shows how to optimize the data layout of a feature group stored using the Iceberg table format. The sagemaker_featurestore represents the name of the SageMaker Feature Store database, and orders-feature-group-iceberg-post-comp-03-14-05-17-1670076334 is our feature group table name.

OPTIMIZE sagemaker_featurestore.orders-feature-group-iceberg-post-comp-03-14-05-17-1670076334 REWRITE DATA USING BIN_PACK

After running the optimize command, you use the VACUUM procedure, which performs snapshot expiration and removes orphan files. These actions reduce metadata size and remove files that are not in the current table state and are also older than the retention period specified for the table.

VACUUM sagemaker_featurestore.orders-feature-group-iceberg-post-comp-03-14-05-17-1670076334

Note that table properties are configurable using Athena’s ALTER TABLE. For an example of how to do this, see the Athena documentation. For VACUUM, vacuum_min_snapshots_to_keep and vacuum_max_snapshot_age_seconds can be used to configure snapshot pruning parameters.

Let’s have a look at the performance impact of running compaction on a sample feature group table. For testing purposes, we ingested the same orders feature records into two feature groups, orders-feature-group-iceberg-pre-comp-02-11-03-06-1669979003 and orders-feature-group-iceberg-post-comp-03-14-05-17-1670076334, using a parallelized SageMaker processing job with Scikit-Learn, which results in 49,908,135 objects stored in Amazon S3 and a total size of 106.5 GiB.

We run a query to select the latest snapshot without duplicates and without deleted records on the feature group orders-feature-group-iceberg-pre-comp-02-11-03-06-1669979003. Prior to compaction, the query took 1hr 27mins.

We then run compaction on orders-feature-group-iceberg-post-comp-03-14-05-17-1670076334 using the Athena OPTIMIZE query, which compacted the feature group table to 109,851 objects in Amazon S3 and a total size of 2.5 GiB. If we then run the same query after compaction, its runtime decreased to 1min 13sec.

With Iceberg file compaction, the query execution time improved significantly. For the same query, the run time decreased from 1h 27mins to 1min 13sec, which is 71 times faster.

Scheduling Iceberg compaction with AWS services

In this section, you will learn how to automate the table management procedures to compact your offline feature store. The following diagram illustrates the architecture for creating feature groups in Iceberg table format and a fully automated table management solution, which includes file compaction and cleanup operations.

At a high level, you create a feature group using the Iceberg table format and ingest records into the online feature store. Feature values are automatically replicated from the online store to the historical offline store. Athena is used to run the Iceberg management procedures. To schedule the procedures, you set up an AWS Glue job using a Python shell script and create an AWS Glue job schedule.

AWS Glue Job setup

You use an AWS Glue job to execute the Iceberg table maintenance operations on a schedule. First, you need to create an IAM role for AWS Glue to have permissions to access Amazon Athena, Amazon S3, and CloudWatch.

Next, you need to create a Python script to run the Iceberg procedures. You can find the sample script in GitHub. The script will execute the OPTIMIZE query using boto3.

optimize_sql = f"optimize {database}.{table} rewrite data using bin_pack"

The script has been parametrized using the AWS Glue getResolvedOptions(args, options) utility function that gives you access to the arguments that are passed to your script when you run a job. In this example, the AWS Region, the Iceberg database and table for your feature group, the Athena workgroup, and the Athena output location results folder can be passed as parameters to the job, making this script reusable in your environment.

Finally, you create the actual AWS Glue job to run the script as a shell in AWS Glue.

  • Navigate to the AWS Glue console.
  • Choose the Jobs tab under AWS Glue Studio.
  • Select Python Shell script editor.
  • Choose Upload and edit an existing script. Click Create.
  • The Job details button lets you configure the AWS Glue job. You need to select the IAM role you created earlier. Select Python 3.9 or the latest available Python version.
  • In the same tab, you can also define a number of other configuration options, such as Number of retries or Job timeout. In Advanced properties, you can add job parameters to execute the script, as shown in the example screenshot below.
  • Click Save.

In the Schedules tab, you can define the schedule to run the feature store maintenance procedures. For example, the following screenshot shows you how to run the job on a schedule of every 6 hours.

You can monitor job runs to understand runtime metrics such as completion status, duration, and start time. You can also check the CloudWatch Logs for the AWS Glue job to check that the procedures run successfully.

Executing Iceberg table management tasks with Spark

Customers can also use Spark to manage the compaction jobs and maintenance methods. For more detail on the Spark procedures, see the Spark documentation.

You first need to configure some of the common properties.

%%configure -f
{
  "conf": {
    "spark.sql.catalog.smfs": "org.apache.iceberg.spark.SparkCatalog",
    "spark.sql.catalog.smfs.catalog-impl": "org.apache.iceberg.aws.glue.GlueCatalog",
    "spark.sql.catalog.smfs.warehouse": "<YOUR_ICEBERG_DATA_S3_LOCATION>",
    "spark.sql.extensions":"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
    "spark.sql.catalog.smfs.glue.skip-name-validation": "true"
  }
}

The following code can be used to optimize the feature groups via Spark.

spark.sql(f"""CALL smfs.system.rewrite_data_files(table => '{DATABASE}.`{ICEBERG_TABLE}`')""")

You can then execute the next two table maintenance procedures to remove older snapshots and orphan files that are no longer needed.

spark.sql(f"""CALL smfs.system.expire_snapshots(table => '{DATABASE}.`{ICEBERG_TABLE}`', older_than => TIMESTAMP '{one_day_ago}', retain_last => 1)""")
spark.sql(f"""CALL smfs.system.remove_orphan_files(table => '{DATABASE}.`{ICEBERG_TABLE}`')""")

You can then incorporate the above Spark commands into your Spark environment. For example, you can create a job that performs the optimization above on a desired schedule or in a pipeline after ingestion.

To explore the complete code example, and try it out in your own account, see the GitHub repo.

Conclusion

SageMaker Feature Store provides a purpose-built feature management solution to help organizations scale ML development across data science teams. In this post, we explained how you can leverage Apache Iceberg as a table format and table maintenance operations such as compaction to benefit from significantly faster queries when working with offline feature groups at scale and, as a result, build training datasets faster. Give it a try, and let us know what you think in the comments.


About the authors

Arnaud Lauer is a Senior Partner Solutions Architect in the Public Sector team at AWS. He enables partners and customers to understand how best to use AWS technologies to translate business needs into solutions. He brings more than 17 years of experience in delivering and architecting digital transformation projects across a range of industries, including public sector, energy, and consumer goods. Arnaud holds 12 AWS certifications, including the ML Specialty Certification.

Ioan Catana is an Artificial Intelligence and Machine Learning Specialist Solutions Architect at AWS. He helps customers develop and scale their ML solutions in the AWS Cloud. Ioan has over 20 years of experience mostly in software architecture design and cloud engineering.

Mark Roy is a Principal Machine Learning Architect for AWS, helping customers design and build AI/ML solutions. Mark’s work covers a wide range of ML use cases, with a primary interest in computer vision, deep learning, and scaling ML across the enterprise. He has helped companies in many industries, including insurance, financial services, media and entertainment, healthcare, utilities, and manufacturing. Mark holds six AWS certifications, including the ML Specialty Certification. Prior to joining AWS, Mark was an architect, developer, and technology leader for over 25 years, including 19 years in financial services.

Brandon Chatham is a software engineer with the SageMaker Feature Store team. He’s deeply passionate about building elegant systems that bring big data and machine learning to people’s fingertips.

Read More

Announcing the updated ServiceNow connector (V2) for Amazon Kendra

Announcing the updated ServiceNow connector (V2) for Amazon Kendra

Amazon Kendra is a highly accurate and simple-to-use intelligent search service powered by machine learning (ML). Amazon Kendra offers a suite of data source connectors to simplify the process of ingesting and indexing your content, wherever it resides.

Valuable data in organizations is stored in both structured and unstructured repositories. An enterprise search solution should be able to pull together data across several structured and unstructured repositories to index and search on.

One such data repository is ServiceNow. As the foundation for all digital workflows, the ServiceNow Platform® connects people, functions, and systems across your organization. As data accumulates over time, a lot of critical information is stored in service catalogs, knowledge articles, and incidents including attachments for each entry.

We’re excited to announce that we have updated the ServiceNow connector for Amazon Kendra to add even more capabilities. In this version (V2), you can now crawl knowledge articles, service catalog documents, and incidents, and also bring in identity/ACL information to make your searches more granular. The connector also supports ServiceNow versions of Tokyo, Rome, San Diego, and others, and two sync modes: Full Sync mode, which does forced full syncs, and New, Modified, and Deleted mode, which does incremental syncs.

Solution overview

With Amazon Kendra, you can configure multiple data sources to provide a central place to index and search across your document repository. For our solution, we demonstrate how to index a ServiceNow repository using the Amazon Kendra connector for ServiceNow. The solution consists of the following steps:

  1. Configure an app on ServiceNow and get the connection details.
  2. Store the details in AWS Secrets Manager.
  3. Create a ServiceNow data source via the Amazon Kendra console.
  4. Index the data in the ServiceNow repository.
  5. Run a sample query to get the information.

Prerequisites

To try out the Amazon Kendra connector for ServiceNow, you need the following:

Configure a ServiceNow app and gather connection details

Before we set up the ServiceNow data source, we need a few details about your ServiceNow repository. Let’s gather those in advance.

  1. Go to https://developer.servicenow.com/.
  2. Sign in with your credentials.
  3. Create a ServiceNow instance by choosing Start Building.
  4. If you’re currently logged in as the App Engine Studio Creator role, choose Change User Role.
  5. Select Admin and choose Change User Role.
  6. Choose Manage Instance Password and log in using the instance URL using the admin user and password provided.
  7. Save the displayed instance name, URL, user name, and password for later use.
  8. Log in to the instance using the admin URL and credentials from the previous step.
  9. Choose All and search for Application Registry.
  10. Choose New to create new OAuth credentials.
  11. Choose Create an OAuth API endpoint for external clients.
  12. For Name, enter myKendraConnector and leave the other fields blank.The myKendraConnector OAuth is now created.
  13. Copy and store the client ID and client secret to use when configuring the connector in a later step.

The session token is valid for up to 30 minutes. You have to generate a new session token each time you index the content, or you can configure Access Token Lifespan with a longer time.

Store ServiceNow credentials in Secrets Manager

To store your ServiceNow credentials in Secrets Manager, compete the following steps:

  1. On the Secrets Manager console, choose Store a new secret.
  2. Choose Other type of secret.
  3. Create six key-value pairs for hostUrl, clientId, clientSecret, userName, password, and authType, and enter the values saved from ServiceNow.
  4. Choose Save.
  5. For Secret name, enter a name (for example, AmazonKendra-ServiceNow-secret).
  6. Enter an optional description.
  7. Choose Next.
  8. In the Configure rotation section, keep all settings at their defaults and choose Next.
  9. On the Review page, choose Store.

Configure the Amazon Kendra connector for ServiceNow

To configure the Amazon Kendra connector, complete the following steps:

  1. On the Amazon Kendra console, choose Create an Index.
  2. For Index name, enter a name for the index (for example, my-ServiceNow-index).
  3. Enter an optional description.
  4. For Role name, enter an IAM role name.
  5. Configure optional encryption settings and tags.
  6. Choose Next.
  7. In the Configure user access control section, leave the settings at their defaults and choose Next.
  8. For Provisioning editions, select Developer edition.
  9. Choose Create.This creates and propagates the IAM role and then creates the Amazon Kendra index, which can take up to 30 minutes.
  10. Choose Data sources in the navigation pane.
  11. Under ServiceNow Index, choose Add connector.
  12. For Data source name, enter a name (for example, my-ServiceNow-connector).
  13. Enter an optional description.
  14. Choose Next.
  15. For ServiceNow host, enter xxxxx.service-now.com (the instance URL from the ServiceNow setup).
  16. For Type of authentication token, select OAuth 2.0 Authentication.
  17. For AWS Secrets Manager secret, choose the secret you created earlier.
  18. For IAM role, choose Create a new role.
  19. For Role name, enter a name (for example, AmazonKendra-ServiceNow-role).
  20. Choose Next.
  21. For Select entities or content types, choose your content types.
  22. For Frequency, choose Run on demand.
  23. Choose Next.
  24. Set any optional field mappings and choose Next.
  25. Choose Review and Create and choose Add data source.
  26. Choose Sync now.
  27. Wait for the sync to complete.

Test the solution

Now that you have ingested the content from your ServiceNow account into your Amazon Kendra index, you can test some queries.

Go to your index and choose Search indexed content. Enter a sample search query and test out your search results (your query will vary based on the contents of your account).

The ServiceNow connector also optionally crawls local identity information from ServiceNow. For users, it sets the user email ID as principal. For groups, it sets the group ID as principal. If you turn off identity crawling, then you need to upload the user and group mapping to the principal store using the PutPrincipalMapping API. To filter search results by users or groups, complete the following steps:

  1. Navigate to the search console.
  2. Expand Test query with user name or groups and choose Apply user name or groups.
  3. Enter the user or group names and choose Apply.
  4. Next, enter the search query and press Enter.

This brings you a filtered set of results based on your criteria.

Congratulations! You have successfully used Amazon Kendra to surface answers and insights based on the content indexed from your ServiceNow account.

Clean up

It is good practice to clean up (delete) any resources you no longer want to use. Cleaning up AWS resources prevents your account from incurring any further charges.

  1. On the Amazon Kendra console, choose Indexes in the navigation pane.
  2. Choose the index to delete.
  3. Choose Delete to delete the selected index.

Conclusion

With the ServiceNow connector for Amazon Kendra, organizations can tap into the repository of information stored in their account securely using intelligent search powered by Amazon Kendra.

In this post, we introduced you to the basics, but there are many additional features that we didn’t cover. For example:

  • You can enable user-based access control for your Amazon Kendra index and restrict access to users and groups that you configure
  • You can map additional fields to Amazon Kendra index attributes and enable them for faceting, search, and display in the search results
  • You can integrate the ServiceNow data source with the Custom Document Enrichment (CDE) capability in Amazon Kendra to perform additional attribute mapping logic and even custom content transformation during ingestion

To learn about these possibilities and more, refer to the Amazon Kendra Developer Guide.


About the authors

 Senthil Ramachandran is an Enterprise Solutions Architect at AWS, supporting customers in the US North East. He is primarily focused on Cloud adoption and Digital Transformation in Financial Services Industry. Senthil’s area of interest is AI, especially Deep Learning and Machine Learning. He focuses on application automations with continuous learning and improving human enterprise experience. Senthil enjoys watching Autosport, Soccer and spending time with his family.

Ashish Lagwankar is a Senior Enterprise Solutions Architect at AWS. His core interests include AI/ML, serverless, and container technologies. Ashish is based in the Boston, MA, area and enjoys reading, outdoors, and spending time with his family.

Read More

Power recommendations and search using an IMDb knowledge graph – Part 2

Power recommendations and search using an IMDb knowledge graph – Part 2

This three-part series demonstrates how to use graph neural networks (GNNs) and Amazon Neptune to generate movie recommendations using the IMDb and Box Office Mojo Movies/TV/OTT licensable data package, which provides a wide range of entertainment metadata, including over 1 billion user ratings; credits for more than 11 million cast and crew members; 9 million movie, TV, and entertainment titles; and global box office reporting data from more than 60 countries. Many AWS media and entertainment customers license IMDb data through AWS Data Exchange to improve content discovery and increase customer engagement and retention.

In Part 1, we discussed the applications of GNNs, and how to transform and prepare our IMDb data for querying. In this post, we discuss the process of using Neptune to generate embeddings used to conduct our out-of-catalog search in Part 3 . We also go over Amazon Neptune ML, the machine learning (ML) feature of Neptune, and the code we use in our development process. In Part 3 , we walk through how to apply our knowledge graph embeddings to an out-of-catalog search use case.

Solution overview

Large connected datasets often contain valuable information that can be hard to extract using queries based on human intuition alone. ML techniques can help find hidden correlations in graphs with billions of relationships. These correlations can be helpful for recommending products, predicting credit worthiness, identifying fraud, and many other use cases.

Neptune ML makes it possible to build and train useful ML models on large graphs in hours instead of weeks. To accomplish this, Neptune ML uses GNN technology powered by Amazon SageMaker and the Deep Graph Library (DGL) (which is open-source). GNNs are an emerging field in artificial intelligence (for an example, see A Comprehensive Survey on Graph Neural Networks). For a hands-on tutorial about using GNNs with the DGL, see Learning graph neural networks with Deep Graph Library.

In this post, we show how to use Neptune in our pipeline to generate embeddings.

The following diagram depicts the overall flow of IMDb data from download to embedding generation.

We use the following AWS services to implement the solution:

In this post, we walk you through the following high-level steps:

  1. Set up environment variables
  2. Create an export job.
  3. Create a data processing job.
  4. Submit a training job.
  5. Download embeddings.

Code for Neptune ML commands

We use the following commands as part of implementing this solution:

%%neptune_ml export start
%%neptune_ml export status
%neptune_ml training start
%neptune_ml training status

We use neptune_ml export to check the status or start a Neptune ML export process, and neptune_ml training to start and check the status of a Neptune ML model training job.

For more information about these and other commands, refer to Using Neptune workbench magics in your notebooks.

Prerequisites

To follow along with this post, you should have the following:

  • An AWS account
  • Familiarity with SageMaker, Amazon S3, and AWS CloudFormation
  • Graph data loaded into the Neptune cluster (see Part 1 <link to blog 1> for more information)

Set up environment variables

Before we begin, you’ll need to set up your environment by setting the following variables: s3_bucket_uri and processed_folder. s3_bucket_uri is the name of the bucket used in Part 1 and processed_folder is the Amazon S3 location for the output from the export job .

# name of s3 bucket
s3_bucket_uri = "<s3-bucket-name>"

# the s3 location you want to store results
processed_folder = f"s3://{s3_bucket_uri}/experiments/neptune-export/"

Create an export job

In Part 1, we created a SageMaker notebook and export service to export our data from the Neptune DB cluster to Amazon S3 in the required format.

Now that our data is loaded and the export service is created, we need to create an export job start it. To do this, we use NeptuneExportApiUri and create parameters for the export job. In the following code, we use the variables expo and export_params. Set expo to your NeptuneExportApiUri value, which you can find on the Outputs tab of your CloudFormation stack. For export_params, we use the endpoint of your Neptune cluster and provide the value for outputS3path, which is the Amazon S3 location for the output from the export job.

expo = <NEPTUNE-EXPORT-URI>
export_params={
    "command": "export-pg",
    "params": { "endpoint": neptune_ml.get_host(),
                "profile": "neptune_ml",
                "cloneCluster": True
                  },
    "outputS3Path": processed_folder,
    "additionalParams": {
            "neptune_ml": {
             "version": "v2.0"
             }
      },
"jobSize": "medium"}

To submit the export job use the following command:

%%neptune_ml export start --export-url {expo} --export-iam --store-to export_results --wait-timeout 1000000                                                              
${export_params}

To check the status of the export job use the following command:

%neptune_ml export status --export-url {expo} --export-iam --job-id {export_results['jobId']} --store-to export_results

After your job is complete, set the processed_folder variable to provide the Amazon S3 location of the processed results:

export_results['processed_location']= processed_folder

Create a data processing job

Now that the export is done, we create a data processing job to prepare the data for the Neptune ML training process. This can be done a few different ways. For this step, you can change the job_name and modelType variables, but all other parameters must remain the same. The main portion of this code is the modelType parameter, which can either be heterogeneous graph models (heterogeneous) or knowledge graphs (kge).

The export job also includes training-data-configuration.json. Use this file to add or remove any nodes or edges that you don’t want to provide for training (for example, if you want to predict the link between two nodes, you can remove that link in this configuration file). For this blog post we use the original configuration file. For additional information, see Editing a training configuration file.

Create your data processing job with the following code:

job_name = neptune_ml.get_training_job_name("link-pred")
processing_params = f"""--config-file-name training-data-configuration.json 
--job-id {job_name}-DP 
--s3-input-uri {export_results['outputS3Uri']}  
--s3-processed-uri {export_results['processed_location']} 
--model-type kge 
--instance-type ml.m5.2xlarge
"""

%neptune_ml dataprocessing start --store-to processing_results {processing_params}

To check the status of the export job use the following command:

%neptune_ml dataprocessing status --job-id {processing_results['id']} --store-to processing_results

Submit a training job

After the processing job is complete, we can begin our training job, which is where we create our embeddings. We recommend an instance type of ml.m5.24xlarge, but you can change this to suit your computing needs. See the following code:

dp_id = processing_results['id']
training_job_name = dp_id + "training"
training_job_name = "".join(training_job_name.split("-")) training_params=f"--job-id train-{training_job_name}  
--data-processing-id {dp_id}  
--instance-type ml.m5.24xlarge  
--s3-output-uri s3://{str(s3_bucket_uri)}/training/{training_job_name}/" 

%neptune_ml training start --store-to training_results {training_params} 
print(training_results)

We print the training_results variable to get the ID for the training job. Use the following command to check the status of your job:

%neptune_ml training status --job-id {training_results['id']} --store-to training_status_results

Download embeddings

After your training job is complete, the last step is to download your raw embeddings. The following steps show you how to download embeddings created by using KGE (you can use the same process for RGCN).

In the following code, we use neptune_ml.get_mapping() and get_embeddings() to download the mapping file (mapping.info) and the raw embeddings file (entity.npy). Then we need to map the appropriate embeddings to their corresponding IDs.

neptune_ml.get_embeddings(training_status_results["id"])                                            
neptune_ml.get_mapping(training_status_results["id"])                                               
                                                                                        
f = open('/home/ec2-user/SageMaker/model-artifacts/'+ training_status_results["id"]+'/mapping.info',  "rb")                                                                                   
mapping = pickle.load(f)                                                                
                                                                                        
node2id = mapping['node2id']                                                            
localid2globalid = mapping['node2gid']                                                  
data = np.load('/home/ec2-user/SageMaker/model-artifacts/'+ training_status_results["id"]+'/embeddings/entity.npy')                                                                           
                                                                                          
embd_to_sum = mapping["node2id"]                                                        
full = len(list(embd_to_sum["movie"].keys()))                                                                                                                                    
ITEM_ID = []                                                                            
KEY = []                                                                                
VALUE = []                                                                              
for ii in tqdm(range(full)):                                                         
node_id = list(embd_to_sum["movie"].keys())[ii]
index = localid2globalid['movie'][node2id['movie'][node_id]]
embedding = data[index]
ITEM_ID += [node_id]*embedding.shape[0]
KEY += [i for i in range(embedding.shape[0])]
VALUE += list(embedding)
                                                                       
meta_df = pd.DataFrame({"ITEM_ID": ITEM_ID, "KEY": KEY, "VALUE":VALUE})
meta_df.to_csv('new_embeddings.csv')

To download RGCNs, follow the same process with a new training job name by processing the data with the modelType parameter set to heterogeneous, then training your model with the modelName parameter set to rgcn see here for more details. Once that is finished, call the get_mapping and get_embeddings functions to download your new mapping.info and entity.npy files. After you have the entity and mapping files, the process to create the CSV file is identical.

Finally, upload your embeddings to your desired Amazon S3 location:

s3_destination = "s3://"+s3_bucket_uri+"/embeddings/"+"new_embeddings.csv"

!aws s3 cp new_embeddings.csv {s3_destination}

Make sure you remember this S3 location, you will need to use it in Part 3.

Clean up

When you’re done using the solution, be sure to clean up any resources to avoid ongoing charges.

Conclusion

In this post, we discussed how to use Neptune ML to train GNN embeddings from IMDb data.

Some related applications of knowledge graph embeddings are concepts like out-of-catalog search, content recommendations, targeted advertising, predicting missing links, general search, and cohort analysis. Out of catalog search is the process of searching for content that you don’t own, and finding or recommending content that is in your catalog that is as close to what the user searched as possible. We dive deeper into out-of-catalog search in Part 3.


About the Authors

Matthew Rhodes is a Data Scientist I working in the Amazon ML Solutions Lab. He specializes in building Machine Learning pipelines that involve concepts such as Natural Language Processing and Computer Vision.

Divya Bhargavi is a Data Scientist and Media and Entertainment Vertical Lead at the Amazon ML Solutions Lab,  where she solves high-value business problems for AWS customers using Machine Learning. She works on image/video understanding, knowledge graph recommendation systems, predictive advertising use cases.

Gaurav Rele is a Data Scientist at the Amazon ML Solution Lab, where he works with AWS customers across different verticals to accelerate their use of machine learning and AWS Cloud services to solve their business challenges.

Karan Sindwani is a Data Scientist at Amazon ML Solutions Lab, where he builds and deploys deep learning models. He specializes in the area of computer vision. In his spare time, he enjoys hiking.

Soji Adeshina is an Applied Scientist at AWS where he develops graph neural network-based models for machine learning on graphs tasks with applications to fraud & abuse, knowledge graphs, recommender systems, and life sciences. In his spare time, he enjoys reading and cooking.

Vidya Sagar Ravipati is a Manager at the Amazon ML Solutions Lab, where he leverages his vast experience in large-scale distributed systems and his passion for machine learning to help AWS customers across different industry verticals accelerate their AI and cloud adoption.

Read More

Power recommendation and search using an IMDb knowledge graph – Part 1

Power recommendation and search using an IMDb knowledge graph – Part 1

The IMDb and Box Office Mojo Movies/TV/OTT licensable data package provides a wide range of entertainment metadata, including over 1 billion user ratings; credits for more than 11 million cast and crew members; 9 million movie, TV, and entertainment titles; and global box office reporting data from more than 60 countries. Many AWS media and entertainment customers license IMDb data through AWS Data Exchange to improve content discovery and increase customer engagement and retention.

In this three-part series, we demonstrate how to transform and prepare IMDb data to power out-of-catalog search for your media and entertainment use cases. In this post, we discuss how to prepare IMDb data and load the data into Amazon Neptune for querying. In Part 2, we discuss how to use Amazon Neptune ML to train graph neural network (GNN) embeddings from the IMDb graph. In Part 3, we walk through a demo application out-of-catalog search that is powered by the GNN embeddings.

Solution overview

In this series, we use the IMDb and Box Office Mojo Movies/TV/OTT licensed data package to show how you can built your own applications using graphs.

This licensable data package consists of JSON files with IMDb metadata for more than 9 million titles (including movies, TV and OTT shows, and video games) and credits for more than 11 million cast, crew, and entertainment professionals. IMDb’s metadata package also includes over 1 billion user ratings, as well as plots, genres, categorized keywords, posters, credits, and more.

IMDb delivers data through AWS Data Exchange, which makes it incredibly simple for you to access data to power your entertainment experiences and seamlessly integrate with other AWS services. IMDb licenses data to a wide range of media and entertainment customers, including pay TV, direct-to-consumer, and streaming operators, to improve content discovery and increase customer engagement and retention. Licensing customers also use IMDb data to enhance in-catalog and out-of-catalog title search and power relevant recommendations.

We use the following services as part of this solution:

The following diagram depicts the workflow for part 1 of the 3 part blog series.

In this post, we walk through the following high-level steps:

  1. Provision Neptune resources with AWS CloudFormation.
  2. Access the IMDb data from AWS Data Exchange.
  3. Clone the GitHub repo.
  4. Process the data in Neptune Gremlin format.
  5. Load the data into a Neptune cluster.
  6. Query the data using Gremlin Query Language.

Prerequisites

The IMDb data used in this post requires an IMDb content license and paid subscription to the IMDb and Box Office Mojo Movies/TV/OTT licensing package in AWS Data Exchange. To inquire about a license and access sample data, visit developer.imdb.com.

Additionally, to follow along with this post, you should have an AWS account and familiarity with Neptune, the Gremlin query language, and SageMaker.

Provision Neptune resources with AWS CloudFormation

Now that you’ve seen the structure of the solution, you can deploy it into your account to run an example workflow.

You can launch the stack in AWS Region us-east-1 on the AWS CloudFormation console by choosing Launch Stack:

Launch Button

To launch the stack in a different Region, refer to Using the Neptune ML AWS CloudFormation template to get started quickly in a new DB cluster.

The following screenshot shows the stack parameters to provide.

Stack creation takes approximately 20 minutes. You can monitor the progress on the AWS CloudFormation console.

When the stack is complete, you’re now ready to process the IMDb data. On the Outputs tab for the stack, note the values for NeptuneExportApiUri and NeptuneLoadFromS3IAMRoleArn. Then proceed to the following steps to gain access to the IMDb dataset.

Access the IMDb data

IMDb publishes its dataset once a day on AWS Data Exchange. To use the IMDb data, you first subscribe to the data in AWS Data Exchange, then you can export the data to Amazon Simple Storage Service (Amazon S3). Complete the following steps:

  1. On the AWS Data Exchange console, choose Browse catalog in the navigation pane.
  2. In the search field, enter IMDb.
  3. Subscribe to either IMDb and Box Office Mojo Movie/TV/OTT Data (SAMPLE) or IMDb and Box Office Mojo Movie/TV/OTT Data.
  4. Complete the steps in the following workshop to export the IMDb data from AWS Data Exchange to Amazon S3.

Clone the GitHub repository

Complete the following steps:

  1. Open the SageMaker instance that you created from the CloudFormation template.
  2. Clone the GitHub repository.

Process IMDb data in Neptune Gremlin format

To add the data into Amazon Neptune, we process the data in Neptune gremlin format. From the GitHub repository, we run process_imdb_data.py to process the files. The script creates the CSVs to load the data into Neptune. Upload the data to an S3 bucket and note the S3 URI location.

Note that for this post, we filter the dataset to include only movies. You need either an AWS Glue job or Amazon EMR to process the full data.

To process the IMDb data using AWS Glue, complete the following steps:

  1. On the AWS Glue console, in the navigation pane, choose Jobs.
  2. On the Jobs page, choose Spark script editor.
  3. Under Options, choose Upload and edit existing script and upload the 1_process_imdb_data.py file.
  4. Choose Create.
  5. On the editor page, choose Job Details.
  6. On the Job Details page, add the following options:
    1. For Name, enter imdb-graph-processor.
    2. For Description, enter processing IMDb dataset and convert to Neptune Gremlin Format.
    3. For IAM role, use an existing AWS Glue role or create an IAM role for AWS Glue. Make sure you give permission to your Amazon S3 location for the raw data and output data path.
    4. For Worker type, choose G 2X.
    5. For Requested number of workers, enter 20.
  7. Expand Advanced properties.
  8. Under Job Parameters, choose Add new parameter and enter the following key value pair:
    1. For the key, enter --output_bucket_path.
    2. For the value, enter the S3 path where you want to save the files. This path is also used to load the data into the Neptune cluster.
  9. To add another parameter, choose Add new parameter and enter the following key value pair:
    1. For the key, enter --raw_data_path.
    2. For the value, enter the S3 path where the raw data is stored.
  10. Choose Save and then choose Run.

This job takes about 2.5 hours to complete.

The following table provide details about the nodes for the graph data model.

Description Label
Principal cast members Person
Long format movie Movie
Genre of movies Genre
Keyword descriptions of movies Keyword
Shooting locations of movies Place
Ratings for movies rating
Awards event where movie received an award awards

Similarly, the following table shows some of the edges included in the graph. There will be in total 24 edge types.

Description Label From To
Movies an actress has acted in casted-by-actress Movie Person
Movies an actor has acted in casted-by-actor Movie Person
Keywords in a movie by character described-by-character-keyword Movie keyword
Genre of a movie is-genre Movie Genre
Place where the movie was shot Filmed-at Movie Place
Composer of a movie Crewed-by-composer Movie Person
award nomination Nominated_for Movie Awards
award winner Has_won Movie Awards

Load the data into a Neptune cluster

In the repo, navigate to the graph_creation folder and run the 2_load.ipynb. To load the data to Neptune, use the %load command in the notebook, and provide your AWS Identity and Access Management (IAM) role ARN and Amazon S3 location of your processed data.

role = '<NeptuneLoadFromS3IAMRoleArn>'
%load -l {role} -s <s3_location> --store-to load_id

The following screen shot shows the output of the command.

Note that the data load takes about 1.5 hours to complete. To check the status of the load, use the following command:

%load_status {load_id['payload']['loadId']} --errors --details

When the load is complete, the status displays LOAD_COMPLETED, as shown in the following screenshot.

All the data is now loaded into graphs, and you can start querying the graph.

Fig: Sample Knowledge graph representation of movies in IMDb dataset. Movies “Saving Private Ryan” and “Bridge of Spies” have common connections like actor and director as well as indirect connections through movies like “The Catcher was a Spy” in the graph network.

Query the data using Gremlin

To access the graph in Neptune, we use the Gremlin query language. For more information, refer to Querying a Neptune Graph.

The graph consists of a rich set of information that can be queried directly using Gremlin. In this section, we show a few examples of questions that you can answer with the graph data. In the repo, navigate to the graph_creation folder and run the 3_queries.ipynb notebook. The following section goes over all the queries from the notebook.

Worldwide gross of movies that have been shot in New Zealand, with minimum 7.5 rating

The following query returns the worldwide gross of movies filmed in New Zealand, with a minimum rating of 7.5:

%%gremlin --store-to result

g.V().has('place', 'name', containing('New Zealand')).in().has('movie', 'rating', gt(7.5)).dedup().valueMap(['name', 'gross_worldwide', 'rating', 'studio','id'])

The following screenshot shows the query results.

Top 50 movies that belong to action and drama genres and have Oscar-winning actors

In the following example, we want to find the top 50 movies in two different genres (action and drama) with Oscar-winning actors. We can do this by using three different queries and merging the information using Pandas:

%%gremlin --store result_action
g.V().has('genre', 'name', 'Action').in().has('movie', 'rating', gt(8.5)).limit(50).valueMap(['name', 'year', 'poster'])
%%gremlin --store result_drama
g.V().has('genre', 'name', 'Drama').in().has('movie', 'rating', gt(8.5)).limit(50).valueMap(['name', 'year', 'poster'])
%%gremlin --store result_actors --silent
g.V().has('person', 'oscar_winner', true).in().has('movie', 'rating', gt(8.5)).limit(50).valueMap(['name', 'year', 'poster'])

The following screenshot shows our results.

Top movies that have common keywords “tattoo” and “assassin”

The following query returns movies with keywords “tattoo” and “assassin”:

%%gremlin --store result

g.V().has('keyword','name','assassin').in("described-by-plot-related-keyword").where(out("described-by-plot-related-keyword").has('keyword','name','tattoo')).dedup().limit(10).valueMap(['name', 'poster','year'])

The following screenshot shows our results.

Movies that have common actors

In the following query, we find movies that have Leonardo DiCaprio and Tom Hanks:

%%gremlin --store result

g.V().has('person', 'name', containing('Leonardo DiCaprio')).in().hasLabel('movie').out().has('person','name', 'Tom Hanks').path().by(valueMap('name', 'poster'))

We get the following results.

Conclusion

In this post, we showed you the power of the IMDb and Box Office Mojo Movies/TV/OTT dataset and how you can use it in various use cases by converting the data into a graph using Gremlin queries. In Part 2 of this series, we show you how to create graph neural network models on this data that can be used for downstream tasks.

For more information about Neptune and Gremlin, refer to Amazon Neptune Resources for additional blog posts and videos.


About the Authors

Gaurav Rele is a Data Scientist at the Amazon ML Solution Lab, where he works with AWS customers across different verticals to accelerate their use of machine learning and AWS Cloud services to solve their business challenges.

Matthew Rhodes is a Data Scientist I working in the Amazon ML Solutions Lab. He specializes in building Machine Learning pipelines that involve concepts such as Natural Language Processing and Computer Vision.

Divya Bhargavi is a Data Scientist and Media and Entertainment Vertical Lead at the Amazon ML Solutions Lab,  where she solves high-value business problems for AWS customers using Machine Learning. She works on image/video understanding, knowledge graph recommendation systems, predictive advertising use cases.

Karan Sindwani is a Data Scientist at Amazon ML Solutions Lab, where he builds and deploys deep learning models. He specializes in the area of computer vision. In his spare time, he enjoys hiking.

Soji Adeshina is an Applied Scientist at AWS where he develops graph neural network-based models for machine learning on graphs tasks with applications to fraud & abuse, knowledge graphs, recommender systems, and life sciences. In his spare time, he enjoys reading and cooking.

Vidya Sagar Ravipati is a Manager at the Amazon ML Solutions Lab, where he leverages his vast experience in large-scale distributed systems and his passion for machine learning to help AWS customers across different industry verticals accelerate their AI and cloud adoption.

Read More