Build an image search engine with Amazon Kendra and Amazon Rekognition

Build an image search engine with Amazon Kendra and Amazon Rekognition

In this post, we discuss a machine learning (ML) solution for complex image searches using Amazon Kendra and Amazon Rekognition. Specifically, we use the example of architecture diagrams for complex images due to their incorporation of numerous different visual icons and text.

With the internet, searching and obtaining an image has never been easier. Most of the time, you can accurately locate your desired images, such as searching for your next holiday getaway destination. Simple searches are often successful, because they’re not associated with many characteristics. Beyond the desired image characteristics, the search criteria typically doesn’t require significant details to locate the required result. For example, if a user tried to search for a specific type of blue bottle, results of many different types of blue bottles will be displayed. However, the desired blue bottle may not be easily found due to generic search terms.

Interpreting search context also contributes to simplification of results. When users have a desired image in mind, they try to frame this into a text-based search query. Understanding the nuances between search queries for similar topics is important to provide relevant results and minimize the effort required from the user to manually sort through results. For example, the search query “Dog owner plays fetch” seeks to return image results showing a dog owner playing a game of fetch with a dog. However, the actual results generated may instead focus on a dog fetching an object without displaying an owner’s involvement. Users may have to manually filter out unsuitable image results when dealing with complex searches.

To address the problems associated with complex searches, this post describes in detail how you can achieve a search engine that is capable of searching for complex images by integrating Amazon Kendra and Amazon Rekognition. Amazon Kendra is an intelligent search service powered by ML, and Amazon Rekognition is an ML service that can identify objects, people, text, scenes, and activities from images or videos.

What images can be too complex to be searchable? One example is architecture diagrams, which can be associated with many search criteria depending on the use case complexity and number of technical services required, which results in significant manual search effort for the user. For example, if users want to find an architecture solution for the use case of customer verification, they will typically use a search query similar to “Architecture diagrams for customer verification.” However, generic search queries would span a wide range of services and across different content creation dates. Users would need to manually select suitable architectural candidates based on specific services and consider the relevance of the architecture design choices according to the content creation date and query date.

The following figure shows an example diagram that illustrates an orchestrated extract, transform, and load (ETL) architecture solution.

For users who are not familiar with the service offerings that are provided on the cloud platform, they may provide different generic ways and descriptions when searching for such a diagram. The following are some examples of how it could be searched:

  • “Orchestrate ETL workflow”
  • “How to automate bulk data processing”
  • “Methods to create a pipeline for transforming data”

Solution overview

We walk you through the following steps to implement the solution:

  1. Train an Amazon Rekognition Custom Labels model to recognize symbols in architecture diagrams.
  2. Incorporate Amazon Rekognition text detection to validate architecture diagram symbols.
  3. Use Amazon Rekognition inside a web crawler to build a repository for searching
  4. Use Amazon Kendra to search the repository.

To easily provide users with a large repository of relevant results, the solution should provide an automated way of searching through trusted sources. Using architecture diagrams as an example, the solution needs to search through reference links and technical documents for architecture diagrams and identify the services present. Identifying keywords such as use cases and industry verticals in these sources also allows the information to be captured and for more relevant search results to be displayed to the user.

Considering the objective of how relevant diagrams should be searched, the image search solution needs to fulfil three criteria:

  • Enable simple keyword search
  • Interpret search queries based on use cases that users provide
  • Sort and order search results

Keyword search is simply searching for “Amazon Rekognition” and being shown architecture diagrams on how the service is used in different use cases. Alternatively, the search terms can be linked indirectly to the diagram through use cases and industry verticals that may be associated with the architecture. For example, searching for the terms “How to orchestrate ETL pipeline” returns results of architecture diagrams built with AWS Glue and AWS Step Functions. Sorting and ordering of search results based on attributes such as creation date would ensure the architecture diagrams are still relevant in spite of service updates and releases. The following figure shows the architecture diagram to the image search solution.

As illustrated in the preceding diagram and in the solution overview, there are two main aspects of the solution. The first aspect is performed by Amazon Rekognition, which can identify objects, people, text, scenes, and activities from images or videos. It consists of pre-trained models that can be applied to analyze images and videos at scale. With its custom labels feature, Amazon Rekognition allows you to tailor the ML service to your specific business needs by labeling images collated from sourcing through architecture diagrams in trusted reference links and technical documents. By uploading a small set of training images, Amazon Rekognition automatically loads and inspects the training data, selects the right ML algorithms, trains a model, and provides model performance metrics. Therefore, users without ML expertise can enjoy the benefits of a custom labels model through an API call, because a significant amount of overhead is reduced. The solution applies Amazon Rekognition Custom Labels to detect AWS service logos on architecture diagrams to allow the architecture diagrams to be searchable with service names. After modeling, detected services of each architecture diagram image and its metadata, like URL origin and image title, are indexed for future search purposes and stored in Amazon DynamoDB, a fully managed, serverless, key-value NoSQL database designed to run high-performance applications.

The second aspect is supported by Amazon Kendra, an intelligent enterprise search service powered by ML that allows you to search across different content repositories. With Amazon Kendra, you can search for results, such as images or documents, that have been indexed. These results can also be stored across different repositories because the search service employs built-in connectors. Keywords, phrases, and descriptions could be used for searching, which allows you to accurately search for diagrams that are related to a particular use case. Therefore, you can easily build an intelligent search service with minimal development costs.

With an understanding of the problem and solution, the subsequent sections dive into how to automate data sourcing through the crawling of architecture diagrams from credible sources. Following this, we walk through the process of generating a custom label ML model with a fully managed service. Lastly, we cover the data ingestion by an intelligent search service, powered by ML.

Create an Amazon Rekognition model with custom labels

Before obtaining any architecture diagrams, we need a tool to evaluate if an image can be identified as an architecture diagram. Amazon Rekognition Custom Labels provides a streamlined process to create an image recognition model that identifies objects and scenes in images that are specific to a business need. In this case, we use Amazon Rekognition Custom Labels to identify AWS service icons, then the images are indexed with the services for a more relevant search using Amazon Kendra. This model doesn’t differentiate whether a picture is an architecture diagram or not; it simply identifies service icons, if any. As such, there may be instances where images that aren’t architecture diagrams end up in the search results. However, such results are minimal.

The following figure shows the steps that this solution takes to create an Amazon Rekognition Custom Labels model.

This process involves uploading the datasets, generating a manifest file that references the uploaded datasets, followed by uploading this manifest file into Amazon Rekognition. A Python script is used to aid in the process of uploading the datasets and generating the manifest file. Upon successfully generating the manifest file, it’s then uploaded into Amazon Rekognition to begin the model training process. For details on the Python script and how to run it, refer to the GitHub repo.

To train the model, in the Amazon Rekognition project, choose Train model, select the project you want to train, then add any relevant tags and choose Train model. For instructions on starting an Amazon Rekognition Custom Labels project, refer to the available video tutorials. The model may take up to 8 hours to train with this dataset.

When the training is complete, you may choose the trained model to view the evaluation results. For more details on the different metrics such as precision, recall, and F1, refer to Metrics for evaluation your model. To use the model, navigate to the Use Model tab, leave the number of inference units at 1, and start the model. Then we can use an AWS Lambda function to send images to the model in base64, and the model returns a list of labels and confidence scores.

Upon successfully training an Amazon Rekognition model with Amazon Rekognition Custom Labels, we can use it to identify service icons in the architecture diagrams that have been crawled. To increase the accuracy of identifying services in the architecture diagram, we use another Amazon Rekognition feature called text detection. To use this feature, we pass in the same picture in base64, and Amazon Rekognition returns the list of text identified in the picture. In the following figures, we compare the original image and what it looks like after the services in the image are identified. The first figure shows the original image.

The following figure shows the original image with detected services.

To ensure scalability, we use a Lambda function, which will be exposed through an API endpoint created using Amazon API Gateway. Lambda is a serverless, event-driven compute service that lets you run code for virtually any type of application or backend service without provisioning or managing servers. Using a Lambda function eliminates a common concern about scaling up when large volumes of requests are made to the API endpoint. Lambda automatically runs the function for the specific API call, which stops when the invocation is complete, thereby reducing cost incurred to the user. Because the request would be directed to the Amazon Rekognition endpoint, having only the Lambda function being scalable is not sufficient. In order for the Amazon Rekognition endpoint to be scalable, you can increase the inference unit of the endpoint. For more details on configuring the inference unit, refer to Inference units.

The following is a code snippet of the Lambda function for the image recognition process:

const AWS = require("aws-sdk");
const axios = require("axios");

// API to retrieve information about individual services
const SERVICE_API = process.env.SERVICE_API;
// ARN of Amazon Rekognition model
const MODEL_ARN = process.env.MODEL_ARN;

const rekognition = new AWS.Rekognition();

exports.handler = async (event) => {
  const body = JSON.parse(event["body"]);
  let base64Binary = "";

  // Checks if the payload contains a url to the image or the image in base64
  if (body.url) {
    const base64Res = await new Promise((resolve) => {
      axios
        .get(body.url, {
          responseType: "arraybuffer",
        })
        .then((response) => {
          resolve(Buffer.from(response.data, "binary").toString("base64"));
        });
    });

    base64Binary = new Buffer.from(base64Res, "base64");
  } else if (body.byte) {
    const base64Cleaned = body.byte.split("base64,")[1];
    base64Binary = new Buffer.from(base64Cleaned, "base64");
  }

  // Pass the contents through the trained Custom Labels model and text detection
  const [labels, text] = await Promise.all([
    detectLabels(rekognition, base64Binary, MODEL_ARN),
    detectText(rekognition, base64Binary),
  ]);
  const texts = text.TextDetections.map((text) => ({
    DetectedText: text.DetectedText,
    ParentId: text.ParentId,
  }));

  // Compare between overlapping labels and retain the label with the highest confidence
  let filteredLabels = removeOverlappingLabels(labels);

  // Sort all the labels from most to least confident
  filteredLabels = sortByConfidence(filteredLabels);

  // Remove duplicate services in the list
  const services = retrieveUniqueServices(filteredLabels, texts);

  // Pass each service into the reference document API to retrieve the URL to the documentation
  const refLinks = await getReferenceLinks(services);

  var responseBody = {
    labels: filteredLabels,
    text: texts,
    ref_links: refLinks,
  };

  console.log("Response: ", response_body);

  const response = {
    statusCode: 200,
    headers: {
      "Access-Control-Allow-Origin": "*", // Required for CORS to work
    },
    body: JSON.stringify(responseBody),
  };
  return response;
};

// Code removed to truncate section

After creating the Lambda function, we can proceed to expose it as an API using API Gateway. For instructions on creating an API with Lambda proxy integration, refer to Tutorial: Build a Hello World REST API with Lambda proxy integration.

Crawl the architecture diagrams

In order for the search feature to work feasibly, we need a repository of architecture diagrams. However, these diagrams must originate from credible sources such as AWS Blog and AWS Prescriptive Guidance. Establishing credibility of data sources ensures the underlying implementation and purpose of the use cases are accurate and well vetted. The next step is to set up a crawler that can help gather many architecture diagrams to feed into our repository. We created a web crawler to extract architecture diagrams and information such as a description of the implementation from the relevant sources. There are multiple ways that you could achieve building such a mechanism; for this example, we use a program that runs on Amazon Elastic Compute Cloud (Amazon EC2). The program first obtains links to blog posts from an AWS Blog API. The response returned from the API contains information of the post such as title, URL, date, and the links to images found in the post.

The following is a code snippet of the JavaScript function for the web crawling process:

import axios from "axios";
import puppeteer from "puppeteer";
import {
  putItemDDB,
  identifyImageHighConfidence,
  getReferenceList,
} from "./utils.js";

/** Global variables */
const blogPostsApi = process.env.BLOG_POSTS_API;
const IMAGE_URL_PATTERN =
  "<pattern in the url that identified as link to image>";
const DDB_Table = process.env.DDB_Table;

// Function that retrieves URLs of records from a public API
function getURLs(blogPostsApi) {
  // Return a list of URLs
  return axios
    .get(blogPostsApi)
    .then((response) => {
      var data = response.data.items;
      console.log("RESPONSE:");
      const blogLists = data.map((blog) => [
        blog.item.additionalFields.link,
        blog.item.dateUpdated,
      ]);
      return blogLists;
    })
    .catch((error) => console.error(error));
}

// Function that crawls content of individual URLs
async function crawlFromUrl(urls) {
  const browser = await puppeteer.launch({
    executablePath: "/usr/bin/chromium-browser",
  });
  // const browser = await puppeteer.launch();

  const page = await browser.newPage();

  let numOfValidArchUrls = 0;

  for (let index = 0; index < urls.length; index++) {
    console.log("index: ", index);
    let blogURL = urls[index][0];
    let dateUpdated = urls[index][1];

    await page.goto(blogURL);
    console.log("blogUrl:", blogURL);
    console.log("date:", dateUpdated);

    // Identify and get image from post based on URL pattern
    const images = await page.evaluate(() =>
      Array.from(document.images, (e) => e.src)
    );
    const filter1 = images.filter((img) => img.includes(IMAGE_URL_PATTERN));
    console.log("all images:", filter1);

    // Validate if image is an architecture diagram
    for (let index_1 = 0; index_1 < filter1.length; index_1++) {
      const imageUrl = filter1[index_1];

      const rekog = await identifyImageHighConfidence(imageUrl);

      if (rekog) {
        if (rekog.labels.size >= 2) {
          console.log("Rekog.labels.size = ", rekog.labels.size);
          console.log("Selected image url  = ", imageUrl);

          let articleSection = [];
          let metadata = await page.$$('span[property="articleSection"]');

          for (let i = 0; i < metadata.length; i++) {
            const element = metadata[i];
            const value = await element.evaluate(
              (el) => el.textContent,
              element
            );
            console.log("value: ", value);
            articleSection.push(value);
          }

          const title = await page.title();
          const allRefLinks = await getReferenceList(
            rekog.labels,
            rekog.textServices
          );

          numOfValidArchUrls = numOfValidArchUrls + 1;

          putItemDDB(
            blogURL,
            dateUpdated,
            imageUrl,
            articleSection.toString(),
            rekog,
            { L: allRefLinks },
            title,
            DDB_Table
          );

          console.log("numOfValidArchUrls = ", numOfValidArchUrls);
        }
      }
      if (rekog && rekog.labels.size >= 2) {
        break;
      }
    }
  }
  console.log("valid arch : ", numOfValidArchUrls);
  await browser.close();
}

async function startCrawl() {
  // Get a list of URLs
  // Extract architecture image from those URLs
  const urls = await getURLs(blogPostsApi);

  if (urls) console.log("Crawling urls completed");
  else {
    console.log("Unable to crawl images");
    return;
  }
  await crawlFromUrl(urls);
}

startCrawl();

With this mechanism, we can easily crawl hundreds and thousands of images from different blogs. However, we need a filter that only accepts images that contain content of an architecture diagram, which in our case are icons of AWS services, to filter out images that are not architecture diagrams.

This is the purpose of our Amazon Rekognition model. The diagrams go through the image recognition process, which identifies service icons and determines if it could be considered as a valid architecture diagram.

The following is a code snippet of the function that sends images to the Amazon Rekognition model:

import axios from "axios";
import AWS from "aws-sdk";

// Configuration
AWS.config.update({ region: process.env.REGION });

/** Global variables */
// API to identify images
const LABEL_API = process.env.LABEL_API;
// API to get relevant documentations of individual services
const DOCUMENTATION_API = process.env.DOCUMENTATION_API;
// Create the DynamoDB service object
const dynamoDB = new AWS.DynamoDB({ apiVersion: "2012-08-10" });

// Function to identify image using an API that calls Amazon Rekognition model
function identifyImageHighConfidence(image_url) {
  return axios
    .post(LABEL_API, {
      url: image_url,
    })
    .then((res) => {
      let data = res.data;
      let rekogLabels = new Set();
      let rekogTextServices = new Set();
      let rekogTextMetadata = new Set();

      data.labels.forEach((element) => {
        if (element.Confidence >= 40) rekogLabels.add(element.Name);
      });

      data.text.forEach((element) => {
        if (
          element.DetectedText.includes("AWS") ||
          element.DetectedText.includes("Amazon")
        ) {
          rekogTextServices.add(element.DetectedText);
        } else {
          rekogTextMetadata.add(element.DetectedText);
        }
      });
      rekogTextServices.delete("AWS");
      rekogTextServices.delete("Amazon");
      return {
        labels: rekogLabels,
        textServices: rekogTextServices,
        textMetadata: Array.from(rekogTextMetadata).join(", "),
      };
    })
    .catch((error) => console.error(error));
}

After passing the image recognition check, the results returned from the Amazon Rekognition model and the information relevant to it are bundled into their own metadata. The metadata is then stored in a DynamoDB table where the record would be used to ingest into Amazon Kendra.

The following is a code snippet of the function that stores the metadata of the diagram in DynamoDB:

// Code removed to truncate section

// Function that PUTS item into Amazon DynamoDB table
function putItemDDB(
  originUrl,
  publishDate,
  imageUrl,
  crawlerData,
  rekogData,
  referenceLinks,
  title,
  tableName
) {
  console.log("WRITE TO DDB");
  console.log("originUrl :   ", originUrl);
  console.log("publishDate:  ", publishDate);
  console.log("imageUrl: ", imageUrl);
  let write_params = {
    TableName: tableName,
    Item: {
      OriginURL: { S: originUrl },
      PublishDate: { S: formatDate(publishDate) },
      ArchitectureURL: {
        S: imageUrl,
      },
      Metadata: {
        M: {
          crawler: {
            S: crawlerData,
          },
          Rekognition: {
            M: {
              labels: {
                S: Array.from(rekogData.labels).join(", "),
              },
              textServices: {
                S: Array.from(rekogData.textServices).join(", "),
              },
              textMetadata: {
                S: rekogData.textMetadata,
              },
            },
          },
        },
      },
      Reference: referenceLinks,
      Title: {
        S: title,
      },
    },
  };

  dynamoDB.putItem(write_params, function (err, data) {
    if (err) {
      console.log("*** DDB Error", err);
    } else {
      console.log("Successfuly inserted in DDB", data);
    }
  });
}

Ingest metadata into Amazon Kendra

After the architecture diagrams go through the image recognition process and the metadata is stored in DynamoDB, we need a way for the diagrams to be searchable while referencing the content in the metadata. The approach to this is to have a search engine that can be integrated with the application and can handle a large amount of search queries. Therefore, we use Amazon Kendra, an intelligent enterprise search service.

We use Amazon Kendra as the interactive component of the solution is because of its powerful search capabilities, particularly with the use of natural language. This adds an additional layer of simplicity when users are searching for diagrams that are closest to what they’re looking for. Amazon Kendra offers a number of data sources connectors for ingesting and connecting contents. This solution uses a custom connector to ingest architecture diagrams’ information from DynamoDB. To configure a data source to an Amazon Kendra index, you can use an existing index or create a new index.

The diagrams crawled then have to be ingested into the Amazon Kendra index that has been created. The following figure shows the flow of how the diagrams are indexed.

First, the diagrams inserted into DynamoDB create a Put event via Amazon DynamoDB Streams. The event triggers the Lambda function that acts as a custom data source for Amazon Kendra and loads the diagrams into the index. For instructions on creating a DynamoDB Streams trigger for a Lambda function, refer to Tutorial: Using AWS Lambda with Amazon DynamoDB Streams

After we integrate the Lambda function with DynamoDB, we need to ingest the records of the diagrams sent to the function into the Amazon Kendra index. The index accepts data from various types of sources, and ingesting items into the index from the Lambda function means that it has to use the custom data source configuration. For instructions on creating a custom data source for your index, refer to Custom data source connector.

The following is a code snippet of the Lambda function for how a diagram could be indexed in a custom manner:

import json
import os
import boto3

KENDRA = boto3.client("kendra")
INDEX_ID = os.environ["INDEX_ID"]
DS_ID = os.environ["DS_ID"]


def lambda_handler(event, context):
    dbRecords = event["Records"]

    # Loop through items from Amazon DynamoDB
    for row in dbRecords:
        rowData = row["dynamodb"]["NewImage"]
        originUrl = rowData["OriginURL"]["S"]
        publishedDate = rowData["PublishDate"]["S"]
        architectureUrl = rowData["ArchitectureURL"]["S"]
        title = rowData["Title"]["S"]

        metadata = rowData["Metadata"]["M"]
        crawlerMetadata = metadata["crawler"]["S"]
        rekognitionMetadata = metadata["Rekognition"]["M"]
        rekognitionLabels = rekognitionMetadata["labels"]["S"]
        rekognitionServices = rekognitionMetadata["textServices"]["S"]

        concatenatedText = (
            f"{crawlerMetadata} {rekognitionLabels} {rekognitionServices}"
        )

        add_document(
            dsId=DS_ID,
            indexId=INDEX_ID,
            originUrl=originUrl,
            architectureUrl=architectureUrl,
            title=title,
            publishedDate=publishedDate,
            text=concatenatedText,
        )

    return


# Function to add the diagram into Kendra index
def add_document(dsId, indexId, originUrl, architectureUrl, title, publishedDate, text):
    document = get_document(
        dsId, indexId, originUrl, architectureUrl, title, publishedDate, text
    )
    documents = [document]
    result = KENDRA.batch_put_document(IndexId=indexId, Documents=documents)
    print("result:" + json.dumps(result))
    return True


# Frame the diagram into a document that Kendra accepts
def get_document(dsId, originUrl, architectureUrl, title, publishedDate, text):
    document = {
        "Id": originUrl,
        "Title": title,
        "Attributes": [
            {"Key": "_data_source_id", "Value": {"StringValue": dsId}},
            {"Key": "_source_uri", "Value": {"StringValue": architectureUrl}},
            {"Key": "_created_at", "Value": {"DateValue": publishedDate}},
            {"Key": "publish_date", "Value": {"DateValue": publishedDate}},
        ],
        "Blob": text,
    }

    return document

The important factor that enables diagrams to be searchable is the Blob key in a document. This is what Amazon Kendra looks into when users provide their search input. In this example code, the Blob key contains a summarized version of the use case of the diagram concatenated with the information detected from the image recognition process. This allows users to search for architecture diagrams based on use cases such as “Fraud Detection” or by service names like “Amazon Kendra.”

To illustrate an example of what the Blob key looks like, the following snippet references the initial ETL diagram that we introduced earlier in this post. It contains a description of the diagram that was obtained when it was crawled, as well as the services that were identified by the Amazon Rekognition model.

{
    ...,
    "Blob": "Build and orchestrate ETL pipelines using Amazon Athena and AWS Step Functions Amazon Athena, AWS Step Functions, Amazon S3, AWS Glue Data Catalog "
}

Search with Amazon Kendra

After we put all the components together, the results of an example search of “real time analytics” look like the following screenshot.

By searching for this use case, it produces different architecture diagrams. Users are provided with these different methods of the specific workload that they’re trying to implement.

Clean up

Complete the steps in this section to clean up the resources you created as part of this post:

  1. Delete the API:
    1. On the API Gateway console, select the API to be deleted.
    2. On the Actions menu, choose Delete.
    3. Choose Delete to confirm.
  2. Delete the DynamoDB table:
    1. On the DynamoDB console, choose Tables in the navigation pane.
    2. Select the table you created and choose Delete.
    3. Enter delete when prompted for confirmation.
    4. Choose Delete table to confirm.
  3. Delete the Amazon Kendra index:
    1. On the Amazon Kendra console, choose Indexes in the navigation pane.
    2. Select the index you created and choose Delete
    3. Enter a reason when prompted for confirmation.
    4. Choose Delete to confirm.
  4. Delete the Amazon Rekognition project:
    1. On the Amazon Rekognition console, choose Use Custom Labels in the navigation pane, then choose Projects.
    2. Select the project you created and choose Delete.
    3. Enter Delete when prompted for confirmation.
    4. Choose Delete associated datasets and models to confirm.
  5. Delete the Lambda function:
    1. On the Lambda console, select the function to be deleted.
    2. On the Actions menu, choose Delete.
    3. Enter Delete when prompted for confirmation.
    4. Choose Delete to confirm.

Summary

In this post, we showed an example of how you can intelligently search information from images. This includes the process of training an Amazon Rekognition ML model that acts as a filter for images, the automation of image crawling, which ensures credibility and efficiency, and querying for diagrams by attaching a custom data source that enables a more flexible manner to index items. To dive deeper into the implementation of the codes, refer to the GitHub repo.

Now that you understand how to deliver the backbone of a centralized search repository for complex searches, try creating your own image search engine. For more information on the core features, refer to Getting started with Amazon Rekognition Custom Labels, Moderating content, and the Amazon Kendra Developer Guide. If you’re new to Amazon Rekognition Custom Labels, try it out using our Free Tier, which lasts 3 months and includes 10 free training hours per month and 4 free inference hours per month.


About the Authors

Ryan See is a Solutions Architect at AWS. Based in Singapore, he works with customers to build solutions to solve their business problems as well as tailor a technical vision to help run more scalable and efficient workloads in the cloud.

James Ong Jia Xiang is a Customer Solutions Manager at AWS. He specializes in the Migration Acceleration Program (MAP) where he helps customers and partners successfully implement large-scale migration programs to AWS. Based in Singapore, he also focuses on driving modernization and enterprise transformation initiatives across APJ through scalable mechanisms. For leisure, he enjoys nature activities like trekking and surfing.

Hang Duong is a Solutions Architect at AWS. Based in Hanoi, Vietnam, she focuses on driving cloud adoption across her country by providing highly available, secure, and scalable cloud solutions for her customers. Additionally, she enjoys building and is involved in various prototyping projects. She is also passionate about the field of machine learning.

Trinh Vo is a Solutions Architect at AWS, based in Ho Chi Minh City, Vietnam. She focuses on working with customers across different industries and partners in Vietnam to craft architectures and demonstrations of the AWS platform that work backward from the customer’s business needs and accelerate the adoption of appropriate AWS technology. She enjoys caving and trekking for leisure.

Wai Kin Tham is a Cloud Architect at AWS. Based in Singapore, his day job involves helping customers migrate to the cloud and modernize their technology stack in the cloud. In his free time, he attends Muay Thai and Brazilian Jiu Jitsu classes.

Read More

Create high-quality datasets with Amazon SageMaker Ground Truth and FiftyOne

Create high-quality datasets with Amazon SageMaker Ground Truth and FiftyOne

This is a joint post co-written by AWS and Voxel51. Voxel51 is the company behind FiftyOne, the open-source toolkit for building high-quality datasets and computer vision models.

A retail company is building a mobile app to help customers buy clothes. To create this app, they need a high-quality dataset containing clothing images, labeled with different categories. In this post, we show how to repurpose an existing dataset via data cleaning, preprocessing, and pre-labeling with a zero-shot classification model in FiftyOne, and adjusting these labels with Amazon SageMaker Ground Truth.

You can use Ground Truth and FiftyOne to accelerate your data labeling project. We illustrate how to seamlessly use the two applications together to create high-quality labeled datasets. For our example use case, we work with the Fashion200K dataset, released at ICCV 2017.

Solution overview

Ground Truth is a fully self-served and managed data labeling service that empowers data scientists, machine learning (ML) engineers, and researchers to build high-quality datasets. FiftyOne by Voxel51 is an open-source toolkit for curating, visualizing, and evaluating computer vision datasets so that you can train and analyze better models by accelerating your use cases.

In the following sections, we demonstrate how to do the following:

  • Visualize the dataset in FiftyOne
  • Clean the dataset with filtering and image deduplication in FiftyOne
  • Pre-label the cleaned data with zero-shot classification in FiftyOne
  • Label the smaller curated dataset with Ground Truth
  • Inject labeled results from Ground Truth into FiftyOne and review labeled results in FiftyOne

Use case overview

Suppose you own a retail company and want to build a mobile application to give personalized recommendations to help users decide what to wear. Your prospective users are looking for an application that tells them which articles of clothing in their closet work well together. You see an opportunity here: if you can identify good outfits, you can use this to recommend new articles of clothing that complement the clothing a customer already owns.

You want to make things as easy as possible for the end-user. Ideally, someone using your application only needs to take pictures of the clothes in their wardrobe, and your ML models work their magic behind the scenes. You might train a general-purpose model or fine-tune a model to each user’s unique style with some form of feedback.

First, however, you need to identify what type of clothing the user is capturing. Is it a shirt? A pair of pants? Or something else? After all, you probably don’t want to recommend an outfit that has multiple dresses or multiple hats.

To address this initial challenge, you want to generate a training dataset consisting of images of various articles of clothing with various patterns and styles. To prototype with a limited budget, you want to bootstrap using an existing dataset.

To illustrate and walk you through the process in this post, we use the Fashion200K dataset released at ICCV 2017. It’s an established and well-cited dataset, but it isn’t directly suited for your use case.

Although articles of clothing are labeled with categories (and subcategories) and contain a variety of helpful tags that are extracted from the original product descriptions, the data is not systematically labeled with pattern or style information. Your goal is to turn this existing dataset into a robust training dataset for your clothing classification models. You need to clean the data, augmenting the labeling schema with style labels. And you want to do so quickly and with as little spend as possible.

Download the data locally

First, download the women.tar zip file and the labels folder (with all of its subfolders) following the instructions provided in the Fashion200K dataset GitHub repository. After you’ve unzipped them both, create a parent directory fashion200k, and move the labels and women folders into this. Fortunately, these images have already been cropped to the object detection bounding boxes, so we can focus on classification, rather than worry about object detection.

Despite the “200K” in its moniker, the women directory we extracted contains 338,339 images. To generate the official Fashion200K dataset, the dataset’s authors crawled more than 300,000 products online, and only products with descriptions containing more than four words made the cut. For our purposes, where the product description isn’t essential, we can use all of the crawled images.

Let’s look at how this data is organized: within the women folder, images are arranged by top-level article type (skirts, tops, pants, jackets, and dresses), and article type subcategory (blouses, t-shirts, long-sleeved tops).

Within the subcategory directories, there is a subdirectory for each product listing. Each of these contains a variable number of images. The cropped_pants subcategory, for instance, contains the following product listings and associated images.

The labels folder contains a text file for each top-level article type, for both train and test splits. Within each of these text files is a separate line for each image, specifying the relative file path, a score, and tags from the product description.

Because we’re repurposing the dataset, we combine all of the train and test images. We use these to generate a high-quality application-specific dataset. After we complete this process, we can randomly split the resulting dataset into new train and test splits.

Inject, view, and curate a dataset in FiftyOne

If you haven’t already done so, install open-source FiftyOne using pip:

pip install fiftyone

A best practice is to do so within a new virtual (venv or conda) environment. Then import the relevant modules. Import the base library, fiftyone, the FiftyOne Brain, which has built-in ML methods, the FiftyOne Zoo, from which we will load a model that will generate zero-shot labels for us, and the ViewField, which lets us efficiently filter the data in our dataset:

import fiftyone as fo
import fiftyone.brain as fob
import fiftyone.zoo as foz
from fiftyone import ViewField as F

You also want to import the glob and os Python modules, which will help us work with paths and pattern match over directory contents:

from glob import glob
import os

Now we’re ready to load the dataset into FiftyOne. First, we create a dataset named fashion200k and make it persistent, which allows us to save the results of computationally intensive operations, so we only need to compute said quantities once.

dataset = fo.Dataset("fashion200k", persistent=True)

We can now iterate through all subcategory directories, adding all the images within the product directories. We add a FiftyOne classification label to each sample with the field name article_type, populated by the image’s top-level article category. We also add both category and subcategory information as tags:

# Map dir categories to article type labels
labels_map = {
    "dresses": "dress",
    "jackets": "jacket",
    "pants": "pants",
    "skirts": "skirt",
    "tops": "top",
}

dataset_dir = "./fashion200k"

for d in glob(os.path.join(dataset_dir, "women", "*", "*")):
    _, _, category, subcategory = d.split("/")
    subcategory = subcategory.replace("_", " ")
    label = labels_map[category]

    dataset.add_samples(
        [
            fo.Sample(
                    filepath=filepath,
tags=[category, subcategory],   article_type=fo.Classification(label=label),
            )
            for filepath in glob(os.path.join(d, "*", "*"))
        ]
    )

At this point, we can visualize our dataset in the FiftyOne app by launching a session:

session = fo.launch_app(dataset)

We can also print out a summary of the dataset in Python by running print(dataset):

Name:        fashion200k
Media type:  image
Num samples: 338339
Persistent:  True
Tags:        []
Sample fields:
    id:            fiftyone.core.fields.ObjectIdField
    filepath:      fiftyone.core.fields.StringField
    tags:          fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:      fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    article_type:  fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)

We can also add the tags from the labels directory to the samples in our dataset:

working_dir = os.getcwd()

tags = {
f: set(t) 
for f, t in zip(*dataset.values(["filepath", "tags"]))
}


for label_file in glob("fashion200k/labels/*"):
    with open(label_file, 'r') as f:
        for line in f.readlines():
            line_list = line.split()
            fp = os.path.join(
                working_dir, 
                dataset_dir, 
                line_list[0]
            )
          
           # add new tags
          new_tags_for_fp = line_list[2:]
          tags[fp].update(new_tags_for_fp)

# Update tags
dataset.set_values("tags", tags, key_field="filepath")

Looking at the data, a few things become clear:

  • Some of the images are fairly grainy, with low resolution. This is likely because these images were generated by cropping initial images in object detection bounding boxes.
  • Some clothes are worn by a person, and some are photographed on their own. These details are encapsulated by the viewpoint property.
  • A lot of the images of the same product are very similar, so at least initially, including more than one image per product may not add much predictive power. For the most part, the first image of each product (ending in _0.jpeg) is the cleanest.

Initially, we might want to train our clothing style classification model on a controlled subset of these images. To this end, we use high-resolution images of our products, and limit our view to one representative sample per product.

First, we filter out the low-resolution images. We use the compute_metadata() method to compute and store image width and height, in pixels, for each image in the dataset. We then employ the FiftyOne ViewField to filter out images based on the minimum allowed width and height values. See the following code:

dataset.compute_metadata()

min_width = 200
min_height = 300

width_filter = F("metadata.width") > min_width
height_filter = F("metadata.height") > min_height


high_res_view = dataset.match(
    width_filter & height_filter
)

session.view = high_res_view.view()

This high-resolution subset has just under 200,000 samples.

From this view, we can create a new view into our dataset containing only one representative sample (at most) for each product. We use the ViewField once again, pattern matching for file paths that end with _0.jpeg:

representative_view = high_res_view.match(
    F("filepath").ends_with("_0.jpeg")
)

Let’s view a randomly shuffled ordering of images in this subset:

session.view = representative_view.shuffle()

Remove redundant images in the dataset

This view contains 66,297 images, or just over 19% of the original dataset. When we look at the view, however, we see that there are many very similar products. Keeping all of these copies will likely only add cost to our labeling and model training, without noticeably improving performance. Instead, let’s get rid of the near duplicates to create a smaller dataset that still packs the same punch.

Because these images are not exact duplicates, we can’t check for pixel-wise equality. Fortunately, we can use the FiftyOne Brain to help us clean our dataset. In particular, we’ll compute an embedding for each image—a lower-dimensional vector representing the image—and then look for images whose embedding vectors are close to each other. The closer the vectors, the more similar the images.

We use a CLIP model to generate a 512-dimensional embedding vector for each image, and store these embeddings in the field embeddings on the samples in our dataset:

## load model
model = foz.load_zoo_model("clip-vit-base32-torch")
 
## compute embeddings
representative_view.compute_embeddings(
model, 
embeddings_field="embedding"
)

Then we compute the closeness between embeddings, using cosine similarity, and assert that any two vectors whose similarity is greater than some threshold are likely to be near duplicates. Cosine similarity scores lie in the range [0, 1], and looking at the data, a threshold score of thresh=0.5 seems to be about right. Again, this doesn’t need to be perfect. A few near-duplicate images are not likely to ruin our predictive power, and throwing away a few non-duplicate images doesn’t materially impact model performance.

results = fob.compute_similarity(
view,
embeddings="embedding",
brain_key="sim",
metric="cosine"
)

results.find_duplicates(thresh=0.5)

We can view the purported duplicates to verify that they are indeed redundant:

## view the duplicates, paired up, 
## to make sure it is doing what we think it is doing
dup_view = results.duplicates_view()
session = fo.launch_app(dup_view)

When we’re happy with the result and believe these images are indeed near duplicates, we can pick one sample from each set of similar samples to keep, and ignore the others:

## get one image from each group of duplicates
dup_rep_ids = list(results.neighbors_map.keys())

# get ids of non-duplicates
non_dup_ids = representative_view.exclude(
dup_view.values("id")
).values("id")

# ids to keep
ids = dup_rep_ids + non_dup_ids

# create view from ids
non_dup_view = representative_view[ids]

Now this view has 3,729 images. By cleaning the data and identifying a high-quality subset of the Fashion200K dataset, FiftyOne lets us restrict our focus from more than 300,000 images to just under 4,000, representing a reduction by 98%. Using embeddings to remove near-duplicate images alone brought our total number of images under consideration down by more than 90%, with little if any effect on any models to be trained on this data.

Before pre-labeling this subset, we can better understand the data by visualizing the embeddings we have already computed. We can use the FiftyOne Brain’s built-in compute_visualization() method, which employs the uniform manifold approximation (UMAP) technique to project the 512-dimensional embedding vectors into two-dimensional space so we can visualize them:

fob.compute_visualization(
    non_dup_view, 
    embeddings="embedding", 
    brain_key="vis"
)

We open a new Embeddings panel in the FiftyOne app and coloring by article type, and we can see that these embeddings roughly encode a notion of article type (among other things!).

Now we are ready to pre-label this data.

Inspecting these highly unique, high-resolution images, we can generate a decent initial list of styles to use as classes in our pre-labeling zero-shot classification. Our goal in pre-labeling these images is not to necessarily label each image correctly. Rather, our goal is to provide a good starting point for human annotators so we can reduce labeling time and cost.

styles = [
 "graphic", 
 "lettered", 
 "plain", 
 "striped", 
 "polka dot", 
 "floral", 
 "jersey", 
 "checkered", 
 "denim", 
 "plaid",
 "houndstooth",
 "chevron", 
 "paisley", 
 "animal print", 
 "quatrefoil",
 “camouflage”
]

We can then instantiate a zero-shot classification model for this application. We use a CLIP model, which is a general-purpose model trained on both images and natural language. We instantiate a CLIP model with the text prompt “Clothing in the style,” so that given an image, the model will output the class for which “Clothing in the style [class]” is the best fit. CLIP is not trained on retail or fashion-specific data, so this won’t be perfect, but it can save you in labeling and annotation costs.

zero_shot_model = foz.load_zoo_model(
 "clip-vit-base32-torch",
 text_prompt="Clothing in the style ",
 classes=styles,
)

We then apply this model to our reduced subset and store the results in an article_style field:

non_dup_view.apply_model(
zero_shot_model, 
label_field="article_style"
)

Launching the FiftyOne App once again, we can visualize the images with these predicted style labels. We sort by prediction confidence so we view the most confident style predictions first:

high_conf_view = non_dup_view.sort_by(
 "article_style.confidence", reverse=True
)

session.view = high_conf_view

We can see that the highest confidence predictions seem to be for “jersey,” “animal print,” “polka dot,” and “lettered” styles. This makes sense, because these styles are relatively distinct. It also seems like, for the most part, the predicted style labels are accurate.

We can also look at the lowest-confidence style predictions:

low_conf_view = non_dup_view.sort_by(
"article_style.confidence"
)
session.view = low_conf_view

For some of these images, the appropriate style category is in the provided list, and the article of clothing is incorrectly labeled. The first image in the grid, for instance, should clearly be “camouflage” and not “chevron.” In other cases, however, the products don’t fit neatly into the style categories. The dress in the second image in the second row, for example, is not exactly “striped,” but given the same labeling options, a human annotator might also have been conflicted. As we build out our dataset, we need to decide whether to remove edge cases like these, add new style categories, or augment the dataset.

Export the final dataset from FiftyOne

Export the final dataset with the following code:

# The directory to which to write the exported dataset
export_dir = "200kFashionDatasetExportResult"

# The name of the sample field containing the label that you wish to export
# Used when exporting labeled datasets (e.g., classification or detection)
label_field = "article_style"  # for example

# The type of dataset to export
# Any subclass of `fiftyone.types.Dataset` is supported
dataset_type = fo.types.COCODetectionDataset  # for example

# Export the dataset
high_conf_view.export(
    export_dir=export_dir,
    dataset_type=dataset_type,
    label_field=label_field,
)

We can export a smaller dataset, for example, 16 images, to the folder 200kFashionDatasetExportResult-16Images. We create a Ground Truth adjustment job using it:

# The directory to which to write the exported dataset
export_dir = "200kFashionDatasetExportResult-16Images"

# The name of the sample field containing the label that you wish to export
# Used when exporting labeled datasets (e.g., classification or detection)
label_field = "article_style"  # for example

# The type of dataset to export
# Any subclass of `fiftyone.types.Dataset` is supported
dataset_type = fo.types.COCODetectionDataset  # for example

# Export the dataset
high_conf_view.take(16).export(
    export_dir=export_dir,
    dataset_type=dataset_type,
    label_field=label_field,
)

Upload the revised dataset, convert the label format to Ground Truth, upload to Amazon S3, and create a manifest file for the adjustment job

We can convert the labels in the dataset to match the output manifest schema of a Ground Truth bounding box job, and upload the images to an Amazon Simple Storage Service (Amazon S3) bucket to launch a Ground Truth adjustment job:

import json
# open the labels.json file of ground truth bounding box 
#labels from the exported dataset
f = open('200kFashionDatasetExportResult-16Images/labels.json')
data = json.load(f)

# provide your aws s3 bucket name, prefix, and aws credentials
bucket_name = 'sagemaker-your-preferred-s3-bucket'
s3_prefix = 'sagemaker-your-preferred-s3-prefix'

session = boto3.Session(
    aws_access_key_id='<AWS_ACCESS_KEY_ID>',
    aws_secret_access_key='<AWS_SECRET_ACCESS_KEY>'
)
s3 = session.resource('s3')

for image in data['images']:
    file_name = image['file_name']
    file_id = file_name[:-4]
    image_id = image['id']
    
    # upload the image to s3
    s3.meta.client.upload_file('200kFashionDatasetExportResult-16Images/data/'+image['file_name'], bucket_name, s3_prefix+'/'+image['file_name'])
    
    gt_annotations = []
    confidence = 0.00
    
    for annotation in data['annotations']:
        if annotation['image_id'] == image['id']:
            confidence = annotation['score']
            gt_annotation = {
                "class_id": gt_class_array.index(style_category), 
                # convert the original ground_truth bounding box 
                #label to predicted style label
                "left": annotation['bbox'][0],
                "top": annotation['bbox'][1],
                "width": annotation['bbox'][2],
                "height": annotation['bbox'][3]
            }
            
            gt_annotations.append(gt_annotation)
            break
    
    gt_metadata_objects = []
    for gt_annotation in gt_annotations:
        gt_metadata_objects.append({
            "confidence": confidence
        })
    
    gt_label_attribute_metadata = {
        "class-map": gt_class_map,
        "objects": gt_metadata_objects,
        "type": "groundtruth/object-detection",
        "human-annotated": "yes",
        "creation-date": "2023-02-19T00:23:25.339582",
        "job-name": "labeling-job/200k-fashion-origin"
    }
    
    gt_output = {
        "source-ref": f"s3://{bucket_name}/{s3_prefix}/{image['file_name']}",
        "200k-fashion-origin": {
            "image_size": [
                {
                    "width": image['width'],
                    "height": image['height'],
                    "depth": 3
                  }
      
            ],
            "annotations": gt_annotations
        },
        "200k-fashion-origin-metadata": gt_label_attribute_metadata
    }
    

    # write to the manifest file    
    with open(200k-fashion-output.manifest', 'a') as output_file:
        output_file.write(json.dumps(gt_output) + "n")

Upload the manifest file to Amazon S3 with the following code:

s3.meta.client.upload_file(200k-fashion-output.manifest', bucket_name, s3_prefix+'/200k-fashion-output.manifest')

Create corrected styled labels with Ground Truth

To annotate your data with style labels using Ground Truth, complete the necessary steps to start a bounding box labeling job by following the procedure outlined in the Getting Started with Ground Truth guide with the dataset in the same S3 bucket.

  1. On the SageMaker console, create a Ground Truth labeling job.
  2. Set the Input dataset location to be the manifest that we created in the preceding steps.
  3. Specify an S3 path for Output dataset location.
  4. For IAM Role, choose Enter a custom IAM role ARN, then enter the role ARN.
  5. For Task category, choose Image and select Bounding box.
  6. Choose Next.
  7. In the Workers section, choose the type of workforce you would like to use.
    You can select a workforce through Amazon Mechanical Turk, third-party vendors, or your own private workforce. For more details about your workforce options, see Create and Manage Workforces.
  8. Expand Existing-labels display options and select I want to display existing labels from the dataset for this job.
  9. For Label attribute name, choose the name from your manifest that corresponds to the labels that you want to display for adjustment.
    You will only see label attribute names for labels that match the task type you selected in the previous steps.
  10. Manually enter the labels for Bounding box labeling tool.
    The labels must contain the same labels used in the public dataset. You can add new labels. The following screenshot shows how you can choose the workers and configure the tool for your labeling job.
  11. Choose Preview to preview the image and original annotations.

We have now created a labeling job in Ground Truth. After our job is complete, we can load the newly generated labeled data into FiftyOne. Ground Truth produces output data in a Ground Truth output manifest. For more details on the output manifest file, see Bounding Box Job Output. The following code shows an example of this output manifest format:

{
    "source-ref": "s3://AWSDOC-EXAMPLE-BUCKET/example_image.png",
    "bounding-box-attribute-name":
    {
        "image_size": [{ "width": 500, "height": 400, "depth":3}],
        "annotations":
        [
            {"class_id": 0, "left": 111, "top": 134,
                    "width": 61, "height": 128},
            {"class_id": 5, "left": 161, "top": 250,
                     "width": 30, "height": 30},
            {"class_id": 5, "left": 20, "top": 20,
                     "width": 30, "height": 30}
        ]
    },
    "bounding-box-attribute-name-metadata":
    {
        "objects":
        [
            {"confidence": 0.8},
            {"confidence": 0.9},
            {"confidence": 0.9}
        ],
        "class-map":
        {
            "0": "jersey",
            "5": "polka dot"
        },
        "type": "groundtruth/object-detection",
        "human-annotated": "yes",
        "creation-date": "2018-10-18T22:18:13.527256",
        "job-name": "identify-fashion-set"
    },
    "adjusted-bounding-box":
    {
        "image_size": [{ "width": 500, "height": 400, "depth":3}],
        "annotations":
        [
            {"class_id": 0, "left": 110, "top": 135,
                    "width": 61, "height": 128},
            {"class_id": 5, "left": 161, "top": 250,
                     "width": 30, "height": 30},
            {"class_id": 5, "left": 10, "top": 10,
                     "width": 30, "height": 30}
        ]
    },
    "adjusted-bounding-box-metadata":
    {
        "objects":
        [
            {"confidence": 0.8},
            {"confidence": 0.9},
            {"confidence": 0.9}
        ],
        "class-map":
        {
            "0": "dog",
            "5": "bone"
        },
        "type": "groundtruth/object-detection",
        "human-annotated": "yes",
        "creation-date": "2018-11-20T22:18:13.527256",
        "job-name": "adjust-identify-fashion-set",
        "adjustment-status": "adjusted"
    }
 }

Review labeled results from Ground Truth in FiftyOne

After the job is complete, download the output manifest of the labeling job from Amazon S3.

Read the output manifest file:

with open('<path-to-your-output.manifest>', 'r') as fh:
    adjustment_manifest_lines = fh.readlines()

Create a FiftyOne dataset and convert the manifest lines to samples in the dataset:

def get_classification_labels(manifest_line, dataset, attr_name) -> fo.Classifications:
    label_attribute_data = manifest_line.get(attr_name)
    metadata = manifest_line.get(f"{attr_name}-metadata")
 
    annotations = label_attribute_data.get("annotations")
 
    image_data = label_attribute_data.get("image_size")[0]
    width = image_data.get("width")
    height = image_data.get("height")

    predictions = []
    for i, annotation in enumerate(annotations):
        label = metadata.get("class-map").get(str(annotation.get("class_id")))

        confidence = metadata.get("objects")[i].get("confidence")
        
        prediction = fo.Classification(label=label, confidence=confidence)

        predictions.append(prediction)

    return fo.Classifications(classifications=predictions)

def get_bounding_box_labels(manifest_line, dataset, attr_name) -> fo.Detections:
    label_attribute_data = manifest_line.get(attr_name)
    metadata = manifest_line.get(f"{attr_name}-metadata")
 
    annotations = label_attribute_data.get("annotations")
 
    image_data = label_attribute_data.get("image_size")[0]
    width = image_data.get("width")
    height = image_data.get("height")

    detections = []
    for i, annotation in enumerate(annotations):
        label = metadata.get("class-map").get(str(annotation.get("class_id")))

        confidence = metadata.get("objects")[i].get("confidence")

        # Bounding box coordinates should be relative values
        # in [0, 1] in the following format:
        # [top-left-x, top-left-y, width, height]
        bounding_box = [
            annotation.get("left") / width,
            annotation.get("top") / height,
            annotation.get("width") / width,
            annotation.get("height") / height,
        ]

        detection = fo.Detection(
            label=label, bounding_box=bounding_box, confidence=confidence
        )
        
        detections.append(detection)

    return fo.Detections(detections=detections)
    
def get_sample_from_manifest_line(manifest_line, dataset, attr_name):
    """
    For each line in manifest, transform annotations into Fiftyone format
    Args:
        line: manifest line
    Output:
        Fiftyone image sample
    """
    file_name = manifest_line.get("source-ref")[5:].split("/")[-1]
    file_loc = f'200kFashionDatasetExportResult-16Images/data/{file_name}'

    sample = fo.Sample(filepath=file_loc)

    sample['ground_truth'] = get_bounding_box_labels(
        manifest_line=manifest_line, dataset=dataset, attr_name=attr_name
    )
    sample["prediction"] = get_classification_labels(
        manifest_line=manifest_line, dataset=dataset, attr_name=attr_name
    )

    return sample

adjustment_dataset = fo.Dataset("adjustment-job-dataset")

samples = [
            get_sample_from_manifest_line(
                manifest_line=json.loads(manifest_line), dataset=adjustment_dataset, attr_name='smgt-fiftyone-style-adjustment-job'
            )
            for manifest_line in adjustment_manifest_lines
        ]

adjustment_dataset.add_samples(samples)

session = fo.launch_app(adjustment_dataset)

You can now see high-quality labeled data from Ground Truth in FiftyOne.

Conclusion

In this post, we showed how to build high-quality datasets by combining the power of FiftyOne by Voxel51, an open-source toolkit that allows you to manage, track, visualize, and curate your dataset, and Ground Truth, a data labeling service that allows you to efficiently and accurately label the datasets required for training ML systems by providing access to multiple built-in task templates and access to a diverse workforce through Mechanical Turk, third-party vendors, or your own private workforce.

We encourage you to try out this new functionality by installing a FiftyOne instance and using the Ground Truth console to get started. To learn more about Ground Truth, refer to Label Data, Amazon SageMaker Data Labeling FAQs, and the AWS Machine Learning Blog.

Connect with the Machine Learning & AI community if you have any questions or feedback!

Join the FiftyOne community!

Join the thousands of engineers and data scientists already using FiftyOne to solve some of the most challenging problems in computer vision today!


About the Authors

Shalendra Chhabra is currently Head of Product Management for Amazon SageMaker Human-in-the-Loop (HIL) Services. Previously, Shalendra incubated and led Language and Conversational Intelligence for Microsoft Teams Meetings, was EIR at Amazon Alexa Techstars Startup Accelerator, VP of Product and Marketing at Discuss.io, Head of Product and Marketing at Clipboard (acquired by Salesforce), and Lead Product Manager at Swype (acquired by Nuance). In total, Shalendra has helped build, ship, and market products that have touched more than a billion lives.

Jacob Marks is a Machine Learning Engineer and Developer Evangelist at Voxel51, where he helps bring transparency and clarity to the world’s data. Prior to joining Voxel51, Jacob founded a startup to help emerging musicians connect and share creative content with fans. Before that, he worked at Google X, Samsung Research, and Wolfram Research. In a past life, Jacob was a theoretical physicist, completing his PhD at Stanford, where he investigated quantum phases of matter. In his free time, Jacob enjoys climbing, running, and reading science fiction novels.

Jason Corso is co-founder and CEO of Voxel51, where he steers strategy to help bring transparency and clarity to the world’s data through state-of-the-art flexible software. He is also a Professor of Robotics, Electrical Engineering, and Computer Science at the University of Michigan, where he focuses on cutting-edge problems at the intersection of computer vision, natural language, and physical platforms. In his free time, Jason enjoys spending time with his family, reading, being in nature, playing board games, and all sorts of creative activities.

Brian Moore is co-founder and CTO of Voxel51, where he leads technical strategy and vision. He holds a PhD in Electrical Engineering from the University of Michigan, where his research was focused on efficient algorithms for large-scale machine learning problems, with a particular emphasis on computer vision applications. In his free time, he enjoys badminton, golf, hiking, and playing with his twin Yorkshire Terriers.

Zhuling Bai is a Software Development Engineer at Amazon Web Services. She works on developing large-scale distributed systems to solve machine learning problems.

Read More

Achieve high performance with lowest cost for generative AI inference using AWS Inferentia2 and AWS Trainium on Amazon SageMaker

Achieve high performance with lowest cost for generative AI inference using AWS Inferentia2 and AWS Trainium on Amazon SageMaker

The world of artificial intelligence (AI) and machine learning (ML) has been witnessing a paradigm shift with the rise of generative AI models that can create human-like text, images, code, and audio. Compared to classical ML models, generative AI models are significantly bigger and more complex. However, their increasing complexity also comes with high costs for inference and a growing need for powerful compute resources. The high cost of inference for generative AI models can be a barrier to entry for businesses and researchers with limited resources, necessitating the need for more efficient and cost-effective solutions. Furthermore, the majority of generative AI use cases involve human interaction or real-world scenarios, necessitating hardware that can deliver low-latency performance. AWS has been innovating with purpose-built chips to address the growing need for powerful, efficient, and cost-effective compute hardware.

Today, we are excited to announce that Amazon SageMaker supports AWS Inferentia2 (ml.inf2) and AWS Trainium (ml.trn1) based SageMaker instances to host generative AI models for real-time and asynchronous inference. ml.inf2 instances are available for model deployment on SageMaker in US East (Ohio) and ml.trn1 instances in US East (N. Virginia).

You can use these instances on SageMaker to achieve high performance at a low cost for generative AI models, including large language models (LLMs), Stable Diffusion, and vision transformers. In addition, you can use Amazon SageMaker Inference Recommender to help you run load tests and evaluate the price-performance benefits of deploying your model on these instances.

You can use ml.inf2 and ml.trn1 instances to run your ML applications on SageMaker for text summarization, code generation, video and image generation, speech recognition, personalization, fraud detection, and more. You can easily get started by specifying ml.trn1 or ml.inf2 instances when configuring your SageMaker endpoint. You can use ml.trn1 and ml.inf2 compatible AWS Deep Learning Containers (DLCs) for PyTorch, TensorFlow, Hugging Face, and large model inference (LMI) to easily get started. For the full list with versions, see Available Deep Learning Containers Images.

In this post, we show the process of deploying a large language model on AWS Inferentia2 using SageMaker, without requiring any extra coding, by taking advantage of the LMI container. We use the GPT4ALL-J, a fine-tuned GPT-J 7B model that provides a chatbot style interaction.

Overview of ml.trn1 and ml.inf2 instances

ml.trn1 instances are powered by the Trainium accelerator, which is purpose built mainly for high-performance deep learning training of generative AI models, including LLMs. However, these instances also support inference workloads for models that are even larger than what fits into Inf2. The largest instance size, trn1.32xlarge instances, features 16 Trainium accelerators with 512 GB of accelerator memory in a single instance delivering up to 3.4 petaflops of FP16/BF16 compute power. 16 Trainium accelerators are connected with ultra-high-speed NeuronLinkv2 for streamlined collective communications.

ml.Inf2 instances are powered by the AWS Inferentia2 accelerator, a purpose built accelerator for inference. It delivers three times higher compute performance, up to four times higher throughput, and up to 10 times lower latency compared to first-generation AWS Inferentia. The largest instance size, Inf2.48xlarge, features 12 AWS Inferentia2 accelerators with 384 GB of accelerator memory in a single instance for a combined compute power of 2.3 petaflops for BF16/FP16. It enables you to deploy up to a 175-billion-parameter model in a single instance. Inf2 is the only inference-optimized instance to offer this interconnect, a feature that is only available in more expensive training instances. For ultra-large models that don’t fit into a single accelerator, data flows directly between accelerators with NeuronLink, bypassing the CPU completely. With NeuronLink, Inf2 supports faster distributed inference and improves throughput and latency.

Both AWS Inferentia2 and Trainium accelerators have two NeuronCores-v2, 32 GB HBM memory stacks, and dedicated collective-compute engines, which automatically optimize runtime by overlapping computation and communication when doing multi-accelerator inference. For more details on the architecture, refer to Trainium and Inferentia devices.

The following diagram shows an example architecture using AWS Inferentia2.

AWS Neuron SDK

AWS Neuron is the SDK used to run deep learning workloads on AWS Inferentia and Trainium based instances. AWS Neuron includes a deep learning compiler, runtime, and tools that are natively integrated into TensorFlow and PyTorch. With Neuron, you can develop, profile, and deploy high-performance ML workloads on ml.trn1 and ml.inf2.

The Neuron Compiler accepts ML models in various formats (TensorFlow, PyTorch, XLA HLO) and optimizes them to run on Neuron devices. The Neuron compiler is invoked within the ML framework, where ML models are sent to the compiler by the Neuron framework plugin. The resulting compiler artifact is called a NEFF file (Neuron Executable File Format) that in turn is loaded by the Neuron runtime to the Neuron device.

The Neuron runtime consists of kernel driver and C/C++ libraries, which provide APIs to access AWS Inferentia and Trainium Neuron devices. The Neuron ML frameworks plugins for TensorFlow and PyTorch use the Neuron runtime to load and run models on the NeuronCores. The Neuron runtime loads compiled deep learning models (NEFF) to the Neuron devices and is optimized for high throughput and low latency.

Host NLP models using SageMaker ml.inf2 instances

Before we dive deep into serving LLMs with transformers-neuronx, which is an open-source library to shard the model’s large weight matrices onto multiple NeuronCores, let’s briefly go through the typical deployment flow for a model that can fit onto the single NeuronCore.

Check the list of supported models to ensure the model is supported on AWS Inferentia2. Next, the model needs to be pre-compiled by the Neuron Compiler. You can use a SageMaker notebook or an Amazon Elastic Compute Cloud (Amazon EC2) instance to compile the model. You can use the SageMaker Python SDK to deploy models using popular deep learning frameworks such as PyTorch, as shown in the following code. You can deploy your model to SageMaker hosting services and get an endpoint that can be used for inference. These endpoints are fully managed and support auto scaling.

from sagemaker.pytorch.model import PyTorchModel

pytorch_model = PyTorchModel(
    model_data=s3_model_uri,
    role=role,
    source_dir="code",
    entry_point="inference.py",
    image_uri=ecr_image
)

predictor = pytorch_model.deploy(
    initial_instance_count=1, 
    instance_type="ml.inf2.xlarge"
)

Refer to Developer Flows for more details on typical development flows of Inf2 on SageMaker with sample scripts.

Host LLMs using SageMaker ml.inf2 instances

Large language models with billions of parameters are often too big to fit on a single accelerator. This necessitates the use of model parallel techniques for hosting LLMs across multiple accelerators. Another crucial requirement for hosting LLMs is the implementation of a high-performance model-serving solution. This solution should efficiently load the model, manage partitioning, and seamlessly serve requests via HTTP endpoints.

SageMaker includes specialized deep learning containers (DLCs), libraries, and tooling for model parallelism and large model inference. For resources to get started with LMI on SageMaker, refer to Model parallelism and large model inference. SageMaker maintains DLCs with popular open-source libraries for hosting large models such as GPT, T5, OPT, BLOOM, and Stable Diffusion on AWS infrastructure. These specialized DLCs are referred to as SageMaker LMI containers.

SageMaker LMI containers use DJLServing, a model server that is integrated with the transformers-neuronx library to support tensor parallelism across NeuronCores. To learn more about how DJLServing works, refer to Deploy large models on Amazon SageMaker using DJLServing and DeepSpeed model parallel inference. The DJL model server and transformers-neuronx library serve as core components of the container, which also includes the Neuron SDK. This setup facilitates the loading of models onto AWS Inferentia2 accelerators, parallelizes the model across multiple NeuronCores, and enables serving via HTTP endpoints.

The LMI container supports loading models from an Amazon Simple Storage Service (Amazon S3) bucket or Hugging Face Hub. The default handler script loads the model, compiles and converts it into a Neuron-optimized format, and loads it. To use the LMI container to host LLMs, we have two options:

  • A no-code (preferred) – This is the easiest way to deploy an LLM using an LMI container. In this method, you can use the provided default handler and just pass the model name and the parameters required in serving.properties file to load and host the model. To use the default handler, we provide the entryPoint parameter as djl_python.transformers-neuronx.
  • Bring your own script – In this approach, you have the option to create your own model.py file, which contains the code necessary for loading and serving the model. This file acts as an intermediary between the DJLServing APIs and the transformers-neuronx APIs. To customize the model loading process, you can provide serving.properties with configurable parameters. For a comprehensive list of available configurable parameters, refer to All DJL configuration options. Here is an example of a model.py file.

Runtime architecture

The tensor_parallel_degree property value determines the distribution of tensor parallel modules across multiple NeuronCores. For instance, inf2.24xlarge has six AWS Inferentia2 accelerators. Each AWS Inferentia2 accelerator has two NeuronCores. Each NeuronCore has a dedicated high bandwidth memory (HBM) of 16 GB storing tensor parallel modules. With a tensor parallel degree of 4, the LMI will allocate three model copies of the same model, each utilizing four NeuronCores. As shown in the following diagram, when the LMI container starts, the model will be loaded and traced first in the CPU addressable memory. When the tracing is complete, the model is partitioned across the NeuronCores based on the tensor parallel degree.

LMI uses DJLServing as its model serving stack. After the container’s health check passes in SageMaker, the container is ready to serve the inference request. DJLServing launches multiple Python processes equivalent to the TOTAL NUMBER OF NEURON CORES/TENSOR_PARALLEL_DEGREE. Each Python process contains threads in C++ equivalent to TENSOR_PARALLEL_DEGREE. Each C++ threads holds one shard of the model on one NeuronCore.

Many practitioners (Python process) tend to run inference sequentially when the server is invoked with multiple independent requests. Although it’s easier to set up, it’s usually not the best practice to utilize the accelerator’s compute power. To address this, DJLServing offers the built-in optimizations of dynamic batching to combine these independent inference requests on the server side to form a larger batch dynamically to increase throughput. All the requests reach the dynamic batcher first before entering the actual job queues to wait for inference. You can set your preferred batch sizes for dynamic batching using the batch_size settings in serving.properties. You can also configure max_batch_delay to specify the maximum delay time in the batcher to wait for other requests to join the batch based on your latency requirements. The throughput also depends on the number of model copies and the Python process groups launched in the container. As shown in the following diagram, with the tensor parallel degree set to 4, the LMI container launches three Python process groups, each holding the full copy of the model. This allows you to increase the batch size and get higher throughput.

SageMaker notebook for deploying LLMs

In this section, we provide a step-by-step walkthrough of deploying GPT4All-J, a 6-billion-parameter model that is 24 GB in FP32. GPT4All-J is a popular chatbot that has been trained on a vast variety of interaction content like word problems, dialogs, code, poems, songs, and stories. GPT4all-J is a fine-tuned GPT-J model that generates responses similar to human interactions.

The complete notebook for this example is provided on GitHub. We can use the SageMaker Python SDK to deploy the model to an Inf2 instance. We use the provided default handler to load the model. With this, we just need to provide a servings.properties file. This file has the required configurations for the DJL model server to download and host the model. We can specify the name of the Hugging Face model using the model_id parameter to download the model directly from the Hugging Face repo. Alternatively, you can download the model from Amazon S3 by providing the s3url parameter. The entryPoint parameter is configured to point to the library to load the model. For more details on djl_python.fastertransformer, refer to the GitHub code.

The tensor_parallel_degree property value determines the distribution of tensor parallel modules across multiple devices. For instance, with 12 NeuronCores and a tensor parallel degree of 4, LMI will allocate three model copies, each utilizing four NeuronCores. You can also define the precision type using the property dtype. n_position parameter defines the sum of max input and output sequence length for the model. See the following code:

%%writefile serving.properties# Start writing content here
engine=Python
option.entryPoint=djl_python.transformers-neuronx
#option.model_id=nomic-ai/gpt4all-j
option.s3url = {{s3url}}
option.tensor_parallel_degree=2
option.model_loading_timeout=2400
option.n_positions=512

Construct the tarball containing serving.properties and upload it to an S3 bucket. Although the default handler is used in this example, you can develop a model.py file for customizing the loading and serving process. If there are any packages that need installation, include them in the requirements.txt file. See the following code:

%%sh
mkdir mymodel
mv serving.properties mymodel/
tar czvf mymodel.tar.gz mymodel/
rm -rf mymodel

s3_code_prefix = "large-model-lmi/code"
bucket = sess.default_bucket()  # bucket to house artifacts
code_artifact = sess.upload_data("mymodel.tar.gz", bucket, s3_code_prefix)print(f"S3 Code or Model tar ball uploaded to --- > {code_artifact}")

Retrieve the DJL container image and create the SageMaker model:

##Retrieve djl container image
image_uri = image_uris.retrieve(
        framework="djl-deepspeed",
        region=sess.boto_session.region_name,
        version="0.21.0"
    )
image_uri = image_uri.split(":")[0] + ":" + "0.22.1-neuronx-sdk2.9.0"

model = Model(image_uri=image_uri, model_data=code_artifact, env=env, role=role)

Next, we create the SageMaker endpoint with the model configuration defined earlier. The container downloads the model into the /tmp space because SageMaker maps the /tmp to Amazon Elastic Block Store (Amazon EBS). We need to add a volume_size parameter to ensure the /tmp directory has enough space to download and compile the model. We set container_startup_health_check_timeout to 3,600 seconds to ensure the health check starts after the model is ready. We use the ml.inf2.8xlarge instance. See the following code:

instance_type = "ml.inf2.8xlarge"
endpoint_name = sagemaker.utils.name_from_base("lmi-model")


model.deploy(initial_instance_count=1,
             instance_type=instance_type,
             endpoint_name=endpoint_name,
             container_startup_health_check_timeout=3600,
             volume_size=256
            )

After the SageMaker endpoint has been created, we can make real-time predictions against SageMaker endpoints using the Predictor object:

# our requests and responses will be in json format so we specify the serializer and the deserializer
predictor = sagemaker.Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sess,
    serializer=serializers.JSONSerializer(),
    deserializer=deserializers.JSONDeserializer(),
)

predictor.predict(
    {"inputs": "write a blog on new York", "parameters": {}}
)

Clean up

Delete the endpoints to save costs after you finish your tests:

# - Delete the end point
sess.delete_endpoint(endpoint_name)
sess.delete_endpoint_config(endpoint_name)
model.delete_model()

Conclusion

In this post, we showcased the newly launched capability of SageMaker, which now supports ml.inf2 and ml.trn1 instances for hosting generative AI models. We demonstrated how to deploy GPT4ALL-J, a generative AI model, on AWS Inferentia2 using SageMaker and the LMI container, without writing any code. We also showcased how you can use DJLServing and transformers-neuronx to load a model, partition it, and serve.

Inf2 instances provide the most cost-effective way to run generative AI models on AWS. For performance details, refer to Inf2 Performance.

Check out the GitHub repo for an example notebook. Try it out and let us know if you have any questions!


About the Authors

Vivek Gangasani is a Senior Machine Learning Solutions Architect at Amazon Web Services. He works with Machine Learning Startups to build and deploy AI/ML applications on AWS. He is currently focused on delivering solutions for MLOps, ML Inference and low-code ML. He has worked on projects in different domains, including Natural Language Processing and Computer Vision.

Hiroshi Tokoyo is a Solutions Architect at AWS Annapurna Labs. Based in Japan, he joined Annapurna Labs even before the acquisition by AWS and has consistently helped customers with Annapurna Labs technology. His recent focus is on Machine Learning solutions based on purpose-built silicon, AWS Inferentia and Trainium.

Dhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing, and Artificial Intelligence. He focuses on Deep learning including NLP and Computer Vision domains. He helps customers achieve high performance model inference on SageMaker.

Qing Lan is a Software Development Engineer in AWS. He has been working on several challenging products in Amazon, including high performance ML inference solutions and high performance logging system. Qing’s team successfully launched the first Billion-parameter model in Amazon Advertising with very low latency required. Qing has in-depth knowledge on the infrastructure optimization and Deep Learning acceleration.

Qingwei Li is a Machine Learning Specialist at Amazon Web Services. He received his Ph.D. in Operations Research after he broke his advisor’s research grant account and failed to deliver the Nobel Prize he promised. Currently he helps customers in the financial service and insurance industry build machine learning solutions on AWS. In his spare time, he likes reading and teaching.

Alan Tan is a Senior Product Manager with SageMaker leading efforts on large model inference. He’s passionate about applying Machine Learning to the area of Analytics. Outside of work, he enjoys the outdoors.

Varun Syal is a Software Development Engineer with AWS Sagemaker working on critical customer facing features for the ML Inference platform. He is passionate about working in the Distributed Systems and AI space. In his spare time, he likes reading and gardening.

Read More

Automate the deployment of an Amazon Forecast time-series forecasting model

Automate the deployment of an Amazon Forecast time-series forecasting model

Time series forecasting refers to the process of predicting future values of time series data (data that is collected at regular intervals over time). Simple methods for time series forecasting use historical values of the same variable whose future values need to be predicted, whereas more complex, machine learning (ML)-based methods use additional information, such as the time series data of related variables.

Amazon Forecast is an ML-based time series forecasting service that includes algorithms that are based on over 20 years of forecasting experience used by Amazon.com, bringing the same technology used at Amazon to developers as a fully managed service, removing the need to manage resources. Forecast uses ML to learn not only the best algorithm for each item, but also the best ensemble of algorithms for each item, automatically creating the best model for your data.

This post describes how to deploy recurring Forecast workloads (time series forecasting workloads) with no code using AWS CloudFormation, AWS Step Functions, and AWS Systems Manager. The method presented here helps you build a pipeline that allows you to use the same workflow starting from the first day of your time series forecasting experimentation through the deployment of the model into production.

Time series forecasting using Forecast

The workflow for Forecast involves the following common concepts:

  • Importing datasets – In Forecast, a dataset group is a collection of datasets, schema, and forecast results that go together. Each dataset group can have up to three datasets, one of each dataset type: target time series (TTS), related time series (RTS), and item metadata. A dataset is a collection of files that contain data that is relevant for a forecasting task. A dataset must conform to the schema defined within Forecast. For more details, refer to Importing Datasets.
  • Training predictors – A predictor is a Forecast-trained model used for making forecasts based on time series data. During training, Forecast calculates accuracy metrics that you use to evaluate the predictor and decide whether to use the predictor to generate a forecast. For more information, refer to Training Predictors.
  • Generating forecasts – You can then use the trained model for generating forecasts for a future time horizon, known as the forecasting horizon. Forecast provides forecasts at various specified quantiles. For example, a forecast at the 0.90 quantile will estimate a value that is lower than the observed value 90% of the time. By default, Forecast uses the following values for the predictor forecast types: 0.1 (P10), 0.5 (P50), and 0.9 (P90). Forecasts at various quantiles are typically used to provide a prediction interval (an upper and lower bound for forecasts) to account for forecast uncertainty.

You can implement this workflow in Forecast either from the AWS Management Console, the AWS Command Line Interface (AWS CLI), via API calls using Python notebooks, or via automation solutions. The console and AWS CLI methods are best suited for quick experimentation to check the feasibility of time series forecasting using your data. The Python notebook method is great for data scientists already familiar with Jupyter notebooks and coding, and provides maximum control and tuning. However, the notebook-based method is difficult to operationalize. Our automation approach facilitates rapid experimentation, eliminates repetitive tasks, and allows easier transition between various environments (development, staging, production).

In this post, we describe an automation approach to using Forecast that allows you to use your own data and provides a single workflow that you can use seamlessly throughout the lifecycle of the development of your forecasting solution, from the first days of experimentation through the deployment of the solution in your production environment.

Solution overview

In the following sections, we describe a complete end-to-end workflow that serves as a template to follow for automated deployment of time series forecasting models using Forecast. This workflow creates forecasted data points from an open-source input dataset; however, you can use the same workflow for your own data, as long as you can format your data according to the steps outlined in this post. After you upload the data, we walk you through the steps to create Forecast dataset groups, import data, train ML models, and produce forecasted data points on future unseen time horizons from raw data. All of this is possible without having to write or compile code.

The following diagram illustrates the forecasting workflow.

Cyclical forecasting workflow

The solution is deployed using two CloudFormation templates: the dependencies template and the workload template. CloudFormation enables you to perform AWS infrastructure deployments predictably and repeatedly by using templates describing the resources to be deployed. A deployed template is referred to as a stack. We’ve taken care of defining the infrastructure in the solution for you in the two provided templates. The dependencies template defines prerequisite resources used by the workload template, such as an Amazon Simple Storage Service (Amazon S3) bucket for object storage and AWS Identity and Access Management (IAM) permissions for AWS API actions. The resources defined in the dependencies template may be shared by multiple workload templates. The workload template defines the resources used to ingest data, train a predictor, and generate a forecast.

Deployment workflow

Deploy the dependencies CloudFormation template

First, let’s deploy the dependencies template to create our prerequisite resources. The dependencies template deploys an optional S3 bucket, AWS Lambda functions, and IAM roles. Amazon S3 is a low-cost, highly available, resilient, object storage service. We use an S3 bucket in this solution to store source data and trigger the workflow, resulting in a forecast. Lambda is a serverless, event-driven compute service that lets you run code without provisioning or managing servers. The dependencies template includes functions to do things like create a dataset group in Forecast and purge objects within an S3 bucket before deleting the bucket. IAM roles define permissions within AWS for users and services. The dependencies template deploys a role to be used by Lambda and another for Step Functions, a workflow management service that will coordinate the tasks of data ingestion and processing, as well as predictor training and inference using Forecast.

Complete the following steps to deploy the dependencies template:

  1. On the console, select the desired Region supported by Forecast for solution deployment.
  2. On the AWS CloudFormation console, choose Stacks in the navigation pane.
  3. Choose Create stack and choose With new resources (standard).
    Create stack
  4. For Template source, select Amazon S3 URL.
  5. Enter the template URL: https://amazon-forecast-samples.s3.us-west-2.amazonaws.com/ml_ops/forecast-mlops-dependency.yaml.
  6. Choose Next.
    Specify template
  7. For Stack name, enter forecast-mlops-dependency.
  8. Under Parameters, choose to use an existing S3 bucket or create a new one, then provide the name of the bucket.
  9. Choose Next.
  10. Choose Next to accept the default stack options.
  11. Select the check box to acknowledge the stack creates IAM resources, then choose Create stack to deploy the template.

You should see the template deploy as the forecast-mlops-dependency stack. When the status changes to CREATE_COMPLETE, you may move to the next step.

Deploy the workload CloudFormation template

Next, let’s deploy the workload template to create our prerequisite resources. The workload template deploys Step Functions state machines for workflow management, AWS Systems Manager Parameter Store parameters to store parameter values from AWS CloudFormation and inform the workflow, an Amazon Simple Notification Service (Amazon SNS) topic for workflow notifications, and an IAM role for workflow service permissions.

The solution creates five state machines:

  • CreateDatasetGroupStateMachine – Creates a Forecast dataset group for data to be imported into.
  • CreateImportDatasetStateMachine – Imports source data from Amazon S3 into a dataset group for training.
  • CreateForecastStateMachine – Manages the tasks required to train a predictor and generate a forecast.
  • AthenaConnectorStateMachine – Enables you to write SQL queries with the Amazon Athena connector to land data in Amazon S3. This is an optional process to obtain historical data in the required format for Forecast by using Athena instead of placing files manually in Amazon S3.
  • StepFunctionWorkflowStateMachine – Coordinates calls out to the other four state machines and manages the overall workflow.

Parameter Store, a capability of Systems Manager, provides secure, hierarchical storage and programmatic retrieval of configuration data management and secrets management. Parameter Store is used to store parameters set in the workload stack as well as other parameters used by the workflow.

Complete the following steps to deploy the workload template:

  1. On the AWS CloudFormation console, choose Stacks in the navigation pane.
  2. Choose Create stack and choose With new resources (standard).
  3. For Template source, select Amazon S3 URL.
  4. Enter the template URL: https://amazon-forecast-samples.s3.us-west-2.amazonaws.com/ml_ops/forecast-mlops-solution-guidance.yaml.
  5. Choose Next.
  6. For Stack name, enter a name.
  7. Accept the default values or modify the parameters.

Be sure to enter the S3 bucket name from the dependencies stack for S3 Bucket and a valid email address for SNSEndpoint even if you accept the default parameter values.

The following table describes each parameter.

Parameter Description More Information
DatasetGroupFrequencyRTS The frequency of data collection for the RTS dataset. .
DatasetGroupFrequencyTTS The frequency of data collection for the TTS dataset. .
DatasetGroupName A short name for the dataset group, a self-contained workload. CreateDatasetGroup
DatasetIncludeItem Specify if you want to provide item metadata for this use case. .
DatasetIncludeRTS Specify if you want to provide a related time series for this use case. .
ForecastForecastTypes When a CreateForecast job runs, this declares which quantiles to produce predictions for. You may choose up to five values in this array. Edit this value to include values according to need. CreateForecast
PredictorAttributeConfigs For the target variable in TTS and each numeric field in the RTS datasets, a record must be created for each time interval for each item. This configuration helps determine how missing records are filled in: with 0, NaN, or otherwise. We recommend filing the gaps in the TTS with NaN instead of 0. With 0, the model might learn wrongly to bias forecasts toward 0. NaN is how the guidance is delivered. Consult with your AWS Solutions Architect with any questions on this. CreateAutoPredictor
PredictorExplainPredictor Valid values are TRUE or FALSE. These determine if explainability is enabled for your predictor. This can help you understand how values in the RTS and item metadata influence the model. Explainability
PredictorForecastDimensions You may want to forecast at a finer grain than item. Here, you can specify dimensions such as location, cost center, or whatever your needs are. This needs to agree with the dimensions in your RTS and TTS. Note that if you have no dimension, the correct parameter is null, by itself and in all lowercase. null is a reserved word that lets the system know there is no parameter for the dimension. CreateAutoPredictor
PredictorForecastFrequency Defines the time scale at which your model and predictions will be generated, such as daily, weekly, or monthly. The drop-down menu helps you choose allowed values. This needs to agree with your RTS time scale if you’re using RTS. CreateAutoPredictor
PredictorForecastHorizon The number of time steps that the model predicts. The forecast horizon is also called the prediction length. CreateAutoPredictor
PredictorForecastOptimizationMetric Defines the accuracy metric used to optimize the predictor. The drop-down menu will help you select weighted quantile loss balances for over- or under-forecasting. RMSE is concerned with units, and WAPE/MAPE are concerned with percent errors. CreateAutoPredictor
PredictorForecastTypes When a CreateAutoPredictor job runs, this declares which quantiles are used to train prediction points. You may choose up to five values in this array, allowing you to balance over- and under-forecasting. Edit this value to include values according to need. CreateAutoPredictor
S3Bucket The name of the S3 bucket where input data and output data are written for this workload. .
SNSEndpoint A valid email address to receive notifications when the predictor and Forecast jobs are complete. .
SchemaITEM This defines the physical order, column names, and data types for your item metadata dataset. This is an optional file provided in the solution example. CreateDataset
SchemaRTS This defines the physical order, column names, and data types for your RTS dataset. The dimensions must agree with your TTS. The time-grain of this file governs the time-grain at which predictions can be made. This is an optional file provided in the solution example. CreateDataset
SchemaTTS This defines the physical order, column names, and data types for your TTS dataset, the only required dataset. The file must contain a target value, timestamp, and item at a minimum. CreateDataset
TimestampFormatRTS Defines the timestamp format provided in the RTS file. CreateDatasetImportJob
TimestampFormatTTS Defines the timestamp format provided in the TTS file. CreateDatasetImportJob
  1. Choose Next to accept the default stack options.
  2. Select the check box to acknowledge the stack creates IAM resources, then choose Create stack to deploy the template.

You should see the template deploy as the stack name you chose earlier. When the status changes to CREATE_COMPLETE, you may move to the data upload step.

Upload the data

In the previous section, you provided a stack name and an S3 bucket. This section describes how to deposit the publicly available dataset Food Demand in this bucket. If you’re using your own dataset, refer to Datasets to prepare your dataset in a format the deployment is expecting. The dataset needs to contain at least the target time series, and optionally, the related time series and the item metadata:

  • TTS is the time series data that includes the field that you want to generate a forecast for; this field is called the target field
  • RTS is time series data that doesn’t include the target field, but includes a related field
  • The item data file isn’t time series data, but includes metadata information about the items in the TTS or RTS datasets

Complete the following steps:

  1. If you’re using the provided sample dataset, download the dataset Food Demand to your computer and unzip the file, which creates three files inside three directories (rts, tts, item).
  2. On the Amazon S3 console, navigate to the bucket you created earlier.
  3. Choose Create folder.
  4. Use the same string as your workload stack name for the folder name.
  5. Choose Upload.
  6. Choose the three dataset folders, then choose Upload.

When the upload is complete, you should see something like the following screenshot. For this example, our folder is aiml42.S3 folder structure

Create a Forecast dataset group

Complete the steps in this section to create a dataset group as a one-time event for each workload. Going forward, you should plan on running the import data, create predictor, and create forecast steps as appropriate, as a series, according to your schedule, which could be daily, weekly, or otherwise.

  1. On the Step Functions console, locate the state machine containing Create-Dataset-Group.
  2. On the state machine detail page, choose Start execution.
  3. Choose Start execution again to confirm.

The state machine takes about 1 minute to run. When it’s complete, the value under Execution Status should change from Running to Succeeded Execution status

Import data into Forecast

Follow the steps in this section to import the data set that you uploaded to your S3 bucket into your dataset group:

  1. On the Step Functions console, locate the state machine containing Import-Dataset.
  2. On the state machine detail page, choose Start Execution.
  3. Choose Start execution again to confirm.

The amount of time the state machine takes to run depends on the dataset being processed.

Graph inspector

  1. While this is running, in your browser, open another tab and navigate to the Forecast console.
  2. On the Forecast console, choose View dataset groups and navigate to the dataset group with the name specified for DataGroupName from your workload stack.
  3. Choose View datasets.

You should see the data imports in progress.

Data imports in progress

When the state machine for Import-Dataset is complete, you can proceed to the next step to build your time series data model.

Create AutoPredictor (train a time series model)

This section describes how to train an initial predictor with Forecast. You may choose to create a new predictor (your first, baseline predictor) or retrain a predictor during each production cycle, which could be daily, weekly, or otherwise. You may also elect not to create a predictor each cycle and rely on predictor monitoring to guide you when to create one. The following figure visualizes the process of creating a production-ready Forecast predictor.

Production ready predictor workflow

To create a new predictor, complete the following steps:

  1. On the Step Functions console, locate the state machine containing Create-Predictor.
  2. On the state machine detail page, choose Start Execution.
  3. Choose Start execution again to confirm.
    The amount of runtime can depend on the dataset being processed. This could take up to an hour or more to complete.
  4. While this is running, in your browser, open another tab and navigate to the Forecast console.
  5. On the Forecast console, choose View dataset groups and navigate to the dataset group with the name specified for DataGroupName from your workload stack.
  6. Choose View predictors.

You should see the predictor training in progress (Training status shows “Create in progress…”).

Data imports in progress

When the state machine for Create-Predictor is complete, you can evaluate its performance.

As part of the state machine, the system creates a predictor and also runs a BacktestExport job that writes out time series-level predictor metrics to Amazon S3. These are files located in two S3 folders under the backtest-export folder:

  • accuracy-metrics-values – Provides item-level accuracy metric computations so you can understand the performance of a single time series. This allows you to investigate the spread rather than focusing on the global metrics alone.
  • forecasted-values – Provides step-level predictions for each time series in the backtest window. This enables you to compare the actual target value from a holdout test set to the predicted quantile values. Reviewing this helps formulate ideas on how to provide additional data features in RTS or item metadata to help better estimate future values, further reducing loss. You may download backtest-export files from Amazon S3 or query them in place with Athena.

S3 bucket contents

With your own data, you need to closely inspect the predictor outcomes and ensure the metrics meet your expected results by using the backtest export data. When satisfied, you can begin generating future-dated predictions as described in the next section.

Generate a forecast (inference about future time horizons)

This section describes how to generate forecast data points with Forecast. Going forward, you should harvest new data from the source system, import the data into Forecast, and then generate forecast data points. Optionally, you may also insert a new predictor creation after import and before forecast. The following figure visualizes the process of creating production time series forecasts using Forecast.

Production time series forecast workflow

Complete the following steps:

  1. On the Step Functions console, locate the state machine containing Create-Forecast.
  2. On the state machine detail page, choose Start Execution.
  3. Choose Start execution again to confirm.
    This state machine finishes very quickly because the system isn’t configured to generate a forecast. It doesn’t know which predictor model you have approved for inference.
    Let’s configure the system to use your trained predictor.
  4. On the Forecast console, locate the ARN for your predictor.
  5. Copy the ARN to use in a later step.
    Predictor details
  6. In your browser, open another tab and navigate to the Systems Manager console.
  7. On the Systems Manager console, choose Parameter Store in the navigation pane.
  8. Locate the parameter related to your stack (/forecast/<StackName>/Forecast/PredictorArn).
  9. Enter the ARN you copied for your predictor.
    This is how you associate a trained predictor with the inference function of Forecast.
  10. Locate the parameter /forecast/<StackName>/Forecast/Generate and edit the value, replacing FALSE with TRUE.
    Now you’re ready to run a forecast job for this dataset group.
  11. On the Step Functions console, run the Create-Forecast state machine.

This time, the job runs as expected. As part of the state machine, the system creates a forecast and a ForecastExport job, which writes out time series predictions to Amazon S3. These files are located in the forecast folder

Forecast folder contents

Inside the forecast folder, you will find predictions for your items, located in many CSV or Parquet files, depending on your selection. The predictions for each time step and selected time series exist with all your chosen quantile values per record. You may download these files from Amazon S3, query them in place with Athena, or choose another strategy to use the data.

This wraps up the entire workflow. You can now visualize your output using any visualization tool of your choice, such as Amazon QuickSight. Alternatively, data scientists can use pandas to generate their own plots. If you choose to use QuickSight, you can connect your forecast results to QuickSight to perform data transformations, create one or more data analyses, and create visualizations.

This process provides a template to follow. You will need to adapt the sample to your schema, set the forecast horizon, time resolution, and so forth according to your use case. You will also need to set a recurring schedule where data is harvested from the source system, import the data, and produce forecasts. If desired, you may insert a predictor task between the import and forecast steps.

Retrain the predictor

We have walked through the process of training a new predictor, but what about retraining a predictor? Retraining a predictor is one way to reduce the cost and time involved with training a predictor on the latest available data. Rather than create a new predictor and train it on the entire dataset, we can retrain the existing predictor by providing only the new incremental data made available since the predictor was last trained. Let’s walk through how to retrain a predictor using the automation solution:

  1. On the Forecast console, choose View dataset groups.
  2. Choose the dataset group associated with the predictor you want to retrain.
  3. Choose View predictors, then chose the predictor you want to retrain.
  4. On the Settings tab, copy the predictor ARN.
    We need to update a parameter used by the workflow to identify the predictor to retrain.
  5. On the Systems Manager console, choose Parameter Store in the navigation pane.
  6. Locate the parameter /forecast/<STACKNAME>/Forecast/Predictor/ReferenceArn.
  7. On the parameter detail page, choose Edit.
  8. For Value, enter the predictor ARN.
    This identifies the correct predictor for the workflow to retrain. Next, we need to update a parameter used by the workflow to change the training strategy.
  9. Locate the parameter /forecast/<STACKNAME>/Forecast/Predictor/Strategy.
  10. On the parameter detail page, choose Edit.
  11. For Value, enter RETRAIN.
    The workflow defaults to training a new predictor; however, we can modify that behavior to retrain an existing predictor or simply reuse an existing predictor without retraining by setting this value to NONE. You may want to forego training if your data is relatively stable or you’re using automated predictor monitoring to decide when retraining is necessary.
  12. Upload the incremental training data to the S3 bucket.
  13. On the Step Functions console, locate the state machine <STACKNAME>-Create-Predictor.
  14. On the state machine detail page, choose Start execution to begin the retraining.

When the retraining is complete, the workflow will end and you will receive an SNS email notification to the email address provided in the workload template parameters.

Clean up

When you’re done with this solution, follow the steps in this section to delete related resources.

Delete the S3 bucket

  1. On the Amazon S3 console, choose Buckets in the navigation pane.
  2. Select the bucket where data was uploaded and choose Empty to delete all data associated with the solution, including source data.
  3. Enter permanently delete to delete the bucket contents permanently.
  4. On the Buckets page, select the bucket and choose Delete.
  5. Enter the name of the bucket to confirm the deletion and choose Delete bucket.

Delete Forecast resources

  1. On the Forecast console, choose View dataset groups.
  2. Select the dataset group name associated with the solution, then choose Delete.
  3. Enter delete to delete the dataset group and associated predictors, predictor backtest export jobs, forecasts, and forecast export jobs.
  4. Choose Delete to confirm.

Delete the CloudFormation stacks

  1. On the AWS CloudFormation console, choose Stacks in the navigation pane.
  2. Select the workload stack and choose Delete.
  3. Choose Delete stack to confirm deletion of the stack and all associated resources.
  4. When the deletion is complete, select the dependencies stack and choose Delete.
  5. Choose Delete to confirm.

Conclusion

In this post, we discussed some different ways to get started using Forecast. We walked through an automated forecasting solution based on AWS CloudFormation for a rapid, repeatable solution deployment of a Forecast pipeline from data ingestion to inference, with little infrastructure knowledge required. Finally, we saw how we can use Lambda to automate model retraining, reducing cost and training time.

There’s no better time than the present to start forecasting with Forecast. To start building and deploying an automated workflow, visit Amazon Forecast resources. Happy forecasting!


About the Authors

Aaron Fagan is a Principal Specialist Solutions Architect at AWS based in New York. He specializes in helping customers architect solutions in machine learning and cloud security.

Raju Patil is a Data Scientist in AWS Professional Services. He builds and deploys AI/ML solutions to assist AWS customers in overcoming their business challenges. His AWS engagements have covered a wide range of AI/ML use cases such as computer vision, time-series forecasting, and predictive analytics, etc., across numerous industries, including financial services, telecom, health care, and more. Prior to this, he has led Data Science teams in Advertising Technology, and made significant contributions to numerous research and development initiatives in computer vision and robotics. Outside of work, he enjoys photography, hiking, travel, and culinary explorations.

Read More

Get started with generative AI on AWS using Amazon SageMaker JumpStart

Get started with generative AI on AWS using Amazon SageMaker JumpStart

Generative AI is gaining a lot of public attention at present, with talk around products such as GPT4, ChatGPT, DALL-E2, Bard, and many other AI technologies. Many customers have been asking for more information on AWS’s generative AI solutions. The aim of this post is to address those needs.

This post provides an overview of generative AI with a real customer use case, provides a concise description and outlines its benefits, references an easy-to-follow demo of AWS DeepComposer for creating new musical compositions, and outlines how to get started using Amazon SageMaker JumpStart for deploying GPT2, Stable Diffusion 2.0, and other generative AI models.

Generative AI overview

Generative AI is a specific field of artificial intelligence that focuses on generating new material. It’s one of the most exciting fields in the AI world, with the potential to transform existing businesses and allow completely new business ideas to come to market. You can use generative techniques for:

  • Creating new works of art using a model such as Stable Diffusion 2.0
  • Writing a best-selling book using a model such as GPT2, Bloom, or Flan-T5-XL
  • Composing your next symphony using the Transformers technique in AWS DeepComposer

AWS DeepComposer is an educational tool that helps you understand the key concepts associated with machine learning (ML) through the language of musical composition. To learn more, refer to Generate a jazz rock track using Generative Artificial Intelligence.

Stable Diffusion, GPT2, Bloom, and Flan-T5-XL are all ML models. They are simply mathematical algorithms that need to be trained to identify patterns within data. After the patterns are learned, they’re deployed onto endpoints, ready for a process known as inference. New data that the model hasn’t seen is fed into the inference model, and new creative material is produced.

For example, with image generation models such as Stable Diffusion, we can create stunning illustrations using a few words. With text generation models such as GPT2, Bloom, and Flan-T5-XL, we can generate new literary articles, and potentially books, from a simple human sentence.

Autodesk is an AWS customer using Amazon SageMaker to help their product designers sort through thousands of iterations of visual designs for various use cases and use ML to help choose the optimal design. Specifically, they have worked with Edera Safety to help develop a spinal cord protector that protects riders from accidents while participating in sporting events, such as mountain biking. For more information, check out the video AWS Machine Learning Enables Design Optimization.

To learn more about what AWS customers are doing with generative AI and fashion, refer to Virtual fashion styling with generative AI using Amazon SageMaker.

Now that we understand what generative AI is all about, let’s jump into a JumpStart demonstration to learn how to generate new text or images with AI.

Prerequisites

Amazon SageMaker Studio is the integrated development environment (IDE) within SageMaker that provides us with all the ML features that we need in a single pane of glass. Before we can run JumpStart, we need to set up Studio. You can skip this step if you already have your own version of Studio running.

The first thing we need to do before we can use any AWS services is to make sure we have signed up for and created an AWS account. Next is to create an administrative user and a group. For instructions on both steps, refer to Set Up Amazon SageMaker Prerequisites.

The next step is to create a SageMaker domain. A domain sets up all the storage and allows you to add users to access SageMaker. For more information, refer to Onboard to Amazon SageMaker Domain. This demo is created in the AWS Region us-east-1.

Finally, you launch Studio. For this post, we recommend launching a user profile app. For instructions, refer to Launch Amazon SageMaker Studio.

Choose a JumpStart solution

Now we come to the exciting part. You should now be logged in to Studio, and see a page similar to the following screenshot.

In the navigation pane, under SageMaker JumpStart, choose Models, notebooks, solutions.

You’re presented with a range of solutions, foundation models, and other artifacts that can help you get started with a specific model or a specific business problem or use case.

If you want to experiment in a particular area, you can use the search function. Or you can simply browse the artifacts to find the relevant model or business solution for your needs.

For example, if you’re interested in fraud detection solutions, enter fraud detection into the search bar.

Fraud Detection Screenshot

If you’re interested in text generation solutions, enter text generation into the search bar. A good place to start if you want to explore a range of text generation models is to select the Intro to JS – Text Generation notebook.

JS - Text Generation

Let’s dive into a specific demonstration of the GPT-2 model.

JumpStart GPT-2 model demo

GPT 2 is a language model that helps generate human-like text based on a given prompt. We can use this type of transformer model to create new sentences and help us automate writing. This can be used for content creation such as blogs, social media posts, and books.

The GPT 2 model is part of the Generative Pre-Trained Transformer family that was the predecessor to GPT 3. At the time of writing, GPT 3 is used as the foundation for the OpenAI ChatGPT application.

To start exploring the GPT-2 model demo in JumpStart, complete the following steps:

  1. On JumpStart, search for and choose GPT 2.
  2. In the Deploy Model section, expand Deployment Configuration.
  3. For SageMaker hosting instance, choose your instance (for this post, we use ml.c5.2xlarge).

Different machine types have different price points attached. At the time of writing, the ml.c5.2xlarge that we selected incurs under $0.50 per hour. For the most up-to-date pricing, refer to Amazon SageMaker Pricing.

  1. For Endpoint name, enter demo-hf-textgeneration-gpt2.
  2. Choose Deploy.

Endpoint Name & Deploy

Wait for the ML endpoint to deploy (up to 15 minutes).

  1. When the endpoint is deployed, choose Open Notebook.

Endpoint Status

You’ll see a page similar to the following screenshot.
Python Code

The document we’re using to showcase our demonstration is a Jupyter notebook, which encompasses all the necessary Python code. Note that the code in this screenshot maybe be slightly different to the code you have, because AWS is constantly updating these notebooks and making sure they are secure, are free of defects, and provide the best customer experience.

  1. Click into the first cell and choose Ctrl+Enter to run the code block.

Code Block 1

An asterisk (*) appears to the left of the code block and then turns into a number. The asterisk indicates that the code is running and is complete when the number appears.

  1. In the next code block, enter some sample text, then press Ctrl+Enter.

Code Block 2

  1. Choose Ctrl+Enter in the third code block to run it.

After about 30-60 seconds, you will see your inference results.

For the input text “Once upon a time there were 18 sandwiches,” we get the following generated text:

Once upon a time there were 18 sandwiches, four plates with some salad, and three sandwiches with some beef. One restaurant was so nice that the food was made by hand. There were people living at the beginning of the time who were waiting so that

For the input text “And for the final time Peter said to Mary,” we get the following generated text:

And for the final time Peter said to Mary that he was a saint.

11 But Peter said that it was not a blessing, but rather that it would be the death of Peter. And when Mary heard of that Peter said to him,

You can experiment with running this third code block multiple times, and you will notice that the model makes different predictions each time.

To tailor the output using some of the advanced features, scroll down to experiment in the fourth code block.

To learn more about text generation models, refer to Run text generation with Bloom and GPT models on Amazon SageMaker JumpStart.

Clean up resources

Before we move on, don’t forget to delete your endpoint when you’re finished. On the previous tab, under Delete Endpoint, choose Delete.

Delete Endpoint

If you have accidentally closed this notebook, you can also delete your endpoint via the SageMaker console. Under Inference in the navigation pane, choose Endpoints.

Select the endpoint you used and on the Actions menu, choose Delete.

Delete Endpoint

Now that we understand how to use our first JumpStart solution, let’s look at using a Stable Diffusion model.

JumpStart Stable Diffusion model demo

We can use the Stable Diffusion 2 model to generate images from a simple line of text. This can be used to generate content for things like social media posts, promotional material, album covers, or anything that requires creative artwork.

  1. Return to JumpStart, then search for and choose Stable Diffusion 2.

Stable Diffusion 2

  1. In the Deploy Model section, expand Deployment Configuration.
  2. For SageMaker hosting instance, choose your instance (for this post, we use ml.g5.2xlarge).
  3. For Endpoint name, enter demo-stabilityai-stable-diffusion-v2.
  4. Choose Deploy.

Because this is a larger model, it can take up to 25 minutes to deploy. When it’s ready, the endpoint status shows as In Service.

In Service

  1. Choose Open Notebook to open a Jupyter notebook with Python code.

Python Code

  1. Run the first and second code blocks.
  2. In the third code block, change the text prompt, then run the cell.

Code Block 1

Wait about 30–60 seconds for your image to appear. The following image is based on our example text.

Output Picture

Again, you can play with the advanced features in the next code block. The picture it creates is different every time.

Clean up resources

Again, don’t forget to delete your endpoint. This time, we’re using ml.g5.2xlarge, so it incurs slightly higher charges than before. At the time of writing, it was just over $1 per hour.

Finally, let’s move to AWS DeepComposer.

AWS DeepComposer

AWS DeepComposer is a great way to learn about generative AI. It allows you to use built-in melodies in your models to generate new forms of music. The model that you use determines on how the input melody is transformed.

If you’re used to participating in AWS DeepRacer days to help your employees learn about re-enforcement learning, consider augmenting and enhancing the day with AWS DeepComposer to learn about generative AI.

For a detailed explanation and easy-to-follow demonstration of three of the models in this post, refer to Generate a jazz rock track using Generative Artificial Intelligence.

Check out the following cool examples uploaded to SoundCloud using AWS DeepComposer.

We would love to see your experiments, so feel free to reach out via social media (@digitalcolmer) and share your learnings and experiments.

Conclusion

In this post, we talked about the definition of generative AI, illustrated by an AWS customer story. We then stepped you through how to get started with Studio and JumpStart, and showed you how to get started with GPT 2 and Stable Diffusion models. We wrapped up with a brief overview of AWS DeepComposer.

To explore JumpStart more, try using your own data to fine-tune an existing model. For more information, refer to Incremental training with Amazon SageMaker JumpStart. For information about fine-tuning Stable Diffusion models, refer to Fine-tune text-to-image Stable Diffusion models with Amazon SageMaker JumpStart.

To learn more about Stable Diffusion models, refer to Generate images from text with the stable diffusion model on Amazon SageMaker JumpStart.

We didn’t cover any information on the Flan-T5-XL model, so to learn more, refer to the following GitHub repo. The Amazon SageMaker Examples repo also includes a range of available notebooks on GitHub for the various SageMaker products, including JumpStart, covering a range of different use cases.

To learn more about AWS ML via a range of free digital assets, check out our AWS Machine Learning Ramp-Up Guide. You can also try our free ML Learning Plan to build on your current knowledge or have a clear starting point. To take an instructor-led course, we highly recommend the following courses:

It is truly an exciting time in the AI/ML space. AWS is here to support your ML journey, so please connect with us on social media. We look forward to seeing all your learning, experiments, and fun with the various ML services over the coming months and relish the opportunity to be your instructor on your ML journey.


About the Author

Paul Colmer is a Senior Technical Trainer at Amazon Web Services specializing in machine learning and generative AI. His passion is helping customers, partners, and employees develop and grow through compelling storytelling, shared experiences, and knowledge transfer. With over 25 years in the IT industry, he specializes in agile cultural practices and machine learning solutions. Paul is a Fellow of the London College of Music and Fellow of the British Computer Society.

Read More

Quickly build high-accuracy Generative AI applications on enterprise data using Amazon Kendra, LangChain, and large language models

Quickly build high-accuracy Generative AI applications on enterprise data using Amazon Kendra, LangChain, and large language models

Generative AI (GenAI) and large language models (LLMs), such as those available soon via Amazon Bedrock and Amazon Titan are transforming the way developers and enterprises are able to solve traditionally complex challenges related to natural language processing and understanding. Some of the benefits offered by LLMs include the ability to create more capable and compelling conversational AI experiences for customer service applications, and improving employee productivity through more intuitive and accurate responses.

For these use cases, however, it’s critical for the GenAI applications implementing the conversational experiences to meet two key criteria: limit the responses to company data, thereby mitigating model hallucinations (incorrect statements), and filter responses according to the end-user content access permissions.

To restrict the GenAI application responses to company data only, we need to use a technique called Retrieval Augmented Generation (RAG). An application using the RAG approach retrieves information most relevant to the user’s request from the enterprise knowledge base or content, bundles it as context along with the user’s request as a prompt, and then sends it to the LLM to get a GenAI response. LLMs have limitations around the maximum word count for the input prompt, therefore choosing the right passages among thousands or millions of documents in the enterprise, has a direct impact on the LLM’s accuracy.

In designing effective RAG, content retrieval is a critical step to ensure the LLM receives the most relevant and concise context from enterprise content to generate accurate responses. This is where the highly accurate, machine learning (ML)-powered intelligent search in Amazon Kendra plays an important role. Amazon Kendra is a fully managed service that provides out-of-the-box semantic search capabilities for state-of-the-art ranking of documents and passages. You can use the high-accuracy search in Amazon Kendra to source the most relevant content and documents to maximize the quality of your RAG payload, yielding better LLM responses than using conventional or keyword-based search solutions. Amazon Kendra offers easy-to-use deep learning search models that are pre-trained on 14 domains and don’t require any ML expertise, so there’s no need to deal with word embeddings, document chunking, and other lower-level complexities typically required for RAG implementations. Amazon Kendra also comes with pre-built connectors to popular data sources such as Amazon Simple Storage Service (Amazon S3), SharePoint, Confluence, and websites, and supports common document formats such as HTML, Word, PowerPoint, PDF, Excel, and pure text files. To filter responses based on only those documents that the end-user permissions allow, Amazon Kendra offers connectors with access control list (ACL) support. Amazon Kendra also offers AWS Identity and Access Management (IAM) and AWS IAM Identity Center (successor to AWS Single Sign-On) integration for user-group information syncing with customer identity providers such as Okta and Azure AD.

In this post, we demonstrate how to implement a RAG workflow by combining the capabilities of Amazon Kendra with LLMs to create state-of-the-art GenAI applications providing conversational experiences over your enterprise content. After Amazon Bedrock launches, we will publish a follow-up post showing how to implement similar GenAI applications using Amazon Bedrock, so stay tuned.

Solution overview

The following diagram shows the architecture of a GenAI application with a RAG approach.

We use an Amazon Kendra index to ingest enterprise unstructured data from data sources such as wiki pages, MS SharePoint sites, Atlassian Confluence, and document repositories such as Amazon S3. When a user interacts with the GenAI app, the flow is as follows:

  1. The user makes a request to the GenAI app.
  2. The app issues a search query to the Amazon Kendra index based on the user request.
  3. The index returns search results with excerpts of relevant documents from the ingested enterprise data.
  4. The app sends the user request and along with the data retrieved from the index as context in the LLM prompt.
  5. The LLM returns a succinct response to the user request based on the retrieved data.
  6. The response from the LLM is sent back to the user.

With this architecture, you can choose the most suitable LLM for your use case. LLM options include our partners Hugging Face, AI21 Labs, Cohere, and others hosted on an Amazon SageMaker endpoint, as well as models by companies like Anthropic and OpenAI. With Amazon Bedrock, you will be able to choose Amazon Titan, Amazon’s own LLM, or partner LLMs such as those from AI21 Labs and Anthropic with APIs securely without the need for your data to leave the AWS ecosystem. The additional benefits that Amazon Bedrock will offer include a serverless architecture, a single API to call the supported LLMs, and a managed service to streamline the developer workflow.

For the best results, a GenAI app needs to engineer the prompt based on the user request and the specific LLM being used. Conversational AI apps also need to manage the chat history and the context. GenAI app developers can use open-source frameworks such as LangChain that provide modules to integrate with the LLM of choice, and orchestration tools for activities such as chat history management and prompt engineering. We have provided the KendraIndexRetriever class, which implements a LangChain retriever interface, which applications can use in conjunction with other LangChain interfaces such as chains to retrieve data from an Amazon Kendra index. We have also provided a few sample applications in the GitHub repo. You can deploy this solution in your AWS account using the step-by-step guide in this post.

Prerequisites

For this tutorial, you’ll need a bash terminal with Python 3.9 or higher installed on Linux, Mac, or Windows Subsystem for Linux, and an AWS account. We also recommend using an AWS Cloud9 instance or an Amazon Elastic Compute Cloud (Amazon EC2) instance.

Implement a RAG workflow

To configure your RAG workflow, complete the following steps:

  1. Use the provided AWS CloudFormation template to create a new Amazon Kendra index.

This template includes sample data containing AWS online documentation for Amazon Kendra, Amazon Lex, and Amazon SageMaker. Alternately, if you have an Amazon Kendra index and have indexed your own dataset, you can use that. Launching the stack requires about 30 minutes followed by about 15 minutes to synchronize it and ingest the data in the index. Therefore, wait for about 45 minutes after launching the stack. Note the index ID and AWS Region on the stack’s Outputs tab.

  1. For an improved GenAI experience, we recommend requesting an Amazon Kendra service quota increase for maximum DocumentExcerpt size, so that Amazon Kendra provides larger document excerpts to improve semantic context for the LLM.
  2. Install the AWS SDK for Python on the command line interface of your choice.
  3. If you want to use the sample web apps built using Streamlit, you first need to install Streamlit. This step is optional if you want to only run the command line versions of the sample applications.
  4. Install LangChain.
  5. The sample applications used in this tutorial require you to have access to one or more LLMs from Flan-T5-XL, Flan-T5-XXL, Anthropic Claud-V1, and OpenAI-text-davinci-003.
    1. If you want to use Flan-T5-XL or Flan-T5-XXL, deploy them to an endpoint for inference using Amazon SageMaker Studio Jumpstart.
    2. If you want to work with Anthropic Claud-V1 or OpenAI-da-vinci-003, acquire the API keys for your LLMs of your interest from https://www.anthropic.com/ and https://openai.com/, respectively.
  6. Follow the instructions in the GitHub repo to install the KendraIndexRetriever interface and sample applications.
  7. Before you run the sample applications, you need to set environment variables with the Amazon Kendra index details and API keys of your preferred LLM or the SageMaker endpoints of your deployments for Flan-T5-XL or Flan-T5-XXL. The following is a sample script to set the environment variables:
    export AWS_REGION="<YOUR-AWS-REGION>"
    export KENDRA_INDEX_ID="<YOUR-KENDRA-INDEX-ID>"
    export FLAN_XL_ENDPOINT="<YOUR-SAGEMAKER-ENDPOINT-FOR-FLAN-T-XL>"
    export FLAN_XXL_ENDPOINT="<YOUR-SAGEMAKER-ENDPOINT-FOR-FLAN-T-XXL>"
    export OPENAI_API_KEY="<YOUR-OPEN-AI-API-KEY>"
    export ANTHROPIC_API_KEY="<YOUR-ANTHROPIC-API-KEY>"

  8. In a command line window, change to the samples subdirectory of where you have cloned the GitHub repository. You can run the command line apps from the command line as python <sample-file-name.py>. You can run the streamlit web app by changing the directory to samples and running streamlit run app.py <anthropic|flanxl|flanxxl|openai>.
  9. Open the sample file kendra_retriever_flan_xxl.py in an editor of your choice.

Observe the statement result = run_chain(chain, "What's SageMaker?"). This is the user query (“What’s SageMaker?”) that’s being run through the chain that uses Flan-T-XXL as the LLM and Amazon Kendra as the retriever. When this file is run, you can observe the output as follows. The chain sent the user query to the Amazon Kendra index, retrieved the top three result excerpts, and sent them as the context in a prompt along with the query, to which the LLM responded with a succinct answer. It has also provided the sources, (the URLs to the documents used in generating the answer).

~. python3 kendra_retriever_flan_xxl.py
Amazon SageMaker is a machine learning service that lets you train and deploy models in the cloud.
Sources:
https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-intro.html
https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-projects-whatis.html
https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html
  1. Now let’s run the web app app.py as streamlit run app.py flanxxl. For this specific run, we are using a Flan-T-XXL model as the LLM.

It opens a browser window with the web interface. You can enter a query, which in this case is “What is Amazon Lex?” As seen in the following screenshot, the application responds with an answer, and the Sources section provides the URLs to the documents from which the excerpts were retrieved from the Amazon Kendra index and sent to the LLM in the prompt as the context along with the query.

  1. Now let’s run app.py again and get a feel of the conversational experience using streamlit run app.py anthropic. Here the underlying LLM used is Anthropic Claud-V1.

As you can see in the following video, the LLM provides a detailed answer to the user’s query based on the documents it retrieved from the Amazon Kendra index and then supports the answer with the URLs to the source documents that were used to generate the answer. Note that the subsequent queries don’t explicitly mention Amazon Kendra; however, the ConversationalRetrievalChain (a type of chain that’s part of the LangChain framework and provides an easy mechanism to develop conversational application-based information retrieved from retriever instances, used in this LangChain application), manages the chat history and the context to get an appropriate response.

Also note that in the following screenshot, Amazon Kendra finds the extractive answer to the query and shortlists the top documents with excerpts. Then the LLM is able to generate a more succinct answer based on these retrieved excerpts.

In the following sections, we explore two use cases for using Generative AI with Amazon Kendra.

Use case 1: Generative AI for financial service companies

Financial organizations create and store data across various data repositories, including financial reports, legal documents, and whitepapers. They must adhere to strict government regulations and oversight, which means employees need to find relevant, accurate, and trustworthy information quickly. Additionally, searching and aggregating insights across various data sources is cumbersome and error prone. With Generative AI on AWS, users can quickly generate answers from various data sources and types, synthesizing accurate answers at enterprise scale.

We chose a solution using Amazon Kendra and AI21 Lab’s Jurassic-2 Jumbo Instruct LLM. With Amazon Kendra, you can easily ingest data from multiple data sources such as Amazon S3, websites, and ServiceNow. Then Amazon Kendra uses AI21 Lab’s Jurassic-2 Jumbo Instruct LLM to carry out inference activities on enterprise data such as data summarization, report generation, and more. Amazon Kendra augments LLMs to provide accurate and verifiable information to the end-users, which reduces hallucination issues with LLMs. With the proposed solution, financial analysts can make faster decisions using accurate data to quickly build detailed and comprehensive portfolios. We plan to make this solution available as an open-source project in near future.

Example

Using the Kendra Chatbot solution, financial analysts and auditors can interact with their enterprise data (financial reports and agreements) to find reliable answers to audit-related questions. Kendra ChatBot provides answers along with source links and has the capability to summarize longer answers. The following screenshot shows an example conversation with Kendra ChatBot.

Architecture overview

The following diagram illustrates the solution architecture.

The workflow includes the following steps:

  1. Financial documents and agreements are stored on Amazon S3, and ingested to an Amazon Kendra index using the S3 data source connector.
  2. The LLM is hosted on a SageMaker endpoint.
  3. An Amazon Lex chatbot is used to interact with the user via the Amazon Lex web UI.
  4. The solution uses an AWS Lambda function with LangChain to orchestrate between Amazon Kendra, Amazon Lex, and the LLM.
  5. When users ask the Amazon Lex chatbot for answers from a financial document, Amazon Lex calls the LangChain orchestrator to fulfill the request.
  6. Based on the query, the LangChain orchestrator pulls the relevant financial records and paragraphs from Amazon Kendra.
  7. The LangChain orchestrator provides these relevant records to the LLM along with the query and relevant prompt to carry out the required activity.
  8. The LLM processes the request from the LangChain orchestrator and returns the result.
  9. The LangChain orchestrator gets the result from the LLM and sends it to the end-user through the Amazon Lex chatbot.

Use case 2: Generative AI for healthcare researchers and clinicians

Clinicians and researchers often analyze thousands of articles from medical journals or government health websites as part of their research. More importantly, they want trustworthy data sources they can use to validate and substantiate their findings. The process requires hours of intensive research, analysis, and data synthesis, lengthening the time to value and innovation. With Generative AI on AWS, you can connect to trusted data sources and run natural language queries to generate insights across these trusted data sources in seconds. You can also review the sources used to generate the response and validate its accuracy.

We chose a solution using Amazon Kendra and Flan-T5-XXL from Hugging Face. First, we use Amazon Kendra to identify text snippets from semantically relevant documents in the entire corpus. Then we use the power of an LLM such as Flan-T5-XXL to use the text snippets from Amazon Kendra as context and obtain a succinct natural language answer. In this approach, the Amazon Kendra index functions as the passage retriever component in the RAG mechanism. Lastly, we use Amazon Lex to power the front end, providing a seamless and responsive experience to end-users. We plan to make this solution available as an open-source project in the near future.

Example

The following screenshot is from a web UI built for the solution using the template available on GitHub. The text in pink are responses from the Amazon Kendra LLM system, and the text in blue are the user questions.

Architecture overview

The architecture and solution workflow for this solution are similar to that of use case 1.

Clean up

To save costs, delete all the resources you deployed as part of the tutorial. If you launched the CloudFormation stack, you can delete it via the AWS CloudFormation console. Similarly, you can delete any SageMaker endpoints you may have created via the SageMaker console.

Conclusion

Generative AI powered by large language models is changing how people acquire and apply insights from information. However, for enterprise use cases, the insights must be generated based on enterprise content to keep the answers in-domain and mitigate hallucinations, using the Retrieval Augmented Generation approach. In the RAG approach, the quality of the insights generated by the LLM depends on the semantic relevance of the retrieved information on which it is based, making it increasingly necessary to use solutions such as Amazon Kendra that provide high-accuracy semantic search results out of the box. With its comprehensive ecosystem of data source connectors, support for common file formats, and security, you can quickly start using Generative AI solutions for enterprise use cases with Amazon Kendra as the retrieval mechanism.

For more information on working with Generative AI on AWS, refer to Announcing New Tools for Building with Generative AI on AWS. You can start experimenting and building RAG proofs of concept (POCs) for your enterprise GenAI apps, using the method outlined in this blog. As mentioned earlier, once Amazon Bedrock is available, we will publish a follow up blog showing how you can build RAG using Amazon Bedrock.


About the authors

Abhinav JawadekarAbhinav Jawadekar is a Principal Solutions Architect focused on Amazon Kendra in the AI/ML language services team at AWS. Abhinav works with AWS customers and partners to help them build intelligent search solutions on AWS.

Jean-Pierre Dodel is the Principal Product Manager for Amazon Kendra and leads key strategic product capabilities and roadmap prioritization. He brings extensive Enterprise Search and ML/AI experience to the team, with prior leading roles at Autonomy, HP, and search startups prior to joining Amazon 7 years ago.

Mithil Shah is an ML/AI Specialist at AWS. Currently he helps public sector customers improve lives of citizens by building Machine Learning solutions on AWS.

Firaz Akmal is a Sr. Product Manager for Amazon Kendra at AWS. He is a customer advocate, helping customers understand their search and generative AI use-cases with Kendra on AWS. Outside of work Firaz enjoys spending time in the mountains of the PNW or experiencing the world through his daughter’s perspective.

Abhishek Maligehalli Shivalingaiah is a Senior AI Services Solution Architect at AWS with focus on Amazon Kendra. He is passionate about building applications using Amazon Kendra ,Generative AI and NLP. He has around 10 years of experience in building Data & AI solutions to create value for customers and enterprises. He has built a (personal) chatbot for fun to answers questions about his career and professional journey. Outside of work he enjoys making portraits of family & friends, and loves creating artworks.

Read More

Implement backup and recovery using an event-driven serverless architecture with Amazon SageMaker Studio

Implement backup and recovery using an event-driven serverless architecture with Amazon SageMaker Studio

Amazon SageMaker Studio is the first fully integrated development environment (IDE) for ML. It provides a single, web-based visual interface where you can perform all machine learning (ML) development steps required to build, train, tune, debug, deploy, and monitor models. It gives data scientists all the tools you need to take ML models from experimentation to production without leaving the IDE. Moreover, as of November 2022, Studio supports shared spaces to accelerate real-time collaboration and multiple Amazon SageMaker domains in a single AWS Region for each account.

There are two prevailing use cases for Studio domain backup and recovery. The first use case involves a customer business unit and project wanting a functionality to replicate data scientists’ artifacts and data files to any target domains and profiles at will. The second use case involves the replication only when the domain and profile are deleted due to conditions such as the change from a customer-managed key to an AWS-managed key or a change of onboarding from AWS Identity and Access Management (IAM) authentication (see Onboard to Amazon SageMaker Domain Using IAM) to AWS IAM Identity Center (see Onboard to Amazon SageMaker Domain Using IAM Identity Center).

This post mainly covers the second use case by presenting how to back up and recover users’ work when the user and space profiles are deleted and recreated, but we also provide the Python script to support the first use case.

When the user and space profiles are recreated in the existing Studio domain, a new ID of the profile directory will be created within the Studio Amazon Elastic File System (Amazon EFS) volume. As a result, the Studio users could lose access to the model artifacts and data files stored in their previous profile directory if they are deleted. Additionally, Studio domains don’t currently support mounting custom or additional EFS volumes. We recommend keeping the previous Studio EFS volume as a backup using RetentionPolicy in Studio.

Therefore, a proper recovery solution needs to be implemented to access the data from the previous directory in case of profile deletion or to recover files from a detached volume in case of domain deletion. Data scientists can minimize the potential impacts of deleting the domain and profiles if they frequently commit their code to the repository and utilize external storage for data access. However, having the capability to back up and recover the data scientist’s workspace is another layer to ensure their continuity of work, which may increase their productivity. Moreover, if you have tens and hundreds of Studio users, consider how to automate the recovery process to avoid mistakes and save costs and time. To solve this problem, we provide a solution to supplement Studio domain recovery.

This post explains the backup and recovery module and one approach to automate the process using an event-driven architecture. First, we demonstrate how to perform backup and recovery if you create a new Studio domain, user, and space profiles using AWS CloudFormation templates. Next, we explain the required steps to test our recovery solution using the existing domain and profiles without using our CloudFormation templates (you can use your own templates). Although this post focuses on a single domain setting, our solution works for multiple Studio domains as well. Finally, we have automated the provisioning of all resources using the AWS Serverless Application Model (AWS SAM), an open-source framework for building serverless applications.

Solution overview

The following diagram illustrates the high-level workflow of Studio domain backup and recovery with an event-driven architecture.

technical architecture

The event-driven app includes the following steps:

  1. An Amazon CloudWatch events rule uses AWS CloudTrail to trackCreateUserProfile and CreateSpace API calls, trigger the rule, and invoke the AWS Lambda function.
  2. The function updates the user table and appends items in the history table in Amazon DynamoDB. In addition, the database layer keeps track of the domain and profile name and file system mapping.

The following image shows the DynamoDB tables structure. The partition key and sort key in the studioUser table consist of the profile and domain name. The replication column holds the replication flag with true as the default value. In addition, bytes_written, bytes_file_transferred, total_duration_ms, and replication_status fields are populated when the replication completes successfully.

table schema

The database layer can be replaced by other services, such as Amazon Relational Database Service (Amazon RDS) or Amazon Simple Storage Service (Amazon S3). However, we chose DynamoDB because of the Amazon DynamoDB Streams feature.

  1. DynamoDB Streams is enabled on the user table, and the Lambda function is set as a trigger and synchronously invoked when new stream records are available.
  2. Another Lambda function triggers the process to restore the files using the user and space files restore tools.

The backup and recovery workflow includes the following steps:

  1. The backup and recovery workflow consists of AWS Step Functions, integrated with other AWS services, including AWS DataSync, to orchestrate the recovery of the user and space files from the previous directory to a new directory between the same Studio domain EFS volume (profile recreation) or a new domain EFS volume (domain recreation). With the Step Functions Workflow Studio, the workflow can be implemented with no code (such as in this case) or low code for a more customized solution. The Step Functions state machine is invoked when the event-driven app detects the profile creation event. For each profile, the Step Functions state machine runs the DataSync task to copy all files from their previous directories to the new directory.

The following image is the actual graph of the Step Functions state machine. Note that the ListApp* step ensures the profile directories are populated in the Studio EFS volume before proceeding. Also, we implemented retry with exponential backoff to handle API throttle for DataSync CreateLocationEfs and CreateTask API calls.

step functions diagram

  1. When the users open their Studio, all the files from the respective directories from the previous directory will be available to continue their work. The DataSync job replicating one gigabyte of data from our experiment took approximately 1 minute.

The following are services that will be used as part of the solution:

Prerequisites

To implement this solution, you must have the following prerequisites:

  • An AWS account if you don’t already have one. The IAM user that you use must have sufficient permissions to make the necessary AWS service calls and manage AWS resources.
  • The AWS SAM CLI installed and configured.
  • Your AWS credentials set up.
  • Git installed.
  • Python 3.9.
  • A Studio profile and domain name combination that is unique across all Studio domains within a Region and account.
  • You need to use the existing Amazon VPC and S3 bucket to follow the deployment step.
  • Also, be aware of the service quota for the maximum number of DataSync tasks per account per Region (default is 100). You can request a quota increase to meet the number of replication tasks for your use case.

Refer to the AWS Regional Services List for service availability based on Region. Additionally, review Amazon SageMaker endpoints and quotas.

Set up a Studio profile recovery infrastructure

The following diagram shows the logical steps for a SageMaker administrator to set up the Studio user and space recovery infrastructure, which a single command can complete with our automated solution.

logical flow 1

To set up the environment, clone the GitHub repo in the terminal:

git clone https://github.com/aws-samples/sagemaker-studio-efs-recovery-serverless.git && cd sagemaker-studio-efs-recovery-serverless

The following code shows the deployment script usage:

bash deploy.sh -h

Usage: deploy.sh [-n <stack_name>] [-v <vpc_id>] [-s <subnet_id>] [-b <s3_bucket>] [-r <aws_region>] [-d] Options: -n: specify stack name -v: specify your vpc id -s: specify subnet -b: specify s3 bucket name to store artifacts -r: specify aws region -d: whether to skip a creation of a new SageMaker Studio Domain (default: no)

To create a new Amazon SageMaker domain, run the following command. You need to specify which Amazon VPC and subnet you want to use. We use VPC only mode for the Studio deployment. If you don’t have any preference, you can use the default VPC and subnet. Also, specify any stack name, AWS Region, and S3 bucket name for AWS SAM to deploy the Lambda function:

bash deploy.sh -v <vpc_id> -s <subnet_id> -b <s3_bucket_name> -n <stack_name> -r <aws_region>

If you want to use an existing Studio domain, run the following command. Option -d yes will skip creating a new Studio domain:

bash deploy.sh -v <vpc_id> -s <subnet_id> -b <s3_bucket_name> -n <stack_name> -r <aws_region> -d yes

For the existing domains, the SageMaker administrator must also update the source and target Studio EFS security groups to allow connection to the user and space file restore tool. For example, to run the following command, you need to specify HomeEfsFileSystemId, the EFS file system ID, and SecurityGroupId used by the user and space file restore tool (we discuss this in more detail later in the post):

python3 src/add-security-group.py --efs-id <HomeEfsFileSystemId> --security-groups <SecurityGroupId> --region <aws_region>

User and space recovery logical flow

The following diagram shows the logical user and space recovery flow diagram for a SageMaker administrator to understand how the solution works, and no additional setup is required. If the profile (user or space) and domain are accidentally deleted, the EFS volume is detached but not deleted. A possible scenario is that we may want to revert the deletion by recreating a new domain and profiles. If the same profiles are being onboarded again, they may wish to access the files from their respective workspace in the detached volume. The recovery process is almost entirely automated; the only action required by the SageMaker administrator is to recreate the Studio domain and profiles using the same CloudFormation template. The rest of the steps are automated.

logical flow 2

Optionally, if the SageMaker admin wants control over replication, run the following command to turn off replication for specific domains and profiles. This script updates the replication field given the domain and profile name in the table. Note that you need to run the script for the same user each time they get recreated.

python3 src/update-replication-flag.py --profile-name <profile_name> --domain-name <domain_name> --region <aws_region> --no-replication

The following optional step provides the solution for the first use case to allow replication to take place between the specified source file system to any target domain and profile name. If the SageMaker admin wants to replicate particular profile data to a different domain and a profile that doesn’t exist yet, run the following command. The script inserts the new domain and profile name with the specified source file system information. The subsequent profile creation will trigger the replication task. Note that you need to run add-security-group.py from the previous step to allow connection to the file restore tool.

python3 src/add-replication-target.py --src-profile-name <profile_name> --src-domain-name <domain_name> --target-profile-name <profile_name> --target-domain-name <domain_name> --region <aws_region>

In the following sections, we test two scenarios to confirm that the solution works as expected.

Create a new Studio domain

Our first test scenario assumes you are starting from scratch and want to create a new Studio domain and profiles in your environment using our templates. Then we deploy the Studio domain, user and space, backup and recovery workflow, and event app. The purpose of the first scenario is to confirm that the profile file is recovered in the new home directory automatically when the profile is deleted and recreated within the same Studio domain.

Complete the following steps:

  1. To deploy the application, run the following command:
    bash deploy.sh -v <vpc_id> -s <subnet_id> -b <s3_bucket_name> -n <stack_name> -r <aws_region>

  1. On the AWS CloudFormation console, ensure the following stacks are in CREATE_COMPLETE status:
    1. <stack_name>-DemoBootstrap-*
    2. <stack_name>-StepFunction-*
    3. <stack_name>-EventApp-*
    4. <stack_name>-StudioDomain-*
    5. <stack_name>-StudioUser1-*
    6. <stack_name>-StudioSpace-*

cloud formation console

If the deployment failed in any stacks, check the error and resolve the issues. Then, proceed to the next step only if the problems are resolved.

  1. On the DynamoDB console, choose Tables in the navigation pane and confirm that the studioUser and studioUserHistory tables are created.
  2. Select studioUser and choose Explore table items to confirm that items for user1 and space1 are populated in the table.
  3. On the SageMaker console, choose Domains in the navigation pane.
  4. Choose demo-myapp-dev-studio-domain.
  5. On the User profiles tab, select user1 and choose Launch, and choose Studio to open the Studio for the user.

Note that Studio may take 10-15 minutes to load for the first time.

  1. On the File menu, choose Terminal to launch a new terminal within Studio.
  2. Run the following command in the terminal to create a file for testing:
    echo "i don't want to lose access to this file" > user1.txt

  1. Repeat these steps for space1 (choose Spaces in Step 7). Feel free to create a file of your choice.
  2. Delete the Studio user user1 and space1 by removing the nested stacks <stack_name>-StudioUser1-* and <stack_name>-StudioSpace-* from the parent. Delete the stacks by commenting out the following code blocks from the AWS SAM template file, template.yaml. Make sure to save the file after the edit:
    StudioUser1:
      Type: AWS::Serverless::Application
      Condition: CreateDomainCond
      DependsOn: StudioDomain
      Properties:
        Location: Infrastructure/Templates/sagemaker-studio-user.yaml
        Parameters:
          LambdaLayerArn: !GetAtt DemoBootstrap.Outputs.LambdaLayerArn
          StudioUserProfileName: !Ref StudioUserProfileName1
          UID: !Ref UID
          Env: !Ref Env
          AppName: !Ref AppName
    StudioSpace:
      Type: AWS::Serverless::Application
      Condition: CreateDomainCond
      DependsOn: StudioDomain
      Properties:
        Location: Infrastructure/Templates/sagemaker-studio-space.yaml
        Parameters:
          LambdaLayerArn: !GetAtt DemoBootstrap.Outputs.LambdaLayerArn
          StudioSpaceName: !Ref StudioSpaceName
          UID: !Ref UID
          Env: !Ref Env
          AppName: !Ref AppName

  1. Run the following command to deploy the stack with this change:
    bash deploy.sh -v <vpc_id> -s <subnet_id> -b <s3_bucket_name> -n <stack_name> -r <aws_region>

  2. Recreate the Studio profiles by adding back the stack back to the parent. Uncomment the code block from the previous step, save the file, and run the same command:
    bash deploy.sh -v <vpc_id> -s <subnet_id> -b <s3_bucket_name> -n <stack_name> -r <aws_region>

After a successful deployment, you can check the results.

  1. On the AWS CloudFormation console, choose the stack <stack_name>-StepFunction-*
  2. In the stack, choose the value for Physical ID of StepFunction in the Resources section.
  3. Choose the most recent run and confirm its status in Graph view.

It should look like the following screenshot for the user profile replication. You can also check the other run to ensure the same for the space profile.

step functions complete

  1. If you completed Steps 5–10, open the Studio domain for user1 and confirm that the user1.txt file is copied to the newly created directory.

It should not be visible in space1 directory, keeping the same file ownership.

  1. Repeat this step for space1.
  2. On the DataSync console, choose the most recent task ID.
  3. Choose History and the most recent run ID.

This is another way to inspect the configurations and the run status of the DataSync task. As an example, the following screenshot shows the task result for user1 directory replication.

datasync complete

We only covered profile recreation in this scenario. However, our solution works in the same way for Studio domain recreation, and it can be tested by deleting and recreating the domain.

Use an existing Studio domain

Our second test scenario assumes you want to use the existing SageMaker domain and profiles in the environment. Therefore, we only deploy the backup and recovery workflow and the event app. Again, you can use your own Studio CloudFormation template or create them through the AWS CloudFormation console to follow along. Because we’re using the existing Studio domain, the solution will list the current user and space for all domains within the Region, which we call seeding.

Complete the following steps:

  1. To deploy the application, run the following command:
    bash deploy.sh -v <vpc_id> -s <subnet_id> -b <s3_bucket_name> -n <stack_name> -r <aws_region> -d yes

  2. On the AWS CloudFormation console, ensure the following stacks are in CREATE_COMPLETE status:
    1. <stack_name>-DemoBootstrap-*
    2. <stack_name>-StepFunction-*
    3. <stack_name>-EventApp-*

If the deployment failed in any stacks, check the error and resolve the issues. Then, proceed to the next step only if the problems are resolved.

  1. Verify the initial data seed has completed.
  2. On the DynamoDB console, choose Tables in the navigation pane and confirm that the studioUser and studioUserHistory tables are created.
  3. Choose studioUser and choose Explore table items to confirm that items for the existing Studio domain are populated in the table.

Proceed to the next step only if the seed has completed successfully. If the tables aren’t populated, check the CloudWatch logs of the corresponding Lambda function. On the AWS CloudFormation console, choose the stack <stack_name>-EventApp-*, and choose the physical ID of DDBSeedLambda in the Resources section. Under Monitor, choose View CloudWatch Logs and check the logs for the most recent run to troubleshoot.

  1. To update the EFS security group, first get the SecurityGroupId. We use the security group created by the CloudFormation template, which allows all traffic in the outbound connection. Run the following command:
    echo "SecurityGroupId:" $(aws ssm get-parameter --name /network/vpc/sagemaker/securitygroups --region
    
    <aws_region> --query 'Parameter.Value')

  1. Get the HomeEfsFileSystemId, which is the ID of the Studio home EFS volume. Run the following command:
    echo "HomeEfsFileSystemId:" $(aws sagemaker describe-domain --domain-id <domain_id> --region <aws_region> --query 'HomeEfsFileSystemId')

  2. Finally, update the EFS security group by allowing inbounds from the security group shared with the DataSync task using port 2049. Run the following command:
    python3 src/add-security-group.py --efs-id <HomeEfsFileSystemId> --security-groups <SecurityGroupId> --region <aws_region>

  3. Delete and recreate the Studio profiles of your choice using the same profile name.
  4. Confirm the run status of the Step Functions state machine and recovery of the Studio profile directory by following the steps from the first scenario.

You can also test the Step Functions workflow manually with your choice of source and target inputs for replication (more details found in README.md in the GitHub repository).

Clean up

Run the following commands to clean up your resources:

sam delete --region <aws_region> --no-prompts --stack-name <stack_name>

Manually delete the SageMakerSecurityGroup after 20 minutes or so. Deletion of the Elastic Network Interface (ENI) can make the stack show as DELETE_IN_PROGRESS for some time, so we intentionally set the security group to be retained. Also, you need to disassociate that security group from the security group managed by SageMaker before you can delete it.

Conclusion

Studio is a powerful IDE that allows data scientists to quickly develop, train, test, and deploy models. This post discusses how to back up and recover the files stored in a data scientist’s home and shared space directory. We also demonstrated how an event-driven architecture can help automate the recovery process.

Our solution can help improve the resiliency of data scientists’ artifacts within Studio, leading to operational efficiency on the AWS Cloud. Also, the solution is modular, so you can use the necessary components and update them for your usage. For instance, the enhancement to this solution might be a cross-account replication. We hope that what we demonstrated in the post will be a helpful resource to support those ideas.

To get started with Studio, check out Amazon SageMaker for Data Scientists. Please send us feedback on the AWS forum for SageMaker or through your AWS support contacts. You can find other Studio examples in our GitHub repository.


About the Authors

Kenny Sato is a Machine Learning Engineer at AWS, guiding customers in architecting and implementing machine learning solutions. He received his master’s in Computer Engineering from Virginia Tech and is pursuing a PhD in Computer Science. In his spare time, you can find him in his backyard or out somewhere playing with his lovely daughters.

Gautam Nambiar is a DevOps Consultant with AWS. He is particularly interested in architecting and building automated solutions, MLOps pipelines, and creating reusable and secure DevOps best practice patterns. In his spare time, he likes playing and watching soccer.

Read More