Research Focus: Week of March 6, 2023

Research Focus: Week of March 6, 2023

Microsoft Research Focus 11 edition, week of March 06, 2023

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

NEW RESEARCH

Hide and seek with Spectres

Attack methods like Spectre exploit speculative execution, one of the key performance optimizations of modern CPUs. Microsoft researchers are working on a novel testing tool that can automatically detect speculative leaks in commercial (back-box) CPUs. However, until now the testing process has been slow, which has hindered in-depth testing campaigns and the discovery of new classes of leakage.

In a new paper: Hide and Seek with Spectres: Efficient discovery of speculative information leaks with random testing, researchers from Microsoft and academic collaborators identify the root causes of the performance limitations in existing approaches—and propose techniques to overcome them. These techniques improve the testing speed over the state of the art by up to two orders of magnitude.

These improvements enabled the researchers to run a testing campaign of unprecedented depth on Intel and AMD CPUs. In the process, they discovered two types of previously unknown speculative leaks (affecting string comparison and division) that have escaped previous manual and automatic analyses.

The paper that describes the novel techniques will appear at IEEE Symposium on Security and Privacy 2023.

Spotlight: On-Demand EVENT

Microsoft Research Summit 2022

On-Demand
Watch now to learn about some of the most pressing questions facing our research community and listen in on conversations with 120+ researchers around how to ensure new technologies have the broadest possible benefit for humanity.

PODCAST

Microsoft Research’s Philipp Witte on improving carbon sequestration with AI

Reducing carbon dioxide in the atmosphere could play an important role in minimizing climate change. Carbon sequestration – the process of locking carbon dioxide in deep underground reservoirs – is a developing technology that could make a meaningful contribution if it were deployed at scale. Deep learning AI technologies can improve the models required to develop these reservoirs, which could help scale up sequestration projects to a meaningful level.

Philipp Witte, a researcher with Microsoft Research for Industry, recently chatted with Fixing the Future from IEEE Spectrum about how AI can help improve carbon sequestration. For example, AI can facilitate computationally intensive simulations and manage complex modeling requirements more efficiently than conventional methods.

Tune in to the podcast for a lively discussion at the intersection of AI and decarbonization.


OPPORTUNITY

Microsoft Research Data Science Summer School – Apply now

Microsoft Research New York City’s Data Science Summer School (DS3) is an intensive hands-on introduction to data science for local undergraduate students interested in attending graduate school in computer science and related fields. The curriculum includes coursework in data science and group research projects. This year’s program runs from May 30 to June 23, and applications will be accepted through April 11.

The course is taught by leading scientists at Microsoft Research New York City and each participant will receive a laptop and $3,000 stipend. Sessions will cover tools and techniques for acquiring, cleaning, and utilizing real-world data for research purposes. The course also serves as an introduction to problems in applied statistics and machine learning, and will cover the theory behind simple but effective methods for supervised and unsupervised learning.

Applicants must be currently enrolled in an undergraduate program in the New York City area. We strongly encourage people from diverse, non-traditional, and under-represented backgrounds in STEM to apply.

Check out the program details and apply today.

The post Research Focus: Week of March 6, 2023 appeared first on Microsoft Research.

Read More

Ready for Its Closeup: NVIDIA Powers 15 Years of Oscar-Worthy Visual Effects

Ready for Its Closeup: NVIDIA Powers 15 Years of Oscar-Worthy Visual Effects

The Academy Award nominations are in — and for the 15th year in a row, NVIDIA technologies worked behind the scenes of every film nominated for Best Visual Effects.

The five VFX contenders for the 95th annual Academy Awards, taking place on Sunday, March 12, include:

  • All Quiet on the Western Front
  • Avatar: The Way of Water
  • The Batman
  • Black Panther: Wakanda Forever
  • Top Gun: Maverick

For over a decade, filmmakers and VFX studios around the world have used NVIDIA technologies to power the most advanced, visually rich movies ever made. Today, creators and artists are transforming VFX using advanced capabilities in graphics, like real-time ray tracing, simulation, AI and virtual production — all powered by NVIDIA RTX technologies.

Diving Into Natural Wonders With Cutting-Edge Graphics

Award-winning studio Wētā FX created the stunning visuals for director James Cameron’s much-anticipated sequel, Avatar: The Way of Water. The film is one of Wētā’s largest VFX projects to date. The team created 3,240 shots — which is 98% of the total shots in the film, more than two-thirds of which featured water.

In computer graphics (CG), making water look natural and realistic — from how it moves off a character’s skin to how it drips from clothing — is one of the biggest challenges for visual effects artists. But for this film, Wētā developed and implemented a new water toolset that advanced their capabilities across simulation, rendering and more.

The team started with pre-production and performance capture using a real-time, GPU-based ocean spectrum deformer, which served as a consistent, physically based starting point for water on set. From there, Wētā created a new suite of water solvers — many of them within Loki, the studio’s in-house multiphysics simulation framework. Loki allows coupling of multiple solvers in any configuration. For example, hair, cloth, air and water can all be simulated together.

Other key innovations from Wētā centered on both dry and wet performance capture, new deep learning models to process stereo camera images and generate depth maps for compositing, and neural networks to assist with facial animation and muscle systems.

Creating Captivating Car Chases Through Gritty Gotham

Wētā FX was also behind the cinematic visuals for The Batman. The team, led by VFX supervisor Anders Langlands, worked on the gripping highway chase between Batman and the infamous villain, the Penguin. As they race through the city of Gotham under heavy rainfall, the Penguin sets off a sequence of car crashes and explosions.

To create a feeling of danger and exhilaration, the team put the car chase scene together through heavily enhanced live action and completely CG shots. Rendering the proper lighting; simulating realistic raindrops colliding with multiple surfaces, hydroplaning and wheel spray; and illuminating rain through headlights and streetlights all added to the complexity of these shots. Wētā also worked on background environments for scenes in the Batcave and Gotham’s City Hall.

Taking CGI to the Sky

The practical effects and cinematography behind Top Gun: Maverick was an instant highlight of this heart-pounding Hollywood blockbuster film. But to add more layers of realism to those outstanding aerial shots, VFX Supervisor Ryan Tudhope and the team at Method Studios partnered with the camera department, aerial coordinators and the United States Navy to film extensive air-to-air and ground-to-air footage of real jets. They captured over 800 hours of aerial stunts, mounts and plates to provide their team with a practical foundation for the visual effects work.

The Top Gun: Maverick team implemented various VFX techniques, creating a surprising 2,400 VFX shots for the movie. The visual effects included creating and adding CG planes in scenes, as well as adding missiles, smoke and explosions in various action sequences. The invisible nature of the visual effects in Top Gun: Maverick make it a top contender for the Academy Award for Best Visual Effects.

A New Swimlane for Underwater Worlds

In Black Panther: Wakanda Forever, Wētā FX further demonstrated its leadership in creating photorealistic underwater sequences. Chris White, visual effects supervisor for the film, was tasked with creating the Mesoamerican-inspired Talokan underwater kingdom.

To get a realistic look for the characters in this undersea world, Wētā used a combination of live-action sequences shot in water tanks and dry-for-wet shots that helped capture realistic underwater motion for the characters, clothes and hair.

Wētā also reflected how various skin tones would react to light with the added complexity of a murky underwater environment. The bar for realistic water simulation has once again been raised by Wētā FX in Blank Panther: Wakanda Forever.

All Action on the VFX Front

Movie magic is made when visual effects are so seamless that the audience remains completely immersed in the story, not realizing that what they’re seeing is an effect. This is how VFX supervisor Markus Frank and production company Cine Chromatix earned their Best Visual Effects nomination for All Quiet on the Western Front.

To authentically tell the story of two young soldiers during World War I, Cine Chromatix and the film’s visual effects teams focused on the fine details needed to craft VFX that are hidden in plain sight.

The result is stunning. Even after watching Cine Chromatix’s VFX breakdown reel for the film, viewers may find themselves scrubbing back and forth to decipher fact from fiction.

See How Oscar-Nominated VFX Are Created at GTC

NVIDIA congratulates all of this year’s nominees for the Academy Award for Best Visual Effects.

Learn more about visual effects, AI, virtual production and animation at NVIDIA GTC, a global technology conference taking place online March 20-23. Register for free and hear from industry luminaries creating stunning visuals in film and TV. Check out all the media and entertainment sessions at GTC.

Featured image courtesy of 20th Century Studios.

Read More

Hosting YOLOv8 PyTorch models on Amazon SageMaker Endpoints

Hosting YOLOv8 PyTorch models on Amazon SageMaker Endpoints

Deploying models at scale can be a cumbersome task for many data scientists and machine learning engineers. However, Amazon SageMaker endpoints provide a simple solution for deploying and scaling your machine learning (ML) model inferences. Our last blog post and GitHub repo on hosting a YOLOv5 TensorFlowModel on Amazon SageMaker Endpoints sparked a lot of interest from our readers. Many readers were also interested in learning how to host the YOLOv5 model using PyTorch. To address this issue and with the recent release of the YOLOv8 model from Ultralytics, we present this post on how to host a YOLOv8 PyTorchModel on SageMaker endpoints. The YOLOv8 model, distributed under the GNU GPL3 license, is a popular object detection model known for its runtime efficiency as well as detection accuracy. Amazon SageMaker endpoints provide an easily scalable and cost-optimized solution for model deployment.

Solution overview

The following image outlines the AWS services used to host the YOLOv8 model using a SageMaker endpoint and invoke the endpoint as a user. The solution uses AWS CloudFormation to automate the creation of a SageMaker instance and clone our GitHub repository to the instance. The SageMaker notebook accesses and downloads a YOLOv8 PyTorch model and stores the custom inference code along with the model in an Amazon Simple Storage Service (Amazon S3) bucket. The steps within the notebook highlight the creation of the SageMaker endpoint that hosts the YOLOv8 PyTorch model and the custom inference code. The notebook also demonstrates how to test the endpoint and plot the results. The solution consists of the following steps:

  1. We have created a GitHub repository with two notebooks 1_DeployEndpoint.ipynb and 2_TestEndpoint.ipynb, under the sm-notebook/ directory.
  2. AWS CloudFormation template runs, creates a SageMaker Notebook instance, and then clones the GitHub repository.
  3. The notebook 1_DeployEndpoint.ipynb is used to download the YOLOv8 model.
  4. The YOLOv8 model and inference code are stored as model.tar.gz in Amazon S3.
  5. A SageMaker endpoint is created by hosting the model.tar.gz.
  6. The notebook 2_TestEndpoint.ipynb is used to test the endpoint and gather results.

Prerequisites

AWS Account with AWS Identity and Access Management (IAM) roles that provides access to:

  • AWS CloudFormation
  • Amazon SageMaker
  • Amazon S3

1. Host YOLOv8 on a SageMaker endpoint

Ultralytics has multiple YOLOv8 models with different capabilities. They are subdivided into the following:

  • Object Detection (yolov8l.pt, yolov8m.pt, yolov8n.pt, yolov8s.pt, yolov8x.pt, yolov8x6.pt)
  • Segmentation (yolov8l-seg.pt, yolov8m-seg.pt, yolov8n-seg.pt, yolov8s-seg.pt, yolov8x-seg.pt)
  • Classification (yolov8l-cls.pt, yolov8m-cls.pt, yolov8n-cls.pt, yolov8s-cls.pt, yolov8x-cls.pt)

In this blog, we focus on object detection using yolov8l.pt PyTorch model. In order to host the YOLOv8 model and the custom inference code on SageMaker endpoint, they need to be compressed together into a single model.tar.gz with the following structure:

model.tar.gz
        ├─ code/
        │    ├── inference.py
        │    └── requirements.txt
        └── yolov8l.pt

The model weights yolov8l.pt file must be outside the code/ directory and the main inference python script inference.py, which contains the functions needed for loading the model, parsing the input, running the inference, and post-processing the output, should reside under code/ directory. Further details on inference.py are presented in the following section.

1.1. Custom inference code

Depending on your pipeline and code workflow, inputs to and outputs from SageMaker endpoints can vary. In this post, we present a workflow for passing a numpy array to the endpoint and processing. However, the inputs to the endpoint can be json or text as well. Depending on your workflow, you must modify the functions in inference.py to accommodate different inputs and outputs. In addition, with the recent release of YOLOv8, the Ultralytics team released their Python API, which allows us to install the YOLO library directly through requirements.txt and import the model in inference.py.

1.1.1. Contents of code/inference.py:

import numpy as np
import torch, os, json, io, cv2, time
from ultralytics import YOLO

def model_fn(model_dir):
    print("Executing model_fn from inference.py ...")
    env = os.environ
    model = YOLO("/opt/ml/model/code/" + env['YOLOV8_MODEL'])
    return model

def input_fn(request_body, request_content_type):
    print("Executing input_fn from inference.py ...")
    if request_content_type:
        jpg_original = np.load(io.BytesIO(request_body),
                               allow_pickle=True)
        jpg_as_np = np.frombuffer(jpg_original, 
		                          dtype=np.uint8)
        img = cv2.imdecode(jpg_as_np, flags=-1)
    else:
        raise Exception("Unsupported content type: " + request_content_type)
    return img

def predict_fn(input_data, model):
    print("Executing predict_fn from inference.py ...")
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model.to(device)
    with torch.no_grad():
        result = model(input_data)
    return result

def output_fn(prediction_output, content_type):
    print("Executing output_fn from inference.py ...")
    infer = {}
    for result in prediction_output:
        if result.boxes:
            infer['boxes'] = result.boxes.numpy().data.tolist()
        if result.masks:
            infer['masks'] = result.masks.numpy().data.tolist()
        if result.probs:
            infer['probs'] = result.probs.numpy().data.tolist()
    return json.dumps(infer)

1.1.2. Contents of code/requirements.txt:

opencv-python
torchvision
seaborn
ultralytics
omegaconf==2.3.0

Once all the file contents for model.tar.gz are finalized, run the following command to create a tar ball:

$ tar -czvf model.tar.gz code/ yolov8l.pt

1.2. Host model.tar.gz to SageMaker endpoint:

This involves a few steps wherein the model.tar.gz is first uploaded to the S3 bucket. The uploaded artifact is used to create a SageMaker PyTorchModel. And finally, this PyTorchModel is used to deploy the model to a SageMaker Endpoint.

1.2.1. Upload model and inference code to S3:

from sagemaker import s3

bucket = "s3://NAME_OF_BUCKET"
prefix = "yolov8/demo-custom-endpoint"
model_data = s3.S3Uploader.upload("model.tar.gz", bucket + "/" + prefix)

1.2.2. Create SageMaker PyTorchModel:

from sagemaker.pytorch import PyTorchModel

model_name = 'yolov8l.pt'

model = PyTorchModel(entry_point='inference.py',
                     model_data=model_data,
                     framework_version='1.12',
                     py_version='py38',
                     role=role,
                     env={'TS_MAX_RESPONSE_SIZE':'20000000', 'YOLOV8_MODEL': model_name},
                     sagemaker_session=sess)

1.2.3. Compile and host the model to an endpoint:

from sagemaker.deserializers import JSONDeserializer

INSTANCE_TYPE = 'ml.m5.4xlarge'
ENDPOINT_NAME = 'yolov8-pytorch-' + str(datetime.utcnow().strftime('%Y-%m-%d-%H-%M-%S-%f'))

predictor = model.deploy(initial_instance_count=1,
                         instance_type=INSTANCE_TYPE,
                         deserializer=JSONDeserializer(),
                         endpoint_name=ENDPOINT_NAME)

2. Test the SageMaker endpoint

Once the endpoint is successfully hosted, it can be used to run inference. In this step, we will first read an image, convert it to bytes and run inference by passing the bytes as an input to the endpoint. The results generated would have either bounding boxes or masks or confidence scores based on the type of YOLOv8 model used for hosting. The output can be plotted accordingly.

2.1.1. Generate inference results and plot output:

import cv2, random
import numpy as np
import matplotlib.pyplot as plt

orig_image = cv2.imread('bus.jpg')

image_height, image_width, _ = orig_image.shape
model_height, model_width = 300, 300
x_ratio = image_width/model_width
y_ratio = image_height/model_height

resized_image = cv2.resize(orig_image, (model_height, model_width))
payload = cv2.imencode('.jpg', resized_image)[1].tobytes()
result = predictor.predict(payload)

if 'boxes' in result:
    for idx,(x1,y1,x2,y2,conf,lbl) in enumerate(result['boxes']):
        # Draw Bounding Boxes
        x1, x2 = int(x_ratio*x1), int(x_ratio*x2)
        y1, y2 = int(y_ratio*y1), int(y_ratio*y2)
        color = (random.randint(10,255), random.randint(10,255), random.randint(10,255))
        cv2.rectangle(orig_image, (x1,y1), (x2,y2), color, 4)
        cv2.putText(orig_image, f"Class: {int(lbl)}", (x1,y1-40), cv2.FONT_HERSHEY_SIMPLEX, 1, color, 2, cv2.LINE_AA)
        cv2.putText(orig_image, f"Conf: {int(conf*100)}", (x1,y1-10), cv2.FONT_HERSHEY_SIMPLEX, 1, color, 2, cv2.LINE_AA)
        if 'masks' in result:
            # Draw Masks
            mask = cv2.resize(np.asarray(result['masks'][idx]), dsize=(image_width, image_height), interpolation=cv2.INTER_CUBIC)
            for c in range(3):
                orig_image[:,:,c] = np.where(mask>0.5, orig_image[:,:,c]*(0.5)+0.5*color[c], orig_image[:,:,c])

if 'probs' in result:
    # Find Class
    lbl = result['probs'].index(max(result['probs']))
    color = (random.randint(10,255), random.randint(10,255), random.randint(10,255))
    cv2.putText(orig_image, f"Class: {int(lbl)}", (20,20), cv2.FONT_HERSHEY_SIMPLEX, 1, color, 2, cv2.LINE_AA)

plt.imshow(cv2.cvtColor(orig_image, cv2.COLOR_BGR2RGB))
plt.show()

2.1.2. Results:

The output of object detection and segmentation YOLOv8 models is shown in the following images:

3. Clean up

Deleting the CloudFormation stack would remove all the resources that were originally created. However, the CloudFormation is not currently configured to automatically remove the endpoint, endpoint configuration, and the model. If the hosted endpoint is not being used, it is a good practice to remove it to save costs. It can be done as follows:

import boto3

sm_client = boto3.client(service_name="sagemaker")

response = sm_client.describe_endpoint_config(EndpointConfigName=endpoint_name)
print(response)
endpoint_config_name = response['EndpointConfigName']

# Delete Endpoint
sm_client.delete_endpoint(EndpointName=endpoint_name)

# Delete Endpoint Configuration
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)

# Delete Model
for prod_var in response['ProductionVariants']:
    model_name = prod_var['ModelName']
    sm_client.delete_model(ModelName=model_name)

Conclusion

In this post, we demonstrated how to host a pre-trained YOLOv8 PyTorchModel on a SageMaker endpoint and test the inference results by invoking the endpoint. The detailed code is available on GitHub, and the template CloudFormation stack is available on GitHub as well.

To learn more about SageMaker endpoints, please check out Create your endpoint and deploy your model and Use PyTorch with Amazon SageMaker, which highlights using PyTorchModel on SageMaker. The process can be automated using CloudFormation support for SageMaker.


About the authors

Kevin Song is a Data Scientist at AWS Professional Services. He holds a PhD in Biophysics and has more than five years of industry experience in building computer vision and machine learning solutions.

Romil Shah is an IoT Edge Data Scientist at AWS Professional Services. Romil has more than six years of industry experience in computer vision, machine learning, and IoT edge devices. He is involved in helping customers optimize and deploy their machine learning models for edge devices in an industrial setup.

Read More

Four approaches to manage Python packages in Amazon SageMaker Studio notebooks

Four approaches to manage Python packages in Amazon SageMaker Studio notebooks

This post presents and compares options and recommended practices on how to manage Python packages and virtual environments in Amazon SageMaker Studio notebooks. A public GitHub repo provides hands-on examples for each of the presented approaches.

Amazon SageMaker Studio is a web-based, integrated development environment (IDE) for machine learning (ML) that lets you build, train, debug, deploy, and monitor your ML models. Studio provides all the tools you need to take your models from data preparation to experimentation to production while boosting your productivity.

Studio notebooks are collaborative Jupyter notebooks that you can launch quickly because you don’t need to set up compute instances and file storage beforehand. When you open a notebook in Studio, you are prompted to set up your environment by choosing a SageMaker image, a kernel, an instance type, and, optionally, a lifecycle configuration script that runs on image startup.

For more details on Studio notebook concepts and other aspects of the architecture, refer to Dive deep into Amazon SageMaker Studio Notebooks architecture.

Studio notebooks are designed to support you in all phases of your ML development, for example, ideation, experimentation, and operationalization of an ML workflow. Studio comes with pre-built images that include the latest Amazon SageMaker Python SDK and, depending on the image type, other specific packages and resources, such as Spark, MXNet, or PyTorch framework libraries, and their required dependencies. Each image can host one or multiple kernels, which can be different virtual environments for development.

To ensure the best fit for your development process and phases, access to specific or latest ML frameworks, or to fulfil data access and governance requirements, you can customize the pre-built notebook environments or create new environments using your own images and kernels.

This post considers the following approaches for customizing Studio environments by managing packages and creating Python virtual environments in Studio notebooks:

  • Use a custom Studio KernelGateway app image
  • Use Studio notebook lifecycle configurations
  • Use the Studio Amazon Elastic File System (Amazon EFS) volume to persist Conda environments
  • Use pip install

Studio KernelGateway apps and notebooks kernels

One of the main differences of Studio notebooks architecture compared to SageMaker notebook instances is that Studio notebook kernels run in a Docker container, called a SageMaker image container, rather than hosted directly on Amazon Elastic Compute Cloud (Amazon EC2) instances, which is the case with SageMaker notebook instances.

The following diagram shows the relations between KernelGateway, notebook kernels, and SageMaker images. (For more information, see Use Amazon SageMaker Studio Notebooks.)

Because of this difference, there are some specifics of how you create and manage virtual environments in Studio notebooks, for example usage of Conda environments or persistence of ML development environments between kernel restarts.

The following sections explain each of four environment customization approaches in detail, provide hands-on examples, and recommend use cases for each option.

Prerequisites

To get started with the examples and try the customization approaches on your own, you need an active SageMaker domain and at least one user profile in the domain. If you don’t have a domain, refer to the instructions in Onboard to Amazon SageMaker Domain.

Studio KernelGateway custom app images

A Studio KernelGateway app image is a Docker container that identifies the kernels, language packages, and other dependencies required to run a Jupyter notebook in Studio. You use these images to create environments that you then run Jupyter notebooks on. Studio provides many built-in images for you to use.

If you need different functionality, specific frameworks, or library packages, you can bring your own custom images (BYOI) to Studio.

You can create app images and image versions, attach image versions to your domain, and make an app available for all domain users or for specific user profiles. You can manage app images via the SageMaker console, the AWS SDK for Python (Boto3), and the AWS Command Line Interface (AWS CLI). The custom image needs to be stored in an Amazon Elastic Container Registry (Amazon ECR) repository.

The main benefits of this approach are a high level of version control and reproducibility of an ML runtime environment and immediate availability of library packages because they’re installed in the image. You can implement comprehensive tests, governance, security guardrails, and CI/CD automation to produce custom app images. Having snapshots of development environments facilitates and enforces your organization’s guardrails and security practices.

The provided notebook implements an app image creation process for Conda-based environments. The notebook demonstrates how you can create multi-environment images so the users of the app can have a selection of kernels they can run their notebooks on.

Configure a custom app image

You must run this notebook as a SageMaker notebook instance to allow using Docker locally and run Docker commands in the notebook. Alternatively to using notebook instances or shell scripts, you can use the Studio Image Build CLI to work with Docker in Studio. The Studio Image Build CLI lets you build SageMaker-compatible Docker images directly from your Studio environments by using AWS CodeBuild.

If you don’t have a SageMaker notebook instance, follow the instructions in Create an Amazon SageMaker Notebook Instance to get started.

You must also ensure that the execution role you use for a notebook instance has the required permissions for Amazon ECR and SageMaker domain operations:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ecr:CompleteLayerUpload",
                "ecr:GetAuthorizationToken",
                "ecr:UploadLayerPart",
                "ecr:InitiateLayerUpload",
                "ecr:BatchCheckLayerAvailability",
                "ecr:PutImage",
                "ecr:CreateRepository",
                "ecr:ListImages"
            ],
            "Resource": "arn:aws:ecr:<REGION>:<ACCOUNT ID>:repository/<YOUR REPOSITORY NAME>"
        }
    ]
}

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "sagemaker:UpdateDomain"
            ],
            "Resource": "arn:aws:sagemaker:<REGION>:<ACCOUNT ID>:domain/<YOUR DOMAIN ID>"
        }
    ]
}

To create a custom image with two kernels, each with their own Conda virtual environment, the notebook implements the following steps:

  1. Define the Conda environments. The Conda environment must have a Jupyter kernel package installed, for example, ipykernel for Python kernel.
  2. Define a Dockerfile. Consider the custom SageMaker image specifications when creating your own image.
  3. Build a Docker image compatible with Studio and push the image into the ECR repository.
  4. Create a SageMaker image with the Docker image from the ECR repository and create an initial image version. Every time you update the image in Amazon ECR, a new image version must be created.
  5. Update an existing SageMaker domain to use this image. For this operation, the execution role needs the UpdateDomain permission. The image is immediately available to all user profiles of the domain. If you want to make the image available only for a specific user profile, you can use the UpdateUserProfile API call instead of UpdateDomain.
  6. Launch the custom image in Studio. Start a new notebook and choose the new image on the image selection drop-down menu.

Studio automatically recognizes the Conda environments in your image as corresponding kernels in the kernel selection drop-down menu in the Set up notebook environment widget.

Refer to these sample notebooks for more examples and use cases on custom app image implementation.

Clean up

To avoid charges, you must stop the active SageMaker notebook instances. For instructions, refer to Clean up.

Implement an automated image authoring process

As already mentioned, you can use the Studio Image Build CLI to implement an automated CI/CD process of app image creation and deployment with CodeBuild and sm-docker CLI. It abstracts the setup of your Docker build environments by automatically setting up the underlying services and workflow necessary for building Docker images.

Recommended use cases

The custom app image approach is a good fit for the following scenarios when using a Studio notebook environment:

  • Stable and controlled environments for production or sensitive development use
  • Environments without internet access, where you want to pre-package all needed resources and libraries into the image
  • High environment reuse ratio and low rate of changes in the environments
  • High scale of data science operations, dozens or hundreds of developers or teams who need access to standardized custom environments
  • Use libraries that can’t be configured on the SageMaker first-party images
  • Requirements to use custom images for a different OS or different programming language
  • Centralized governance and environment development using automated CI/CD pipelines

Limitations of this approach

This approach requires a multi-step image creation process including tests, which might be overkill for smaller or very dynamic environments. Furthermore, consider the following limitations of the approach:

  • An upfront effort is needed to add new packages or create new versions of an image. As mitigation, you can customize the existing custom image with pip, even if it’s not persistent.
  • Attaching a new custom image or adding a new version to the domain requires the UpdateDomain permission, which isn’t normally attached to the user profile execution role. We recommend using an automated pipeline with a dedicated execution role to perform this operation or give the permission to update a domain to a dedicated admin user or role.
  • A high manual effort for image authoring is involved. We recommend implementing an automated pipeline if you produce and update custom images frequently.
  • If you use Conda environments, you might encounter issues with it in Docker environment. For an example, refer to Activating a Conda environment in your Dockerfile. Not all Conda commands may work in the notebook virtual environment. However, this Studio customization approach is not limited to Conda-based environments.
  • You can’t manually switch between Conda environments in the notebook; you must switch kernels in the notebook environment setup widget.

Also consider that there are default quotas of 30 custom images per domain and 5 images per user profile. These are soft limits and can be increased.

The next sections describe more lightweight approaches that may be a better fit for other use cases.

Studio notebook lifecycle configurations

Studio lifecycle configurations define a shell script that runs at each restart of the kernel gateway application and can install the required packages. The main benefit is that a data scientist can choose which script to run to customize the container with new packages. This option doesn’t require rebuilding the container and in most cases doesn’t require a custom image at all because you can customize the pre-built ones.

Set up a lifecycle configuration process

This process takes around 5 minutes to complete. The post demonstrates how to use the lifecycle configurations via the SageMaker console. The provided notebook shows how to implement the same programmatically using Boto3.

  1. On the SageMaker console, choose Lifecycle configurations in the navigation pane.
  2. On the Studio tab, choose Create configuration.

The first step to create the lifecycle configuration is to select the type.

  1. For this use case of installing dependencies each time a Jupyter kernel gateway app is created, choose Jupyter kernel gateway app and choose Next.
  2. For Name, enter a name for the configuration.
  3. In the Scripts section, define the script to be run when the kernel starts. For this example, the PyArrow library will be installed with the following script:
    # This script installs a single pip package on a SageMaker Studio Kernel Application
    #!/bin/bash
    set -eux
    # PARAMETERS
    PACKAGE=pyarrow
    pip install --upgrade $PACKAGE

  4. Choose Create Configuration.

Now that the configuration has been created, it needs to be attached to a domain or user profile. When attached to the domain, all user profiles in that domain inherit it, whereas when attached to a user profile, it is scoped to that specific profile. For this walkthrough, we use the Studio domain route.

  1. Choose Domains in the navigation pane and open your existing domain.
  2. On the Environment tab, in the Lifecycle configurations for personal Studio apps section, choose Attach.
  3. For Source, select Existing configuration.
  4. Select the lifecycle configuration you created and choose Attach to domain.

Now that all the configuration is done, it’s time to test the script within Studio.

  1. Launch Studio and on the Launcher tab, locate the Notebooks and compute resources section, and choose Change environment to select the lifecycle configuration you created.
  2. For Start-up script, choose the lifecycle configuration you created, then choose Select.
  3. Choose Create notebook.

You can also set the Lifecycle configuration to be run by default in the Lifecycle configurations for personal Studio apps section of the Domain page.

Within the new notebook, the dependencies installed in the startup script will be available.

Recommended use cases

This approach is lightweight but also powerful because it allows you to control the setup of your notebook environment via shell scripts. The use cases that best fit this approach are the following:

  • Integrating package installations in the notebook lifecycle configuration that must run at each kernel start.
  • Environments without internet access. Use lifecycle configurations to set up an environment to access local or security artifact and package repositories, such as AWS CodeArtifact.
  • If you already use lifecycle configurations, you can extend them to include package install.
  • Installation of a few additional packages on top of built-in or custom app images.
  • When you need a shorter time to market than with custom app images.

Limitations of this approach

The main limitations are a high effort to manage lifecycle configuration scripts at scale and slow installation of packages. Depending on how many packages are installed and how large they are, the lifecycle script might even timeout. There are also limited options for ad hoc script customization by users, such as data scientists or ML engineers, due to permissions of the user profile execution role.

Refer to SageMaker Studio Lifecycle Configuration Samples for more samples and use cases.

Persist Conda environments to the Studio EFS volume

SageMaker domains and Studio use an EFS volume as a persistent storage layer. You can save your Conda environments on this EFS volume. These environments are persistent between kernel, app, or Studio restart. Studio automatically picks up all environments as KernelGateway kernels.

This is a straightforward process for a data scientist, but there is a 1-minute delay for the environment to appear in the list of selectable kernels. There also might be issues with using environments for kernel gateway apps that have different compute requirements, for example a CPU-based environment on a GPU-based app.

Refer to Custom Conda environments on SageMaker Studio for detailed instructions. The post’s GitHub repo also contains a notebook with the step-by-step guide.

Create persistent Conda environments on a Studio EFS volume

This walkthrough should take around 10 minutes.

  1. On Studio, choose Home in the navigation pane.
  2. Choose Open Launcher.
  3. Within the Launcher, locate the Notebooks and compute resources section.
  4. Check that the SageMaker image selected is a Conda-supported first-party kernel image such as “Data Science.”
  5. Choose Open image terminal to open a terminal window with a new kernel.

A message displays saying “Starting image terminal…” and after a few moments, the new terminal will open in a new tab.

  1. Within the terminal, run the following commands:
    mkdir -p ~/.conda/envs
    conda create --yes -p ~/.conda/envs/custom
    conda activate ~/.conda/envs/custom
    conda install -y ipykernel
    conda config --add envs_dirs ~/.conda/envs

These commands will take about 3 minutes to run and will create a directory on the EFS volume to store the Conda environments, create the new Conda environment and activate it, install the ipykernel dependencies (without this dependency this solution will not work), and finally create a Conda configuration file (.condarc), which contains the reference to the new Conda environment directory. Because this is a new Conda environment, no additional dependencies are installed. To install other dependencies, you can modify the conda install line or wait for the following commands to finish and install any additional dependencies while inside the Conda environment.

  1. For this example, we install the NumPy library by running the following command in the terminal window:
    conda install -y numpy
    python -c "import numpy; print(numpy.version.version)"

Now that the Conda environment is created and the dependencies are installed, you can create a notebook that uses this Conda environment persisted on Amazon EFS.

  1. On the Studio Launcher, choose Create notebook.
  2. From the new notebook, choose the “Python 3 (Data Science)” kernel.
  3. For Kernel, choose the newly created Conda environment, then choose Select.

If at first there is no option for the new Conda environment, this could be because it takes a few minutes to propagate.

Back within the notebook, the kernel name will have changed in the top right-hand corner, and within a cell you can test that the dependencies installed are available.

Recommended use cases

The following use cases are the best fit for this approach:

  • Environments without internet access, with all dependencies pre-installed in the persisted Conda environments
  • Ad hoc environments that need persistence between kernel sessions
  • Testing of custom SageMaker images in Studio before creating a Docker image and pushing to Amazon ECR

Limitations of this approach

Although this approach has practical uses, consider the following limitations:

  • There might be performance issues with Amazon EFS on many small files, which is very common when managing Python packages.
  • It may be challenging to share persistent environments between Studio user profiles.
  • It may be challenging to reuse persistent environments.
  • It may be challenging to address management at scale.
  • The approach works only with specific Conda-based first-party SageMaker images, for example “Data Science,” “Data Science 2.0,” and “Data Science 3.0.” For a list of all available images, refer to Available Amazon SageMaker Images.

Pip install

You can install packages directly into the default Conda environment or the default Python environment.

Create a setup.py or requirements.txt file with all required dependencies and run %pip install .-r requirement.txt. You have to run this command every time you restart the kernel or recreate an app.

This approach is recommended for ad hoc experimentation because these environments are not persistent.

For more details about using the pip install command and limitations, refer to Install External Libraries and Kernels in Amazon SageMaker Studio.

Recommended use cases

This approach is a standard way to install packages to customize your notebook environment. The recommended use cases are limited to non-production use for ad hoc experimentation in a notebook:

  • Ad hoc experimentation in Studio notebooks
  • Non-productive and non-sensitive environments, sandbox environments
  • Environments with internet access

Limitations of this approach

The main limitations of this approach are:

  • Some enterprise environments block all egress and ingress internet connections and you can’t use pip install to pull Python packages or need to configure an offline mode
  • Lower reproducibility of environments
  • Need to wait until packages are downloaded and installed
  • No persistence between image restarts

Conclusion

SageMaker Studio offers a broad range of possible customization of development environments. Each user role such as a data scientist; an ML, MLOps, or DevOps engineer; and an administrator can choose the most suitable approach based on their needs, place in the development cycle, and enterprise guardrails.

The following table summarizes the presented approaches along with their preferred use cases and main limitations.

Approach Persistence Best Fit Use Cases Limitations
Bring your own image Permanent, transferrable between user profiles and domains
  • Need for a stable, reproduceable, shareable, and centrally managed ML runtime
  • Reuse the same image for Studio development, and SageMaker processing and training jobs
  • Enterprise ML runtime golden images with built-in security controls and guardrails
  • Multi-step manual authoring process or needs an automated build and test pipeline
Lifecycle configurations Permanent, transferrable between user profiles and domains
  • Need for a centrally managed, reproduceable, and shareable environment
  • Need for installation of a few additional packages on top of an existing environment
  • Time limit for environment installation
  • Effort and challenges for managing at scale
Conda environments on the Studio EFS volume Permanent, not transferrable between user profiles or domains
  • Fast experimentation in a notebook with a need for persistence, reuse, and reproducibility of environments
  • Single-user self-managed environments
  • Works only with some kernels
  • Performance issues with many small files
  • Can’t share environments between users
Pip install Transient, no persistence between image or Studio restarts, not transferrable between user profiles or domains
  • Fast experimentation in a notebook
  • Single-user self-managed environments
  • Non-productive environments
  • Low reproducibility of environments
  • Potentially long package download and installation times
  • No persistence

It’s still Day 1. The real-world virtual environment and Python management is far more complex than these four approaches, but this post helps you with the first steps for developing your own use case.

You can find more use cases, details, and hands-on examples in the following resources:


About the authors

Yevgeniy Ilyin is a Solutions Architect at Amazon Web Services (AWS). He has over 20 years of experience working at all levels of software development and solutions architecture and has used programming languages from COBOL and Assembler to .NET, Java, and Python. He develops and codes cloud native solutions with a focus on big data, analytics, and data engineering.

Alex Grace is a Solutions Architect at Amazon Web Services (AWS) who looks after Fintech Digital Native Businesses. Based in London, Alex works with a few of the UK’s leading Fintechs and enjoys supporting their use of AWS to solve business problems and fuel future growth. Previously, Alex has worked as a software developer and tech lead at Fintech startups in London and has more recently been specialising in AWS’ machine learning solutions.

Read More

AI/ML-driven actionable insights and themes for Amazon third-party sellers using AWS

AI/ML-driven actionable insights and themes for Amazon third-party sellers using AWS

The Amazon International Seller Growth (ISG) team runs the CSBA (Customer Service by Amazon) program that supports over 200,000 third-party Merchant Fulfilled Network (MFN) sellers. Amazon call centers facilitate hundreds of thousands of phone calls, chats, and emails going between the consumers and Amazon MFN sellers. The large volume of contacts creates a challenge for CSBA to extract key information from the transcripts that helps sellers promptly address customer needs and improve customer experience. Therefore, it’s critical to automatically discover insights from these transcripts, perform theme detection to analyze multiple customer conversations, and automatically present a set of themes that indicate the top reasons for customer contact, so that the customer problems are addressed in the right way and as soon as possible.

This post presents a solution that uses a workflow and AWS AI and machine learning (ML) services to provide actionable insights based on those transcripts. We use multiple AWS AI/ML services, such as Contact Lens for Amazon Connect and Amazon SageMaker, and utilize a combined architecture. This solution is tested with ISG using a small volume of data samples. In this post, we discuss the thought process, building this solution, and the outcome from the test. We believe the lessons learned and our journey presented here may help you on your own journey.

Operational landscape and business workflow

The following figure shows the recommended operational landscape with stakeholders and business workflow for ISG so that sellers can stay close to their customers anytime, anywhere. The consumer contacts Customer Service through a contact center platform and engages with the Customer Service Associate (CSA). Then the transcripts of contacts become available to CSBA to extract actionable insights through millions of customer contacts for the sellers, and the data is stored in the Seller Data Lake. Sellers use the Amazon Seller Central portal to access the analytics outcomes and take action to quickly and effectively address customer problems.

Solution overview

The following diagram shows the architecture reflecting the workflow operations into AI/ML and ETL (extract, transform, and load) services.

solution architecture

The workflow steps are as follows:

  1. We use Amazon Connect as a cloud contact center for consumer-CSA interactions. Contact Lens for Amazon Connect generates call and chat transcripts; derives contact summary, analytics, categorization of associate-customer interaction, and issue detection; and measures customer sentiments.
  2. Contact Lens then stores analytics data into an Amazon Simple Storage Service (Amazon S3) bucket for long-term retention.
  3. Amazon Kinesis Data Streams collects and transfers the high-throughput analytics data, processed by AWS Lambda, and injects and stores the data into an intermediate S3 bucket. At this stage, the data contains call and chat transcripts, sentiment scores, detected issues, and categories.
  4. It triggers the Lambda functions to ingest the data stream, extract the requested data fields, and trigger inference of custom ML analyses by AWS AI/ML services, on top of Contact Lens results.In this analysis, Contact Lens provides accurate sentiment scores measuring customer satisfaction on consumer-CSA interactions. Contact Lens rules help us categorize known issues in the contact center. At this stage, ISG wanted to provide additional insights to the seller by detecting the theme through discovering previously unknown issues in seller-specific calls, performed resolutions, and specific key phrases. Here, a non-deep learning model was trained and run on SageMaker, the details of which will be explained in the following section.
  5. After the AI/ML-based analytics, all actionable insights are generated and then stored in the Seller Data Lake. The insights are shared on the Seller Central Portal for the international sellers to pinpoint the root cause and take prompt action.

In the following sections, we dive deeper into the AI/ML solution and its components.

Data labeling

In this section, we describe our approach for data labeling to identify the contact reason and resolution, and our methodology for keywords extraction for the sellers to perform root cause analysis.

Contact reason and resolution labeling

To detect the contact reason from transcripts by ML, we utilized seven Standardized Issue Codes (SICs) as the data labels from the sample data provided by ISG team:

  • Contacted seller to request cancelation
  • Tracking shows delivered but shipment not received
  • Shipment undeliverable
  • Shipment not delivered past delivery date
  • Shipment in transit to customer
  • Request Return Mailing Label (RML)
  • Item non-returnable

The contact reason labels can be further extended by adding the previously unknown issues to the seller; however, those issues had not been defined in the SIC. Unlike the contact reason, the contact resolution doesn’t have a label associated with the transcripts. The resolution categories were specified by the ISG team, and the resolutions needed to be labeled based on these categories. Therefore, we utilized Amazon SageMaker Ground Truth to create or update labels for each contact.

Ground Truth provides a data labeling service that makes it easy to label data, and gives you the option to use human annotators through Amazon Mechanical Turk, third-party vendors, or your own private workforce. For this solution, the ISG team defined for categories for contact resolution in over 140 transcript documents, which were labeled by Amazon Mechanical Turk contractors:

  • Full refund – 69 records
  • Contact seller – 52 records
  • Partial refund – 15 records
  • Other – 4 records

It only took a couple of hours for the contractors to complete the multi-label text classification contact center resolution labeling for the 140 documents, and have them reviewed by the customer. In the next step, we build the multi-class classification models, then predict the contact reason and resolution from the new call and chat transcripts coming from the customer service.

Keywords for the root cause analysis

Another challenge is to extract the keywords from the transcripts that can guide the MFN sellers on specific actions. For this example, the seller needs to capture the key information such as product information, critical timeline, problem details, and refund offered by the CSA, which may not be clear. Here we built a custom key phrases extraction model in SageMaker using the RAKE (Rapid Automatic Keyword Extraction) algorithm, following the process shown in the following figure. RAKE is a domain-independent keyword extraction algorithm that determines key phrases by analyzing the frequency of word appearance and its co-occurrence with other words in the text.

keywords extraction process

After the standard document preprocessing, RAKE detects the most relevant key words and phrases from the transcript documents. The output is listed as follows:

[('im amazons chat helper .. im', 0.08224299065420558),

('jun 23 .. could', 0.041588785046728964), <== timeline

('original payment method please', 0.04112149532710279), <== resolution: refund

('amazon gift card balance', 0.04112149532710279), <== resolution: refund

('previous conversation .. let', 0.04018691588785046),

('faulty pieces would like', 0.036448598130841114), <== call reason: faulty piece

('nice day !:)..', 0.025233644859813078),

('dual fuel gas', 0.025233644859813078), <== call reason: product info

('customer service hub', 0.025233644859813078),

('5 business days', 0.025233644859813078), <== timeline

('try .. got', 0.02476635514018691),

('right place ..', 0.023364485981308407),

('item .. let', 0.023364485981308407),

('youd like help', 0.02242990654205607),

('think would help', 0.02242990654205607),

('search help pages', 0.02242990654205607),

('gbc1793w ). order', 0.02242990654205607), <== call reason: product info

('moment .. ok', 0.021962616822429903),

('charcoal combo grill', 0.021028037383177565), <== call reason: product info

('call back ..', 0.021028037383177565),

('yes please anything', 0.020093457943925228),

('chat ..', 0.014953271028037382),

('would love', 0.014018691588785043),

('looks like', 0.014018691588785043),

('bent pieces', 0.013084112149532708), <== call reason: faulty details

This method captured key phrases with high relevance scores on the critical information such as timeline (“June 23”), refund resolution (“Amazon gift card,” “in 5 business days”), product information (“charcoal combo grill,” “dual fuel gas,” “gbc1793w”) and problem details (“faulty piece,” “bent pieces”). These insights not only tell the seller that this customer has been taken care of by getting a refund, but also guide the seller to further investigate the gas grill product defect and avoid having similar issues for other customers.

Text classification model training

Contact Lens generated transcripts, contact summary, and sentiments for call and chat samples collected from ISG Customer Service. Throughout the testing, the transcription and sentiment scores were accurate as expected. Along with known issues, the ISG team also looks for detecting unknown issues from transcripts to meet the seller-specific needs such as delivery problems, product defects, the resolutions provided by the contact, and issues or key phrases leading to a return or refund.

To address this challenge, we extended our tests through custom models on SageMaker. Our experience pointed to “bag-of-words” based, more conventional (non-deep learning) models using SageMaker based on the size of the dataset and samples.

We performed the contact reason classification modeling following the three steps on SageMaker as shown in the following figure.

text classification process

The steps are as follows:

  1. Preprocessing – We used the NLTK library to lower the cases; remove punctuation, tags, markups, and white space trailing; and filter single letters, numeric values, and customized lists of stop words.
  2. Vectorization – We used the TF-IDF (Term Frequency-Inverse Document Frequency) method to convert the processed document into a matrix of TF-IDF features. The method quantifies the importance and relevance of words and phrases in a document with a collection of documents (corpus), then generates the features in numeric values to represent how important a word is to the document in the corpus. For this solution, we tested with specifying 750 and 1,500 features.
  3. Multi-class classification – We generated a seven-class classification model using a vectorized feature list and SIC labels. We utilized 90% of the samples for training and 10% for validation.

We tested three algorithms aiming to obtain the best-performing model:

  • First, we used the SageMaker Linear Learner algorithm with default hyperparameters and performed 10 epochs, and reached 71% accuracy for the testing set.
  • Next, we used the SageMaker built-in XGBoost algorithm, and ran automatic hyperparameter optimization (HPO) tuning on four parameters (eta, alpha, min_child_weight, max_depth), which gave us 71% accuracy for the testing set.
  • Finally, we worked with AutoGluon’s AutoML framework on SageMaker, which performs automatic modeling and hyperparameter selection with multiple models ensembling and multiple layers stacking. The framework trained 13 models and generated the final ensemble model yielding 74% accuracy for the testing set. We also tested by increasing the number of TF-IDF vectorizer features to 1,500; with the AutoGluon model, the classification accuracy on testing set can be further improved to 82%.

For our model training through AutoGluon, we used the MultilabelPredictor method from the AutoGluon library. This predictor performs multi-label prediction for tabular data. We used the sample notebook from AWS samples on GitHub. We used the same notebook by starting with importing AutoGluon libraries and defining the class for MultilabelPredictor(). To save space, we don’t show those lines in the following code snippet; you can copy/paste that part from the sample notebook. We employed the training in the file train.csv in our S3 bucket (your_path_to_s3/train.csv), specified the column used for label, and performed model training through MultilabelPredictor.

train_data = TabularDataset(‘your_path_to_s3/train.csv’)
subsample_size = 106                                                    # the sample size for training
train_data = train_data.sample(n=subsample_size, random_state=0)
labels = [‘label’]                                                      # column to predict based on the others
problem_types = ['multiclass']                                          # type of each prediction problem
save_path = ‘your_save_path_to_results’                                 # the path to your s3 bucket for results to store
time_limit = 60                                                         # number of seconds to train the TabularPredictor for each label

multi_predictor = MultilabelPredictor(labels=labels, problem_types=problem_types, path=save_path)
multi_predictor.fit(train_data, time_limit=time_limit)

The following table lists the AI/ML services and models, and summarizes the accuracy.

. Transcripts Feature Linear Learner XGB with HPO AutoGluon
Validation set 11 750 0.91 0.82 0.82
Validation set 11 1500 0.82 0.82 0.91
Testing set 34 750 0.71 0.71 0.74
Testing set 34 1500 0.65 0.65 0.82

The following charts summarize the accuracy for the sample set based on amount of features.

text classification 750 feature text classification 1500 feature

In the following charts, we observed that the models of the decision tree with a gradient boosting machine, such as LGB, XGBoost, and Random Forest, were better choices for this type of problem for both the 750-feature models and 1,500-feature models. The neural net model is ranked lower among the 13 models, which confirmed our expectation that deep learning might not be suitable for our case.

model score time to train

Conclusion

With AWS AI/ML services, we can provide accurate and efficient contact reason and contact resolution detection and other actionable insights for Amazon International Seller Growth. MFN sellers can use these insights to better understand consumer problems, and take effective actions to resolve Amazon consumers’ issues, while also optimizing their process and costs.

You can tailor the solution for your contact center by developing your own custom model in SageMaker, and feeding the call and chat transcripts for training and inference. You could also apply this solution for general theme detection to analyze customer conversations in your contact center.


About the Authors

YunfeiYunfei Bai is a Senior Solutions Architect at AWS. With the background in AI/ML, Data Science and Analytics, Yunfei helps customers adopt AWS services to deliver business results. He designs AI/ML and Data Analytics solutions that overcome complex technical challenges and drive strategic objectives. Yunfei is a PhD in Electronic and Electrical Engineering . Outside of work, Yunfei enjoys reading and music.

BurakBurak Gozluklu is a Principal ML Specialist Solutions Architect located in Boston, MA. Burak has +15 years of industry experience in simulation modeling, data science and ML technology. He helps global customers adopting AWS technologies and especially, AI/ML solutions to achieve their business objectives. Burak has a PhD in Aerospace Eng. from METU, MS in Systems Engineering and post-doc on system dynamics from MIT in Cambridge, MA. Burak is passionate about yoga and meditation.

ChelseaChelsea Cai is a Senior Product Manager at Amazon’s International Seller Growth (ISG) organization, where she works for Customer Service by Amazon service (CSBA) helping 3P sellers improve their customer service/CX through Amazon CS technology and worldwide organizations. In her spare time, she likes philosophy, psychology, swimming, hiking, good food, and spending time with her family and friends.

AbhishekAbhishek Kumar is a Senior Product Manager at Amazon’s International Seller Growth (ISG) organization, where he develops software platforms and applications to help global 3P sellers manage their Amazon business. In his free time, Abhishek enjoys traveling, learning Italian, and exploring European cultures and cuisines with his extended Italian family.

Read More

Announcing the Yammer connector for Amazon Kendra

Announcing the Yammer connector for Amazon Kendra

Yammer is a social networking platform designed for open and dynamic communications and collaborations within organizations. It allows you to build communities of interest, gather ideas and feedback, and keep everyone informed. It’s available via browser or mobile app, and provides a variety of common social networking features such as private and public communities, news feeds, groups of interest, instant messaging, and more. Each of these features create a huge amount of unstructured data collected over time and stored in multiple repositories. Searching through these fragmented repositories provides an enormous challenge to users, which is where Amazon Kendra comes in.

Amazon Kendra is a highly accurate and simple-to-use intelligent search service powered by machine learning (ML). Amazon Kendra offers a suite of data source connectors to simplify the process of ingesting and indexing your content, wherever it resides. Valuable data in organizations is stored in both structured and unstructured repositories. An enterprise search solution should be able to pull together data across several structured and unstructured repositories to index and search on.

We’re excited to announce that you can now use the Amazon Kendra connector for Yammer to search information stored in Yammer. In this post, we show how to index information stored in Yammer and use Amazon Kendra intelligent search to find answers to your questions accurately and quickly. In addition, the ML-powered intelligent search can accurately find information from unstructured documents containing natural language narrative content, for which keyword search isn’t very effective.

Solution overview

With Amazon Kendra, you can configure multiple data sources to provide a central place to index and search across your document repository. For our solution, we demonstrate how to index a Yammer repository using the Amazon Kendra connector for Yammer. The solution consists of the following steps:

  1. Configure the Yammer app API connector on Azure and get the connection details.
  2. Create an Amazon Kendra index.
  3. Create a Yammer data source.
  4. Run a sample query to get information.

Prerequisites

To try out the Amazon Kendra connector for Yammer, you need the following:

Configure the Yammer app API connector and gather connection details

Before we set up the Yammer data source, we need a few details about your Yammer repository. Let’s

gather those in advance.

  1. Log in to the Azure portal using your global admin user account and choose Next.
  2. Enter your password and choose Sign in.
  3. On the Azure welcome page, choose App registrations.

Alternatively, you can search for “App Registrations” in the search bar.

  1. Choose New registration.
  2. Enter a name for the app (for example, my-yammer-connector) and choose Register.
  3. Note down the tenant ID (you need it when setting up the data source for Amazon Kendra).
  4. Next to Client credentials, choose Add a certificate or secret.
  5. Enter a description (for example, Yammer Connector Client Credentials).
  6. Choose an expiration period (for this post, 6 months).
  7. Choose Add.
  8. Save the client ID and secret ID for AWS Secrets Manager configuration.
  9. In the navigation pane, choose API permissions.

This is where you can add or remove admin permissions.

  1. Choose Add a permission and choose Yammer for Request API permissions.
  2. Choose Delegated permissions and select user_impersonation.
  3. Choose Add permissions.

Now the Yammer connector application is configured in the Azure portal. Let’s switch over to the Amazon Kendra console to complete our setup.

Create an Amazon Kendra index

You can create an Amazon Kendra index or use an existing index. For this post, we create a new index called my-yammer-index. For instructions, refer to Creating an index.

Create a Yammer data source

Complete the following steps to create your data source:

  1. On the Amazon Kendra console, choose Data sources in the navigation pane.
  2. Under Microsoft Yammer connector, choose Add connector.
  3. For Data source name, enter a name (for example, my-yammer-datasource).
  4. Enter an optional description.
  5. Choose Next.

You have the choice of creating credentials in Secrets Manager in advance. For this post, we create a secret on-demand.

  1. Configure a Secrets Manager secret with the user name, password, client ID, and secret ID you collected earlier.
  2. Choose Save.
  3. For IAM role, choose Create a new role.
  4. For Role name, choose AmazonKendra-my-yammer-iam-role.
  5. Choose Next.
  6. In the Configure sync settings section, you can optionally configure contents to sync, communities to include, and date since.
  7. Choose Sync mode and Sync run schedule.

You can choose how you want to update your index when your data source content changes. Amazon Kendra provides three types of sync modes:

  • Full sync – Amazon Kendra will sync all contents in all entities, regardless of the previous sync status
  • New or modified content sync – Amazon Kendra will only sync new or modified content
  • New, modified, or deleted content sync – Amazon Kendra will only sync new, modified, or deleted content
  1. For this post, select Full sync.
  2. For Frequency, choose Run on demand
  3. Choose Next.
  4. You can optionally set field mappings and Amazon Kendra associates data fields with the index.
  5. Choose Next.
  6. Review and choose Add data source.
  7. Choose Sync now.

The sync takes between minutes to hours based on the size of the repository Amazon Kendra is indexing.

Test the solution

Now that you have ingested the content from Yammer into your Amazon Kendra index, you can test some queries.

  1. On the Amazon Kendra console, navigate to your index and choose Search indexed content.
  2. Enter a sample search query and test out your search results (your query will vary based on the contents of your account).

The Yammer connector also crawls local identity information from Yammer. When a document is indexed into Amazon Kendra, a corresponding Access Control List (ACL) is ingested for most documents.

The ACL specifies which user names and group names are allowed or denied access to the document. Documents without an ACL are public documents. You can use this feature to narrow down your query by user.

You can use the user ID (email) to filter search results based on the user or their group access to documents. When you issue a query, Amazon Kendra checks the user and group information and runs the query. All the documents relevant to the query that the user has access to, including public documents, are returned.

  1. To use this feature, go back to the search results page.
  2. Expand Test query with user name or groups and choose Apply user name or groups.

For Yammer, we don’t import groups, we just import user names. User names are email IDs in this case.

  1. Enter the user ID (email) of your user and choose Apply.

The following screenshot shows the updated search results.

When fronting Amazon Kendra with an application such as an application built using Experience Builder, you can pass the user identity (in the form of the email ID) to Amazon Kendra to ensure that each user only sees content specific to their user ID. Alternately, you can use AWS IAM Identity Center (successor to AWS Single Sign-On) to control user context being passed to Amazon Kendra to limit queries by user.

Congratulations! You have successfully used Amazon Kendra to surface answers and insights based on the content indexed from your Yammer account.

Limitations

This solution has the following limitations:

  • Only the export API is available to fetch all communities. API support for fetching event details, votes about polls, and update messages is not available as of this writing.
  • Deleted entities such as messages, attachments, communities, and users are not crawled in change log crawl mode. You need to run another full crawl to get the updated information on deletion of all the entities.
  • For communities, the following are not part of indexing:
    • Community insight details
    • Community information
    • Related communities for that community
    • Files uploaded directly into the community without any attachment to a message
  • Yammer has rate limits that govern the speed of ingestion. For more information, refer to Limits in Yammer.

Clean up

To avoid incurring future costs, clean up the resources you created as part of this solution. If you created a new Amazon Kendra index while testing this solution, delete it. If you only added a new data source using the Amazon Kendra connector for Yammer, delete that data source.

Conclusion

With the Yammer connector for Amazon Kendra, organizations can tap into the repository of information stored in their account securely using intelligent search powered by Amazon Kendra.

To learn about these possibilities and more, refer to the Amazon Kendra Developer Guide. For more information on how you can create, modify, or delete metadata and content when ingesting your data from Yammer, refer to Enriching your documents during ingestion and Enrich your content and metadata to enhance your search experience with custom document enrichment in Amazon Kendra.


About the authors

 Senthil Ramachandran is an Enterprise Solutions Architect at AWS, supporting customers in the US North East. He is primarily focused on Cloud adoption and Digital Transformation in Financial Services Industry. Senthil’s area of interest is AI, especially Deep Learning and Machine Learning. He focuses on application automations with continuous learning and improving human enterprise experience. Senthil enjoys watching Autosport, Soccer and spending time with his family.

Read More

Announcing the ICDAR 2023 Competition on Hierarchical Text Detection and Recognition

Announcing the ICDAR 2023 Competition on Hierarchical Text Detection and Recognition

The last few decades have witnessed the rapid development of Optical Character Recognition (OCR) technology, which has evolved from an academic benchmark task used in early breakthroughs of deep learning research to tangible products available in consumer devices and to third party developers for daily use. These OCR products digitize and democratize the valuable information that is stored in paper or image-based sources (e.g., books, magazines, newspapers, forms, street signs, restaurant menus) so that they can be indexed, searched, translated, and further processed by state-of-the-art natural language processing techniques.

Research in scene text detection and recognition (or scene text spotting) has been the major driver of this rapid development through adapting OCR to natural images that have more complex backgrounds than document images. These research efforts, however, focus on the detection and recognition of each individual word in images, without understanding how these words compose sentences and articles.

Layout analysis is another relevant line of research that takes a document image and extracts its structure, i.e., title, paragraphs, headings, figures, tables and captions. These layout analysis efforts are parallel to OCR and have been largely developed as independent techniques that are typically evaluated only on document images. As such, the synergy between OCR and layout analysis remains largely under-explored. We believe that OCR and layout analysis are mutually complementary tasks that enable machine learning to interpret text in images and, when combined, could improve the accuracy and efficiency of both tasks.

With this in mind, we announce the Competition on Hierarchical Text Detection and Recognition (the HierText Challenge), hosted as part of the 17th annual International Conference on Document Analysis and Recognition (ICDAR 2023). The competition is hosted on the Robust Reading Competition website, and represents the first major effort to unify OCR and layout analysis. In this competition, we invite researchers from around the world to build systems that can produce hierarchical annotations of text in images using words clustered into lines and paragraphs. We hope this competition will have a significant and long-term impact on image-based text understanding with the goal to consolidate the research efforts across OCR and layout analysis, and create new signals for downstream information processing tasks.

The concept of hierarchical text representation.

Constructing a hierarchical text dataset

In this competition, we use the HierText dataset that we published at CVPR 2022 with our paper “Towards End-to-End Unified Scene Text Detection and Layout Analysis”. It’s the first real-image dataset that provides hierarchical annotations of text, containing word, line, and paragraph level annotations. Here, “words” are defined as sequences of textual characters not interrupted by spaces. “Lines” are then interpreted as “space“-separated clusters of “words” that are logically connected in one direction, and aligned in spatial proximity. Finally, “paragraphs” are composed of “lines” that share the same semantic topic and are geometrically coherent.

To build this dataset, we first annotated images from the Open Images dataset using the Google Cloud Platform (GCP) Text Detection API. We filtered through these annotated images, keeping only images rich in text content and layout structure. Then, we worked with our third-party partners to manually correct all transcriptions and to label words, lines and paragraph composition. As a result, we obtained 11,639 transcribed images, split into three subsets: (1) a train set with 8,281 images, (2) a validation set with 1,724 images, and (3) a test set with 1,634 images. As detailed in the paper, we also checked the overlap between our dataset, TextOCR, and Intel OCR (both of which also extracted annotated images from Open Images), making sure that the test images in the HierText dataset were not also included in the TextOCR or Intel OCR training and validation splits and vice versa. Below, we visualize examples using the HierText dataset and demonstrate the concept of hierarchical text by shading each text entity with different colors. We can see that HierText has a diversity of image domain, text layout, and high text density.

Samples from the HierText dataset. Left: Illustration of each word entity. Middle: Illustration of line clustering. Right: Illustration paragraph clustering.

Dataset with highest density of text

In addition to the novel hierarchical representation, HierText represents a new domain of text images. We note that HierText is currently the most dense publicly available OCR dataset. Below we summarize the characteristics of HierText in comparison with other OCR datasets. HierText identifies 103.8 words per image on average, which is more than 3x the density of TextOCR and 25x more dense than ICDAR-2015. This high density poses unique challenges for detection and recognition, and as a consequence HierText is used as one of the primary datasets for OCR research at Google.

Dataset       Training split       Validation split       Testing split       Words per image      
ICDAR-2015       1,000       0       500       4.4      
TextOCR       21,778       3,124       3,232       32.1      
Intel OCR       19,1059       16,731       0       10.0      
HierText       8,281       1,724       1,634       103.8

Comparing several OCR datasets to the HierText dataset.

Spatial distribution

We also find that text in the HierText dataset has a much more even spatial distribution than other OCR datasets, including TextOCR, Intel OCR, IC19 MLT, COCO-Text and IC19 LSVT. These previous datasets tend to have well-composed images, where text is placed in the middle of the images, and are thus easier to identify. On the contrary, text entities in HierText are broadly distributed across the images. It’s proof that our images are from more diverse domains. This characteristic makes HierText uniquely challenging among public OCR datasets.

Spatial distribution of text instances in different datasets.

The HierText challenge

The HierText Challenge represents a novel task and with unique challenges for OCR models. We invite researchers to participate in this challenge and join us in ICDAR 2023 this year in San Jose, CA. We hope this competition will spark research community interest in OCR models with rich information representations that are useful for novel down-stream tasks.

Acknowledgements

The core contributors to this project are Shangbang Long, Siyang Qin, Dmitry Panteleev, Alessandro Bissacco, Yasuhisa Fujii and Michalis Raptis. Ashok Popat and Jake Walker provided valuable advice. We also thank Dimosthenis Karatzas and Sergi Robles from Autonomous University of Barcelona for helping us set up the competition website.

Read More