Create high-quality datasets with Amazon SageMaker Ground Truth and FiftyOne

Create high-quality datasets with Amazon SageMaker Ground Truth and FiftyOne

This is a joint post co-written by AWS and Voxel51. Voxel51 is the company behind FiftyOne, the open-source toolkit for building high-quality datasets and computer vision models.

A retail company is building a mobile app to help customers buy clothes. To create this app, they need a high-quality dataset containing clothing images, labeled with different categories. In this post, we show how to repurpose an existing dataset via data cleaning, preprocessing, and pre-labeling with a zero-shot classification model in FiftyOne, and adjusting these labels with Amazon SageMaker Ground Truth.

You can use Ground Truth and FiftyOne to accelerate your data labeling project. We illustrate how to seamlessly use the two applications together to create high-quality labeled datasets. For our example use case, we work with the Fashion200K dataset, released at ICCV 2017.

Solution overview

Ground Truth is a fully self-served and managed data labeling service that empowers data scientists, machine learning (ML) engineers, and researchers to build high-quality datasets. FiftyOne by Voxel51 is an open-source toolkit for curating, visualizing, and evaluating computer vision datasets so that you can train and analyze better models by accelerating your use cases.

In the following sections, we demonstrate how to do the following:

  • Visualize the dataset in FiftyOne
  • Clean the dataset with filtering and image deduplication in FiftyOne
  • Pre-label the cleaned data with zero-shot classification in FiftyOne
  • Label the smaller curated dataset with Ground Truth
  • Inject labeled results from Ground Truth into FiftyOne and review labeled results in FiftyOne

Use case overview

Suppose you own a retail company and want to build a mobile application to give personalized recommendations to help users decide what to wear. Your prospective users are looking for an application that tells them which articles of clothing in their closet work well together. You see an opportunity here: if you can identify good outfits, you can use this to recommend new articles of clothing that complement the clothing a customer already owns.

You want to make things as easy as possible for the end-user. Ideally, someone using your application only needs to take pictures of the clothes in their wardrobe, and your ML models work their magic behind the scenes. You might train a general-purpose model or fine-tune a model to each user’s unique style with some form of feedback.

First, however, you need to identify what type of clothing the user is capturing. Is it a shirt? A pair of pants? Or something else? After all, you probably don’t want to recommend an outfit that has multiple dresses or multiple hats.

To address this initial challenge, you want to generate a training dataset consisting of images of various articles of clothing with various patterns and styles. To prototype with a limited budget, you want to bootstrap using an existing dataset.

To illustrate and walk you through the process in this post, we use the Fashion200K dataset released at ICCV 2017. It’s an established and well-cited dataset, but it isn’t directly suited for your use case.

Although articles of clothing are labeled with categories (and subcategories) and contain a variety of helpful tags that are extracted from the original product descriptions, the data is not systematically labeled with pattern or style information. Your goal is to turn this existing dataset into a robust training dataset for your clothing classification models. You need to clean the data, augmenting the labeling schema with style labels. And you want to do so quickly and with as little spend as possible.

Download the data locally

First, download the women.tar zip file and the labels folder (with all of its subfolders) following the instructions provided in the Fashion200K dataset GitHub repository. After you’ve unzipped them both, create a parent directory fashion200k, and move the labels and women folders into this. Fortunately, these images have already been cropped to the object detection bounding boxes, so we can focus on classification, rather than worry about object detection.

Despite the “200K” in its moniker, the women directory we extracted contains 338,339 images. To generate the official Fashion200K dataset, the dataset’s authors crawled more than 300,000 products online, and only products with descriptions containing more than four words made the cut. For our purposes, where the product description isn’t essential, we can use all of the crawled images.

Let’s look at how this data is organized: within the women folder, images are arranged by top-level article type (skirts, tops, pants, jackets, and dresses), and article type subcategory (blouses, t-shirts, long-sleeved tops).

Within the subcategory directories, there is a subdirectory for each product listing. Each of these contains a variable number of images. The cropped_pants subcategory, for instance, contains the following product listings and associated images.

The labels folder contains a text file for each top-level article type, for both train and test splits. Within each of these text files is a separate line for each image, specifying the relative file path, a score, and tags from the product description.

Because we’re repurposing the dataset, we combine all of the train and test images. We use these to generate a high-quality application-specific dataset. After we complete this process, we can randomly split the resulting dataset into new train and test splits.

Inject, view, and curate a dataset in FiftyOne

If you haven’t already done so, install open-source FiftyOne using pip:

pip install fiftyone

A best practice is to do so within a new virtual (venv or conda) environment. Then import the relevant modules. Import the base library, fiftyone, the FiftyOne Brain, which has built-in ML methods, the FiftyOne Zoo, from which we will load a model that will generate zero-shot labels for us, and the ViewField, which lets us efficiently filter the data in our dataset:

import fiftyone as fo
import fiftyone.brain as fob
import fiftyone.zoo as foz
from fiftyone import ViewField as F

You also want to import the glob and os Python modules, which will help us work with paths and pattern match over directory contents:

from glob import glob
import os

Now we’re ready to load the dataset into FiftyOne. First, we create a dataset named fashion200k and make it persistent, which allows us to save the results of computationally intensive operations, so we only need to compute said quantities once.

dataset = fo.Dataset("fashion200k", persistent=True)

We can now iterate through all subcategory directories, adding all the images within the product directories. We add a FiftyOne classification label to each sample with the field name article_type, populated by the image’s top-level article category. We also add both category and subcategory information as tags:

# Map dir categories to article type labels
labels_map = {
    "dresses": "dress",
    "jackets": "jacket",
    "pants": "pants",
    "skirts": "skirt",
    "tops": "top",
}

dataset_dir = "./fashion200k"

for d in glob(os.path.join(dataset_dir, "women", "*", "*")):
    _, _, category, subcategory = d.split("/")
    subcategory = subcategory.replace("_", " ")
    label = labels_map[category]

    dataset.add_samples(
        [
            fo.Sample(
                    filepath=filepath,
tags=[category, subcategory],   article_type=fo.Classification(label=label),
            )
            for filepath in glob(os.path.join(d, "*", "*"))
        ]
    )

At this point, we can visualize our dataset in the FiftyOne app by launching a session:

session = fo.launch_app(dataset)

We can also print out a summary of the dataset in Python by running print(dataset):

Name:        fashion200k
Media type:  image
Num samples: 338339
Persistent:  True
Tags:        []
Sample fields:
    id:            fiftyone.core.fields.ObjectIdField
    filepath:      fiftyone.core.fields.StringField
    tags:          fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:      fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    article_type:  fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)

We can also add the tags from the labels directory to the samples in our dataset:

working_dir = os.getcwd()

tags = {
f: set(t) 
for f, t in zip(*dataset.values(["filepath", "tags"]))
}


for label_file in glob("fashion200k/labels/*"):
    with open(label_file, 'r') as f:
        for line in f.readlines():
            line_list = line.split()
            fp = os.path.join(
                working_dir, 
                dataset_dir, 
                line_list[0]
            )
          
           # add new tags
          new_tags_for_fp = line_list[2:]
          tags[fp].update(new_tags_for_fp)

# Update tags
dataset.set_values("tags", tags, key_field="filepath")

Looking at the data, a few things become clear:

  • Some of the images are fairly grainy, with low resolution. This is likely because these images were generated by cropping initial images in object detection bounding boxes.
  • Some clothes are worn by a person, and some are photographed on their own. These details are encapsulated by the viewpoint property.
  • A lot of the images of the same product are very similar, so at least initially, including more than one image per product may not add much predictive power. For the most part, the first image of each product (ending in _0.jpeg) is the cleanest.

Initially, we might want to train our clothing style classification model on a controlled subset of these images. To this end, we use high-resolution images of our products, and limit our view to one representative sample per product.

First, we filter out the low-resolution images. We use the compute_metadata() method to compute and store image width and height, in pixels, for each image in the dataset. We then employ the FiftyOne ViewField to filter out images based on the minimum allowed width and height values. See the following code:

dataset.compute_metadata()

min_width = 200
min_height = 300

width_filter = F("metadata.width") > min_width
height_filter = F("metadata.height") > min_height


high_res_view = dataset.match(
    width_filter & height_filter
)

session.view = high_res_view.view()

This high-resolution subset has just under 200,000 samples.

From this view, we can create a new view into our dataset containing only one representative sample (at most) for each product. We use the ViewField once again, pattern matching for file paths that end with _0.jpeg:

representative_view = high_res_view.match(
    F("filepath").ends_with("_0.jpeg")
)

Let’s view a randomly shuffled ordering of images in this subset:

session.view = representative_view.shuffle()

Remove redundant images in the dataset

This view contains 66,297 images, or just over 19% of the original dataset. When we look at the view, however, we see that there are many very similar products. Keeping all of these copies will likely only add cost to our labeling and model training, without noticeably improving performance. Instead, let’s get rid of the near duplicates to create a smaller dataset that still packs the same punch.

Because these images are not exact duplicates, we can’t check for pixel-wise equality. Fortunately, we can use the FiftyOne Brain to help us clean our dataset. In particular, we’ll compute an embedding for each image—a lower-dimensional vector representing the image—and then look for images whose embedding vectors are close to each other. The closer the vectors, the more similar the images.

We use a CLIP model to generate a 512-dimensional embedding vector for each image, and store these embeddings in the field embeddings on the samples in our dataset:

## load model
model = foz.load_zoo_model("clip-vit-base32-torch")
 
## compute embeddings
representative_view.compute_embeddings(
model, 
embeddings_field="embedding"
)

Then we compute the closeness between embeddings, using cosine similarity, and assert that any two vectors whose similarity is greater than some threshold are likely to be near duplicates. Cosine similarity scores lie in the range [0, 1], and looking at the data, a threshold score of thresh=0.5 seems to be about right. Again, this doesn’t need to be perfect. A few near-duplicate images are not likely to ruin our predictive power, and throwing away a few non-duplicate images doesn’t materially impact model performance.

results = fob.compute_similarity(
view,
embeddings="embedding",
brain_key="sim",
metric="cosine"
)

results.find_duplicates(thresh=0.5)

We can view the purported duplicates to verify that they are indeed redundant:

## view the duplicates, paired up, 
## to make sure it is doing what we think it is doing
dup_view = results.duplicates_view()
session = fo.launch_app(dup_view)

When we’re happy with the result and believe these images are indeed near duplicates, we can pick one sample from each set of similar samples to keep, and ignore the others:

## get one image from each group of duplicates
dup_rep_ids = list(results.neighbors_map.keys())

# get ids of non-duplicates
non_dup_ids = representative_view.exclude(
dup_view.values("id")
).values("id")

# ids to keep
ids = dup_rep_ids + non_dup_ids

# create view from ids
non_dup_view = representative_view[ids]

Now this view has 3,729 images. By cleaning the data and identifying a high-quality subset of the Fashion200K dataset, FiftyOne lets us restrict our focus from more than 300,000 images to just under 4,000, representing a reduction by 98%. Using embeddings to remove near-duplicate images alone brought our total number of images under consideration down by more than 90%, with little if any effect on any models to be trained on this data.

Before pre-labeling this subset, we can better understand the data by visualizing the embeddings we have already computed. We can use the FiftyOne Brain’s built-in compute_visualization() method, which employs the uniform manifold approximation (UMAP) technique to project the 512-dimensional embedding vectors into two-dimensional space so we can visualize them:

fob.compute_visualization(
    non_dup_view, 
    embeddings="embedding", 
    brain_key="vis"
)

We open a new Embeddings panel in the FiftyOne app and coloring by article type, and we can see that these embeddings roughly encode a notion of article type (among other things!).

Now we are ready to pre-label this data.

Inspecting these highly unique, high-resolution images, we can generate a decent initial list of styles to use as classes in our pre-labeling zero-shot classification. Our goal in pre-labeling these images is not to necessarily label each image correctly. Rather, our goal is to provide a good starting point for human annotators so we can reduce labeling time and cost.

styles = [
 "graphic", 
 "lettered", 
 "plain", 
 "striped", 
 "polka dot", 
 "floral", 
 "jersey", 
 "checkered", 
 "denim", 
 "plaid",
 "houndstooth",
 "chevron", 
 "paisley", 
 "animal print", 
 "quatrefoil",
 “camouflage”
]

We can then instantiate a zero-shot classification model for this application. We use a CLIP model, which is a general-purpose model trained on both images and natural language. We instantiate a CLIP model with the text prompt “Clothing in the style,” so that given an image, the model will output the class for which “Clothing in the style [class]” is the best fit. CLIP is not trained on retail or fashion-specific data, so this won’t be perfect, but it can save you in labeling and annotation costs.

zero_shot_model = foz.load_zoo_model(
 "clip-vit-base32-torch",
 text_prompt="Clothing in the style ",
 classes=styles,
)

We then apply this model to our reduced subset and store the results in an article_style field:

non_dup_view.apply_model(
zero_shot_model, 
label_field="article_style"
)

Launching the FiftyOne App once again, we can visualize the images with these predicted style labels. We sort by prediction confidence so we view the most confident style predictions first:

high_conf_view = non_dup_view.sort_by(
 "article_style.confidence", reverse=True
)

session.view = high_conf_view

We can see that the highest confidence predictions seem to be for “jersey,” “animal print,” “polka dot,” and “lettered” styles. This makes sense, because these styles are relatively distinct. It also seems like, for the most part, the predicted style labels are accurate.

We can also look at the lowest-confidence style predictions:

low_conf_view = non_dup_view.sort_by(
"article_style.confidence"
)
session.view = low_conf_view

For some of these images, the appropriate style category is in the provided list, and the article of clothing is incorrectly labeled. The first image in the grid, for instance, should clearly be “camouflage” and not “chevron.” In other cases, however, the products don’t fit neatly into the style categories. The dress in the second image in the second row, for example, is not exactly “striped,” but given the same labeling options, a human annotator might also have been conflicted. As we build out our dataset, we need to decide whether to remove edge cases like these, add new style categories, or augment the dataset.

Export the final dataset from FiftyOne

Export the final dataset with the following code:

# The directory to which to write the exported dataset
export_dir = "200kFashionDatasetExportResult"

# The name of the sample field containing the label that you wish to export
# Used when exporting labeled datasets (e.g., classification or detection)
label_field = "article_style"  # for example

# The type of dataset to export
# Any subclass of `fiftyone.types.Dataset` is supported
dataset_type = fo.types.COCODetectionDataset  # for example

# Export the dataset
high_conf_view.export(
    export_dir=export_dir,
    dataset_type=dataset_type,
    label_field=label_field,
)

We can export a smaller dataset, for example, 16 images, to the folder 200kFashionDatasetExportResult-16Images. We create a Ground Truth adjustment job using it:

# The directory to which to write the exported dataset
export_dir = "200kFashionDatasetExportResult-16Images"

# The name of the sample field containing the label that you wish to export
# Used when exporting labeled datasets (e.g., classification or detection)
label_field = "article_style"  # for example

# The type of dataset to export
# Any subclass of `fiftyone.types.Dataset` is supported
dataset_type = fo.types.COCODetectionDataset  # for example

# Export the dataset
high_conf_view.take(16).export(
    export_dir=export_dir,
    dataset_type=dataset_type,
    label_field=label_field,
)

Upload the revised dataset, convert the label format to Ground Truth, upload to Amazon S3, and create a manifest file for the adjustment job

We can convert the labels in the dataset to match the output manifest schema of a Ground Truth bounding box job, and upload the images to an Amazon Simple Storage Service (Amazon S3) bucket to launch a Ground Truth adjustment job:

import json
# open the labels.json file of ground truth bounding box 
#labels from the exported dataset
f = open('200kFashionDatasetExportResult-16Images/labels.json')
data = json.load(f)

# provide your aws s3 bucket name, prefix, and aws credentials
bucket_name = 'sagemaker-your-preferred-s3-bucket'
s3_prefix = 'sagemaker-your-preferred-s3-prefix'

session = boto3.Session(
    aws_access_key_id='<AWS_ACCESS_KEY_ID>',
    aws_secret_access_key='<AWS_SECRET_ACCESS_KEY>'
)
s3 = session.resource('s3')

for image in data['images']:
    file_name = image['file_name']
    file_id = file_name[:-4]
    image_id = image['id']
    
    # upload the image to s3
    s3.meta.client.upload_file('200kFashionDatasetExportResult-16Images/data/'+image['file_name'], bucket_name, s3_prefix+'/'+image['file_name'])
    
    gt_annotations = []
    confidence = 0.00
    
    for annotation in data['annotations']:
        if annotation['image_id'] == image['id']:
            confidence = annotation['score']
            gt_annotation = {
                "class_id": gt_class_array.index(style_category), 
                # convert the original ground_truth bounding box 
                #label to predicted style label
                "left": annotation['bbox'][0],
                "top": annotation['bbox'][1],
                "width": annotation['bbox'][2],
                "height": annotation['bbox'][3]
            }
            
            gt_annotations.append(gt_annotation)
            break
    
    gt_metadata_objects = []
    for gt_annotation in gt_annotations:
        gt_metadata_objects.append({
            "confidence": confidence
        })
    
    gt_label_attribute_metadata = {
        "class-map": gt_class_map,
        "objects": gt_metadata_objects,
        "type": "groundtruth/object-detection",
        "human-annotated": "yes",
        "creation-date": "2023-02-19T00:23:25.339582",
        "job-name": "labeling-job/200k-fashion-origin"
    }
    
    gt_output = {
        "source-ref": f"s3://{bucket_name}/{s3_prefix}/{image['file_name']}",
        "200k-fashion-origin": {
            "image_size": [
                {
                    "width": image['width'],
                    "height": image['height'],
                    "depth": 3
                  }
      
            ],
            "annotations": gt_annotations
        },
        "200k-fashion-origin-metadata": gt_label_attribute_metadata
    }
    

    # write to the manifest file    
    with open(200k-fashion-output.manifest', 'a') as output_file:
        output_file.write(json.dumps(gt_output) + "n")

Upload the manifest file to Amazon S3 with the following code:

s3.meta.client.upload_file(200k-fashion-output.manifest', bucket_name, s3_prefix+'/200k-fashion-output.manifest')

Create corrected styled labels with Ground Truth

To annotate your data with style labels using Ground Truth, complete the necessary steps to start a bounding box labeling job by following the procedure outlined in the Getting Started with Ground Truth guide with the dataset in the same S3 bucket.

  1. On the SageMaker console, create a Ground Truth labeling job.
  2. Set the Input dataset location to be the manifest that we created in the preceding steps.
  3. Specify an S3 path for Output dataset location.
  4. For IAM Role, choose Enter a custom IAM role ARN, then enter the role ARN.
  5. For Task category, choose Image and select Bounding box.
  6. Choose Next.
  7. In the Workers section, choose the type of workforce you would like to use.
    You can select a workforce through Amazon Mechanical Turk, third-party vendors, or your own private workforce. For more details about your workforce options, see Create and Manage Workforces.
  8. Expand Existing-labels display options and select I want to display existing labels from the dataset for this job.
  9. For Label attribute name, choose the name from your manifest that corresponds to the labels that you want to display for adjustment.
    You will only see label attribute names for labels that match the task type you selected in the previous steps.
  10. Manually enter the labels for Bounding box labeling tool.
    The labels must contain the same labels used in the public dataset. You can add new labels. The following screenshot shows how you can choose the workers and configure the tool for your labeling job.
  11. Choose Preview to preview the image and original annotations.

We have now created a labeling job in Ground Truth. After our job is complete, we can load the newly generated labeled data into FiftyOne. Ground Truth produces output data in a Ground Truth output manifest. For more details on the output manifest file, see Bounding Box Job Output. The following code shows an example of this output manifest format:

{
    "source-ref": "s3://AWSDOC-EXAMPLE-BUCKET/example_image.png",
    "bounding-box-attribute-name":
    {
        "image_size": [{ "width": 500, "height": 400, "depth":3}],
        "annotations":
        [
            {"class_id": 0, "left": 111, "top": 134,
                    "width": 61, "height": 128},
            {"class_id": 5, "left": 161, "top": 250,
                     "width": 30, "height": 30},
            {"class_id": 5, "left": 20, "top": 20,
                     "width": 30, "height": 30}
        ]
    },
    "bounding-box-attribute-name-metadata":
    {
        "objects":
        [
            {"confidence": 0.8},
            {"confidence": 0.9},
            {"confidence": 0.9}
        ],
        "class-map":
        {
            "0": "jersey",
            "5": "polka dot"
        },
        "type": "groundtruth/object-detection",
        "human-annotated": "yes",
        "creation-date": "2018-10-18T22:18:13.527256",
        "job-name": "identify-fashion-set"
    },
    "adjusted-bounding-box":
    {
        "image_size": [{ "width": 500, "height": 400, "depth":3}],
        "annotations":
        [
            {"class_id": 0, "left": 110, "top": 135,
                    "width": 61, "height": 128},
            {"class_id": 5, "left": 161, "top": 250,
                     "width": 30, "height": 30},
            {"class_id": 5, "left": 10, "top": 10,
                     "width": 30, "height": 30}
        ]
    },
    "adjusted-bounding-box-metadata":
    {
        "objects":
        [
            {"confidence": 0.8},
            {"confidence": 0.9},
            {"confidence": 0.9}
        ],
        "class-map":
        {
            "0": "dog",
            "5": "bone"
        },
        "type": "groundtruth/object-detection",
        "human-annotated": "yes",
        "creation-date": "2018-11-20T22:18:13.527256",
        "job-name": "adjust-identify-fashion-set",
        "adjustment-status": "adjusted"
    }
 }

Review labeled results from Ground Truth in FiftyOne

After the job is complete, download the output manifest of the labeling job from Amazon S3.

Read the output manifest file:

with open('<path-to-your-output.manifest>', 'r') as fh:
    adjustment_manifest_lines = fh.readlines()

Create a FiftyOne dataset and convert the manifest lines to samples in the dataset:

def get_classification_labels(manifest_line, dataset, attr_name) -> fo.Classifications:
    label_attribute_data = manifest_line.get(attr_name)
    metadata = manifest_line.get(f"{attr_name}-metadata")
 
    annotations = label_attribute_data.get("annotations")
 
    image_data = label_attribute_data.get("image_size")[0]
    width = image_data.get("width")
    height = image_data.get("height")

    predictions = []
    for i, annotation in enumerate(annotations):
        label = metadata.get("class-map").get(str(annotation.get("class_id")))

        confidence = metadata.get("objects")[i].get("confidence")
        
        prediction = fo.Classification(label=label, confidence=confidence)

        predictions.append(prediction)

    return fo.Classifications(classifications=predictions)

def get_bounding_box_labels(manifest_line, dataset, attr_name) -> fo.Detections:
    label_attribute_data = manifest_line.get(attr_name)
    metadata = manifest_line.get(f"{attr_name}-metadata")
 
    annotations = label_attribute_data.get("annotations")
 
    image_data = label_attribute_data.get("image_size")[0]
    width = image_data.get("width")
    height = image_data.get("height")

    detections = []
    for i, annotation in enumerate(annotations):
        label = metadata.get("class-map").get(str(annotation.get("class_id")))

        confidence = metadata.get("objects")[i].get("confidence")

        # Bounding box coordinates should be relative values
        # in [0, 1] in the following format:
        # [top-left-x, top-left-y, width, height]
        bounding_box = [
            annotation.get("left") / width,
            annotation.get("top") / height,
            annotation.get("width") / width,
            annotation.get("height") / height,
        ]

        detection = fo.Detection(
            label=label, bounding_box=bounding_box, confidence=confidence
        )
        
        detections.append(detection)

    return fo.Detections(detections=detections)
    
def get_sample_from_manifest_line(manifest_line, dataset, attr_name):
    """
    For each line in manifest, transform annotations into Fiftyone format
    Args:
        line: manifest line
    Output:
        Fiftyone image sample
    """
    file_name = manifest_line.get("source-ref")[5:].split("/")[-1]
    file_loc = f'200kFashionDatasetExportResult-16Images/data/{file_name}'

    sample = fo.Sample(filepath=file_loc)

    sample['ground_truth'] = get_bounding_box_labels(
        manifest_line=manifest_line, dataset=dataset, attr_name=attr_name
    )
    sample["prediction"] = get_classification_labels(
        manifest_line=manifest_line, dataset=dataset, attr_name=attr_name
    )

    return sample

adjustment_dataset = fo.Dataset("adjustment-job-dataset")

samples = [
            get_sample_from_manifest_line(
                manifest_line=json.loads(manifest_line), dataset=adjustment_dataset, attr_name='smgt-fiftyone-style-adjustment-job'
            )
            for manifest_line in adjustment_manifest_lines
        ]

adjustment_dataset.add_samples(samples)

session = fo.launch_app(adjustment_dataset)

You can now see high-quality labeled data from Ground Truth in FiftyOne.

Conclusion

In this post, we showed how to build high-quality datasets by combining the power of FiftyOne by Voxel51, an open-source toolkit that allows you to manage, track, visualize, and curate your dataset, and Ground Truth, a data labeling service that allows you to efficiently and accurately label the datasets required for training ML systems by providing access to multiple built-in task templates and access to a diverse workforce through Mechanical Turk, third-party vendors, or your own private workforce.

We encourage you to try out this new functionality by installing a FiftyOne instance and using the Ground Truth console to get started. To learn more about Ground Truth, refer to Label Data, Amazon SageMaker Data Labeling FAQs, and the AWS Machine Learning Blog.

Connect with the Machine Learning & AI community if you have any questions or feedback!

Join the FiftyOne community!

Join the thousands of engineers and data scientists already using FiftyOne to solve some of the most challenging problems in computer vision today!


About the Authors

Shalendra Chhabra is currently Head of Product Management for Amazon SageMaker Human-in-the-Loop (HIL) Services. Previously, Shalendra incubated and led Language and Conversational Intelligence for Microsoft Teams Meetings, was EIR at Amazon Alexa Techstars Startup Accelerator, VP of Product and Marketing at Discuss.io, Head of Product and Marketing at Clipboard (acquired by Salesforce), and Lead Product Manager at Swype (acquired by Nuance). In total, Shalendra has helped build, ship, and market products that have touched more than a billion lives.

Jacob Marks is a Machine Learning Engineer and Developer Evangelist at Voxel51, where he helps bring transparency and clarity to the world’s data. Prior to joining Voxel51, Jacob founded a startup to help emerging musicians connect and share creative content with fans. Before that, he worked at Google X, Samsung Research, and Wolfram Research. In a past life, Jacob was a theoretical physicist, completing his PhD at Stanford, where he investigated quantum phases of matter. In his free time, Jacob enjoys climbing, running, and reading science fiction novels.

Jason Corso is co-founder and CEO of Voxel51, where he steers strategy to help bring transparency and clarity to the world’s data through state-of-the-art flexible software. He is also a Professor of Robotics, Electrical Engineering, and Computer Science at the University of Michigan, where he focuses on cutting-edge problems at the intersection of computer vision, natural language, and physical platforms. In his free time, Jason enjoys spending time with his family, reading, being in nature, playing board games, and all sorts of creative activities.

Brian Moore is co-founder and CTO of Voxel51, where he leads technical strategy and vision. He holds a PhD in Electrical Engineering from the University of Michigan, where his research was focused on efficient algorithms for large-scale machine learning problems, with a particular emphasis on computer vision applications. In his free time, he enjoys badminton, golf, hiking, and playing with his twin Yorkshire Terriers.

Zhuling Bai is a Software Development Engineer at Amazon Web Services. She works on developing large-scale distributed systems to solve machine learning problems.

Read More

Achieve high performance with lowest cost for generative AI inference using AWS Inferentia2 and AWS Trainium on Amazon SageMaker

Achieve high performance with lowest cost for generative AI inference using AWS Inferentia2 and AWS Trainium on Amazon SageMaker

The world of artificial intelligence (AI) and machine learning (ML) has been witnessing a paradigm shift with the rise of generative AI models that can create human-like text, images, code, and audio. Compared to classical ML models, generative AI models are significantly bigger and more complex. However, their increasing complexity also comes with high costs for inference and a growing need for powerful compute resources. The high cost of inference for generative AI models can be a barrier to entry for businesses and researchers with limited resources, necessitating the need for more efficient and cost-effective solutions. Furthermore, the majority of generative AI use cases involve human interaction or real-world scenarios, necessitating hardware that can deliver low-latency performance. AWS has been innovating with purpose-built chips to address the growing need for powerful, efficient, and cost-effective compute hardware.

Today, we are excited to announce that Amazon SageMaker supports AWS Inferentia2 (ml.inf2) and AWS Trainium (ml.trn1) based SageMaker instances to host generative AI models for real-time and asynchronous inference. ml.inf2 instances are available for model deployment on SageMaker in US East (Ohio) and ml.trn1 instances in US East (N. Virginia).

You can use these instances on SageMaker to achieve high performance at a low cost for generative AI models, including large language models (LLMs), Stable Diffusion, and vision transformers. In addition, you can use Amazon SageMaker Inference Recommender to help you run load tests and evaluate the price-performance benefits of deploying your model on these instances.

You can use ml.inf2 and ml.trn1 instances to run your ML applications on SageMaker for text summarization, code generation, video and image generation, speech recognition, personalization, fraud detection, and more. You can easily get started by specifying ml.trn1 or ml.inf2 instances when configuring your SageMaker endpoint. You can use ml.trn1 and ml.inf2 compatible AWS Deep Learning Containers (DLCs) for PyTorch, TensorFlow, Hugging Face, and large model inference (LMI) to easily get started. For the full list with versions, see Available Deep Learning Containers Images.

In this post, we show the process of deploying a large language model on AWS Inferentia2 using SageMaker, without requiring any extra coding, by taking advantage of the LMI container. We use the GPT4ALL-J, a fine-tuned GPT-J 7B model that provides a chatbot style interaction.

Overview of ml.trn1 and ml.inf2 instances

ml.trn1 instances are powered by the Trainium accelerator, which is purpose built mainly for high-performance deep learning training of generative AI models, including LLMs. However, these instances also support inference workloads for models that are even larger than what fits into Inf2. The largest instance size, trn1.32xlarge instances, features 16 Trainium accelerators with 512 GB of accelerator memory in a single instance delivering up to 3.4 petaflops of FP16/BF16 compute power. 16 Trainium accelerators are connected with ultra-high-speed NeuronLinkv2 for streamlined collective communications.

ml.Inf2 instances are powered by the AWS Inferentia2 accelerator, a purpose built accelerator for inference. It delivers three times higher compute performance, up to four times higher throughput, and up to 10 times lower latency compared to first-generation AWS Inferentia. The largest instance size, Inf2.48xlarge, features 12 AWS Inferentia2 accelerators with 384 GB of accelerator memory in a single instance for a combined compute power of 2.3 petaflops for BF16/FP16. It enables you to deploy up to a 175-billion-parameter model in a single instance. Inf2 is the only inference-optimized instance to offer this interconnect, a feature that is only available in more expensive training instances. For ultra-large models that don’t fit into a single accelerator, data flows directly between accelerators with NeuronLink, bypassing the CPU completely. With NeuronLink, Inf2 supports faster distributed inference and improves throughput and latency.

Both AWS Inferentia2 and Trainium accelerators have two NeuronCores-v2, 32 GB HBM memory stacks, and dedicated collective-compute engines, which automatically optimize runtime by overlapping computation and communication when doing multi-accelerator inference. For more details on the architecture, refer to Trainium and Inferentia devices.

The following diagram shows an example architecture using AWS Inferentia2.

AWS Neuron SDK

AWS Neuron is the SDK used to run deep learning workloads on AWS Inferentia and Trainium based instances. AWS Neuron includes a deep learning compiler, runtime, and tools that are natively integrated into TensorFlow and PyTorch. With Neuron, you can develop, profile, and deploy high-performance ML workloads on ml.trn1 and ml.inf2.

The Neuron Compiler accepts ML models in various formats (TensorFlow, PyTorch, XLA HLO) and optimizes them to run on Neuron devices. The Neuron compiler is invoked within the ML framework, where ML models are sent to the compiler by the Neuron framework plugin. The resulting compiler artifact is called a NEFF file (Neuron Executable File Format) that in turn is loaded by the Neuron runtime to the Neuron device.

The Neuron runtime consists of kernel driver and C/C++ libraries, which provide APIs to access AWS Inferentia and Trainium Neuron devices. The Neuron ML frameworks plugins for TensorFlow and PyTorch use the Neuron runtime to load and run models on the NeuronCores. The Neuron runtime loads compiled deep learning models (NEFF) to the Neuron devices and is optimized for high throughput and low latency.

Host NLP models using SageMaker ml.inf2 instances

Before we dive deep into serving LLMs with transformers-neuronx, which is an open-source library to shard the model’s large weight matrices onto multiple NeuronCores, let’s briefly go through the typical deployment flow for a model that can fit onto the single NeuronCore.

Check the list of supported models to ensure the model is supported on AWS Inferentia2. Next, the model needs to be pre-compiled by the Neuron Compiler. You can use a SageMaker notebook or an Amazon Elastic Compute Cloud (Amazon EC2) instance to compile the model. You can use the SageMaker Python SDK to deploy models using popular deep learning frameworks such as PyTorch, as shown in the following code. You can deploy your model to SageMaker hosting services and get an endpoint that can be used for inference. These endpoints are fully managed and support auto scaling.

from sagemaker.pytorch.model import PyTorchModel

pytorch_model = PyTorchModel(
    model_data=s3_model_uri,
    role=role,
    source_dir="code",
    entry_point="inference.py",
    image_uri=ecr_image
)

predictor = pytorch_model.deploy(
    initial_instance_count=1, 
    instance_type="ml.inf2.xlarge"
)

Refer to Developer Flows for more details on typical development flows of Inf2 on SageMaker with sample scripts.

Host LLMs using SageMaker ml.inf2 instances

Large language models with billions of parameters are often too big to fit on a single accelerator. This necessitates the use of model parallel techniques for hosting LLMs across multiple accelerators. Another crucial requirement for hosting LLMs is the implementation of a high-performance model-serving solution. This solution should efficiently load the model, manage partitioning, and seamlessly serve requests via HTTP endpoints.

SageMaker includes specialized deep learning containers (DLCs), libraries, and tooling for model parallelism and large model inference. For resources to get started with LMI on SageMaker, refer to Model parallelism and large model inference. SageMaker maintains DLCs with popular open-source libraries for hosting large models such as GPT, T5, OPT, BLOOM, and Stable Diffusion on AWS infrastructure. These specialized DLCs are referred to as SageMaker LMI containers.

SageMaker LMI containers use DJLServing, a model server that is integrated with the transformers-neuronx library to support tensor parallelism across NeuronCores. To learn more about how DJLServing works, refer to Deploy large models on Amazon SageMaker using DJLServing and DeepSpeed model parallel inference. The DJL model server and transformers-neuronx library serve as core components of the container, which also includes the Neuron SDK. This setup facilitates the loading of models onto AWS Inferentia2 accelerators, parallelizes the model across multiple NeuronCores, and enables serving via HTTP endpoints.

The LMI container supports loading models from an Amazon Simple Storage Service (Amazon S3) bucket or Hugging Face Hub. The default handler script loads the model, compiles and converts it into a Neuron-optimized format, and loads it. To use the LMI container to host LLMs, we have two options:

  • A no-code (preferred) – This is the easiest way to deploy an LLM using an LMI container. In this method, you can use the provided default handler and just pass the model name and the parameters required in serving.properties file to load and host the model. To use the default handler, we provide the entryPoint parameter as djl_python.transformers-neuronx.
  • Bring your own script – In this approach, you have the option to create your own model.py file, which contains the code necessary for loading and serving the model. This file acts as an intermediary between the DJLServing APIs and the transformers-neuronx APIs. To customize the model loading process, you can provide serving.properties with configurable parameters. For a comprehensive list of available configurable parameters, refer to All DJL configuration options. Here is an example of a model.py file.

Runtime architecture

The tensor_parallel_degree property value determines the distribution of tensor parallel modules across multiple NeuronCores. For instance, inf2.24xlarge has six AWS Inferentia2 accelerators. Each AWS Inferentia2 accelerator has two NeuronCores. Each NeuronCore has a dedicated high bandwidth memory (HBM) of 16 GB storing tensor parallel modules. With a tensor parallel degree of 4, the LMI will allocate three model copies of the same model, each utilizing four NeuronCores. As shown in the following diagram, when the LMI container starts, the model will be loaded and traced first in the CPU addressable memory. When the tracing is complete, the model is partitioned across the NeuronCores based on the tensor parallel degree.

LMI uses DJLServing as its model serving stack. After the container’s health check passes in SageMaker, the container is ready to serve the inference request. DJLServing launches multiple Python processes equivalent to the TOTAL NUMBER OF NEURON CORES/TENSOR_PARALLEL_DEGREE. Each Python process contains threads in C++ equivalent to TENSOR_PARALLEL_DEGREE. Each C++ threads holds one shard of the model on one NeuronCore.

Many practitioners (Python process) tend to run inference sequentially when the server is invoked with multiple independent requests. Although it’s easier to set up, it’s usually not the best practice to utilize the accelerator’s compute power. To address this, DJLServing offers the built-in optimizations of dynamic batching to combine these independent inference requests on the server side to form a larger batch dynamically to increase throughput. All the requests reach the dynamic batcher first before entering the actual job queues to wait for inference. You can set your preferred batch sizes for dynamic batching using the batch_size settings in serving.properties. You can also configure max_batch_delay to specify the maximum delay time in the batcher to wait for other requests to join the batch based on your latency requirements. The throughput also depends on the number of model copies and the Python process groups launched in the container. As shown in the following diagram, with the tensor parallel degree set to 4, the LMI container launches three Python process groups, each holding the full copy of the model. This allows you to increase the batch size and get higher throughput.

SageMaker notebook for deploying LLMs

In this section, we provide a step-by-step walkthrough of deploying GPT4All-J, a 6-billion-parameter model that is 24 GB in FP32. GPT4All-J is a popular chatbot that has been trained on a vast variety of interaction content like word problems, dialogs, code, poems, songs, and stories. GPT4all-J is a fine-tuned GPT-J model that generates responses similar to human interactions.

The complete notebook for this example is provided on GitHub. We can use the SageMaker Python SDK to deploy the model to an Inf2 instance. We use the provided default handler to load the model. With this, we just need to provide a servings.properties file. This file has the required configurations for the DJL model server to download and host the model. We can specify the name of the Hugging Face model using the model_id parameter to download the model directly from the Hugging Face repo. Alternatively, you can download the model from Amazon S3 by providing the s3url parameter. The entryPoint parameter is configured to point to the library to load the model. For more details on djl_python.fastertransformer, refer to the GitHub code.

The tensor_parallel_degree property value determines the distribution of tensor parallel modules across multiple devices. For instance, with 12 NeuronCores and a tensor parallel degree of 4, LMI will allocate three model copies, each utilizing four NeuronCores. You can also define the precision type using the property dtype. n_position parameter defines the sum of max input and output sequence length for the model. See the following code:

%%writefile serving.properties# Start writing content here
engine=Python
option.entryPoint=djl_python.transformers-neuronx
#option.model_id=nomic-ai/gpt4all-j
option.s3url = {{s3url}}
option.tensor_parallel_degree=2
option.model_loading_timeout=2400
option.n_positions=512

Construct the tarball containing serving.properties and upload it to an S3 bucket. Although the default handler is used in this example, you can develop a model.py file for customizing the loading and serving process. If there are any packages that need installation, include them in the requirements.txt file. See the following code:

%%sh
mkdir mymodel
mv serving.properties mymodel/
tar czvf mymodel.tar.gz mymodel/
rm -rf mymodel

s3_code_prefix = "large-model-lmi/code"
bucket = sess.default_bucket()  # bucket to house artifacts
code_artifact = sess.upload_data("mymodel.tar.gz", bucket, s3_code_prefix)print(f"S3 Code or Model tar ball uploaded to --- > {code_artifact}")

Retrieve the DJL container image and create the SageMaker model:

##Retrieve djl container image
image_uri = image_uris.retrieve(
        framework="djl-deepspeed",
        region=sess.boto_session.region_name,
        version="0.21.0"
    )
image_uri = image_uri.split(":")[0] + ":" + "0.22.1-neuronx-sdk2.9.0"

model = Model(image_uri=image_uri, model_data=code_artifact, env=env, role=role)

Next, we create the SageMaker endpoint with the model configuration defined earlier. The container downloads the model into the /tmp space because SageMaker maps the /tmp to Amazon Elastic Block Store (Amazon EBS). We need to add a volume_size parameter to ensure the /tmp directory has enough space to download and compile the model. We set container_startup_health_check_timeout to 3,600 seconds to ensure the health check starts after the model is ready. We use the ml.inf2.8xlarge instance. See the following code:

instance_type = "ml.inf2.8xlarge"
endpoint_name = sagemaker.utils.name_from_base("lmi-model")


model.deploy(initial_instance_count=1,
             instance_type=instance_type,
             endpoint_name=endpoint_name,
             container_startup_health_check_timeout=3600,
             volume_size=256
            )

After the SageMaker endpoint has been created, we can make real-time predictions against SageMaker endpoints using the Predictor object:

# our requests and responses will be in json format so we specify the serializer and the deserializer
predictor = sagemaker.Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sess,
    serializer=serializers.JSONSerializer(),
    deserializer=deserializers.JSONDeserializer(),
)

predictor.predict(
    {"inputs": "write a blog on new York", "parameters": {}}
)

Clean up

Delete the endpoints to save costs after you finish your tests:

# - Delete the end point
sess.delete_endpoint(endpoint_name)
sess.delete_endpoint_config(endpoint_name)
model.delete_model()

Conclusion

In this post, we showcased the newly launched capability of SageMaker, which now supports ml.inf2 and ml.trn1 instances for hosting generative AI models. We demonstrated how to deploy GPT4ALL-J, a generative AI model, on AWS Inferentia2 using SageMaker and the LMI container, without writing any code. We also showcased how you can use DJLServing and transformers-neuronx to load a model, partition it, and serve.

Inf2 instances provide the most cost-effective way to run generative AI models on AWS. For performance details, refer to Inf2 Performance.

Check out the GitHub repo for an example notebook. Try it out and let us know if you have any questions!


About the Authors

Vivek Gangasani is a Senior Machine Learning Solutions Architect at Amazon Web Services. He works with Machine Learning Startups to build and deploy AI/ML applications on AWS. He is currently focused on delivering solutions for MLOps, ML Inference and low-code ML. He has worked on projects in different domains, including Natural Language Processing and Computer Vision.

Hiroshi Tokoyo is a Solutions Architect at AWS Annapurna Labs. Based in Japan, he joined Annapurna Labs even before the acquisition by AWS and has consistently helped customers with Annapurna Labs technology. His recent focus is on Machine Learning solutions based on purpose-built silicon, AWS Inferentia and Trainium.

Dhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing, and Artificial Intelligence. He focuses on Deep learning including NLP and Computer Vision domains. He helps customers achieve high performance model inference on SageMaker.

Qing Lan is a Software Development Engineer in AWS. He has been working on several challenging products in Amazon, including high performance ML inference solutions and high performance logging system. Qing’s team successfully launched the first Billion-parameter model in Amazon Advertising with very low latency required. Qing has in-depth knowledge on the infrastructure optimization and Deep Learning acceleration.

Qingwei Li is a Machine Learning Specialist at Amazon Web Services. He received his Ph.D. in Operations Research after he broke his advisor’s research grant account and failed to deliver the Nobel Prize he promised. Currently he helps customers in the financial service and insurance industry build machine learning solutions on AWS. In his spare time, he likes reading and teaching.

Alan Tan is a Senior Product Manager with SageMaker leading efforts on large model inference. He’s passionate about applying Machine Learning to the area of Analytics. Outside of work, he enjoys the outdoors.

Varun Syal is a Software Development Engineer with AWS Sagemaker working on critical customer facing features for the ML Inference platform. He is passionate about working in the Distributed Systems and AI space. In his spare time, he likes reading and gardening.

Read More

MaMMUT: A simple vision-encoder text-decoder architecture for multimodal tasks

MaMMUT: A simple vision-encoder text-decoder architecture for multimodal tasks

Vision-language foundational models are built on the premise of a single pre-training followed by subsequent adaptation to multiple downstream tasks. Two main and disjoint training scenarios are popular: a CLIP-style contrastive learning and next-token prediction. Contrastive learning trains the model to predict if image-text pairs correctly match, effectively building visual and text representations for the corresponding image and text inputs, whereas next-token prediction predicts the most likely next text token in a sequence, thus learning to generate text, according to the required task. Contrastive learning enables image-text and text-image retrieval tasks, such as finding the image that best matches a certain description, and next-token learning enables text-generative tasks, such as Image Captioning and Visual Question Answering (VQA). While both approaches have demonstrated powerful results, when a model is pre-trained contrastively, it typically does not fare well on text-generative tasks and vice-versa. Furthermore, adaptation to other tasks is often done with complex or inefficient methods. For example, in order to extend a vision-language model to videos, some models need to do inference for each video frame separately. This limits the size of the videos that can be processed to only a few frames and does not fully take advantage of motion information available across frames.

Motivated by this, we present “A Simple Architecture for Joint Learning for MultiModal Tasks”, called MaMMUT, which is able to train jointly for these competing objectives and which provides a foundation for many vision-language tasks either directly or via simple adaptation. MaMMUT is a compact, 2B-parameter multimodal model that trains across contrastive, text generative, and localization-aware objectives. It consists of a single image encoder and a text decoder, which allows for a direct reuse of both components. Furthermore, a straightforward adaptation to video-text tasks requires only using the image encoder once and can handle many more frames than prior work. In line with recent language models (e.g., PaLM, GLaM, GPT3), our architecture uses a decoder-only text model and can be thought of as a simple extension of language models. While modest in size, our model outperforms the state of the art or achieves competitive performance on image-text and text-image retrieval, video question answering (VideoQA), video captioning, open-vocabulary detection, and VQA.

The MaMMUT model enables a wide range of tasks such as image-text/text-image retrieval (top left and top right), VQA (middle left), open-vocabulary detection (middle right), and VideoQA (bottom).

Decoder-only model architecture

One surprising finding is that a single language-decoder is sufficient for all these tasks, which obviates the need for both complex constructs and training procedures presented before. For example, our model (presented to the left in the figure below) consists of a single visual encoder and single text-decoder, connected via cross attention, and trains simultaneously on both contrastive and text-generative types of losses. Comparatively, prior work is either not able to handle image-text retrieval tasks, or applies only some losses to only some parts of the model. To enable multimodal tasks and fully take advantage of the decoder-only model, we need to jointly train both contrastive losses and text-generative captioning-like losses.

MaMMUT architecture (left) is a simple construct consisting of a single vision encoder and a single text decoder. Compared to other popular vision-language models — e.g., PaLI (middle) and ALBEF, CoCa (right) — it trains jointly and efficiently for multiple vision-language tasks, with both contrastive and text-generative losses, fully sharing the weights between the tasks.

Decoder two-pass learning

Decoder-only models for language learning show clear advantages in performance with smaller model size (almost half the parameters). The main challenge for applying them to multimodal settings is to unify the contrastive learning (which uses unconditional sequence-level representation) with captioning (which optimizes the likelihood of a token conditioned on the previous tokens). We propose a two-pass approach to jointly learn these two conflicting types of text representations within the decoder. During the first pass, we utilize cross attention and causal masking to learn the caption generation task — the text features can attend to the image features and predict the tokens in sequence. On the second pass, we disable the cross-attention and causal masking to learn the contrastive task. The text features will not see the image features but can attend bidirectionally to all text tokens at once to produce the final text-based representation. Completing this two-pass approach within the same decoder allows for accommodating both types of tasks that were previously hard to reconcile. While simple, we show that this model architecture is able to provide a foundation for multiple multimodal tasks.

MaMMUT decoder-only two-pass learning enables both contrastive and generative learning paths by the same model.

Another advantage of our architecture is that, since it is trained for these disjoint tasks, it can be seamlessly applied to multiple applications such as image-text and text-image retrieval, VQA, and captioning.

Moreover, MaMMUT easily adapts to video-language tasks. Previous approaches used a vision encoder to process each frame individually, which required applying it multiple times. This is slow and restricts the number of frames the model can handle, typically to only 6–8. With MaMMUT, we use sparse video tubes for lightweight adaptation directly via the spatio-temporal information from the video. Furthermore, adapting the model to Open-Vocabulary Detection is done by simply training to detect bounding-boxes via an object-detection head.

Adaptation of the MaMMUT architecture to video tasks (left) is simple and fully reuses the model. This is done by generating a video “tubes” feature representation, similar to image patches, that are projected to lower dimensional tokens and run through the vision encoder. Unlike prior approaches (right) that need to run multiple individual images through the vision encoder, we use it only once.

Results

Our model achieves excellent zero-shot results on image-text and text-image retrieval without any adaptation, outperforming all previous state-of-the-art models. The results on VQA are competitive with state-of-the-art results, which are achieved by much larger models. The PaLI model (17B parameters) and the Flamingo model (80B) have the best performance on the VQA2.0 dataset, but MaMMUT (2B) has the same accuracy as the 15B PaLI.

MaMMUT outperforms the state of the art (SOTA) on Zero-Shot Image-Text (I2T) and Text-Image (T2I) retrieval on both MS-COCO (top) and Flickr (bottom) benchmarks.
Performance on the VQA2.0 dataset is competitive but does not outperform large models such as Flamingo-80B and PalI-17B. Performance is evaluated in the more challenging open-ended text generation setting.

MaMMUT also outperforms the state-of-the-art on VideoQA, as shown below on the MSRVTT-QA and MSVD-QA datasets. Note that we outperform much bigger models such as Flamingo, which is specifically designed for image+video pre-training and is pre-trained with both image-text and video-text data.

MaMMUT outperforms the SOTA models on VideoQA tasks (MSRVTT-QA dataset, top, MSVD-QA dataset, bottom), outperforming much larger models, e.g., the 5B GIT2 or Flamingo, which uses 80B parameters and is pre-trained for both image-language and vision-language tasks.

Our results outperform the state-of-the-art on open-vocabulary detection fine-tuning as is also shown below.

MAMMUT open-vocabulary detection results on the LVIS dataset compared to state-of-the-art methods. We report the average precisions for rare classes (APr) as is previously adopted in the literature.

Key ingredients

We show that joint training of both contrastive and text-generative objectives is not an easy task, and in our ablations we find that these tasks are served better by different design choices. We see that fewer cross-attention connections are better for retrieval tasks, but more are preferred by VQA tasks. Yet, while this shows that our model’s design choices might be suboptimal for individual tasks, our model is more effective than more complex, or larger, models.

Ablation studies showing that fewer cross-attention connections (1-2) are better for retrieval tasks (top), whereas more connections favor text-generative tasks such as VQA (bottom).

Conclusion

We presented MaMMUT, a simple and compact vision-encoder language-decoder model that jointly trains a number of conflicting objectives to reconcile contrastive-like and text-generative tasks. Our model also serves as a foundation for many more vision-language tasks, achieving state-of-the-art or competitive performance on image-text and text-image retrieval, videoQA, video captioning, open-vocabulary detection and VQA. We hope it can be further used for more multimodal applications.

Acknowledgements

The work described is co-authored by: Weicheng Kuo, AJ Piergiovanni, Dahun Kim, Xiyang Luo, Ben Caine, Wei Li, Abhijit Ogale, Luowei Zhou, Andrew Dai, Zhifeng Chen, Claire Cui, and Anelia Angelova. We would like to thank Mojtaba Seyedhosseini, Vijay Vasudevan, Priya Goyal, Jiahui Yu, Zirui Wang, Yonghui Wu, Runze Li, Jie Mei, Radu Soricut, Qingqing Huang, Andy Ly, Nan Du, Yuxin Wu, Tom Duerig, Paul Natsev, Zoubin Ghahramani for their help and support.

Read More

Automate the deployment of an Amazon Forecast time-series forecasting model

Automate the deployment of an Amazon Forecast time-series forecasting model

Time series forecasting refers to the process of predicting future values of time series data (data that is collected at regular intervals over time). Simple methods for time series forecasting use historical values of the same variable whose future values need to be predicted, whereas more complex, machine learning (ML)-based methods use additional information, such as the time series data of related variables.

Amazon Forecast is an ML-based time series forecasting service that includes algorithms that are based on over 20 years of forecasting experience used by Amazon.com, bringing the same technology used at Amazon to developers as a fully managed service, removing the need to manage resources. Forecast uses ML to learn not only the best algorithm for each item, but also the best ensemble of algorithms for each item, automatically creating the best model for your data.

This post describes how to deploy recurring Forecast workloads (time series forecasting workloads) with no code using AWS CloudFormation, AWS Step Functions, and AWS Systems Manager. The method presented here helps you build a pipeline that allows you to use the same workflow starting from the first day of your time series forecasting experimentation through the deployment of the model into production.

Time series forecasting using Forecast

The workflow for Forecast involves the following common concepts:

  • Importing datasets – In Forecast, a dataset group is a collection of datasets, schema, and forecast results that go together. Each dataset group can have up to three datasets, one of each dataset type: target time series (TTS), related time series (RTS), and item metadata. A dataset is a collection of files that contain data that is relevant for a forecasting task. A dataset must conform to the schema defined within Forecast. For more details, refer to Importing Datasets.
  • Training predictors – A predictor is a Forecast-trained model used for making forecasts based on time series data. During training, Forecast calculates accuracy metrics that you use to evaluate the predictor and decide whether to use the predictor to generate a forecast. For more information, refer to Training Predictors.
  • Generating forecasts – You can then use the trained model for generating forecasts for a future time horizon, known as the forecasting horizon. Forecast provides forecasts at various specified quantiles. For example, a forecast at the 0.90 quantile will estimate a value that is lower than the observed value 90% of the time. By default, Forecast uses the following values for the predictor forecast types: 0.1 (P10), 0.5 (P50), and 0.9 (P90). Forecasts at various quantiles are typically used to provide a prediction interval (an upper and lower bound for forecasts) to account for forecast uncertainty.

You can implement this workflow in Forecast either from the AWS Management Console, the AWS Command Line Interface (AWS CLI), via API calls using Python notebooks, or via automation solutions. The console and AWS CLI methods are best suited for quick experimentation to check the feasibility of time series forecasting using your data. The Python notebook method is great for data scientists already familiar with Jupyter notebooks and coding, and provides maximum control and tuning. However, the notebook-based method is difficult to operationalize. Our automation approach facilitates rapid experimentation, eliminates repetitive tasks, and allows easier transition between various environments (development, staging, production).

In this post, we describe an automation approach to using Forecast that allows you to use your own data and provides a single workflow that you can use seamlessly throughout the lifecycle of the development of your forecasting solution, from the first days of experimentation through the deployment of the solution in your production environment.

Solution overview

In the following sections, we describe a complete end-to-end workflow that serves as a template to follow for automated deployment of time series forecasting models using Forecast. This workflow creates forecasted data points from an open-source input dataset; however, you can use the same workflow for your own data, as long as you can format your data according to the steps outlined in this post. After you upload the data, we walk you through the steps to create Forecast dataset groups, import data, train ML models, and produce forecasted data points on future unseen time horizons from raw data. All of this is possible without having to write or compile code.

The following diagram illustrates the forecasting workflow.

Cyclical forecasting workflow

The solution is deployed using two CloudFormation templates: the dependencies template and the workload template. CloudFormation enables you to perform AWS infrastructure deployments predictably and repeatedly by using templates describing the resources to be deployed. A deployed template is referred to as a stack. We’ve taken care of defining the infrastructure in the solution for you in the two provided templates. The dependencies template defines prerequisite resources used by the workload template, such as an Amazon Simple Storage Service (Amazon S3) bucket for object storage and AWS Identity and Access Management (IAM) permissions for AWS API actions. The resources defined in the dependencies template may be shared by multiple workload templates. The workload template defines the resources used to ingest data, train a predictor, and generate a forecast.

Deployment workflow

Deploy the dependencies CloudFormation template

First, let’s deploy the dependencies template to create our prerequisite resources. The dependencies template deploys an optional S3 bucket, AWS Lambda functions, and IAM roles. Amazon S3 is a low-cost, highly available, resilient, object storage service. We use an S3 bucket in this solution to store source data and trigger the workflow, resulting in a forecast. Lambda is a serverless, event-driven compute service that lets you run code without provisioning or managing servers. The dependencies template includes functions to do things like create a dataset group in Forecast and purge objects within an S3 bucket before deleting the bucket. IAM roles define permissions within AWS for users and services. The dependencies template deploys a role to be used by Lambda and another for Step Functions, a workflow management service that will coordinate the tasks of data ingestion and processing, as well as predictor training and inference using Forecast.

Complete the following steps to deploy the dependencies template:

  1. On the console, select the desired Region supported by Forecast for solution deployment.
  2. On the AWS CloudFormation console, choose Stacks in the navigation pane.
  3. Choose Create stack and choose With new resources (standard).
    Create stack
  4. For Template source, select Amazon S3 URL.
  5. Enter the template URL: https://amazon-forecast-samples.s3.us-west-2.amazonaws.com/ml_ops/forecast-mlops-dependency.yaml.
  6. Choose Next.
    Specify template
  7. For Stack name, enter forecast-mlops-dependency.
  8. Under Parameters, choose to use an existing S3 bucket or create a new one, then provide the name of the bucket.
  9. Choose Next.
  10. Choose Next to accept the default stack options.
  11. Select the check box to acknowledge the stack creates IAM resources, then choose Create stack to deploy the template.

You should see the template deploy as the forecast-mlops-dependency stack. When the status changes to CREATE_COMPLETE, you may move to the next step.

Deploy the workload CloudFormation template

Next, let’s deploy the workload template to create our prerequisite resources. The workload template deploys Step Functions state machines for workflow management, AWS Systems Manager Parameter Store parameters to store parameter values from AWS CloudFormation and inform the workflow, an Amazon Simple Notification Service (Amazon SNS) topic for workflow notifications, and an IAM role for workflow service permissions.

The solution creates five state machines:

  • CreateDatasetGroupStateMachine – Creates a Forecast dataset group for data to be imported into.
  • CreateImportDatasetStateMachine – Imports source data from Amazon S3 into a dataset group for training.
  • CreateForecastStateMachine – Manages the tasks required to train a predictor and generate a forecast.
  • AthenaConnectorStateMachine – Enables you to write SQL queries with the Amazon Athena connector to land data in Amazon S3. This is an optional process to obtain historical data in the required format for Forecast by using Athena instead of placing files manually in Amazon S3.
  • StepFunctionWorkflowStateMachine – Coordinates calls out to the other four state machines and manages the overall workflow.

Parameter Store, a capability of Systems Manager, provides secure, hierarchical storage and programmatic retrieval of configuration data management and secrets management. Parameter Store is used to store parameters set in the workload stack as well as other parameters used by the workflow.

Complete the following steps to deploy the workload template:

  1. On the AWS CloudFormation console, choose Stacks in the navigation pane.
  2. Choose Create stack and choose With new resources (standard).
  3. For Template source, select Amazon S3 URL.
  4. Enter the template URL: https://amazon-forecast-samples.s3.us-west-2.amazonaws.com/ml_ops/forecast-mlops-solution-guidance.yaml.
  5. Choose Next.
  6. For Stack name, enter a name.
  7. Accept the default values or modify the parameters.

Be sure to enter the S3 bucket name from the dependencies stack for S3 Bucket and a valid email address for SNSEndpoint even if you accept the default parameter values.

The following table describes each parameter.

Parameter Description More Information
DatasetGroupFrequencyRTS The frequency of data collection for the RTS dataset. .
DatasetGroupFrequencyTTS The frequency of data collection for the TTS dataset. .
DatasetGroupName A short name for the dataset group, a self-contained workload. CreateDatasetGroup
DatasetIncludeItem Specify if you want to provide item metadata for this use case. .
DatasetIncludeRTS Specify if you want to provide a related time series for this use case. .
ForecastForecastTypes When a CreateForecast job runs, this declares which quantiles to produce predictions for. You may choose up to five values in this array. Edit this value to include values according to need. CreateForecast
PredictorAttributeConfigs For the target variable in TTS and each numeric field in the RTS datasets, a record must be created for each time interval for each item. This configuration helps determine how missing records are filled in: with 0, NaN, or otherwise. We recommend filing the gaps in the TTS with NaN instead of 0. With 0, the model might learn wrongly to bias forecasts toward 0. NaN is how the guidance is delivered. Consult with your AWS Solutions Architect with any questions on this. CreateAutoPredictor
PredictorExplainPredictor Valid values are TRUE or FALSE. These determine if explainability is enabled for your predictor. This can help you understand how values in the RTS and item metadata influence the model. Explainability
PredictorForecastDimensions You may want to forecast at a finer grain than item. Here, you can specify dimensions such as location, cost center, or whatever your needs are. This needs to agree with the dimensions in your RTS and TTS. Note that if you have no dimension, the correct parameter is null, by itself and in all lowercase. null is a reserved word that lets the system know there is no parameter for the dimension. CreateAutoPredictor
PredictorForecastFrequency Defines the time scale at which your model and predictions will be generated, such as daily, weekly, or monthly. The drop-down menu helps you choose allowed values. This needs to agree with your RTS time scale if you’re using RTS. CreateAutoPredictor
PredictorForecastHorizon The number of time steps that the model predicts. The forecast horizon is also called the prediction length. CreateAutoPredictor
PredictorForecastOptimizationMetric Defines the accuracy metric used to optimize the predictor. The drop-down menu will help you select weighted quantile loss balances for over- or under-forecasting. RMSE is concerned with units, and WAPE/MAPE are concerned with percent errors. CreateAutoPredictor
PredictorForecastTypes When a CreateAutoPredictor job runs, this declares which quantiles are used to train prediction points. You may choose up to five values in this array, allowing you to balance over- and under-forecasting. Edit this value to include values according to need. CreateAutoPredictor
S3Bucket The name of the S3 bucket where input data and output data are written for this workload. .
SNSEndpoint A valid email address to receive notifications when the predictor and Forecast jobs are complete. .
SchemaITEM This defines the physical order, column names, and data types for your item metadata dataset. This is an optional file provided in the solution example. CreateDataset
SchemaRTS This defines the physical order, column names, and data types for your RTS dataset. The dimensions must agree with your TTS. The time-grain of this file governs the time-grain at which predictions can be made. This is an optional file provided in the solution example. CreateDataset
SchemaTTS This defines the physical order, column names, and data types for your TTS dataset, the only required dataset. The file must contain a target value, timestamp, and item at a minimum. CreateDataset
TimestampFormatRTS Defines the timestamp format provided in the RTS file. CreateDatasetImportJob
TimestampFormatTTS Defines the timestamp format provided in the TTS file. CreateDatasetImportJob
  1. Choose Next to accept the default stack options.
  2. Select the check box to acknowledge the stack creates IAM resources, then choose Create stack to deploy the template.

You should see the template deploy as the stack name you chose earlier. When the status changes to CREATE_COMPLETE, you may move to the data upload step.

Upload the data

In the previous section, you provided a stack name and an S3 bucket. This section describes how to deposit the publicly available dataset Food Demand in this bucket. If you’re using your own dataset, refer to Datasets to prepare your dataset in a format the deployment is expecting. The dataset needs to contain at least the target time series, and optionally, the related time series and the item metadata:

  • TTS is the time series data that includes the field that you want to generate a forecast for; this field is called the target field
  • RTS is time series data that doesn’t include the target field, but includes a related field
  • The item data file isn’t time series data, but includes metadata information about the items in the TTS or RTS datasets

Complete the following steps:

  1. If you’re using the provided sample dataset, download the dataset Food Demand to your computer and unzip the file, which creates three files inside three directories (rts, tts, item).
  2. On the Amazon S3 console, navigate to the bucket you created earlier.
  3. Choose Create folder.
  4. Use the same string as your workload stack name for the folder name.
  5. Choose Upload.
  6. Choose the three dataset folders, then choose Upload.

When the upload is complete, you should see something like the following screenshot. For this example, our folder is aiml42.S3 folder structure

Create a Forecast dataset group

Complete the steps in this section to create a dataset group as a one-time event for each workload. Going forward, you should plan on running the import data, create predictor, and create forecast steps as appropriate, as a series, according to your schedule, which could be daily, weekly, or otherwise.

  1. On the Step Functions console, locate the state machine containing Create-Dataset-Group.
  2. On the state machine detail page, choose Start execution.
  3. Choose Start execution again to confirm.

The state machine takes about 1 minute to run. When it’s complete, the value under Execution Status should change from Running to Succeeded Execution status

Import data into Forecast

Follow the steps in this section to import the data set that you uploaded to your S3 bucket into your dataset group:

  1. On the Step Functions console, locate the state machine containing Import-Dataset.
  2. On the state machine detail page, choose Start Execution.
  3. Choose Start execution again to confirm.

The amount of time the state machine takes to run depends on the dataset being processed.

Graph inspector

  1. While this is running, in your browser, open another tab and navigate to the Forecast console.
  2. On the Forecast console, choose View dataset groups and navigate to the dataset group with the name specified for DataGroupName from your workload stack.
  3. Choose View datasets.

You should see the data imports in progress.

Data imports in progress

When the state machine for Import-Dataset is complete, you can proceed to the next step to build your time series data model.

Create AutoPredictor (train a time series model)

This section describes how to train an initial predictor with Forecast. You may choose to create a new predictor (your first, baseline predictor) or retrain a predictor during each production cycle, which could be daily, weekly, or otherwise. You may also elect not to create a predictor each cycle and rely on predictor monitoring to guide you when to create one. The following figure visualizes the process of creating a production-ready Forecast predictor.

Production ready predictor workflow

To create a new predictor, complete the following steps:

  1. On the Step Functions console, locate the state machine containing Create-Predictor.
  2. On the state machine detail page, choose Start Execution.
  3. Choose Start execution again to confirm.
    The amount of runtime can depend on the dataset being processed. This could take up to an hour or more to complete.
  4. While this is running, in your browser, open another tab and navigate to the Forecast console.
  5. On the Forecast console, choose View dataset groups and navigate to the dataset group with the name specified for DataGroupName from your workload stack.
  6. Choose View predictors.

You should see the predictor training in progress (Training status shows “Create in progress…”).

Data imports in progress

When the state machine for Create-Predictor is complete, you can evaluate its performance.

As part of the state machine, the system creates a predictor and also runs a BacktestExport job that writes out time series-level predictor metrics to Amazon S3. These are files located in two S3 folders under the backtest-export folder:

  • accuracy-metrics-values – Provides item-level accuracy metric computations so you can understand the performance of a single time series. This allows you to investigate the spread rather than focusing on the global metrics alone.
  • forecasted-values – Provides step-level predictions for each time series in the backtest window. This enables you to compare the actual target value from a holdout test set to the predicted quantile values. Reviewing this helps formulate ideas on how to provide additional data features in RTS or item metadata to help better estimate future values, further reducing loss. You may download backtest-export files from Amazon S3 or query them in place with Athena.

S3 bucket contents

With your own data, you need to closely inspect the predictor outcomes and ensure the metrics meet your expected results by using the backtest export data. When satisfied, you can begin generating future-dated predictions as described in the next section.

Generate a forecast (inference about future time horizons)

This section describes how to generate forecast data points with Forecast. Going forward, you should harvest new data from the source system, import the data into Forecast, and then generate forecast data points. Optionally, you may also insert a new predictor creation after import and before forecast. The following figure visualizes the process of creating production time series forecasts using Forecast.

Production time series forecast workflow

Complete the following steps:

  1. On the Step Functions console, locate the state machine containing Create-Forecast.
  2. On the state machine detail page, choose Start Execution.
  3. Choose Start execution again to confirm.
    This state machine finishes very quickly because the system isn’t configured to generate a forecast. It doesn’t know which predictor model you have approved for inference.
    Let’s configure the system to use your trained predictor.
  4. On the Forecast console, locate the ARN for your predictor.
  5. Copy the ARN to use in a later step.
    Predictor details
  6. In your browser, open another tab and navigate to the Systems Manager console.
  7. On the Systems Manager console, choose Parameter Store in the navigation pane.
  8. Locate the parameter related to your stack (/forecast/<StackName>/Forecast/PredictorArn).
  9. Enter the ARN you copied for your predictor.
    This is how you associate a trained predictor with the inference function of Forecast.
  10. Locate the parameter /forecast/<StackName>/Forecast/Generate and edit the value, replacing FALSE with TRUE.
    Now you’re ready to run a forecast job for this dataset group.
  11. On the Step Functions console, run the Create-Forecast state machine.

This time, the job runs as expected. As part of the state machine, the system creates a forecast and a ForecastExport job, which writes out time series predictions to Amazon S3. These files are located in the forecast folder

Forecast folder contents

Inside the forecast folder, you will find predictions for your items, located in many CSV or Parquet files, depending on your selection. The predictions for each time step and selected time series exist with all your chosen quantile values per record. You may download these files from Amazon S3, query them in place with Athena, or choose another strategy to use the data.

This wraps up the entire workflow. You can now visualize your output using any visualization tool of your choice, such as Amazon QuickSight. Alternatively, data scientists can use pandas to generate their own plots. If you choose to use QuickSight, you can connect your forecast results to QuickSight to perform data transformations, create one or more data analyses, and create visualizations.

This process provides a template to follow. You will need to adapt the sample to your schema, set the forecast horizon, time resolution, and so forth according to your use case. You will also need to set a recurring schedule where data is harvested from the source system, import the data, and produce forecasts. If desired, you may insert a predictor task between the import and forecast steps.

Retrain the predictor

We have walked through the process of training a new predictor, but what about retraining a predictor? Retraining a predictor is one way to reduce the cost and time involved with training a predictor on the latest available data. Rather than create a new predictor and train it on the entire dataset, we can retrain the existing predictor by providing only the new incremental data made available since the predictor was last trained. Let’s walk through how to retrain a predictor using the automation solution:

  1. On the Forecast console, choose View dataset groups.
  2. Choose the dataset group associated with the predictor you want to retrain.
  3. Choose View predictors, then chose the predictor you want to retrain.
  4. On the Settings tab, copy the predictor ARN.
    We need to update a parameter used by the workflow to identify the predictor to retrain.
  5. On the Systems Manager console, choose Parameter Store in the navigation pane.
  6. Locate the parameter /forecast/<STACKNAME>/Forecast/Predictor/ReferenceArn.
  7. On the parameter detail page, choose Edit.
  8. For Value, enter the predictor ARN.
    This identifies the correct predictor for the workflow to retrain. Next, we need to update a parameter used by the workflow to change the training strategy.
  9. Locate the parameter /forecast/<STACKNAME>/Forecast/Predictor/Strategy.
  10. On the parameter detail page, choose Edit.
  11. For Value, enter RETRAIN.
    The workflow defaults to training a new predictor; however, we can modify that behavior to retrain an existing predictor or simply reuse an existing predictor without retraining by setting this value to NONE. You may want to forego training if your data is relatively stable or you’re using automated predictor monitoring to decide when retraining is necessary.
  12. Upload the incremental training data to the S3 bucket.
  13. On the Step Functions console, locate the state machine <STACKNAME>-Create-Predictor.
  14. On the state machine detail page, choose Start execution to begin the retraining.

When the retraining is complete, the workflow will end and you will receive an SNS email notification to the email address provided in the workload template parameters.

Clean up

When you’re done with this solution, follow the steps in this section to delete related resources.

Delete the S3 bucket

  1. On the Amazon S3 console, choose Buckets in the navigation pane.
  2. Select the bucket where data was uploaded and choose Empty to delete all data associated with the solution, including source data.
  3. Enter permanently delete to delete the bucket contents permanently.
  4. On the Buckets page, select the bucket and choose Delete.
  5. Enter the name of the bucket to confirm the deletion and choose Delete bucket.

Delete Forecast resources

  1. On the Forecast console, choose View dataset groups.
  2. Select the dataset group name associated with the solution, then choose Delete.
  3. Enter delete to delete the dataset group and associated predictors, predictor backtest export jobs, forecasts, and forecast export jobs.
  4. Choose Delete to confirm.

Delete the CloudFormation stacks

  1. On the AWS CloudFormation console, choose Stacks in the navigation pane.
  2. Select the workload stack and choose Delete.
  3. Choose Delete stack to confirm deletion of the stack and all associated resources.
  4. When the deletion is complete, select the dependencies stack and choose Delete.
  5. Choose Delete to confirm.

Conclusion

In this post, we discussed some different ways to get started using Forecast. We walked through an automated forecasting solution based on AWS CloudFormation for a rapid, repeatable solution deployment of a Forecast pipeline from data ingestion to inference, with little infrastructure knowledge required. Finally, we saw how we can use Lambda to automate model retraining, reducing cost and training time.

There’s no better time than the present to start forecasting with Forecast. To start building and deploying an automated workflow, visit Amazon Forecast resources. Happy forecasting!


About the Authors

Aaron Fagan is a Principal Specialist Solutions Architect at AWS based in New York. He specializes in helping customers architect solutions in machine learning and cloud security.

Raju Patil is a Data Scientist in AWS Professional Services. He builds and deploys AI/ML solutions to assist AWS customers in overcoming their business challenges. His AWS engagements have covered a wide range of AI/ML use cases such as computer vision, time-series forecasting, and predictive analytics, etc., across numerous industries, including financial services, telecom, health care, and more. Prior to this, he has led Data Science teams in Advertising Technology, and made significant contributions to numerous research and development initiatives in computer vision and robotics. Outside of work, he enjoys photography, hiking, travel, and culinary explorations.

Read More

Get started with generative AI on AWS using Amazon SageMaker JumpStart

Get started with generative AI on AWS using Amazon SageMaker JumpStart

Generative AI is gaining a lot of public attention at present, with talk around products such as GPT4, ChatGPT, DALL-E2, Bard, and many other AI technologies. Many customers have been asking for more information on AWS’s generative AI solutions. The aim of this post is to address those needs.

This post provides an overview of generative AI with a real customer use case, provides a concise description and outlines its benefits, references an easy-to-follow demo of AWS DeepComposer for creating new musical compositions, and outlines how to get started using Amazon SageMaker JumpStart for deploying GPT2, Stable Diffusion 2.0, and other generative AI models.

Generative AI overview

Generative AI is a specific field of artificial intelligence that focuses on generating new material. It’s one of the most exciting fields in the AI world, with the potential to transform existing businesses and allow completely new business ideas to come to market. You can use generative techniques for:

  • Creating new works of art using a model such as Stable Diffusion 2.0
  • Writing a best-selling book using a model such as GPT2, Bloom, or Flan-T5-XL
  • Composing your next symphony using the Transformers technique in AWS DeepComposer

AWS DeepComposer is an educational tool that helps you understand the key concepts associated with machine learning (ML) through the language of musical composition. To learn more, refer to Generate a jazz rock track using Generative Artificial Intelligence.

Stable Diffusion, GPT2, Bloom, and Flan-T5-XL are all ML models. They are simply mathematical algorithms that need to be trained to identify patterns within data. After the patterns are learned, they’re deployed onto endpoints, ready for a process known as inference. New data that the model hasn’t seen is fed into the inference model, and new creative material is produced.

For example, with image generation models such as Stable Diffusion, we can create stunning illustrations using a few words. With text generation models such as GPT2, Bloom, and Flan-T5-XL, we can generate new literary articles, and potentially books, from a simple human sentence.

Autodesk is an AWS customer using Amazon SageMaker to help their product designers sort through thousands of iterations of visual designs for various use cases and use ML to help choose the optimal design. Specifically, they have worked with Edera Safety to help develop a spinal cord protector that protects riders from accidents while participating in sporting events, such as mountain biking. For more information, check out the video AWS Machine Learning Enables Design Optimization.

To learn more about what AWS customers are doing with generative AI and fashion, refer to Virtual fashion styling with generative AI using Amazon SageMaker.

Now that we understand what generative AI is all about, let’s jump into a JumpStart demonstration to learn how to generate new text or images with AI.

Prerequisites

Amazon SageMaker Studio is the integrated development environment (IDE) within SageMaker that provides us with all the ML features that we need in a single pane of glass. Before we can run JumpStart, we need to set up Studio. You can skip this step if you already have your own version of Studio running.

The first thing we need to do before we can use any AWS services is to make sure we have signed up for and created an AWS account. Next is to create an administrative user and a group. For instructions on both steps, refer to Set Up Amazon SageMaker Prerequisites.

The next step is to create a SageMaker domain. A domain sets up all the storage and allows you to add users to access SageMaker. For more information, refer to Onboard to Amazon SageMaker Domain. This demo is created in the AWS Region us-east-1.

Finally, you launch Studio. For this post, we recommend launching a user profile app. For instructions, refer to Launch Amazon SageMaker Studio.

Choose a JumpStart solution

Now we come to the exciting part. You should now be logged in to Studio, and see a page similar to the following screenshot.

In the navigation pane, under SageMaker JumpStart, choose Models, notebooks, solutions.

You’re presented with a range of solutions, foundation models, and other artifacts that can help you get started with a specific model or a specific business problem or use case.

If you want to experiment in a particular area, you can use the search function. Or you can simply browse the artifacts to find the relevant model or business solution for your needs.

For example, if you’re interested in fraud detection solutions, enter fraud detection into the search bar.

Fraud Detection Screenshot

If you’re interested in text generation solutions, enter text generation into the search bar. A good place to start if you want to explore a range of text generation models is to select the Intro to JS – Text Generation notebook.

JS - Text Generation

Let’s dive into a specific demonstration of the GPT-2 model.

JumpStart GPT-2 model demo

GPT 2 is a language model that helps generate human-like text based on a given prompt. We can use this type of transformer model to create new sentences and help us automate writing. This can be used for content creation such as blogs, social media posts, and books.

The GPT 2 model is part of the Generative Pre-Trained Transformer family that was the predecessor to GPT 3. At the time of writing, GPT 3 is used as the foundation for the OpenAI ChatGPT application.

To start exploring the GPT-2 model demo in JumpStart, complete the following steps:

  1. On JumpStart, search for and choose GPT 2.
  2. In the Deploy Model section, expand Deployment Configuration.
  3. For SageMaker hosting instance, choose your instance (for this post, we use ml.c5.2xlarge).

Different machine types have different price points attached. At the time of writing, the ml.c5.2xlarge that we selected incurs under $0.50 per hour. For the most up-to-date pricing, refer to Amazon SageMaker Pricing.

  1. For Endpoint name, enter demo-hf-textgeneration-gpt2.
  2. Choose Deploy.

Endpoint Name & Deploy

Wait for the ML endpoint to deploy (up to 15 minutes).

  1. When the endpoint is deployed, choose Open Notebook.

Endpoint Status

You’ll see a page similar to the following screenshot.
Python Code

The document we’re using to showcase our demonstration is a Jupyter notebook, which encompasses all the necessary Python code. Note that the code in this screenshot maybe be slightly different to the code you have, because AWS is constantly updating these notebooks and making sure they are secure, are free of defects, and provide the best customer experience.

  1. Click into the first cell and choose Ctrl+Enter to run the code block.

Code Block 1

An asterisk (*) appears to the left of the code block and then turns into a number. The asterisk indicates that the code is running and is complete when the number appears.

  1. In the next code block, enter some sample text, then press Ctrl+Enter.

Code Block 2

  1. Choose Ctrl+Enter in the third code block to run it.

After about 30-60 seconds, you will see your inference results.

For the input text “Once upon a time there were 18 sandwiches,” we get the following generated text:

Once upon a time there were 18 sandwiches, four plates with some salad, and three sandwiches with some beef. One restaurant was so nice that the food was made by hand. There were people living at the beginning of the time who were waiting so that

For the input text “And for the final time Peter said to Mary,” we get the following generated text:

And for the final time Peter said to Mary that he was a saint.

11 But Peter said that it was not a blessing, but rather that it would be the death of Peter. And when Mary heard of that Peter said to him,

You can experiment with running this third code block multiple times, and you will notice that the model makes different predictions each time.

To tailor the output using some of the advanced features, scroll down to experiment in the fourth code block.

To learn more about text generation models, refer to Run text generation with Bloom and GPT models on Amazon SageMaker JumpStart.

Clean up resources

Before we move on, don’t forget to delete your endpoint when you’re finished. On the previous tab, under Delete Endpoint, choose Delete.

Delete Endpoint

If you have accidentally closed this notebook, you can also delete your endpoint via the SageMaker console. Under Inference in the navigation pane, choose Endpoints.

Select the endpoint you used and on the Actions menu, choose Delete.

Delete Endpoint

Now that we understand how to use our first JumpStart solution, let’s look at using a Stable Diffusion model.

JumpStart Stable Diffusion model demo

We can use the Stable Diffusion 2 model to generate images from a simple line of text. This can be used to generate content for things like social media posts, promotional material, album covers, or anything that requires creative artwork.

  1. Return to JumpStart, then search for and choose Stable Diffusion 2.

Stable Diffusion 2

  1. In the Deploy Model section, expand Deployment Configuration.
  2. For SageMaker hosting instance, choose your instance (for this post, we use ml.g5.2xlarge).
  3. For Endpoint name, enter demo-stabilityai-stable-diffusion-v2.
  4. Choose Deploy.

Because this is a larger model, it can take up to 25 minutes to deploy. When it’s ready, the endpoint status shows as In Service.

In Service

  1. Choose Open Notebook to open a Jupyter notebook with Python code.

Python Code

  1. Run the first and second code blocks.
  2. In the third code block, change the text prompt, then run the cell.

Code Block 1

Wait about 30–60 seconds for your image to appear. The following image is based on our example text.

Output Picture

Again, you can play with the advanced features in the next code block. The picture it creates is different every time.

Clean up resources

Again, don’t forget to delete your endpoint. This time, we’re using ml.g5.2xlarge, so it incurs slightly higher charges than before. At the time of writing, it was just over $1 per hour.

Finally, let’s move to AWS DeepComposer.

AWS DeepComposer

AWS DeepComposer is a great way to learn about generative AI. It allows you to use built-in melodies in your models to generate new forms of music. The model that you use determines on how the input melody is transformed.

If you’re used to participating in AWS DeepRacer days to help your employees learn about re-enforcement learning, consider augmenting and enhancing the day with AWS DeepComposer to learn about generative AI.

For a detailed explanation and easy-to-follow demonstration of three of the models in this post, refer to Generate a jazz rock track using Generative Artificial Intelligence.

Check out the following cool examples uploaded to SoundCloud using AWS DeepComposer.

We would love to see your experiments, so feel free to reach out via social media (@digitalcolmer) and share your learnings and experiments.

Conclusion

In this post, we talked about the definition of generative AI, illustrated by an AWS customer story. We then stepped you through how to get started with Studio and JumpStart, and showed you how to get started with GPT 2 and Stable Diffusion models. We wrapped up with a brief overview of AWS DeepComposer.

To explore JumpStart more, try using your own data to fine-tune an existing model. For more information, refer to Incremental training with Amazon SageMaker JumpStart. For information about fine-tuning Stable Diffusion models, refer to Fine-tune text-to-image Stable Diffusion models with Amazon SageMaker JumpStart.

To learn more about Stable Diffusion models, refer to Generate images from text with the stable diffusion model on Amazon SageMaker JumpStart.

We didn’t cover any information on the Flan-T5-XL model, so to learn more, refer to the following GitHub repo. The Amazon SageMaker Examples repo also includes a range of available notebooks on GitHub for the various SageMaker products, including JumpStart, covering a range of different use cases.

To learn more about AWS ML via a range of free digital assets, check out our AWS Machine Learning Ramp-Up Guide. You can also try our free ML Learning Plan to build on your current knowledge or have a clear starting point. To take an instructor-led course, we highly recommend the following courses:

It is truly an exciting time in the AI/ML space. AWS is here to support your ML journey, so please connect with us on social media. We look forward to seeing all your learning, experiments, and fun with the various ML services over the coming months and relish the opportunity to be your instructor on your ML journey.


About the Author

Paul Colmer is a Senior Technical Trainer at Amazon Web Services specializing in machine learning and generative AI. His passion is helping customers, partners, and employees develop and grow through compelling storytelling, shared experiences, and knowledge transfer. With over 25 years in the IT industry, he specializes in agile cultural practices and machine learning solutions. Paul is a Fellow of the London College of Music and Fellow of the British Computer Society.

Read More

Modeling Spoken Information Queries for Virtual Assistants: Open Problems, Challenges and Opportunities



Virtual assistants are becoming increasingly important speech-driven Information Retrieval platforms that assist users with various tasks. We discuss open problems and challenges with respect to modeling spoken information queries for virtual assistants, and list opportunities where Information Retrieval methods and research can be applied to improve the quality of virtual assistant speech recognition. We discuss how query domain classification, knowledge graphs and user interaction data, and query personalization can be helpful in improving the accurate recognition of spoken information…Apple Machine Learning Research

Using generative AI to imitate human behavior

Using generative AI to imitate human behavior

This research was accepted by the 2023 International Conference on Learning Representations (ICLR), which is dedicated to the advancement of the branch of artificial intelligence generally referred to as deep learning.

An overview of our method, providing a side-by-side comparison of text-to-image diffusion, with observation-to-action diffusion. On the right are diagrams of the different denoising architectures tested, as well an illustration of the sampling schemes explored.
Figure 1: Overview of our method.

Diffusion models have emerged as a powerful class of generative AI models. They have been used to generate photorealistic images and short videos, compose music, and synthesize speech. And their uses don’t stop there. In our new paper, Imitating Human Behaviour with Diffusion Models, we explore how they can be used to imitate human behavior in interactive environments.

This capability is valuable in many applications. For instance, it could help automate repetitive manipulation tasks in robotics, or it could be used to create humanlike AI in video games, which could lead to exciting new game experiences—a goal particularly dear to our team.

We follow a machine learning paradigm known as imitation learning (more specifically behavior cloning). In this paradigm, we are provided with a dataset containing observations a person saw, and the actions they took, when acting in an environment, which we would like an AI agent to mimic. In interactive environments, at each time step, an observation ( o_t ) is received (e.g. a screenshot of a video game), and an action ( a_t ) is then selected (e.g. the mouse movement). With this dataset of many ( o )’s and ( a )’s performed by some demonstrator, a model ( pi ) could try to learn this mapping of observation-to-action, ( pi(o) to a ).

Spotlight: Microsoft Research Podcast

AI Frontiers: The Physics of AI with Sébastien Bubeck

What is intelligence? How does it emerge and how do we measure it? Ashley Llorens and machine learning theorist Sébastian Bubeck discuss accelerating progress in large-scale AI and early experiments with GPT-4.

When the actions are continuous, training a model to learn this mapping introduces some interesting challenges. In particular, what loss function should be used? A simple choice is mean squared error, as often used in supervised regression tasks. In an interactive environment, this objective encourages an agent to learn the average of all the behaviors in the dataset.

If the goal of the application is to generate diverse human behaviors, the average might not be very useful. After all, humans are stochastic (they act on whims) and multimodal creatures (different humans might make different decisions). Figure 2 depicts the failure of mean squared error to mimic the true action distribution (marked in yellow) when it is multimodal. It also includes several other popular choices for the loss function when doing behavior cloning.

This toy example (based on an arcade claw game) shows an action space with two continuous action dimensions. It shows that popular choices of behavioral cloning loss fail to capture the true distribution, but diffusion models offer a good approximation.
Figure 2: This toy example (based on an arcade claw game) shows an action space with two continuous action dimensions. Here the demonstration distribution is marked in yellow—it is both multimodal and has correlations between action dimensions. Diffusion models offer a good imitation of the full diversity in the dataset.

Ideally, we’d like our models to learn the full variety of human behaviors. And this is where generative models help. Diffusion models are a specific class of generative model that are both stable to train and easy to sample from. They have been very successful in the text-to-image domain, which shares this one-to-many challenge—a single text caption might be matched by multiple different images.

Our work adapts ideas that have been developed for text-to-image diffusion models, to this new paradigm of observation-to-action diffusion. Figure 1 highlights some differences. One obvious point is that the object we are generating is now a low-dimensional action vector (rather than an image). This calls for a new design for the denoising network architecture. In image generation, heavy convolutional U-Nets are in vogue, but these are less applicable for low-dimensional vectors. Instead, we innovated and tested three different architectures shown in Figure 1.

In observation-to-action models, sampling a single bad action during an episode can throw an agent off course, and hence we were motivated to develop sampling schemes that would more reliably return good action samples (also shown in Figure 1). This problem is less severe in text-to-image models, since users often have the luxury of selecting a single image from among several generated samples and ignoring any bad images. Figure 3 shows an example of this, where a user might cherry-pick their favorite, while ignoring the one with nonsensical text.

Four samples from a text-to-image diffusion model from Bing using the prompt “A cartoon style picture of people playing with arcade claw machine”. Some of the samples are good quality, some contain errors, for example the text in one image is nonsensical.
Figure 3: Four samples from a text-to-image diffusion model from Bing (note this is not our own work), using the prompt “A cartoon style picture of people playing with arcade claw machine”.

We tested our diffusion agents in two different environments. The first, a simulated kitchen environment, is a challenging high-dimensional continuous control problem where a robotic arm must manipulate various objects. The demonstration dataset is collected from a variety of humans performing various tasks in differing orders. Hence there is rich multimodality in the dataset.

We found that diffusion agents outperformed baselines in two aspects. 1) The diversity of behaviors they learned were broader, and closer to the human demonstrations. 2) The rate of task completion (a proxy for reward) was better.

The videos below highlight the ability of diffusion to capture multimodal behavior–starting from the same initial conditions, we roll out the diffusion agent eight times. Each time it selects a different sequence of tasks to complete.

A short clip showing a robotic arm interacting with a kitchen environment performing a specific task.
A short clip showing a robotic arm interacting with a kitchen environment performing a specific task.
A short clip showing a robotic arm interacting with a kitchen environment performing a specific task.
A short clip showing a robotic arm interacting with a kitchen environment performing a specific task.
A short clip showing a robotic arm interacting with a kitchen environment performing a specific task.
A short clip showing a robotic arm interacting with a kitchen environment performing a specific task.
A short clip showing a robotic arm interacting with a kitchen environment performing a specific task.
A short clip showing a robotic arm interacting with a kitchen environment performing a specific task.

The second environment tested was a modern 3D video game, Counter-strike. We refer interested readers to the paper for results.

In summary, our work has demonstrated how exciting recent advances in generative modeling can be leveraged to build agents that can behave in humanlike ways in interactive environments. We’re excited to continue exploring this direction – watch this space for future work.

For more detail on our work, please see our paper and code repo.

The post Using generative AI to imitate human behavior appeared first on Microsoft Research.

Read More

Inferring rewards through interaction

Inferring rewards through interaction

This research was accepted by the 2023 International Conference on Learning Representations (ICLR), which is dedicated to the advancement of the branch of artificial intelligence generally referred to as deep learning.

A diagram in which five newspaper icons are lined up in the middle, the first of which is labeled a. An arrow points from the newspaper to an icon of a person above it. The person is labeled x and has a mouse click icon next to it and a thought bubble with the words “I like this!” that’s labeled r. An arrow points from the mouse click icon to a box labeled “recommender system” under the newspapers.

Reinforcement learning (RL) hinges on the power of rewards, driving agents—or the models doing the learning—to explore and learn valuable actions. The feedback received through rewards shapes their behavior, culminating in effective policies. Yet, crafting reward functions is a complex, laborious task, even for experts. A more appealing option, particularly for the people ultimately using systems that learn from feedback over time, is an agent that can automatically infer a reward function. The interaction-grounded learning (IGL) paradigm from Microsoft Research enables agents to infer rewards through the very process of interaction, utilizing diverse feedback signals rather than explicit numeric rewards. Despite the absence of a clear reward signal, the feedback relies on a binary latent reward through which the agent masters a policy that maximizes this unseen latent reward using environmental feedback.

In our paper “Personalized Reward Learning with Interaction-Grounded Learning,” which we’re presenting at the 2023 International Conference on Learning Representations (ICLR), we propose a novel approach to solve for the IGL paradigm: IGL-P. IGL-P is the first IGL strategy for context-dependent feedback, the first use of inverse kinematics as an IGL objective, and the first IGL strategy for more than two latent states. This approach provides a scalable alternative to current personalized agent learning methods, which can require expensive high-dimensional parameter tuning, handcrafted rewards, and/or extensive and costly user studies.

IGL-P in the recommender system setting

IGL-P is particularly useful for interactive learning applications such as recommender systems. Recommender systems help people navigate increasing volumes of content offerings by providing personalized content suggestions. However, without explicit feedback, recommender systems can’t detect for certain whether a person enjoyed the displayed content. To accommodate, modern recommender systems equate implicit feedback signals with user satisfaction. Despite the popularity of this approach, implicit feedback is not the true reward. Even the click-through rate (CTR) metric, the gold standard for recommender systems, is an imperfect reward, and its optimization naturally promotes clickbait.

Interaction-grounded learning (IGL) for the recommender system setting. The recommender system receives features describing a person (x), recommends an item (a), and observes implicit user feedback (y), which is dependent on the latent reward (r) but not r itself, to learn how to better recommend personalized content to the individual.
Interaction-grounded learning (IGL) for the recommender system setting. The recommender system receives features describing a person (x), recommends an item (a), and observes implicit user feedback (y), which is dependent on the latent reward (r) but not r itself, to learn how to better recommend personalized content to the individual.

This problem has led to the handcrafting of reward functions with various implicit feedback signals in modern recommender systems. Recommendation algorithms will use hand-defined weights for different user interactions, such as replying to or liking content, when deciding how to recommend content to different people. This fixed weighting of implicit feedback signals might not generalize across a wide variety of people, and thus a personalized learning method can improve user experience by recommending content based on user preferences.

Spotlight: On-Demand EVENT

Microsoft Research Summit 2022

On-Demand
Watch now to learn about some of the most pressing questions facing our research community and listen in on conversations with 120+ researchers around how to ensure new technologies have the broadest possible benefit for humanity.

The choice of reward function is further complicated by differences in how people interact with recommender systems. A growing body of work shows that recommender systems don’t provide consistently good recommendations across demographic groups. Previous research suggests that this inconsistency has its roots in user engagement styles. In other words, a reward function that might work well for one type of user might (and often does) perform poorly for another type of user who interacts with the platform differently. For example, older adults have been found to click on clickbait more often. If the CTR is used as an objective, this group of users will receive significantly more clickbait recommendations than the general public, resulting in higher rates of negative user experiences and leading to user distrust in the recommender system.

IGL-P provides a novel approach to optimize content for latent user satisfaction—that is, rewards that a model doesn’t have direct access to—by learning personalized reward functions for different people rather than requiring a fixed, human-designed reward function. IGL-P learns representations of diverse user communication modalities and how these modalities depend on the underlying user satisfaction. It assumes that people may communicate their feedback in different ways but a given person expresses (dis)satisfaction or indifference to all content in the same way. This enables the use of inverse kinematics toward a solution for recovering the latent reward. With additional assumptions that rewards are rare when the agent acts randomly and some negatively labeled interactions are directly accessible to the agent, IGL-P recovers the latent reward function and leverages that to learn a personalized policy.

IGL-P successes

The success of IGL-P is demonstrated with experiments using simulations, as well as with real-world production traces. IGL-P is evaluated in three different settings:

  • A simulation using a supervised classification dataset shows that IGL-P can learn to successfully distinguish between different communication modalities.
  • A simulation for online news recommendation based on publicly available data from Facebook users shows that IGL-P leverages insights about different communication modalities to learn better policies and achieve consistent performance among diverse user groups (the dataset, created in 2016, consists of public posts from the official Facebook pages of news companies from 2012 to 2016 and aggregated user reactions; because of this aggregation, identifying information can’t be extracted).
  • A real-world experiment deployed in the Microsoft image recommendation product Windows Spotlight showcases that the proposed method outperforms the hand-engineered reward baseline and succeeds in a practical application serving millions of people.

The post Inferring rewards through interaction appeared first on Microsoft Research.

Read More

Meet the Maker: Software Developer Builds Fully Functional Superhero Helmet

Meet the Maker: Software Developer Builds Fully Functional Superhero Helmet

Kris Kersey

Kris Kersey is an embedded software developer with over 20 years of experience, an educational YouTuber with 30,000+ subscribers, and a lifelong lover of comics and cosplay.

These interests and expertise came together in his first-ever project using the NVIDIA Jetson platform for edge AI and robotics when he created a fully functional superhero helmet as portrayed in one of his favorite Marvel Comic films, Iron Man.

The 3D-printed helmet comes complete with computer-vision capabilities in a heads-up display (HUD) that presents information wherever the user’s looking, just like in the movie.

The NVIDIA Jetson platform processes data from two cameras — one by each eye slot — that see what the helmet’s wearer is seeing. The HUD then presents information including the current temperature, humidity, altitude and GPS location. It can also classify what’s in the user’s view based on deep neural networks for object detection.

To let others join in on the fun, Kersey shared his entire workflow on his popular YouTube channel, Kersey Fabrications.

Superhero films and science fiction remind Kersey that cutting-edge technology requires collaboration across disciplines, he said.

“Often, as with this project, artists and storytellers use their imaginations to come up with brilliant ideas — then, it’s up to scientists and engineers to make them real,” the developer said.

About the Maker

Kersey, who studied computer science at Southern Polytechnic State University — now part of Kennesaw State University — in Georgia, has broad experience working with embedded microprocessors and architectures. He specializes in the Linux operating system, which is compatible with the NVIDIA Jetson platform.

“Writing software on the Jetson platform didn’t require that I learn a new programming language or operating system, which made it very easy for me,” the maker said.

By day, he’s a software engineer at an Atlanta-based startup. By night, he’s working on projects in his personal makerspace.

“I’ve never used my garage for cars,” he said.

Instead, it’s full of tools, boards and other equipment that enable his marvelous projects.

Kersey emphasized that what’s important to him most of all, however, is his family, with whom he likes to play board games, watch movies and go on hikes.

His Inspiration

Kersey’s fascination with technology stemmed from his childhood. His mother was a teacher focused on computer-aided drafting and mechanical design.

“From a very early age, I could tinker with computers that she had access to, which always fascinated me,” he said. “My cousin also once gave me an old 8-bit computer, but there wasn’t much I could do with it, so I remember pulling out the manual and reading the whole thing — that taught me basic programming.”

More recently, Kersey got into 3D printing while helping his son with a project for Science Olympiad.

“From that moment on, I got really into 3D printing as a hobby — my son never really took to it a whole lot,” he mused.

In 2018, Kersey created his YouTube channel with a focus on 3D printing as a way to delve deeper into the maker community while teaching others what he’s learned along the way.

A Jetson-Powered Superhero Project

Kersey’s 3D-printed, fully functional, wireless Iron Man helmet — which he even sanded and painted himself — could be straight out of the iconic films.

The prototype used the NVIDIA Jetson Xavier NX developer kit as the core powering its HUD.

“For this whole experience to feel as awesome as Iron Man’s tech, it has to be real time, low latency, high resolution and high frame rate,” Kersey said. “It also needs to display a lot of information on screen, which requires a powerful graphics processor — that’s why I chose the Jetson platform.”

Jetson developer kits are equipped with a powerful, onboard NVIDIA GPU and AI capabilities to supercharge embedded applications.

Kersey also tapped the NVIDIA TensorRT software development kit to enable high-performance deep-learning inference with low latency and high throughput for the project.

For the next generation of the helmet’s HUD — a project that’s “not finished till it’s finished,” according to the maker — Kersey used the NVIDIA Jetson Orin Nano developer kit. Launched in September, the kit has set a new standard for creating entry-level AI-powered robots, intelligent cameras and more.

It only took Kersey two hours to get from opening the Orin Nano box to having the software deployed and running, he said.

He’s now looking to upgrade the project with the Jetson Orin NX 16GB system-on-module, as well as build a full suit beyond the headgear, starting with prototype aluminum repulsors.

And the developer will soon make the project’s code open source, so others can easily turn themselves into superheroes, too.

Kersey plans to wear the upgraded superhero gear at Dragon Con — the world’s largest multimedia, popular culture convention — taking place in August. Plus, at this month’s MomoCon in Atlanta, he’ll present on a panel titled Making It Real: High Tech in Cosplay.

Asked if Iron Man is his favorite superhero, Kersey said with a smile: “He is right now.”

Check out Kersey Fabrications on YouTube and learn more about the NVIDIA Jetson platform.

Read More

GeForce NOW Makes May-hem With 16 New Games, Including ‘The Lord of the Rings: Gollum’

GeForce NOW Makes May-hem With 16 New Games, Including ‘The Lord of the Rings: Gollum’

What has it got in its pocketses? More games coming in May, that’s what.

GFN Thursday gets the summer started early with two newly supported games this week and 16 more coming later this month — including The Lord of the Rings: Gollum.

Don’t forget to take advantage of the limited-time discount on six-month Priority memberships. Priority members get faster access to cloud gaming servers, as well as support for RTX ON in supported games — all for 40% off the normal price. But hurry, this offer ends Sunday, May 21.

And the fun in May won’t stop there.

Stay tuned for more news on Xbox games joining the GeForce NOW library soon.

How Precious

No need to be sneaky about it — The Lord of the Rings: Gollum from Daedalic Entertainment comes to GeForce NOW when it releases on Thursday, May 25.

The action-adventure game and epic interactive experience takes place in parallel to the events described in The Fellowship of the Ring. Play as the enigmatic Gollum on his perilous journey and find out how he outwitted the most powerful characters in Middle-earth.

Climb the mountains of Mordor, sneak around Mirkwood and make difficult choices. Who will gain the upper hand: the cunning Gollum or the innocent Smeagol? Priority and Ultimate members can experience the epic story with support for RTX ray tracing and DLSS technology for AI-powered high-quality graphics, streaming across nearly any device with up to eight-hour sessions. Go Ultimate today with the one cloud gaming membership that rules them all.

May-Day Game-Day

It’s gonna be May, and that means more of the best games joining the GeForce NOW library.

Age of Wonders on GeForce NOW
Welcome to a new Age of Wonders.

Age of Wonders 4 is the long-awaited sequel from Paradox Interactive. A blend of 4x strategy and turn-based combat, members can explore new magical realms and rule over a faction of their design that grows with expanding empires. Battle through each chapter and guide your empire to greatness.

It leads two new games joining the cloud this week:

  • Age of Wonders 4 (New release on Steam)
  • Showgunners (New release on Steam)

Then check out the rest of the titles on their way in May:

  • Occupy Mars: The Game (New release on Steam, May 10)
  • TT Isle of Man: Ride on the Edge 3 (New release on Steam, May 11)
  • Far Cry 6 (New release on Steam, May 11)
  • Tin Hearts (New release on Steam, May 16)
  • The Outlast Trials (New release on Steam, May 18)
  • Warhammer 40,000: Boltgun (New release on Steam, May 23)
  • Blooming Business: Casino (New release on Steam, May 23)
  • Railway Empire 2 (New release on Steam, May 25)
  • The Lord of the Rings: Gollum (New release on Steam, May 25)
  • Above Snakes (New release on Steam, May 25)
  • System Shock (New release on Steam, May 30)
  • Patch Quest (Steam)
  • The Ascent (Steam)
  • Lawn Mowing Simulator (Steam)
  • Conqueror’s Blade (Steam)

April Additions

There were 23 announced games in April, plus another eight that joined the GeForce NOW library of over 1,600 games:

Poker Club unfortunately couldn’t be added in April due to technical issues. Tin Hearts also didn’t make it in April, but is included in the May list due to a shift in its release date.

With so many titles streaming from the cloud, what device will you be streaming on? Let us know in the comments below, or on Twitter or Facebook.

Read More