August 2023 – Page 11

Build a centralized monitoring and reporting solution for Amazon SageMaker using Amazon CloudWatch

Amazon SageMaker is a fully managed machine learning (ML) platform that offers a comprehensive set of services that serve end-to-end ML workloads. As recommended by AWS as a best practice, customers have used separate accounts to simplify policy management for users and isolate resources by workloads and account. However, when more users and teams are using the ML platform in the cloud, monitoring the large ML workloads in a scaling multi-account environment becomes more challenging. For better observability, customers are looking for solutions to monitor the cross-account resource usage and track activities, such as job launch and running status, which is essential for their ML governance and management requirements.

SageMaker services, such as Processing, Training, and Hosting, collect metrics and logs from the running instances and push them to users’ Amazon CloudWatch accounts. To view the details of these jobs in different accounts, you need to log in to each account, find the corresponding jobs, and look into the status. There is no single pane of glass that can easily show this cross-account and multi-job information. Furthermore, the cloud admin team needs to provide individuals access to different SageMaker workload accounts, which adds additional management overhead for the cloud platform team.

In this post, we present a cross-account observability dashboard that provides a centralized view for monitoring SageMaker user activities and resources across multiple accounts. It allows the end-users and cloud management team to efficiently monitor what ML workloads are running, view the status of these workloads, and trace back different account activities at certain points of time. With this dashboard, you don’t need to navigate from the SageMaker console and click into each job to find the details of the job logs. Instead, you can easily view the running jobs and job status, troubleshoot job issues, and set up alerts when issues are identified in shared accounts, such as job failure, underutilized resources, and more. You can also control access to this centralized monitoring dashboard or share the dashboard with relevant authorities for auditing and management requirements.

Overview of solution

This solution is designed to enable centralized monitoring of SageMaker jobs and activities across a multi-account environment. The solution is designed to have no dependency on AWS Organizations, but can be adopted easily in an Organizations or AWS Control Tower environment. This solution can help the operation team have a high-level view of all SageMaker workloads spread across multiple workload accounts from a single pane of glass. It also has an option to enable CloudWatch cross-account observability across SageMaker workload accounts to provide access to monitoring telemetries such as metrics, logs, and traces from the centralized monitoring account. An example dashboard is shown in the following screenshot.

The following diagram shows the architecture of this centralized dashboard solution.

SageMaker has native integration with the Amazon EventBridge, which monitors status change events in SageMaker. EventBridge enables you to automate SageMaker and respond automatically to events such as a training job status change or endpoint status change. Events from SageMaker are delivered to EventBridge in near-real time. For more information about SageMaker events monitored by EventBridge, refer to Automating Amazon SageMaker with Amazon EventBridge. In addition to the SageMaker native events, AWS CloudTrail publishes events when you make API calls, which also streams to EventBridge so that this can be utilized by many downstream automation or monitoring use cases. In our solution, we use EventBridge rules in the workload accounts to stream SageMaker service events and API events to the monitoring account’s event bus for centralized monitoring.

In the centralized monitoring account, the events are captured by an EventBridge rule and further processed into different targets:

A CloudWatch log group, to use for the following:
- Auditing and archive purposes. For more information, refer to the Amazon CloudWatch Logs User Guide.
- Analyzing log data with CloudWatch Log Insights queries. CloudWatch Logs Insights enables you to interactively search and analyze your log data in CloudWatch Logs. You can perform queries to help you more efficiently and effectively respond to operational issues. If an issue occurs, you can use CloudWatch Logs Insights to identify potential causes and validate deployed fixes.
- Support for the CloudWatch Metrics Insights query widget for high-level operations in the CloudWatch dashboard, adding CloudWatch Insights Query to dashboards, and exporting query results.
An AWS Lambda function to complete the following tasks:
- Perform custom logic to augment SageMaker service events. One example is performing a metric query on the SageMaker job host’s utilization metrics when a job completion event is received.
- Convert event information into metrics in certain log formats as ingested as EMF logs. For more information, refer to Embedding metrics within logs.

The example in this post is supported by the native CloudWatch cross-account observability feature to achieve cross-account metrics, logs, and trace access. As shown at the bottom of the architecture diagram, it integrates with this feature to enable cross-account metrics and logs. To enable this, necessary permissions and resources need to be created in both the monitoring accounts and source workload accounts.

You can use this solution for either AWS accounts managed by Organizations or standalone accounts. The following sections explain the steps for each scenario. Note that within each scenario, steps are performed in different AWS accounts. For your convenience, the account type to perform the step is highlighted at the beginning each step.

Prerequisites

Before starting this procedure, clone our source code from the GitHub repo in your local environment or AWS Cloud9. Additionally, you need the following:

Node.js 14.15.0 (or later) and nmp installed
The AWS Command Line Interface (AWS CLI) version 2 installed
The AWS CDK Toolkit
Docker Engine installed (in running state when performing the deployment procedures)

Deploy the solution in an Organizations environment

If the monitoring account and all SageMaker workload accounts are all in the same organization, the required infrastructure in the source workload accounts is created automatically via an AWS CloudFormation StackSet from the organization’s management account. Therefore, no manual infrastructure deployment into the source workload accounts is required. When a new account is created or an existing account is moved into a target organizational unit (OU), the source workload infrastructure stack will be automatically deployed and included in the scope of centralized monitoring.

Set up monitoring account resources

We need to collect the following AWS account information to set up the monitoring account resources, which we use as the inputs for the setup script later on.

Input	Description	Example
Home Region	The Region where the workloads run.	`ap-southeast-2`
Monitoring account AWS CLI profile name	You can find the profile name from `~/.aws/config`. This is optional. If not provided, it uses the default AWS credentials from the chain.	.
SageMaker workload OU path	The OU path that has the SageMaker workload accounts. Keep the `/` at the end of the path.	`o-1a2b3c4d5e/r-saaa/ou-saaa-1a2b3c4d/`

To retrieve the OU path, you can go to the Organizations console, and under AWS accounts, find the information to construct the OU path. For the following example, the corresponding OU path is o-ye3wn3kyh6/r-taql/ou-taql-wu7296by/.

After you retrieve this information, run the following command to deploy the required resources on the monitoring account:

./scripts/organization-deployment/deploy-monitoring-account.sh

You can get the following outputs from the deployment. Keep a note of the outputs to use in the next step when deploying the management account stack.

Set up management account resources

We need to collect the following AWS account information to set up the management account resources, which we use as the inputs for the setup script later on.

Input	Description	Example
Home Region	The Region where the workloads run. This should be the same as the monitoring stack.	`ap-southeast-2`
Management account AWS CLI profile name	You can find the profile name from `~/.aws/config`. This is optional. If not provided, it uses the default AWS credentials from the chain.	.
SageMaker workload OU ID	Here we use just the OU ID, not the path.	`ou-saaa-1a2b3c4d`
Monitoring account ID	The account ID where the monitoring stack is deployed to.	.
Monitoring account role name	The output for `MonitoringAccountRoleName` from the previous step.	.
Monitoring account event bus ARN	The output for `MonitoringAccountEventbusARN` from the previous step.	.
Monitoring account sink identifier	The output from `MonitoringAccountSinkIdentifier` from the previous step.	.

You can deploy the management account resources by running the following command:

./scripts/organization-deployment/deploy-management-account.sh

Deploy the solution in a non-Organizations environment

If your environment doesn’t use Organizations, the monitoring account infrastructure stack is deployed in a similar manner but with a few changes. However, the workload infrastructure stack needs to be deployed manually into each workload account. Therefore, this method is suitable for an environment with a limited number of accounts. For a large environment, it’s recommended to consider using Organizations.

Set up monitoring account resources

We need to collect the following AWS account information to set up the monitoring account resources, which we use as the inputs for the setup script later on.

Input	Description	Example
Home Region	The Region where the workloads run.	`ap-southeast-2`
SageMaker workload account list	A list of accounts that run the SageMaker workload and stream events to the monitoring account, separated by commas.	`111111111111,222222222222`
Monitoring account AWS CLI profile name	You can find the profile name from `~/.aws/config`. This is optional. If not provided, it uses the default AWS credentials from the chain.	.

We can deploy the monitoring account resources by running the following command after you collect the necessary information:

./scripts/individual-deployment/deploy-monitoring-account.sh

We get the following outputs when the deployment is complete. Keep a note of the outputs to use in the next step when deploying the management account stack.

Set up workload account monitoring infrastructure

We need to collect the following AWS account information to set up the workload account monitoring infrastructure, which we use as the inputs for the setup script later on.

Input	Description	Example
Home Region	The Region where the workloads run. This should be the same as the monitoring stack.	`ap-southeast-2`
Monitoring account ID	The account ID where the monitoring stack is deployed to.	.
Monitoring account role name	The output for `MonitoringAccountRoleName` from the previous step.	.
Monitoring account event bus ARN	The output for `MonitoringAccountEventbusARN` from the previous step.	.
Monitoring account sink identifier	The output from `MonitoringAccountSinkIdentifier` from the previous step.	.
Workload account AWS CLI profile name	You can find the profile name from `~/.aws/config`. This is optional. If not provided, it uses the default AWS credentials from the chain.	.

We can deploy the monitoring account resources by running the following command:

./scripts/individual-deployment/deploy-workload-account.sh

Visualize ML tasks on the CloudWatch dashboard

To check if the solution works, we need to run multiple SageMaker processing jobs and SageMaker training jobs on the workload accounts that we used in the previous sections. The CloudWatch dashboard is customizable based on your own scenarios. Our sample dashboard consists of widgets for visualizing SageMaker Processing jobs and SageMaker Training jobs. All jobs for monitoring workload accounts are displayed in this dashboard. In each type of job, we show three widgets, which are the total number of jobs, the number of failing jobs, and the details of each job. In our example, we have two workload accounts. Through this dashboard, we can easily find that one workload account has both processing jobs and training jobs, and another workload account only has training jobs. As with the functions we use in CloudWatch, we can set the refresh interval, specify the graph type, and zoom in or out, or we can run actions such as download logs in a CSV file.

Customize your dashboard

The solution provided in the GitHub repo includes both SageMaker Training job and SageMaker Processing job monitoring. If you want to add more dashboards to monitor other SageMaker jobs, such as batch transform jobs, you can follow the instructions in this section to customize your dashboard. By modifying the index.py file, you can customize the fields what you want to display on the dashboard. You can access all details that are captured by CloudWatch through EventBridge. In the Lambda function, you can choose the necessary fields that you want to display on the dashboard. See the following code:

@metric_scope
def lambda_handler(event, context, metrics):
    
    try:
        event_type = None
        try:
            event_type = SAGEMAKER_STAGE_CHANGE_EVENT(event["detail-type"])
        except ValueError as e:
            print("Unexpected event received")

        if event_type:
            account = event["account"]
            detail = event["detail"]

            job_detail = {
                "DashboardQuery": "True"
            }
            job_detail["Account"] = account
            job_detail["JobType"] = event_type.name

            
            metrics.set_dimensions({"account": account, "jobType": event_type.name}, use_default=False)
            metrics.set_property("JobType", event_type.value)
            
            if event_type == SAGEMAKER_STAGE_CHANGE_EVENT.PROCESSING_JOB:
                job_status = detail.get("ProcessingJobStatus")

                metrics.set_property("JobName", detail.get("ProcessingJobName"))
                metrics.set_property("ProcessingJobArn", detail.get("ProcessingJobArn"))

                job_detail["JobName"]  = detail.get("ProcessingJobName")
                job_detail["ProcessingJobArn"] = detail.get("ProcessingJobArn")
                job_detail["Status"] = job_status
                job_detail["StartTime"] = detail.get("ProcessingStartTime")
                job_detail["InstanceType"] = detail.get("ProcessingResources").get("ClusterConfig").get("InstanceType")
                job_detail["InstanceCount"] = detail.get("ProcessingResources").get("ClusterConfig").get("InstanceCount")
                if detail.get("FailureReason"):

To customize the dashboard or widgets, you can modify the source code in the monitoring-account-infra-stack.ts file. Note that the field names you use in this file should be the same as those (the keys of job_detail) defined in the Lambda file:

 // CloudWatch Dashboard
    const sagemakerMonitoringDashboard = new cloudwatch.Dashboard(
      this, 'sagemakerMonitoringDashboard',
      {
        dashboardName: Parameters.DASHBOARD_NAME,
        widgets: []
      }
    )

    // Processing Job
    const processingJobCountWidget = new cloudwatch.GraphWidget({
      title: "Total Processing Job Count",
      stacked: false,
      width: 12,
      height: 6,
      left:[
        new cloudwatch.MathExpression({
          expression: `SEARCH('{${AWS_EMF_NAMESPACE},account,jobType} jobType="PROCESSING_JOB" MetricName="ProcessingJobCount_Total"', 'Sum', 300)`,
          searchRegion: this.region,
          label: "${PROP('Dim.account')}",
        })
      ]
    });
    processingJobCountWidget.position(0,0)
    const processingJobFailedWidget = new cloudwatch.GraphWidget({
      title: "Failed Processing Job Count",
      stacked: false,
      width: 12,
      height:6,
      right:[
        new cloudwatch.MathExpression({
          expression: `SEARCH('{${AWS_EMF_NAMESPACE},account,jobType} jobType="PROCESSING_JOB" MetricName="ProcessingJobCount_Failed"', 'Sum', 300)`,
          searchRegion: this.region,
          label: "${PROP('Dim.account')}",
        })
      ]
    })
    processingJobFailedWidget.position(12,0)
    
    const processingJobInsightsQueryWidget = new cloudwatch.LogQueryWidget(
      {
        title: 'SageMaker Processing Job History',
        logGroupNames: [ingesterLambda.logGroup.logGroupName],
        view: cloudwatch.LogQueryVisualizationType.TABLE,
        queryLines: [
          'sort @timestamp desc',
          'filter DashboardQuery == "True"',
          'filter JobType == "PROCESSING_JOB"',
          'fields Account, JobName, Status, Duration, InstanceCount, InstanceType, Host, fromMillis(StartTime) as StartTime, FailureReason',
          'fields Metrics.CPUUtilization as CPUUtil, Metrics.DiskUtilization as DiskUtil, Metrics.MemoryUtilization as MemoryUtil',
          'fields Metrics.GPUMemoryUtilization as GPUMemoeyUtil, Metrics.GPUUtilization as GPUUtil',
        ],
        width:24,
        height: 6,
      }
    );
    processingJobInsightsQueryWidget.position(0, 6)
    sagemakerMonitoringDashboard.addWidgets(processingJobCountWidget);
    sagemakerMonitoringDashboard.addWidgets(processingJobFailedWidget);
    sagemakerMonitoringDashboard.addWidgets(processingJobInsightsQueryWidget);

After you modify the dashboard, you need to redeploy this solution from scratch. You can run the Jupyter notebook provided in the GitHub repo to rerun the SageMaker pipeline, which will launch the SageMaker Processing jobs again. When the jobs are finished, you can go to the CloudWatch console, and under Dashboards in the navigation pane, choose Custom Dashboards. You can find the dashboard named SageMaker-Monitoring-Dashboard.

Clean up

If you no longer need this custom dashboard, you can clean up the resources. To delete all the resources created, use the code in this section. The cleanup is slightly different for an Organizations environment vs. a non-Organizations environment.

For an Organizations environment, use the following code:

make destroy-management-stackset # Execute against the management account
make destroy-monitoring-account-infra # Execute against the monitoring account

For a non-Organizations environment, use the following code:

make destroy-workload-account-infra # Execute against each workload account
make destroy-monitoring-account-infra # Execute against the monitoring account

Alternatively, you can log in to the monitoring account, workload account, and management account to delete the stacks from the CloudFormation console.

Conclusion

In this post, we discussed the implementation of a centralized monitoring and reporting solution for SageMaker using CloudWatch. By following the step-by-step instructions outlined in this post, you can create a multi-account monitoring dashboard that displays key metrics and consolidates logs related to their various SageMaker jobs from different accounts in real time. With this centralized monitoring dashboard, you can have better visibility into the activities of SageMaker jobs across multiple accounts, troubleshoot issues more quickly, and make informed decisions based on real-time data. Overall, the implementation of a centralized monitoring and reporting solution using CloudWatch offers an efficient way for organizations to manage their cloud-based ML infrastructure and resource utilization.

Please try out the solution and send us the feedback, either in the AWS forum for Amazon SageMaker, or through your usual AWS contacts.

To learn more about the cross-account observability feature, please refer to the blog Amazon CloudWatch Cross-Account Observability

About the Authors

Jie Dong is an AWS Cloud Architect based in Sydney, Australia. Jie is passionate about automation, and loves to develop solutions to help customer improve productivity. Event-driven system and serverless framework are his expertise. In his own time, Jie loves to work on building smart home and explore new smart home gadgets.

Melanie Li, PhD, is a Senior AI/ML Specialist TAM at AWS based in Sydney, Australia. She helps enterprise customers build solutions using state-of-the-art AI/ML tools on AWS and provides guidance on architecting and implementing ML solutions with best practices. In her spare time, she loves to explore nature and spend time with family and friends.

Gordon Wang, is a Senior AI/ML Specialist TAM at AWS. He supports strategic customers with AI/ML best practices cross many industries. He is passionate about computer vision, NLP, Generative AI and MLOps. In his spare time, he loves running and hiking.

Challenge Accepted: GeForce NOW Fires Up the Cloud With Ultimate Challenge and First Bethesda Games

Rise and shine, it’s time to quake up — the GeForce NOW Ultimate KovaaK’s challenge kicks off at the QuakeCon gaming festival today, giving gamers everywhere the chance to play to their ultimate potential with ultra-high 240 frames per second streaming. On top of bragging rights, top scorers can win some sweet prizes — including a 240Hz gaming monitor.

Bethesda’s award-winning titles Doom Eternal, Quake, Wolfenstein: The New Order, Wolfenstein II: The New Colossus and Wolfenstein: Youngblood heat up the cloud this week, leading 21 new games joining the GeForce NOW library.

Plus, Baldur’s Gate 3 has been a hit with members. Make sure to upgrade to Ultimate and Priority memberships today to skip the waiting lines over free members and get into gaming faster.

Ultimate Power, Ultimate Wins

Warning: The GeForce NOW Ultimate membership is so good that gamers can no longer blame their hardware for losses.

To celebrate the completion of the Ultimate upgrade, GeForce NOW is giving everyone a chance to experience the full power of an Ultimate membership and 240 fps cloud gaming with its Ultimate KovaaK’s challenge. See how streaming from a GeForce RTX 4080 gaming rig completely changes the game.

Ultimate KovaaK's challenge on GeForce NOW — *Go ahead, give it a shot!*

GeForce NOW has teamed with popular aim trainer KovaaK’s to create a custom demo on the GeForce NOW app for PC and macOS. Free and Priority members can stream the demo, then get a free one-day upgrade to 240 fps gaming with GeForce NOW Ultimate to instantly experience a major performance improvement. Members will receive an email once their free one-day upgrade is available, and should make sure their device settings are optimized for the challenge.

Gamers can replay the demo unlimited times on Ultimate during the one-day upgrade and aim for the top score against other GeForce NOW members on the challenge leaderboard. QuakeCon attendees and those playing from home can compete for prizes through Thursday, Sept. 21. Keep an eye out on Twitter and Facebook for more details.

Ultimate members who’ve already been enjoying ultra-high 240 fps cloud gaming can also join in on the fun — just try to show the newcomers a little mercy on the leaderboard.

The Cloud Just Got Hotter

After warming up with the Ultimate Challenge, bring the heat over to Bethesda’s highly acclaimed first-person shooter games.

Doom Eternal on GeForce NOW — *Time to raze.*

Hell’s armies have invaded Earth once again in Doom Eternal, the latest entry in the legendary Doom franchise. Traverse various dimensions to stop the demonic invasion and save humanity. Raze enemies in the single-player campaign, or grab some buddies for “Battlemode” and face off against their demons as a fully armed, upgraded Doom Slayer in a best-of-five match. Each demon has unique abilities, while the Slayer can use its arsenal and power-ups to take the enemies down.

Quake on GeForce NOW — *You can’t be at fault streaming Quake from the cloud.*

Grab the gaming classic Quake to step into the shoes of Ranger, a warrior armed with a powerful arsenal of weapons. Fight corrupted knights, deformed ogres and an army of twisted creatures. Brave it alone or with a squad of up to four players in an online co-op mode.

Wolfenstein on GeForce NOW — *Who let the wolves out?*

Take the fight over to the Wolfenstein franchise and battle against high-tech Nazi legions in a twisted version of history with Wolfenstein: The New Order, Wolfenstein II: The New Colossus and Wolfenstein: Youngblood joining the cloud. Experience popular character B.J. Blazkowicz’s story in New Order and The New Colossus, then play as his twin daughters in Youngblood. Members can also experience Wolfenstein: Youngblood with RTX ON for real-time cinematic lighting.

Those returning to the series or experiencing them for the first time can stream at up to 240 fps with a GeForce NOW Ultimate membership, which offers peak performance that’s helpful whether facing off against demons, war machines or other players around the world.

Bring on the New

The newest season of Apex Legends, the popular, free-to-play, battle-royale first-person shooter game, is now available to stream. Apex Legends: Resurrection brings a new look and deadly new abilities for offense-focused character Revenant. Plus, members can battle on new stages for Mixtape on Broken Moon. Or, gamers can put their skills to the test in a new Ranked season and terrorize foes from the Resurrection Battle Pass.

Members can look forward to the 21 new games joining this week:

I Am Future (New release on Steam, Aug. 8)
Atlas Fallen (New release on Steam, Aug. 10)
Orwell: Keeping an Eye on You (Free game on Epic Games Store, Aug. 10)
Sengoku Dynasty (New release on Steam, Aug. 10)
Tales & Tactics (New release on Steam, Aug. 10)
Aliens: Dark Descent (Epic Games Store)
Doom Eternal (Steam)
LEGO Brawls (Epic Games Store)
Quake (Steam and Epic Games Store)
Session Skate Sim (Epic Games Store)
Smalland: Survive the Wilds (Epic Games Store)
Superhot (Epic Games Store)
Terra Invicta (Epic Games Store)
Ultimate KovaaK’s Challenge
Wall World (Steam)
Wild West Dynasty (Epic Games Store)
Wolfenstein: The New Order (Steam and Epic Games Store)
Wolfenstein: Youngblood (Steam)
Wolfenstein II: The New Colossus (Steam)
WRECKFEST (Epic Games Store)
Xenonauts 2 (Epic Games Store)

This week’s Game On giveaway with SteelSeries includes RuneScape and three-day Priority membership codes. Check the giveaway page for details on how to enter.

And we’ve got a question before the weekend starts. Let us know how you do with the Ultimate KovvaK’s challenge on Twitter or in the comments below.

Who’s up for a challenge?

— NVIDIA GeForce NOW (@NVIDIAGFN) August 9, 2023

Strength in Numbers: NVIDIA and Generative Red Team Challenge Unleash Thousands to Vet Security at DEF CON

Thousands of hackers will tweak, twist and probe the latest generative AI platforms this week in Las Vegas as part of an effort to build more trustworthy and inclusive AI.

Collaborating with the hacker community to establish best practices for testing next-generation AI, NVIDIA is participating in a first-of-its-kind test of industry-leading LLM solutions, including NVIDIA NeMo and NeMo Guardrails.

The Generative Red Team Challenge, hosted by AI Village, SeedAI, and Humane Intelligence, will be among a series of workshops, training sessions and appearances by NVIDIA leaders at the Black Hat and DEF CON security conferences in Las Vegas.

The challenge — which gives hackers a number of vulnerabilities to exploit — promises to be the first of many opportunities to reality-check emerging AI technologies.

“AI empowers individuals to create and build previously impossible things,” said Austin Carson, founder of SeedAI and co-organizer of the Generative Red Team Challenge. “But without a large, diverse community to test and evaluate the technology, AI will just mirror its creators, leaving big portions of society behind.”

The collaboration with the hacker community comes amid a concerted push for AI safety making headlines across the world, with the Biden-Harris administration securing voluntary commitment from the leading AI companies working on cutting-edge generative models.

“AI Village draws the community concerned about the implications of AI systems – both malicious use and impact on society,” said Sven Cattell founder of AI Village and co-organizer of the Generative Red Team Challenge. “At DEFCON 29, we hosted the first Algorithmic Bias Bounty with Rumman Chowdhury’s former team at Twitter. This marked the first time a company had allowed public access to their model for scrutiny.”

This week’s challenge is a key step in the evolution of AI, thanks to the leading role played by the hacker community — with its ethos of skepticism, independence and transparency — in creating and field testing emerging security standards.

NVIDIA’s technologies are fundamental to AI, and NVIDIA was there at the beginning of the generative AI revolution. In 2016, NVIDIA founder and CEO Jensen Huang hand-delivered to OpenAI the first NVIDIA DGX AI supercomputer — the engine behind the large language model breakthrough powering ChatGPT.

NVIDIA DGX systems, originally used as an AI research instrument, are now running 24/7 at businesses across the world to refine data and process AI.

Management consultancy McKinsey estimates generative AI could add the equivalent of $2.6 trillion to $4.4 trillion annually to the global economy across 63 use cases.

This makes safety — and trust — an industry-wide concern.

That’s why NVIDIA employees are engaging with attendees at both last week’s Black Hat conference for security professionals and this week’s DEF CON gathering.

At Black Hat, NVIDIA hosted a two-day training session on using machine learning and a briefing on the risks of poisoning web-scale training datasets. It also participated in a panel discussion on the potential benefits of AI for security.

At DEF CON, NVIDIA is sponsoring a talk on the risks of breaking into baseboard management controllers. These specialized service processors monitor the physical state of a computer, network server or other hardware devices.

And through the Generative Red Team Challenge, part of the AI Village Prompt Detective workshop, thousands of DEF CON participants will be able to demonstrate prompt injection, attempt to elicit unethical behaviors and test other techniques to obtain inappropriate responses.

Models built by Anthropic, Cohere, Google, Hugging Face, Meta, NVIDIA, OpenAI and Stability, with participation from Microsoft, will be tested on an evaluation platform developed by Scale AI.

As a result, everyone gets smarter.

“We’re fostering the exchange of ideas and information while simultaneously addressing risks and opportunities,” said Rumman Chowdhury, a member of AI Village’s leadership team and co-founder of Humane Intelligence, the nonprofit designing the challenges. “The hacker community is exposed to different ideas, and community partners gain new skills that position them for the future.”

Released in April as open-source software, NeMo Guardrails can help developers guide generative AI applications to create impressive text responses that can stay on track — ensuring intelligent, LLM-powered applications are accurate, appropriate, on topic and secure.

To ensure transparency and the ability to put the technology to work across many environments, NeMo Guardrails — the product of several years of research — is open source, with much of the NeMo conversational AI framework already available as open-source code on GitHub, contributing to the developer community’s tremendous energy and work on AI safety.

Engaging with the DEF CON community builds on this, enabling NVIDIA to share what it has learned with NeMo Guardrails and to, in turn, learn from the community.

Organizers of the event — which include SeedAI, Humane Intelligence and AI Village — plan to analyze the data and publish their findings, including processes and learnings, to help other organizations conduct similar exercises.

Last week, organizers also issued a call for research proposals and received several proposals from leading researchers within the first 24 hours.

“Since this is the first instance of a live hacking event of a generative AI system at scale, we will be learning together,” Chowdhury said. “The ability to replicate this exercise and put AI testing into the hands of thousands is key to its success.”

The Generative Red Team Challenge will take place in the AI Village at DEF CON 31 from Aug. 10-13, at Caesar’s Forum in Las Vegas.

Intel Joins the PyTorch Foundation as a Premier Member

The PyTorch Foundation, a neutral home for the deep learning community to collaborate on the open source PyTorch framework and ecosystem, is announcing today that Intel has joined as a premier member.

“The PyTorch Foundation is thrilled to welcome Intel as a premier member, marking a significant milestone in our mission to empower the global AI community. Intel’s extensive expertise and commitment to advancing cutting-edge technologies align perfectly with our vision of fostering open-source innovation,” said PyTorch Foundation Executive Director Ibrahim Haddad. “Together, we will accelerate the development and democratization of PyTorch, and use the collaboration to shape a vibrant future of AI for all.”

Intel has developed and released several PyTorch-based tools and libraries to enable developers to accelerate their AI workflows, and is actively working on optimizing PyTorch to leverage Intel hardware capabilities.

“At Intel, we believe in the power of collaboration and open-source innovation to propel the ecosystem towards an AI Everywhere future. Joining the Governing Board of the PyTorch Foundation is a testament to Intel’s commitment to advancing and democratizing AI,” said Wei Li, Vice President and General Manager of Artificial Intelligence and Analytics (AIA) at Intel. “By harnessing the collective expertise and resources within the deep learning community, we aim to accelerate the development of PyTorch and continue to drive breakthroughs in AI research and applications.”

Intel fosters industry collaboration, co-engineering, and open source contributions to accelerate software innovation and develop new technologies that bring benefits to the open source community. By working together with other member companies and under the guidance of the PyTorch Foundation, Intel remains committed to actively contributing to and advocating for the community.

As a premier member, Intel is granted one seat to the PyTorch Foundation Governing Board. The Board sets policy through our bylaws, mission and vision statements, describing the overarching scope of foundation initiatives, technical vision, and direction.

Wei Li

We’re happy to welcome Wei Li, Vice President and General Manager of Artificial Intelligence and Analytics (AIA) at Intel, to our board. Dr. Wei Li is Vice President and General Manager of Artificial Intelligence and Analytics (AIA) at Intel, where he leads a world-wide team of engineering “magicians” who make AI Everywhere a reality by supercharging machine performance and developer productivity. Wei and his team have been instrumental in Intel’s recent multi-billion-dollar AI revenue growth by delivering 10-100X software acceleration, across deep learning, statistical machine learning and big data analytics, to complement Intel’s AI-optimized hardware portfolio.

To learn more about how you can be a part of the PyTorch Foundation, visit our website.

Read more about Intel’s commitment to the PyTorch Community here.

About Intel

Intel (Nasdaq: INTC) is an industry leader, creating world-changing technology that enables global progress and enriches lives. Inspired by Moore’s Law, we continuously work to advance the design and manufacturing of semiconductors to help address our customers’ greatest challenges. By embedding intelligence in the cloud, network, edge and every kind of computing device, we unleash the potential of data to transform business and society for the better. To learn more about Intel’s innovations, go to newsroom.intel.com and intel.com.

About PyTorch Foundation

The PyTorch Foundation is a neutral home for the deep learning community to collaborate on the open source PyTorch framework and ecosystem. The PyTorch Foundation is supported by its members and leading contributors to the PyTorch open source project. The Foundation leverages resources provided by members and contributors to enable community discussions and collaboration.

About The Linux Foundation

The Linux Foundation is the world’s leading home for collaboration on open source software, hardware, standards, and data. Linux Foundation projects are critical to the world’s infrastructure including Linux, Kubernetes, Node.js, ONAP, PyTorch, RISC-V, SPDX, OpenChain, and more. The Linux Foundation focuses on leveraging best practices and addressing the needs of contributors, users, and solution providers to create sustainable models for open collaboration. For more information, please visit us at linuxfoundation.org. The Linux Foundation has registered trademarks and uses trademarks. For a list of trademarks of The Linux Foundation, please see its trademark usage page. Linux is a registered trademark of Linus Torvalds.

Visual Effects Multiplier: Wylie Co. Goes All in on GPU Rendering for 24x Returns

Visual effects studios have long relied on render farms — vast numbers of servers — for computationally intensive, complex special effects, but that landscape is rapidly changing.

High silicon and energy costs at these server facilities, which can be restricted in performance gains by Moore’s law, cut into studio profits and increase production time.

To avoid those challenges, Wylie Co. — the visual effects studio behind Oscar-winning Dune, Marvel titles, HBO and Netflix work — is going all in on GPU rendering.

It’s estimated that rendering photoreal visual effects and stylized animations consumes nearly 10 billion CPU core hours a year. To render a single animated feature film, render farms can involve more than 50,000 CPU cores working for more than 300 million CPU core hours. These resources can create a substantial carbon effect and physical footprint.

While many studios already use GPUs for one leg of the rendering process, Wylie Co. is now using it for about everything, including final renders as well as for AI used in wire removals and many other aspects of compositing and visual effects workflows.

Move to GPUs Boosts Performance 24X

Render farms allow visual effects studios to offload large files of images, scenes or entire feature films, freeing up studio resources while these jobs may take hours or weeks to finish.

Many studios are moving to multi-GPU workstations that can handle some of the tasks that were previously sent to render farms. This enables studios to iterate faster as well as compress production time and costs.

Wylie Co. migrated to GPUs across a number of areas, realizing overall a 24x performance leap compared with CPUs¹.

GPUs Deliver 10X Lower Energy Usage

While studios would like to reduce their costs from these compute-heavy rendering tasks, the reality is that the decreased energy and space costs also bring a lower carbon footprint benefit.

GPUs used in visual effects rendering pipelines can increase performance by as much as 46x² while reducing energy consumption by 5x and capital expenses by 6x.

By switching to GPUs, the industry stands to save $900 million in acquisition costs worldwide and 215 gigawatt hours in energy consumed compared with using CPU-based render farms.

Learn about NVIDIA energy-efficiency solutions for digital rendering

₁25X performance for an NVIDIA Quadro RTXTM 8000 (4x GPUs per node) vs. Intel Xeon Gold 6126 processor (2x 12-core CPUs per node).

₂46X performance for NVIDIA RTX 6000 Ada generation (8x GPUs per node) vs. Intel Xeon Gold 6430 (2x 32-core CPUs per node). Performance and energy findings based on internal and industry benchmarks.

Advances in document understanding

Posted by Sandeep Tata, Software Engineer, Google Research, Athena Team

The last few years have seen rapid progress in systems that can automatically process complex business documents and turn them into structured objects. A system that can automatically extract data from documents, e.g., receipts, insurance quotes, and financial statements, has the potential to dramatically improve the efficiency of business workflows by avoiding error-prone, manual work. Recent models, based on the Transformer architecture, have shown impressive gains in accuracy. Larger models, such as PaLM 2, are also being leveraged to further streamline these business workflows. However, the datasets used in academic literature fail to capture the challenges seen in real-world use cases. Consequently, academic benchmarks report strong model accuracy, but these same models do poorly when used for complex real-world applications.

In “VRDU: A Benchmark for Visually-rich Document Understanding”, presented at KDD 2023, we announce the release of the new Visually Rich Document Understanding (VRDU) dataset that aims to bridge this gap and help researchers better track progress on document understanding tasks. We list five requirements for a good document understanding benchmark, based on the kinds of real-world documents for which document understanding models are frequently used. Then, we describe how most datasets currently used by the research community fail to meet one or more of these requirements, while VRDU meets all of them. We are excited to announce the public release of the VRDU dataset and evaluation code under a Creative Commons license.

Benchmark requirements

First, we compared state-of-the-art model accuracy (e.g., with FormNet and LayoutLMv2) on real-world use cases to academic benchmarks (e.g., FUNSD, CORD, SROIE). We observed that state-of-the-art models did not match academic benchmark results and delivered much lower accuracy in the real world. Next, we compared typical datasets for which document understanding models are frequently used with academic benchmarks and identified five dataset requirements that allow a dataset to better capture the complexity of real-world applications:

Rich Schema: In practice, we see a wide variety of rich schemas for structured extraction. Entities have different data types (numeric, strings, dates, etc.) that may be required, optional, or repeated in a single document or may even be nested. Extraction tasks over simple flat schemas like (header, question, answer) do not reflect typical problems encountered in practice.
Layout-Rich Documents: The documents should have complex layout elements. Challenges in practical settings come from the fact that documents may contain tables, key-value pairs, switch between single-column and double-column layout, have varying font-sizes for different sections, include pictures with captions and even footnotes. Contrast this with datasets where most documents are organized in sentences, paragraphs, and chapters with section headers — the kinds of documents that are typically the focus of classic natural language processing literature on long inputs.
Diverse Templates: A benchmark should include different structural layouts or templates. It is trivial for a high-capacity model to extract from a particular template by memorizing the structure. However, in practice, one needs to be able to generalize to new templates/layouts, an ability that the train-test split in a benchmark should measure.
High-Quality OCR: Documents should have high-quality Optical Character Recognition (OCR) results. Our aim with this benchmark is to focus on the VRDU task itself and to exclude the variability brought on by the choice of OCR engine.
Token-Level Annotation: Documents should contain ground-truth annotations that can be mapped back to corresponding input text, so that each token can be annotated as part of the corresponding entity. This is in contrast with simply providing the text of the value to be extracted for the entity. This is key to generating clean training data where we do not have to worry about incidental matches to the given value. For instance, in some receipts, the ‘total-before-tax’ field may have the same value as the ‘total’ field if the tax amount is zero. Having token level annotations prevents us from generating training data where both instances of the matching value are marked as ground-truth for the ‘total’ field, thus producing noisy examples.

VRDU datasets and tasks

The VRDU dataset is a combination of two publicly available datasets, Registration Forms and Ad-Buy forms. These datasets provide examples that are representative of real-world use cases, and satisfy the five benchmark requirements described above.

The Ad-buy Forms dataset consists of 641 documents with political advertisement details. Each document is either an invoice or receipt signed by a TV station and a campaign group. The documents use tables, multi-columns, and key-value pairs to record the advertisement information, such as the product name, broadcast dates, total price, and release date and time.

The Registration Forms dataset consists of 1,915 documents with information about foreign agents registering with the US government. Each document records essential information about foreign agents involved in activities that require public disclosure. Contents include the name of the registrant, the address of related bureaus, the purpose of activities, and other details.

We gathered a random sample of documents from the public Federal Communications Commission (FCC) and Foreign Agents Registration Act (FARA) sites, and converted the images to text using Google Cloud’s OCR. We discarded a small number of documents that were several pages long and the processing did not complete in under two minutes. This also allowed us to avoid sending very long documents for manual annotation — a task that can take over an hour for a single document. Then, we defined the schema and corresponding labeling instructions for a team of annotators experienced with document-labeling tasks.

The annotators were also provided with a few sample labeled documents that we labeled ourselves. The task required annotators to examine each document, draw a bounding box around every occurrence of an entity from the schema for each document, and associate that bounding box with the target entity. After the first round of labeling, a pool of experts were assigned to review the results. The corrected results are included in the published VRDU dataset. Please see the paper for more details on the labeling protocol and the schema for each dataset.

Existing academic benchmarks (FUNSD, CORD, SROIE, Kleister-NDA, Kleister-Charity, DeepForm) fall-short on one or more of the five requirements we identified for a good document understanding benchmark. VRDU satisfies all of them. See our paper for background on each of these datasets and a discussion on how they fail to meet one or more of the requirements.

We built four different model training sets with 10, 50, 100, and 200 samples respectively. Then, we evaluated the VRDU datasets using three tasks (described below): (1) Single Template Learning, (2) Mixed Template Learning, and (3) Unseen Template Learning. For each of these tasks, we included 300 documents in the testing set. We evaluate models using the F1 score on the testing set.

Single Template Learning (STL): This is the simplest scenario where the training, testing, and validation sets only contain a single template. This simple task is designed to evaluate a model’s ability to deal with a fixed template. Naturally, we expect very high F1 scores (0.90+) for this task.
Mixed Template Learning (MTL): This task is similar to the task that most related papers use: the training, testing, and validation sets all contain documents belonging to the same set of templates. We randomly sample documents from the datasets and construct the splits to make sure the distribution of each template is not changed during sampling.
Unseen Template Learning (UTL): This is the most challenging setting, where we evaluate if the model can generalize to unseen templates. For example, in the Registration Forms dataset, we train the model with two of the three templates and test the model with the remaining one. The documents in the training, testing, and validation sets are drawn from disjoint sets of templates. To our knowledge, previous benchmarks and datasets do not explicitly provide such a task designed to evaluate the model’s ability to generalize to templates not seen during training.

The objective is to be able to evaluate models on their data efficiency. In our paper, we compared two recent models using the STL, MTL, and UTL tasks and made three observations. First, unlike with other benchmarks, VRDU is challenging and shows that models have plenty of room for improvements. Second, we show that few-shot performance for even state-of-the-art models is surprisingly low with even the best models resulting in less than an F1 score of 0.60. Third, we show that models struggle to deal with structured repeated fields and perform particularly poorly on them.

Conclusion

We release the new Visually Rich Document Understanding (VRDU) dataset that helps researchers better track progress on document understanding tasks. We describe why VRDU better reflects practical challenges in this domain. We also present experiments showing that VRDU tasks are challenging, and recent models have substantial headroom for improvements compared to the datasets typically used in the literature with F1 scores of 0.90+ being typical. We hope the release of the VRDU dataset and evaluation code helps research teams advance the state of the art in document understanding.

Acknowledgements

Many thanks to Zilong Wang, Yichao Zhou, Wei Wei, and Chen-Yu Lee, who co-authored the paper along with Sandeep Tata. Thanks to Marc Najork, Riham Mansour and numerous partners across Google Research and the Cloud AI team for providing valuable insights. Thanks to John Guilyard for creating the animations in this post.

Generative AI Revs Up New Age in Auto Industry, From Design and Engineering to Production and Sales

Generating content and code. Creating images and videos. Testing algorithms with synthetic data.

Generative AI is a force multiplier enabling leaps in productivity and creativity for nearly every industry, particularly transportation, where it’s streamlining workflows and driving new business.

Across the entire auto industry, companies are exploring generative AI to improve vehicle design, engineering, and manufacturing, as well as marketing and sales.

Beyond the automotive product lifecycle, generative AI is also enabling new breakthroughs in autonomous vehicle (AV) development. Such research areas include the use of neural radiance field (NeRF) technology to turn recorded sensor data into fully interactive 3D simulations. These digital twin environments, as well as synthetic data generation, can be used to develop, test and validate AVs at incredible scale.

Merge Ahead: Transformative Use Cases

Generative AI, large language models and recommender systems are the digital engines of the modern economy, NVIDIA founder and CEO Jensen Huang said.

Foundational models — like ChatGPT for text generation and Stable Diffusion for image generation — can support AI systems capable of multiple tasks. This unlocks many possibilities.

Much like when early iPhone app developers began using GPS, accelerometers and other sensors to create mobile applications, AI developers now can tap foundation models to build new experiences and capabilities.

Generative AI can help tie different data streams together, not just text to text, or text to image, but also with inputs and outputs like video or 3D. Using this powerful new computing model, a text prompt could return a physically accurate layout of an assembly plant.

Toyota, one of the world’s largest automakers, has developed a generative AI technique to ensure that early design sketches incorporate engineering parameters.

Meanwhile, Mercedes-Benz has demonstrated a ChatGPT-enabled voice assistant.

Other automotive industry players are also looking to generative AI to help accelerate design iterations and provide better results.

Designer and Artist Workflows Poised to Benefit

Currently, it takes designers and artists months of preparation and design reviews to progress from early concept ideation and sketching through to the development of full scale models. This is often hampered by incompatible tools, siloed data and serial workflows.

Artists often begin the design process by looking for “scrap,” or visual references, based on trends in automotive styling. They seek inspiration for design cues, pulling from image libraries based on keywords.

The process involves looking at vehicles across the industry, whether existing or historic. Then, with a great deal of human curation, some blend of popular designs and fresh inspirations based on a company’s stylings emerge. That forms the basis for artists’ 2D hand-drawn sketches that are then recreated as 3D models and clay prototypes.

These linear and time-consuming design concept processes are utilized for exterior parts like grilles, hoods and wheels, as well as interior aspects such as dashboards, seats, ergonomics and user interfaces.

To develop these 3D models, automotive styling teams work with engineers in tools like Autodesk Alias or Maya to develop “NURBS” models, short for non-uniform rational B-splines. The resulting mathematical representations of 3D geometry capture the shapes from 2D drafts. The end deliverable is a 3D representation that’s the result of bespoke styling, design and engineering work and can be used in computer-aided design applications to define surfaces.

The automotive industry now has an opportunity to use generative AI to instantly transform 2D sketches into NURBS models for leaps in productivity. These tools will not replace designers, but enable them to explore a wide range of options faster.

Generative AI Riding Shotgun on Concept and Styling

Design-oriented enterprises can use visual datasets and generative AI to assist their work across many fronts. This has already been achieved with coding tools such as GitHub Copilot — trained on billions of lines of code — and similarly promises to help compress lengthy design timelines.

In particular, when looking for “scrap” design elements, generative AI models can be trained on an automaker’s portfolio as well as vehicles industrywide, assisting this workflow. This can happen first by fine-tuning a small dataset of images with transfer learning, and then by tapping into NVIDIA TAO Toolkit. Or it might require a more robust dataset of some 100 million images, depending on the requirements of the generative AI model.

In this bring-your-own-model setup, design teams and developers could harness NVIDIA Picasso — a cloud-based foundry for building generative AI models for visual design — with Stable Diffusion.

In this case, designers and artists prompt the generative AI for design elements, such as “rugged,” “sophisticated” or “sleek.” It then generates examples from the external world of automakers as well as from a company’s internal catalogs of images, vastly accelerating this initial phase.

For vehicle interiors, large language models for text-to-image generation can enable designers to type in a description of a texture, like a floral pattern, and the generative AI will put it onto the surface of a seat, door panel or dashboard. If a designer wants to use a particular image to generate an interior design texture, generative AI can handle image-to-image texture creation.

Smart Factories Getting Generative AI Edge

Manufacturers developing smart factories are adopting Omniverse and generative AI application programming interfaces to connect design and engineering tools to build digital twins of their facilities. BMW Group is starting the global rollout of NVIDIA Omniverse to support its vision for a factory of the future.

When building manufacturing facilities, planning in simulation before launching into production helps to reduce costly change orders that can shut down factory lines.

Generative AI Benefits Marketing and Retail Sales

Generative AI is also making inroads in marketing and retail sales departments across many industries worldwide. These teams are expected to see a productivity lift from generative AI of more than $950 billion, according to a McKinsey report.

For instance, many are adopting ChatGPT to investigate, brainstorm and get feedback on writing topics to get a jump on marketing copy and advertising campaigns. Text-to-image generative AI is helping to support visual efforts in marketing and sales.

NVIDIA NeMo is a framework to build, customize and deploy generative AI models. It’s optimized to do inference for language and image applications and used in automated speech recognition, helping improve customer support with large language models. Automakers can develop next-generation customer service chatbots using its generative AI.

London advertising giant WPP and NVIDIA are working on a groundbreaking generative AI-enabled content engine to assist the $700 billion digital advertising industry.

Today ads are retrieved, but in the future when you engage information much of it will be generated — the computing model has changed, said Huang.

This innovative system is built with NVIDIA AI and Omniverse Cloud — a software platform for developing unified 3D workflows and OpenUSD applications — and offers automotive OEMs capabilities to help create highly personalized visual content faster and more efficiently.

In Omniverse, creative teams take advantage of OpenUSD to unify their complex 3D pipelines, seamlessly connecting design tools such as Adobe Substance 3D, Alias, and VRED to develop digital twins of client products. Accessing generative AI tools will enable content creation from trained datasets and built with NVIDIA Picasso, producing virtual sets. This will give WPP clients complete scenes to generate various ads, videos and 3D experiences.

DENZA, BYD’s joint venture with Mercedes-Benz, is relying on WPP to build and deploy the first of its kind car configurators with Omniverse Cloud.

Running on Generative AI: Faster, Better, Cheaper Everywhere

Generative AI’s contextual understanding, creative output and adaptive learning capacities mark a new era.

What began with the transformer model discovery has since unleashed incredible results, supported by massive models whose training has been made possible with leaps in performance from NVIDIA accelerated computing.

While it’s still early days, and therefore hard to quantify the full implications of this shift, automakers are embracing industry-specific “copilots” for design, engineering, manufacturing, marketing and sales to achieve better, more efficient and less expensive operations.

And they’re just getting started.

See how NVIDIA AI and Omniverse are revolutionizing the automotive industry from end to end.

Compressing token-embedding matrices for language models

Combining low-rank approximation, a residual binary autoencoder, and a new loss function enables a fivefold increase in compression ratio.Read More

Simpleperf case study: Fast initialization of TFLite’s Memory Arena

Posted by Alan Kelly, Software Engineer

One of our previous articles, Optimizing TensorFlow Lite Runtime Memory, discusses how TFLite’s memory arena minimizes memory usage by sharing buffers between tensors. This means we can run models on even smaller edge devices. In today’s article, I will describe the performance optimization of the memory arena initialization so that our users get the benefit of low memory usage with little additional overhead.

ML is normally deployed on-device as part of a larger pipeline. TFLite is used because it’s fast and lightweight, but the rest of the pipeline must also be fast. Profiling on the target device with representative data lets us identify the slowest parts of the pipeline so that we can optimize the most important part of the code.

In this article, I will describe the profiling and optimization of TFLite’s memory arena with instructions on how to use Simpleperf and visualize the results. Sample commands are given. It is assumed that the Android NDK is installed and that you have a development device that you can connect to using adb.

Simpleperf

Simpleperf comes with some scripts to make it easier to use. run_simpleperf_on_device.py pushes simpleperf to the device and runs your binary with the given arguments.

/usr/lib/android-ndk/simpleperf/run_simpleperf_on_device.py record –call-graph fp /data/local/tmp/my_binary arg0 arg1 …

This will generate the output file perf.data which you must then copy back to your computer.

adb pull /data/local/tmp/perf.data

You then generate the binary cache which contains all the information needed later to generate a useful profile.

/usr/lib/android-ndk/simpleperf/binary_cache_builder.py -lib /your/binarys/folder -i perf.data

And generate the proto buffer used for visualization:

/usr/lib/android-ndk/simpleperf/pprof_proto_generator.py –ndk_path=/path/to/android-ndk -i perf.data -o profile.proto

You can then display the results this using pprof:

pprof -http :8888 profile.proto

And open localhost:8888 in your browser to view the profile. I find flame graphs to be the most useful:

Optimizing TFLite’s Memory Arena

ArenaPlanner::ExecuteAllocations accounts for 54.3% of the runtime of this model. I was expecting to find that ML operators such as fully connected layers or convolutions to be the bottleneck of this model, and not runtime overhead. This is a particularly bad case, the memory arena overhead isn’t this bad for every model, but improvements here will impact all models. This model has variable input sizes and many dynamic tensors, whose output size isn’t known until operator evaluation, which trigger frequent tensor re-allocations. This really is as bad as it gets. Let’s zoom in on the profile.

InterpreterInfo::num_tensors() accounts for 10.4% of the runtime. The reason this function is so expensive is because it is a virtual function which calls another function and it is called within a loop. I would never have suspected this.

for (int i = 0; i < static_cast<int>(graph_info_->num_tensors()); ++i) {

  …

}

Arena planner does not create or destroy tensors so the number of tensors is constant. Let’s cache it.

const int num_tensors = static_cast<int>(graph_info_->num_tensors());

for (int i = 0; i < num_tensors); ++i) {

  …

}

Our next piece of low hanging fruit is InterpreterInfo::tensor(unsigned long) which is another virtual function which does bounds checking and then returns a pointer to a tensor. Tensors are stored in an array so let’s add a function to get a pointer to this array. The commits are here and here.

After these simple changes the runtime of this model has reduced by 25% and then overhead of the memory allocator by half. Simpleperf made identifying these inefficiencies easy! Time to profile again to measure the impact of these changes.

ArenaPlanner::CalculateAllocations is now the most expensive function at 12.7%. This calls two functions: SimpleMemoryArena::Allocate and ArenaPlanner::CreateTensorAllocationVector.

Although ArenaPlanner::CreateTensorAllocationVector is the cheaper of the two, the code is far simpler so it might be easier to optimize. This function identifies which tensors are allocated between two nodes in the graph and then sorts them by size as a Greedy by Size allocation algorithm is used where the largest tensors are allocated first. The structure of the graph is constant so we can store a map of tensors allocated at each node. Instead of checking each tensor in the model to see if it is allocated between the two nodes, we can identify the tensors to be allocated by iterating through the map. The cost of ArenaPlanner::CreateTensorAllocationVector has gone from 4.8% to 0.8% of the runtime. Code can be seen here. Sort does not appear in the profile so we ignore it.

The next function to look at is ArenaPlanner::ResolveTensorAllocation which is 10.9% of the runtime after the previous optimizations. This function resets each tensor’s data pointer after allocation. However, these pointers don’t always change. How about keeping track of and only updating the ones which change? After this change, ArenaPlanner::ResolveTensorAllocation doesn’t appear in the profile anymore.

Let’s now take a look at allocation and deallocation. SimpleMemoryArena::Allocate accounts for 7% and SimpleMemoryArena::Deallocate accounts for 6.8% of the runtime. A record of all allocations in the arena are stored in a vector ordered by their offsets within the memory arena. Entries in this sorted data structure are inserted, removed and searched. These operations are all O(N) in a vector. Could an std::multimap with the offset as the key be better? A multimap is needed because the records are ordered by their offsets and there may be multiple tensors with the same offset. Removal and insertion are O(logN) but search would still be O(N) as we are searching for the tensor id and not the offset. The best way to find out is to test and profile.

Replacing the vector with a multimap actually slows down the arena code: it is almost three times slower than using a vector! While this goes against intuition, this is commonly found when optimizing code. Operations on a set or a map have linear or logarithmic complexities, however, there is also a constant value in the complexity. This value is higher than the constant value for the complexity of a vector. We also iterate through the records, which is much cheaper for a vector than for a list or multimap. A list was also tested, coming in at twice as slow as a vector.

Deallocation can still be improved though. SimpleMemoryArena::Deallocate iterates through the records and when it finds the record to deallocate, it removes it from the vector. This has O(N2) complexity. The memcpy seen in the profile above comes from the frequent calls to std::vector::erase. It is much more efficient to mark records to be erased and then to erase them in one pass using std::remove_if. The second optimization here is to look at how tensors are typically deallocated: ArenaPlanner::ResetAllocationsAfter deallocates all tensors from a node until the end of the graph. To address this, SimpleMemoryArena::DeallocateAfter(int32_t node) was added which iterates once through all the records, marking those which are allocated after the node. SimpleMemoryArena::ResolveDeallocations erases these in one pass making deallocation O(N). After these changes, ResetAllocationsAfter no longer appears in the profile! Commits are here and here.

The profile now looks very different with the overhead of tensor allocation gone from 49.9% of the runtime to 11%. This profile already looks much more reasonable. SimpleMemoryArena::Allocate is the last function left to optimize. For each tensor, this function iterates through the vector of records trying to find space for the current allocation. The complexity of this is O(N²). This is a fundamental limitation of the Greedy By Size algorithm. Efficient use of memory comes at the cost of increased overhead. Although the complexity can’t be reduced, N can. We process nodes in the order in which they are executed. Allocation information for tensors which have been deallocated on already executed nodes is not needed anymore, it is only slowing things down. Records are purged periodically so that only records which are active are considered. On a large model, this significantly reduces N. ArenaPlanner::ExecuteAllocations is no longer the most expensive function! It has gone from 11% to 6% and a fully connected operator is now the most expensive function in the profile, which is what we expect when profiling neural network inference.

This is what a neural network profile should look like. Time should be spent running your model’s operators, not in the inference runtime.

The optimized memory arena is now publicly available as part of TensorFlow 2.13.

Next Steps

Today’s post walked you through an example of Simpleperf helping to find easy to fix inefficiencies in TFLite’s memory arena that would never have been found by just looking at the code. Pprof can display annotated source code, disassembly and graphs making it easy to find the bottlenecks in your on-device pipelines.

Generate creative advertising using generative AI deployed on Amazon SageMaker

Creative advertising has the potential to be revolutionized by generative AI (GenAI). You can now create a wide variation of novel images, such as product shots, by retraining a GenAI model and providing a few inputs into the model, such as textual prompts (sentences describing the scene and objects to be produced by the model). This technique has shown promising results starting in 2022 with the explosion of a new class of foundation models (FMs) called latent diffusion models such as Stable Diffusion, Midjourney, and Dall-E-2. However, to use these models in production, the generation process requires constant refining to generate consistent outputs. This often means creating a large number of sample images of the product and clever prompt engineering, which makes the task difficult at scale.

In this post, we explore how this transformative technology can be harnessed to generate captivating and innovative advertisements at scale, especially when dealing with large catalogs of images. By using the power of GenAI, specifically through the technique of inpainting, we can seamlessly create image backgrounds, resulting in visually stunning and engaging content and reducing unwanted image artifacts (termed model hallucinations). We also delve into the practical implementation of this technique by utilizing Amazon SageMaker endpoints, which enable efficient deployment of the GenAI models driving this creative process.

We use inpainting as the key technique within GenAI-based image generation because it offers a powerful solution for replacing missing elements in images. However, this presents certain challenges. For instance, precise control over the positioning of objects within the image can be limited, leading to potential issues such as image artifacts, floating objects, or unblended boundaries, as shown in the following example images.

To overcome this, we propose in this post to strike a balance between creative freedom and efficient production by generating a multitude of realistic images using minimal supervision. To scale the proposed solution for production and streamline the deployment of AI models in the AWS environment, we demonstrate it using SageMaker endpoints.

In particular, we propose to split the inpainting process as a set of layers, each one potentially with a different set of prompts. The process can be summarized as the following steps:

First, we prompt for a general scene (for example, “park with trees in the back”) and randomly place the object on that background.
Next, we add a layer in the lower mid-section of the object by prompting where the object lies (for example, “picnic on grass, or wooden table”).
Finally, we add a layer similar to the background layer on the upper mid-section of the object using the same prompt as the background.

The benefit of this process is the improvement in the realism of the object because it’s perceived with better scaling and positioning relative to the background environment that matches with human expectations. The following figure shows the steps of the proposed solution.

Solution overview

To accomplish the tasks, the following flow of the data is considered:

Segment Anything Model (SAM) and Stable Diffusion Inpainting models are hosted in SageMaker endpoints.
A background prompt is used to create a generated background image using the Stable Diffusion model
A base product image is passed through SAM to generate a mask. The inverse of the mask is called the anti-mask.
The generated background image, mask, along with foreground prompts and negative prompts are used as input to the Stable Diffusion Inpainting model to generate a generated intermediate background image.
Similarly, the generated background image, anti-mask, along with foreground prompts and negative prompts are used as input to the Stable Diffusion Inpainting model to generate a generated intermediate foreground image.
The final output of the generated product image is obtained by combining the generated intermediate foreground image and generated intermediate background image.

Prerequisites

We have developed an AWS CloudFormation template that will create the SageMaker notebooks used to deploy the endpoints and run inference.

You will need an AWS account with AWS Identity and Access Management (IAM) roles that provides access to the following:

AWS CloudFormation
SageMaker
- Although SageMaker endpoints provide instances to run ML models, in order to run heavy workloads like generative AI models, we use the GPU-enabled SageMaker endpoints. Refer to Amazon SageMaker Pricing for more information about pricing.
- We use the NVIDIA A10G-enabled instance ml.g5.2xlarge to host the models.
Amazon Simple Storage Service (Amazon S3)

For more details, check out the GitHub repository and the CloudFormation template.

Mask the area of interest of the product

In general, we need to provide an image of the object that we want to place and a mask delineating the contour of the object. This can be done using tools such as Amazon SageMaker Ground Truth. Alternatively, we can automatically segment the object using AI tools such as Segment Anything Models (SAM), assuming that the object is in the center of the image.

Use SAM to generate a mask

With SAM, an advanced generative AI technique, we can effortlessly generate high-quality masks for various objects within images. SAM uses deep learning models trained on extensive datasets to accurately identify and segment objects of interest, providing precise boundaries and pixel-level masks. This breakthrough technology revolutionizes image processing workflows by automating the time-consuming and labor-intensive task of manually creating masks. With SAM, businesses and individuals can now rapidly generate masks for object recognition, image editing, computer vision tasks, and more, unlocking a world of possibilities for visual analysis and manipulation.

Host the SAM model on a SageMaker endpoint

We use the notebook 1_HostGenAIModels.ipynb to create SageMaker endpoints and host the SAM model.

We use the inference code in inference_sam.py and package that into a code.tar.gz file, which we use to create the SageMaker endpoint. The code downloads the SAM model, hosts it on an endpoint, and provides an entry point to run inference and generate output:

SAM_ENDPOINT_NAME = 'sam-pytorch-' + str(datetime.utcnow().strftime('%Y-%m-%d-%H-%M-%S-%f'))
prefix_sam = "SAM/demo-custom-endpoint"
model_data_sam = s3.S3Uploader.upload("code.tar.gz", f's3://{bucket}/{prefix_sam}')
model_sam = PyTorchModel(entry_point='inference_sam.py',
                         model_data=model_data_sam,
                         framework_version='1.12',
                         py_version='py38',  
                         role=role,
                         env={'TS_MAX_RESPONSE_SIZE':'2000000000', 'SAGEMAKER_MODEL_SERVER_TIMEOUT' : '300'},
                         sagemaker_session=sess,      
                         name='model-'+SAM_ENDPOINT_NAME)
predictor_sam = model_sam.deploy(initial_instance_count=1,
                                 instance_type=INSTANCE_TYPE,
                                 deserializers=JSONDeserializer(),
                                 endpoint_name=SAM_ENDPOINT_NAME)

Invoke the SAM model and generate a mask

The following code is part of the 2_GenerateInPaintingImages.ipynb notebook, which is used to run the endpoints and generate results:

raw_image = Image.open("images/speaker.png").convert("RGB")
predictor_sam = PyTorchPredictor(endpoint_name=SAM_ENDPOINT_NAME,
                                 deserializer=JSONDeserializer())
output_array = predictor_sam.predict(raw_image, initial_args={'Accept': 'application/json'})
mask_image = Image.fromarray(np.array(output_array).astype(np.uint8))
# save the mask image using PIL Image
mask_image.save('images/speaker_mask.png')

The following figure shows the resulting mask obtained from the product image.

Use inpainting to create a generated image

By combining the power of inpainting with the mask generated by SAM and the user’s prompt, we can create remarkable generated images. Inpainting utilizes advanced generative AI techniques to intelligently fill in the missing or masked regions of an image, seamlessly blending them with the surrounding content. With the SAM-generated mask as guidance and the user’s prompt as a creative input, inpainting algorithms can generate visually coherent and contextually appropriate content, resulting in stunning and personalized images. This fusion of technologies opens up endless creative possibilities, allowing users to transform their visions into vivid, captivating visual narratives.

Host a Stable Diffusion Inpainting model on a SageMaker endpoint

Similarly to 2.1, we use the notebook 1_HostGenAIModels.ipynb to create SageMaker endpoints and host the Stable Diffusion Inpainting model.

We use the inference code in inference_inpainting.py and package that into a code.tar.gz file, which we use to create the SageMaker endpoint. The code downloads the Stable Diffusion Inpainting model, hosts it on an endpoint, and provides an entry point to run inference and generate output:

INPAINTING_ENDPOINT_NAME = 'inpainting-pytorch-' + str(datetime.utcnow().strftime('%Y-%m-%d-%H-%M-%S-%f'))
prefix_inpainting = "InPainting/demo-custom-endpoint"
model_data_inpainting = s3.S3Uploader.upload("code.tar.gz", f"s3://{bucket}/{prefix_inpainting}")

model_inpainting = PyTorchModel(entry_point='inference_inpainting.py',
                                model_data=model_data_inpainting,       
                                framework_version='1.12',
                                py_version='py38',
                                role=role,
                                env={'TS_MAX_RESPONSE_SIZE':'2000000000', 'SAGEMAKER_MODEL_SERVER_TIMEOUT' : '300'},
                                sagemaker_session=sess,
                                name='model-'+INPAINTING_ENDPOINT_NAME)

predictor_inpainting = model_inpainting.deploy(initial_instance_count=1,
                                               instance_type=INSTANCE_TYPE,
                                               serializer=JSONSerializer(),
                                               deserializers=JSONDeserializer(),
                                               endpoint_name=INPAINTING_ENDPOINT_NAME,
                                               volume_size=128)

Invoke the Stable Diffusion Inpainting model and generate a new image

Similarly to the step to invoke the SAM model, the notebook 2_GenerateInPaintingImages.ipynb is used to run the inference on the endpoints and generate results:

raw_image = Image.open("images/speaker.png").convert("RGB")
mask_image = Image.open('images/speaker_mask.png').convert('RGB')
prompt_fr = "table and chair with books"
prompt_bg = "window and couch, table"
negative_prompt = "longbody, lowres, bad anatomy, bad hands, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, letters"

inputs = {}
inputs["image"] = np.array(raw_image)
inputs["mask"] = np.array(mask_image)
inputs["prompt_fr"] = prompt_fr
inputs["prompt_bg"] = prompt_bg
inputs["negative_prompt"] = negative_prompt

predictor_inpainting = PyTorchPredictor(endpoint_name=INPAINTING_ENDPOINT_NAME,
                                        serializer=JSONSerializer(),
                                        deserializer=JSONDeserializer())

output_array = predictor_inpainting.predict(inputs, initial_args={'Accept': 'application/json'})
gai_image = Image.fromarray(np.array(output_array[0]).astype(np.uint8))
gai_background = Image.fromarray(np.array(output_array[1]).astype(np.uint8))
gai_mask = Image.fromarray(np.array(output_array[2]).astype(np.uint8))
post_image = Image.fromarray(np.array(output_array[3]).astype(np.uint8))

# save the generated image using PIL Image
post_image.save('images/speaker_generated.png')

The following figure shows the refined mask, generated background, generated product image, and postprocessed image.

The generated product image uses the following prompts:

Background generation – “chair, couch, window, indoor”
Inpainting – “besides books”

Clean up

In this post, we use two GPU-enabled SageMaker endpoints, which contributes to the majority of the cost. These endpoints should be turned off to avoid extra cost when the endpoints are not being used. We have provided a notebook, 3_CleanUp.ipynb, which can assist in cleaning up the endpoints. We also use a SageMaker notebook to host the models and run inference. Therefore, it’s good practice to stop the notebook instance if it’s not being used.

Conclusion

Generative AI models are generally large-scale ML models that require specific resources to run efficiently. In this post, we demonstrated, using an advertising use case, how SageMaker endpoints offer a scalable and managed environment for hosting generative AI models such as the text-to-image foundation model Stable Diffusion. We demonstrated how two models can be hosted and run as needed, and multiple models can also be hosted from a single endpoint. This eliminates the complexities associated with infrastructure provisioning, scalability, and monitoring, enabling organizations to focus solely on deploying their models and serving predictions to solve their business challenges. With SageMaker endpoints, organizations can efficiently deploy and manage multiple models within a unified infrastructure, achieving optimal resource utilization and reducing operational overhead.

The detailed code is available on GitHub. The code demonstrates the use of AWS CloudFormation and the AWS Cloud Development Kit (AWS CDK) to automate the process of creating SageMaker notebooks and other required resources.

About the authors

Fabian Benitez-Quiroz is a IoT Edge Data Scientist in AWS Professional Services. He holds a PhD in Computer Vision and Pattern Recognition from The Ohio State University. Fabian is involved in helping customers run their machine learning models with low latency on IoT devices and in the cloud across various industries.

Romil Shah is a Sr. Data Scientist at AWS Professional Services. Romil has more than 6 years of industry experience in computer vision, machine learning, and IoT edge devices. He is involved in helping customers optimize and deploy their machine learning models for edge devices and on the cloud. He works with customers to create strategies for optimizing and deploying foundation models.

Han Man is a Senior Data Science & Machine Learning Manager with AWS Professional Services based in San Diego, CA. He has a PhD in Engineering from Northwestern University and has several years of experience as a management consultant advising clients in manufacturing, financial services, and energy. Today, he is passionately working with key customers from a variety of industry verticals to develop and implement ML and GenAI solutions on AWS.

Overview of solution

Prerequisites

Deploy the solution in an Organizations environment

Set up monitoring account resources

Set up management account resources

Deploy the solution in a non-Organizations environment

Set up monitoring account resources

Set up workload account monitoring infrastructure

Visualize ML tasks on the CloudWatch dashboard

Customize your dashboard

Clean up

Conclusion

About the Authors

Ultimate Power, Ultimate Wins

The Cloud Just Got Hotter

Bring on the New

About Intel

About PyTorch Foundation

About The Linux Foundation

Move to GPUs Boosts Performance 24X

GPUs Deliver 10X Lower Energy Usage

1 25X performance for an NVIDIA Quadro RTXTM 8000 (4x GPUs per node) vs. Intel Xeon Gold 6126 processor (2x 12-core CPUs per node).

2 46X performance for NVIDIA RTX 6000 Ada generation (8x GPUs per node) vs. Intel Xeon Gold 6430 (2x 32-core CPUs per node). Performance and energy findings based on internal and industry benchmarks.

Benchmark requirements

VRDU datasets and tasks

Conclusion

Acknowledgements

Merge Ahead: Transformative Use Cases

Designer and Artist Workflows Poised to Benefit

Generative AI Riding Shotgun on Concept and Styling

Smart Factories Getting Generative AI Edge

Generative AI Benefits Marketing and Retail Sales

Running on Generative AI: Faster, Better, Cheaper Everywhere

Simpleperf

Optimizing TFLite’s Memory Arena

Next Steps

Solution overview

Prerequisites

Mask the area of interest of the product

Use SAM to generate a mask

Host the SAM model on a SageMaker endpoint

Invoke the SAM model and generate a mask

Use inpainting to create a generated image

Host a Stable Diffusion Inpainting model on a SageMaker endpoint

Invoke the Stable Diffusion Inpainting model and generate a new image

Clean up

Conclusion

About the authors

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.

₁25X performance for an NVIDIA Quadro RTXTM 8000 (4x GPUs per node) vs. Intel Xeon Gold 6126 processor (2x 12-core CPUs per node).

₂46X performance for NVIDIA RTX 6000 Ada generation (8x GPUs per node) vs. Intel Xeon Gold 6430 (2x 32-core CPUs per node). Performance and energy findings based on internal and industry benchmarks.