Automatically identify languages in multi-lingual audio using Amazon Transcribe

If you operate in a country with multiple official languages or across multiple regions, your audio files can contain different languages. Participants may be speaking entirely different languages or may switch between languages. Consider a customer service call to report a problem in an area with a substantial multi-lingual population. Although the conversation could begin in one language, it’s feasible that the customer might change to another language to describe the problem, depending on comfort level or usage preferences with other languages. In a similar vein, the customer care representative may transition between languages while conveying operating or troubleshooting instructions.

With a minimum of 3 seconds of audio, Amazon Transcribe can automatically identify and efficiently generate transcripts in the languages spoken in the audio without needing humans to specify the languages. This applies to various use cases such as transcribing customer calls, converting voicemails to text, capturing meeting interactions, tracking user forum communications, or monitoring media content production and localization workflows.

This post walks through the steps for transcribing a multi-language audio file using Amazon Transcribe. We discuss how to make audio files available to Amazon Transcribe and enable transcription of multi-lingual audio files when calling Amazon Transcribe APIs.

Solution overview

Amazon Transcribe is an AWS service that makes it easy for you to convert speech to text. Adding speech to text functionality to any application is simple with the help of Amazon Transcribe, an automated speech recognition (ASR) service. You can ingest audio input using Amazon Transcribe, create clear transcripts that are easy to read and review, increase accuracy with customization, and filter information to protect client privacy.

The solution also uses Amazon Simple Storage Service (Amazon S3), an object storage service built to store and retrieve any amount of data from anywhere. It’s a simple storage service that offers industry-leading durability, availability, performance, security, and virtually unlimited scalability at very low cost. When you store data in Amazon S3, you work with resources known as buckets and objects. A bucket is a container for objects. An object is a file and any metadata that describes the file.

In this post, we walk you through the following steps to implement a multi-multilingual audio transcription solution:

Create an S3 bucket.
Upload your audio file to the bucket.
Create the transcription job.
Review the job output.

Prerequisites

For this walkthrough, you should have the following prerequisites:

An AWS account
AWS Identity and Access Management (IAM) permissions to:
- Create S3 buckets as well as read and write bucket data
- Access the Amazon Transcribe console and call Amazon Transcribe APIs

Amazon Transcribe provide the option to store transcribed output in either a service managed or customer managed S3 bucket. For this post, we have Amazon Transcribe write the results to a service managed S3 bucket.

Note that Amazon Transcribe is a Regional service and the Amazon Transcribe API endpoints being called need to be in the same Region as the S3 buckets.

Create an S3 bucket to store your audio input files

To create your S3 bucket, complete the following steps:

On the Amazon S3 console, choose Create bucket.
For Bucket name, enter a globally unique name for the bucket.
For AWS Region, choose the same Region as your Amazon Transcribe API endpoints.
Leave all defaults as is.
Choose Create bucket.

Upload your audio file to the S3 bucket

Upload your multi-lingual audio file to the S3 bucket in your AWS account. For the purpose of this exercise, we use the following sample multi-lingual audio file. It captures a customer support call involving English and Spanish languages.

On the Amazon S3 console, choose Buckets in the navigation pane.
Choose the bucket you created previously for storing the input audio files.
Choose Upload.
Choose Add files.
Choose the audio file you want to transcribe from your local computer.
Choose Upload.

Your audio file will shortly be available in the S3 bucket.

Create the transcription job

With the audio file uploaded, we now create a transcription job.

On the Amazon Transcribe console, choose Transcription jobs in the navigation pane.
Choose Create job.
For Name, enter a unique name for the job.
This will also be the name of the output transcript file.
For Language settings, select Automatic multiple languages identification.
This feature enables Amazon Transcribe to automatically identify and transcribe all languages spoken in the audio file.
For Language options for automatic language identification, leave it unselected.
Amazon Transcribe automatically identifies and transcribes all languages spoken in the audio. To improve transcription accuracy, you can optionally select two or more languages you know were spoken in the audio.
For Model type, only the General model option is available at the time of writing this post.
For Input data, choose Browse S3.
Choose the audio source file we uploaded previously.
For Output data, you can select either Service-managed S3 bucket or Customer specified S3 bucket. For this post, select Service-managed S3 bucket.
Choose Next.
Choose Create job.

Review the job output

When the transcription job is complete, open the transcription job.

Scroll down to the Transcription preview section. The audio transcription is displayed on the Text tab. The transcription includes both the English and Spanish portions of the conversation.

You can optionally download a copy of the transcript as a JSON file, which you could use for further post-call analytics.

Clean up

To avoid incurring future charges, empty and delete the S3 bucket you created for storing the input audio source file. Make sure you have the files stored elsewhere because this will permanently remove all objects contained within the bucket. On the Amazon Transcribe console, select and delete the job previously created for the transcription.

Conclusion

In this post, we created an end-to-end workflow to automate identification and transcription of multi-lingual audio files, without writing any code. We used the new functionality in Amazon Transcribe to automatically identify different languages in an audio file and transcribe each language correctly.

For more information, refer to Language identification with batch transcription jobs.

About the Authors

Murtuza Bootwala is a Senior Solutions Architect at AWS with an interest in AI/ML technologies. He enjoys working with customers to help them achieve their business outcomes. Outside of work, he enjoys outdoor activities and spending time with family.

Victor Rojo is passionate about AI/ML and software development. He helped get Amazon Alexa up and running in the US and Mexico. He also brought Amazon Textract to AWS Partners and got AWS Contact Center Intelligence (CCI) off the ground. He’s currently the Global Tech Leader for Conversational AI Partners.

Babu Srinivasan is an AWS Sr. Specialist SA (Language AI Services) based out of Chicago. He focuses on Amazon Transcribe (speech to text), helping our customers use AI services to solve business problems. Outside of work, he enjoys woodworking and performing magic shows.

Translate multiple source language documents to multiple target languages using Amazon Translate

Enterprises need to translate business-critical content such as marketing materials, instruction manuals, and product catalogs across multiple languages to communicate with a global audience of customers, partners, and stakeholders. Identifying the source language in each document before calling a translate job creates complexities and adds another step to your workflow. For example, an international product company with its customer support operations located in their corporate office requires their agents to translate emails or documents to support customer requests. Previously, they had to set up workflows to identify dominant language in each document, group them by language type, and set up a batch translate job for each source language. Now, Amazon Translate’s automatic language detection feature for batch translation jobs allows you to translate a batch of documents in various languages with a single translate job. This removes the need for you to orchestrate the document translate workflow that required dominant language identification and grouping. Amazon Translate also allows translation to multiple target languages for translation (up to 10 languages). A single translation job can translate documents to multiple target languages. This feature eliminates the need to create separate batch jobs for individual target languages. Customers can now create documentation in multiple languages, all with a single API call.

In this post, we demonstrate how to translate documents into multiple target languages in a batch translation job.

Solution overview

Automatic detection of source language for batch translate jobs allows you to translate documents written in various supported languages in a single operation. You can also provide up to 10 languages as targets. The job processes each document, identifies the dominant source language, and translates it to the target language. Amazon Translate uses Amazon Comprehend to determine the dominant language in each of your source documents, and uses it as the source language.

In the following sections, we demonstrate how to create a batch translation job via the AWS Management Console or the AWS SDK.

Create a batch translation job via console

In this example, we configure Amazon Translate batch translation to automatically detect the source language and translate it to English and Hindi, using the input and output Amazon Simple Storage Service (Amazon S3) bucket locations provided.

Next, we create an AWS Identity and Access Management (IAM) role that gets provisioned as part of the configuration. The role is given access to the input and output S3 buckets.

After the job is created, you can monitor the progress of the batch translation job in the Translation jobs section.

When the translation job is complete, you can navigate to the output S3 bucket location and observe that the documents have been translated to their target language. Our input consisted of two files, sample-doc.txt and sample-doc-2.txt, in two different languages. Each document was translated into two target languages, for a total of four documents.

Create a batch translation job via the AWS SDK

The following Python Boto3 code uses the batch translation call to translate documents in your source S3 bucket. Specify the following parameters:

InputDataConfig – Provide the S3 bucket location of your input documents
OutputDataConfig – Provide the S3 bucket location of your output documents
DataAccessRoleArn – Create an IAM role that gives Amazon Translate permission to access your input and output S3 buckets
SourceLanguageCode: Use auto
TargetLanguageCodes: Choose up to 10 target languages

import boto3

client = boto3.client('translate')


def lambda_handler(event, context):

    response = client.start_text_translation_job(
        JobName='auto-translate-multi-language-sdk',
        InputDataConfig={
            'S3Uri': 's3://<<REPLACE-WITH-YOUR-INPUT-BUCKET>>/input-sdk',
            'ContentType': 'text/plain'
        },
        OutputDataConfig={
            'S3Uri': 's3://<<REPLACE-WITH-YOUR-OUTPUT-BUCKET>>/output-sdk',
        },
        DataAccessRoleArn='<<REPLACE-WITH-THE-IAM-ROLE-ARN>>',
        SourceLanguageCode='auto',
        TargetLanguageCodes=[
            'en', 'hi'
        ]
    )

Clean up

To clean up after using this solution, complete the following steps:

Delete the S3 buckets that you created.
Delete IAM roles that you set up.
Delete any other resources that you set up for this post.

Conclusion

With today’s need to have a global reach with limited resources, Amazon Translate helps you simplify your multi-language processing workflows. With the introduction of automatically detecting the dominant language in your source document for batch translation jobs, and translating them to up to 10 target languages, you can focus on your business logic rather than dealing with the operational burden of sorting documents and managing multiple batch translation jobs.

We strive to add features to our service that make it easier for our customers innovate. Try this solution and let us know how this helped simplify your document processing workloads.

About the authors

Kishore Dhamodaran is a Senior Solutions Architect at AWS. Kishore helps strategic customers with their cloud enterprise strategy and migration journey, leveraging his years of industry and cloud experience.

Sid Padgaonkar is the Sr. Product Manager for Amazon Translate, AWS’s natural language processing service. On weekends you will find him playing squash and exploring the food scene in the Pacific NW.

Data-efficient continual learning in Alexa

EMNLP papers examine constrained generation of rewrite candidates and automatic selection of information-rich training data.Read More

Who Said What? Recorder’s On-device Solution for Labeling Speakers

Posted by Quan Wang, Senior Staff Software Engineer, and Fan Zhang, Staff Software Engineer, Google

In 2019 we launched Recorder, an audio recording app for Pixel phones that helps users create, manage, and edit audio recordings. It leverages recent developments in on-device machine learning to transcribe speech, recognize audio events, suggest tags for titles, and help users navigate transcripts.

Nonetheless, some Recorder users found it difficult to navigate long recordings that have multiple speakers because it’s not clear who said what. During the Made By Google event this year, we announced the “speaker labels” feature for the Recorder app. This opt-in feature annotates a recording transcript with unique and anonymous labels for each speaker (e.g., “Speaker 1”, “Speaker 2”, etc.) in real time during the recording. It significantly improves the readability and usability of the recording transcripts. This feature is powered by Google’s new speaker diarization system named Turn-to-Diarize, which was first presented at ICASSP 2022.

Left: Recorder transcript without speaker labels. Right: Recorder transcript with speaker labels.

System Architecture

Our speaker diarization system leverages several highly optimized machine learning models and algorithms to allow diarizing hours of audio in a real-time streaming fashion with limited computational resources on mobile devices. The system mainly consists of three components: a speaker turn detection model that detects a change of speaker in the input speech, a speaker encoder model that extracts voice characteristics from each speaker turn, and a multi-stage clustering algorithm that annotates speaker labels to each speaker turn in a highly efficient way. All components run fully on the device.

Architecture of the Turn-to-Diarize system.

Detecting Speaker Turns

The first component of our system is a speaker turn detection model based on a Transformer Transducer (T-T), which converts the acoustic features into text transcripts augmented with a special token <st> representing a speaker turn. Unlike preceding customized systems that use role-specific tokens (e.g., <doctor> and <patient>) for conversations, this model is more generic and can be trained on and deployed to various application domains.

In most applications, the output of a diarization system is not directly shown to users, but combined with a separate automatic speech recognition (ASR) system that is trained to have smaller word errors. Therefore, for the diarization system, we are relatively more tolerant to word token errors than errors of the <st> token. Based on this intuition, we propose a new token-level loss function that allows us to train a small speaker turn detection model with high accuracy on predicted <st> tokens. Combined with edit-based minimum Bayes risk (EMBR) training, this new loss function significantly improved the interval-based F1 score on seven evaluation datasets.

Extracting Voice Characteristics

Once the audio recording has been segmented into homogeneous speaker turns, we use a speaker encoder model to extract an embedding vector (i.e., d-vector) to represent the voice characteristics of each speaker turn. This approach has several advantages over prior work that extracts embedding vectors from small fixed-length segments. First, it avoids extracting an embedding from a segment containing speech from multiple speakers. At the same time, each embedding covers a relatively large time range that contains sufficient signals from the speaker. It also reduces the total number of embeddings to be clustered, thus making the clustering step less expensive. These embeddings are processed entirely on-device until speaker labeling of the transcript is completed, and then deleted.

Multi-Stage Clustering

After the audio recording is represented by a sequence of embedding vectors, the last step is to cluster these embedding vectors, and assign a speaker label to each. However, since audio recordings from the Recorder app can be as short as a few seconds, or as long as up to 18 hours, it is critical for the clustering algorithm to handle sequences of drastically different lengths.

For this we propose a multi-stage clustering strategy to leverage the benefits of different clustering algorithms. First, we use the speaker turn detection outputs to determine whether there are at least two different speakers in the recording. For short sequences, we use agglomerative hierarchical clustering (AHC) as the fallback algorithm. For medium-length sequences, we use spectral clustering as our main algorithm, and use the eigen-gap criterion for accurate speaker count estimation. For long sequences, we reduce computational cost by using AHC to pre-cluster the sequence before feeding it to the main algorithm. During the streaming, we keep a dynamic cache of previous AHC cluster centroids that can be reused for future clustering calls. This mechanism allows us to enforce an upper bound on the entire system with constant time and space complexity.

This multi-stage clustering strategy is a critical optimization for on-device applications where the budget for CPU, memory, and battery is very small, and allows the system to run in a low power mode even after diarizing hours of audio. As a tradeoff between quality and efficiency, the upper bound of the computational cost can be flexibly configured for devices with different computational resources.

Diagram of the multi-stage clustering strategy.

Correction and Customization

In our real-time streaming speaker diarization system, as the model consumes more audio input, it accumulates confidence on predicted speaker labels, and may occasionally make corrections to previously predicted low-confidence speaker labels. The Recorder app automatically updates the speaker labels on the screen during recording to reflect the latest and most accurate predictions.

At the same time, the Recorder app’s UI allows the user to rename the anonymous speaker labels (e.g., “Speaker 2”) to customized labels (e.g., “car dealer”) for better readability and easier memorization for the user within each recording.

Recorder allows the user to rename the speaker labels for better readability.

Future Work

Currently, our diarization system mostly runs on the CPU block of Google Tensor, Google’s custom-built chip that powers more recent Pixel phones. We are working on delegating more computations to the TPU block, which will further reduce the overall power consumption of the diarization system. Another future work direction is to leverage multilingual capabilities of speaker encoder and speech recognition models to expand this feature to more languages.

Acknowledgments

The work described in this post represents joint efforts from multiple teams within Google. Contributors include Quan Wang, Yiling Huang, Evan Clark, Qi Cao, Han Lu, Guanlong Zhao, Wei Xia, Hasim Sak, Alvin Zhou, Jason Pelecanos, Luiza Timariu, Allen Su, Fan Zhang, Hugh Love, Kristi Bradford, Vincent Peng, Raff Tsai, Richard Chou, Yitong Lin, Ann Lu, Kelly Tsai, Hannah Bowman, Tracy Wu, Taral Joglekar, Dharmesh Mokani, Ajay Dudani, Ignacio Lopez Moreno, Diego Melendo Casado, Nino Tasca, Alex Gruenstein.

Announcing the winners of the 2022 Controls That Matter and Considering Everyone request for proposals

In February 2022, Reality Labs Research launched the Controls That Matter and Considering Everyone: High-Realism VR Avatars in Virtual Work Settings request for proposals (RFP).Read More

Introducing Amazon SageMaker Data Wrangler’s new embedded visualizations

Manually inspecting data quality and cleaning data is a painful and time-consuming process that can take a huge chunk of a data scientist’s time on a project. According to a 2020 survey of data scientists conducted by Anaconda, data scientists spend approximately 66% of their time on data preparation and analysis tasks, including loading (19%), cleaning (26%), and visualizing data (21%). Amazon SageMaker offers a range of data preparation tools to meet different customer needs and preferences. For users who prefer a GUI-based interactive interface, SageMaker Data Wrangler offers 300+ built-in visualizations, analyses, and transformations to efficiently process data backed by Spark without writing a single line of code.

Data visualization in machine learning (ML) is an iterative process and requires continuous visualization of the dataset for discovery, investigation and validation. Putting data into perspective entails seeing each of the columns to comprehend possible data errors, missing values, wrong data types, misleading/incorrect data, outlier data, and more.

In this post, we’ll show you how Amazon SageMaker Data Wrangler automatically generates key visualizations of data distribution, detects data quality issues, and surfaces data insights such as outliers for each feature without writing a single line of code. It helps improve the data grid experience with automatic quality warnings (for example, missing values or invalid values). The automatically-generated visualizations are also interactive. For example, you can show a tabulation of the top five most frequent items ordered by percent, and hover over the bar to switch between count and percentage.

Prerequisites

Amazon SageMaker Data Wrangler is a SageMaker feature available within SageMaker Studio. You can follow the Studio onboarding process to spin up the Studio environment and notebooks. Although you can choose from a few authentication methods, the simplest way to create a Studio domain is to follow the Quick start instructions. The Quick start uses the same default settings as the standard Studio setup. You can also choose to onboard using AWS Identity and Access Management (IAM) Identity Center (successor to AWS Single Sign-On) for authentication (see Onboard to Amazon SageMaker Domain Using IAM Identity Center).

Solution Walkthrough

Start your SageMaker Studio Environment and create a new Data Wrangler flow. You can either import your own dataset or use a sample dataset (Titanic) as seen in the following image. These two nodes (the source node and the data type node) are clickable – when you double-click these two nodes, Data Wrangler will display the table.

In our case, let’s right-click on the Data Types icon and Add a transform:

You should now see visualizations on top of each column. Please allow for some time for the charts to load. The latency depends on the size of the dataset (for the Titanic dataset, it should take 1-2 seconds in the default instance).

Scroll to the horizontal top bar by hovering over tooltip. Now that the charts have loaded, you can see the data distribution, invalid values, and missing values. Outliers and missing values are characteristics of erroneous data, and it’s critical to identify them because they could affect your results. This means that because your data came from an unrepresentative sample, your findings may not be generalizable to situations outside of your study. Classification of values can be seen on the charts on the bottom where valid values are represented in white, invalid values in blue, and missing values in purple. You can also look at the outliers depicted by the blue dots to the left or right of a chart.

All the visualizations come in the form of histograms. For non-categorical data, a bucket set is defined for each bin. For categorical data, each unique value is treated as a bin. On top of the histogram, there’s a bar chart showing you the invalid and missing values. We can view the ratio of valid values for Numeric, Categorical, Binary, Text, and Datetime types, as well as the ratio of missing values based on the total null and empty cells and, finally, the ratio of invalid values. Let’s look at some examples to understand how you can see these using Data Wrangler’s pre-loaded sample Titanic Dataset.

Example 1 – We can look at the 20% missing values for the AGE feature/column. It’s crucial to deal with missing data in the field of data-related research/ML, either by removing it or imputing it (handling the missing values with some estimation).

You can process missing values using the Handle missing values transform group. Use the Impute missing transform to generate imputed values where missing values were found in input column. The configuration depends on your data type.

In this example, the AGE column has numeric data type. For imputing strategy, we can choose to impute the mean or the approximate median over the values that are present in your dataset.

Now that we have added the transformation, we can see that the AGE column no longer has missing values.

Example 2 – We can look at the 27% invalid values for the TICKET feature/column which is of the STRING type. Invalid data can produce biased estimates, which can reduce a model’s accuracy and result in false conclusions. Let us explore some transforms that we can utilize to handle the invalid data in the TICKET column.

Looking at the screenshot, we see that some of the inputs are written in a format that contains alphabets before numerals “PC 17318” and others are just numerals such as “11769”.

We can choose to apply a transform to search for and edit specific patterns within strings such as “PC” and replace them. Next, we can cast our string column to a new type such as Long for ease of use.

This still leaves us with 19% missing values on the TICKET feature. Similar to example 1, we can now impute the missing values using mean or approximate median. The feature TICKET should no longer have invalid or missing values as per the image below.

Clean Up

To make sure that you don’t incur charges after following this tutorial, make sure that you shut down the Data Wrangler app.

Conclusion

In this post, we presented the new Amazon Sagemaker Data Wrangler widget that will help remove the undifferentiated heavy lifting for end users during data preparation with automatically surfacing visualizations and data profiling insights for each feature. This widget makes it easy to visualize data (for example, categorical/non-categorical histogram), detect data quality issues (for example, missing values and invalid values), and surface data insights (for example, outliers and top N item).

You can start using this capability today in all of the regions where SageMaker Studio is available. Give it a try, and let us know what you think. We’re always looking forward to your feedback, either through your usual AWS support contacts, or on the AWS Forum for SageMaker.

About the Authors

Isha Dua is a Senior Solutions Architect based in the San Francisco Bay Area. She helps AWS Enterprise customers grow by understanding their goals and challenges, and guides them on how they can architect their applications in a cloud-native manner while making sure they are resilient and scalable. She’s passionate about machine learning technologies and environmental sustainability.

Parth Patel is a Solutions Architect at AWS in the San Francisco Bay Area. Parth guides customers to accelerate their journey to the cloud and helps them adopt the AWS Cloud successfully. He focuses on ML and application modernization.

2023 Predictions: AI That Bends Reality, Unwinds the Golden Screw and Self-Replicates

After three years of uncertainty caused by the pandemic and its post-lockdown hangover, enterprises in 2023 — even with recession looming and uncertainty abounding — face the same imperatives as before: lead, innovate and problem solve.

AI is becoming the common thread in accomplishing these goals. On average, 54% of enterprise AI projects made it from pilot to production, according to a recent Gartner survey of nearly 700 enterprises in the U.S., U.K. and Germany. A whopping 80% of executives in the survey said automation can be applied to any business decision, and that they’re shifting away from tactical to strategic uses of AI.

The mantra for 2023? Do more with less. Some of NVIDIA’s experts in AI predict businesses will prioritize scaling their AI projects amid layoffs and skilled worker shortages by using cloud-based integrated software and hardware offerings that can be purchased and customized to any enterprise, application or budget.

Cost-effective AI development also is a recurring theme among our expert predictions for 2023. With Moore’s law running up against the laws of physics, installing on-premises compute power is getting more expensive and less energy efficient. And the Golden Screw search for critical components is speeding the shift to the cloud for developing AI applications as well as for finding data-driven solutions to supply chain issues.

Here’s what our experts have to say about the year ahead in AI:

ANIMA ANANDKUMAR
Director of ML Research, and Bren Professor at Caltech

Digital Twins Get Physical: We will see large-scale digital twins of physical processes that are complex and multi-scale, such as weather and climate models, seismic phenomena and material properties. This will accelerate current scientific simulations as much as a million-x, and enable new scientific insights and discoveries.

Generalist AI Agents: AI agents will solve open-ended tasks with natural language instructions and large-scale reinforcement learning, while harnessing foundation models — those large AI models trained on a vast quantity of unlabeled data at scale — to enable agents that can parse any type of request and adapt to new types of questions over time.

MANUVIR DAS
Vice President, Enterprise Computing

Software Advances End AI Silos: Enterprises have long had to choose between cloud computing and hybrid architectures for AI research and development — a practice that can stifle developer productivity and slow innovation. In 2023, software will enable businesses to unify AI pipelines across all infrastructure types and deliver a single, connected experience for AI practitioners. This will allow enterprises to balance costs against strategic objectives, regardless of project size or complexity, and provide access to virtually unlimited capacity for flexible development.

Generative AI Transforms Enterprise Applications: The hype about generative AI becomes reality in 2023. That’s because the foundations for true generative AI are finally in place, with software that can transform large language models and recommender systems into production applications that go beyond images to intelligently answer questions, create content and even spark discoveries. This new creative era will fuel massive advances in personalized customer service, drive new business models and pave the way for breakthroughs in healthcare.

KIMBERLY POWELL
Vice President, Healthcare

Biology Becomes Information Science: Breakthroughs in large language models and the fortunate ability to describe biology in a sequence of characters are giving researchers the ability to train a new class of AI models for chemistry and biology. The capabilities of these new AI models give drug discovery teams the ability to generate, represent and predict the properties and interactions of molecules and proteins — all in silicon. This will accelerate our ability to explore the essentially infinite space of potential therapies.

Surgery 4.0 Is Here: Flight simulators serve to train pilots and research new aircraft control. The same is now true for surgeons and robotic surgery device makers. Digital twins that can simulate at every scale, from the operating room environment to the medical robot and patient anatomy, are breaking new ground in personalized surgical rehearsals and designing AI-driven human and machine interactions. Long residencies won’t be the only way to produce an experienced surgeon. Many will become expert operators when they perform their first robot-assisted surgery on a real patient.

DANNY SHAPIRO
Vice President, Automotive

Training Autonomous Vehicles in the Metaverse: The more than 250 auto and truck makers, startups, transportation and mobility-as-a-service providers developing autonomous vehicles are tackling one of the most complex AI challenges of our time. It’s simply not possible to encounter every scenario they must be able to handle by testing on the road, so much of the industry in 2023 will turn to the virtual world to help.

On-road data collection will be supplemented by virtual fleets that generate data for training and testing new features before deployment. High-fidelity simulation will run autonomous vehicles through a virtually infinite range of scenarios and environments. We’ll also see the continued deployment of digital twins for vehicle production to improve manufacturing efficiencies, streamline operations and improve worker ergonomics and safety.

Moving to the Cloud: 2023 will bring more software-as-a-service (SaaS) and infrastructure-as-a-service offerings to the transportation industry. Developers will be able to access a comprehensive suite of cloud services to design, deploy and experience metaverse applications anywhere. Teams will design and collaborate on 3D workflows — such as AV development simulation, in-vehicle experiences, cloud gaming and even car configurators delivered via the web or in showrooms.

Your In-Vehicle Concierge: Advances in conversational AI, natural language processing, gesture detection and avatar animation are making their way to next-generation vehicles in the form of digital assistants. This AI concierge can make reservations, access vehicle controls and provide alerts using natural language understanding. Using interior cameras, deep neural networks and multimodal interaction, vehicles will be able to ensure that driver attention is on the road and ensure no passenger or pet is left behind when the journey is complete.

REV LEBAREDIAN
Vice President, Omniverse and Simulation Technology

The Metaverse Universal Translator: Just as HTML is the standard language of the 2D web, Universal Scene Description is set to become the most powerful, extensible, open language for the 3D web. As the 3D standard for describing virtual worlds in the metaverse, USD will allow enterprises and even consumers to move between different 3D worlds using various tools, viewers and browsers in the most seamless and consistent fashion.

Bending Reality With Digital Twins: A new class of true-to-reality digital twins of goods, services and locations is set to offer greater windfalls than their real-world counterparts. Imagine selling many virtual pairs of sneakers in partnership with a gaming company that are simply undergoing design testing — long before sending the pattern to manufacturing. Companies also stand to benefit by saving on waste, increasing operational efficiencies and boosting accuracy.

RONNIE VASISHTA
Senior Vice President, Telecoms

Cutting the Cord on AR/VR Over 5G Networks: While many businesses will move to the cloud for hardware and software development, edge design and collaboration also will grow as 5G networks become more fully deployed around the world. Automotive designers, for instance, can don augmented reality headsets and stream the same content they see over wireless networks to colleagues around the world, speeding collaborative changes and developing innovative solutions at record speeds. 5G also will lead to accelerated deployments of connected robots across industries — used for restocking store shelves, cleaning floors, delivering pizzas and picking and packing goods in factories.

RAN in the Cloud: Network operators around the world are rolling out software-defined virtual radio access network 5G to save time and money as they seek faster returns on their multibillion-dollar investments. Now, they’re shifting away from bespoke L1 accelerators to 100% software-defined and full-stack, 5G-baseband acceleration that includes L2, RIC, Beamforming and FH offerings. This shift will lead to an increase in the utilization of RAN systems by enabling multi-tenancy between RAN and AI workloads.

BOB PETTE
Vice President, Professional Visualization

An Industrial Revolution via Simulation: Everything built in the physical world will first be simulated in a virtual world that obeys the laws of physics. These digital twins — including large-scale environments such as factories, cities and even the entire planet — and the industrial metaverse are set to become critical components of digital transformation initiatives. Examples already abound: Siemens is taking industrial automation to a new level. BMW is simulating entire factory floors to optimally plan manufacturing processes. Lockheed Martin is simulating the behavior of forest fires to anticipate where and when to deploy resources. DNEG, SONY Pictures, WPP and others are boosting productivity through globally distributed art departments that enable creators, artists and designers to iterate on scenes virtually in real time.

Rethinking of Enterprise IT Architecture: Just as many businesses scrambled to adapt their culture and technologies to meet the challenges of hybrid work, the new year will bring a re-architecting of many companies’ entire IT infrastructure. Companies will seek powerful client devices capable of tackling the ever-increasing demands of applications and complex datasets. And they’ll embrace flexibility, moving to burst to the cloud for exponential scaling. The adoption of distributed computing software platforms will enable a globally dispersed workforce to collaborate and stay productive under the most disparate working environments.

Similarly, complex AI model development and training will require powerful compute infrastructure in the data center and the desktop. Businesses will look at curated AI software stacks for different industrial use cases to make it easy for them to bring AI into their workflows and deliver higher quality products and services to customers faster.

AZITA MARTIN
Vice President, AI for Retail and Consumer Products Group

Tackling Shrinkage: Brick-and-mortar retailers perennially struggle with a commonplace problem: shrinkage, the industry parlance for theft. As more and more adopt AI-based services for contactless checkout, they’ll seek sophisticated software that combines computer vision with store analytics data to make sure what a shopper rings up is actually the item being purchased. The adoption of smart self-tracking technology will aid in the development of fully automated store experiences and help solve for labor shortages and lost income.

AI to Optimize Supply Chains: Even the most sophisticated retailers and e-commerce companies had trouble the past two years balancing supply with demand. Consumers embraced home shopping during the pandemic and then flocked back into brick-and-mortar stores after lockdowns were lifted. After inflation hit, they changed their buying habits once again, giving supply chain managers fits. AI will enable more frequent and more accurate forecasting, ensuring the right product is at the right store at the right time. Also, retailers will embrace route optimization software and simulation technology to provide a more holistic view of opportunities and pitfalls.

MALCOLM DEMAYO
Vice President, Financial Services

Better Risk Management: Firms will look for opportunities like accelerated compute to drive efficiencies. The simulation techniques used to value risk in derivatives trading are computationally intensive and typically consume large swaths of data center space, power and cooling. What runs all night on traditional compute will run over a lunch break or faster on accelerated compute. A real-time value of sensitivities will enable firms to better manage risk and improve the value they deliver to their investors.

Cloud-First for Financial Services: Banks have a new imperative: get agile fast. Facing increasing competition from non-traditional financial institutions, changing customer expectations rising from their experiences in other industries and saddled with legacy infrastructure, banks and other institutions will embrace a cloud-first AI approach. But as a highly regulated industry that requires operational resiliency, an industry term that means your systems can absorb and survive shocks (like a pandemic), banks will look for open, portable, hardened, hybrid solutions. As a result, banks are obligated to purchase support agreements when available.

CHARLIE BOYLE
Vice President, DGX systems

AI Becomes Cost-Effective With Energy-Efficient Computing: In 2023, inefficient, x86-based legacy computing architectures that can’t support parallel processing will give way to accelerated computing solutions that deliver the computational performance, scale and efficiency needed to build language models, recommenders and more.

Amidst economic headwinds, enterprises will seek out AI solutions that can deliver on objectives, while streamlining IT costs and boosting efficiency. New platforms that use software to integrate workflows across infrastructure will deliver computing performance breakthroughs — with lower total cost of ownership, reduced carbon footprint and faster return on investment on transformative AI projects — displacing more wasteful, older architectures.

DAVID REBER
Chief Security Officer

Data Scientists Are Your New Cyber Asset: Traditional cyber professionals can no longer effectively defend against the most sophisticated threats because the speed and complexity of attacks and defense have effectively exceeded human capacities. Data scientists and other human analysts will use AI to look at all of the data objectively and discover threats. Breaches are going to happen, so data science techniques using AI and humans will help find the needle in the haystack and respond quickly.

AI Cybersecurity Gets Customized: Just like recommender systems serve every consumer on the planet, AI cybersecurity systems will accommodate every business. Tailored solutions will become the No. 1 need for enterprises’ security operations centers as identity-based attacks increase. Cybersecurity is everyone’s problem, so we’ll see more transparency and sharing of various types of cybersecurity architectures. Democratizing AI enables everyone to contribute to the solution. As a result, the collective defense of the ecosystem will move faster to counter threats.

KARI BRISKI
Vice President, AI and HPC Software

The Rise of LLM Applications: Research on large language models will lead to new types of practical applications that can transform languages, text and even images into useful insights that can be used across a multitude of diverse organizations by everyone from business executives to fine artists. We’ll also see rapid growth in demand for the ability to customize models so that LLM expertise spreads to languages and dialects far beyond English, as well as across business domains, from generating catalog descriptions to summarizing medical notes.

Unlabeled Data Finds Its Purpose: Large language models and structured data will also extend to the reams of photos, audio recordings, tweets and more to find hidden patterns and clues to support healthcare breakthroughs, advancements in science, better customer engagements and even major advances in self-driving transportation. In 2023, adding all this unstructured data to the mix will help develop neural networks that can, for instance, generate synthetic profiles to mimic the health records they’ve learned from. This type of unsupervised machine learning is set to become as important as supervised machine learning.

The New Call Center: Keep an eye on the call center in 2023, where adoption of more and more easily implemented speech AI workflows will provide business flexibility at every step of the customer interaction pipeline — from modifying model architectures to fine-tuning models on proprietary data and customizing pipelines. As the accessibility of speech AI workflows broadens, we’ll see a widening of enterprise adoption and giant increase in call center productivity by speeding time to resolution. AI will help agents pull the right information out of a massive knowledge base at the right time, minimizing wait times for customers.

KEVIN DEIERLING
Vice President, Networking

Moore’s Law on Life Support: As CPU design runs up against the laws of physics and struggles to keep up with Moore’s law — the postulation that roughly every two years the number of transistors on microchips would double and create faster, more efficient processing — enterprises increasingly will turn to accelerated computing. They’ll use custom combinations of CPUs, GPUs, DPUs and more in scalable data centers to innovate faster while becoming more cloud oriented and energy efficient.

The Network as the New Computing Platform: Just as personal computers combined software, hardware and storage into productivity-generating tools for everyone, the cloud is fast becoming the new computing tool for AI and the network is what enables the cloud. Enterprises will use third-party software, or bring their own, to develop AI applications and services that run both on-prem and in the cloud. They’ll use cloud services operators to purchase the capacity they need when they need it, working across CPUs, GPUs, DPUs and intelligent switches to optimize compute, storage and the network for their different workloads. What’s more, with zero-trust security being rapidly adopted by cloud service providers, the cloud will deliver computing as secure as on-prem solutions.

DEEPU TALLA
Vice President, Embedded and Edge Computing

Robots Get a Million Lives: More robots will be trained in virtual worlds as photorealistic rendering and accurate physics modeling combine with the ability to simulate in parallel millions of instances of a robot on GPUs in the cloud. Generative AI techniques will make it easier to create highly realistic 3D simulation scenarios and further accelerate the adoption of simulation and synthetic data for developing more capable robots.

Expanding the Horizon: Most robots operate in constrained environments where there is limited to no human activity. Advancements in edge computing and AI will enable robots to have multi-modal perception for better semantic understanding of their environment. This will drive increased adoption of robots operating in brownfield facilities and public spaces such as retail stores, hospitals and hotels.

MARC SPIELER
Senior Director, Energy

AI-Powered Energy Grid: As the grid becomes more complex due to the unprecedented rate of distributed energy resources being added, electric utility companies will require edge AI to improve operational efficiency, enhance functional safety, increase accuracy of load and demand forecasting, and accelerate the connection time of renewable energy, like solar and wind. AI at the edge will increase grid resiliency, while reducing energy waste and cost.

More Accurate Extreme Weather Forecasting: A combination of AI and physics can help better predict the world’s atmosphere using a technique called Fourier Neural Operator. The FourCastNet system is able to predict a precise path of a hurricane and can also make weather predictions in advance and provide real-time updates as climate conditions change. Using this information will allow energy companies to better plan for renewable energy expenditures, predict generation capacity and prepare for severe weather events.

The post 2023 Predictions: AI That Bends Reality, Unwinds the Golden Screw and Self-Replicates appeared first on NVIDIA Blog.

How we count carbon emissions from electricity matters

Amazon advocates for updating carbon accounting to measure where renewable-energy projects will have the greatest impact.Read More

Start your successful journey with time series forecasting with Amazon Forecast

Organizations of all sizes are striving to grow their business, improve efficiency, and serve their customers better than ever before. Even though the future is uncertain, a data-driven, science-based approach can help anticipate what lies ahead to successfully navigate through a sea of choices.

Every industry uses time series forecasting to address a variety of planning needs, including but not limited to:

Developing a cash flow projection based on future expected revenues and expenses
Estimating how many items to manufacture or purchase from suppliers to meet future demand
Knowing where to stock inventory in retail settings to meet on-shelf availability while also minimizing stock-outs and product waste
In wholesale or ecommerce settings, knowing where to position inventory within the supply chain network to maximize regional availability while also minimizing final mile delivery costs
Having a system for detecting outliers in which future actuals far exceed or fall short of the expected plan
Establishing specialized workforces in response to anticipated customer foot traffic, call center operations, manufacturing plans, and other similar workforce demand curves

In this post, we outline five best practices to get started with Amazon Forecast, and apply the power of highly-accurate machine learning (ML) forecasting to your business.

Why Amazon Forecast

AWS offers a fully managed time series forecasting service called Amazon Forecast that allows you to generate and maintain ongoing automated time series forecasts without requiring ML expertise. In addition, you can build and deploy repeatable forecasting operations without the need to write code, build ML models, or manage infrastructure.

The capabilities of Forecast allow it to serve a wide range of customer roles, from analysts and supply chain managers to developers and ML experts. There are several reasons why customers favor Forecast: it offers high accuracy, repeatable results, and the ability to self-serve without waiting on specialized technical resource availability. Forecast is also selected by data science experts because it provides highly accurate results, based on an ensemble of self-tuned models, and the flexibility to experiment quickly without having to deploy or manage clusters of any particular size. Its ML models also make it easier to support forecasts for a large number of items, and can generate accurate forecasts for cold-start items with no history.

Five best practices when getting started with Forecast

Forecast provides high accuracy and quick time-to-market for developers and data scientists. Although developing highly accurate time series models has been made easy, this post provides best practices to speed up your onboarding and time to value. A little rigor and perhaps a couple of rounds of experimentation must be applied to reach success. A successful forecasting journey depends on multiple factors, some subtle.

These are some key items you should consider when starting to work with Forecast.

Start simple

As shown in the following flywheel, consider beginning with a simple model that uses a target time series dataset to develop a baseline as you propose your first set of input data. Subsequent experiments can add in other temporal features and static metadata with a goal of improving model accuracy. Each time a change is made, you can measure and learn how much the change has helped, if at all. Depending on your assessment, you may decide to keep the new set of features provided, or pivot and try another option.

Focus on the outliers

With Forecast, you can obtain accuracy statistics for the entire dataset. It’s important to recognize that although this top-level statistic is interesting, it should be viewed as being only directionally correct. You should concentrate on item-level accuracy statistics rather than top-level statistics. Consider the following scatterplot as a guide. Some of the items in the dataset will have high accuracy; for these no action is required.

While building a model, you should explore some of the points labeled as “exploratory time-series.” In these exploratory cases, determine how to improve accuracy by incorporating more input data, such as price variations, promotional spend, explicit seasonality features, and the inclusion of local, market, global, and other real-world events and conditions.

Review predictor accuracy before creating forecasts

Don’t create future dated forecasts with Forecast until you have reviewed prediction accuracy during the backtest period. The preceding scatterplot illustrates time series level accuracy, which is your best indication for what future dated predictions will look like, all other things being the same. If this period isn’t providing your required level of accuracy, don’t proceed with the future dated forecast operation, because this may lead to inefficient spend. Instead, focus on augmenting your input data and trying another round at the innovation flywheel, as discussed earlier.

Reduce training time

You can reduce training time through two mechanisms. First, use Forecast’s retrain function to help reduce training time through transfer learning. Second, prevent model drift with predictor monitoring by training only when necessary.

Build repeatable processes

We encourage you not to build Forecast workflows through the AWS Management Console or using APIs from scratch until you have at least evaluated our AWS samples GitHub repo. Our mission with GitHub samples is to help remove friction and expedite your time-to-market with repeatable workflows that have already been thoughtfully designed. These workflows are serverless and can be scheduled to run on a regular schedule.

Visit our official GitHub repo, where you can quickly deploy our solution guidance by following the steps provided. As shown in the following figure, the workflow provides a complete end-to-end pipeline that can retrieve historical data, import it, build models, and produce inference against the models—all without needing to write code.

The following figure offers a deeper view into just one module, which is able to harvest historical data for model training from a myriad of database sources that are supported by Amazon Athena Federated Query.

Get started today

You can implement a fully automated production workflow in a matter of days to weeks, especially when paired with our workflow orchestration pipeline available at our GitHub sample repository.

This re:Invent video highlights a use case of a customer who automated their workflow using this GitHub model:

Forecast has many built-in capabilities to help you achieve your business goals through highly accurate ML-based forecasting. We encourage you to contact your AWS account team if you have any questions and let them know that you would like to speak with a time series specialist in order to provide guidance and direction. We can also offer workshops to assist you in learning how to use Forecast.

We are here to support you and your organization as you endeavor to automate and improve demand forecasting in your company. A more accurate forecast can result in higher sales, a significant reduction in waste, a reduction in idle inventory, and ultimately higher levels of customer service.

Take action today; there is no better time than the present to begin creating a better tomorrow.

About the Author

Charles Laughlin is a Principal AI/ML Specialist Solution Architect and works inside the Time Series ML team at AWS. He helps shape the Amazon Forecast service roadmap and collaborates daily with diverse AWS customers to help transform their businesses using cutting-edge AWS technologies and thought leadership. Charles holds a M.S. in Supply Chain Management and has spent the past decade working in the consumer packaged goods industry.

Dan Sinnreich is a Sr. Product Manager for Amazon Forecast. He is focused on democratizing low-code/no-code machine learning and applying it to improve business outcomes. Outside of work, he can be found playing hockey, trying to improve his tennis serve, scuba diving, and reading science fiction.

Chronomics detects COVID-19 test results with Amazon Rekognition Custom Labels

Chronomics is a tech-bio company that uses biomarkers—quantifiable information taken from the analysis of molecules—alongside technology to democratize the use of science and data to improve the lives of people. Their goal is to analyze biological samples and give actionable information to help you make decisions—about anything where knowing more about the unseen is important. Chronomics’s platform enables providers to seamlessly implement at-home diagnostics at scale—all without sacrificing efficiency or accuracy. It has already processed millions of tests through this platform and delivers a high-quality diagnostics experience.

During the COVID-19 pandemic, Chronomics sold lateral flow tests (LFT) for detecting COVID-19. The users register the test on the platform by uploading a picture of the test cassette and entering a manual reading of the test (positive, negative or invalid). With the increase in the number of tests and users, it quickly became impractical to manually verify if the reported result matched the result in the picture of the test. Chronomics wanted to build a scalable solution that uses computer vision to verify the results.

In this post, we share how Chronomics used Amazon Rekognition to automatically detect the results of a COVID-19 lateral flow test.

Preparing the data

The following image shows the picture of a test cassette uploaded by a user. The dataset consists of images like this one. These images are to be classified as positive, negative, or invalid, corresponding to the outcome of a COVID-19 test.

The main challenges with the dataset were the following:

Imbalanced dataset – The dataset was extremely skewed. More than 90% of the samples were from the negative class.
Unreliable user inputs – Readings that were manually reported by the users were not reliable. Around 40% of the readings didn’t match the actual result from the picture.

To create a high-quality training dataset, Chronomics engineers decided to follow these steps:

Manual annotation – Manually select and label 1,000 images to ensure that the three classes are evenly represented
Image augmentation – Augment the labeled images to increase the number to 10,000

Image augmentation was performed using Albumentations, an open-source Python library. A number of transformations like rotation, rescale, and brightness were performed to generate 9,000 synthetic images. These synthetic images were added to the original images to create a high-quality dataset.

Building a custom computer vision model with Amazon Rekognition

Chronomics’s engineers turned towards Amazon Rekognition Custom Labels, a feature of Amazon Rekognition with AutoML capabilities. After training images are provided, it can automatically load and inspect the data, select the right algorithms, train a model, and provide model performance metrics. This significantly accelerates the process of training and deploying a computer vision model, making it the primary reason for Chronomics to adopt Amazon Rekognition. With Amazon Rekognition, we were able to get a highly accurate model in 3–4 weeks as opposed to spending 4 months trying to build a custom model to achieve the desired performance.

The following diagram illustrates the model training pipeline. The annotated images were first preprocessed using an AWS Lambda function. This preprocessing step ensured that the images were in the appropriate file format and also performed some additional steps like resizing the image and converting the image from RGB to grayscale. It was observed that this improved the performance of the model.

After the model has been trained, it can be deployed for inference using just a single click or API call.

Model performance and fine-tuning

The model yielded an accuracy of 96.5% and a F1 score of 97.9% on a set of out-of-sample images. The F1 score is a measure that uses both precision and recall to measure the performance of a classifier. The DetectCustomLabels API is used to detect the labels of a supplied image during inference. The API also returns the confidence that Rekognition Custom Labels has in the accuracy of the predicted label. The following chart has the distribution of the confidence scores of the predicted labels for the images. The x-axis represents the confidence score multiplied by 100, and the y-axis is the count of the predictions in log-scale.

By setting a threshold on the confidence score, we can filter out predictions that have a lower confidence. A threshold of 0.99 resulted in an accuracy of 99.6%, and 5% of the predictions were discarded. A threshold of 0.999 resulted in an accuracy of 99.87%, with 27% of the predictions discarded. In order to deliver the right business value, Chronomics picked a threshold of 0.99 to maximize the accuracy and minimize the rejection of predictions. For more information, see Analyzing an image with a trained model.

The discarded predictions can also be routed to a human in the loop using Amazon Augmented AI (Amazon A2I) for manually processing the image. For more information on how to do this, refer to Use Amazon Augmented AI with Amazon Rekognition.

The following image is an example where the model has correctly identified the test as invalid with a confidence of 0.999.

Conclusion

In this post, we showed the ease with which Chronomics quickly built and deployed a scalable computer vision-based solution that uses Amazon Rekognition to detect the result of a COVID-19 lateral flow test. The Amazon Rekognition API makes it very easy for practitioners to accelerate the process of building computer vision models.

Learn about how you can train computer vision models for your specific business use case by visiting Getting started with Amazon Rekognition custom labels and by reviewing the Amazon Rekognition Custom Labels Guide.

About the Authors

Mattia Spinelli is a Senior Machine Learning Engineer at Chronomics, a biomedical company. Chronomics’s platform enables providers to seamlessly implement at-home diagnostics at scale—all without sacrificing efficiency or accuracy.

Pinak Panigrahi works with customers to build machine learning driven solutions to solve strategic business problems on AWS. When not occupied with machine learning, he can be found taking a hike, reading a book or catching up with sports.

Jay Rao is a Principal Solutions Architect at AWS. He enjoys providing technical and strategic guidance to customers and helping them design and implement solutions on AWS.

Pashmeen Mistry is a Senior Product Manager at AWS. Outside of work, Pashmeen enjoys adventurous hikes, photography, and spending time with his family.

Solution overview

Prerequisites

Create an S3 bucket to store your audio input files

Upload your audio file to the S3 bucket

Create the transcription job

Review the job output

Clean up

Conclusion

About the Authors

Solution overview

Create a batch translation job via console

Create a batch translation job via the AWS SDK

Clean up

Conclusion

About the authors

System Architecture

Detecting Speaker Turns

Extracting Voice Characteristics

Multi-Stage Clustering

Correction and Customization

Future Work

Acknowledgments

Prerequisites

Solution Walkthrough

Clean Up

Conclusion

About the Authors

Why Amazon Forecast

Five best practices when getting started with Forecast

Start simple

Focus on the outliers

Review predictor accuracy before creating forecasts

Reduce training time

Build repeatable processes

Get started today

About the Author

Preparing the data

Building a custom computer vision model with Amazon Rekognition

Model performance and fine-tuning

Conclusion

About the Authors

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.