AlphaGeometry: An Olympiad-level AI system for geometryRead More
From Embers to Algorithms: How DigitalPath’s AI is Revolutionizing Wildfire Detection
DigitalPath is igniting change in the Golden State — using computer vision, generative adversarial networks and a network of thousands of cameras to detect signs of fire in real time.
In the latest episode of NVIDIA’s AI Podcast, host Noah Kravtiz spoke with DigitalPath System Architect Ethan Higgins about the company’s role in the ALERTCalifornia initiative, a collaboration between California’s wildfire fighting agency CAL FIRE and the University of California, San Diego.
DigitalPath built computer vision models to process images collected from network cameras — anywhere from 8 million to 16 million a day — intelligently identifying signs of fire like smoke.
“One of the things we realized early on, though, is that it’s not necessarily a problem about just detecting a fire in a picture,” Higgins said. “It’s a process of making a manageable amount of data to handle.”
That’s because, he explained, it’s unlikely that humans will be entirely out of the loop in the detection process for the foreseeable future.
The company uses various AI algorithms to classify images based on whether they should be reviewed or acted upon — if so, an alert is sent out to a CAL FIRE command center.
One of the downsides to using computer vision to detect wildfires is that extinguishing more fires means a greater buildup of natural fuel and the potential for larger wildfires in the long term. DigitalPath and UCSD are exploring the use of high-resolution lidar data to identify where those fuels can be released in the form of prescribed burns.
Looking ahead, Higgins foresees the field tapping generative AI to accelerate new simulation tools and using AI models to analyze the output of other models to doubly improve wildfire prediction and detection.
“AI is not perfect, but when you couple multiple models together, it can get really close,” he said.
You Might Also Like
Driver’s Ed: How Waabi Uses AI Simulation to Teach Autonomous Vehicles to Drive
Teaching the AI brains of autonomous vehicles to understand the world as humans do requires billions of miles of driving experience—the road to achieving this astronomical level of driving leads to the virtual world. Learn how Waabi uses powerful high-fidelity simulations to train and develop production-level autonomous vehicles.
Polestar’s Dennis Nobelius on the Sustainable Performance Brand’s Plans
Driving enjoyment and autonomous driving capabilities can complement one another in intelligent, sustainable vehicles. Learn about the automaker’s plans to unveil its third vehicle, the Polestar 3, the tech inside it, and what the company’s racing heritage brings to the intersection of smarts and sustainability.
GANTheftAuto: Harrison Kinsley on AI-Generated Gaming Environments
Humans playing games against machines is nothing new, but now computers can develop games for people to play. Programming enthusiast and social media influencer Harrison Kinsley created GANTheftAuto, an AI-based neural network that generates a playable chunk of the classic video game Grand Theft Auto V.
Subscribe to the AI Podcast, Now Available on Amazon Music
The AI Podcast is now available through Amazon Music.
In addition, get the AI Podcast through iTunes, Google Podcasts, Google Play, Castbox, DoggCatcher, Overcast, PlayerFM, Pocket Casts, Podbay, PodBean, PodCruncher, PodKicker, Soundcloud, Spotify, Stitcher and TuneIn.
Make the AI Podcast better: Have a few minutes to spare? Fill out this listener survey.
Host the Whisper Model on Amazon SageMaker: exploring inference options
OpenAI Whisper is an advanced automatic speech recognition (ASR) model with an MIT license. ASR technology finds utility in transcription services, voice assistants, and enhancing accessibility for individuals with hearing impairments. This state-of-the-art model is trained on a vast and diverse dataset of multilingual and multitask supervised data collected from the web. Its high accuracy and adaptability make it a valuable asset for a wide array of voice-related tasks.
In the ever-evolving landscape of machine learning and artificial intelligence, Amazon SageMaker provides a comprehensive ecosystem. SageMaker empowers data scientists, developers, and organizations to develop, train, deploy, and manage machine learning models at scale. Offering a wide range of tools and capabilities, it simplifies the entire machine learning workflow, from data pre-processing and model development to effortless deployment and monitoring. SageMaker’s user-friendly interface makes it a pivotal platform for unlocking the full potential of AI, establishing it as a game-changing solution in the realm of artificial intelligence.
In this post, we embark on an exploration of SageMaker’s capabilities, specifically focusing on hosting Whisper models. We’ll dive deep into two methods for doing this: one utilizing the Whisper PyTorch model and the other using the Hugging Face implementation of the Whisper model. Additionally, we’ll conduct an in-depth examination of SageMaker’s inference options, comparing them across parameters such as speed, cost, payload size, and scalability. This analysis empowers users to make informed decisions when integrating Whisper models into their specific use cases and systems.
Solution overview
The following diagram shows the main components of this solution.
- In order to host the model on Amazon SageMaker, the first step is to save the model artifacts. These artifacts refer to the essential components of a machine learning model needed for various applications, including deployment and retraining. They can include model parameters, configuration files, pre-processing components, as well as metadata, such as version details, authorship, and any notes related to its performance. It’s important to note that Whisper models for PyTorch and Hugging Face implementations consist of different model artifacts.
- Next, we create custom inference scripts. Within these scripts, we define how the model should be loaded and specify the inference process. This is also where we can incorporate custom parameters as needed. Additionally, you can list the required Python packages in a
requirements.txt
file. During the model’s deployment, these Python packages are automatically installed in the initialization phase. - Then we select either the PyTorch or Hugging Face deep learning containers (DLC) provided and maintained by AWS. These containers are pre-built Docker images with deep learning frameworks and other necessary Python packages. For more information, you can check this link.
- With the model artifacts, custom inference scripts and selected DLCs, we’ll create Amazon SageMaker models for PyTorch and Hugging Face respectively.
- Finally, the models can be deployed on SageMaker and used with the following options: real-time inference endpoints, batch transform jobs, and asynchronous inference endpoints. We’ll dive into these options in more detail later in this post.
The example notebook and code for this solution are available on this GitHub repository.
Walkthrough
Hosting the Whisper Model on Amazon SageMaker
In this section, we’ll explain the steps to host the Whisper model on Amazon SageMaker, using PyTorch and Hugging Face Frameworks, respectively. To experiment with this solution, you need an AWS account and access to the Amazon SageMaker service.
PyTorch framework
- Save model artifacts
The first option to host the model is to use the Whisper official Python package, which can be installed using pip install openai-whisper
. This package provides a PyTorch model. When saving model artifacts in the local repository, the first step is to save the model’s learnable parameters, such as model weights and biases of each layer in the neural network, as a ‘pt’ file. You can choose from different model sizes, including ‘tiny,’ ‘base,’ ‘small,’ ‘medium,’ and ‘large.’ Larger model sizes offer higher accuracy performance, but come at the cost of longer inference latency. Additionally, you need to save the model state dictionary and dimension dictionary, which contain a Python dictionary that maps each layer or parameter of the PyTorch model to its corresponding learnable parameters, along with other metadata and custom configurations. The code below shows how to save the Whisper PyTorch artifacts.
- Select DLC
The next step is to select the pre-built DLC from this link. Be careful when choosing the correct image by considering the following settings: framework (PyTorch), framework version, task (inference), Python version, and hardware (i.e., GPU). It is recommended to use the latest versions for the framework and Python whenever possible, as this results in better performance and address known issues and bugs from previous releases.
- Create Amazon SageMaker models
Next, we utilize the SageMaker Python SDK to create PyTorch models. It’s important to remember to add environment variables when creating a PyTorch model. By default, TorchServe can only process file sizes up to 6MB, regardless of the inference type used.
The following table shows the settings for different PyTorch versions:
Framework | Environment variables |
PyTorch 1.8 (based on TorchServe) | ‘TS_MAX_REQUEST_SIZE ‘: ‘100000000’‘ TS_MAX_RESPONSE_SIZE ‘: ‘100000000’‘ TS_DEFAULT_RESPONSE_TIMEOUT ‘: ‘1000’ |
PyTorch 1.4 (based on MMS) | ‘MMS_MAX_REQUEST_SIZE ‘: ‘1000000000’‘ MMS_MAX_RESPONSE_SIZE ‘: ‘1000000000’‘ MMS_DEFAULT_RESPONSE_TIMEOUT ‘: ‘900’ |
- Define the model loading method in inference.py
In the custom inference.py
script, we first check for the availability of a CUDA-capable GPU. If such a GPU is available, then we assign the 'cuda'
device to the DEVICE
variable; otherwise, we assign the 'cpu'
device. This step ensures that the model is placed on the available hardware for efficient computation. We load the PyTorch model using the Whisper Python package.
Hugging Face framework
- Save model artifacts
The second option is to use Hugging Face’s Whisper implementation. The model can be loaded using the AutoModelForSpeechSeq2Seq
transformers class. The learnable parameters are saved in a binary (bin) file using the save_pretrained
method. The tokenizer and preprocessor also need to be saved separately to ensure the Hugging Face model works properly. Alternatively, you can deploy a model on Amazon SageMaker directly from the Hugging Face Hub by setting two environment variables: HF_MODEL_ID
and HF_TASK
. For more information, please refer to this webpage.
- Select DLC
Similar to the PyTorch framework, you can choose a pre-built Hugging Face DLC from the same link. Make sure to select a DLC that supports the latest Hugging Face transformers and includes GPU support.
- Create Amazon SageMaker models
Similarly, we utilize the SageMaker Python SDK to create Hugging Face models. The Hugging Face Whisper model has a default limitation where it can only process audio segments up to 30 seconds. To address this limitation, you can include the chunk_length_s
parameter in the environment variable when creating the Hugging Face model, and later pass this parameter into the custom inference script when loading the model. Lastly, set the environment variables to increase payload size and response timeout for the Hugging Face container.
Framework | Environment variables |
HuggingFace Inference Container (based on MMS) |
‘MMS_MAX_REQUEST_SIZE ‘: ‘2000000000’‘ MMS_MAX_RESPONSE_SIZE ‘: ‘2000000000’‘ MMS_DEFAULT_RESPONSE_TIMEOUT ‘: ‘900’ |
- Define the model loading method in inference.py
When creating custom inference script for the Hugging Face model, we utilize a pipeline, allowing us to pass the chunk_length_s
as a parameter. This parameter enables the model to efficiently process long audio files during inference.
Exploring different inference options on Amazon SageMaker
The steps for selecting inference options are the same for both PyTorch and Hugging Face models, so we won’t differentiate between them below. However, it’s worth noting that, at the time of writing this post, the serverless inference option from SageMaker doesn’t support GPUs, and as a result, we exclude this option for this use-case.
We can deploy the model as a real-time endpoint, providing responses in milliseconds. However, it’s important to note that this option is limited to processing inputs under 6 MB. We define the serializer as an audio serializer, which is responsible for converting the input data into a suitable format for the deployed model. We utilize a GPU instance for inference, allowing for accelerated processing of audio files. The inference input is an audio file that is from the local repository.
The second inference option is the batch transform job, which is capable of processing input payloads up to 100 MB. However, this method may take a few minutes of latency. Each instance can handle only one batch request at a time, and the instance initiation and shutdown also require a few minutes. The inference results are saved in an Amazon Simple Storage Service (Amazon S3) bucket upon completion of the batch transform job.
When configuring the batch transformer, be sure to include max_payload = 100
to handle larger payloads effectively. The inference input should be the Amazon S3 path to an audio file or an Amazon S3 Bucket folder containing a list of audio files, each with a size smaller than 100 MB.
Batch Transform partitions the Amazon S3 objects in the input by key and maps Amazon S3 objects to instances. For example, when you have multiple audio files, one instance might process input1.wav, and another instance might process the file named input2.wav to enhance scalability. Batch Transform allows you to configure max_concurrent_transforms
to increase the number of HTTP requests made to each individual transformer container. However, it’s important to note that the value of (max_concurrent_transforms* max_payload
) must not exceed 100 MB.
Finally, Amazon SageMaker Asynchronous Inference is ideal for processing multiple requests concurrently, offering moderate latency and supporting input payloads of up to 1 GB. This option provides excellent scalability, enabling the configuration of an autoscaling group for the endpoint. When a surge of requests occurs, it automatically scales up to handle the traffic, and once all requests are processed, the endpoint scales down to 0 to save costs.
Using asynchronous inference, the results are automatically saved to an Amazon S3 bucket. In the AsyncInferenceConfig
, you can configure notifications for successful or failed completions. The input path points to an Amazon S3 location of the audio file. For additional details, please refer to the code on GitHub.
Optional: As mentioned earlier, we have the option to configure an autoscaling group for the asynchronous inference endpoint, which allows it to handle a sudden surge in inference requests. A code example is provided in this GitHub repository. In the following diagram, you can observe a line chart displaying two metrics from Amazon CloudWatch: ApproximateBacklogSize
and ApproximateBacklogSizePerInstance
. Initially, when 1000 requests were triggered, only one instance was available to handle the inference. For three minutes, the backlog size consistently exceeded three (please note that these numbers can be configured), and the autoscaling group responded by spinning up additional instances to efficiently clear out the backlog. This resulted in a significant decrease in the ApproximateBacklogSizePerInstance
, allowing backlog requests to be processed much faster than during the initial phase.
Comparative analysis for the inference options
The comparisons for different inference options are based on common audio processing use cases. Real-time inference offers the fastest inference speed but restricts payload size to 6 MB. This inference type is suitable for audio command systems, where users control or interact with devices or software using voice commands or spoken instructions. Voice commands are typically small in size, and low inference latency is crucial to ensure that transcribed commands can promptly trigger subsequent actions. Batch Transform is ideal for scheduled offline tasks, when each audio file’s size is under 100 MB, and there is no specific requirement for fast inference response times. Asynchronous inference allows for uploads of up to 1 GB and offers moderate inference latency. This inference type is well-suited for transcribing movies, TV series, and recorded conferences where larger audio files need to be processed.
Both real-time and asynchronous inference options provide autoscaling capabilities, allowing the endpoint instances to automatically scale up or down based on the volume of requests. In cases with no requests, autoscaling removes unnecessary instances, helping you avoid costs associated with provisioned instances that aren’t actively in use. However, for real-time inference, at least one persistent instance must be retained, which could lead to higher costs if the endpoint operates continuously. In contrast, asynchronous inference allows instance volume to be reduced to 0 when not in use. When configuring a batch transform job, it’s possible to use multiple instances to process the job and adjust max_concurrent_transforms to enable one instance to handle multiple requests. Therefore, all three inference options offer great scalability.
Cleaning up
Once you have completed utilizing the solution, ensure to remove the SageMaker endpoints to prevent incurring additional costs. You can use the provided code to delete real-time and asynchronous inference endpoints, respectively.
Conclusion
In this post, we showed you how deploying machine learning models for audio processing has become increasingly essential in various industries. Taking the Whisper model as an example, we demonstrated how to host open-source ASR models on Amazon SageMaker using PyTorch or Hugging Face approaches. The exploration encompassed various inference options on Amazon SageMaker, offering insights into efficiently handling audio data, making predictions, and managing costs effectively. This post aims to provide knowledge for researchers, developers, and data scientists interested in leveraging the Whisper model for audio-related tasks and making informed decisions on inference strategies.
For more detailed information on deploying models on SageMaker, please refer to this Developer guide. Additionally, the Whisper model can be deployed using SageMaker JumpStart. For additional details, kindly check the Whisper models for automatic speech recognition now available in Amazon SageMaker JumpStart post.
Feel free to check out the notebook and code for this project on GitHub and share your comment with us.
About the Author
Ying Hou, PhD, is a Machine Learning Prototyping Architect at AWS. Her primary areas of interest encompass Deep Learning, with a focus on GenAI, Computer Vision, NLP, and time series data prediction. In her spare time, she relishes spending quality moments with her family, immersing herself in novels, and hiking in the national parks of the UK.
How Best Take makes your group photos better
Learn more about how Google built Google Photo’s Best Take tool.Read More
GHDDI and Microsoft Research use AI technology to achieve significant progress in discovering new drugs to treat global infectious diseases
The Global Health Drug Discovery Institute (GHDDI) (opens in new tab) and Microsoft Research recently achieved significant progress in accelerating drug discovery for the treatment of global infectious diseases. Working in close collaboration, the joint team successfully used generative AI and foundation models to design several small molecule inhibitors for essential target proteins of Mycobacterium tuberculosis and coronaviruses. These new inhibitors show outstanding bioactivities, comparable to or surpassing the best-known lead compounds.
This breakthrough is a testament to the team’s combined efforts in generative AI, molecular physicochemical modeling, and iterative feedback loops between scientists and AI technologies. Normally, the discovery and in vitro confirmation of such molecules could take up to several years, but with the acceleration of AI, the joint team achieved these new results in just five months. This research also shows the tremendous potential of AI for helping scientists discover or create the building blocks needed to develop effective treatments for infectious diseases that continue to threaten the health and lives of people around the world.
Since 2019, for example, there have been more than 772 million confirmed cases of COVID-19 worldwide and nearly 7 million deaths from the virus, according to the World Health Organization (WHO), the Centers for Disease Control, and various other sources. Although vaccines have reduced the incidence and deadliness of the disease, the coronavirus continues to mutate and evolve, making it a serious ongoing threat to global health. Meanwhile, the WHO reports that tuberculosis continues to be a leading cause of death among infectious diseases, second only to COVID-19 in 2022, when 10.6 million people worldwide fell ill with TB and the disease killed 1.3 million (the most recent figures currently available).
Laying the foundation for new infectious disease treatments
Microsoft Research has rich experience in developing and pre-training large AI models specialized for proteins and molecules, demonstrated in both property prediction and molecular generation. Based on those experiences, Microsoft Research developed and maintains ownership of an AI model for molecule generation tailored for specific protein targets. The generated compounds were virtually screened and further optimized by data scientists and medicinal chemists from GHDDI, followed by compound synthesis and wet-lab experiments to quantify bioactivities. The experimental results were then fed back to the research team at Microsoft for AI model improvement and new compound generation.
This AI-expert-experiment integrated pipeline enables the success of novel compound generation for protein targets in Mycobacterium tuberculosis and coronaviruses SARS-CoV-2. In less than five months, the joint team designed several chemical compounds that are effective in inhibiting these pathogens’ essential target proteins, accelerating the structure-based drug discovery process.
One distinguishing feature of AI-generated molecules is their novel scaffold structures, which are important because they create the potential for these molecules to be developed into a new class of drug candidates. These novel structures offer the possibility of more effective treatments, and also help to address the escalating challenge of antimicrobial resistance (AMR), a major hurdle in treating infectious diseases like tuberculosis and COVID-19.
“In the current landscape of scientific research, we encounter unparalleled challenges but also have unprecedented opportunities,” said Dr. Sheng Ding, institute director of GHDDI. “Innovation stands as the central catalyst for scientific advancement and a crucial element in addressing global health challenges. I’m excited about our collaboration with Microsoft Research and gratified with the progress we’ve jointly achieved. Without a doubt, our combined efforts will enhance R&D efficiency and expedite the process of drug discovery.”
“This represents a collaboration that transcends disciplines and boundaries,” he noted. “Our combined strengths will advance pharmaceutical research, paving new avenues in scientific exploration. Going forward, we anticipate deploying such cutting-edge technologies in uncharted realms of life sciences. This will enable us to offer more comprehensive, profound, and practical solutions for global health issues.”
MICROSOFT RESEARCH PODCAST
Abstracts: October 23, 2023
On “Abstracts,” Partner Research Manager Andy Gordon & Senior Researcher Carina Negreanu explore new work introducing co-audit, a term for any tool-assisted experience that helps users of generative AI find and fix mistakes in AI output.
Using AI to improve global health
Embracing the principle of open innovation, the collaboration between GHDDI and Microsoft Research is dedicated to harnessing AI technology to expedite drug discovery. The goal is to contribute to global health equity through the development of lifesaving medications and the prompt delivery of safer and more effective drug solutions that are accessible to everyone. The collaboration focuses on infectious diseases that pose a threat to global health, including but not limited to tuberculosis, viral infections, and malaria. Both parties are committed to a deep integration of generative AI, foundational models, high-throughput virtual screening, and expert knowledge to tackle these challenges.
“Successful AI-driven drug discovery necessitates a tight-knit collaboration between AI specialists and medicinal experts,” said Dr. Tie-Yan Liu, distinguished scientist at Microsoft Research AI4Science. “In recent years, our globally recognized team at Microsoft Research has been deeply engaged in interdisciplinary research between AI and natural science. To complement this, GHDDI experts bring to the table a wealth of industry experience and profound domain knowledge. Their experimental facilities not only allow for testing but also help provide invaluable feedback for training AI models. Because of our close collaboration, we look forward to producing groundbreaking research outcomes with the potential to redefine the future of healthcare through AI technology innovation.”
Accelerating drug discovery
Commenting on the research into Mycobacterium tuberculosis and coronaviruses, Dr. Rumin Zhang, chief scientific officer at GHDDI, noted that the application of AI technology by the collaborative team managed to considerably reduce the traditionally lengthy drug discovery process. The team was able to design and validate highly effective small molecule inhibitors for the pathogens in just five months.
“This is an exceptional accomplishment that underscores the immense potential of AI in efficient de novo drug design. It also vividly illustrates the team’s exceptional innovative capacity and professional prowess,” he said. “We are excited about this innovative R&D strategy leading to more groundbreaking advancements in a broader spectrum of future drug discovery projects.”
“This work is all about pushing the boundaries of AI technology for application in new drug R&D,” said Dr. Tao Qin, senior principal researcher at Microsoft Research AI4Science “We aim to leverage AI innovations to enhance human health, tackle worldwide health issues, and ensure the advantages of AI technology are accessible to all.”
“We plan to intensify and broaden our collaboration, further advancing the use of AI technology in the realm of life sciences,” said Dr. Jinjiang Guo, head of the Data Science Department at GHDDI. “This will yield novel insights that will enrich researchers’ understanding of mechanisms underlying diseases and life, thus paving the way for the development of innovative treatment strategies and providing more effective solutions for diseases that have long affected human health. We are highly optimistic about the potential of this collaboration and are confident that it will have a substantial impact on the future of the healthcare field.”
Next steps
In the next phase, Microsoft Research and GHDDI will collaborate to optimize the discovered hit compounds, enhance ADMET properties, progress toward preclinical studies, and initiate a broader range of drug-discovery projects.
The post GHDDI and Microsoft Research use AI technology to achieve significant progress in discovering new drugs to treat global infectious diseases appeared first on Microsoft Research.
Māori Speech AI Model Helps Preserve and Promote New Zealand Indigenous Language
Indigenous languages are under threat. Some 3,000 — three-quarters of the total — could disappear before the end of the century, or one every two weeks, according to UNESCO.
As part of a movement to protect such languages, New Zealand’s Te Hiku Media, a broadcaster focused on the Maori people’s indigenous language known as te reo, is using trustworthy AI to help preserve and revitalize the tongue.
Using ethical, transparent methods of speech data collection and analysis to maintain data sovereignty for the Māori people, Te Hiku Media is developing automatic speech recognition (ASR) models for te reo, which is a Polynesian language.
Built using the open-source NVIDIA NeMo toolkit for ASR and NVIDIA A100 Tensor Core GPUs, the speech-to-text models transcribe te reo with 92% accuracy. It can also transcribe bilingual speech using English and te reo with 82% accuracy. They’re pivotal tools, made by and for the Māori people, that are helping preserve and amplify their stories.
“There’s immense value in using NVIDIA’s open-source technologies to build the tools we need to ultimately achieve our mission, which is the preservation, promotion and revitalization of te reo Māori,” said Keoni Mahelona, chief technology officer at Te Hiku Media, who leads a team of data scientists and developers, as well as Māori language experts and data curators, working on the project.
“We’re also helping guide the industry on ethical ways of using data and technologies to ensure they’re used for the empowerment of marginalized communities,” added Mahelona, a Native Hawaiian now living in New Zealand.
Building a ‘House of Speech’
Te Hiku Media began more than three decades ago as a radio station aiming to ensure te reo had space on the airwaves. Over the years, the organization incorporated television broadcasting and, with the rise of the internet, it convened a meeting in 2013 with the community’s elders to form a strategy for sharing content in the digital era.
“The elders agreed that we should make the stories accessible online for our community members — rather than just keeping our archives on cassettes in boxes — but once we had that objective, the challenge was how to do this correctly, in alignment with our strong roots in valuing sovereignty,” Mahelona said.
Instead of uploading its video and audio sources to popular, global platforms — which, in their terms and conditions of use, require signing over certain rights related to the content — Te Hiku Media decided to build its own content distribution platform.
Called Whare Kōrero — meaning “house of speech” — the platform now holds more than 30 years’ worth of digitized, archival material featuring about 1,000 hours of te reo native speakers, some of whom were born in the late 19th century, as well as more recent content from second-language learners and bilingual Māori people.
Now, around 20 Māori radio stations use and upload their content to Whare Kōrero. Community members can access the content through an app.
“It’s an invaluable reproduce of acoustic data,” Mahelona said.
Turning to Trustworthy AI
Such a trove held incredible value for those working to revitalize the language, the Te Hiku Media team quickly realized, but manual transcription required pulling lots of time and effort from limited resources. So began the organization’s trustworthy AI efforts, in 2016, to accelerate its work using ASR.
“No one would have a clue that there are eight NVIDIA A100 GPUs in our derelict, rundown, musky-smelling building in the far north of New Zealand — training and building Māori language models,” Mahelona said. “But the work has been game-changing for us.”
To collect speech data in a transparent, ethically compliant, community-oriented way, Te Hiku Media began by explaining its cause to elders, garnering their support and asking them to come to the station to read phrases aloud.
“It was really important that we had the support of the elders and that we recorded their voices, because that’s the sort of content we want to transcribe,” Mahelona said. “But eventually these efforts didn’t scale — we needed second-language learners, kids, middle-aged people and a lot more speech data in general.”
So, the organization ran a crowdsourcing campaign, Kōrero Māori, to collect highly labeled speech samples according to the Kaitiakitanga license, which ensures Te Hiku Media uses the data only for the benefit of the Māori people.
In just 10 days, more than 2,500 signed up to read 200,000+ phrases, providing over 300 hours of labeled speech data, which was used to build and train the te reo Māori ASR models.
In addition to other open-source trustworthy AI tools, Te Hiku Media now uses the NVIDIA NeMo toolkit’s ASR module for speech AI throughout its entire pipeline. The NeMo toolkit comprises building blocks called neural modules and includes pretrained models for language model development.
“It’s been absolutely amazing — NVIDIA’s open-source NeMo enabled our ASR models to be bilingual and added automatic punctuation to our transcriptions,” Mahelona said.
Te Hiku Media’s ASR models are the engines running behind Kaituhi, a te reo Māori transcription service now available online.
The efforts have spurred similar ASR projects now underway by Native Hawaiians and the Mohawk people in southeastern Canada.
“It’s indigenous-led work in trustworthy AI that’s inspiring other indigenous groups to think: ‘If they can do it, we can do it, too,’” Mahelona said.
Learn more about NVIDIA-powered trustworthy AI, the NVIDIA NeMo toolkit and how it enabled a Telugu language speech AI breakthrough.
Starstruck: 3D Artist Brellias Brings Curiosity to Light This Week ‘In the NVIDIA Studio’
Editor’s note: This post is part of our weekly In the NVIDIA Studio series, which celebrates featured artists, offers creative tips and tricks, and demonstrates how NVIDIA Studio technology improves creative workflows. We’re also deep diving on new GeForce RTX 40 Series GPU features, technologies and resources, and how they dramatically accelerate content creation.
Curiosity leads the way for this week’s featured In the NVIDIA Studio 3D artist, Brellias.
It’s what inspired the native Chilean’s latest artwork Estrellitas, which in English translates to “little stars.” The scene expresses the mixture of emotions that comes with curiosity, depicting a young girl holding little stars in her hand with a conflicted expression.
“She’s excited to learn about them, but she’s also a little scared,” Brellias explained.
The striking visual piece, rich with vibrant colors and expertly executed textures, underscores that while curiosity can invoke various emotions — both joyful and painful — it is always a source of change and growth.
A Sky Full of Stars
To start, Brellias first visualized and reworked an existing 3D scene of a woman in Blender. He used Blender’s built-in multi-resolution modifier for sculpting and added some shape keys to achieve the desired modifications.
He also created a custom shader for the character’s skin — a stylistic choice to lend its appearance a galactic hue.
Next, Brellias tapped Blender’s OptiX GPU-accelerated viewport denoising, powered by his GeForce RTX GPU.
“The technology helps reduce noise and improve the quality of the viewport image more quickly, allowing me to make decisions and iterate on the render faster,” he said.
Next, Brellias animated the scene using a base model from Daz Studio, a free media design software developed by Daz 3D. Daz features an AI denoiser for high-performance interactive rendering that can also be accelerated by RTX GPUs.
In addition, rig tools in Blender made the animation process easy, eliminating the need to modify file formats.
To animate the character’s face, Brellias tied drivers to shape keys using empties, enabling greater fluidity and control over facial expressions.
Brellias then used geometry nodes in Blender to animate the character’s hair, giving it a magical floating effect. To light the scene, Brellias added some ambient light behind the character’s face and between its hands. His RTX GPU accelerated OptiX ray tracing in Blender’s Cycles for the fastest final-frame renders.
Finally, he moved to Blackmagic Design’s DaVinci Resolve to denoise and deflicker the scene for the smoothest-looking animation.
Here, Brellias’ RTX GPU accelerated the color grading, video editing and color scoping processes, dramatically speeding his creative workflow. Other RTX-accelerated AI features, including facial recognition for automatically tagging clips and the tracking of effects, were available for his use.
Estrellitas was partially inspired by Brellias’ own curiosity in exploring NVIDIA and GeForce RTX GPU technologies to power content creation workflows — a venture that provided rewarding results.
“Every step of my creative process involves GPU acceleration or AI in some way or another,” said Brellias. “I can’t imagine creating without a powerful GPU at my disposal.”
His curiosity in AI extends to productivity. He recently installed the NVIDIA Broadcast app, which can transform any room into a home studio.
The app has enhanced Brellias’ microphone performance by canceling external noise and echo — especially useful given his urban surroundings.
Download the Broadcast beta and explore the rest of the Studio suite of apps, including Canvas, which uses AI to turn simple brushstrokes into realistic landscape images, and RTX Remix, which allows modders to create AI-powered RTX remasters of classic games. The apps are all free for RTX GPU owners.
Check out Brellias’ portfolio on Instagram.
Follow NVIDIA Studio on Instagram, X and Facebook. Access tutorials on the Studio YouTube channel and get updates directly in your inbox by subscribing to the Studio newsletter.
Accelerating Triton Dequantization Kernels for GPTQ
TL;DR
Leveraging a first principles approach, we showcase a step by step process undertaken to accelerate the current Triton GPTQ kernels by 3x (core GPTQ) and 6x (AutoGPTQ). Example: 275us to 47us on a typical Llama style inference input. The goal is to provide a helpful template for accelerating any given Triton kernel. We provide a background on Triton and GPTQ quantization and dequantization process, showcase the impact of coalesced memory access to improve shared and global memory throughput, highlight changes made to reduce warp stalling to improve total throughput, and an overview on integrating Triton kernels into PyTorch code. Longer term, we hope to surpass the existing CUDA native GPTQ kernel with our Triton kernel.
Fig 1: Performance benchmarking the optimized AutoGTPQ kernel vs the current AutoGPTQ kernel on H100
Fig 2: Performance benchmarking the newly optimized AutoGTPQ kernel vs the current AutoGPTQ kernel on A100
Fig 3: Even with these improvements, there remains a gap between our optimized Triton kernel and the CUDA native AutoGTPQ kernel on A100. More to come…
1.0 Introduction to Triton
The Triton framework provides a hardware agnostic way of programming and targeting GPUs, currently supporting both NVIDIA and AMD, with support for additional hardware vendors in progress. Triton is now a mainstay for PyTorch 2.0 as torch.compile decomposes eager PyTorch and re-assembles it into a high percentage of Triton kernels with PyTorch connecting code.
As Triton becomes more widely adopted, it will be essential that programmers understand how to systematically step through the Triton stack (from the high level Python down to the low-level SASS) to address performance bottlenecks in order to optimize GPU efficiency for algorithms that go beyond torch.compile generated kernels.
In this post, we will introduce some core concepts of the Triton programming language, how to identify common performance limiters in GPU kernels, and in parallel, tune a quantization kernel used in AutoGPTQ that can be used for high throughput inference applications.
Intro to GPTQ Quantization and Dequantization
GPTQ is a quantization algorithm that is able to compress ultra-large (175B+) LLMs efficiently to int4 bit representation, via approximate second order information (Hessian inverse). AutoGPTQ is a framework built on GPTQ, allowing for rapid dequantization and inference/serving of LLMs that have been quantized with GPTQ.
As part of the AutoGPTQ stack, they provide a Triton GPTQ kernel to handle the dequantization of a model for inference.
The basic process for INT quantization is shown below and involves determining the scale and zero point, and then computing the quantized 4bit Weight using the Scale and Zero point:
We thus store the 4 Bit weights along with the meta information of Scale and ZeroPoint for each group of weights.
To ‘dequant’ these weights, we do the following:
And then proceed to Matrix Multiply the dequantized weights with the dense input feature matrix for this linear layer.
2.0 Identify the Bottlenecks – Optimizing Matrix Multiplication
As it turns out, making a fast matrix multiplication kernel is not trivial. A naively implemented matrix multiply will rarely reach peak throughput performance on highly parallel machines like GPUs. So – we need to tackle our compute and memory subsystems in our GPU in an hierarchical fashion to make sure we are maximally utilizing each resource.
We start our optimization process, by running the unoptimized Triton Kernel, through the Nvidia Nsight Compute tool and taking a note of some important metrics and warnings:
Fig xy (todo)
We notice first that both compute and memory throughput are low, 7.40% and 21.19% respectively (fig xy) . Knowing that for typical inference matrix problem sizes, we are in the memory bound regime, we will attempt to optimize the kernel by applying code changes that target the memory subsystem of our A100 GPU.
The three topics this post will cover are:
- L2 Optimization
- Vectorized Load
- Warp Stalling
Let’s walk through each topic, make the appropriate changes, and see its corresponding impact on our Triton Kernel. This Triton kernel is a fused dequantization kernel that dequantizes a packed int32 weight (we will refer to this as the B Matrix) Tensor into int4 weights, performs matrix multiplication with the activation tensor (refer to as the A matrix) in FP16 mode, and then stores the results back to a matrix C.
The above is referred to as W4A16 quantization. Keep in mind that the process we describe can and should be used for the development of any GPU kernel, as these are common bottlenecks in any unoptimized kernel.
3.0 L2 Optimization
This optimization already exists in the AutoGPTQ kernel, but we’d like to dedicate a section to this to help readers better understand how mapping and execution order of thread blocks is handled in Triton. Thus, we will step through a naive mapping and then a more optimal mapping to see its corresponding impact.
Let’s build up our kernel naively, starting with a “linear” load from global memory and then compare it to a more optimized “swizzled” load. Linear vs Swizzled determines the execution order of our grid of work on the GPU. Let’s take a look at the hints that the Nvidia Nsight Compute Tool provides regarding our kernels shared memory access pattern in the naive case:
To tackle this issue we can use an approach referred to as “tile-swizzling.” The idea of this method is to launch our thread blocks in a more L2 cache friendly order.
Let’s take a step back and familiarize ourselves with some Triton semantics and make a simple CUDA analogy to understand the concept better. Triton kernels launch “programs”. These so-called programs map to the concept of a Thread Block in CUDA and it is the basic unit of parallelism in a Triton Kernel. Every program has with it associated a “pid” and all the threads in a program are guaranteed to be executing the same instruction.
The Triton programs will be distributed onto your SMs in a naive-way if you do a simple linear mapping of “pid” to a 2D grid location of your output matrix C.
This 2D grid location is determined by pid_m and pid_n in Triton. We would like to exploit data and cache locality in the L2 cache of our GPU, when we distribute our grid of work. To do this in Triton we can make the following changes:
The code highlighted in red would be the naive “linear” tile ordering, and the code highlighted in green is the “swizzled” tile ordering. This type of launch promotes a sense of locality. Here is a visual to help understand this better.
After incorporating this change, the profiler no longer complains about uncoalesced memory accesses. Let’s take a look at how our memory throughput has changed:
This change was tested on a simple load store kernel. Looking at the GPU speed of light statistics section in the profiler we also see a 112.07% increase in the memory throughput of the simple load kernel, which is what we were after with this optimization. Again, this optimization already exists in the AutoGPTQ kernel, but is the boilerplate logic that every Triton Kernel programmer will have to write in the beginning of their kernel, before any of the exciting dequantization or matrix multiply logic. It is thus important to understand that:
-
This mapping is not unique
-
Triton does not automatically handle this kind of optimization for the programmer, and careful thought must be taken to ensure your kernel is optimally handling shared memory accesses
These are not obvious for those new to Triton, as much of the shared memory access optimization is handled by the Triton compiler. However, in the cases where these are not handled by the compiler, it is important to be able to understand what tools and methods are available to us to be able to influence memory behavior.
4.0 Vectorized Load
Now, back to the original complaints of our unoptimized kernel. We want to optimize the global memory access pattern of our kernel. From the details page of the Nvidia Nsight compute tool, we see the following note, where the profiler is complaining about uncoalesced global memory accesses.
Let’s dig deeper and take a look at the SASS (Assembly) Code load for an unoptimized memory read:
This load operation resulted in 32 global load operations that are 16 bit wide. This is not optimal.
We would like to do our global memory loads in a vectorized way so that it results in the least amount of load instructions. To combat this we can give the Triton Compiler some help.
The green highlighted lines above act as a compiler hint. It tells the compiler that these elements are contiguous in memory and that this load operation can be coalesced.
Let’s see the effect in assembly after adding these lines.
The load is now performed in 4 global load operations that are each 128 bit wide, instead of 32 16 bit global load operations. This means 28 fewer memory fetch instructions, and importantly a coalesced memory access. This can be seen from the fact that a single thread is not accessing consecutive memory addresses, which without the compiler hint, was the behavior.
The resulting effect is 73x speedup in an isolated load operation, and after incorporating it in the full dequantization kernel we were able to see another 6% speedup. Another step in the right direction!
5.0 Warp Stalling
Now putting all the changes back into our full dequantization kernel, we see the following performance limiter, warp stalling.
These warp stalls are mostly caused by ‘Long Scoreboard’ stalls, accounting for 92.63% of the total.
At a high level, long scoreboard stalls happen when a warp requires data that may not be ready yet in order to be in the “issued” state. In other words GPUs are throughput machines, and we need to hide the latency of load instructions with compute instructions. By loading more data and rearranging where the load instructions are in the script we can take care of this problem.
In an ideal scenario, each warp scheduler would be able to issue 1 instruction every clock cycle. Note – Every SM on an A100 GPU has 4 warp schedulers.
However – our kernel has bottlenecks and is spending 4.4 cycles in the stall state with the block size that AutoGPTQ Triton kernel deems as optimal given the presets it has.
How do we improve this?
We want to be able to increase our memory throughput so that we can increase the chance that when a warp issues an instruction, we won’t be waiting for loads to be stored in SRAM so that they can be used for computation. We played around with multiple parameters (such as number of pipeline stages, and number of warps) and the one that had the biggest impact was increasing the block size by a factor of 2 in the k dimension.
These changes yield an immediate impact on both compute and memory throughput.
We also see the long scoreboard wait time at the step where we shift and scale the quantized weights drop significantly, which is what we identified as the original bottleneck in the source code. While there are still stalls at this line, only 68% of them are caused by long scoreboard stalls, compared to the original 92%. Ideally, we do not observe ANY stalls, so there is still work to be done here, but a reduction in the amount of stalls caused by long scoreboard tells us that our data is at this point ready to be used (in L1TEX) memory by an instruction that a warp wants to execute, at a higher frequency then the original kernel.
The corresponding impact is a 1.4x speedup in the execution time of our kernel.
6.0 Results
By tackling all these problem areas methodically our resulting kernel is 6x faster on the Nvidia A100 GPU than if you were to use the Triton kernel AutoGPTQ provides out-of-the-box.
Taking a relevant Llama inference sample data point, the Triton kernel we’ve developed takes 47us to perform dequantization and matrix multiplication compared to the 275us taken by the AutoGPTQ kernel for the same matrix size.
By replicating this step-by-step approach it should be possible to get similar speedups in other kernels, and help build understanding on common GPU bottlenecks and how to tackle them.
It is important to note that while strides have been made in improving the performance of the AutoGPTQ Triton Kernel, we have still not closed the gap on the current exllamaV2 CUDA kernels found in AutoGPTQ.
More research is required to understand how we can further optimize this kernel to match equivalent custom CUDA kernel performance.
Summary and Future work
Triton extends PyTorch by allowing low level GPU optimizations to be done at a higher level of abstraction than CUDA programming, with the net result that adding optimized Triton kernels can help PyTorch models run faster.
Our goal in this post was to show an example of accelerating the GPTQ dequant kernel and provide a template workflow for how the accelerations were achieved.
For future work, SplitK work decomposition for the matrix multiplication is a potential speed up we’ll investigate.
Integrating custom Triton Kernels into PyTorch
Given the acceleration shown above, a common question is how to actually use a custom kernel in a given PyTorch codebase.
A triton kernel will contain at least two parts – the actual Triton kernel code which will be compiled by the Triton compiler:
Along with the actual kernel code is a python wrapper, that may or may not subclass the PyTorch autograd class – depending if it’s going to support a backwards pass (i.e. for training purposes or only for inference purposes).
You simply import the python class into your PyTorch code where you want to use it much like any other Python / PyTorch function.
In this case, simply importing and then using ‘fast_qlinear’ would invoke the underlying Triton kernel with the speed-ups we’ve shown above applied to your PyTorch model.
Acknowledgements
Thanks to Jamie Yang and Hao Yu from IBM Research for their technical guidance in the collection of these results.
Democratic inputs to AI grant program: lessons learned and implementation plans
We funded 10 teams from around the world to design ideas and tools to collectively govern AI. We summarize the innovations, outline our learnings, and call for researchers and engineers to join us as we continue this work.OpenAI Blog
Personalization of CTC-based End-to-End Speech Recognition Using Pronunciation-Driven Subword Tokenization
Recent advances in deep learning and automatic speech recognition have boosted the accuracy of end-to-end speech recognition to a new level. However, recognition of personal content such as contact names remains a challenge. In this work, we present a personalization solution for an end-to-end system based on connectionist temporal classification. Our solution uses class-based language model, in which a general language model provides modeling of the context for named entity classes, and personal named entities are compiled in a separate finite state transducer. We further introduce a…Apple Machine Learning Research