Personalize your search results with Amazon Personalize and Amazon OpenSearch Service integration

Personalize your search results with Amazon Personalize and Amazon OpenSearch Service integration

Amazon Personalize has launched a new integration with Amazon OpenSearch Service that enables you to personalize search results for each user and assists in predicting their search needs. The Amazon Personalize Search Ranking plugin within OpenSearch Service allows you to improve the end-user engagement and conversion from your website and app search by taking advantage of the deep learning capabilities offered by Amazon Personalize. This feature is also available with self-managed OpenSearch.

Search is crucial in engaging users because it brings high-intent traffic from individuals seeking specific products or categories. Previously, customers found it challenging to capitalize on this traffic and provide relevant search results to their users due to infrastructure limitations or lack of ML expertise. This led to increased instances of users failing to find the items they were searching for. With the Amazon Personalize Search Ranking plugin, customers of OpenSearch Service version 2.9.0 or later can go beyond the traditional keyword matching approach and boost relevant items in an individual user’s search results based on their interests, context, and past interactions in real time. You can also fine-tune the level of personalization for every search query to ensure flexibility and control over the search experience.

AWS Partners like Cognizant are excited by the personalization possibilities that the Amazon Personalize Search Ranking plugin will unlock for their media and retail customers.

“Amazon Personalize has been proven to be highly impactful for many businesses with its cost-effective and streamlined implementation. With the release of the new Amazon Personalize Search Ranking plugin within Amazon OpenSearch Service, we can now rapidly deploy and implement real-time user personalization to search results. We are highly confident that it will deliver improved customer experience and satisfaction as well as increase conversion and clickthrough rates by two to three times. Personalized search is a differentiator, especially for media and retail platforms. We are really excited to be a launch partner with AWS on this release and are looking forward to helping businesses deliver personalized search solutions powered by Amazon Personalize.”

– Andy Huang, Head of AI/ML at Cognizant Servian.

In this post, we show you how search results get personalized based on the user and how they vary when you adjust the personalization weight. We specify a value closer to zero to place less emphasis on personalization, and specify a value closer to 1 to re-rank search results based on a higher level of personalization.

Example use cases

To explore the impact of this new feature in greater detail, let’s review an example using a dataset from the Retail Demo Store.

First, we use OpenSearch Service to get search results for the search query “Grooming.” When the personalization weight is set to 0.0, no personalization takes place. As shown in the following table, the top five search results from OpenSearch Service show the grooming items with a higher gender affinity towards women (refer to the Gender_Affinity column, where M stands for male and F stands for female).

Rank Item_ID Item_Name Description Gender_Affinity
1 1bcb66c4-ee9d-4c0c-ba53-168cb243569f Women’s Grooming Kit A must-have in every bathroom F
2 f91ec34f-a08e-4408-8bb0-592bdd09375c Besto Hairbrush for Women Soft brush for everyday use F
3 4296626c-fbb0-42b4-9a50-b6c6c16095f3 Makeup Brush Kit This nifty makeup brush kit is essential in ev… F
4 09920b2e-4e07-41f7-aca6-47744777a2a7 Trendy Razor A must-have in every bathroom F
5 39945ad0-57c9-4c28-a69c-532d5d167202 Makeup Brushes Makeup brushes for every bathroom F
6 1bfbe5c7-6f02-4465-82f1-6083a4b302c0 Premium Men’s Razor Razor for every bathroom M
7 6d5b3f03-ade6-42f7-969d-acd1f2162332 5-Blade Razor for Men Razor for every bathroom M
8 83095a08-2968-4275-a375-4fab404df7ac Fusion5 Razers for Men Razor for every bathroom M
9 afdd9c41-2762-45bf-b6a7-e3fb8f1b34ba Minimalistic Razor A must-have in every bathroom M
10 5dbc7cb7-39c5-4795-9064-d1655d78b3ca Razor Brand for Men Razor for every bathroom M

Let’s suppose that a user with gender M (male) performs a search using the same query for “Grooming.” When the personalization weight is set to 0.3, the items with a gender affinity towards men get a subtle boost in ranking. In this example, Premium Men’s Razor, which was originally ranked number 6 in the previous table by OpenSearch Service, gets boosted to rank 2 in the updated table. Similarly, Razor Brand for Men shows up higher in position (rank 6) despite being the lowest-ranked item in the previous table.

Rank Item_ID Item_Name Description Gender_Affinity
1 1bcb66c4-ee9d-4c0c-ba53-168cb243569f Women’s Grooming Kit A must-have in every bathroom F
2 1bfbe5c7-6f02-4465-82f1-6083a4b302c0 Premium Men’s Razor Razor for every bathroom M
3 f91ec34f-a08e-4408-8bb0-592bdd09375c Besto Hairbrush for Women Soft brush for everyday use F
4 4296626c-fbb0-42b4-9a50-b6c6c16095f3 Makeup Brush Kit This nifty makeup brush kit is essential in ev… F
5 09920b2e-4e07-41f7-aca6-47744777a2a7 Trendy Razor A must-have in every bathroom F
6 5dbc7cb7-39c5-4795-9064-d1655d78b3ca Razor Brand for Men Razor for every bathroom M
7 39945ad0-57c9-4c28-a69c-532d5d167202 Makeup Brushes Makeup brushes for every bathroom F
8 afdd9c41-2762-45bf-b6a7-e3fb8f1b34ba Minimalistic Razor A must-have in every bathroom M
9 83095a08-2968-4275-a375-4fab404df7ac Fusion5 Razers for Men Razor for every bathroom M
10 6d5b3f03-ade6-42f7-969d-acd1f2162332 5-Blade Razor for Men Razor for every bathroom M

Next, we fine-tune the personalization weight to a value of 0.8 to get more personalized search results for “Grooming.” In the following table, the top four items in the search results are highly suited for men. Premium Men’s Razor and Razor Brand for Men shoot up further in rank. We also see other grooming items such as Minimalistic Razor and Fusion5 Razers for Men surfaced at the top of the search results even though they had a lower ranking in our first query.

Rank Item_ID Item_Name Description Gender_Affinity
1 1bfbe5c7-6f02-4465-82f1-6083a4b302c0 Premium Men’s Razor Razor for every bathroom M
2 5dbc7cb7-39c5-4795-9064-d1655d78b3ca Razor Brand for Men Razor for every bathroom M
3 afdd9c41-2762-45bf-b6a7-e3fb8f1b34ba Minimalistic Razor A must-have in every bathroom M
4 83095a08-2968-4275-a375-4fab404df7ac Fusion5 Razers for Men Razor for every bathroom M
5 1bcb66c4-ee9d-4c0c-ba53-168cb243569f Women’s Grooming Kit A must-have in every bathroom F
6 f91ec34f-a08e-4408-8bb0-592bdd09375c Besto Hairbrush for Women Soft brush for everyday use F
7 6d5b3f03-ade6-42f7-969d-acd1f2162332 5-Blade Razor for Men Razor for every bathroom M
8 09920b2e-4e07-41f7-aca6-47744777a2a7 Trendy Razor A must-have in every bathroom F
9 39945ad0-57c9-4c28-a69c-532d5d167202 Makeup Brushes Makeup brushes for every bathroom F
10 4296626c-fbb0-42b4-9a50-b6c6c16095f3 Makeup Brush Kit This nifty makeup brush kit is essential in ev… F

For more details on how to implement personalized search with OpenSearch Service, refer to Personalizing search results from OpenSearch.

Conclusion

With the new Amazon Personalize Search Ranking plugin, customers of both self-managed OpenSearch and OpenSearch Service v2.9 and above can boost relevant items in their search results by including signals from each user’s history, context, and preferences. The plugin enables you to exercise greater control over the level of personalization for each user and query type, and improve the overall search experience for your users.

For more details on Amazon Personalize, refer to the Amazon Personalize Developer Guide.


About the Authors


Shreeya Sharma
is a Sr. Technical Product Manager working with AWS AI/ML on the Amazon Personalize team. She has a background in computer science engineering, technology consulting, and data analytics

Ketan Kulkarni is a Software Development Engineer with the Amazon Personalize team focused on building AI-powered recommender systems at scale. In his spare time, he enjoys reading and traveling.

Prashant Mishra is a Software Development Engineer on the Amazon Personalize team.

Branislav Kveton is a Principal Scientist at AWS AI Labs. He proposes, analyzes, and applies algorithms that learn incrementally, run in real time, and converge to near optimal solutions as the number of observations increases.

Read More

People of AI: Season 2

People of AI: Season 2

Posted by Ashley Oldacre

If you are joining us for the first time, you can binge listen to our amazing 8 episodes from Season 1 wherever you get your podcasts.

We are back for another season of People of AI with a new lineup of incredible guests! I am so excited to introduce my new co-host Luiz Gustavo Martins as we meet inspiring people with interesting stories in the field of Artificial Intelligence.

Last season we focused on the incredible journeys that our guests took to get into the field of AI. Through our stories, we highlighted that no matter who you are, what your interests are, or what you work on, there is a place for anyone to get into this field. We also explored how much more accessible the technology has become over the years, as well as the importance of building AI-related products responsibly and ethically. It is easier than ever to use tools, platforms and services powered by machine learning to leverage the benefits of AI, and break down the barrier of entry.

For season 2, we will feature amazing conversations, focusing on Generative AI! Specifically, we will be discussing the explosive growth of Generative AI tools and the major technology shift that has happened in recent months. We will dive into various topics to explore areas where Generative AI can contribute tremendous value, as well as boost both productivity and economic growth. We will also continue to explore the personal paths and career development of this season’s guests as they share how their interest in technology was sparked, how they worked hard to get to where they are today, and explore what it is that they are currently working on.

Starting today, we will release one new episode of season 2 per week. Listen to the first episode on the People of AI site or wherever you get your podcasts. And stay tuned for later in the season when we premiere our first video podcasts as well!

  • Episode 1: meet your hosts, Ashley and Gus and learn about Generative AI, Bard and the big shift that has dramatically changed the industry. 
  • Episode 2: meet Sunita Verma, a long-time Googler, as she shares her personal journey from Engineering to CS, and into Google. As an early pioneer of AI and Google Ads, we will talk about the evolution of AI and how Generative AI will transform the way we work. 
  • Episode 3: meet Sayak Paul, a Google Developer Expert (GDE) as we explore what it means to be a GDE and how to leverage the power of your community through community contributions. 
  • Episode 4: meet Crispin Velez, the lead for Cloud’s Vertex AI as we dig into his experience in Cloud working with customers and partners on how to integrate and deploy AI. We also learn how he grew his AI developer community in LATAM from scratch. 
  • Episode 5: meet Joyce Shen, venture capital/private equity investor. She shares her fascinating career in AI and how she has worked with businesses to spot AI talent, incorporate AI technology into workflows and implement responsible AI into their products. 
  • Episode 6: meet Anne Simonds and Brian Gary, founders of Muse https://www.museml.com. Join us as we talk about their recent journeys into AI and their new company which uses the power of Generative AI to spark creativity. 
  • Episode 7: meet Tulsee Doshi, product lead for Google’s Responsible AI efforts as we discuss the development of Google-wide resources and best practices for developing more inclusive, diverse, and ethical algorithm driven products. 
  • Episode 8: meet Jeanine Banks, Vice President and General Manager of Google Developer X and Head of Developer Relations. Join us as we debunk AI and get down to what Generative AI really is, how it has changed over the past few months and will continue to change the developer landscape. 
  • Episode 9: meet Simon Tokumine, Director of Product Management at Google. We will talk about how AI has brought us into the era of task-orientated products and is fueling a new community of makers.

Listen now to the first episode of Season 2. We can’t wait to share the stories of these exceptional People of AI with you!

This podcast is sponsored by Google. Any remarks made by the speakers are their own and are not endorsed by Google.

Read More

Goal Representations for Instruction Following

Goal Representations for Instruction Following


Goal Representations for Instruction Following


<!– Figure title. Figure caption. This image is centered and set to 50%
page width.
–>

A longstanding goal of the field of robot learning has been to create generalist agents that can perform tasks for humans. Natural language has the potential to be an easy-to-use interface for humans to specify arbitrary tasks, but it is difficult to train robots to follow language instructions. Approaches like language-conditioned behavioral cloning (LCBC) train policies to directly imitate expert actions conditioned on language, but require humans to annotate all training trajectories and generalize poorly across scenes and behaviors. Meanwhile, recent goal-conditioned approaches perform much better at general manipulation tasks, but do not enable easy task specification for human operators. How can we reconcile the ease of specifying tasks through LCBC-like approaches with the performance improvements of goal-conditioned learning?

Goal Representations for Instruction Following

Goal Representations for Instruction Following


Goal Representations for Instruction Following


<!– Figure title. Figure caption. This image is centered and set to 50%
page width.
–>

A longstanding goal of the field of robot learning has been to create generalist agents that can perform tasks for humans. Natural language has the potential to be an easy-to-use interface for humans to specify arbitrary tasks, but it is difficult to train robots to follow language instructions. Approaches like language-conditioned behavioral cloning (LCBC) train policies to directly imitate expert actions conditioned on language, but require humans to annotate all training trajectories and generalize poorly across scenes and behaviors. Meanwhile, recent goal-conditioned approaches perform much better at general manipulation tasks, but do not enable easy task specification for human operators. How can we reconcile the ease of specifying tasks through LCBC-like approaches with the performance improvements of goal-conditioned learning?

Striking Performance: Large Language Models up to 4x Faster on RTX With TensorRT-LLM for Windows

Striking Performance: Large Language Models up to 4x Faster on RTX With TensorRT-LLM for Windows

Generative AI is one of the most important trends in the history of personal computing, bringing advancements to gaming, creativity, video, productivity, development and more.

And GeForce RTX and NVIDIA RTX GPUs, which are packed with dedicated AI processors called Tensor Cores, are bringing the power of generative AI natively to more than 100 million Windows PCs and workstations.

Today, generative AI on PC is getting up to 4x faster via TensorRT-LLM for Windows, an open-source library that accelerates inference performance for the latest AI large language models, like Llama 2 and Code Llama. This follows the announcement of TensorRT-LLM for data centers last month.

NVIDIA has also released tools to help developers accelerate their LLMs, including scripts that optimize custom models with TensorRT-LLM, TensorRT-optimized open-source models and a developer reference project that showcases both the speed and quality of LLM responses.

TensorRT acceleration is now available for Stable Diffusion in the popular Web UI by Automatic1111 distribution. It speeds up the generative AI diffusion model by up to 2x over the previous fastest implementation.

Plus, RTX Video Super Resolution (VSR) version 1.5 is available as part of today’s Game Ready Driver release — and will be available in the next NVIDIA Studio Driver, releasing early next month.

Supercharging LLMs With TensorRT

LLMs are fueling productivity — engaging in chat, summarizing documents and web content, drafting emails and blogs — and are at the core of new pipelines of AI and other software that can automatically analyze data and generate a vast array of content.

TensorRT-LLM, a library for accelerating LLM inference, gives developers and end users the benefit of LLMs that can now operate up to 4x faster on RTX-powered Windows PCs.

At higher batch sizes, this acceleration significantly improves the experience for more sophisticated LLM use — like writing and coding assistants that output multiple, unique auto-complete results at once. The result is accelerated performance and improved quality that lets users select the best of the bunch.

TensorRT-LLM acceleration is also beneficial when integrating LLM capabilities with other technology, such as in retrieval-augmented generation (RAG), where an LLM is paired with a vector library or vector database. RAG enables the LLM to deliver responses based on a specific dataset, like user emails or articles on a website, to provide more targeted answers.

To show this in practical terms, when the question “How does NVIDIA ACE generate emotional responses?” was asked of the LLaMa 2 base model, it returned an unhelpful response.

Better responses, faster.

Conversely, using RAG with recent GeForce news articles loaded into a vector library and connected to the same Llama 2 model not only returned the correct answer — using NeMo SteerLM — but did so much quicker with TensorRT-LLM acceleration. This combination of speed and proficiency gives users smarter solutions.

TensorRT-LLM will soon be available to download from the NVIDIA Developer website. TensorRT-optimized open source models and the RAG demo with GeForce news as a sample project are available at ngc.nvidia.com and GitHub.com/NVIDIA.

Automatic Acceleration

Diffusion models, like Stable Diffusion, are used to imagine and create stunning, novel works of art. Image generation is an iterative process that can take hundreds of cycles to achieve the perfect output. When done on an underpowered computer, this iteration can add up to hours of wait time.

TensorRT is designed to accelerate AI models through layer fusion, precision calibration, kernel auto-tuning and other capabilities that significantly boost inference efficiency and speed. This makes it indispensable for real-time applications and resource-intensive tasks.

And now, TensorRT doubles the speed of Stable Diffusion.

Compatible with the most popular distribution, WebUI from Automatic1111, Stable Diffusion with TensorRT acceleration helps users iterate faster and spend less time waiting on the computer, delivering a final image sooner. On a GeForce RTX 4090, it runs 7x faster than the top implementation on Macs with an Apple M2 Ultra. The extension is available for download today.

The TensorRT demo of a Stable Diffusion pipeline provides developers with a reference implementation on how to prepare diffusion models and accelerate them using TensorRT. This is the starting point for developers interested in turbocharging a diffusion pipeline and bringing lightning-fast inferencing to applications.

Video That’s Super

AI is improving everyday PC experiences for all users. Streaming video — from nearly any source, like YouTube, Twitch, Prime Video, Disney+ and countless others — is among the most popular activities on a PC. Thanks to AI and RTX, it’s getting another update in image quality.

RTX VSR is a breakthrough in AI pixel processing that improves the quality of streamed video content by reducing or eliminating artifacts caused by video compression. It also sharpens edges and details.

Available now, RTX VSR version 1.5 further improves visual quality with updated models, de-artifacts content played in its native resolution and adds support for RTX GPUs based on the NVIDIA Turing architecture — both professional RTX and GeForce RTX 20 Series GPUs.

Retraining the VSR AI model helped it learn to accurately identify the difference between subtle details and compression artifacts. As a result, AI-enhanced images more accurately preserve details during the upscaling process. Finer details are more visible, and the overall image looks sharper and crisper.

RTX Video Super Resolution v1.5 improves detail and sharpness.

New with version 1.5 is the ability to de-artifact video played at the display’s native resolution. The original release only enhanced video when it was being upscaled. Now, for example, 1080p video streamed to a 1080p resolution display will look smoother as heavy artifacts are reduced.

RTX VSR now de-artifacts video played at its native resolution.

RTX VSR 1.5 is available today for all RTX users in the latest Game Ready Driver. It will be available in the upcoming NVIDIA Studio Driver, scheduled for early next month.

RTX VSR is among the NVIDIA software, tools, libraries and SDKs — like those mentioned above, plus DLSS, Omniverse, AI Workbench and others — that have helped bring over 400 AI-enabled apps and games to consumers.

The AI era is upon us. And RTX is supercharging at every step in its evolution.

Read More

NVIDIA RTX Video Super Resolution Update Enhances Video Quality, Detail Preservation and Expands to GeForce RTX 20 Series GPUs

NVIDIA RTX Video Super Resolution Update Enhances Video Quality, Detail Preservation and Expands to GeForce RTX 20 Series GPUs

NVIDIA today announced an update to RTX Video Super Resolution (VSR) that delivers greater overall graphical fidelity with preserved details, upscaling for native videos and support for GeForce RTX 20 Series desktop and laptop GPUs.

For AI assists from RTX VSR and more — from enhanced creativity and productivity to blisteringly fast gaming — check out the RTX for AI page.

Plus, this week In the NVIDIA Studio, Twitch personality Runebee shares her inspiration, streaming tips and how she uses AI and RTX GPU acceleration.

And don’t forget to join the #SeasonalArtChallenge by submitting spooky Halloween-themed art in October and harvest- and fall-themed pieces in November. For inspiration, check out the hauntingly adorable work of artists like iryna.blender3d on Twitter.

The Super RTX VSR Update 1.5

RTX VSR’s AI model has been retrained to more accurately identify the difference between subtle details and compression artifacts to better preserve image details during the upscaling process. Finer details are more visible, and the overall image looks sharper and crisper than before.

RTX VSR v1.5 improves detail and sharpness.

RTX VSR version 1.5 will also de-artifact videos played at their native resolution — prior, only upscaled video could be enhanced. Providing a leap in graphical fidelity for laptop owners with 1080p screens, the updated RTX VSR makes 1080p resolution, which is popular for content and displays, look smoother at its native resolution, even with heavy artifacts.

RTX VSR now de-artifacts video played at native resolution.

And with expanded RTX VSR support, owners of GeForce RTX 20 Series GPUs can benefit from the same AI-enhanced video as those using RTX 30 and 40 Series GPUs.

RTX VSR 1.5 is available as part of the latest Game Ready Driver, available for download today. Content creators downloading NVIDIA Studio Drivers — designed to enhance features, reduce repetitiveness and dramatically accelerate creative workflows — can install the driver with RTX VSR releasing in early November.

Runebee-lievable Streaming

Runebee has been livestreaming for over 10 years, providing a space for viewers to hang out and talk about games, movies or whatever else is going on in life. Over the years, she’s realized how common a desire for escapism is.

“Things aren’t always sunshine and rainbows, so it’s nice to have some company that can help take your mind off things,” said Runebee.

Runebee has amassed over 100K followers on Twitch, YouTube and Instagram, crediting her success to thorough preparation of her setup. Her technology-forward approach ensures efficiency and reliability — allowing her focus to be on performance.

“There’s a lot of planning involved in streaming, but at the end of the day, hitting the ‘start streaming’ button is the most important step, and NVIDIA GPU-acceleration is a massive factor in allowing it to go as smoothly as it does,” said Runebee.

“I never thought I’d have this smooth of a stream just by upgrading to a GeForce RTX 40 Series GPU.” – Runebee

OBS is Runbee’s preferred open-source software for video recording and livestreaming on Twitch. For maximum efficiency, Runebee deploys her GeForce RTX 4080 RTX GPU, taking advantage of the eighth-generation NVIDIA encoder, NVENC, to independently encode video, which frees up the graphics card to focus on livestreaming.

“Streaming games and running OBS used to kill my CPU, and NVENC has taken so much stress off,” said Runebee. “I was hardly even able to stream PC games until I switched to NVENC.”

For livestreamers, RTX 40 Series GPUs can offer support for real-time AV1 hardware encoding, providing a 40% efficiency boost compared to H.264 and delivering higher quality than competing GPUs.

“As I started building more PCs with NVIDIA GPUs, I never had a reason to switch!” – Runebee

Runebee can export recordings of her livestreams with Adobe Premiere Pro in half the normally required time thanks to GeForce RTX 40 Series dual encoders working together, dividing the work evenly to double output.

They’re capable of recording up to 8K, 60 frames per second content in real time via GeForce Experience and OBS Studio.

Always looking to improve her livestreaming process, Runebee plans on experimenting with the NVIDIA Broadcast app, which transforms any room into a home studio by upgrading standard webcams, microphones and speakers into premium smart devices using the power of AI.

Runebee encourages those interested in livestreaming to at least give their potential passion project a shot. “It’s a great way to meet tons of new friends, become more articulate at describing the things you love — be it games or movies — and cultivate a community to share your passions with,” she said.

Twitch livestreamer Runebee’s setup.

Follow Runebee on Twitch.

Follow NVIDIA Studio on Instagram, Twitter and Facebook. Access tutorials on the Studio YouTube channel and get updates directly in your inbox by subscribing to the Studio newsletter. See notice regarding software product information.

Read More

Compiling NumPy code into C++ or CUDA via torch.compile

Quansight engineers have implemented support for tracing through NumPy code via
torch.compile in PyTorch 2.1. This feature leverages PyTorch’s compiler to
generate efficient fused vectorized code without having to modify your original
NumPy code. Even more, it also allows for executing NumPy code on CUDA
just by running it through torch.compile under torch.device("cuda")!

In this post, we go over how to use this feature and give a few tips and tricks
to make the most out of it.

Compiling NumPy code into Parallel C++

We take as our running example one step in a K-Means algorithm.
This piece of code is borrowed from this NumPy book

import numpy as np

def kmeans(X, means):
    return np.argmin(np.linalg.norm(X - means[:, None], axis=2), axis=0)

We create a synthetic dataset with 20M random 2-D points. We can see that,
given that the means are chosen appropriately, the function returns the correct
cluster for all of them

npts = 10_000_000
X = np.repeat([[5, 5], [10, 10]], [npts, npts], axis=0)
X = X + np.random.randn(*X.shape)  # 2 distinct "blobs"
means = np.array([[5, 5], [10, 10]])
np_pred = kmeans(X, means)

Benchmarking this function gives us a baseline of 1.26s on an AMD 3970X CPU.

Compiling this function is now as easy as wrapping it with torch.compile and
executing it with the example inputs

import torch

compiled_fn = torch.compile(kmeans)
compiled_pred = compiled_fn(X, means)
assert np.allclose(np_pred, compiled_pred)

The compiled function yields a 9x speed-up when running it on 1 core. Even
better, as opposed to NumPy, our generated code does take advantage of all the
cores in a processor. As such, when we run it on 32 cores, we get a 57x
speed-up
. Note that PyTorch always uses all the available cores unless
explicitly restricted, so this is the default behavior you get when using
torch.compile.

We may inspect the generated C++ code by running the script with the
environment variable TORCH_LOGS=output_code. When doing so, we can see that
torch.compile was able to compile the broadcasting and the two reductions
into just one for-loop, and parallelize it using OpenMP

extern "C" void kernel(const double* in_ptr0, const long* in_ptr1, long* out_ptr0) {
    #pragma omp parallel num_threads(32)
    #pragma omp for
    for(long i0=0L; i0<20000000L; i0+=1L) {
        auto tmp0 = in_ptr0[2L*i0];
        auto tmp1 = in_ptr1[0L];
        auto tmp5 = in_ptr0[1L + (2L*i0)];
        auto tmp6 = in_ptr1[1L];
        // Rest of the kernel omitted for brevity

Compiling NumPy code into CUDA

Compiling our code so that it runs on CUDA is as simple as setting the
default device to be CUDA

with torch.device("cuda"):
    cuda_pred = compiled_fn(X, means)
assert np.allclose(np_pred, cuda_pred)

By inspecting the generated code via TORCH_LOGS=output_code, we see that,
rather than generating CUDA code directly, torch.compile generates rather
readable triton code

def triton_(in_ptr0, in_ptr1, out_ptr0, XBLOCK : tl.constexpr):
    xnumel = 20000000
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (2*x0), xmask)
    tmp1 = tl.load(in_ptr1 + (0))
    // Rest of the kernel omitted for brevity

Running this small snippet on an RTX 2060 gives an 8x speed-up over the
original NumPy code. This is something, but it is not particularly impressive,
given the speed-ups we have seen on CPU. Let’s have a look into how to squeeze
the most out of our GPU via a couple minor changes.

float64 vs float32. Many GPUs, in particular consumer-grade ones, are
rather sluggish when running operations on float64. For this reason, changing
the data generation to float32, the original NumPy code just gets a bit
faster, about a 9%, but our CUDA code gets 40% faster, yielding a 11x
speed-up
over the plain NumPy code.

torch.compile, by default, respects the NumPy semantics, and as such, it uses
np.float64 as its default dtype for all its creation ops. As discussed, this
can hinder performance, so it is possible to change this default by setting

from torch._dynamo import config
config.numpy_default_float = "float32"

CPU <> CUDA copies. An 11x speed-up is good, but it is not even close to
the CPU numbers. This is caused by a small transformation that torch.compile
does behind the scenes. The code above takes NumPy arrays and returns NumPy
arrays. All of these arrays are on CPU, but the computations are performed on
the GPU. This means that every time the function is called, torch.compile has
to copy all these arrays from CPU to the GPU, and then copy the result back to
CPU to preserve the original semantics. There is no native solution to this
issue in NumPy, as NumPy does not have the notion of a device. That being
said, we can work around it by creating a wrapper to this function so that it
accepts PyTorch tensors and returns PyTorch tensors.

@torch.compile
def tensor_fn(X, means):
    X, means = X.numpy(), means.numpy()
    ret = kmeans(X, means)
    return torch.from_numpy(ret)

def cuda_fn(X, means):
    with torch.device("cuda"):
        return tensor_fn(X, means)

This function now takes tensors in CUDA memory and returns tensors in CUDA
memory, but the function itself is written in NumPy! torch.compile uses the
numpy() and the from_numpy() calls as hints, and optimizes them away, and
internally it simply works with PyTorch tensors without moving the memory at
all. When we keep the tensors in CUDA and perform the computations in
float32, we see a 200x speed-up over the initial NumPy implementation on
float32 arrays.

Mixing NumPy and PyTorch. In this example, we had to write a small adaptor
to convert tensors to ndarrays and then back to tensors. In programs that mix
PyTorch and NumPy converting a tensor into an ndarray is often implemented as
x.detach().cpu().numpy(), or simply x.numpy(force=True). Since when running
under torch.compile we can run NumPy code in CUDA, we can implement this
conversion pattern as call to x.numpy(), as we did above. Doing so and
running the resulting code under device("cuda") will generate efficient CUDA
code from original NumPy calls without copying the data from CUDA to CPU at
all. Note that the resulting code does not run without torch.compile. For it
to run in eager mode one would need to rollback to x.numpy(force=True).

Further Speed-up tricks

General advice. The CUDA code we have shown is already quite efficient, but
it is true that the running example is rather short. When dealing with larger
programs, we may need to tweak parts of it to make it more efficient. A good
place to start is the multiple tutorials and FAQs for torch.compile.
This showcases a number of ways to inspect the tracing process, and how to
identify problematic code that may cause slowdowns.

Advice when compiling NumPy code. NumPy, even if rather similar to PyTorch,
is often used very differently. It is rather common to perform computations in
NumPy and then do an if/else depending on values within the array, or perform
operations in-place, perhaps via boolean masks. These constructions, while
supported by torch.compile, hamper its performance. Changes like writing the
code in a branchless way to avoid graph breaks, or avoiding in-place ops can go
a long way.

To write fast NumPy code, it is best to avoid loops, but sometimes they are
unavoidable. When tracing through a loop, torch.compile will try to fully
unroll it. This is sometimes desirable, but sometimes it may not even be
possible, like when we have a dynamic stopping condition, like in a while loop.
In these cases, it may be best to just compile the body of the loop, perhaps a
few iterations at a time (loop unrolling).

Debugging NumPy code. Debugging is rather tricky when a compiler is
involved. To figure out whether an error you are hitting is a torch.compile
error, or an error from the program, you can execute your NumPy program without
torch.compile by replacing the NumPy import by import torch._numpy as np.
This is should just be used for debugging purposes and is in no way a
replacement for the PyTorch API, as it is much slower and, as a private API,
may change without notice. See also this FAQ for other tricks.

Differences between NumPy and torch.compile NumPy

NumPy scalars. NumPy returns NumPy scalars in almost any case where PyTorch
would return a 0-D tensor (e.g. from np.sum). Under torch.compile, NumPy
scalars are treated as 0-D arrays. This is just fine in most cases. The only
case when their behavior diverges is when NumPy scalars are implicitly used as
Python scalars. For example,

>>> np.asarray(2) * [1, 2, 3]  # 0-D array is an array-like
array([2, 4, 6])
>>> u = np.int32(2)
>>> u * [1, 2, 3]              # scalar decays into a Python int
[1, 2, 3, 1, 2, 3]
>>> torch.compile(lambda: u * [1, 2, 3])()
array([2, 4, 6])               # acts as a 0-D array, not as a scalar ?!?!

If we compile the first two lines, we see that torch.compile treats u as a
0-D array. To recover the eager semantics, we just need to make the casting
explicit

>>> torch.compile(lambda: int(u) * [1, 2, 3])()
[1, 2, 3, 1, 2, 3]

Type promotion and versioning. NumPy’s type promotion rules may be, at
times, a bit surprising

>>> np.zeros(1, dtype=np.int8) + 127
array([127], dtype=int8)
>>> np.zeros(1, dtype=np.int8) + 128
array([128], dtype=int16)

NumPy 2.0 is changing these rules to follow others that are closer to those
PyTorch. The relevant technical document is NEP 50.
torch.compile went ahead and implemented NEP 50 rather than the about-to-be-deprecated rules.

In general, NumPy within torch.compile follows NumPy 2.0 pre-release.

Beyond NumPy: SciPy and scikit-learn

In parallel to this effort of making torch.compile understand NumPy code,
other Quansight engineers have designed and proposed a way to support PyTorch
tensors within scikit-learn and SciPy. This was received enthusiastically by
other maintainers from these libraries, as it was shown that using PyTorch as a
backend would often yield considerable speed-ups. Both projects have now merged
initial support for PyTorch tensors across a number of APIs and submodules.

This sets the stepping stone to move towards a future where PyTorch tensors can
be used within other libraries in the Python data ecosystem. Even more, this
will enable running these other libraries on GPUs and even compiling code
mixing these libraries and PyTorch, similar to what we have been discussed in
this post.

If you want to learn more about this effort, how to use it, or how to help
moving it forward, see this other blogpost.

Conclusion

PyTorch has committed since its inception to be a framework compatible with the
rest of the Python ecosystem. Enabling compiling NumPy programs, and
establishing the tools necessary to do the same for other prominent libraries
are two more steps in this direction. Quansight and Meta continue working hand
on hand, improving the compatibility between PyTorch and the rest of the
ecosystem.

From Quansight, we would like to thank Mengwei, Voz, and Ed for their
invaluable help in integrating our work with torch.compile. We would also
like to thank Meta for funding this project as well as previous work on
improving NumPy compatibility within PyTorch, and the project that led to
supporting PyTorch within scikit-learn and SciPy. These are giant leaps towards
consolidating PyTorch as the framework of choice within the open source Python
data ecosystem.

Read More

Huawei Joins the PyTorch Foundation as a Premier Member

Today, the PyTorch Foundation, a neutral home for the deep learning community to collaborate on the open source PyTorch framework and ecosystem, announced that Huawei has joined as a premier member.

Huawei has been a long-standing supporter and contributor to the PyTorch Ecosystem, and, through the release of progressive diverse computing, provides easier access to the PyTorch ecosystem for more hardware vendors. By joining as a premier member, Huawei will continue to optimize PyTorch to fully unleash Ascend computing capabilities.

“We are delighted to join the PyTorch Foundation, and hope to further collaborate with other member companies and expand the community to a wider audience,” said by Zhang Dixuan, President of Huawei Ascend Computing Business, “This move benefits both Huawei, PyTorch, and the wider AI ecosystem. It also aligns with our long-held beliefs in openness, innovation, collaboration, and shared success, and we are confident that it will spur new innovations in the global AI community.”

Huawei unveiled the All Intelligence strategy to accelerate intelligence across all industries. To cater the demand for AI computing needs, Huawei invests in the system-level technologies, and that belief is centered on open hardware and software that enables partners and fosters talent. This strategy aligns with the PyTorch Foundation’s mission to develop AI as part of a sustainable open source ecosystem and produce inclusive technological feats.

PyTorch Foundation Executive Director Ibrahim Haddad said, “We are delighted to welcome Huawei to the PyTorch Foundation. Huawei is a leading body in researching computer vision, natural language processing, speech recognition, and other emerging areas, and has proven experience in the field of foundation models. We have no doubt that we will benefit from their support and guidance.”

As a premier member, Huawei is granted one seat to the PyTorch Foundation Governing Board, and will help set policies, bylaws, and mission and vision statements that define the overarching scope of the PyTorch Foundation’s initiatives, technical vision, and direction.

The Board welcomes Huawei representative Fred Li, Head of Computing Open Source Development Team at Huawei. Fred leads an active and creative team in R&D and operations projects under the principle of “upstream first”, which aims to make diverse computing power ubiquitous.

To learn more about how you can be a part of the PyTorch Foundation, visit our website.

About Huawei

Founded in 1987, Huawei is a leading global provider of information and communications technology (ICT) infrastructure and smart devices. We have 207,000 employees and operate in over 170 countries and regions, serving more than three billion people around the world. We are committed to bringing digital to every person, home and organization for a fully connected, intelligent world.

About PyTorch Foundation

The PyTorch Foundation is a neutral home for the deep learning community to collaborate on the open source PyTorch framework and ecosystem. The PyTorch Foundation is supported by its members and leading contributors to the PyTorch open source project. The Foundation leverages resources provided by members and contributors to enable community discussions and collaboration.

About The Linux Foundation

The Linux Foundation is the world’s leading home for collaboration on open source software, hardware, standards, and data. Linux Foundation projects are critical to the world’s infrastructure including Linux, Kubernetes, Node.js, ONAP, PyTorch, RISC-V, SPDX, OpenChain, and more. The Linux Foundation focuses on leveraging best practices and addressing the needs of contributors, users, and solution providers to create sustainable models for open collaboration. For more information, please visit us at linuxfoundation.org. The Linux Foundation has registered trademarks and uses trademarks. For a list of trademarks of The Linux Foundation, please see its trademark usage page. Linux is a registered trademark of Linus Torvalds.


华为成为PyTorch基金会Primer会员

PyTorch 基金会是深度学习社区在开源 PyTorch 框架和生态系统上进行协作的中立家园,今天宣布华为已作为Primer会员加入。

华为长期以来一直是PyTorch生态系统的支持者和贡献者,通过推进多样性算力支持与改进,帮助更多厂商后端能够更加轻松地接入PyTorch生态,并积极致力于PyTorch优化,从而充分释放昇腾的算力。

“通过加入PyTorch基金会,我们可以进一步与其他成员公司共同协作,加速PyTorch社区的发展。”华为昇腾计算业务总裁张迪煊表示,“我们相信这对华为和 PyTorch 生态系统是互惠互利的,也符合我们长期以来开放创新,协作共赢的开源理念,为全球人工智能社区带来更多的兴奋和创新。”

华为发布全面智能化战略,加速千行万业智能化的转型,持续通过系统级持续创新,坚持硬件开放、软件开源、使能伙伴、发展人才,以满足各行各业多样性的AI算力需求。这与 PyTorch 基金会的使命完美契合且相互补充,即通过培育和维持开源生态系统来推动人工智能的发展,并使每个人都能使用这些技术创新。

“华为在计算机视觉、自然语言处理、语音识别等领域进行了广泛的研究,并且在大模型领域也积累了成熟的研究经验。我们相信 PyTorch 基金会将从他们对我们的成员和生态系统的支持中受益匪浅。”PyTorch 基金会执行董事 Ibrahim Haddad 说道。

作为 Primer 会员,华为获得了 PyTorch 基金会董事会的一个席位。董事会通过我们的章程、使命和愿景声明制定政策,描述基金会计划、技术愿景和方向的总体范围。

我们很高兴欢迎华为计算开源业务总经理李永乐加入我们的董事会。李永乐目前负责华为计算产品线开源业务,他领导着一支极具创新又充满活力的技术和运营团队,他们秉持着“Upstream first”的原则,让多样性算力无处不在。

要了解有关如何成为 PyTorch 基金会一部分的更多信息,请访问我们的网站

关于华为

华为创立于1987年,是全球领先的ICT(信息与通信)基础设施和智能终端提供商。我们的20.7万员工遍及170多个国家和地区,为全球30多亿人口提供服务。我们致力于把数字世界带入每个人、每个家庭、每个组织,构建万物互联的智能世界。

关于PyTorch基金会

PyTorch 基金会是深度学习社区在开源 PyTorch 框架和生态系统上进行协作的中立家园。 PyTorch 基金会得到其成员和 PyTorch 开源项目主要贡献者的支持。基金会利用成员和贡献者提供的资源来促进社区讨论和协作。

关于Linux基金会

Linux 基金会是世界领先的开源软件、硬件、标准和数据协作中心。 Linux 基金会项目对世界基础设施至关重要,包括 Linux、Kubernetes、Node.js、ONAP、PyTorch、RISC-V、SPDX、OpenChain 等。 Linux 基金会专注于利用最佳实践并满足贡献者、用户和解决方案提供商的需求,以创建可持续的开放协作模型。欲了解更多信息,请访问我们的 linuxfoundation.org。 Linux 基金会已注册商标并使用商标。有关 Linux 基金会的商标列表,请参阅其商标使用页面:www.linuxfoundation.org/trademark-usage。 Linux 是 Linus Torvalds 的注册商标。

Read More