<Incoming Transmission> Epic Games is bringing a new Fortnite reward to GeForce NOW, available to all members. Drop from the Battle Bus in Fortnite on GeForce NOW between today and Thursday, Aug. 4, to earn “The Dish-stroyer Pickaxe” in game for free.
<Transmission continues> Members can earn this item by streaming Fortnite on GeForce NOW on their PCs, Macs, Chromebooks, SHIELD TVs and with an optimized touch experience on iOS Safari and Android mobile devices. Thanks to the power of Epic and GeForce servers, all GeForce NOW members can take the action wherever they go.
Plus, nine new titles arrive on GeForce NOW this week, joining Fortnite and 1,300+ other games streaming at GeForce quality.
<bZZZt bZZZt> Whoops, sorry there, almost lost the signal. It’s coming through loud and clear now for a jam-packed GFN Thursday.
Dish It Out With This In-Game Reward
Bring home the big wins with the free “Dish-stroyer Pickaxe,” a reward available to GeForce NOW members who stream Fortnite any time between today at noon Eastern and Thursday, Aug. 4, 11:59 p.m Eastern. Rewards will appear in accounts starting Thursday, Aug. 11. Check out this FAQ from Epic for more details on how to link your Epic account to GFN.
Fortnite fans can try out GeForce NOW for free to obtain this reward, and play Fortnite across all compatible GeForce NOW devices, including on mobile with intuitive touch controls, Windows PC, macOS, iOS Safari, Android phones and tablets, Android TV, SHIELD TV, 2022 Samsung TVs and select LG TV models.
All members are eligible for this in-game reward, regardless of membership tier. For members with an RTX 3080 membership, taking out opponents with “The Dish-stroyer” will feel even more victorious — with ultra-low latency, eight-hour gaming sessions and streaming in 4K resolution at 60 frames per second, or 1440p at 120 FPS on the PC and Mac apps.
With 120 FPS streaming now broadly available on 120Hz Android devices, RTX 3080 members can stream Fortnite at higher frame rates to more phones and tablets for an even more responsive experience.
Keep the Victories Rolling
This week also adds nine new games, including 3D platformer Hell Pie. GFN members can now see “Nate the demon” and “Nugget the angel” in all their fearsome glory, powered by ray-tracing technology for more vibrant gameplay. Check out the other titles now available to stream:
Thanks to earbuds you can have calls anywhere while doing anything. The problem: those on the other end of the call hear it all, too, from your roommate’s vacuum cleaner to background conversations at the cafe you’re working from.
Now, work by a trio of graduate students at the University of Washington who spent the pandemic cooped up together in a noisy apartment, lets those on the other end of the call hear just you — rather than all the stuff going on around you.
Users found that the system, dubbed “ClearBuds” — presented last month at the ACM International Conference on Mobile Systems, Applications, and Services — improved background noise suppression much better than a commercially available alternative.
“You’re removing your audio background the same way you can remove your visual background on a video call,” explained Vivek Jayaram, a doctoral student in the Paul G. Allen School of Computer Science & Engineering.
Outlined in a paper co-authored by the three roommates, all computer science and engineering graduate students at the University of Washington — Maruchi Kim, Ishan Chatterjee, and Jayaram — ClearBuds are different from other wireless earbuds in two big ways.
First, ClearBuds use two microphones per earbud.
While most earbuds use two microphones on the same earbud, ClearBuds uses a microphone from both earbuds and creates two audio streams.
This creates higher spatial resolution for the system to better separate sounds coming from different directions, Kim explained. In other words, it makes it easier for the system to pick out the earbud wearer’s voice.
Second, the team created a neural network algorithm that can run on a mobile phone to process the audio streams to identify which sounds should be enhanced and which should be suppressed.
The researchers relied on two separate neural networks to do this.
The first neural network suppresses everything that isn’t a human voice.
The second enhances the speaker’s voice. The speaker can be identified because it’s coming from microphones in both earbuds at the same time.
Together, they effectively mask background noise and ensure the earbud wearer is heard loud and clear.
While the software the researchers created was lightweight enough to run on a mobile device, they relied on an NVIDIA TITAN desktop GPU to train the neural networks. They used both synthetic audio samples and real audio. Training took less than a day.
And the results, users reported, were dramatically better than commercially available earbuds, results that are winning recognition industrywide.
The team took second place for best paper at last month’s ACM MobSys 2022 conference. In addition to Kim, Chatterjee and Jayarm, the paper’s co-authors included Ira Kemelmacher-Shlizerman, an associate professor at the Allen School; Shwetak Patel, a professor in both the Allen School and the electrical and computer engineering department; and Shyam Gollakota and Steven Seitz, both professors in the Allen School.
To be sure, the system outlined in the paper can’t be adopted instantly. While many earbuds have two microphones per earbud, they only stream audio from one earbud. Industry standards are just catching up to the idea of processing multiple audio streams from earbuds.
Nevertheless, the researchers are hopeful their work, which is open source, will inspire others to couple neural networks and microphones to provide better quality audio calls.
The ideas could also be useful for isolating and enhancing conversations taking place over smart speakers by harnessing them for ad hoc microphone arrays, Kim said, and even tracking robot locations or search and rescue missions.
Sounds good to us.
Featured image credit: Raymond Smith, University of Washington
Typical educational robotics approaches rely on imperative programming for robot navigation. However, with the increasing presence of AI in everyday life, these approaches miss an opportunity to introduce machine learning (ML) techniques grounded in an authentic and engaging learning context. Furthermore, the needs for costly specialized equipment and ample physical space are barriers that limit access to robotics experiences for all learners. We propose ARtonomous, a relatively low-cost, virtual alternative to physical, programming-only robotics kits. With ARtonomous, students employ…Apple Machine Learning Research
As new data privacy regulations like GDPR (General Data Protection Regulation, 2017) have come into effect, customers are under increased pressure to monetize media assets while abiding by the new rules. Monetizing media while respecting privacy regulations requires the ability to automatically extract granular metadata from assets like text, images, video, and audio files at internet scale. It also requires a scalable way to map media assets to industry taxonomies that facilitate discovery and monetization of content. This use case is particularly significant for the advertising industry as data privacy rules cause a shift from behavioral targeting using third-party cookies.
Third-party cookies help enable personalized ads for web users, and allow advertisers to reach their intended audience. A traditional solution to serve ads without third-party cookies is contextual advertising, which places ads on webpages based on the content published on the pages. However, contextual advertising poses the challenge of extracting context from media assets at scale, and likewise using that context to monetize the assets.
In this post, we discuss how you can build a machine learning (ML) solution that we call Contextual Intelligence Taxonomy Mapper (CITM) to extract context from digital content and map it to standard taxonomies in order to generate value. Although we apply this solution to contextual advertising, you can use it to solve other use cases. For example, education technology companies can use it to map their content to industry taxonomies in order to facilitate adaptive learning that delivers personalized learning experiences based on students’ individual needs.
Solution overview
The solution comprises two components: AWS Media Intelligence (AWS MI) capabilities for context extraction from content on web pages, and CITM for intelligent mapping of content to an industry taxonomy. You can access the solution’s code repository for a detailed view of how we implement its components.
AWS Media Intelligence
AWS MI capabilities enable automatic extraction of metadata that provides contextual understanding of a webpage’s content. You can combine ML techniques like computer vision, speech to text, and natural language processing (NLP) to automatically generate metadata from text, videos, images, and audio files for use in downstream processing. Managed AI services such as Amazon Rekognition, Amazon Transcribe, Amazon Comprehend, and Amazon Textract make these ML techniques accessible using API calls. This eliminates the overhead needed to train and build ML models from scratch. In this post, you see how using Amazon Comprehend and Amazon Rekognition for media intelligence enables metadata extraction at scale.
Contextual Intelligence Taxonomy Mapper
After you extract metadata from media content, you need a way to map that metadata to an industry taxonomy in order to facilitate contextual targeting. To do this, you build Contextual Intelligence Taxonomy Mapper (CITM), which is powered by a BERT sentence transformer from Hugging Face.
The BERT sentence transformer enables CITM to categorize web content with contextually related keywords. For example, it can categorize a web article about healthy living with keywords from the industry taxonomy, such as “Healthy Cooking and Eating,” “Running and Jogging,” and more, based on the text written and the images used within the article. CITM also provides the ability to choose the mapped taxonomy terms to use for your ad bidding process based on your criteria.
The following diagram illustrates the conceptual view of the architecture with CITM.
The IAB (Interactive Advertising Bureau) Content Taxonomy
For this post, we use the IAB Tech Lab’s Content Taxonomy as the industry standard taxonomy for the contextual advertising use case. By design, the IAB taxonomy helps content creators more accurately describe their content, and it provides a common language for all parties in the programmatic advertising process. The use of a common terminology is crucial because the selection of ads for a webpage a user visits has to happen within milliseconds. The IAB taxonomy serves as a standardized way to categorize content from various sources while also being an industry protocol that real-time bidding platforms use for ad selection. It has a hierarchical structure, which provides granularity of taxonomy terms and enhanced context for advertisers.
Solution workflow
The following diagram illustrates the solution workflow.
Amazon Comprehend performs topic modeling to extract common themes from the collection of articles.
The Amazon Rekognition object label API detects labels in images.
CITM maps content to a standard taxonomy.
Optionally, you can store content to taxonomy mapping in a metadata store.
In the following sections, we walk through each step in detail.
Amazon S3 stores the IAB content taxonomy and extracted web content
We store extracted text and images from a collection of web articles in an S3 bucket. We also store the IAB content taxonomy. As a first step, we concatenate different tiers on the taxonomy to create combined taxonomy terms. This approach helps maintain the taxonomy’s hierarchical structure when the BERT sentence transformer creates embeddings for each keyword. See the following code:
def prepare_taxonomy(taxonomy_df):
"""
Concatenate IAB Tech Lab content taxonomy tiers and prepare keywords for BERT embedding.
Use this function as-is if using the IAB Content Taxonomy
Parameters (input):
----------
taxonomy_df : Content taxonomy dataframe
Returns (output):
-------
df_clean : Content taxonomy with tiers in the taxonomy concatenated
keyword_list: List of concatenated content taxonomy keywords
ids: List of ids for the content taxonomy keywords
"""
df = taxonomy_df[['Unique ID ','Parent','Name','Tier 1','Tier 2','Tier 3']]
df_str = df.astype({"Unique ID ": 'str', "Parent": 'str', "Tier 1": 'str', "Tier 2": 'str', "Tier 3": 'str'})
df_clean = df_str.replace('nan','')
#create a column that concatenates all tiers for each taxonomy keyword
df_clean['combined']=df_clean[df_clean.columns[2:6]].apply(lambda x: ' '.join(x.dropna().astype(str)),axis=1)
#turn taxonomy keyords to list of strings a prep for encoding with BERT sentence transformer
keyword_list=df_clean['combined'].to_list()
#get list of taxonomy ids
ids = df_clean['Unique ID '].to_list()
return df_clean, keyword_list, ids
taxonomy_df, taxonomy_terms, taxonomy_ids = prepare_taxonomy(read_taxonomy)
The following diagram illustrates the IAB context taxonomy with combined tiers.
Amazon Comprehend performs topic modeling to extract common themes from the collection of articles
With the Amazon Comprehend topic modeling API, you analyze all the article texts using the Latent Dirichlet Allocation (LDA) model. The model examines each article in the corpus and groups keywords into the same topic based on the context and frequency in which they appear across the entire collection of articles. To ensure the LDA model detects highly coherent topics, you perform a preprocessing step prior to calling the Amazon Comprehend API. You can use the gensim library’s CoherenceModel to determine the optimal number of topics to detect from the collection of articles or text files. See the following code:
def compute_coherence_scores(dictionary, corpus, texts, limit, start=2, step=3):
"""
Compute coherence scores for various number of topics for your topic model.
Adjust the parameters below based on your data
Parameters (input):
----------
dictionary : Gensim dictionary created earlier from input texts
corpus : Gensim corpus created earlier from input texts
texts : List of input texts
limit : The maximum number of topics to test. Amazon Comprehend can detect up to 100 topics in a collection
Returns (output):
-------
models : List of LDA topic models
coherence_scores : Coherence values corresponding to the LDA model with respective number of topics
"""
coherence_scores = []
models = []
for num_topics in range(start, limit, step):
model = gensim.models.LdaMulticore(corpus=corpus, num_topics=num_topics, id2word=id2word)
models.append(model)
coherencemodel = CoherenceModel(model=model, texts=corpus_words, dictionary=id2word, coherence='c_v')
coherence_scores.append(coherencemodel.get_coherence())
return models, coherence_scores
models, coherence_scores = compute_coherence_scores(dictionary=id2word, corpus=corpus_tdf, texts=corpus_words, start=2, limit=100, step=3)
After you get the optimal number of topics, you use that value for the Amazon Comprehend topic modeling job. Providing different values for the NumberOfTopics parameter in the Amazon Comprehend StartTopicsDetectionJob operation results in a variation in the distribution of keywords placed in each topic group. An optimized value for the NumberOfTopics parameter represents the number of topics that provide the most coherent grouping of keywords with higher contextual relevance. You can store the topic modeling output from Amazon Comprehend in its raw format in Amazon S3.
The Amazon Rekognition object label API detects labels in images
You analyze each image extracted from all webpages using the Amazon Rekognition DetectLabels operation. For each image, the operation provides a JSON response with all labels detected within the image, coupled with a confidence score for each. For our use case, we arbitrarily select a confidence score of 60% or higher as the threshold for object labels to use in the next step. You store object labels in their raw format in Amazon S3. See the following code:
"""
Create a function to extract object labels from a given image using Amazon Rekognition
"""
def get_image_labels(image_loc):
labels = []
with fs.open(image_loc, "rb") as im:
response = rekognition_client.detect_labels(Image={"Bytes": im.read()})
for label in response["Labels"]:
if label["Confidence"] >= 60: #change to desired confidence score threshold, value between [0,100]:
object_label = label["Name"]
labels.append(object_label)
return labels
CITM maps content to a standard taxonomy
CITM compares extracted content metadata (topics from text and labels from images) with keywords on the IAB taxonomy, and then maps the content metadata to keywords from the taxonomy that are semantically related. For this task, CITM completes the following three steps:
Generate neural embeddings for the content taxonomy, topic keywords, and image labels using Hugging Face’s BERT sentence transformer. We access the sentence transformer model from Amazon SageMaker. In this post, we use the paraphrase-MiniLM-L6-v2 model, which maps keywords and labels to a 384 dimensional dense vector space.
Compute the cosine similarity score between taxonomy keywords and topic keywords using their embeddings. It also computes cosine similarity between the taxonomy keywords and the image object labels. We use cosine similarity as a scoring mechanism to find semantically similar matches between the content metadata and the taxonomy. See the following code:
def compute_similarity(entity_embeddings, entity_terms, taxonomy_embeddings, taxonomy_terms):
"""
Compute cosine scores between entity embeddings and taxonomy embeddings
Parameters (input):
----------
entity_embeddings : Embeddings for either topic keywords from Amazon Comprehend or image labels from Amazon Rekognition
entity_terms : Terms for topic keywords or image labels
taxonomy_embeddings : Embeddings for the content taxonomy
taxonomy_terms : Terms for the taxonomy keywords
Returns (output):
-------
mapping_df : Dataframe that matches each entity keyword to each taxonomy keyword and their cosine similarity score
"""
#calculate cosine score, pairing each entity embedding with each taxonomy keyword embedding
cosine_scores = util.pytorch_cos_sim(entity_embeddings, taxonomy_embeddings)
pairs = []
for i in range(len(cosine_scores)-1):
for j in range(0, cosine_scores.shape[1]):
pairs.append({'index': [i, j], 'score': cosine_scores[i][j]})
#Sort cosine similarity scores in decreasing order
pairs = sorted(pairs, key=lambda x: x['score'], reverse=True)
rows = []
for pair in pairs:
i, j = pair['index']
rows.append([entity_terms[i], taxonomy_terms[j], pair['score']])
#move sorted values to a dataframe
mapping_df= pd.DataFrame(rows, columns=["term", "taxonomy_keyword","cosine_similarity"])
mapping_df['cosine_similarity'] = mapping_df['cosine_similarity'].astype('float')
mapping_df= mapping_df.sort_values(by=['term','cosine_similarity'], ascending=False)
drop_dups= mapping_df.drop_duplicates(subset=['term'], keep='first')
mapping_df = drop_dups.sort_values(by=['cosine_similarity'], ascending=False).reset_index(drop=True)
return mapping_df
#compute cosine_similairty score between topic keywords and content taxonomy keywords using BERT embeddings
text_taxonomy_mapping=compute_similarity(keyword_embeddings, topic_keywords, taxonomy_embeddings, taxonomy_terms)
Identify pairings with similarity scores that are above a user-defined threshold and use them to map the content to semantically related keywords on the content taxonomy. In our test, we select all keywords from pairings that have a cosine similarity score of 0.5 or higher. See the following code:
#merge text and image keywords mapped to content taxonomy
rtb_keywords=pd.concat([text_taxonomy_mapping[["term","taxonomy_keyword","cosine_similarity"]],image_taxonomy_mapping]).sort_values(by='cosine_similarity',ascending=False).reset_index(drop=True)
#select keywords with a cosine_similarity score greater than your desired threshold ( the value should be from 0 to 1)
rtb_keywords[rtb_keywords["cosine_similarity"]> 50] # change to desired threshold for cosine score, value between [0,100]:
A common challenge when working with internet-scale language representation (such as in this use case) is that you need a model that can fit most of the content—in this case, words in the English language. Hugging Face’s BERT transformer has been pre-trained using a large corpus of Wikipedia posts in the English language to represent the semantic meaning of words in relation to one another. You fine-tune the pre-trained model using your specific dataset of topic keywords, image labels, and taxonomy keywords. When you place all embeddings in the same feature space and visualize them, you see that BERT logically represents semantic similarity between terms.
The following example visualizes IAB content taxonomy keywords for the class Automotive represented as vectors using BERT. BERT places Automotive keywords from the taxonomy close to semantically similar terms.
The feature vectors allow CITM to compare the metadata labels and taxonomy keywords in the same feature space. In this feature space, CITM calculates cosine similarity between each feature vector for taxonomy keywords and each feature vector for topic keywords. In a separate step, CITM compares taxonomy feature vectors and feature vectors for image labels. Pairings with cosine scores closest to 1 are identified as semantically similar. Note that a pairing can either be a topic keyword and a taxonomy keyword, or an object label and a taxonomy keyword.
The following screenshot shows example pairings of topic keywords and taxonomy keywords using cosine similarity calculated with BERT embeddings.
To map content to taxonomy keywords, CITM selects keywords from pairings with cosine scores that meet a user-defined threshold. These are the keywords that will be used on real-time bidding platforms to select ads for the webpage’s inventory. The result is a rich mapping of online content to the taxonomy.
Optionally store content to taxonomy mapping in a metadata store
After you identify contextually similar taxonomy terms from CITM, you need a way for low-latency APIs to access this information. In programmatic bidding for advertisements, low response time and high concurrency play an important role in monetizing the content. The schema for the data store needs to be flexible to accommodate additional metadata when needed to enrich bid requests. Amazon DynamoDB can match the data access patterns and operational requirements for such a service.
Conclusion
In this post, you learned how to build a taxonomy-based contextual targeting solution using Contextual Intelligence Taxonomy Mapper (CITM). You learned how to use Amazon Comprehend and Amazon Rekognition to extract granular metadata from your media assets. Then, using CITM you mapped the assets to an industry standard taxonomy to facilitate programmatic ad bidding for contextually related ads. You can apply this framework to other use cases that require use of a standard taxonomy to enhance the value of existing media assets.
To experiment with CITM, you can access its code repository and use it with a text and image dataset of your choice.
Aramide Kehinde is a Sr. Partner Solution Architect at AWS in Machine Learning and AI. Her career journey has spanned the areas of Business Intelligence and Advanced Analytics across multiple industries. She works to enable partners to build solutions with AWS AI/ML services that serve customers needs for innovation. She also enjoys building the intersection of AI and creative arenas and spending time with her family.
Anuj Gupta is a Principal Solutions Architect working with hyper-growth companies on their cloud native journey. He is passionate about using technology to solve challenging problems and has worked with customers to build highly distributed and low latency applications. He contributes to open-source Serverless and Machine Learning solutions. Outside of work, he loves traveling with his family and writing poems and philosophical blogs.
We’ll invite 1 million people from our waitlist over the coming weeks. Users can create with DALL·E using free credits that refill every month, and buy additional credits in 115-generation increments for $15.
DALL·E, the AI system that creates realistic images and art from a description in natural language, is now available in beta. Today we’re beginning the process of inviting 1 million people from our waitlist over the coming weeks.
Every DALL·E user will receive 50 free credits during their first month of use and 15 free credits every subsequent month. Each credit can be used for one original DALL·E prompt generation — returning four images — or an edit or variation prompt, which returns three images.
A powerful creative tool
DALL·E allows users to create quickly and easily, and artists and creative professionals are using DALL·E to inspire and accelerate their creative processes. We’ve already seen people use DALL·E to make music videos for young cancer patients, create magazine covers, and bring novel concepts to life.
Other features include:
Edit allows users to make realistic and context-aware edits to images they generate with DALL·E or images they upload using a natural language description.
Variations can take an image generated by DALL·E or an image uploaded by a user and create different variations of it inspired by the original.
My Collection allows users to save generations right in the DALL·E platform.
Pricing
In this first phase of the beta, users can buy additional DALL·E credits in 115-credit increments (460 images[1]) for $15 on top of their free monthly credits. One credit is applied each time a prompt is entered and a user hits “generate” or “variations.”
As we learn more and gather user feedback, we plan to explore other options that will align with users’ creative processes.
Using DALL·E for commercial projects
Starting today, users get full usage rights to commercialize the images they create with DALL·E, including the right to reprint, sell, and merchandise. This includes images they generated during the research preview.
Users have told us that they are planning to use DALL·E images for commercial projects, like illustrations for children’s books, art for newsletters, concept art and characters for games, moodboards for design consulting, and storyboards for movies.
Safety
Prior to making DALL·E available in beta, we’ve worked with researchers, artists, developers, and other users to learn about risks and have taken steps to improve our safety systems based on learnings from the research preview, including:
Curbing misuse: To minimize the risk of DALL·E being misused to create deceptive content, we reject image uploads containing realistic faces and attempts to create the likeness of public figures, including celebrities and prominent political figures. We also used advanced techniques to prevent photorealistic generations of real individuals’ faces.
Preventing harmful images: We’ve made our content filters more accurate so that they are more effective at blocking images that violate our content policy — which does not allow users to generate violent, adult, or political content, among other categories — while still allowing creative expression. We also limited DALL·E’s exposure to these concepts by removing the most explicit content from its training data.
Reducing bias: We implemented a new technique so that DALL·E generates images of people that more accurately reflect the diversity of the world’s population. This technique is applied at the system level when DALL·E is given a prompt about an individual that does not specify race or gender, like “CEO.”
Monitoring: We will continue to have automated and human monitoring systems to help guard against misuse.
Subsidized access for qualifying artists
We hope to make DALL·E as accessible as possible. Artists who are in need of financial assistance will be able to apply for subsidized access. Please fill out this interest form if you’d like to be notified once more details are available.
We are excited to see what people create with DALL·E and look forward to users’ feedback during this beta period.
Large-scale models are revolutionizing deep learning and AI research, driving major improvements in language understanding, generating creative texts, multi-lingual translation and many more. But despite their remarkable capabilities, the models’ large size creates latency and cost constraints that hinder the deployment of applications on top of them. In particular, increased inference time and memory consumption inhibit deployment of models on latency-sensitive and resource-constrained applications on both server and client devices. To address these deployment challenges, the DeepSpeed team, as part of Microsoft’s AI at Scale initiative, has been exploring innovations in system optimization and model compression. On the former, we released the DeepSpeed inference system, which consists of a diverse set of optimizations, such as highly optimized CUDA kernels and inference-adapted parallelism to accelerate model inference speed, as well as ZeRO-Inference, which breaks the GPU memory wall and fits large models across heterogeneous memories to address hardware accessibility limitations. These optimizations target improving the inference system efficiency while preserving the model sizes, the amount of computation, and model accuracy: the total work remains the same, but the processing capability and speed are higher. On the latter, the emerging compression algorithms show great potential in reducing model size and inference computation. These algorithms use condensed format to represent, store, communicate, and compute DNN models, reducing the total work needed for inference with little or no loss in accuracy. System optimizations and model compression are very much complementary, and they can be synergistically combined to provide a multiplicative reduction on inference latency and cost. Motivated by combining the best of both worlds, we are proud to announce DeepSpeed Compression—a composable library that combines novel compression technologies and highly efficient system optimizations to make DL model size smaller and inference speed faster, all with much lowered compression cost.
Challenges of compressing large deep learning models
Although there have been numerous efforts to compress model sizes and reduce inference computation, applying existing compression techniques to large scale models still has many challenges in practice:
Complex pipeline for achieving high compression ratio. Various strategies have been proposed to overcome optimization difficulty and accuracy degradation when compressing large models. However, no systematic study on best practices for extreme compression exists, such as using aggressive quantization methods and layer reduction. This leaves the underlying question unanswered: do we really need those ad-hoc tricks to recover the accuracy loss or do simpler yet more effective methods exist?
High compression cost. Existing methods for compressing large models incur high training costs. For example, popular compression methods such as quantize-aware training (QAT) and multi-stage distillation methods lead to long training time and large hardware resource requirement as the model size grows into multi-billion parameters or at even larger scale, making compressing these models costly and difficult. For example, the 20B GPT-NeoX model was pre-trained using 96 NVIDIA A100 GPUs in three months. Performing QAT even with 10% of training samples would still require large amounts of computational resources, which many practitioners cannot afford.
Lack of tailored system optimizations for compressed models. To maximize the benefits of compressed models, specialized system optimizations are often required, e.g., quantized and sparsified models need optimized low-bit arithmetic computation and sparse matrix multiplication to boost the inference speed on commodity hardware. Existing methods often focus on reducing theoretical computation overhead but miss the opportunities to offer the best inference latency reduction via tailored system optimizations for the compressed models.
Limited composability. Existing methods have limited composability from two aspects. First, there is limited composability among multiple compression methods. Although well-performing compression solutions have been proposed independently, combining multiple methods together for the best outcome is still a laborious process, requiring building a complex compression pipeline. Second, there is a lack of composability between compression techniques and system optimizations. As we just mentioned, compressed models require specialized system optimizations to maximize latency and cost reduction. However, few existing methods take an end-to-end approach of composing compressions with system optimizations, as it requires significant efforts to bring modeling, algorithm, and system areas of deep learning to work synergistically together.
DeepSpeed Compression overcomes these challenges by offering novel state-of-the-art compression techniques, such as XTC for 32x smaller model size and ZeroQuant for 5000x lower compression cost reduction. It also takes an end-to-end approach to improve the computation efficiency of compressed models via a highly optimized inference engine. Furthermore, our library has multiple built-in state-of-the-art compression methods and supports synergistic composition of these methods together with the system optimizations, offering the best of both worlds while allowing a seamless and easy-to-use pipeline for efficient DL model inference. Each of these features is explained further below.
Smaller model size: 32x smaller transformer models via simple yet effective binarized extreme compression
Reducing the size of large models is critical when deploying them on both servers and client devices. In DeepSpeed Compression, we provide extreme compression techniques to reduce model size by 32x with almost no accuracy loss or to achieve 50x model size reduction while retaining 97% of the accuracy. We do this through two main techniques: extreme quantization and layer reduction. Extreme quantization via ternarization/binarization reduces the model size significantly but is considered a particularly challenging task due to the large quantization error resulting in model performance degradation. To improve the accuracy of binarized/ternarized models, existing methods often adopt complicated and computationally expensive compression pipelines, such as multi-stage distillation. However, it remains unclear how different components in extreme quantization affect the resulting performance. To tease apart their effects, we perform a systematic study on the impacts of various techniques currently used for extreme compression.
In this process, we have identified several best practices for extreme compression:
A longer training iteration with learning rate decay is highly preferred for closing the accuracy gap of extreme quantization;
Single-stage knowledge distillation with more training budgets is sufficient to match or even exceed accuracy from multi-stage ones;
Training without data augmentation hurts performance on downstream tasks for various compression tasks, especially on smaller tasks;
Lightweight layer reduction matches or even exceeds expensive pre-training distillation for task-specific compression.
Based on these findings, we greatly simplify the procedure of extreme compression and propose a new extreme compression technique, XTC, that compresses a model to its limit with lightweight layer reduction and robust binarization. XTC produces models with little loss in accuracy yet up to 50x model size reduction, as shown in Figure 1. XTC reduces the model size by 32x with almost no loss in the average score on the GLUE tasks via simple yet effective binarization technique. By combining extreme quantization and lightweight layer reduction, we can further improve the binarized model, achieving 50x model size reduction while retaining 97% of the accuracy. Given that transformers are becoming the standard architecture choice for AI, we believe the investigation and the proposed solution could be highly impactful to power large-scale models on resource-constrained devices. If you are interested in XTC, you can also find more details in our technical report “Extreme Compression for Pre-trained Transformers Made Simple and Efficient.”
Lower compression cost: Quantizing models with >5000x compression cost reduction and no training data
Large-scale transformer models with hundreds of billions of parameters are usually challenging to quantize due to the lack of training resources and/or data access. To resolve those issues, we propose a method called ZeroQuant, which quantizes large-scale models with little or no fine-tuning cost on limited resources. Under the hood, ZeroQuant contains two major parts: 1) a hardware friendly fine-grained quantization scheme that allows us to quantize weights and activations into low-bit values with minimal errors while still empowering fast inference speed on commodity hardware with low quantization/dequantization cost; and 2) a layer-by-layer knowledge distillation pipeline, which fine-tunes the quantized model to close the accuracy gap from low-precision (e.g., INT4) quantization.
The benefits of ZeroQuant are threefold: First, unlike previous quantization-aware training that requires expensive retraining and parameter tuning, ZeroQuant enables quantizing BERT and GPT-style models from FP32/FP16 into INT8 weight and activations to retain accuracy without incurring any retraining cost, as shown in Figure 2. Second, by loading only one layer for low-precision (e.g., INT4) quantization at a time, the maximum memory footprint required to quantize the model depends solely on the size of individual layer instead of the entire model, allowing one to quantize gigantic models with as little as one GPU. Third, our quantization method is data-free, which means that it does not require the original training data of the model to obtain a quantized model. This is especially useful when the data is not available due to privacy related reasons, for example.
We demonstrated the scalability of ZeroQuant on a GPT-3-style model with 1.3B parameters (GPT-3-1.3B) and one of the largest open-source language models, GPT-NeoX (20B). Particularly, thanks to the fine-grained quantization scheme, ZeroQuant can convert GPT-3-1.3B (trained with 128 NVIDIA A100 with five days) and GPT-NeoX (trained with 96 A100 with three months) to INT8 without any cost or training data while delivering comparable accuracy. Furthermore, with the lightweight layer-by-layer knowledge distillation, ZeroQuant can quantize GPT-3-1.3B with mixed INT4/INT8 precision in three hours on a single GPU, which leads to 5000x compression cost reduction compared to quantization-aware training. To find more details about ZeroQuant, refer to “ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers”.
Faster inference speed: Latency reduction via highly optimized DeepSpeed Inference system
System optimizations play a key role in efficiently utilizing the available hardware resources and unleashing their full capability through inference optimization libraries like ONNX runtime and DeepSpeed. We build our work on top of DeepSpeed inference, which provides high-performance model serving with inference optimized kernels, parallelism, and memory optimizations, covering a wide variety of models for both latency sensitive and throughput-oriented applications. Besides leveraging these, we also extend the inference capability to support models in compressed formats. For example, we developed variations of efficient low-bit computation such as INT8 GeMM kernels. These kernels load INT8 parameters and activations from GPU device memory to the registers and use the customized INT8 GeMM implemented on top of CUTLASS tuned for different batch sizes to deliver faster GeMM computation. The kernels also fuse quantization and dequantization operations before and after GeMM, further reducing the kernel invocation overhead and improving the memory bandwidth utilization. Furthermore, our inference engine supports many-GPU transformer layers for serving transformer models across GPUs using inference-adapted parallelism strategies. For compressed models that have a smaller memory footprint, the inference engine can automatically shrink the number of GPUs required to serve a model, leading to reduced cross-GPU communication and hardware cost. For example, DeepSpeed compression leverages INT8 for GPT-NeoX (20B) and reduces the GPU requirement of serving the model from two to one, reducing latency from 65ms to 25ms, and achieving a 5.2x cost reduction. As shown in Figure 3, DeepSpeed INT8 kernels can boost performance by up to 2x compared to our own FP16 kernels, and they achieve 2.8-5.2x latency cost reduction compared to the baseline FP16 in PyTorch, significantly reducing the latency and cost of large-scale model inference.
A library that synergistically composes compression algorithms and system optimizations
DeepSpeed Compression proposes a seamless pipeline to address the compression composability challenges, as shown in Figure 4. The core piece of DeepSpeed Compression is a component called compression composer, which includes several significant features:
It offers multiple cutting-edge compression methods, as shown in Table 1, including extreme quantization, head/row/channel pruning, and knowledge distillation, that can effectively reduce model size and inference cost. The list will expand as we continually integrate more state-of-the-art compression methods.
Category
Methods
Targets
Quantization
INT8/INT4
Activations
INT8/INT4/Ternary/Binary
Weights
Sparsification
Head pruning
Attention head (Transformer)
Sparse/Row pruning
Weights
Channel pruning
Conv2D weights
Layer Reduction
Arbitrary subset of network layers
Layers
Distillation
Output logits, feature map, attn. map
Layers
Table 1: Compression techniques supported in DeepSpeed Compression composer.
It offers an easy-to-use API that automatically takes care of the complexities of assembling different compression techniques to deliver the compound benefits of multiple compression methods. For example, XTC requires composition of lightweight layer reduction, binarization, and knowledge distillation. However, composing them together is non-trivial. With our compression composer, applying extreme compression is as easy as adding two new API calls to enable compression and clean the compressed model.
It is designed in a modular way so that it will be easy for users to add new compression schemes. For example, additional compression methods can be added through custom compression layers and, by registering them with the compression composer, the new methods can be composed with existing methods that are already managed by the composer.
It seamlessly works with the existing DeepSpeed library. This has two benefits. First, DeepSpeed Compression can be specified and enabled the same way as DeepSpeed training and inference via a JSON file, where enabling different combination of compression techniques only requires a few lines of modification in the JSON file. Second, once the compression schemes have been configured, the compression composer automatically modifies the model layers and training to enable the compression process and does not require additional changes from the user to the model structure or the training procedure.
After the DNN model has been compressed, DeepSpeed Compression replaces the compressed layers with highly optimized kernels in the DeepSpeed Inference engine to maximize hardware efficiency. Together, the compression composer and inference engine achieve the best of both worlds of compression and system optimization, delivering a compound effect of inference cost reduction.
Use Cases of DeepSpeed Compression
Although we started DeepSpeed Compression quite recently, we have successfully leveraged it to optimize several large-scale open-source models and Microsoft production workloads. It delivers significant latency and cost reduction, widely applicable on both various NLP and CV tasks.
We applied INT8 quantization of DeepSpeed Compression to optimize two large-scale open-source models in GPT-3 style: GPT-J (6B) and GPT-NeoX (20B) on the Azure AI platform. As shown in Figure 5, our quantized models achieve similar accuracy as the original models on 19 zero-shot evaluation tasks and WikiText, while achieving 3.67x and 5.2x inference cost savings, respectively, compared with PyTorch FP16 baseline on ND A100 v4 Azure instances. Very importantly, we quantize these models without requiring any training data, expensive compression time or GPU resources, bringing huge training cost savings compared with QAT!
Beyond open-source models, DeepSpeed Compression has also demonstrated its effectiveness to optimize production workloads in Microsoft:
It reduces the Microsoft Turing Image Super Resolution model (T-ISR) model size by 3.1x together with 1.85x latency reduction by composing different compression schemes like pruning and distillation with efficient system optimizations. The model has also been deployed in Bing Maps and Microsoft Edge, where it automatically derives high-resolutions images from lower-resolution images, which can be seen in this blog post.
It also successfully compresses the Microsoft Relevance Fusion models—a Transformer-based ranking model used in Bing’s core search stack. Without DeepSpeed Compression, it took three days to quantize the model using QAT. With DeepSpeed Compression, we can quantize the model in a few minutes with improved accuracy and reduced latency compared to QAT.
DeepSpeed Compression release plan
DeepSpeed Compression is still at its early stage and under active development, but we’d like to share the results and tools to DeepSpeed users as soon as possible. At this first release, we open-source the core DeepSpeed Compression components, including the compression composer, which supports various compression methods consisting of INT8/INT4/Ternary/Binary quantization, lightweight layer reduction, pretraining and task specific knowledge distillation, head pruning, row pruning, and channel pruning, for compressing both NLP and computer vision models. Together with the compression composer, we are releasing the two novel technologies XTC and ZeroQuant introduced in this blog as part of the library.
We hope you will try DeepSpeed Compression. Please find the code, tutorial, and documents at the DeepSpeed GitHub, and website. We highly value your feedback and comments, so let us know what you think and how we can improve. As for the next steps, we plan to extend our offerings with more compression methods, an extended coverage of specialized kernels for compressed models, and an optimization module that automatically finds the best compression schemes. We believe that our composable library and new innovations will help close the gap between what is possible in AI and what is deployable as well as making DL inference faster, cheaper, and simpler.
We are a group of system and modeling researchers—Zhewei Yao, Xiaoxia Wu, Minjia Zhang, Conglong Li, Reza Yazdani Aminabadi, Elton Zheng, Samyam Rajbhandari, Ammar Ahmad Awan, Jeff Rasley, Cheng Li, Olatunji Ruwase, Shaden Smith, Du Li, Michael Wyatt, Arash Bakhtiari, Guanhua Wang, Connor Holmes, Sam Ade Jacobs, Martin Cai, Yuxiong He (team lead)—who are enthusiastic about performance optimization of large-scale systems. We have recently focused on deep learning systems, optimizing deep learning’s speed to train, speed to convergence, and speed to develop.
AI and electric vehicle technology breakthroughs are transforming the automotive industry. These developments pave the way for new innovators, attracting technical prowess and design philosophies from Silicon Valley.
Mike Bell, senior vice president of digital at Lucid Motors, sees continuous innovation coupled with over-the-air updates as key to designing sustainable, award-winning intelligent vehicles that provide seamless automated driving experiences.
NVIDIA’s Katie Burke Washabaugh spoke with Bell on the latest AI Podcast episode, covering what it takes to stay ahead in the software-defined vehicle space.
Bell touched on future technology and its implications for the mass adoption of sustainable, AI-powered EVs — as well as what Lucid’s Silicon Valley roots bring to the intersection of innovation and transportation.
Driver’s Ed: How Waabi Uses AI, Simulation to Teach Autonomous Vehicles to Drive
Teaching the AI brains of autonomous vehicles to understand the world as humans do requires billions of miles of driving experience. The road to achieving this astronomical level of driving leads to the virtual world. Learn how Waabi uses powerful high-fidelity simulations to train and develop production-level autonomous vehicles.
Polestar’s Dennis Nobelius on the Sustainable Performance Brand’s Plans
Driving enjoyment and autonomous driving capabilities can complement one another in intelligent, sustainable vehicles. Learn about the automaker’s plans to unveil its third vehicle, the Polestar 3, the tech inside it, and what the company’s racing heritage brings to the intersection of smarts and sustainability.
GANTheftAuto: Harrison Kinsley on AI-Generated Gaming Environments
Humans playing games against machines is nothing new, but now computers can develop their own games for people to play. Programming enthusiast and social media influencer Harrison Kinsley created GANTheftAuto, an AI-based neural network that generates a playable chunk of the classic video game Grand Theft Auto V.
Subscribe to the AI Podcast: Now Available on Amazon Music
We’ll invite 1 million people from our waitlist over the coming weeks. Users can create with DALL·E using free credits that refill every month, and buy additional credits in 115-generation increments for $15.OpenAI Blog
For workers who use machine-learning models to help them make decisions, knowing when to trust a model’s predictions is not always an easy task, especially since these models are often so complex that their inner workings remain a mystery.
Users sometimes employ a technique, known as selective regression, in which the model estimates its confidence level for each prediction and will reject predictions when its confidence is too low. Then a human can examine those cases, gather additional information, and make a decision about each one manually.
But while selective regression has been shown to improve the overall performance of a model, researchers at MIT and the MIT-IBM Watson AI Lab have discovered that the technique can have the opposite effect for underrepresented groups of people in a dataset. As the model’s confidence increases with selective regression, its chance of making the right prediction also increases, but this does not always happen for all subgroups.
For instance, a model suggesting loan approvals might make fewer errors on average, but it may actually make more wrong predictions for Black or female applicants. One reason this can occur is due to the fact that the model’s confidence measure is trained using overrepresented groups and may not be accurate for these underrepresented groups.
Once they had identified this problem, the MIT researchers developed two algorithms that can remedy the issue. Using real-world datasets, they show that the algorithms reduce performance disparities that had affected marginalized subgroups.
“Ultimately, this is about being more intelligent about which samples you hand off to a human to deal with. Rather than just minimizing some broad error rate for the model, we want to make sure the error rate across groups is taken into account in a smart way,” says senior MIT author Greg Wornell, the Sumitomo Professor in Engineering in the Department of Electrical Engineering and Computer Science (EECS) who leads the Signals, Information, and Algorithms Laboratory in the Research Laboratory of Electronics (RLE) and is a member of the MIT-IBM Watson AI Lab.
Joining Wornell on the paper are co-lead authors Abhin Shah, an EECS graduate student, and Yuheng Bu, a postdoc in RLE; as well as Joshua Ka-Wing Lee SM ’17, ScD ’21 and Subhro Das, Rameswar Panda, and Prasanna Sattigeri, research staff members at the MIT-IBM Watson AI Lab. The paper will be presented this month at the International Conference on Machine Learning.
To predict or not to predict
Regression is a technique that estimates the relationship between a dependent variable and independent variables. In machine learning, regression analysis is commonly used for prediction tasks, such as predicting the price of a home given its features (number of bedrooms, square footage, etc.) With selective regression, the machine-learning model can make one of two choices for each input — it can make a prediction or abstain from a prediction if it doesn’t have enough confidence in its decision.
When the model abstains, it reduces the fraction of samples it is making predictions on, which is known as coverage. By only making predictions on inputs that it is highly confident about, the overall performance of the model should improve. But this can also amplify biases that exist in a dataset, which occur when the model does not have sufficient data from certain subgroups. This can lead to errors or bad predictions for underrepresented individuals.
The MIT researchers aimed to ensure that, as the overall error rate for the model improves with selective regression, the performance for every subgroup also improves. They call this monotonic selective risk.
“It was challenging to come up with the right notion of fairness for this particular problem. But by enforcing this criteria, monotonic selective risk, we can make sure the model performance is actually getting better across all subgroups when you reduce the coverage,” says Shah.
Focus on fairness
The team developed two neural network algorithms that impose this fairness criteria to solve the problem.
One algorithm guarantees that the features the model uses to make predictions contain all information about the sensitive attributes in the dataset, such as race and sex, that is relevant to the target variable of interest. Sensitive attributes are features that may not be used for decisions, often due to laws or organizational policies. The second algorithm employs a calibration technique to ensure the model makes the same prediction for an input, regardless of whether any sensitive attributes are added to that input.
The researchers tested these algorithms by applying them to real-world datasets that could be used in high-stakes decision making. One, an insurance dataset, is used to predict total annual medical expenses charged to patients using demographic statistics; another, a crime dataset, is used to predict the number of violent crimes in communities using socioeconomic information. Both datasets contain sensitive attributes for individuals.
When they implemented their algorithms on top of a standard machine-learning method for selective regression, they were able to reduce disparities by achieving lower error rates for the minority subgroups in each dataset. Moreover, this was accomplished without significantly impacting the overall error rate.
“We see that if we don’t impose certain constraints, in cases where the model is really confident, it could actually be making more errors, which could be very costly in some applications, like health care. So if we reverse the trend and make it more intuitive, we will catch a lot of these errors. A major goal of this work is to avoid errors going silently undetected,” Sattigeri says.
The researchers plan to apply their solutions to other applications, such as predicting house prices, student GPA, or loan interest rate, to see if the algorithms need to be calibrated for those tasks, says Shah. They also want to explore techniques that use less sensitive information during the model training process to avoid privacy issues.
And they hope to improve the confidence estimates in selective regression to prevent situations where the model’s confidence is low, but its prediction is correct. This could reduce the workload on humans and further streamline the decision-making process, Sattigeri says.
This research was funded, in part, by the MIT-IBM Watson AI Lab and its member companies Boston Scientific, Samsung, and Wells Fargo, and by the National Science Foundation.