DP-Auditorium: A flexible library for auditing differential privacy

DP-Auditorium: A flexible library for auditing differential privacy

Differential privacy (DP) is a property of randomized mechanisms that limit the influence of any individual user’s information while processing and analyzing data. DP offers a robust solution to address growing concerns about data protection, enabling technologies across industries and government applications (e.g., the US census) without compromising individual user identities. As its adoption increases, it’s important to identify the potential risks of developing mechanisms with faulty implementations. Researchers have recently found errors in the mathematical proofs of private mechanisms, and their implementations. For example, researchers compared six sparse vector technique (SVT) variations and found that only two of the six actually met the asserted privacy guarantee. Even when mathematical proofs are correct, the code implementing the mechanism is vulnerable to human error.

However, practical and efficient DP auditing is challenging primarily due to the inherent randomness of the mechanisms and the probabilistic nature of the tested guarantees. In addition, a range of guarantee types exist, (e.g., pure DP, approximate DP, Rényi DP, and concentrated DP), and this diversity contributes to the complexity of formulating the auditing problem. Further, debugging mathematical proofs and code bases is an intractable task given the volume of proposed mechanisms. While ad hoc testing techniques exist under specific assumptions of mechanisms, few efforts have been made to develop an extensible tool for testing DP mechanisms.

To that end, in “DP-Auditorium: A Large Scale Library for Auditing Differential Privacy”, we introduce an open source library for auditing DP guarantees with only black-box access to a mechanism (i.e., without any knowledge of the mechanism’s internal properties). DP-Auditorium is implemented in Python and provides a flexible interface that allows contributions to continuously improve its testing capabilities. We also introduce new testing algorithms that perform divergence optimization over function spaces for Rényi DP, pure DP, and approximate DP. We demonstrate that DP-Auditorium can efficiently identify DP guarantee violations, and suggest which tests are most suitable for detecting particular bugs under various privacy guarantees.

DP guarantees

The output of a DP mechanism is a sample drawn from a probability distribution (M (D)) that satisfies a mathematical property ensuring the privacy of user data. A DP guarantee is thus tightly related to properties between pairs of probability distributions. A mechanism is differentially private if the probability distributions determined by M on dataset D and a neighboring dataset D’, which differ by only one record, are indistinguishable under a given divergence metric.

For example, the classical approximate DP definition states that a mechanism is approximately DP with parameters (ε, δ) if the hockey-stick divergence of order eε, between M(D) and M(D’), is at most δ. Pure DP is a special instance of approximate DP where δ = 0. Finally, a mechanism is considered Rényi DP with parameters (𝛼, ε) if the Rényi divergence of order 𝛼, is at most ε (where ε is a small positive value). In these three definitions, ε is not interchangeable but intuitively conveys the same concept; larger values of ε imply larger divergences between the two distributions or less privacy, since the two distributions are easier to distinguish.

DP-Auditorium

DP-Auditorium comprises two main components: property testers and dataset finders. Property testers take samples from a mechanism evaluated on specific datasets as input and aim to identify privacy guarantee violations in the provided datasets. Dataset finders suggest datasets where the privacy guarantee may fail. By combining both components, DP-Auditorium enables (1) automated testing of diverse mechanisms and privacy definitions and, (2) detection of bugs in privacy-preserving mechanisms. We implement various private and non-private mechanisms, including simple mechanisms that compute the mean of records and more complex mechanisms, such as different SVT and gradient descent mechanism variants.

Property testers determine if evidence exists to reject the hypothesis that a given divergence between two probability distributions, P and Q, is bounded by a prespecified budget determined by the DP guarantee being tested. They compute a lower bound from samples from P and Q, rejecting the property if the lower bound value exceeds the expected divergence. No guarantees are provided if the result is indeed bounded. To test for a range of privacy guarantees, DP-Auditorium introduces three novel testers: (1) HockeyStickPropertyTester, (2) RényiPropertyTester, and (3) MMDPropertyTester. Unlike other approaches, these testers don’t depend on explicit histogram approximations of the tested distributions. They rely on variational representations of the hockey-stick divergence, Rényi divergence, and maximum mean discrepancy (MMD) that enable the estimation of divergences through optimization over function spaces. As a baseline, we implement HistogramPropertyTester, a commonly used approximate DP tester. While our three testers follow a similar approach, for brevity, we focus on the HockeyStickPropertyTester in this post.

Given two neighboring datasets, D and D’, the HockeyStickPropertyTester finds a lower bound,^δ  for the hockey-stick divergence between M(D) and M(D’) that holds with high probability. Hockey-stick divergence enforces that the two distributions M(D) and M(D’) are close under an approximate DP guarantee. Therefore, if a privacy guarantee claims that the hockey-stick divergence is at most δ, and^δ  > δ, then with high probability the divergence is higher than what was promised on D and D’ and the mechanism cannot satisfy the given approximate DP guarantee. The lower bound^δ  is computed as an empirical and tractable counterpart of a variational formulation of the hockey-stick divergence (see the paper for more details). The accuracy of^δ  increases with the number of samples drawn from the mechanism, but decreases as the variational formulation is simplified. We balance these factors in order to ensure that^δ  is both accurate and easy to compute.

Dataset finders use black-box optimization to find datasets D and D’ that maximize^δ, a lower bound on the divergence value δ. Note that black-box optimization techniques are specifically designed for settings where deriving gradients for an objective function may be impractical or even impossible. These optimization techniques oscillate between exploration and exploitation phases to estimate the shape of the objective function and predict areas where the objective can have optimal values. In contrast, a full exploration algorithm, such as the grid search method, searches over the full space of neighboring datasets D and D’. DP-Auditorium implements different dataset finders through the open sourced black-box optimization library Vizier.

Running existing components on a new mechanism only requires defining the mechanism as a Python function that takes an array of data D and a desired number of samples n to be output by the mechanism computed on D. In addition, we provide flexible wrappers for testers and dataset finders that allow practitioners to implement their own testing and dataset search algorithms.

Key results

We assess the effectiveness of DP-Auditorium on five private and nine non-private mechanisms with diverse output spaces. For each property tester, we repeat the test ten times on fixed datasets using different values of ε, and report the number of times each tester identifies privacy bugs. While no tester consistently outperforms the others, we identify bugs that would be missed by previous techniques (HistogramPropertyTester). Note that the HistogramPropertyTester is not applicable to SVT mechanisms.

Number of times each property tester finds the privacy violation for the tested non-private mechanisms. NonDPLaplaceMean and NonDPGaussianMean mechanisms are faulty implementations of the Laplace and Gaussian mechanisms for computing the mean.

We also analyze the implementation of a DP gradient descent algorithm (DP-GD) in TensorFlow that computes gradients of the loss function on private data. To preserve privacy, DP-GD employs a clipping mechanism to bound the l2-norm of the gradients by a value G, followed by the addition of Gaussian noise. This implementation incorrectly assumes that the noise added has a scale of G, while in reality, the scale is sG, where s is a positive scalar. This discrepancy leads to an approximate DP guarantee that holds only for values of s greater than or equal to 1.

We evaluate the effectiveness of property testers in detecting this bug and show that HockeyStickPropertyTester and RényiPropertyTester exhibit superior performance in identifying privacy violations, outperforming MMDPropertyTester and HistogramPropertyTester. Notably, these testers detect the bug even for values of s as high as 0.6. It is worth highlighting that s = 0.5 corresponds to a common error in literature that involves missing a factor of two when accounting for the privacy budget ε. DP-Auditorium successfully captures this bug as shown below. For more details see section 5.6 here.

Estimated divergences and test thresholds for different values of s when testing DP-GD with the HistogramPropertyTester (left) and the HockeyStickPropertyTester (right).

Estimated divergences and test thresholds for different values of s when testing DP-GD with the RényiPropertyTester (left) and the MMDPropertyTester (right)

To test dataset finders, we compute the number of datasets explored before finding a privacy violation. On average, the majority of bugs are discovered in less than 10 calls to dataset finders. Randomized and exploration/exploitation methods are more efficient at finding datasets than grid search. For more details, see the paper.

Conclusion

DP is one of the most powerful frameworks for data protection. However, proper implementation of DP mechanisms can be challenging and prone to errors that cannot be easily detected using traditional unit testing methods. A unified testing framework can help auditors, regulators, and academics ensure that private mechanisms are indeed private.

DP-Auditorium is a new approach to testing DP via divergence optimization over function spaces. Our results show that this type of function-based estimation consistently outperforms previous black-box access testers. Finally, we demonstrate that these function-based estimators allow for a better discovery rate of privacy bugs compared to histogram estimation. By open sourcing DP-Auditorium, we aim to establish a standard for end-to-end testing of new differentially private algorithms.

Acknowledgements

The work described here was done jointly with Andrés Muñoz Medina, William Kong and Umar Syed. We thank Chris Dibak and Vadym Doroshenko for helpful engineering support and interface suggestions for our library.

Read More

GraphRAG: Unlocking LLM discovery on narrative private data

GraphRAG: Unlocking LLM discovery on narrative private data

Project Ire - GraphRag background: Blue-green gradient

Perhaps the greatest challenge – and opportunity – of LLMs is extending their powerful capabilities to solve problems beyond the data on which they have been trained, and to achieve comparable results with data the LLM has never seen.  This opens new possibilities in data investigation, such as identifying themes and semantic concepts with context and grounding on datasets.  In this post, we introduce GraphRAG, created by Microsoft Research, as a significant advance in enhancing the capability of LLMs.

Retrieval-Augmented Generation (RAG) is a technique to search for information based on a user query and provide the results as reference for an AI answer to be generated. This technique is an important part of most LLM-based tools and the majority of RAG approaches use vector similarity as the search technique. GraphRAG uses LLM-generated knowledge graphs to provide substantial improvements in question-and-answer performance when conducting document analysis of complex information.  This builds upon our recent research, which points to the power of prompt augmentation when performing discovery on private datasets. Here, we define private dataset as data that the LLM is not trained on and has never seen before, such as an enterprise’s proprietary research, business documents, or communications. Baseline RAG1 was created to help solve this problem, but we observe situations where baseline RAG performs very poorly. For example:

  • Baseline RAG struggles to connect the dots.  This happens when answering a question requires traversing disparate pieces of information through their shared attributes in order to provide new synthesized insights.
  • Baseline RAG performs poorly when being asked to holistically understand summarized semantic concepts over large data collections or even singular large documents.

To address this, the tech community is working to develop methods that extend and enhance RAG (e.g., LlamaIndex (opens in new tab)).  Microsoft Research’s new approach, GraphRAG, uses the LLM to create a knowledge graph based on the private dataset.  This graph is then used alongside graph machine learning to perform prompt augmentation at query time.  GraphRAG shows substantial improvement in answering the two classes of questions described above, demonstrating intelligence or mastery that outperforms other approaches previously applied to private datasets.   

Applying RAG to private datasets

To demonstrate the effectiveness of GraphRAG, let’s start with an investigation using the Violent Incident Information from News Articles (VIINA) dataset (opens in new tab).  This dataset was chosen due to its complexity and the presence of differing opinions and partial information.  It is a messy real-world test case that was recent enough not to be included in the LLM base model’s training.  

For this research, we use thousands of news articles from both Russian and Ukrainian news sources for the month of June 2023, translated into English, to create a private dataset on which we will perform our LLM-based retrieval.  The dataset is far too large to fit into an LLM context window, thus demanding a RAG approach.

We start with an exploratory query, which we pose to both a baseline RAG system and to our new approach, GraphRAG:

Query: “What is Novorossiya?”

Baseline RAG GraphRAG
The term “Novorossiya” is not mentioned in the provided context. However, to provide some information, Novorossiya, or New Russia, is a historical term referring to a region in the south of Ukraine that was conquered by the Russian Empire in the 18th century. In recent years, the term has been revived by Russian-backed separatists in Eastern Ukraine to refer to the territories they claim, but it is not officially recognized as a separate entity by international bodies. Novorossiya, also known as New Russia, is a term with historical and contemporary significance. Historically, it referred to a region in the southern part of the Russian Empire that is now part of Ukraine. In the contemporary context, it has been used by pro-Russian separatists to refer to certain areas of Ukraine [Data: Entities (6494, 912, 6493)]. 

The term Novorossiya has been used in the context of a political movement in Ukraine, particularly by Russian nationalists. This movement has been associated with efforts to destabilize Ukraine and create a so-called “Novorossiya” [Data: Entities (912); Relationships (15211, 15206)]. The term has also been used to denote a separatist state in Ukraine, known as the Federal Republic of Novorossiya (FRN) [Data: Entities (7690)]. 

In these results, we can see both systems perform well – highlighting a class of query on which baseline RAG performs well.  Let’s try a query that requires connecting the dots:

Query: “What has Novorossiya done?”

Baseline RAG GraphRAG
The text does not provide specific information on what Novorossiya has done. Novorossiya, a political movement in Ukraine, has been involved in a series of destructive activities, particularly targeting various entities in Ukraine [Entities (6494, 912)]. The movement has been linked to plans to destroy properties of several Ukrainian entities, including Rosen, the Odessa Canning Factory, the Odessa Regional Radio Television Transmission Center, and the National Television Company of Ukraine [Relationships (15207, 15208, 15209, 15210)]. In addition to these, Novorossiya has also been associated with plans to destroy properties of Roshen, the Odessa Conservatory Factory, the Odesa Regional Radio Television Transmission Center, and the National Telecommunications Company of Ukraine [Relationships (8329, 8330, 8331, 8332)]. Furthermore, Novorossiya has been implicated in plans to blow up ATMs of PrivatBank, the largest state bank in Ukraine [Relationships (8335, 8335)]. 
 
The Office of the General Prosecutor in Ukraine has reported on the creation of Novorossiya, indicating the government’s awareness and potential concern over the activities of this movement [Relationships (8264, 8264)]. The destructive activities of Novorossiya have had significant impacts on the targeted entities and have likely contributed to the ongoing tensions and conflicts in the region. 
In summary, Novorossiya has been involved in a series of destructive activities targeting various entities in Ukraine. These activities have had significant impacts on the targeted entities and have likely contributed to the ongoing tensions and conflicts in the region. The Ukrainian government, through the Office of the General Prosecutor, has acknowledged the existence and activities of Novorossiya, indicating a level of concern over the movement’s actions. 

Baseline RAG fails to answer this question.  Looking at the source documents inserted into the context window (Figure 1), none of the text segments discuss Novorossiya, resulting in this failure.

Figure 1: LangChain Q&A Retrieved Context A table entitled “Relevant chunks of source documents” with 10 rows of text segments pulled from the VIINA dataset. Each text segment mentions a news event happening in Ukraine and Russia. None include the term ‘Novorossiya’.
Figure 1: Baseline RAG retrieved context

In comparison, the GraphRAG approach discovered an entity in the query, Novorossiya.  This allows the LLM to ground itself in the graph and results in a superior answer that contains provenance through links to the original supporting text.  For example, Figure 2 below shows the exact content the LLM used for the LLM-generated statement, “Novorossiya has been implicated in plans to blow up ATMs.” We see the snippet from the raw source documents (after English translation) that the LLM used to support the assertion that a specific bank was a target for Novorossiya via the relationship that exists between the two entities in the graph. 

Figure 2: GraphRAG Provenance An image of the GraphRAG system displaying a table of the VIINA source text used to ground the connection between Novorossiya and PrivatBank. The table has three columns for source, date, and text. There is a single row of content shown. The row shows the source is from ‘interfaxua’, the date of publication is June 8, 2023, and the text box contains a paragraph taken from the source document. In summary, the text describes the creation of Novorossiya with intent to commit acts of terrorism targeting PrivatBank, the Regional Radio and Television Broadcasting Center, and other targets. It describes recruitment of residents of Odessa. Highlighted in the text box are two separate strings of text. The first is the word ‘Novorossiya’ and the second is the text ‘criminal blew up buildings of military commissariats, ATMs’.
Figure 2: GraphRAG provenance

By using the LLM-generated knowledge graph, GraphRAG vastly improves the “retrieval” portion of RAG, populating the context window with higher relevance content, resulting in better answers and capturing evidence provenance. 

Being able to trust and verify LLM-generated results is always important.  We care that the results are factually correct, coherent, and accurately represent content found in the source material. GraphRAG provides the provenance, or source grounding information, as it generates each response.  It demonstrates that an answer is grounded in the dataset.  Having the cited source for each assertion readily available also enables a human user to quickly and accurately audit the LLM’s output directly against the original source material.   

However, this isn’t all that’s possible using GraphRAG. 

Whole dataset reasoning 

Baseline RAG struggles with queries that require aggregation of information across the dataset to compose an answer. Queries such as “What are the top 5 themes in the data?” perform terribly because baseline RAG relies on a vector search of semantically similar text content within the dataset. There is nothing in the query to direct it to the correct information. 

However, with GraphRAG we can answer such questions, because the structure of the LLM-generated knowledge graph tells us about the structure (and thus themes) of the dataset as a whole.  This allows the private dataset to be organized into meaningful semantic clusters that are pre-summarized.  The LLM uses these clusters to summarize these themes when responding to a user query. 

We illustrate whole-dataset reasoning abilities by posing the following question to the two systems: 

Query: “What are the top 5 themes in the data?

Baseline RAG GraphRAG
1. Improving the quality of life in cities and urban development. 
2. The meeting between Vladimir Putin and the President of the United Arab Emirates, Mohammed bin Zayed Al Nahyan. 
3. The National Rating of the Investment Climate in the Regions of Russia. 
4. The state of the Russian economy and the growth of domestic brands. 
5. The discussion on the meaning of life and the role of science in understanding it. 
The data presents a wide range of themes, but the top five most prevalent themes can be identified as follows: 

1. Conflict and Military Activity: A significant portion of the data revolves around the ongoing conflict in Ukraine, with numerous reports detailing military activities, occupations, and attacks by various entities. This theme includes the actions of the Ukrainian military, Russian forces, and various other military units and entities [Data: Reports (513, 241, 151, 493, 343, +more)]. 

2. Political and Governmental Entities: Many reports focus on political figures and governmental entities, highlighting their roles, relationships, and actions in the context of the current situation. This includes entities like the Kremlin, the President of Ukraine, the US State Department, and various regional administrations [Data: Reports (98, 493, 265, 129, 196, +more)]. 

3. Infrastructure and Environmental Concerns: Several reports highlight potential environmental impacts and infrastructure threats. These include potential environmental damage due to industrial activities, flooding crises, and threats to infrastructure due to military conflicts [Data: Reports (308, 154, 201, 348, 216, +more)]. 

4. Community Analysis and Threat Assessment: Several reports provide detailed analyses of specific communities, often centered around a particular location or entity. These analyses typically discuss the key entities within the community, their relationships, and any associated threats. Many reports also include assessments of potential threats associated with the entities or situations they discuss [Data: Reports (323, 596, 518, 583, 179, +more)]. 

5. Health, Humanitarian Concerns, and Emergency Situations: Some reports focus on health and humanitarian issues, such as the Shereshevsky-Turner Syndrome community and the potential risk of intestinal diseases in the Odessa region. The data also covers emergency situations, including natural disasters, accidents, and attacks [Data: Reports (204, 149, 372, 154, 532, +more)]. 
These themes provide a comprehensive overview of the current situation, highlighting the complexity and interconnectedness of the issues at hand. 

Looking at the results from baseline RAG, we see that none of the listed themes has much to do with the war between the two countries.  As anticipated, the vector search retrieved irrelevant text, which was inserted into the LLM’s context window.  Results that were included were likely keying on the word “theme,” resulting in a less than useful assessment of what is going on in the dataset. 

Observing the results from GraphRAG, we can clearly see that the results are far more aligned with what is going on in the dataset as a whole.  The answer provides the five main themes as well as supporting details that are observed in the dataset.  The referenced reports are pre-generated by the LLM for each semantic cluster in GraphRAG and, in turn, provide provenance back to original source material.

MICROSOFT RESEARCH PODCAST

AI Frontiers: The future of scale with Ahmed Awadallah and Ashley Llorens

This episode features Senior Principal Research Manager Ahmed H. Awadallah, whose work improving the efficiency of large-scale AI models and efforts to help move advancements in the space from research to practice have put him at the forefront of this new era of AI.


Creating LLM-generated knowledge graphs

We note the basic flow that underpins GraphRAG, which builds upon our prior research (opens in new tab) and repositories (opens in new tab) using graph machine learning: 

  • The LLM processes the entire private dataset, creating references to all entities and relationships within the source data, which are then used to create an LLM-generated knowledge graph. 
  • This graph is then used to create a bottom-up clustering that organizes the data hierarchically into semantic clusters (indicated by using color in Figure 3 below).  This partitioning allows for pre-summarization of semantic concepts and themes, which aids in holistic understanding of the dataset. 
  • At query time, both of these structures are used to provide materials for the LLM context window when answering a question. 

An example visualization of the graph is shown in Figure 3.  Each circle is an entity (e.g., a person, place, or organization), with the entity size representing the number of relationships that entity has, and the color representing groupings of similar entities.  The color partitioning is a bottom-up clustering method built on top of the graph structure, which enables us to answer questions at varying levels of abstraction.

Figure 3: LLM-generated knowledge graph built from a private dataset using GPT-4 Turbo. A knowledge graph visualization represented by a collection in 3D space projected onto a 2D image of circles of varying sizes and colors. The circles are grouped together in space by color, and within each color area the larger circles are surrounded by many smaller circles. Each circle represents an entity within the knowledge graph.
Figure 3: LLM-generated knowledge graph built from a private dataset using GPT-4 Turbo.

Result metrics

The illustrative examples above are representative of GraphRAG’s consistent improvement across multiple datasets in different subject domains.  We assess this improvement by performing an evaluation using an LLM grader to determine a pairwise winner between GraphRAG and baseline RAG.  We use a set of qualitative metrics, including comprehensiveness (completeness within the framing of the implied context of the question), human enfranchisement (provision of supporting source material or other contextual information), and diversity (provision of differing viewpoints or angles on the question posed). Initial results show that GraphRAG consistently outperforms baseline RAG on these metrics.  

In addition to relative comparisons, we also use SelfCheckGPT (opens in new tab) to perform an absolute measurement of faithfulness to help ensure factual, coherent results grounded in the source material. Results show that GraphRAG achieves a similar level of faithfulness to baseline RAG. We are currently developing an evaluation framework to measure performance on the class of problems above.  This will include more robust mechanisms for generating question-answer test sets as well as additional metrics, such as accuracy and context relevance. 

Next steps

By combining LLM-generated knowledge graphs and graph machine learning, GraphRAG enables us to answer important classes of questions that we cannot attempt with baseline RAG alone.  We have seen promising results after applying this technology to a variety of scenarios, including social media, news articles, workplace productivity, and chemistry.  Looking forward, we plan to work closely with customers on a variety of new domains as we continue to apply this technology while working on metrics and robust evaluation. We look forward to sharing more as our research continues.


1As baseline RAG in this comparison we use LangChain’s Q&A (opens in new tab), a well-known representative example of this class of RAG tools in widespread use today.

The post GraphRAG: Unlocking LLM discovery on narrative private data appeared first on Microsoft Research.

Read More

How BigBasket improved AI-enabled checkout at their physical stores using Amazon SageMaker

How BigBasket improved AI-enabled checkout at their physical stores using Amazon SageMaker

This post is co-written with Santosh Waddi and Nanda Kishore Thatikonda from BigBasket.

BigBasket is India’s largest online food and grocery store. They operate in multiple ecommerce channels such as quick commerce, slotted delivery, and daily subscriptions. You can also buy from their physical stores and vending machines. They offer a large assortment of over 50,000 products across 1,000 brands, and are operating in more than 500 cities and towns. BigBasket serves over 10 million customers.

In this post, we discuss how BigBasket used Amazon SageMaker to train their computer vision model for Fast-Moving Consumer Goods (FMCG) product identification, which helped them reduce training time by approximately 50% and save costs by 20%.

Customer challenges

Today, most supermarkets and physical stores in India provide manual checkout at the checkout counter. This has two issues:

  • It requires additional manpower, weight stickers, and repeated training for the in-store operational team as they scale.
  • In most stores, the checkout counter is different from the weighing counters, which adds to the friction in the customer purchase journey. Customers often lose the weight sticker and have to go back to the weighing counters to collect one again before proceeding with the checkout process.

Self-checkout process

BigBasket introduced an AI-powered checkout system in their physical stores that uses cameras to distinguish items uniquely. The following figure provides an overview of the checkout process.

Self-Checkout

The BigBasket team was running open source, in-house ML algorithms for computer vision object recognition to power AI-enabled checkout at their Fresho (physical) stores. We were facing the following challenges to operate their existing setup:

  • With the continuous introduction of new products, the computer vision model needed to continuously incorporate new product information. The system needed to handle a large catalog of over 12,000 Stock Keeping Units (SKUs), with new SKUs being continually added at a rate of over 600 per month.
  • To keep pace with new products, a new model was produced each month using the latest training data. It was costly and time consuming to train the models frequently to adapt to new products.
  • BigBasket also wanted to reduce the training cycle time to improve the time to market. Due to increases in SKUs, the time taken by the model was increasing linearly, which impacted their time to market because the training frequency was very high and took a long time.
  • Data augmentation for model training and manually managing the complete end-to-end training cycle was adding significant overhead. BigBasket was running this on a third-party platform, which incurred significant costs.

Solution overview

We recommended that BigBasket rearchitect their existing FMCG product detection and classification solution using SageMaker to address these challenges. Before moving to full-scale production, BigBasket tried a pilot on SageMaker to evaluate performance, cost, and convenience metrics.

Their objective was to fine-tune an existing computer vision machine learning (ML) model for SKU detection. We used a convolutional neural network (CNN) architecture with ResNet152 for image classification. A sizable dataset of around 300 images per SKU was estimated for model training, resulting in over 4 million total training images. For certain SKUs, we augmented data to encompass a broader range of environmental conditions.

The following diagram illustrates the solution architecture.

Architecture

The complete process can be summarized into the following high-level steps:

  1. Perform data cleansing, annotation, and augmentation.
  2. Store data in an Amazon Simple Storage Service (Amazon S3) bucket.
  3. Use SageMaker and Amazon FSx for Lustre for efficient data augmentation.
  4. Split data into train, validation, and test sets. We used FSx for Lustre and Amazon Relational Database Service (Amazon RDS) for fast parallel data access.
  5. Use a custom PyTorch Docker container including other open source libraries.
  6. Use SageMaker Distributed Data Parallelism (SMDDP) for accelerated distributed training.
  7. Log model training metrics.
  8. Copy the final model to an S3 bucket.

BigBasket used SageMaker notebooks to train their ML models and were able to easily port their existing open source PyTorch and other open source dependencies to a SageMaker PyTorch container and run the pipeline seamlessly. This was the first benefit seen by the BigBasket team, because there were hardly any changes needed to the code to make it compatible to run on a SageMaker environment.

The model network consists of a ResNet 152 architecture followed by fully connected layers. We froze the low-level feature layers and retained the weights acquired through transfer learning from the ImageNet model. The total model parameters were 66 million, consisting of 23 million trainable parameters. This transfer learning-based approach helped them use fewer images at the time of training, and also enabled faster convergence and reduced the total training time.

Building and training the model within Amazon SageMaker Studio provided an integrated development environment (IDE) with everything needed to prepare, build, train, and tune models. Augmenting the training data using techniques like cropping, rotating, and flipping images helped improve the model training data and model accuracy.

Model training was accelerated by 50% through the use of the SMDDP library, which includes optimized communication algorithms designed specifically for AWS infrastructure. To improve data read/write performance during model training and data augmentation, we used FSx for Lustre for high-performance throughput.

Their starting training data size was over 1.5 TB. We used two Amazon Elastic Compute Cloud (Amazon EC2) p4d.24 large instances with 8 GPU and 40 GB GPU memory. For SageMaker distributed training, the instances need to be in the same AWS Region and Availability Zone. Also, training data stored in an S3 bucket needs to be in the same Availability Zone. This architecture also allows BigBasket to change to other instance types or add more instances to the current architecture to cater to any significant data growth or achieve further reduction in training time.

How the SMDDP library helped reduce training time, cost, and complexity

In traditional distributed data training, the training framework assigns ranks to GPUs (workers) and creates a replica of your model on each GPU. During each training iteration, the global data batch is divided into pieces (batch shards) and a piece is distributed to each worker. Each worker then proceeds with the forward and backward pass defined in your training script on each GPU. Finally, model weights and gradients from the different model replicas are synced at the end of the iteration through a collective communication operation called AllReduce. After each worker and GPU has a synced replica of the model, the next iteration begins.

The SMDDP library is a collective communication library that improves the performance of this distributed data parallel training process. The SMDDP library reduces the communication overhead of the key collective communication operations such as AllReduce. Its implementation of AllReduce is designed for AWS infrastructure and can speed up training by overlapping the AllReduce operation with the backward pass. This approach achieves near-linear scaling efficiency and faster training speed by optimizing kernel operations between CPUs and GPUs.

Note the following calculations:

  • The size of the global batch is (number of nodes in a cluster) * (number of GPUs per node) * (per batch shard)
  • A batch shard (small batch) is a subset of the dataset assigned to each GPU (worker) per iteration

BigBasket used the SMDDP library to reduce their overall training time. With FSx for Lustre, we reduced the data read/write throughput during model training and data augmentation. With data parallelism, BigBasket was able to achieve almost 50% faster and 20% cheaper training compared to other alternatives, delivering the best performance on AWS. SageMaker automatically shuts down the training pipeline post-completion. The project completed successfully with 50% faster training time in AWS (4.5 days in AWS vs. 9 days on their legacy platform).

At the time of writing this post, BigBasket has been running the complete solution in production for more than 6 months and scaling the system by catering to new cities, and we’re adding new stores every month.

“Our partnership with AWS on migration to distributed training using their SMDDP offering has been a great win. Not only did it cut down our training times by 50%, it was also 20% cheaper. In our entire partnership, AWS has set the bar on customer obsession and delivering results—working with us the whole way to realize promised benefits.”

– Keshav Kumar, Head of Engineering at BigBasket.

Conclusion

In this post, we discussed how BigBasket used SageMaker to train their computer vision model for FMCG product identification. The implementation of an AI-powered automated self-checkout system delivers an improved retail customer experience through innovation, while eliminating human errors in the checkout process. Accelerating new product onboarding by using SageMaker distributed training reduces SKU onboarding time and cost. Integrating FSx for Lustre enables fast parallel data access for efficient model retraining with hundreds of new SKUs monthly. Overall, this AI-based self-checkout solution provides an enhanced shopping experience devoid of frontend checkout errors. The automation and innovation have transformed their retail checkout and onboarding operations.

SageMaker provides end-to-end ML development, deployment, and monitoring capabilities such as a SageMaker Studio notebook environment for writing code, data acquisition, data tagging, model training, model tuning, deployment, monitoring, and much more. If your business is facing any of the challenges described in this post and wants to save time to market and improve cost, reach out to the AWS account team in your Region and get started with SageMaker.


About the Authors

Santosh-waddiSantosh Waddi is a Principal Engineer at BigBasket, brings over a decade of expertise in solving AI challenges. With a strong background in computer vision, data science, and deep learning, he holds a postgraduate degree from IIT Bombay. Santosh has authored notable IEEE publications and, as a seasoned tech blog author, he has also made significant contributions to the development of computer vision solutions during his tenure at Samsung.

nandaNanda Kishore Thatikonda is an Engineering Manager leading the Data Engineering and Analytics at BigBasket. Nanda has built multiple applications for anomaly detection and has a patent filed in a similar space. He has worked on building enterprise-grade applications, building data platforms in multiple organizations and reporting platforms to streamline decisions backed by data. Nanda has over 18 years of experience working in Java/J2EE, Spring technologies, and big data frameworks using Hadoop and Apache Spark.

Sudhanshu Hate is a Principal AI & ML Specialist with AWS and works with clients to advise them on their MLOps and generative AI journey. In his previous role, he conceptualized, created, and led teams to build a ground-up, open source-based AI and gamification platform, and successfully commercialized it with over 100 clients. Sudhanshu has to his credit a couple of patents; has written 2 books, several papers, and blogs; and has presented his point of view in various forums. He has been a thought leader and speaker, and has been in the industry for nearly 25 years. He has worked with Fortune 1000 clients across the globe and most recently is working with digital native clients in India.

Ayush Kumar is Solutions Architect at AWS. He is working with a wide variety of AWS customers, helping them adopt the latest modern applications and innovate faster with cloud-native technologies. You’ll find him experimenting in the kitchen in his spare time.

Read More

Amazon SageMaker Feature Store now supports cross-account sharing, discovery, and access

Amazon SageMaker Feature Store now supports cross-account sharing, discovery, and access

Amazon SageMaker Feature Store is a fully managed, purpose-built repository to store, share, and manage features for machine learning (ML) models. Features are inputs to ML models used during training and inference. For example, in an application that recommends a music playlist, features could include song ratings, listening duration, and listener demographics. Features are used repeatedly by multiple teams, and feature quality is critical to ensure a highly accurate model. Also, when features used to train models offline in batch are made available for real-time inference, it’s hard to keep the two feature stores synchronized. SageMaker Feature Store provides a secured and unified store to process, standardize, and use features at scale across the ML lifecycle.

SageMaker Feature Store now makes it effortless to share, discover, and access feature groups across AWS accounts. This new capability promotes collaboration and minimizes duplicate work for teams involved in ML model and application development, particularly in enterprise environments with multiple accounts spanning different business units or functions.

With this launch, account owners can grant access to select feature groups by other accounts using AWS Resource Access Manager (AWS RAM). After they’re granted access, users of those accounts can conveniently view all of their feature groups, including the shared ones, through Amazon SageMaker Studio or SDKs. This enables teams to discover and utilize features developed by other teams, fostering knowledge sharing and efficiency. Additionally, usage details of shared resources can be monitored with Amazon CloudWatch and AWS CloudTrail. For a deep dive, refer to Cross account feature group discoverability and access.

In this post, we discuss the why and how of a centralized feature store with cross-account access. We show how to set it up and run a sample demonstration, as well as the benefits you can get by using this new capability in your organization.

Who needs a cross-account feature store

Organizations need to securely share features across teams to build accurate ML models, while preventing unauthorized access to sensitive data. SageMaker Feature Store now allows granular sharing of features across accounts via AWS RAM, enabling collaborative model development with governance.

SageMaker Feature Store provides purpose-built storage and management for ML features used during training and inferencing. With cross-account support, you can now selectively share features stored in one AWS account with other accounts in your organization.

For example, the analytics team may curate features like customer profile, transaction history, and product catalogs in a central management account. These need to be securely accessed by ML developers in other departments like marketing, fraud detection, and so on to build models.

The following are key benefits of sharing ML features across accounts:

  • Consistent and reusable features – Centralized sharing of curated features improves model accuracy by providing consistent input data to train on. Teams can discover and directly consume features created by others instead of duplicating them in each account.
  • Feature group access control – You can grant access to only the specific feature groups required for an account’s use case. For example, the marketing team may only get access to the customer profile feature group needed for recommendation models.
  • Collaboration across teams – Shared features allow disparate teams like fraud, marketing, and sales to collaborate on building ML models using the same reliable data instead of creating siloed features.
  • Audit trail for compliance – Administrators can monitor feature usage by all accounts centrally using CloudTrail event logs. This provides an audit trail required for governance and compliance.

Delineating producers from consumers in cross-account feature stores

In the realm of machine learning, the feature store acts as a crucial bridge, connecting those who supply data with those who harness it. This dichotomy can be effectively managed using a cross-account setup for the feature store. Let’s demystify this using the following personas and a real-world analogy:

  • Data and ML engineers (owners and producers) – They lay the groundwork by feeding data into the feature store
  • Data scientists (consumers) – They extract and utilize this data to craft their models

Data engineers serve as architects sketching the initial blueprint. Their task is to construct and oversee efficient data pipelines. Drawing data from source systems, they mold raw data attributes into discernable features. Take “age” for instance. Although it merely represents the span between now and one’s birthdate, its interpretation might vary across an organization. Ensuring quality, uniformity, and consistency is paramount here. Their aim is to feed data into a centralized feature store, establishing it as the undisputed reference point.

ML engineers refine these foundational features, tailoring them for mature ML workflows. In the context of banking, they might deduce statistical insights from account balances, identifying trends and flow patterns. The hurdle they often face is redundancy. It’s common to see repetitive feature creation pipelines across diverse ML initiatives.

Imagine data scientists as gourmet chefs scouting a well-stocked pantry, seeking the best ingredients for their next culinary masterpiece. Their time should be invested in crafting innovative data recipes, not in reassembling the pantry. The hurdle at this juncture is discovering the right data. A user-friendly interface, equipped with efficient search tools and comprehensive feature descriptions, is indispensable.

In essence, a cross-account feature store setup meticulously segments the roles of data producers and consumers, ensuring efficiency, clarity, and innovation. Whether you’re laying the foundation or building atop it, knowing your role and tools is pivotal.

The following diagram shows two different data scientist teams, from two different AWS accounts, who share and use the same central feature store to select the best features needed to build their ML models. The central feature store is located in a different account managed by data engineers and ML engineers, where the data governance layer and data lake are usually situated.

Cross-account feature group controls

With SageMaker Feature Store, you can share feature group resources across accounts. The resource owner account shares resources with the resource consumer accounts. There are two distinct categories of permissions associated with sharing resources:

  • Discoverability permissionsDiscoverability means being able to see feature group names and metadata. When you grant discoverability permission, all feature group entities in the account that you share from (resource owner account) become discoverable by the accounts that you are sharing with (resource consumer accounts). For example, if you make the resource owner account discoverable by the resource consumer account, then principals of the resource consumer account can see all feature groups contained in the resource owner account. This permission is granted to resource consumer accounts by using the SageMaker catalog resource type.
  • Access permissions – When you grant an access permission, you do so at the feature group resource level (not the account level). This gives you more granular control over granting access to data. The type of access permissions that can be granted are read-only, read/write, and admin. For example, you can select only certain feature groups from the resource owner account to be accessible by principals of the resource consumer account, depending on your business needs. This permission is granted to resource consumer accounts by using the feature group resource type and specifying feature group entities.

The following example diagram visualizes sharing the SageMaker catalog resource type granting the discoverability permission vs. sharing a feature group resource type entity with access permissions. The SageMaker catalog contains all of your feature group entities. When granted a discoverability permission, the resource consumer account can search and discover all feature group entities within the resource owner account. A feature group entity contains your ML data. When granted an access permission, the resource consumer account can access the feature group data, with access determined by the relevant access permission.

Solution overview

Complete the following steps to securely share features between accounts using SageMaker Feature Store:

  1. In the source (owner) account, ingest datasets and prepare normalized features. Organize related features into logical groups called feature groups.
  2. Create a resource share to grant cross-account access to specific feature groups. Define allowed actions like get and put, and restrict access only to authorized accounts.
  3. In the target (consumer) accounts, accept the AWS RAM invitation to access shared features. Review the access policy to understand permissions granted.

Developers in target accounts can now retrieve shared features using the SageMaker SDK, join with additional data, and use them to train ML models. The source account can monitor access to shared features by all accounts using CloudTrail event logs. Audit logs provide centralized visibility into feature usage.

With these steps, you can enable teams across your organization to securely use shared ML features for collaborative model development.

Prerequisites

We assume that you have already created feature groups and ingested the corresponding features inside your owner account. For more information about getting started, refer to Get started with Amazon SageMaker Feature Store.

Grant discoverability permissions

First, we demonstrate how to share our SageMaker Feature Store catalog in the owner account. Complete the following steps:

  1. In the owner account of the SageMaker Feature Store catalog, open the AWS RAM console.
  2. Under Shared by me in the navigation pane, choose Resource shares.
  3. Choose Create resource share.
  4. Enter a resource share name and choose SageMaker Resource Catalogs as the resource type.
  5. Choose Next.
  6. For discoverability-only access, enter AWSRAMPermissionSageMakerCatalogResourceSearch for Managed permissions.
  7. Choose Next.
  8. Enter your consumer account ID and choose Add. You may add several consumer accounts.
  9. Choose Next and complete your resource share.

Now the shared SageMaker Feature Store catalog should show up on the Resource shares page.

You can achieve the same result by using the AWS Command Line Interface (AWS CLI) with the following command (provide your AWS Region, owner account ID, and consumer account ID):

aws ram create-resource-share 
  --name MyCatalogFG 
  --resource-arns arn:aws:sagemaker:REGION:OWNERACCOUNTID:sagemaker-catalog/DefaultFeatureGroupCatalog 
  --principals CONSACCOUNTID 
  --permission-arns arn:aws:ram::aws:permission/AWSRAMPermissionSageMakerCatalogResourceSearch

Accept the resource share invite

To accept the resource share invite, complete the following steps:

  1. In the target (consumer) account, open the AWS RAM console.
  2. Under Shared with me in the navigation pane, choose Resource shares.
  3. Choose the new pending resource share.
  4. Choose Accept resource share.

You can achieve the same result using the AWS CLI with the following command:

aws ram get-resource-share-invitations

From the output of preceding command, retrieve the value of resourceShareInvitationArn and then accept the invitation with the following command:

aws ram accept-resource-share-invitation 
--resource-share-invitation-arn RESOURCESHAREINVITATIONARN

The workflow is the same for sharing feature groups with another account via AWS RAM.

After you share some feature groups with the target account, you can inspect the SageMaker Feature Store, where you can observe that the new catalog is available.

Grant access permissions

With access permissions, we can grant permissions at the feature group resource level. Complete the following steps:

  1. In the owner account of the SageMaker Feature Store catalog, open the AWS RAM console.
  2. Under Shared by me in the navigation pane, choose Resource shares.
  3. Choose Create resource share.
  4. Enter a resource share name and choose SageMaker Feature Groups as the resource type.
  5. Select one or more feature groups to share.
  6. Choose Next.
  7. For read/write access, enter AWSRAMPermissionSageMakerFeatureGroupReadWrite for Managed permissions.
  8. Choose Next.
  9. Enter your consumer account ID and choose Add. You may add several consumer accounts.
  10. Choose Next and complete your resource share.

Now the shared catalog should show up on the Resource shares page.

You can achieve the same result by using the AWS CLI with the following command (provide your Region, owner account ID, consumer account ID, and feature group name):

aws ram create-resource-share 
  --name MyCatalogFG 
  --resource-arns arn:aws:sagemaker:REGION:OWNERACCOUNTID:feature-group/FEATUREGROUPNAME 
  --principals CONSACCOUNTID 
  --permission-arns arn:aws:ram::aws:permission/AWSRAMPermissionSageMakerFeatureGroupReadWrite

There are three types of access that you can grant to feature groups:

  • AWSRAMPermissionSageMakerFeatureGroupReadOnly – The read-only privilege allows resource consumer accounts to read records in the shared feature groups and view details and metadata
  • AWSRAMPermissionSageMakerFeatureGroupReadWrite – The read/write privilege allows resource consumer accounts to write records to, and delete records from, the shared feature groups, in addition to read permissions
  • AWSRAMPermissionSagemakerFeatureGroupAdmin – The admin privilege allows the resource consumer accounts to update the description and parameters of features within the shared feature groups and update the configuration of the shared feature groups, in addition to read/write permissions

Accept the resource share invite

To accept the resource share invite, complete the following steps:

  1. In the target (consumer) account, open the AWS RAM console.
  2. Under Shared with me in the navigation pane, choose Resource shares.
  3. Choose the new pending resource share.
  4. Choose Accept resource share.

The process of accepting the resource share using the AWS CLI is the same as for the previous discoverability section, with the get-resource-share-invitations and accept-resource-share-invitation commands.

Sample notebooks showcasing this new capability

Two notebooks were added to the SageMaker Feature Store Workshop GitHub repository in the folder 09-module-security/09-03-cross-account-access:

  • m9_03_nb1_cross-account-admin.ipynb – This needs to be launched on your admin or owner AWS account
  • m9_03_nb2_cross-account-consumer.ipynb – This needs to be launched on your consumer AWS account

The first script shows how to create the discoverability resource share for existing feature groups at the admin or owner account and share it with another consumer account programmatically using the AWS RAM API create_resource_share(). It also shows how to grant access permissions to existing feature groups at the owner account and share these with another consumer account using AWS RAM. You need to provide your consumer AWS account ID before running the notebook.

The second script accepts the AWS RAM invitations to discover and access cross-account feature groups from the owner level. Then it shows how to discover cross-account feature groups that are on the owner account and list these on the consumer account. You can also see how to access in read/write cross-account feature groups that are on the owner account and perform the following operations from the consumer account: describe(), get_record(), ingest(), and delete_record().

Conclusion

The SageMaker Feature Store cross-account capability offers several compelling benefits. Firstly, it facilitates seamless collaboration by enabling sharing of feature groups across multiple AWS accounts. This enhances data accessibility and utilization, allowing teams in different accounts to use shared features for their ML workflows.

Additionally, the cross-account capability enhances data governance and security. With controlled access and permissions through AWS RAM, organizations can maintain a centralized feature store while ensuring that each account has tailored access levels. This not only streamlines data management, but also strengthens security measures by limiting access to authorized users.

Furthermore, the ability to share feature groups across accounts simplifies the process of building and deploying ML models in a collaborative environment. It fosters a more integrated and efficient workflow, reducing redundancy in data storage and facilitating the creation of robust models with shared, high-quality features. Overall, the Feature Store’s cross-account capability optimizes collaboration, governance, and efficiency in ML development across diverse AWS accounts. Give it a try, and let us know what you think in the comments.


About the Authors

Ioan Catana is a Senior Artificial Intelligence and Machine Learning Specialist Solutions Architect at AWS. He helps customers develop and scale their ML solutions in the AWS Cloud. Ioan has over 20 years of experience, mostly in software architecture design and cloud engineering.

Philipp Kaindl is a Senior Artificial Intelligence and Machine Learning Solutions Architect at AWS. With a background in data science and mechanical engineering, his focus is on empowering customers to create lasting business impact with the help of AI. Outside of work, Philipp enjoys tinkering with 3D printers, sailing, and hiking.

Dhaval Shah is a Senior Solutions Architect at AWS, specializing in machine learning. With a strong focus on digital native businesses, he empowers customers to use AWS and drive their business growth. As an ML enthusiast, Dhaval is driven by his passion for creating impactful solutions that bring positive change. In his leisure time, he indulges in his love for travel and cherishes quality moments with his family.

Mizanur Rahman is a Senior Software Engineer for Amazon SageMaker Feature Store with over 10 years of hands-on experience specializing in AI and ML. With a strong foundation in both theory and practical applications, he holds a Ph.D. in Fraud Detection using Machine Learning, reflecting his dedication to advancing the field. His expertise spans a broad spectrum, encompassing scalable architectures, distributed computing, big data analytics, micro services and cloud infrastructures for organizations.

Read More

Say What? Chat With RTX Brings Custom Chatbot to NVIDIA RTX AI PCs

Say What? Chat With RTX Brings Custom Chatbot to NVIDIA RTX AI PCs

Chatbots are used by millions of people around the world every day, powered by NVIDIA GPU-based cloud servers. Now, these groundbreaking tools are coming to Windows PCs powered by NVIDIA RTX for local, fast, custom generative AI.

Chat with RTX, now free to download, is a tech demo that lets users personalize a chatbot with their own content, accelerated by a local NVIDIA GeForce RTX 30 Series GPU or higher with at least 8GB of video random access memory, or VRAM.

Ask Me Anything

Chat with RTX uses retrieval-augmented generation (RAG), NVIDIA TensorRT-LLM software and NVIDIA RTX acceleration to bring generative AI capabilities to local, GeForce-powered Windows PCs. Users can quickly, easily connect local files on a PC as a dataset to an open-source large language model like Mistral or Llama 2, enabling queries for quick, contextually relevant answers.

Rather than searching through notes or saved content, users can simply type queries. For example, one could ask, “What was the restaurant my partner recommended while in Las Vegas?” and Chat with RTX will scan local files the user points it to and provide the answer with context.

The tool supports various file formats, including .txt, .pdf, .doc/.docx and .xml. Point the application at the folder containing these files, and the tool will load them into its library in just seconds.

Users can also include information from YouTube videos and playlists. Adding a video URL to Chat with RTX allows users to integrate this knowledge into their chatbot for contextual queries. For example, ask for travel recommendations based on content from favorite influencer videos, or get quick tutorials and how-tos based on top educational resources.

Chat with RTX can integrate knowledge from YouTube videos into queries.

Since Chat with RTX runs locally on Windows RTX PCs and workstations, the provided results are fast — and the user’s data stays on the device. Rather than relying on cloud-based LLM services, Chat with RTX lets users process sensitive data on a local PC without the need to share it with a third party or have an internet connection.

In addition to a GeForce RTX 30 Series GPU or higher with a minimum 8GB of VRAM, Chat with RTX requires Windows 10 or 11, and the latest NVIDIA GPU drivers.

Develop LLM-Based Applications With RTX

Chat with RTX shows the potential of accelerating LLMs with RTX GPUs. The app is built from the TensorRT-LLM RAG developer reference project, available on GitHub. Developers can use the reference project to develop and deploy their own RAG-based applications for RTX, accelerated by TensorRT-LLM. Learn more about building LLM-based applications.

Enter a generative AI-powered Windows app or plug-in to the NVIDIA Generative AI on NVIDIA RTX developer contest, running through Friday, Feb. 23, for a chance to win prizes such as a GeForce RTX 4090 GPU, a full, in-person conference pass to NVIDIA GTC and more.

Learn more about Chat with RTX.

Read More

Resource-constrained Stereo Singing Voice Cancellation

We study the problem of stereo singing voice cancellation, a subtask of music source separation, whose goal is to estimate an instrumental background from a stereo mix. We explore how to achieve performance similar to large state-of-the-art source separation networks starting from a small, efficient model for real-time speech separation. Such a model is useful when memory and compute are limited and singing voice processing has to run with limited look-ahead. In practice, this is realised by adapting an existing mono model to handle stereo input. Improvements in quality are obtained by tuning…Apple Machine Learning Research

How Booking.com modernized its ML experimentation framework with Amazon SageMaker

How Booking.com modernized its ML experimentation framework with Amazon SageMaker

This post is co-written with Kostia Kofman and Jenny Tokar from Booking.com.

As a global leader in the online travel industry, Booking.com is always seeking innovative ways to enhance its services and provide customers with tailored and seamless experiences. The Ranking team at Booking.com plays a pivotal role in ensuring that the search and recommendation algorithms are optimized to deliver the best results for their users.

Sharing in-house resources with other internal teams, the Ranking team machine learning (ML) scientists often encountered long wait times to access resources for model training and experimentation – challenging their ability to rapidly experiment and innovate. Recognizing the need for a modernized ML infrastructure, the Ranking team embarked on a journey to use the power of Amazon SageMaker to build, train, and deploy ML models at scale.

Booking.com collaborated with AWS Professional Services to build a solution to accelerate the time-to-market for improved ML models through the following improvements:

  • Reduced wait times for resources for training and experimentation
  • Integration of essential ML capabilities such as hyperparameter tuning
  • A reduced development cycle for ML models

Reduced wait times would mean that the team could quickly iterate and experiment with models, gaining insights at a much faster pace. Using SageMaker on-demand available instances allowed for a tenfold wait time reduction. Essential ML capabilities such as hyperparameter tuning and model explainability were lacking on premises. The team’s modernization journey introduced these features through Amazon SageMaker Automatic Model Tuning and Amazon SageMaker Clarify. Finally, the team’s aspiration was to receive immediate feedback on each change made in the code, reducing the feedback loop from minutes to an instant, and thereby reducing the development cycle for ML models.

In this post, we delve into the journey undertaken by the Ranking team at Booking.com as they harnessed the capabilities of SageMaker to modernize their ML experimentation framework. By doing so, they not only overcame their existing challenges, but also improved their search experience, ultimately benefiting millions of travelers worldwide.

Approach to modernization

The Ranking team consists of several ML scientists who each need to develop and test their own model offline. When a model is deemed successful according to the offline evaluation, it can be moved to production A/B testing. If it shows online improvement, it can be deployed to all the users.

The goal of this project was to create a user-friendly environment for ML scientists to easily run customizable Amazon SageMaker Model Building Pipelines to test their hypotheses without the need to code long and complicated modules.

One of the several challenges faced was adapting the existing on-premises pipeline solution for use on AWS. The solution involved two key components:

  • Modifying and extending existing code – The first part of our solution involved the modification and extension of our existing code to make it compatible with AWS infrastructure. This was crucial in ensuring a smooth transition from on-premises to cloud-based processing.
  • Client package development – A client package was developed that acts as a wrapper around SageMaker APIs and the previously existing code. This package combines the two, enabling ML scientists to easily configure and deploy ML pipelines without coding.

SageMaker pipeline configuration

Customizability is key to the model building pipeline, and it was achieved through config.ini, an extensive configuration file. This file serves as the control center for all inputs and behaviors of the pipeline.

Available configurations inside config.ini include:

  • Pipeline details – The practitioner can define the pipeline’s name, specify which steps should run, determine where outputs should be stored in Amazon Simple Storage Service (Amazon S3), and select which datasets to use
  • AWS account details – You can decide which Region the pipeline should run in and which role should be used
  • Step-specific configuration – For each step in the pipeline, you can specify details such as the number and type of instances to use, along with relevant parameters

The following code shows an example configuration file:

[BUILD]
pipeline_name = ranking-pipeline
steps = DATA_TRANFORM, TRAIN, PREDICT, EVALUATE, EXPLAIN, REGISTER, UPLOAD
train_data_s3_path = s3://...
...
[AWS_ACCOUNT]
region = eu-central-1
...
[DATA_TRANSFORM_PARAMS]
input_data_s3_path = s3://...
compression_type = GZIP
....
[TRAIN_PARAMS]
instance_count = 3
instance_type = ml.g5.4xlarge
epochs = 1
enable_sagemaker_debugger = True
...
[PREDICT_PARAMS]
instance_count = 3
instance_type = ml.g5.4xlarge
...
[EVALUATE_PARAMS]
instance_type = ml.m5.8xlarge
batch_size = 2048
...
[EXPLAIN_PARAMS]
check_job_instance_type = ml.c5.xlarge
generate_baseline_with_clarify = False
....

config.ini is a version-controlled file managed by Git, representing the minimal configuration required for a successful training pipeline run. During development, local configuration files that are not version-controlled can be utilized. These local configuration files only need to contain settings relevant to a specific run, introducing flexibility without complexity. The pipeline creation client is designed to handle multiple configuration files, with the latest one taking precedence over previous settings.

SageMaker pipeline steps

The pipeline is divided into the following steps:

  • Train and test data preparation – Terabytes of raw data are copied to an S3 bucket, processed using AWS Glue jobs for Spark processing, resulting in data structured and formatted for compatibility.
  • Train – The training step uses the TensorFlow estimator for SageMaker training jobs. Training occurs in a distributed manner using Horovod, and the resulting model artifact is stored in Amazon S3. For hyperparameter tuning, a hyperparameter optimization (HPO) job can be initiated, selecting the best model based on the objective metric.
  • Predict – In this step, a SageMaker Processing job uses the stored model artifact to make predictions. This process runs in parallel on available machines, and the prediction results are stored in Amazon S3.
  • Evaluate – A PySpark processing job evaluates the model using a custom Spark script. The evaluation report is then stored in Amazon S3.
  • Condition – After evaluation, a decision is made regarding the model’s quality. This decision is based on a condition metric defined in the configuration file. If the evaluation is positive, the model is registered as approved; otherwise, it’s registered as rejected. In both cases, the evaluation and explainability report, if generated, are recorded in the model registry.
  • Package model for inference – Using a processing job, if the evaluation results are positive, the model is packaged, stored in Amazon S3, and made ready for upload to the internal ML portal.
  • Explain – SageMaker Clarify generates an explainability report.

Two distinct repositories are used. The first repository contains the definition and build code for the ML pipeline, and the second repository contains the code that runs inside each step, such as processing, training, prediction, and evaluation. This dual-repository approach allows for greater modularity, and enables science and engineering teams to iterate independently on ML code and ML pipeline components.

The following diagram illustrates the solution workflow.

Automatic model tuning

Training ML models requires an iterative approach of multiple training experiments to build a robust and performant final model for business use. The ML scientists have to select the appropriate model type, build the correct input datasets, and adjust the set of hyperparameters that control the model learning process during training.

The selection of appropriate values for hyperparameters for the model training process can significantly influence the final performance of the model. However, there is no unique or defined way to determine which values are appropriate for a specific use case. Most of the time, ML scientists will need to run multiple training jobs with slightly different sets of hyperparameters, observe the model training metrics, and then try to select more promising values for the next iteration. This process of tuning model performance is also known as hyperparameter optimization (HPO), and can at times require hundreds of experiments.

The Ranking team used to perform HPO manually in their on-premises environment because they could only launch a very limited number of training jobs in parallel. Therefore, they had to run HPO sequentially, test and select different combinations of hyperparameter values manually, and regularly monitor progress. This prolonged the model development and tuning process and limited the overall number of HPO experiments that could run in a feasible amount of time.

With the move to AWS, the Ranking team was able to use the automatic model tuning (AMT) feature of SageMaker. AMT enables Ranking ML scientists to automatically launch hundreds of training jobs within hyperparameter ranges of interest to find the best performing version of the final model according to the chosen metric. The Ranking team is now able choose between four different automatic tuning strategies for their hyperparameter selection:

  • Grid search – AMT will expect all hyperparameters to be categorical values, and it will launch training jobs for each distinct categorical combination, exploring the entire hyperparameter space.
  • Random search – AMT will randomly select hyperparameter values combinations within provided ranges. Because there is no dependency between different training jobs and parameter value selection, multiple parallel training jobs can be launched with this method, speeding up the optimal parameter selection process.
  • Bayesian optimization – AMT uses Bayesian optimization implementation to guess the best set of hyperparameter values, treating it as a regression problem. It will consider previously tested hyperparameter combinations and its impact on the model training jobs with the new parameter selection, optimizing for smarter parameter selection with fewer experiments, but it will also launch training jobs only sequentially to always be able to learn from previous trainings.
  • Hyperband – AMT will use intermediate and final results of the training jobs it’s running to dynamically reallocate resources towards training jobs with hyperparameter configurations that show more promising results while automatically stopping those that underperform.

AMT on SageMaker enabled the Ranking team to reduce the time spent on the hyperparameter tuning process for their model development by enabling them for the first time to run multiple parallel experiments, use automatic tuning strategies, and perform double-digit training job runs within days, something that wasn’t feasible on premises.

Model explainability with SageMaker Clarify

Model explainability enables ML practitioners to understand the nature and behavior of their ML models by providing valuable insights for feature engineering and selection decisions, which in turn improves the quality of the model predictions. The Ranking team wanted to evaluate their explainability insights in two ways: understand how feature inputs affect model outputs across their entire dataset (global interpretability), and also be able to discover input feature influence for a specific model prediction on a data point of interest (local interpretability). With this data, Ranking ML scientists can make informed decisions on how to further improve their model performance and account for the challenging prediction results that the model would occasionally provide.

SageMaker Clarify enables you to generate model explainability reports using Shapley Additive exPlanations (SHAP) when training your models on SageMaker, supporting both global and local model interpretability. In addition to model explainability reports, SageMaker Clarify supports running analyses for pre-training bias metrics, post-training bias metrics, and partial dependence plots. The job will be run as a SageMaker Processing job within the AWS account and it integrates directly with the SageMaker pipelines.

The global interpretability report will be automatically generated in the job output and displayed in the Amazon SageMaker Studio environment as part of the training experiment run. If this model is then registered in SageMaker model registry, the report will be additionally linked to the model artifact. Using both of these options, the Ranking team was able to easily track back different model versions and their behavioral changes.

To explore input feature impact on a single prediction (local interpretability values), the Ranking team enabled the parameter save_local_shap_values in the SageMaker Clarify jobs and was able to load them from the S3 bucket for further analyses in the Jupyter notebooks in SageMaker Studio.

The preceding images show an example of how a model explainability would look like for an arbitrary ML model.

Training optimization

The rise of deep learning (DL) has led to ML becoming increasingly reliant on computational power and vast amounts of data. ML practitioners commonly face the hurdle of efficiently using resources when training these complex models. When you run training on large compute clusters, various challenges arise in optimizing resource utilization, including issues like I/O bottlenecks, kernel launch delays, memory constraints, and underutilized resources. If the configuration of the training job is not fine-tuned for efficiency, these obstacles can result in suboptimal hardware usage, prolonged training durations, or even incomplete training runs. These factors increase project costs and delay timelines.

Profiling of CPU and GPU usage helps understand these inefficiencies, determine the hardware resource consumption (time and memory) of the various TensorFlow operations in your model, resolve performance bottlenecks, and, ultimately, make the model run faster.

Ranking team used the framework profiling feature of Amazon SageMaker Debugger (now deprecated in favor of Amazon SageMaker Profiler) to optimize these training jobs. This allows you to track all activities on CPUs and GPUs, such as CPU and GPU utilizations, kernel runs on GPUs, kernel launches on CPUs, sync operations, memory operations across GPUs, latencies between kernel launches and corresponding runs, and data transfer between CPUs and GPUs.

Ranking team also used the TensorFlow Profiler feature of TensorBoard, which further helped profile the TensorFlow model training. SageMaker is now further integrated with TensorBoard and brings the visualization tools of TensorBoard to SageMaker, integrated with SageMaker training and domains. TensorBoard allows you to perform model debugging tasks using the TensorBoard visualization plugins.

With the help of these two tools, Ranking team optimized the their TensorFlow model and were able to identify bottlenecks and reduce the average training step time from 350 milliseconds to 140 milliseconds on CPU and from 170 milliseconds to 70 milliseconds on GPU, speedups of 60% and 59%, respectively.

Business outcomes

The migration efforts centered around enhancing availability, scalability, and elasticity, which collectively brought the ML environment towards a new level of operational excellence, exemplified by the increased model training frequency and decreased failures, optimized training times, and advanced ML capabilities.

Model training frequency and failures

The number of monthly model training jobs increased fivefold, leading to significantly more frequent model optimizations. Furthermore, the new ML environment led to a reduction in the failure rate of pipeline runs, dropping from approximately 50% to 20%. The failed job processing time decreased drastically, from over an hour on average to a negligible 5 seconds. This has strongly increased operational efficiency and decreased resource wastage.

Optimized training time

The migration brought with it efficiency increases through SageMaker-based GPU training. This shift decreased model training time to a fifth of its previous duration. Previously, the training processes for deep learning models consumed around 60 hours on CPU; this was streamlined to approximately 12 hours on GPU. This improvement not only saves time but also expedites the development cycle, enabling faster iterations and model improvements.

Advanced ML capabilities

Central to the migration’s success is the use of the SageMaker feature set, encompassing hyperparameter tuning and model explainability. Furthermore, the migration allowed for seamless experiment tracking using Amazon SageMaker Experiments, enabling more insightful and productive experimentation.

Most importantly, the new ML experimentation environment supported the successful development of a new model that is now in production. This model is deep learning rather than tree-based and has introduced noticeable improvements in online model performance.

Conclusion

This post provided an overview of the AWS Professional Services and Booking.com collaboration that resulted in the implementation of a scalable ML framework and successfully reduced the time-to-market of ML models of their Ranking team.

The Ranking team at Booking.com learned that migrating to the cloud and SageMaker has proved beneficial, and that adapting machine learning operations (MLOps) practices allows their ML engineers and scientists to focus on their craft and increase development velocity. The team is sharing the learnings and work done with the entire ML community at Booking.com, through talks and dedicated sessions with ML practitioners where they share the code and capabilities. We hope this post can serve as another way to share the knowledge.

AWS Professional Services is ready to help your team develop scalable and production-ready ML in AWS. For more information, see AWS Professional Services or reach out through your account manager to get in touch.


About the Authors

Laurens van der Maas is a Machine Learning Engineer at AWS Professional Services. He works closely with customers building their machine learning solutions on AWS, specializes in distributed training, experimentation and responsible AI, and is passionate about how machine learning is changing the world as we know it.

Daniel Zagyva is a Data Scientist at AWS Professional Services. He specializes in developing scalable, production-grade machine learning solutions for AWS customers. His experience extends across different areas, including natural language processing, generative AI and machine learning operations.

Kostia Kofman is a Senior Machine Learning Manager at Booking.com, leading the Search Ranking ML team, overseeing Booking.com’s most extensive ML system. With expertise in Personalization and Ranking, he thrives on leveraging cutting-edge technology to enhance customer experiences.

Jenny Tokar is a Senior Machine Learning Engineer at Booking.com’s Search Ranking team. She specializes in developing end-to-end ML pipelines characterized by efficiency, reliability, scalability, and innovation. Jenny’s expertise empowers her team to create cutting-edge ranking models that serve millions of users every day.

Aleksandra Dokic is a Senior Data Scientist at AWS Professional Services. She enjoys supporting customers to build innovative AI/ML solutions on AWS and she is excited about business transformations through the power of data.

Luba Protsiva is an Engagement Manager at AWS Professional Services. She specializes in delivering Data and GenAI/ML solutions that enable AWS customers to maximize their business value and accelerate speed of innovation.

Read More

NVIDIA CEO: Every Country Needs Sovereign AI

NVIDIA CEO: Every Country Needs Sovereign AI

Every country needs to own the production of their own intelligence, NVIDIA founder and CEO Jensen Huang told attendees Monday at the World Governments Summit in Dubai.

Huang, who spoke as part of a fireside chat with the UAE’s Minister of AI, His Excellency Omar Al Olama, described sovereign AI — which emphasizes a country’s ownership over its data and the intelligence it produces — as an enormous opportunity for the world’s leaders.

“It codifies your culture, your society’s intelligence, your common sense, your history – you own your own data,” Huang told Al Olama during their conversation, a highlight of an event attended by more than 4,000 delegates from 150 countries.

“We completely subscribe to that vision,” Al Olama said. “That’s why the UAE is moving aggressively on creating large language models and mobilizing compute.”

Huang’s appearance in the UAE comes as the Gulf State is moving rapidly to transform itself from an energy powerhouse into a global information technology hub.

Dubai is the latest stop for Huang in a global tour that has included meetings with leaders in Canada, France, India, Japan, Malaysia, Singapore and Vietnam over the past six months.

The Middle East is poised to reap significant benefits from AI, with PwC projecting a $320 billion boost to the region’s economy by 2030.

At Monday’s summit, Huang urged leaders not to be “mystified” by AI. AI’s unprecedented ability to take directions from ordinary humans makes it critical for countries to embrace AI, infusing it with local languages and expertise.

In response to Al Olama’s question about how he might approach AI if he were the leader of a developing nation, Huang emphasized the importance of building infrastructure.

“It’s not that costly, it is also not that hard,” Huang said. “The first thing that I would do, of course, is I would codify the language, the data of your culture into your own large language model.”

And as AI and accelerated computing has developed, NVIDIA GPUs have become a platform for one innovation after another.

“NVIDIA GPU is the only platform that’s available to everybody on any platform,” Huang said. “This ubiquity has not only democratized AI but facilitated a wave of innovation that spans from cloud computing to autonomous systems and beyond.

All of this promises to unleash new kinds of innovations that go beyond what’s traditionally been thought of as information technology.

Huang even countered advice offered by many visionaries over the years who urged young people to study computer science in order to compete in the information age. No longer.

“In fact, it’s almost exactly the opposite,” Huang said. “It is our job to create computing technologies that nobody has to program and that the programming language is human: everybody in the world is now a programmer — that is the miracle.”

In a move that further underscores the regional momentum behind AI, Moro Hub, a subsidiary of Digital DEWA, the digital arm of the Dubai Electricity and Water Authority, focused on providing cloud services, cybersecurity and smart city solutions, announced Monday it has agreed to build a green data center with NVIDIA.

In addition to the fireside chat, the summit featured panels on smart mobility, sustainable development and more, showcasing the latest in AI advancements. Later in the evening, Huang and Al Olama took the stage at the ‘Get Inspired’ ecosystem event, organized by the UAE’s AI Office, featuring 280 attendees including developers, start-ups, and others.

Read More