Learn more about Google for Startups Accelerator: AI First program for North American startups.Read More
Moving to GraphRAG 1.0 – Streamlining ergonomics for developers and users
Introducing GraphRAG 1.0
Microsoft debuted (opens in new tab) the pre-release version of GraphRAG (opens in new tab) in July 2024 to advance AI use in complex domains. Since that time, we’ve seen incredible adoption and community engagement (over 20k stars and 2k forks on GitHub as of this writing), with numerous fixes and improvements by the core team and community contributors. We’re deeply grateful for the contributions and feedback we’ve received and are excited to share a number of major ergonomic and structural improvements that culminate in the official release of GraphRAG 1.0.
Ergonomic refactors
Easier setup for new projects
When we first launched GraphRAG, most config was done using environment variables, which could be daunting, given the many options available. We’ve reduced the friction on setup by adding an init command (opens in new tab) that generates a simplified starter settings.yml file with all core required config already set. We recommend developers start here to ensure they get the clearest initial config. With this update, a minimal starting config does not require the user to have expertise with GraphRAG for a quick setup, only an OpenAI API key in their environment.
New and expanded command line interface
We expanded the functionality and ease of use of the command line interface (opens in new tab) (CLI) and adopted Typer (opens in new tab) to provide better inline documentation and a richer CLI experience. The original CLI was intended as a starter demo for users to try GraphRAG on a sample dataset. We’ve since learned from the community that most people actually want to use this as their primary interaction mode for GraphRAG, so as part of this milestone release, we’ve incorporated enhancements that result in a more streamlined experience. From this work, CLI startup times dropped from an average of 148 seconds to 2 seconds.
Consolidated API layer
In August 2024 we introduced a standalone API layer to simplify developer usage. The original CLI contained all the code required to instantiate and execute basic indexing and query commands, which users often needed to replicate. The API layer is still considered provisional as we gather feedback, but is intended to be the primary entry point for developers who wish to integrate GraphRAG functionality into their own applications without deep pipeline or query class customization. In fact, the CLI and Accelerator (opens in new tab) are built entirely on top of the API layer, acting as a documented example of how to interact with the API. We have also added examples of how to use this API to our notebook collection (opens in new tab) that we will continue to update as we iterate in future releases.
Simplified data model
GraphRAG creates several output artifacts to store the indexed knowledge model. The initial model contained a large number of files, fields, and cross-references based on experimental ideas during the early research, which can be overwhelming for both new and routine users. We performed a comprehensive review of the data model and incorporated fixes to add clarity and consistency, remove redundant or unused fields, improve storage space, and simplify the data models. Previously, the output lacked standardization, and relevant outputs could easily be confused with non-critical intermediary output files. Now with GraphRAG 1.0, the output will only include relevant outputs that are easily readable and traceable.
Microsoft research podcast
Abstracts: August 15, 2024
Advanced AI may make it easier for bad actors to deceive others online. A multidisciplinary research team is exploring one solution: a credential that allows people to show they’re not bots without sharing identifying information. Shrey Jain and Zoë Hitzig explain.
Streamlined vector stores
Embeddings and their vector stores are some of the primary drivers of GraphRAG’s storage needs. Our original data model stored all embeddings within the parquet output files after data ingestion and indexing. This made the files portable, which was convenient for early research, but for many users it became unnecessary as they configured their own vector stores and the scale of data ingestion grew. We have updated the GraphRAG pipeline to create a default vector store during indexing, so no post-processing is needed, and the query library shares this configuration for seamless use. The benefit of this change is that those vectors (which can be quite large) no longer need to be loaded when the output files are read from disk, saving read time and memory during every query. Coupled with the simplified data model, this resulted in output parquet disk savings of 80%, and total disk space (including embeddings in the vector store) reduction of 43%. GraphRAG supports LanceDB and Azure AI Search out-of-the-box for vector stores. For simple startup, LanceDB is used as the default, and is written to a local database alongside the knowledge model artifacts.
Flatter, clearer code structure
A key initiative on the road to version 1.0 has been to simplify the codebase so it is easier to maintain and more approachable for third-party users. We’ve removed much of the code depth from the organization to make it easier to browse, and co-located more code that our own usage patterns indicate was not required to be in separate functional areas.
We have also found that very few users need the declarative configuration that the underlying DataShaper (opens in new tab) engine provides, so we collapsed these 88 verbose workflow definitions into a smaller set of 11 workflows that operate in a functional versus composed manner. This makes the pipeline easier to understand and is a step toward an architecture that is better suited for our future research plans and improves performance across the board. By collapsing workflows, we now have fewer unused output artifacts, reduced data duplication, and fewer disk I/O operations. This streamlining has also reduced the in-memory footprint of the pipeline, enabling users to index and analyze larger datasets with GraphRAG.
Incremental ingest
Until now, an evolving dataset needed complete re-indexing every time new information was acquired in order to re-generate the knowledge model. In GraphRAG 1.0 we are including a new update command in the CLI that computes the deltas between an existing index and newly added content and intelligently merges the updates to minimize re-indexing. GraphRAG uses an LLM caching mechanism to save as much cost as possible when re-indexing, so re-runs over a dataset are often significantly faster and cheaper than an initial run. Adding brand new content can alter the community structure such that much of an index needs to be re-computed – the update command (opens in new tab) resolves this while also improving answer quality.
Availability
GraphRAG version 1.0 is now available on GitHub (opens in new tab), and published to PyPI (opens in new tab). Check out the Getting Started (opens in new tab) guide to use GraphRAG 1.0 today. today.
Migrating
We recommend users migrate to GraphRAG 1.0, which offers a streamlined experience including multiple improvements for both users and developers. However, because of the breadth of its updates, version 1.0 is not backwards compatible. If you’ve used GraphRAG prior to version 1.0 and have existing indexes, there are a handful of breaking changes that need to be addressed, but this should be a straightforward process. To support the community in this migration, we’ve created a migration guide (opens in new tab) in the repository with more information.
Future directions
We recently posted about a brand-new approach to GraphRAG called LazyGraphRAG, which performs minimal up-front indexing to avoid LLM usage until user queries are executed. This avoids LLM-based summarization of large volumes of content that may not be interesting to users – and therefore never explored even after expensive processing. This approach shows strong performance at a fraction of the cost of GraphRAG, and will be added to the core GraphRAG codebase in the near future as a new option for users.
Additionally, Microsoft has been active in exploring how GraphRAG can advance the rate of scientific progress, and is in the process of building relevant GraphRAG capabilities to align with our broader work in AI-enabled scientific discovery (opens in new tab).
We continue to refine the codebase and investigate architectural changes that will enable users to use their own language model APIs, storage providers, and vector stores. We’re excited about this major milestone, and the foundation that this refactoring lays for our continued research in the GraphRAG space.
The post Moving to GraphRAG 1.0 – Streamlining ergonomics for developers and users appeared first on Microsoft Research.
Economics Nobelist on causal inference
In a keynote address at the latest Amazon Machine Learning Conference, Amazon Visiting Academic, Stanford professor, and recent Nobel laureate Guido Imbens offered insights on the estimation of causal effects in “panel data” settings.Read More
Multi-tenant RAG with Amazon Bedrock Knowledge Bases
Organizations are continuously seeking ways to use their proprietary knowledge and domain expertise to gain a competitive edge. With the advent of foundation models (FMs) and their remarkable natural language processing capabilities, a new opportunity has emerged to unlock the value of their data assets.
As organizations strive to deliver personalized experiences to customers using generative AI, it becomes paramount to specialize the behavior of FMs using their own—and their customers’—data. Retrieval Augmented Generation (RAG) has emerged as a simple yet effective approach to achieve a desired level of specialization.
Amazon Bedrock Knowledge Bases is a fully managed capability that simplifies the management of the entire RAG workflow, empowering organizations to give FMs and agents contextual information from company’s private data sources to deliver more relevant and accurate responses tailored to their specific needs.
For organizations developing multi-tenant products, such as independent software vendors (ISVs) creating software as a service (SaaS) offerings, the ability to personalize experiences for each of their customers (tenants in their SaaS application) is particularly significant. This personalization can be achieved by implementing a RAG approach that selectively uses tenant-specific data.
In this post, we discuss and provide examples of how to achieve personalization using Amazon Bedrock Knowledge Bases. We focus particularly on addressing the multi-tenancy challenges that ISVs face, including data isolation, security, tenant management, and cost management. We focus on scenarios where the RAG architecture is integrated into the ISV application and not directly exposed to tenants. Although the specific implementations presented in this post use Amazon OpenSearch Service as a vector database to store tenants’ data, the challenges and architecture solutions proposed can be extended and tailored to other vector store implementations.
Multi-Tenancy design considerations
When architecting a multi-tenanted RAG system, organizations need to take several considerations into account:
- Tenant isolation – One crucial consideration in designing multi-tenanted systems is the level of isolation between the data and resources related to each tenant. These resources include data sources, ingestion pipelines, vector databases, and RAG client application. The level of isolation is typically governed by security, performance, and the scalability requirements of your solution, together with your regulatory requirements. For example, you may need to encrypt the data related to each of your tenants using a different encryption key. You may also need to make sure that high activity generated by one of the tenants doesn’t affect other tenants.
- Tenant variability – A similar yet distinct consideration is the level of variability of the features provided to each tenant. In the context of RAG systems, tenants might have varying requirements for data ingestion frequency, document chunking strategy, or vector search configuration.
- Tenant management simplicity – Multi-tenant solutions need a mechanism for onboarding and offboarding tenants. This dimension determines the degree of complexity for this process, which might involve provisioning or tearing down tenant-specific infrastructure, such as data sources, ingestion pipelines, vector databases, and RAG client applications. This process could also involve adding or deleting tenant-specific data in its data sources.
- Cost-efficiency – The operating costs of a multi-tenant solution depend on the way it provides the isolation mechanism for tenants, so designing a cost-efficient architecture for the solution is crucial.
These four considerations need to be carefully balanced and weighted to suit the needs of the specific solution. In this post, we present a model to simplify the decision-making process. Using the core isolation concepts of silo, pool, and bridge defined in the SaaS Tenant Isolation Strategies whitepaper, we propose three patterns for implementing a multi-tenant RAG solution using Amazon Bedrock Knowledge Bases, Amazon Simple Storage Service (Amazon S3), and OpenSearch Service.
A typical RAG solution using Amazon Bedrock Knowledge Bases is composed of several components, as shown in the following figure:
- A data source, such as an S3 bucket
- A knowledge base including a data source
- A vector database such as an Amazon OpenSearch Serverless collection and index or other supported vector databases
- A RAG client application
The main challenge in adapting this architecture for multi-tenancy is determining how to provide isolation between tenants for each of the components. We propose three prescriptive patterns that cater to different use cases and offer carrying levels of isolation, variability, management simplicity, and cost-efficiency. The following figure illustrates the trade-offs between these three architectural patterns in terms of achieving tenant isolation, variability, cost-efficiency, and ease of tenant management.
Multi-tenancy patterns
In this section, we describe the implementation of these three different multi-tenancy patterns in a RAG architecture based on Amazon Bedrock Knowledge Bases, discussing their use cases as well as their pros and cons.
Silo
The silo pattern, illustrated in the following figure, offers the highest level of tenant isolation, because the entire stack is deployed and managed independently for each single tenant.
In the context of the RAG architecture implemented by Amazon Bedrock Knowledge Bases, this pattern prescribes the following:
- A separate data source per tenant – In this post, we consider the scenario in which tenant documents to be vectorized are stored in Amazon S3, therefore a separate S3 bucket is provisioned per tenant. This allows for per-tenant AWS Key Management Service (AWS KMS) encryption keys, as well as per-tenant S3 lifecycle policies to manage object expiration, and object versioning policies to maintain multiple versions of objects. Having separate buckets per tenant provides isolation and allows for customized configurations based on tenant requirements.
- A separate knowledge base per tenant – This allows for a separate chunking strategy per tenant, and it’s particularly useful if you envision the document basis of your tenants to be different in nature. For example, one of your tenants might have a document base composed of flat text documents, which can be treated with fixed-size chunking, whereas another tenant might have a document base with explicit sections, for which semantic chunking would be better suited to section. Having a different knowledge base per tenant also lets you decide on different embedding models, giving you the possibility to choose different vector dimensions, balancing accuracy, cost, and latency. You can choose a different KMS key per tenant for the transient data stores, which Amazon Bedrock uses for end-to-end per-tenant encryption. You can also choose per-tenant data deletion policies to control whether your vectors are deleted from the vector database when a knowledge base is deleted. Separate knowledge bases also mean that you can have different ingestion schedules per tenants, allowing you to agree to different data freshness standards with your customers.
- A separate OpenSearch Serverless collection per tenant – Having a separate OpenSearch Serverless collection per tenant allows you to have a separate KMS encryption key per tenant, maintaining per-tenant end-to-end encryption. For each tenant-specific collection, you can create a separate vector index, therefore choosing for each tenant the distance metric between Euclidean and dot product, so that you can choose how much importance to give to the document length. You can also choose the specific settings for the HNSW algorithm per tenant to control memory consumption, cost, and indexing time. Each vector index, in conjunction with the setup of metadata mappings in your knowledge base, can have a different metadata set per tenant, which can be used to perform filtered searches. Metadata filtering can be used in the silo pattern to restrict the search to a subset of documents with a specific characteristic. For example, one of your tenants might be uploading dated documents and wants to filter documents pertaining to a specific year, whereas another tenant might be uploading documents coming from different company divisions and wants to filter over the documentation of a specific company division.
Because the silo pattern offers tenant architectural independence, onboarding and offboarding a tenant means creating and destroying the RAG stack for that tenant, composed of the S3 bucket, knowledge base, and OpenSearch Serverless collection. You would typically do this using infrastructure as code (IaC). Depending on your application architecture, you may also need to update the log sinks and monitoring systems for each tenant.
Although the silo pattern offers the highest level of tenant isolation, it is also the most expensive to implement, mainly due to creating a separate OpenSearch Serverless collection per tenant for the following reasons:
- Minimum capacity charges – Each OpenSearch Serverless collection encrypted with a separate KMS key has a minimum of 2 OpenSearch Compute Units (OCUs) charged hourly. These OCUs are charged independently from usage, meaning that you will incur charges for dormant tenants if you choose to have a separate KMS encryption key per tenant.
- Scalability overhead – Each collection separately scales OCUs depending on usage, in steps of 6 GB of memory, and associated vCPUs and fast access storage. This means that resources might not be fully and optimally utilized across tenants.
When choosing the silo pattern, note that a maximum of 100 knowledge bases are supported in each AWS account. This makes the silo pattern favorable for your largest tenants with specific isolation requirements. Having a separate knowledge base per tenant also reduces the impact of quotas on concurrent ingestion jobs (maximum one concurrent job per KB, five per account), job size (100 GB per job), and data sources (maximum of 5 million documents per data source). It also improves the performance fairness as perceived by your tenants.
Deleting a knowledge base during offboarding a tenant might be time-consuming, depending on the size of the data sources and the synchronization process. To mitigate this, you can set the data deletion policy in your tenants’ knowledge bases to RETAIN
. This way, the knowledge base deletion process will not delete your tenants’ data from the OpenSearch Service index. You can delete the index by deleting the OpenSearch Serverless collection.
Pool
In contrast with the silo pattern, in the pool pattern, illustrated in the following figure, the whole end-to-end RAG architecture is shared by your tenants, making it particularly suitable to accommodate many small tenants.
The pool pattern prescribes the following:
- Single data source – The tenants’ data is stored within the same S3 bucket. This implies that the pool model supports a shared KMS key for encryption at rest, not offering the possibility of per-tenant encryption keys. To identify tenant ownership downstream for each document uploaded to Amazon S3, a corresponding JSON metadata file has to be generated and uploaded. The metadata file generation process can be asynchronous, or even batched for multiple files, because Amazon Bedrock Knowledge Bases requires an explicit triggering of the ingestion job. The metadata file must use the same name as its associated source document file, with
.metadata.json
appended to the end of the file name, and must be stored in the same folder or location as the source file in the S3 bucket. The following code is an example of the format:
In the preceding JSON structure, the key tenantId
has been deliberately chosen, and can be changed to a key you want to use to express tenancy. The tenancy field will be used at runtime to filter documents belonging to a specific tenant, therefore the filtering key at runtime must match the metadata key in the JSON used to index the documents. Additionally, you can include other metadata keys to perform further filtering that isn’t based on tenancy. If you don’t upload the object.metadata.json
file, the client application won’t be able to find the document using metadata filtering.
- Single knowledge base – A single knowledge base is created to handle the data ingestion for your tenants. This means that your tenants will share the same chunking strategy and embedding model, and share the same encryption at-rest KMS key. Moreover, because ingestion jobs are triggered per data source per KB, you will be restricted to offer to your tenants the same data freshness standards.
- Single OpenSearch Serverless collection and index – Your tenant data is pooled in a single OpenSearch Service vector index, therefore your tenants share the same KMS encryption key for vector data, and the same HNSW parameters for indexing and query. Because tenant data isn’t physically segregated, it’s crucial that the query client be able to filter results for a single tenant. This can be efficiently achieved using either the Amazon Bedrock Knowledge Bases
Retrieve
orRetrieveAndGenerate
, expressing the tenant filtering condition as part of the retrievalConfiguration (for more details, see Amazon Bedrock Knowledge Bases now supports metadata filtering to improve retrieval accuracy). If you want to restrict the vector search to return results fortenant_1
, the following is an example client implementation performingRetrieveAndGenerate
based on the AWS SDK for Python (Boto3):
text
contains the original user query that needs to be answered. Taking into account the document base, <YOUR_KNOWLEDGEBASE_ID>
needs to be substituted with the identifier of the knowledge base used to pool your tenants, and <FM_ARN>
needs to be substituted with the Amazon Bedrock model Amazon Resource Name (ARN) you want to use to reply to the user query. The client presented in the preceding code has been streamlined to present the tenant filtering functionality. In a production case, we recommend implementing session and error handling, logging and retry logic, and separating the tenant filtering logic from the client invocation to make it inaccessible to developers.
Because the end-to-end architecture is pooled in this pattern, onboarding and offboarding a tenant doesn’t require you to create new physical or logical constructs, and it’s as simple as starting or stopping and uploading specific tenant documents to Amazon S3. This implies that there is no AWS managed API that can be used to offboard and end-to-end forget a specific tenant. To delete the historical documents belonging to a specific tenant, you can just delete the relevant objects in Amazon S3. Typically, customers will have an external application that maintains the list of available tenants and their status, facilitating the onboarding and offboarding process.
Sharing the monitoring system and logging capabilities in this pattern reduces the complexity of operations with a large number of tenants. However, it requires you to collect the tenant-specific metrics from the client side to perform specific tenant attribution.
The pool pattern optimizes the end-to-end cost of your RAG architecture, because sharing OCUs across tenants maximizes the use of each OCU and minimizes the tenants’ idle time. Sharing the same pool of OCUs across tenants means that this pattern doesn’t offer performance isolation at the vector store level, so the largest and most active tenants might impact the experience of other tenants.
When choosing the pool pattern for your RAG architecture, you should be aware that a single ingestion job can ingest or delete a maximum of 100 GB. Additionally, the data source can have a maximum of 5 million documents. If the solution has many tenants that are geographically distributed, consider triggering the ingestion job multiple times a day so you don’t hit the ingestion job size limit. Also, depending on the number and size of your documents to be synchronized, the time for ingestion will be determined by the embedding model invocation rate. For example, consider the following scenario:
- Number of tenants to be synchronized = 10
- Average number of documents per tenant = 100
- Average size per document = 2 MB, containing roughly 200,000 tokens divided in 220 chunks of 1,000 tokens to allow for overlap
- Using Amazon Titan Embeddings v2 on demand, allowing for 2,000 RPM and 300,000 TPM
This would result in the following:
- Total embeddings requests = 10*100*220 = 220,000
- Total tokens to process = 10*100*1,000=1,000,000
- Total time taken to embed is dominated by the RPM, therefore 220,000/2,000 = 1 hour, 50 minutes
This means you could trigger an ingestion job 12 times per day to have a good time distribution of data to be ingested. This calculation is a best-case scenario and doesn’t account for the latency introduced by the FM when creating the vector from the chunk. If you expect having to synchronize a large number of tenants at the same time, consider using provisioned throughput to decrease the time it takes to create vector embeddings. This approach will also help distribute the load on the embedding models, limiting throttling of the Amazon Bedrock runtime API calls.
Bridge
The bridge pattern, illustrated in the following figure, strikes a balance between the silo and pool patterns, offering a middle ground that balances tenant data isolation and security.
The bridge pattern delivers the following characteristics:
- Separate data source per tenant in a common S3 bucket – Tenant data is stored in the same S3 bucket, but prefixed by a tenant identifier. Although having a different prefix per tenant doesn’t offer the possibility of using per-tenant encryption keys, it does create a logical separation that can be used to segregate data downstream in the knowledge bases.
- Separate knowledge base per tenant – This pattern prescribes creating a separate knowledge base per tenant similar to the silo pattern. Therefore, the considerations in the silo pattern apply. Applications built using the bridge pattern usually share query clients across tenants, so they need to identify the specific tenant’s knowledge base to query. They can identify the knowledge base by storing the tenant-to-knowledge base mapping in an external database, which manages tenant-specific configurations. The following example shows how to store this tenant-specific information in an Amazon DynamoDB table:
In a production setting, your application will store tenant-specific parameters belonging to other functionality in your data stores. Depending on your application architecture, you might choose to store
knowledgebaseId
andmodelARN
alongside the other tenant-specific parameters, or create a separate data store (for example, thetenantKbConfig
table) specifically for your RAG architecture.This mapping can then be used by the client application by invoking the
RetrieveAndGenerate
API. The following is an example implementation: - Separate OpenSearch Service index per tenant – You store data within the same OpenSearch Serverless collection, but you create a vector index per tenant. This implies your tenants share the same KMS encryption key and the same pool of OCUs, optimizing the OpenSearch Service resources usage for indexing and querying. The separation in vector indexes gives you the flexibility of choosing different HNSM parameters per tenant, letting you tailor the performance of your k-NN indexing and querying for your different tenants.
The bridge pattern supports up to 100 tenants, and onboarding and offboarding a tenant requires the creation and deletion of a knowledge base and OpenSearch Service vector index. To delete the data pertaining to a particular tenant, you can delete the created resources and use the tenant-specific prefix as a logical parameter in your Amazon S3 API calls. Unlike the silo pattern, the bridge pattern doesn’t allow for per-tenant end-to-end encryption; it offers the same level of tenant customization offered by the silo pattern while optimizing costs.
Summary of differences
The following figure and table provide a consolidated view for comparing the characteristics of the different multi-tenant RAG architecture patterns. This comprehensive overview highlights the key attributes and trade-offs associated with the pool, bridge, and silo patterns, enabling informed decision-making based on specific requirements.
The following figure illustrates the mapping of design characteristics to components of the RAG architecture.
The following table summarizes the characteristics of the multi-tenant RAG architecture patterns.
Characteristic | Attribute of | Pool | Bridge | Silo |
Per-tenant chunking strategy | Amazon Bedrock Knowledge Base Data Source | No | Yes | Yes |
Customer managed key for encryption of transient data and at rest | Amazon Bedrock Knowledge Base Data Source | No | No | Yes |
Per-tenant distance measure | Amazon OpenSearch Service Index | No | Yes | Yes |
Per-tenant ANN index configuration | Amazon OpenSearch Service Index | No | Yes | Yes |
Per-tenant data deletion policies | Amazon Bedrock Knowledge Base Data Source | No | Yes | Yes |
Per-tenant vector size | Amazon Bedrock Knowledge Base Data Source | No | Yes | Yes |
Tenant performance isolation | Vector database | No | No | Yes |
Tenant onboarding and offboarding complexity | Overall solution | Simplest, requires management of new tenants in existing infrastructure | Medium, requires minimal management of end-to-end infrastructure | Hardest, requires management of end-to-end infrastructure |
Query client implementation | Original Data Source | Medium, requires dynamic filtering | Hardest, requires external tenant mapping table | Simplest, same as single-tenant implementation |
Amazon S3 tenant management complexity | Amazon S3 buckets and objects | Hardest, need to maintain tenant specific metadata files for each object | Medium, each tenant needs a different S3 path | Simplest, each tenant requires a different S3 bucket |
Cost | Vector database | Lowest | Medium | Highest |
Per-tenant FM used to create vector embeddings | Amazon Bedrock Knowledge Base | No | Yes | Yes |
Conclusion
This post explored three distinct patterns for implementing a multi-tenant RAG architecture using Amazon Bedrock Knowledge Bases and OpenSearch Service. The silo, pool, and bridge patterns offer varying levels of tenant isolation, variability, management simplicity, and cost-efficiency, catering to different use cases and requirements. By understanding the trade-offs and considerations associated with each pattern, organizations can make informed decisions and choose the approach that best aligns with their needs.
Get started with Amazon Bedrock Knowledge Bases today.
About the Authors
Emanuele Levi is a Solutions Architect in the Enterprise Software and SaaS team, based in London. Emanuele helps UK customers on their journey to refactor monolithic applications into modern microservices SaaS architectures. Emanuele is mainly interested in event-driven patterns and designs, especially when applied to analytics and AI, where he has expertise in the fraud-detection industry.
Mehran Nikoo is a Generative AI Go-To-Market Specialist at AWS. He leads the generative AI go-to-market strategy for UK and Ireland.
Dani Mitchell is a Generative AI Specialist Solutions Architect at AWS. He is focused on computer vision use case and helps AWS customers in EMEA accelerate their machine learning and generative AI journeys with Amazon SageMaker and Amazon Bedrock.
NeurIPS 2024: AI for Science with Chris Bishop
The Microsoft Research Podcast offers its audience a unique view into the technical advances being pursued at Microsoft through the insights and personal experiences of the people committed to those pursuits.
In this special edition of the podcast, Technical Fellow and Microsoft Research AI for Science Director Chris Bishop joins guest host Eliza Strickland of IEEE Spectrum at the 38th annual Conference on Neural Information Processing Systems (NeurIPS) to talk about deep learning’s potential to improve the speed and scale at which scientific advancements can be made. Bishop discusses the factors considered when choosing which scientific challenges to tackle with AI; the impact foundation models are having right now in areas such as drug discovery and weather forecasting; and the work at NeurIPS that he’s excited about.
Learn more:
From forecasting storms to designing molecules: How new AI foundation models can speed up scientific discovery (opens in new tab)
Microsoft Source blog, October 2024
Introducing Aurora: The first large-scale foundation model of the atmosphere
Microsoft Research blog, June 2024
GHDDI and Microsoft Research use AI technology to achieve significant progress in discovering new drugs to treat global infectious diseases
Microsoft Research blog, January 2024
AI Frontiers: A deep dive into deep learning with Ashley Llorens and Chris Bishop
Microsoft Research Podcast, December 2023
AI4Science to empower the fifth paradigm of scientific discovery
Microsoft Research blog, July 2022
Novartis empowers scientists with AI to speed the discovery and development of breakthrough medicines (opens in new tab)
Microsoft Source, November 2021
Bringing together deep bioscience and AI to help patients worldwide: Novartis and Microsoft work to reinvent treatment discovery and development (opens in new tab)
Official Microsoft Blog, October 2019
Subscribe to the Microsoft Research Podcast:
Transcript
[MUSIC]ELIZA STRICKLAND: Welcome to the Microsoft Research Podcast, where Microsoft’s leading researchers bring you to the cutting edge. This series of conversations showcases the technical advances being pursued at Microsoft through the insights and experiences of the people driving them.
I’m Eliza Strickland, a senior editor at IEEE Spectrum and your guest host for a special edition of the podcast.
[MUSIC FADES]Joining me today in the Microsoft Booth at the 38th annual Conference on Neural Information Processing Systems, or NeurIPS, is Chris Bishop. Chris is a Microsoft technical fellow and the director of Microsoft Research AI for Science. Chris is with me for one of our two on-site conversations that we’re having here at the conference.
Chris, welcome to the podcast.
CHRIS BISHOP: Thanks, Eliza. Really great to join you.
STRICKLAND: How did your long career in machine learning lead you to this focus on AI for Science, and were there any pivotal moments when you started to think that, hey, this deep learning thing, it’s going to change the way scientific discovery happens?
BISHOP: Oh, that’s such a great question. I think this is like my career coming full circle, really. I started out studying physics at Oxford, and then I did a PhD in quantum field theory. And then I moved into the fusion program. I wanted to do something of practical value, [LAUGHTER] so I worked on nuclear fusion for about seven or eight years doing theoretical physics, and then that was about the time that Geoff Hinton published his backprop paper. And it really caught my imagination as an exciting approach to artificial intelligence that might actually yield some progress. So that was, kind of, 35 years ago, and I moved into the field of machine learning. And, actually, the way I made that transition was by applying neural networks to fusion. I was working at the JET experiment, which was the world’s largest fusion experiment. It was sort of big data in its day. And so I had to, first of all, teach myself to program.
STRICKLAND: [LAUGHS] Right.
BISHOP: I was a pencil-and-paper theoretician up to that point. Persuade my boss to buy me a workstation and then started to play with these neural nets. So right from the get-go, I was applying machine learning 35 years ago to data from science experiments. And that was a great on-ramp for me. And then, eventually, I just got so distracted, I decided I wanted to build my career in machine learning. Spent a few years as a research professor and then joined Microsoft 27 years ago, when Microsoft opened its first research lab outside the US in Cambridge, UK, and have been there very happily ever since. Went on to become lab director. But about three or four years ago, I realized that not only was deep learning transforming so many different things, but I felt it was especially relevant to scientific discovery. And so I had an opportunity to pitch to our chief technology officer to go start a new team. And he was very excited by this. So just over two and a half years ago now, we set up Microsoft Research AI for Science, and it’s a global team, and it, sort of, does what it says on the tin.
STRICKLAND: So you’ve said that AI could usher in a fifth paradigm of scientific discovery, which builds upon the ideas of Turing Award–winner Jim Gray, who described four stages in the evolution of science. Can you briefly explain the four prior paradigms and then tell us about what makes this stage different?
BISHOP: Yeah, sure. So it was a nice insight by Jim. He said, well, of course, the first paradigm of scientific discovery was really the empirical one. I tend to think of some cave dweller picking up a big rock and a small rock and letting go of them at the same time and thinking the big rock will hit the ground first …
STRICKLAND: [LAUGHS] Right …
BISHOP: … discovering they land together. And this is interesting. They’ve discovered a, sort of, pattern irregularity in nature, and even today, the first paradigm is in a sense the prime paradigm. It’s the most important one because at the end of the day, it’s experimental results that determine the truth, if you like. So that’s the first paradigm. And it continues to be of critical importance today. And then the second paradigm really emerged in the 17th century. When Newton discovered the laws of motion and the law of gravity, and not only did he discover the equations but this, sort of, remarkable fact that nature can even be described by equations, right. It’s not obvious that this would be true, but it turns out that, you know, the world around us can be described by very simple equations that you can write on a T-shirt. And so in the 19th century, James Clerk Maxwell discovered some simple equations that describe the whole of electricity and magnetism, electromagnetic waves, and so on. And then very importantly, the beginning of the 20th century, we had this remarkable breakthrough in quantum physics. So again down at the molecular—the atomic—level, the world is described with exquisite precision by Schrödinger’s equation. And so this was the second paradigm, the theoretical. That the world is described with incredible precision of a huge range of length and time by very simple equations.
But of course, there’s a catch, which is those equations are very hard to solve. And so the third paradigm really began, I guess, sort of, in the ’50s and ’60s, the development of digital computers. And, actually, the very first use of digital computers was to simulate physics, and it’s been at the core of digital computing right up to the present day. And so what you’re doing there is using a computer to go with a numerical algorithm to solve those very simple equations but solve them in a practical setting. And so that’s, I’ll refer to that as simulation. That’s the third paradigm. And that’s proven to be tremendously powerful. If you look up the weather forecast on your phone today, it’s done by numerical weather forecasting, solving in those case Navier-Stokes equations using big numerical simulators. What Jim Gray observed, though, really emerging at the beginning of the 21st century was what he called the fourth paradigm, or data-intensive scientific discovery. So this is the era of big data. Think of particle physics at the CERN accelerator, for example, generating colossal amounts of data in real time. And that data can then be processed and filtered. We can do statistics on it. But of course, we can do machine learning on that data. And so machine learning feeds off large data. And so the fourth paradigm really is dominated today by machine learning. And again that remains tremendously important.
What I noticed, though, is that there’s again another framework. We call it the fifth paradigm. Again, it goes back to those fundamental equations. But again, it’s driven by computation, and it’s the idea that we can train machine learning systems not using the empirical data of the fourth paradigm but instead using the results of simulation. So the output of the third paradigm. So think of it this way. You want to predict the property of some molecule, let’s say. You could in principle solve Schrödinger’s equation on a digital computer; it’d be very expensive. And let’s say you want to screen hundreds of millions of molecules. That’s going to get far too costly. So instead, what you can do is have a mindset shift. You can think of that simulator not as a tool to predict the molecule’s properties directly but instead as a way of generating synthetic training data. And then you use that training data to train a deep learning system to give what I like to call an emulator, an emulator of the simulator. Once it’s trained, that emulator is fast. It’s usually three to four orders of magnitude faster than the simulator. So if you’re going to do something over and over again, that three-to-four-order-of-magnitude acceleration is tremendously disruptive. And what’s really interesting is we see that fifth paradigm occur in many, many different places. The idea goes back a long way. The, actually, the last project that I worked on before I left the fusion program was to do what was the world’s first-ever real-time control of a tokamak fusion plasma using a neural net and the computers of the day. But the processors were just far too slow, long before GPUs, and so on. And so it wasn’t possible to solve the equations. In that case, it was called the Grad-Shafranov equation. Again, a simple differential equation you could write on a T-shirt, but solving it was expensive on a computer. We were about a million times too slow to solve it directly in real time. And so instead, we generated lots and lots of solutions. We used those solutions to train a very simple neural network, not a deep network, just a simple two-layer network back in the day, and then we implemented that in special hardware and did real-time feedback control. So that was an example of the fifth paradigm from, you know, a quarter of a century ago. But of course, deep learning just tremendously expands the range of applicability. So today we’re using the fifth paradigm in many, many different scenarios. And time and time again, we see these four-orders-of-magnitude acceleration. So I think it’s worthy of thinking of that as a new paradigm because it’s so pervasive and so ubiquitous.
STRICKLAND: So how do you identify fields of science and particular problems that are amenable to this kind of AI assistance? Is it all about availability of data or the need for that kind of speed up?
BISHOP: So there are lots of factors that go into this. And when I think about AI for Science actually, the space of opportunity is colossal because science is, science is really just understanding more about the world around us. And so the range of possibilities is daunting really. So in choosing what to work on, I think there are several factors. Yes, of course, data is important, but very interestingly, we can use experimental data or we can generate synthetic data by running simulators. So we’re a big fan of the fifth paradigm. But I think another factor—and this is particularly at Microsoft—is thinking about, how can we have real-world impact at scale? Because that’s our job, is to make the world a better place and to do so at a planetary scale. And so we’ve settled on, for the most part, working at the molecular level. So if you think about the number of different ways of combining atoms together to make new stable configurations of atoms, it’s gargantuan. I mean, the number of just small molecules, small organic molecules, that are potential drug candidates is about 1060. It’s about the same as the number of atoms in the solar system. The number of proteins, maybe the fourth power of the number of atoms in the universe, or something crazy. So you’ve got this gargantuan space to search, and within that space, for sure, there’ll be all sorts of interesting molecules, materials, new drugs, new therapies, new materials for carbon capture, new kinds of batteries, new photovoltaics. The list is endless because everything around us is made of atoms, including our own bodies. So the potential just in the molecular space is gargantuan. And so that’s why we focus there.
STRICKLAND: It’s a big focus. [LAUGHTER]
BISHOP: It’s a broad focus, still, yes.
STRICKLAND: So let’s take one of these case studies then. In a project on drug discovery, you worked with the Global Health Drug Discovery Institute on molecules that would interact with tuberculosis and coronaviruses, I think. And you found, I think, candidate molecules in five months instead of several years. Can you talk about what models you used in this work and how they helped you get this vastly sped up process?
BISHOP: Sure. Yes. We’re very proud of this project. We’re working with the Gates Foundation and the Global Health Drug Discovery Institute to look at particularly diseases that affect low-income countries like tuberculosis. And in terms of the models we use, I think we’re all familiar with a large language model. We train it on a sequence of words or sequence of word tokens, and it’s trained to predict the next token. We can do a similar thing, but instead of learning the language of humans, we can learn the language of nature. So in particular, what we’re looking for here is a small organic molecule that we could synthesize in a laboratory that will bind with a particular target protein. It’s called ClpP. And by interfering with that protein, we can arrest the process of tuberculosis. So the goal is to search that space of 1060 molecules and find a new one that has the right properties. Now, the way we do this is to train something that’s essentially a transformer. So it looks like a language model, but the language it’s trained on is a thing called SMILES strings. It’s an idea that’s been around in chemistry for a long time. It’s just a way of taking a three-dimensional molecule and representing it as a one-dimensional sequence of characters. So this is perfect for feeding into a language model. So we take a transformer and we train it on a large database of small organic molecules that are, sort of, typical of the kinds of things you might see in the space of drug molecules. Once that’s been trained, we can now run it generatively. And it will output new molecules. Now, we don’t just want to generate molecules at random because that doesn’t help. We want to generate molecules that bind to this particular binding site on this particular protein. So the next step is we have to tell the model about the protein and the protein binding site. And we do that by giving it information about not actually—well, we do tell it about the whole protein, but we especially give it information about the three-dimensional geometry of the binding site. So we tell about the locations of the atoms that are in the binding site. And we do this in a way that satisfies certain physics constraints, sort of, equivariance properties, it’s called. So if you think about a molecule, if I rotate the molecule in space, the positions of all the atoms change in a complicated way. But it’s the same molecule; it has the same energy and other properties and so on. So we need the right kind of representation. That’s then fed into this transformer using a technique called cross-attention. So internally, the transformer uses self-attention to look at the history of tokens, but it can now use cross-attention to look at another model that understands the proteins. But even that’s not enough. Because in discovering drugs and exploring this gargantuan space and looking for these needles in a haystack, what typically happens [is] you find a hit, a molecule that binds, but now you want to optimize it. You want to make lots of small variations of that molecule in order to make it better and better at binding. So the third piece of the architecture is another module, a thing called a variational autoencoder, that again uses deep learning. But this time, it can take as input an organic molecule that is already known, a hit that’s already known to bind to the site, and that again is fed in through cross-attention. And now the SMILES autoregressive model can now generate a molecule that’s an improvement on the starting molecule and knows about the protein binding. And so what we do is, we start off with the state-of-the-art molecule. And the best example we found is one that’s more than two orders of magnitude stronger binding affinity to the binding pocket, which is a tremendous advance; it’s the state of the art in addressing tuberculosis. And of course, the exciting thing is that this is tested in the laboratory. So this is not just a computer experiment in some sort of benchmark or whatever. We sent a description of the molecule to the laboratories at GHDDI. They synthesized a molecule, characterized it, measured its binding property, and said, well, hey, this is a new state of the art for this target protein. So we’re continuing to work with them to further refine this. There are obviously quite a few more steps. If you know about the drug discovery process, there’s a lot of hurdles you have to get through, including, of course, very important clinical trials, before you have something that can actually be used in humans. But we’re already hugely excited about the fact that we were able to make such a big advance so quickly, in such a short amount of time, compared to the usual drug discovery process.
STRICKLAND: And while you were looking for that molecule that had the proper characteristics, were you also determining whether it could be manufactured easily, like trying to think about practical realities of bringing this thing out of the computer and into the lab?
BISHOP: Great question. I mean, you’re hinting there at the fact the discovery process, of course, is a long pipeline. You start with the protein. You have to find a molecule that binds. You then refine the molecule. Now you have to look at ADMET, you know, the absorption, metabolism, and excretion and so on of the molecule. Also make sure that it’s not toxic. But then you need to be able to synthesize it. It’s no good if nobody can make this molecule. So you have to look at that. So, actually, in the AI for Science team, we look at all of these aspects of that drug discovery process. And we find particular areas, especially where there’s, sort of, low-hanging fruit where we can see that deep learning can make a big impact. It doesn’t necessarily help much to take a very easy, fast piece of the pipeline and go work on that. You want to understand, what are the bottlenecks, and can we really unlock those with deep learning? So we’re very interested in that whole process. It’s a fascinating problem. You’ve got a gargantuan search space, and yet you have so many different constraints that need to be met. And deep learning just feels like the perfect tool to go after this problem.
STRICKLAND: When you talk to the scientists that you collaborate with, is AI changing the kinds of questions that they are able to ask? That they want to ask?
BISHOP: Oh, for sure. And it’s really empowering. It’s enabling those working in the drug discovery space to, I think, to think in a much more expansive way. If you think about just the kind of acceleration that I talked about from the fifth paradigm, if you go to four-order-of-magnitude acceleration, OK, it may not sound like much of a dent onto the 1060 space, but now when you’re exploring variants of molecules and so on, the ability to explore that space orders of magnitude faster allows you to think much more creatively, allows you to think in a more expansive way about how much of that space you can explore and how efficiently you can explore it. So I think it really is opening up new horizons, and certainly, we have an exciting partnership with Novartis. We’ve been working with them for the last five years, and they’ve been deploying some of our techniques and models in practice for their drug discovery pipeline. We get a lot of great feedback from them about how exciting they’re finding these techniques to use in practice because it is changing the way they go about doing the drug discovery process.
STRICKLAND: To jump to one other case study, we don’t have to go into great detail on it, but I’m very curious about your Project Aurora, this foundation model for state-of-the-art weather forecasting that, I believe, is 5,000 times faster than traditional physics-based methods. Can you talk a little bit about how that project is evolving, how you imagine these AI forecasting models working with traditional forecasting models, perhaps, or replacing them?
BISHOP: Yes. So I said most of what we do is down at the molecular level. So this is one of the exceptions. So this is really at the global level, the planetary level. Again, it’s a beautiful example of the fifth paradigm because the way forecasting has been done for a number of decades now and the way most forecasting is done at the moment is through what’s called numerical weather prediction. So again, you have these simple equations. It’s no longer Schrödinger’s equation of atomic physics. It’s now Navier–Stokes equations of fluid flows and a whole bunch of other equations that describe moisture in the atmosphere and the weather and so on. And those equations are solved on a supercomputer. And again, we can think of that numerical simulator now not just as the way you’re going to do the forecasting but actually as the way to generate training data for a deep learning emulator. So several groups have been exploring this over the last couple of years. And again, we see this very robust three-to-four-order-of-magnitude acceleration. But what’s really interesting about Aurora, it’s the world’s first foundation model, so instead of just building an emulator of a particular numerical weather simulator, which is already very interesting, we trained Aurora on a much more diverse set of data and really trying to force it not just to emulate a particular simulator but really, as it were, understand or model the fundamental equations of fluid flows in the Earth’s atmosphere. And then the reason we want to do this is because we now want to take that foundation model and fine-tune it to other downstream applications where there’s much less data. So one example would be pollution flow. So obviously the flow of pollution around the atmosphere is extremely important. But the data is far more sparse. There are far fewer sensors for pollution than there are for, sort of, wind and rain and temperature and so on. And so we were able to achieve state-of-the-art performance in modeling the flow of pollution by leveraging huge data and building this foundation model and then using relatively little data, our pollution monitoring, to build that downstream fine-tuned model. So beautiful example of a foundation model.
STRICKLAND: That is a cool example. And finally, just to wrap up, what have you seen or heard at NeurIPS that’s gotten you excited? What kind of trends are in the air? What’s the buzz?
BISHOP: Oh, that’s a great question. I mean, it’s such a huge conference. There’s something like 17,000 people or so here this year, I’ve heard. I think, you know, one of the things that’s happened so far that’s actually given me an enormous amount of energy wasn’t just a technical talk. It was actually an event we had on the first day called Women in Machine Learning. And I was a mentor on one of the mentorship tables, and I found it very energizing just to meet so many people, early-career-stage people, who were very excited about AI for Science and realizing that, you know, it’s not just that I think AI for Science is important. A lot of people are moving into this field now. It is a big frontier for AI. I’m a little biased, perhaps. I think that it’s the most important application area. Intellectually, it’s very exciting because we get to deal with science as well as machine learning. But also if you think about [it], science is really about learning more about the world. And once we learn more about the world, we can then develop aquaculture; we can develop the steam engine; we can develop silicon chips; we can change the world. We can save lives and make the world a better place. And so I think it’s the most fundamental undertaking we have in AI for Science and the thing I loved about the Women in Machine Learning event is that the AI for Science table was just completely swamped with all of these people at early stages of their career, either already working in this field and doing PhDs or wanting to get into it. That was very exciting.
STRICKLAND: That is really exciting and inspiring, and it gives me a lot of hope. Well, Chris Bishop, thank you so much for joining us today and thanks for a great conversation.
BISHOP: Thank you. I really appreciate it.
[MUSIC]STRICKLAND: And to our listeners, thanks for tuning in. If you want to learn more about research at Microsoft, you can check out the Microsoft Research website at microsoft.com/research. Until next time.
[MUSIC FADES]
The post NeurIPS 2024: AI for Science with Chris Bishop appeared first on Microsoft Research.
How Amazon trains sequential ensemble models at scale with Amazon SageMaker Pipelines
Amazon SageMaker Pipelines includes features that allow you to streamline and automate machine learning (ML) workflows. This allows scientists and model developers to focus on model development and rapid experimentation rather than infrastructure management
Pipelines offers the ability to orchestrate complex ML workflows with a simple Python SDK with the ability to visualize those workflows through SageMaker Studio. This helps with data preparation and feature engineering tasks and model training and deployment automation. Pipelines also integrates with Amazon SageMaker Automatic Model Tuning which can automatically find the hyperparameter values that result in the best performing model, as determined by your chosen metric.
Ensemble models are becoming popular within the ML communities. They generate more accurate predictions through combining the predictions of multiple models. Pipelines can quickly be used to create and end-to-end ML pipeline for ensemble models. This enables developers to build highly accurate models while maintaining efficiency, and reproducibility.
In this post, we provide an example of an ensemble model that was trained and deployed using Pipelines.
Use case overview
Sales representatives generate new leads and create opportunities within Salesforce to track them. The following application is a ML approach using unsupervised learning to automatically identify use cases in each opportunity based on various text information, such as name, description, details, and product service group.
Preliminary analysis showed that use cases vary by industry and different use cases have a very different distribution of annualized revenue and can help with segmentation. Hence, a use case is an important predictive feature that can optimize analytics and improve sales recommendation models.
We can treat the use case identification as a topic identification problem and we explore different topic identification models such as Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA), and BERTopic. In both LSA and LDA, each document is treated as a collection of words only and the order of the words or grammatical role does not matter, which may cause some information loss in determining the topic. Moreover, they require a pre-determined number of topics, which was hard to determine in our data set. Since, BERTopic overcame the above problem, it was used in order to identify the use case.
The approach uses three sequential BERTopic models to generate the final clustering in a hierarchical method.
Each BERTopic model consists of four parts:
- Embedding – Different embedding methods can be used in BERTopic. In this scenario, input data comes from various areas and is usually inputted manually. As a result, we use sentence embedding to ensure scalability and fast processing.
- Dimension reduction – We use Uniform Manifold Approximation and Projection (UMAP), which is an unsupervised and nonlinear dimension reduction method, to reduce high dimension text vectors.
- Clustering – We use the Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) method to form different use case clusters.
- Keyword identification – We use class-based TF-IDF to extract the most representative words from each cluster.
Sequential ensemble model
There is no predetermined number of topics, so we set an input for the number of clusters to be 15–25 topics. Upon observation, some of the topics are wide and general. Therefore, another layer of the BERTopic model is applied individually to them. After combining all of the newly identified topics in the second-layer model and together with the original topics from first-layer results, postprocessing is performed manually to finalize topic identification. Lastly, a third layer is used for some of the clusters to create sub-topics.
To enable the second- and third-layer models to work effectively, you need a mapping file to map results from previous models to specific words or phrases. This helps make sure that the clustering is accurate and relevant.
We’re using Bayesian optimization for hyperparameter tuning and cross-validation to reduce overfitting. The data set contains features like opportunity name, opportunity details, needs, associated product name, product details, product groups. The models are evaluated using a customized loss function, and the best embedding model is selected.
Challenges and considerations
Here are some of the challenges and considerations of this solution:
- The pipeline’s data preprocessing capability is crucial for enhancing model performance. With the ability to preprocess incoming data prior to training, we can make sure that our models are fed with high-quality data. Some of the preprocessing and data cleaning steps include converting all text column to lower case, removing template elements, contractions, URLs, emails, etc. removing non-relevant NER labels, and lemmatizing combined text. The result is more accurate and reliable predictions.
- We need a compute environment that is highly scalable so that we can effortlessly handle and train millions of rows of data. This allows us to perform large-scale data processing and modeling tasks with ease and reduces development time and costs.
- Because every step of the ML workflow requires varying resource requirements, a flexible and adaptable pipeline is essential for efficient resource allocation. We can reduce the overall processing time, resulting in faster model development and deployment, by optimizing resource usage for each step.
- Running custom scripts for data processing and model training requires the availability of required frameworks and dependencies.
- Coordinating the training of multiple models can be challenging, especially when each subsequent model depends on the output of the previous one. The process of orchestrating the workflow between these models can be complex and time-consuming.
- Following each training layer, it’s necessary to revise a mapping that reflects the topics produced by the model and use it as an input for the subsequent model layer.
Solution overview
In this solution, the entry point is Amazon SageMaker Studio, which is a web-based integrated development environment (IDE) provided by AWS that enables data scientists and ML developers to build, train, and deploy ML models at scale in a collaborative and efficient manner.
The following diagrams illustrates the high-level architecture of the solution.
As part of the architecture, we’re using the following SageMaker pipeline steps:
- SageMaker Processing – This step allows you to preprocess and transform data before training. One benefit of this step is the ability to use built-in algorithms for common data transformations and automatic scaling of resources. You can also use custom code for complex data preprocessing, and it allows you to use custom container images.
- SageMaker Training – This step allows you to train ML models using SageMaker-built-in algorithms or custom code. You can use distributed training to accelerate model training.
- SageMaker Callback – This step allows you to run custom code during the ML workflow, such as sending notifications or triggering additional processing steps. You can run external processes and resume the pipeline workflow on completion in this step.
- SageMaker Model – This step allows you to create or register model to Amazon SageMaker
Implementation Walkthrough
First, we set up the Sagemaker pipeline:
import boto3
import sagemaker
# create a Session with custom region (e.g. us-east-1), will be None if not specified
region = "<your-region-name>"
# allocate default S3 bucket for SageMaker session, will be None if not specified
default_bucket = "<your-s3-bucket>"
boto_session = boto3.Session(region_name=region
sagemaker_client = boto_session.client("sagemaker")
Initialize a SageMaker Session
sagemaker_session = sagemaker.session.Session(boto_session=boto_session, sagemaker_client=sagemaker_client, default_bucket= default_bucket,)
Set Sagemaker execution role for the session
role = sagemaker.session.get_execution_role(sagemaker_session)
Manage interactions under Pipeline Context
pipeline_session = sagemaker.workflow.pipeline_context.PipelineSession(boto_session=boto_session, sagemaker_client=sagemaker_client, default_bucket=default_bucket,)
Define base image for scripts to run on
account_id = role.split(":")[4]
# create a base image that take care of dependencies
ecr_repository_name = "<your-base-image-to-run-script>".
tag = "latest"
container_image_uri = "{0}.dkr.ecr.{1}.amazonaws.com/{2}:{3}".format(account_id, region, ecr_repository_name, tag)
The following is a detailed explanation of the workflow steps:
- Preprocess the data – This involves cleaning and preparing the data for feature engineering and splitting the data into train, test, and validation sets.
import os
BASE_DIR = os.path.dirname(os.path.realpath(__file__))
from sagemaker.workflow.parameters import ParameterString
from sagemaker.workflow.steps import ProcessingStep
from sagemaker.processing import (
ProcessingInput,
ProcessingOutput,
ScriptProcessor,
)
processing_instance_type = ParameterString(
name="ProcessingInstanceType",
# choose an instance type suitable for the job
default_value="ml.m5.4xlarge"
)
script_processor = ScriptProcessor(
image_uri=container_image_uri,
command=["python"],
instance_type=processing_instance_type,
instance_count=1,
role=role,
)
# define the data preprocess job
step_preprocess = ProcessingStep(
name="DataPreprocessing",
processor=script_processor,
inputs=[
ProcessingInput(source=BASE_DIR, destination="/opt/ml/processing/input/code/")
],
outputs=[
ProcessingOutput(output_name="data_train", source="/opt/ml/processing/data_train"), # output data and dictionaries etc for later steps
]
code=os.path.join(BASE_DIR, "preprocess.py"),
)
- Train layer 1 BERTopic model – A SageMaker training step is used to train the first layer of the BERTopic model using an Amazon Elastic Container Registry (Amazon ECR) image and a custom training script.
base_job_prefix="OppUseCase"
from sagemaker.workflow.steps import TrainingStep
from sagemaker.estimator import Estimator
from sagemaker.inputs import TrainingInput
training_instance_type = ParameterString(
name="TrainingInstanceType",
default_value="ml.m5.4xlarge"
)
# create an estimator for training job
estimator_first_layer = Estimator(
image_uri=container_image_uri,
instance_type=training_instance_type,
instance_count=1,
output_path= f"s3://{default_bucket}/{base_job_prefix}/train_first_layer", # S3 bucket where the training output be stored
role=role,
entry_point = "train_first_layer.py"
)
# create training job for the estimator based on inputs from data-preprocess step
step_train_first_layer = TrainingStep(
name="TrainFirstLayerModel",
estimator = estimator_first_layer,
inputs={
TrainingInput(
s3_data=step_preprocess.properties.ProcessingOutputConfig.Outputs[ "data_train" ].S3Output.S3Uri,
),
},
)
- Use a callback step – This involves sending a message to an Amazon Simple Queue Service (Amazon SQS) queue, which triggers an AWS Lambda function. The Lambda function updates the mapping file in Amazon Simple Storage Service (Amazon S3) and sends a success token back to the pipeline to resume its run.
from sagemaker.workflow.callback_step import CallbackStep, CallbackOutput, CallbackOutputTypeEnum
first_sqs_queue_to_use = ParameterString(
name="FirstSQSQueue",
default_value= <first_queue_url>, # add queue url
)
first_callback_output = CallbackOutput(output_name="s3_mapping_first_update", output_type=CallbackOutputTypeEnum.String)
step_first_mapping_update = CallbackStep(
name="FirstMappingUpdate",
sqs_queue_url= first_sqs_queue_to_use,
# Input arguments that will be provided in the SQS message
inputs={
"input_location": f"s3://{default_bucket}/{base_job_prefix}/mapping",
"output_location": f"s3://{default_bucket}/{base_job_prefix}/ mapping_first_update "
},
outputs=[
first_callback_output,
],
)
step_first_mapping_update.add_depends_on([step_train_first_layer]) # call back is run after the step_train_first_layer
- Train layer 2 BERTopic model – Another SageMaker TrainingStep is used to train the second layer of the BERTopic model using an ECR image and a custom training script.
estimator_second_layer = Estimator(
image_uri=container_image_uri,
instance_type=training_instance_type, # same type as of first train layer
instance_count=1,
output_path=f"s3://{bucket}/{base_job_prefix}/train_second_layer", # S3 bucket where the training output be stored
role=role,
entry_point = "train_second_layer.py"
)
# create training job for the estimator based on inputs from preprocessing, output of previous call back step and first train layer step
step_train_second_layer = TrainingStep(
name="TrainSecondLayerModel",
estimator = estimator_second_layer,
inputs={
TrainingInput(
s3_data=step_preprocess.properties.ProcessingOutputConfig.Outputs[ "data_train"].S3Output.S3Uri,
),
TrainingInput(
# Output of the previous call back step
s3_data= step_first_mapping_update.properties.Outputs["s3_mapping_first_update"],
),
TrainingInput(
s3_data=f"s3://{bucket}/{base_job_prefix}/train_first_layer"
),
}
)
- Use a callback step – Similar to Step 3, this involves sending a message to an SQS queue which triggers a Lambda function. The Lambda function updates the mapping file in Amazon S3 and sends a success token back to the pipeline to resume its run.
second_sqs_queue_to_use = ParameterString(
name="SecondSQSQueue",
default_value= <second_queue_url>, # add queue url
)
second_callback_output = CallbackOutput(output_name="s3_mapping_second_update", output_type=CallbackOutputTypeEnum.String)
step_second_mapping_update = CallbackStep(
name="SecondMappingUpdate",
sqs_queue_url= second_sqs_queue_to_use,
# Input arguments that will be provided in the SQS message
inputs={
"input_location": f"s3://{default_bucket}/{base_job_prefix}/mapping_first_update ",
"output_location": f"s3://{default_bucket}/{base_job_prefix}/mapping_second_update "
},
outputs=[
second_callback_output,
],
)
step_second_mapping_update.add_depends_on([step_train_second_layer]) # call back is run after the step_train_second_layer
- Train layer 3 BERTopic model – This involves fetching the mapping file from Amazon S3 and training the third layer of the BERTopic model using an ECR image and a custom training script.
estimator_third_layer = Estimator(
image_uri=container_image_uri,
instance_type=training_instance_type, # same type as of prvious two train layers
instance_count=1,
output_path=f"s3://{default_bucket}/{base_job_prefix}/train_third_layer", # S3 bucket where the training output be stored
role=role,
entry_point = "train_third_layer.py"
)
# create training job for the estimator based on inputs from preprocess step, second callback step and outputs of previous two train layers
step_train_third_layer = TrainingStep(
name="TrainThirdLayerModel",
estimator = estimator_third_layer,
inputs={
TrainingInput(
s3_data=step_preprocess.properties.ProcessingOutputConfig.Outputs["data_train"].S3Output.S3Uri,
),
TrainingInput(
# s3_data = Output of the previous call back step
s3_data= step_second_mapping_update.properties.Outputs[' s3_mapping_second_update’],
),
TrainingInput(
s3_data=f"s3://{default_bucket}/{base_job_prefix}/train_first_layer"
),
TrainingInput(
s3_data=f"s3://{default_bucket}/{base_job_prefix}/train_second_layer"
),
}
)
- Register the model – A SageMaker model step is used to register the model in the SageMaker model registry. When the model is registered, you can use the model through a SageMaker inference pipeline.
from sagemaker.model import Model
from sagemaker.workflow.model_step import ModelStep
model = Model(
image_uri=container_image_uri,
model_data=step_train_third_layer.properties.ModelArtifacts.S3ModelArtifacts,
sagemaker_session=sagemaker_session,
role=role,
)
register_args = model.register(
content_types=["text/csv"],
response_types=["text/csv"],
inference_instances=["ml.c5.9xlarge", "ml.m5.xlarge"],
model_package_group_name=model_package_group_name,
approval_status=model_approval_status,
)
step_register = ModelStep(name="OppUseCaseRegisterModel", step_args=register_args)
To effectively train a BERTopic model and BIRCH and UMAP methods, you need a custom training image which can provide additional dependencies and framework required to run the algorithm. For a working sample of a custom docker image, refer to Create a custom Docker container Image for SageMaker
Conclusion
In this post, we explained how you can use wide range of steps offered by SageMaker Pipelines with custom images to train an ensemble model. For more information on how to get started with Pipelines using an existing ML Operations (MLOps) template, refer to Building, automating, managing, and scaling ML workflows using Amazon SageMaker Pipelines.
About the Authors
Bikramjeet Singh is a Applied Scientist at AWS Sales Insights, Analytics and Data Science (SIADS) Team, responsible for building GenAI platform and AI/ML Infrastructure solutions for ML scientists within SIADS. Prior to working as an AS, Bikram worked as a Software Development Engineer within SIADS and Alexa AI.
Rahul Sharma is a Senior Specialist Solutions Architect at AWS, helping AWS customers build ML and Generative AI solutions. Prior to joining AWS, Rahul has spent several years in the finance and insurance industries, helping customers build data and analytics platforms.
Sachin Mishra is a seasoned professional with 16 years of industry experience in technology consulting and software leadership roles. Sachin lead the Sales Strategy Science and Engineering function at AWS. In this role, he was responsible for scaling cognitive analytics for sales strategy, leveraging advanced AI/ML technologies to drive insights and optimize business outcomes.
Nada Abdalla is a research scientist at AWS. Her work and expertise span multiple science areas in statistics and ML including text analytics, recommendation systems, Bayesian modeling and forecasting. She previously worked in academia and obtained her M.Sc and PhD from UCLA in Biostatistics. Through her work in academia and industry she published multiple papers at esteemed statistics journals and applied ML conferences. In her spare time she enjoys running and spending time with her family.
Implementing login node load balancing in SageMaker HyperPod for enhanced multi-user experience
Amazon SageMaker HyperPod is designed to support large-scale machine learning (ML) operations, providing a robust environment for training foundation models (FMs) over extended periods. Multiple users — such as ML researchers, software engineers, data scientists, and cluster administrators — can work concurrently on the same cluster, each managing their own jobs and files without interfering with others.
When using HyperPod, you can use familiar orchestration options such as Slurm or Amazon Elastic Kubernetes Service (Amazon EKS). This blog post specifically applies to HyperPod clusters using Slurm as the orchestrator. In these clusters, the concept of login nodes is available, which cluster administrators can add to facilitate user access. These login nodes serve as the entry point through which users interact with the cluster’s computational resources. By using login nodes, users can separate their interactive activities, such as browsing files, submitting jobs, and compiling code, from the cluster’s head node. This separation helps prevent any single user’s activities from affecting the overall performance of the cluster.
However, although HyperPod provides the capability to use login nodes, it doesn’t provide an integrated mechanism for load balancing user activity across these nodes. As a result, users manually select a login node, which can lead to imbalances where some nodes are overutilized while others remain underutilized. This not only affects the efficiency of resource usage but can also lead to uneven performance experiences for different users.
In this post, we explore a solution for implementing load balancing across login nodes in Slurm-based HyperPod clusters. By distributing user activity evenly across all available nodes, this approach provides more consistent performance, better resource utilization, and a smoother experience for all users. We guide you through the setup process, providing practical steps to achieve effective load balancing in your HyperPod clusters.
Solution overview
In HyperPod, login nodes serve as access points for users to interact with the cluster’s computational resources so they can manage their tasks without impacting the head node. Although the default method for accessing these login nodes is through AWS Systems Manager, there are cases where direct Secure Shell (SSH) access is more suitable. SSH provides a more traditional and flexible way of managing interactions, especially for users who require specific networking configurations or need features such as TCP load balancing, which Systems Manager doesn’t support.
Given that HyperPod is typically deployed in a virtual private cloud (VPC) using private subnets, direct SSH access to the login nodes requires secure network connectivity into the private subnet. There are several options to achieve this:
- AWS Site-to-Site VPN – Establishes a secure connection between your on-premises network and your VPC, suitable for enterprise environments
- AWS Direct Connect – Offers a dedicated network connection for high-throughput and low-latency needs
- AWS VPN Client – A software-based solution that remote users can use to securely connect to the VPC, providing flexible and easy access to the login nodes
This post demonstrates how to use the AWS VPN Client to establish a secure connection to the VPC. We set up a Network Load Balancer (NLB) within the private subnet to evenly distribute SSH traffic across the available login nodes and use the VPN connection to connect to the NLB in the VPC. The NLB ensures that user sessions are balanced across the nodes, preventing any single node from becoming a bottleneck and thereby improving overall performance and resource utilization.
For environments where VPN connectivity might not be feasible, an alternative option is to deploy the NLB in a public subnet to allow direct SSH access from the internet. In this configuration, the NLB can be secured by restricting access through a security group that allows SSH traffic only from specified, trusted IP addresses. As a result, authorized users can connect directly to the login nodes while maintaining some level of control over access to the cluster. However, this public-facing method is outside the scope of this post and isn’t recommended for production environments, as exposing SSH access to the internet can introduce additional security risks.
The following diagram provides an overview of the solution architecture.
Prerequisites
Before following the steps in this post, make sure you have the foundational components of a HyperPod cluster setup in place. This includes the core infrastructure for the HyperPod cluster and the network configuration required for secure access. Specifically, you need:
- HyperPod cluster – This post assumes you already have a HyperPod cluster deployed. If not, refer to Getting started with SageMaker HyperPod and the HyperPod workshop for guidance on creating and configuring your cluster.
- VPC, subnets, and security group – Your HyperPod cluster should reside within a VPC with associated subnets. To deploy a new VPC and subnets, follow the instructions in the Own Account section of the HyperPod workshop. This process includes deploying an AWS CloudFormation stack to create essential resources such as the VPC, subnets, security group, and an Amazon FSx for Lustre volume for shared storage.
Setting up login nodes for cluster access
Login nodes are dedicated access points that users can use to interact with the HyperPod cluster’s computational resources without impacting the head node. By connecting through login nodes, users can browse files, submit jobs, and compile code independently, promoting a more organized and efficient use of the cluster’s resources.
If you haven’t set up login nodes yet, refer to the Login Node section of the HyperPod Workshop, which provides detailed instructions on adding these nodes to your cluster configuration.
Each login node in a HyperPod cluster has an associated network interface within your VPC. A network interface, also known as an elastic network interface, represents a virtual network card that connects each login node to your VPC, allowing it to communicate over the network. These interfaces have assigned IPv4 addresses, which are essential for routing traffic from the NLB to the login nodes.
To proceed with the load balancer setup, you need to obtain the IPv4 addresses of each login node. You can obtain these addresses from the AWS Management Console or by invoking a command on your HyperPod cluster’s head node.
Using the AWS Management Console
To set up login nodes for cluster access using the AWS Management Console, follow these steps:
- On the Amazon EC2 console, select Network interfaces in the navigation pane
- In the Search bar, select VPC ID = (Equals) and choose the VPC id of the VPC containing the HyperPod cluster
- In the Search bar, select Description : (Contains) and enter the name of the instance group that includes your login nodes (typically, this is login-group)
For each login node, you will find an entry in the list, as shown in the following screenshot. Note down the IPv4 addresses for all login nodes of your cluster.
Using the HyperPod head node
Alternatively, you can also retrieve the IPv4 addresses by entering the following command on your HyperPod cluster’s head node:
Create a Network Load Balancer
The next step is to create a NLB to manage traffic across your cluster’s login nodes.
For the NLB deployment, you need the IPv4 addresses of the login nodes collected earlier and the appropriate security group configurations. If you deployed your cluster using the HyperPod workshop instructions, a security group that permits communication between all cluster nodes should already be in place.
This security group can be applied to the load balancer, as demonstrated in the following instructions. Alternatively, you can opt to create a dedicated security group that grants access specifically to the login nodes.
Create target group
First, we create the target group that will be used by the NLB.
- On the Amazon EC2 console, select Target groups in the navigation pane
- Choose Create target group
- Create a target group with the following parameters:
- For Choose a target type, choose IP addresses
- For Target group name, enter smhp-login-node-tg
- For Protocol : Port, choose TCP and enter 22
- For IP address type, choose IPv4
- For VPC, choose SageMaker HyperPod VPC (which was created with the CloudFormation template)
- For Health check protocol, choose TCP
- Choose Next, as shown in the following screenshot
- In the Register targets section, register the login node IP addresses as the targets
- For Ports, enter 22 and choose Include as pending below, as shown in the following screenshot
- The login node IPs will appear as targets with Pending health status. Choose Create target group, as shown in the following screenshot
Create load balancer
To create the load balancer, follow these steps:
- On the Amazon EC2 console, select Load Balancers in the navigation pane
- Choose Create load balancer
- Choose Network Load Balancer and choose Create, as shown in the following screenshot
- Provide a name (for example, smhp-login-node-lb) and choose Internal as Scheme
- For network mapping, select the VPC that contains your HyperPod cluster and an associated private subnet, as shown in the following screenshot
- Select a security group that allows access on port 22 to the login nodes. If you deployed your cluster using the HyperPod workshop instructions, you can use the security group from this deployment.
- Select the Target group that you created before and choose TCP as Protocol and 22 for Port, as shown in the following screenshot
- Choose Create load balancer
After the load balancer has been created, you can find its DNS name on the load balancer’s detail page, as shown in the following screenshot.
Making sure host keys are consistent across login nodes
When using multiple login nodes in a load-balanced environment, it’s crucial to maintain consistent SSH host keys across all nodes. SSH host keys are unique identifiers that each server uses to prove its identity to connecting clients. If each login node has a different host key, users will encounter “WARNING: SSH HOST KEY CHANGED” messages whenever they connect to a different node, causing confusion and potentially leading users to question the security of the connection.
To avoid these warnings, configure the same SSH host keys on all login nodes in the load balancing rotation. This setup makes sure that users won’t receive host key mismatch alerts when routed to a different node by the load balancer.
You can enter the following script on the cluster’s head node to copy the SSH host keys from the first login node to the other login nodes in your HyperPod cluster:
Create AWS Client VPN endpoint
Because the NLB has been created with Internal scheme, it’s only accessible from within the HyperPod VPC. To access the VPC and send requests to the NLB, we use AWS Client VPN in this post.
AWS Client VPN is a managed client-based VPN service that enables secure access to your AWS resources and resources in your on-premises network.
We’ll set up an AWS Client VPN endpoint that provides clients with access to the HyperPod VPC and uses mutual authentication. With mutual authentication, Client VPN uses certificates to perform authentication between clients and the Client VPN endpoint.
To deploy a client VPN endpoint with mutual authentication, you can follow the steps outlined in Get started with AWS Client VPN. When configuring the client VPN to access the HyperPod VPC and the login nodes, keep these adaptations to the following steps in mind:
- Step 2 (create a Client VPN endpoint) – By default, all client traffic is routed through the Client VPN tunnel. To allow internet access without routing traffic through the VPN, you can enable the option Enable split-tunnel when creating the endpoint. When this option is enabled, only traffic destined for networks matching a route in the Client VPN endpoint route table is routed through the VPN tunnel. For more details, refer to Split-tunnel on Client VPN endpoints.
- Step 3 (target network associations) – Select the VPC and private subnet used by your HyperPod cluster, which contains the cluster login nodes.
- Step 4 (authorization rules) – Choose the Classless Inter-Domain Routing (CIDR) range associated with the HyperPod VPC. If you followed the HyperPod workshop instructions, the CIDR range is 10.0.0.0/16.
- Step 6 (security groups) – Select the security group that you previously used when creating the NLB.
Connecting to the login nodes
After the AWS Client VPN is configured, clients can establish a VPN connection to the HyperPod VPC. With the VPN connection in place, clients can use SSH to connect to the NLB, which will route them to one of the login nodes.
ssh -i /path/to/your/private-key.pem user@<NLB-IP-or-DNS>
To allow SSH access to the login nodes, you must create user accounts on the cluster and add their public keys to the authorized_keys file on each login node (or on all nodes, if necessary). For detailed instructions on managing multi-user access, refer to the Multi-User section of the HyperPod workshop.
In addition to using the AWS Client VPN, you can also access the NLB from other AWS services, such as Amazon Elastic Compute Cloud (Amazon EC2) instances, if they meet the following requirements:
- VPC connectivity – The EC2 instances must be either in the same VPC as the NLB or able to access the HyperPod VPC through a peering connection or similar network setup.
- Security group configuration – The EC2 instance’s security group must allow outbound connections on port 22 to the NLB security group. Likewise, the NLB security group should be configured to accept inbound SSH traffic on port 22 from the EC2 instance’s security group.
Clean up
To remove deployed resources, you can clean them up in the following order:
- Delete the Client VPN endpoint
- Delete the Network Load Balancer
- Delete the target group associated with the load balancer
If you also want to delete the HyperPod cluster, follow these additional steps:
- Delete the HyperPod cluster
- Delete the CloudFormation stack, which includes the VPC, subnets, security group, and FSx for Lustre volume
Conclusion
In this post, we explored how to implement login node load balancing for SageMaker HyperPod clusters. By using a Network Load Balancer to distribute user traffic across login nodes, you can optimize resource utilization and enhance the overall multi-user experience, providing seamless access to cluster resources for each user.
This approach represents only one way to customize your HyperPod cluster. Because of the flexibility of SageMaker HyperPod you can adapt configurations to your unique needs while benefiting from a managed, resilient environment. Whether you need to scale foundation model workloads, share compute resources across different tasks, or support long-running training jobs, SageMaker HyperPod offers a versatile solution that can evolve with your requirements.
For more details on making the most of SageMaker HyperPod, dive into the HyperPod workshop and explore further blog posts covering HyperPod.
About the Authors
Janosch Woschitz is a Senior Solutions Architect at AWS, specializing in AI/ML. With over 15 years of experience, he supports customers globally in leveraging AI and ML for innovative solutions and building ML platforms on AWS. His expertise spans machine learning, data engineering, and scalable distributed systems, augmented by a strong background in software engineering and industry expertise in domains such as autonomous driving.
Giuseppe Angelo Porcelli is a Principal Machine Learning Specialist Solutions Architect for Amazon Web Services. With several years of software engineering and an ML background, he works with customers of any size to understand their business and technical needs and design AI and ML solutions that make the best use of the AWS Cloud and the Amazon Machine Learning stack. He has worked on projects in different domains, including MLOps, computer vision, and NLP, involving a broad set of AWS services. In his free time, Giuseppe enjoys playing football.
How Clearwater Analytics is revolutionizing investment management with generative AI and Amazon SageMaker JumpStart
This post was written with Darrel Cherry, Dan Siddall, and Rany ElHousieny of Clearwater Analytics.
As global trading volumes rise rapidly each year, capital markets firms are facing the need to manage large and diverse datasets to stay ahead. These datasets aren’t just expansive in volume; they’re critical in driving strategy development, enhancing execution, and streamlining risk management. The explosion of data creation and utilization, paired with the increasing need for rapid decision-making, has intensified competition and unlocked opportunities within the industry. To remain competitive, capital markets firms are adopting Amazon Web Services (AWS) Cloud services across the trade lifecycle to rearchitect their infrastructure, remove capacity constraints, accelerate innovation, and optimize costs.
Generative AI, AI, and machine learning (ML) are playing a vital role for capital markets firms to speed up revenue generation, deliver new products, mitigate risk, and innovate on behalf of their customers. A great example of such innovation is our customer Clearwater Analytics and their use of large language models (LLMs) hosted on Amazon SageMaker JumpStart, which has propelled asset management productivity and delivered AI-powered investment management productivity solutions to their customers.
In this post, we explore Clearwater Analytics’ foray into generative AI, how they’ve architected their solution with Amazon SageMaker, and dive deep into how Clearwater Analytics is using LLMs to take advantage of more than 18 years of experience within the investment management domain while optimizing model cost and performance.
About Clearwater Analytics
Clearwater Analytics (NYSE: CWAN) stands at the forefront of investment management technology. Founded in 2004 in Boise, Idaho, Clearwater has grown into a global software-as-a-service (SaaS) powerhouse, providing automated investment data reconciliation and reporting for over $7.3 trillion in assets across thousands of accounts worldwide. With a team of more than 1,600 professionals and a long-standing relationship with AWS dating back to 2008, Clearwater has consistently pushed the boundaries of financial technology innovation.
In May 2023, Clearwater embarked on a journey into the realm of generative AI, starting with a private, secure generative AI chat-based assistant for their internal workforce, enhancing client inquiries through Retrieval Augmented Generation (RAG). As a result, Clearwater was able to increase assets under management (AUM) over 20% without increasing operational headcount. By September of the same year, Clearwater unveiled its generative AI customer offerings at the Clearwater Connect User Conference, marking a significant milestone in their AI-driven transformation.
About SageMaker JumpStart
Amazon SageMaker JumpStart is an ML hub that can help you accelerate your ML journey. With SageMaker JumpStart, you can evaluate, compare, and select foundation models (FMs) quickly based on predefined quality and responsibility metrics to perform tasks such as article summarization and image generation. Pre-trained models are fully customizable for your use case with your data, and you can effortlessly deploy them into production with the user interface or AWS SDK. You can also share artifacts, including models and notebooks, within your organization to accelerate model building and deployment, and admins can control which models are visible to users within their organization.
Clearwater’s generative AI solution architecture
Clearwater Analytics’ generative AI architecture supports a wide array of vertical solutions by merging extensive functional capabilities through the LangChain framework, domain knowledge through RAG, and customized LLMs hosted on Amazon SageMaker. This integration has resulted in a potent asset for both Clearwater customers and their internal teams.
The following image illustrates the solution architecture.
As of September 2024, the AI solution supports three core applications:
- Clearwater Intelligent Console (CWIC) – Clearwater’s customer-facing AI application. This assistant framework is built upon three pillars:
- Knowledge awareness – Using RAG, CWIC compiles and delivers comprehensive knowledge that is crucial for customers from intricate calculations of book value to period-end reconciliation processes.
- Application awareness – Transforming novice users into power users instantly, CWIC guides clients to inquire about Clearwater’s applications and receive direct links to relevant investment reports. For instance, if a client needs information on their yuan exposure, CWIC employs its tool framework to identify and provide links to the appropriate currency exposure reports.
- Data awareness – Digging deep into portfolio data, CWIC adeptly manages complex queries, such as validating book yield tie-outs, by accessing customer-specific data and performing real-time calculations.The following image shows a snippet of the generative AI assistance within the CWIC.
- Crystal – Clearwater’s advanced AI assistant with expanded capabilities that empower internal teams’ operations. Crystal shares CWIC’s core functionalities but benefits from broader data sources and API access. Enhancements driven by Crystal have achieved efficiency gains between 25% and 43%, improving Clearwater’s ability to manage substantial increases in AUM without increases in staffing.
- CWIC Specialists – Their most recent solution CWIC Specialists are domain-specific generative AI agents equipped to handle nuanced investment tasks, from accounting to regulatory compliance. These agents can work in single or multi-agentic workflows to answer questions, perform complex operations, and collaborate to solve various investment-related tasks. These specialists assist both internal teams and customers in domain specific areas, such as investment accounting, regulatory requirements, and compliance information. Each specialist is underpinned by thousands of pages of domain documentation, which feeds into the RAG system and is used to train smaller, specialized models with Amazon SageMaker JumpStart. This approach enhances cost-effectiveness and performance to promote high-quality interactions.
In the next sections, we dive deep into how Clearwater analytics is using Amazon SageMaker JumpStart to fine-tune models for productivity improvement and to deliver new AI services.
Clearwater’s Use of LLMs hosted on Amazon SageMaker JumpStart
Clearwater employs a two-pronged strategy for using LLMs. This approach addresses both high-complexity scenarios requiring powerful language models and domain-specific applications demanding rapid response times.
- Advanced foundation models – For tasks involving intricate reasoning or creative output, Clearwater uses state-of-the-art pre-trained models such as Anthropic’s Claude or Meta’s Llama. These models excel in handling complex queries and generating innovative solutions.
- Fine-tuned models for specialized knowledge – In cases where domain-specific expertise or swift responses are crucial, Clearwater uses fine-tuned models. These customized LLMs are optimized for industries or tasks that require accuracy and efficiency.
Fine-tuned models through domain adaptation with Amazon SageMaker JumpStart
Although general LLMs are powerful, their accuracy can be put to the test in specialized domains. This is where domain adaptation, also known as continued pre-training, comes into play. Domain adaptation is a sophisticated form of transfer learning that allows a pre-trained model to be fine-tuned for optimal performance in a different, yet related, target domain. This approach is particularly valuable when there’s a scarcity of labeled data in the target domain but an abundance in a related source domain.
These are some of the key benefits for domain adaptation:
- Cost-effectiveness – Creating a curated set of questions and answers for instruction fine-tuning can be prohibitively expensive and time-consuming. Domain adaptation eliminates the need for thousands of manually created Q&As.
- Comprehensive learning – Unlike instruction tuning, which only learns from provided questions, domain adaptation extracts information from entire documents, resulting in a more thorough understanding of the subject matter.
- Efficient use of expertise – Domain adaptation frees up human experts from the time-consuming task of generating questions so they can focus on their primary responsibilities.
- Faster deployment – With domain adaptation, specialized AI models can be developed and deployed more quickly, accelerating time to market for AI-powered solutions.
AWS has been at the forefront of domain adaptation, creating a framework to allow creating powerful, specialized AI models. Using this framework, Clearwater has been able to train smaller, faster models tailored to specific domains without the need for extensive labeled datasets. This innovative approach allows Clearwater to power digital specialists with a finely tuned model trained on a particular domain. The result? More responsive LLMs that form the backbone of their cutting-edge generative AI services.
The evolution of fine-tuning with Amazon SageMaker JumpStart
Clearwater is collaborating with AWS to enhance their fine-tuning processes. Amazon SageMaker JumpStart offered them a framework for domain adaptation. During the year, Clearwater has witnessed significant improvements in the user interface and effortlessness of fine-tuning using SageMaker JumpStart.
For instance, the code required to set up and fine-tune a GPT-J-6B model has been drastically streamlined. Previously, it required a data scientist to write over 100 lines of code within an Amazon SageMaker Notebook to identify and retrieve the proper image, set the right training script, and import the right hyperparameters. Now, using SageMaker JumpStart and advancements in the field, the process has streamlined to a few lines of code:
A fine-tuning example: Clearwater’s approach
For Clearwater’s AI, the team successfully fine-tuned a GPT-J-6B (huggingface-textgeneration1-gpt-j- 6bmodel) model with domain adaptation using Amazon SageMaker JumpStart. The following are the concrete steps used for the fine-tuning process to serve as a blueprint for others to implement similar strategies. A detailed tutorial can found in this amazon-sagemaker-examples repo.
- Document assembly – Gather all relevant documents that will be used for training. This includes help content, manuals, and other domain-specific text. The data Clearwater used for training this model is public help content which contains no client data. Clearwater exclusively uses client data, with their collaboration and approval, to fine-tune a model dedicated solely to the specific client. Curation, cleaning and de-identification of data is necessary for training and subsequent tuning operations.
- Test set creation – Develop a set of questions and answers that will be used to evaluate the model’s performance before and after fine-tuning. Clearwater has implemented a sophisticated model evaluation system for additional assessment of performance for open source and commercial models. This is covered more in the Model evaluation and optimization section later in this post.
- Pre-trained model deployment – Deploy the original, pre-trained GPT-J-6B model.
- Baseline testing – Use the question set to test the pre-trained model, establishing a performance baseline.
- Pre-trained model teardown – Remove the pre-trained model to free up resources.
- Data preparation – Upload the assembled documents to an S3 bucket, making sure they’re in a format suitable for the fine-tuning process.
- Fine-tuning – Train the new model using the uploaded documents, adjusting hyperparameters as needed.
- Fine-tuned model testing – Evaluate the fine-tuned model using the same question set used for the baseline.
- Fine-tuned model teardown – If not immediately needed, tear down the fine-tuned model to optimize resource usage.
- RAG comparison – Test a RAG-based system using the same question set for an additional point of comparison.
- Performance evaluation – Analyze the results from all tests to assess the effectiveness of the fine-tuning process.
This is how the results look before tuning an open source model (GGPT-J-6B), after tuning, and using a commercial model with a larger context window (Claude-3.5-Sonnet).
Prompt
What does Clearwater Compliance module do?
Reference Answer
The Clearwater Compliance module supports client investment policies, providing daily updates, including compliance warnings and violations. Clearwater’s Compliance module monitors various portfolio characteristics, including asset and security type allocation, sector exposure, investment time horizon, and credit quality. The Compliance module enables users to establish and update investment policies, monitor changes to accounts, and provide audit reporting. Rules provide flexibility to evaluate data from the individual security-level to aggregate values at the portfolio-level. User-identified investments can also be prohibited. Multiple investment policies can be applied to individual accounts and aggregates of accounts.
GPT-J-6B pre-trained | GPT-J-6B fine-tuned | Claude-3.5-Sonnet with RAG |
The Clearwater Compliance module is an integrated set of controls you can use to manage your compliance with the federal and state requirements for the handling and transportation of hazardous materials. |
The Clearwater Compliance module is a suite of tools designed to help clients monitor and manage their compliance policies and procedures. · Compliance Overview dashboard – Provides a high-level view of compliance policies and a history of compliance policy changes. · Compliance Settings tool – Users can update compliance settings, such as the frequency at which compliance policies are checked and the level of compliance required to access a compliance-restricted area of the website. · Compliance Audit report – Provides a detailed list of policies and rules that weren’t met. The report includes a breakdown of rule violations and the actions required to bring the policies into compliance. · Compliance Notifications tool – Sends notifications to designated recipients, such as the client’s compliance officer, when a compliance rule is about to expire or has been violated. · Compliance Documents tool – Users can upload and manage compliance policies and rules. |
The Clearwater Compliance module is a sophisticated tool designed to support and monitor client investment policies. Here’s what it does: · Daily Monitoring – Provides daily updates, including compliance warnings and violations, to foster ongoing adherence to investment policies. · Portfolio characteristic tracking –Monitors various aspects of portfolios, including asset and security type allocation, sector exposure, investment time horizon, and credit quality. · Policy management – Users can establish, update, and manage investment policies through the module. · Flexible rule application – Allows for the evaluation of compliance rules at both individual security and portfolio levels. · Custom policy creation – Clearwater works with clients to create compliance rules based on their specific investment policies and interpretations. · Multi-account support – Multiple investment policies can be applied to individual accounts or aggregates of accounts. · Reporting and auditing – Generates audit reports and provides tools for monitoring account changes. · Risk mitigation – Offers integrated, customized daily reports on the Clearwater Reporting Site, to proactively manage risk. · Regulatory compliance – Includes external and regulatory policies when provided by the client, providing broader compliance coverage. · Violation tracking – Identifies compliance rule violations, tracks the number of days in violation, and allows for resolution statements. |
Although the Compliance module is comprehensive, it has some limitations. It only supports objective rules using data available within the Clearwater system and doesn’t cover areas such as pre-trade compliance, investments without publicly available data, or scenario testing.
Overall, the Clearwater Compliance module serves as a robust tool for maintaining investment policy compliance, offering daily monitoring, customization, and reporting features to support effective risk management and regulatory adherence.
Model evaluation and optimization
Clearwater employs a sophisticated evaluation system to assess the performance of new models available on Amazon SageMaker JumpStart. This means that only models demonstrating superior capabilities are integrated into the production environment.
Clearwater’s LLM operations (LLMOps) pipeline plays a crucial role in this process, automating the evaluation and seamless integration of new models. This commitment to using the most effective LLMs for each unique task with cutting-edge technology and optimal performance is the cornerstone of Clearwater’s approach.
The evaluation phase is crucial for determining the success of the fine-tuning process. As you determine the evaluation process and framework that should be used, you need to make sure they fit the criteria for their domain. At Clearwater, we designed our own internal evaluation framework to meet the specific needs of our investment management and accounting domains.
Here are key considerations:
- Performance comparison – The fine-tuned model should outperform the pre-trained model on domain-specific tasks. If it doesn’t, it might indicate that the pre-trained model already had significant knowledge in this area.
- RAG benchmark – Compare the fine-tuned model’s performance against a RAG system using a pre-trained model. If the fine-tuned model doesn’t at least match RAG performance, troubleshooting is necessary.
- Troubleshooting checklist:
- Data format suitability for fine-tuning
- Completeness of the training dataset
- Hyperparameter optimization
- Potential overfitting or underfitting
- Cost-benefit analysis. That is, estimate the operational costs of using a RAG system with a pre-tuned model (for example, Claude-3.5 Sonnet) compared with deploying the fine-tuned model at production scale.
- Advance considerations:
- Iterative fine-tuning – Consider multiple rounds of fine-tuning, gradually introducing more specific or complex data.
- Multi-task learning – If applicable, fine-tune the model on multiple related domains simultaneously to improve its versatility.
- Continual learning – Implement strategies to update the model with new information over time without full retraining.
Conclusion
For businesses and organizations seeking to harness the power of AI in specialized domains, domain adaptation presents significant opportunities. Whether you’re in healthcare, finance, legal services, or any other specialized field, adapting LLMs to your specific needs can provide a significant competitive advantage.
By following this comprehensive approach with Amazon SageMaker, organizations can effectively adapt LLMs to their specific domains, achieving better performance and potentially more cost-effective solutions than generic models with RAG systems. However, the process requires careful monitoring, evaluation, and optimization to achieve the best results.
As we’ve observed with Clearwater’s success, partnering with an experienced AI company such as AWS can help navigate the complexities of domain adaptation and unlock its full potential. By embracing this technology, you can create AI solutions that are not just powerful, but also truly tailored to your unique requirements and expertise.
The future of AI isn’t just about bigger models, but smarter, more specialized ones. Domain adaptation is paving the way for this future, and those who harness its power will emerge as leaders in their respective industries.
Get started with Amazon SageMaker JumpStart on your fine-tuning LLM journey today.
About the Authors
Darrel Cherry is a Distinguished Engineer with over 25 years of experience leading organizations to create solutions for complex business problems. With a passion for emerging technologies, he has architected large cloud and data processing solutions, including machine learning and deep learning AI applications. Darrel holds 19 US patents and has contributed to various industry publications. In his current role at Clearwater Analytics, Darrel leads technology strategy for AI solutions, as well as Clearwater’s overall enterprise architecture. Outside the professional sphere, he enjoys traveling, auto racing, and motorcycling, while also spending quality time with his family.
Dan Siddall, a Staff Data Scientist at Clearwater Analytics, is a seasoned expert in generative AI and machine learning, with a comprehensive understanding of the entire ML lifecycle from development to production deployment. Recognized for his innovative problem-solving skills and ability to lead cross-functional teams, Dan leverages his extensive software engineering background and strong communication abilities to bridge the gap between complex AI concepts and practical business solutions.
Rany ElHousieny is an Engineering Leader at Clearwater Analytics with over 30 years of experience in software development, machine learning, and artificial intelligence. He has held leadership roles at Microsoft for two decades, where he led the NLP team at Microsoft Research and Azure AI, contributing to advancements in AI technologies. At Clearwater, Rany continues to leverage his extensive background to drive innovation in AI, helping teams solve complex challenges while maintaining a collaborative approach to leadership and problem-solving.
Pablo Redondo is a Principal Solutions Architect at Amazon Web Services. He is a data enthusiast with over 18 years of FinTech and healthcare industry experience and is a member of the AWS Analytics Technical Field Community (TFC). Pablo has been leading the AWS Gain Insights Program to help AWS customers achieve better insights and tangible business value from their data analytics and AI/ML initiatives. In his spare time, Pablo enjoys quality time with his family and plays pickleball in his hometown of Petaluma, CA.
Prashanth Ganapathy is a Senior Solutions Architect in the Small Medium Business (SMB) segment at AWS. He enjoys learning about AWS AI/ML services and helping customers meet their business outcomes by building solutions for them. Outside of work, Prashanth enjoys photography, travel, and trying out different cuisines.
How Twitch used agentic workflow with RAG on Amazon Bedrock to supercharge ad sales
Twitch, the world’s leading live-streaming platform, has over 105 million average monthly visitors. As part of Amazon, Twitch advertising is handled by the ad sales organization at Amazon. New ad products across diverse markets involve a complex web of announcements, training, and documentation, making it difficult for sales teams to find precise information quickly. In early 2024, Amazon launched a major push to harness the power of Twitch for advertisers globally. This necessitated the ramping up of Twitch knowledge to all of Amazon ad sales. The task at hand was especially challenging to internal sales support teams. With a ratio of over 30 sellers per specialist, questions posed in public channels often took an average of 2 hours for an initial reply, with 20% of questions not being answered at all. All in all, the entire process from an advertiser’s request to the first campaign launch could stretch up to 7 days.
In this post, we demonstrate how we innovated to build a Retrieval Augmented Generation (RAG) application with agentic workflow and a knowledge base on Amazon Bedrock. We implemented the RAG pipeline in a Slack chat-based assistant to empower the Amazon Twitch ads sales team to move quickly on new sales opportunities. We discuss the solution components to build a multimodal knowledge base, drive agentic workflow, use metadata to address hallucinations, and also share the lessons learned through the solution development using multiple large language models (LLMs) and Amazon Bedrock Knowledge Bases.
Solution overview
A RAG application combines an LLM with a specialized knowledge base to help answer domain-specific questions. We developed an agentic workflow with RAG solution that revolves around a centralized knowledge base that aggregates Twitch internal marketing documentation. This content is then transformed into a vector database optimized for efficient information retrieval. In the RAG pipeline, the retriever taps into this vector database to surface relevant information, and the LLM generates tailored responses to Twitch user queries submitted through a Slack assistant. The solution architecture is presented in the following diagram.
The key architectural components driving this solution include:
- Data sources – A centralized repository containing marketing data aggregated from various sources such as wikis and slide decks, using web crawlers and periodic refreshes
- Vector database – The marketing contents are first embedded into vector representations using Amazon Titan Multimodal Embeddings G1 on Amazon Bedrock, capable of handling both text and image data. These embeddings are then stored in an Amazon Bedrock knowledge bases.
- Agentic workflow – The agent acts as an intelligent dispatcher. It evaluates each user query to determine the appropriate course of action, whether refusing to answer off-topic queries, tapping into the LLM, or invoking APIs and data sources such as the vector database. The agent uses chain-of-thought (CoT) reasoning, which breaks down complex tasks into a series of smaller steps then dynamically generates prompts for each subtask, combines the results, and synthesizes a final coherent response.
- Slack integration – A message processor was implemented to interface with users through a Slack assistant using an AWS Lambda function, providing a seamless conversational experience.
Lessons learned and best practices
The process of designing, implementing, and iterating a RAG application with agentic workflow and a knowledge base on Amazon Bedrock produced several valuable lessons.
Processing multimodal source documents in the knowledge base
An early problem we faced was that Twitch documentation is scattered across the Amazon internal network. Not only is there no centralized data store, but there is also no consistency in the data format. Internal wikis contain a mixture of image and text, and training materials to sales agents are often in the form of PowerPoint presentations. To make our chat assistant the most effective, we needed to coalesce all of this information together into a single repository the LLM could understand.
The first step was making a wiki crawler that uploaded all the relevant Twitch wikis and PowerPoint slide decks to Amazon Simple Storage Service (Amazon S3). We used that as the source to create a knowledge base on Amazon Bedrock. To handle the combination of images and text in our data source, we used the Amazon Titan Multimodal Embeddings G1 model. For the documents containing specific information such as demographic context, we summarized multiple slides to ensure this information is included in the final contexts for LLM.
In total, our knowledge base contains over 200 documents. Amazon Bedrock knowledge bases are easy to amend, and we routinely add and delete documents based on changing wikis or slide decks. Our knowledge base is queried from time to time every day, and metrics, dashboards, and alarms are inherently supported in Amazon Web Services (AWS) through Amazon CloudWatch. These tools provide complete transparency into the health of the system and allow fully hands-off operation.
Agentic workflow for a wide range of user queries
As we observed our users interact with our chat assistant, we noticed that there were some questions the standard RAG application couldn’t answer. Some of these questions were overly complex, with multiple questions combined, some asked for deep insights into Twitch audience demographics, and some had nothing to do with Twitch at all.
Because the standard RAG solution could only answer simple questions and couldn’t handle all these scenarios gracefully, we invested in an agentic workflow with RAG solution. In this solution, an agent breaks down the process of answering questions into multiple steps, and uses different tools to answer different types of questions. We implemented an XML agent in LangChain, choosing XML because the Anthropic Claude models available in Amazon Bedrock are extensively trained on XML data. In addition, we engineered our prompts to instruct the agent to adopt a specialized persona with domain expertise in advertising and the Twitch business realm. The agent breaks down queries, gathers relevant information, analyzes context, and weighs potential solutions. The flow for our chat agent is shown in the following diagram. In the follow, when the agent reads a user question, the first step is to decide whether the question is related to Twitch – if it isn’t, the agent politely refuses to answer. If the question is related to Twitch, the agent ‘thinks’ about which tool is best suited to answer the question. For instance, if the question is related to audience forecasting, the agent will invoke Amazon internal Audience Forecasting API. If the question is related to Twitch advertisement products, the agent will invoke its advertisement knowledge base. Once the agent fetches the results from the appropriate tool, the agent will consider the results and think whether it now has enough information to answer the question. If it doesn’t, the agent will invoke its toolkit again (maximum of 3 attempts) to gain more context. Once its finished gathering information, the agent will generate a final response and send it to the user.
One of the chief benefits of agentic AI is the ability to integrate with multiple data sources. In our case, we use an internal forecasting API to fetch data related to the available Amazon and Twitch audience supply. We also use Amazon Bedrock Knowledge Bases to help with questions about static data, such as features of Twitch ad products. This greatly increased the scope of questions our chatbot could answer, which the initial RAG couldn’t support. The agent is intelligent enough to know which tool to use based on the query. You only need to provide high-level instructions about the tool purpose, and it will invoke the LLM to make a decision. For example,
Even better, LangChain logs the agent’s thought process in CloudWatch. This is what a log statement looks like when the agent decides which tool to use:
The agent helps keep our RAG flexible. Looking towards the future, we plan to onboard additional APIs, build new vector stores, and integrate with chat assistants in other Amazon organizations. This is critical to helping us expand our product, maximizing its scope and impact.
Contextual compression for LLM invocation
During the document retrieval, we found that our internal wikis varied greatly in size. This meant that often a wiki would contain hundreds or even thousands of lines of text, but only a small paragraph was relevant to answering the question. To reduce the size of context and input token to the LLM, we used another LLM to perform contextual compression to extract the relevant portions of the returned documents. Initially, we used Anthropic Claude Haiku because of its superior speed. However, we found that Anthropic Claude Sonnet boosted the result accuracy, while being only 20% slower than Haiku (from 8 seconds to 10 seconds). As a result, we chose Sonnet for our use case because providing the best quality answers to our users is the most important factor. We’re willing to take an additional 2 seconds latency, comparing to the 2-day turn-around time in the traditional manual process.
Address hallucinations by document metadata
As with any RAG solution, our chat assistant occasionally hallucinated incorrect answers. While this is a well-recognized problem with LLMs, it was particularly pronounced in our system, because of the complexity of the Twitch advertising domain. Because our users relied on the chatbot responses to interact with their clients, they were reluctant to trust even its correct answers, despite most answers being correct.
We increased the users’ trust by showing them where the LLM was getting its information from for each statement made. This way, if a user is skeptical of a statement, they can check the references the LLM used and read through the authoritative documentation themselves. We achieved this by adding the source URL of the retrieved documents as metadata in our knowledge base, which Amazon Bedrock directly supports. We then instructed the LLM to read the metadata and append the source URLs as clickable links in its responses.
Here’s an example question and answer with citations:
Note that the LLM responds with two sources. The first is from a sales training PowerPoint slide deck, and the second is from an internal wiki. For the slide deck, the LLM can provide the exact slide number it pulled the information from. This is especially useful because some decks contain over 100 slides.
After adding citations, our user feedback score noticeably increased. Our favorable feedback rate increased by 40% and overall assistant usage increased by 20%, indicating that users gained more trust in the assistant’s responses due to the ability to verify the answers.
Human-in-the-loop feedback collection
When we launched our chat assistant in Slack, we had a feedback form that users could fill out. This included several questions to rate aspects of the chat assistant on a 1–5 scale. While the data was very rich, hardly anyone used it. After switching to a much simpler thumb up or thumb down button that a user could effortlessly select (the buttons are appended to each chatbot answer), our feedback rate increased by eightfold.
Conclusion
Moving fast is important in the AI landscape, especially because the technology changes so rapidly. Often engineers will have an idea about a new technique in AI and want to test it out quickly. Using AWS services helped us learn fast about what technologies are effective and what aren’t. We used Amazon Bedrock to test multiple foundation models (FMs), including Anthropic Claude Haiku and Sonnet, Meta Llama 3, Cohere embedding models, and Amazon Titan Multimodal Embeddings. Amazon Bedrock Knowledge Bases helped us implement RAG with agentic workflow efficiently without building custom integrations to our various multimodal data sources and data flows. Using dynamic chunking and metadata filtering let us retrieve the needed contents more accurately. All these together allowed us to spin up a working prototype in a few days instead of months. After we deployed the changes to our customers, we continued to adopt Amazon Bedrock and other AWS services in the application.
Since the Twitch Sales Bot launch in February 2024, we have answered over 11,000 questions about the Twitch sales process. In addition, Amazon sellers who used our generative AI solution delivered 25% more Twitch revenue year-to-date when compared with sellers who didn’t, and delivered 120% more revenue when compared to self-service accounts. We will continue expanding our chat assistant’s agentic capabilities—using Amazon Bedrock along with other AWS services—to solve new problems for our users and increase Twitch bottom line. We plan to incorporate distinct Knowledge Bases across Amazon portfolio of 1P Publishers like Prime Video, Alexa, and IMDb as a fast, accurate, and comprehensive generative AI solution to supercharge ad sales.
For your own project, you can follow our architecture and adopt a similar solution to build an AI assistant to address your own business challenge.
About the Authors
Bin Xu is a Senior Software Engineer at Amazon Twitch Advertising and holds a Master’s degree in Data Science from Columbia University. As the visionary creator behind TwitchBot, Bin successfully introduced the proof of concept in 2023. Bin is currently leading a team in Twitch Ads Monetization, focusing on optimizing video ad delivery, improving sales workflows, and enhancing campaign performance. Also leading efforts to integrate AI-driven solutions to further improve the efficiency and impact of Twitch ad products. Outside of his professional endeavors, Bin enjoys playing video games and tennis.
Nick Mariconda is a Software Engineer at Amazon Advertising, focused on enhancing the advertising experience on Twitch. He holds a Master’s degree in Computer Science from Johns Hopkins University. When not staying up to date with the latest in AI advancements, he enjoys getting outdoors for hiking and connecting with nature.
Frank Zhu is a Senior Product Manager at Amazon Advertising, located in New York City. With a background in programmatic ad-tech, Frank helps connect the business needs of advertisers and Amazon publishers through innovative advertising products. Frank has a BS in finance and marketing from New York University and outside of work enjoys electronic music, poker theory, and video games.
Yunfei Bai is a Principal Solutions Architect at AWS. With a background in AI/ML, data science, and analytics, Yunfei helps customers adopt AWS services to deliver business results. He designs AI/ML and data analytics solutions that overcome complex technical challenges and drive strategic objectives. Yunfei has a PhD in Electronic and Electrical Engineering. Outside of work, Yunfei enjoys reading and music.
Cathy Willcock is a Principal Technical Business Development Manager located in Seattle, WA. Cathy leads the AWS technical account team supporting Amazon Ads adoption of AWS cloud technologies. Her team works across Amazon Ads enabling discovery, testing, design, analysis, and deployments of AWS services at scale, with a particular focus on innovation to shape the landscape across the AdTech and MarTech industry. Cathy has led engineering, product, and marketing teams and is an inventor of ground-to-air calling (1-800-RINGSKY).
Acknowledgments
We would also like to acknowledge and express our gratitude to our leadership team: Abhoy Bhaktwatsalam (VP, Amazon Publisher Monetization), Carl Petersen (Director, Twitch, Audio & Podcast Monetization), Cindy Barker (Senior Principal Engineer, Amazon Publisher Insights & Analytics), and Timothy Fagan (Principal Engineer, Twitch Monetization), for their invaluable insights and support. Their expertise and backing were instrumental for the successful development and implementation of this innovative solution.
Abstracts: NeurIPS 2024 with Jindong Wang and Steven Euijong Whang
Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements.
In this episode, Jindong Wang, a senior researcher at Microsoft Research, and Steven Euijong Whang, a tenured associate professor at Korea Advanced Institute of Science and Technology (KAIST), join host Gretchen Huizinga to discuss the paper “ERBench: An Entity-Relationship based Automatically Verifiable Hallucination Benchmark for Large Language Models,” a spotlight session at this year’s Conference on Neural Information Processing Systems (NeurIPS). ERBench leverages the integrity constraints of relational databases to create LLM benchmarks that can verify model rationale via keywords as well as check for answer correctness.
Transcript
[MUSIC]GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.
[MUSIC FADES]Today I’m talking to Jindong Wang, a senior researcher at Microsoft Research, and Steven Whang, a tenured associate professor at the Korea Advanced Institute of Science and Technology. Jindong and Steven are coauthors of a paper called “ERBench: An Entity-Relationship based Automatically Verifiable Hallucination Benchmark for Large Language Models,” and this paper is a spotlight at this year’s conference on Neural Information Processing Systems, or NeurIPS, in Vancouver, BC, this week. Jindong and Steven, thanks for joining us on Abstracts!
JINDONG WANG: Thank you. Nice to be here.
STEVEN EUIJONG WHANG: It’s great to be here.
HUIZINGA: So, Jindong, I’ll start with you. In just a few sentences, tell us what problem your research addresses and why people should care about it.
JINDONG WANG: OK, everybody knows that with the widespread usage of large language models, hallucination has become a crucial factor of concern. Hallucination occurs when models generate false or nonexistent information. In particular, factual hallucination greatly undermines the reliability of the large language models. To correctly evaluate the hallucination, evaluating the model’s rationale is also important. Up to date, when the paper, you know, was submitted, there were no works dealing with automatic rationale evaluation systematically because, you know, most of them focused on manual evaluation or just using GPT-judge. ERBench is the first one to generate a large language model evaluation benchmark utilizing relational databases. Relational databases are based on the relational data model assuming a fixed schema. The fixed schema enables relational databases to have data integrity that are based on database design theories, so that integrity constraints in relational databases allows better evaluation of the large language models. Functional dependencies allow automatic rationale evaluation using the functional dependency inferred keywords, and foreign key constraints also allow for easy generation of the multi-hop questions, which are usually very complicated to generate with other techniques. So that’s basically what we want to do. So in one sentence, we try to build an automatic evaluation benchmark for evaluation of the hallucination.
HUIZINGA: Steven, give us a quick overview of your research methodology and findings. How did you conduct your research, and what were your major takeaways?
STEVEN EUIJONG WHANG: Sure. So this was a collaboration between our group at KAIST, and Dr. Xing Xie’s group at MSRA (Microsoft Research Asia). KAIST is Korea Advanced Institute of Science and Technology. So we had the privilege to closely work with our LLM expert, Dr. Jindong Wang, here. We also acknowledge the Microsoft Accelerating Foundation Models Research, or AFMR, program for using Azure quota for our experiments. So we had some biweekly meetings for maybe over a year, and at some point, we figured that relational databases could be really important for LLM evaluation. I personally have a background in databases, which I studied at Stanford University as a PhD student. So relational databases have integrity constraints that can be used to better construct complex, in-depth questions and verify answers. So the first ingredient is functional dependencies. So these are constraints where, given a few attributes, you can determine another attribute. So I’ll just give an example because I think that helps the understanding. So suppose that you have, like, a movie table, and in a movie, you have the title of the movie, the year of production, and the director of the movie, and the length of the movie, and so on and so forth. So if you know the title and year of the movie, that pretty much identifies the movie, and you can actually determine the director of the movie, as well. So, for example, if you know that there’s a movie called Star Wars, which is a very popular movie produced in 1977, that determines the director. We know it’s George Lucas, right. So, basically, it’s like a function. It receives the Star Wars 1977 and determines, gives the output, George Lucas. So that’s the first ingredient. Now, the reason this is important is that we can use these functional dependencies to pinpoint critical keywords that an LLM must know to properly answer a given question containing certain attribute values. For example, we may ask the LLM, is there a director of a movie called Star Wars produced in 1977? And the LLM can say yes. And it is the right answer, but we’d like to know if the LLM is knowing what it’s saying, right. And so we look at the rationale. That’s why looking at the rationale is important. We just can’t say it’s doing the correct thing. So if the LLM mentions George Lucas, bingo, that’s a great answer. However, if the LLM mentions some other director, like Steven Spielberg, that’s not a correct rationale. So that’s exactly what we’re trying to evaluate. Functional dependency is key to being able to do that kind of verification.
The second ingredient is foreign key constraints. So foreign key constraint is where one of the attributes in one table can intuitively link to another attribute of another table. So in our movie table, we had the director attribute. Now we may also have a separate table called the director table, and maybe we might have some more information about the director in that table, like the director name, the director’s age, all sorts of information about the director. So foreign key constraint basically requires that if there is some director mentioned in the movie table, it has to be one of the directors in the director table. So this basically links a table to another table. It’s very useful. So using this, what we can do is we can join the two tables, right. So now we can join the movie and director table and generate a bigger table. The reason this is useful is that we can also chain together functional dependencies that I just mentioned into longer functional dependencies. So what this enables is us to construct more complex questions, arbitrarily, that are multi-hop. So using these integrity constraints, we can basically convert any relational database into an LLM benchmark, and this supports continuous evaluation as the database changes. We can also support multimodal questions and also support various prompt engineering techniques.
HUIZINGA: Well, I would ask you to, kind of, drill in on what you found in how ERBench compares to other benchmark tests.
STEVEN EUIJONG WHANG: So we evaluated our benchmark on five domains and performed comprehensive analyses in terms of answer and rationale accuracies and hallucination rates using single, multi-hop, and multimodal questions and also performed prompt engineering and fine-tuning. And what we found is that some LLMs, like GPT-4, are relatively aggressive and good at answering lots of questions. Other LLMs, like Gemini, tend to be a bit more conservative and do not answer as many questions but instead hallucinate less as a result. So the key conclusion is that no LLM, like, totally subsumes the other in all aspects, which is the reason why we use multiple measures. And the key message we want to make is that overall, ERBench is effective in evaluating any LLM’s thought process by pinpointing critical keywords within the rationale.
HUIZINGA: Well, Jindong, back to you. Research settings are one thing, but tell us how your work is significant in real-world settings, and who does this impact most and how?
JINDONG WANG: Relational databases, you know, they are everywhere across various domains. Anyone can easily get access from Google or from Kaggle or even create them targeting the domain or subject that one wants to test the model on. So taking into account that ERBench is the first work to utilize the relational database for generating large language model hallucination benchmarks … so this work will lead a new research direction of integrating database design theories and techniques, a long-studied field—you know, database is very traditional, old, and classic, but, you know, they’re still operating right now—into the large language model field, a recently emerging area.
HUIZINGA: Right. Well, Steven, as we close, I assume there are still a few unanswered questions or unsolved problems in the field. What do you propose to do about those, and what’s next on your research agenda?
STEVEN EUIJONG WHANG: Sure, so the big picture is that we basically proposed the first work to properly evaluate the rationale of LLMs, right. This is very important because LLMs are being used in our everyday lives, and everyone has the question, is the LLM suitable for my task? Can I benefit from the LLM? So it’s very important to verify if the LLM knows what it’s saying. So I just mentioned that we use functional dependencies to pinpoint critical keywords in the rationale. And we believe that’s just the first step. It’s very effective, by the way. So you may have the question, is it enough to just look at, like, the George Lucas within the long rationale? And it turns out 95% of the cases, it is actually effective, so we did human studies and also used GPT-judge to verify that. But these are factual questions and there could be various other questions that require long answers, right. Long rationales. And so the important question is, can we also verify all the rest of the rationales, the complicated rationales, as well? And so in order to properly do that, we need a lot of technology. So first we need to understand the rationales using NLP techniques, and we need to know if it’s properly answering the question, and so on and so forth. And so we believe that there’s a lot of opportunity to expand from that. So we basically, you know, proposed an initial work towards this direction, but we believe that there are many more interesting challenges that remain.
HUIZINGA: Well, Jindong Wang and Steven Whang, thanks for joining us today, and to our listeners, thanks for tuning in. If you’re interested in learning more about this paper, you can find a link at aka.ms/abstracts.
[MUSIC]You can also find it on arXiv and on the NeurIPS website. And if you’re at the NeurIPS conference this week, go to the poster session and talk to the authors! See you next time on Abstracts!
[MUSIC FADES]
The post Abstracts: NeurIPS 2024 with Jindong Wang and Steven Euijong Whang appeared first on Microsoft Research.