Build a read-through semantic cache with Amazon OpenSearch Serverless and Amazon Bedrock

Build a read-through semantic cache with Amazon OpenSearch Serverless and Amazon Bedrock

In the field of generative AI, latency and cost pose significant challenges. The commonly used large language models (LLMs) often process text sequentially, predicting one token at a time in an autoregressive manner. This approach can introduce delays, resulting in less-than-ideal user experiences. Additionally, the growing demand for AI-powered applications has led to a high volume of calls to these LLMs, potentially exceeding budget constraints and creating financial pressures for organizations.

This post presents a strategy for optimizing LLM-based applications. Given the increasing need for efficient and cost-effective AI solutions, we present a serverless read-through caching blueprint that uses repeated data patterns. With this cache, developers can effectively save and access similar prompts, thereby enhancing their systems’ efficiency and response times. The proposed cache solution uses Amazon OpenSearch Serverless and Amazon Bedrock, a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

Solution overview

The cache in this solution acts as a buffer, intercepting prompts—requests to the LLM expressed in natural language—before they reach the main model. The semantic cache functions as a memory bank storing previously encountered similar prompts. It’s designed for efficiency and swiftly matching a user’s prompt with its closest semantic counterparts. However, in a practical cache system, it’s crucial to refine the definition of similarity. This refinement is necessary to strike a balance between two key factors: increasing cache hits and reducing cache collisions. A cache hit occurs when a requested prompt is found in the cache, meaning the system doesn’t need to send it to the LLM for a new generation. Conversely, a cache collision happens when multiple prompts are mapped to the same cache location due to similarities in their semantic features. To better understand these concepts, let’s examine a couple of examples.

Imagine a concierge AI assistant powered by an LLM, specifically designed for a travel company. It excels at providing personalized responses drawn from a pool of past interactions, making sure that each reply is relevant and tailored to travelers’ needs. Here, we might prioritize high recall, meaning we’d rather have more cached responses even if it occasionally leads to overlapping prompts.

Now, consider a different scenario: an AI assistant, designed to assist back desk agents at this travel company, uses an LLM to translate natural language queries into SQL commands. This enables the agents to generate reports from invoices and other financial data, applying filters such as dates and total amounts to streamline report creation. Precision is key here. We need every user request mapped accurately to its corresponding SQL command, leaving no room for error. In this case, we’d opt for a tighter similarity threshold, making sure that cache collisions are kept to an absolute minimum.

In essence, the read-through semantic cache isn’t just a go-between; it’s a strategic tool for optimizing system performance based on the specific demands of different applications. Whether it’s prioritizing recall for a chatbot or precision for a query parser, the adjustable similarity feature makes sure that the cache operates at peak efficiency, enhancing the overall user experience.

A semantic cache system operates at its core as a database storing numerical vector embeddings of text queries. Before being stored, each natural language query is transformed into a corresponding embedding vector. With Amazon Bedrock, you have the flexibility to select from various managed embedding models, including Amazon’s proprietary Amazon Titan embedding model or third-party alternatives like Cohere. These embedding models are specifically designed to map similar natural language queries to vector embeddings with comparable Euclidean distances, providing semantic similarity. With OpenSearch Serverless, you can establish a vector database suitable for setting up a robust cache system.

By harnessing these technologies, developers can build a semantic cache that efficiently stores and retrieves semantically related queries, improving the performance and responsiveness of their systems. In this post, we demonstrate how to use various AWS technologies to establish a serverless semantic cache system. This setup allows for quick lookups to retrieve available responses, bypassing the time-consuming LLM calls. The result is not only faster response times, but also a notable reduction in price.

The solution presented in this post can be deployed through an AWS CloudFormation template. It uses the following AWS services:

The following architecture shows a serverless read-through semantic cache pattern you can use to integrate into an LLM-based solution.

Illustration of Semantic Cache

In this architecture, examples of cache miss and hit are shown in red and green, respectively. In this particular scenario, the client sends a query, which is then semantically compared to previously seen queries. The Lambda function, acting as the cache manager, prompts an LLM for a new generation due to a lack of cache hits given the similarity threshold. The new generation is then sent to the client and used to update the vector database. In the case of a cache hit (green path), the previously generated semantically similar query is sent to the client immediately.

For this short query, the following table summarizes the response latency using the test results of “Who was the first US president” queries, tested on Anthropic Claude V2.

Query Under Test Without Cache Hit With Cache Hit
Who was the first US president? 2 seconds Under 0.5 seconds

Prerequisites

Amazon Bedrock users need to request access to FMs before they are available for use. This is a one-time action and takes less than a minute. For this solution, you’ll need one of the embedding models such as Cohere Embed-English on Amazon Bedrock or Amazon Titan Text Embedding. For text generation, you can choose from Anthropic’s Claude models. For a complete list of text generation models, refer to Amazon Bedrock.

Bedrock Model Access

Deploy the solution

This solution entails setting up a Lambda layer that includes dependencies to interact with services like OpenSearch Serverless and Amazon Bedrock. A pre-built layer is compiled and added to a public Amazon Simple Storage Service (Amazon S3) prefix, available in the provided CloudFormation template. You have the option to build your own layer with other libraries; for more information, refer to the following GitHub repo.

You can deploy this solution with the required roles by using the provided template:

This solution uses the following input parameters:

  • Embedding model
  • LLM
  • Similarity threshold

After a successful deployment (which takes about 2 minutes), you can get your Lambda function name and start experimenting. You can find the Lambda function name on the Outputs tab of your CloudFormation stack, as shown in the following screenshot.

Stack Outputs

You can invoke the Lambda function from the Lambda console or through the AWS Command Line Interface (AWS CLI):

aws lambda invoke 
--function-name YOUR_LAMBDA_FUNCTION_NAME 
--invocation-type RequestResponse 
--cli-binary-format raw-in-base64-out
--payload '{"prompt": "your question here"}' 
output.txt

Your payload can have other options to control cache and generation as follows:

{
"max_tokens_to_sample": int, // Optional. Controls the length of LLM generation in case of a cache miss. Default is 1000.
"reset": bool // Optional. To reset cache
}

Test the solution

To test your cache using the Lambda console, open the Functions page. Navigate to the function you retrieved from the output of your stack. Set up a test event as illustrated in the following screenshot.

Lambda Test Event

Upon a successful run, the result appears similar to the following screenshot.

Lambda Test Event Result

Let’s try another experiment to take a closer look into the functioning of our cache. For this experiment, we use the Cohere embedding model on Amazon Bedrock and a similarity threshold of 0.75. We asked the following question, and it took about 15 seconds for the model to generate the complete response with Anthropic’s Claude V2:

Prompt: What are the potential long-term impacts of artificial intelligence and blockchain technology on the global financial industry?

Generation: Here are some potential long-term impacts of AI and blockchain on the global financial industry:

Automation of routine tasks – AI and machine learning can automate many routine financial tasks like loan processing, fraud detection, client onboarding etc. This could lead to lower operational costs and improved efficiency for financial firms.

Better data analytics and insights – AI can analyze large volumes of financial data more efficiently than humans. This can lead to better insights, personalized recommendations for clients and improved risk management by firms.

Improved trading and investment platforms – AI algorithms can potentially execute trades faster and more optimally than human traders. They can also enable robo-advisors that provide automated investment advice and portfolio management.

Enhanced security and fraud detection – Machine learning models can detect patterns in data to identify potential fraud, money laundering or cybersecurity threats. Blockchain provides a decentralized, encrypted record of transactions that is harder to tamper with.

When we tried these prompt variations, the cache system successfully returned the previously generated response (under 0.5 seconds):

Similar Prompt 1: What are the potential future implications of AI and blockchain on the worldwide financial sector?

Similar Prompt 2: Discuss the possible effects of artificial intelligence and blockchain advancements on the global finance landscape over the coming decades.

Let’s replace the term “financial sector” with “healthcare sector”; we expect no cache collision and new generation:

Prompt: Could you outline the future implications of AI and blockchain on the worldwide healthcare sector?

Generation: Here are some potential long-term impacts AI could have on the healthcare sector:

Personalized medicine – AI can analyze huge datasets about genetics, lifestyle, medical history etc. to provide personalized treatment plans tailored to each individual patient’s needs and risks. This could improve outcomes and reduce waste in the system.

Early disease detection – Algorithms are getting very good at analyzing medical images and data to identify potential diseases and conditions much earlier than human doctors. This could lead to earlier treatment and prevention of serious illness.

Reduced costs – AI has the potential to automate and streamline many processes in healthcare leading to greater efficiency and lower costs. For example, AI chatbots and virtual assistants could provide some basic services at a fraction of the cost of human staff.

The following table summarizes the query latency test results without and with cache hit tested on Anthropic’s Claude V2.

Query Under Test Without Cache Hit With Cache Hit
Could you outline the future implications of AI and blockchain on the worldwide healthcare sector? 15 seconds Under 0.5 seconds

In addition to latency, you can also save costs for your LLM system. Typically, embedding models are more cost-efficient than generation models. For example, Amazon Titan Text Embedding V2 costs $0.00002 per 1,000 input tokens, whereas Anthropic’s Claude V2 costs $0.008 per 1,000 input tokens and $0.024 for 1,000 output tokens. Even considering an additional cost from OpenSearch Service, depending on the scale of cache data, the cache system can be cost-efficient for many use cases.

Clean up

After you are done experimenting with the Lambda function, you can quickly delete all the resources you used to build this semantic cache, including your OpenSearch Serverless collection and Lambda function. To do so, locate your CloudFormation stack on the AWS CloudFormation console and delete it.

Make sure that the status of your stack changes from Delete in progress to Deleted.

Conclusion

In this post, we walked you through the process of setting up a serverless read-through semantic cache. By implementing the pattern outlined here, you can elevate the latency of your LLM-based applications while simultaneously optimizing costs and enriching user experience. Our solution allows for experimentation with embedding models of varying sizes, conveniently hosted on Amazon Bedrock. Moreover, it enables fine-tuning of similarity thresholds to strike the perfect balance between cache hit and cache collision rates. Embrace this approach to unlock enhanced efficiency and effectiveness within your projects.

For more information, refer to the Amazon Bedrock User Guide and Amazon OpenSearch Serverless Developer Guide.


About the Authors

Kamran Razi is a Data Scientist at the Amazon Generative AI Innovation Center. With a passion for delivering cutting-edge generative AI solutions, Kamran helps customers unlock the full potential of AWS AI/ML services to solve real-world business challenges. Leveraging over a decade of experience in software development, he specializes in building AI-driven solutions, including chatbots, document processing, and retrieval-augmented generation (RAG) pipelines. Kamran holds a PhD in Electrical Engineering from Queen’s University.

Sungmin Hong is a Senior Applied Scientist at Amazon Generative AI Innovation Center where he helps expedite the variety of use cases of AWS customers. Before joining Amazon, Sungmin was a postdoctoral research fellow at Harvard Medical School. He holds Ph.D. in Computer Science from New York University. Outside of work, Sungmin enjoys hiking, reading and cooking.

Yash Shah is a Science Manager in the AWS Generative AI Innovation Center. He and his team of applied scientists and machine learning engineers work on a range of machine learning use cases from healthcare, sports, automotive and manufacturing.

Anila Joshi has more than a decade of experience building AI solutions. As a Senior Manager, Applied Science at AWS Generative AI Innovation Center, Anila pioneers innovative applications of AI that push the boundaries of possibility and accelerate the adoption of AWS services with customers by helping customers ideate, identify, and implement secure generative AI solutions.

Read More

Taste of Success: Zordi Plants AI and Robotics to Grow Flavorful Strawberries Indoors

Taste of Success: Zordi Plants AI and Robotics to Grow Flavorful Strawberries Indoors

With startup Zordi, founder Gilwoo Lee’s enthusiasm for robotics, healthy eating, better produce and sustainable farming has taken root.

Lee hadn’t even finished her Ph.D. in AI and robotics at the University of Washington when investors seeded her ambitious plans for autonomous agriculture. Since researcher-turned-entrepreneur Lee founded Zordi in 2020 with Casey Call, formerly head grower at vertical farming startup Plenty, the robotic grower of strawberries has landed its fruits in Wegmans and is now expanding with partner farms in New Jersey and California.

“The most rewarding part is that the fruits you get taste amazing,” said Lee. “You’re able to consistently do that throughout the cycle of the plant because you are constantly optimizing.”

The company has two types of robots within its hydroponic operations. One is a scouting robot for gathering information on the health of plants using foundational models. The other is a harvesting robot for delicately picking and placing fruits and handling other tasks.

Zordi, whose engineering team is based outside Boston, has farms in southern New Jersey and western New York. The company uses NVIDIA GPUs in the cloud and on desktops for training everything from crop health models to those for fruit picking and assessing fruit quality.

Lee aims to deploy autonomous greenhouse systems globally to support regional markets, cutting down on the carbon footprint for transportation as well as providing fresher, better-tasting fruits grown more sustainably.

Having operated greenhouses in New York and New Jersey for two years, the company recently formed partnerships with greenhouse farms in New Jersey and California to meet growing demand.

Zordi is bringing NVIDIA accelerated AI automation to indoor growing that in many ways is parallel to developments in manufacturing and fulfillment operations.

Adopting Jetson for Sustainable Farming, Energy Efficiency

Zordi is building AI models and robots to enable sustainable farming at scale. It uses NVIDIA Jetson AGX Orin modules for testing out gathering sensor data and running its models to recognize the health of plants, flowers and fruits, early pest and disease symptoms, and the needs for hydration and nutrition, as well as light and temperature management.

Jetson’s energy efficiency and the availability of low-cost, high performance cameras from NVIDIA partners are attractive attributes for Zordi, said Lee. The company runs several cameras on each of its robots to collect data.

“Jetson opens up a lot of lower-cost cameras,” said Lee. “It lets us play with different cameras and gives us better battery management.”

With its scouting and harvesting robots, Zordi also aims to address a big issue farms worldwide complain about: a labor shortage that affects operations, leaving fruits and vegetables sometimes unattended and unharvested altogether.

Zordi is planning to scale up its growing operations to meet consumer demand. The company expects that it can do more with AI and robotic automation despite labor challenges.

“We want our harvesting robots to do more dexterous tasks, like pruning leaves, with the help of simulation,” said Lee.

Omniverse Isaac Sim and Digital Twins to Boost Operations 

Zordi is looking at how to boost its indoor growing operations with AI much like industrial manufacturers do, using Isaac Sim in Omniverse for simulations and digital twins to optimize operations.

The company’s software platform for viewing all the data collected from its robots’ sensors provides a live dashboard with a spatial map. It offers a real-time view of every plant in its facilities so that it’s easy to monitor the production remotely.

What’s more, it analyzes plant health and makes optional crop care recommendations using foundational models so that inexperienced farm operators can manage farms like experts.

“We’re literally one step away from putting this all into Isaac Sim and Omniverse,” said Lee, whose Ph.D. dissertation covered reinforcement learning and sim-to-real.

Zordi is working on gripping simulations for strawberries as well as for cucumbers and tomatoes to expand into other categories.

“With strawberries or any other crops, if you can handle them delicately, then it helps with longer shelf life,” Lee said.

Lee is optimistic that the simulations in Isaac Sim will not only boost Zordi’s performance in harvest, but also let it do other manipulation tasks in other scenarios.

Big picture, Zordi aims to create a fully autonomous farming system that makes farming easy and profitable, with AI recommending sustainable crop-care decisions and robots doing the hard work.

“What’s really important for us is how do we automate this, and how do we have a thinking AI that is actually making decisions for the farm with a lot of automations,” said Lee.

Read More

Rad AI reduces real-time inference latency by 50% using Amazon SageMaker

Rad AI reduces real-time inference latency by 50% using Amazon SageMaker

This post is co-written with Ken Kao and Hasan Ali Demirci from Rad AI.

Rad AI has reshaped radiology reporting, developing solutions that streamline the most tedious and repetitive tasks, and saving radiologists’ time. Since 2018, using state-of-the-art proprietary and open source large language models (LLMs), our flagship product—Rad AI Impressions— has significantly reduced the time radiologists spend dictating reports, by generating Impression sections.

The Impression section serves as the conclusion of a radiology report, including summarization, follow-up recommendations, and highlights of significant findings. It stands as the primary result for the clinician who ordered the study, influencing the subsequent course of the patient’s treatment. Given its pivotal role, accuracy and clarity in this section are paramount. Traditionally, radiologists dictated every word of the impressions section, creating it from scratch for each report. This time-consuming process led to fatigue and burnout, and involved redundant manual dictation in many studies.

The automation provided by Rad AI Impressions not only reduces burnout, but also safeguards against errors arising from manual repetition. It increases the capacity to generate reports, reducing health system turnaround times and making high-quality care available to more patients. Impressions are meticulously customized to each radiologist’s preferred language and style. Radiologists review and revise the output as they see fit, maintaining exact control over the final report, and Rad AI also helps radiologists catch and fix a wide variety of errors in their reports. This improves the overall quality of patient care.

Today, by executing abstractive summarization tasks at scale, Rad AI’s language models generate impressions for millions of radiology studies every month, assisting thousands of radiologists at more than 40% of all US health systems and 9 of the 10 largest US radiology practices. Based on years of working with customers, we estimate that our solutions save 1 hour for every 9-hour radiology shift.

Operating within the real-time radiology workflow, our product functions online around the clock, adhering to strict latency requirements. For years, Rad AI has been a reliable partner to radiology practices and health systems, consistently delivering high availability and generating complete results seamlessly in 0.5–3 seconds, with minimal latency. This efficiency empowers radiologists to achieve optimal results in their studies.

In this post, we share how Rad AI reduced real-time inference latency by 50% using Amazon SageMaker.

Challenges in deploying advanced ML models in healthcare

Rad AI, being an AI-first company, integrates machine learning (ML) models across various functions—from product development to customer success, from novel research to internal applications. AI models are ubiquitous within Rad AI, enhancing multiple facets of the organization. It might seem straightforward to integrate ML models into healthcare workflows, but the challenges are many and interconnected.

Healthcare applications make some of the usual AI complexities more challenging. Although any AI solution has to balance speed against accuracy, radiologists rely on the timeliness of our impressions to care for patients, and expect our clinical accuracy to always improve. This constant innovation requires new kinds of models and demands continually improving specialized software and hardware. As inference logic becomes more complex, composing results from multiple models (each seeing regular releases), and a streamlined and reproducible process for orchestration and management is of paramount importance. Even diagnosing basic issues, at this level of complexity, requires a deliberate and methodical approach.

Rad AI’s ML organization tackles this challenge on two fronts. First, it enhances researcher productivity by providing the necessary processes and automation, positioning them to deliver high-quality models with regularity. Second, it navigates operational requirements by making strategic infrastructure choices and partnering with vendors that offer both computational resources and managed services. By enhancing both researcher productivity and operational efficiency, Rad AI creates an environment that fosters ML innovation.

To succeed in this environment, Rad AI takes advantage of the availability and consistency offered by SageMaker real-time endpoints, a fully managed AI inference service that allows seamless deployment and scaling of models independently from the applications that use them. By integrating Amazon Elastic Container Service (Amazon ECS) and SageMaker, Rad AI’s ML system forms a complex server-side architecture with numerous online components. This infrastructure enables Rad AI to navigate the complexities of real-time model deployment, so radiologists receive timely and accurate impressions.

With focused effort and strategic planning, Rad AI continues to enhance its systems and processes, ultimately improving outcomes for patients and clinicians alike.

Let’s transition to exploring solutions and architectural strategies.

Approaches to researcher productivity

To translate our strategic planning into action, we developed approaches focused on refining our processes and system architectures. By improving our deployment pipelines and enhancing collaboration between researchers and MLOps engineers, we streamlined the integration of models into our healthcare workflows. In this section, we discuss the practices that have enabled us to optimize our operations and advance our ML capabilities.

To enable researchers to work at full capacity while minimizing synchronization with MLOps engineers, we recognized the need for normalization in our deployment processes. The pipeline begins when researchers manage tags and metadata on the corresponding model artifact. This approach abstracts away the complexity beneath the surface and eliminates the usual ceremony involved in deploying models. By centralizing model registration and aligning practices across team members, we clamp the entry point for model deployment. This allows us to build additional tooling as we identify bottlenecks or areas for improvement.

Instead of frequent synchronization between MLOps and research teams, we observe practices and identify needs as they arise. Under the hood, we employ an in-house tool combined with modular, reusable infrastructure as code to automate the creation of pull requests. No one writes any code manually. The protocol between researchers and engineers is reduced to pull request reviews, eliminating the need for circulating documents or holding alignment meetings. The declarative nature of the infrastructure code, coupled with intuitive design, answers most questions that MLOps engineers would typically ask researchers—all within the file added to the repository and pull requested.

These approaches, combined with the power and streamlining offered by SageMaker, have reduced the model deployment process to a matter of minutes after a model artifact is ready. Deploying a new model to a target environment now requires minimal effort. Only when dealing with peculiar characteristics of an architecture or specific configurations—such as adjustments for tensor parallelism—do additional considerations arise. By minimizing the complexity and time involved in deployment, we enable researchers to concentrate on innovation rather than operational hurdles.

Architectural strategies

In our architectural strategies, we aimed to achieve high performance and scalability while effectively deploying ML models. The need for low latency in inference tasks—especially critical in healthcare settings where delays can impact patient care—required architectures capable of efficiently handling both GPU-bound and CPU-bound workloads. Additionally, straightforward configuration options that allow us to quickly generate benchmarks became essential. This capability enables us to swiftly evaluate different backend engines, a necessity in latency-bound environments.

In addition to process improvements, we implemented architectural strategies to address the technical aspects. As previously mentioned, real-world inference systems often combine GPU-bound and CPU-bound inference tasks, along with the need to compose results from multiple services. This complexity is typically required for an ML organization to provide product-side functionality. We use AWS Fargate to run CPU inferences and other supporting components, usually alongside a comprehensive frontend API. This setup implements a classic architecture consisting of a frontend API and backend application services. GPU inferences are served through SageMaker real-time inference endpoints. An illustration of this architecture is provided in the following diagram.

We standardized on using SageMaker Large Model Inference (LMI) containers, maintained and offered from public Amazon repositories. These containers support several optimization frameworks and provide simple configuration delivery options. This setup is straightforward for researchers to interpret and spares them the unnecessary hassle of dealing with dependencies and compatibility issues among various ML libraries and managing the underlying container layers.

Diving deeper into our architecture, we consider one of the deployment strategies used in our online inference system. On a single instance, we employ a server that schedules inference tasks with DJL Serving as the model server. This approach allows us to select from and experiment with multiple backend engines, including popular frameworks such as TensorRT-LLM and vLLM. The abstractions and built-in integration with SageMaker real-time endpoints, along with support for multi-GPU inference and tensor parallelism, enable us to quickly evaluate different backends for a given task.

As Rad AI has matured, our architectural solutions have evolved. Initially, we relied on custom components, managing our own container images and running NVIDIA Triton Server directly on instances provided by Amazon ECS. However, by migrating to SageMaker managed hosting and using instance types ranging from 1–8 GPUs of various kinds, we implemented the architectural strategies discussed earlier. Removing the undifferentiated heavy lifting involved in building and optimizing model hosting infrastructure reduced our total cost of ownership by 50%. Optimizing the instance types and container parameters decreased latency by the same margin.

When deploying models with SageMaker Inference, consider the following key best practices:

  • It’s important to build a robust model deployment pipeline that automates the process of registering, testing, and promoting models to production. This can involve integrating SageMaker with continuous integration and delivery (CI/CD) tools to streamline the model release process.
  • In terms of infrastructure choices, it’s important to right-size your SageMaker endpoints to match the expected traffic and model complexity, using features like auto scaling to dynamically adjust capacity.
  • Performance optimization techniques like model optimization and inference container parameter tuning can help improve latency and reduce costs.
  • Comprehensive monitoring and logging of model performance in production is critical to quickly identify and address any issues that arise.

Conclusion

One of the enduring challenges in healthcare is enhancing patient care on a global scale. Rad AI is committed to meeting this challenge by transforming the field of radiology. By refining our processes and implementing strategic architectural solutions, we have enhanced both researcher productivity and operational efficiency.

Our deliberate approach to model deployment and infrastructure management has streamlined workflows and significantly reduced costs and latency. Every additional second saved not only increases bandwidth and reduces fatigue for the radiologists we serve, but also improves patient outcomes and benefits healthcare organizations in a variety of ways. Our inference systems are instrumental in realizing these objectives, using SageMaker’s scalability and flexibility to integrate ML models seamlessly into healthcare settings. As we continue to evolve, our commitment to innovation and excellence positions Rad AI at the forefront of AI-driven healthcare solutions.

Share your thoughts and questions in the comments.

References


About the authors

Ken Kao is an executive leader with 12+ years leading engineering and product across early, mid-stage startups and public companies. He is currently the VP of Engineering at Rad AI pushing the frontier of applying Gen AI to healthcare to help make physicians more efficient and improve patient outcome. Prior to that, he was at Meta driving VR Device performance, emulation, and development tooling & Infrastructure. He has also previously held engineering leadership roles at Airbnb, Flatiron Health, and Palantir. Ken holds M.S. and B.S degrees in Electrical Engineering from Stanford University.

Hasan Ali Demirci is a Staff Engineer at Rad AI, specializing in software and infrastructure for machine learning. Since joining as an early engineer hire in 2019, he has steadily worked on the design and architecture of Rad AI’s online inference systems. He is certified as an AWS Certified Solutions Architect and holds a bachelor’s degree in mechanical engineering from Boğaziçi University in Istanbul and a graduate degree in finance from the University of California, Santa Cruz.

Karan Jain is a Senior Machine Learning Specialist at AWS, where he leads the worldwide Go-To-Market strategy for Amazon SageMaker Inference. He helps customers accelerate their generative AI and ML journey on AWS by providing guidance on deployment, cost-optimization, and GTM strategy. He has led product, marketing, and business development efforts across industries for over 10 years, and is passionate about mapping complex service features to customer solutions.

Dmitry Soldatkin is a Senior Machine Learning Solutions Architect at Amazon Web Services (AWS), helping customers design and build AI/ML solutions. Dmitry’s work covers a wide range of ML use cases, with a primary interest in Generative AI, deep learning, and scaling ML across the enterprise. He has helped companies in many industries, including insurance, financial services, utilities, and telecommunications. He has a passion for continuous innovation and using data to drive business outcomes. Prior to joining AWS, Dmitry was an architect, developer, and technology leader in data analytics and machine learning fields in financial services industry.

Read More

Read graphs, diagrams, tables, and scanned pages using multimodal prompts in Amazon Bedrock

Read graphs, diagrams, tables, and scanned pages using multimodal prompts in Amazon Bedrock

Large language models (LLMs) have come a long way from being able to read only text to now being able to read and understand graphs, diagrams, tables, and images. In this post, we discuss how to use LLMs from Amazon Bedrock to not only extract text, but also understand information available in images.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API. It also provides a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

Solution overview

In this post, we demonstrate how to use models on Amazon Bedrock to retrieve information from images, tables, and scanned documents. We provide the following examples:

  • Performing object classification and object detection tasks
  • Reading and querying graphs
  • Reading flowcharts and architecture diagrams (such as an AWS architecture diagram) and converting it to text

We also explore specific use cases where Amazon Bedrock can help you generate information from an image:

  • Identify the action in the image
  • Name a product and extract metadata to generate a tagline and description
  • Generate a real estate listing for a property
  • Generate a recipe from an image of a dish

Perform object classification and object detection tasks using LLM

At the time of writing, the best algorithms for object classification and detection are based on convolution neural networks. However, LLMs such as Anthropic’s Claude 3 Sonnet on Amazon Bedrock can also perform these tasks using zero-shot prompting, which refers to a prompting technique to give a task to the model without providing specific examples or training for that specific task. In the following example, we show how to classify an image based on the classes that we passed with the prompt. The following image is a sample Australian Medicare card.

We use the following prompt with Anthropic’s Claude 3 Sonnet to classify the image.:

This is one of the documents that a customer seeking a loan has uploaded. The image can belong to one of the following four classes : ID, bank statement, application form or utility bill. Identify what the class of this document is and give me just the class name. I don’t need any other information for now

We get the following response from the LLM:

Based on the image provided, the class of this document appears to be an ID card or identification document. The image shows what seems to be a Medicare card or similar government-issued ID card displaying names and an identification number.

It has correctly identified the class; however, a response in JSON would make it more straightforward to process that response. We add the following to the end of the prompt:

provide the response in json format with the key as “class” and the value as the class of the document

We get the following response:

{
"class": "ID"
}

You can now read the JSON response using a library of your choice, such as the Python JSON library.

In this next example, we use the Meta’s Llama 3.2 90B Vision model to identify an object in an image. The following image is of a gearbox.

We use the following prompt to identify the gearbox from the image:

identify the entire machine part in this image. Give just the name of the machine part in a JSON format with the key as ”name“ and the value as the name of the machine part

The following screenshot shows the prompt and the answer the model gives.

Read and query graphs

In this next example, we show how to read a graph and then interpolate and extrapolate values in the graph. The following graph shows the population growth for a few countries (source).

We pass this image to Anthropic’s Claude 3 Sonnet along with the following prompt:

From this graph create a table for the population of India, with the first column as the year and the second column as the population in millions. Create a row for every 5 years starting from 1950 to 2025.

We ask it to create a row for every 5 years, so the model must interpolate values. We also ask it to extend the table until 2025, and because the data is only until 2021, the model will have to extrapolate the values. We get the following response:

Here is a table for the population of India, with the first column as the year and the second column as the population in millions, for every 5 years starting from 1950 to 2025:

| Year | Population (in millions) |

|------|--------------------------|

| 1950 | 357.6 |

| 1955 | 390.5 |

| 1960 | 427.8 |

| 1965 | 468.0 |

| 1970 | 510.9 |

| 1975 | 556.9 |

| 1980 | 605.7 |

| 1985 | 658.0 |

| 1990 | 713.8 |

| 1995 | 773.0 |

| 2000 | 835.6 |

| 2005 | 901.7 |

| 2010 | 971.5 |

| 2015 | 1045.0 |

| 2020 | 1122.0 |

| 2025 | 1202.4 |

The extrapolation may not be accurate, and this is where a more traditional machine learning (ML) approach would be helpful.

You can also ask the model to combine its knowledge with the knowledge from the graph. The following prompt asks for the population for countries in this graph for 2021, along with the country’s capital:

Can you create a table for 2021 with the first column as the country name, the second column as population in millions and third column as the capital of that country?

The following screenshot shows the response.

The LLM has filled in the table based on the graph and its own knowledge about the capital of each country.

You can try out something harder as well. The following image is a weather map from the

You can ask the LLM a specific question, such as the following:

This is a precipitation map for Australia. Is it raining in Sydney right now?

You don’t have to tell the LLM where Sydney is or that the image is for rainfall.

We get the following response:

Based on the precipitation map shown, there does not appear to be any significant rainfall occurring over the Sydney area, which is located on the eastern coast of the Australian continent. The map uses contour lines and colors to represent precipitation levels, but the region around Sydney has no shaded areas or contour lines indicating rainfall at this particular timeframe depicted by the map. Therefore, the available data suggests that it is likely not raining in Sydney at the time represented by this precipitation map.

There are a couple of interesting observations from this use case:

  • Anthropic’s Claude 3 Sonnet can read maps
  • The model is able to read overlays on a map
  • Phrases such as “region around Sydney” shows that the model doesn’t need to work with exact information but can use an approximation, just as humans do

Read flowcharts and architecture diagrams

You can read AWS architecture diagrams using the Meta Llama 3.2 90B Vision model. The following is an example architecture diagram for modernizing applications with microservices using Amazon Elastic Kubernetes Service (Amazon EKS).

We use the following prompt to read this diagram:

The steps in this diagram are explained using numbers 1 to 11. The numbers are shown in blue squares. Can you explain the diagram using the numbers 1 to 11 and an explanation of what happens at each of those steps?

The following screenshot shows the response that we get from the LLM (truncated for brevity).

Furthermore, you can use this diagram to ask follow-up questions:

Why do we need a network load balancer in this architecture

The following screenshot shows the response from the model.

As you can see, the LLM acts as your advisor now for questions related to this architecture.

However, we’re not limited to using generative AI for only software engineering. You can also read diagrams and images from engineering, architecture, and healthcare.

For this example, we use a process diagram taken from Wikipedia.

To find out what this process diagram is for and to describe the process, you can use the following prompt:

Can you name the process shown in the example. Also describe the process using numbered steps and go from left to right.

The following screenshot shows the response.

The LLM has done a good job figuring out that the diagram is for the Haber process to produce ammonia. It also describes the steps of the process.

Identify actions in an image

You can identify and classify the actions taking place in the image. The model’s ability to accurately identify actions is further enhanced by its capacity to analyze contextual information, such as the surrounding objects, environments, and the positions of individuals or entities within the image. By combining these visual cues and contextual elements, Anthropic’s Claude 3 Sonnet can make informed decisions about the nature of the actions being performed, providing a comprehensive understanding of the scene depicted in the image.

The following is an example where we can not only classify the action of the player but also provide feedback to the player comparing the action to a professional player.

We provide the model the following image of a tennis player. The image was generated using the Stability AI (SDXL 1.0) model on Amazon Bedrock.

The following screenshot shows the prompt and the model’s response.

Name a product and extract metadata to generate a tagline and description

In the field of marketing and product development, coming up with a perfect product name and creative promotional content can be challenging. With the image-to-text capabilities of Anthropic’s Claude 3 Sonnet, you can upload the image of the product and the model can generate a unique product name and craft taglines to suit the target audience.

For this example, we provide the following image of a sneaker to the model (the image was generated using the Stability AI (SDXL 1.0) model on Amazon Bedrock).

The following screenshot shows the prompt.

The following screenshot shows the model’s response.

In the retail and ecommerce domain, you can also use Anthropic’s Claude 3 Sonnet to extract detailed product information from the images for inventory management.

For example, we use the prompt shown in the following screenshot.

The following screenshot shows the model’s response.

Create a real estate listing for a property

You can upload images of a property floor plan and pictures of interior and exterior of the house and then get a description to use in a real estate listing. This is useful to increase the creativity and productivity of real estate agents while advertising properties. Architects could also use this mechanism to explain the floor plan to customers.

We provide the following example floor plan to the model.

The following screenshot shows the prompt.

The following screenshot shows the response.

Generate a recipe from the image of a dish

You can also use Anthropic’s Claude 3 Sonnet to create a recipe based on a picture of a dish. However, out of the box, the model can identify only the dishes that are included in the dataset used for the model training. Factors such as ingredient substitutions, cooking techniques, and cultural variations in cuisine can pose significant challenges.

For example, we provide the following image of a cake to the model to extract the recipe. The image was generated using the Stability AI model (SDXL 1.0) on Amazon Bedrock.

The following screenshot shows the prompt.

The model successfully identifies the dish as Black Forest cake and creates a detailed recipe. The recipe may not create the exact cake shown in the figure, but it does get close to a Black Forest Cake.

Conclusion

FMs such as Anthropic’s Claude 3 Sonnet and Meta Llama 3.2 90B Vision model, available on Amazon Bedrock, have demonstrated impressive capabilities in image processing. These FMs unlock a range of powerful features, including image classification, optical character recognition (OCR), and the ability to interpret complex visuals such as graphs and architectural blueprints. Such innovations offer novel solutions to challenging problems, from searching through scanned document archives to generating image-inspired text content and converting visual information into structured data.

To start using these capabilities for your specific needs, we recommend exploring the chat playground feature on Amazon Bedrock, which allows you to interact with and extract information from images.


About the Authors

Mithil Shah is a Principal AI/ML Solution Architect at Amazon Web Services. He helps commercial and public sector customers use AI/ML to achieve their business outcome. He is currently helping customers build chat bots and search functionality using LLM agents and RAG.

Santosh Kulkarni is an Senior Solutions Architect at Amazon Web Services specializing in AI/ML. He is passionate about generative AI and is helping customers unlock business potential and drive actionable outcomes with machine learning at scale. Outside of work, he enjoys reading and traveling.

Read More

How Crexi achieved ML models deployment on AWS at scale and boosted efficiency

How Crexi achieved ML models deployment on AWS at scale and boosted efficiency

This post is co-written with Isaac Smothers and James Healy-Mirkovich from Crexi. 

With the current demand for AI and machine learning (AI/ML) solutions, the processes to train and deploy models and scale inference are crucial to business success. Even though AI/ML and especially generative AI progress is rapid, machine learning operations (MLOps) tooling is continuously evolving to keep pace. Customers are looking for success stories about how best to adopt the culture and new operational solutions to support their data scientists. Solutions should be flexible to adopt, allow seamless integration with other systems, and provide a path to automate MLOps using AWS services and third-party tools, as we’ll explore in this post with Pulumi and Datadog. This framework helps to achieve operational excellence not only in the DevOps space but allows stakeholders to optimize tools such as infrastructure as code (IaC) automation and DevOps research and assessment (DORA) observability of pipelines for MLOps.

Commercial Real Estate Exchange, Inc. (Crexi), is a digital marketplace and platform designed to streamline commercial real estate transactions. It allows brokers to manage the entire process from listing to closing on one platform, including digital letters of intent, best and final offer negotiations, and transaction management tools. Its data and research features allow investors and other commercial real estate stakeholders to conduct due diligence and proactively connect with other professionals ahead of the transaction process.

In this post, we will review how Crexi achieved its business needs and developed a versatile and powerful framework for AI/ML pipeline creation and deployment. This customizable and scalable solution allows its ML models to be efficiently deployed and managed to meet diverse project requirements.

Datadog is a monitoring service for cloud-scale applications, bringing together data from servers, databases, tools and services to present a unified view of your entire stack. Datadog is a SaaS-based data analytics platform that enables Dev and Ops teams to work collaboratively to avoid downtime, resolve performance problems, and helps track that development and deployment cycles finish on time.

Pulumi’s modern infrastructure as code (IaC) platform empowers teams to manage cloud resources using their favorite languages including Python, JavaScript, TypeScript, Go, and C#. Pulumi’s open source SDK integrates with its free and commercial software as a service (SaaS) to simplify infrastructure provisioning, delivery, architecture, policy, and testing on a cloud.

Solution overview

Central to Crexi’s infrastructure are boilerplate AWS Lambda triggers that call Amazon SageMaker endpoints, executing any given model’s inference logic asynchronously. This modular approach supports complex pipeline pathways, with final results directed to Amazon Simple Storage Service (Amazon S3) and Amazon Data Firehose for seamless integration into other systems. One of the SageMaker endpoints also uses Amazon Textract, but any model can be used.

ML pipeline engineering requirements

The engineering requirements for the ML pipeline goal to build a robust infrastructure for model deployments are:

  • Rapid deployment of ML models: Model pipeline deployments should be managed through a continuous integration and continuous deployment (CI/CD) infrastructure, facilitating model pipeline rollbacks, regression testing, and click deploys. This automated CI/CD deployment process is used to automatically test and deploy pipeline changes, minimizing the risk of errors and downtime.
  • Distinct separation of concerns for production and development ML pipelines: This requirement prevents ongoing model experiments in the development environment from affecting the production environment, thereby maintaining the stability and reliability of the production models.
  • Model pipeline health monitoring: Health monitoring allows for proactive identification and resolution of potential issues in model pipelines before they impact downstream engineering teams and users.
  • Readily accessible models: Model pipelines should be accessible across engineering teams and straightforward to integrate into new and existing products.

The goal is to build reliable, efficient ML pipelines that can be used by other engineering teams with confidence.

Technical overview

The ML pipeline infrastructure is an amalgamation of various AWS products, designed to seamlessly invoke and retrieve output from ML models. This infrastructure is deployed using Pulumi, a modern IaC tool that allows Crexi to handle the orchestration of AWS products in a streamlined and efficient manner.

The AWS products managed by Pulumi in the infrastructure include:

To protect the robustness and reliability of the infrastructure, Crexi uses Datadog for pipeline log monitoring, which allows the team to keep a close eye on the pipeline’s performance and quickly identify and address issues that might arise.

Lastly, Crexi uses GitHub actions to run Pulumi scripts in a CI/CD fashion for ML pipeline deploys, updates, and destroys. These GitHub actions keep the infrastructure reproducible and sufficiently hardened against code regression.

Pipeline as code

Pulumi-managed ML pipelines are coded as YAML files that data scientists can quickly create and deploy. Deploying IaC using YAML files that data scientists can write has three key advantages:

  • Increased efficiency and speed: A streamlined deployment process allows data scientists to write and deploy their own models. Enabling data scientists in this way reduces delivery time by not requiring additional data engineering or ops personnel (that is, it reduces cross-functional dependencies) for deployments.
  • Flexibility and customization: YAML files allow data scientists to specify the necessary configurations such as instance types, model images, and additional permissions. This level of customization helps the team to optimize the deployed models for specific use cases.
  • Simplicity and readability: YAML files are human-readable, facilitating the evaluation, review, and auditing of infrastructure and deployment configurations.

Implementation

Now, let’s look at the implementation details of the ML pipeline.

The pipeline contains three Sage Maker endpoints named model-a, model-b, and model-c. Each endpoint is asynchronous and has a specified number of running instances. They each have a specified docker image to run the model hosted on the endpoint, a specified location of the model.tar.gz file that the endpoint will host, and a specified type of machine instance to run the endpoint on. The model-b and model-c endpoints depend on the output from model-a.

The model-a endpoint has access to input Amazon S3 objects in the Crexi AWS account and depends on the crexi-model-input-dev bucket for input. Lastly, the model-c endpoint also has access to input S3 objects in the Crexi AWS account in addition to Amazon Textract.

After a new version of an input is uploaded to the crexi-model-input-dev S3 bucket, a Lambda function passes it to the model-a SageMaker endpoint. After results are ready and delivered to the model-a-model-output bucket, the relevant Lambda functions execute model-b and model-c SageMaker endpoints accordingly.

The visualization that follows depicts the pipeline flow.

Solution architecture overview

To automate changes in the resources and new models, the Crexi team manages infrastructure using Pulumi and defines resources using YAML. SageMakerPipelineExample.yaml creates a stack of AWS resources that deploy service models to production. The AWS stack contains the necessary Lambda functions, S3 buckets, SageMaker endpoints, IAM permissions, and so on. As an example, the following is part of the YAML files that define the SageMaker endpoints.

team: Mlops

identifier: SagemakerPipelineExample

data_dev: 
  buckets:
    - name: "crexi-model-storage-dev" 
      additionalWriters:
        - "arn:aws:iam::<aws_account_id>:role/DataDevelopers"
    - name: "crexi-model-input-dev"

sagemakerPipelines:
  - name: "Infrared"
    models:
      - name: model-a 
        async: true 
        count: 4
        image: "<aws_account_id>.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-inference-with-t
        s3Path: "crexi-model-storage-dev/model-a.tar.gz"
        access:
          filesCrexiAccess: true 
          instanceType: ml.c5.4xlarge 
          dependsOn: 
            s3Buckets:
              - bucketName: "crexi-model-input-dev"
                prefix: "manifests/"
                suffix: ".json"
      - name: model-b 
        async: true 
        count: 1
        instanceType: ml.m5.xlarge
        image: "<aws_account_id>.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-inference:2.1.0
        s3Path: "crexi-model-storage-dev/model-b.tar.gz"
        dependsOn: 
          models:
            - "model-a"
      - name: model-c 
        async: true 
        count: 1
        instanceType: ml.m5.large
        image: "<aws_account_id>.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-inference:2.1.0
        s3Path: "crexi-model-storage-dev/model-c.tar.gz" 
        access:
          filesCrexiAccess: true
        textract: true 
        dependsOn:
          models:
            - "model-a"

Pipeline deployment

ML pipelines can be quickly deployed, modified, and destroyed using a continuous delivery GitHub workflow named Deploy self-service infrastructure that has been set up in a Crexi repository. After new models are tested and everything is ready in Crexi’s repository, GitHub workflow triggers deployment using Pulumi and a YAML file with resources defined in the previous section of this post.

The Deploy self-service infrastructure workflow takes four arguments:

  1. branch
    • Description: GitHub branch to source the pipeline YAML file from
    • Input (options)
      • GitHub branch (for example, main)
  2. action
    • Description: Specifies the type of Pulumi action to run
    • Input (options):
      • up: Create or update resources
      • destroy: Tear down resources
      • preview: Preview changes without applying them
  3. environment
    • Description: Defines the environment against which the action will be executed
    • Input (options):
      • data_dev: Development environment
      • data_prod: Production environment
  4. YAML
    • Description: Path to the infrastructure YAML file that defines the resources to be managed
    • Input (string)
      • Filename of SageMaker model pipeline YAML file to deploy, modify, or destroy

The following screenshot shows GitHub workflow parameters and history.

GitHub workflow

Pipeline Monitoring

Pipeline monitoring for Pulumi-deployed ML pipelines uses a comprehensive Datadog dashboard (shown in the following figure) that offers extensive logging capabilities and visualizations. Key metrics and logs are collected and visualized to facilitate real-time monitoring and historical analysis. Pipeline monitoring has dramatically simplified the assessment of a given pipeline’s health status, allowing for the rapid detection of potential bottlenecks and bugs, thereby improving operation of the ML pipelines.

Monitoring dashboard

The dashboard offers several core features:

  • Error tracking: The dashboard tracks 4xx and 5xx errors in aggregate, correlating errors to specific logged events within the model pipelines, which aids in quick and effective diagnosis by providing insights into the frequency and distribution of these errors.
  • Invocation metrics for SageMaker models: The dashboard aggregates data on instance resource utilization, invocation latency, invocation failures, and endpoint backlog for the SageMaker models deployed through Pulumi, giving a detailed view of performance bottlenecks and latencies.
  • Lambda function monitoring: The dashboard monitors the success and failure rates of invocations for triggerable Lambda functions, thus delivering a holistic view of the system’s performance.

Conclusion

The ML pipeline deployment framework explored here offers a robust, scalable, and highly customizable solution for AI/ML needs and addresses Crexi’s requirements. With the power to rapidly build and deploy pipelines, experiments and new ML techniques can be tested at scale with minimal effort. It separates development workflow of models and production deployments, and allows to proactively monitor for different issues. Additionally, routing model outputs to S3 supports seamless integration with Snowflake, facilitating storage and accessibility of data. This interconnected ecosystem does more than just improve current operations; it lays the groundwork for continuous innovation. The data housed in Snowflake serves as a rich resource for training new models that can be deployed quickly with new ML pipelines, enabling a cycle of improvement and experimentation that propels Crexi’s projects forward.

If you have any thoughts or questions, leave them in the comments section.


Isaac Smothers is a Senior DevOps Engineer at Crexi. Isaac focuses on automating the creation and maintenance of robust, secure cloud infrastructure with built-in observability. Based in San Luis Obispo, he is passionate about providing self-service solutions that enable developers to build, configure, and manage their services independently, without requiring cloud or DevOps expertise. In his free time, he enjoys hiking, video editing, and gaming.

James Healy-Mirkovich is a principal data scientist at Crexi in Los Angeles. Passionate about making data actionable and impactful, he develops and deploys customer-facing AI/ML solutions and collaborates with product teams to explore the possibilities of AI/ML. Outside work, he unwinds by playing guitar, traveling, and enjoying music and movies.

Marina Novikova is a Senior Partner Solution Architect at AWS. Marina works on the technical co-enablement of AWS ISV Partners in the DevOps and Data and Analytics segments to enrich partner solutions and solve complex challenges for AWS customers. Outside of work, Marina spends time climbing high peaks around the world.

Read More

Deploy Meta Llama 3.1 models cost-effectively in Amazon SageMaker JumpStart with AWS Inferentia and AWS Trainium

Deploy Meta Llama 3.1 models cost-effectively in Amazon SageMaker JumpStart with AWS Inferentia and AWS Trainium

We’re excited to announce the availability of Meta Llama 3.1 8B and 70B inference support on AWS Trainium and AWS Inferentia instances in Amazon SageMaker JumpStart. Meta Llama 3.1 multilingual large language models (LLMs) are a collection of pre-trained and instruction tuned generative models. Trainium and Inferentia, enabled by the AWS Neuron software development kit (SDK), offer high performance and lower the cost of deploying Meta Llama 3.1 by up to 50%.

In this post, we demonstrate how to deploy Meta Llama 3.1 on Trainium and Inferentia instances in SageMaker JumpStart.

What is the Meta Llama 3.1 family?

The Meta Llama 3.1 multilingual LLMs are a collection of pre-trained and instruction tuned generative models in 8B, 70B, and 405B sizes (text in/text and code out). All models support a long context length (128,000) and are optimized for inference with support for grouped query attention (GQA). The Meta Llama 3.1 instruction tuned text-only models (8B, 70B, 405B) are optimized for multilingual dialogue use cases and outperform many of the available open source chat models on common industry benchmarks.

At its core, Meta Llama 3.1 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety. Architecturally, the core LLM for Meta Llama 3 and Meta Llama 3.1 is the same dense architecture.

Meta Llama 3.1 also offers instruct variants, and the instruct model is fine-tuned for tool use. The model has been trained to generate calls for a few specific tools for capabilities like search, image generation, code execution, and mathematical reasoning. In addition, the model also supports zero-shot tool use.

The responsible use guide from Meta can assist you in additional fine-tuning that may be necessary to customize and optimize the models with appropriate safety mitigations.

What is SageMaker JumpStart?

SageMaker JumpStart offers access to a broad selection of publicly available foundation models (FMs). These pre-trained models serve as powerful starting points that can be deeply customized to address specific use cases. You can now use state-of-the-art model architectures, such as language models, computer vision models, and more, without having to build them from scratch.

With SageMaker JumpStart, you can deploy models in a secure environment. The models are provisioned on dedicated SageMaker Inference instances, including Trainium and Inferentia powered instances, and are isolated within your virtual private cloud (VPC). This provides data security and compliance, because the models operate under your own VPC controls, rather than in a shared public environment. After deploying an FM, you can further customize and fine-tune it using the extensive capabilities of SageMaker, including SageMaker Inference for deploying models and container logs for improved observability. With SageMaker, you can streamline the entire model deployment process.

Solution overview

SageMaker JumpStart provides FMs through two primary interfaces: Amazon SageMaker Studio and the SageMaker Python SDK. This provides multiple options to discover and use hundreds of models for your specific use case.

SageMaker Studio is a comprehensive interactive development environment (IDE) that offers a unified, web-based interface for performing all aspects of the machine learning (ML) development lifecycle. From preparing data to building, training, and deploying models, SageMaker Studio provides purpose-built tools to streamline the entire process. In SageMaker Studio, you can access SageMaker JumpStart to discover and explore the extensive catalog of FMs available for deployment to inference capabilities on SageMaker Inference.

In SageMaker Studio, you can access SageMaker JumpStart by choosing JumpStart in the navigation pane or by choosing JumpStart on the Home page.

Alternatively, you can use the SageMaker Python SDK to programmatically access and use JumpStart models. This approach allows for greater flexibility and integration with existing AI and ML workflows and pipelines. By providing multiple access points, SageMaker JumpStart helps you seamlessly incorporate pre-trained models into your AI and ML development efforts, regardless of your preferred interface or workflow.

In the following sections, we demonstrate how to deploy Meta Llama 3.1 on Trainium instances using SageMaker JumpStart in SageMaker Studio for a one-click deployment and the Python SDK.

Prerequisites

To try out this solution using SageMaker JumpStart, you need the following prerequisites:

  • An AWS account that will contain all your AWS resources.
  • An AWS Identity and Access Management (IAM) role to access SageMaker. To learn more about how IAM works with SageMaker, refer to Identity and Access Management for Amazon SageMaker.
  • Access to SageMaker Studio or a SageMaker notebook instance, or an IDE) such as PyCharm or Visual Studio Code. We recommend using SageMaker Studio for straightforward deployment and inference.
  • One instance of ml.trn1.32xlarge for SageMaker hosting.

Deploy Meta Llama 3.1 using the SageMaker JumpStart UI

From the SageMaker JumpStart landing page, you can browse for models, notebooks, and other resources. You can find Meta Llama 3.1 Neuron models by searching by “3.1” or find them in the Meta hub.

If you don’t see Meta Llama 3.1 Neuron models in SageMaker Studio Classic, update your SageMaker Studio version by shutting down and restarting. For more information about version updates, refer to Shut down and Update Studio Apps.

In SageMaker JumpStart, you can access the Meta Llama 3.1 Neuron models listed in the following table.

Model Card Description Key Capabilities
Meta Llama 3.1 8B Neuron Llama-3.1-8B is a state-of-the-art openly accessible model that excels at language nuances, contextual understanding, and complex tasks like translation and dialogue generation supported in 10 languages. Multilingual support and stronger reasoning capabilities, enabling advanced use cases like long-form text summarization and multilingual conversational agents.
Meta Llama 3.1 8B Instruct Neuron Llama-3.1-8B-Instruct is an update to Meta-Llama-3-8B-Instruct, an assistant-like chat model, that includes an expanded 128,000 context length, multilinguality, and improved reasoning capabilities. Able to follow instructions and tasks, improved reasoning and understanding of nuances and context, and multilingual translation.
Meta Llama 3.1 70B Neuron Llama-3.1-70B is a state-of-the-art openly accessible model that excels at language nuances, contextual understanding, and complex tasks like translation and dialogue generation in 10 languages. Multilingual support and stronger reasoning capabilities, enabling advanced use cases like long-form text summarization and multilingual conversational agents.
Meta Llama 3.1 70B Instruct Neuron Llama-3.1-70B-Instruct is an update to Meta-Llama-3-70B-Instruct, an assistant-like chat model, that includes an expanded 128,000 context length, multilinguality, and improved reasoning capabilities Able to follow instructions and tasks, improved reasoning and understanding of nuances and context, and multilingual translation.

You can choose the model card to view details about the model such as license, data used to train, and how to use.

You can also find two buttons on the model details page, Deploy and Preview notebooks, which help you use the model.

When you choose Deploy, a pop-up will show the end-user license agreement and acceptable use policy for you to acknowledge.

When you acknowledge the terms choose Deploy, model deployment will start.

Deploy Meta Llama 3.1 using the Python SDK

Alternatively, you can deploy through the example notebook available from the model page by choosing Preview notebooks. The example notebook provides end-to-end guidance on how to deploy the model for inference and clean up resources.

To deploy using a notebook, we start by selecting an appropriate model, specified by the model_id. For example, you can deploy a Meta Llama 3.1 70B Instruct model through SageMaker JumpStart with the following SageMaker Python SDK code:

from sagemaker.jumpstart.model import JumpStartModel

model = JumpStartModel(model_id = "meta-textgenerationneuron-llama-3-1-70b-instruct") 
predictor = model.deploy(accept_eula=True)

This deploys the model on SageMaker with default configurations, including default instance type and default VPC configurations. You can change these configurations by specifying non-default values in JumpStartModel. To successfully deploy the model, you must manually set accept_eula=True as a deploy method argument. After it’s deployed, you can run inference against the deployed endpoint through the SageMaker predictor:

payload = {
    "inputs": "The color of the sky is blue but sometimes it can also be ",
    "parameters": {"max_new_tokens":256, "top_p":0.9, "temperature":0.6}
}
response = predictor.predict(payload)

The following table lists all the Meta Llama models available in SageMaker JumpStart, along with the model_id, default instance type, and supported instance types for each model.

Model Card Model ID Default Instance Type Supported Instance Types
Meta Llama 3 1 8B Neuron meta-textgenerationneuron-llama-3-1-8b ml.inf2.48xlarge ml.inf2.8xlarge, ml.inf2.24xlarge, ml.inf2.48xlarge, ml.trn1.2xlarge, ml.trn1.32xlarge, ml.trn1n.32xlarge
Meta Llama 3.1 8B Instruct Neuron meta-textgenerationneuron-llama-3-1-8b-instruct ml.inf2.48xlarge ml.inf2.8xlarge, ml.inf2.24xlarge, ml.inf2.48xlarge, ml.trn1.2xlarge, ml.trn1.32xlarge, ml.trn1n.32xlarge
Meta Llama 3.1 70B Neuron meta-textgenerationneuron-llama-3-1-70b ml.trn1.32xlarge ml.trn1.32xlarge, ml.trn1n.32xlarge, ml.inf2.48xlarge
Meta Llama 3.1 70B Instruct Neuron meta-textgenerationneuron-llama-3-1-70b-instruct ml.trn1.32xlarge ml.trn1.32xlarge, ml.trn1n.32xlarge, ml.inf2.48xlarge

If you want more control of the deployment configurations, such as context length, tensor parallel degree, and maximum rolling batch size, you can modify them using environmental variables. The underlying Deep Learning Container (DLC) of the deployment is the Large Model Inference (LMI) NeuronX DLC. Refer to the LMI user guide for the supported environment variables.

SageMaker JumpStart has pre-compiled Neuron graphs for a variety of configurations for the preceding parameters to avoid runtime compilation. The configurations of pre-compiled graphs are listed in the following table. As long as the environmental variables fall into one of the following categories, compilation of Neuron graphs will be skipped.

Meta Llama 3.1 8B and Meta Llama 3.1 8B Instruct
OPTION_N_POSITIONS OPTION_MAX_ROLLING_BATCH_SIZE OPTION_TENSOR_PARALLEL_DEGREE OPTION_DTYPE
8192 8 2 bf16
8192 8 4 bf16
8192 8 8 bf16
8192 8 12 bf16
8192 8 24 bf16
8192 8 32 bf16
Meta Llama 3.1 70B and Meta Llama 3.1 70B Instruct
OPTION_N_POSITIONS OPTION_MAX_ROLLING_BATCH_SIZE OPTION_TENSOR_PARALLEL_DEGREE OPTION_DTYPE
8192 8 24 bf16
8192 8 32 bf16

The following is an example of deploying Meta Llama 3.1 70B Instruct and setting all the available configurations:

from sagemaker.jumpstart.model import JumpStartModel 

model_id = "meta-textgenerationneuron-llama-3-1-70b-instruct"

model = JumpStartModel( 
    model_id=model_id, 
    env={
        "OPTION_DTYPE": "bf16",
        "OPTION_N_POSITIONS": "8192",
        "OPTION_TENSOR_PARALLEL_DEGREE": "24",
        "OPTION_MAX_ROLLING_BATCH_SIZE": "8",
     }, 
     instance_type="ml.trn1.32xlarge"
) 

pretrained_predictor = model.deploy(accept_eula=True)

Now that you have deployed the Meta Llama 3.1 70B Instruct model, you can run inference with it by invoking the endpoint. The following code snippet demonstrates using the supported inference parameters to control text generation:

{
    'inputs': '<|begin_of_text|><|start_header_id|>system<|end_header_id|>nnAlways answer with Haiku<|eot_id|><|start_header_id|>user<|end_header_id|>nnI am going to Paris, what should I see?<|eot_id|><|start_header_id|>assista<|end_header_id|>nn',
    'parameters': {
        'max_new_tokens': 256,
        'top_p': 0.9,
        'temperature': 0.6,
        'stop': '<|eot_id|>'
    }
}

response = pretrained_predictor.predict(payload)

We get the following output:

{'generated_text': "Eiffel's iron lacenRiver Seine's gentle flow bynMontmartre's charm calls<|eot_id|>"}

For more information on the parameters in the payload, refer to Parameters.

Clean up

To prevent incurring unnecessary charges, it’s recommended to clean up the deployed resources when you’re done using them. You can remove the deployed model with the following code:

pretrained_predictor.delete_predictor()

Conclusion

The deployment of Meta Llama 3.1 Neuron models on SageMaker demonstrates a significant advancement in managing and optimizing large-scale generative AI models with reduced costs up to 50% compared to GPU. These models, including variants like Meta Llama 3.1 8B and 70B, use Neuron for efficient inference on Inferentia and Trainium based instances, enhancing their performance and scalability.

The ability to deploy these models through the SageMaker JumpStart UI and Python SDK offers flexibility and ease of use. The Neuron SDK, with its support for popular ML frameworks and high-performance capabilities, enables efficient handling of these large models.

For more information on deploying and fine-tuning pre-trained Meta Llama 3.1 models on GPU-based instances, refer to Llama 3.1 models are now available in Amazon SageMaker JumpStart and Fine-tune Meta Llama 3.1 models for generative AI inference using Amazon SageMaker JumpStart.


About the authors

Sharon Yu is a Software Development Engineer with Amazon SageMaker based in New York City.

Saurabh Trikande is a Senior Product Manager for Amazon Bedrock and SageMaker Inference. He is passionate about working with customers and partners, motivated by the goal of democratizing AI. He focuses on core challenges related to deploying complex AI applications, inference with multi-tenant models, cost optimizations, and making the deployment of Generative AI models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.

Michael Nguyen is a Senior Startup Solutions Architect at AWS, specializing in leveraging AI/ML to drive innovation and develop business solutions on AWS. Michael holds 12 AWS certifications and has a BS/MS in Electrical/Computer Engineering and an MBA from Penn State University, Binghamton University, and the University of Delaware.

Dr. Xin Huang is a Senior Applied Scientist for Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms. He focuses on developing scalable machine learning algorithms. His research interests are in the area of natural language processing, explainable deep learning on tabular data, and robust analysis of non-parametric space-time clustering. He has published many papers in ACL, ICDM, KDD conferences, and Royal Statistical Society: Series A.

Read More

AWS achieves ISO/IEC 42001:2023 Artificial Intelligence Management System accredited certification

Amazon Web Services (AWS) is excited to be the first major cloud service provider to announce ISO/IEC 42001 accredited certification for AI services, covering: Amazon Bedrock, Amazon Q Business, Amazon Textract, and Amazon Transcribe. ISO/IEC 42001 is an international management system standard that outlines requirements and controls for organizations to promote the responsible development and use of AI systems.

Responsible AI is a long-standing commitment at AWS. From the outset, AWS has prioritized responsible AI innovation and developed rigorous methodologies to build and operate our AI services with consideration for fairness, explainability, privacy and security, safety, controllability, veracity and robustness, governance, and transparency.

AWS is an active stakeholder working with global standard-setting organizations to develop guidelines that play an important role in our industry by improving clarity, definitions and scope, establishing benchmarks for responsible AI practices, and focusing industry efforts on effective options for addressing risk. Our goal is to contribute to and improve AI standards across several critical areas, including risk management, data quality, unwanted bias mitigation, and explainability.

Technical standards, such as ISO/IEC 42001, are significant because they provide a common framework for responsible AI development and deployment, fostering trust and interoperability in an increasingly global and AI-driven technological landscape. Achieving ISO/IEC42001 certification means that an independent third party has validated that AWS is taking proactive steps to manage risks and opportunities associated with AI development, deployment, and operation. This independent validation enables our customers to gain further assurances around AWS’s commitment to responsible AI and their ability to build and operate AI applications responsibly using AWS Services.

“At Snowflake, delivering AI capabilities to our customers is a top priority. Our product teams need to build and deploy AI responsibly, and have to depend upon our suppliers to do the same, despite the technical complexity. This is why ISO 42001 is important to us. Having ISO 42001 certification means a company has implemented a thoughtful AI management system. Knowing that AWS has ISO 42001 certified services gives us confidence in their commitment to the responsible development and deployment of their services, and builds trust with our own customers’ confidence in our products,” said Tim Tutt, VP US Public Sector, Snowflake.

An accredited certification, like ISO/IEC 42001, is issued by a certification body that has been recognized by a national or international accreditation authority. This demonstrates that the certification is credible, trustworthy, and based on independent verification. In this case, Schellman Compliance, LLC, an ISO certification body accredited by the ANSI National Accreditation Board (ANAB), granted the certification.

Read More

Speech is More Than Words: Do Speech-to-Text Translation Systems Leverage Prosody?

This paper was accepted at the Ninth Conference on Machine Translation (WMT24) at EMNLP 2024.
The prosody of a spoken utterance, including features like stress, intonation and rhythm, can significantly affect the underlying semantics, and as a consequence can also affect its textual translation. Nevertheless, prosody is rarely studied within the context of speech-to-text translation (S2TT) systems. In particular, end-to-end (E2E) systems have been proposed as well-suited for prosody-aware translation because they have direct access to the speech signal when making translation decisions, but…Apple Machine Learning Research

Accelerating drug discovery with TamGen: A generative AI approach to target-aware molecule generation

Accelerating drug discovery with TamGen: A generative AI approach to target-aware molecule generation

TamGen logo on a gradient background.

The Global Health Drug Discovery Institute (opens in new tab) (GHDDI) and Microsoft Research have reached a milestone in tuberculosis (TB) drug research with TamGen (opens in new tab), an open-source (opens in new tab), transformer-based chemical language model for developing target-specific drug compounds. Working in collaboration, the joint team successfully identified several promising inhibitors for a TB protease, with the most effective compound showing significant bioactivity. Research shows that TamGen can also optimize existing molecules by designing target-aware molecule fragments, potentially enabling the discovery of novel compounds that build on a known molecular core structure.  

Generative AI helps overcome limitations in drug discovery

Generative AI is opening new avenues for scientific exploration by allowing computers to autonomously learn and produce original content. TamGen offers a new approach to drug discovery by applying the principles of generative AI to molecular design. Unlike traditional methods, which depend on systematically screening known compounds—a process that is long, complex, and costly due to its reliance on empirical knowledge and the time-consuming task of exploring a vast chemical library—generative AI provides opportunities for designing entirely new chemical structures.  

TamGen goes beyond analyzing existing data by generating chemically diverse compounds that conventional approaches might miss. Figure 1 shows that generative AI expands chemical exploration, allowing for a deeper and more comprehensive search for therapeutic solutions compared to traditional methods.

Two funnels that illustrate compared to the traditional screening-based approach, the generative AI-based approach enables the exploration of a broader range of novel compounds. The table below the two funnels has two rows labeled “Valuable cmpd%” and “Novel cmpd” to compare “Screening-based” approach and “Generative-based” approach. The “Screening-based” column shows “Low”, “No”, “High” verse to “Generative-based” shows “High” “Yes”, “Low”
Figure 1. Compared with the traditional screening-based approach to drug discovery, a generative AI-based approach enables the discovery of novel compounds. 

TamGen workflow 

TamGen’s workflow uses generative AI to design target-specific chemical compounds. Building on the success of large language models (LLMs), we adapted a similar approach for molecular generation, using a training method like that of GPT models, which involves next-token prediction. Molecules were first converted into a simplified molecular input line entry system (SMILES)—a notation representing molecular structures as symbol sequences, similar to text. We then developed a protein encoder to process information about proteins, including their 3D structure.  

A contextual encoder combines insights from medical professionals with data on the protein target and existing compounds that have proven to be effective or promising. Using expert knowledge and computational analysis, this encoder guides the compound generator to produce new molecules that are more likely to bind to a given protein. This workflow is illustrated in Figure 2. 

The protein encoder process proteins information, including 3D information, to help with protein representation. The contextual encoder captures protein-ligand information to generate new molecules bind to proteins. Then the new molecules will be converted into SMILES and pre-trained a GPT-like model to generate molecular compounds.
Figure 2. TamGen’s workflow 

Evaluating TamGen computationally 

To evaluate TamGen’s performance, we compared it to five other common methods used to create 3D shapes of molecules intended to bind to certain proteins. We evaluated these methods using the CrossDocked benchmark, a dataset used in AI research to assess the quality of molecule generation conditioned on a target protein.

Evaluation metrics: 

  • Docking score: Measures how well a molecule binds to a target protein. 
  • Quantitative estimate of drug-likeness (QED): Assesses how good a candidate a molecule is for a drug. 
  • Synthesis accessibility score (SAS): Measures how easy or difficult it is to synthesize a particular chemical compound in a lab. 
  • Ro5 (Lipinski’s rule of five): Determines how likely a compound can be developed into an oral drug.  
  • LogP: Tests a compound’s ability to move between water and fats. 
  • Diversity: Measures the range of different molecular structures and properties in a collection of compounds.  

The findings, illustrated in Figure 3, show TamGen’s overall performance. While other methods may produce compounds that bind more strongly, they often include multiple interconnected ring structures. Research indicates that more of these structures can lower synthesis accessibility (SAS) and increase cellular toxicity, making these compounds harder to develop. We believe that molecular pretraining of the model contributed to the overall effectiveness of the compounds TamGen generated.

The figure uses the shade of the color to illustrate the performance of “Docking”, “QED”, “Lipinski”, “SAS”, “logP”, “Diversity” between TamGen 5 other methods including “Pocket2Mol”, “ResGen”, “TargeDiff”, “3D-AR”, “LiGAN”. The results show that TamGen achieved the best overall performance.
Figure 3. Results from TamGen’s computational performance verification

Experimental lab verification 

To ensure real-world applicability, we also validated our findings in a hands-on lab environment. Here, we focused on the ClpP protease in Mycobacterium tuberculosis as the target because it plays a significant role in the bacterium’s survival under stress conditions. We proposed the Design-Refine-Test pipeline to effectively identify molecular compounds for TB drug discovery.

Design stage: We began by using TamGen to analyze the binding pocket of the protease, where molecules can attach and influence its function. TamGen generated about 2,600 potential compounds that could fit into this pocket. We assessed these compounds based on how well they could attach to the protease and their predicted biological effects, narrowing it down to four promising candidates. 

Refine stage: Next, we entered the four compounds into TamGen, along with three molecular fragments that had been validated in previous lab experiments. This generated a total of 8,600 new compounds, which we screened again using the same criteria, eventually narrowing the selection to 296 compounds.

Test stage: Because synthesizing all 296 compounds wasn’t feasible, we identified similar compounds available in commercial libraries and tested their initial activity against TB. Five compounds showed promising results. We then synthesized one of the originals and two variants of another. Additionally, we categorized the generated compounds into clusters, selected the top 10% from each cluster based on docking scores, and after manual review, synthesized eight more compounds. 

The team from Microsoft Research generated the compounds by TamGen, and the GHDDI team conducted binding analysis, structure–activity relationship studies, and lab experiments to verify the compounds’ inhibitory effect on the ClpP protease, measuring their capacity to interfere with or reduce its activity. Lower IC50 values signify greater potency. Out of the 16 compounds tested, 14 showed strong inhibitory activity measuring under 40 µM, indicating high potential. The most effective compound had a measured IC50 value of 1.88 µM.

The figure shows step by step the process and results, as the context described, of how TamGen analyzes the binding pocket of the protease on Design stage, Refine stage and Test stage.
Figure 4. The hands-on lab verification process  

From molecule to fragment generation 

In addition to generating new molecules, TamGen can optimize existing ones by designing smaller molecular fragments. In this fragment generation process, TamGen builds on a given protein target and a molecular core structure to design new compounds around that core. By incorporating information about the target protein, it generates fragments that are highly specific to the target. This approach moves beyond traditional methods that rely on pre-existing databases, which often limit both novelty and effectiveness of molecular fragments.

For fragment generation, we adjusted the input to TamGen’s compound generator. We modified the SMILES string to ensure it ended at the desired growth site. This was done by specifying the fragment we wanted to retain and its connection point for further growth. The tailored SMILES string was then fed into the compound generator to extend the molecule. 

We evaluated this method by targeting the ClpP protease for TB, achieving a more than tenfold improvement in the binding affinity of the generated compound compared to the original. Some compounds also demonstrated slow binding, indicating potential for prolonged action and improved selectivity for the target protein.

AI’s potential in drug discovery 

TamGen showcases the transformative potential of generative AI in drug design, combining advanced molecular modeling with researcher-AI collaboration. Tasks that once took years can now be accomplished in a fraction of the time. This research underscores AI’s expanding role in drug discovery and its promise for developing effective treatments against persistent infectious diseases like TB. 

Looking ahead, we plan to integrate advanced techniques into TamGen, including diffusion models for generating 3D structures, reinforcement learning to apply physical constraints, and molecular dynamics simulations to capture proteins’ shifting shapes. These enhancements aim to improve how well generated compounds bind to target proteins, increase their ability to be synthesized, and strengthen other critical drug properties.

The post Accelerating drug discovery with TamGen: A generative AI approach to target-aware molecule generation appeared first on Microsoft Research.

Read More