Oracle Cloud Infrastructure Offers New NVIDIA GPU-Accelerated Compute Instances

Oracle Cloud Infrastructure Offers New NVIDIA GPU-Accelerated Compute Instances

With generative AI and large language models (LLMs) driving groundbreaking innovations, the computational demands for training and inference are skyrocketing.

These modern-day generative AI applications demand full-stack accelerated compute, starting with state-of-the-art infrastructure that can handle massive workloads with speed and accuracy. To help meet this need, Oracle Cloud Infrastructure today announced general availability of NVIDIA H100 Tensor Core GPUs on OCI Compute, with NVIDIA L40S GPUs coming soon.

NVIDIA H100 Tensor Core GPU Instance on OCI

The OCI Compute bare-metal instances with NVIDIA H100 GPUs, powered by the NVIDIA Hopper architecture, enable an order-of-magnitude leap for large-scale AI and high-performance computing, with unprecedented performance, scalability and versatility for every workload.

Organizations using NVIDIA H100 GPUs obtain up to a 30x increase in AI inference performance and a 4x boost in AI training compared with tapping the NVIDIA A100 Tensor Core GPU. The H100 GPU is designed for resource-intensive computing tasks, including training LLMs and inference while running them.

The BM.GPU.H100.8 OCI Compute shape includes eight NVIDIA H100 GPUs, each with 80GB of HBM2 GPU memory. Between the eight GPUs, 3.2TB/s of bisectional bandwidth enables each GPU to communicate directly with all seven other GPUs via NVIDIA NVSwitch and NVLink 4.0 technology. The shape includes 16 local NVMe drives with a capacity of 3.84TB each and also includes 4th Gen Intel Xeon CPU processors with 112 cores, as well as 2TB of system memory.

In a nutshell, this shape is optimized for organizations’ most challenging workloads.

Depending on timelines and sizes of workloads, OCI Supercluster allows organizations to scale their NVIDIA H100 GPU usage from a single node to up to tens of thousands of H100 GPUs over a high-performance, ultra-low-latency network.

NVIDIA L40S GPU Instance on OCI

The NVIDIA L40S GPU, based on the NVIDIA Ada Lovelace architecture, is a universal GPU for the data center, delivering breakthrough multi-workload acceleration for LLM inference and training, visual computing and video applications. The OCI Compute bare-metal instances with NVIDIA L40S GPUs will be available for early access later this year, with general availability coming early in 2024.

These instances will offer an alternative to the NVIDIA H100 and A100 GPU instances for tackling smaller- to medium-sized AI workloads, as well as for graphics and video compute tasks. The NVIDIA L40S GPU achieves up to a 20% performance boost for generative AI workloads and as much as a 70% improvement in fine-tuning AI models compared with the NVIDIA A100.

The BM.GPU.L40S.4 OCI Compute shape includes four NVIDIA L40S GPUs, along with the latest-generation Intel Xeon CPU with up to 112 cores, 1TB of system memory, 15.36TB of low-latency NVMe local storage for caching data and 400GB/s of cluster network bandwidth. This instance was created to tackle a wide range of use cases, ranging from LLM training, fine-tuning and inference to NVIDIA Omniverse workloads and industrial digitalization, 3D graphics and rendering, video transcoding and FP32 HPC.

NVIDIA and OCI: Enterprise AI

This collaboration between OCI and NVIDIA will enable organizations of all sizes to join the generative AI revolution by providing them with state-of-the-art NVIDIA H100 and L40S GPU-accelerated infrastructure.

Access to NVIDIA GPU-accelerated instances may not be enough, however. Unlocking the maximum potential of NVIDIA GPUs on OCI Compute means having an optimal software layer. NVIDIA AI Enterprise streamlines the development and deployment of enterprise-grade accelerated AI software with open-source containers and frameworks optimized for the underlying NVIDIA GPU infrastructure, all with the help of support services.

To learn more, join NVIDIA at Oracle Cloud World in the AI Pavillion, attend this session on the new OCI instances on Wednesday, Sept. 20, and visit these web pages on Oracle Cloud Infrastructure, OCI Compute, how Oracle approaches AI and the NVIDIA AI Platform.

Read More

Meet the Omnivore: Industrial Designer Blends Art and OpenUSD to Create 3D Assets for AI Training

Meet the Omnivore: Industrial Designer Blends Art and OpenUSD to Create 3D Assets for AI Training

Editor’s note: This post is a part of our Meet the Omnivore series, which features individual creators and developers who use NVIDIA Omniverse and OpenUSD to accelerate their 3D workflows and create virtual worlds.

As a student at the Queensland University of Technology (QUT) in Australia, Emily Boehmer was torn between pursuing the creative arts or science.

And then she discovered industrial design, which allowed her to dive into research and coding while exploring visualization workflows like sketching, animation and 3D modeling.

Now, Boehmer is putting her skills to practice as a design intern at BMW Group’s Technology Office in Munich. The team uses NVIDIA Omniverse, a platform for developing and connecting 3D tools and applications, and Universal Scene Description — aka OpenUSD — to enhance its synthetic data generation pipelines.

Boehmer creates realistic 3D assets that can be used with SORDI.ai, short for Synthetic Object Recognition Dataset for Industries. Published by BMW Group, Microsoft and NVIDIA, SORDI.ai helps developers and researchers streamline and accelerate the training of AI for production. To automate image generation, the team developed an extension based on Omniverse Replicator, a software development kit for creating custom synthetic data generation tools.

As part of the SORDI.ai team, Boehmer uses Blender and Adobe Substance Painter to design 3D assets with high levels of physical accuracy and photorealism, helping ensure that synthetic data can be used to efficiently train AI models.

All the assets Boehmer creates are used to test and simulate autonomous robots on the NVIDIA Isaac Sim platform, which provides developers a suite of synthetic data generation capabilities that can power photorealistic, physically accurate virtual environments.

Creating Realistic 3D Assets for Training AI 

As a design intern, Boehmer’s main tasks are animation and 3D modeling. The process starts with taking photos of target objects. Then, she uses the 2D photos as references by lining them up with the 3D models in Blender.

3D objects can consist of thousands of polygons, so Boehmer creates two versions of the asset — one with a low number of polygons and one with a higher polygon count. The details of the high-poly version can be baked onto the low-poly model, helping maintain more details so the asset looks realistic.

Once the 3D assets are created, Boehmer uses the models to start assembling scenes. Her favorite aspect of the Omniverse platform is the flexibility of USD, because it allows her to easily make changes to 3D models.

USD workflows have enabled the BMW Group’s design teams to create many different scenes using the same components, as they can easily access all the USD files stored on Omniverse Nucleus. When creating portions of a scene, Boehmer pulls from dozens of USD models from SORDI.ai and adds them into scenes that will be used by other designers to assemble larger factory scenes.

Boehmer only has to update the USD file of the original asset to automatically apply changes to all reference files containing it.

“It’s great to see USD support for both Blender and Substance Painter,” she said. “When I create 3D assets using USD, I can be confident that they’ll look and behave as expected in the scenes they’ll be placed in.”

Emily Boehmer’s creative process starts with photographing the object, then using that image as a reference to build and texture 3D models.

Building Factory Scenes With Synthetic Data

The Isaac Sim platform is a key part of the SORDI.ai team’s workflow. It’s used to develop pipelines that use generative AI and procedural algorithms for 3D scene generation. The team also developed an extension based on Omniverse Replicator that automates randomization within a scene when generating synthetic images.

“The role of design interns like me is to realistically model and texture the assets used for scenes built in Isaac Sim,” Boehmer said. “The more realistic the assets are, the more realistic the synthetic images can be and the more effective they are for training AI models for real scenarios.”

Data annotation — the process of labeling data like images, text, audio or video with relevant tags — makes it easier for AI to understand the data, but the manual process can be incredibly time-consuming, especially for large quantities of content. SORDI.ai addresses these challenges by using synthetic data to train AI.

When importing assets into Omniverse and creating USD versions of the files, Boehmer tags them with the appropriate data label. Once these assets have been put together in a scene, she can use Omniverse Replicator to generate images that are automatically annotated using the original labels.

And using SORDI.ai, designers can set up scenes and generate thousands of annotated images with just one click.

Boehmer will be a guest on an Omniverse livestream on Wednesday, Sept. 20, where she’ll demonstrate how she uses Blender and Substance Painter in Omniverse for synthetic image generation pipelines.

Join In on the Creation

Anyone can build their own Omniverse extension or Connector to enhance their 3D workflows and tools. Creators and developers can download Omniverse for free, and enterprise teams can use the platform for their 3D projects.

Check out artwork from other “Omnivores” and submit projects in the gallery. See how creators are using OpenUSD to accelerate a variety of 3D workflows in the latest OpenUSD All Stars. And connect workflows to Omniverse with software from Adobe, Autodesk, Blender, Epic Games, Reallusion and more.

Get started with NVIDIA Omniverse by downloading the standard license free, or learn how Omniverse Enterprise can connect your team. Developers can get started with Omniverse resources and learn about OpenUSD. Explore the growing ecosystem of 3D tools connected to Omniverse.

Stay up to date on the platform by subscribing to the newsletter, and follow NVIDIA Omniverse on Instagram, Medium and Twitter. For more, join the Omniverse community and check out the Omniverse forums, Discord server, Twitch and YouTube channels.

Read More

Generative AI and multi-modal agents in AWS: The key to unlocking new value in financial markets

Generative AI and multi-modal agents in AWS: The key to unlocking new value in financial markets

Multi-modal data is a valuable component of the financial industry, encompassing market, economic, customer, news and social media, and risk data. Financial organizations generate, collect, and use this data to gain insights into financial operations, make better decisions, and improve performance. However, there are challenges associated with multi-modal data due to the complexity and lack of standardization in financial systems and data formats and quality, as well as the fragmented and unstructured nature of the data. Financial clients have frequently described the operational overhead of gaining financial insights from multi-modal data, which necessitates complex extraction and transformation logic, leading to bloated effort and costs. Technical challenges with multi-modal data further include the complexity of integrating and modeling different data types, the difficulty of combining data from multiple modalities (text, images, audio, video), and the need for advanced computer science skills and sophisticated analysis tools.

One of the ways to handle multi-modal data that is gaining popularity is the use of multi-modal agents. Multi-modal agents are AI systems that can understand and analyze data in multiple modalities using the right tools in their toolkit. They are able to connect insights across these diverse data types to gain a more comprehensive understanding and generate appropriate responses. Multi-modal agents, in conjunction with generative AI, are finding a wide spread application in financial markets. The following are a few popular use cases:

  • Smart reporting and market intelligence – AI can analyze various sources of financial information to generate market intelligence reports, aiding analysts, investors, and companies to stay updated on trends. Multi-modal agents can summarize lengthy financial reports quickly, saving analysts significant time and effort.
  • Quantitative modeling and forecasting – Generative models can synthesize large volumes of financial data to train machine learning (ML) models for applications like stock price forecasting, portfolio optimization, risk modeling, and more. Multi-modal models that understand diverse data sources can provide more robust forecasts.
  • Compliance and fraud detection – This solution can be extended to include monitoring tools that analyze communication channels like calls, emails, chats, access logs, and more to identify potential insider trading or market manipulation. Detecting fraudulent collusion across data types requires multi-modal analysis.

A multi-modal agent with generative AI boosts the productivity of a financial analyst by automating repetitive and routine tasks, freeing time for analysts to focus on high-value work. Multi-modal agents can amplify an analyst’s ability to gain insights by assisting with research and analysis. Multi-modal agents can also generate enhanced quantitative analysis and financial models, enabling analysts to work faster and with greater accuracy.

Implementing a multi-modal agent with AWS consolidates key insights from diverse structured and unstructured data on a large scale. Multi-modal agents can easily combine the power of generative AI offerings from Amazon Bedrock and Amazon SageMaker JumpStart with the data processing capabilities from AWS Analytics and AI/ML services to provide agile solutions that enable financial analysts to efficiently analyze and gather insights from multi-modal data in a secure and scalable manner within AWS. Amazon offers a suite of AI services that enable natural language processing (NLP), speech recognition, text extraction, and search:

  • Amazon Comprehend is an NLP service that can analyze text for key phrases and analyze sentiment
  • Amazon Textract is an intelligent document processing service that can accurately extract text and data from documents
  • Amazon Transcribe is an automatic speech recognition service that can convert speech to text
  • Amazon Kendra is an enterprise search service powered by ML to find the information across a variety of data sources, including documents and knowledge bases

In this post, we showcase a scenario where a financial analyst interacts with the organization’s multi-modal data, residing on purpose-built data stores, to gather financial insights. In the interaction, we demonstrate how multi-modal agents plan and run the user query and retrieve the results from the relevant data sources. All this is achieved using AWS services, thereby increasing the financial analyst’s efficiency to analyze multi-modal financial data (text, speech, and tabular data) holistically.

The following screenshot shows an example of the UI.

Solution overview

The following diagram illustrates the conceptual architecture to use generative AI with multi-modal data using agents. The steps involved are as follows:

  1. The financial analyst poses questions via a platform such as chatbots.
  2. The platform uses a framework to determine the most suitable multi-modal agent tool to answer the question.
  3. Once identified, the platform runs the code that is linked to the previously identified tool.
  4. The tool generates an analysis of the financial data as requested by the financial analyst.
  5. In summarizing the results, large language models retrieve and report back to the financial analyst.

Technical architecture

The multi-modal agent orchestrates various tools based on natural language prompts from business users to generate insights. For unstructured data, the agent uses AWS Lambda functions with AI services such as Amazon Textract for document analysis, Amazon Transcribe for speech recognition, Amazon Comprehend for NLP, and Amazon Kendra for intelligent search. For structured data, the agent uses the SQL Connector and SQLAlchemy to analyze databases, which includes Amazon Athena. The agent also utilizes Python in Lambda and the Amazon SageMaker SDK for computations and quantitative modeling. The agent also has long-term memory for storing prompts and results in Amazon DynamoDB. The multi-modal agent resides in a SageMaker notebook and coordinates these tools based on English prompts from business users in a Streamlit UI.

The key components of the technical architecture are as follows:

  • Data storage and analytics – The quarterly financial earning recordings as audio files, financial annual reports as PDF files, and S&P stock data as CSV files are hosted on Amazon Simple Storage Service (Amazon S3). Data exploration on stock data is done using Athena.
  • Large language models – The large language models (LLMs) are available via Amazon Bedrock, SageMaker JumpStart, or an API.
  • Agents – We use LangChain’s agents for a non-predetermined chain of calls as user input to LLMs and other tools. In these types of chains, there is an agent that has access to a suite of tools. Each tool has been built for a specific task. Depending on the user input, the agent decides the tool or a combination of tools to call to answer the question. We created the following purpose-built agent tools for our scenario:
    • Stocks Querying Tool – To query S&P stocks data using Athena and SQLAlchemy.
    • Portfolio Optimization Tool – To build a portfolio based on the chosen stocks.
    • Financial Information Lookup Tool – To search for financial earnings information stored in multi-page PDF files using Amazon Kendra.
    • Python Calculation Tool – To use for mathematical calculations.
    • Sentiment Analysis Tool – To identify and score sentiments on a topic using Amazon Comprehend.
    • Detect Phrases Tool – To find key phrases in recent quarterly reports using Amazon Comprehend.
    • Text Extraction Tool – To convert the PDF versions of quarterly reports to text files using Amazon Textract.
    • Transcribe Audio Tool – To convert audio recordings to text files using Amazon Transcribe.

The agent memory that holds the chain of user interactions with the agent is saved in DynamoDB.

The following sections explain some of the primary steps with associated code. To dive deeper into the solution and code for all the steps shown here, refer to the GitHub repo.

Prerequisites

To run this solution, you must have an API key to an LLM such as Anthropic Claud2, or have access to Amazon Bedrock foundation models.

To generate responses from structured and unstructured data using LLMs and LangChain, you need access to LLMs through either Amazon Bedrock, SageMaker JumpStart, or API keys, and to use databases that are compatible with SQLAlchemy. AWS Identity and Access Management (IAM) policies are also required, the details which you can find in the GitHub repo.

Key components of a multi-modal agent

There are a few key components components of the multi-modal agent:

  • Functions defined for tools of the multi-modal agent
  • Tools defined for the multi-modal agent
  • Long-term memory for the multi-modal agent
  • Planner-executor based multi-modal agent (defined with tools, LLMs, and memory)

In this section, we illustrate the key components with associated code snippets.

Functions defined for tools of the multi-modal agent

The multi-modal agent needs to use various AI services to process different types of data—text, speech, images, and more. Some of these functions may need to call AWS AI services like Amazon Comprehend to analyze text, Amazon Textract to analyze images and documents, and Amazon Transcribe to convert speech to text. These functions can either be called locally within the agent or deployed as Lambda functions that the agent can invoke. The Lambda functions internally call the relevant AWS AI services and return the results to the agent. This approach modularizes the logic and makes the agent more maintainable and extensible.

The following function defines how to calculate the optimized portfolio based on the chosen stocks. One way to convert a Python-based function to an LLM tool is to use the BaseTool wrapper.

class OptimizePortfolio(BaseTool):

name = "Portfolio Optimization Tool"
description = """
use this tool when you need to build optimal portfolio or for optimization of stock price.
The stock_ls should be a list of stock symbols, such as ['WWW', 'AAA', 'GGGG'].
"""

def _run(self, stock_ls: List):

session = boto3.Session(region_name=region_name)
athena_client = session.client('athena')

database=database_name
table=table_Name
...

The following is the code for Lambda calling the AWS AI service (Amazon Comprehend, Amazon Textract, Amazon Transcribe) APIs:

def SentimentAnalysis(inputString):
print(inputString)
lambda_client = boto3.client('lambda')
lambda_payload = {"inputString:"+inputString}
response=lambda_client.invoke(FunctionName='FSI-SentimentDetecttion',
InvocationType='RequestResponse',
Payload=json.dumps(inputString))
print(response['Payload'].read())
return response

Tools defined for the multi-modal agent

The multi-modal agent has access to various tools to enable its functionality. It can query a stocks database to answer questions on stocks. It can optimize a portfolio using a dedicated tool. It can retrieve information from Amazon Kendra, Amazon’s enterprise search service. A Python REPL tool allows the agent to run Python code. An example of the structure of the tools, including their names and descriptions, is shown in the following code. The actual tool box of this post has eight tools: Stocks Querying Tool, Portfolio Optimization Tool, Financial Information Lookup Tool, Python Calculation Tool, Sentiment Analysis Tool, Detect Phrases Tool, Text Extraction Tool, and Transcribe Audio Tool.

tools = [
Tool(
name="Financial Information Lookup Tool",
func=run_chain,
description="""
Useful for when you need to look up financial information using Kendra.
"""
),
Tool(
name="Sentiment Analysis Tool",
func=SentimentAnalysis,
description="""
Useful for when you need to analyze the sentiment of a topic.
"""
),
Tool(
name="Detect Phrases Tool",
func=DetectKeyPhrases,
description="""
Useful for when you need to detect key phrases in recent quaterly reports.
"""
),
...
]

Long-term memory for the multi-modal agent

The following code illustrates the configuration of long-term memory for the multi-modal agent. In this code, DynamoDB table is added as memory to store prompts and answers for future reference.

chat_history_table = dynamodb_table_name

chat_history_memory = DynamoDBChatMessageHistory(table_name=chat_history_table, session_id=chat_session_id)
memory = ConversationBufferMemory(memory_key="chat_history",
chat_memory=chat_history_memory, return_messages=True)

Planner-executor based multi-modal agent

The planner-executor based multi-modal agent architecture has two main components: a planner and an executor. The planner generates a high-level plan with steps required to run and answer the prompt question. The executor then runs this plan by generating appropriate system responses for each plan step using the language model with necessary tools. See the following code:

llm = ChatAnthropic(temperature=0, anthropic_api_key=ANTHROPIC_API_KEY, max_tokens_to_sample = 512)
model = llm

planner = load_chat_planner(model)

system_message_prompt = SystemMessagePromptTemplate.from_template(combo_template)
human_message_prompt = planner.llm_chain.prompt.messages[1]
planner.llm_chain.prompt = ChatPromptTemplate.from_messages([system_message_prompt, human_message_prompt])

executor = load_agent_executor(model, tools, verbose=True)
agent = PlanAndExecute(planner=planner, executor=executor, verbose=True, max_iterations=2)

Example scenarios based on questions asked by financial analyst

In this section, we explore two example scenarios to illustrate the end-to-end steps performed by the multi-modal agent based on questions asked by financial analyst.

Scenario 1: Questions by financial analyst related to structured data

In this scenario, the financial analyst asks a question in English related to companies’ stocks to the multi-modal agent. The multi-modal LangChain agent comes up with a multi-step plan and decides what tools to use for each step. The following diagram illustrates an example workflow with the following steps:

  1. The financial analyst asks a financial question in English through the UI to the multi-modal agent.
  2. The agent identifies that it requires the database tool to answer the question. It generates a SQL query using an LLM based on the question and queries the Athena database.
  3. Athena runs the SQL query, retrieves the relevant result (stock price time series of the five companies), and passes the result with relevant data to the agent.
  4. The agent identifies that it requires a second tool to answer the question. It passes the retrieved data to the Python tool for portfolio optimization.
  5. The Python tool calculates the required optimal portfolio, including weights, and passes the answer to the LLM within the agent.
  6. The agent compiles the required information and calculations using the LLM and answers the financial analysist via the UI in English.

The financial analyst asks the following question:

What are the closing prices of stocks AAAA, WWW, DDD in year 2018? Can you build an optimized portfolio using these three stocks?

The following code shows the steps identified by the planner of the multi-modal agent:

steps=[Step(value='Use Stocks Querying Tool to find closing stock prices of AAAA, WWW and DDD in year 2018.'),
Step(value='Use Portfolio Optimization Tool to build an optimized portfolio using the three stocks.'),
Step(value='Provide the closing stock prices and optimized portfolio allocation.')]

The agent translated the question prompt into the following SQL query:

SELECT date, aaaa, www, ddd
FROM "blog-stock-prices-db"."stock_prices"
WHERE date BETWEEN '2018-01-01' AND '2018-12-31'
ORDER BY date DESC

The following code shows the final answer by the multi-modal agent based on the question:

The closing stock prices for AAAA, WWW and DDD in 2018 are:

AAAA: $172.44
WWW: $85.91
DDD: $9.82

The optimized portfolio allocation for stocks AAAA, WWW and DDD is:

AAAA: 55%
WWW: 45%
DDD: 0%'

Scenario 2: Questions by financial analyst related to unstructured data

In this scenario, a financial analyst asks multiple questions about the company’s financial health to the multi-modal agent. The agent invokes two tool agents, Amazon Kendra with unstructured data and a Python compute tool, to gather information and perform the necessary calculations. The agent then compiles the information from the tools using its language model and provides the full answer to the analyst. The key of this scenario is that the agent can remember the questions and answers of the previous questions, and incorporates the context into the conversation. The following diagram illustrates the workflow, which consists of the following steps:

  1. The financial analyst asks a financial question in English through the UI to the multi-modal agent.
  2. The agent identifies that it requires the Amazon Kendra tool to answer the question and invokes the Amazon Kendra agent tool.
  3. Amazon Kendra searches the unstructured data corpus in Amazon S3 and retrieves the relevant document containing the answer.
  4. Using Retrieval Augmented Generation (RAG), Amazon Kendra passes the answer from the relevant document to the LLM within the agent.
  5. The agent identifies that it requires a second tool to answer the second part of the question. It passes information from the first tool to the Python compute tool.
  6. The Python tool calculates the required ratio and passes the answer to the LLM within the agent.
  7. The agent compiles the required information and calculations using the LLM and answers the financial analyst via the UI.

The following are questions and answers between the analyst and multi-modal agent.

The agent answers the question “What is Amazon’s net loss for fiscal year ending in December 2022?” with “Amazon reported total net sales of $514.0 billion and a net loss of $2.7 billion for fiscal year 2022.” The source is the Amazon 10K report for the fiscal year ending December 2022, retrieved using the Retrive_Kendra Tool.

For the question “What is the net loss for the same year ?” the agent answers “$2.7 billion.” The source is the Amazon 10K report for the fiscal year ending December 2022, retrieved using the Retrive_Kendra Tool. Note that even though the question is similar to the second question, this question didn’t specify the company name or the year number. However, the agent was able to use the correct company name and year based on the context in conversation memory.

For the question “What is the formula for Net Profit Margin Ratio which is a type of profitability ratio ?” the agent answers “Amazon reported total net sales of $514.0 billion and a net loss of $2.7 billion for fiscal year 2022. The Net Profit Margin Ratio for Amazon for fiscal year 2022 is -0.5%.” The source is Retrieve_Kendra and PythonREPLTool().

Dive deeper into the solution

To dive deeper into the solution and the code shown in this post, check out the GitHub repo.

In the GitHub repo, you will be able to find a Python notebook that has the end-to-end solution, an AWS CloudFormation template for provisioning the infrastructure, unstructured data (earnings reports PDF files, earnings call audio files), and structured data (stocks time series).

In the appendix at the end, different questions asked by financial analyst, agent tools invoked, and the answer from the multi-modal agent has been tabulated.

Clean up

After you run the multi-modal agent, make sure to clean up any resources that won’t be utilized. Shut down and delete the databases used (Athena). In addition, delete the data in Amazon S3 and stop any SageMaker Studio notebook instances to not incur any further charges. Delete the Lambda functions and DynamoDB tables as part of long-term memory that aren’t used. If you used SageMaker JumpStart to deploy an LLM as a SageMaker real-time endpoint, delete the endpoint through either the SageMaker console or SageMaker Studio.

Conclusion

This post demonstrated the wide range of AWS storage, AI/ML, and compute services that you can use to build an advanced multi-modal AI solution along with the LangChain framework and generative AI. By integrating NLP, speech recognition, and ML technologies, the multi-modal agent developed in this scenario is able to analyze diverse financial data modalities (structured and unstructured) and respond to prompts in natural language (English). This boosts the productivity and efficiency of financial analysts, allowing them to focus on high-value, strategic work. Multi-modal AI solutions powered by AWS have the potential to transform financial markets by automating repetitive tasks, enhancing human capabilities, and generating data-driven insights efficiently. For structured data, a modern data architecture with multiple databases can be incorporated with generative AI and LLM. For more details, refer to Reinventing the data experience: Use generative AI and modern data architecture to unlock insights. Also, refer to Amazon Bedrock for use cases on generative AI, foundation models, and large language models.

Appendix

The following table contains example questions by a financial analyst to ask the multi-modal agent based on structured and unstructured data.

Question Purpose Question asked by financial analysts through UI Agent tools invoked Primary AWS services used by Agent tools Answer from multi-modal agent
Prompt the LangChain agent to preprocess and extract information from the collected data Convert the Amazon`s recent quarters 10k and 10q files to text format 1. Text Extraction Tool
  1. Amazon Textract
  2. AWS Lambda
  3. Amazon S3
The text files containing Amazon's financial filings have been cleaned and are ready for analysis.
Prompt the LangChain agent to transcribe the audio recordings to texts Can you transcribe most recent quaterly audio recording of earnings call to text format?
  1. Transcribe Audio Tool
  1. Amazon Transcribe
  2. AWS Lambda
  3. Amazon S3
I have the earnings call transcripts, now I can extract key phrases.
Prompt the LangChain agent to analyze the company’s financial health and performance Analyze Amazon's financial health based on the 10k report. Calculate important financial ratios. Please limit your answer to 5 sentences.
  1. Financial Information Lookup Tool
  2. Python Calculation Tool
  1. Amazon Kendra
  2. AWS Lambda
  3. Amazon S3
Here is a summary of Amazon’s financial health in 5 sentences:  Amazon has a low profit margin indicating low profitability relative to revenue. However, Amazon has a high return on assets indicating efficient use of assets to generate profits. Amazon has a low debt-to-equity ratio indicating low financial leverage and risk. Amazon has a high current ratio indicating good short-term liquidity position and ability to pay off short-term obligations. Overall, Amazon seems to be in a good financial position with efficient operations and low risk.
Prompt the LangChain agent to analyze the company’s financial performance from company’s earnings reports What were the main drivers of the company’s revenue growth in the quarter?  Please limit your answers to 5 sentences.
  1. Detect Phrases Tool
  2. Financial Information Lookup Tool
  1. Amazon Comprehend
  2. Amazon Kendra
  3. AWS Lambda
  4. Amazon S3
Here is a summary of the main drivers of revenue growth in 5 sentences: Amazon's revenue grew due to an increase in subscription and support revenue from new customers. Existing customer accounts also expanded, contributing to revenue growth. Higher net sales drove increased revenue for the company. Although I could not calculate the exact revenue growth percentage, the key phrases and information extracted point to revenue growth being driven primarily by new subscriptions, existing customer expansion, and greater net sales.
Prompt the LangChain agent to analyze the company’s management team and their plans for the future from company’s earnings reports What are the company's top priorities for the coming year? What are the biggest risks facing the company? Please limit your answers to 5 sentences.
  1. Sentiment Analysis Tool
  2. Detect Phrases Tool
  3. Financial Information Lookup Tool
  1. Amazon Comprehend
  2. Amazon Kendra
  3. AWS Lambda
  4. Amazon S3
Here is a summary response in 5 sentences: Amazon's priorities are investing in growth and expanding into new markets. However, risks like changes in consumer demand and competition threaten their business. By analyzing key phrases in financial reports, their focus on fulfillment, technology, and new products emerges, as do risks around suppliers, exchange rates, and retail competition. Though facing pressures, Amazon aims to innovate, improve customer experience, and strengthen their position. Continuous progress on price, speed, and selection will drive their competitive advantage.
Prompt the LangChain agent to analyze the company’s strategy from company’s earnings reports What are company’s financial and business goals for upcoming financial year/quarter? Please limit your answers to 5 sentences.
  1. Financial Information Lookup Tool
  2. Detect Phrases Tool
  1. Amazon Comprehend
  2. Amazon Kendra
  3. AWS Lambda
  4. Amazon S3
Here is a summary of Amazon's main goals in 5 sentences: Amazon seeks to grow by improving infrastructure and entering new markets.  They will invest in fulfillment and technology to enable expansion.  By moving into groceries, healthcare, and entertainment, Amazon can diversify their offerings.  Focusing on the customer through competitive pricing, selection, and innovation is key.  If Amazon achieves operational excellence and continuous progress, they will strengthen their position.
Prompt the LangChain agent to perform sentiment analysis on the earnings call recordings What is the sentiment around inflation in Amazon's earnings call?
  1. Sentiment Analysis Tool
  1. Amazon Comprehend
  2. AWS Lambda
  3. Amazon S3
The sentiment of mentions of 'inflation' in the Amazon earnings call transcript is negative. Justification: Quote 1: "We're seeing inflationary pressures, especially in fuel and transportation costs." Quote 2: "Inflation is a concern and we're seeing cost inflation, which is an unusual thing for us.
Prompt the LangChain agent to build an optimal portfolio using the collected data What are the closing prices of stocks AAAA, WWW, DDD in year 2018? Can you build an optimized portfolio using these three stocks?
  1. Portfolio Optimization Tool
  2. Stocks Querying Tool
  1. Amazon SageMaker
  2. Amazon Athena
The closing stock prices for AAAA, WWW and DDD in 2018 are:
AAAA: $172.44
WWW: $85.91
DDD: $9.82The optimized portfolio allocation for stocks AAAA, WWW and DDD is:
AAAA: 55%
WWW: 45%
DDD: 0%'

About the Authors

Sovik Kumar Nath is an AI/ML solution architect with AWS. He has extensive experience designing end-to-end machine learning and business analytics solutions in finance, operations, marketing, healthcare, supply chain management, and IoT. Sovik has published articles and holds a patent in ML model monitoring. He has double masters degrees from the University of South Florida, University of Fribourg, Switzerland, and a bachelors degree from the Indian Institute of Technology, Kharagpur. Outside of work, Sovik enjoys traveling, taking ferry rides, and watching movies.

Mohan Musti is Senior Technical Account Manger based out of Dallas. Mohan helps customers architect and optimize applications on AWS. Mohan has Computer Science and Engineering from JNT University ,India. In his spare time, he enjoys spending time with his family and camping.

Jia (Vivian) Li is a Senior Solutions Architect in AWS, with specialization in AI/ML. She currently supports customers in financial industry. Prior to joining AWS in 2022, she had 7 years of experience supporting enterprise customers use AI/ML in the cloud to drive business results. Vivian has a BS from Peking University and a PhD from University of Southern California. In her spare time, she enjoys all the water activities, and hiking in the beautiful mountains in her home state, Colorado.

Uchenna Egbe is an AIML Solutions Architect who enjoys building reusable AIML solutions. Uchenna has an MS from the University of Alaska Fairbanks. He spends his free time researching about herbs, teas, superfoods, and how to incorporate them into his daily diet.

Navneet Tuteja is a Data Specialist at Amazon Web Services. Before joining AWS, Navneet worked as a facilitator for organizations seeking to modernize their data architectures and implement comprehensive AI/ML solutions. She holds an engineering degree from Thapar University, as well as a master’s degree in statistics from Texas A&M University.

Praful Kava is a Sr. Specialist Solutions Architect at AWS. He guides customers to design and engineer Cloud scale Analytics pipelines on AWS. Outside work, he enjoys travelling with his family and exploring new hiking trails.

Read More

How VirtuSwap accelerates their pandas-based trading simulations with an Amazon SageMaker Studio custom container and AWS GPU instances

How VirtuSwap accelerates their pandas-based trading simulations with an Amazon SageMaker Studio custom container and AWS GPU instances

This post is written in collaboration with Dima Zadorozhny and Fuad Babaev from VirtuSwap.

VirtuSwap is a startup company developing innovative technology for decentralized exchange of assets on blockchains. VirtuSwap’s technology provides more efficient trading for assets that don’t have a direct pair between them. The absence of a direct pair leads to costly indirect trading, meaning that two or more trades are required to complete a desired swap, leading to double or triple trading costs. VirtuSwap’s Reserve-based Virtual Pools technology solves the problem by making every trade direct, saving up to 50% of trading costs. Read more at virtuswap.io.

In this post, we share how VirtuSwap used the bring-your-own-container feature in Amazon SageMaker Studio to build a robust environment to host their GPU-intensive simulations to solve linear optimization problems.

The challenge

The VirtuSwap Minerva engine creates recommendations for optimal distribution of liquidity between different liquidity pools, while taking into account multiple parameters, such as trading volumes, current market liquidity, and volatilities of traded assets, constrained by a total amount of liquidity available for distribution. To provide these recomndations, VirtuSwap Minerva uses thousands of historical trading pairs to simulate their run through various liquidity configurations to find the optimal distribution of liquidity, pool fees, and more.

The initial implementation was coded using pandas dataframes. However, as the simulation data grew, the runtime nearly quadrupled, along with the size of the problem. The result of this was that iterations slowed down and it was almost impossible to run larger dimensionality tasks. VirtuSwap realized that they needed to use GPU instances for the simulation to allow faster results.

VirtuSwap needed a GPU-compatible pandas-like library to run their simulation and chose cuDF, a GPU DataFrame library by Rapids. cuDF is used for loading, joining, aggregating, filtering, and otherwise manipulating data, in a pandas-like API that accelerates the work on dataframes, using CUDA for significantly faster performance than pandas.

Solution overview

VirtuSwap chose SageMaker Studio for end-to-end development, starting with iterative, interactive development in notebooks. Due to the flexibility of SageMaker Studio, they decided to use it for their simulation as well, taking advantage of Amazon SageMaker custom images, which allow VirtuSwap to bring their own custom libraries and software needed, such as cuDF. The following diagram illustrates the solution workflow.

In the following sections, we share the step-by-step instructions to build and use a Rapids cuDF image in SageMaker.

Prerequisites

To run this step-by-step guide, you need an AWS account with permissions to SageMaker, Amazon Elastic Container Registry (Amazon ECR), AWS Identity and Access Management (IAM), and AWS CodeBuild. In addition, you need to have a SageMaker domain ready.

Create IAM roles and policies

For the build process of SageMaker custom notebooks, we used AWS CloudShell, which provides all the required packages to build the custom image. In CloudShell, we used SageMaker Docker Build, a CLI for building Docker images for and in SageMaker Studio. The CLI can create the repository in Amazon ECR and build the container using CodeBuild. For that, we need to provide the tool an IAM role with proper permissions. Complete the following steps:

  1. Sign in to the AWS Management Console and open the IAM console.
  2. In the navigation pane on the left, choose Policies.
  3. Create a policy named sm-build-policy with the following permissions:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "codebuild:DeleteProject",
                    "codebuild:CreateProject",
                    "codebuild:BatchGetBuilds",
                    "codebuild:StartBuild"
                ],
                "Resource": "arn:aws:codebuild:*:*:project/sagemaker-studio*"
            },
            {
                "Effect": "Allow",
                "Action": "logs:CreateLogStream",
                "Resource": "arn:aws:logs:*:*:log-group:/aws/codebuild/sagemaker-studio*"
            },
            {
                "Effect": "Allow",
                "Action": [
                    "logs:GetLogEvents",
                    "logs:PutLogEvents"
                ],
                "Resource": "arn:aws:logs:*:*:log-group:/aws/codebuild/sagemaker-studio*:log-stream:*"
            },
            {
                "Effect": "Allow",
                "Action": "logs:CreateLogGroup",
                "Resource": "*"
            },
            {
                "Effect": "Allow",
                "Action": [
                    "ecr:CreateRepository",
                    "ecr:BatchGetImage",
                    "ecr:CompleteLayerUpload",
                    "ecr:DescribeImages",
                    "ecr:DescribeRepositories",
                    "ecr:UploadLayerPart",
                    "ecr:ListImages",
                    "ecr:InitiateLayerUpload",
                    "ecr:BatchCheckLayerAvailability",
                    "ecr:PutImage"
                ],
                "Resource": "arn:aws:ecr:*:*:repository/sagemaker-studio*"
            },
            {
                "Sid": "ReadAccessToPrebuiltAwsImages",
                "Effect": "Allow",
                "Action": [
                    "ecr:BatchGetImage",
                    "ecr:GetDownloadUrlForLayer"
                ],
                "Resource": [
                    "arn:aws:ecr:*:763104351884:repository/*",
                    "arn:aws:ecr:*:217643126080:repository/*",
                    "arn:aws:ecr:*:727897471807:repository/*",
                    "arn:aws:ecr:*:626614931356:repository/*",
                    "arn:aws:ecr:*:683313688378:repository/*",
                    "arn:aws:ecr:*:520713654638:repository/*",
                    "arn:aws:ecr:*:462105765813:repository/*"
                ]
            },
            {
                "Sid": "EcrAuthorizationTokenRetrieval",
                "Effect": "Allow",
                "Action": [
                    "ecr:GetAuthorizationToken"
                ],
                "Resource": [
                    "*"
                ]
            },
            {
                "Effect": "Allow",
                "Action": [
                    "s3:GetObject",
                    "s3:DeleteObject",
                    "s3:PutObject"
                ],
                "Resource": "arn:aws:s3:::sagemaker-*/*"
            },
            {
                "Effect": "Allow",
                "Action": [
                    "s3:CreateBucket"
                ],
                "Resource": "arn:aws:s3:::sagemaker*"
            },
            {
                "Effect": "Allow",
                "Action": [
                    "iam:GetRole",
                    "iam:ListRoles"
                ],
                "Resource": "*"
            },
            {
                "Effect": "Allow",
                "Action": "iam:PassRole",
                "Resource": "arn:aws:iam::*:role/*",
                "Condition": {
                    "StringLikeIfExists": {
                        "iam:PassedToService": "codebuild.amazonaws.com"
                    }
                }
            },
            {
                "Effect": "Allow",
                "Action": [
                    "ecr:CreateRepository",
                    "ecr:BatchGetImage",
                    "ecr:CompleteLayerUpload",
                    "ecr:DescribeImages",
                    "ecr:DescribeRepositories",
                    "ecr:UploadLayerPart",
                    "ecr:ListImages",
                    "ecr:InitiateLayerUpload",
                    "ecr:BatchCheckLayerAvailability",
                    "ecr:PutImage"
                ],
                "Resource": "arn:aws:ecr:*:*:repository/*"
            }
        ]
    }

The permissions provide the ability to utilize the utility in full: create repositories, create a CodeBuild job, use Amazon Simple Storage Service (Amazon S3), and send logs to Amazon CloudWatch.

  1. Create a role named sm-build-role with the following trust policy, and add the policy sm-build-policy that you created earlier:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Principal": {
                    "Service": "codebuild.amazonaws.com"
                },
                "Action": "sts:AssumeRole"
            }
        ]
    }

Now, let’s review the steps in CloudShell.

Create a cuDF Docker image in CloudShell

For our purposes, we needed a Rapids CUDA image, which also includes an ipykernel, so that the image can be used in a SageMaker Studio notebook.

We use an existing CUDA image by RapidsAI that is available in the official Rapids AI Docker hub, and add the ipykernel installation.

In a CloudShell terminal, run the following command:

printf "FROM nvcr.io/nvidia/rapidsai/rapidsai:0.16-cuda10.1-base-ubuntu18.04
RUN pip install ipykernel && 
python -m ipykernel install --sys-prefix &&  
useradd --create-home --shell /bin/bash --gid 100 --uid 1000 sagemaker-user
USER sagemaker-user" > Dockerfile

This will create the Dockerfile that will build our custom Docker image for SageMaker.

Build and push the image to a repository

As mentioned, we used the SageMaker Docker Build library, which allows data scientists and developers to easily build custom container images. For more information, refer to Using the Amazon SageMaker Studio Image Build CLI to build container images from your Studio notebooks.

The following command creates an ECR repository (if the repository doesn’t exist). sm-docker will create it, and build and push the new Docker image to the created repository:

sm-docker build . --repository rapids:v1 --role sm-build-role

In case you are missing sm-docker in your CloudShell, run the following code:

pip3 install sagemaker-studio-image-build

On completion, the ECR image URI will be returned.

Create a SageMaker custom image

After you have created a custom Docker image and pushed it to your container repository (Amazon ECR), you can configure SageMaker to use that custom Docker image. Complete the following steps:

  1. On the SageMaker console, choose Images in the navigation pane.
  2. Choose Create image.
  3. Enter the image URI output from the previous section, then choose Next.
  4. For Image name and Image display name, enter rapids.
  5. For Description, enter a description.
  6. For IAM role, choose the proper IAM role for your SageMaker domain.
  7. For EFS mount path, enter /home/sagemaker-user (default).
  8. Expand Advanced configuration.
  9. For User ID, enter 1000.
  10. For Group ID, enter 100.

  1. In the Image type section, select SageMaker Studio Image.
  2. Choose Add kernel.
  3. For Kernel name, enter conda-env-rapids-py.
  4. For Kernel display name, enter rapids.
  5. Choose Submit to create the SageMaker image.

Attach the new image to your SageMaker Studio domain

Now that you have created the custom image, you need to make it available to use by attaching the image to your domain. Complete the following steps:

  1. On the SageMaker console, choose Domains in the navigation pane.
  2. Choose your domain. This step is optional; you can create and attach the custom image directly from the domain and skip this step.

  1. On the domain details page, choose the Environment tab, then choose Attach image.
  2. Select Existing image and select the new image (rapids) from the list.
  3. Choose Next.

  1. Review the custom image configuration and make sure to set Image type as SageMaker Studio Image, as in the previous step, with the same kernel name and kernel display name.
  2. Choose Submit.

The custom image is now available in SageMaker Studio and ready for use.

Create a new notebook with the image

For instructions to launch a new notebook, refer to Launch a custom SageMaker image in Amazon SageMaker Studio. Complete the following steps:

  1. On the SageMaker Studio console, choose Open launcher.
  2. Choose Change environment.

  1. For Image, choose the newly created image, rapids v1.
  2. For Kernel, choose rapids.
  3. For Instance type¸ choose your instance.

SageMaker Studio provides the option to customize your computing power by choosing an instance from the AWS accelerated compute, general purpose compute, compute optimized, or memory optimized families. This flexibility allowed you to seamlessly transition between CPUs and GPUs, as well as dynamically scale up or down the instance sizes as needed. For our notebook, we used the ml.g4dn.2xlarge instance type to test cuDF performance while utilizing GPU accelerator.

  1. Choose Select.

  1. Select your environment and choose Create notebook, then wait until the notebook kernel becomes ready.

Validate your custom image

To validate that your custom image was launched and cuDF is ready to use, create a new cell, enter import cudf, and run it.

Clean up

Power off the Jupyter instance running the test notebook in SageMaker Studio by choosing Running Terminals and Kernels and powering off the running instance.

Runtime comparison results

We conducted a runtime comparison of our code using both CPU and GPU on SageMaker g4dn.2xlarge instances, with a time complexity of O(N). The results, as shown in the following figure, reveal the efficiency of using GPUs over CPUs.

The main advantage of GPUs lies in their ability to perform parallel processing. As we increase the value of N, the runtime on CPUs increases at a rate of 3N. On the other hand, with GPUs, the rate of increase can be described as 2N, as illustrated in the preceding figure. The larger the problem size, the more efficient the GPU becomes. In our case, using a GPU was at least 20 times faster than using a CPU. This highlights the growing importance of GPUs in modern computing, especially for tasks that require large amounts of data to be processed quickly.

With SageMaker GPU instances, VirtuSwap is able to dramatically increase the dimensionality of the solved problems and find solutions faster.

Conclusion

In this post, we showed how VirtuSwap customized SageMaker Studio by using a custom image to solve a complex problem. With the ability to easily change the run environment and switch between different instances, sizes, and kernels, VirtuSwap was able to experiment fast and speed up the runtime by 15x and deliver a scalable solution.

As a next step, VirtuSwap is considering broadening their usage of SageMaker and running their processing in Amazon SageMaker Processing to process the massive data they’re collecting from various blockchains into their platform.


About the Authors

Adir Sharabi is a Principal Solutions Architect with Amazon Web Services. He works with AWS customers to help them architect secure, resilient, scalable and high performance applications in the cloud. He is also passionate about Data and helping customers to get the most out of it.

Omer Haim is a Senior Startup Solutions Architect at Amazon Web Services. He helps startups with their cloud journey, and is passionate about containers and ML. In his spare time, Omer likes to travel, and occasionally game with his son.

Dmitry Zadorozhny is a data analyst at virtuswap.io. He is responsible for data mining, processing and storage, as well as integrating cloud services such as AWS. Prior to joining virtuswap, he worked in the data science field and was an analytics ambassador lead at dydx foundation. Dima has a M.Sc in Computer Science. Dima enjoys playing computer games in his spare time.

Fuad Babaev serves as a Data Science Specialist at Virtuswap (virtuswap.io). He brings expertise in tackling complex optimization challenges, crafting simulations, and architecting models for trade processes. Outside of his professional career Fuad has a passion in playing chess.

Read More

Unlock ML insights using the Amazon SageMaker Feature Store Feature Processor

Unlock ML insights using the Amazon SageMaker Feature Store Feature Processor

Amazon SageMaker Feature Store provides an end-to-end solution to automate feature engineering for machine learning (ML). For many ML use cases, raw data like log files, sensor readings, or transaction records need to be transformed into meaningful features that are optimized for model training.

Feature quality is critical to ensure a highly accurate ML model. Transforming raw data into features using aggregation, encoding, normalization, and other operations is often needed and can require significant effort. Engineers must manually write custom data preprocessing and aggregation logic in Python or Spark for each use case.

This undifferentiated heavy lifting is cumbersome, repetitive, and error-prone. The SageMaker Feature Store Feature Processor reduces this burden by automatically transforming raw data into aggregated features suitable for batch training ML models. It lets engineers provide simple data transformation functions, then handles running them at scale on Spark and managing the underlying infrastructure. This enables data scientists and data engineers to focus on the feature engineering logic rather than implementation details.

In this post, we demonstrate how a car sales company can use the Feature Processor to transform raw sales transaction data into features in three steps:

  1. Local runs of data transformations.
  2. Remote runs at scale using Spark.
  3. Operationalization via pipelines.

We show how SageMaker Feature Store ingests the raw data, runs feature transformations remotely using Spark, and loads the resulting aggregated features into a feature group. These engineered features are can then be used to train ML models.

For this use case, we see how SageMaker Feature Store helps convert the raw car sales data into structured features. These features are subsequently used to gain insights like:

  • Average and maximum price of red convertibles from 2010
  • Models with best mileage vs. price
  • Sales trends of new vs. used cars over the years
  • Differences in average MSRP across locations

We also see how SageMaker Feature Store pipelines keep the features updated as new data comes in, enabling the company to continually gain insights over time.

Solution overview

We work with the dataset car_data.csv, which contains specifications such as model, year, status, mileage, price, and MSRP for used and new cars sold by the company. The following screenshot shows an example of the dataset.

"Image displaying a table of car data, including car model, year, mileage, price, and MSRP for various vehicles."

The solution notebook feature_processor.ipynb contains the following main steps, which we explain in this post:

  1. Create two feature groups: one called car-data for raw car sales records and another called car-data-aggregated for aggregated car sales records.
  2. Use the @feature_processor decorator to load data into the car-data feature group from Amazon Simple Storage Service (Amazon S3).
  3. Run the @feature_processor code remotely as a Spark application to aggregate the data.
  4. Operationalize the feature processor via SageMaker pipelines and schedule runs.
  5. Explore the feature processing pipelines and lineage in Amazon SageMaker Studio.
  6. Use aggregated features to train an ML model.

Prerequisites

To follow this tutorial, you need the following:

For this post, we refer to the following notebook, which demonstrates how to get started with Feature Processor using the SageMaker Python SDK.

Create feature groups

To create the feature groups, complete the following steps:

  1. Create a feature group definition for car-data as follows:
    # Feature Group - Car Sales CAR_SALES_FG_NAME = "car-data"
    CAR_SALES_FG_ARN = f"arn:aws:sagemaker:{region}:{aws_account_id}:feature-group/{CAR_SALES_FG_NAME}"
    CAR_SALES_FG_ROLE_ARN = offline_store_role
    CAR_SALES_FG_OFFLINE_STORE_S3_URI = f"s3://{s3_bucket}/{s3_offline_store_prefix}"
    CAR_SALES_FG_FEATURE_DEFINITIONS = [
        FeatureDefinition(feature_name="id", feature_type=FeatureTypeEnum.STRING),
        FeatureDefinition(feature_name="model", feature_type=FeatureTypeEnum.STRING),
        FeatureDefinition(feature_name="model_year", feature_type=FeatureTypeEnum.STRING),
        FeatureDefinition(feature_name="status", feature_type=FeatureTypeEnum.STRING),
        FeatureDefinition(feature_name="mileage", feature_type=FeatureTypeEnum.STRING),
        FeatureDefinition(feature_name="price", feature_type=FeatureTypeEnum.STRING),
        FeatureDefinition(feature_name="msrp", feature_type=FeatureTypeEnum.STRING),
        FeatureDefinition(feature_name="ingest_time", feature_type=FeatureTypeEnum.FRACTIONAL),
    ]

The features correspond to each column in the car_data.csv dataset (Model, Year, Status, Mileage, Price, and MSRP).

  1. Add the record identifier id and event time ingest_time to the feature group:
CAR_SALES_FG_RECORD_IDENTIFIER_NAME = "id"
CAR_SALES_FG_EVENT_TIME_FEATURE_NAME = "ingest_time"
  1. Create a feature group definition for car-data-aggregated as follows:
# Feature Group - Aggregated Car SalesAGG_CAR_SALES_FG_NAME = "car-data-aggregated"
AGG_CAR_SALES_FG_ARN = (
    f"arn:aws:sagemaker:{region}:{aws_account_id}:feature-group/{AGG_CAR_SALES_FG_NAME}"
)
AGG_CAR_SALES_FG_ROLE_ARN = offline_store_role
AGG_CAR_SALES_FG_OFFLINE_STORE_S3_URI = f"s3://{s3_bucket}/{s3_offline_store_prefix}"
AGG_CAR_SALES_FG_FEATURE_DEFINITIONS = [
    FeatureDefinition(feature_name="model_year_status", feature_type=FeatureTypeEnum.STRING),
    FeatureDefinition(feature_name="avg_mileage", feature_type=FeatureTypeEnum.STRING),
    FeatureDefinition(feature_name="max_mileage", feature_type=FeatureTypeEnum.STRING),
    FeatureDefinition(feature_name="avg_price", feature_type=FeatureTypeEnum.STRING),
    FeatureDefinition(feature_name="max_price", feature_type=FeatureTypeEnum.STRING),
    FeatureDefinition(feature_name="avg_msrp", feature_type=FeatureTypeEnum.STRING),
    FeatureDefinition(feature_name="max_msrp", feature_type=FeatureTypeEnum.STRING),
    FeatureDefinition(feature_name="ingest_time", feature_type=FeatureTypeEnum.FRACTIONAL),
]

For the aggregated feature group, the features are model year status, average mileage, max mileage, average price, max price, average MSRP, max MSRP, and ingest time. We add the record identifier model_year_status and event time ingest_time to this feature group.

  1. Now, create the car-data feature group:
# Create Feature Group - Car sale records.
car_sales_fg = FeatureGroup(
    name=CAR_SALES_FG_NAME,
    feature_definitions=CAR_SALES_FG_FEATURE_DEFINITIONS,
    sagemaker_session=sagemaker_session,
)

create_car_sales_fg_resp = car_sales_fg.create(
        record_identifier_name=CAR_SALES_FG_RECORD_IDENTIFIER_NAME,
        event_time_feature_name=CAR_SALES_FG_EVENT_TIME_FEATURE_NAME,
        s3_uri=CAR_SALES_FG_OFFLINE_STORE_S3_URI,
        enable_online_store=True,
        role_arn=CAR_SALES_FG_ROLE_ARN,
    )
  1. Create the car-data-aggregated feature group:
# Create Feature Group - Aggregated car sales records.
agg_car_sales_fg = FeatureGroup(
    name=AGG_CAR_SALES_FG_NAME,
    feature_definitions=AGG_CAR_SALES_FG_FEATURE_DEFINITIONS,
    sagemaker_session=sagemaker_session,
)

create_agg_car_sales_fg_resp = agg_car_sales_fg.create(
        record_identifier_name=AGG_CAR_SALES_FG_RECORD_IDENTIFIER_NAME,  
        event_time_feature_name=AGG_CAR_SALES_FG_EVENT_TIME_FEATURE_NAME,
        s3_uri=AGG_CAR_SALES_FG_OFFLINE_STORE_S3_URI,
        enable_online_store=True,
        role_arn=AGG_CAR_SALES_FG_ROLE_ARN,
    )

You can navigate to the SageMaker Feature Store option under Data on the SageMaker Studio Home menu to see the feature groups.

Image from Sagemaker Feature store with headers Feature group name and description

Use the @feature_processor decorator to load data

In this section, we locally transform the raw input data (car_data.csv) from Amazon S3 into the car-data feature group using the Feature Store Feature Processor. This initial local run allows us to develop and iterate before running remotely, and could be done on a sample of the data if desired for faster iteration.

With the @feature_processor decorator, your transformation function runs in a Spark runtime environment where the input arguments provided to your function and its return value are Spark DataFrames.

  1. Install the Feature Processor SDK from the SageMaker Python SDK and its extras using the following command:
pip install sagemaker[feature-processor]

The number of input parameters in your transformation function must match the number of inputs configured in the @feature_processor decorator. In this case, the @feature_processor decorator has car-data.csv as input and the car-data feature group as output, indicating this is a batch operation with the target_store as OfflineStore:

from sagemaker.feature_store.feature_processor import (
    feature_processor,
    FeatureGroupDataSource,
    CSVDataSource,
)

@feature_processor(
    inputs=[CSVDataSource(RAW_CAR_SALES_S3_URI)],
    output=CAR_SALES_FG_ARN,
    target_stores=["OfflineStore"],
)
  1. Define the transform() function to transform the data. This function performs the following actions:
    • Convert column names to lowercase.
    • Add the event time to the ingest_time column.
    • Remove punctuation and replace missing values with NA.
def transform(raw_s3_data_as_df):
    """Load data from S3, perform basic feature engineering, store it in a Feature Group"""
    from pyspark.sql.functions import regexp_replace
    from pyspark.sql.functions import lit
    import time

    transformed_df = (
        raw_s3_data_as_df.withColumn("Price", regexp_replace("Price", "$", ""))
        # Rename Columns
        .withColumnRenamed("Id", "id")
        .withColumnRenamed("Model", "model")
        .withColumnRenamed("Year", "model_year")
        .withColumnRenamed("Status", "status")
        .withColumnRenamed("Mileage", "mileage")
        .withColumnRenamed("Price", "price")
        .withColumnRenamed("MSRP", "msrp")
        # Add Event Time
        .withColumn("ingest_time", lit(int(time.time())))
        # Remove punctuation and fluff; replace with NA
        .withColumn("mileage", regexp_replace("mileage", "(,)|(mi.)", ""))
        .withColumn("mileage", regexp_replace("mileage", "Not available", "NA"))
        .withColumn("price", regexp_replace("price", ",", ""))
        .withColumn("msrp", regexp_replace("msrp", "(^MSRPs\$)|(,)", ""))
        .withColumn("msrp", regexp_replace("msrp", "Not specified", "NA"))
        .withColumn("msrp", regexp_replace("msrp", "\$d+[a-zA-Zs]+", "NA"))
        .withColumn("model", regexp_replace("model", "^dddds", ""))
    )
  1. Call the transform() function to store the data in the car-data feature group:
# Execute the FeatureProcessor
transform()

The output shows that the data is ingested successfully into the car-data feature group.

The output of the transform_df.show() function is as follows:

INFO:sagemaker:Ingesting transformed data to arn:aws:sagemaker:us-west-2:416578662734:feature-group/car-data with target_stores: ['OfflineStore']

+---+--------------------+----------+------+-------+--------+-----+-----------+
| id|               model|model_year|status|mileage|   price| msrp|ingest_time|
+---+--------------------+----------+------+-------+--------+-----+-----------+
|  0|    Acura TLX A-Spec|      2022|   New|     NA|49445.00|49445| 1686627154|
|  1|    Acura RDX A-Spec|      2023|   New|     NA|50895.00|   NA| 1686627154|
|  2|    Acura TLX Type S|      2023|   New|     NA|57745.00|   NA| 1686627154|
|  3|    Acura TLX Type S|      2023|   New|     NA|57545.00|   NA| 1686627154|
|  4|Acura MDX Sport H...|      2019|  Used| 32675 |40990.00|   NA| 1686627154|
|  5|    Acura TLX A-Spec|      2023|   New|     NA|50195.00|50195| 1686627154|
|  6|    Acura TLX A-Spec|      2023|   New|     NA|50195.00|50195| 1686627154|
|  7|    Acura TLX Type S|      2023|   New|     NA|57745.00|   NA| 1686627154|
|  8|    Acura TLX A-Spec|      2023|   New|     NA|47995.00|   NA| 1686627154|
|  9|    Acura TLX A-Spec|      2022|   New|     NA|49545.00|   NA| 1686627154|
| 10|Acura Integra w/A...|      2023|   New|     NA|36895.00|36895| 1686627154|
| 11|    Acura TLX A-Spec|      2023|   New|     NA|48395.00|48395| 1686627154|
| 12|Acura MDX Type S ...|      2023|   New|     NA|75590.00|   NA| 1686627154|
| 13|Acura RDX A-Spec ...|      2023|   New|     NA|55345.00|   NA| 1686627154|
| 14|    Acura TLX A-Spec|      2023|   New|     NA|50195.00|50195| 1686627154|
| 15|Acura RDX A-Spec ...|      2023|   New|     NA|55045.00|   NA| 1686627154|
| 16|    Acura TLX Type S|      2023|   New|     NA|56445.00|   NA| 1686627154|
| 17|    Acura TLX A-Spec|      2023|   New|     NA|47495.00|47495| 1686627154|
| 18|   Acura TLX Advance|      2023|   New|     NA|52245.00|52245| 1686627154|
| 19|    Acura TLX A-Spec|      2023|   New|     NA|50595.00|50595| 1686627154|
+---+--------------------+----------+------+-------+--------+-----+-----------+
only showing top 20 rows

We have successfully transformed the input data and ingested it in the car-data feature group.

Run the @feature_processor code remotely

In this section, we demonstrate running the feature processing code remotely as a Spark application using the @remote decorator described earlier. We run the feature processing remotely using Spark to scale to large datasets. Spark provides distributed processing on clusters to handle data that is too big for a single machine. The @remote decorator runs the local Python code as a single or multi-node SageMaker training job.

  1. Use the @remote decorator along with the @feature_processor decorator as follows:
@remote(spark_config=SparkConfig(), instance_type = "ml.m5.xlarge", ...)
@feature_processor(inputs=[FeatureGroupDataSource(CAR_SALES_FG_ARN)],
                   output=AGG_CAR_SALES_FG_ARN, target_stores=["OfflineStore"], enable_ingestion=False )

The spark_config parameter indicates this is run as a Spark application. The SparkConfig instance configures the Spark configuration and dependencies.

  1. Define the aggregate() function to aggregate the data using PySpark SQL and user-defined functions (UDFs). This function performs the following actions:
    • Concatenate model, year, and status to create model_year_status.
    • Take the average of price to create avg_price.
    • Take the max value of price to create max_price.
    • Take the average of mileage to create avg_mileage.
    • Take the max value of mileage to create max_mileage.
    • Take the average of msrp to create avg_msrp.
    • Take the max value of msrp to create max_msrp.
    • Group by model_year_status.
def aggregate(source_feature_group, spark):
    """
    Aggregate the data using a SQL query and UDF.
    """
    import time
    from pyspark.sql.types import StringType
    from pyspark.sql.functions import udf

    @udf(returnType=StringType())
    def custom_concat(*cols, delimeter: str = ""):
        return delimeter.join(cols)

    spark.udf.register("custom_concat", custom_concat)

    # Execute SQL string.
    source_feature_group.createOrReplaceTempView("car_data")
    aggregated_car_data = spark.sql(
        f"""
        SELECT
            custom_concat(model, "_", model_year, "_", status) as model_year_status,
            AVG(price) as avg_price,
            MAX(price) as max_price,
            AVG(mileage) as avg_mileage,
            MAX(mileage) as max_mileage,
            AVG(msrp) as avg_msrp,
            MAX(msrp) as max_msrp,
            "{int(time.time())}" as ingest_time
        FROM car_data
        GROUP BY model_year_status
        """
    )

    aggregated_car_data.show()

    return aggregated_car_data
  1. Run the aggregate() function, which creates a SageMaker training job to run the Spark application:
# Execute the aggregate function
aggregate()

As a result, SageMaker creates a training job to the Spark application defined earlier. It will create a Spark runtime environment using the sagemaker-spark-processing image.

We use SageMaker Training jobs here to run our Spark feature processing application. With SageMaker Training, you can reduce startup times to 1 minute or less by using warm pooling, which is unavailable in SageMaker Processing. This makes SageMaker Training better optimized for short batch jobs like feature processing where startup time is important.

  1. To view the details, on the SageMaker console, choose Training jobs under Training in the navigation pane, then choose the job with the name aggregate-<timestamp>.

Image shows the Sagemaker training job

The output of the aggregate() function generates telemetry code. Inside the output, you will see the aggregated data as follows:

+--------------------+------------------+---------+------------------+-----------+--------+--------+-----------+
|   model_year_status|         avg_price|max_price|       avg_mileage|max_mileage|avg_msrp|max_msrp|ingest_time|
+--------------------+------------------+---------+------------------+-----------+--------+--------+-----------+
|Acura CL 3.0_1997...|            7950.0|  7950.00|          100934.0|    100934 |    null|      NA| 1686634807|
|Acura CL 3.2 Type...|            6795.0|  7591.00|          118692.5|    135760 |    null|      NA| 1686634807|
|Acura CL 3_1998_Used|            9899.0|  9899.00|           63000.0|     63000 |    null|      NA| 1686634807|
|Acura ILX 2.0L Te...|         14014.125| 18995.00|         95534.875|     89103 |    null|      NA| 1686634807|
|Acura ILX 2.0L Te...|           15008.2| 16998.00|           94935.0|     88449 |    null|      NA| 1686634807|
|Acura ILX 2.0L Te...|           16394.6| 19985.00|           97719.4|     80000 |    null|      NA| 1686634807|
|Acura ILX 2.0L w/...|14567.181818181818| 16999.00| 96624.72727272728|     98919 |    null|      NA| 1686634807|
|Acura ILX 2.0L w/...|           16673.4| 18995.00|           84848.6|     96637 |    null|      NA| 1686634807|
|Acura ILX 2.0L w/...|12580.333333333334| 14546.00|100207.33333333333|     95782 |    null|      NA| 1686634807|
|Acura ILX 2.0L_20...|         14565.375| 17590.00|         92941.125|     81842 |    null|      NA| 1686634807|
|Acura ILX 2.0L_20...|           14877.9|  9995.00|           99739.5|     89252 |    null|      NA| 1686634807|
|Acura ILX 2.0L_20...|           15659.5| 15660.00|           82136.0|     89942 |    null|      NA| 1686634807|
|Acura ILX 2.0L_20...|17121.785714285714| 20990.00| 78278.14285714286|     96067 |    null|      NA| 1686634807|
|Acura ILX 2.4L (A...|           17846.0| 21995.00|          101558.0|     85974 |    null|      NA| 1686634807|
|Acura ILX 2.4L Pr...|           16327.0| 16995.00|           85238.0|     95356 |    null|      NA| 1686634807|
|Acura ILX 2.4L w/...|           12846.0| 12846.00|           75209.0|     75209 |    null|      NA| 1686634807|
|Acura ILX 2.4L_20...|           18998.0| 18998.00|           51002.0|     51002 |    null|      NA| 1686634807|
|Acura ILX 2.4L_20...|17908.615384615383| 19316.00| 74325.38461538461|     89116 |    null|      NA| 1686634807|
|Acura ILX 4DR SDN...|           18995.0| 18995.00|           37017.0|     37017 |    null|      NA| 1686634807|
|Acura ILX 8-SPD_2...|           24995.0| 24995.00|           22334.0|     22334 |    null|      NA| 1686634807|
+--------------------+------------------+---------+------------------+-----------+--------+--------+-----------+
only showing top 20 rows

When the training job is complete, you should see following output:

06-13 05:40 smspark-submit INFO     spark submit was successful. primary node exiting.
Training seconds: 153
Billable seconds: 153

Operationalize the feature processor via SageMaker pipelines

In this section, we demonstrate how to operationalize the feature processor by promoting it to a SageMaker pipeline and scheduling runs.

  1. First, upload the transformation_code.py file containing the feature processing logic to Amazon S3:
car_data_s3_uri = s3_path_join("s3://", sagemaker_session.default_bucket(),
                               'transformation_code', 'car-data-ingestion.py')
S3Uploader.upload(local_path='car-data-ingestion.py', desired_s3_uri=car_data_s3_uri)
print(car_data_s3_uri)
  1. Next, create a Feature Processor pipeline car_data_pipeline using the .to_pipeline() function:
car_data_pipeline_name = f"{CAR_SALES_FG_NAME}-ingestion-pipeline"
car_data_pipeline_arn = fp.to_pipeline(pipeline_name=car_data_pipeline_name,
                                      step=transform,
                                      transformation_code=TransformationCode(s3_uri=car_data_s3_uri) )
print(f"Created SageMaker Pipeline: {car_data_pipeline_arn}.")
  1. To run the pipeline, use the following code:
car_data_pipeline_execution_arn = fp.execute(pipeline_name=car_data_pipeline_name)
print(f"Started an execution with execution arn: {car_data_pipeline_execution_arn}")
  1. Similarly, you can create a pipeline for aggregated features called car_data_aggregated_pipeline and start a run.
  2. Schedule the car_data_aggregated_pipeline to run every 24 hours:
fp.schedule(pipeline_name=car_data_aggregated_pipeline_name,
           schedule_expression="rate(24 hours)", state="ENABLED")
print(f"Created a schedule.")

In the output section, you will see the ARN of pipeline and the pipeline execution role, and the schedule details:

{'pipeline_arn': 'arn:aws:sagemaker:us-west-2:416578662734:pipeline/car-data-aggregated-ingestion-pipeline',
 'pipeline_execution_role_arn': 'arn:aws:iam::416578662734:role/service-role/AmazonSageMaker-ExecutionRole-20230612T120731',
 'schedule_arn': 'arn:aws:scheduler:us-west-2:416578662734:schedule/default/car-data-aggregated-ingestion-pipeline',
 'schedule_expression': 'rate(24 hours)',
 'schedule_state': 'ENABLED',
 'schedule_start_date': '2023-06-13T06:05:17Z',
 'schedule_role': 'arn:aws:iam::416578662734:role/service-role/AmazonSageMaker-ExecutionRole-20230612T120731'}
  1. To get all the Feature Processor pipelines in this account, use the list_pipelines() function on the Feature Processor:
fp.list_pipelines()

The output will be as follows:

[{'pipeline_name': 'car-data-aggregated-ingestion-pipeline'},
 {'pipeline_name': 'car-data-ingestion-pipeline'}]

We have successfully created SageMaker Feature Processor pipelines.

Explore feature processing pipelines and ML lineage

In SageMaker Studio, complete the following steps:

  1. On the SageMaker Studio console, on the Home menu, choose Pipelines.

Image of Sagemaker Studio home tab highlighting pipelines option

You should see two pipelines created: car-data-ingestion-pipeline and car-data-aggregated-ingestion-pipeline.

Image of Sagemaker Studio pipelines with the list of pipelines

  1. Choose the car-data-ingestion-pipeline.

It shows the run details on the Executions tab.

Image of Sagemaker Studio of the car data ingestion pipeline

  1. To view the feature group populated by the pipeline, choose Feature Store under Data and choose car-data.

Image of Sagemaker Studio home highliting data

You will see the two feature groups we created in the previous steps.

Image of Sagemaker Studio with feature groups created

  1. Choose the car-data feature group.

You will see the features details on the Features tab.

Image of Sagemaker Studio with feature group and the features in the group

View pipeline runs

To view the pipeline runs, complete the following steps:

  1. On the Pipeline Executions tab, select car-data-ingestion-pipeline.

This will show all the runs.

Image shows the Sagemaker Feature group tab of the pipeline executions

  1. Choose one of the links to see the details of the run.

Image shows the sagemaker UI with the pipelines in execution

  1. To view lineage, choose Lineage.

The full lineage for car-data shows the input data source car_data.csv and upstream entities. The lineage for car-data-aggregated shows the input car-data feature group.

Image of Sagemaker UI of the feature group of car data

  1. Choose Load features and then choose Query upstream lineage on car-data and car-data-ingestion-pipeline to see all the upstream entities.

The full lineage for car-data feature group should look like the following screenshot.

Image shows the Sagemaker feature store with car lineage

Similarly, the lineage for the car-aggregated-data feature group should look like the following screenshot.

Image shoes the aggregated feature group from Sagemaker Feature Store UI

SageMaker Studio provides a single environment to track scheduled pipelines, view runs, explore lineage, and view the feature processing code.

The aggregated features such as average price, max price, average mileage, and more in the car-data-aggregated feature group provide insight into the nature of the data. You can also use these features as a dataset to train a model to predict car prices, or for other operations. However, training the model is out of scope for this post, which focuses on demonstrating the SageMaker Feature Store capabilities for feature engineering.

Clean up

Don’t forget to clean up the resources created as part of this post to avoid incurring ongoing charges.

  1. Disable the scheduled pipeline via the fp.schedule() method with the state parameter as Disabled:
# Disable the scheduled pipeline
fp.schedule(
pipeline_name=car_data_aggregated_pipeline_name,
schedule_expression="rate(24 hours)",
state="DISABLED",
)
  1. Delete both feature groups:
# Delete feature groups
car_sales_fg.delete()
agg_car_sales_fg.delete()

The data residing in the S3 bucket and offline feature store can incur costs, so you should delete them to avoid any charges.

  1. Delete the S3 objects.
  2. Delete the records from the feature store.

Conclusion

In this post, we demonstrated how a car sales company used SageMaker Feature Store Feature Processor to gain valuable insights from their raw sales data by:

  • Ingesting and transforming batch data at scale using Spark
  • Operationalizing feature engineering workflows via SageMaker pipelines
  • Providing lineage tracking and a single environment to monitor pipelines and explore features
  • Preparing aggregated features optimized for training ML models

By following these steps, the company was able to transform previously unusable data into structured features that could then be used to train a model to predict car prices. SageMaker Feature Store enabled them to focus on feature engineering rather than the underlying infrastructure.

We hope this post helps you unlock valuable ML insights from your own data using SageMaker Feature Store Feature Processor!

For more information on this, refer to Feature Processing and the SageMaker example on Amazon SageMaker Feature Store: Feature Processor Introduction.


About the Authors


Dhaval Shah
is a Senior Solutions Architect at AWS, specializing in Machine Learning. With a strong focus on digital native businesses, he empowers customers to leverage AWS and drive their business growth. As an ML enthusiast, Dhaval is driven by his passion for creating impactful solutions that bring positive change. In his leisure time, he indulges in his love for travel and cherishes quality moments with his family.

Ninad Joshi is a Senior Solutions Architect at AWS, helping global AWS customers design secure, scalable, and cost effective solutions in cloud to solve their complex real-world business challenges. His work in Machine Learning (ML) covers a wide range of AI/ML use cases, with a primary focus on End-to-End ML, Natural Language Processing, and Computer Vision. Prior to joining AWS, Ninad worked as a software developer for 12+ years. Outside of his professional endeavors, Ninad enjoys playing chess and exploring different gambits.

Read More

Announcing the DeepSpeed4Science Initiative: Enabling large-scale scientific discovery through sophisticated AI system technologies

Announcing the DeepSpeed4Science Initiative: Enabling large-scale scientific discovery through sophisticated AI system technologies

DeepSpeed4Science Initiative - graphic with 6 icons

Introduction 

In the next decade, deep learning may revolutionize the natural sciences, enhancing our capacity to model and predict natural occurrences. This could herald a new era of scientific exploration, bringing significant advancements across sectors from drug development to renewable energy. In line with Microsoft’s mission to empower every person and every organization on the planet to achieve more, the DeepSpeed (opens in new tab) team at Microsoft is responding to this opportunity by launching a new initiative called DeepSpeed4Science (opens in new tab), aiming to build unique capabilities through AI system technology innovations to help domain experts to unlock today’s biggest science mysteries.

The DeepSpeed (opens in new tab) system is an industry leading open-source AI system framework, developed by Microsoft, that enables unprecedented scale and speed for deep learning training and inference on a wide range of AI hardware. Figure 1 demonstrates our basic approach to this new initiative. By leveraging DeepSpeed’s current technology pillars (training, inference and compression) as base technology enablers, DeepSpeed4Science will create a new set of AI system technologies tailored for accelerating scientific discoveries by addressing their unique complexity beyond the common technical approaches used for accelerating generic large language models (LLMs). We work closely with internal and external teams who own AI-driven science models that represent key science missions, to identify and address general domain-specific AI system challenges. This includes climate science, drug design, biological understanding, molecular dynamics simulation, cancer diagnosis and surveillance, catalyst/material discovery, and other domains.

Figure 1: It is a three-tier diagram. From bottom to top wise (vertically), it describes our basic approach for executing DeepSpeed4Science initative. Bottom section represents the current three pillars of
the DeepSpeed framework, including training, inference and compression. The middle layer, which is what this particular blog is about, is creating a new set of AI system technologies that are beyond generic large language model support, tailored for accelerating scientific discoveries and addressing their complexity. The very top layer represents gemera; AI-driven science models across different domains, which can be supported by DeepSpeed4Science software support.
Figure 1: DeepSpeed4Science approach: developing a new set of AI system technologies that are beyond generic large language model support, tailored for accelerating scientific discoveries and addressing their complexity.

Our long-term vision is to develop DeepSpeed4Science into a new platform and a unified repository for sharing advanced AI system technologies that support scientific discoveries. DeepSpeed4Science is designed to be inclusive, echoing Microsoft’s AI for Good commitment. That is reflected in the initiative’s support for a diverse group of signature science models, representing some of the most critical AI for science investments. In this blog, we showcase how DeepSpeed4Science helps address two of their critical system challenges in structural biology research: (1) eliminating memory explosion problems for scaling Evoformer-centric protein-structure prediction models, and (2) enabling very-long sequence support for better understanding the evolutionary landscape of pandemic-causing viruses.

Microsoft Research Podcast

Collaborators: Renewable energy storage with Bichlien Nguyen and David Kwabi

Dr. Bichlien Nguyen and Dr. David Kwabi explore their work in flow batteries and how machine learning can help more effectively search the vast organic chemistry space to identify compounds with properties just right for storing waterpower and other renewables.


Our launch and key collaborators 

The new system technologies enabled by DeepSpeed4Science can empower AI-driven scientific discoveries using signature models that represent a wide spectrum of efforts pushing the boundaries of science. Currently, DeepSpeed4Science is honored to support several key science models from Microsoft Research AI4Science (opens in new tab), Microsoft WebXT/Bing (opens in new tab) and U.S. DoE National Labs (opens in new tab).

Current Microsoft internal partnerships 

Scientific Foundation Model (SFM), Microsoft Research AI4Science

Figure 2: This figure contains two peices. The top piece represents the general methodology of buliding this scientific foundtaion model (SFM). The bottom section is a GIF that illustrates one important apporach that has been developed by Microsoft on protein structure prediction through Distributional Graphormer. Unlike the other protein prediction methods on the market, Distributional Graphormer claims that molecules are not rigid, rather they are dynamic that can adopt different structures with different probabilities at equilibrium. Distributional Graphormer is the first computational method that can predict equilibrium distribution of molecules by advanced generative AI technology.
Figure 2: This figure contains two peices. The top piece represents the general methodology of buliding this scientific foundtaion model (SFM). The bottom section is a GIF that illustrates one important apporach that has been developed by Microsoft on protein structure prediction through Distributional Graphormer. Unlike the other protein prediction methods on the market, Distributional Graphormer claims that molecules are not rigid, rather they are dynamic that can adopt different structures with different probabilities at equilibrium. Distributional Graphormer is the first computational method that can predict equilibrium distribution of molecules by advanced generative AI technology.
Figure 2: Scientific foundation model (SFM) and its current exploration: Distributional Graphormer.

Scientific foundation model (SFM) aims to create a unified large-scale foundation model to empower natural scientific discovery by supporting diverse inputs, multiple scientific domains (e.g., drugs, materials, biology, health, etc.) and computational tasks. The DeepSpeed4Science partnership will provide new training and inference technologies to empower the SFM team’s continuous research on projects like Microsoft’s new generative AI methods, such as Distributional Graphormer.

ClimaX, MSR AI4Science

Figure 3: The diagram of a foundation model for weather modeling is shown here. Our changing climate is producing more frequent extreme weather events. To mitigate the negative effects, it is increasingly important to predict where these events will occur. ClimaX is the first foundation model designed to perform a wide variety of weather and climate modeling tasks. It can absorb many different datasets with different variables and resolutions, potentially improving weather forecasting.
Figure 3: ClimaX is the first foundation model designed to perform a wide variety of weather and climate modeling tasks.

Our changing climate is producing more frequent extreme weather events. To mitigate the negative effects, it is increasingly important to predict where these events will occur. ClimaX is the first foundation model designed to perform a wide variety of weather and climate modeling tasks. It can absorb many different datasets with different variables and resolutions, potentially improving weather forecasting. DeepSpeed4Science is creating new system supports and acceleration strategies for ClimaX for efficiently pretraining/finetuning bigger foundation models while handling very large high-resolution image data (e.g., tens to hundreds of petabytes) with long sequences.

AI Powered Ab Initio Molecular Dynamics (AI2MD), MSR AI4Science

Figure 4:This animated figure illustrates one million steps of a molecular dynamics simulation, e.g., RBD-protein interacts with protein inhibitor. Simulations like this are efficient enough to generate trajectories long enough to observe chemically significant events.
Figure 4: One million steps of molecular dynamics simulation: RBD-protein interacts with protein inhibitor.

This project simulates the dynamics of large (million-atom) molecular systems with near ab initio accuracy using AI-powered force field models while maintaining the efficiency and scalability of classical molecular dynamics. The simulations are efficient enough to generate trajectories long enough to observe chemically significant events. Typically, millions or even billions of inference steps are required for this process. This poses a significant challenge in optimizing the inference speed of graph neural network (GNN)+ LLM models, for which DeepSpeed4Science will provide new acceleration strategies.

Weather from Microsoft Start, Microsoft WebXT/Bing

Figure 5: This figure shows Microsoft Start precipitation nowcast application on Bing, i.e., every 4 minutes for the next 4 hours. Weather from Microsoft Start provides precise weather information to help users make better decisions for their lifestyles, health, jobs and activities – including accurate 10-day global weather forecasts updated multiple times every hour.
Figure 5: Microsoft Start precipitation nowcast (every 4 minutes for the next 4 hours).

Weather from Microsoft Start (opens in new tab) provides precise weather information to help users make better decisions for their lifestyles, health, jobs and activities (opens in new tab) – including accurate 10-day global weather forecasts updated multiple times every hour.  Previously, Weather from Microsoft Start benefited from DeepSpeed technologies to accelerate their multi-GPU training environments. Currently, DeepSpeed4Science is working with the WebXT weather team to further enhance Microsoft Weather services with cutting-edge features and improvements.

Current external collaborators 

DeepSpeed4Science’s journey started with two pioneering LLM-based AI models for structural biology research: OpenFold (opens in new tab) from Columbia University, an open-sourced high-fidelity protein structure prediction model; and GenSLMs (opens in new tab) from Argonne National Laboratory (opens in new tab), an award-winning genome-scale language model (opens in new tab) for learning the evolutionary landscape of SARS-CoV-2 (COVID-19) genomes. As the featured showcases for this release, they represent two common AI system challenges facing today’s AI-driven structural biology research. We will discuss how DeepSpeed4Science empowered their scientific discovery in the next section.  

Additionally, DeepSpeed4Science has recently expanded its scope to support a more diverse range of science models. For example, in our work with Argonne on training a trillion-parameter science model on Aurora Exascale system (opens in new tab), DeepSpeed4Science technologies will help them reach the performance requirements and scalability needed for this critical mission. Furthermore, by collaborating with Oak Ridge National Lab (opens in new tab) and National Cancer Institute (NCI) (opens in new tab) on cancer surveillance, DeepSpeed4Science will help enable high-fidelity extraction and classification of information from unstructured clinical texts for the MOSSAIC project (opens in new tab).  DeepSpeed4Science technologies will also be adopted by Brookhaven National Laboratory (opens in new tab) to support development of a large digital twin model for clean energy research by using LLMs to produce more realistic simulation data. You can find more detailed information about our external colleagues and their science missions at DeepSpeed4Science (opens in new tab).

Partnership showcases 

Showcase (I): DeepSpeed4Science eliminates memory explosion problems for scaling Evoformer-centric structural biology models via DS4Sci_EvoformerAttention

Figure 6: The top figure illustrates the prediction demonstration from AlphaFold2 and OpenFold against the baseline experiemental result. OpenFold is a community reproduction of DeepMind’s AlphaFold2 that makes it possible to train or finetune AlphaFold2 on new datasets. Researchers have used it to retrain AlphaFold2 from scratch to produce new sets of model parameters, studied the early training phase of AlphaFold2 (shown as the bottom figure), and developed new protein folding systems. The bottom figure demonstrates OpenFold's predictions for PDB chain 7B3A_A as the model trains.
Figure 6: The top figure illustrates the prediction demonstration from AlphaFold2 and OpenFold against the baseline experiemental result. OpenFold is a community reproduction of DeepMind’s AlphaFold2 that makes it possible to train or finetune AlphaFold2 on new datasets. Researchers have used it to retrain AlphaFold2 from scratch to produce new sets of model parameters, studied the early training phase of AlphaFold2 (shown as the bottom figure), and developed new protein folding systems. The bottom figure demonstrates OpenFold's predictions for PDB chain 7B3A_A as the model trains.
Figure 6: OpenFold predictions for PDB chain 7B3A_A as the model trains.

OpenFold (opens in new tab) is a community reproduction of DeepMind’s AlphaFold2 (opens in new tab) that makes it possible to train or finetune AlphaFold2 on new datasets. Researchers have used it to retrain AlphaFold2 from scratch to produce new sets of model parameters, studied the early training phase of AlphaFold2 (Figure 6), and developed new protein folding systems.

Figure 7: It shows the peak memory requirement for training variants of the multiple sequence alignment (MSA) attention kernels (with bias) with the maximum possible training sample dimension in OpenFold. (Left) The original OpenFold implementation with EvoformerAttention used in AlphaFold2. The memory explosion problems in training/inference for these types of protein structure prediction models are common. Particularly, state-of-the-art FlashAttention cannot effectively support such science attention variants. (Right) A new solution from DeepSpeed4Science called DS4Sci_EvoformerAttention significantly reduces OpenFold’s peak memory requirement for training by 13X without accuracy loss.
Figure 7: Peak memory requirement for training variants of the multiple sequence alignment (MSA) attention kernels (with bias) with the maximum possible training sample dimension in OpenFold. (Left) The original OpenFold implementation with EvoformerAttention used in AlphaFold2. The memory explosion problems in training/inference for these types of protein structure prediction models are common. Particularly, state-of-the-art FlashAttention cannot effectively support such science attention variants. (Right) A new solution from DeepSpeed4Science called DS4Sci_EvoformerAttention significantly reduces OpenFold’s peak memory requirement for training by 13X without accuracy loss.

While OpenFold does include performance and memory optimizations using state-of-the-art system technologies, training AlphaFold2 from scratch is still computationally expensive. The model at the current stage is small in absolute terms, with just 93 million parameters, but it contains several custom attention variants that manifest unusually large activations. During the “finetuning” phase of a standard AlphaFold2 training run, the logit tensor produced in just one of these variants–one designed to attend over the deep protein MSAs fed to the model as input–is in excess of 12GB in half precision alone, dwarfing the peak memory requirements of comparably sized language models. Even with techniques like activation checkpointing and DeepSpeed ZeRO optimizations, this memory explosion problem heavily constrains the sequence lengths and MSA depths on which the model can be trained. Furthermore, approximation strategies can significantly affect the model accuracy and convergence, while still resulting in memory explosion, shown as the left bar (orange) in Figure 7.  

To address this common system challenge in structural biology research (e.g., protein structure prediction and equilibrium distribution prediction), DeepSpeed4Science is addressing this memory inefficiency problem by designing customized exact attention kernels for the attention variants (i.e., EvoformerAttention), which widely appear in this category of science models. Specifically, a set of highly memory-efficient DS4Sci_EvoformerAttention kernels enabled by sophisticated fusion/tiling strategies and on-the-fly memory reduction methods, are created for the broader community as high-quality machine learning primitives. Incorporated into OpenFold, they provide a substantial speedup during training and dramatically reduce the model’s peak memory requirement for training and inference. This allows OpenFold to be experimented with bigger and more complex models, and longer sequences, and trained on a wider spectrum of hardware. Detailed information about this technology can be found at DeepSpeed4Science (opens in new tab).

Showcase (II): DeepSpeed4Science enables very-long sequence support via both systematic and algorithmic approaches for genome-scale foundation models (e.g., GenSLMs)

Figure 8. The dynamic figure dipicts GenSLMs, 2022 ACM Gordon Bell Winning COVID Model (a 25B/33B dense model based on GPT-NeoX). It is used to learn the latent space that describes biologically meaningful properties for SARS-CoV-2 genomes. This GIF is visualizing an important protein family, malate dehydrogenase, and viewing a projection of the latent space colored by important features such as sequence length and GC content (the ratio of the content of the nucleic acids guanine and cytosine in comparison to adenine and thymine. It measures the ability of a DNA strand to withstand heat).
Figure 8: GenSLMs: 2022 ACM Gordon Bell Winning COVID Model (a 25B/33B dense model based on GPT-NeoX). It is used to learn the latent space that describes biologically meaningful properties for SARS-CoV-2 genomes. This GIF is visualizing an important protein family, malate dehydrogenase, and viewing a projection of the latent space colored by important features such as sequence length and GC content (the ratio of the content of the nucleic acids guanine and cytosine in comparison to adenine and thymine. It measures the ability of a DNA strand to withstand heat).

GenSLMs (opens in new tab), a 2022 ACM Gordon Bell award (opens in new tab) winning genome-scale language model from Argonne National Lab, can learn the evolutionary landscape of SARS-CoV-2 (COVID-19) genomes by adapting large language models (LLMs) for genomic data. It is designed to transform how new and emergent variants of pandemic-causing viruses, especially SARS-CoV-2, are identified and classified. GenSLM represents one of the first whole genome-scale foundation models which can generalize to other prediction tasks. A good understanding of the latent space can help GenSLMs tackle new domains beyond just viral sequences and expand their ability to model bacterial pathogens and even eukaryotic organisms, e.g., to understand things such as function, pathway membership, and evolutionary relationships. To achieve this scientific goal, GenSLMs and similar models require very long sequence support for both training and inference that is beyond generic LLMs’ long-sequence strategies like FlashAttention (opens in new tab). Through DeepSpeed4Science’s new designs, scientists can now build and train models with significantly longer context windows, allowing them to explore relationships that were previously inaccessible.

DeepSpeed - Figure 9. The two figures show the maximum sequence lengths of GenSLM models (25 billion parameters and 33 billion parameters) supported by different frameworks at different scales. The hardware profiled here are NVIDIA DGX nodes with eight 40G A100 GPUs per node.
Figure 9: Maximum sequence lengths of GenSLM models supported by different frameworks at different scales. The hardware profiled here are NVIDIA DGX nodes with eight 40G A100 GPUs per node.

Specifically, at system level, we release the newest Megatron-DeepSpeed (opens in new tab) framework for very-long sequence support along with other new optimizations (opens in new tab). Scientists can now train their large science models like GenSLMs with much longer sequences via a synergetic combination of our newly added memory optimization techniques on attention mask and position embedding, tensor parallelism, pipeline parallelism, sequence parallelism, ZeRO-style data parallelism and model state offloading. Figure 9 demonstrates that our new release enables the longest sequence length for GenSLMs’ 25B and 33B models by up to 12X and 14X, respectively, over the previous Megatron-DeepSpeed. In terms of supported sequence lengths, this new framework also significantly outperforms NVIDIA’s Megatron-LM by up to 9.8X and 9.1X for the 25B and 33B models, respectively. For example, GenSLMs’ 25B model can now be trained with a 512K sequence of nucleotides, compared to the Argonne team’s original 42K sequence length on 64 GPUs. This drastically improves model quality and scientific discovery scope with no accuracy loss. Additional support for domain scientists who prefer algorithmic strategies like relative position embedding techniques is also integrated in this new release (opens in new tab).

Summary and roadmap 

We are very proud and excited to announce the DeepSpeed4Science initiative along with several R&D highlights and achievements. Starting today, we will host our new initiative at DeepSpeed4Science (opens in new tab), including information about our external colleagues, and current and future DeepSpeed4Science technology releases. One of our high-level goals is to generalize AI system technologies that broadly address the major system pain points for large-scale scientific discoveries. We hope scientists around the world will enjoy the new capabilities unlocked by DeepSpeed4Science through open-sourced software. We are looking forward to better understanding the AI system design challenges that block your discovery progress. We sincerely welcome your participation to help us build a promising AI4Science future. Please email us at deepspeed-info@microsoft.com (opens in new tab). We encourage you to report issues, contribute PRs, and join discussions on our DeepSpeed GitHub (opens in new tab) page.

Acknowledgements 

Core DeepSpeed4Science Team:  

Shuaiwen Leon Song (DeepSpeed4Science lead), Minjia Zhang, Conglong Li, Shiyang Chen, Chengming Zhang, Xiaoxia (Shirley) Wu, Masahiro Tanaka, Martin Cai, Adam Graham, Charlie Zhou, Yuxiong He (DeepSpeed team lead)

Our Founding Collaborators (in alphabetical order):

Argonne National Lab team: Rick Stevens, Cristina Negri, Rao Kotamarthi, Venkatram Vishwanath, Arvind Ramanathan, Sam Foreman, Kyle Hippe, Troy Arcomano, Romit Maulik, Maxim Zvyagin, Alexander Brace, Yuntian Deng, Bin Zhang, Cindy Orozco Bohorquez, Austin Clyde, Bharat Kale, Danilo Perez-Rivera, Heng Ma, Carla M. Mann, Michael Irvin, J. Gregory Pauloski, Logan Ward, Valerie Hayot, Murali Emani, Zhen Xie, Diangen Lin, Maulik Shukla, Weili Nie, Josh Romero, Christian Dallago, Arash Vahdat, Chaowei Xiao, Thomas Gibbs, Ian Foster, James J. Davis, Michael E. Papka, Thomas Brettin, Anima Anandkumar

AMD: Ivo Bolsen, Micheal Schulte, Bo Begole, Angela Dalton, Steve Reinhart, Ashwin Aji, Jalal Mahmud, Mahesh Balashibramanian 

Brookhaven National Lab team: Adolfy Hoisie, Shinjae Yoo, Yihui Ren. 

Columbia University OpenFold team: Mohammed AlQuraishi, Gustaf Ahdritz 

Microsoft Research AI4Science team: Christopher Bishop, Bonnie Kruft, Max Welling, Tie-Yan Liu, Christian Bodnar, Johannes Brandsetter, Wessel Bruinsma, Chan Cao, Yuan-Jyue Chen, Peggy Dai, Patrick Garvan, Liang He, Elizabeth Heider, PiPi Hu, Peiran Jin, Fusong Ju, Yatao Li, Chang Liu, Renqian Luo, Qi Meng, Frank Noe, Tao Qin, Janwei Zhu, Bin Shao, Yu Shi, Wenlei Shi, Gregor Simm, Megan Stanley, Lixin Sun, Yue Wang, Tong Wang, Zun Wang, Lijun Wu, Yingce Xia, Leo Xia, Shufang Xie, Shuxin Zheng, Jianwei Zhu  

Oakridge National Lab team: Prassana Balaprakash, Georgia Tourass 

Princeton University: William Tang, Kyle Felker, Alexey Svyatkovskiy (Microsoft liaison) 

Rutgers University: Hang Liu

WebXT Weather team: Pete Luferenko, Divya Kumar, Jonathan Weyn, Ruixiong Zhang, Sylwester Klocek, Volodymyr Vragov 

The post Announcing the DeepSpeed4Science Initiative: Enabling large-scale scientific discovery through sophisticated AI system technologies appeared first on Microsoft Research.

Read More

OpenAI Red Teaming Network

We’re announcing an open call for the OpenAI Red Teaming Network and invite domain experts interested in improving the safety of OpenAI’s models to join our efforts.OpenAI Blog