Apple Machine Learning Research
Get started with the open-source Amazon SageMaker Distribution
Data scientists need a consistent and reproducible environment for machine learning (ML) and data science workloads that enables managing dependencies and is secure. AWS Deep Learning Containers already provides pre-built Docker images for training and serving models in common frameworks such as TensorFlow, PyTorch, and MXNet. To improve this experience, we announced a public beta of the SageMaker open-source distribution at 2023 JupyterCon. This provides a unified end-to-end ML experience across ML developers of varying levels of expertise. Developers no longer need to switch between different framework containers for experimentation, or as they move from local JupyterLab environments and SageMaker notebooks to production jobs on SageMaker. The open-source SageMaker Distribution supports the most common packages and libraries for data science, ML, and visualization, such as TensorFlow, PyTorch, Scikit-learn, Pandas, and Matplotlib. You can start using the container from the Amazon ECR Public Gallery starting today.
In this post, we show you how you can use the SageMaker open-source distribution to quickly experiment on your local environment and easily promote them to jobs on SageMaker.
Solution overview
For our example, we showcase training an image classification model using PyTorch. We use the KMNIST dataset available publicly on PyTorch. We train a neural network model, test the model’s performance, and finally print the training and test loss. The full notebook for this example is available in the SageMaker Studio Lab examples repository. We start experimentation on a local laptop using the open-source distribution, move it to Amazon SageMaker Studio for using a larger instance, and then schedule the notebook as a notebook job.
Prerequisites
You need the following prerequisites:
- Docker installed.
- An active AWS account with administrator permissions.
- An environment with the AWS Command Line Interface (AWS CLI) and Docker installed.
- An existing SageMaker domain. To create a domain, refer to Onboard to Amazon SageMaker Domain.
Set up your local environment
You can directly start using the open-source distribution on your local laptop. To start JupyterLab, run the following commands on your terminal:
You can replace ECR_IMAGE_ID
with any of the image tags available in the Amazon ECR Public Gallery, or choose the latest-gpu
tag if you are using a machine that supports GPU.
This command will start JupyterLab and provide a URL on the terminal, like http://127.0.0.1:8888/lab?token=<token>
. Copy the link and enter it in your preferred browser to start JupyterLab.
Set up Studio
Studio is an end-to-end integrated development environment (IDE) for ML that lets developers and data scientists build, train, deploy, and monitor ML models at scale. Studio provides an extensive list of first-party images with common frameworks and packages, such as Data Science, TensorFlow, PyTorch, and Spark. These images make it simple for data scientists to get started with ML by simply choosing a framework and instance type of their choice for compute.
You can now use the SageMaker open-source distribution on Studio using Studio’s bring your own image feature. To add the open-source distribution to your SageMaker domain, complete the following steps:
- Add the open-source distribution to your account’s Amazon Elastic Container Registry (Amazon ECR) repository by running the following commands on your terminal:
- Create a SageMaker image and attach the image to the Studio domain:
- On the SageMaker console, launch Studio by choosing your domain and existing user profile.
- Optionally, restart Studio by following the steps in Shut down and update SageMaker Studio.
Download the notebook
Download the sample notebook locally from the GitHub repo.
Open the notebook in your choice of IDE and add a cell to the beginning of the notebook to install torchsummary
. The torchsummary
package is not part of the distribution, and installing this on the notebook will ensure the notebook runs end to end. We recommend using conda
or micromamba
to manage environments and dependencies. Add the following cell to the notebook and save the notebook:
Experiment on the local notebook
Upload the notebook to the JupyterLab UI you launched by choosing the upload icon as shown in the following screenshot.
When it’s uploaded, launch the cv-kmnist.ipynb
notebook. You can start running the cells immediately, without having to install any dependencies such as torch, matplotlib, or ipywidgets.
If you followed the preceding steps, you can see that you can use the distribution locally from your laptop. In the next step, we use the same distribution on Studio to take advantage of Studio’s features.
Move the experimentation to Studio (optional)
Optionally, let’s promote the experimentation to Studio. One of the advantages of Studio is that the underlying compute resources are fully elastic, so you can easily dial the available resources up or down, and the changes take place automatically in the background without interrupting your work. If you wanted to run the same notebook from earlier on a larger dataset and compute instance, you can migrate to Studio.
Navigate to the Studio UI you launched earlier and choose the upload icon to upload the notebook.
After you launch the notebook, you will be prompted to choose the image and instance type. On the kernel launcher, choose sagemaker-runtime
as the image and an ml.t3.medium
instance, then choose Select.
You can now run the notebook end to end without needing any changes on the notebook from your local development environment to Studio notebooks!
Schedule the notebook as a job
When you’re done with your experimentation, SageMaker provides multiple options to productionalize your notebook, such as training jobs and SageMaker pipelines. One such option is to directly run the notebook itself as a non-interactive, scheduled notebook job using SageMaker notebook jobs. For example, you might want to retrain your model periodically, or get inferences on incoming data periodically and generate reports for consumption by your stakeholders.
From Studio, choose the notebook job icon to launch the notebook job. If you have installed the notebook jobs extension locally on your laptop, you can also schedule the notebook directly from your laptop. See Installation Guide to set up the notebook jobs extension locally.
The notebook job automatically uses the ECR image URI of the open-source distribution, so you can directly schedule the notebook job.
Choose Run on schedule, choose a schedule, for example every week on Saturday, and choose Create. You can also choose Run now if you’d like to view the results immediately.
When the first notebook job is complete, you can view the notebook outputs directly from the Studio UI by choosing Notebook under Output files.
Additional considerations
In addition to using the publicly available ECR image directly for ML workloads, the open-source distribution offers the following advantages:
- The Dockerfile used to build the image is available publicly for developers to explore and build their own images. You can also inherit this image as the base image and install your custom libraries to have a reproducible environment.
- If you’re not used to Docker and prefer to use Conda environments on your JupyterLab environment, we provide an
env.out
file for each of the published versions. You can use the instructions in the file to create your own Conda environment that will mimic the same environment. For example, see the CPU environment file cpu.env.out. - You can use the GPU versions of the image to run GPU-compatible workloads such as deep learning and image processing.
Clean up
Complete the following steps to clean up your resources:
- If you have scheduled your notebook to run on a schedule, pause or delete the schedule on the Notebook Job Definitions tab to avoid paying for future jobs.
- Shut down all Studio apps to avoid paying for unused compute usage. See Shut down and Update Studio Apps for instructions.
- Optionally, delete the Studio domain if you created one.
Conclusion
Maintaining a reproducible environment across different stages of the ML lifecycle is one of the biggest challenges for data scientists and developers. With the SageMaker open-source distribution, we provide an image with mutually compatible versions of the most common ML frameworks and packages. The distribution is also open source, providing developers with transparency into the packages and build processes, making it easier to customize their own distribution.
In this post, we showed you how to use the distribution on your local environment, on Studio, and as the container for your training jobs. This feature is currently in public beta. We encourage you to try this out and share your feedback and issues on the public GitHub repository!
About the authors
Durga Sury is an ML Solutions Architect on the Amazon SageMaker Service SA team. She is passionate about making machine learning accessible to everyone. In her 4 years at AWS, she has helped set up AI/ML platforms for enterprise customers. When she isn’t working, she loves motorcycle rides, mystery novels, and long walks with her 5-year-old husky.
Ketan Vijayvargiya is a Senior Software Development Engineer in Amazon Web Services (AWS). His focus areas are machine learning, distributed systems and open source. Outside work, he likes to spend his time self-hosting and enjoying nature.
Exploring Generative AI in conversational experiences: An Introduction with Amazon Lex, Langchain, and SageMaker Jumpstart
Customers expect quick and efficient service from businesses in today’s fast-paced world. But providing excellent customer service can be significantly challenging when the volume of inquiries outpaces the human resources employed to address them. However, businesses can meet this challenge while providing personalized and efficient customer service with the advancements in generative artificial intelligence (generative AI) powered by large language models (LLMs).
Generative AI chatbots have gained notoriety for their ability to imitate human intellect. However, unlike task-oriented bots, these bots use LLMs for text analysis and content generation. LLMs are based on the Transformer architecture, a deep learning neural network introduced in June 2017 that can be trained on a massive corpus of unlabeled text. This approach creates a more human-like conversation experience and accommodates several topics.
As of this writing, companies of all sizes want to use this technology but need help figuring out where to start. If you are looking to get started with generative AI and the use of LLMs in conversational AI, this post is for you. We have included a sample project to quickly deploy an Amazon Lex bot that consumes a pre-trained open-source LLM. The code also includes the starting point to implement a custom memory manager. This mechanism allows an LLM to recall previous interactions to keep the conversation’s context and pace. Finally, it’s essential to highlight the importance of experimenting with fine-tuning prompts and LLM randomness and determinism parameters to obtain consistent results.
Solution overview
The solution integrates an Amazon Lex bot with a popular open-source LLM from Amazon SageMaker JumpStart, accessible through an Amazon SageMaker endpoint. We also use LangChain, a popular framework that simplifies LLM-powered applications. Finally, we use a QnABot to provide a user interface for our chatbot.
First, we start by describing each component in the preceding diagram:
- JumpStart offers pre-trained open-source models for various problem types. This enables you to begin machine learning (ML) quickly. It includes the FLAN-T5-XL model, an LLM deployed into a deep learning container. It performs well on various natural language processing (NLP) tasks, including text generation.
- A SageMaker real-time inference endpoint enables fast, scalable deployment of ML models for predicting events. With the ability to integrate with Lambda functions, the endpoint allows for building custom applications.
- The AWS Lambda function uses the requests from the Amazon Lex bot or the QnABot to prepare the payload to invoke the SageMaker endpoint using LangChain. LangChain is a framework that lets developers create applications powered by LLMs.
- The Amazon Lex V2 bot has the built-in
AMAZON.FallbackIntent
intent type. It is triggered when a user’s input doesn’t match any intents in the bot. - The QnABot is an open-source AWS solution to provide a user interface to Amazon Lex bots. We configured it with a Lambda hook function for a
CustomNoMatches
item, and it triggers the Lambda function when QnABot can’t find an answer. We assume you have already deployed it and included the steps to configure it in the following sections.
The solution is described at a high level in the following sequence diagram.
Major tasks performed by the solution
In this section, we look at the major tasks performed in our solution. This solution’s entire project source code is available for your reference in this GitHub repository.
Handling chatbot fallbacks
The Lambda function handles the “don’t know” answers via AMAZON.FallbackIntent
in Amazon Lex V2 and the CustomNoMatches
item in QnABot. When triggered, this function looks at the request for a session and the fallback intent. If there is a match, it hands off the request to a Lex V2 dispatcher; otherwise, the QnABot dispatcher uses the request. See the following code:
Providing memory to our LLM
To preserve the LLM memory in a multi-turn conversation, the Lambda function includes a LangChain custom memory class mechanism that uses the Amazon Lex V2 Sessions API to keep track of the session attributes with the ongoing multi-turn conversation messages and to provide context to the conversational model via previous interactions. See the following code:
The following is the sample code we created for introducing the custom memory class in a LangChain ConversationChain:
Prompt definition
A prompt for an LLM is a question or statement that sets the tone for the generated response. Prompts function as a form of context that helps direct the model toward generating relevant responses. See the following code:
Using an Amazon Lex V2 session for LLM memory support
Amazon Lex V2 initiates a session when a user interacts to a bot. A session persists over time unless manually stopped or timed out. A session stores metadata and application-specific data known as session attributes. Amazon Lex updates client applications when the Lambda function adds or changes session attributes. The QnABot includes an interface to set and get session attributes on top of Amazon Lex V2.
In our code, we used this mechanism to build a custom memory class in LangChain to keep track of the conversation history and enable the LLM to recall short-term and long-term interactions. See the following code:
Prerequisites
To get started with the deployment, you need to fulfill the following prerequisites:
- Access to the AWS Management Console via a user who can launch AWS CloudFormation stacks
- Familiarity navigating the Lambda and Amazon Lex consoles
Deploy the solution
To deploy the solution, proceed with the following steps:
- Choose Launch Stack to launch the solution in the
us-east-1
Region: - For Stack name, enter a unique stack name.
- For HFModel, we use the
Hugging Face Flan-T5-XL
model available on JumpStart. - For HFTask, enter
text2text
. - Keep S3BucketName as is.
These are used to find Amazon Simple Storage Service (Amazon S3) assets needed to deploy the solution and may change as updates to this post are published.
- Acknowledge the capabilities.
- Choose Create stack.
There should be four successfully created stacks.
Configure the Amazon Lex V2 bot
There is nothing to do with the Amazon Lex V2 bot. Our CloudFormation template already did the heavy lifting.
Configure the QnABot
We assume you already have an existing QnABot deployed in your environment. But if you need help, follow these instructions to deploy it.
- On the AWS CloudFormation console, navigate to the main stack that you deployed.
- On the Outputs tab, make a note of the
LambdaHookFunctionArn
because you need to insert it in the QnABot later.
- Log in to the QnABot Designer User Interface (UI) as an administrator.
- In the Questions UI, add a new question.
- Enter the following values:
- ID –
CustomNoMatches
- Question –
no_hits
- Answer – Any default answer for “don’t know”
- ID –
- Choose Advanced and go to the Lambda Hook section.
- Enter the Amazon Resource Name (ARN) of the Lambda function you noted previously.
- Scroll down to the bottom of the section and choose Create.
You get a window with a success message.
Your question is now visible on the Questions page.
Test the solution
Let’s proceed with testing the solution. First, it’s worth mentioning that we deployed the FLAN-T5-XL model provided by JumpStart without any fine-tuning. This may have some unpredictability, resulting in slight variations in responses.
Test with an Amazon Lex V2 bot
This section helps you test the Amazon Lex V2 bot integration with the Lambda function that calls the LLM deployed in the SageMaker endpoint.
- On the Amazon Lex console, navigate to the bot entitled
Sagemaker-Jumpstart-Flan-LLM-Fallback-Bot
.
This bot has been configured to call the Lambda function that invokes the SageMaker endpoint hosting the LLM as a fallback intent when no other intents are matched. - Choose Intents in the navigation pane.
On the top right, a message reads, “English (US) has not built changes.”
- Choose Build.
- Wait for it to complete.
Finally, you get a success message, as shown in the following screenshot.
- Choose Test.
A chat window appears where you can interact with the model.
We recommend exploring the built-in integrations between Amazon Lex bots and Amazon Connect. And also, messaging platforms (Facebook, Slack, Twilio SMS) or third-party Contact Centers using Amazon Chime SDK and Genesys Cloud, for example.
Test with a QnABot instance
This section tests the QnABot on AWS integration with the Lambda function that calls the LLM deployed in the SageMaker endpoint.
- Open the tools menu in the top left corner.
- Choose QnABot Client.
- Choose Sign In as Admin.
- Enter any question in the user interface.
- Evaluate the response.
Clean up
To avoid incurring future charges, delete the resources created by our solution by following these steps:
- On the AWS CloudFormation console, select the stack named
SagemakerFlanLLMStack
(or the custom name you set to the stack). - Choose Delete.
- If you deployed the QnABot instance for your tests, select the QnABot stack.
- Choose Delete.
Conclusion
In this post, we explored the addition of open-domain capabilities to a task-oriented bot that routes the user requests to an open-source large language model.
We encourage you to:
- Save the conversation history to an external persistence mechanism. For example, you can save the conversation history to Amazon DynamoDB or an S3 bucket and retrieve it in the Lambda function hook. In this way, you don’t need to rely on the internal non-persistent session attributes management offered by Amazon Lex.
- Experiment with summarization – In multiturn conversations, it’s helpful to generate a summary that you can use in your prompts to add context and limit the usage of conversation history. This helps to prune the bot session size and keep the Lambda function memory consumption low.
- Experiment with prompt variations – Modify the original prompt description that matches your experimentation purposes.
- Adapt the language model for optimal results – You can do this by fine-tuning the advanced LLM parameters such as randomness (
temperature
) and determinism (top_p
) according to your applications. We demonstrated a sample integration using a pre-trained model with sample values, but have fun adjusting the values for your use cases.
In our next post, we plan to help you discover how to fine-tune pre-trained LLM-powered chatbots with your own data.
Are you experimenting with LLM chatbots on AWS? Tell us more in the comments!
Resources and references
- Companion source code for this post
- Amazon Lex V2 Developer Guide
- AWS Solutions Library: QnABot on AWS
- Text2Text Generation with FLAN T5 models
- LangChain – Building applications with LLMs
- Amazon SageMaker Examples with Jumpstart Foundation Models
- Amazon BedRock – The easiest way to build and scale generative AI applications with foundation models
- Quickly build high-accuracy Generative AI applications on enterprise data using Amazon Kendra, LangChain, and large language models
About the Authors
Marcelo Silva is an experienced tech professional who excels in designing, developing, and implementing cutting-edge products. Starting off his career at Cisco, Marcelo worked on various high-profile projects including deployments of the first ever carrier routing system and the successful rollout of ASR9000. His expertise extends to cloud technology, analytics, and product management, having served as senior manager for several companies like Cisco, Cape Networks, and AWS before joining GenAI. Currently working as a Conversational AI/GenAI Product Manager, Marcelo continues to excel in delivering innovative solutions across industries.
Victor Rojo is a highly experienced technologist who is passionate about the latest in AI, ML, and software development. With his expertise, he played a pivotal role in bringing Amazon Alexa to the US and Mexico markets while spearheading the successful launch of Amazon Textract and AWS Contact Center Intelligence (CCI) to AWS Partners. As the current Principal Tech Leader for the Conversational AI Competency Partners program, Victor is committed to driving innovation and bringing cutting-edge solutions to meet the evolving needs of the industry.
Justin Leto is a Sr. Solutions Architect at Amazon Web Services with a specialization in machine learning. His passion is helping customers harness the power of machine learning and AI to drive business growth. Justin has presented at global AI conferences, including AWS Summits, and lectured at universities. He leads the NYC machine learning and AI meetup. In his spare time, he enjoys offshore sailing and playing jazz. He lives in New York City with his wife and baby daughter.
Ryan Gomes is a Data & ML Engineer with the AWS Professional Services Intelligence Practice. He is passionate about helping customers achieve better outcomes through analytics and machine learning solutions in the cloud. Outside work, he enjoys fitness, cooking, and spending quality time with friends and family.
Mahesh Birardar is a Sr. Solutions Architect at Amazon Web Services with specialization in DevOps and Observability. He enjoys helping customers implement cost-effective architectures that scale. Outside work, he enjoys watching movies and hiking.
Kanjana Chandren is a Solutions Architect at Amazon Web Services (AWS) who is passionate about Machine Learning. She helps customers in designing, implementing and managing their AWS workloads. Outside of work she loves travelling, reading and spending time with family and friends.
Introducing popularity tuning for Similar-Items in Amazon Personalize
Amazon Personalize now enables popularity tuning for its Similar-Items recipe (aws-similar-items
). Similar-Items generates recommendations that are similar to the item that a user selects, helping users discover new items in your catalog based on the previous behavior of all users and item metadata. Previously, this capability was only available for SIMS, the other Related_Items
recipe within Amazon Personalize.
Every customer’s item catalog and the way that users interact with it are unique to their business. When recommending similar items, some customers may want to place more emphasis on popular items because they increase the likelihood of user interaction, while others may want to de-emphasize popular items to surface recommendations that are more similar to the selected item but are less widely known. This launch gives you more control over the degree to which popularity influences Similar-Items recommendations, so you can tune the model to meet your particular business needs.
In this post, we show you how to tune popularity for the Similar-Items recipe. We specify a value closer to zero to include more popular items, and specify a value closer to 1 to place less emphasis on popularity.
Example use cases
To explore the impact of this new feature in greater detail, let’s review two examples. [1]
First, we used the Similar-Items recipe to find recommendations similar to Disney’s 1994 movie The Lion King (IMDB record). When the popularity discount is set to 0, Amazon Personalize recommends movies that have a high frequency of occurrence (are popular). In this example, the movie Seven (a.k.a. Se7en), which occurred 19,295 times in the dataset, is recommended at rank 3.0.
By tuning the popularity discount to a value of 0.4 for The Lion King recommendations, we see that the rank of the movie Seven drops to 4.0. We also see movies from the Children genre like Babe, Beauty and the Beast, Aladdin, and Snow White and the Seven Dwarfs get recommended at a higher rank despite their lower overall popularity in the dataset.
Let’s explore another example. We used the Similar-Items recipe to find recommendations similar to Disney and Pixar’s 1995 movie Toy Story (IMDB record). When the popularity discount is set to 0, Amazon Personalize recommends movies that have a high frequency occurrence in the dataset. In this example, we see that the movie Twelve Monkeys (a.k.a. 12 Monkeys), which occurred 6,678 times in the dataset, is recommended at rank 5.0.
By tuning the popularity discount to a value of 0.4 for Toy Story recommendations, we see that the rank of the Twelve Monkeys is no longer recommended in the top 10. We also see movies from the Children genre like Aladdin, Toy Story 2, and A Bug’s Life get recommended at a higher rank despite their lower overall popularity in the dataset.
Placing greater emphasis on more popular content can help increase likelihood that users will engage with item recommendations. Reducing emphasis on popularity may surface recommendations that seem more relevant to the queried item, but may be less popular with users. You can tune the degree of importance placed on popularity to meet your business needs for a specific personalization campaign.
Implement popularity tuning
To tune popularity for the Similar-Items recipe, configure the popularity_discount_factor
hyperparameter via the AWS Management Console, the AWS SDKs, or the AWS Command Line Interface (AWS CLI).
The following is sample code setting the popularity discount factor to 0.5 via the AWS SDK:
The following screenshot shows setting the popularity discount factor to 0.3 on the Amazon Personalize console.
Conclusion
With popularity tuning, you can now further refine the Similar-Items recipe within Amazon Personalize to control the degree to which popularity influences item recommendations. This gives you greater control over defining the end-user experience and what is included or excluded in your Similar-Items recommendations.
For more details on how to implement popularity tuning for the Similar-Items recipe, refer to documentation.
References
[1] Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=http://dx.doi.org/10.1145/2827872About the Authors
Julia McCombs Clark is a Sr. Technical Product Manager on the Amazon Personalize team.
Nihal Harish is a Software Development Engineer on the Amazon Personalize team.
Introducing Google’s Secure AI Framework
Today Google released released the Secure AI Framework to help collaboratively secure AI technology.Read More
Link-credible: Get in the Game Faster With Steam, Epic Games Store and Ubisoft Account Linking on GeForce NOW
Get into your favorite games faster by linking GeForce NOW to Steam, Epic Games Store and Ubisoft accounts.
And get a peek at more games coming to GeForce NOW later this year by tuning in to Ubisoft Forward on Monday, June 12, when the game publisher will reveal its latest news and announcements.
Plus, two new games are available to stream from the cloud this week, as well as the newest season for Tom Clancy’s The Division 2 from Ubisoft.
Linked In
GeForce NOW makes gaming convenient and easy for members by enabling them to link their accounts from Steam, Epic and, most recently, Ubisoft, directly to the service. Instead of signing into their accounts for each play session, members can be automatically signed in across their devices after linking them up just once.

Starting today, launching Ubisoft Connect games requires members to link their Ubisoft accounts in the app. Once that’s completed, members can effortlessly play hit Ubisoft games, including Rainbow Six Siege, Far Cry 6 and The Division 2.
Members also have the benefit of library account syncing, which automatically syncs supported GeForce NOW games from Ubisoft Connect and Steam libraries — helping members find their Ubisoft games instantly.
For an even more streamlined experience, upgrade to an Ultimate or Priority membership to skip the waiting lines over free members and get into gaming faster.
The Mission: More Games

“Season 1: Broken Wings” is the newest season for Tom Clancy’s The Division 2, kicking off Year Five for the hit game from Ubisoft. It introduces a new game mode — Descent — a rogue-lite for teams of up to four players. Begin each match without any gear, perks or specializations and unlock them through game progression to work up through the ranks. The rest of the year will bring more seasons, each with their own manhunts, events and leagues. Stream “Broken Wings” on GeForce NOW today.
And take a look at the two new games available to stream this week:
Before the weekend arrives, let’s take things back with our question of the week. Let us know your answer on Twitter or in the comments below.
What’s the first @ubisoft game you ever played?
—
NVIDIA GeForce NOW (@NVIDIAGFN) June 7, 2023
AI Frontiers: The future of causal reasoning with Emre Kiciman and Amit Sharma

Episode 140 | June 8, 2023
Powerful new large-scale AI models like GPT-4 are showing dramatic improvements in reasoning, problem-solving, and language capabilities. This marks a phase change for artificial intelligence—and a signal of accelerating progress to come.
In this Microsoft Research Podcast series, AI scientist and engineer Ashley Llorens hosts conversations with his collaborators and colleagues about what these new models—and the models that will come next—mean for our approach to creating, understanding, and deploying AI, its applications in areas such as health care and education, and its potential to benefit humanity.
This episode features Senior Principal Researcher Emre Kiciman and Principal Researcher Amit Sharma, whose paper “Causal Reasoning and Large Language Models: Opening a New Frontier for Causality” examines the causal capabilities of large language models (LLMs) and their implications. Kiciman and Sharma break down the study of cause and effect; recount their respective ongoing journeys with GPT-3.5 and GPT-4—from their preconceptions to where they are now—and share their views of a future in which LLMs help bring together different modes of reasoning in the practice of causal inference and make causal methods easier to adopt.
Learn more:
- Causal Reasoning and Large Language Models: Opening a New Frontier for Causality
Publication, April 2023 - The AI Revolution in Medicine: GPT-4 and Beyond by Peter Lee
Book, April 2023 - AI and Microsoft Research
Learn more about the breadth of AI research at Microsoft
Transcript
[MUSIC PLAYS]ASHLEY LLORENS: I’m Ashley Llorens with Microsoft Research. I’ve spent the last 20 years working in AI and machine learning, but I’ve never felt more fortunate to work in the field than at this moment. The development of increasingly powerful large-scale models like GPT-4 is accelerating the advancement of AI. These models are exhibiting surprising new abilities like reasoning, problem-solving, and translation across languages and domains. In this podcast series, I’ll share conversations with fellow researchers about our impressions of GPT-4, the work we’re doing to understand its capabilities and limitations, and ultimately how innovations like these can have the greatest benefit for humanity. Welcome to AI Frontiers.
Today we’re talking with Emre Kiciman and Amit Sharma, two Microsoft researchers who have been studying causal reasoning with AI for many years. Determining cause and effect relationships is critically important across many domains such as law, medicine, and the advancement of science itself. Emre and Amit recently published a paper that explores how large language models can advance the research and application of causal reasoning with AI. Emre joins us from our lab in Redmond, Washington, and Amit is on the line from Microsoft Research India, in Bangalore.
[MUSIC FADES]
Emre, Amit, let’s jump right in. I’m so excited to speak with you both about causal reasoning. And this is such a timely conversation because we’re living through the rise of generative pretrained models, specifically large language models. And when I’ve engaged with GPT-4 in dialogue, depending on what I ask, it can appear to be doing something resembling causal reasoning. And as a machine learning person myself, I have to say this is not something that I’d expected to see from a neural network that works based on analyzing and generating statistical patterns. Um, you know, this is something that before this time last year, I thought of as a uniquely human skill as I think maybe many others have, as well. Now, both of you do this for a living. You study causal reasoning for a living. Um, and so where I’d like to start is with your first reactions to GPT-4, your first contact. What did you find surprising, and how did you feel, uh, as a researcher in this area? I want to go to Emre first on this.
EMRE KICIMAN: Sure. Well, um, yeah, I think I went through a process. Um, right now, I am surprised how much I’m depending on functionality from GPT-4 and how much I expect it to work. And yet, I also don’t quite believe that it can do the things that it’s doing. It’s really, um, a weird mind space to be in. I think the, the moment when I was a bit astounded by, like, what might be possible was actually before I got my hands on GPT-4 directly. You know, I’ve been hearing that people were very impressed with what it was doing. But the thing that made me reconsider my preconceptions was actually some of the academic research looking at, um, how transformer models and architectures could actually represent Turing machines, Turing-complete computational machines. And once I saw that the transformer architecture could represent that type of program, that type of thing, then I figured, well, all bets are off. We don’t know whether it’s learning this or not, but if it can represent it, now there really is a chance that it could, that it might be learning that. And so we have to really keep an open mind.
The second moment when I changed my mind again about what GPT-4 might be doing … so I’ll give a little background. So once I saw some of the work that we’ll talk about here, uh, coming into play, where we’re seeing GPT do some sorts of, you know, very interesting causal-related tasks, um, I was like, OK, this is great. We have our causal processes; we’re just going to run through them and this fits in. Someone will come with their causal question; we’ll run through and run our, our causal analysis. And I thought that, you know, this all makes sense. We can do things that we want, what we’ve wanted to do for so, for so long. And it was actually reading, uh, some of the vignettes in Peter Lee’s book where he was quizzing, uh, GPT-4 to diagnose a patient based on their electronic health records, explain counterfactual scenarios, um, think through why someone might have made a misdiagnosis. And, and here, all of a sudden, I realized our conceptualizations of causal tasks that we’ve worked on in the academic fields are kind of boxes where we say we’re doing effect inference or we’re doing attribution or we’re doing discovery. These like very well-circumscribed tasks are, are not enough; they’re not flexible enough. Once you have this natural language interface, you can ask so many more things, so many more interesting questions. And we need to make sure that we can formally answer those … correctly answer those questions. And, and this GPT-4 is basically a bridge to expressing and, you know, meeting people where they want to be. That really opened my eyes the second time.
LLORENS: Thanks, Emre. Amit, first impressions.
AMIT SHARMA: Yeah, my experience was back in December—I think it was when a lot of people were talking about ChatGPT—and me, thinking that I worked in causality, uh, I was quite smug, right. I knew that causality requires you to have interventional data. Language models are only built on some observations. So I was quite happy to think that I would beat this topic, right. But it was just that every day, I would see, perhaps on Twitter, people expressing new things that ChatGPT can do that one day, I thought, OK, let me just try it, right. So the first query I thought was an easy query for, uh, GPT models. I just asked it, does smoking cause lung cancer, right? And I was surprised when it gave the right answer. But then I thought maybe, oh, this is just too common. Let me ask the opposite. Does lung cancer cause smoking? Uh, it gave the right answer. No. Uh, and then I was literally struck, and I, and I thought, what else can I test, right? And then I thought of the all the causal relationships that we typically talk about in our field, and I started doing them one by one. And what I found was that the accuracy was just astounding. And it was not just the accuracy, but also the explanation that it gives would sort of almost make you believe that as if it is a causal agent, as if it is doing, uh, something causal. So, so to me, I think those few days in December with slightly sleepless nights on what exactly is going on with these models and what I might add … what am I going to do as a researcher now? [LAUGHS] I think that was, sort of, my initial foray into this. And, and I think the logical next step was then to study it more deeply.
LLORENS: And stemming from both of your reactions, you began collaborating on a paper, which you’ve recently released, called “Causal Reasoning [and] Large Language Models,” um, and I’ve had the, you know, the pleasure of spending some time with that over these last few days and, and a week here. And one of the things you do in the paper is you provide what I think of as a helpful overview of the different kinds of causality. And so, Emre, I want to go back to you. What is causality, and how can we think about the space of different, you know, kinds of causal reasoning?
KICIMAN: Causality … it’s the study of cause-and-effect relationships, of the mechanisms that, that drive, you know, what we see happening in the world around us. You know, why do things happen? What made something happen? And this is a study that spread out across so many disciplines—computer science, economics, health, statistics. Like, everyone cares about, about causality, to some degree. And so this means that there’s many different kinds of, you know, tools and languages to talk about causality, um, that are appropriate for different kinds of tasks. So that’s one of the first things that we thought we had to lay out in the paper, was kind of a very broad landscape about what causality is. And so we talk about a couple of different axes. One is data-driven causal analysis, and the other is logic-based causal reasoning. These are two very different ways of, of, of thinking about causality. And then the second major axis is whether we’re talking about causal relationships in general, in the abstract, like, uh, does smoking normally cause … or often cause cancer? Versus causality in a very specific context— that’s called actual causality. And this is something like Bob smoked; Bob got lung cancer. Was Bob’s lung cancer caused by Bob’s smoking? It’s a very specific question in this very, you know, in, in a specific instance. And so those are the two axes: data-driven versus logic and then general causality versus actual causality.
LLORENS: Amit, I want to go to you now, and I want to dwell on this topic of actual causality. And I actually learned this phrase from your paper. But I think this is a kind of causal reasoning that people do quite often, maybe even it’s the thing they think about when they think about causal reasoning. So, Amit, you know, let’s go deeper into what actual causality is. Maybe you can illustrate with some examples. And then I want to get into experiments you’ve conducted in this area with GPT-4.
SHARMA: Sure. So interestingly, actual causality in research is sort of the less talked about. As Emre was saying, I think most researchers in health sciences, economics often talk about general phenomena. But actual causality talks about events and what might have caused them, right. So think about something happens in the real world. So let’s say … I’ll take an example of, let’s say, you catch a ball and you prevent it from falling down, right. And I think people would reasonably argue that your catching the ball was the cause of preventing it from falling onto the ground. But very quickly, these kinds of determinations become complex because what could have been happening is that there could be multiple other factors at play, uh, and there could also be questions about how exactly you’re even thinking about what is a cause. Should, should you be thinking about necessary causes, or should you be thinking about sufficient causes, and so on. So, so I think actual causality before sort of these language models was kind of a paradox in the sense that the applications were kind of everywhere, going from everyday life to even thinking about computer systems. So if your computer system fails, you want to understand why this failure occurred, right. You’re not really interested in why computer systems fail in general; you’re just interested in answering the specific failure’s causes. And the paradox is that even though these sort of questions were so common, I think what research had to offer, uh, was not immediately systemizable or deployable, uh, because you would often sort of tie yourself in knots in defining exactly what you mean by the cause and also sort of how do you even get that framing without sort of just having a formal representation, right. Most of these tasks were in English, right, or in the case of computer systems, you would just get a debug log. So I think one of the hardest problems was how do you take something in vague language, human language, and convert it into sort of logical framing or logical systems?
LLORENS: In the paper, you explore briefly, you know, kind of actual causality that deals with responsibility or faults. And, you know, this connects with things like, you know, reasoning in the, in the legal domain. And so I just want to, I want to explore that with you. And I know I’ve jumped to the back of the paper. I just find these particular set … this particular set of topics pretty fascinating. And so tell me about the experiments that you’ve conducted where you ask, you know, the, the algorithm … the model to do this kind of actual causal reasoning around assigning blame or responsibility for something?
SHARMA: So one of the important challenges in actual causality is determining what’s a necessary cause and what’s a sufficient cause for an event, right. Now if you’re familiar with logic, you can break this down into sort of simple predicates. What we are asking is if an event happened, was some action necessary? It means that if that action did not happen, then that event would not happen, right. So we have a nice ”but for” relationship. Sufficiency, on the other hand, is kind of the complement. So there you’re saying if this action happens, the event will always happen, irrespective of whatever else happens in the world, right. And so, so far, in actual causality, people would use logic-based methods to think about what’s the right answer for any kind of event. So what we did was we looked at all the sort of vignettes or these examples that causality researchers had collected over the past decade. All of these are very challenging examples of situations in English language. And I think their purpose was to kind of elucidate the different kinds of sort of gotchas you get when you try to sort of just use the simple concept for real-world applications. So let me take you through one example in our dataset that we studied and how we’re finding that LLMs are somehow able to take this very vague, ambiguous information in an English-language vignette and directly go from sort of that language to an answer in English, right. So in a sense, they’re kind of sidestepping the logical reasoning, but maybe in the future we can also combine logical reasoning and LLMs.
So let’s take an example. Uh, it’s like Alice catches a ball. The next part on … the next destination on the ball’s trajectory was a brick wall, which would have stopped it, and beyond that there was a window. So as humans, we would immediately think that Alice was not a cause, right, because even if she had not stopped the ball, it would have hit the brick, and so if you’re asking if Alice was the cause of the window being safe, an intuitive answer might be no. But when you analyze it through the necessary and sufficient lens, you would find that Alice was obviously not a necessary cause because the brick wall would have stopped it, but Alice was a sufficient cause, meaning that if Alice had stopped the ball, even if the brick wall collapsed, even if other things happened in the world, the window would still be safe right. So these are the kind of sort of interesting examples that we tried out. And what we found was GPT-3.5, which is ChatGPT, does not do so well. I think it actually fails to identify correctly these causes, but GPT-4 somehow is able to do that. So it gets about 86 percent accuracy on, on this task. And one of the interesting things we were worried about was maybe it’s just memorizing. Again, these are very popular examples in textbooks, right? So we did this fun thing. We just created our own dataset. So, so now instead of Alice catching a ball, Alice could be, I don’t know, dropping a test tube in a lab, right? So we created this sort of a lab setup—a completely new dataset—and we again found the same results that GPT-4 is able to infer these causes.
LLORENS: Now you’re, you’re getting into experimental results, and that’s great because one of the things that I think required some creativity here was how you actually even structure, you know, a rigorous set of experiments. And so, Emre, can you take … take us through the experiment setup and how you had to approach that with this, you know, kind of unique, unique way of assessing causal reasoning?
KICIMAN: Well, one of the things that we wanted to make sure we had when we were running these experiments is, uh, construct validity to really make sure that the experiments that we were running were testing what we thought they were testing, or at least that we understood what they actually were testing. Um, and so most of these types of, uh, tests over large language models work with benchmark questions, and the biggest issue with the, with many of these benchmark questions is that often the large language models have seen them before. And there’s a concern that rather than thinking through to get the right answer, they’ve really only memorized the specific answers to these, to these specific questions.
And so what we did was, uh, we actually ran a memorization test to see whether the underlying dataset had been memorized by the large language model before. We developed … some of our benchmark datasets we developed, uh, as novel datasets that, you know, had never been written before so clearly had not been seen or memorized. And then we ran additional tests to help us understand what was triggering the specific answers. Like we would redact words from our question, uh, to see what would lead the LLM to make a mistake. So, for example, if we remove the key word from the question, we would expect the LLM to be confused, right. That’s, that’s fine. If we removed an unimportant word, maybe, you know, a participle or something, then we would expect that, that, that, that should be something that the LLM should recover from. And so this was able to give us a better understanding of what the LLM was, was paying attention to. This led us, for example, to be very clear in our paper that in, for example, our causal discovery experiments—where we are specifically asking the LLM to go back to its learned knowledge and tell us whether it knows something from common sense or domain knowledge, whether it’s memorized that, you know, some, uh, some cause, uh, has a particular effect—we are very clear in our experiments that we are not able to tell you what the odds are that the LLM has memorized any particular fact. But what we can say is, given that it’s seen that fact, is it able to transform it, you know, and combine it somehow into the correct answer in a particular context. And so it’s just, it’s really important to, to know what, uh, what these experiments really are testing. So I, I really appreciated the opportunity to go a little bit deeper into these studies.
LLORENS: I find this concept of construct validity pretty fascinating here, and it’s, you know, you, you stressed the importance of it for doing this kind of black-box testing, where you don’t actually have an explicit model for how the, well, the model is doing what it’s doing. And, you know, you talked about memorization as one important test where you’re, you know, you want to, you want to have a valid construct. But I think even deeper than that, there’s, there’s an aspect of your mental model, your beliefs about, you know, what the algorithm is doing and how relevant the testing you’re doing would be to future performance or performance on future tasks. And so I wonder if we can dwell on this notion of construct validity a little bit, maybe even one level deeper than the memorization, you know, you and your mental model of what’s happening there and why that’s important.
KICIMAN: My mental model of what the large language model is giving us is that it’s read so much of the text out on the internet that it’s captured the common sense and domain knowledge that we would normally expect only a human to do. And through some process—maybe it’s, maybe it’s probabilistic; maybe it’s some more sophisticated reasoning—it’s able to identify, like Amit said, the most important or relevant relationships for a particular scenario. So it knows that, you know, when we’re talking about a doctor washing his or her hands with soap or not, that infection, uh, in a patient is the next … is something that’s really critical. And maybe if we weren’t talking about a doctor, this would not be, you know, the most important consideration. So it is starting from capturing this knowledge, remembering it somehow in its model, and then recognizing the right moment to recall that fact and put it back out there as part of its answer. Um, that’s, that’s my mental model of what I think it’s doing, and we are able to demonstrate with our, you know, experiments that it is transforming from many different input data formats into, you know, answers to our natural language questions. So we, we have data we think it’s seen that’s in tabular format or in graphical formats. Um, and, you know, it’s, it’s impressive to see that it’s able to generate answers to our questions in various natural language forms.
LLORENS: I want to go now to a different kind of causality, causal discovery, which you describe in your paper as dealing with variables and their effect on each other. Emre, we’ll stick with you. And I also think that this is a, a kind of causal reasoning that maybe is closer to your day job and closer to the kinds of models maybe that you construct in the problems that you deal with. And so tell me about causal discovery and, you know, what you’re seeing in terms of the capabilities of GPT-4 and your, your experimentation.
KICIMAN: Yeah. So causal discovery is about looking at data, observational data, where you’re not necessarily intervening on the system—you’re just watching—and then from that, trying to figure out what relationships … uh, what the causal relationships are among the factors that you’re observing. And this is something that usually is done in the context of general causality, so trying to learn general relationships, uh, between factors, and it’s usually done in a, in a databased way—looking at the covariances, statistical covariances, between your observations. And, uh, there’s causal discovery algorithms out there. Uh, there are … this is something that’s been studied for decades. And there’s essentially, uh, testing statistical independence relationships that, you know, if something isn’t causing something else, then if you hold everything constant, there should be statistical independence between those two factors or different kinds of statistical independence relationships depending on what type of causal structures you see in, uh, among the relationships. And what these algorithms are able to do, the classical algorithms, is they can get you down to, um, a set of, a set of plausible relationships, but there’s always some point at which they can’t solve … uh, they can’t distinguish things based on data alone. They can, you know … there’s going to be a couple of relationships in your dataset where they might not know whether A is causing B or B is causing A, vice versa. And this is where a human comes in with their domain knowledge and has to make a declaration of what they think the right answer is based on their understanding of system mechanics. So there’s always this reliance on a human coming in with domain knowledge. And what, what we’re, uh, seeing now, I think, with LLMs is for the first time, we have some sort of programmatic access to this common sense and domain knowledge, just like in the actual causality setting. We have it provided to us again, uh, in the causal discovery setting. And we can push on this further. We don’t have … we can, if we want, run our data analysis first, then look at the LLM to, um, to disambiguate the last couple of things that we couldn’t get out of data. But we can also start from scratch and just ask, uh, the LLM to orient all of these causal edges and identify the right mechanisms from the beginning, just solely based on common sense and domain knowledge.
And so that’s what we did in our experiments here. We went through, uh, lists of edges and then larger graph structures to see how much we could re-create from, uh, just the common sense or domain knowledge that’s captured inside the LLM. And it did, it did quite well, beating the state of the art of the data-oriented approaches. Now, to be clear, it’s not doing the same task. If you have some data about a phenomenon that’s never been studied before, it’s not well understood, it’s never been named, the large language model is not going to be able to tell you—I don’t think it’s going to be able to tell you—what that causal relationship is. But for the many things that we do already know, it, it beats, you know, looking at the data. It’s, it’s quite impressive that way. So we think this is super exciting because it really removes this burden that we’ve really put on to the human analyst before, and now, now we can run these analyses, these … this whole data-driven process can be, uh, uh, built off of common sense it’s already captured without having to ask a user, a human, to type it all up correctly.
LLORENS: Amit, one of the things I found fascinating about the set of experiments that you, that you ran here was the prompt engineering and just the effect on the experimental results of different ways of prompting the model. Take us through that experience and, and please do get specific on the particular prompts that you used and their effects on the outcome.
SHARMA: Sure, yeah, this was an iterative exercise for us, as well. So as I was mentioning [to] you, when I started in December, um, the prompt I used was pretty simple: does changing A cause a change in B, right? So if you’re thinking of, let’s say, the relationship between altitude and temperature, it would just translate to a single sentence: does changing the altitude change the temperature? As we sort of moved into working for our paper and as we saw many different prompt strategies from other works, we started experimenting, right, and one of the most surprising things—actually shocking for us—was that if you just add … in these GPT-3.5 and 4 class of models, there’s a system prompt which sort of you can give some meta instructions to, to the model, and we just added a single line saying that “you are an expert in causal reasoning.” And it was quite shocking that just that thing gave us a 5-percentage point boost in the accuracy on the datasets that we were testing. So there’s something there about sort of prompting or kind of conditioning the model to be generating text more attuned with causality, which we found as interesting. It also sort of suggests that maybe the language model is not the model here; maybe it’s the prompt plus a language model, uh, meaning that GPT-4 with a great prompt could give you great answers, but sort of there’s a question of robustness of the prompt, as well. And I think finally, the prompt that we went for was an iteration on this, where instead of asking two questions—because for each pair we can ask, does A cause B or does B cause A—we thought of just making it one prompt and asking it, here are two variables, let’s say, altitude and temperature. Which direction is more likely? And so we just gave it two options or three options in the case of no direction exists. And there were two benefits to this. So, one, I think somehow this was, uh, increasing the accuracy even more, perhaps because choosing between options becomes easier now; you can compare which one is more likely. But also we could ask the LLM now to explain its reasoning. So we would ask it literally, explain it step by step going from the chain of thought reasoning. And its answers would be very instructive. So for example, some of the domains we tested, uh, we don’t know anything about it, right. So there was one neuropathic pain dataset, which has nodes called radiculopathy, DLS , lumbago. We have no idea, right. But just looking at the responses from the LLM, you can both sort of get a peek into what it’s doing at some high level maybe, but also understand the concepts and think for yourself whether those sorts of things, the reasoning, is making sense or not, right. And of course, we are not experts, so we may be fooled. We might think this is doing something. But imagine a doctor using it or imagine some expert using it. I think they can both get some auxiliary insight but also these explanations help them debug it. So if the explanation seems to be off or it doesn’t make sense, uh, that’s also a nice way of sort of knowing when to trust the model or not.
KICIMAN: One of the things that we noticed with these prompts is that, you know, there’s more to do in this space, too. Like the kinds of mistakes that it’s making right now are things that we think might be resolved at least, you know, in some part with additional prompting or thinking strategies. For example, one of the mistakes was, um, about … when we asked about the relationship between ozone and levels in radiation levels, and it answered wrong. It didn’t answer what, what was expected in the benchmark. But it turns out it’s because there’s ambiguity in the question. The relationship between ozone and radiation, uh, is one direction if you’re talking about ozone at ground level in a city, and it’s the other direction if you’re talking about ozone in the stratosphere. And so you can ask it, is there any ambiguity here? Is there any additional information you would need that would change the direction of the causal mechanism that you’re, you know, suggesting? And it’ll tell you; it’ll say, if we’re talking about in the stratosphere, it’s this; if it’s on the ground, it’s this. And so there’s really … I think we’re going to see some really fun strategies for improving the performance further by digging into these types of interrogations.
LLORENS: You know, the model is a kind of generalist in a way that most people are not or—I’m just going to go for it—in a way that no person is. You know, with all this knowledge of law and culture and economics and so many other … code, you know, so many other things, and I could imagine showing up and, yeah, a little bit of a primer on, a briefing on, well, here’s why you’re here and what you’re doing … I mean, that’s helpful for a person. And I imagine … and as we see, it’s helpful for these generalist, you know, general-purpose reasoners. And of course, mechanistically, what we’re doing is through the context, we’re inducing a different probability distribution over the tokens. And so I guess that’s … no, that’s what’s happening here. This is the primer that it gets before it steps into the room and, and does the Q&A or gives the talk, you know, as, as, as we do. But I want to get into a little bit now about where you see this going from here—for the field and for you as a researcher in the field. Let’s, let’s stick with you, Emre. Where do we go from here? What are some of the exciting frontiers?
KICIMAN: What I’m most excited about is this opportunity I think that’s opening up right now to fluidly, flexibly go back and forth between these different modes of causality. Going from logic-based reasoning to data-based reasoning and going beyond the kind of set tasks that we have well-defined for, for us in our field right now. So there’s a fun story that I heard when I was visiting a university a couple of months ago. We were talking about actual causality and connections to, to database causality, and this person brought up this scenario where they were an expert witness in a case where a hedge fund was suing a newspaper. The newspaper had run an exposé of some kind on the hedge fund, scared off all of their investors, and the hedge fund went belly-up. And the hedge fund was blaming the newspaper and wanted, you know, compensation for this, right. But at the same time, this was in the middle of a financial crisis. And so there’s this question of wouldn’t the hedge fund have failed anyway? A lot of other hedge funds did. Plus there’s the question of, you know, how much of an effect do newspaper stories like this usually have? Could it possibly have killed the hedge fund? And then there’s all the, you know, questions of normality and, you know, morality and stuff of maybe this is what the newspaper is supposed to be doing anyway. It’s not their fault, um, what the consequences were. So now you can imagine asking this question, starting off in this logical, you know, framing of the problem; then when you get down to this sub-element of what happened to all the other hedge funds—what would have happened to this hedge fund if, um, if the newspaper hadn’t written a story?—we can go look at the data of what happened to all the other hedge funds, and we can run the data analysis, and we can come back. We can go back and forth so much. I think that kind of flexibility is something I’m really going to be excited to see us, you know, able to automate in some fashion.
LLORENS: Amit, what do you think? Where do we go from here?
SHARMA: Yeah, I think I’m also excited about the practical aspects of how this might transform the causal practice. So, for example, what Emre and I have worked a lot on, this problem of estimating the causal effect, and one of the challenges that has been in the field for a long time is that we have great methods for estimating the causal effect once we have the graph established, but getting that graph often is a really challenging process, and you need to get domain expertise, human involvement, and often that means that a lot of the causal analysis does not get done just because the upfront cost of building a graph is just too much or it’s too complex. And the flipside is also that it’s also hard to verify. So suppose you assume a graph and then you do your analysis; you get some effect like this policy is better, let’s say. It’s very hard to evaluate how good your graph was and how maybe there are some checks you can do, robustness checks, to, to validate that, right.
And so what I feel the opportunity here is that the LLMs are really being complementary to what we are already good at in causal inference, right? So we’re only good at, given a graph, getting you an estimate using statistics. What the LLMs can come in and do is help domain experts build the graph much, much faster. So now instead of sort of thinking about, “Oh, what is my system? What do I need to do?” Maybe there’s a documentation of your system somewhere that you just feed into an LLM, and it provides you a candidate graph to start with. And at the same time, on the backend, once you have estimated something, a hard challenge that researchers like us face is what might be good robustness checks, right. So often these are … one example is a negative control, where you try to think of what is something that would definitely not cause the outcome. I know it from my domain knowledge. Let me run my analysis through assuming if that was the action variable, and then my analysis should always give an answer of zero. But again, like sort of figuring out what such variables are is more of an art than science. And I think in the preliminary experiments that we are doing, the LLMs could also help you there; you could again sort of give your graph and your data … and your sort of data description, and the LLMs can suggest to you, “Hey, these might be the variables that you can use for your robustness check.” So I’m most excited about this possibility of sort of more and more adoption of causal methods because now the LLMs can substitute or at least help people to stand up these analyses much faster.
LLORENS: Thank you both for this fascinating discussion. Understanding cause-and-effect relationships is such a fundamental part of how we apply human intelligence across so many different domains. I’m really looking forward to tracking your research, and the possibilities for more powerful causal reasoning with AI.
The post AI Frontiers: The future of causal reasoning with Emre Kiciman and Amit Sharma appeared first on Microsoft Research.
Less Is More: A Unified Architecture for Device-Directed Speech Detection with Multiple Invocation Types
Suppressing unintended invocation of the device because of the speech that sounds like wake-word, or accidental button presses, is critical for a good user experience, and is referred to as False-Trigger-Mitigation (FTM). In case of multiple invocation options, the traditional approach to FTM is to use invocation-specific models, or a single model for all invocations. Both approaches are sub-optimal: the memory cost for the former approach grows linearly with the number of invocation options, which is prohibitive for on-device deployment, and does not take advantage of shared training data;…Apple Machine Learning Research
Accelerate PyTorch with DeepSpeed to train large language models with Intel Habana Gaudi-based DL1 EC2 instances
Training large language models (LLMs) with billions of parameters can be challenging. In addition to designing the model architecture, researchers need to set up state-of-the-art training techniques for distributed training like mixed precision support, gradient accumulation, and checkpointing. With large models, the training setup is even more challenging because the available memory in a single accelerator device bounds the size of models trained using only data parallelism, and using model parallel training requires additional level of modifications to the training code. Libraries such as DeepSpeed (an open-source deep learning optimization library for PyTorch) address some of these challenges, and can help accelerate model development and training.
In this post, we set up training on the Intel Habana Gaudi-based Amazon Elastic Compute Cloud (Amazon EC2) DL1 instances and quantify the benefits of using a scaling framework such as DeepSpeed. We present scaling results for an encoder-type transformer model (BERT with 340 million to 1.5 billion parameters). For the 1.5-billion-parameter model, we achieved a scaling efficiency of 82.7% across 128 accelerators (16 dl1.24xlarge instances) using DeepSpeed ZeRO stage 1 optimizations. The optimizer states were partitioned by DeepSpeed to train large models using the data parallel paradigm. This approach has been extended to train a 5-billion-parameter model using data parallelism. We also used Gaudi’s native support of the BF16 data type for reduced memory size and increased training performance compared to using the FP32 data type. As a result, we achieved pre-training (phase 1) model convergence within 16 hours (our target was to train a large model within a day) for the BERT 1.5-billion-parameter model using the wikicorpus-en dataset.
Training setup
We provisioned a managed compute cluster comprised of 16 dl1.24xlarge instances using AWS Batch. We developed an AWS Batch workshop that illustrates the steps to set up the distributed training cluster with AWS Batch. Each dl1.24xlarge instance has eight Habana Gaudi accelerators, each with 32 GB of memory and a full mesh RoCE network between cards with a total bi-directional interconnect bandwidth of 700 Gbps each (see Amazon EC2 DL1 instances Deep Dive for more information). The dl1.24xlarge cluster also used four AWS Elastic Fabric Adapters (EFA), with a total of 400 Gbps interconnect between nodes.
The distributed training workshop illustrates the steps to set up the distributed training cluster. The workshop shows the distributed training setup using AWS Batch and in particular, the multi-node parallel jobs feature to launch large-scale containerized training jobs on fully managed clusters. More specifically, a fully managed AWS Batch compute environment is created with DL1 instances. The containers are pulled from Amazon Elastic Container Registry (Amazon ECR) and launched automatically into the instances in the cluster based on the multi-node parallel job definition. The workshop concludes by running a multi-node, multi-HPU data parallel training of a BERT (340 million to 1.5 billion parameters) model using PyTorch and DeepSpeed.
BERT 1.5B pre-training with DeepSpeed
Habana SynapseAI v1.5 and v1.6 support DeepSpeed ZeRO1 optimizations. The Habana fork of the DeepSpeed GitHub repository includes the modifications necessary to support the Gaudi accelerators. There is full support of distributed data parallel (multi-card, multi-instance), ZeRO1 optimizations, and BF16 data types.
All these features are enabled on the BERT 1.5B model reference repository, which introduces a 48-layer, 1600-hidden dimension, and 25-head bi-directional encoder model, derived from a BERT implementation. The repository also contains the baseline BERT Large model implementation: a 24-layer, 1024-hidden, 16-head, 340-million-parameter neural network architecture. The pre-training modeling scripts are derived from the NVIDIA Deep Learning Examples repository to download the wikicorpus_en data, preprocess the raw data into tokens, and shard the data into smaller h5 datasets for distributed data parallel training. You can adopt this generic approach to train your custom PyTorch model architectures using your datasets using DL1 instances.
Pre-training (phase 1) scaling results
For pre-training large models at scale, we mainly focused on two aspects of the solution: training performance, as measured by the time to train, and cost-effectiveness of arriving at a fully converged solution. Next, we dive deeper into these two metrics with BERT 1.5B pre-training as an example.
Scaling performance and time to train
We start by measuring the performance of the BERT Large implementation as a baseline for scalability. The following table lists the measured throughput of sequences per second from 1-8 dl1.24xlarge instances (with eight accelerator devices per instance). Using the single-instance throughput as baseline, we measured the efficiency of scaling across multiple instances, which is an important lever to understand the price-performance training metric.
Number of Instances | Number of Accelerators | Sequences per Second | Sequences per Second per Accelerator | Scaling Efficiency |
1 | 8 | 1,379.76 | 172.47 | 100.0% |
2 | 16 | 2,705.57 | 169.10 | 98.04% |
4 | 32 | 5,291.58 | 165.36 | 95.88% |
8 | 64 | 9,977.54 | 155.90 | 90.39% |
The following figure illustrates the scaling efficiency.
For BERT 1.5B, we modified the hyperparameters for the model in the reference repository to guarantee convergence. The effective batch size per accelerator was set to 384 (for maximum memory utilization), with micro-batches of 16 per step and 24 steps of gradient accumulation. Learning rates of 0.0015 and 0.003 were used for 8 and 16 nodes, respectively. With these configurations, we achieved convergence of the phase 1 pre-training of BERT 1.5B across 8 dl1.24xlarge instances (64 accelerators) in approximately 25 hours, and 15 hours across 16 dl1.24xlarge instances (128 accelerators). The following figure shows the average loss as a function of number of training epochs, as we scale up the number of accelerators.
With the configuration described earlier, we obtained 85% strong scaling efficiency with 64 accelerators and 83% with 128 accelerators, from a baseline of 8 accelerators in a single instance. The following table summarizes the parameters.
Number of Instances | Number of Accelerators | Sequences per Second | Sequences per Second per Accelerator | Scaling Efficiency |
1 | 8 | 276.66 | 34.58 | 100.0% |
8 | 64 | 1,883.63 | 29.43 | 85.1% |
16 | 128 | 3,659.15 | 28.59 | 82.7% |
The following figure illustrates the scaling efficiency.
Conclusion
In this post, we evaluated support for DeepSpeed by Habana SynapseAI v1.5/v1.6 and how it helps scale LLM training on Habana Gaudi accelerators. Pre-training of a 1.5-billion-parameter BERT model took 16 hours to converge on a cluster of 128 Gaudi accelerators, with 85% strong scaling. We encourage you to take a look at the architecture demonstrated in the AWS workshop and consider adopting it to train custom PyTorch model architectures using DL1 instances.
About the authors
Mahadevan Balasubramaniam is a Principal Solutions Architect for Autonomous Computing with nearly 20 years of experience in the area of physics-infused deep learning, building, and deploying digital twins for industrial systems at scale. Mahadevan obtained his PhD in Mechanical Engineering from the Massachusetts Institute of Technology and has over 25 patents and publications to his credit.
RJ is an engineer in Search M5 team leading the efforts for building large scale deep learning systems for training and inference. Outside of work he explores different cuisines of food and plays racquet sports.
Sundar Ranganathan is the Head of Business Development, ML Frameworks on the Amazon EC2 team. He focuses on large-scale ML workloads across AWS services like Amazon EKS, Amazon ECS, Elastic Fabric Adapter, AWS Batch, and Amazon SageMaker. His experience includes leadership roles in product management and product development at NetApp, Micron Technology, Qualcomm, and Mentor Graphics.
Abhinandan Patni is a Senior Software Engineer at Amazon Search. He focuses on building systems and tooling for scalable distributed deep learning training and real time inference.
Pierre-Yves Aquilanti is Head of Frameworks ML Solutions at Amazon Web Services where he helps develop the industry’s best cloud based ML Frameworks solutions. His background is in High Performance Computing and prior to joining AWS, Pierre-Yves was working in the Oil & Gas industry. Pierre-Yves is originally from France and holds a Ph.D. in Computer Science from the University of Lille.
Evaluating speech synthesis in many languages with SQuId
Previously, we presented the 1,000 languages initiative and the Universal Speech Model with the goal of making speech and language technologies available to billions of users around the world. Part of this commitment involves developing high-quality speech synthesis technologies, which build upon projects such as VDTTS and AudioLM, for users that speak many different languages.
![]() |
After developing a new model, one must evaluate whether the speech it generates is accurate and natural: the content must be relevant to the task, the pronunciation correct, the tone appropriate, and there should be no acoustic artifacts such as cracks or signal-correlated noise. Such evaluation is a major bottleneck in the development of multilingual speech systems.
The most popular method to evaluate the quality of speech synthesis models is human evaluation: a text-to-speech (TTS) engineer produces a few thousand utterances from the latest model, sends them for human evaluation, and receives results a few days later. This evaluation phase typically involves listening tests, during which dozens of annotators listen to the utterances one after the other to determine how natural they sound. While humans are still unbeaten at detecting whether a piece of text sounds natural, this process can be impractical — especially in the early stages of research projects, when engineers need rapid feedback to test and restrategize their approach. Human evaluation is expensive, time consuming, and may be limited by the availability of raters for the languages of interest.
Another barrier to progress is that different projects and institutions typically use various ratings, platforms and protocols, which makes apples-to-apples comparisons impossible. In this regard, speech synthesis technologies lag behind text generation, where researchers have long complemented human evaluation with automatic metrics such as BLEU or, more recently, BLEURT.
In “SQuId: Measuring Speech Naturalness in Many Languages“, to be presented at ICASSP 2023, we introduce SQuId (Speech Quality Identification), a 600M parameter regression model that describes to what extent a piece of speech sounds natural. SQuId is based on mSLAM (a pre-trained speech-text model developed by Google), fine-tuned on over a million quality ratings across 42 languages and tested in 65. We demonstrate how SQuId can be used to complement human ratings for evaluation of many languages. This is the largest published effort of this type to date.
Evaluating TTS with SQuId
The main hypothesis behind SQuId is that training a regression model on previously collected ratings can provide us with a low-cost method for assessing the quality of a TTS model. The model can therefore be a valuable addition to a TTS researcher’s evaluation toolbox, providing a near-instant, albeit less accurate alternative to human evaluation.
SQuId takes an utterance as input and an optional locale tag (i.e., a localized variant of a language, such as “Brazilian Portuguese” or “British English”). It returns a score between 1 and 5 that indicates how natural the waveform sounds, with a higher value indicating a more natural waveform.
Internally, the model includes three components: (1) an encoder, (2) a pooling / regression layer, and (3) a fully connected layer. First, the encoder takes a spectrogram as input and embeds it into a smaller 2D matrix that contains 3,200 vectors of size 1,024, where each vector encodes a time step. The pooling / regression layer aggregates the vectors, appends the locale tag, and feeds the result into a fully connected layer that returns a score. Finally, we apply application-specific post-processing that rescales or normalizes the score so it is within the [1, 5] range, which is common for naturalness human ratings. We train the whole model end-to-end with a regression loss.
The encoder is by far the largest and most important piece of the model. We used mSLAM, a pre-existing 600M-parameter Conformer pre-trained on both speech (51 languages) and text (101 languages).
![]() |
The SQuId model. |
To train and evaluate the model, we created the SQuId corpus: a collection of 1.9 million rated utterances across 66 languages, collected for over 2,000 research and product TTS projects. The SQuId corpus covers a diverse array of systems, including concatenative and neural models, for a broad range of use cases, such as driving directions and virtual assistants. Manual inspection reveals that SQuId is exposed to a vast range of of TTS errors, such as acoustic artifacts (e.g., cracks and pops), incorrect prosody (e.g., questions without rising intonations in English), text normalization errors (e.g., verbalizing “7/7” as “seven divided by seven” rather than “July seventh”), or pronunciation mistakes (e.g., verbalizing “tough” as “toe”).
A common issue that arises when training multilingual systems is that the training data may not be uniformly available for all the languages of interest. SQuId was no exception. The following figure illustrates the size of the corpus for each locale. We see that the distribution is largely dominated by US English.
![]() |
Locale distribution in the SQuId dataset. |
How can we provide good performance for all languages when there are such variations? Inspired by previous work on machine translation, as well as past work from the speech literature, we decided to train one model for all languages, rather than using separate models for each language. The hypothesis is that if the model is large enough, then cross-locale transfer can occur: the model’s accuracy on each locale improves as a result of jointly training on the others. As our experiments show, cross-locale proves to be a powerful driver of performance.
Experimental results
To understand SQuId’s overall performance, we compare it to a custom Big-SSL-MOS model (described in the paper), a competitive baseline inspired by MOS-SSL, a state-of-the-art TTS evaluation system. Big-SSL-MOS is based on w2v-BERT and was trained on the VoiceMOS’22 Challenge dataset, the most popular dataset at the time of evaluation. We experimented with several variants of the model, and found that SQuId is up to 50.0% more accurate.
![]() |
SQuId versus state-of-the-art baselines. We measure agreement with human ratings using the Kendall Tau, where a higher value represents better accuracy. |
To understand the impact of cross-locale transfer, we run a series of ablation studies. We vary the amount of locales introduced in the training set and measure the effect on SQuId’s accuracy. In English, which is already over-represented in the dataset, the effect of adding locales is negligible.
![]() |
SQuId’s performance on US English, using 1, 8, and 42 locales during fine-tuning. |
However, cross-locale transfer is much more effective for most other locales:
![]() |
SQuId’s performance on four selected locales (Korean, French, Thai, and Tamil), using 1, 8, and 42 locales during fine-tuning. For each locale, we also provide the training set size. |
To push transfer to its limit, we held 24 locales out during training and used them for testing exclusively. Thus, we measure to what extent SQuId can deal with languages that it has never seen before. The plot below shows that although the effect is not uniform, cross-locale transfer works.
![]() |
SQuId’s performance on four “zero-shot” locales; using 1, 8, and 42 locales during fine-tuning. |
When does cross-locale operate, and how? We present many more ablations in the paper, and show that while language similarity plays a role (e.g., training on Brazilian Portuguese helps European Portuguese) it is surprisingly far from being the only factor that matters.
Conclusion and future work
We introduce SQuId, a 600M parameter regression model that leverages the SQuId dataset and cross-locale learning to evaluate speech quality and describe how natural it sounds. We demonstrate that SQuId can complement human raters in the evaluation of many languages. Future work includes accuracy improvements, expanding the range of languages covered, and tackling new error types.
Acknowledgements
The author of this post is now part of Google DeepMind. Many thanks to all authors of the paper: Ankur Bapna, Joshua Camp, Diana Mackinnon, Ankur P. Parikh, and Jason Riesa.