Today Google released released the Secure AI Framework to help collaboratively secure AI technology.Read More
Link-credible: Get in the Game Faster With Steam, Epic Games Store and Ubisoft Account Linking on GeForce NOW
Get into your favorite games faster by linking GeForce NOW to Steam, Epic Games Store and Ubisoft accounts.
And get a peek at more games coming to GeForce NOW later this year by tuning in to Ubisoft Forward on Monday, June 12, when the game publisher will reveal its latest news and announcements.
Plus, two new games are available to stream from the cloud this week, as well as the newest season for Tom Clancy’s The Division 2 from Ubisoft.
Linked In
GeForce NOW makes gaming convenient and easy for members by enabling them to link their accounts from Steam, Epic and, most recently, Ubisoft, directly to the service. Instead of signing into their accounts for each play session, members can be automatically signed in across their devices after linking them up just once.
Starting today, launching Ubisoft Connect games requires members to link their Ubisoft accounts in the app. Once that’s completed, members can effortlessly play hit Ubisoft games, including Rainbow Six Siege, Far Cry 6 and The Division 2.
Members also have the benefit of library account syncing, which automatically syncs supported GeForce NOW games from Ubisoft Connect and Steam libraries — helping members find their Ubisoft games instantly.
For an even more streamlined experience, upgrade to an Ultimate or Priority membership to skip the waiting lines over free members and get into gaming faster.
The Mission: More Games
“Season 1: Broken Wings” is the newest season for Tom Clancy’s The Division 2, kicking off Year Five for the hit game from Ubisoft. It introduces a new game mode — Descent — a rogue-lite for teams of up to four players. Begin each match without any gear, perks or specializations and unlock them through game progression to work up through the ranks. The rest of the year will bring more seasons, each with their own manhunts, events and leagues. Stream “Broken Wings” on GeForce NOW today.
And take a look at the two new games available to stream this week:
Before the weekend arrives, let’s take things back with our question of the week. Let us know your answer on Twitter or in the comments below.
What’s the first @ubisoft game you ever played?
— NVIDIA GeForce NOW (@NVIDIAGFN) June 7, 2023
AI Frontiers: The future of causal reasoning with Emre Kiciman and Amit Sharma
Episode 140 | June 8, 2023
Powerful new large-scale AI models like GPT-4 are showing dramatic improvements in reasoning, problem-solving, and language capabilities. This marks a phase change for artificial intelligence—and a signal of accelerating progress to come.
In this Microsoft Research Podcast series, AI scientist and engineer Ashley Llorens hosts conversations with his collaborators and colleagues about what these new models—and the models that will come next—mean for our approach to creating, understanding, and deploying AI, its applications in areas such as health care and education, and its potential to benefit humanity.
This episode features Senior Principal Researcher Emre Kiciman and Principal Researcher Amit Sharma, whose paper “Causal Reasoning and Large Language Models: Opening a New Frontier for Causality” examines the causal capabilities of large language models (LLMs) and their implications. Kiciman and Sharma break down the study of cause and effect; recount their respective ongoing journeys with GPT-3.5 and GPT-4—from their preconceptions to where they are now—and share their views of a future in which LLMs help bring together different modes of reasoning in the practice of causal inference and make causal methods easier to adopt.
Learn more:
- Causal Reasoning and Large Language Models: Opening a New Frontier for Causality
Publication, April 2023 - The AI Revolution in Medicine: GPT-4 and Beyond by Peter Lee
Book, April 2023 - AI and Microsoft Research
Learn more about the breadth of AI research at Microsoft
Transcript
[MUSIC PLAYS]ASHLEY LLORENS: I’m Ashley Llorens with Microsoft Research. I’ve spent the last 20 years working in AI and machine learning, but I’ve never felt more fortunate to work in the field than at this moment. The development of increasingly powerful large-scale models like GPT-4 is accelerating the advancement of AI. These models are exhibiting surprising new abilities like reasoning, problem-solving, and translation across languages and domains. In this podcast series, I’ll share conversations with fellow researchers about our impressions of GPT-4, the work we’re doing to understand its capabilities and limitations, and ultimately how innovations like these can have the greatest benefit for humanity. Welcome to AI Frontiers.
Today we’re talking with Emre Kiciman and Amit Sharma, two Microsoft researchers who have been studying causal reasoning with AI for many years. Determining cause and effect relationships is critically important across many domains such as law, medicine, and the advancement of science itself. Emre and Amit recently published a paper that explores how large language models can advance the research and application of causal reasoning with AI. Emre joins us from our lab in Redmond, Washington, and Amit is on the line from Microsoft Research India, in Bangalore.
[MUSIC FADES]
Emre, Amit, let’s jump right in. I’m so excited to speak with you both about causal reasoning. And this is such a timely conversation because we’re living through the rise of generative pretrained models, specifically large language models. And when I’ve engaged with GPT-4 in dialogue, depending on what I ask, it can appear to be doing something resembling causal reasoning. And as a machine learning person myself, I have to say this is not something that I’d expected to see from a neural network that works based on analyzing and generating statistical patterns. Um, you know, this is something that before this time last year, I thought of as a uniquely human skill as I think maybe many others have, as well. Now, both of you do this for a living. You study causal reasoning for a living. Um, and so where I’d like to start is with your first reactions to GPT-4, your first contact. What did you find surprising, and how did you feel, uh, as a researcher in this area? I want to go to Emre first on this.
EMRE KICIMAN: Sure. Well, um, yeah, I think I went through a process. Um, right now, I am surprised how much I’m depending on functionality from GPT-4 and how much I expect it to work. And yet, I also don’t quite believe that it can do the things that it’s doing. It’s really, um, a weird mind space to be in. I think the, the moment when I was a bit astounded by, like, what might be possible was actually before I got my hands on GPT-4 directly. You know, I’ve been hearing that people were very impressed with what it was doing. But the thing that made me reconsider my preconceptions was actually some of the academic research looking at, um, how transformer models and architectures could actually represent Turing machines, Turing-complete computational machines. And once I saw that the transformer architecture could represent that type of program, that type of thing, then I figured, well, all bets are off. We don’t know whether it’s learning this or not, but if it can represent it, now there really is a chance that it could, that it might be learning that. And so we have to really keep an open mind.
The second moment when I changed my mind again about what GPT-4 might be doing … so I’ll give a little background. So once I saw some of the work that we’ll talk about here, uh, coming into play, where we’re seeing GPT do some sorts of, you know, very interesting causal-related tasks, um, I was like, OK, this is great. We have our causal processes; we’re just going to run through them and this fits in. Someone will come with their causal question; we’ll run through and run our, our causal analysis. And I thought that, you know, this all makes sense. We can do things that we want, what we’ve wanted to do for so, for so long. And it was actually reading, uh, some of the vignettes in Peter Lee’s book where he was quizzing, uh, GPT-4 to diagnose a patient based on their electronic health records, explain counterfactual scenarios, um, think through why someone might have made a misdiagnosis. And, and here, all of a sudden, I realized our conceptualizations of causal tasks that we’ve worked on in the academic fields are kind of boxes where we say we’re doing effect inference or we’re doing attribution or we’re doing discovery. These like very well-circumscribed tasks are, are not enough; they’re not flexible enough. Once you have this natural language interface, you can ask so many more things, so many more interesting questions. And we need to make sure that we can formally answer those … correctly answer those questions. And, and this GPT-4 is basically a bridge to expressing and, you know, meeting people where they want to be. That really opened my eyes the second time.
LLORENS: Thanks, Emre. Amit, first impressions.
AMIT SHARMA: Yeah, my experience was back in December—I think it was when a lot of people were talking about ChatGPT—and me, thinking that I worked in causality, uh, I was quite smug, right. I knew that causality requires you to have interventional data. Language models are only built on some observations. So I was quite happy to think that I would beat this topic, right. But it was just that every day, I would see, perhaps on Twitter, people expressing new things that ChatGPT can do that one day, I thought, OK, let me just try it, right. So the first query I thought was an easy query for, uh, GPT models. I just asked it, does smoking cause lung cancer, right? And I was surprised when it gave the right answer. But then I thought maybe, oh, this is just too common. Let me ask the opposite. Does lung cancer cause smoking? Uh, it gave the right answer. No. Uh, and then I was literally struck, and I, and I thought, what else can I test, right? And then I thought of the all the causal relationships that we typically talk about in our field, and I started doing them one by one. And what I found was that the accuracy was just astounding. And it was not just the accuracy, but also the explanation that it gives would sort of almost make you believe that as if it is a causal agent, as if it is doing, uh, something causal. So, so to me, I think those few days in December with slightly sleepless nights on what exactly is going on with these models and what I might add … what am I going to do as a researcher now? [LAUGHS] I think that was, sort of, my initial foray into this. And, and I think the logical next step was then to study it more deeply.
LLORENS: And stemming from both of your reactions, you began collaborating on a paper, which you’ve recently released, called “Causal Reasoning [and] Large Language Models,” um, and I’ve had the, you know, the pleasure of spending some time with that over these last few days and, and a week here. And one of the things you do in the paper is you provide what I think of as a helpful overview of the different kinds of causality. And so, Emre, I want to go back to you. What is causality, and how can we think about the space of different, you know, kinds of causal reasoning?
KICIMAN: Causality … it’s the study of cause-and-effect relationships, of the mechanisms that, that drive, you know, what we see happening in the world around us. You know, why do things happen? What made something happen? And this is a study that spread out across so many disciplines—computer science, economics, health, statistics. Like, everyone cares about, about causality, to some degree. And so this means that there’s many different kinds of, you know, tools and languages to talk about causality, um, that are appropriate for different kinds of tasks. So that’s one of the first things that we thought we had to lay out in the paper, was kind of a very broad landscape about what causality is. And so we talk about a couple of different axes. One is data-driven causal analysis, and the other is logic-based causal reasoning. These are two very different ways of, of, of thinking about causality. And then the second major axis is whether we’re talking about causal relationships in general, in the abstract, like, uh, does smoking normally cause … or often cause cancer? Versus causality in a very specific context— that’s called actual causality. And this is something like Bob smoked; Bob got lung cancer. Was Bob’s lung cancer caused by Bob’s smoking? It’s a very specific question in this very, you know, in, in a specific instance. And so those are the two axes: data-driven versus logic and then general causality versus actual causality.
LLORENS: Amit, I want to go to you now, and I want to dwell on this topic of actual causality. And I actually learned this phrase from your paper. But I think this is a kind of causal reasoning that people do quite often, maybe even it’s the thing they think about when they think about causal reasoning. So, Amit, you know, let’s go deeper into what actual causality is. Maybe you can illustrate with some examples. And then I want to get into experiments you’ve conducted in this area with GPT-4.
SHARMA: Sure. So interestingly, actual causality in research is sort of the less talked about. As Emre was saying, I think most researchers in health sciences, economics often talk about general phenomena. But actual causality talks about events and what might have caused them, right. So think about something happens in the real world. So let’s say … I’ll take an example of, let’s say, you catch a ball and you prevent it from falling down, right. And I think people would reasonably argue that your catching the ball was the cause of preventing it from falling onto the ground. But very quickly, these kinds of determinations become complex because what could have been happening is that there could be multiple other factors at play, uh, and there could also be questions about how exactly you’re even thinking about what is a cause. Should, should you be thinking about necessary causes, or should you be thinking about sufficient causes, and so on. So, so I think actual causality before sort of these language models was kind of a paradox in the sense that the applications were kind of everywhere, going from everyday life to even thinking about computer systems. So if your computer system fails, you want to understand why this failure occurred, right. You’re not really interested in why computer systems fail in general; you’re just interested in answering the specific failure’s causes. And the paradox is that even though these sort of questions were so common, I think what research had to offer, uh, was not immediately systemizable or deployable, uh, because you would often sort of tie yourself in knots in defining exactly what you mean by the cause and also sort of how do you even get that framing without sort of just having a formal representation, right. Most of these tasks were in English, right, or in the case of computer systems, you would just get a debug log. So I think one of the hardest problems was how do you take something in vague language, human language, and convert it into sort of logical framing or logical systems?
LLORENS: In the paper, you explore briefly, you know, kind of actual causality that deals with responsibility or faults. And, you know, this connects with things like, you know, reasoning in the, in the legal domain. And so I just want to, I want to explore that with you. And I know I’ve jumped to the back of the paper. I just find these particular set … this particular set of topics pretty fascinating. And so tell me about the experiments that you’ve conducted where you ask, you know, the, the algorithm … the model to do this kind of actual causal reasoning around assigning blame or responsibility for something?
SHARMA: So one of the important challenges in actual causality is determining what’s a necessary cause and what’s a sufficient cause for an event, right. Now if you’re familiar with logic, you can break this down into sort of simple predicates. What we are asking is if an event happened, was some action necessary? It means that if that action did not happen, then that event would not happen, right. So we have a nice ”but for” relationship. Sufficiency, on the other hand, is kind of the complement. So there you’re saying if this action happens, the event will always happen, irrespective of whatever else happens in the world, right. And so, so far, in actual causality, people would use logic-based methods to think about what’s the right answer for any kind of event. So what we did was we looked at all the sort of vignettes or these examples that causality researchers had collected over the past decade. All of these are very challenging examples of situations in English language. And I think their purpose was to kind of elucidate the different kinds of sort of gotchas you get when you try to sort of just use the simple concept for real-world applications. So let me take you through one example in our dataset that we studied and how we’re finding that LLMs are somehow able to take this very vague, ambiguous information in an English-language vignette and directly go from sort of that language to an answer in English, right. So in a sense, they’re kind of sidestepping the logical reasoning, but maybe in the future we can also combine logical reasoning and LLMs.
So let’s take an example. Uh, it’s like Alice catches a ball. The next part on … the next destination on the ball’s trajectory was a brick wall, which would have stopped it, and beyond that there was a window. So as humans, we would immediately think that Alice was not a cause, right, because even if she had not stopped the ball, it would have hit the brick, and so if you’re asking if Alice was the cause of the window being safe, an intuitive answer might be no. But when you analyze it through the necessary and sufficient lens, you would find that Alice was obviously not a necessary cause because the brick wall would have stopped it, but Alice was a sufficient cause, meaning that if Alice had stopped the ball, even if the brick wall collapsed, even if other things happened in the world, the window would still be safe right. So these are the kind of sort of interesting examples that we tried out. And what we found was GPT-3.5, which is ChatGPT, does not do so well. I think it actually fails to identify correctly these causes, but GPT-4 somehow is able to do that. So it gets about 86 percent accuracy on, on this task. And one of the interesting things we were worried about was maybe it’s just memorizing. Again, these are very popular examples in textbooks, right? So we did this fun thing. We just created our own dataset. So, so now instead of Alice catching a ball, Alice could be, I don’t know, dropping a test tube in a lab, right? So we created this sort of a lab setup—a completely new dataset—and we again found the same results that GPT-4 is able to infer these causes.
LLORENS: Now you’re, you’re getting into experimental results, and that’s great because one of the things that I think required some creativity here was how you actually even structure, you know, a rigorous set of experiments. And so, Emre, can you take … take us through the experiment setup and how you had to approach that with this, you know, kind of unique, unique way of assessing causal reasoning?
KICIMAN: Well, one of the things that we wanted to make sure we had when we were running these experiments is, uh, construct validity to really make sure that the experiments that we were running were testing what we thought they were testing, or at least that we understood what they actually were testing. Um, and so most of these types of, uh, tests over large language models work with benchmark questions, and the biggest issue with the, with many of these benchmark questions is that often the large language models have seen them before. And there’s a concern that rather than thinking through to get the right answer, they’ve really only memorized the specific answers to these, to these specific questions.
And so what we did was, uh, we actually ran a memorization test to see whether the underlying dataset had been memorized by the large language model before. We developed … some of our benchmark datasets we developed, uh, as novel datasets that, you know, had never been written before so clearly had not been seen or memorized. And then we ran additional tests to help us understand what was triggering the specific answers. Like we would redact words from our question, uh, to see what would lead the LLM to make a mistake. So, for example, if we remove the key word from the question, we would expect the LLM to be confused, right. That’s, that’s fine. If we removed an unimportant word, maybe, you know, a participle or something, then we would expect that, that, that, that should be something that the LLM should recover from. And so this was able to give us a better understanding of what the LLM was, was paying attention to. This led us, for example, to be very clear in our paper that in, for example, our causal discovery experiments—where we are specifically asking the LLM to go back to its learned knowledge and tell us whether it knows something from common sense or domain knowledge, whether it’s memorized that, you know, some, uh, some cause, uh, has a particular effect—we are very clear in our experiments that we are not able to tell you what the odds are that the LLM has memorized any particular fact. But what we can say is, given that it’s seen that fact, is it able to transform it, you know, and combine it somehow into the correct answer in a particular context. And so it’s just, it’s really important to, to know what, uh, what these experiments really are testing. So I, I really appreciated the opportunity to go a little bit deeper into these studies.
LLORENS: I find this concept of construct validity pretty fascinating here, and it’s, you know, you, you stressed the importance of it for doing this kind of black-box testing, where you don’t actually have an explicit model for how the, well, the model is doing what it’s doing. And, you know, you talked about memorization as one important test where you’re, you know, you want to, you want to have a valid construct. But I think even deeper than that, there’s, there’s an aspect of your mental model, your beliefs about, you know, what the algorithm is doing and how relevant the testing you’re doing would be to future performance or performance on future tasks. And so I wonder if we can dwell on this notion of construct validity a little bit, maybe even one level deeper than the memorization, you know, you and your mental model of what’s happening there and why that’s important.
KICIMAN: My mental model of what the large language model is giving us is that it’s read so much of the text out on the internet that it’s captured the common sense and domain knowledge that we would normally expect only a human to do. And through some process—maybe it’s, maybe it’s probabilistic; maybe it’s some more sophisticated reasoning—it’s able to identify, like Amit said, the most important or relevant relationships for a particular scenario. So it knows that, you know, when we’re talking about a doctor washing his or her hands with soap or not, that infection, uh, in a patient is the next … is something that’s really critical. And maybe if we weren’t talking about a doctor, this would not be, you know, the most important consideration. So it is starting from capturing this knowledge, remembering it somehow in its model, and then recognizing the right moment to recall that fact and put it back out there as part of its answer. Um, that’s, that’s my mental model of what I think it’s doing, and we are able to demonstrate with our, you know, experiments that it is transforming from many different input data formats into, you know, answers to our natural language questions. So we, we have data we think it’s seen that’s in tabular format or in graphical formats. Um, and, you know, it’s, it’s impressive to see that it’s able to generate answers to our questions in various natural language forms.
LLORENS: I want to go now to a different kind of causality, causal discovery, which you describe in your paper as dealing with variables and their effect on each other. Emre, we’ll stick with you. And I also think that this is a, a kind of causal reasoning that maybe is closer to your day job and closer to the kinds of models maybe that you construct in the problems that you deal with. And so tell me about causal discovery and, you know, what you’re seeing in terms of the capabilities of GPT-4 and your, your experimentation.
KICIMAN: Yeah. So causal discovery is about looking at data, observational data, where you’re not necessarily intervening on the system—you’re just watching—and then from that, trying to figure out what relationships … uh, what the causal relationships are among the factors that you’re observing. And this is something that usually is done in the context of general causality, so trying to learn general relationships, uh, between factors, and it’s usually done in a, in a databased way—looking at the covariances, statistical covariances, between your observations. And, uh, there’s causal discovery algorithms out there. Uh, there are … this is something that’s been studied for decades. And there’s essentially, uh, testing statistical independence relationships that, you know, if something isn’t causing something else, then if you hold everything constant, there should be statistical independence between those two factors or different kinds of statistical independence relationships depending on what type of causal structures you see in, uh, among the relationships. And what these algorithms are able to do, the classical algorithms, is they can get you down to, um, a set of, a set of plausible relationships, but there’s always some point at which they can’t solve … uh, they can’t distinguish things based on data alone. They can, you know … there’s going to be a couple of relationships in your dataset where they might not know whether A is causing B or B is causing A, vice versa. And this is where a human comes in with their domain knowledge and has to make a declaration of what they think the right answer is based on their understanding of system mechanics. So there’s always this reliance on a human coming in with domain knowledge. And what, what we’re, uh, seeing now, I think, with LLMs is for the first time, we have some sort of programmatic access to this common sense and domain knowledge, just like in the actual causality setting. We have it provided to us again, uh, in the causal discovery setting. And we can push on this further. We don’t have … we can, if we want, run our data analysis first, then look at the LLM to, um, to disambiguate the last couple of things that we couldn’t get out of data. But we can also start from scratch and just ask, uh, the LLM to orient all of these causal edges and identify the right mechanisms from the beginning, just solely based on common sense and domain knowledge.
And so that’s what we did in our experiments here. We went through, uh, lists of edges and then larger graph structures to see how much we could re-create from, uh, just the common sense or domain knowledge that’s captured inside the LLM. And it did, it did quite well, beating the state of the art of the data-oriented approaches. Now, to be clear, it’s not doing the same task. If you have some data about a phenomenon that’s never been studied before, it’s not well understood, it’s never been named, the large language model is not going to be able to tell you—I don’t think it’s going to be able to tell you—what that causal relationship is. But for the many things that we do already know, it, it beats, you know, looking at the data. It’s, it’s quite impressive that way. So we think this is super exciting because it really removes this burden that we’ve really put on to the human analyst before, and now, now we can run these analyses, these … this whole data-driven process can be, uh, uh, built off of common sense it’s already captured without having to ask a user, a human, to type it all up correctly.
LLORENS: Amit, one of the things I found fascinating about the set of experiments that you, that you ran here was the prompt engineering and just the effect on the experimental results of different ways of prompting the model. Take us through that experience and, and please do get specific on the particular prompts that you used and their effects on the outcome.
SHARMA: Sure, yeah, this was an iterative exercise for us, as well. So as I was mentioning [to] you, when I started in December, um, the prompt I used was pretty simple: does changing A cause a change in B, right? So if you’re thinking of, let’s say, the relationship between altitude and temperature, it would just translate to a single sentence: does changing the altitude change the temperature? As we sort of moved into working for our paper and as we saw many different prompt strategies from other works, we started experimenting, right, and one of the most surprising things—actually shocking for us—was that if you just add … in these GPT-3.5 and 4 class of models, there’s a system prompt which sort of you can give some meta instructions to, to the model, and we just added a single line saying that “you are an expert in causal reasoning.” And it was quite shocking that just that thing gave us a 5-percentage point boost in the accuracy on the datasets that we were testing. So there’s something there about sort of prompting or kind of conditioning the model to be generating text more attuned with causality, which we found as interesting. It also sort of suggests that maybe the language model is not the model here; maybe it’s the prompt plus a language model, uh, meaning that GPT-4 with a great prompt could give you great answers, but sort of there’s a question of robustness of the prompt, as well. And I think finally, the prompt that we went for was an iteration on this, where instead of asking two questions—because for each pair we can ask, does A cause B or does B cause A—we thought of just making it one prompt and asking it, here are two variables, let’s say, altitude and temperature. Which direction is more likely? And so we just gave it two options or three options in the case of no direction exists. And there were two benefits to this. So, one, I think somehow this was, uh, increasing the accuracy even more, perhaps because choosing between options becomes easier now; you can compare which one is more likely. But also we could ask the LLM now to explain its reasoning. So we would ask it literally, explain it step by step going from the chain of thought reasoning. And its answers would be very instructive. So for example, some of the domains we tested, uh, we don’t know anything about it, right. So there was one neuropathic pain dataset, which has nodes called radiculopathy, DLS , lumbago. We have no idea, right. But just looking at the responses from the LLM, you can both sort of get a peek into what it’s doing at some high level maybe, but also understand the concepts and think for yourself whether those sorts of things, the reasoning, is making sense or not, right. And of course, we are not experts, so we may be fooled. We might think this is doing something. But imagine a doctor using it or imagine some expert using it. I think they can both get some auxiliary insight but also these explanations help them debug it. So if the explanation seems to be off or it doesn’t make sense, uh, that’s also a nice way of sort of knowing when to trust the model or not.
KICIMAN: One of the things that we noticed with these prompts is that, you know, there’s more to do in this space, too. Like the kinds of mistakes that it’s making right now are things that we think might be resolved at least, you know, in some part with additional prompting or thinking strategies. For example, one of the mistakes was, um, about … when we asked about the relationship between ozone and levels in radiation levels, and it answered wrong. It didn’t answer what, what was expected in the benchmark. But it turns out it’s because there’s ambiguity in the question. The relationship between ozone and radiation, uh, is one direction if you’re talking about ozone at ground level in a city, and it’s the other direction if you’re talking about ozone in the stratosphere. And so you can ask it, is there any ambiguity here? Is there any additional information you would need that would change the direction of the causal mechanism that you’re, you know, suggesting? And it’ll tell you; it’ll say, if we’re talking about in the stratosphere, it’s this; if it’s on the ground, it’s this. And so there’s really … I think we’re going to see some really fun strategies for improving the performance further by digging into these types of interrogations.
LLORENS: You know, the model is a kind of generalist in a way that most people are not or—I’m just going to go for it—in a way that no person is. You know, with all this knowledge of law and culture and economics and so many other … code, you know, so many other things, and I could imagine showing up and, yeah, a little bit of a primer on, a briefing on, well, here’s why you’re here and what you’re doing … I mean, that’s helpful for a person. And I imagine … and as we see, it’s helpful for these generalist, you know, general-purpose reasoners. And of course, mechanistically, what we’re doing is through the context, we’re inducing a different probability distribution over the tokens. And so I guess that’s … no, that’s what’s happening here. This is the primer that it gets before it steps into the room and, and does the Q&A or gives the talk, you know, as, as, as we do. But I want to get into a little bit now about where you see this going from here—for the field and for you as a researcher in the field. Let’s, let’s stick with you, Emre. Where do we go from here? What are some of the exciting frontiers?
KICIMAN: What I’m most excited about is this opportunity I think that’s opening up right now to fluidly, flexibly go back and forth between these different modes of causality. Going from logic-based reasoning to data-based reasoning and going beyond the kind of set tasks that we have well-defined for, for us in our field right now. So there’s a fun story that I heard when I was visiting a university a couple of months ago. We were talking about actual causality and connections to, to database causality, and this person brought up this scenario where they were an expert witness in a case where a hedge fund was suing a newspaper. The newspaper had run an exposé of some kind on the hedge fund, scared off all of their investors, and the hedge fund went belly-up. And the hedge fund was blaming the newspaper and wanted, you know, compensation for this, right. But at the same time, this was in the middle of a financial crisis. And so there’s this question of wouldn’t the hedge fund have failed anyway? A lot of other hedge funds did. Plus there’s the question of, you know, how much of an effect do newspaper stories like this usually have? Could it possibly have killed the hedge fund? And then there’s all the, you know, questions of normality and, you know, morality and stuff of maybe this is what the newspaper is supposed to be doing anyway. It’s not their fault, um, what the consequences were. So now you can imagine asking this question, starting off in this logical, you know, framing of the problem; then when you get down to this sub-element of what happened to all the other hedge funds—what would have happened to this hedge fund if, um, if the newspaper hadn’t written a story?—we can go look at the data of what happened to all the other hedge funds, and we can run the data analysis, and we can come back. We can go back and forth so much. I think that kind of flexibility is something I’m really going to be excited to see us, you know, able to automate in some fashion.
LLORENS: Amit, what do you think? Where do we go from here?
SHARMA: Yeah, I think I’m also excited about the practical aspects of how this might transform the causal practice. So, for example, what Emre and I have worked a lot on, this problem of estimating the causal effect, and one of the challenges that has been in the field for a long time is that we have great methods for estimating the causal effect once we have the graph established, but getting that graph often is a really challenging process, and you need to get domain expertise, human involvement, and often that means that a lot of the causal analysis does not get done just because the upfront cost of building a graph is just too much or it’s too complex. And the flipside is also that it’s also hard to verify. So suppose you assume a graph and then you do your analysis; you get some effect like this policy is better, let’s say. It’s very hard to evaluate how good your graph was and how maybe there are some checks you can do, robustness checks, to, to validate that, right.
And so what I feel the opportunity here is that the LLMs are really being complementary to what we are already good at in causal inference, right? So we’re only good at, given a graph, getting you an estimate using statistics. What the LLMs can come in and do is help domain experts build the graph much, much faster. So now instead of sort of thinking about, “Oh, what is my system? What do I need to do?” Maybe there’s a documentation of your system somewhere that you just feed into an LLM, and it provides you a candidate graph to start with. And at the same time, on the backend, once you have estimated something, a hard challenge that researchers like us face is what might be good robustness checks, right. So often these are … one example is a negative control, where you try to think of what is something that would definitely not cause the outcome. I know it from my domain knowledge. Let me run my analysis through assuming if that was the action variable, and then my analysis should always give an answer of zero. But again, like sort of figuring out what such variables are is more of an art than science. And I think in the preliminary experiments that we are doing, the LLMs could also help you there; you could again sort of give your graph and your data … and your sort of data description, and the LLMs can suggest to you, “Hey, these might be the variables that you can use for your robustness check.” So I’m most excited about this possibility of sort of more and more adoption of causal methods because now the LLMs can substitute or at least help people to stand up these analyses much faster.
LLORENS: Thank you both for this fascinating discussion. Understanding cause-and-effect relationships is such a fundamental part of how we apply human intelligence across so many different domains. I’m really looking forward to tracking your research, and the possibilities for more powerful causal reasoning with AI.
The post AI Frontiers: The future of causal reasoning with Emre Kiciman and Amit Sharma appeared first on Microsoft Research.
Less Is More: A Unified Architecture for Device-Directed Speech Detection with Multiple Invocation Types
Suppressing unintended invocation of the device because of the speech that sounds like wake-word, or accidental button presses, is critical for a good user experience, and is referred to as False-Trigger-Mitigation (FTM). In case of multiple invocation options, the traditional approach to FTM is to use invocation-specific models, or a single model for all invocations. Both approaches are sub-optimal: the memory cost for the former approach grows linearly with the number of invocation options, which is prohibitive for on-device deployment, and does not take advantage of shared training data;…Apple Machine Learning Research
Accelerate PyTorch with DeepSpeed to train large language models with Intel Habana Gaudi-based DL1 EC2 instances
Training large language models (LLMs) with billions of parameters can be challenging. In addition to designing the model architecture, researchers need to set up state-of-the-art training techniques for distributed training like mixed precision support, gradient accumulation, and checkpointing. With large models, the training setup is even more challenging because the available memory in a single accelerator device bounds the size of models trained using only data parallelism, and using model parallel training requires additional level of modifications to the training code. Libraries such as DeepSpeed (an open-source deep learning optimization library for PyTorch) address some of these challenges, and can help accelerate model development and training.
In this post, we set up training on the Intel Habana Gaudi-based Amazon Elastic Compute Cloud (Amazon EC2) DL1 instances and quantify the benefits of using a scaling framework such as DeepSpeed. We present scaling results for an encoder-type transformer model (BERT with 340 million to 1.5 billion parameters). For the 1.5-billion-parameter model, we achieved a scaling efficiency of 82.7% across 128 accelerators (16 dl1.24xlarge instances) using DeepSpeed ZeRO stage 1 optimizations. The optimizer states were partitioned by DeepSpeed to train large models using the data parallel paradigm. This approach has been extended to train a 5-billion-parameter model using data parallelism. We also used Gaudi’s native support of the BF16 data type for reduced memory size and increased training performance compared to using the FP32 data type. As a result, we achieved pre-training (phase 1) model convergence within 16 hours (our target was to train a large model within a day) for the BERT 1.5-billion-parameter model using the wikicorpus-en dataset.
Training setup
We provisioned a managed compute cluster comprised of 16 dl1.24xlarge instances using AWS Batch. We developed an AWS Batch workshop that illustrates the steps to set up the distributed training cluster with AWS Batch. Each dl1.24xlarge instance has eight Habana Gaudi accelerators, each with 32 GB of memory and a full mesh RoCE network between cards with a total bi-directional interconnect bandwidth of 700 Gbps each (see Amazon EC2 DL1 instances Deep Dive for more information). The dl1.24xlarge cluster also used four AWS Elastic Fabric Adapters (EFA), with a total of 400 Gbps interconnect between nodes.
The distributed training workshop illustrates the steps to set up the distributed training cluster. The workshop shows the distributed training setup using AWS Batch and in particular, the multi-node parallel jobs feature to launch large-scale containerized training jobs on fully managed clusters. More specifically, a fully managed AWS Batch compute environment is created with DL1 instances. The containers are pulled from Amazon Elastic Container Registry (Amazon ECR) and launched automatically into the instances in the cluster based on the multi-node parallel job definition. The workshop concludes by running a multi-node, multi-HPU data parallel training of a BERT (340 million to 1.5 billion parameters) model using PyTorch and DeepSpeed.
BERT 1.5B pre-training with DeepSpeed
Habana SynapseAI v1.5 and v1.6 support DeepSpeed ZeRO1 optimizations. The Habana fork of the DeepSpeed GitHub repository includes the modifications necessary to support the Gaudi accelerators. There is full support of distributed data parallel (multi-card, multi-instance), ZeRO1 optimizations, and BF16 data types.
All these features are enabled on the BERT 1.5B model reference repository, which introduces a 48-layer, 1600-hidden dimension, and 25-head bi-directional encoder model, derived from a BERT implementation. The repository also contains the baseline BERT Large model implementation: a 24-layer, 1024-hidden, 16-head, 340-million-parameter neural network architecture. The pre-training modeling scripts are derived from the NVIDIA Deep Learning Examples repository to download the wikicorpus_en data, preprocess the raw data into tokens, and shard the data into smaller h5 datasets for distributed data parallel training. You can adopt this generic approach to train your custom PyTorch model architectures using your datasets using DL1 instances.
Pre-training (phase 1) scaling results
For pre-training large models at scale, we mainly focused on two aspects of the solution: training performance, as measured by the time to train, and cost-effectiveness of arriving at a fully converged solution. Next, we dive deeper into these two metrics with BERT 1.5B pre-training as an example.
Scaling performance and time to train
We start by measuring the performance of the BERT Large implementation as a baseline for scalability. The following table lists the measured throughput of sequences per second from 1-8 dl1.24xlarge instances (with eight accelerator devices per instance). Using the single-instance throughput as baseline, we measured the efficiency of scaling across multiple instances, which is an important lever to understand the price-performance training metric.
Number of Instances | Number of Accelerators | Sequences per Second | Sequences per Second per Accelerator | Scaling Efficiency |
1 | 8 | 1,379.76 | 172.47 | 100.0% |
2 | 16 | 2,705.57 | 169.10 | 98.04% |
4 | 32 | 5,291.58 | 165.36 | 95.88% |
8 | 64 | 9,977.54 | 155.90 | 90.39% |
The following figure illustrates the scaling efficiency.
For BERT 1.5B, we modified the hyperparameters for the model in the reference repository to guarantee convergence. The effective batch size per accelerator was set to 384 (for maximum memory utilization), with micro-batches of 16 per step and 24 steps of gradient accumulation. Learning rates of 0.0015 and 0.003 were used for 8 and 16 nodes, respectively. With these configurations, we achieved convergence of the phase 1 pre-training of BERT 1.5B across 8 dl1.24xlarge instances (64 accelerators) in approximately 25 hours, and 15 hours across 16 dl1.24xlarge instances (128 accelerators). The following figure shows the average loss as a function of number of training epochs, as we scale up the number of accelerators.
With the configuration described earlier, we obtained 85% strong scaling efficiency with 64 accelerators and 83% with 128 accelerators, from a baseline of 8 accelerators in a single instance. The following table summarizes the parameters.
Number of Instances | Number of Accelerators | Sequences per Second | Sequences per Second per Accelerator | Scaling Efficiency |
1 | 8 | 276.66 | 34.58 | 100.0% |
8 | 64 | 1,883.63 | 29.43 | 85.1% |
16 | 128 | 3,659.15 | 28.59 | 82.7% |
The following figure illustrates the scaling efficiency.
Conclusion
In this post, we evaluated support for DeepSpeed by Habana SynapseAI v1.5/v1.6 and how it helps scale LLM training on Habana Gaudi accelerators. Pre-training of a 1.5-billion-parameter BERT model took 16 hours to converge on a cluster of 128 Gaudi accelerators, with 85% strong scaling. We encourage you to take a look at the architecture demonstrated in the AWS workshop and consider adopting it to train custom PyTorch model architectures using DL1 instances.
About the authors
Mahadevan Balasubramaniam is a Principal Solutions Architect for Autonomous Computing with nearly 20 years of experience in the area of physics-infused deep learning, building, and deploying digital twins for industrial systems at scale. Mahadevan obtained his PhD in Mechanical Engineering from the Massachusetts Institute of Technology and has over 25 patents and publications to his credit.
RJ is an engineer in Search M5 team leading the efforts for building large scale deep learning systems for training and inference. Outside of work he explores different cuisines of food and plays racquet sports.
Sundar Ranganathan is the Head of Business Development, ML Frameworks on the Amazon EC2 team. He focuses on large-scale ML workloads across AWS services like Amazon EKS, Amazon ECS, Elastic Fabric Adapter, AWS Batch, and Amazon SageMaker. His experience includes leadership roles in product management and product development at NetApp, Micron Technology, Qualcomm, and Mentor Graphics.
Abhinandan Patni is a Senior Software Engineer at Amazon Search. He focuses on building systems and tooling for scalable distributed deep learning training and real time inference.
Pierre-Yves Aquilanti is Head of Frameworks ML Solutions at Amazon Web Services where he helps develop the industry’s best cloud based ML Frameworks solutions. His background is in High Performance Computing and prior to joining AWS, Pierre-Yves was working in the Oil & Gas industry. Pierre-Yves is originally from France and holds a Ph.D. in Computer Science from the University of Lille.
Evaluating speech synthesis in many languages with SQuId
Previously, we presented the 1,000 languages initiative and the Universal Speech Model with the goal of making speech and language technologies available to billions of users around the world. Part of this commitment involves developing high-quality speech synthesis technologies, which build upon projects such as VDTTS and AudioLM, for users that speak many different languages.
After developing a new model, one must evaluate whether the speech it generates is accurate and natural: the content must be relevant to the task, the pronunciation correct, the tone appropriate, and there should be no acoustic artifacts such as cracks or signal-correlated noise. Such evaluation is a major bottleneck in the development of multilingual speech systems.
The most popular method to evaluate the quality of speech synthesis models is human evaluation: a text-to-speech (TTS) engineer produces a few thousand utterances from the latest model, sends them for human evaluation, and receives results a few days later. This evaluation phase typically involves listening tests, during which dozens of annotators listen to the utterances one after the other to determine how natural they sound. While humans are still unbeaten at detecting whether a piece of text sounds natural, this process can be impractical — especially in the early stages of research projects, when engineers need rapid feedback to test and restrategize their approach. Human evaluation is expensive, time consuming, and may be limited by the availability of raters for the languages of interest.
Another barrier to progress is that different projects and institutions typically use various ratings, platforms and protocols, which makes apples-to-apples comparisons impossible. In this regard, speech synthesis technologies lag behind text generation, where researchers have long complemented human evaluation with automatic metrics such as BLEU or, more recently, BLEURT.
In “SQuId: Measuring Speech Naturalness in Many Languages“, to be presented at ICASSP 2023, we introduce SQuId (Speech Quality Identification), a 600M parameter regression model that describes to what extent a piece of speech sounds natural. SQuId is based on mSLAM (a pre-trained speech-text model developed by Google), fine-tuned on over a million quality ratings across 42 languages and tested in 65. We demonstrate how SQuId can be used to complement human ratings for evaluation of many languages. This is the largest published effort of this type to date.
Evaluating TTS with SQuId
The main hypothesis behind SQuId is that training a regression model on previously collected ratings can provide us with a low-cost method for assessing the quality of a TTS model. The model can therefore be a valuable addition to a TTS researcher’s evaluation toolbox, providing a near-instant, albeit less accurate alternative to human evaluation.
SQuId takes an utterance as input and an optional locale tag (i.e., a localized variant of a language, such as “Brazilian Portuguese” or “British English”). It returns a score between 1 and 5 that indicates how natural the waveform sounds, with a higher value indicating a more natural waveform.
Internally, the model includes three components: (1) an encoder, (2) a pooling / regression layer, and (3) a fully connected layer. First, the encoder takes a spectrogram as input and embeds it into a smaller 2D matrix that contains 3,200 vectors of size 1,024, where each vector encodes a time step. The pooling / regression layer aggregates the vectors, appends the locale tag, and feeds the result into a fully connected layer that returns a score. Finally, we apply application-specific post-processing that rescales or normalizes the score so it is within the [1, 5] range, which is common for naturalness human ratings. We train the whole model end-to-end with a regression loss.
The encoder is by far the largest and most important piece of the model. We used mSLAM, a pre-existing 600M-parameter Conformer pre-trained on both speech (51 languages) and text (101 languages).
The SQuId model. |
To train and evaluate the model, we created the SQuId corpus: a collection of 1.9 million rated utterances across 66 languages, collected for over 2,000 research and product TTS projects. The SQuId corpus covers a diverse array of systems, including concatenative and neural models, for a broad range of use cases, such as driving directions and virtual assistants. Manual inspection reveals that SQuId is exposed to a vast range of of TTS errors, such as acoustic artifacts (e.g., cracks and pops), incorrect prosody (e.g., questions without rising intonations in English), text normalization errors (e.g., verbalizing “7/7” as “seven divided by seven” rather than “July seventh”), or pronunciation mistakes (e.g., verbalizing “tough” as “toe”).
A common issue that arises when training multilingual systems is that the training data may not be uniformly available for all the languages of interest. SQuId was no exception. The following figure illustrates the size of the corpus for each locale. We see that the distribution is largely dominated by US English.
Locale distribution in the SQuId dataset. |
How can we provide good performance for all languages when there are such variations? Inspired by previous work on machine translation, as well as past work from the speech literature, we decided to train one model for all languages, rather than using separate models for each language. The hypothesis is that if the model is large enough, then cross-locale transfer can occur: the model’s accuracy on each locale improves as a result of jointly training on the others. As our experiments show, cross-locale proves to be a powerful driver of performance.
Experimental results
To understand SQuId’s overall performance, we compare it to a custom Big-SSL-MOS model (described in the paper), a competitive baseline inspired by MOS-SSL, a state-of-the-art TTS evaluation system. Big-SSL-MOS is based on w2v-BERT and was trained on the VoiceMOS’22 Challenge dataset, the most popular dataset at the time of evaluation. We experimented with several variants of the model, and found that SQuId is up to 50.0% more accurate.
SQuId versus state-of-the-art baselines. We measure agreement with human ratings using the Kendall Tau, where a higher value represents better accuracy. |
To understand the impact of cross-locale transfer, we run a series of ablation studies. We vary the amount of locales introduced in the training set and measure the effect on SQuId’s accuracy. In English, which is already over-represented in the dataset, the effect of adding locales is negligible.
SQuId’s performance on US English, using 1, 8, and 42 locales during fine-tuning. |
However, cross-locale transfer is much more effective for most other locales:
SQuId’s performance on four selected locales (Korean, French, Thai, and Tamil), using 1, 8, and 42 locales during fine-tuning. For each locale, we also provide the training set size. |
To push transfer to its limit, we held 24 locales out during training and used them for testing exclusively. Thus, we measure to what extent SQuId can deal with languages that it has never seen before. The plot below shows that although the effect is not uniform, cross-locale transfer works.
SQuId’s performance on four “zero-shot” locales; using 1, 8, and 42 locales during fine-tuning. |
When does cross-locale operate, and how? We present many more ablations in the paper, and show that while language similarity plays a role (e.g., training on Brazilian Portuguese helps European Portuguese) it is surprisingly far from being the only factor that matters.
Conclusion and future work
We introduce SQuId, a 600M parameter regression model that leverages the SQuId dataset and cross-locale learning to evaluate speech quality and describe how natural it sounds. We demonstrate that SQuId can complement human raters in the evaluation of many languages. Future work includes accuracy improvements, expanding the range of languages covered, and tackling new error types.
Acknowledgements
The author of this post is now part of Google DeepMind. Many thanks to all authors of the paper: Ankur Bapna, Joshua Camp, Diana Mackinnon, Ankur P. Parikh, and Jason Riesa.
Retrain ML models and automate batch predictions in Amazon SageMaker Canvas using updated datasets
You can now retrain machine learning (ML) models and automate batch prediction workflows with updated datasets in Amazon SageMaker Canvas, thereby making it easier to constantly learn and improve the model performance and drive efficiency. An ML model’s effectiveness depends on the quality and relevance of the data it’s trained on. As time progresses, the underlying patterns, trends, and distributions in the data may change. By updating the dataset, you ensure that the model learns from the most recent and representative data, thereby improving its ability to make accurate predictions. Canvas now supports updating datasets automatically and manually enabling you to use the latest version of the tabular, image, and document dataset for training ML models.
After the model is trained, you may want to run predictions on it. Running batch predictions on an ML model enables processing multiple data points simultaneously instead of making predictions one by one. Automating this process provides efficiency, scalability, and timely decision-making. After the predictions are generated, they can be further analyzed, aggregated, or visualized to gain insights, identify patterns, or make informed decisions based on the predicted outcomes. Canvas now supports setting up an automated batch prediction configuration and associating a dataset to it. When the associated dataset is refreshed, either manually or on a schedule, a batch prediction workflow will be triggered automatically on the corresponding model. Results of the predictions can be viewed inline or downloaded for later review.
In this post, we show how to retrain ML models and automate batch predictions using updated datasets in Canvas.
Overview of solution
For our use case, we play the part of a business analyst for an ecommerce company. Our product team wants us to determine the most critical metrics that influence a shopper’s purchase decision. For this, we train an ML model in Canvas with a customer website online session dataset from the company. We evaluate the model’s performance and, if needed, retrain the model with additional data to see if it improves the performance of the existing model or not. To do so, we use the auto update dataset capability in Canvas and retrain our existing ML model with the latest version of training dataset. Then we configure automatic batch prediction workflows—when the corresponding prediction dataset is updated, it automatically triggers the batch prediction job on the model and makes the results available for us to review.
The workflow steps are as follows:
- Upload the downloaded customer website online session data to Amazon Simple Storage Service (Amazon S3) and create a new training dataset Canvas. For the full list of supported data sources, refer to Importing data in Amazon SageMaker Canvas.
- Build ML models and analyze their performance metrics. Refer to the steps on how to build a custom ML Model in Canvas and evaluate a model’s performance.
- Set up auto update on the existing training dataset and upload new data to the Amazon S3 location backing this dataset. Upon completion, it should create a new dataset version.
- Use the latest version of the dataset to retrain the ML model and analyze its performance.
- Set up automatic batch predictions on the better performing model version and view the prediction results.
You can perform these steps in Canvas without writing a single line of code.
Overview of data
The dataset consists of feature vectors belonging to 12,330 sessions. The dataset was formed so that each session would belong to a different user in a 1-year period to avoid any tendency to a specific campaign, special day, user profile, or period. The following table outlines the data schema.
Column Name | Data Type | Description |
Administrative |
Numeric | Number of pages visited by the user for user account management-related activities. |
Administrative_Duration |
Numeric | Amount of time spent in this category of pages. |
Informational |
Numeric | Number of pages of this type (informational) that the user visited. |
Informational_Duration |
Numeric | Amount of time spent in this category of pages. |
ProductRelated |
Numeric | Number of pages of this type (product related) that the user visited. |
ProductRelated_Duration |
Numeric | Amount of time spent in this category of pages. |
BounceRates |
Numeric | Percentage of visitors who enter the website through that page and exit without triggering any additional tasks. |
ExitRates |
Numeric | Average exit rate of the pages visited by the user. This is the percentage of people who left your site from that page. |
Page Values |
Numeric | Average page value of the pages visited by the user. This is the average value for a page that a user visited before landing on the goal page or completing an ecommerce transaction (or both). |
SpecialDay |
Binary | The “Special Day” feature indicates the closeness of the site visiting time to a specific special day (such as Mother’s Day or Valentine’s Day) in which the sessions are more likely to be finalized with a transaction. |
Month |
Categorical | Month of the visit. |
OperatingSystems |
Categorical | Operating systems of the visitor. |
Browser |
Categorical | Browser used by the user. |
Region |
Categorical | Geographic region from which the session has been started by the visitor. |
TrafficType |
Categorical | Traffic source through which user has entered the website. |
VisitorType |
Categorical | Whether the customer is a new user, returning user, or other. |
Weekend |
Binary | If the customer visited the website on the weekend. |
Revenue |
Binary | If a purchase was made. |
Revenue is the target column, which will help us predict whether or not a shopper will purchase a product or not.
The first step is to download the dataset that we will use. Note that this dataset is courtesy of the UCI Machine Learning Repository.
Prerequisites
For this walkthrough, complete the following prerequisite steps:
- Split the downloaded CSV that contains 20,000 rows into multiple smaller chunk files.
This is so that we can showcase the dataset update functionality. Ensure all the CSV files have the same headers, otherwise you may run into schema mismatch errors while creating a training dataset in Canvas.
- Create an S3 bucket and upload
online_shoppers_intentions1-3.csv
to the S3 bucket.
- Set aside 1,500 rows from the downloaded CSV to run batch predictions on after the ML model is trained.
- Remove the
Revenue
column from these files so that when you run batch prediction on the ML model, that is the value your model will be predicting.
Ensure all the predict*.csv
files have the same headers, otherwise you may run into schema mismatch errors while creating a prediction (inference) dataset in Canvas.
- Perform the necessary steps to set up a SageMaker domain and Canvas app.
Create a dataset
To create a dataset in Canvas, complete the following steps:
- In Canvas, choose Datasets in the navigation pane.
- Choose Create and choose Tabular.
- Give your dataset a name. For this post, we call our training dataset
OnlineShoppersIntentions
. - Choose Create.
- Choose your data source (for this post, our data source is Amazon S3).
Note that as of this writing, the dataset update functionality is only supported for Amazon S3 and locally uploaded data sources.
- Select the corresponding bucket and upload the CSV files for the dataset.
You can now create a dataset with multiple files.
- Preview all the files in the dataset and choose Create dataset.
We now have version 1 of the OnlineShoppersIntentions
dataset with three files created.
- Choose the dataset to view the details.
The Data tab shows a preview of the dataset.
- Choose Dataset details to view the files that the dataset contains.
The Dataset files pane lists the available files.
- Choose the Version History tab to view all the versions for this dataset.
We can see our first dataset version has three files. Any subsequent version will include all the files from previous versions and will provide a cumulative view of the data.
Train an ML model with version 1 of the dataset
Let’s train an ML model with version 1 of our dataset.
- In Canvas, choose My models in the navigation pane.
- Choose New model.
- Enter a model name (for example,
OnlineShoppersIntentionsModel
), select the problem type, and choose Create. - Select the dataset. For this post, we select the
OnlineShoppersIntentions
dataset.
By default, Canvas will pick up the most current dataset version for training.
- On the Build tab, choose the target column to predict. For this post, we choose the Revenue column.
- Choose Quick build.
The model training will take 2–5 minutes to complete. In our case, the trained model gives us a score of 89%.
Set up automatic dataset updates
Let’s update on our dataset using the auto update functionality and bring in more data and see if the model performance improves with the new version of dataset. Datasets can be manually updated as well.
- On the Datasets page, select the
OnlineShoppersIntentions
dataset and choose Update dataset. - You can either choose Manual update, which is a one-time update option, or Automatic update, which allows you to automatically update your dataset on a schedule. For this post, we showcase the automatic update feature.
You’re redirected to the Auto update tab for the corresponding dataset. We can see that Enable auto update is currently disabled.
- Toggle Enable auto update to on and specify the data source (as of this writing, Amazon S3 data sources are supported for auto updates).
- Select a frequency and enter a start time.
- Save the configuration settings.
An auto update dataset configuration has been created. It can be edited at any time. When a corresponding dataset update job is triggered on the specified schedule, the job will appear in the Job history section.
- Next, let’s upload the
online_shoppers_intentions4.csv
,online_shoppers_intentions5.csv
, andonline_shoppers_intentions6.csv
files to our S3 bucket.
We can view our files in the dataset-update-demo
S3 bucket.
The dataset update job will get triggered at the specified schedule and create a new version of the dataset.
When the job is complete, dataset version 2 will have all the files from version 1 and the additional files processed by the dataset update job. In our case, version 1 has three files and the update job picked up three additional files, so the final dataset version has six files.
We can view the new version that was created on the Version history tab.
The Data tab contains a preview of the dataset and provides a list of all the files in the latest version of the dataset.
Retrain the ML model with an updated dataset
Let’s retrain our ML model with the latest version of the dataset.
- On the My models page, choose your model.
- Choose Add version.
- Select the latest dataset version (v2 in our case) and choose Select dataset.
- Keep the target column and build configuration similar to the previous model version.
When the training is complete, let’s evaluate the model performance. The following screenshot shows that adding additional data and retraining our ML model has helped improve our model performance.
Create a prediction dataset
With an ML model trained, let’s create a dataset for predictions and run batch predictions on it.
- On the Datasets page, create a tabular dataset.
- Enter a name and choose Create.
- In our S3 bucket, upload one file with 500 rows to predict.
Next, we set up auto updates on the prediction dataset.
- Toggle Enable auto update to on and specify the data source.
- Select the frequency and specify a starting time.
- Save the configuration.
Automate the batch prediction workflow on an auto updated predictions dataset
In this step, we configure our auto batch prediction workflows.
- On the My models page, navigate to version 2 of your model.
- On the Predict tab, choose Batch prediction and Automatic.
- Choose Select dataset to specify the dataset to generate predictions on.
- Select the
predict
dataset that we created earlier and choose Choose dataset. - Choose Set up.
We now have an automatic batch prediction workflow. This will be triggered when the Predict
dataset is automatically updated.
Now let’s upload more CSV files to the predict
S3 folder.
This operation will trigger an auto update of the predict
dataset.
This will in turn trigger the automatic batch prediction workflow and generate predictions for us to view.
We can view all automations on the Automations page.
Thanks to the automatic dataset update and automatic batch prediction workflows, we can use the latest version of the tabular, image, and document dataset for training ML models, and build batch prediction workflows that get automatically triggered on every dataset update.
Clean up
To avoid incurring future charges, log out of Canvas. Canvas bills you for the duration of the session, and we recommend logging out of Canvas when you’re not using it. Refer to Logging out of Amazon SageMaker Canvas for more details.
Conclusion
In this post, we discussed how we can use the new dataset update capability to build new dataset versions and train our ML models with the latest data in Canvas. We also showed how we can efficiently automate the process of running batch predictions on updated data.
To start your low-code/no-code ML journey, refer to the Amazon SageMaker Canvas Developer Guide.
Special thanks to everyone who contributed to the launch.
About the Authors
Janisha Anand is a Senior Product Manager on the SageMaker No/Low-Code ML team, which includes SageMaker Canvas and SageMaker Autopilot. She enjoys coffee, staying active, and spending time with her family.
Prashanth is a Software Development Engineer at Amazon SageMaker and mainly works with SageMaker low-code and no-code products.
Esha Dutta is a Software Development Engineer at Amazon SageMaker. She focuses on building ML tools and products for customers. Outside of work, she enjoys the outdoors, yoga, and hiking.
Expedite the Amazon Lex chatbot development lifecycle with Test Workbench
Amazon Lex is excited to announce Test Workbench, a new bot testing solution that provides tools to simplify and automate the bot testing process. During bot development, testing is the phase where developers check whether a bot meets the specific requirements, needs and expectations by identifying errors, defects, or bugs in the system before scaling. Testing helps validate bot performance on several fronts such as conversational flow (understanding user queries and responding accurately), intent overlap handling, and consistency across modalities. However, testing is often manual, error-prone, and non-standardized. Test Workbench standardizes automated test management by allowing chatbot development teams to generate, maintain, and execute test sets with a consistent methodology and avoid custom scripting and ad-hoc integrations. In this post, you will learn how Test Workbench streamlines automated testing of a bot’s voice and text modalities and provides accuracy and performance measures for parameters such as audio transcription, intent recognition, and slot resolution for both single utterance inputs and multi-turn conversations. This allows you to quickly identify bot improvement areas and maintain a consistent baseline to measure accuracy over time and observe any accuracy regression due to bot updates.
Amazon Lex is a fully managed service for building conversational voice and text interfaces. Amazon Lex helps you build and deploy chatbots and virtual assistants on websites, contact center services, and messaging channels. Amazon Lex bots help increase interactive voice response (IVR) productivity, automate simple tasks, and drive operational efficiencies across the organization. Test Workbench for Amazon Lex standardizes and simplifies the bot testing lifecycle, which is critical to improving bot design.
Features of Test Workbench
Test Workbench for Amazon Lex includes the following features:
- Generate test datasets automatically from a bot’s conversation logs
- Upload manually built test set baselines
- Perform end-to-end testing of single input or multi-turn conversations
- Test both audio and text modalities of a bot
- Review aggregated and drill-down metrics for bot dimensions:
- Speech transcription
- Intent recognition
- Slot resolution (including multi-valued slots or composite slots)
- Context tags
- Session attributes
- Request attributes
- Runtime hints
- Time delay in seconds
Prerequisites
To test this feature, you should have the following:
- An AWS account with administrator access
- A sample retail bot imported via the Amazon Lex console (for more information, refer to importing a bot)
- A test set source, either from:
- Conversation logs enabled for the bot to store bot interactions, or
- A sample retail test set that can be imported following the instructions provided in this post
In addition, you should have knowledge and understanding of the following services and features:
Create a test set
To create your test set, complete the following steps:
- On the Amazon Lex console, under Test workbench in the navigation pane, choose Test sets.
You can review a list of existing test sets, including basic information such as name, description, number of test inputs, modality, and status. In the following steps, you can choose between generating a test set from the conversation logs associated with the bot or uploading an existing manually built test set in a CSV file format.
- Choose Create test set.
- Generating test sets from conversation logs allows you to do the following:
- Include real multi-turn conversations from the bot’s logs in CloudWatch
- Include audio logs and conduct tests that account for real speech nuances, background noises, and accents
- Speed up the creation of test sets
- Uploading a manually built test set allows you to do the following:
- Test new bots for which there is no production data
- Perform regression tests on existing bots for any new or modified intents, slots, and conversation flows
- Test carefully crafted and detailed scenarios that specify session attributes and request attributes
To generate a test set, complete the following steps. To upload a manually built test set, skip to step 7.
- Choose Generate a baseline test set.
- Choose your options for Bot name, Bot alias, and Language.
- For Time range, set a time range for the logs.
- For Existing IAM role, choose a role.
Ensure that the IAM role is able to grant you access to retrieve information from the conversation logs. Refer to Creating IAM roles to create an IAM role with the appropriate policy.
- If you prefer to use a manually created test set, select Upload a file to this test set.
- For Upload a file to this test set, choose from the following options:
- Select Upload from S3 bucket to upload a CSV file from an Amazon Simple Storage Service (Amazon S3) bucket.
- Select Upload a file to this test set to upload a CSV file from your computer.
You can use the sample test set provided in this post. For more information about templates, choose the CSV Template link on the page.
- For Modality, select the modality of your test set, either Text or Audio.
Test Workbench provides testing support for audio and text input formats.
- For S3 location, enter the S3 bucket location where the results will be stored.
- Optionally, choose an AWS Key Management Service (AWS KMS) key to encrypt output transcripts.
- Choose Create.
Your newly created test set will be listed on the Test sets page with one of the following statuses:
- Ready for annotation – For test sets generated from Amazon Lex bot conversation logs, the annotation step serves as a manual gating mechanism to ensure quality test inputs. By annotating values for expected intents and expected slots for each test line item, you indicate the “ground truth” for that line. The test results from the bot run are collected and compared against the ground truth to mark test results as pass or fail. This line level comparison then allows for creating aggregated measures.
- Ready for testing – This indicates that the test set is ready to be executed against an Amazon Lex bot.
- Validation error – Uploaded test files are checked for errors such as exceeding maximum supported length, invalid characters in intent names, or invalid Amazon S3 links containing audio files. If the test set is in the Validation error state, download the file showing the validation details to see test input issues or errors on a line-by-line basis. Once they are addressed, you can manually upload the corrected test set CSV into the test set.
Executing a test set
A test set is de-coupled from a bot. The same test set can be executed against a different bot or bot alias in the future as your business use case evolves. To report performance metrics of a bot against the baseline test data, complete the following steps:
- Import the sample bot definition and build the bot (refer to Importing a bot for guidance).
- On the Amazon Lex console, choose Test sets in the navigation pane.
- Choose your validated test set.
Here you can review basic information about the test set and the imported test data.
- Choose Execute test.
- Choose the appropriate options for Bot name, Bot alias, and Language.
- For Test type, select Audio or Text.
- For Endpoint selection, select either Streaming or Non-streaming.
- Choose Validate discrepancy to validate your test dataset.
Before executing a test set, you can validate test coverage, including identifying intents and slots present in the test set but not in the bot. This early warning serves to set tester expectation for unexpected test failures. If discrepancies between your test dataset and your bot are detected, the Execute test page will update with the View details button.
Intents and slots found in the test data set but not in the bot alias are listed as shown in the following screenshots.
- After you validate the discrepancies, choose Execute to run the test.
Review results
The performance measures generated after executing a test set help you identify areas of bot design that need improvements and are useful for expediting bot development and delivery to support your customers. Test Workbench provides insights on intent classification and slot resolution in end-to-end conversation and single-line input level. The completed test runs are stored with timestamps in your S3 bucket, and can be used for future comparative reviews.
- On the Amazon Lex console, choose Test results in the navigation pane.
- Choose the test result ID for the results you want to review.
On the next page, the test results will include a breakdown of results organized in four main tabs: Overall results, Conversation results, Intent and slot results, and Detailed results.
Overall results
The Overall results tab contains three main sections:
- Test set input breakdown — A chart showing the total number of end-to-end conversations and single input utterances in the test set.
- Single input breakdown — A chart showing the number of passed or failed single inputs.
- Conversation breakdown — A chart showing the number of passed or failed multi-turn inputs.
For test sets run in audio modality, speech transcription charts are provided to show the number of passed or failed speech transcriptions on both single input and conversation types. In audio modality, a single input or multi-turn conversation could pass the speech transcription test, yet fail the overall end-to-end test. This can be caused, for instance, by a slot resolution or an intent recognition issue.
Conversation results
Test Workbench helps you drill down into conversation failures that can be attributed to specific intents or slots. The Conversation results tab is organized into three main areas, covering all intents and slots used in the test set:
- Conversation pass rates — A table used to visualize which intents and slots are responsible for possible conversation failures.
- Conversation intent failure metrics — A bar graph showing the top five worst performing intents in the test set, if any.
- Conversation slot failure metrics — A bar graph showing the top five worst performing slots in the test set, if any.
Intent and slot results
The Intent and slot results tab provides drill-down metrics for bot dimensions such as intent recognition and slot resolution.
- Intent recognition metrics — A table showing the intent recognition success rate.
- Slot resolution metrics — A table showing the slot resolution success rate, by
Detailed results
You can access a detailed report of the executed test run on the Detailed results tab. A table is displayed to show the actual transcription, output intent, and slot values in a test set. The report can be downloaded as a CSV for further analysis.
The line-level output provides insights to help improve the bot design and boost accuracy. For instance, misrecognized or missed speech inputs such as branded words can be added to custom vocabulary of an intent or as utterances under an intent.
In order to further improve conversation design, you can refer to this post, outlining best practices on using ML to create a bot that will delight your customers by accurately understanding them.
Conclusion
In this post, we presented the Test Workbench for Amazon Lex, a native capability that standardizes a chatbot automated testing process and allows developers and conversation designers to streamline and iterate quickly through bot design and development.
We look forward to hearing how you use this new functionality of Amazon Lex and welcome feedback! For any questions, bugs, or feature requests, please reach us through AWS re:Post for Amazon Lex or your AWS Support contacts.
To learn more, see Amazon Lex FAQs and the Amazon Lex V2 Developer Guide.
About the authors
Sandeep Srinivasan is a Product Manager on the Amazon Lex team. As a keen observer of human behavior, he is passionate about customer experience. He spends his waking hours at the intersection of people, technology, and the future.
Grazia Russo Lassner is a Senior Consultant with the AWS Professional Services Natural Language AI team. She specializes in designing and developing conversational AI solutions using AWS technologies for customers in various industries. Outside of work, she enjoys beach weekends, reading the latest fiction books, and family.
Taking AI to School: A Conversation With MIT’s Anant Agarwal
In the latest episode of NVIDIA’s AI Podcast, Anant Agarwal, founder of edX and chief platform officer at 2U, shared his vision for the future of online education and how AI is revolutionizing the learning experience.
Agarwal, a strong advocate for massive open online courses, or MOOCs, discussed the importance of accessibility and quality in education. The MIT professor and renowned edtech pioneer also highlighted the implementation of AI-powered features in the edX platform, including the ChatGPT plug-in and edX Xpert, an AI-powered learning assistant.
You Might Also Like
Jules Anh Tuan Nguyen Explains How AI Lets Amputee Control Prosthetic Hand, Video Games
A postdoctoral researcher at the University of Minnesota discusses his efforts to allow amputees to control their prosthetic limb — right down to the finger motions — with their minds.
Overjet’s Ai Wardah Inam on Bringing AI to Dentistry
Overjet, a member of NVIDIA Inception, is moving fast to bring AI to dentists’ offices. Dr. Wardah Inam, CEO of the company, discusses using AI to improve patient care.
Immunai CTO and Co-Founder Luis Voloch on Using Deep Learning to Develop New Drugs
Luis Voloch, co-founder and chief technology officer of Immunai, talks about tackling the challenges of the immune system with a machine learning and data science mindset.
Subscribe to the AI Podcast: Now Available on Amazon Music
The AI Podcast is now available through Amazon Music.
In addition, get the AI Podcast through iTunes, Google Podcasts, Google Play, Castbox, DoggCatcher, Overcast, PlayerFM, Pocket Casts, Podbay, PodBean, PodCruncher, PodKicker, Soundcloud, Spotify, Stitcher and TuneIn.
Make the AI Podcast better. Have a few minutes to spare? Fill out this listener survey.
Announcing enhanced table extractions with Amazon Textract
Amazon Textract is a machine learning (ML) service that automatically extracts text, handwriting, and data from any document or image. Amazon Textract has a Tables feature within the AnalyzeDocument API that offers the ability to automatically extract tabular structures from any document. In this post, we discuss the improvements made to the Tables feature and how it makes it easier to extract information in tabular structures from a wide variety of documents.
Tabular structures in documents such as financial reports, paystubs, and certificate of analysis files are often formatted in a way that enables easy interpretation of information. They often also include information such as table title, table footer, section title, and summary rows within the tabular structure for better readability and organization. For a similar document prior to this enhancement, the Tables feature within AnalyzeDocument
would have identified those elements as cells, and it didn’t extract titles and footers that are present outside the bounds of the table. In such cases, custom postprocessing logic to identify such information or extract it separately from the API’s JSON output was necessary. With this announcement of enhancements to the Table feature, the extraction of various aspects of tabular data becomes much simpler.
In April 2023, Amazon Textract introduced the ability to automatically detect titles, footers, section titles, and summary rows present in documents via the Tables feature. In this post, we discuss these enhancements and give examples to help you understand and use them in your document processing workflows. We walk through how to use these improvements through code examples to use the API and process the response with the Amazon Textract Textractor library.
Overview of solution
The following image shows that the updated model not only identifies the table in the document but all corresponding table headers and footers. This sample financial report document contains table title, footer, section title, and summary rows.
The Tables feature enhancement adds support for four new elements in the API response that allows you to extract each of these table elements with ease, and adds the ability to distinguish the type of table.
Table elements
Amazon Textract can identify several components of a table such as table cells and merged cells. These components, known as Block
objects, encapsulate the details related to the component, such as the bounding geometry, relationships, and confidence score. A Block
represents items that are recognized in a document within a group of pixels close to each other. The following are the new Table Blocks introduced in this enhancement:
- Table title – A new
Block
type calledTABLE_TITLE
that enables you to identify the title of a given table. Titles can be one or more lines, which are typically above a table or embedded as a cell within the table. - Table footers – A new
Block
type calledTABLE_FOOTER
that enables you to identify the footers associated with a given table. Footers can be one or more lines that are typically below the table or embedded as a cell within the table. - Section title – A new
Block
type calledTABLE_SECTION_TITLE
that enables you to identify if the cell detected is a section title. - Summary cells – A new
Block
type calledTABLE_SUMMARY
that enables you to identify if the cell is a summary cell, such as a cell for totals on a paystub.
Types of tables
When Amazon Textract identifies a table in a document, it extracts all the details of the table into a top-level Block
type of TABLE
. Tables can come in various shapes and sizes. For example, documents often contain tables that may or may not have a discernible table header. To help distinguish these types of tables, we added two new entity types for a TABLE Block
: SEMI_STRUCTURED_TABLE
and STRUCTURED_TABLE
. These entity types help you distinguish between a structured versus a semistructured table.
Structured tables are tables that have clearly defined column headers. But with semi-structured tables, data might not follow a strict structure. For example, data may appear in tabular structure that isn’t a table with defined headers. The new entity types offer the flexibility to choose which tables to keep or remove during post-processing. The following image shows an example of STRUCTURED_TABLE
and SEMI_STRUCTURED_TABLE
.
Analyzing the API output
In this section, we explore how you can use the Amazon Textract Textractor library to postprocess the API output of AnalyzeDocument
with the Tables feature enhancements. This allows you to extract relevant information from tables.
Textractor is a library created to work seamlessly with Amazon Textract APIs and utilities to subsequently convert the JSON responses returned by the APIs into programmable objects. You can also use it to visualize entities on the document and export the data in formats such as comma-separated values (CSV) files. It’s intended to aid Amazon Textract customers in setting up their postprocessing pipelines.
In our examples, we use the following sample page from a 10-K SEC filing document.
The following code can be found within our GitHub repository. To process this document, we make use of the Textractor library and import it for us to postprocess the API outputs and visualize the data:
The first step is to call Amazon Textract AnalyzeDocument
with Tables feature, denoted by the features=[TextractFeatures.TABLES]
parameter to extract the table information. Note that this method invokes the real-time (or synchronous) AnalyzeDocument API, which supports single-page documents. However, you can use the asynchronous StartDocumentAnalysis
API to process multi-page documents (with up to 3,000 pages).
The document
object contains metadata about the document that can be reviewed. Notice that it recognizes one table in the document along with other entities in the document:
Now that we have the API output containing the table information, we visualize the different elements of the table using the response structure discussed previously:
The Textractor library highlights the various entities within the detected table with a different color code for each table element. Let’s dive deeper into how we can extract each element. The following code snippet demonstrates extracting the title of the table:
Similarly, we can use the following code to extract the footers of the table. Notice that table_footers is a list, which means that there can be one or more footers associated with the table. We can iterate over this list to see all the footers present, and as shown in the following code snippet, the output displays three footers:
Generating data for downstream ingestion
The Textractor library also helps you simplify the ingestion of table data into downstream systems or other workflows. For example, you can export the extracted table data into a human readable Microsoft Excel file. At the time of this writing, this is the only format that supports merged tables.
We can also convert it to a Pandas DataFrame. DataFrame is a popular choice for data manipulation, analysis, and visualization in programming languages such as Python and R.
In Python, DataFrame is a primary data structure in the Pandas library. It’s flexible and powerful, and is often the first choice for data analysis professionals for various data analysis and ML tasks. The following code snippet shows how to convert the extracted table information into a DataFrame with a single line of code:
Lastly, we can convert the table data into a CSV file. CSV files are often used to ingest data into relational databases or data warehouses. See the following code:
Conclusion
The introduction of these new block and entity types (TABLE_TITLE
, TABLE_FOOTER
, STRUCTURED_TABLE
, SEMI_STRUCTURED_TABLE
, TABLE_SECTION_TITLE
, TABLE_FOOTER
, and TABLE_SUMMARY
) marks a significant advancement in extraction of tabular structures from documents with Amazon Textract.
These tools provide a more nuanced and flexible approach, catering to both structured and semistructured tables and making sure that no important data is overlooked, regardless of its location in a document.
This means we can now handle diverse data types and table structures with enhanced efficiency and accuracy. As we continue to embrace the power of automation in document processing workflows, these enhancements will no doubt pave the way for more streamlined workflows, higher productivity, and more insightful data analysis. For more information on AnalyzeDocument
and the Tables feature, refer to AnalyzeDocument.
About the authors
Raj Pathak is a Senior Solutions Architect and Technologist specializing in Financial Services (Insurance, Banking, Capital Markets) and Machine Learning. He specializes in Natural Language Processing (NLP), Large Language Models (LLM) and Machine Learning infrastructure and operations projects (MLOps).
Anjan Biswas is a Senior AI Services Solutions Architect with focus on AI/ML and Data Analytics. Anjan is part of the world-wide AI services team and works with customers to help them understand, and develop solutions to business problems with AI and ML. Anjan has over 14 years of experience working with global supply chain, manufacturing, and retail organizations and is actively helping customers get started and scale on AWS AI services.
Lalita Reddi is a Senior Technical Product Manager with the Amazon Textract team. She is focused on building machine learning-based services for AWS customers. In her spare time, Lalita likes to play board games, and go on hikes.