Ideas: The journey to DNA data storage

Ideas: The journey to DNA data storage

Outlined illustrations of Karin Strauss, Jake Smith, Bichlien Nguyen, and Sergey Yekhanin for the Microsoft Research Podcast, Ideas series.

Behind every emerging technology is a great idea propelling it forward. In the Microsoft Research Podcast series Ideas, members of the research community at Microsoft discuss the beliefs that animate their research, the experiences and thinkers that inform it, and the positive human impact it targets.

Accommodating the increasing amounts of digital data the world is producing requires out-of-the-box thinking. In this episode, guest host Karin Strauss, an innovation strategist and senior principal research manager at Microsoft, brings together members of her team to explore a more sustainable, more cost-effective alternative for archival data storage: synthetic DNA. Strauss, Principal Researcher Bichlien Nguyen, Senior Researcher Jake Smith, and Partner Research Manager Sergey Yekhanin discuss how Microsoft Research’s contributions have helped bring “science fiction,” as Strauss describes it, closer to reality, including its role in establishing the DNA Data Storage Alliance to foster collaboration in developing the technology and to establish specifications for interoperability. They also talk about the scope of collaboration with other fields, such as the life sciences and electrical and mechanical engineering, and the coding theory behind the project, including the group’s most powerful algorithm for DNA error correction, Trellis BMA, which is now open source. 

Transcript

[TEASER] [MUSIC PLAYS UNDER DIALOGUE]

JAKE SMITH: This really starts from the fundamental data production–data storage gap, where we produce way more data nowadays than we could ever have imagined years ago. And it’s more than we can practically store in magnetic media. And so we really need a denser medium on the other side to contain that. DNA is extremely dense. It holds far, far more information per unit volume, per unit mass than any storage media that we have available today. This, along with the fact that DNA is itself a relatively rugged molecule—it lives in our body; it lives outside our body for thousands and thousands of years if we, you know, leave it alone to do its thing—makes it a very attractive media.

BICHLIEN NGUYEN: It’s such a futuristic technology, right? When you begin to work on the tech, you realize how many disciplines and domains you actually have to reach in and leverage. It’s really interesting, this multidisciplinarity, because we’re, in a way, bridging software with wetware with hardware. And so you, kind of, need all the different disciplines to actually get you to where you need to go. 

SERGEY YEKHANIN: We all work for Microsoft; we are all Microsoft researchers. Microsoft isn’t a startup. But that team, the team that drove the DNA Data Storage Project, it did feel like a startup, and it was something unusual and exciting for me.

SERIES INTRO: You’re listening to Ideas, a Microsoft Research Podcast that dives deep into the world of technology research and the profound questions behind the code. In this series, we’ll explore the technologies that are shaping our future and the big ideas that propel them forward.


[MUSIC FADES]

GUEST HOST KARIN STRAUSS: I’m your guest host Karin Strauss, a senior principal research manager at Microsoft. For nearly a decade, my colleagues and I—along with a fantastic and talented group of collaborators from academia and industry—have been working together to help close the data creation–data storage gap. We’re producing far more digital information than we can possibly store. One solution we’ve explored uses synthetic DNA as a medium, and over the years, we’ve contributed to steady and promising progress in the area. We’ve helped push the boundaries of how much DNA writer can simultaneously store, shown that full automation is possible, and helped create an ecosystem for the commercial success of DNA data storage. And just this week, we’ve made one of our most advanced tools for encoding and decoding data in DNA open source. Joining me today to discuss the state of DNA data storage and some of our contributions are several members of the DNA Data Storage Project at Microsoft Research: Principal Researcher Bichlien Nguyen, Senior Researcher Jake Smith, and Partner Research Manager Sergey Yekhanin. Bichlien, Jake, and Sergey, welcome to the podcast.

BICHLIEN NGUYEN: Thanks for having us, Karin.

SERGEY YEKHANIN: Thank you so much.

JAKE SMITH: Yes, thank you.

STRAUSS: So before getting into the details of DNA data storage and our work, I’d like to talk about the big idea behind the work and how we all got here. I’ve often described the DNA Data Storage Project as turning science fiction into reality. When we started the project in 2015, though, the idea of using DNA for archival storage was already out there and had been for over five decades. Still, when I talked about the work in the area, people were pretty skeptical in the beginning, and I heard things like, “Wow, why are you thinking about that? It’s so far off.” So, first, please share a bit of your research backgrounds and then how you came to work on this project. Where did you first encounter this idea, what do you remember about your initial impressions—or the impressions of others—and what made you want to get involved? Sergey, why don’t you start.

YEKHANIN: Thanks so much. So I’m a coding theorist by training, so, like, my core areas of research have been error-correcting codes and also computational complexity theory. So I joined the project probably, like, within half a year of the time that it was born, and thanks, Karin, for inviting me to join. So, like, that was roughly the time when I moved from a different lab, from the Silicon Valley lab in California to the Redmond lab, and actually, it just so happened that at that moment, I was thinking about what to do next. Like, in California, I was mostly working on coding for distributed storage, and when I joined here, that effort kept going. But I had some free cycles, and that was the moment when Karin came just to my office and told me about the project. So, indeed, initially, it did feel a lot like science fiction. Because, I mean, we are used to coding for digital storage media, like for magnetic storage media, and here, like, this is biology, and, like, why exactly these kind of molecules? There are so many different molecules. Like, why that? But honestly, like, I didn’t try to pretend to be a biologist and make conclusions about whether this is the right medium or the wrong medium. So I tried to look into these kinds of questions from a technical standpoint, and there was a lot of, kind of, deep, interesting coding questions, and that was the main attraction for me. At the same time, I wasn’t convinced that we will get as far as we actually got, and I wasn’t immediately convinced about the future of the field, but, kind of, just the depth and the richness of the, what I’ll call, technical problems, that’s what made it appealing for me, and I, kind of, enthusiastically joined. And also, I guess, the culture of the team. So, like, it did feel like a startup. Like, we all work for Microsoft; we’re all Microsoft researchers. Microsoft isn’t a startup. But that team, the team that drove the DNA Data Storage Project, it did feel like a startup, and it was something unusual and exciting for me.

NGUYEN: Oh, I love that, Sergey. So my background is in organic chemistry, and Karin had reached out to me, and I interviewed not knowing what Karin wanted. Actually … so I took the job kind of blind because I was like, “Hmm, Microsoft Research? … DNA biotech? …” I was very, very curious, and then when she told me that this project was about DNA data storage, I was like, this is a crazy, crazy idea. I definitely was not sold on it, but I was like, well, look, I get to meet and work with so many interesting people from different backgrounds that, one, even if it doesn’t work out, I’m going to learn something, and, two, I think it could work, like it could work. And so I think that’s really what motivated me to join.

SMITH: The first thing that you think when you hear about we’re going to take what is our hard drive and we’re going to turn that into DNA is that this is nuts. But, you know, it didn’t take very long after that. I come from a chemistry, biotech-type background where I’ve been working on designing drugs, and there, DNA is this thing off in the nethers, you know. You look at it every now and then to see what information it can tell you about, you know, what maybe your drug might be hitting on the target side, and it’s, you know, that connection—that the DNA contains the information in the living systems, the DNA contains the information in our assays, and why could the DNA not contain the information that we, you know, think more about every day, that information that lives in our computers—as an extremely cool idea.

STRAUSS: Through our work, we’ve had years to wrap our heads around DNA data storage. But, Jake, could you tell us a little bit about how DNA data storage works and why we’re interested in looking into the technology?

SMITH: So you mentioned it earlier, Karin, that this really starts from the fundamental data production–data storage gap, where we produce way more data nowadays than we could ever have imagined years ago. And it’s more than we can practically store in magnetic media. This is a problem because, you know, we have data; we have recognized the value of data with the rise of large language models and these other big generative models. The data that we do produce, our video has gone from, you know, substantially small, down at 480 resolution, all the way up to things at 8K resolution that now take orders of magnitude more storage. And so we really need a denser medium on the other side to contain that. DNA is extremely dense. It holds far, far more information per unit volume, per unit mass than any storage media that we have available today. This, along with the fact that DNA is itself a relatively rugged molecule—it lives in our body; it lives outside our body for thousands and thousands of years if we, you know, leave it alone to do its thing—makes it a very attractive media, particularly compared to the traditional magnetic media, which has lower density and a much shorter lifetime on the, you know, scale of decades at most.

So how does DNA data storage actually work? Well, at a very high level, we start out in the digital domain, where we have our information represented as ones and zeros, and we need to convert that into a series of A’s, C’s, T’s, and G’s that we could then actually produce, and this is really the domain of Sergey. He’ll tell us much more about how this works later on. For now, let’s just assume we’ve done this. And now our information, you know, lives in the DNA base domain. It’s still in the digital world. It’s just represented as A’s, C’s, T’s, and G’s, and we now need to make this physical so that we can store it. This is accomplished through large-scale DNA synthesis. Once the DNA has been synthesized with the sequences that we specified, we need to store it. There’s a lot of ways we can think about storing it. Bichlien’s done great work looking at DNA encapsulation, as well as, you know, other more raw just DNA-on-glass-type techniques. And we’ve done some work looking at the susceptibility of DNA stored in this unencapsulated form to things like atmospheric humidity, to temperature changes and, most excitingly, to things like neutron radiation. So we’ve stored our data in this physical form, we’ve archived it, and coming back to it, likely many years in the future because the properties of DNA match up very well with archival storage, we need to convert it back into the digital domain. And this is done through a technique called DNA sequencing. What this does is it puts the molecules through some sort of machine, and on the other side of the machine, we get out, you know, a noisy representation of what the actual sequence of bases in the molecules were. We have one final step. We need to take this series of noisy sequences and convert it back into ones and zeros. Once we do this, we return to our original data and we’ve completed, let’s call it, one DNA data storage cycle.

STRAUSS: We’ll get into this in more detail later, but maybe, Sergey, we dig a little bit on encoding-decoding end of things and how DNA is different as a medium from other types of media.

YEKHANIN: Sure. So, like, I mean, coding is an important aspect of this whole idea of DNA data storage because we have to deal with errors—it’s a new medium—but talking about error-correcting codes in the context of DNA data storage, so, I mean, usually, like … what are error-correcting codes about? Like, on the very high level, right, I mean, you have some data—think of it as a binary string—you want to store it, but there are errors. So usually, like, in most, kind of, forms of media, the errors are bit flips. Like, you store a 0; you get a 1. Or you store a 1; you get a 0. So these are called substitution errors. The field of error-correcting codes, it started, like, in the 1950s, so, like, it’s 70 years old at least. So we, kind of, we understand how to deal with this kind of error reasonably well, so with substitution errors. In DNA data storage, the way you store your data is that given, like, some large amount of digital data, you have the freedom of choosing which short DNA molecules to generate. So in a DNA molecule, it’s a sequence of the bases A, G, C, and T, and you have the freedom to decide, like, which of the short molecules you need to generate, and then those molecules get stored, and then during the storage, some of them are lost; some of them can be damaged. There can be insertions and deletions of bases on every molecule. Like, we call them strands. So you need redundancy, and there are two forms of redundancy. There’s redundancy that goes across strands, and there is redundancy on the strand. And so, yeah, so, kind of, from the error-correcting side of things, like, we get to decide what kind of redundancy we want to introduce—across strands, on the strand—and then, like, we want to make sure that our encoding and decoding algorithms are efficient. So that’s the coding theory angle on the field.

NGUYEN: Yeah, and then, you know, from there, once you have that data encoded into DNA, the question is how do you make that data on a scale that’s compatible with digital data storage? And so that’s where a lot of the work came in for really automating the synthesis process and also the reading process, as well. So synthesis is what we consider the writing process of DNA data storage. And so, you know, we came up with some unique ideas there. We made a chip that enabled us to get to the densities that we needed. And then on the reading side, we used different sequencing technologies. And it was great to see that we could actually just, kind of, pull sequencing technologies off the shelf because people are so interested in reading biological DNA. So we explored the Illumina technologies and also Oxford Nanopore, which is a new technology coming in the horizon. And then preservation, too, because we have to make sure that the data that’s stored in the DNA doesn’t get damaged and that we can recover it using the error-correcting codes.

STRAUSS: Yeah, absolutely. And it’s clear that—and it’s also been our experience that—DNA data storage and projects like this require more than just a team of computer scientists. Bichlien, you’ve had the opportunity to collaborate with many people in all different disciplines. So do you want to talk a little bit about that? What kind of expertise, you know, other disciplines that are relevant to bringing DNA data storage to reality?

NGUYEN: Yeah, well, it’s such a futuristic technology, right? When you begin to work on the tech, you realize how many disciplines and domains you actually have to reach in and leverage. One concrete example is that in order to fabricate an electronic chip to synthesize DNA, we really had to pull in a lot of material science research because there’s different capabilities that are needed when trying to use liquid on a chip. We, you know, have to think about DNA data storage itself. And that’s a very different beast than, you know, the traditional storage mediums. And so we worked with teams who literally create, you know, these little tiny micro- or nanocapsules in glass and being able to store that there. It’s really interesting, this multidisciplinarity, because we’re, in a way, bridging software with wetware with hardware. And so you, kind of, need all the different disciplines to actually get you to where you need to go.

STRAUSS: Yeah, absolutely. And, you know, building on, you know, collaborators, I think one area that was super interesting, as well, and was pretty early on in the project was building that first end-to-end system that we collaborated with University of Washington, the Molecular Information Systems Lab there, to build. And really, at that point, you know, there had been work suggesting that DNA data storage was viable, but nobody had really shown an end-to-end system, from beginning to end, and in fact, my manager at the time, Doug Carmean, used to call it the “bubble gum and shoestring” system. But it was a crucial first step because it shows it was possible to really fully automate the process. And there have been several interesting challenges there in the system, but we noticed that one particularly challenging one was synthesis. That first system that we built was capable of storing the word “hello,” and that was all we could store. So it wasn’t a very high-capacity system. But in order to be able to store a lot more volumes of data instead of a simple word, we really needed much more advanced synthesis systems. And this is what both Bichlien and Jake ended up working on, so do you want to talk a little bit about that and the importance of that particular work?

SMITH: Yeah, absolutely. As you said, Karin, the amount of DNA that is required to store the massive amount of data we spoke about earlier is far beyond the amount of DNA that’s needed for any, air quotes, traditional applications of synthetic DNA, whether it’s your gene construction or it’s your primer synthesis or such. And so we really had to rethink how you make DNA at scale and think about how could this actually scale to meet the demand. And so Bichlien started out looking at a thing called a microelectrode array, where you have this big checkerboard of small individual reaction sites, and in each reaction site, we used electrochemistry in order to control base by base—A, C, T, or G by A, C, T, or G—the sequence that was growing at that particular reaction site. We got this down to the nanoscale. And so what this means practically is that on one of these chips, we could synthesize at any given time on the order of hundreds of millions of individual strands. So once we had the synthesis working with the traditional chemistry where you’re doing chemical synthesis—each base is added in using a mixture of chemicals that are added to the individual spots—they’re activated. But each coupling happens due to some energy you prestored in the synthesis of your reagents. And this makes the synthesis of those reagents costly and themselves a bottleneck. And so taking, you know, a look forward at what else was happening in the synthetic biology world, the, you know, next big word in DNA synthesis was and still is enzymatic synthesis, where rather than having to, you know, spend a lot of energy to chemically pre-activate reagents that will go in to make your actual DNA strands, we capitalize on nature’s synthetic robots—enzymes—to start with less-activated, less-expensive-to-get-to, cheaply-produced-through-natural-processes substrates, and we use the enzymes themselves, toggling their activity over each of the individual chips, or each of the individual spots on our checkerboard, to construct DNA strands. And so we got a little bit into this project. You know, we successfully showed that we could put down selectively one base at a given time. We hope that others will, kind of, take up the work that we’ve put out there, you know, particularly our wonderful collaborators at Ansa who helped us design the enzymatic system. And one day we will see, you know, a truly parallelized, in this fashion, enzymatic DNA system that can achieve the scales necessary.

NGUYEN: It’s interesting to note that even though it’s DNA and we’re still storing data in these DNA strands, chemical synthesis and enzymatic synthesis provide different errors that you see in the actual files, right, in the DNA files. And so I know that we talked to Sergey about how do we deal with these new types of errors and also the new capabilities that you can have, for example, if you don’t control base by base the DNA synthesis.

YEKHANIN: This whole field of DNA data storage, like, the technologies on the biology side are advancing rapidly, right. And there are different approaches to synthesis. There are different approaches to sequencing. And, presumably, the way the storage is actually done, like, is also progressing, right, and we had works on that. So there is, kind of, this very general, kind of, high-level error profile that you can say that these are the type of errors that you encounter in DNA data storage. Like, in DNA molecules—just the sequence of these bases, A, G, C, T, in maybe a length of, like, 200 or so and you store a very, very large number of them—the errors that you see is that some of these strands, kind of, will disappear. Some of these strings can be torn apart like, let’s say, in two pieces, maybe even more. And then on every strand, you also encounter these errors—insertions, deletions, substitutions—with different rates. Like, the likelihood of all kinds of these errors may differ very significantly across different technologies that you use on the biology side. And also there can be error bursts somehow. Maybe you can get an insertion of, I don’t know, 10 A’s, like, in a row, or you can lose, like, you know, 10 bases in a row. So if you don’t, kind of, quantify, like, what are the likelihoods of all these bad events happening, then I think this still, kind of, fits at least the majority of approaches to DNA data storage, maybe not exactly all of them, but it fits the majority. So when we design coding schemes, we are trying also, kind of, to look ahead in the sense that, like, we don’t know, like, in five years, like, how will these error profiles, how will it look like. So the technologies that we develop on the error-correction side, we try to keep them very flexible, so whether it’s enzymatic synthesis, whether it’s Nanopore technology, whether it’s Illumina technology that is being used, the error-correction algorithms would be able to adapt and would still be useful. But, I mean, this makes also coding aspect harder because, [LAUGHTER] kind of, you want to keep all this flexibility in mind.

STRAUSS: So, Sergey, we are at an interesting moment now because you’re open sourcing the Trellis BMA piece of code, right, that you published a few years ago. Can you talk a little bit about that specific problem of trace reconstruction and then the paper specifically and how it solves it?

YEKHANIN: Absolutely, yeah, so this Trellis BMA paper for that we are releasing the source code right now, this is, kind of, this is the latest in our sequence of publications on error-correction for DNA data storage. And I should say that, like, we already discussed that the project is, kind of, very interdisciplinary. So, like, we have experts from all kinds of fields. But really even within, like, within this coding theory, like, within computer science/information theory, coding theory, in our algorithms, we use ideas from very different branches. I mean, there are some core ideas from, like, core algorithm space, and I won’t go into these, but let me just focus, kind of, on two aspects. So when we just faced this problem of coding for DNA data storage and we were thinking about, OK, so how to exactly design the coding scheme and what are the algorithms that we’ll be using for error correction, so, I mean, we’re always studying the literature, and we came up on this problem called trace reconstruction that was pretty popular—I mean, somewhat popular, I would say—in computer science and in statistics. It didn’t have much motivation, but very strong mathematicians had been looking at it. And the problem is as follows. So, like, there is a long binary string picked at random, and then it’s transmitted over a deletion channel, so some bits—some zeros and some ones—at certain coordinates get deleted and you get to see, kind of, the shortened version of the string. But you get to see it multiple times. And the question is, like, how many times do you need to see it so that you can get a reasonably accurate estimate of the original string that was transmitted? So that was called trace reconstruction, and we took a lot of motivation—we took a lot of inspiration—from the problem, I would say, because really, in DNA data storage, if we think about a single strand, like, a single strand which is being stored, after we read it, we usually get multiple reads of this string. And, well, the errors there are not just deletions. There are insertions, substitutions, and, like, inversive errors, but still we could rely on this literature in computer science that already had some ideas. So there was an algorithm called BMA, Bitwise Majority Alignment. We extended it—we adopted it, kind of, for the needs of DNA data storage—and it became, kind of, one of the tools in our toolbox for error correction.

So we also started to use ideas from literature on electrical engineering, what are called convolutional error-correcting codes and a certain, kind of, class of algorithms for decoding errors in these convolutional error-correcting codes called, like, I mean, Trellis is the main data structure, like, Trellis-based algorithms for decoding convolutional codes, like, Viterbi algorithm or BCJR algorithm. Convolutional codes allow you to introduce redundancy on the string. So, like, with algorithms kind of similar to BMA, like, they were good for doing error correction when there was no redundancy on the strand itself. Like, when there is redundancy on the strand, kind of, we could do some things, but really it was very limited. With Trellis-based approaches, like, again inspired by the literature in electrical engineering, we had an approach to introduce redundancy on the strand, so that allowed us to have more powerful error-correction algorithms. And then in the end, we have this algorithm, which we call Trellis BMA, which, kind of, combines ideas from both fields. So it’s based on Trellis, but it’s also more efficient than standard Trellis-based algorithms because it uses ideas from BMA from computer science literature. So this is, kind of, this is a mix of these two approaches. And, yeah, that’s the paper that we wrote about three years ago. And now we’re open sourcing it. So it is the most powerful algorithm for DNA error correction that we developed in the group. We’re really happy that now we are making it publicly available so that anybody can experiment with the source code. Because, again, the field has expanded a lot, and now there are multiple groups around the globe that work just specifically on error correction apart from all other aspects, so, yeah, so we are really happy that it’s become publicly available to hopefully further advance the field.

STRAUSS: Yeah, absolutely, and I’m always amazed by, you know, how, it is really about building on other people’s work. Jake and Bichlien, you recently published a paper in Nature Communications. Can you tell us a little bit about what it was, what you exposed the DNA to, and what it was specifically about?

NGUYEN: Yeah. So that paper was on the effects of neutron radiation on DNA data storage. So, you know, when we started the DNA Data Storage Project, it was really a comparison, right, between the different storage medias that exist today. And one of the issues that have come up through the years of development of those technologies was, you know, hard errors and soft errors that were induced by radiation. So we wanted to know, does that maybe happen in DNA? We know that DNA, in humans at least, is affected by radiation from cosmic rays. And so that was really the motivation for this type of experiment. So what we did was we essentially took our DNA files and dried them and threw them in a neutron accelerator, which was fantastic. It was so exciting. That’s, kind of, the merge of, you know, sci fi with sci fi at the same time. [LAUGHS] It was fantastic. And we irradiated for over 80 million years—

STRAUSS: The equivalent of …

NGUYEN: The equivalent of 80 million years.

STRAUSS: Yes, because it’s a lot of radiation all at the same time, …

NGUYEN: It’s a lot of radiation …

STRAUSS: … and it’s accelerated radiation exposure?

NGUYEN: Yeah, I would say it’s accelerated aging with radiation. It’s an insane amount of radiation. And it was surprising that even though we irradiated our DNA files with that much radiation, there wasn’t that much damage. And that’s surprising because, you know, we know that humans, if we were to be irradiated like that, it would be disastrous. But in, you know, DNA, our files were able to be recovered with zero bit errors.

STRAUSS: And why that difference?

NGUYEN: Well, we think there’s a few reasons. One is that when you look at the interaction between a neutron and the actual elemental composition of DNA—which is basically carbons, oxygens, and hydrogens, maybe a phosphorus—the neutrons don’t interact with the DNA much. And if it did interact, we would have, for example, a strand break, which based on the error-correcting codes, we can recover from. So essentially, there’s not much … one, there’s not much interaction between neutrons and DNA, and second, we have error-correcting codes that would prevent any data loss.

STRAUSS: Awesome, so yeah, this is another milestone that contributes towards the technology becoming a reality. There are also other conditions that are needed for technology to be brought to the market. And one thing I’ve worked on is to, you know, create the DNA Data Storage Alliance; this is something Microsoft co-founded with Illumina, Twist Bioscience, and Western Digital. And the goal there was to essentially provide the right conditions for the technology to thrive commercially. We did bring together multiple universities and companies that were interested in the technology. And one thing that we’ve seen with storage technologies that’s been pretty important is standardization and making sure that the technology’s interoperable. And, you know, we’ve seen stalemate situations like Blu-ray and high-definition DVD, where, you know, really we couldn’t decide on a standard, and the technology, it took a while for the technology to be picked up, and the intent of the DNA Data Storage [Alliance] is to provide an ecosystem of companies, universities, groups interested in making sure that this time, it’s an interoperable technology from the get-go, and that increases the chances of commercial adoption. As a group, we often talk about how amazing it is to work for a company that empowers us to do this kind of research. And for me, one of Microsoft Research’s unique strengths, particularly in this project, is the opportunity to work with such a diverse set of collaborators on such a multidisciplinary project like we have. How do you all think where you’ve done this work has impacted how you’ve gone about it and the contributions you’ve been able to make?

NGUYEN: I’m going to start with if we look around this table and we see who’s sitting at it, which is two chemists, a computer architect, and a coding theorist, and we come together and we’re like, what can we make that would be super, super impactful? I think that’s the answer right there, is that being at Microsoft and being in a culture that really fosters this type of interdisciplinary collaboration is the key to getting a project like this off the ground.

SMITH: Yeah, absolutely. And we should acknowledge the gigantic contributions made by our collaborators at the University of Washington. Many of them would fall in not any of these three categories. They’re electrical engineers, they’re mechanical engineers, they’re pure biologists that we worked with. And each of them brought their own perspective, and particularly when you talk about going to a true end-to-end system, those perspectives were invaluable as we were trying to fit all the puzzle pieces together.

STRAUSS: Yeah, absolutely. We’ve had great collaborations over time—University of Washington, ETH Zürich, Los Alamos National Lab, ChipIr, Twist Bioscience, Ansa Biotechnologies. Yeah, it’s been really great and a great set of different disciplines, all the way from coding theorists to the molecular biology and chemistry, electrical and mechanical engineering. One of the great things about research is there’s never a shortage of interesting questions to pursue, and for us, this particular work has opened the door to research in adjacent domains, including sustainability fields. DNA data storage requires small amounts of materials to accommodate the large amounts of data, and early on, we wanted to understand if DNA data storage was, as it seemed, a more sustainable way to store information. And we learned a lot. Bichlien and Jake, you had experience in green chemistry when you came to Microsoft. What new findings did we make, and what sustainability benefits do we get with DNA data storage? And, finally, what new sustainability work has the project led to?

NGUYEN: As a part of this project, if we’re going to bring new technologies to the forefront, you know, to the world, we should make sure that they have a lower carbon footprint, for example, than previous technologies. And so we ran a life cycle assessment—which is a way to systematically evaluate the environmental impacts of anything of interest—and we did this on DNA data storage and compared it to electronic storage medium[1], and we noticed that if we were able to store all of our digital information in DNA, that we would have benefits associated with carbon emissions. We would be able to reduce that because we don’t need as much infrastructure compared to the traditional storage methods. And there would be an energy reduction, as well, because this is a passive way of archival data storage. So that was, you know, the main takeaways that we had. But that also, kind of, led us to think about other technologies that would be beneficial beyond data storage and how we could use the same kind of life cycle thinking towards that.

SMITH: This design approach that you’ve, you know, talked about us stumbling on, not inventing but seeing other people doing in the literature and trying to implement ourselves on the DNA Data Storage Project, you know, is something that can be much bigger than any single material. And where we think there’s a, you know, chance for folks like ourselves at Microsoft Research to make a real impact on this sustainability-focused design is through the application of machine learning, artificial intelligence—the new tools that will allow us to look at much bigger design spaces than we could previously to evaluate sustainability metrics that were not possible when everything was done manually and to ultimately, you know, at the end of the day, take a sustainability-first look at what a material should be composed of. And so we’ve tried to prototype this with a few projects. We had another wonderful collaboration with the University of Washington where we looked at recyclable circuit boards and a novel material called a vitrimer that it could possibly be made out of[2]. We’ve had another great collaboration with the University of Michigan, where we’ve looked at the design of charge-carrying molecules in these things called flow batteries that have good potential for energy smoothing in, you know, renewables production, trying to get us out of that day-night, boom-bust cycle[3]. And we had one more project, you know, this time with collaborators at the University of Berkeley, where we looked at, you know, design of a class of materials called a metal organic framework, which have great promise in low-energy-cost gas separation, such as pulling CO2 out of the, you know, plume of a smokestack or, you know, ideally out of the air itself[4].

STRAUSS: For me, the DNA work has made me much more open to projects outside my own research area—as Bichlien mentioned, my core research area is computer architecture, but we’ve ventured in quite a bit of other areas here—and going way beyond my own comfort zone and really made me love interdisciplinary projects like this and try, really try, to do the most important work I can. And this is what attracted me to these other areas of environmental sustainability that Bichlien and Jake covered, where there’s absolutely no lack of problems. Like them, I’m super interested in using AI to solve many of them. So how do each of you think working on the DNA Data Storage Project has influenced your research approach more generally and how you think about research questions to pursue next?

YEKHANIN: It definitely expanded the horizons a lot, like, just, kind of, just having this interactions with people, kind of, whose core areas of research are so different from my own and also a lot of learning even within my own field that we had to do to, kind of, carry this project out. So, I mean, it was a great and rewarding experience.

NGUYEN: Yeah, for me, it’s kind of the opposite of Karin, right. I started as an organic chemist and then now really, one, appreciate the breadth and depth of going from a concept to a real end-to-end prototype and all the requirements that you need to get there. And then also, really the importance of having, you know, a background in computer science and really being able to understand the lingo that is used in multidisciplinary projects because you might say something and someone else interprets it very differently, and it’s because you’re not speaking the same language. And so that understanding that you have to really be … you have to learn a little bit of vocabulary from each person and understand how they contribute and then how your ideas can contribute to their ideas has been really impactful in my career here.

SMITH: Yeah, I think the key change in approach that I took away—and I think many of us took away from the DNA Data Storage Project—was rather than starting with an academic question, we started with a vision of what we wanted to happen, and then we derived the research questions from analyzing what would need to happen in the world—what are the bottlenecks that need to be solved in order for us to achieve, you know, that goal? And this is something that we’ve taken with us into the sustainability-focused research and, you know, something that I think will affect all the research I do going forward.

STRAUSS: Awesome. As we close, let’s reflect a bit on what a world in which DNA data storage is widely used might look like. If everything goes as planned, what do you hope the lasting impact of this work will be? Sergey, why don’t you lead us off.

YEKHANIN: Sure, I remember that, like, when … in the early days when I started working on this project actually, you, Karin, told me that you were taking an Uber ride somewhere and you were talking to the taxi driver, and the taxi driver—I don’t know if you remember that—but the taxi driver mentioned that he has a camera which is recording everything that’s happening in the car. And then you had a discussion with him about, like, how long does he keep the data, how long does he keep the videos. And he told you that he keeps it for about a couple of days because it’s too expensive. But otherwise, like, if it weren’t that expensive, he would keep it for much, much longer because, like, he wants to have these recordings if later somebody is upset about the ride and, I don’t know, he is getting sued or something. So this is, like, this is one small narrow application area where DNA data storage would clearly, kind of, if it happens, then it will solve it. Because then, kind of, this long-term archival storage will become very cheap, available to everybody; it would become a commodity basically. There are many things that will be enabled, like this helping the Uber drivers, for instance. But also one has to think of, of course, like, about, kind of, the broader implications so that we don’t get into something negative because again this power of recording everything and storing everything, it can also lead to some use cases that might be, kind of, morally wrong. So, again, hopefully by the time that we get to, like, really wide deployments of this technology, the regulation will also be catching up and the, like, we will have great use cases and we won’t have bad ones. I mean, that’s how I think of it. But definitely there are lots of, kind of, great scenarios that this can enable.

SMITH: Yeah. I’ll grab onto the word you use there, which is making DNA a commodity. And one of the things that I hope comes out of this project, you know, besides all the great benefits of DNA data storage itself is spillover benefits into the field of health—where if we make DNA synthesis at large scale truly a commodity thing, which I hope some of the work that we’ve done to really accelerate the throughput of synthesis will do—then this will open new doors in what we can do in terms of gene synthesis, in terms of, like, fundamental biotech research that will lead to that next set of drugs and, you know, give us medications or treatments that we could not have thought possible if we were not able to synthesize DNA and related molecules at that scale.

NGUYEN: So much information gets lost because of just time. And so I think being able to recover really ancient history that humans wrote in the future, I think, is something that I really hope could be achieved because we’re so information rich, but in the course of time, we become information poor, and so I would like for our future generations to be able to understand the life of, you know, an everyday 21st-century person.

STRAUSS: Well, Bichlien, Jake, Sergey, it’s been fun having this conversation with you today and collaborating with you in all of this amazing project [MUSIC] and all the research we’ve done together. Thank you so much.

YEKHANIN: Thank you, Karin.

SMITH: Thank you.

NGUYEN: Thanks.

[MUSIC FADES]


[1] The team presented the findings from their life cycle assessment of DNA data storage in the paper Architecting Datacenters for Sustainability: Greener Data Storage using Synthetic DNA.

[2] For more information, check out the podcast episode Collaborators: Sustainable electronics with Jake Smith and Aniruddh Vashisth and the paper Recyclable vitrimer-based printed circuit boards for sustainable electronics.

[3] For more information, check out the podcast episode Collaborators: Renewable energy storage with Bichlien Nguyen and David Kwabi.

[4] For more information, check out the paper MOFDiff: Coarse-grained Diffusion for Metal-Organic Framework Design.

The post Ideas: The journey to DNA data storage appeared first on Microsoft Research.

Read More

NVIDIA and Microsoft Showcase Blackwell Preview, Omniverse Industrial AI and RTX AI PCs at Microsoft Ignite

NVIDIA and Microsoft Showcase Blackwell Preview, Omniverse Industrial AI and RTX AI PCs at Microsoft Ignite

NVIDIA and Microsoft today unveiled product integrations designed to advance full-stack NVIDIA AI development on Microsoft platforms and applications.

At Microsoft Ignite, Microsoft announced the launch of the first cloud private preview of the Azure ND GB200 V6 VM series, based on the NVIDIA Blackwell platform. The Azure ND GB200 v6 will be a new AI-optimized virtual machine (VM) series and combines the NVIDIA GB200 NVL72 rack design with NVIDIA Quantum InfiniBand networking.

In addition, Microsoft revealed that Azure Container Apps now supports NVIDIA GPUs, enabling simplified and scalable AI deployment. Plus, the NVIDIA AI platform on Azure includes new reference workflows for industrial AI and an NVIDIA Omniverse Blueprint for creating immersive, AI-powered visuals.

At Ignite, NVIDIA also announced multimodal small language models (SLMs) for RTX AI PCs and workstations, enhancing digital human interactions and virtual assistants with greater realism.

NVIDIA Blackwell Powers Next-Gen AI on Microsoft Azure

Microsoft’s new Azure ND GB200 V6 VM series will harness the powerful performance of NVIDIA GB200 Grace Blackwell Superchips, coupled with advanced NVIDIA Quantum InfiniBand networking. This offering is optimized for large-scale deep learning workloads to accelerate breakthroughs in natural language processing, computer vision and more.

The Blackwell-based VM series complements previously announced Azure AI clusters with ND H200 V5 VMs, which provide increased high-bandwidth memory for improved AI inferencing. The ND H200 V5 VMs are already being used by OpenAI to enhance ChatGPT.

Azure Container Apps Enables Serverless AI Inference With NVIDIA Accelerated Computing

Serverless computing provides AI application developers increased agility to rapidly deploy, scale and iterate on applications without worrying about underlying infrastructure. This enables them to focus on optimizing models and improving functionality while minimizing operational overhead.

The Azure Container Apps serverless containers platform simplifies deploying and managing microservices-based applications by abstracting away the underlying infrastructure.

Azure Container Apps now supports NVIDIA-accelerated workloads with serverless GPUs, allowing developers to use the power of accelerated computing for real-time AI inference applications in a flexible, consumption-based, serverless environment. This capability simplifies AI deployments at scale while improving resource efficiency and application performance without the burden of infrastructure management.

Serverless GPUs allow development teams to focus more on innovation and less on infrastructure management. With per-second billing and scale-to-zero capabilities, customers pay only for the compute they use, helping ensure resource utilization is both economical and efficient. NVIDIA is also working with Microsoft to bring NVIDIA NIM microservices to serverless NVIDIA GPUs in Azure to optimize AI model performance.

NVIDIA Unveils Omniverse Reference Workflows for Advanced 3D Applications

NVIDIA announced reference workflows that help developers to build 3D simulation and digital twin applications on NVIDIA Omniverse and Universal Scene Description (OpenUSD)  — accelerating industrial AI and advancing AI-driven creativity.

A reference workflow for 3D remote monitoring of industrial operations is coming soon to enable developers to connect physically accurate 3D models of industrial systems to real-time data from Azure IoT Operations and Power BI.

These two Microsoft services integrate with applications built on NVIDIA Omniverse and OpenUSD to provide solutions for industrial IoT use cases. This helps remote operations teams accelerate decision-making and optimize processes in production facilities.

The Omniverse Blueprint for precise visual generative AI enables developers to create applications that let nontechnical teams generate AI-enhanced visuals while preserving brand assets. The blueprint supports models like SDXL and Shutterstock Generative 3D to streamline the creation of on-brand, AI-generated images.

Leading creative groups, including Accenture Song, Collective, GRIP, Monks and WPP, have adopted this NVIDIA Omniverse Blueprint to personalize and customize imagery across markets.

Accelerating Gen AI for Windows With RTX AI PCs 

NVIDIA’s collaboration with Microsoft extends to bringing AI capabilities to personal computing devices.

At Ignite, NVIDIA announced its new multimodal SLM, NVIDIA Nemovision-4B Instruct, for understanding visual imagery in the real world and on screen. It’s  coming soon to RTX AI PCs and workstations and will pave the way for more sophisticated and lifelike digital human interactions.

Plus, updates to NVIDIA TensorRT Model Optimizer (ModelOpt) offer Windows developers a path to optimize a model for ONNX Runtime deployment. TensorRT ModelOpt enables developers to create AI models for PCs that are faster and more accurate when accelerated by RTX GPUs. This enables large models to fit within the constraints of PC environments, while making it easy for developers to deploy across the PC ecosystem with ONNX runtimes.

RTX AI-enabled PCs and workstations offer enhanced productivity tools, creative applications and immersive experiences powered by local AI processing.

Full-Stack Collaboration for AI Development

NVIDIA’s extensive ecosystem of partners and developers brings a wealth of AI and high-performance computing options to the Azure platform.

SoftServe, a global IT consulting and digital services provider, today announced the availability of SoftServe Gen AI Industrial Assistant, based on the NVIDIA AI Blueprint for multimodal PDF data extraction, on the Azure marketplace. The assistant addresses critical challenges in manufacturing by using AI to enhance equipment maintenance and improve worker productivity.

At Ignite, AT&T will showcase how it’s using NVIDIA AI and Azure to enhance operational efficiency, boost employee productivity and drive business growth through retrieval-augmented generation and autonomous assistants and agents.

Learn more about NVIDIA and Microsoft’s collaboration and sessions at Ignite.

See notice regarding software product information.

Read More

Microsoft and NVIDIA Supercharge AI Development on RTX AI PCs

Microsoft and NVIDIA Supercharge AI Development on RTX AI PCs

Generative AI-powered laptops and PCs are unlocking advancements in gaming, content creation, productivity and development. Today, over 600 Windows apps and games are already running AI locally on more than 100 million GeForce RTX AI PCs worldwide, delivering fast, reliable and low-latency performance.

At Microsoft Ignite, NVIDIA and Microsoft announced tools to help Windows developers quickly build and optimize AI-powered apps on RTX AI PCs, making local AI more accessible. These new tools enable application and game developers to harness powerful RTX GPUs to accelerate complex AI workflows for applications such as AI agents, app assistants and digital humans.

RTX AI PCs Power Digital Humans With Multimodal Small Language Models

Meet James, an interactive digital human knowledgeable about NVIDIA and its products. James uses a collection of NVIDIA NIM microservices, NVIDIA ACE and ElevenLabs digital human technologies to provide natural and immersive responses.

NVIDIA ACE is a suite of digital human technologies that brings life to agents, assistants and avatars. To achieve a higher level of understanding so that they can respond with greater context-awareness, digital humans must be able to visually perceive the world like humans do.

Enhancing digital human interactions with greater realism demands technology that enables perception and understanding of their surroundings with greater nuance. To achieve this, NVIDIA developed multimodal small language models that can process both text and imagery, excel in role-playing and are optimized for rapid response times.

The NVIDIA Nemovision-4B-Instruct model, soon to be available, uses the latest NVIDIA VILA and NVIDIA NeMo framework for distilling, pruning and quantizing to become small enough to perform on RTX GPUs with the accuracy developers need.

The model enables digital humans to understand visual imagery in the real world and on the screen to deliver relevant responses. Multimodality serves as the foundation for agentic workflows and offers a sneak peek into a future where digital humans can reason and take action with minimal assistance from a user.

NVIDIA is also introducing the Mistral NeMo Minitron 128k Instruct family, a suite of large-context small language models designed for optimized, efficient digital human interactions, coming soon. Available in 8B-, 4B- and 2B-parameter versions, these models offer flexible options for balancing speed, memory usage and accuracy on RTX AI PCs. They can handle large datasets in a single pass, eliminating the need for data segmentation and reassembly. Built in the GGUF format, these models enhance efficiency on low-power devices and support compatibility with multiple programming languages.

Turbocharge Gen AI With NVIDIA TensorRT Model Optimizer for Windows 

When bringing models to PC environments, developers face the challenge of limited memory and compute resources for running AI locally. And they want to make models available to as many people as possible, with minimal accuracy loss.

Today, NVIDIA announced updates to NVIDIA TensorRT Model Optimizer (ModelOpt) to offer Windows developers an improved way to optimize models for ONNX Runtime deployment.

With the latest updates, TensorRT ModelOpt enables models to be optimized into an ONNX checkpoint for deploying the model within ONNX runtime environments — using GPU execution providers such as CUDA, TensorRT and DirectML.

TensorRT-ModelOpt includes advanced quantization algorithms, such as INT4-Activation Aware Weight Quantization. Compared to other tools such as Olive, the new method reduces the memory footprint of the model and improves throughput performance on RTX GPUs.

During deployment, the models can have up to 2.6x reduced memory footprint compared to FP16 models. This results in faster throughput, with minimal accuracy degradation, allowing them to run on a wider range of PCs.

Learn more about how developers on Microsoft systems, from Windows RTX AI PCs to NVIDIA Blackwell-powered Azure servers, are transforming how users interact with AI on a daily basis.

Read More

Do Compressed LLMs Forget Knowledge? An Experimental Study with Practical Implications

This paper was accepted at the Machine Learning and Compression Workshop at NeurIPS 2024.
Compressing Large Language Models (LLMs) often leads to reduced performance, especially for knowledge-intensive tasks. In this work, we dive into how compression damages LLMs’ inherent knowledge and the possible remedies. We start by proposing two conjectures on the nature of the damage: one is certain knowledge being forgotten (or erased) after LLM compression, hence necessitating the compressed model to (re)learn from data with additional parameters; the other presumes that knowledge is internally…Apple Machine Learning Research

Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum

Large language models (LLMs) are commonly trained on datasets consisting of fixed-length token sequences. These datasets are created by randomly concatenating documents of various lengths and then chunking them into sequences of a predetermined target length (concat-and-chunk). Recent attention implementations mask cross-document attention, reducing the effective length of a chunk of tokens. Additionally, training on long sequences becomes computationally prohibitive due to the quadratic cost of attention. In this study, we introduce dataset decomposition, a novel variable sequence length…Apple Machine Learning Research

Towards Low-Bit Communication for Tensor Parallel LLM Inference

This paper was accepted at the Efficient Natural Language and Speech Processing (ENLSP) Workshop at NeurIPS 2024.
Tensor parallelism provides an effective way to increase server large language model (LLM) inference efficiency despite adding an additional communication cost. However, as server LLMs continue to scale in size, they will need to be distributed across more devices, magnifying the communication cost. One way to approach this problem is with quantization, but current methods for LLMs tend to avoid quantizing the features that tensor parallelism needs to communicate. Taking advantage…Apple Machine Learning Research

Build cost-effective RAG applications with Binary Embeddings in Amazon Titan Text Embeddings V2, Amazon OpenSearch Serverless, and Amazon Bedrock Knowledge Bases

Build cost-effective RAG applications with Binary Embeddings in Amazon Titan Text Embeddings V2, Amazon OpenSearch Serverless, and Amazon Bedrock Knowledge Bases

Today, we are happy to announce the availability of Binary Embeddings for Amazon Titan Text Embeddings V2 in Amazon Bedrock Knowledge Bases and Amazon OpenSearch Serverless. With support for binary embedding in Amazon Bedrock and a binary vector store in OpenSearch Serverless, you can use binary embeddings and binary vector store to build Retrieval Augmented Generation (RAG) applications in Amazon Bedrock Knowledge Bases, reducing memory usage and overall costs.

Amazon Bedrock is a fully managed service that provides a single API to access and use various high-performing foundation models (FMs) from leading AI companies. Amazon Bedrock also offers a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. Using Amazon Bedrock Knowledge Bases, FMs and agents can retrieve contextual information from your company’s private data sources for RAG. RAG helps FMs deliver more relevant, accurate, and customized responses.

Amazon Titan Text Embeddings models generate meaningful semantic representations of documents, paragraphs, and sentences. Amazon Titan Text Embeddings takes as an input a body of text and generates a 1,024 (default), 512, or 256 dimensional vector. Amazon Titan Text Embeddings are offered through latency-optimized endpoint invocation for faster search (recommended during the retrieval step) and throughput-optimized batch jobs for faster indexing. With Binary Embeddings, Amazon Titan Text Embeddings V2 will represent data as binary vectors with each dimension encoded as a single binary digit (0 or 1). This binary representation will convert high-dimensional data into a more efficient format for storage and computation.

Amazon OpenSearch Serverless is a serverless deployment option for Amazon OpenSearch Service, a fully managed service that makes it simple to perform interactive log analytics, real-time application monitoring, website search, and vector search with its k-nearest neighbor (kNN) plugin. It supports exact and approximate nearest-neighbor algorithms and multiple storage and matching engines. It makes it simple for you to build modern machine learning (ML) augmented search experiences, generative AI applications, and analytics workloads without having to manage the underlying infrastructure.

The OpenSearch Serverless kNN plugin now supports 16-bit (FP16) and binary vectors, in addition to 32-bit floating point vectors (FP32). You can store the binary embeddings generated by Amazon Titan Text Embeddings V2 for lower costs by setting the kNN vector field type to binary. The vectors can be stored and searched in OpenSearch Serverless using PUT and GET APIs.

This post summarizes the benefits of this new binary vector support across Amazon Titan Text Embeddings, Amazon Bedrock Knowledge Bases, and OpenSearch Serverless, and gives you information on how you can get started. The following diagram is a rough architecture diagram with Amazon Bedrock Knowledge Bases and Amazon OpenSearch Serverless.

You can lower latency and reduce storage costs and memory requirements in OpenSearch Serverless and Amazon Bedrock Knowledge Bases with minimal reduction in retrieval quality.

We ran the Massive Text Embedding Benchmark (MTEB) retrieval data set with binary embeddings. On this data set, we reduced storage, while observing a 25-times improvement in latency. Binary embeddings maintained 98.5% of the retrieval accuracy with re-ranking, and 97% without re-ranking. Compare these results to the results we got using full precision (float32) embeddings. In end-to-end RAG benchmark comparisons with full-precision embeddings, Binary Embeddings with Amazon Titan Text Embeddings V2 retain 99.1% of the full-precision answer correctness (98.6% without reranking). We encourage customers to do their own benchmarks using Amazon OpenSearch Serverless and Binary Embeddings for Amazon Titan Text Embeddings V2.

OpenSearch Serverless benchmarks using the Hierarchical Navigable Small Worlds (HNSW) algorithm with binary vectors have unveiled a 50% reduction in search OpenSearch Computing Units (OCUs), translating to cost savings for users. The use of binary indexes has resulted in significantly faster retrieval times. Traditional search methods often rely on computationally intensive calculations such as L2 and cosine distances, which can be resource-intensive. In contrast, binary indexes in Amazon OpenSearch Serverless operate on Hamming distances, a more efficient approach that accelerates search queries.

In the following sections we’ll discuss the how-to for binary embeddings with Amazon Titan Text Embeddings, binary vectors (and FP16) for vector engine, and binary embedding option for Amazon Bedrock Knowledge Bases To learn more about Amazon Bedrock Knowledge Bases, visit Knowledge Bases now delivers fully managed RAG experience in Amazon Bedrock.

Generate Binary Embeddings with Amazon Titan Text Embeddings V2

Amazon Titan Text Embeddings V2 now supports Binary Embeddings and is optimized for retrieval performance and accuracy across different dimension sizes (1024, 512, 256) with text support for more than 100 languages. By default, Amazon Titan Text Embeddings models produce embeddings at Floating Point 32 bit (FP32) precision. Although using a 1024-dimension vector of FP32 embeddings helps achieve better accuracy, it also leads to large storage requirements and related costs in retrieval use cases.

To generate binary embeddings in code, add the right embeddingTypes parameter in your invoke_model API request to Amazon Titan Text Embeddings V2:

import json
import boto3
import numpy as np
rt_client = boto3.client("bedrock-runtime")

response = rt_client.invoke_model(modelId="amazon.titan-embed-text-v2:0", 
          body=json.dumps(
               {
                   "inputText":"What is Amazon Bedrock?",
                   "embeddingTypes": ["binary","float"]
               }))['body'].read()

embedding = np.array(json.loads(response)["embeddingsByType"]["binary"], dtype=np.int8)

As in the request above, we can request either the binary embedding alone or both binary and float embeddings. The preceding embedding above is a 1024-length binary vector similar to:

array([0, 1, 1, ..., 0, 0, 0], dtype=int8)

For more information and sample code, refer to Amazon Titan Embeddings Text.

Configure Amazon Bedrock Knowledge Bases with Binary Vector Embeddings

You can use Amazon Bedrock Knowledge Bases, to take advantage of the Binary Embeddings with Amazon Titan Text Embeddings V2 and the binary vectors and Floating Point 16 bit (FP16) for vector engine in Amazon OpenSearch Serverless, without writing a single line of code. Follow these steps:

  1. On the Amazon Bedrock console, create a knowledge base. Provide the knowledge base details, including name and description, and create a new or use an existing service role with the relevant AWS Identity and Access Management (IAM) permissions. For information on creating service roles, refer to Service roles. Under Choose data source, choose Amazon S3, as shown in the following screenshot. Choose Next.
  2. Configure the data source. Enter a name and description. Define the source S3 URI. Under Chunking and parsing configurations, choose Default. Choose Next to continue.
  3. Complete the knowledge base setup by selecting an embeddings model. For this walkthrough, select Titan Text Embedding v2. Under Embeddings type, choose Binary vector embeddings. Under Vector dimensions, choose 1024. Choose Quick Create a New Vector Store. This option will configure a new Amazon Open Search Serverless store that supports the binary data type.

You can check the knowledge base details after creation to monitor the data source sync status. After the sync is complete, you can test the knowledge base and check the FM’s responses.

Conclusion

As we’ve explored throughout this post, Binary Embeddings are an option in Amazon Titan Text Embeddings V2 models available in Amazon Bedrock and the binary vector store in OpenSearch Serverless. These features significantly reduce memory and disk needs in Amazon Bedrock and OpenSearch Serverless, resulting in fewer OCUs for the RAG solution. You’ll also experience better performance and improvement in latency, but there will be some impact on the accuracy of the results compared to using the full float data type (FP32). Although the drop in accuracy is minimal, you have to decide if it suits your application. The specific benefits will vary based on factors such as the volume of data, search traffic, and storage requirements, but the examples discussed in this post illustrate the potential value.

Binary Embeddings support in Amazon Open Search Serverless, Amazon Bedrock Knowledge Bases, and Amazon Titan Text Embeddings v2 are available today in all AWS Regions where the services are already available. Check the Region list for details and future updates. To learn more about Amazon Knowledge Bases, visit the Amazon Bedrock Knowledge Bases product page. For more information regarding Amazon Titan Text Embeddings, visit Amazon Titan in Amazon Bedrock. For more information on Amazon OpenSearch Serverless, visit the Amazon OpenSearch Serverless  product page. For pricing details, review the Amazon Bedrock pricing page.

Give the new feature a try in the Amazon Bedrock console today. Send feedback to AWS re:Post for Amazon Bedrock or through your usual AWS contacts and engage with the generative AI builder community at community.aws.


About the Authors

Shreyas Subramanian is a principal data scientist and helps customers by using generative AI and deep learning to solve their business challenges using AWS services. Shreyas has a background in large-scale optimization and ML and in the use of ML and reinforcement learning for accelerating optimization tasks.

Ron Widha is a Senior Software Development Manager with Amazon Bedrock Knowledge Bases, helping customers easily build scalable RAG applications.

Satish Nandi is a Senior Product Manager with Amazon OpenSearch Service. He is focused on OpenSearch Serverless and has years of experience in networking, security and AI/ML. He holds a bachelor’s degree in computer science and an MBA in entrepreneurship. In his free time, he likes to fly airplanes and hang gliders and ride his motorcycle.

Vamshi Vijay Nakkirtha is a Senior Software Development Manager working on the OpenSearch Project and Amazon OpenSearch Service. His primary interests include distributed systems.

Read More

Automate cloud security vulnerability assessment and alerting using Amazon Bedrock

Automate cloud security vulnerability assessment and alerting using Amazon Bedrock

Cloud technologies are progressing at a rapid pace. Businesses are adopting new innovations and technologies to create cutting-edge solutions for their customers. However, security is a big risk when adopting the latest technologies. Enterprises often rely on reactive security monitoring and notification techniques, but those techniques might not be sufficient to safeguard your enterprises from vulnerable assets and third-party attacks. You need to establish proper security guardrails in the cloud environment and create a proactive monitoring practice to strengthen your cloud security posture and maintain required compliance standards.

To address this challenge, this post demonstrates a proactive approach for security vulnerability assessment of your accounts and workloads, using Amazon GuardDuty, Amazon Bedrock, and other AWS serverless technologies. This approach aims to identify potential vulnerabilities proactively and provide your users with timely alerts and recommendations, avoiding reactive escalations and other damages. By implementing a proactive security monitoring and alerting system, users can receive personalized notifications in preferred channels like email, SMS, or push notifications. These alerts concisely summarize the identified security issues and provide succinct troubleshooting steps to fix the problem promptly, without the need for escalation.

GuardDuty is a threat detection service that continuously monitors for malicious activity and unauthorized behavior across your AWS environment. GuardDuty combines machine learning (ML), anomaly detection, and malicious file discovery, using both AWS and industry-leading third-party sources, to help protect AWS accounts, workloads, and data. GuardDuty integrates with Amazon EventBridge by creating an event for EventBridge for new generated vulnerability findings. This solution uses a GuardDuty findings notification through EventBridge to invoke AWS Step Functions, a serverless orchestration engine, which runs a state machine. The Step Functions state machine invokes AWS Lambda functions to get a findings summary and remediation steps through Amazon Bedrock.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

By using generative AI FMs on Amazon Bedrock, users can quickly analyze vast amounts of security data to identify patterns and anomalies that may indicate potential threats or breaches. Furthermore, by recognizing patterns in network traffic, user behavior, or system logs, such FMs can help identify suspicious activities or security vulnerabilities. Generative AI can make predictions about future security threats or attacks by analyzing historical security data and trends. This can help organizations proactively implement security measures to prevent breaches before they occur. This form of automation can help improve efficiency and reduce the response time to security threats.

Solution overview

The solution uses the built-in integration between GuardDuty and EventBridge to raise an event notification for any new vulnerability findings in your AWS accounts or workloads. You can configure the EventBridge rule to filter the findings based on severity so that only high-severity findings are prioritized first. The EventBridge rule invokes a Step Functions workflow. The workflow invokes a Lambda function and passes the GuardDuty findings details. The Lambda function calls Anthropic’s Claude 3 Sonnet model through Amazon Bedrock APIs with the input request. The API returns the finding summarization and mitigation steps. The Step Functions workflow sends findings and remediation notifications to the subscribers or users using Amazon Simple Notification Service (Amazon SNS). In this post, we use email notification, but you can extend the solution to send mobile text or push notifications.

The solution uses the following key services:

  • Amazon Bedrock – The solution integrates with Anthropic’s Claude 3 Sonnet model to provide summarized visibility into security vulnerabilities and troubleshooting steps.
  • Amazon EventBridge – EventBridge is a serverless event bus that helps you receive, filter, transform, route, and deliver events.
  • Amazon GuardDuty – The solution uses the threat detection capabilities of GuardDuty to identify and respond to threats.
  • IAM – With AWS Identity and Access Management (IAM), you can specify who or what can access services and resources in AWS, centrally manage fine-grained permissions, and analyze access to refine permissions across AWS. Follow the principle of least privilege to safeguard your workloads.
  • AWS Lambda – Lambda is a compute service that runs your code in response to events and automatically manages the compute resources, making it the fastest way to turn an idea into a modern, production, serverless application.
  • Amazon SNS – Amazon SNS is a managed service that provides message delivery from publishers to subscribers.
  • AWS Step Functions – Step Functions is a visual workflow service that helps developers use AWS services to build distributed applications, automate processes, orchestrate microservices, and create data and ML pipelines.

The following diagram illustrates the solution architecture.

The workflow includes the following steps:

  1. GuardDuty invokes an EventBridge rule. The rule can filter the findings based on severity.
    1. The findings are also exported to an Amazon Simple Storage Service (Amazon S3) bucket.
  2. The EventBridge rule invokes a Step Functions workflow.
  3. The Step Functions workflow calls a Lambda function to get the details of the vulnerability findings.
  4. The Lambda function creates a prompt with the vulnerability details and passes it to Anthropic’s Claude 3 using Amazon Bedrock APIs. The function returns the response to the Step Functions workflow.
  5. The Step Functions workflow calls an SNS topic with the findings details to send an email notification to subscribers. You can use your support or operations team as the subscriber for this use case.
  6. Amazon SNS sends the email to the subscribers.
  7. The Step Functions workflow and Lambda function logs are stored in Amazon CloudWatch. For more details, see Configure logging in the Step Functions console to store logs in CloudWatch. By default, CloudWatch logs use server-side encryption for the log data at rest.

Solution benefits

The solution provides the following benefits for end-users:

  • Real-time visibility – The intuitive omnichannel support solution provides a comprehensive view of your cloud environment’s security posture
  • Actionable insights – You can drill down into specific security alerts and vulnerabilities generated using generative AI to prioritize and respond effectively
  • Proactive customizable reporting – You can troubleshoot various errors before escalation by retrieving a summary of reports with action recommendations

Prerequisites

Complete the following prerequisite steps:

  1. Enable GuardDuty in your account to generate findings.
  2. Provision least privilege IAM permissions for AWS resources like Step Functions and Lambda functions to perform desired actions:
    1. The Step Functions IAM role should have IAM policies to invoke the Lambda function and publish to the SNS topic.
    2. The Lambda function needs AWSLambdaBasic ExecutionRole to publish logs and the bedrock:InvokeModel
    3. Edit the access policy of the SNS topic to only allow Step Functions to publish messages to the topic.
  3. Request access to Anthropic’s Claude 3 on Amazon Bedrock.
  4. Turn on encryption at the SNS topic to enable server-side encryption.

Deploy the solution

Complete the following steps to deploy the solution:

  1. On the EventBridge console, create a new rule for GuardDuty findings notifications.

The example rule in the following screenshot filters high-severity findings at severity level 8 and above. For a complete list of GuardDuty findings, refer to the GetFindings API.

  1. On the Lambda console, create a Lambda function that will take the findings as the input and call the Amazon Bedrock API to get the summarization and mitigation steps from Anthropic’s Claude 3.

You need to provide proper IAM permissions to your Lambda function to call Amazon Bedrock APIs. You can configure parameters in the environment variables in the Lambda function. The following function uses three configuration parameters:

  • modelId is set as claude-3-sonnet-20240229-v1:0
  • findingDetailType is set as GuardDuty finding to filter the payload
  • source is set as guardduty to only evaluate GuardDuty findings
import json
import boto3
import urllib.parse
import os

region = os.environ['AWS_REGION']
model_Id = os.environ['modelId']
finding_detail_type = os.environ['findingDetailType']
finding_source = os.environ['source']

# Bedrock client used to interact with APIs around models
bedrock = boto3.client(service_name='bedrock', region_name= region)

# Bedrock Runtime client used to invoke and question the models
bedrock_runtime = boto3.client(service_name='bedrock-runtime', region_name= region)

evaluator_response = []
max_tokens=512
top_p=1
temp=0.5
system = ""

def lambda_handler(event, context):
    message = ""
    try:
        file_body = json.loads(json.dumps(event))
        print(finding_detail_type)
        print(finding_source)
        if file_body['detail-type'] == finding_detail_type and file_body['source'] == finding_source and file_body['detail']:
            print(f'File contents: {file_body['detail']}')
            description = file_body["detail"]["description"]
            finding_arn = file_body["detail"]["arn"]
            try:
                body= createBedrockRequest(description)
                message = invokeModel(body)
                print(message)
                evaluator_response.append(message)
                evaluator_response.append(finding_arn)
            except Exception as e:
                print(e)
                print('Error calling model')
        else:
            message = "Invalid finding source"
    except Exception as e:
        print(e)
        print('Error getting finding id from the guard duty record')
        raise e
    return message

def createBedrockRequest(description):
    prompt = "You are an expert in troubleshooting AWS logs and sharing details with the user via an email draft as stated in <description>. Do NOT provide any preamble. Draft a professional email summary of details as stated in description. Write the recipient as - User in the email and sender in the email should be listed as - Your Friendly Troubleshooter. Skip the preamble and directly start with subject. Also, provide detailed troubleshooting steps in the email draft." + "<description>" + description + "</description>"
    messages = [{ "role":'user', "content":[{'type':'text','text': prompt}]}]
    body=json.dumps(
             {
                "anthropic_version": "bedrock-2023-05-31",
                "max_tokens": max_tokens,
                "messages": messages,
                "temperature": temp,
                "top_p": top_p,
                "system": system
            } 
        )
    return body

def invokeModel(body):
    response = bedrock_runtime.invoke_model(body= body, modelId = model_Id)
    response_body = json.loads(response.get('body').read())
    message = response_body.get('content')[0].get("text")
    return message

It’s crucial to perform prompt engineering and follow prompting best practices in order to avoid hallucinations or non-coherent responses from the LLM. In our solution, we created the following prompt to generate responses from Anthropic’s Claude 3 Sonnet:

Prompt = ```You are an expert in troubleshooting AWS logs and sharing details with the user via an email draft as stated in <description>. Do NOT provide any preamble. Draft a professional email summary of details as stated in description. Write the recipient as - User in the email and sender in the email should be listed as - Your Friendly Troubleshooter. Skip the preamble and directly start with subject. Also, provide detailed troubleshooting steps in the email draft." + "<description>" + description + "</description>```

The prompt makes sure the description of the issue under consideration is categorized appropriately within XML tags. Further emphasis has been provided upon jumping directly into generating the answer and skipping any additional information that may be generated from the model.

  1. On the Amazon SNS console, create an SNS topic to send notifications and add the emails of the subscribers.

The following screenshot shows the topic details with some test subscribers.

Now you can create the Step Functions state machine and integrate the Lambda and Amazon SNS calls in the workflow.

  1. On the Step Functions console, create a new state machine and add the Lambda and Amazon SNS optimized integration.

You need to provide appropriate IAM permissions to the Step Functions role so it can call Lambda and Amazon SNS.

The following diagram illustrates the Step Functions state machine.

The following sample code shows how to use the Step Functions optimized integration with Lambda and Amazon SNS.

  1. On the EventBridge console, add the Step Functions state machine as the target of the EventBridge rule created earlier.

As seen in the following screenshot, the rule needs to have proper IAM permission to invoke the Step Functions state machine.

Test the solution

You can test the setup by generating some sample findings on the GuardDuty console. Based on the sample findings volume, the test emails will be triggered accordingly.

Based on a sample generation, the following screenshot shows an email from Amazon SNS about a potential security risk in an Amazon Elastic Container Service (Amazon ECS) cluster. The email contains the vulnerability summary and a few mitigation steps to remediate the issue.

The following screenshot is a sample email notification about a potential Bitcoin IP address communication.

This proactive approach enables users to take immediate action and remediate vulnerabilities before they escalate, reducing the risk of data breaches or security incidents. It empowers users to maintain a secure environment within their AWS accounts, fostering a culture of proactive security awareness and responsibility. Furthermore, a proactive security vulnerability assessment and remediation system can streamline the resolution process, minimizing the time and effort required to address security concerns.

Clean up

To avoid incurring unnecessary costs, complete the following steps:

  1. Delete the following AWS resources associated with this solution:
    1. Step Functions state machine
    2. Lambda functions
    3. SNS topic
  2. You can disable GuardDuty if you’re no longer using it to avoid S3 bucket storage cost.

By cleaning up the resources created for this solution, you can prevent any ongoing charges to your AWS account.

Conclusion

By providing users with clear and actionable recommendations, they can swiftly implement the necessary fixes, reducing the likelihood of untracked or lost tickets and enabling swift resolution. Adopting this proactive approach not only enhances the overall security posture of AWS accounts, but also promotes a collaborative and efficient security practice within the organization, fostering a sense of ownership and accountability among users.

You can deploy this solution and integrate it with other services to have a holistic omnichannel solution. To learn more about Amazon Bedrock and AWS generative AI services, refer to the following workshops:


About the Authors

Shikhar Kwatra is a Sr. Partner Solutions Architect at Amazon Web Services, working with leading Global System Integrators. He has earned the title of one of the Youngest Indian Master Inventors with over 500 patents in the AI/ML and IoT domains. Shikhar aids in architecting, building, and maintaining cost-efficient, scalable cloud environments for the organization, and support the GSI partners in building strategic industry solutions on AWS.

Rajdeep Banerjee is a Senior Partner Solutions Architect at AWS helping strategic partners and clients in the AWS cloud migration and digital transformation journey. Rajdeep focuses on working with partners to provide technical guidance on AWS, collaborate with them to understand their technical requirements, and designing solutions to meet their specific needs. He is a member of Serverless technical field community. Rajdeep is based out of Richmond, Virginia.

Read More

DXC transforms data exploration for their oil and gas customers with LLM-powered tools

DXC transforms data exploration for their oil and gas customers with LLM-powered tools

This post is co-written with Kaustubh Kambli from DXC Technology.

DXC Technology is an IT services leader with more than 130,000 employees supporting 6,000 customers in over 70 countries. DXC builds offerings across industry verticals to deliver accelerated value to their customers.

One of the sectors DXC has deep expertise in is energy. The oil and gas industry relies on discovering new drilling sites to drive growth. Data-driven insights can accelerate the process of identifying potential locations and improve decision-making. For the largest companies in the sector, shaving even a single day off the time to first oil can impact operational costs and revenue by millions of dollars.

In this post, we show you how DXC and AWS collaborated to build an AI assistant using large language models (LLMs), enabling users to access and analyze different data types from a variety of data sources. The AI assistant is powered by an intelligent agent that routes user questions to specialized tools that are optimized for different data types such as text, tables, and domain-specific formats. It uses the LLM’s ability to understand natural language, write code, and reason about conversational context.

Data plays a key role in identifying sites for oil exploration and in accelerating the time to extract oil from those sites, but data in this industry is scattered, non-standard, and of various types. These companies have remote sites, offshore drilling locations, branch offices, and corporate offices. Relevant data is also in various formats, ranging from spreadsheets to complex datasets like satellite images and GIS data. Moreover, there are industry-specific data formats like Log ASCII standard (LAS).

The assistant architecture consists of several key components powered by Anthropic’s Claude on Amazon Bedrock. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies such as AI21 Labs, Anthropic, Cohere, Meta, Mistral, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

In this project, Amazon Bedrock enabled us to seamlessly switch between different variants of Anthropic’s Claude models. We used smaller, faster models for basic tasks such as routing, and more powerful models for complex processes such as code generation. Additionally, we took advantage of Amazon Bedrock Knowledge Bases, a managed service that enhances the LLM’s knowledge by integrating external documents. This service seamlessly integrates with FMs on Bedrock and can be set up through the console in a few minutes.

Solution overview

The solution is composed of several components powered by Anthropic’s Claude models on Bedrock:

  • Router – The router analyzes the user query and routes it to the appropriate tool
  • Custom built tools – These tools are optimized for different data sources such as file systems, tables, and LAS files
  • Conversational capabilities – These capabilities enable the model to understand context and rewrite queries when needed

We also use an Amazon Simple Storage Service (Amazon S3) bucket to store the data. The data is indexed by relevant tools when deploying the solution, and we use signed S3 URLs to provide access to the relevant data sources in the UI.

The following diagram illustrates the solution architecture.

In the following sections, we go over these components in detail. The examples presented in this post use the Teapot dome dataset, which describes geophysical and well log data from the Rocky Mountain Oilfield Testing Center (RMOTC) related to the Tensleep Formation and is available for educational and research purposes.

LLM-powered router

The types of questions that the chatbot can be asked can be broken down into distinct categories:

  • File name questions – For example, “How many 3D seg-y files do we have?” For these questions, we don’t need to look at the file content; we only need to filter by file extension and count.
  • File content questions – For example, “What can you say about the geology of teapot dome?” For these questions, we need to do semantic search on the file content.
  • Production questions – For example, “How much oil did API 490251069400 produce on March 2001?” For these questions, we need to filter the production Excel sheet (here, match on API number) and make operations on the columns (here, sum on the oil production column).
  • Directional survey questions – For example, “What APIs have a total depth of more than 6000 ft?” For these questions, we need to filter the directional survey Excel sheet. The process is similar to the production questions, but the data differs.
  • LAS files questions – For example, “What log measurements were acquired for API 490251069400?” For these questions, we need to open and process the LAS file to look at the measurements present in the file.

The way to deal with each of these questions requires different processing steps. We can design LLM-powered tools that can address each question type, but the first step upon receiving the user query is to route it to the right tool. For this, we use Anthropic’s Claude v2.1 on Amazon Bedrock with the following prompt:

routing_prompt = """

Human: You are an AI assistant that is an expert in Oil and Gas exploration.

    Use the following information as background to categorize the question

    - An API well number or API# can can have up to 14 digits sometimes divided
    by dashes.
    - There can be different unique identifiers for wells other than API #.
    - .las or .LAS refers to Log ASCII file format. It is a standard file
    format for storing well log data, which is crucial in the exploration and production of oil and gas. Well logs are detailed records of the geological formations penetrated by a borehole, and they are essential for understanding the subsurface conditions.


    Determine the right category of the question to route it to the appropriate service by following the instructions below

    - Repond with single word (the category name).
    - Use the information provided in <description> to determine the category of
    the question.
    - If you are unable to categorize the question or it is not related to one of
    the below categories then return "unknown".
    - Use the category names provided below. Do not make up new categories.
    - If the category is ambiguous then output all the categories that are relevant
    as a comma separated list.


    <categories>
        <category>
            <name>filename</name>
            <description>The questions about searching for files or objects or
            those related to counting files of specific types such as .pdf, .las, .xls, .sgy etc.
            </description>
        </category>

        <category>
            <name>production</name>
            <description>well production related information. This can correlate API#, Well, Date of production, Amount Produces, Formation, Section </description>
        </category>

        <category>
            <name>las</name>
            <description>related to log data or .las or .LAS or Log ASCII files.
            Except questions related
            to searching or counting the files with .las extension.
            Those belong to filesystem category. </description>
        </category>

        <category>
            <name>directional_survey</name>
            <description>directional survey contains information about multiple
            wells and associates API, Measured Depth, Inclination and Azimuth
            </description>
        </category>

        <category>
            <name>knowledge_base</name>
            <description>related to oil and gas exploration but does not fit in any of the categories above, include seismic, logging and core analysis related questions.
            </description>
        </category>

        <category>
            <name>unrelated</name>
            <description> Question does not belong to one of the above categories and it is not related to oil and gas exploration in general. </description>
        </category>

    </categories>

    Here is the question

    <question>
    {query}
    </question>


    Return your answer in the following format

    <answer>
        <reason>$REASON_JUSTIFYING_CATEGORY</reason>
        <labels>$COMMA_SEPARETED_LABELS</labels>
    </answer>
"""

Using XML tags in the output allows you to parse out the right category for the question. You can then pass the query down to the relevant tool for further processing. Note that with the release of new powerful Anthropic models, you could also use Anthropic’s Claude Haiku on Amazon Bedrock to improve latency for the routing.

The prompt also includes guardrails to make sure queries not pertaining to oil and gas data are gently dismissed.

LLM-powered tools

To optimally handle the variety of tasks for the chatbot, we built specialized tools. The tools that we built are data-type specific (text, tables, and LAS), except for the file search tool, which is task specific.

File search tool

When searching for files and information, we identified two distinct types of search. One type pertains to identifying files based on the name or extension; the other requires analyzing the contents of the file to answer a question. We call the first type file name-based search and the second semantic-content based search.

File name-based search

For this tool, we don’t need to look at the contents of the file; only at the file name. To initialize the tool, we first crawl the S3 bucket containing the data to get a list of the available files. Then for each query, the steps are as follows:

  1. LLM call to extract file extension and keywords from the query. When searching for relevant files to answer a query, we can look for specific file extensions or keywords that might be present in the content of the files. Our approach is to first use an LLM to analyze the user’s query and extract potential file extensions and keywords that could be relevant. We then search through the list of available files, looking for matches to those file extensions and keywords identified by the LLM. This allows us to narrow down the set of files that might contain information pertinent to answering the user’s question. Because we’re working with Anthropic’s Claude models, we ask the LLM to format its answer using XML tags. This structured output allows us to parse and extract file extensions and keywords from the answer. For instance, if the question is “Do we have any *.SGY files,” the LLM response should be <file-extension>.sgy</file-extension> <keywords></keywords> because there are no keywords. On the other hand, if the question is “Can you show me the 2Dseismic base map,” the response should be <file-extension></file-extension> <keywords>2D, seismic, base, map</keywords>.
  2. Retrieve files that match the extension or keywords identified by the LLM. Retrieval is done by doing simple string matching over the list of available files. If a file extension is extracted by the LLM, simple string matching is done on the end of the file name; if keywords are extracted by the LLM, string matching is done for each of the keywords.
  3. LLM call to confirm that retrieved files match the user query, and provide a final answer. To reach the final answer, we build a prompt containing the user query and the files retrieved at Step 2. We also give specific output formatting instructions to the LLM. Similar to Step 1, we rely on an XML structure to parse and extract desired information. In this step, the desired outputs are as follows:
    1. <success> – Whether the search was successful.
    2. <resources> – The list of Amazon S3 locations that the LLM deems relevant to the user query.
    3. <answer> – The final answer from the LLM.

To be mindful of the limited number of input and output tokens, we also implement controls to make sure the retrieved context isn’t too large and that the LLM’s answer isn’t cut off, which can happen if there are too many resources that match the user query.

The following screenshot is an example of a query that’s answered using that tool. It shows the query “show me the .sgy files” and the tool’s response, which includes a list of files.

Semantic content-based search

The implementation for semantic content-based search relies on Amazon Bedrock Knowledge Bases. Amazon Bedrock Knowledge Bases provides a seamless way to implement semantic search by pointing the service to an S3 bucket containing the data. The managed service takes care of the processing, chunking, and data management when files are added or deleted from the bucket. For setup instructions, see Knowledge Bases now delivers fully managed RAG experience in Amazon Bedrock.

For a given user query that’s passed to the tool, the steps are as follows:

  1. Use the retrieve API from Amazon Bedrock Knowledge Bases to retrieve documents semantically close to the query.
  2. Construct a prompt with the retrieved documents formatted with XML tags—<content> for text content and <location> for the corresponding Amazon S3 location.
  3. Call Anthropic’s Claude v2.1 model with the prompt to get the final answer. Similarly to the file name-based search, we instruct the LLM to use <success>, <answer>, and <resources> tags in the answer.

Using the retrieve_and_reply API instead of the retrieve API would get the answer in a single step, but this approach gives us more flexibility in the prompting to get the output with the desired format.

The following screenshot is an example of a question answered using the semantic search tool. It shows the query “what information do you have about the geology of teapot dome?” and the tool’s response.

Tables tool

This tool is designed to filter tables and compute certain metrics from the information they contain. It uses the LLM’s ability to write Python code for data analysis. We implemented a generic tables tool that takes the following as input:

  • An Excel or CSV file
  • A description of the table (optional)
  • Table-specific instructions (optional)

In practice, with every new CSV or Excel file, we create a standalone tool for the router. This means that the tables tool applied to the production Excel sheet constitutes the production tool, whereas the tables tool coupled with the directional survey Excel sheet constitutes the directional survey tool.

Some out-of-the-box data analysis tools, such as LangChain’s Pandas agent, are available in open source libraries. The way these agents work is that they use an LLM to generate Python code, execute the code, and send the result of the code back to the LLM to generate a final response. However, for certain data analysis tasks, it would be preferable to directly output the result of Python code. Having an LLM generate the response as an extra step after the code execution introduces both latency and a risk for hallucination.

For example, many sample questions require filtering a DataFrame and potentially returning dozens of entries. The ability to filter a DataFrame and return the filtered results as an output was essential for our use case. To address this limitation, we wanted the LLM to generate code that we could run to obtain the desired output directly, so we built a custom agent to enable this functionality. Our custom agent also has the ability to self-correct if the generated code outputs an error. The main difference with traditional code-writing agents is that after the code is run, we return the output, whereas with traditional agents, this output is passed back to the agent to generate the final response. In our example with filtering and returning a large DataFrame, passing the DataFrame back to the agent to generate the final response would have the LLM rewrite that large DataFrame with risk of either exceeding the context window or hallucinating some of the data.

The following screenshot is an example of a question answered using the production data tool, which is the tables tool applied to the production data CSV file. It shows the query “What were the top 3 oil producing wells in March 2024” and the tool’s response. The response includes a DataFrame with the top 3 oil producing wells as well as the logic behind how the DataFrame was obtained.

The following code is the LLM response generated in the background; you can see in <code> tags the code that is being run to get the result in the data section of the UI. We explicitly prompt the LLM to store the final result in a variable called result so we can send it back to the user in the frontend.

<scratchpad>
To find wells with total depth greater than 6000 feet, I need to:

1. Filter the df dataframe on the 'Total Depth' column
2. Check if the value is greater than 6000
3. Handle any NaN values so they are not included
</scratchpad>

<code>
import pandas as pd
import numpy as np

result = df[df['Total Depth'].replace({np.nan: 0}) > 6000]
</code>

<answer>
I filtered the dataframe on the 'Total Depth' column, replaced any NaN values with 0 so they would not be included, and checked for values greater than 6000 feet. The final filtered dataframe is stored in the variable 'result'.
</answer>

LAS tool

The LAS tool is built to analyze Log ASCII Standard (LAS) files, which contain a wide variety of log curves and data to fully characterize the drilled formations and support petroleum engineering analysis. This file format, which is in essence a structured text file, can be efficiently queried and analyzed using the lasio Python library.

Similarly to the tables tool, we use the code-writing abilities of the LLM to generate Python code using the lasio library. We used the XML agent implemented in LangChain, along with the Python REPL tool, which allows the agent to run Python code.

Because lasio isn’t a common Python library, we give detailed instructions in the prompt on how to use it properly. The following code shows an example of the extract from the larger prompt:

""" ...
If the question is to display a particular log data, you can use the `lasio` package
 in Python to load every LAS file into dataframe and display them iteratively as
follows.

    import lasio
   #Extract the las file names as a list from the dataframe df
    las_files = df[df['API Number'] == api_num]['las_files'].values[0]
    print(las_files)

    for las_file in las_files: #enumerate the las file name in the list            
        las = lasio.read(las_file) # read a single LAS file
        las_df = las.df() # convert the LAS data into dataframe
        print(las_df) # print the dataframe 
 ... """

The following screenshot shows an example of a question answered using the LAS tool. It shows the query “What log measurements were acquired for API #490251107500” and the tool’s response, which includes processed data from the LAS files.

The following code is the LLM’s response in the backend, which contains the relevant Python code:

<tool>Python_REPL</tool><tool_input>
import pandas as pd
import lasio

df = pd.read_parquet("/home/ec2-user/SageMaker/AmazonSageMaker-DXC-GAIIC-Opibus-Repo/data/processed/teapot_dome/las_index.parq")

api_num = 490251107500  
las_files = df[df['API Number'] == api_num]['las_files'].values[0]

print(f"LAS files for API #{api_num}: {las_files}")

log_measurements = []
for las_file in las_files:
    las = lasio.read(las_file)
    for curve in las.curves:
        if curve.mnemonic not in log_measurements:
            log_measurements.append(curve.mnemonic)
            
print(f"Log measurements acquired: {log_measurements}")

The following is the output of the code run by the Python REPL tool, which is returned to the LLM to generate its final answer:

"""
LAS files for API #490251107500: ['/home/ec2-user/SageMaker/AmazonSageMaker-DXC-GAIIC-Opibus-Repo/data/raw/teapot_dome/Well Log Data/LAS_log_files/Shallow_LAS_files/49025110750000_480593.LAS']
Log measurements acquired: ['DEPT', 'GRR', 'RILD', 'RILM', 'RLL', 'SPR']
"""

Lastly, we have the LLM’s final answer in the backend, before it’s parsed and returned to the UI:

 <final_answer>
    <data>
        <log_measurements>
            DEPT, GRR, RILD, RILM, RLL, SPR
        </log_measurements>
    </data>
    <resources>
        <resource>
            <location>/home/ec2-user/SageMaker/AmazonSageMaker-DXC-GAIIC-Opibus-Repo/data/raw/teapot_dome/Well Log Data/LAS_log_files/Shallow_LAS_files/49025110750000_480593.LAS</location>

Conversational capabilities

The basic router handles a single user query and isn’t aware of chat history. However, conversational context is an essential part of the user experience. For instance, when a user asks “What API produced the most oil in 2010,” a natural follow-up question would be “What measurements do we have for that API,” in which case we need to recall the API number from the previous context, or “What about 2011,” in which case we need to recall the fact that the question is to find the API that produced the most oil. To enable this ability to add follow-up questions, we added another layer before the router that takes the user query and the conversation history and rewrites the user query with context that might be missing from it. We can also use this query-rewriting layer to directly translate or summarize previous responses, without having to go to the router, which saves time for simple queries.

The following is the sample prompt for context-aware query rewriting. We give the LLM two choices: either directly reply to the question if it’s a translation of summarization of a previous interaction, because this doesn’t require the use of tools, or rewrite the query to forward it to an expert (the router plus the tool framework). To differentiate between the options, the LLM can use either <answer> tags or <new_query> tags in its reply. In both cases, we ask the LLM to start out by using <thinking> tags to logically think about which one is more appropriate. If the <answer> tag is present in the LLM’s reply, we directly forward that answer to the user. Otherwise, if the <new_query> tag is present, we forward that new query to the router for appropriate tool use. We also added few-shot examples to the prompt to make the query rewriting process more explicit for the LLM, and in our experience they were instrumental to the success of query rewriting.

query_rewriting_prompt = """
You are an AI assistant that helps a human answer oil and gas question.
You only know how to or rewrite previous interactions.
If the human asks for oil and gas specific knowledge, or to count and find specific
files, you should rewrite the query so it can be forwarded to an expert.
If the human tries to ask a question that is not related to oil and gas,
you should politely tell them that only oil and gas related questions are supported.

Here is the conversation between the human and the expert so far.
H is the human and E is the expert:
<history>
{history}
</history>

Here is the new query
<query>
{query}
</query>

If you can answer the question, your answer should be formatted as follows.
In the example, H is the human, E is the expert and A is you, the assistant.
<example>
H: How many wells are in section 24?
E: There are 42 wells
H: Can you rewrite that in French?
A: <think> This is a translation, I can answer.</think>
<answer>Il y a 42 puits.</answer>
</example>

<example>
H: Can you summarize that in one sentence?
A: <think> This is just rewriting, I can summarize the previous reply and
answer directly.</think>
<answer>Il y a 42 puits.</answer>
</example>

<example>
H: Who's the queen of England?
A: <think>This is unrelated, I can tell the user I can't answer.</think>
<answer>I am sorry but I can only answer questions about your files.</answer>
</example>

If you need to forward the question to the expert, your answer should be as follows
<example>
H: What is the depth of well x250?
E: It's 2000 ft
H : What about well y890?
A: <think>This requires expert knowledge and some context, and I need to rewrite
the query before I ask the expert.</think>
<new_query>What is the depth of well y890?</new_query>
</example>

<example>
H: How many pdf files do I have?
A: <think>This requires to look into the file system,
I need to forward the question to the expert.</think>
</new_query>How many pdf files do I have?</new_query>
</example>

Remember, You only know how to translate or rewrite previous interactions.
If the human asks for anything other than rewriting or translating,
you should rewrite the query so it can be forwarded to an expert.
If the query needs context from previous questions,  rewrite the query so
the expert can understand it, otherwise, forward the query as-is.
If the human tries to ask a question that is not related to oil and gas,
you should politely tell them that only oil and gas related questions are supported.
"""

This query rewriting step adds an extra second in terms of latency, and could be toggled off, but it greatly enhances user experience because it enables follow-up questions. Another way to handle this would have been to combine the query rewriting and the router in a single prompt, but we find that it’s more straightforward for LLMs to perform tasks separately, because they can get overwhelmed when faced with too many instructions.

The following is an example of a conversational flow. The user first asks about the measurements for a given API, which requires the LAS tool. Then they follow up by asking production questions about that API, all using the production tool. Each of these questions builds on previous questions, which highlights the need for query rewriting. The table shows the initial user query and corresponding LLM-rewritten query, which accounts for previous context.

User Query LLM Rewritten Query (Context Aware)
What log measurements were acquired for API #490251107500 What log measurements were acquired for API #490251107500?
How much oil did this API produce in September 2003? How much oil did API #490251107500 produce in September 2003?
What about November? For API #490251107500, how much oil did it produce in November 2003?
What month had the highest production that year? What steps would you take to analyze the production data for API #490251107500 and determine which month had the highest oil production in the year 2003?
Get me a table of the monthly productions for that API for that year, include the monthly production and the months in the table Please provide a table of the monthly oil productions for API #490251107500 for the year 2003. This API number and year were referenced in our previous discussion.

The following screenshots show the corresponding flow in the UI and demonstrates that the tool is able to respond based on previous context.

Conclusion

In this post, we presented an AI assistant for efficient data exploration in the oil and gas industry powered by LLMs and optimized tools. The router uses the language understanding abilities of LLMs to route queries to specialized tools. We built custom tools optimized for major data types such as text, tables, and domain-specific formats like LAS. Conversational capabilities enable clarification and context-aware follow-up questions. The end-to-end solution showcases how LLMs can transform data exploration workflows through the use of specialized tools and conversational interfaces. Data exploration tasks that took hours can now be achieved in just a few minutes, dramatically reducing time to first oil for DXC’s customers.

In addition to the tools presented here, you can create additional generative AI tools to query SQL data bases or analyze other industry-specific formats. Additionally, instead of creating separate table tools for each CSV dataset, the selection of the relevant dataset could be part of the tables tools itself, further reducing the need for preprocessing when onboarding the solution.

If you’re interested in building a similar AI assistant that can use multiple tools, you can get started with Amazon Bedrock Agents, a fully managed AWS solution that helps orchestrate complex tasks.


About the authors

Aude Genevay is a Senior Applied Scientist at the Generative AI Innovation Center, where she helps customers tackle critical business challenges and create value using generative AI. She holds a PhD in theoretical machine learning and enjoys turning cutting-edge research into real-world solutions.

Asif Fouzi is a Principal Solutions Architect leading a team of seasoned technologists supporting Global Service Integrators (GSI) helping GSIs such as DXC in their cloud journey. When he is not innovating on behalf of users, he likes to play guitar, travel and spend time with his family.

Kaustubh Kambli is a Senior Manager responsible for Generative AI and Cloud Analytics Delivery at DXC. His team drives innovation and AI-powered solutions to meet client needs across multiple industries in AMS region. When he’s not focused on advancing AI technologies, Kaustubh enjoys exploring new places, engaging in creative pursuits and spending quality time with his loved ones.

Anveshi Charuvaka is a Senior Applied Scientist at the Generative AI Innovation Center, where he develops Generative AI-driven solutions for customers’ critical business challenges. With a PhD in Machine Learning and over a decade of experience, he specializes in applying innovative machine learning and generative AI techniques to address complex real-world problems.

Mofijul Islam is an Applied Scientist II at the AWS Generative AI Innovation Center, where he helps customers tackle customer-centric research and business challenges using generative AI, large language models (LLM), multi-agent learning, and multimodal learning. He holds a PhD in machine learning from the University of Virginia, where his work focused on multimodal machine learning, multilingual NLP, and multitask learning. His research has been published in top-tier conferences like NeurIPS, ICLR, AISTATS, and AAAI, as well as IEEE and ACM Transactions.

Yingwei Yu is an Applied Science Manager at Generative AI Innovation Center, AWS, where he leverages machine learning and generative AI to drive innovation across industries. With a PhD in Computer Science from Texas A&M University and years of working experience in Oil&Gas industry, Yingwei brings extensive expertise in applying cutting-edge technologies to real-world applications. Outside of work, he enjoys swimming, painting, MIDI composing, and spending time with family and friends.

Read More