KGLens: Towards Efficient and Effective Knowledge Probing of Large Language Models with Knowledge Graphs

This paper was accepted at the Workshop Towards Knowledgeable Language Models 2024.
Large Language Models (LLMs) might hallucinate facts, while curated Knowledge Graph (KGs) are typically factually reliable especially with domain-specific knowledge. Measuring the alignment between KGs and LLMs can effectively probe the factualness and identify the knowledge blind spots of LLMs. However, verifying the LLMs over extensive KGs can be expensive. In this paper, we present KGLens, a Thompson-sampling-inspired framework aimed at effectively and efficiently measuring the alignment between KGs and…Apple Machine Learning Research

Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling

Large language models are trained on massive scrapes of the web, which are often unstructured, noisy, and poorly phrased. Current scaling laws show that learning from such data requires an abundance of both compute and data, which grows with the size of the model being trained. This is infeasible both because of the large compute costs and duration associated with pre-training, and the impending scarcity of high-quality data on the web. In this work, we propose Web Rephrase Augmented Pre-training (WRAP) that uses an off-the-shelf instruction-tuned model prompted to paraphrase documents on the…Apple Machine Learning Research

Build an end-to-end RAG solution using Knowledge Bases for Amazon Bedrock and AWS CloudFormation

Build an end-to-end RAG solution using Knowledge Bases for Amazon Bedrock and AWS CloudFormation

Retrieval Augmented Generation (RAG) is a state-of-the-art approach to building question answering systems that combines the strengths of retrieval and foundation models (FMs). RAG models first retrieve relevant information from a large corpus of text and then use a FM to synthesize an answer based on the retrieved information.

An end-to-end RAG solution involves several components, including a knowledge base, a retrieval system, and a generation system. Building and deploying these components can be complex and error-prone, especially when dealing with large-scale data and models.

This post demonstrates how to seamlessly automate the deployment of an end-to-end RAG solution using Knowledge Bases for Amazon Bedrock and AWS CloudFormation, enabling organizations to quickly and effortlessly set up a powerful RAG system.

Solution overview

The solution provides an automated end-to-end deployment of a RAG workflow using Knowledge Bases for Amazon Bedrock. We use AWS CloudFormation to set up the necessary resources, including :

  1. An AWS Identity and Access Management (IAM) role
  2. An Amazon OpenSearch Serverless collection and index
  3. A knowledge base with its associated data source

The RAG workflow enables you to use your document data stored in an Amazon Simple Storage Service (Amazon S3) bucket and integrate it with the powerful natural language processing capabilities of FMs provided in Amazon Bedrock. The solution simplifies the setup process, allowing you to quickly deploy and start querying your data using the selected FM.

Prerequisites

To implement the solution provided in this post, you should have the following:

  • An active AWS account and familiarity with FMs, Amazon Bedrock, and OpenSearch Serverless.
  • An S3 bucket where your documents are stored in a supported format (.txt, .md, .html, .doc/docx, .csv, .xls/.xlsx, .pdf).
  • The Amazon Titan Embeddings G1-Text model enabled in Amazon Bedrock. You can confirm it’s enabled on the Model access page of the Amazon Bedrock console. If the Amazon Titan Embeddings G1-Text model is enabled, the access status will show as Access granted, as shown in the following screenshot.

Set up the solution

When the prerequisite steps are complete, you’re ready to set up the solution:

  1. Clone the GitHub repository containing the solution files:
git clone https://github.com/aws-samples/amazon-bedrock-samples.git
  1. Navigate to the solution directory:
cd knowledge-bases/features-examples/04-infrastructure/e2e-rag-deployment-using-bedrock-kb-cfn
  1. Run the sh script, which will create the deployment bucket, prepare the CloudFormation templates, and upload the ready CloudFormation templates and required artifacts to the deployment bucket:
bash deploy.sh

While running deploy.sh, if you provide a bucket name as an argument to the script, it will create a deployment bucket with the specified name. Otherwise, it will use the default name format: e2e-rag-deployment-${ACCOUNT_ID}-${AWS_REGION}

As shown in the following screenshot, if you complete the preceding steps in an Amazon SageMaker notebook instance, you can run the bash deploy.sh at the terminal, which creates the deployment bucket in your account (account number has been redacted).

  1. After the script is complete, note the S3 URL of the main-template-out.yml.

  1. On the AWS CloudFormation console, create a new stack.
  2. For Template source, select Amazon S3 URL and enter the URL you copied earlier.
  3. Choose Next.

  1. Provide a stack name and specify the RAG workflow details according to your use case and then choose Next.

  1. Leave everything else as default and choose Next on the following pages.
  1. Review the stack details and select the acknowledgement check boxes.

  1. Choose Submit to start the deployment process.

You can monitor the stack deployment progress on the AWS CloudFormation console.

Test the solution

When the deployment is successful (which may take 7–10 minutes to complete), you can start testing the solution.

  1. On the Amazon Bedrock console, navigate to the created knowledge base.
  2. Choose Sync to initiate the data ingestion job.

  1. After data synchronization is complete, select the desired FM to use for retrieval and generation (it requires model access to be granted to this FM in Amazon Bedrock before using).

  1. Start querying your data using natural language queries.

That’s it! You can now interact with your documents using the RAG workflow powered by Amazon Bedrock.

Clean up

To avoid incurring future charges, delete the resources used in this solution:

  1. On the Amazon S3 console, manually delete the contents inside the bucket you created for template deployment, then delete the bucket.
  2. On the AWS CloudFormation console, choose Stacks in the navigation pane, select the main stack, and choose Delete.

Your created knowledge base will be deleted when you delete the stack.

Conclusion

In this post, we introduced an automated solution for deploying an end-to-end RAG workflow using Knowledge Bases for Amazon Bedrock and AWS CloudFormation. By using the power of AWS services and the preconfigured CloudFormation templates, you can quickly set up a powerful question answering system without the complexities of building and deploying individual components for RAG applications. This automated deployment approach not only saves time and effort, but also provides a consistent and reproducible setup, enabling you to focus on utilizing the RAG workflow to extract valuable insights from your data.

Try it out and see firsthand how it can streamline your RAG workflow deployment and enhance efficiency. Please share your feedback to us!


About the Authors

Sandeep Singh is a Senior Generative AI Data Scientist at Amazon Web Services, helping businesses innovate with generative AI. He specializes in generative AI, machine learning, and system design. He has successfully delivered state-of-the-art AI/ML-powered solutions to solve complex business problems for diverse industries, optimizing efficiency and scalability.

Yanyan Zhang is a Senior Generative AI Data Scientist at Amazon Web Services, where she has been working on cutting-edge AI/ML technologies as a Generative AI Specialist, helping customers use generative AI to achieve their desired outcomes. With a keen interest in exploring new frontiers in the field, she continuously strives to push boundaries. Outside of work, she loves traveling, working out, and exploring new things.

Mani Khanuja is a Tech Lead – Generative AI Specialists, author of the book Applied Machine Learning and High Performance Computing on AWS, and a member of the Board of Directors for Women in Manufacturing Education Foundation Board. She leads machine learning projects in various domains such as computer vision, natural language processing, and generative AI. She speaks at internal and external conferences such AWS re:Invent, Women in Manufacturing West, YouTube webinars, and GHC 23. In her free time, she likes to go for long runs along the beach.

Read More

Faster LLMs with speculative decoding and AWS Inferentia2

Faster LLMs with speculative decoding and AWS Inferentia2

In recent years, we have seen a big increase in the size of large language models (LLMs) used to solve natural language processing (NLP) tasks such as question answering and text summarization. Larger models with more parameters, which are in the order of hundreds of billions at the time of writing, tend to produce better results. For example, Llama-3-70B, scores better than its smaller 8B parameters version on metrics like reading comprehension (SQuAD 85.6 compared to 76.4). Thus, customers often experiment with larger and newer models to build ML-based products that bring value.

However, the larger the model, the more computationally demanding it is, and the higher the cost to deploy. For example, on AWS Trainium, Llama-3-70B has a median per-token latency of 21.4 ms, while Llama-3-8B takes 4.7 ms. Similarly, Llama-2-70B has a median per-token latency of 20.6 ms, while Llama-2-7B takes 3.7 ms. Customers have to consider performance to ensure they meet their users’ needs. In this blog post, we will explore how speculative sampling can help make large language model inference more compute efficient and cost-effective on AWS Inferentia and Trainium. This technique improves LLM inference throughput and output token latency (TPOT).

Introduction

Modern language models are based on the transformer architecture. The input prompts are processed first using a technique called context encoding, which runs fast because it is parallelizable. Next, we perform auto-regressive token generation where the output tokens are generated sequentially. Note that we cannot generate the next token until we know the previous one, as depicted in Figure 1. Therefore, to generate N output tokens we need N serial runs through the decoder. A run takes longer through a larger model, like Llama-3-70B, than through a smaller model, like Llama-3-8B.

AWS Neuron speculative decoding - Sequential token generation in LLMs

Figure 1: Sequential token generation in LLMs

From a computational perspective, token generation in LLMs is a memory bandwidth-bound process. The larger the model, the more likely it is that we will wait on memory transfers. This results in underutilizing the compute units and not fully benefiting from the floating-point operations (FLOPS) available.

Speculative sampling

Speculative sampling is a technique that improves the computational efficiency for running inference with LLMs, while maintaining accuracy. It works by using a smaller, faster draft model to generate multiple tokens, which are then verified by a larger, slower target model. This verification step processes multiple tokens in a single pass rather than sequentially and is more compute efficient than processing tokens sequentially. Increasing the number of tokens processed in parallel increases the compute intensity because a larger number of tokens can be multiplied with the same weight tensor. This provides better performance compared with the non-speculative run, which is usually memory bandwidth-bound, and thus leads to better hardware resource utilization.

The speculative process involves an adjustable window k, where the target model provides one guaranteed correct token, and the draft model speculates on the next k-1 tokens. If the draft model’s tokens are accepted, the process speeds up. If not, the target model takes over, ensuring accuracy.

AWS Neuron speculative decoding - Case when all speculated tokens are accepted

Figure 2: Case when all speculated tokens are accepted

Figure 2 illustrates a case where all speculated tokens are accepted, resulting in faster processing. The target model provides a guaranteed output token, and the draft model runs multiple times to produce a sequence of possible output tokens. These are verified by the target model and subsequently accepted by a probabilistic method.

AWS Neuron speculative decoding - Case when some speculated tokens are rejected

Figure 3: Case when some speculated tokens are rejected

On the other hand, Figure 3 shows a case where some of the tokens are rejected. The time it takes to run this speculative sampling loop is the same as in Figure 2, but we obtain fewer output tokens. This means we will be repeating this process more times to complete the response, resulting in slower overall processing.

By adjusting the window size k and understanding when the draft and target models are likely to produce similar results, we can maximize the benefits of speculative sampling.

A Llama-2-70B/7B demonstration

We will show how speculative sampling works on Inferentia2-powered Amazon EC2 Inf2 instances and Trainium-powered EC2 Trn1 instances. We will be using a sample where we generate text faster with Llama-2-70B by using a Llama-2-7B model as a draft model. The example walk-through is based on Llama-2 models, but you can follow a similar process for Llama-3 models as well.

Loading models

You can load the Llama-2 models using data type bfloat16. The draft model needs to be loaded in a standard way like in the example below. The parameter n_positions is adjustable and represents the maximum sequence length you want to allow for generation. The only batch_size we support for speculative sampling at the time of writing is 1. We will explain tp_degree later in this section.

draft_model = LlamaForSampling.from_pretrained('Llama-2-7b', n_positions=128, batch_size=1, tp_degree=32, amp='bf16')

The target model should be loaded in a similar way, but with speculative sampling functionality enabled. The value k was described previously.

target_model = LlamaForSampling.from_pretrained('Llama-2-70b', n_positions=128, batch_size=1, tp_degree=32, amp='bf16')
target_model.enable_speculative_decoder(k)

Combined, the two models need almost 200 GB of device memory for the weights with additional memory in the order of GBs needed for key-value (KV) caches. If you prefer to use the models with float32 parameters, they will need around 360 GB of device memory. Note that the KV caches grow linearly with sequence length (input tokens + tokens yet to be generated). Use neuron-top to see the memory utilization live. To accommodate for these memory requirements, we’ll need either the largest Inf2 instance (inf2.48xlarge) or largest Trn1 instance (trn1.32xlarge).

Because of the size of the models, their weights need to be distributed amongst the NeuronCores using a technique called tensor parallelism. Notice that in the sample provided, tp_degree is used per model to specify how many NeuronCores that model should use. This, in turn, affects the memory bandwidth utilization, which is critical for token generation performance. A higher tp_degree can lead to better bandwidth utilization and improved throughput. The topology for Trn1 requires that tp_degree is set to 1, 2, 8, 16 or a multiple of 32. For Inf2, it needs to be 1 or multiples of 2.

The order in which you load the models also matters. After a set of NeuronCores has been initialized and allocated for one model, you cannot use the same NeuronCores for another model unless it’s the exact same set. If you try to use only some of the NeuronCores that were previously initialized, you will get an nrt_load_collectives - global nec_comm is already init'd error.

Let’s go through two examples on trn1.32xlarge (32 NeuronCores) to understand this better. We will calculate how many NeuronCores we need per model. The formula used is the observed model size in memory, using neuron-top, divided by 16GB which is the device memory per NeuronCore.

  1. If we run the models using bfloat16, we need more than 10 NeuronCores for Llama-2-70B and more than 2 NeuronCores for Llama-2-7B. Because of topology constraints, it means we need at least tp_degree=16 for Llama-2-70B. We can use the remaining 16 NeuronCores for Llama-2-7B. However, because both models fit in memory across 32 NeuronCores, we should set tp_degree=32 for both, to speed-up the model inference for each.
  2. If we run the models using float32, we need more than 18 NeuronCores for Llama-2-70B and more than 3 NeuronCores for Llama-2-7B. Because of topology constraints, we have to set tp_degree=32 for Llama-2-70B. That means Llama-2-7B needs to re-use the same set of NeuronCores, so you need to set tp_degree=32 for Llama-2-7B too.

Walkthrough

The decoder we’ll use from transformers-neuronx is LlamaForSampling, which is suitable for loading and running Llama models. You can also use NeuronAutoModelForCausalLM which will attempt to auto-detect which decoder to use. To perform speculative sampling, we need to create a speculative generator first which takes two models and the value k described previously.

spec_gen = SpeculativeGenerator(draft_model, target_model, k)

We invoke the inferencing process by calling the following function:

spec_gen.sample(input_ids=input_token_ids, sequence_length=total_output_length)

During sampling, there are several hyper-parameters (for example: temperature, top_p, and top_k) that affect if the output is deterministic across multiple runs. At the time of writing, the speculative sampling implementation sets default values for these hyper-parameters. With these values, expect randomness in results when you run a model multiple times, even if it’s with the same prompt. This is normal intended behavior for LLMs because it improves their qualitative responses.

When you run the sample, you will use the default token acceptor, based on the DeepMind paper which introduced speculative sampling, which uses a probabilistic method to accept tokens. However, you can also implement a custom token acceptor, which you can pass as part of the acceptor parameter when you initialize the SpeculativeGenerator. You would do this if you wanted more deterministic responses, for example. See the implementation of the DefaultTokenAcceptor class in transformers-neuronx to understand how to write your own.

Conclusion

As more developers look to incorporate LLMs into their applications, they’re faced with a choice of using larger, more costly, and slower models that will deliver higher quality results. Or they can use smaller, less expensive and faster models that might reduce quality of answers. Now, with AWS artificial intelligence (AI) chips and speculative sampling, developers don’t have to make that choice. They can take advantage of the high-quality outputs of larger models and the speed and responsiveness of smaller models.

In this blog post, we have shown that we can accelerate the inference of large models, such as Llama-2-70B, by using a new feature called speculative sampling.

To try it yourself, check out the speculative sampling example, and tweak the input prompt and k parameter to see the results you get. For more advanced use cases, you can develop your own token acceptor implementation. To learn more about running your models on Inferentia and Trainium instances, see the AWS Neuron documentation. You can also visit repost.aws AWS Neuron channel to discuss your experimentations with the AWS Neuron community and share ideas.


About the Authors

Syl Taylor AWSSyl Taylor is a Specialist Solutions Architect for Efficient Compute. She advises customers across EMEA on Amazon EC2 cost optimization and improving application performance using AWS-designed chips. Syl previously worked in software development and AI/ML for AWS Professional Services, designing and implementing cloud native solutions. She’s based in the UK and loves spending time in nature.

Emir Ayar AWSEmir Ayar is a Senior Tech Lead Solutions Architect with the AWS Prototyping team. He specializes in assisting customers with building ML and generative AI solutions, and implementing architectural best practices. He supports customers in experimenting with solution architectures to achieve their business objectives, emphasizing agile innovation and prototyping. He lives in Luxembourg and enjoys playing synthesizers.

Read More

Catalog, query, and search audio programs with Amazon Transcribe and Knowledge Bases for Amazon Bedrock

Catalog, query, and search audio programs with Amazon Transcribe and Knowledge Bases for Amazon Bedrock

Information retrieval systems have powered the information age through their ability to crawl and sift through massive amounts of data and quickly return accurate and relevant results. These systems, such as search engines and databases, typically work by indexing on keywords and fields contained in data files.

However, much of our data in the digital age also comes in non-text format, such as audio and video files. Finding relevant content usually requires searching through text-based metadata such as timestamps, which need to be manually added to these files. This can be hard to scale as the volume of unstructured audio and video files continues to grow.

Fortunately, the rise of artificial intelligence (AI) solutions that can transcribe audio and provide semantic search capabilities now offer more efficient solutions for querying content from audio files at scale. Amazon Transcribe is an AWS AI service that makes it straightforward to convert speech to text. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

In this post, we show how Amazon Transcribe and Amazon Bedrock can streamline the process to catalog, query, and search through audio programs, using an example from the AWS re:Think podcast series.

Solution overview

The following diagram illustrates how you can use AWS services to deploy a solution for cataloging, querying, and searching through content stored in audio files.

Architecture Diagram of Amazon Bedrock and related AWS Services

In this solution, audio files stored in mp3 format are first uploaded to Amazon Simple Storage Service (Amazon S3) storage. Video files (such as mp4) that contain audio in supported languages can also be uploaded to Amazon S3 as part of this solution. Amazon Transcribe will then transcribe these files and store the entire transcript in JSON format as an object in Amazon S3.

To catalog these files, each JSON file in Amazon S3 should be tagged with the corresponding episode title. This allows us to later retrieve the episode title for each query result.

Next, we use Amazon Bedrock to create numerical representations of the content inside each file. These numerical representations are also called embeddings, and they’re stored as vectors inside a vector database that we can later query.

Amazon Bedrock is a fully managed service that makes FMs from leading AI startups and Amazon available through an API. Included with Amazon Bedrock is Knowledge Bases for Amazon Bedrock. As a fully managed service, Knowledge Bases for Amazon Bedrock makes it straightforward to set up a Retrieval Augmented Generation (RAG) workflow.

With Knowledge Bases for Amazon Bedrock, we first set up a vector database on AWS. Knowledge Bases for Amazon Bedrock can then automatically split the data files stored in Amazon S3 into chunks and then create embeddings of each chunk using Amazon Titan on Amazon Bedrock. Amazon Titan is a family of high-performing FMs from Amazon. Included with Amazon Titan is Amazon Titan Text Embeddings, which we use to create the numerical representation of the text inside each chunk and store them in a vector database.

When a user queries the contents of the audio files through a generative AI application or AWS Lambda function, it makes an API call to Knowledge Bases for Amazon Bedrock. Knowledge Bases for Amazon Bedrock will then orchestrate a call to the vector database to perform a semantic search, which returns the most relevant results. Next, Knowledge Bases for Amazon Bedrock augments the user’s original query with these results to a prompt, which is sent to the large language model (LLM). The LLM will return results that are more accurate and relevant to the user query.

Let’s walk through an example of how you can catalog, query, and search through a library of audio files using these AWS AI services. For this post, we use episodes of the re:Think podcast series, which has over 20 episodes. Each episode is an audio program recorded in mp3 format. As we continue to add new episodes, we will want to use AI services to make the task of querying and searching for specific content more scalable without the need to manually add metadata for each episode.

Prerequisites

In addition to having access to AWS services through the AWS Management Console, you need a few other resources to deploy this solution.

First, you need a library of audio files to catalog, query, and search. For this post, we use episodes of the AWS re:Think podcast series.

To make API calls to Amazon Bedrock from our generative AI application, we use Python version 3.11.4 and the AWS SDK for Python (Boto3).

Transcribe audio files

The first task is to transcribe each mp3 file using Amazon Transcribe. For instructions on transcribing with the AWS Management Console or AWS CLI, refer to the Amazon Transcribe Developer guide. Amazon Transcribe can create a transcript for each episode and store it as an S3 object in JSON format.

Catalog audio files using tagging

To catalog each episode, we tag the S3 object for each episode with the corresponding episode title. For instructions on tagging objects in S3, refer to the Amazon Simple Storage Service User Guide. For example, for the S3 object AI-Accelerators.json, we tag it with key = “title” and value = “Episode 20: AI Accelerators in the Cloud.”

Edit Tags in S3

The title is the only metadata we need to manually add for each audio file. There is no need to manually add timestamps for each chapter or section in order to later search for specific content.

Set up a vector database using Knowledge Bases for Amazon Bedrock

Next, we set up our fully managed RAG workflow using Knowledge Bases for Amazon Bedrock. For instructions on creating a knowledge base, refer to the Amazon Bedrock User Guide. We begin by specifying a data source. In our case, we choose the S3 bucket location where our transcripts in JSON format are stored.

Configure data source for Knowledge Base

Next, we select an embedding model. The embedding model will convert each chunk of our transcript into embeddings. Embeddings are numbers, and the meaning of each embedding depends on the model. In our example, we select Titan Text Embeddings v2 with a dimension size of 1024.

Select embeddings model and configure vector store for Knowledge Base

The embeddings are stored as vectors in a vector database. You can either specify an existing vector database you have already created or have Knowledge Bases for Amazon Bedrock create one for you. For our example, we have Knowledge Bases for Amazon Bedrock create a vector database using Amazon OpenSearch Serverless.

Create a new vectore store

Before you can query the vector database, you must first sync it with the data source. During each sync operation, Knowledge Bases for Amazon Bedrock will split the data source into chunks and then use the selected embedding model to embed each chunk as a vector. Knowledge Bases for Amazon Bedrock will then store these vectors in the vector database.

The sync operation as well as other Amazon Bedrock operations described so far can be performed either using the console or API calls.

Query the audio files

Now we’re ready to query and search for specific content from our library of podcast episodes. In episode 20, titled “AI Accelerators in the Cloud,” our guest Matthew McClean, a senior manager from AWS’s Annapurna team, shared why AWS decided to buy Annapurna Labs in 2015. For our first query, we ask, “Why did AWS acquire Annapurna Labs?”

We entered this query into Knowledge Bases for Amazon Bedrock using Anthropic Claude and got the following response:

“AWS acquired Annapurna Labs in 2015 because Annapurna was providing AWS with nitro cards that offloaded virtualization, security, networking and storage from EC2 instances to free up CPU resources.”

This is an exact quote from Matthew McClean in the podcast episode. You wouldn’t get this quote if you had entered the same prompt into other publicly available generative AI chatbots because they don’t have the vector database with embeddings of the podcast transcript to provide more relevant context.

Retrieve an episode title

Now let’s suppose that in addition to getting more relevant responses, we also want to retrieve the correct podcast episode title that was relevant to this query from our catalog of podcast episodes.

To retrieve the episode title, we first use the most relevant data chunk from the query. Whenever Knowledge Bases for Amazon Bedrock responds to a query, it also provides one or more chunks of data that it retrieved from the vector database that were most relevant to the query in order of relevance. We can take the first chunk that was returned. These chunks are returned as JSON documents. Nested inside the JSON is the S3 location of the transcript object. In our example, the S3 location is s3://rethinkpodcast/text/transcripts/AI-Accelerators.json.

The first words in the chunk text are: “Yeah, sure. So maybe I can start with the history of Annapurna…”

Because we have already tagged this transcript object in Amazon S3 with the episode title, we can retrieve the title by retrieving the value of the tag where key = “title”. In this case, the title is “Episode 20: AI Accelerators in the Cloud.”

Search the start time

What if we also want to search and find the start time inside the episode where the relevant content begins? We want to do so without having to manually read through the transcript or listen to the episode from the beginning, and without manually adding timestamps for every chapter.

We can find the start time much faster by having our generative AI application make a few more API calls. We start by treating the chunk text as a substring of the entire transcript. We then search for the start time of the first word in the chunk text.

In our example, the first words returned were “Yeah, sure. So maybe I can start with the history of Annapurna…” We now need to search the entire transcript for the start time of the word “Yeah.”

Amazon Transcribe outputs the start time of every word in the transcript. However, any word can appear more than once. The word “Yeah” occurs 28 times in the transcript, and each occurrence has its own start time. So how do we determine the correct start time for “Yeah” in our example?

There are multiple approaches an application developer can use to find the correct start time. For our example, we use the Python string find() method to find the position of the chunk text within the entire transcript.

For the chunk text that begins with “Yeah, sure. So maybe I can start with the history of Annapurna…” the find() method returned the position as 2047. If we treat the transcript as one long text string, the chunk “Yeah, sure. So maybe…” starts at character position 2047.

Finding the start time now becomes a matter of counting the character position of each word in the transcript and using it to look up the correct start time from the transcript file generated by Amazon Transcribe. This may be tedious for a person to do manually, but trivial for a computer.

In our example Python code, we loop through an array that contains the start time for each token while counting the number of the character position that each token starts at. Because we’re looping through the tokens, we can build a new array that stores the start time for each character position.

In this example query, the start time for the word “Yeah” at position 2047 is 160 seconds, or 2 minutes and 40 seconds into the podcast. You can check the recording starting at 2 minutes 40 seconds.

Clean up

This solution incurs charges based on the services you use:

  • Amazon Transcribe operates under a pay-as-you-go pricing model. For more details, see Amazon Transcribe Pricing.
  • Amazon Bedrock uses an on-demand quota, so you only pay for what you use. For more information, refer to Amazon Bedrock pricing.
  • With OpenSearch Serverless, you only pay for the resources consumed by your workload.
  • If you’re using Knowledge Bases for Amazon Bedrock with other vector databases besides OpenSearch Serverless, you may continue to incur charges even when not running any queries. It is recommended you delete your knowledge base and its associated vector store along with audio files stored in Amazon S3 to avoid unnecessary costs when you’re done testing this solution.

Conclusion

Cataloging, querying, and searching through large volumes of audio files can be difficult to scale. In this post, we showed how Amazon Transcribe and Knowledge Bases for Amazon Bedrock can help automate and make the process of retrieving relevant information from audio files more scalable.

You can begin transcribing your own library of audio files with Amazon Transcribe. To learn more on how Knowledge Bases for Amazon Bedrock can then orchestrate a RAG workflow for your transcripts with vector stores, refer to Knowledge Bases now delivers fully managed RAG experience in Amazon Bedrock.

With the help of these AI services, we can now expand the frontiers of our knowledge bases.


About the Author

Nolan Chen is a Partner Solutions Architect at AWS, where he helps startup companies build innovative solutions using the cloud. Prior to AWS, Nolan specialized in data security and helping customers deploy high-performing wide area networks. Nolan holds a bachelor’s degree in Mechanical Engineering from Princeton University.

Read More

GENEVA uses large language models for interactive game narrative design

GENEVA uses large language models for interactive game narrative design

This paper was presented at the IEEE 2024 Conference on Games (opens in new tab) (IEEE CoG 2024), the leading forum on innovation in and through games.

IEEE 2024 Conference on Games recap blog

Mastering the art of storytelling, a highly valued skill across films, novels, games, and more, requires creating rich narratives with compelling plots and characters. In recent years, the rise of AI has prompted inquiries into whether large language models (LLMs) can effectively generate and sustain detailed, coherent storylines that engage audiences. Consequentially, researchers have been actively exploring AI’s potential to support creative processes in video game development, where the growing demands of narrative design often surpass the capabilities of traditional tools. This investigation focuses on AI’s capacity for innovation in storytelling and the necessary human interactions to drive such advances.

In this context, we introduce “GENEVA: GENErating and Visualizing branching narratives using LLMs (opens in new tab),” presented at IEEE CoG 2024. This graph-based narrative generation and visualization tool requires a high-level narrative description and constraints, such as the number of different starts, endings, and storylines, as well as context for grounding the narrative. GENEVA uses the generative capabilities of GPT-4 to create narratives with branching storylines and renders them in a graph format, allowing users to interactively explore different narrative paths through its web interface (opens in new tab).

Visualizing narratives using graphs

The narrative graph itself is a directed acyclic graph (DAG), where each node represents a narrative beat—an event that moves the plot forward—with directed edges (arrows) marking the progression through the story’s events. These beats are the fundamental units of the narrative structure, representing the exchange of action and reaction. A single path from a start node to an end node outlines a unique storyline, and the graph illustrates the various potential storylines based on the same overarching narrative. 

The generation and visualization of these narrative graphs are accomplished using GPT-4 in a two-step process. First, the model generates the branching storylines from the given description and constraints. Second, it produces code to render these narratives in a visually comprehensible graph format.

We detail this methodology in our paper, through a case study where we used GENEVA to construct narrative graphs for four well-known stories—Dracula, Frankenstein, Jack and the Beanstalk, and Little Red Riding Hood. Each was set in one of four distinct worlds: the game of Minecraft, the 21st century, ancient Rome, and the quantum realm. Figure 1 shows a narrative graph of Frankenstein set in the 21st century, and Figure 2 shows the storylines generated for this story.

Figure 1. A picture of a screenshot of the online interface of GENEVA. The screenshot has the title “Visualizing Generated Narratives”. Below the title are four dropdown menus, each for stories, number of starts, number of ends, number of plots and contexts. The values selected for the respective options are Frankenstein story with 1 start, 2 endings, 4 plots and set in the 21st century context. Besides that, there are two buttons, one that says, “show graph” and another that says, “show details”. Below these menu options, is a large graph with nodes and edges. The one orange node on the left is annotated as the start node and the two orange nodes on the right are annotated as the end nodes. The rest of the nodes are blue in color and each of them is annotated with a short phrase of about 3 to 4 words.
Figure 1: A narrative graph for the novel, Frankenstein, grounded in the 21st century. Additional constraints on the graph include one start, two endings, and four storylines.
Figure 2. A picture of a screenshot of the online interface of GENEVA. The screenshot has the title “Visualizing Generated Narratives”. Below the title are four dropdown menus, each for stories, number of starts, number of ends, number of plots and contexts. The values selected for the respective options are Frankenstein story with 1 start, 2 endings, 4 plots and set in the 21st century context. Besides that, there are two buttons, one that says, “show graph” and another that says, “hide details”. Below these menu options is a large text area with three storylines. Each storyline consists of a sequence of beats. Each beat has a unique number and a sentence describing the beat.
Figure 2: A detailed view of the four different storylines in the narrative graph in Figure 1.

microsoft research podcast

What’s Your Story: Weishung Liu

Principal PM Manager Weishung Liu shares how a career delivering products and customer experiences aligns with her love of people and storytelling and how—despite efforts to defy the expectations that come with growing up in Silicon Valley—she landed in tech.


Assessing GENEVA’s narrative adaptations

In our assessment, we found that GENEVA performed better in specific narrative contexts. For example, in Frankenstein’s adaptation to the 21st century, the storylines included themes like creating life from DNA fragments and genetic engineering, maintaining relevance while preserving the original story’s essence. However, upon closer examination, we noted areas for improvement, such as the need for more variety and better grounding of the narrative. Generally, stories that are better known and more thoroughly documented tend to yield richer and more varied adaptations.

Implications and looking forward

GENEVA remains a prototype, serving as a tool for exploring the narrative capabilities of LLMs. As these models evolve, we anticipate corresponding advances in their narrative generation abilities. The ultimate goal in game design is to engage players with compelling interactive experiences. With the skilled input of experienced game designers, tools like GENEVA could increasingly contribute to creating engaging gameplay experiences through iterative refinement of narrative paths.

Our collaboration with Xbox and Inworld AI (opens in new tab) continues to advance the use of AI in game development, incorporating these developments into practical tools for creators. Discover more about this transformative technology by watching this video (opens in new tab).

The post GENEVA uses large language models for interactive game narrative design appeared first on Microsoft Research.

Read More

Players, creators, and AI collaborate to build and expand rich game narratives

Players, creators, and AI collaborate to build and expand rich game narratives

This paper was presented at the IEEE 2024 Conference on Games (opens in new tab) (IEEE CoG 2024), the leading forum on innovation in and through games.

Player-Driven Emergence in LLM-Driven Game Narrative,” presented at IEEE CoG 2024

In the fast-evolving landscape of video game development, crafting dialogues and narratives is a labor-intensive endeavor. Traditionally, creating these elements involved meticulous hand-coding, resulting in static interactions that limit player agency. However, the rise of large language models (LLMs) is introducing possibilities for richer, more dynamic narrative experiences and automating some of the more challenging aspects of game creation. Despite this advance, a key challenge with using LLMs for narrative design in games is that, without human intervention, they tend to repeat patterns.

We address this in our paper, “Player-Driven Emergence in LLM-Driven Game Narrative,” presented at IEEE CoG 2024, where we explore how LLMs can foster unique forms of creativity when players participate in the design process. Rather than replacing designers, LLMs can empower players with considerable freedom in their interactions with nonplayer characters (NPC)—characters not controlled by the players but crucial for gameplay. These interactions provide implicit feedback for designers, offering insights unattainable with traditional dialogue trees—a branching structure of player dialogue choices affecting the narrative.

Creating and designing “Dejaboom!”

To test this hypothesis, we developed a text-adventure game called “Dejaboom!” The game’s premise involves a player waking up at home with déjà vu, recalling an explosion in their village from the day before. The objective is to relive the day and prevent the disaster. Players interact with five NPCs in the village. After a set number of steps, the bomb explodes, causing the player to lose all the items they gathered but retain memories of the NPC interactions. Figure 1 illustrates the game design.

Figure 1 (game design): The figure shows the map of the village where the game takes place. It shows the various locations that the player can explore, including home, park, restaurant, library, blacksmith’s shop, and town hall. It also shows the streets connecting the various locations. In addition to these, there are also two hidden rooms, namely a lab connected to the library and a storage room connected to the blacksmith’s shop. There are several objects placed at various locations that the player can pick up and use. There is a water bucket at home, a redstone torch in the park, shears in the blacksmith’s shop, a journal in the library, a map in the townhall, and a bomb in the storage room. There are five NPCs in the game that the player can interact with. There is Chef Maria in the restaurant, Mrs. Thompson on the residential street, Mad Hatter in the park, Merlin in the lab and Moriarty in the town hall.
Figure. 1. A map of the village shows the locations, objects, and NPCs.

We built the game using TextWorld, an open-source, extensible engine for text adventure games, modifying it to include dialogue with NPCs through OpenAI’s GPT-4 model. TextWorld provided the core game logic, while GPT-4 allowed for dynamic input and output—including both game feedback and NPC responses. Figure 2 illustrates our implementation of the game. In a conventional text game, this setup would allow only a fixed set of player commands and offer a predefined set of game responses. However, the use of GPT-4 allows the game’s input and output to be dynamic.

Figure 2 (game implementation): The figure depicts the implementation of the Dejaboom game. When a player issues a text command, it is first processed by an LLM which classifies it as either an action or words. If it is an action (for example “chase the birds”), then it goes to the fixed game agent which generates a fixed game response (example “this verb is not recognizable”). This response is taken in by another instance of the LLM which generates a more palatable natural language response (example “You tried to chase the birds, but nothing happened”) which is then shown to the player as the game feedback. If the player's text command is classified as words by the LLM classifier (example “can I see your menu”), then it goes to the second instance of the LLM which generates an appropriate NPC response that gets shown to the player (example “Chef Maria: Of course! Our menu today features a delicious selection of Italian-American fusion dishes”).
Figure 2: In our implementation of the game, the user’s commands are classified by GPT-4 as actions or words. Actions are processed by the game agent, while words trigger GPT-4 to generate contextually appropriate NPC responses.

About Microsoft Research

Advancing science and technology to benefit humanity


Narrative analysis and user study

Our goal was to identify narrative paths that players create and how they diverge from the designer’s original narrative. We used GPT-4 to transform player game logs into a narrative graph, where a node represents a player’s strategy at specific points and directed edges (arrows) show game progression. We compared these to a graph of the designer’s intended narrative. We defined emergent nodes as those that appear in the narrative graph of players but are not present in the original narrative graph. 

When we applied this approach to a user study with 28 gamers playing Dejaboom!, we found that players often introduced new strategies and elements, indicating a high level of creative engagement. Those generating the most emergent nodes tended to enjoy games that emphasize discovery, exploration, and experimentation, suggesting that such players are ideally suited for a collaborative approach to game development.

Figure 3 (narrative graph showing emergence): The figure shows a graph with nodes and edges. There are two types of nodes (blue nodes and green nodes). The blue nodes make up the initial narrative graph intended by the game designers whereas the green nodes indicate a few examples of the emergent nodes created by players implicitly through their gameplay. There is also a single start node and a single end node. A single path from the start node to the end node indicates one possible way to stop the explosion.
Figure 3: The single circles indicate the initial narrative graph intended by the designers. The double circles denote the emergent nodes created by players, representing creative new paths.

Implications and looking ahead

Our goal is to build methods that help empower game creators to create novel NPC experiences, design new narratives, and ultimately build entire new worlds through implicit player feedback and progressive application of advanced AI technologies. This work represents a foundational step, marking the start of a new paradigm of game development in which designers, players and generative AI models can collaboratively design and evolve games. Utilizing AI models introduces a new mechanism for capturing implicit player feedback through their emergent behaviors.

The post Players, creators, and AI collaborate to build and expand rich game narratives appeared first on Microsoft Research.

Read More

LLM in a Flash: Efficient Large Language Model Inference with Limited Memory

This paper was accepted at the ACL 2024
Large language models (LLMs) are central to modern natural language processing, delivering exceptional performance in various tasks. However, their substantial computational and memory requirements present challenges, especially for devices with limited DRAM capacity. This paper tackles the challenge of efficiently running LLMs that exceed the available DRAM capacity by storing the model parameters in flash memory, but bringing them on demand to DRAM. Our method involves constructing an inference cost model that takes into account the characteristics of…Apple Machine Learning Research

Direct Large Language Model Alignment Through Self-Rewarding Contrastive Prompt Distillation

Aligning large language models (LLMs) with human expectations without human-annotated preference data is an important problem. In this paper, we propose a method to evaluate the response preference by using the output probabilities of response pairs under contrastive prompt pairs, which could achieve better performance on LLaMA2-7B and LLaMA2-13B compared to RLAIF. Based on this, we propose an automatic alignment method, Direct Large Model Alignment (DLMA). First, we use contrastive prompt pairs to automatically generate preference data. Then, we continue to evaluate the generated preference…Apple Machine Learning Research