Accelerate video Q&A workflows using Amazon Bedrock Knowledge Bases, Amazon Transcribe, and thoughtful UX design

Organizations are often inundated with video and audio content that contains valuable insights. However, extracting those insights efficiently and with high accuracy remains a challenge. This post explores an innovative solution to accelerate video and audio review workflows through a thoughtfully designed user experience that enables human and AI collaboration. By approaching the problem from the user’s point of view, we can create a powerful tool that allows people to quickly find relevant information within long recordings without the risk of AI hallucinations.

Many professionals, from lawyers and journalists to content creators and medical practitioners, need to review hours of recorded content regularly to extract verifiably accurate insights. Traditional methods of manual review or simple keyword searches over transcripts are time-consuming and often miss important context. More advanced AI-powered summarization tools exist, but they risk producing hallucinations or inaccurate information, which can be dangerous in high-stakes environments like healthcare or legal proceedings.

Our solution, the Recorded Voice Insight Extraction Webapp (ReVIEW), addresses these challenges by providing a seamless method for humans to collaborate with AI, accelerating the review process while maintaining accuracy and trust in the results. The application is built on top of Amazon Transcribe and Amazon Bedrock, a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

User experience

To accelerate a user’s review of a long-form audio or video while mitigating the risk of hallucinations, we introduce the concept of timestamped citations. Not only are large language models (LLMs) capable of answering a user’s question based on the transcript of the file, they are also capable of identifying the timestamp (or timestamps) of the transcript during which the answer was discussed. By using a combination of transcript preprocessing, prompt engineering, and structured LLM output, we enable the user experience shown in the following screenshot, which demonstrates the conversion of LLM-generated timestamp citations into clickable buttons (shown underlined in red) that navigate to the correct portion of the source video.

The user in this example has uploaded a number of videos, including some recordings of AWS re:Invent talks. You’ll notice that the preceding answer actually contains a hallucination originating from an error in the transcript; the AI assistant replied that “Hyperpaths” was announced, when in reality the service is called Amazon SageMaker HyperPod.

The user in the preceding screenshot had the following journey:

The user asks the AI assistant “What’s new with SageMaker?” The assistant searches the timestamped transcripts of the uploaded re:Invent videos.
The assistant provides an answer with citations. Those citations contain both the name of the video and a timestamp, and the frontend displays buttons corresponding to the citations. Each citation can point to a different video, or to different timestamps within the same video.
The user reads that SageMaker “Hyperpaths” was announced. They proceed to verify the accuracy of the generated answer by selecting the buttons, which auto play the source video starting at that timestamp.
The user sees that the product is actually called Amazon SageMaker HyperPod, and can be confident that SageMaker HyperPod was the product announced at re:Invent.

This experience, which is at the heart of the ReVIEW application, enables users to efficiently get answers to questions based on uploaded audio or video files and to verify the accuracy of the answers by rewatching the source media for themselves.

Solution overview

The full code for this application is available on the GitHub repo.

The architecture of the solution is shown in the following diagram, showcasing the flow of data through the application.

The workflow consists of the following steps:

A user accesses the application through an Amazon CloudFront distribution, which adds a custom header and forwards HTTPS traffic to an Elastic Load Balancing application load balancer. Behind the load balancer is a containerized Streamlit application running on Amazon Elastic Container Service (Amazon ECS).
Amazon Cognito handles user logins to the frontend application and Amazon API Gateway.
When a user uploads a media file through the frontend, a pre-signed URL is generated for the frontend to upload the file to Amazon Simple Storage Service (Amazon S3).
The frontend posts the file to an application S3 bucket, at which point a file processing flow is initiated through a triggered AWS Lambda. The file is sent to Amazon Transcribe and the resulting transcript is stored in Amazon S3. The transcript gets postprocessed into a text form more appropriate for use by an LLM, and an AWS Step Functions state machine syncs the transcript to a knowledge base configured in Amazon Bedrock Knowledge Bases. The knowledge base sync process handles chunking and embedding of the transcript, and storing embedding vectors and file metadata in an Amazon OpenSearch Serverless vector database.
If a user asks a question of one specific transcript (designated by the “pick media file” dropdown menu in the UI), the entire transcript is used to generate the response, so a retrieval step using the knowledge base is not required and an LLM is called directly through Amazon Bedrock.
If the user is asking a question whose answer might appear in any number of source videos (by choosing Chat with all media files on the dropdown menu in the UI), the Amazon Bedrock Knowledge Bases RetrieveAndGenerate API is used to embed the user query, find semantically similar chunks in the vector database, input those chunks into an LLM prompt, and generate a specially formatted response.
Throughout the process, application data from tracking transcription and ingestion status, mapping user names to uploaded files, and caching responses are accomplished with Amazon DynamoDB.

One important characteristic of the architecture is the clear separation of frontend and backend logic through an API Gateway deployed REST API. This was a design decision to enable users of this application to replace the Streamlit frontend with a custom frontend. There are instructions for replacing the frontend in the README of the GitHub repository.

Timestamped citations

The key to this solution lies in the prompt engineering and structured output format. When generating a response to a user’s question, the LLM is instructed to not only provide an answer to the question (if possible), but also to cite its sources in a specific way.

The full prompt can be seen in the GitHub repository, but a shortened pseudo prompt (for brevity) is shown here:

You are an intelligent AI which attempts to answer questions based on retrieved chunks of automatically generated transcripts.

Below are retrieved chunks of transcript with metadata including the file name. Each chunk includes a <media_name> and lines of a transcript, each line beginning with a timestamp.

$$ retrieved transcript chunks $$

Your answer should be in json format, including a list of partial answers, each of which has a citation. The citation should include the source file name and timestamp. Here is the user’s question:

$$ user question $$

The frontend then parses the LLM response into a fixed schema data model, described with Pydantic BaseModels:

from pydantic import BaseModel

class Citation(BaseModel):
    """A single citation from a transcript"""
    media_name: str
    timestamp: int

class PartialQAnswer(BaseModel):
    """Part of a complete answer, to be concatenated with other partial answers"""
    partial_answer: str
    citations: List[Citation]

class FullQAnswer(BaseModel):
    """Full user query response including citations and one or more partial answers"""
    answer: List[PartialQAnswer]

This format allows the frontend to parse the response and display buttons for each citation that cue up the relevant media segment for user review.

Deployment details

The solution is deployed in the form of one AWS Cloud Development Kit (AWS CDK) stack, which contains four nested stacks:

A backend that handles transcribing uploaded media and tracking job statuses
A Retrieval Augmented Generation (RAG) stack that handles setting up OpenSearch Serverless and Amazon Bedrock Knowledge Bases
An API stack that stands up an Amazon Cognito authorized REST API and various Lambda functions to logically separate the frontend from the backend
A frontend stack that consists of a containerized Streamlit application running as a load balanced service in an ECS cluster, with a CloudFront distribution connected to the load balancer

Prerequisites

The solution requires the following prerequisites:

You need to have an AWS account and an AWS Identity and Access Management (IAM) role and user with permissions to create and manage the necessary resources and components for this application. If you don’t have an AWS account, see How do I create and activate a new Amazon Web Services account?
You also need to request access to at least one Amazon Bedrock LLM (to generate answers to questions) and one embedding model (to find transcript chunks that are semantically similar to a user question). The following Amazon Bedrock models are the default, but can be changed using a configuration file at the application deployment time as described later in this post:
- Amazon Titan Embeddings V2 – Text
- Amazon’s Nova Pro
You need a Python environment with AWS CDK dependencies installed. For instructions, see Working with the AWS CDK in Python.
Docker is required to build the Streamlit frontend container at deployment time.
The minimal IAM permissions needed to bootstrap and deploy the AWS CDK are described in the ReVIEW/infra/minimal-iam-policy.json file in the GitHub repository. Make sure the IAM user or role deploying the stacks has these permissions.

Clone the repository

Fork the repository, and clone it to the location of your choice. For example:

$ git clone https://github.com/aws-samples/recorded-voice-insight-extraction-webapp.git

Edit the deployment config file

Optionally, edit the infra/config.yaml file to provide a descriptive base name for your stack. This file is also where you can choose specific Amazon Bedrock embedding models for semantic retrieval and LLMs for response generation, and define chunking strategies for the knowledge base that will ingest transcriptions of uploaded media files. This file is also where you can reuse an existing Amazon Cognito user pool if you want to bootstrap your application with an existing user base.

Deploy the AWS CDK stacks

Deploy the AWS CDK stacks with the following code:

$ cd infra
$ cdk bootstrap
$ cdk deploy –-all

You only need to use the preceding command one time per AWS account. The deploy command will deploy the parent stack and four nested stacks. The process takes approximately 20 minutes to complete.

When the deployment is complete, a CloudFront distribution URL of the form xxx.cloudfront.net will be printed on the console screen to access the application. This URL can also be found on the AWS CloudFormation console by locating the stack whose name matches the value in the config file, then choosing the Outputs tab and locating the value associated with the key ReVIEWFrontendURL. That URL will lead you to a login screen like the following screenshot.

Create an Amazon Cognito user to access the app

To log in to the running web application, you have to create an Amazon Cognito user. Complete the following steps:

On the Amazon Cognito console, navigate to the recently created user pool.
In the Users section under User Management¸ choose Create user.
Create a user name and password to log in to the ReVIEW application deployed in the account.

When the application deployment is destroyed (as described in the cleanup section), the Amazon Cognito pool remains to preserve the user base. The pool can be fully removed manually using the Amazon Cognito console.

Test the application

Test the application by uploading one or more audio or video files on the File Upload tab. The application supports media formats supported by Amazon Transcribe. If you are looking for a sample video, consider downloading a TED talk. After uploading, you will see the file appear on the Job Status tab. You can track processing progress through transcription, postprocessing, and knowledge base syncing steps on this tab. After at least one file is marked Complete, you can chat with it on the Chat With Your Media tab.

The Analyze Your Media tab allows you to create and apply custom LLM template prompts to individual uploaded files. For example, you can create a basic summary template, or an extract key information template, and apply it to your uploaded files here. This functionality was not described in detail in this post.

Clean up

The deployed application will incur ongoing costs even if it isn’t used, for example from OpenSearch Serverless indexing and search OCU minimums. To delete all resources created when deploying the application, run the following command:

$ cdk destroy –-all

Conclusion

The solution presented in this post demonstrates a powerful pattern for accelerating video and audio review workflows while maintaining human oversight. By combining the power of AI models in Amazon Bedrock with human expertise, you can create tools that not only boost productivity but also maintain the critical element of human judgment in important decision-making processes.

We encourage you to explore this fully open sourced solution, adapt it to your specific use cases, and provide feedback on your experiences.

For expert assistance, the AWS Generative AI Innovation Center, AWS Professional Services, and our AWS Partners are here to help.

About the Author

David Kaleko is a Senior Applied Scientist in the AWS Generative AI Innovation Center.

Vedere AI