Businesses today heavily rely on video conferencing platforms for effective communication, collaboration, and decision-making. However, despite the convenience these platforms offer, there are persistent challenges in seamlessly integrating them into existing workflows. One of the major pain points is the lack of comprehensive tools to automate the process of joining meetings, recording discussions, and extracting actionable insights from them. This gap results in inefficiencies, missed opportunities, and limited productivity, hindering the seamless flow of information and decision-making processes within organizations.
To address this challenge, we’ve developed the Amazon Chime SDK Meeting Summarizer application deployed with the Amazon Cloud Development Kit (AWS CDK). This application uses an Amazon Chime SDK SIP media application, Amazon Transcribe, and Amazon Bedrock to seamlessly join meetings, record meeting audio, and process recordings for transcription and summarization. By integrating these services programmatically through the AWS CDK, we aim to streamline the meeting workflow, empower users with actionable insights, and drive better decision-making outcomes. Our solution currently integrates with popular platforms such as Amazon Chime, Zoom, Cisco Webex, Microsoft Teams, and Google Meet.
In addition to deploying the solution, we’ll also teach you the intricacies of prompt engineering in this post. We guide you through addressing parsing and information extraction challenges, including speaker diarization, call scheduling, summarization, and transcript cleaning. Through detailed instructions and structured approaches tailored to each use case, we illustrate the effectiveness of Amazon Bedrock, powered by Anthropic Claude models.
Solution overview
The following infrastructure diagram provides an overview of the AWS services that are used to create this meeting summarization bot. The core services used in this solution are:
- An Amazon Chime SDK SIP Media Application is used to dial into the meeting and record meeting audio
- Amazon Transcribe is used to perform speech-to-text processing of the recorded audio, including speaker diarization
- Anthropic Claude models in Amazon Bedrock are used to identify names, improve the quality of the transcript, and provide a detailed summary of the meeting
For a detailed explanation of the solution, refer to the Amazon Chime Meeting Summarizer documentation.
Prerequisites
Before diving into the project setup, make sure you have the following requirements in place:
- Yarn – Yarn must be installed on your machine.
- AWS account – You’ll need an active AWS account.
- Enable Claude Anthropic models – These models should be enabled in your AWS account. For more information, see Model access.
- Enable Amazon Titan – Amazon Titan should be activated in your AWS account. For more information, see Amazon Titan Models.
Refer to our GitHub repository for a step-by-step guide on deploying this solution.
Access the meeting with Amazon Chime SDK
To capture the audio of a meeting, an Amazon Chime SDK SIP media application will dial into the meeting using the meeting provider’s dial-in number. The Amazon Chime SDK SIP media application (SMA) is a programable telephony service that will make a phone call over the public switched telephone network (PSTN) and capture the audio. SMA uses a request/response model with an AWS Lambda function to process actions. In this demo, an outbound call is made using the CreateSipMediaApplicationCall
API. This will cause the Lambda function to be invoked with a NEW_OUTBOUND_CALL
event.
Because most dial-in mechanisms for meetings require a PIN or other identification to be made, the SMA will use the SendDigits
action to send these dual tone multi-frequency (DTMF) digits to the meeting provider. When the application has joined the meeting, it will introduce itself using the Speak
action and then record the audio using the RecordAudio
action. This audio will be saved in MP3 format and saved to an Amazon Simple Storage Service (Amazon S3) bucket.
Speaker diarization with Amazon Transcribe
Because the SMA is joined to the meeting as a participant, the audio will be a single channel of all the participants. To process this audio file, Amazon Transcribe will be used with the ShowSpeakerLabels
setting:
const response = await transcribeClient.send(
new StartTranscriptionJobCommand({
TranscriptionJobName: jobName,
IdentifyLanguage: true,
MediaFormat: 'wav',
Media: {
MediaFileUri: audioSource,
},
Settings: {
ShowSpeakerLabels: true,
MaxSpeakerLabels: 10,
},
OutputBucketName: `${BUCKET}`,
OutputKey: `${PREFIX_TRANSCRIBE_S3}/`,
}),
);
With speaker diarization, Amazon Transcribe will distinguish different speakers in the transcription output. The JSON file that is produced will include the transcripts and items along with speaker labels grouped by speaker with start and end timestamps. With this information provided by Amazon Transcribe, a turn-by-turn transcription can be generated by parsing through the JSON. The result will be a more readable transcription. See the following example:
spk_0: Hey Court , how’s it going ?
spk_1: Hey Adam, it’s going good . How are you
spk_0: doing ? Well , uh hey , thanks for uh joining me today on this call . I’m excited to talk to you about uh architecting on Aws .
spk_1: Awesome . Yeah , thank you for inviting me . So ,
spk_0: uh can you tell me a little bit about uh the servers you’re currently using on premises ?
spk_1: Yeah . So for our servers , we currently have four web servers running with four gigabyte of RA M and two CP US and we’re currently running Linux as our operating system .
spk_0: Ok . And what are you using for your database ?
spk_1: Oh , yeah , for a database , we currently have a 200 gigabyte database running my as to will and I can’t remember the version . But um the thing about our database is sometimes it lags . So we’re really looking for a faster option for
spk_0: that . So , um when you’re , you’re noticing lags with reads or with rights to the database
spk_1: with , with reads .
spk_0: Yeah . Ok . Have you heard of uh read replicas ?
spk_1: I have not .
spk_0: Ok . That could be a potential uh solution to your problem . Um If you don’t mind , I’ll go ahead and just uh send you some information on that later for you and your team to review .
spk_1: Oh , yeah , thank you , Adam . Yeah . Anything that could help . Yeah , be really helpful .
spk_0: Ok , last question before I let you go . Um what are you doing uh to improve the security of your on premises ? Uh data ?
spk_1: Yeah , so , so number one , we have been experiencing some esto injection attacks . We do have a Palo Alto firewall , but we’re not able to fully achieve layer server protection . So we do need a better option for that .
spk_0: Have you ex have you experienced uh my sequel attacks in the past or sequel injections ?
spk_1: Yes.
spk_0: Ok , great . Uh All right . And then are you guys bound by any compliance standards like PC I DS S or um you know GDR ? Uh what’s another one ? Any of those C just uh
spk_1: we are bound by fate , moderate complaints . So .
spk_0: Ok . Well , you have to transition to fed ramp high at any time .
spk_1: Uh Not in the near future . No .
spk_0: Ok . All right , Court . Well , thanks for that uh context . I will be reaching out to you again here soon with a follow up email and uh yeah , I’m looking forward to talking to you again uh next week .
spk_1: Ok . Sounds good . Thank you , Adam for your help .
spk_0: All right . Ok , bye .
spk_1: All right . Take care .
Here, speakers have been identified based on the order they spoke. Next, we show you how to further enhance this transcription by identifying speakers using their names, rather than spk_0
, spk_1
, and so on.
Use Anthropic Claude and Amazon Bedrock to enhance the solution
This application uses a large language model (LLM) to complete the following tasks:
- Speaker name identification
- Transcript cleaning
- Call summarization
- Meeting invitation parsing
Prompt engineering for speaker name identification
The first task is to enhance the transcription by assigning names to speaker labels. These names are extracted from the transcript itself when a person introduces themselves and then are returned as output in JSON format by the LLM.
Special instructions are provided for cases where only one speaker is identified to provide consistency in the response structure. By following these instructions, the LLM will process the meeting transcripts and accurately extract the names of the speakers without additional words or spacing in the response. If no names are identified by the LLM, we prompted the model to return an Unknown
tag.
In this demo, the prompts are designed using Anthropic Claude Sonnet as the LLM. You may need to tune the prompts to modify the solution to use another available model on Amazon Bedrock.
Human: You are a meeting transcript names extractor. Go over the transcript and extract the names from it. Use the following instructions in the <instructions></instructions> xml tags
<transcript> ${transcript} </transcript>
<instructions>
– Extract the names like this example – spk_0: “name1”, spk_1: “name2”.
– Only extract the names like the example above and do not add any other words to your response
– Your response should only have a list of “speakers” and their associated name separated by a “:” surrounded by {}
– if there is only one speaker identified then surround your answer with {}
– the format should look like this {“spk_0” : “Name”, “spk_1: “Name2”, etc.}, no unnecessary spacing should be added
</instructions>
Assistant: Should I add anything else in my answer?
Human: Only return a JSON formatted response with the Name and the speaker label associated to it. Do not add any other words to your answer. Do NOT EVER add any introductory sentences in your answer. Only give the names of the speakers actively speaking in the meeting. Only give the names of the speakers actively speaking in the meeting in the format shown above.
Assistant:
After the speakers are identified and returned in JSON format, we can replace the generic speaker labels with the name attributed speaker labels in the transcript. The result will be a more enhanced transcription:
Adam: Hey Court , how’s it going ?
Court: Hey Adam , it’s going good . How are you
Adam: doing ? Well , uh hey , thanks for uh joining me today on this call . I’m excited to talk to you about uh architecting on Aws .
Court: Awesome . Yeah , thank you for inviting me . So ,
Adam: uh can you tell me a little bit about uh the servers you’re currently using on premises ?
Court: Yeah . So for our servers , we currently have four web servers running with four gigabyte of RA M and two CP US and we’re currently running Linux as our operating system .
… transcript continues….
But what if a speaker can’t be identified because they never introduced themselves. In that case, we want to let the LLM leave them as unknown, rather than try to force or a hallucinate a label.
We can add the following to the instructions:
If no name is found for a speaker, use UNKNOWN_X where X is the speaker label number
The following transcript has three speakers, but only two identified speakers. The LLM must label a speaker as UNKNOWN
rather than forcing a name or other response on the speaker.
spk_0: Yeah .
spk_1: Uh Thank you for joining us Court . I am your account executive here at Aws . Uh Joining us on the call is Adam , one of our solutions architect at Aws . Adam would like to introduce yourself .
spk_2: Hey , everybody . High Court . Good to meet you . Uh As your dedicated solutions architect here at Aws . Uh my job is to help uh you at every step of your cloud journey . Uh My , my role involves helping you understand architecting on Aws , including best practices for the cloud . Uh So with that , I’d love it if you could just take a minute to introduce yourself and maybe tell me a little about what you’re currently running on premises .
spk_0: Yeah , great . It’s uh great to meet you , Adam . Um My name is Court . I am the V P of engineering here at any company . And uh yeah , really excited to hear what you can do for us .
spk_1: Thanks , work . Well , we , we talked a little bit of last time about your goals for migrating to Aws . I invited Adam to , to join us today to get a better understanding of your current infrastructure and other technical requirements .
spk_2: Yeah . So co could you tell me a little bit about what you’re currently running on premises ?
spk_1: Sure . Yeah , we’re , uh ,
spk_0: we’re running a three tier web app on premise .
When we give Claude Sonnet the option to not force a name , we see results like this:
{“spk_0”: “Court”, “spk_1”: “UNKNOWN_1”, “spk_2”: “Adam”}
Prompt engineering for cleaning the transcript
Now that the transcript has been diarized with speaker attributed names, we can use Amazon Bedrock to clean the transcript. Cleaning tasks include eliminating distracting filler words, identifying and rectifying misattributed homophones, and addressing any diarization errors stemming from subpar audio quality. For guidance on accomplishing these tasks using Anthropic Claude Sonnet models, see the provided prompt:
Human: You are a transcript editor, please follow the <instructions> tags.
<transcript> ${transcript} </transcript>
<instructions>
– The <transcript> contains a speaker diarized transcript
– Go over the transcript and remove all filler words. For example “um, uh, er, well, like, you know, okay, so, actually, basically, honestly, anyway, literally, right, I mean.”
– Fix any errors in transcription that may be caused by homophones based on the context of the sentence. For example, “one instead of won” or “high instead of hi”
– In addition, please fix the transcript in cases where diarization is improperly performed. For example, in some cases you will see that sentences are split between two speakers. In this case infer who the actual speaker is and attribute it to them.
– Please review the following example of this,
Input Example
Court: Adam you are saying the wrong thing. What
Adam: um do you mean, Court?
Output:
Court: Adam you are saying the wrong thing.
Adam: What do you mean, Court?
– In your response, return the entire cleaned transcript, including all of the filler word removal and the improved diarization. Only return the transcript, do not include any leading or trailing sentences. You are not summarizing. You are cleaning the transcript. Do not include any xml tags <>
</instructions>
Assistant:
Transcript Processing
After the initial transcript is passed into the LLM, it returns a polished transcript, free from errors. The following are excerpts from the transcript:
Speaker Identification
Input:
spk_0: Hey Court , how’s it going ?
spk_1: Hey Adam, it’s going good . How are you
spk_0: doing ? Well , uh hey , thanks for uh joining me today on this call . I’m excited to talk to you about uh architecting on Aws .
Output:
Adam: Hey Court, how’s it going?
Court: Hey Adam, it’s going good. How are you?
Adam: Thanks for joining me today. I’m excited to talk to you about architecting on AWS
Homophone Replacement
Input:
Adam: Have you ex have you experienced uh my sequel attacks in the past or sequel injections ?
Court: Yes .
Output:
Adam: Have you experienced SQL injections in the past?
Court: Yes.
Filler Word Removal
Input:
Adam: Ok , great . Uh All right . And then are you guys bound by any compliance standards like PC I DS S or um you know GDR ? Uh what’s another one ? Any of those C just uh
Court: we are bound by fate , moderate complaints . So .
Adam: Ok . Well , you have to transition to fed ramp high at any time
Output:
Adam: Ok, great. And then are you guys bound by any compliance standards like PCI DSS or GDPR? What’s another one? Any of those? CJIS?
Court: We are bound by FedRAMP moderate compliance.
Adam: Ok. Will you have to transition to FedRAMP high at any time?
Prompt engineering for summarization
Now that the transcript has been created by Amazon Transcribe, diarized, and enhanced with Amazon Bedrock, the transcript can be summarized with Amazon Bedrock and an LLM. A simple version might look like the following:
Human:
You are a transcript summarizing bot. You will go over the transcript below and provide a summary of the transcript.
Transcript: ${transcript}
Assistant:
Although this will work, it can be improved with additional instructions. You can use XML tags to provide structure and clarity to the task requirements:
Human: You are a transcript summarizing bot. You will go over the transcript below and provide a summary of the content within the <instructions> tags.
<transcript> ${transcript} </transcript>
<instructions>
– Go over the conversation that was had in the transcript.
– Create a summary based on what occurred in the meeting.
– Highlight specific action items that came up in the meeting, including follow-up tasks for each person.
</instructions>
Assistant:
Because meetings often involve action items and date/time information, instructions are added to explicitly request the LLM to include this information. Because the LLM knows the speaker names, each person is assigned specific action items for them if any are found. To prevent hallucinations, an instruction is included that allows the LLM to fail gracefully.
Human: You are a transcript summarizing bot. You will go over the transcript below and provide a summary of the content within the <instructions> tags.
<transcript> ${transcript} </transcript>
<instructions>
– Go over the conversation that was had in the transcript.
– Create a summary based on what occurred in the meeting.
– Highlight specific action items that came up in the meeting, including follow-up tasks for each person.
– If relevant, focus on what specific AWS services were mentioned during the conversation.
– If there’s sufficient context, infer the speaker’s role and mention it in the summary. For instance, “Bob, the customer/designer/sales rep/…”
– Include important date/time, and indicate urgency if necessary (e.g., deadline/ETAs for outcomes and next steps)
</instructions>
Assistant: Should I add anything else in my answer?
Human: If there is not enough context to generate a proper summary, then just return a string that says “Meeting not long enough to generate a transcript.
Assistant:
Alternatively, we invite you to explore Amazon Transcribe Call Analytics generative call summarization for an out-of-the-box solution that integrates directly with Amazon Transcribe.
Prompt engineering for meeting invitation parsing
Included in this demo is a React-based UI that will start the process of the SMA joining the meeting. Because this demo supports multiple meeting types, the invitation must be parsed. Rather than parse this with a complex regular expression (regex), it will be processed with an LLM. This prompt will first identify the meeting type: Amazon Chime, Zoom, Google Meet, Microsoft Teams, or Cisco Webex. Based on the meeting type, the LLM will extract the meeting ID and dial-in information. Simply copy/paste the meeting invitation to the UI, and the invitation will be processed by the LLM to determine how to call the meeting provider. This can be done for a meeting that is currently happening or scheduled for a future meeting. See the following example:
Human: You are an information extracting bot. Go over the meeting invitation below and determine what the meeting id and meeting type are <instructions></instructions> xml tags
<meeting_invitation>
${meetingInvitation}
</meeting_invitation>
<instructions>
1. Identify Meeting Type:
Determine if the meeting invitation is for Chime, Zoom, Google, Microsoft Teams, or Webex meetings.
2. Chime, Zoom, and Webex
– Find the meetingID
– Remove all spaces from the meeting ID (e.g., #### ## #### -> ##########).
3. If Google – Instructions Extract Meeting ID and Dial in
– For Google only, the meeting invitation will call a meetingID a ‘pin’, so treat it as a meetingID
– Remove all spaces from the meeting ID (e.g., #### ## #### -> ##########).
– Extract Google and Microsoft Dial-In Number (if applicable):
– If the meeting is a Google meeting, extract the unique dial-in number.
– Locate the dial-in number following the text “to join by phone dial.”
– Format the extracted Google dial-in number as (+1 ###-###-####), removing dashes and spaces. For example +1 111-111-1111 would become +11111111111)
4. If Microsoft Teams – Instructions if meeting type is Microsoft Teams.
– Pay attention to these instructions carefully
– The meetingId we want to store in the generated response is the ‘Phone Conference ID’ : ### ### ###
– in the meeting invitation, there are two IDs a ‘Meeting ID’ (### ### ### ##) and a ‘Phone Conference ID’ (### ### ###), ignore the ‘Meeting ID’ use the ‘Phone Conference ID’
– The meetingId we want to store in the generated response is the ‘Phone Conference ID’ : ### ### ###
– The meetingID that we want is referenced as the ‘Phone Conference ID’ store that one as the meeting ID.
– Find the phone number, extract it and store it as the dialIn number (format (+1 ###-###-####), removing dashes and spaces. For example +1 111-111-1111 would become +11111111111)
5. meetingType rules
– The only valid responses for meetingType are ‘Chime’, ‘Webex’, ‘Zoom’, ‘Google’, ‘Teams’
6. Generate Response:
– Create a response object with the following format:
{
meetingId: “meeting id goes here with spaces removed”,
meetingType: “meeting type goes here (options: ‘Chime’, ‘Webex’, ‘Zoom’, ‘Google’, ‘Teams’)”,
dialIn: “Insert Google/Microsoft Teams Dial-In number with no dashes or spaces, or N/A if not a Google/Microsoft Teams Meeting”
}
Meeting ID Formats:
Zoom: ### #### ####
Webex: #### ### ####
Chime: #### ## ####
Google: ### ### ####
Teams: ### ### ###
Ensure that the program does not create fake phone numbers and only includes the Microsoft or Google dial-in number if the meeting type is Google or Teams.
</instructions>
Assistant: Should I add anything else in my answer?
Human: Only return a JSON formatted response with the meetingid and meeting type associated to it. Do not add any other words to your answer. Do not add any introductory sentences in your answer.
Assistant:
With this information extracted from the invitation, a call is placed to the meeting provider so that the SMA can join the meeting as a participant.
Clean up
If you deployed this sample solution, clean up your resources by destroying the AWS CDK application from the AWS Command Line Interface (AWS CLI). This can be done using the following command:
Conclusion
In this post, we showed how to enhance Amazon Transcribe with an LLM using Amazon Bedrock by extracting information that would otherwise be difficult for a regex to extract. We also used this method to extract information from a meeting invitation sent from an unknown source. Finally, we showed how to use an LLM to provide a summarization of the meeting using detailed instructions to produce action items and include date/time information in the response.
We invite you to deploy this demo into your own account. We’d love to hear from you. Let us know what you think in the issues forum of the Amazon Chime SDK Meeting Summarizer GitHub repository. Alternatively, we invite you to explore other methods for meeting transcription and summarization, such as Amazon Live Meeting Assistant, which uses a browser extension to collect call audio.
About the authors
Adam Neumiller is a Solutions Architect for AWS. He is focused on helping public sector customers drive cloud adoption through the use of infrastructure as code. Outside of work, he enjoys spending time with his family and exploring the great outdoors.
Court Schuett is a Principal Specialist SA – GenAI focused on third party models and how they can be used to help solve customer problems. When he’s not coding, he spends his time exploring parks, traveling with his family, and listening to music.
Christopher Lott is a Principal Solutions Architect in the AWS AI Language Services team. He has 20 years of enterprise software development experience. Chris lives in Sacramento, California, and enjoys gardening, cooking, aerospace/general aviation, and traveling the world.
Hang Su is a Senior Applied Scientist at AWS AI. He has been leading AWS Transcribe Contact Lens Science team. His interest lies in call-center analytics, LLM-based abstractive summarization, and general conversational AI.
Jason Cai is an Applied Scientist at AWS AI. He has made contributions to AWS Bedrock, Contact Lens, Lex and Transcribe. His interests include LLM agents, dialogue summarization, LLM prediction refinement, and knowledge graph.
Edgar Costa Filho is a Senior Cloud Infrastructure Architect with a focus on Foundations and Containers, including expertise in integrating Amazon EKS with open-source tooling like Crossplane, Terraform, and GitOps. In his role, Edgar is dedicated to assisting customers in achieving their business objectives by implementing best practices in cloud infrastructure design and management.
Read More