Philips accelerates development of AI-enabled healthcare solutions with an MLOps platform built on Amazon SageMaker

Philips accelerates development of AI-enabled healthcare solutions with an MLOps platform built on Amazon SageMaker

This is a joint blog with AWS and Philips.

Philips is a health technology company focused on improving people’s lives through meaningful innovation. Since 2014, the company has been offering customers its Philips HealthSuite Platform, which orchestrates dozens of AWS services that healthcare and life sciences companies use to improve patient care. It partners with healthcare providers, startups, universities, and other companies to develop technology that helps doctors make more precise diagnoses and deliver more personalized treatment for millions of people worldwide.

One of the key drivers of Philips’ innovation strategy is artificial intelligence (AI), which enables the creation of smart and personalized products and services that can improve health outcomes, enhance customer experience, and optimize operational efficiency.

Amazon SageMaker provides purpose-built tools for machine learning operations (MLOps) to help automate and standardize processes across the ML lifecycle. With SageMaker MLOps tools, teams can easily train, test, troubleshoot, deploy, and govern ML models at scale to boost productivity of data scientists and ML engineers while maintaining model performance in production.

In this post, we describe how Philips partnered with AWS to develop AI ToolSuite—a scalable, secure, and compliant ML platform on SageMaker. This platform provides capabilities ranging from experimentation, data annotation, training, model deployments, and reusable templates. All these capabilities are built to help multiple lines of business innovate with speed and agility while governing at scale with central controls. We outline the key use cases that provided requirements for the first iteration of the platform, the core components, and the outcomes achieved. We conclude by identifying the ongoing efforts to enable the platform with generative AI workloads and rapidly onboard new users and teams to adopt the platform.

Customer context

Philips uses AI in various domains, such as imaging, diagnostics, therapy, personal health, and connected care. Some examples of AI-enabled solutions that Philips has developed over the past years are:

  • Philips SmartSpeed – An AI-based imaging technology for MRI that uses a unique Compressed-SENSE based deep learning AI algorithm to take speed and image quality to the next level for a large variety of patients
  • Philips eCareManager – A telehealth solution that uses AI to support the remote care and management of critically ill patients in intensive care units, by using advanced analytics and clinical algorithms to process the patient data from multiple sources, and providing actionable insights, alerts, and recommendations for the care team
  • Philips Sonicare – A smart toothbrush that uses AI to analyze the brushing behavior and oral health of users, and provide real-time guidance and personalized recommendations, such as optimal brushing time, pressure, and coverage, to improve their dental hygiene and prevent cavities and gum diseases.

For many years, Philips has been pioneering the development of data-driven algorithms to fuel its innovative solutions across the healthcare continuum. In the diagnostic imaging domain, Philips developed a multitude of ML applications for medical image reconstruction and interpretation, workflow management, and treatment optimization. Also in patient monitoring, image guided therapy, ultrasound and personal health teams have been creating ML algorithms and applications. However, innovation was hampered due to using fragmented AI development environments across teams. These environments ranged from individual laptops and desktops to diverse on-premises computational clusters and cloud-based infrastructure. This heterogeneity initially enabled different teams to move fast in their early AI development efforts, but is now holding back opportunities to scale and improve efficiency of our AI development processes.

It was evident that a fundamental shift towards a unified and standardized environment was imperative to truly unleash the potential of data-driven endeavors at Philips.

Key AI/ML use cases and platform requirements

AI/ML-enabled propositions can transform healthcare by automating administrative tasks done by clinicians. For example:

  • AI can analyze medical images to help radiologists diagnose diseases faster and more accurately
  • AI can predict future medical events by analyzing patient data and improving proactive care
  • AI can recommend personalized treatment tailored to patients’ needs
  • AI can extract and structure information from clinical notes to make record-taking more efficient
  • AI interfaces can provide patient support for queries, reminders, and symptom checkers

Overall, AI/ML promises reduced human error, time and cost savings, optimized patient experiences, and timely, personalized interventions.

One of the key requirements for the ML development and deployment platform was the ability of the platform to support the continuous iterative development and deployment process, as shown in the following figure.

The AI asset development starts in a lab environment, where the data is collected and curated, and then the models are trained and validated. When the model is ready and approved for use, it’s deployed into the real-world production systems. Once deployed, model performance is continuously monitored. The real-world performance and feedback are eventually used for further model improvements with full automation of the model training and deployment.

The more detailed AI ToolSuite requirements were driven by three example use cases:

  • Develop a computer vision application aimed at object detection at the edge. The data science team expected an AI-based automated image annotation workflow to speed up a time-consuming labeling process.
  • Enable a data science team to manage a family of classic ML models for benchmarking statistics across multiple medical units. The project required automation of model deployment, experiment tracking, model monitoring, and more control over the entire process end to end both for auditing and retraining in the future.
  • Improve the quality and time to market for deep learning models in diagnostic medical imaging. The existing computing infrastructure didn’t allow for running many experiments in parallel, which delayed model development. Also, for regulatory purposes, it’s necessary to enable full reproducibility of model training for several years.

Non-functional requirements

Building a scalable and robust AI/ML platform requires careful consideration of non-functional requirements. These requirements go beyond the specific functionalities of the platform and focus on ensuring the following:

  • Scalability – The AI ToolSuite platform must be able to scale Philips’s insights generation infrastructure more effectively so that the platform can handle a growing volume of data, users, and AI/ML workloads without sacrificing performance. It should be designed to scale horizontally and vertically to meet increasing demands seamlessly while providing central resource management.
  • Performance – The platform must deliver high-performance computing capabilities to efficiently process complex AI/ML algorithms. SageMaker offers a wide range of instance types, including instances with powerful GPUs, which can significantly accelerate model training and inference tasks. It also should minimize latency and response times to provide real-time or near-real-time results.
  • Reliability – The platform must provide a highly reliable and robust AI infrastructure that spans across multiple Availability Zones. This multi-AZ architecture should ensure uninterrupted AI operations by distributing resources and workloads across distinct data centers.
  • Availability – The platform must be available 24/7, with minimal downtime for maintenance and upgrades. AI ToolSuite’s high availability should include load balancing, fault-tolerant architectures, and proactive monitoring.
  • Security and Governance – The platform must employ robust security measures, encryption, access controls, dedicated roles, and authentication mechanisms with continuous monitoring for unusual activities and conducting security audits.
  • Data Management – Efficient data management is crucial for AI/ML platforms. Regulations in the healthcare industry call for especially rigorous data governance. It should include features like data versioning, data lineage, data governance, and data quality assurance to ensure accurate and reliable results.
  • Interoperability – The platform should be designed to integrate easily with Philips’s internal data repositories, allowing seamless data exchange and collaboration with third-party applications.
  • Maintainability – The platform’s architecture and code base should be well organized, modular, and maintainable. This enables Philips ML engineers and developers to provide updates, bug fixes, and future enhancements without disrupting the entire system.
  • Resource optimization – The platform should monitor utilization reports very closely to make sure computing resources are used efficiently and allocate resources dynamically based on demand. In addition, Philips should use AWS Billing and Cost Management tools to make sure teams receive notifications when utilization passes the allocated threshold amount.
  • Monitoring and logging – The platform should use Amazon CloudWatch alerts for comprehensive monitoring and logging capabilities, which are necessary to track system performance, identify bottlenecks, and troubleshoot issues effectively.
  • Compliance – The platform can also help improve regulatory compliance of AI-enabled propositions. Reproducibility and traceability must be enabled automatically by the end-to-end data processing pipelines, where many mandatory documentation artifacts, such as data lineage reports and model cards, can be prepared automatically.
  • Testing and validation – Rigorous testing and validation procedures must be in place to ensure the accuracy and reliability of AI/ML models and prevent unintended biases.

Solution overview

AI ToolSuite is an end-to-end, scalable, quick start AI development environment offering native SageMaker and associated AI/ML services with Philips HealthSuite security and privacy guardrails and Philips ecosystem integrations. There are three personas with dedicated sets of access permissions:

  • Data scientist – Prepare data, and develop and train models in a collaborative workspace
  • ML engineer – Productionize ML applications with model deployment, monitoring, and maintenance
  • Data science admin – Create a project per team request to provide dedicated isolated environments with use case-specific templates

The platform development spanned multiple release cycles in an iterative cycle of discover, design, build, test, and deploy. Due to the uniqueness of some applications, the extension of the platform required embedding existing custom components like data stores or proprietary tools for annotation.
The following figure illustrates the three-layer architecture of AI ToolSuite, including the base infrastructure as the first layer, common ML components as the second layer, and project-specific templates as the third layer.

Layer 1 contains the base infrastructure:

  • A networking layer with parametrized access to the internet with high availability
  • Self-service provisioning with infrastructure as code (IaC)
  • An integrated development environment (IDE) using an Amazon SageMaker Studio domain
  • Platform roles (data science admin, data scientist)
  • Artifacts storage
  • Logging and monitoring for observability

Layer 2 contains common ML components:

  • Automated experiment tracking for every job and pipeline
  • A model build pipeline to launch a new model build update
  • A model training pipeline comprised of model training, evaluation, registration
  • A model deploy pipeline to deploy the model for final testing and approval
  • A model registry to easily manage model versions
  • A project role created specifically for a given use case, to be assigned to SageMaker Studio users
  • An image repository for storing processing, training, and inference container images built for the project
  • A code repository to store code artifacts
  • A project Amazon Simple Storage Service (Amazon S3) bucket to store all project data and artifacts

Layer 3 contains project-specific templates that can be created with custom components as required by new projects. For example:

  • Template 1 – Includes a component for data querying and history tracking
  • Template 2 – Includes a component for data annotations with a custom annotation workflow to use proprietary annotation tooling
  • Template 3 – Includes components for custom container images to customize both their development environment and training routines, dedicated HPC file system, and access from a local IDE for users

The following diagram highlights the key AWS services spanning multiple AWS accounts for development, staging, and production.

In the following sections, we discuss the key capabilities of the platform enabled by AWS services, including SageMaker, AWS Service Catalog, CloudWatch, AWS Lambda, Amazon Elastic Container Registry (Amazon ECR), Amazon S3, AWS Identity and Access Management (IAM), and others.

Infrastructure as code

The platform uses IaC, which allows Philips to automate the provisioning and management of infrastructure resources. This approach will also help reproducibility, scalability, version control, consistency, security, and portability for development, testing, or production.

Access to AWS environments

SageMaker and associated AI/ML services are accessed with security guardrails for data preparation, model development, training, annotation, and deployment.

Isolation and collaboration

The platform ensures data isolation by storing and processing separately, reducing the risk of unauthorized access or data breaches.

The platform facilitates team collaboration, which is essential in AI projects that typically involve cross-functional teams, including data scientists, data science admins, and MLOps engineers.

Role-based access control

Role-based access control (RBAC) is essential in managing permissions and simplifying access management by defining roles and permissions in a structured manner. It makes it straightforward to manage permissions as teams and projects grow and access control for different personas involved in AWS AI/ML projects, such as the data science admin, data scientist, annotation admin, annotator, and MLOps engineer.

Access to data stores

The platform allows SageMaker access to data stores, which ensures that data can be efficiently utilized for model training and inference without the need to duplicate or move data across different storage locations, thereby optimizing resource utilization and reducing costs.

Annotation using Philips-specific annotation tools

AWS offers a suite of AI and ML services, such as SageMaker, Amazon SageMaker Ground Truth, and Amazon Cognito, which are fully integrated with Philips-specific in-house annotation tools. This integration enables developers to train and deploy ML models using the annotated data within the AWS environment.

ML templates

The AI ToolSuite platform offers templates in AWS for various ML workflows. These templates are preconfigured infrastructure setups tailored to specific ML use cases and are accessible through services like SageMaker project templates, AWS CloudFormation, and Service Catalog.

Integration with Philips GitHub

Integration with GitHub enhances efficiency by providing a centralized platform for version control, code reviews, and automated CI/CD (continuous integration and continuous deployment) pipelines, reducing manual tasks and boosting productivity.

Visual Studio Code integration

Integration with Visual Studio Code provides a unified environment for coding, debugging, and managing ML projects. This streamlines the entire ML workflow, reducing context switching and saving time. The integration also enhances collaboration among team members by enabling them to work on SageMaker projects together within a familiar development environment, utilizing version control systems, and sharing code and notebooks seamlessly.

Model and data lineage and traceability for reproducibility and compliance

The platform provides versioning, which helps keep track of changes to the data scientist’s training and inference data over time, making it easier to reproduce results and understand the evolution of the datasets.

The platform also enables SageMaker experiment tracking, which allows end-users to log and track all the metadata associated with their ML experiments, including hyperparameters, input data, code, and model artifacts. These capabilities are essential for demonstrating compliance with regulatory standards and ensuring transparency and accountability in AI/ML workflows.

AI/ML specification report generation for regulatory compliance

AWS maintains compliance certifications for various industry standards and regulations. AI/ML specification reports serve as essential compliance documentation, showcasing adherence to regulatory requirements. These reports document the versioning of datasets, models, and code. Version control is essential for maintaining data lineage, traceability, and reproducibility, all of which are critical for regulatory compliance and auditing.

Project-level budget management

Project-level budget management allows the organization to set limits on spending, helping to avoid unexpected costs and ensuring that the ML projects stay within budget. With budget management, the organization can allocate specific budgets to individual projects or teams, which helps teams identify resource inefficiencies or unexpected cost spikes early on. In addition to budget management, with the feature to automatically shut down idle notebooks, team members avoid paying for unused resources, also releasing valuable resources when they are not actively in use, making them available for other tasks or users.

Outcomes

AI ToolSuite was designed and implemented as an enterprise-wide platform for ML development and deployment for data scientists across Philips. Diverse requirements from all business units were collected and considered during the design and development. Early in the project, Philips identified champions from the business teams who provided feedback and helped evaluate the value of the platform.

The following outcomes were achieved:

  • User adoption is one of the key leading indicators for Philips. Users from several business units were trained and onboarded to the platform, and that number is expected to grow in 2024.
  • Another important metric is the efficiency for data science users. With AI ToolSuite, new ML development environments are deployed in less than an hour instead of several days.
  • Data science teams can access a scalable, secure, cost-efficient, cloud-based compute infrastructure.
  • Teams can run multiple model training experiments in parallel, which significantly reduced the average training time from weeks to 1–3 days.
  • Because the environment deployment is fully automated, it requires virtually no involvement of the cloud infrastructure engineers, which reduced operational costs.
  • The use of AI ToolSuite significantly enhanced the overall maturity of data and AI deliverables by promoting the use of good ML practices, standardized workflows, and end-to-end reproducibility, which is critical for regulatory compliance in the healthcare industry.

Looking forward with generative AI

As organizations race to adopt the next state-of-the-art in AI, it’s imperative to adopt new technology in the context of the organization’s security and governance policy. The architecture of AI ToolSuite provides an excellent blueprint for enabling access to generative AI capabilities in AWS for different teams at Philips. Teams can use foundation models made available with Amazon SageMaker JumpStart, which provides a vast number of open source models from Hugging Face and other providers. With the necessary guardrails already in place in terms of access control, project provisioning, and cost controls, it will be seamless for teams to start using the generative AI capabilities within SageMaker.

Additionally, access to Amazon Bedrock, a fully managed API-driven service for generative AI, can be provisioned for individual accounts based on project requirements, and the users can access Amazon Bedrock APIs either via the SageMaker notebook interface or through their preferred IDE.

There are additional considerations concerning the adoption of generative AI in a regulated setting, such as healthcare. Careful consideration needs to be given to the value created by generative AI applications against the associated risks and costs. There is also a need to create a risk and legal framework that governs the organization’s use of generative AI technologies. Elements such as data security, bias and fairness, and regulatory compliance need to be considered as part of such mechanisms.

Conclusion

Philips embarked on a journey of harnessing the power of data-driven algorithms to revolutionize healthcare solutions. Over the years, innovation in diagnostic imaging has yielded several ML applications, from image reconstruction to workflow management and treatment optimization. However, the diverse range of setups, from individual laptops to on-premises clusters and cloud infrastructure, posed formidable challenges. Separate system administration, security measures, support mechanisms, and data protocol inhibited a comprehensive view of TCO and complicated transitions between teams. The transition from research and development to production was burdened by the lack of lineage and reproducibility, making continuous model retraining difficult.

As part of the strategic collaboration between Philips and AWS, the AI ToolSuite platform was created to develop a scalable, secure, and compliant ML platform with SageMaker. This platform provides capabilities ranging from experimentation, data annotation, training, model deployments, and reusable templates. All these capabilities were built iteratively over several cycles of discover, design, build, test, and deploy. This helped multiple business units innovate with speed and agility while governing at scale with central controls.

This journey serves as an inspiration for organizations looking to harness the power of AI and ML to drive innovation and efficiency in healthcare, ultimately benefiting patients and care providers worldwide. As they continue to build upon this success, Philips stands poised to make even greater strides in improving health outcomes through innovative AI-enabled solutions.

To learn more about Philips innovation on AWS, visit Philips on AWS.


About the authors

Frank Wartena is a program manager at Philips Innovation & Strategy. He coordinates data & AI related platform assets in support of our Philips data & AI enabled propositions. He has broad experience in artificial intelligence, data science and interoperability. In his spare time, Frank enjoys running, reading and rowing, and spending time with his family.

Irina Fedulova is a Principal Data & AI Lead at Philips Innovation & Strategy. She is driving strategic activities focused on the tools, platforms, and best practices that speed up and scale the development and productization of (Generative) AI-enabled solutions at Philips. Irina has a strong technical background in machine learning, cloud computing, and software engineering. Outside work, she enjoys spending time with her family, traveling and reading.

Selvakumar Palaniyappan is a Product Owner at Philips Innovation & Strategy, in charge of product management for Philips HealthSuite AI & ML platform. He is highly experienced in technical product management and software engineering. He is currently working on building a scalable and compliant AI and ML development and deployment platform. Furthermore, he is spearheading its adoption by Philips’ data science teams in order to develop AI-driven health systems and solutions.

Adnan Elci is a Senior Cloud Infrastructure Architect at AWS Professional Services. He operates in the capacity of a Tech Lead, overseeing various operations for clients in Healthcare and Life Sciences, Finance, Aviation, and Manufacturing. His enthusiasm for automation is evident in his extensive involvement in designing, building and implementing enterprise level customer solutions within the AWS environment. Beyond his professional commitments, Adnan actively dedicates himself to volunteer work, striving to create a meaningful and positive impact within the community.

Hasan Poonawala is a Senior AI/ML Specialist Solutions Architect at AWS, Hasan helps customers design and deploy machine learning applications in production on AWS. He has over 12 years of work experience as a data scientist, machine learning practitioner, and software developer. In his spare time, Hasan loves to explore nature and spend time with friends and family.

Sreoshi Roy is a Senior Global Engagement Manager with AWS. As the business partner to the Healthcare & Life Sciences Customers, she comes with an unparalleled experience in defining and delivering solutions for complex business problems. She helps her customers make strategic objectives, define and design cloud/ data strategies and implement the scaled and robust solution to meet their technical and business objectives. Beyond her professional endeavors, her dedication lies in creating a meaningful impact on people’s lives by fostering empathy and promoting inclusivity.

Wajahat Aziz is a leader for AI/ML & HPC in AWS Healthcare and Life Sciences team. Having served as a technology leader in different roles with life science organizations, Wajahat leverages his experience to help healthcare and life sciences customers leverage AWS technologies for developing state-of-the-art ML and HPC solutions. His current areas of focus are early research, clinical trials and privacy preserving machine learning.

Wioletta Stobieniecka is a Data Scientist at AWS Professional Services. Throughout her professional career, she has delivered multiple analytics-driven projects for different industries such as banking, insurance, telco, and the public sector. Her knowledge of advanced statistical methods and machine learning is well combined with a business acumen. She brings recent AI advancements to create value for customers.

Read More

Fine-tune Whisper models on Amazon SageMaker with LoRA

Fine-tune Whisper models on Amazon SageMaker with LoRA

Whisper is an Automatic Speech Recognition (ASR) model that has been trained using 680,000 hours of supervised data from the web, encompassing a range of languages and tasks. One of its limitations is the low-performance on low-resource languages such as Marathi language and Dravidian languages, which can be remediated with fine-tuning. However, fine-tuning a Whisper model has become a considerable challenge, both in terms of computational resources and storage requirements. Five to ten runs of full fine-tuning for Whisper models demands approximately 100 hours A100 GPU (40 GB SXM4) (varies based on model sizes and model parameters), and each fine-tuned checkpoint necessitates about 7 GB of storage space. This combination of high computational and storage demands can pose significant hurdles, especially in environments with limited resources, often making it exceptionally difficult to achieve meaningful results.

Low-Rank Adaptation, also known as LoRA, takes a unique approach to model fine-tuning. It maintains the pre-trained model weights in a static state and introduces trainable rank decomposition matrices into each layer of the Transformer structure. This method can decrease the number of trainable parameters needed for downstream tasks by 10,000 times and reduce GPU memory requirement by 3 times. In terms of model quality, LoRA has been shown to match or even exceed the performance of traditional fine-tuning methods, despite operating with fewer trainable parameters (see the results from the original LoRA paper). It also offers the benefit of increased training throughput. Unlike the adapter methods, LoRA doesn’t introduce additional latency during inference, thereby maintaining the efficiency of the model during the deployment phase. Fine-tuning Whisper using LoRA has shown promising results. Take Whisper-Large-v2, for instance: running 3-epochs with a 12-hour common voice dataset on 8 GB memory GPU takes 6–8 hours, which is 5 times faster than full fine-tuning with comparable performance.

Amazon SageMaker is an ideal platform to implement LoRA fine-tuning of Whisper. Amazon SageMaker enables you to build, train, and deploy machine learning models for any use case with fully managed infrastructure, tools, and workflows. Additional model training benefits can include lower training costs with Managed Spot Training, distributed training libraries to split models and training datasets across AWS GPU instances, and more.  The trained SageMaker models can be easily deployed for inference directly on SageMaker. In this post, we present a step-by-step guide to implement LoRA fine-tuning in SageMaker. The source code associated with this implementation can be found on GitHub.

Prepare the dataset for fine-tuning

We use the low-resource language Marathi for the fine-tuning task. Using the Hugging Face datasets library, you can download and split the Common Voice dataset into training and testing datasets. See the following code:

from datasets import load_dataset, DatasetDict

language = "Marathi"
language_abbr = "mr"
task = "transcribe"
dataset_name = "mozilla-foundation/common_voice_11_0"

common_voice = DatasetDict()
common_voice["train"] = load_dataset(dataset_name, language_abbr, split="train+validation", use_auth_token=True)
common_voice["test"] = load_dataset(dataset_name, language_abbr, split="test", use_auth_token=True)

The Whisper speech recognition model requires audio inputs to be 16kHz mono 16-bit signed integer WAV files. Because the Common Voice dataset is 48K sampling rate, you will need to downsample the audio files first. Then you need to apply Whisper’s feature extractor to the audio to extract log-mel spectrogram features, and apply Whisper’s tokenizer to the framed features to convert each sentence in the transcript into a token ID. See the following code:

from transformers import WhisperFeatureExtractor
from transformers import WhisperTokenizer

feature_extractor = WhisperFeatureExtractor.from_pretrained(model_name_or_path)
tokenizer = WhisperTokenizer.from_pretrained(model_name_or_path, language=language, task=task)

def prepare_dataset(batch):
# load and resample audio data from 48 to 16kHz
audio = batch["audio"]

# compute log-Mel input features from input audio array
batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]

# encode target text to label ids
batch["labels"] = tokenizer(batch["sentence"]).input_ids
return batch

#apply the data preparation function to all of our fine-tuning dataset samples using dataset's .map method.
common_voice = common_voice.map(prepare_dataset, remove_columns=common_voice.column_names["train"], num_proc=2)
common_voice.save_to_disk("marathi-common-voice-processed")
!aws s3 cp --recursive "marathi-common-voice-processed" s3://<Your-S3-Bucket>

After you have processed all the training samples, upload the processed data to Amazon S3, so that when using the processed training data in the fine-tuning stage, you can use FastFile to mount the S3 file directly instead of copying it to local disk:

from sagemaker.inputs import TrainingInput
training_input_path=s3uri
training = TrainingInput(
s3_data_type='S3Prefix', # Available Options: S3Prefix | ManifestFile | AugmentedManifestFile
s3_data=training_input_path,
distribution='FullyReplicated', # Available Options: FullyReplicated | ShardedByS3Key
input_mode='FastFile'
)

Train the model

For demonstration, we use whisper-large-v2 as the pre-trained model (whisper v3 is now available), which can be imported through Hugging Face transformers library. You can use 8-bit quantization to further improve training efficiency. 8-bit quantization offers the memory optimization by rounding from floating point to 8-bit integers. It is a commonly used model compression technique to get the savings of reduced memory without sacrificing precision during inference too much.

To load the pre-trained model in 8-bit quantized format, we simply add the load_in_8bit=True argument when instantiating the model, as shown in the following code. This will load the model weights quantized to 8 bits, reducing the memory footprint.

from transformers import WhisperForConditionalGeneration

model = WhisperForConditionalGeneration.from_pretrained(model_name_or_path, load_in_8bit=True, device_map="auto")

We use the LoRA implementation from Hugging Face’s peft package. There are four steps to fine-tune a model using LoRA:

  1. Instantiate a base model (as we did in the last step).
  2. Create a configuration (LoraConfig) where LoRA-specific parameters are defined.
  3. Wrap the base model with get_peft_model() to get a trainable PeftModel.
  4. Train the PeftModel as the base model.

See the following code:

from peft import LoraConfig, get_peft_model

config = LoraConfig(r=32, lora_alpha=64, target_modules=["q_proj", "v_proj"], lora_dropout=0.05, bias="none")
model = get_peft_model(model, config)

training_args = Seq2SeqTrainingArguments(
output_dir=args.model_dir,
per_device_train_batch_size=int(args.train_batch_size),
gradient_accumulation_steps=1,
learning_rate=float(args.learning_rate),
warmup_steps=args.warmup_steps,
num_train_epochs=args.num_train_epochs,
evaluation_strategy="epoch",
fp16=True,
per_device_eval_batch_size=args.eval_batch_size,
generation_max_length=128,
logging_steps=25,
remove_unused_columns=False,
label_names=["labels"],
)
trainer = Seq2SeqTrainer(
args=training_args,
model=model,
train_dataset=train_dataset["train"],
eval_dataset=train_dataset.get("test", train_dataset["test"]),
data_collator=data_collator,
tokenizer=processor.feature_extractor,
)

To run a SageMaker training job, we bring our own Docker container. You can download the Docker image from GitHub, where ffmpeg4 and git-lfs are packaged together with other Python requirements. To learn more about how to adapt your own Docker container to work with SageMaker, refer to Adapting your own training container. Then you can use the Hugging Face Estimator and start a SageMaker training job:

OUTPUT_PATH= f's3://{BUCKET}/{PREFIX}/{TRAINING_JOB_NAME}/output/'

huggingface_estimator = HuggingFace(entry_point='train.sh',
source_dir='./src',
output_path= OUTPUT_PATH,
instance_type=instance_type,
instance_count=1,
# transformers_version='4.17.0',
# pytorch_version='1.10.2',
py_version='py310',
image_uri=<ECR-PATH>,
role=ROLE,
metric_definitions = metric_definitions,
volume_size=200,
distribution=distribution,
keep_alive_period_in_seconds=1800,
environment=environment,
)

huggingface_estimator.fit(job_name=TRAINING_JOB_NAME, wait=False)

The implementation of LoRA enabled us to run the Whisper large fine-tuning task on a single GPU instance (for example, ml.g5.2xlarge). In comparison, the Whisper large full fine-tuning task requires multiple GPUs (for example, ml.p4d.24xlarge) and a much longer training time. More specifically, our experiment demonstrated that the full fine-tuning task requires 24 times more GPU hours compared to the LoRA approach.

Evaluate model performance

To evaluate the performance of the fine-tuned Whisper model, we calculate the word error rate (WER) on a held-out test set. WER measures the difference between the predicted transcript and the ground truth transcript. A lower WER indicates better performance. You can run the following script against the pre-trained model and fine-tuned model and compare their WER difference:

metric = evaluate.load("wer")

eval_dataloader = DataLoader(common_voice["test"], batch_size=8, collate_fn=data_collator)

model.eval()
for step, batch in enumerate(tqdm(eval_dataloader)):
with torch.cuda.amp.autocast():
with torch.no_grad():
generated_tokens = (
model.generate(
input_features=batch["input_features"].to("cuda"),
decoder_input_ids=batch["labels"][:, :4].to("cuda"),
max_new_tokens=255,
)
.cpu()
.numpy()
)
labels = batch["labels"].cpu().numpy()
labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
decoded_preds = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
metric.add_batch(
predictions=decoded_preds,
references=decoded_labels,
)
del generated_tokens, labels, batch
gc.collect()
wer = 100 * metric.compute()
print(f"{wer=}")

Conclusion

In this post, we demonstrated fine-tuning Whisper, a state-of-the-art speech recognition model. In particular, we used Hugging Face’s PEFT LoRA and enabled 8-bit quantization for efficient training. We also demonstrated how to run the training job on SageMaker.

Although this is an important first step, there are several ways you can build on this work to further improve the whisper model. Going forward, consider using SageMaker distributed training to scale training on a much larger dataset. This will allow the model to train on more varied and comprehensive data, improving accuracy. You can also optimize latency when serving the Whisper model, to enable real-time speech recognition. Additionally, you could expand work to handle longer audio transcriptions, which requires changes to model architecture and training schemes.

Acknowledgement

The authors extend their gratitude to Paras Mehra, John Sol and Evandro Franco for their insightful feedback and review of the post.


About the Authors

Jun Shi is a Senior Solutions Architect at Amazon Web Services (AWS). His current areas of focus are AI/ML infrastructure and applications. He has over a decade experience in the FinTech industry as software engineer.

Dr. Changsha Ma is an AI/ML Specialist at AWS. She is a technologist with a PhD in Computer Science, a master’s degree in Education Psychology, and years of experience in data science and independent consulting in AI/ML. She is passionate about researching methodological approaches for machine and human intelligence. Outside of work, she loves hiking, cooking, hunting food, and spending time with friends and families.

Read More

Use foundation models to improve model accuracy with Amazon SageMaker

Use foundation models to improve model accuracy with Amazon SageMaker

Photo by Scott Webb on Unsplash

Photo by Scott Webb on Unsplash

Determining the value of housing is a classic example of using machine learning (ML). A significant influence was made by Harrison and Rubinfeld (1978), who published a groundbreaking paper and dataset that became known informally as the Boston housing dataset. This seminal work proposed a method for estimating housing prices as a function of numerous dimensions, including air quality, which was the principal focus of their research. Almost 50 years later, the estimation of housing prices has become an important teaching tool for students and professionals interested in using data and ML in business decision-making.

In this post, we discuss the use of an open-source model specifically designed for the task of visual question answering (VQA). With VQA, you can ask a question of a photo using natural language and receive an answer to your question—also in plain language. Our goal in this post is to inspire and demonstrate what is possible using this technology. We propose using this capability with the Amazon SageMaker platform of services to improve regression model accuracy in an ML use case, and independently, for the automated tagging of visual images.

We provide a corresponding YouTube video that demonstrates what is discussed here. Video playback will start midway to highlight the most salient point. We suggest you follow this reading with the video to reinforce and gain a richer understanding of the concept.

Foundation models

This solution centers on the use of a foundation model published to the Hugging Face model repository. Here, we use the term foundation model to describe an artificial intelligence (AI) capability that has been pre-trained on a large and diverse body of data. Foundation models can sometimes be ready to use without the burden of training a model from zero. Some foundation models can be fine-tuned, which means teaching them additional patterns that are relevant to your business but missing from the original, generalized published model. Fine-tuning is sometimes needed to deliver correct responses that are unique to your use case or body of knowledge.

In the Hugging Face repository, there are several VQA models to choose from. We selected the model with the most downloads at the time of this writing. Although this post demonstrates the ability to use a model from an open-source model repository, the same concept would apply to a model you trained from zero or used from another trusted provider.

A modern approach to a classic use case

Home price estimation has traditionally occurred through tabular data where features of the property are used to inform price. Although there can be hundreds of features to consider, some fundamental examples are the size of the home in the finished space, the number of bedrooms and bathrooms, and the location of the residence.

Machine learning is capable of incorporating diverse input sources beyond tabular data, such as audio, still images, motion video, and natural language. In AI, the term multimodal refers to the use of a variety of media types, such as images and tabular data. In this post, we show how to use multimodal data to find and liberate hidden value locked up in the abundant digital exhaust produced by today’s modern world.

With this idea in mind, we demonstrate the use of foundation models to extract latent features from images of the property. By utilizing insights found in the images, not previously available in the tabular data, we can improve the accuracy of the model. Both the images and tabular data discussed in this post were originally made available and published to GitHub by Ahmed and Moustafa (2016).

A picture is worth a thousand words

Now that we understand the capabilities of VQA, let’s consider the two following images of kitchens. How would you assess the home’s value from these images? What are some questions you would ask yourself? Each picture may elicit dozens of questions in your mind. Some of those questions may lead to meaningful answers that improve a home valuation process.

Photos credit Francesca Tosolini (L) and Sidekix Media (R) on Unsplash

The following table provides anecdotal examples of VQA interactions by showing questions alongside their corresponding answers. Answers can come in the form of categorical, continuous value, or binary responses.

Example Question Example Answer from Foundation Model
What are the countertops made from? granite, tile, marble, laminate, etc.
Is this an expensive kitchen? yes, no
How many separated sinks are there? 0, 1, 2

Reference architecture

In this post, we use Amazon SageMaker Data Wrangler to ask a uniform set of visual questions for thousands of photos in the dataset. SageMaker Data Wrangler is purpose-built to simplify the process of data preparation and feature engineering. By providing more than 300 built-in transformations, SageMaker Data Wrangler helps reduce the time it takes to prepare tabular and image data for ML from weeks to minutes. Here, SageMaker Data Wrangler combines data features from the original tabular set with photo-born features from the foundation model for model training.

Next, we build a regression model with the use of Amazon SageMaker Canvas. SageMaker Canvas can build a model, without writing any code, and deliver preliminary results in as little as 2–15 minutes. In the section that follows, we provide a reference architecture used to make this solution guidance possible.

Many popular models from Hugging Face and other providers are one-click deployable with Amazon SageMaker JumpStart. There are hundreds of thousands of models available in these repositories. For this post, we choose a model not available in SageMaker JumpStart, which requires a customer deployment. As shown in the following figure, we deploy a Hugging Face model for inference using an Amazon SageMaker Studio notebook. The notebook is used to deploy an endpoint for real-time inference. The notebook uses assets that include the Hugging Face binary model, a pointer to a container image, and a purpose-built inference.py script that matches the model’s expected input and output. As you read this, the mix of available VQA models may change. The important thing is to review available VQA models, at the time you read this, and be prepared to deploy the model you choose, which will have its own API request and response contract.

After the VQA model is served by the SageMaker endpoint, we use SageMaker Data Wrangler to orchestrate the pipeline that ultimately combines tabular data and features extracted from the digital images and reshape the data for model training. The next figure offers a view of how the full-scale data transformation job is run.

In the following figure, we use SageMaker Data Wrangler to orchestrate data preparation tasks and SageMaker Canvas for model training. First, SageMaker Data Wrangler uses Amazon Location Service to convert ZIP codes available in the raw data into latitude and longitude features. Second, SageMaker Data Wrangler is able to coordinate sending thousands of photos to a SageMaker hosted endpoint for real-time inference, asking a uniform set of questions per scene. This results a rich array of features that describe characteristics observed in kitchens, bathrooms, home exteriors, and more. After data has been prepared by SageMaker Data Wrangler, a training data set is available in Amazon Simple Storage Service (Amazon S3). Using the S3 data as an input, SageMaker Canvas is able to train a model, in as little as 2–15 minutes, without writing any code.

Data transformation using SageMaker Data Wrangler

The following screenshot shows a SageMaker Data Wrangler workflow. The workflow begins with thousands of photos of homes stored in Amazon S3. Next, a scene detector determines the scene, such as kitchen or bathroom. Finally, a scene-specific set of questions are asked of the images, resulting in a richer, tabular dataset available for training.

The following is an example of the SageMaker Data Wrangler custom transformation code used to interact with the foundation model and obtain information about pictures of kitchens. In the preceding screenshot, if you were to choose the kitchen features node, the following code would appear:


from botocore.config import Config
import json
import boto3
import base64
from pyspark.sql.functions import col, udf, struct, lit

def get_answer(question,image):

	encoded_input_image = base64.b64encode(bytearray(image)).decode()

	payload = {
		"question": question,
		"image": encoded_input_image
	}

	payload = json.dumps(payload).encode('utf-8')
	response = boto3.client('runtime.sagemaker', config=Config(region_name='us-west-2')).invoke_endpoint(EndpointName='my-vqa-endpoint-name', ContentType='application/json', Body=payload)
	return json.loads(response['Body'].read())["predicted_answer"]


vqaUDF = udf(lambda q,img: get_answer(q,img))

# process only images of bathroom type
df = df[df['scene']=='kitchen']

visual_questions = [
	('kitchen_floor_composition', 'what is the floor made of'),
	('kitchen_floor_color', 'what color is the floor'),
	('kitchen_counter_composition', 'what is the countertop made of'),
	('kitchen_counter_color', 'what color is the countertop'),
	('kitchen_wall_composition', 'what are the walls made of'),
	('kitchen_refrigerator_stainless', 'is the refrigerator stainless steel'),
	('kitchen_refrigerator_builtin', 'is there a built-in refrigerator'),
	('kitchen_refrigerator_visible', 'is a refrigerator visible'),
	('kitchen_cabinet_composition', 'what are the kitchen cabinets made of'),
	('kitchen_cabinet_wood', 'what type of wood are the kitchen cabinets'),
	('kitchen_window', 'does the kitchen have windows'),
	('kitchen_expensive', 'is this an expensive kitchen'),
	('kitchen_large', 'is this a large kitchen'),
	('kitchen_recessed_lights', 'are there recessed lights')
	]

for i in visual_questions:
	df = df.withColumn(i[0], vqaUDF(lit(i[1]),col('image_col.data')))

As a security consideration, you must first enable SageMaker Data Wrangler to call your SageMaker real-time endpoint through AWS Identity and Access Management (IAM). Similarly, any AWS resources you invoke through SageMaker Data Wrangler will need similar allow permissions.

Data structures before and after SageMaker Data Wrangler

In this section, we discuss the structure of the original tabular data and the enhanced data. The enhanced data contains new data features relative to this example use case. In your application, take time to imagine the diverse set of questions available in your images to help your classification or regression task. The idea is to imagine as many questions as possible and then test them to make sure they do provide value-add.

Structure of original tabular data

As described in the source GitHub repo, the sample dataset contains 535 tabular records including four images per property. The following table illustrates the structure of the original tabular data.

Feature Comment
Number of bedrooms .
Number of bathrooms .
Area (square feet) .
ZIP Code .
Price This is the target variable to be predicted.

Structure of enhanced data

The following table illustrates the enhanced data structure, which contains several new features derived from the images.

Feature Comment
Number of bedrooms .
Number of bathrooms .
Area (square feet) .
Latitude Computed by passing original ZIP code into Amazon Location Service. This is the centroid value for the ZIP.
Longitude Computed by passing original ZIP code into Amazon Location Service. This is the centroid value for the ZIP.
Does the bedroom contain a vaulted ceiling? 0 = no; 1 = yes
Is the bathroom expensive? 0 = no; 1 = yes
Is the kitchen expensive? 0 = no; 1 = yes
Price This is the target variable to be predicted.

Model training with SageMaker Canvas

A SageMaker Data Wrangler processing job fully prepares and makes the entire tabular training dataset available in Amazon S3. Next, SageMaker Canvas addresses the model building phase of the ML lifecycle. Canvas begins by opening the S3 training set. Being able to understand a model is often a key customer requirement. Without writing code, and within a few clicks, SageMaker Canvas provides rich, visual feedback on model performance. As seen in the screenshot in the following section, SageMaker Canvas shows the how single features inform the model.

Model trained with original tabular data and features derived from real-estate images

We can see from the following screenshot that features developed from images of the property were important. Based on these results, the question “Is this kitchen expensive” from the photo was more significant than “number of bedrooms” in the original tabular set, with feature importance values of 7.08 and 5.498, respectively.

The following screenshot provides important information about the model. First, the residual graph shows most points in the set clustering around the purple shaded zone. Here, two outliers were manually annotated outside SageMaker Canvas for this illustration. These outliers represent significant gaps between the true home value and the predicted value. Additionally, the R2 value, which has a possible range of 0–100%, is shown at 76%. This indicates the model is imperfect and doesn’t have enough information points to fully account for all the variety to fully estimate home values.

We can use outliers to find and propose additional signals to build a more comprehensive model. For example, these outlier properties may include a swimming pool or be located on large plots of land. The dataset didn’t include these features; however, you may be able to locate this data and train a new model with “has swimming pool” included as an additional feature. Ideally, on your next attempt, the R2 value would increase and the MAE and RMSE values would decrease.

Model trained without features derived from real-estate images

Finally, before moving to the next section, let’s explore if the features from the images were helpful. The following screenshot provides another SageMaker Canvas trained model without the features from the VQA model. We see the model error rate has increased, from an RMSE of 282K to an RMSE of 352K. From this, we can conclude that three simple questions from the images improved model accuracy by about 20%. Not shown, but to be complete, the R2 value for the following model deteriorated as well, dropping to a value of 62% from a value of 76% with the VQA features provided. This is an example of how SageMaker Canvas makes it straightforward to quickly experiment and use a data-driven approach that yields a model to serve your business need.

Looking ahead

Many organizations are becoming increasingly interested in foundation models, especially since general pre-trained transformers (GPTs) officially became a mainstream topic of interest in December 2022. A large portion of the interest in foundation models is centered on large language models (LLM) tasks; however, there are other diverse use cases available, such as computer vision and, more narrowly, the specialized VQA task described here.

This post is an example to inspire the use of multimodal data to solve industry use cases. Although we demonstrated the use and benefit of VQA in a regression model, it can also be used to label and tag images for subsequent search or business workflow routing. Imagine being able to search for properties listed for sale or rent. Suppose you want a find a property with tile floors or marble countertops. Today, you might have to get a long list of candidate properties and filter yourself by sight as you browse through each candidate. Instead, imagine being able to filter listings that contain these features—even if a person didn’t explicitly tag them. In the insurance industry, imagine the ability to estimate claim damages, or route next actions in a business workflow from images. In social media platforms, photos could be auto-tagged for subsequent use.

Summary

This post demonstrated how to use computer vision enabled by a foundation model to improve a classic ML use case using the SageMaker platform. As part of the solution proposed, we located a popular VQA model available on a public model registry and deployed it using a SageMaker endpoint for real-time inference.

Next, we used SageMaker Data Wrangler to orchestrate a workflow in which uniform questions were asked of the images in order to generate a rich set of tabular data. Finally, we used SageMaker Canvas to train a regression model. It’s important to note that the sample dataset was very simple and, therefore, imperfect by design. Even so, SageMaker Canvas makes it easy to understand model accuracy and seek out additional signals to improve the accuracy of a baseline model.

We hope this post has encouraged you use the multimodal data your organization may possess. Additionally, we hope the post has inspired you to consider model training as an iterative process. A great model can be achieved with some patience. Models that are near-perfect may be too good to be true, perhaps the result of target leakage or overfitting. An ideal scenario would begin with a model that is good, but not perfect. Using errors, losses, and residual plots, you can obtain additional data signals to increase the accuracy from your initial baseline estimate.

AWS offers the broadest and deepest set of ML services and supporting cloud infrastructure, putting ML in the hands of every developer, data scientist, and expert practitioner. If you’re curious to learn more about the SageMaker platform, including SageMaker Data Wrangler and SageMaker Canvas, please reach out to your AWS account team and start a conversation. Also, consider reading more about SageMaker Data Wrangler custom transformations.

References

Ahmed, E. H., & Moustafa, M. (2016). House price estimation from visual and textual features. IJCCI 2016-Proceedings of the 8th International Joint Conference on Computational Intelligence, 3, 62–68.

Harrison Jr, D., & Rubinfeld, D. L. (1978). Hedonic housing prices and the demand for clean air. Journal of environmental economics and management, 5(1), 81-102.

Kim, W., Son, B. &amp; Kim, I.. (2021). ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research. 139:5583-5594.


About The Author

Charles Laughlin is a Principal AI/ML Specialist Solution Architect and works in the Amazon SageMaker service team at AWS. He helps shape the service roadmap and collaborates daily with diverse AWS customers to help transform their businesses using cutting-edge AWS technologies and thought leadership. Charles holds a M.S. in Supply Chain Management and a Ph.D. in Data Science.

Read More

Implement a custom AutoML job using pre-selected algorithms in Amazon SageMaker Automatic Model Tuning

Implement a custom AutoML job using pre-selected algorithms in Amazon SageMaker Automatic Model Tuning

AutoML allows you to derive rapid, general insights from your data right at the beginning of a machine learning (ML) project lifecycle. Understanding up front which preprocessing techniques and algorithm types provide best results reduces the time to develop, train, and deploy the right model. It plays a crucial role in every model’s development process and allows data scientists to focus on the most promising ML techniques. Additionally, AutoML provides a baseline model performance that can serve as a reference point for the data science team.

An AutoML tool applies a combination of different algorithms and various preprocessing techniques to your data. For example, it can scale the data, perform univariate feature selection, conduct PCA at different variance threshold levels, and apply clustering. Such preprocessing techniques could be applied individually or be combined in a pipeline. Subsequently, an AutoML tool would train different model types, such as Linear Regression, Elastic-Net, or Random Forest, on different versions of your preprocessed dataset and perform hyperparameter optimization (HPO). Amazon SageMaker Autopilot eliminates the heavy lifting of building ML models. After providing the dataset, SageMaker Autopilot automatically explores different solutions to find the best model. But what if you want to deploy your tailored version of an AutoML workflow?

This post shows how to create a custom-made AutoML workflow on Amazon SageMaker using Amazon SageMaker Automatic Model Tuning with sample code available in a GitHub repo.

Solution overview

For this use case, let’s assume you are part of a data science team that develops models in a specialized domain. You have developed a set of custom preprocessing techniques and selected a number of algorithms that you typically expect to work well with your ML problem. When working on new ML use cases, you would like first to perform an AutoML run using your preprocessing techniques and algorithms to narrow down the scope of potential solutions.

For this example, you don’t use a specialized dataset; instead, you work with the California Housing dataset that you will import from Amazon Simple Storage Service (Amazon S3). The focus is to demonstrate the technical implementation of the solution using SageMaker HPO, which later can be applied to any dataset and domain.

The following diagram presents the overall solution workflow.

Architecture diagram showing steps explained in the following Walkthrough section.

Prerequisites

The following are prerequisites for completing the walkthrough in this post:

Implement the solution

The full code is available in the GitHub repo.

The steps to implement the solution (as noted in the workflow diagram) are as follows:

  1. Create a notebook instance and specify the following:
    1. For Notebook instance type, choose ml.t3.medium.
    2. For Elastic Inference, choose none.
    3. For Platform identifier, choose Amazon Linux 2, Jupyter Lab 3.
    4. For IAM role, choose the default AmazonSageMaker-ExecutionRole. If it doesn’t exist, create a new AWS Identity and Access Management (IAM) role and attach the AmazonSageMakerFullAccess IAM policy.

Note that you should create a minimally scoped execution role and policy in production.

  1. Open the JupyterLab interface for your notebook instance and clone the GitHub repo.

You can do that by starting a new terminal session and running the git clone <REPO> command or by using the UI functionality, as shown in the following screenshot.

JupyterLab git integration button

  1. Open the automl.ipynb notebook file, select the conda_python3 kernel, and follow the instructions to trigger a set of HPO jobs.

To run the code without any changes, you need to increase the service quota for ml.m5.large for training job usage and Number of instances across all training jobs. AWS allows by default only 20 parallel SageMaker training jobs for both quotas. You need to request a quota increase to 30 for both. Both quota changes should typically be approved within a few minutes. Refer to Requesting a quota increase for more information.

AWS Service Quotas page allowing to request an increase in particular instance type parallel training jobs

If you don’t want to change the quota, you can simply modify the value of the MAX_PARALLEL_JOBS variable in the script (for example, to 5).

  1. Each HPO job will complete a set of training job trials and indicate the model with optimal hyperparameters.
  2. Analyze the results and deploy the best-performing model.

This solution will incur costs in your AWS account. The cost of this solution will depend on the number and duration of HPO training jobs. As these increase, so will the cost. You can reduce costs by limiting training time and configuring TuningJobCompletionCriteriaConfig according to the instructions discussed later in this post. For pricing information, refer to Amazon SageMaker Pricing.

In the following sections, we discuss the notebook in more detail with code examples and the steps to analyze the results and select the best model.

Initial setup

Let’s start with running the Imports & Setup section in the custom-automl.ipynb notebook. It installs and imports all the required dependencies, instantiates a SageMaker session and client, and sets the default Region and S3 bucket for storing data.

Data preparation

Download the California Housing dataset and prepare it by running the Download Data section of the notebook. The dataset is split into training and testing data frames and uploaded to the SageMaker session default S3 bucket.

The entire dataset has 20,640 records and 9 columns in total, including the target. The goal is to predict the median value of a house (medianHouseValue column). The following screenshot shows the top rows of the dataset.

Top five rows of the California housing data frame showing the structure of the table

Training script template

The AutoML workflow in this post is based on scikit-learn preprocessing pipelines and algorithms. The aim is to generate a large combination of different preprocessing pipelines and algorithms to find the best-performing setup. Let’s start with creating a generic training script, which is persisted locally on the notebook instance. In this script, there are two empty comment blocks: one for injecting hyperparameters and the other for the preprocessing-model pipeline object. They will be injected dynamically for each preprocessing model candidate. The purpose of having one generic script is to keep the implementation DRY (don’t repeat yourself).

#create base script
_script = """
import argparse
import joblib
import os
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer
from sklearn.cluster import KMeans
from sklearn.linear_model import ElasticNet
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
###########################
### Inference functions ###
###########################
def model_fn(model_dir):
clf = joblib.load(os.path.join(model_dir, "model.joblib"))
return clf
if __name__ == "__main__":
print("Extracting arguments")
parser = argparse.ArgumentParser()
# Hyperparameters
##### WILL BE INSERTED DYNAMICALLY #####
{}
############################
# Data, model, and output directories
parser.add_argument("--model-dir", type=str, default=os.environ.get("SM_MODEL_DIR"))
parser.add_argument("--train", type=str, default=os.environ.get("SM_CHANNEL_TRAIN"))
parser.add_argument("--test", type=str, default=os.environ.get("SM_CHANNEL_TEST"))
parser.add_argument("--train-file", type=str, default="train.parquet")
parser.add_argument("--test-file", type=str, default="test.parquet")
parser.add_argument("--features", type=str)
parser.add_argument("--target", type=str)
args, _ = parser.parse_known_args()
# Load and prepare data
train_df = pd.read_parquet(os.path.join(args.train, args.train_file))
test_df = pd.read_parquet(os.path.join(args.test, args.test_file))
X_train = train_df[args.features.split()]
X_test = test_df[args.features.split()]
y_train = train_df[args.target]
y_test = test_df[args.target]
# Train model
##### WILL BE INSERTED DYNAMICALLY #####
{}
{}
############################
pipeline = Pipeline([('preprocessor', preprocessor), ('model', model)])
pipeline.fit(X_train, y_train)
# Validate model and print metrics
rmse = mean_squared_error(y_test, pipeline.predict(X_test), squared=False)
print("RMSE: " + str(rmse))
# Persist model
path = os.path.join(args.model_dir, "model.joblib")
joblib.dump(pipeline, path)
"""
# write _script to file just to have it in hand
with open("script_draft.py", "w") as f:
print(_script, file=f)

Create preprocessing and model combinations

The preprocessors dictionary contains a specification of preprocessing techniques applied to all input features of the model. Each recipe is defined using a Pipeline or a FeatureUnion object from scikit-learn, which chains together individual data transformations and stack them together. For example, mean-imp-scale is a simple recipe that ensures that missing values are imputed using mean values of respective columns and that all features are scaled using the StandardScaler. In contrast, the mean-imp-scale-pca recipe chains together a few more operations:

  1. Impute missing values in columns with its mean.
  2. Apply feature scaling using mean and standard deviation.
  3. Calculate PCA on top of the input data at a specified variance threshold value and merge it together with the imputed and scaled input features.

In this post, all input features are numeric. If you have more data types in your input dataset, you should specify a more complicated pipeline where different preprocessing branches are applied to different feature type sets.

preprocessors = {
    "mean-imp-scale": "preprocessor = Pipeline([('imputer', SimpleImputer(strategy='mean')), ('scaler', StandardScaler())])n",

    "mean-imp-scale-knn": "preprocessor = FeatureUnion([('base-features', Pipeline([('imputer', SimpleImputer(strategy='mean')), ('scaler', StandardScaler())])), ('knn', Pipeline([('imputer', SimpleImputer(strategy='mean')), ('scaler', StandardScaler()), ('knn', KMeans(n_clusters=10))]))])n",

    "mean-imp-scale-pca": "preprocessor = FeatureUnion([('base-features', Pipeline([('imputer', SimpleImputer(strategy='mean')), ('scaler', StandardScaler())])), ('pca', Pipeline([('imputer', SimpleImputer(strategy='mean')), ('scaler', StandardScaler()), ('pca', PCA(n_components=0.9))]))])n"   
}

The models dictionary contains specifications of different algorithms that you fit the dataset to. Every model type comes with the following specification in the dictionary:

  • script_output – Points to the location of the training script used by the estimator. This field is filled dynamically when the models dictionary is combined with the preprocessors dictionary.
  • insertions – Defines code that will be inserted into the script_draft.py and subsequently saved under script_output. The key “preprocessor” is intentionally left blank because this location is filled with one of the preprocessors in order to create multiple model-preprocessor combinations.
  • hyperparameters – A set of hyperparameters that are optimized by the HPO job.
  • include_cls_metadata – More configuration details required by the SageMaker Tuner class.

A full example of the models dictionary is available in the GitHub repository.

models = {
    "rf": {
        "script_output": None,
        "insertions": {
            # Arguments
            "arguments" : 
            "parser.add_argument('--n_estimators', type=int, default=100)n"+
            "    parser.add_argument('--max_depth', type=int, default=None)n"+            
            "    parser.add_argument('--min_samples_leaf', type=int, default=1)n"+
            "    parser.add_argument('--min_samples_split', type=int, default=2)n"+            
            "    parser.add_argument('--max_features', type=str, default='auto')n",
            # Model call
            "preprocessor": None,
            "model_call" : "model = RandomForestRegressor(n_estimators=args.n_estimators,max_depth=args.max_depth,min_samples_leaf=args.min_samples_leaf,min_samples_split=args.min_samples_split,max_features=args.max_features)n"
        },
        "hyperparameters": {
            "n_estimators": IntegerParameter(100, 2000, "Linear"),
            "max_depth": IntegerParameter(1, 100, "Logarithmic"),
            "min_samples_leaf": IntegerParameter(1, 6, "Linear"),
            "min_samples_split": IntegerParameter(2, 20, "Linear"),
            "max_features": CategoricalParameter(["auto", "sqrt", "log2"]),
        },
        "include_cls_metadata": False,
    }
}

Next, let’s iterate through the preprocessors and models dictionaries and create all possible combinations. For example, if your preprocessors dictionary contains 10 recipes and you have 5 model definitions in the models dictionary, the newly created pipelines dictionary contains 50 preprocessor-model pipelines that are evaluated during HPO. Note that individual pipeline scripts are not created yet at this point. The next code block (cell 9) of the Jupyter notebook iterates through all preprocessor-model objects in the pipelines dictionary, inserts all relevant code pieces, and persists a pipeline-specific version of the script locally in the notebook. Those scripts are used in the next steps when creating individual estimators that you plug into the HPO job.

pipelines = {}
for model_name, model_spec in models.items():
    pipelines[model_name] = {}
    for preprocessor_name, preprocessor_spec in preprocessors.items():
        pipeline_name = f"{model_name}-{preprocessor_name}"
        pipelines[model_name][pipeline_name] = {}
        pipelines[model_name][pipeline_name]["insertions"] = {}
        pipelines[model_name][pipeline_name]["insertions"]["preprocessor"] = preprocessor_spec
        pipelines[model_name][pipeline_name]["hyperparameters"] = model_spec["hyperparameters"]
        pipelines[model_name][pipeline_name]["include_cls_metadata"] = model_spec["include_cls_metadata"]        
        pipelines[model_name][pipeline_name]["insertions"]["arguments"] = model_spec["insertions"]["arguments"]
        pipelines[model_name][pipeline_name]["insertions"]["model_call"] = model_spec["insertions"]["model_call"]
        pipelines[model_name][pipeline_name]["script_output"] = f"scripts/{model_name}/script-{pipeline_name}.py"

Define estimators

You can now work on defining SageMaker Estimators that the HPO job uses after scripts are ready. Let’s start with creating a wrapper class that defines some common properties for all estimators. It inherits from the SKLearn class and specifies the role, instance count, and type, as well as which columns are used by the script as features and the target.

class SKLearnBase(SKLearn):
    def __init__(
        self, 
        entry_point=".", # intentionally left blank, will be overwritten in the next function
        framework_version="1.2-1",
        role=sm_role,
        instance_count=1,
        instance_type="ml.c5.xlarge",
        hyperparameters={
           "features": "medianIncome housingMedianAge totalRooms totalBedrooms population households latitude longitude",
            "target": "medianHouseValue",
        },  
        **kwargs,
        ):
        super(SKLearnBase, self).__init__(
            entry_point=entry_point,
            framework_version=framework_version,
            role=role,
            instance_count=instance_count,
            instance_type=instance_type,
            hyperparameters=hyperparameters,
            **kwargs
        )

Let’s build the estimators dictionary by iterating through all scripts generated before and located in the scripts directory. You instantiate a new estimator using the SKLearnBase class, with a unique estimator name, and one of the scripts. Note that the estimators dictionary has two levels: the top level defines a pipeline_family. This is a logical grouping based on the type of models to evaluate and is equal to the length of the models dictionary. The second level contains individual preprocessor types combined with the given pipeline_family. This logical grouping is required when creating the HPO job.

estimators = {}
for pipeline_family in pipelines.keys():
    estimators[pipeline_family] = {}
    scripts = os.listdir(f"scripts/{pipeline_family}")
    for script in scripts:
        if script.endswith(".py"):
            estimator_name = script.split(".")[0].replace("_", "-").replace("script", "estimator")
            estimators[pipeline_family][estimator_name] = SKLearnBase(
                entry_point=f"scripts/{pipeline_family}/{script}",
                base_job_name=estimator_name,
            )

Define HPO tuner arguments

To optimize passing arguments into the HPO Tuner class, the HyperparameterTunerArgs data class is initialized with arguments required by the HPO class. It comes with a set of functions, which ensure HPO arguments are returned in a format expected when deploying multiple model definitions at once.

@dataclass
class HyperparameterTunerArgs:
    base_job_names: list[str]
    estimators: list[object]
    inputs: dict[str]
    objective_metric_name: str
    hyperparameter_ranges: list[dict]
    metric_definition: dict[str]
    include_cls_metadata: list[bool]

    def get_estimator_dict(self) -> dict:
        return {k:v for (k, v) in zip(self.base_job_names, self.estimators)}

    def get_inputs_dict(self) -> dict:
        return {k:v for (k, v) in zip(self.base_job_names, [self.inputs]*len(self.base_job_names))}

    def get_objective_metric_name_dict(self) -> dict:
        return {k:v for (k, v) in zip(self.base_job_names, [self.objective_metric_name]*len(self.base_job_names))}

    def get_hyperparameter_ranges_dict(self) -> dict:
        return {k:v for (k, v) in zip(self.base_job_names, self.hyperparameter_ranges)}

    def get_metric_definition_dict(self) -> dict:
        return {k:[v] for (k, v) in zip(self.base_job_names, [self.metric_definition]*len(self.base_job_names))}

    def get_include_cls_metadata_dict(self) -> dict:
        return {k:v for (k, v) in zip(self.base_job_names, self.include_cls_metadata)}

The next code block uses the previously introduced HyperparameterTunerArgs data class. You create another dictionary called hp_args and generate a set of input parameters specific to each estimator_family from the estimators dictionary. These arguments are used in the next step when initializing HPO jobs for each model family.

hp_args = {}
for estimator_family, estimators in estimators.items():
    hp_args[estimator_family] = HyperparameterTunerArgs(
        base_job_names=list(estimators.keys()),
        estimators=list(estimators.values()),
        inputs={"train": s3_data_train.uri, "test": s3_data_test.uri},
        objective_metric_name="RMSE",
        hyperparameter_ranges=[pipeline.get("hyperparameters") for pipeline in pipelines[estimator_family].values()],
        metric_definition={"Name": "RMSE", "Regex": "RMSE: ([0-9.]+).*$"},
        include_cls_metadata=[pipeline.get("include_cls_metadata") for pipeline in pipelines[estimator_family].values()],
    )

Create HPO tuner objects

In this step, you create individual tuners for every estimator_family. Why do you create three separate HPO jobs instead of launching just one across all estimators? The HyperparameterTuner class is restricted to 10 model definitions attached to it. Therefore, each HPO is responsible for finding the best-performing preprocessor for a given model family and tuning that model family’s hyperparameters.

The following are a few more points regarding the setup:

  • The optimization strategy is Bayesian, which means that the HPO actively monitors the performance of all trials and navigates the optimization towards more promising hyperparameter combinations. Early stopping should be set to Off or Auto when working with a Bayesian strategy, which handles that logic itself.
  • Each HPO job runs for a maximum of 100 jobs and runs 10 jobs in parallel. If you’re dealing with larger datasets, you might want to increase the total number of jobs.
  • Additionally, you may want to use settings that control how long a job runs and how many jobs your HPO is triggering. One way to do that is to set the maximum runtime in seconds (for this post, we set it to 1 hour). Another is to use the recently released TuningJobCompletionCriteriaConfig. It offers a set of settings that monitor the progress of your jobs and decide whether it is likely that more jobs will improve the result. In this post, we set the maximum number of training jobs not improving to 20. That way, if the score isn’t improving (for example, from the fortieth trial), you won’t have to pay for the remaining trials until max_jobs is reached.
STRATEGY = "Bayesian"
OBJECTIVE_TYPE = "Minimize"
MAX_JOBS = 100
MAX_PARALLEL_JOBS = 10
MAX_RUNTIME_IN_SECONDS = 3600
EARLY_STOPPING_TYPE = "Off"
# RANDOM_SEED = 42 # uncomment if you require reproducibility across runs
TUNING_JOB_COMPLETION_CRITERIA_CONFIG = TuningJobCompletionCriteriaConfig(
    max_number_of_training_jobs_not_improving=20,
    )

tuners = {}
for estimator_family, hp in hp_args.items():
    tuners[estimator_family] = HyperparameterTuner.create(
        estimator_dict=hp.get_estimator_dict(),
        objective_metric_name_dict=hp.get_objective_metric_name_dict(),
        hyperparameter_ranges_dict=hp.get_hyperparameter_ranges_dict(),
        metric_definitions_dict=hp.get_metric_definition_dict(),
        strategy=STRATEGY,
        completion_criteria_config=TUNING_JOB_COMPLETION_CRITERIA_CONFIG,
        objective_type=OBJECTIVE_TYPE,
        max_jobs=MAX_JOBS,
        max_parallel_jobs=MAX_PARALLEL_JOBS,
        max_runtime_in_seconds=MAX_RUNTIME_IN_SECONDS,
        base_tuning_job_name=f"custom-automl-{estimator_family}",
        early_stopping_type=EARLY_STOPPING_TYPE, # early stopping of training jobs is not currently supported when multiple training job definitions are used
        # random_seed=RANDOM_SEED,
    )

Now let’s iterate through the tuners and hp_args dictionaries and trigger all HPO jobs in SageMaker. Note the usage of the wait argument set to False, which means that the kernel won’t wait until the results are complete and you can trigger all jobs at once.

It’s likely that not all training jobs will complete and some of them might be stopped by the HPO job. The reason for this is the TuningJobCompletionCriteriaConfig—the optimization finishes if any of the specified criteria is met. In this case, when the optimization criteria isn’t improving for 20 consecutive jobs.

for tuner, hpo in zip(tuners.values(), hp_args.values()):
    tuner.fit(
        inputs=hpo.get_inputs_dict(),
        include_cls_metadata=hpo.get_include_cls_metadata_dict(),
        wait=False,
        )

Analyze results

Cell 15 of the notebook checks if all HPO jobs are complete and combines all results in the form of a pandas data frame for further analysis. Before analyzing the results in detail, let’s take a high-level look at the SageMaker console.

At the top of the Hyperparameter tuning jobs page, you can see your three launched HPO jobs. All of them finished early and didn’t perform all 100 training jobs. In the following screenshot, you can see that the Elastic-Net model family completed the highest number of trials, whereas others didn’t need so many training jobs to find the best result.

SageMaker Hyperparameter tuning jobs console showing all three triggered HPO jobs status

You can open the HPO job to access more details, such as individual training jobs, job configuration, and the best training job’s information and performance.

Detailed view of one of the selected HPO jobs

Let’s produce a visualization based on the results to get more insights of the AutoML workflow performance across all model families.

From the following graph, you can conclude that the Elastic-Net model’s performance was oscillating between 70,000 and 80,000 RMSE and eventually stalled, as the algorithm wasn’t able to improve its performance despite trying various preprocessing techniques and hyperparameter values. It also seems that RandomForest performance varied a lot depending on the hyperparameter set explored by HPO, but despite many trials it couldn’t go below the 50,000 RMSE error. GradientBoosting achieved the best performance already from the start going below 50,000 RMSE. HPO tried to improve that result further but wasn’t able to achieve better performance across other hyperparameter combinations. A general conclusion for all HPO jobs is that not so many jobs were required to find the best performing set of hyperparameters for each algorithm. To further improve the result, you would need to experiment with creating more features and performing additional feature engineering.

Changes in HPO objective value over time by each model family

You can also examine a more detailed view of the model-preprocessor combination to draw conclusions about the most promising combinations.

Changes in HPO objective value over time by each model-preprocessor combination

Select the best model and deploy it

The following code snippet selects the best model based on the lowest achieved objective value. You can then deploy the model as a SageMaker endpoint.

df_best_job = df_tuner_results.loc[df_tuner_results["FinalObjectiveValue"] == df_tuner_results["FinalObjectiveValue"].min()]
df_best_job
BEST_MODEL_FAMILY = df_best_job["TrainingJobFamily"].values[0]

tuners.get(BEST_MODEL_FAMILY).best_training_job()

tuners.get(BEST_MODEL_FAMILY).best_estimator()

predictor = tuners.get(BEST_MODEL_FAMILY).deploy(
    initial_instance_count=1,
    instance_type="ml.c4.large",
    endpoint_name=f"custom-automl-endpoint-{BEST_MODEL_FAMILY}",
)

Clean up

To prevent unwanted charges to your AWS account, we recommend deleting the AWS resources that you used in this post:

  1. On the Amazon S3 console, empty the data from the S3 bucket where the training data was stored.

Amazon S3 console showing how to empty or remove a bucket entirely

  1. On the SageMaker console, stop the notebook instance.

SageMaker Notebook instances console showing how to stop an instance

  1. Delete the model endpoint if you deployed it. Endpoints should be deleted when no longer in use, because they’re billed by time deployed.
sm_client.delete_endpoint(EndpointName=predictor.endpoint)

Conclusion

In this post, we showcased how to create a custom HPO job in SageMaker using a custom selection of algorithms and preprocessing techniques. In particular, this example demonstrates how to automate the process of generating many training scripts and how to use Python programming structures for efficient deployment of multiple parallel optimization jobs. We hope this solution will form the scaffolding of any custom model tuning jobs you will deploy using SageMaker to achieve higher performance and speed up of your ML workflows.

Check out the following resources to further deepen your knowledge of how to use SageMaker HPO:


About the Authors

Konrad SemschKonrad Semsch is a Senior ML Solutions Architect at Amazon Web Services Data Lab Team. He helps customers use machine learning to solve their business challenges with AWS. He enjoys inventing and simplifying to enable customers with simple and pragmatic solutions for their AI/ML projects. He is most passionate about MlOps and traditional data science. Outside of work, he is a big fan of windsurfing and kitesurfing.

Tuna ErsoyTuna Ersoy is a Senior Solutions Architect at AWS. Her primary focus is helping Public Sector customers adopt cloud technologies for their workloads. She has a background in application development, enterprise architecture, and contact center technologies. Her interests include serverless architectures and AI/ML.

Read More

Best prompting practices for using the Llama 2 Chat LLM through Amazon SageMaker JumpStart

Best prompting practices for using the Llama 2 Chat LLM through Amazon SageMaker JumpStart

Llama 2 stands at the forefront of AI innovation, embodying an advanced auto-regressive language model developed on a sophisticated transformer foundation. It’s tailored to address a multitude of applications in both the commercial and research domains with English as the primary linguistic concentration. Its model parameters scale from an impressive 7 billion to a remarkable 70 billion. Llama 2 demonstrates the potential of large language models (LLMs) through its refined abilities and precisely tuned performance.

Diving deeper into Llama 2’s architecture, Meta reveals that the model’s fine-tuning melds supervised fine-tuning (SFT) with reinforcement learning aided by human feedback (RLHF). This combination prioritizes alignment with human-centric norms, striking a balance between efficiency and safety. Built upon a vast reservoir of 2 trillion tokens, Llama 2 provides both pre-trained models for diverse natural language generation and the specialized Llama-2-Chat variant for chat assistant roles. Regardless of a developer’s choice between the basic or the advanced model, Meta’s responsible use guide is an invaluable resource for model enhancement and customization.

For those interested in creating interactive applications, Llama 2 Chat is a good starting point. This conversational model allows for building customized chatbots and assistants. To make it even more accessible, you can deploy Llama-2-Chat models with ease through Amazon SageMaker JumpStart. An offering from Amazon SageMaker, SageMaker JumpStart provides a straightforward way to deploy Llama-2 model variants directly through Amazon SageMaker Studio notebooks. This enables developers to focus on their application logic while benefiting from SageMaker tools for scalable AI model training and hosting. SageMaker JumpStart also provides effortless access to the extensive SageMaker library of algorithms and pre-trained models.

In this post, we explore best practices for prompting the Llama 2 Chat LLM. We highlight key prompt design approaches and methodologies by providing practical examples.

Prerequisites

To try out the examples and recommended best practices for Llama 2 Chat on SageMaker JumpStart, you need the following prerequisites:

Prompting techniques

Prompting, in the context of language models and artificial intelligence, refers to the practice of providing a model with a specific input or cue to elicit a desired response. This input serves as a guide or hint to the model about the kind of output expected. Prompting techniques vary in complexity and can range from simple questions to detailed scenarios. Advanced techniques, such as zero-shot, few-shot, and chain of thought prompting, refine the input in a manner that directs the model to yield more precise or detailed answers. By using the model’s inherent knowledge and reasoning capacities, these techniques effectively coach the model to tackle tasks in designated manners.

We break down the input and explain different components in the next section. We start by sharing some examples of what different prompt techniques look like. The examples are always shown in two code blocks. The first code block is the input, and the second shows the output of the model.

Zero-shot prompting

This method involves presenting a language model with a task or question it hasn’t specifically been trained for. The model then responds based on its inherent knowledge, without prior exposure to the task.

%%time

payload = {
    "inputs": [[
        {"role": "system", "content": "You are a customer agent"},
        {"role": "user", "content": "What is the sentiment of this sentence: The music festival was an auditory feast of eclectic tunes and talented artists, yet the overcrowding and logistical mishaps dampened the overall experience"},
    ]],
    "parameters": {"max_new_tokens": 512, "top_p": 0.9, "temperature": 0.6}
}
response = predictor.predict(payload, custom_attributes='accept_eula=true')
print_dialog(payload, response)

System: You are a customer agent

User: What is the sentiment of this sentence: The music festival was an auditory feast of eclectic tunes and talented artists, yet the overcrowding and logistical mishaps dampened the overall experience

======================================================

Assistant:  The sentiment of the sentence is neutral. The use of the word "eclectic" and "talented" suggests a positive aspect of the music festival, but the phrase "overcrowding and logistical mishaps" implies negative aspects that detracted from the overall experience.
======================================================

CPU times: user 4.24 ms, sys: 389 µs, total: 4.63 ms
Wall time: 1.19 s

Few-shot prompting

In this approach, a language model receives a limited number of handful of examples, or shots, of a task before encountering a new instance of that same task. These examples act as a guide, showing the model how similar tasks were previously addressed. Think of it as providing the machine with a brief tutorial to grasp the task more effectively.

%%time

payload = {
    "inputs": [[
        {"role": "system", "content": "You are a customer agent"},
        {"role": "user", "content": f"""
                                    nnExample 1
                                    nSentence: Though the sun set with a brilliant display of colors, casting a warm glow over the serene beach, it was the bitter news I received earlier that clouded my emotions, making it impossible to truly appreciate nature's beauty.
                                    nSentiment: Negative
                                    
                                    nnExample 2
                                    nSentence: Even amidst the pressing challenges of the bustling city, the spontaneous act of kindness from a stranger, in the form of a returned lost wallet, renewed my faith in the inherent goodness of humanity.
                                    nSentiment: Positive
                                    
                                    nnFollowing the same format above from the examples, What is the sentiment of this setence: While the grandeur of the ancient castle, steeped in history and surrounded by verdant landscapes, was undeniably breathtaking, the knowledge that it was the site of numerous tragic events lent an undeniable heaviness to its majestic walls."""},
    ]],
    "parameters": {"max_new_tokens": 512, "top_p": 0.9, "temperature": 0.6}
}
response = predictor.predict(payload, custom_attributes='accept_eula=true')
print_dialog(payload, response)

System: You are a customer agent

User: 
                                    
Example 1
                                    
Sentence: Though the sun set with a brilliant display of colors, casting a warm glow over the serene beach, it was the bitter news I received earlier that clouded my emotions, making it impossible to truly appreciate nature's beauty.
                                    
Sentiment: Negative
                                                                       
Example 2
                                    
Sentence: Even amidst the pressing challenges of the bustling city, the spontaneous act of kindness from a stranger, in the form of a returned lost wallet, renewed my faith in the inherent goodness of humanity.
                                    
Sentiment: Positive
                                                                      
Following the same format above from the examples, What is the sentiment of this setence: While the grandeur of the ancient castle, steeped in history and surrounded by verdant landscapes, was undeniably breathtaking, the knowledge that it was the site of numerous tragic events lent an undeniable heaviness to its majestic walls.

======================================================

Assistant:  Sure! Here's the sentiment analysis of the given sentence:

Sentence: While the grandeur of the ancient castle, steeped in history and surrounded by verdant landscapes, was undeniably breathtaking, the knowledge that it was the site of numerous tragic events lent an undeniable heaviness to its majestic walls.

Sentiment: Mixed (both positive and negative)
======================================================

CPU times: user 3.72 ms, sys: 773 µs, total: 4.49 ms
Wall time: 1.85 s

Chain of thought prompting

This approach augments the reasoning capabilities of LLMs in intricate tasks. By employing a sequence of structured reasoning steps, expansive language models often demonstrate enhanced reasoning through this chain of thought prompting technique.

%%time

payload = {
    "inputs": [[
        {"role": "system", "content": "You are a pizza professional"},
        {"role": "user", "content": f"""
        You have a pizza that was cut into 8 equal slices. You ate 3 slices, and your friend ate 2 slices. Here's how we can figure out how many slices are left:

            1. Start with the total number of slices.
            2. Subtract the number of slices you ate.
            3. Then subtract the number of slices your friend ate.
            4. The result is the number of slices remaining.

            So, let's calculate:

        """},
    ]],
    "parameters": {"max_new_tokens": 512, "top_p": 0.9, "temperature": 0.6}
}
response = predictor.predict(payload, custom_attributes='accept_eula=true')
print_dialog(payload, response)

System: You are a pizza professional

User: 
        You have a pizza that was cut into 8 equal slices. You ate 3 slices, and your friend ate 2 slices. Here's how we can figure out how many slices are left:

            1. Start with the total number of slices.
            2. Subtract the number of slices you ate.
            3. Then subtract the number of slices your friend ate.
            4. The result is the number of slices remaining.

            So, let's calculate:
 
======================================================

Assistant:  Oh man, I love pizza! Alright, let's get started on this math problem. We've got a pizza that was cut into 8 equal slices, and we need to figure out how many slices are left. Here's the plan:

Step 1: Start with the total number of slices. That's 8 slices.

Step 2: Subtract the number of slices I ate. I ate 3 slices, so let's subtract 3 from 8.

8 - 3 = 5

Step 3: Subtract the number of slices my friend ate. My friend ate 2 slices, so let's subtract 2 from 5.

5 - 2 = 3

Step 4: The result is the number of slices remaining. In this case, there are 3 slices remaining!

So, there you have it! There are 3 slices of pizza left. I hope my friend and I can agree on how to divide them up! 🍕👍
======================================================

CPU times: user 11.8 ms, sys: 1.72 ms, total: 13.5 ms
Wall time: 4.7 s

In the preceding example, Llama 2 Chat was able to assume the persona of a professional that has domain knowledge and was able to demonstrate the reasoning in getting to a conclusion.

Llama 2 Chat inference parameters

Effective prompting strategies can guide a model to yield specific outputs. For those seeking a finer level of control over these outputs, Llama 2 Chat introduces a comprehensive set of inference parameters:

  • max_new_tokens – Defines the length of the model’s output. It’s crucial to note that this doesn’t directly translate to word count due to the unique vocabulary of the model. A single token might represent more than one English word.
  • temperature – Affects the randomness of the output. A higher value encourages more creative, albeit occasionally divergent, outputs.
  • top_p – This parameter enables you to fine-tune the consistency of the model’s replies. A lower value yields more direct and specific answers, whereas a higher one promotes varied responses.

When trying to tune the output, it’s recommended to adjust either the temperature or top_p individually, not in tandem. Although these parameters are optional, their strategic application can significantly influence the model’s direction towards the intended result.

Introduction to system prompts

Llama 2 Chat uses a transformative feature called system prompts. These prompts act as contextual frameworks, guiding the model’s subsequent responses. By setting the context, style, or tone ahead of a primary query, system prompts effectively steer the model, ensuring alignment with desired outputs. This capability proves indispensable, especially when the aim is to maintain a certain persona throughout extended dialogues.

In our previous examples, the following part of the input sets the system context:

{"role": "system", "content": "You are a pizza professional"},

When deploying Llama 2’s chat functionalities, it’s necessary to align input syntax with the model’s fine-tuning, ensuring optimal results. It’s structured as follows:

  • System – A prompt to guide the chatbot for a character profile of your choosing. It could be tailored as per your preference (like “you are a minion”). If you omit this, the model will no longer have a default generic profile. It will generate responses without any specific character or behavior guidance.

  {"role": "system", "content": "You are a minion"}

  • User – A message from a user.

  {"role": "user", "content": "What do you think of bananas?"}

  • Assistant – The output result from the model.

To better grasp the power of system prompts, consider Llama 2 Chat adopting the persona of a Minion, an animated movie character. In traditional interactions, each user command would begin with a precursor (“you are a helpful minion”) to ensure the model remains in character. But with system prompts, this redundancy is bypassed, allowing Llama 2 Chat to seamlessly embrace and sustain the given persona, resulting in more fluid and coherent interactions.

Practical demonstration

Let’s start with the following code to demonstrate:

%%time

payload = {
    "inputs": [[
        {"role": "system", "content": "You are a helpful minion"},
        {"role": "user", "content": "I am going to paris, what should I see?"},
    ]],
    "parameters": {"max_new_tokens": 512, "top_p": 0.9, "temperature": 0.6}
}
response = predictor.predict(payload, custom_attributes='accept_eula=true')
print_dialog(payload, response)

System: You are a helpful minion

User: I am going to paris, what should I see?

======================================================

Assistant:  Oooh la la! *giggle* Paris, the city of love and romance! *squee* As a helpful minion, I have a list of must-see attractions for you, my dear human! *bats eyelashes*

1. The Eiffel Tower: *swoon* The iconic iron lady of Paris! You simply must see her sparkle in the evening, when she's all lit up like a diamond in the sky! *heart eyes*
2. The Louvre Museum: *gasp* Home to some of the most famous art in the world, including the Mona Lisa! *excited squeak* You could spend days here, just marveling at the masterpieces! *bouncy bouncy*
3. Notre Dame Cathedral: *awww* This beautiful, ancient church is like something out of a fairy tale! *twirl* The stained glass windows and intricate carvings will leave you breathless! *sigh*
4. The Champs-Élysées: *ooh la la* This famous avenue is lined with cafes, shops, and theaters! *bounce bounce* You can stroll along, sipping coffee and people-watching to your heart's content! *wink wink*
5. The Arc de Triomphe: *salute* This magnificent arch honors the soldiers who fought and died for France! *proudly* The views from the top are simply stunning! *look around*
6. Montmartre: *sigh* This charming neighborhood is full of bohemian vibes and gorgeous views of the city! *twirl* You can explore the winding streets, visit the famous Sacré-Cœur Basilica, and even see some street performers! *bats eyelashes*
7. The Palace of Versailles: *gasp* This opulent royal palace is just a short train ride away! *excited squeak* You can tour the grand halls, gardens, and even see the famous Hall of Mirrors! *twirl*

Ooh la la! I hope you have a simply marvelous time in Paris, my dear human! *curtsy* Don't
======================================================

CPU times: user 13.5 ms, sys: 2.24 ms, total: 15.8 ms
Wall time: 9.55 s

This interaction exemplifies the model’s proficiency in adopting the helpful Minion persona, offering users a consistent and engaging experience while reducing input repetition.

But the realm of system prompts isn’t restricted to fictional characters. Its versatility shines in more serious contexts as well. Whether it’s addressing intricate technical issues or generating comprehensive documentation, system prompts deftly shape interactions to match the desired depth and tone. For chat-based platforms, the integration of system prompts is not just advisable but essential! Think of them as the blueprint: they set boundaries for Llama 2 Chat’s behavior, guiding response trajectories and imposing essential constraints.

Professional context demonstration

In this next example, we ask the model to assume a more professional demeanor:

%%time

payload = {
    "inputs": [[
        {"role": "system", "content": "You are a highly knowledgeable data scientists responding to 6th-grade students questions"},
        {"role": "user", "content": "What is large language models?"},
    ]],
    "parameters": {"max_new_tokens": 512, "top_p": 0.9, "temperature": 0.6}
}
response = predictor.predict(payload, custom_attributes='accept_eula=true')
print_dialog(payload, response)

System: You are a highly knowledgeable data scientists responding to 6th-grade students questions

User: What is large language models?

======================================================
 
Assistant:  Wow, you must be really curious about language models! adjusts glasses Large language models are like superheroes for words! 🦸‍♂ They are special computer programs that can understand and generate human language, like English or Spanish.
 
Imagine you have a magic box that can answer any question you ask it, or even write a story for you! That's kind of like what large language models can do. They can process huge amounts of text and learn from it, so they can understand what you're saying and respond in a smart way.
 
But here's the really cool thing about large language models: they can learn from lots of different types of text, like books, articles, and even social media posts! 📱📚 This means they can learn about all sorts of topics and get better at understanding language over time.
 
So, why are large language models important? Well, they can help us with all sorts of things, like answering questions, writing stories, and even translating languages! 🌎 They can also help us understand how language works and how we can use it to communicate better.
 
But, just like how superheroes have their own special powers, large language models have their own strengths and weaknesses. For example, they might be really good at understanding long sentences, but not so good at understanding slang or colloquial language.
 
So, there you have it! Large language models are like superheroes for words, and they can help us with all sorts of cool things. Do you have any more questions about them? 🤔
 
======================================================
 
CPU times: user 960 µs, sys: 4.32 ms, total: 5.28 ms
Wall time: 6.68 s

The accuracy and controllability of the outputs are vital. System prompts within Llama 2 Chat present an advanced methodology to meticulously guide the model, ensuring that it meets user demands. Let’s delve deeper with two illustrative use cases:

  • Scenario 1 – Envisaging the model as a knowledge English professor, a user seeks an in-depth analysis from a given synopsis. The model’s output mirrors a conversation one might have with an English scholar, enveloped amidst classical literature:
%%time

payload = {
    "inputs": [[
        {"role": "system", "content": "You are a profoundly knowledgeable English professor"},
        {"role": "user", "content": f"""Extract the main character's name, their main objective, the main obstacle they face, and the setting of the story from the following synopsis: "In a dystopian city of the future, Eleanor seeks to overthrow the oppressive regime that controls every aspect of citizens' lives. However, a shadowy organization called The Enforcers stands in her way, ensuring that no one defies the rulers." """},
    ]],
    "parameters": {"max_new_tokens": 512, "top_p": 0.9, "temperature": 0.6}
}
response = predictor.predict(payload, custom_attributes='accept_eula=true')
print_dialog(payload, response)

System: You are a profoundly knowledgeable English professor

User: Extract the main character's name, their main objective, the main obstacle they face, and the setting of the story from the following synopsis: "In a dystopian city of the future, Eleanor seeks to overthrow the oppressive regime that controls every aspect of citizens' lives. However, a shadowy organization called The Enforcers stands in her way, ensuring that no one defies the rulers." 

======================================================


Assistant:  Ah, a most intriguing synopsis! Let me extract the main elements for you:

1. Main character's name: Eleanor
2. Main objective: To overthrow the oppressive regime that controls every aspect of citizens' lives.
3. Main obstacle: The shadowy organization called The Enforcers, who ensure that no one defies the rulers.
4. Setting: A dystopian city of the future.

Ah, a most compelling story! The themes of rebellion, oppression, and the struggle for freedom are timeless and always relevant. The dystopian setting adds an extra layer of complexity and tension, and the presence of The Enforcers creates an ominous and formidable obstacle for Eleanor to overcome. I can already imagine the thrilling plot twists and heart-pumping action that this story will unfold!
======================================================

CPU times: user 4.42 ms, sys: 154 µs, total: 4.58 ms
Wall time: 3.73 s
  • Scenario 2 – Prioritizing precision, the user instructs the model to adhere strictly to the provided directive. The outcome is a concise, direct response devoid of unnecessary elaborations:
%%time

payload = {
    "inputs": [[
        {"role": "system", "content": "You are a profoundly knowledgeable English professor who follows every instruction meticulously."},
        {"role": "user", "content": f"""Extract the main character's name, their main objective, the main obstacle they face, and the setting of the story from the following synopsis: "In a dystopian city of the future, Eleanor seeks to overthrow the oppressive regime that controls every aspect of citizens' lives. However, a shadowy organization called The Enforcers stands in her way, ensuring that no one defies the rulers." """},
    ]],
    "parameters": {"max_new_tokens": 512, "top_p": 0.9, "temperature": 0.6}
}
response = predictor.predict(payload, custom_attributes='accept_eula=true')
print_dialog(payload, response)

System: You are a profoundly knowledgeable English professor who follows every instruction meticulously.

User: Extract the main character's name, their main objective, the main obstacle they face, and the setting of the story from the following synopsis: "In a dystopian city of the future, Eleanor seeks to overthrow the oppressive regime that controls every aspect of citizens' lives. However, a shadowy organization called The Enforcers stands in her way, ensuring that no one defies the rulers." 

======================================================

Assistant:  Ah, an intriguing synopsis! Let me extract the main character's name, main objective, main obstacle, and setting for you:

1. Main character's name: Eleanor
2. Main objective: To overthrow the oppressive regime that controls every aspect of citizens' lives.
3. Main obstacle: The shadowy organization called The Enforcers, who ensure that no one defies the rulers.
4. Setting: A dystopian city of the future.

Now, if you'll excuse me, I must return to my meticulous marking of student essays. The rigors of academia demand nothing less!
======================================================

CPU times: user 2.16 ms, sys: 1.74 ms, total: 3.89 ms
Wall time: 2.84 s

The distinction between these scenarios underlines the adaptive nature of system prompts. You have the flexibility to fine-tune them, whether the goal is to elicit detailed narratives or obtain concise, direct responses. Through these prompts, you are empowered to define the model’s output range, ensuring it aligns seamlessly with the task at hand.

Text summarization and in-context question answering

System prompts can also be straightforward and enforce context to answer questions. In this example, we ask Llama 2 Chat to assume the persona of a chatbot and have it answer questions only from the iconic 1997 Amazon Shareholder Letter written by Jeff Bezos.

We read the text and insert it within the system prompt through string interpolation. The complete example is shown in the accompanying notebook.

system_prompt = f"""
You are an intelligent chatbot. Answer the questions only using the following context:

{Amazon_Shareholder_Letter_1997}

Here are some rules you always follow:

- Generate human readable output, avoid creating output with gibberish text.
- Generate only the requested output, don't include any other language before or after the requested output.
- Never say thank you, that you are happy to help, that you are an AI agent, etc. Just answer directly.
- Generate professional language typically used in business documents in North America.
- Never generate offensive or foul language.
"""

user_prompt = "Give me the summary of the shareholder letter"

%%time

payload = {
    "inputs": [[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt},
    ]],
    "parameters": {"max_new_tokens": 512, "top_p": 0.9, "temperature": 0.6}
}
response = predictor.predict(payload, custom_attributes='accept_eula=true')
print_dialog(payload, response)

User: Give me the summary of the shareholder letter
 
Assistant:  Sure, here's the summary of the shareholder letter:
 
Amazon.com's Founder and CEO, Jeffrey P. Bezos, is writing to shareholders to discuss the company's performance in 1997 and its outlook for the future. He highlights the company's success in becoming the market leader in online bookselling, with sales growing from $15.7 million in 1996 to $147.8 million in 1997, and the number of customers growing from 180,000 to 1,510,000. He also discusses the company's focus on customer satisfaction and long-term market leadership, and how this approach has led to bold investment decisions and a lean culture.
 
Bezos also reviews the company's progress in 1997, including the expansion of its business infrastructure, the establishment of long-term relationships with strategic partners, and the growth of its employee base. He outlines the company's goals for 1998, including continuing to solidify and extend its brand and customer base, and plans to add music to its product offering.
 
Overall, the letter emphasizes Amazon.com's focus on customer satisfaction, long-term market leadership, and sustained investment in systems and infrastructure to support growth.
 
======================================================
 
CPU times: user 17.9 ms, sys: 1.8 ms, total: 19.7 ms
Wall time: 8.08 s

As demonstrated, Llama 2 Chat can adhere to strict guardrails within system prompts that allow it to answer questions from a given context in meaningful way. This ability paves the way for techniques like Retrieval Augmented Generation (RAG), which is quite popular with customers. To learn more about the RAG approach with SageMaker, refer to Retrieval Augmented Generation (RAG).

Conclusion

Deploying Llama 2 Chat to achieve strong performance requires both technical expertise and strategic insight into its design. To fully take advantage of the model’s extensive abilities, you must understand and apply creative prompting techniques and adjust inference parameters. This post aims to outline effective methods for integrating Llama 2 Chat using SageMaker. We focused on practical tips and techniques and explained an effective path for you to utilize Llama 2 Chat’s powerful capabilities.

The following are key takeaways:

  • Dynamic control with ambience – The temperature controls within Llama 2 Chat serve a pivotal role far beyond simple adjustments. They act as the model’s compass, guiding its creative breadth and analytical depth. Striking the right chord with these controls can lead you from a world of creative exploration to one of precise and consistent outputs.
  • Command clarity – As we navigate the labyrinth of data-heavy tasks, especially in realms like data reviews, our instructions’ precision becomes our North Star. Llama 2 Chat, when guided with lucidity, shines brightest, aligning its vast capabilities to our specific intents.
  • Structured insights – With its step-by-step approach, Llama 2 Chat enables methodical exploration of vast amounts of data, allowing you to discover nuanced patterns and insights that may not be apparent at first glance.

Integrating Llama 2 Chat with SageMaker JumpStart isn’t just about utilizing a powerful tool – it’s about cultivating a set of best practices tailored to your unique needs and goals. Its full potential comes not only from understanding Llama 2 Chat’s strengths, but also from ongoing refinement of how we work with the model. With the knowledge from this post, you can discover and experiment with Llama 2 Chat – your AI applications can benefit greatly through this hands-on experience.

Resources


About the authors

Jin Tan Ruan is a Prototyping Developer within the AWS Industries Prototyping and Customer Engineering (PACE) team, specializing in NLP and generative AI. With a background in software development and nine AWS certifications, Jin brings a wealth of experience to assist AWS customers in materializing their AI/ML and generative AI visions using the AWS platform. He holds a master’s degree in Computer Science & Software Engineering from the University of Syracuse. Outside of work, Jin enjoys playing video games and immersing himself in the thrilling world of horror movies. You can find Jin on Linkedln. Let’s connect!

Dr. Farooq Sabir is a Senior Artificial Intelligence and Machine Learning Specialist Solutions Architect at AWS. He holds PhD and MS degrees in Electrical Engineering from the University of Texas at Austin and an MS in Computer Science from Georgia Institute of Technology. He has over 15 years of work experience and also likes to teach and mentor college students. At AWS, he helps customers formulate and solve their business problems in data science, machine learning, computer vision, artificial intelligence, numerical optimization, and related domains. Based in Dallas, Texas, he and his family love to travel and go on long road trips.

Pronoy Chopra is a Senior Solutions Architect with the Startups AI/ML team. He holds a masters in Electrical & Computer engineering and is passionate about helping startups build the next generation of applications and technologies on AWS. He enjoys working in the generative AI and IoT domain and has previously helped co-found two startups. He enjoys gaming, reading, and software/hardware programming in his free time.

Read More

Principal Financial Group uses AWS Post Call Analytics solution to extract omnichannel customer insights

Principal Financial Group uses AWS Post Call Analytics solution to extract omnichannel customer insights

An established financial services firm with over 140 years in business, Principal is a global investment management leader and serves more than 62 million customers around the world. Principal is conducting enterprise-scale near-real-time analytics to deliver a seamless and hyper-personalized omnichannel customer experience on their mission to make financial security accessible for all. They are processing data across channels, including recorded contact center interactions, emails, chat and other digital channels.

In this post, we demonstrate how data aggregated within the AWS CCI Post Call Analytics solution allowed Principal to gain visibility into their contact center interactions, better understand the customer journey, and improve the overall experience between contact channels while also maintaining data integrity and security.

Solution requirements

Principal provides investment services through Genesys Cloud CX, a cloud-based contact center that provides powerful, native integrations with AWS. Each year, Principal handles millions of calls and digital interactions. As a first step, they wanted to transcribe voice calls and analyze those interactions to determine primary call drivers, including issues, topics, sentiment, average handle time (AHT) breakdowns, and develop additional natural language processing (NLP)-based analytics.

In order analyze the calls properly, Principal had a few requirements:

  • Contact details: Understanding the customer journey requires understanding whether a speaker is an automated interactive voice response (IVR) system or a human agent and when a call transfer occurs between the two.
  • Content redaction: Each customer audio interaction is recorded as a stereo WAV file, but could potentially include sensitive information such as HIPAA-protected and personally identifiable information (PII).
  • Scalability: This architecture needed to immediately scale to thousands of calls per day and millions of calls per year. In addition, Principal needed an extensible analytics architecture that analyze other channels such as email threads and traditional voice of the customer (VoC) survey results.
  • Integrity is non-negotiable at Principal—it guides everything they do. In fact, doing what’s right is one of the core values at Principal. Therefore, when the Principal team started tackling this project, they knew that ensuring the highest standard of data security such as regulatory compliance, data privacy, and data quality would be a non-negotiable, key requirement. The team needed to utilize technology with a matching stance on data security, and the ability to build custom compliance and security controls to uphold strict requirements. Attention to this key requirement allows Principal to maintain a safe and secure customer experience.

Solution overview

After extensive research, the Principal team finalized AWS Contact Center Intelligence (CCI) solutions, which empower companies to improve customer experience and gain conversation insights by adding AI capabilities to third-party on-premises and cloud contact centers. The CCI Post-Call Analytics (PCA) solution is part of CCI solutions suite and fit many of the identified requirements. PCA has a Solutions Library Guidance reference architecture with an open-source example repository on GitHub. Working with their AWS account team, Principal detailed the PCA solution and its deployment, and set up custom training programs and immersion days to rapidly upskill the Principal teams. The example architecture (see the following diagram) and code base in the open-source repository allowed the Principal engineering teams to jumpstart their solution around unifying the customer journey, and merging telephony records and transcript records together.

PCA provides an entire architecture around ingesting audio files in a fully automated workflow with AWS Step Functions, which is initiated when an audio file is delivered to a configured Amazon Simple Storage Service (Amazon S3) bucket. After a few minutes, a transcript is produced with Amazon Transcribe Call Analytics and saved to another S3 bucket for processing by other business intelligence (BI) tools. PCA also offers a web-based user interface that allows customers to browse call transcripts. PCA’s security features ensure that any PII data was redacted from the transcript, as well as from the audio file itself. Additionally, all data within the S3 bucket can be encrypted with keys belonging to Principal.

Principal worked with AWS technical teams to modify the Step Functions workflow within PCA to further achieve their goals. Call details such as interaction timestamps, call queues, agent transfers, and participant speaking times are tracked by Genesys in a file called a Contact Trace Record (CTR). Combining accurate transcripts with Genesys CTR files, Principal could properly identify the speakers, categorize the calls into groups, analyze agent performance, identify upsell opportunities, and conduct additional machine learning (ML)-powered analytics.

The teams built a new data ingestion mechanism, allowing the CTR files to be jointly delivered with the audio file to an S3 bucket. Principal and AWS collaborated on a new AWS Lambda function that was added to the Step Functions workflow. This Lambda function identifies CTR records and provides an additional processing step that outputs an enhanced transcript containing additional metadata such as queue and agent ID information, IVR identification and tagging, and how many agents (and IVRs) the customer was transferred to, all aggregated from the CTR records. This extra information enables Principal to create a map of the customer interaction throughout the lifecycle of the conversation and focus on the critical speech segments, while excluding less relevant ones.

Additionally, this postprocessing step enabled Principal to further enrich transcripts with internal information such as agent and queue names and expand the analytics capabilities of PCA, including custom NLP-based ML models for topic and customer intent identification, deployed using Amazon SageMaker endpoints, and additional transcript augmentation using foundational generative AI models hosted on Amazon Bedrock.

PCA is open source on GitHub, which allows customers such as Principal to extend and maintain their own forks with customized, private business code. It also allows the community to submit code back to the main repository for others to use. Principal and AWS technical teams partnered to merge the Genesys CTR and postprocessing placeholder features into the main release of PCA. This partnership between Principal and AWS enabled speed-to-market for Principal, while ensuring that existing and incoming business requirements could be rapidly added. The contributions to the open-source project has accelerated other customers’ Genesys CTR workloads.

Answer business questions

Once PCA was in place, Principal analysts, data scientists, engineers, and business owners worked with AWS SMEs to build numerous Amazon QuickSight dashboards to display the data insights and begin answering business questions. QuickSight is a cloud-scale BI service that you can use to deliver easy-to-understand insights from multiple datasets, from AWS data, third-party data, software as a service (SaaS) data, and more. The use of this BI tool, with its native integrations to the existing data repositories made accessible by Amazon Athena, made the creation of visualizations to display the large-scale data relatively straightforward, and enabled self-service BI. Visualizations were quickly drafted to answer some key questions, including “What are our customers calling us about,” “What topics relate to the longest AHT/most transfers,” and “What topics and issues relate to the lowest customer sentiment scores?” By ingesting additional data related to Principal custom topic models, the team was able to expand their use of QuickSight to include topic and correlation comparisons, model validation capabilities, and comparisons of sentiment based on speaker, segment, call, and conversation. In addition, the use of QuickSight insights quickly allowed the Principal teams to implement anomaly detection and volume prediction, while Amazon QuickSight Q, an ML feature within QuickSight that uses NLP, enabled rapid natural language quantitative data analytics.

When the initial initiative for PCA was complete, Principal knew they needed to immediately dive deeper into the omnichannel customer experience. Together, Principal and AWS have built data ingestion pipelines for customer email interactions and additional metadata from their customer data platform, and built data aggregation and analytics mechanisms to combine omnichannel data into a single customer insight lens. Utilization of Athena views and QuickSight dashboards has continued to enable classic analytics, and the implementation of proof of concept graph databases via Amazon Neptune will help Principal extract insights into interaction topics and intent relationships within the omnichannel view when implemented at scale.

The Results

PCA helped accelerate time to market. Principal was able to deploy the existing open-source PCA app by themselves in 1 day. Then, Principal worked together with AWS and expanded the PCA offering with numerous features like the Genesys CTR integration over a period of 3 months. The development and deployment process was a joint, iterative process that allowed Principal to test and process production call volumes on newly built features. Since the initial engagement, AWS and Principal continue to work together, sharing business requirements, roadmaps, code, and bug fixes to expand PCA.

Since its initial development and deployment, Principal has processed over 1 million customer calls through the PCA framework. This resulted in over 63 million individual speech segments spoken by a customer, agent, or IVR. With this wealth of data, Principal has been able to conduct large-scale historical and near-real-time analytics to gain insights into the customer experience.

AWS CCI solutions are a game-changer for Principal. Principal’s existing suite of CCI tools, which includes Qualtrics for simple dashboarding and opportunity identification, was expanded with the addition of PCA. The addition of PCA to the suite of CCI tools enabled Principal to rapidly conduct deep analytics on their contact center interactions. With this data, Principal now can conduct advanced analytics to understand customer interactions and call drivers, including topics, intents, issues, action items, and outcomes. Even in a small-scale, controlled production environment, the PCA data lake has spawned numerous new use cases.

Roadmap

The data generated from PCA could be used to make critical business decisions regarding call routing based on insights around which topics are driving longer average handle time, longer holds, more transfers, and negative customer sentiment. Knowledge on when customer interactions with the IVR and automated voice assistants are misunderstood or misrouted will help Principal improve the self-service experience. Understanding why a customer called instead of using the website is critical to improving the customer journey and boosting customer happiness. Product managers responsible for enhancing web experiences have shared how excited they are to be able to use data from PCA to drive their prioritization of new enhancements and measure the impact of changes. Principal is also analyzing other potential use cases such as customer profile mapping, fraud detection, workforce management, the use of additional AI/ML and large language models (LLMs), and identifying new and emerging trends within their contact centers.

In the future, Principal plans to continue expanding postprocessing capabilities with additional data aggregation, analytics, and natural language generation (NLG) models for text summarization. Principal is currently integrating generative AI and foundational models (such as Amazon Titan) to their proprietary solutions. Principal plans to use AWS generative AI to enhance employee productivity, grow assets under management, deliver high-quality customer experiences, and deliver tools that allow customers to make investment and retirement decisions efficiently. Given the flexibility and extensibility of the open-source PCA framework, the teams at Principal have an extensive list of additional enhancements, analytics, and insights that could extend the existing framework.

“With AWS Post Call analytics solution, Principal can currently conduct large-scale historical analytics to understand where customer experiences can be improved, generate actionable insights, and prioritize where to act. Now, we are adding generative AI using Amazon Bedrock to help our business users make data-driven decisions with higher speed and accuracy, while reducing costs. We look forward to exploring the post call summarization feature in Amazon Transcribe Call Analytics in order to enable our agents to focus their time and resources engaging with customers, rather than manual after contact work.”

– says Miguel Sanchez Urresty, Director of Data & Analytics at Principal Financial Group.

Conclusion

The AWS CCI PCA solution is designed to improve customer experience, derive customer insights, and reduce operational costs by adding AI and ML to the contact center provider of your choice. To learn more about other CCI solutions, such as Live Call Analytics, refer to AWS Contact Center Intelligence (CCI) Solutions.

About Principal Financial Group

Principal Financial Group and affiliates, Des Moines IA is a financial company with 19,000 employees. In business for more than 140 years, we’re helping more than 62 million customers in various countries around the world as of December 31, 2022.

AWS and Amazon are not affiliates of any company of the Principal Financial Group Insurance products issued by Principal National Life Insurance Co (except in NY) and Principal Life Insurance Company. Plan administrative services offered by Principal Life. Principal Funds, Inc. is distributed by Principal Funds Distributor, Inc. Securities offered through Principal Securities, Inc., member SIPC and/or independent broker/dealers. Referenced companies are members of the Principal Financial Group, Des Moines, IA 50392. ©2023 Principal Financial Services, Inc.

This communication is intended to be educational in nature and is not intended to be taken as a recommendation. Insurance products and plan administrative services provided through Principal Life Insurance Company, a member of the Principal Financial Group, Des Moines, IA 50392


About the authors

Christopher Lott is a Senior Solutions Architect in the AWS AI Language Services team. He has 20 years of enterprise software development experience. Chris lives in Sacramento, California, and enjoys gardening, cooking, aerospace/general aviation, and traveling the world.

Dr. Nicki Susman is a Senior Data Scientist and the Technical Lead of the Principal Language AI Services team. She has extensive experience in data and analytics, application development, infrastructure engineering, and DevSecOps.

Read More

Foundational vision models and visual prompt engineering for autonomous driving applications

Foundational vision models and visual prompt engineering for autonomous driving applications

Prompt engineering has become an essential skill for anyone working with large language models (LLMs) to generate high-quality and relevant texts. Although text prompt engineering has been widely discussed, visual prompt engineering is an emerging field that requires attention. Visual prompts can include bounding boxes or masks that guide vision models in generating relevant and accurate outputs. In this post, we explore the basics of visual prompt engineering, its benefits, and how it can be used to solve a specific use case: image segmentation for autonomous driving.

In recent years, the field of computer vision has witnessed significant advancements in the area of image segmentation. One such breakthrough is the Segment Anything Model (SAM) by Meta AI, which has the potential to revolutionize object-level segmentation with zero-shot or few-shot training. In this post, we use the SAM model as an example foundation vision model and explore its application to the BDD100K dataset, a diverse autonomous driving dataset for heterogeneous multitask learning. By combining the strengths of SAM with the rich data provided by BDD100K, we showcase the potential of visual prompt engineering with different versions of SAM. Inspired by the LangChain framework for language models, we propose a visual chain to perform visual prompting by combining object detection models with SAM.

Although this post focuses on autonomous driving, the concepts discussed are applicable broadly to domains that have rich vision-based applications such as healthcare and life sciences, and media and entertainment. Let’s begin by learning a little more about what’s under the hood of a foundational vision model like SAM. We used Amazon SageMaker Studio on an ml.g5.16xlarge instance for this post.

Segment Anything Model (SAM)

Foundation models are large machine learning (ML) models trained on vast quantity of data and can be prompted or fine-tuned for task-specific use cases. Here, we explore the Segment Anything Model (SAM), which is a foundational model for vision, specifically image segmentation. It is pre-trained on a massive dataset of 11 million images and 1.1 billion masks, making it the largest segmentation dataset as of writing. This extensive dataset covers a wide range of objects and categories, providing SAM with a diverse and large-scale training data source.

The SAM model is trained to understand objects and can output segmentation masks for any object in images or video frames. The model allows for visual prompt engineering, enabling you to provide inputs such as text, points, bounding boxes, or masks to generate labels without altering the original image. SAM is available in three sizes: base (ViT-B, 91 million parameters), large (ViT-L, 308 million parameters), and huge (ViT-H, 636 million parameters), catering to different computational requirements and use cases.

The primary motivation behind SAM is to improve object-level segmentation with minimal training samples and epochs for any objects of interest. The power of SAM lies in its ability to adapt to new image distributions and tasks without prior knowledge, a feature known as zero-shot transfer. This adaptability is achieved through its training on the expansive SA-1B dataset, which has demonstrated impressive zero-shot performance, surpassing many prior fully supervised results.

As shown in the following architecture for SAM, the process of generating segmentation masks involves three steps:

  1. An image encoder produces a one-time embedding for the image.
  2. A prompt encoder converts any prompt into an embedding vector for the prompt.
  3. The lightweight decoder combines the information from the image encoder and the prompt encoder to predict segmentation masks.

As an example, we can provide an input with an image and bounding box around an object of interest in that image (e.g. Silver car or driving lane)  and SAM model would produce segmentation masks for that object.

Visual prompt engineering

Prompt engineering refers to structuring inputs to a model that makes the model understand the intent and produces desired outcome. With textual prompt engineering, you can structure the input text through modifications such as choice of words, formatting, ordering, and more to get the desired output. Visual prompt engineering assumes that the user is working in a visual modality (image or video), and provides inputs. The following is a non-exhaustive list of potential ways to provide input to the generative AI model in the visual domain:

  • Point – A singular (x, y) coordinate point in the image plane
  • Points – Multiple (x, y) coordinate points, not necessarily related to each other
  • Bounding box – A set of four values (x, y, w, h) that define a rectangular region in the image plane
  • Contour – A set of (x, y) coordinate points in the image plane that form a closed shape
  • Mask – An array the same size as the image with a partial mask of the object of interest

With the visual prompt engineering techniques in mind, let’s explore how this can be applied to the SAM pre-trained model. We have use the base version of the pre-trained model.

Zero-shot prompting with the pre-trained SAM model

To start with, let’s explore the zero-shot approach. The following is a sample image from the training dataset taken from a vehicle’s front camera.

We can get segmentation masks for all objects from the image without any explicit visual prompting by automatically generating masks with just an input image. In the following image, we see parts of the car, road, traffic sign, license plates, flyover, pillars, signs, and more are segmented.

However, this output is not immediately useful for the following reasons:

  • The cars are not segmented as a whole, but in parts. For most perception models, for example, we don’t really care about each of the tires having separate output masks. This is true when looking for other known objects of interest as well, such as road, vegetation, signs, and so on.
  • Parts of the image that are useful for downstream tasks like drivable area are split up, with no explanation. On the other hand, similar instances are identified separately, and we may be interested in grouping similar objects (panoptic vs. instance segmentation).

Visual prompt engineering with the pre-trained SAM model

Fortunately, SAM supports providing input prompts, and we can use points, point arrays, and bounding boxes as inputs. With these specific instructions, we expect SAM to do better with segmentations focusing on specific points or areas. This can be compared with the language prompt template
"What is a good name for a company that makes {product}?"
where the input along with this prompt template from the user is the {product}. {product} is an input slot. In visual prompting, the bounding boxes, points, or masks are the input slots.

The following image provides the original ground truth bounding box around vehicles, and the drivable area patch from BDD100K ground truth data. The image also shows an input point (a yellow X) at the center of the green bounding box that we will refer to in the next few sections.

Let’s try to generate a mask for the car on the left with the green bounding box as an input to SAM. As shown in the following example, the base model of SAM doesn’t really find anything. This is also seen in the low segmentation score. When we look at the segmentation masks more closely, we see that there are small regions returned as masks (pointed at using red arrows) that aren’t really usable for any downstream application.

Let’s try a combination of a bounding box and a point as the input visual prompt. The yellow cross in the preceding image is the center of the bounding box. Providing this point’s (x,y) coordinates as the prompt along with the bounding box constraint gives us the following mask and a slightly higher score. This is still not usable by any means.

Finally, with the base pre-trained model, we can provide just the input point as a prompt (without the bounding box). The following images show two of the top three masks we thought were interesting.

Mask 1 segments the full car, whereas Mask 3 segments out an area that holds the car’s number plate close to the yellow cross (input prompt). Mask 1 is still not a tight, clean mask around the car; this points to the quality of the model, which we can assume increases with model size.

We can try larger pre-trained models with the same input prompt. The following images show our results. When using the huge SAM pre-trained model, Mask 3 is the entire car, whereas Mask 1 and 2 can be used to extract the number plate.

The large version of the SAM model also provides similar outputs.

The process we went through here is similar to manual prompt engineering for text prompts that you may already be familiar with. Note that a recent improvement in the SAM model to segment anything in high quality provides much better object- and context-specific outputs. In our case, we find that zero-shot prompting with text and visual prompts (point, box, and point and box inputs) don’t improve results drastically as we saw above.

Prompt templates and visual chains

As we can see from the preceding zero-shot examples, SAM struggles to identify all the objects in the scene. This is a good example of where we can take advantage of prompt templates and visual chains. Visual chain is inspired by the chain concept in the popular LangChain framework for language applications. It helps chain the data sources and an LLM to produce the output. For example, we can use an API chain to call an API and invoke an LLM to answer the question based on the API response.

Inspired by LangChain, we propose a sequential visual chain that looks like the following figure. We use a tool (like a pre-trained object detection model) to get initial bounding boxes, calculate the point at the center of the bounding box, and use this to prompt the SAM model with the input image.

For example, the following image shows the segmentation masks as a result of running this chain.

Another example chain can involve a text input of the object the user is interested in identifying. To implement this, we built a pipeline using Grounding DINO, an object detection model to prompt SAM for segmentation.

Grounding DINO is a zero-shot object detection model that can perform object detection with text providing category names (such as “traffic lights” or “truck”) and expressions (such as “yellow truck”). It accepts pairs of text and image to perform the object detection. It’s based on a transformer architecture and enables cross modalities with text and image data. To learn more about Grounding DINO, refer to Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. This generates bounding boxes and labels and can be processed further to generate center points, filter based on labels, thresholds, and more. This is used (boxes or points) as a prompt to SAM for segmentation, which outputs masks.

The following are some examples showing the input text, DINO output (bounding boxes), and the final SAM output (segmentation masks).

The following images show the output for “yellow truck.”

The following images show the output for “silver car.”

The following image shows the output for “driving lane.”

We can use this pipeline to build a visual chain. The following code snippet explains this concept:

pipeline = [object_predictor, segment_predictor]
image_chain = ImageChain.from_visual_pipeline(pipeline, image_store, verbose=True)
image_chain.run('All silver cars', image_id='5X3349')

Although this is a simple example, this concept can be extended to process feeds from cameras on vehicles to perform object tracking, personally identifiable information (PII) data redaction, and more. We can also get the bounding boxes from smaller models, or in some cases, using standard computer vision tools. It’s fairly straightforward to use a pre-trained model or a service like Amazon Rekognition to get initial (visual) labels for your prompt. At the time of writing this, there are over 70 models available on Amazon SageMaker Jumpstart for object detection, and Amazon Rekognition already identifies several useful categories of objects in images, including cars, pedestrians, and other vehicles.

Next, we look at some quantitative results related to performance of SAM models with a subset of BDD100K data.

Quantitative results

Our objective is to compare the performance of three pre-trained models when given the same visual prompting. In this case, we use the center point of the object location as the visual input. We compare the performance with respect to the object sizes (in proportion to image size)— small (area <0.11%), medium (0.11% < area < 1%), and large (area > 1%). The bounding box area thresholds are defined by the Common Objects in Context (COCO) evaluation metrics [Lin et al., 2014].

The evaluation is at the pixel level and we use the following evaluation metrics:

  • Precision = (number relevant and retrieved instances) / (total number of retrieved instances)
  • Recall = (number of relevant and retrieve instances) / (total number of relevant instances)
  • Instances here are each pixel within the bounding box of the object of interest

The following table reports the performance of three different versions of the SAM model (base, large, and huge). These versions have three different encoders: ViT-B (base), ViT-L (large), ViT-H (huge). The encoders have different parameter counts, where the base model has less parameters than large, and large is less than huge. Although increasing the number of parameters shows improved performance with larger objects, this is not so for smaller objects.

Fine-tuning SAM for your use case

In many cases, directly using a pre-trained SAM model may not be very useful. For example, let’s look at a typical scene in traffic—the following picture is the output from the SAM model with randomly sampled prompt points as input on the left, and the actual labels from the semantic segmentation task from BDD100K on the right. These are obviously very different.

Perception stacks in AVs can easily use the second image, but not the first. On the other hand, there some useful outputs from the first image that can be used, and that the model was not explicitly trained on, for example, lane markings, sidewalk segmentation, license plate masks, and so on. We can fine-tune the SAM model to improve the segmentation results. To perform this fine-tuning, we created a training dataset using an instance segmentation subset (500 images) from the BDD10K dataset. This is a very small subset of images, but our purpose is to prove that foundational vision models (much like LLMs) can perform well for your use case with a surprisingly small number of images. The following image shows the input image, output mask (in blue, with a red border for the car on the left), and possible prompts (bounding box in green and center point X in yellow).

We performed fine-tuning using the Hugging Face library on Amazon SageMaker Studio. We used the ml.g4dn.xlarge instance for the SAM base model tests, and the ml.g4dn.2xlarge for the SAM huge model tests. In our initial experiments, we observed that fine-tuning the base model with just bounding boxes was not successful. The fine-tuned and pre-trained models weren’t able to learn car-specific ground truth masks from the original datasets. Adding query points to the fine-tuning also didn’t improve the training.

Next, we can try fine-tuning the SAM huge model for 30 epochs, with a very small dataset (500 images). The original ground truth mask looks like the following image for the label type car.

As shown in the following images, the original pre-trained version of the huge model with a specific bounding box prompt (in green) gives no output, whereas the fine-tuned version gives an output (still not accurate but fine-tuning was cut off after 40 epochs, and with a very small training dataset of 500 images). The original, pre-trained huge model wasn’t able to predict masks for any of the images we tested. As an example downstream application, the fine-tuned model can be used in pre-labeling workflows such as the one described in Auto-labeling module for deep learning-based Advanced Driver Assistance Systems on AWS.

Conclusion

In this post, we discussed the foundational vision model known as the Segment Anything Model (SAM) and its architecture. We used the SAM model to discuss visual prompting and the various inputs to visual prompting engineering. We explored how different visual prompts perform and their limitations. We also described how visual chains increase performance over using just one prompt, similar to the LangChain API. Next, we provided a quantitative evaluation of three pre-trained models. Lastly, we discussed the fine-tuned SAM model and its results compared to the original base model. Fine-tuning of foundation models helps improve model performance for specific tasks like segmentation. It should be noted that SAM model due to its resource requirements, limits usage for real-time use-cases and inferencing at the edge in its current state. We hope with future iterations and improved techniques, would reduce compute requirements and improve latency.

It is our hope that this post encourages you to explore visual prompting for your use cases. Because this is still an emerging form of prompt engineering, there is much to discover in terms of visual prompts, visual chains, and performance of these tools. Amazon SageMaker is a fully managed ML platform that enables builders to explore large language and visual models and build generative AI applications. Start building the future with AWS today.


About the authors

Gopi Krishnamurthy is a Senior AI/ML Solutions Architect at Amazon Web Services based in New York City. He works with large Automotive customers as their trusted advisor to transform their Machine Learning workloads and migrate to the cloud. His core interests include deep learning and serverless technologies. Outside of work, he likes to spend time with his family and explore a wide range of music.

Shreyas Subramanian is a Principal AI/ML specialist Solutions Architect, and helps customers by using Machine Learning to solve their business challenges using the AWS platform. Shreyas has a background in large scale optimization and Machine Learning, and in use of Machine Learning and Reinforcement Learning for accelerating optimization tasks.

 Sujitha Martin is an Applied Scientist in the Generative AI Innovation Center (GAIIC). Her expertise is in building machine learning solutions involving computer vision and natural language processing for various industry verticals. In particular, she has extensive experience working on human-centered situational awareness and knowledge infused learning for highly autonomous systems.

Francisco Calderon is a Data Scientist in the Generative AI Innovation Center (GAIIC). As a member of the GAIIC, he helps discover the art of the possible with AWS customers using Generative AI technologies. In his spare time, Francisco likes to play music and guitar, playing soccer with his daughters, and enjoying time with his family.

Read More

Flag harmful content using Amazon Comprehend toxicity detection

Flag harmful content using Amazon Comprehend toxicity detection

Online communities are driving user engagement across industries like gaming, social media, ecommerce, dating, and e-learning. Members of these online communities trust platform owners to provide a safe and inclusive environment where they can freely consume content and contribute. Content moderators are often employed to review user-generated content and check that it’s safe and compliant with your terms of use. However, the ever-increasing scale, complexity, and variety of inappropriate content makes human moderation workflows unscalable and expensive. The result is poor, harmful, and non-inclusive communities that disengage users and negatively impact the community and business.

Along with user-generated content, machine-generated content has brought a fresh challenge to content moderation. It automatically creates highly realistic content that may be inappropriate or harmful at scale. The industry is facing the new challenge of automatically moderating content generated by AI to protect users from harmful material.

In this post, we introduce toxicity detection, a new feature from Amazon Comprehend that helps you automatically detect harmful content in user- or machine-generated text. This includes plain text, text extracted from images, and text transcribed from audio or video content.

Detect toxicity in text content with Amazon Comprehend

Amazon Comprehend is a natural-language processing (NLP) service that uses machine learning (ML) to uncover valuable insights and connections in text. It offers a range of ML models that can be either pre-trained or customized through API interfaces. Amazon Comprehend now provides a straightforward, NLP-based solution for toxic content detection in text.

The Amazon Comprehend Toxicity Detection API assigns an overall toxicity score to text content, ranging from 0–1, indicating the likelihood of it being toxic. It also categorizes text into the following seven categories and provides a confidence score for each:

  • HATE_SPEECH – Speech that criticizes, insults, denounces, or dehumanizes a person or a group on the basis of an identity, be it race, ethnicity, gender identity, religion, sexual orientation, ability, national origin, or another identity group.
  • GRAPHIC – Speech that uses visually descriptive, detailed, and unpleasantly vivid imagery. Such language is often made verbose so as to amplify an insult, or discomfort or harm to the recipient.
  • HARASSMENT_OR_ABUSE – Speech that imposes disruptive power dynamics between the speaker and hearer (regardless of intent), seeks to affect the psychological well-being of the recipient, or objectifies a person.
  • SEXUAL – Speech that indicates sexual interest, activity, or arousal by using direct or indirect references to body parts, physical traits, or sex.
  • VIOLENCE_OR_THREAT – Speech that includes threats that seek to inflict pain, injury, or hostility towards a person or group.
  • INSULT – Speech that includes demeaning, humiliating, mocking, insulting, or belittling language.
  • PROFANITY – Speech that contains words, phrases, or acronyms that are impolite, vulgar, or offensive.

You can access the Toxicity Detection API by calling it directly using the AWS Command Line Interface (AWS CLI) and AWS SDKs. Toxicity detection in Amazon Comprehend is currently supported in the English language.

Use cases

Text moderation plays a crucial role in managing user-generated content across diverse formats, including social media posts, online chat messages, forum discussions, website comments, and more. Moreover, platforms that accept video and audio content can use this feature to moderate transcribed audio content.

The emergence of generative AI and large language models (LLMs) represents the latest trend in the field of AI. Consequently, there is a growing need for responsive solutions to moderate content generated by LLMs. The Amazon Comprehend Toxicity Detection API is ideally suited for addressing this need.

Amazon Comprehend Toxicity Detection API request

You can send up to 10 text segments to the Toxicity Detection API, each with a size limit of 1 KB. Every text segment in the request is handled independently. In the following example, we generate a JSON file named toxicity_api_input.json containing the text content, including three sample text segments for moderation. Note that in the example, the profane words are masked as XXXX.

{
  "TextSegments": [     
    {"Text": "and go through the door go through the door he's on the right"},
    {"Text": "he's on the right XXXXX him"},
    {"Text": "what the XXXX are you doing man that's why i didn't want to play"}
  ],
  "LanguageCode": "en"
}

You can use the AWS CLI to invoke the Toxicity Detection API using the preceding JSON file containing the text content:

aws comprehend detect-toxic-content --cli-input-json file://toxicity_api_input.json

Amazon Comprehend Toxicity Detection API response

The Toxicity Detection API response JSON output will include the toxicity analysis result in the ResultList field. ResultList lists the text segment items, and the sequence represents the order in which the text sequences were received in the API request. Toxicity represents the overall confidence score of detection (between 0–1). Labels includes a list of toxicity labels with confidence scores, categorized by the type of toxicity.

The following code shows the JSON response from the Toxicity Detection API based on the request example in the previous section:

{
    "ResultList": [
        {
            "Toxicity": 0.009200000204145908,
            "Labels": [
                { "Name": "PROFANITY", "Score": 0.0007999999797903001},
                { "Name": "HATE_SPEECH", "Score": 0.0017999999690800905},
                { "Name": "INSULT", "Score": 0.003000000026077032},
                { "Name": "GRAPHIC", "Score": 0.0010000000474974513},
                { "Name": "HARASSMENT_OR_ABUSE", "Score": 0.0013000000035390258},
                { "Name": "SEXUAL", "Score": 0.0017000000225380063},
                { "Name": "VIOLENCE_OR_THREAT", "Score": 0.004999999888241291}
            ]
        },
        {
            "Toxicity": 0.7358999848365784,
            "Labels": [
                { "Name": "PROFANITY", "Score": 0.011900000274181366},
                { "Name": "HATE_SPEECH", "Score": 0.019500000402331352},
                { "Name": "INSULT", "Score": 0.0714000016450882},
                { "Name": "GRAPHIC", "Score": 0.006099999882280827},
                { "Name": "HARASSMENT_OR_ABUSE", "Score": 0.018200000748038292},
                { "Name": "SEXUAL", "Score": 0.0027000000700354576},
                { "Name": "VIOLENCE_OR_THREAT", "Score": 0.8145999908447266}
            ]
        },
        {
            "Toxicity": 0.9843000173568726,
            "Labels": [
                { "Name": "PROFANITY", "Score": 0.9369999766349792 },
                { "Name": "HATE_SPEECH", "Score": 0.30880001187324524 },
                { "Name": "INSULT", "Score": 0.42100000381469727 },
                { "Name": "GRAPHIC", "Score": 0.12630000710487366 },
                { "Name": "HARASSMENT_OR_ABUSE", "Score": 0.25519999861717224 },
                { "Name": "SEXUAL", "Score": 0.19169999659061432 },
                { "Name": "VIOLENCE_OR_THREAT", "Score": 0.19539999961853027 }
            ]
        }
    ]
}

In the preceding JSON, the first text segment is considered safe with a low toxicity score. However, the second and third text segments received toxicity scores of 73% and 98%, respectively. For the second segment, Amazon Comprehend detects a high toxicity score for VIOLENCE_OR_THREAT; for the third segment, it detects PROFANITY with a high toxicity score.

Sample request using the Python SDK

The following code snippet demonstrates how to utilize the Python SDK to invoke the Toxicity Detection API. This code receives the same JSON response as the AWS CLI command demonstrated earlier.

import boto3 import base64
# Initialize a Comprehend boto3 client object
comprehend_client = session.client('comprehend')

# Call comprehend Detect Toxic Content API with text segments
response = comprehend_client.detect_toxic_content(
    TextSegments=[
        {"Text":  "and go through the door go through the door he's on the right"},
        {"Text":  "he's on the right XXXXX him"},
        {"Text":  "what the XXXX are you doing man that's why i didn't want to play"}
    ],
    LanguageCode='en'
)

Summary

In this post, we provided an overview of the new Amazon Comprehend Toxicity Detection API. We also described how you can parse the API response JSON. For more information, refer to Comprehend API document.

Amazon Comprehend toxicity detection is now generally available in four Regions: us-east-1, us-west-2, eu-west-1, and ap-southeast-2.

To learn more about content moderation, refer to Guidance for Content Moderation on AWS. Take the first step towards streamlining your content moderation operations with AWS.


About the Authors

Author - Lana ZhangLana Zhang is a Senior Solutions Architect at AWS WWSO AI Services team, specializing in AI and ML for Content Moderation, Computer Vision, Natural Language Processing and Generative AI. With her expertise, she is dedicated to promoting AWS AI/ML solutions and assisting customers in transforming their business solutions across diverse industries, including social media, gaming, e-commerce, media, advertising & marketing.

Author - Ravisha SKRavisha SK is a Senior Product Manager, Technical at AWS with a focus on AI/ML. She has over 10 years of experience in data analytics and machine learning across different domains. In her spare time, she enjoys reading, experimenting in the kitchen and exploring new coffee shops.

Read More