Streamlining ETL data processing at Talent.com with Amazon SageMaker

Streamlining ETL data processing at Talent.com with Amazon SageMaker

This post is co-authored by Anatoly Khomenko, Machine Learning Engineer, and Abdenour Bezzouh, Chief Technology Officer at Talent.com.

Established in 2011, Talent.com aggregates paid job listings from their clients and public job listings, and has created a unified, easily searchable platform. Covering over 30 million job listings across more than 75 countries and spanning various languages, industries, and distribution channels, Talent.com caters to the diverse needs of job seekers, effectively connecting millions of job seekers with job opportunities.

Talent.com’s mission is to facilitate global workforce connections. To achieve this, Talent.com aggregates job listings from various sources on the web, offering job seekers access to an extensive pool of over 30 million job opportunities tailored to their skills and experiences. In line with this mission, Talent.com collaborated with AWS to develop a cutting-edge job recommendation engine driven by deep learning, aimed at assisting users in advancing their careers.

To ensure the effective operation of this job recommendation engine, it is crucial to implement a large-scale data processing pipeline responsible for extracting and refining features from Talent.com’s aggregated job listings. This pipeline is able to process 5 million daily records in less than 1 hour, and allows for processing multiple days of records in parallel. In addition, this solution allows for a quick deployment to production. The primary source of data for this pipeline is the JSON Lines format, stored in Amazon Simple Storage Service (Amazon S3) and partitioned by date. Each day, this results in the generation of tens of thousands of JSON Lines files, with incremental updates occurring daily.

The primary objective of this data processing pipeline is to facilitate the creation of features necessary for training and deploying the job recommendation engine on Talent.com. It’s worth noting that this pipeline must support incremental updates and cater to the intricate feature extraction requirements necessary for the training and deployment modules essential for the job recommendation system. Our pipeline belongs to the general ETL (extract, transform, and load) process family that combines data from multiple sources into a large, central repository.

For further insights into how Talent.com and AWS collaboratively built cutting-edge natural language processing and deep learning model training techniques, utilizing Amazon SageMaker to craft a job recommendation system, refer to From text to dream job: Building an NLP-based job recommender at Talent.com with Amazon SageMaker. The system includes feature engineering, deep learning model architecture design, hyperparameter optimization, and model evaluation, where all modules are run using Python.

This post shows how we used SageMaker to build a large-scale data processing pipeline for preparing features for the job recommendation engine at Talent.com. The resulting solution enables a Data Scientist to ideate feature extraction in a SageMaker notebook using Python libraries, such as Scikit-Learn or PyTorch, and then to quickly deploy the same code into the data processing pipeline performing feature extraction at scale. The solution does not require porting the feature extraction code to use PySpark, as required when using AWS Glue as the ETL solution. Our solution can be developed and deployed solely by a Data Scientist end-to-end using only a SageMaker, and does not require knowledge of other ETL solutions, such as AWS Batch. This can significantly shorten the time needed to deploy the Machine Learning (ML) pipeline to production. The pipeline is operated through Python and seamlessly integrates with feature extraction workflows, rendering it adaptable to a wide range of data analytics applications.

Solution overview

Overview for ETL pipeline using SageMaker Processing

The pipeline is comprised of three primary phases:

  1. Utilize an Amazon SageMaker Processing job to handle raw JSONL files associated with a specified day. Multiple days of data can be processed by separate Processing jobs simultaneously.
  2. Employ AWS Glue for data crawling after processing multiple days of data.
  3. Load processed features for a specified date range using SQL from an Amazon Athena table, then train and deploy the job recommender model.

Process raw JSONL files

We process raw JSONL files for a specified day using a SageMaker Processing job. The job implements feature extraction and data compaction, and saves processed features into Parquet files with 1 million records per file. We take advantage of CPU parallelization to perform feature extraction for each raw JSONL file in parallel. Processing results of each JSONL file is saved into a separate Parquet file inside a temporary directory. After all of the JSONL files have been processed, we perform compaction of thousands of small Parquet files into several files with 1 million records per file. The compacted Parquet files are then uploaded into Amazon S3 as the output of the processing job. The data compaction ensures efficient crawling and SQL queries in the next stages of the pipeline.

The following is the sample code to schedule a SageMaker Processing job for a specified day, for example 2020-01-01, using the SageMaker SDK. The job reads raw JSONL files from Amazon S3 (for example from s3://bucket/raw-data/2020/01/01) and saves the compacted Parquet files into Amazon S3 (for example to s3://bucket/processed/table-name/day_partition=2020-01-01/).

### install dependencies 
%pip install sagemaker pyarrow s3fs awswrangler

import sagemaker
import boto3

from sagemaker.processing import FrameworkProcessor
from sagemaker.sklearn.estimator import SKLearn
from sagemaker import get_execution_role
from sagemaker.processing import ProcessingInput, ProcessingOutput

region = boto3.session.Session().region_name
role = get_execution_role()
bucket = sagemaker.Session().default_bucket()

### we use instance with 16 CPUs and 128 GiB memory
### note that the script will NOT load the entire data into memory during compaction
### depending on the size of individual jsonl files, larger instance may be needed
instance = "ml.r5.4xlarge"
n_jobs = 8  ### we use 8 process workers
date = "2020-01-01" ### process data for one day

est_cls = SKLearn
framework_version_str = "0.20.0"

### schedule processing job
script_processor = FrameworkProcessor(
    role=role,
    instance_count=1,
    instance_type=instance,
    estimator_cls=est_cls,
    framework_version=framework_version_str,
    volume_size_in_gb=500,
)

script_processor.run(
    code="processing_script.py", ### name of the main processing script
    source_dir="../src/etl/", ### location of source code directory

    ### our processing script loads raw jsonl files directly from S3
    ### this avoids long start-up times of the processing jobs,
    ### since raw data does not need to be copied into instance
    inputs=[], ### processing job input is empty

    outputs=[
        ProcessingOutput(destination="s3://bucket/processed/table-name/",
                         source="/opt/ml/processing/output"),
    ],
    arguments=[
        ### directory with job's output
        "--output", "/opt/ml/processing/output",

        ### temporary directory inside instance
        "--tmp_output", "/opt/ml/tmp_output",

        "--n_jobs", str(n_jobs), ### number of process workers
        "--date", date, ### date to process

        ### location with raw jsonl files in S3
        "--path", "s3://bucket/raw-data/",
    ],
    wait=False
)

The following code outline for the main script (processing_script.py) that runs the SageMaker Processing job is as follows:

import concurrent
import pyarrow.dataset as ds
import os
import s3fs
from pathlib import Path

### function to process raw jsonl file and save extracted features into parquet file  
from process_data import process_jsonl

### parse command line arguments
args = parse_args()

### we use s3fs to crawl S3 input path for raw jsonl files
fs = s3fs.S3FileSystem()
### we assume raw jsonl files are stored in S3 directories partitioned by date
### for example: s3://bucket/raw-data/2020/01/01/
jsons = fs.find(os.path.join(args.path, *args.date.split('-')))

### temporary directory location inside the Processing job instance
tmp_out = os.path.join(args.tmp_output, f"day_partition={args.date}")

### directory location with job's output
out_dir = os.path.join(args.output, f"day_partition={args.date}")

### process individual jsonl files in parallel using n_jobs process workers
futures=[]
with concurrent.futures.ProcessPoolExecutor(max_workers=args.n_jobs) as executor:
    for file in jsons:
        inp_file = Path(file)
        out_file = os.path.join(tmp_out, inp_file.stem + ".snappy.parquet")
        ### process_jsonl function reads raw jsonl file from S3 location (inp_file)
        ### and saves result into parquet file (out_file) inside temporary directory
        futures.append(executor.submit(process_jsonl, file, out_file))

    ### wait until all jsonl files are processed
    for future in concurrent.futures.as_completed(futures):
        result = future.result()

### compact parquet files
dataset = ds.dataset(tmp_out)

if len(dataset.schema) > 0:
    ### save compacted parquet files with 1MM records per file
    ds.write_dataset(dataset, out_dir, format="parquet", 
                     max_rows_per_file=1024 * 1024)

Scalability is a key feature of our pipeline. First, multiple SageMaker Processing jobs can be used to process data for several days simultaneously. Second, we avoid loading the entire processed or raw data into memory at once, while processing each specified day of data. This enables the processing of data using instance types that can’t accommodate a full day’s worth of data in primary memory. The only requirement is that the instance type should be capable of loading N raw JSONL or processed Parquet files into memory simultaneously, with N being the number of process workers in use.

Crawl processed data using AWS Glue

After all the raw data for multiple days has been processed, we can create an Athena table from the entire dataset by using an AWS Glue crawler. We use the AWS SDK for pandas (awswrangler) library to create the table using the following snippet:

import awswrangler as wr

### crawl processed data in S3
res = wr.s3.store_parquet_metadata(
    path='s3://bucket/processed/table-name/',
    database="database_name",
    table="table_name",
    dataset=True,
    mode="overwrite",
    sampling=1.0,
    path_suffix='.parquet',
)

### print table schema
print(res[0])

Load processed features for training

Processed features for a specified date range can now be loaded from the Athena table using SQL, and these features can then be used for training the job recommender model. For example, the following snippet loads one month of processed features into a DataFrame using the awswrangler library:

import awswrangler as wr

query = """
    SELECT * 
    FROM table_name
    WHERE day_partition BETWEN '2020-01-01' AND '2020-02-01' 
"""

### load 1 month of data from database_name.table_name into a DataFrame
df = wr.athena.read_sql_query(query, database='database_name')

Additionally, the use of SQL for loading processed features for training can be extended to accommodate various other use cases. For instance, we can apply a similar pipeline to maintain two separate Athena tables: one for storing user impressions and another for storing user clicks on these impressions. Using SQL join statements, we can retrieve impressions that users either clicked on or didn’t click on and then pass these impressions to a model training job.

Solution benefits

Implementing the proposed solution brings several advantages to our existing workflow, including:

  • Simplified implementation – The solution enables feature extraction to be implemented in Python using popular ML libraries. And, it does not require the code to be ported into PySpark. This streamlines feature extraction as the same code developed by a Data Scientist in a notebook will be executed by this pipeline.
  • Quick path-to-production – The solution can be developed and deployed by a Data Scientist to perform feature extraction at scale, enabling them to develop an ML recommender model against this data. At the same time, the same solution can be deployed to production by an ML Engineer with little modifications needed.
  • Reusability – The solution provides a reusable pattern for feature extraction at scale, and can be easily adapted for other use cases beyond building recommender models.
  • Efficiency – The solution offers good performance: processing a single day of the Talent.com’s data took less than 1 hour.
  • Incremental updates – The solution also supports incremental updates. New daily data can be processed with a SageMaker Processing job, and the S3 location containing the processed data can be recrawled to update the Athena table. We can also use a cron job to update today’s data several times per day (for example, every 3 hours).

We used this ETL pipeline to help Talent.com process 50,000 files per day containing 5 million records, and created training data using features extracted from 90 days of raw data from Talent.com—a total of 450 million records across 900,000 files. Our pipeline helped Talent.com build and deploy the recommendation system into production within only 2 weeks. The solution performed all ML processes including ETL on Amazon SageMaker without utilizing other AWS service. The job recommendation system drove an 8.6% increase in clickthrough rate in online A/B testing against a previous XGBoost-based solution, helping connect millions of Talent.com’s users to better jobs.

Conclusion

This post outlines the ETL pipeline we developed for feature processing for training and deploying a job recommender model at Talent.com. Our pipeline uses SageMaker Processing jobs for efficient data processing and feature extraction at a large scale. Feature extraction code is implemented in Python enabling the use of popular ML libraries to perform feature extraction at scale, without the need to port the code to use PySpark.

We encourage the readers to explore the possibility of using the pipeline presented in this blog as a template for their use-cases where feature extraction at scale is required. The pipeline can be leveraged by a Data Scientist to build an ML model, and the same pipeline can then be adopted by an ML Engineer to run in production. This can significantly reduce the time needed to productize the ML solution end-to-end, as was the case with Talent.com. The readers can refer to the tutorial for setting up and running SageMaker Processing jobs. We also refer the readers to view the post From text to dream job: Building an NLP-based job recommender at Talent.com with Amazon SageMaker, where we discuss deep learning model training techniques utilizing Amazon SageMaker to build Talent.com’s job recommendation system.


About the authors

Dmitriy BespalovDmitriy Bespalov is a Senior Applied Scientist at the Amazon Machine Learning Solutions Lab, where he helps AWS customers across different industries accelerate their AI and cloud adoption.

Yi XiangYi Xiang is a Applied Scientist II at the Amazon Machine Learning Solutions Lab, where she helps AWS customers across different industries accelerate their AI and cloud adoption.

Tong WangTong Wang is a Senior Applied Scientist at the Amazon Machine Learning Solutions Lab, where he helps AWS customers across different industries accelerate their AI and cloud adoption.

Anatoly KhomenkoAnatoly Khomenko is a Senior Machine Learning Engineer at Talent.com with a passion for natural language processing matching good people to good jobs.

Abdenour BezzouhAbdenour Bezzouh is an executive with more than 25 years experience building and delivering technology solutions that scale to millions of customers. Abdenour held the position of Chief Technology Officer (CTO) at Talent.com when the AWS team designed and executed this particular solution for Talent.com.

Yanjun QiYanjun Qi is a Senior Applied Science Manager at the Amazon Machine Learning Solution Lab. She innovates and applies machine learning to help AWS customers speed up their AI and cloud adoption.

Read More

Create summaries of recordings using generative AI with Amazon Bedrock and Amazon Transcribe

Create summaries of recordings using generative AI with Amazon Bedrock and Amazon Transcribe

Meeting notes are a crucial part of collaboration, yet they often fall through the cracks. Between leading discussions, listening closely, and typing notes, it’s easy for key information to slip away unrecorded. Even when notes are captured, they can be disorganized or illegible, rendering them useless.

In this post, we explore how to use Amazon Transcribe and Amazon Bedrock to automatically generate clean, concise summaries of video or audio recordings. Whether it’s an internal team meeting, conference session, or earnings call, this approach can help you distill hours of content down to salient points.

We walk through a solution to transcribe a project team meeting and summarize the key takeaways with Amazon Bedrock. We also discuss how you can customize this solution for other common scenarios like course lectures, interviews, and sales calls. Read on to simplify and automate your note-taking process.

Solution overview

By combining Amazon Transcribe and Amazon Bedrock, you can save time, capture insights, and enhance collaboration. Amazon Transcribe is an automatic speech recognition (ASR) service that makes it straightforward to add speech-to-text capability to applications. It uses advanced deep learning technologies to accurately transcribe audio into text. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon with a single API, along with a broad set of capabilities you need to build generative AI applications. With Amazon Bedrock, you can easily experiment with a variety of top FMs, and privately customize them with your data using techniques such as fine-tuning and Retrieval Augmented Generation (RAG).

The solution presented in this post is orchestrated using an AWS Step Functions state machine that is triggered when you upload a recording to the designated Amazon Simple Storage Service (Amazon S3) bucket. Step Functions lets you create serverless workflows to orchestrate and connect components across AWS services. It handles the underlying complexity so you can focus on application logic. It’s useful for coordinating tasks, distributed processing, ETL (extract, transform, and load), and business process automation.

The following diagram illustrates the high-level solution architecture.

The solution workflow includes the following steps:

  1. A user stores a recording in the S3 asset bucket.
  2. This action triggers the Step Functions transcription and summarization state machine.
  3. As part of the state machine, an AWS Lambda function is triggered, which transcribes the recording using Amazon Transcribe and stores the transcription in the asset bucket.
  4. A second Lambda function retrieves the transcription and generates a summary using the Anthropic Claude model in Amazon Bedrock.
  5. Lastly, a final Lambda function uses Amazon Simple Notification Service (Amazon SNS) to send a summary of the recording to the recipient.

This solution is supported in Regions where Anthropic Claude on Amazon Bedrock is available.

The state machine orchestrates the steps to perform the specific tasks. The following diagram illustrates the detailed process.

Prerequisites

Amazon Bedrock users need to request access to models before they are available for use. This is a one-time action. For this solution, you’ll need to enable access to the Anthropic Claude (not Anthropic Claude Instant) model in Amazon Bedrock. For more information, refer to Model access.

Deploy solution resources

The solution is deployed using an AWS CloudFormation template, found on the GitHub repo, to automatically provision the necessary resources in your AWS account. The template requires the following parameters:

  • Email address used to send summary – The summary will be sent to this address. You must acknowledge the initial Amazon SNS confirmation email before receiving additional notifications.
  • Summary instructions – These are the instructions given to the Amazon Bedrock model to generate the summary.

Run the solution

After you deploy the solution using AWS CloudFormation, complete the following steps:

  1. Acknowledge the Amazon SNS email confirmation that you should receive a few moments after creating the CloudFormation stack.
  2. On the AWS CloudFormation console, navigate to stack you just created.
  3. On the stack’s Outputs tab, and look for the value associated with AssetBucketName; it will look something like summary-generator-assetbucket-xxxxxxxxxxxxx.
  4. On the Amazon S3 console, navigate to your asset bucket.

This is where you’ll upload your recordings. Valid file formats are MP3, MP4, WAV, FLAC, AMR, OGG, and WebM.

  1. Upload your recording to the recordings folder.

Uploading recordings will automatically trigger the Step Functions state machine. For this example, we use a sample team meeting recording in the sample-recording directory of the GitHub repository.

  1. On the Step Functions console, navigate to the summary-generator state machine.
  2. Choose the name of the state machine run with the status Running.

Here, you can watch the progress of the state machine as it processes the recording.

  1. After it reaches its Success state, you should receive an emailed summary of the recording.

Alternatively, you can navigate to the S3 assets bucket and view the transcript there in the transcripts folder.

Review the summary

You will get the recording summary emailed to the address you provided when you created the CloudFormation stack. If you don’t receive the email in a few moments, make sure that you acknowledged the Amazon SNS confirmation email that you should have received after you created the stack and then upload the recording again, which will trigger the summary process.

This solution includes a mock team meeting recording that you can use to test the solution. The summary will look similar to the following example. Because of the nature of generative AI, however, your output will look a bit different, but the content should be close.

Here are the key points from the standup:

  • Joe finished reviewing the current state for task EDU1 and created a new task to develop the future state. That new task is in the backlog to be prioritized. He’s now starting EDU2 but is blocked on resource selection.
  • Rob created a tagging strategy for SLG1 based on best practices, but may need to coordinate with other teams who have created their own strategies, to align on a uniform approach. A new task was created to coordinate tagging strategies.
  • Rob has made progress debugging for SLG2 but may need additional help. This task will be moved to Sprint 2 to allow time to get extra resources.

Next Steps:

  • Joe to continue working on EDU2 as able until resource selection is decided
  • New task to be prioritized to coordinate tagging strategies across teams
  • SLG2 moved to Sprint 2
  • Standups moving to Mondays starting next week

Expand the solution

Now that you have a working solution, here are some potential ideas to customize the solution for your specific use cases:

  • Try altering the process to fit your available source content and desired outputs:
    • For situations where transcripts are available, create an alternate Step Functions workflow to ingest existing text-based or PDF-based transcriptions.
    • Instead of using Amazon SNS to notify recipients via email, you can use it to send the output to a different endpoint, such as a team collaboration site, or to the team’s chat channel.
  • Try changing the summary instructions CloudFormation stack parameter provided to Amazon Bedrock to produce outputs specific to your use case (this is the generative AI prompt):
    • When summarizing a company’s earnings call, you could have the model focus on potential promising opportunities, areas of concern, and things that you should continue to monitor.
    • If you are using this to summarize a course lecture, the model could identify upcoming assignments, summarize key concepts, list facts, and filter out any small talk from the recording.
  • For the same recording, create different summaries for different audiences:
    • Engineers’ summaries focus on design decisions, technical challenges, and upcoming deliverables.
    • Project managers’ summaries focus on timelines, costs, deliverables, and action items.
    • Project sponsors get a brief update on project status and escalations.
    • For longer recordings, try generating summaries for different levels of interest and time commitment. For example, create a single sentence, single paragraph, single page, or in-depth summary. In addition to the prompt, you may want to adjust the max_tokens_to_sample parameter to accommodate different content lengths.

Clean up

To clean up the solution, delete the CloudFormation stack that you created earlier. Note that deleting the stack will not delete the asset bucket. If you no longer need the recordings or transcripts, you can delete this bucket separately. Amazon Transcribe will automatically delete transcription jobs after 90 days, but you can delete these manually before then.

Conclusion

In this post, we explored how to use Amazon Transcribe and Amazon Bedrock to automatically generate clean, concise summaries of video or audio recordings. We encourage you to continue evaluating Amazon Bedrock, Amazon Transcribe, and other AWS AI services, like Amazon Textract, Amazon Translate, and Amazon Rekognition, to see how they can help meet your business objectives.


About the Authors

Rob Barnes is a principal consultant for AWS Professional Services. He works with our customers to address security and compliance requirements at scale in complex, multi-account AWS environments through automation.

Jason Stehle is a Senior Solutions Architect at AWS, based in the New England area. He works with customers to align AWS capabilities with their greatest business challenges. Outside of work, he spends his time building things and watching comic book movies with his family.

Read More

Fine-tune Llama 2 using QLoRA and Deploy it on Amazon SageMaker with AWS Inferentia2

Fine-tune Llama 2 using QLoRA and Deploy it on Amazon SageMaker with AWS Inferentia2

In this post, we showcase fine-tuning a Llama 2 model using a Parameter-Efficient Fine-Tuning (PEFT) method and deploy the fine-tuned model on AWS Inferentia2. We use the AWS Neuron software development kit (SDK) to access the AWS Inferentia2 device and benefit from its high performance. We then use a large model inference container powered by Deep Java Library (DJLServing) as our model serving solution.

Solution overview

Efficient Fine-tuning Llama2 using QLoRa

The Llama 2 family of large language models (LLMs) is a collection of pre-trained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. Llama 2 was pre-trained on 2 trillion tokens of data from publicly available sources. AWS customers sometimes choose to fine-tune Llama 2 models using customers’ own data to achieve better performance for downstream tasks. However, due to Llama 2 model’s large number of parameters, full fine-tuning could be prohibitively expensive and time consuming. Parameter-Efficient Fine-Tuning (PEFT) approach can address this problem by only fine-tune a small number of extra model parameters while freezing most parameters of the pre-trained model. For more information on PEFT, one can read this post. In this post, we use QLoRa to fine-tune a Llama 2 7B model.

Deploy a fine-tuned Model on Inf2 using Amazon SageMaker

AWS Inferentia2 is purpose-built machine learning (ML) accelerator designed for inference workloads and delivers high-performance at up to 40% lower cost for generative AI and LLM workloads over other inference optimized instances on AWS. In this post, we use Amazon Elastic Compute Cloud (Amazon EC2) Inf2 instance, featuring AWS Inferentia2, the second generation Inferentia2 accelerators, each containing two NeuronCores-v2. Each NeuronCore-v2 is an independent, heterogenous compute-unit, with four main engines: Tensor, Vector, Scalar, and GPSIMD engines. It includes an on-chip software-managed SRAM memory for maximizing data locality. Since several blogs on Inf2 has been published, the reader can refer to this post and our documentation for more information on Inf2.

To deploy models on Inf2, we need AWS Neuron SDK as the software layer running on top of the Inf2 hardware. AWS Neuron is the SDK used to run deep learning workloads on AWS Inferentia and AWS Trainium based instances. It enables end-to-end ML development lifecycle to build new models, train and optimize these models, and deploy them for production. AWS Neuron includes a deep learning compiler, runtime, and tools that are natively integrated with popular frameworks like TensorFlow and PyTorch. In this blog, we are going to use transformers-neuronx, which is part of the AWS Neuron SDK for transformer decoder inference workflows. It supports a range of popular models, including Llama 2.

To deploy models on Amazon SageMaker, we usually use a container that contains the required libraries, such as Neuron SDK and transformers-neuronx as well as the model serving component. Amazon SageMaker maintains deep learning containers (DLCs) with popular open source libraries for hosting large models. In this post, we use the Large Model Inference Container for Neuron. This container has everything you need to deploy your Llama 2 model on Inf2. For resources to get started with LMI on Amazon SageMaker, please refer to many of our existing posts (blog 1, blog 2, blog 3) on this topic. In short, you can run the container without writing any additional code. You can use the default handler for a seamless user experience and pass in one of the supported model names and any load time configurable parameters. This compiles and serve an LLM on an Inf2 instance. For example, to deploy OpenAssistant/llama2-13b-orca-8k-3319, you can provide the follow configuration (as serving.properties file). In serving.properties, we specify the model type as llama2-13b-orca-8k-3319, the batch size as 4, the tensor parallel degree as 2, and that is it. For the full list of configurable parameters, refer to All DJL configuration options.

# Engine to use: MXNet, PyTorch, TensorFlow, ONNX, PaddlePaddle, DeepSpeed, etc.
engine = Python 
# default handler for model serving
option.entryPoint = djl_python.transformers_neuronx
# The Hugging Face ID of a model or the s3 url of the model artifacts. 
option.model_id = meta-llama/Llama-2-7b-chat-hf
#the dynamic batch size, default is 1.
option.batch_size=4
# This option specifies number of tensor parallel partitions performed on the model.
option.tensor_parallel_degree=2
# The input sequence length
option.n_positions=512
#Enable iteration level batching using one of "auto", "scheduler", "lmi-dist"
option.rolling_batch=auto
# The data type to which you plan to cast the model default
option.dtype=fp16
# worker load model timeout
option.model_loading_timeout=1500

Alternatively, you can write your own model handler file as shown in this example, but that requires implementing the model loading and inference methods to serve as a bridge between the DJLServing APIs.

Prerequisites

The following list outlines the prerequisites for deploying the model described in this blog post. You can implement either from the AWS Management Console or using the latest version of the AWS Command Line Interface (AWS CLI).

Walkthrough

In the following section, we’ll walkthrough the code in two parts:

  1. Fine-tuning a Llama2-7b model, and upload the model artifacts to a specified Amazon S3 bucket location.
  2. Deploy the model into an Inferentia2 using DJL serving container hosted in Amazon SageMaker.

The complete code samples with instructions can be found in this GitHub repository.

Part 1: Fine-tune a Llama2-7b model using PEFT

We are going to use the recently introduced method in the paper QLoRA: Quantization-aware Low-Rank Adapter Tuning for Language Generation by Tim Dettmers et al. QLoRA is a new technique to reduce the memory footprint of large language models during fine-tuning, without sacrificing performance.

Note: The fine-tuning of llama2-7b model shown in the following was tested on an Amazon SageMaker Studio Notebook with Python 2.0 GPU Optimized Kernel using a ml.g5.2xlarge instance type. As a best practice, we recommend using an Amazon SageMaker Studio Integrated Development Environment (IDE) launched in your own Amazon Virtual Private Cloud (Amazon VPC). This allows you to control, monitor, and inspect network traffic within and outside your VPC using standard AWS networking and security capabilities. For more information, see Securing Amazon SageMaker Studio connectivity using a private VPC.

Quantize the base model

We first load a quantized model with 4-bit quantization using Huggingface transformers library as follows:

# The base pretrained model for fine-tuning
model_name = "NousResearch/Llama-2-7b-chat-hf"

# The instruction dataset to use
dataset_name = "mlabonne/guanaco-llama2-1k"

#Activate 4-bit precision base model loading
use_4bit = True
bnb_4bit_compute_dtype = "float16"
bnb_4bit_quant_type = "nf4"
use_nested_quant = False

compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
load_in_4bit=use_4bit,
bnb_4bit_quant_type=bnb_4bit_quant_type,
bnb_4bit_compute_dtype=compute_dtype,
bnb_4bit_use_double_quant=use_nested_quant,
)

# Load base model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map=device_map
)
model.config.pretraining_tp = 1

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

Load training dataset

Next, we load the dataset to feed the model for fine-tuning step shown as followed:

# Load dataset (you can process it here)
dataset = load_dataset(dataset_name, split="train")

Attach an adapter layer

Here we attach a small, trainable adapter layer, configured as LoraConfig defined in the Hugging Face’s peft library.

# include linear layers to apply LoRA to.
modules = find_all_linear_names(model)

## Setting up LoRA configuration
lora_r = 64

# Alpha parameter for LoRA scaling
lora_alpha = 16

# Dropout probability for LoRA layers
lora_dropout = 0.1

peft_config = LoraConfig(
lora_alpha=lora_alpha,
lora_dropout=lora_dropout,
r=lora_r,
bias="none",
task_type="CAUSAL_LM",
target_modules=modules)

Train a model

Using the LoRA configuration shown above, we’ll fine-tune the Llama2 model along with hyper-parameters. A code snippet for training the model is shown in the following:

# Set training parameters
training_arguments = TrainingArguments(...)

trainer = SFTTrainer(
model=model,
train_dataset=dataset,
peft_config=peft_config, # LoRA config
dataset_text_field="text",
max_seq_length=max_seq_length,
tokenizer=tokenizer,
args=training_arguments,
packing=packing,
)

# Train model
trainer.train()

# Save trained model
trainer.model.save_pretrained(new_model)

Merge model weight

The fine-tuned model executed above created a new model containing the trained LoRA adapter weights. In the following code snippet, we’ll merge the adapter with the base model so that we could use the fine-tuned model for inference.

# Reload model in FP16 and merge it with LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
model_name,
low_cpu_mem_usage=True,
return_dict=True,
torch_dtype=torch.float16,
device_map=device_map,
)
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()

save_dir = "merged_model"
model.save_pretrained(save_dir, safe_serialization=True, max_shard_size="2GB")

# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
tokenizer.save_pretrained(save_dir)

Upload model weight to Amazon S3

In the final step of part 1, we’ll save the merged model weights to a specified Amazon S3 location. The model weight will be used by a model serving container in Amazon SageMaker to host the model using an Inferentia2 instance.

model_data_s3_location = "s3://<bucket_name>/<prefix>/"
!cd {save_dir} && aws s3 cp —recursive . {model_data_s3_location}

Part 2: Host QLoRA model for inference with AWS Inf2 using SageMaker LMI Container

In this section, we’ll walk through the steps of deploying a QLoRA fine-tuned model into an Amazon SageMaker hosting environment. We’ll use a DJL serving container from SageMaker DLC, which integrates with the transformers-neuronx library to host this model. The setup facilitates the loading of models onto AWS Inferentia2 accelerators, parallelizes the model across multiple NeuronCores, and enables serving via HTTP endpoints.

Prepare model artifacts

DJL supports many deep learning optimization libraries, including DeepSpeed, FasterTransformer and more. For model specific configurations, we provide a serving.properties with key parameters, such as tensor_parallel_degree and model_id to define the model loading options. The model_id could be a Hugging Face model ID, or an Amazon S3 path where the model weights are stored. In our example, we provide the Amazon S3 location of our fine-tuned model. The following code snippet shows the properties used for the model serving:

%%writefile serving.properties
engine=Python
option.entryPoint=djl_python.transformers_neuronx
option.model_id=<model data s3 location>
option.batch_size=4
option.neuron_optimize_level=2
option.tensor_parallel_degree=8
option.n_positions=512
option.rolling_batch=auto
option.dtype=fp16
option.model_loading_timeout=1500

Please refer to this documentation for more information about the configurable options available via serving.properties. Please note that we use option.n_position=512 in this blog for faster AWS Neuron compilation. If you want to try larger input token length, then we recommend the reader to pre-compile the model ahead of time (see AOT Pre-Compile Model on EC2). Otherwise, you might run into timeout error if the compilation time is too much.

After the serving.properties file is defined, we’ll package the file into a tar.gz format, as follows:

%%sh
mkdir mymodel
mv serving.properties mymodel/
tar czvf mymodel.tar.gz mymodel/
rm -rf mymodel

Then, we’ll upload the tar.gz to an Amazon S3 bucket location:

s3_code_prefix = "large-model-lmi/code"
bucket = sess.default_bucket()  # bucket to house artifacts
code_artifact = sess.upload_data("mymodel.tar.gz", bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {code_artifact}")

Create an Amazon SageMaker model endpoint

To use an Inf2 instance for serving, we use an Amazon SageMaker LMI container with DJL neuronX support. Please refer to this post for more information about using a DJL NeuronX container for inference. The following code shows how to deploy a model using Amazon SageMaker Python SDK:

# Retrieves the DJL-neuronx docker image URI
image_uri = image_uris.retrieve(
framework="djl-neuronx",
region=sess.boto_session.region_name,
version="0.24.0"
)

# Define inf2 instance type to use for serving
instance_type = "ml.inf2.48xlarge"

endpoint_name = sagemaker.utils.name_from_base("lmi-model")

# Deploy the model for inference
model.deploy(initial_instance_count=1,
instance_type=instance_type,
container_startup_health_check_timeout=1500,
volume_size=256,
endpoint_name=endpoint_name)

# our requests and responses will be in json format so we specify the serializer and the deserializer
predictor = sagemaker.Predictor(
endpoint_name=endpoint_name,
sagemaker_session=sess,
serializer=serializers.JSONSerializer(),
)

Test model endpoint

After the model is deployed successfully, we can validate the endpoint by sending a sample request to the predictor:

prompt="What is machine learning?"
input_data = f"<s>[INST] <<SYS>>nAs a data scientistn<</SYS>>n{prompt} [/INST]"

response = predictor.predict(
{"inputs": input_data, "parameters": {"max_new_tokens":300, "do_sample":"True"}}
)

print(json.loads(response)['generated_text'])

The sample output is shown as follows:

In the context of data analysis, Machine Learning (ML) refers to a statistical technique capable of extracting predictive power from a dataset with an increasing complexity and accuracy by iteratively narrowing down the scope of a statistic.

Machine Learning is not a new statistical technique, but rather a combination of existing techniques. Furthermore, it has not been designed to be used with a specific dataset or to produce a specific outcome. Rather, it was designed to be flexible enough to adapt to any dataset and to make predictions about any outcome.

Clean up

If you decide that you no longer want to keep the SageMaker endpoint running, you can delete it using AWS SDK for Python (boto3), AWS CLI or Amazon SageMaker Console. Additionally, you can also shutdown the Amazon SageMaker Studio Resources that are no longer required.

Conclusion

In this post, we showed you how to fine-tune a Llama2-7b model using LoRA adaptor with 4-bit quantization using a single GPU instance. Then we deployed the model to an Inf2 instance hosted in Amazon SageMaker using a DJL serving container. Finally, we validated the Amazon SageMaker model endpoint with a text generation prediction using the SageMaker Python SDK. Go ahead and give it a try, we love to hear your feedback. Stay tuned for updates on more capabilities and new innovations with AWS Inferentia.

For more examples about AWS Neuron, see aws-neuron-samples.


About the Authors

Wei Teh is a Senior AI/ML Specialist Solutions Architect at AWS. He is passionate about helping customers advance their AWS journey, focusing on Amazon Machine Learning services and machine learning-based solutions. Outside of work, he enjoys outdoor activities like camping, fishing, and hiking with his family.

Qingwei Li is a Machine Learning Specialist at Amazon Web Services. He received his Ph.D. in Operations Research after he broke his advisor’s research grant account and failed to deliver the Nobel Prize he promised. Currently he helps customers in the financial service and insurance industry build machine learning solutions on AWS. In his spare time, he likes reading and teaching.

Read More

Build an end-to-end MLOps pipeline using Amazon SageMaker Pipelines, GitHub, and GitHub Actions

Build an end-to-end MLOps pipeline using Amazon SageMaker Pipelines, GitHub, and GitHub Actions

Machine learning (ML) models do not operate in isolation. To deliver value, they must integrate into existing production systems and infrastructure, which necessitates considering the entire ML lifecycle during design and development. ML operations, known as MLOps, focus on streamlining, automating, and monitoring ML models throughout their lifecycle. Building a robust MLOps pipeline demands cross-functional collaboration. Data scientists, ML engineers, IT staff, and DevOps teams must work together to operationalize models from research to deployment and maintenance. With the right processes and tools, MLOps enables organizations to reliably and efficiently adopt ML across their teams.

Although the requirements of continuous integration and continuous delivery (CI/CD) pipelines can be unique and reflect each organization’s needs, scaling MLOps practices across teams can be simplified by using managed orchestrations and tools that can accelerate the development process and remove the undifferentiated heavy lifting.

Amazon SageMaker MLOps is a suite of features that includes Amazon SageMaker Projects (CI/CD), Amazon SageMaker Pipelines and Amazon SageMaker Model Registry.

SageMaker Pipelines allows for straightforward creation and management of ML workflows, while also offering storage and reuse capabilities for workflow steps. The SageMaker Model Registry centralizes model tracking, simplifying model deployment. SageMaker Projects introduces CI/CD practices to ML, including environment parity, version control, testing, and automation. This allows for a quick establishment of CI/CD in your ML environment, facilitating effective scalability throughout your enterprise.

The built-in project templates provided by Amazon SageMaker include integration with some of third-party tools, such as Jenkins for orchestration and GitHub for source control, and several utilize AWS native CI/CD tools such as AWS CodeCommit, AWS CodePipeline, and AWS CodeBuild. In many scenarios, however, customers would like to integrate SageMaker Pipelines with other existing CI/CD tools and therefore, create their custom project templates.

In this post, we show you a step-by-step implementation to achieve the following:

  • Create a custom SageMaker MLOps project template that integrates with GitHub and GitHub Actions
  • Make your custom project templates available in Amazon SageMaker Studio for your data science team with one-click provisioning

Solution overview

In this post, we construct the following architecture. We create an automated model build pipeline that includes steps for data preparation, model training, model evaluation, and registration of the trained model in the SageMaker Model Registry. The resulting trained ML model is then deployed from the SageMaker Model Registry to staging and production environments upon manual approval.

Solution Overview

Let’s delve into the elements of this architecture to understand the complete configuration.

GitHub and GitHub Actions

GitHub is a web-based platform that provides version control and source code management using Git. It enables teams to collaborate on software development projects, track changes, and manage code repositories. GitHub serves as a centralized location to store, version, and manage your ML code base. This ensures that your ML code base and pipelines are versioned, documented, and accessible by team members.

GitHub Actions is a powerful automation tool within the GitHub ecosystem. It allows you to create custom workflows that automate your software development lifecycle processes, such as building, testing, and deploying code. You can create event-driven workflows triggered by specific events, like when code is pushed to a repository or a pull request is created. When implementing MLOps, you can use GitHub Actions to automate various stages of the ML pipeline, such as:

  • Data validation and preprocessing
  • Model training and evaluation
  • Model deployment and monitoring
  • CI/CD for ML models

With GitHub Actions, you can streamline your ML workflows and ensure that your models are consistently built, tested, and deployed, leading to more efficient and reliable ML deployments.

In the following sections, we start by setting up the prerequisites relating to some of the components that we use as part of this architecture:

  • AWS CloudFormationAWS CloudFormation initiates the model deployment and establishes the SageMaker endpoints after the model deployment pipeline is activated by the approval of the trained model.
  • AWS CodeStar connection – We use AWS CodeStar to establish a link with the GitHub repository and utilize it as code repo integration with AWS resources, like SageMaker Studio.
  • Amazon EventBridgeAmazon EventBridge keeps track of all modifications to the model registry. It also maintains a rule that prompts the Lambda function to deploy the model pipeline when the status of the model package version changes from PendingManualApproval to Approved within the model registry.
  • AWS Lambda – We use an AWS Lambda function to initiate the model deployment workflow in GitHub Actions after a new model is registered in the model registry.
  • Amazon SageMaker – We configure the following SageMaker components:
    • Pipeline – This component consists of a directed acyclic graph (DAG) that helps us build the automated ML workflow for the stages of data preparation, model training, and model evaluation. The model registry maintains records of model versions, their associated artifacts, lineage, and metadata. A model package group is established that houses all related model versions. The model registry is also responsible for managing the approval status of the model version for subsequent deployment.
    • Endpoint – This component sets up two HTTPS real-time endpoints for inference. The hosting configuration can be adjusted, for instance, for batch transform or asynchronous inference. The staging endpoint is generated when the model deployment pipeline is activated by the approval of the trained model from the SageMaker Model Registry. This endpoint is utilized to validate the deployed model by ensuring it provides predictions that satisfy our accuracy standards. When the model is prepared for production deployment, a production endpoint is deployed by a manual approval stage in the GitHub Actions workflow.
    • Code repository – This creates a Git repository as a resource in your SageMaker account. Using the existing data from the GitHub code repository that you input during the creation of your SageMaker project, an association with the same repository is established in SageMaker when you initiate the project. This essentially forms a link with a GitHub repository in SageMaker, enabling interactive actions (pull/push) with your repository.
    • Model registry – This monitors the various versions of the model and the corresponding artifacts, which includes lineage and metadata. A collection known as a model package group is created, housing related versions of the model. Moreover, the model registry oversees the approval status of the model version, ensuring its readiness for subsequent deployment.
  • AWS Secrets Manager – To securely preserve your GitHub personal access token, it’s necessary to establish a secret in AWS Secrets Manager and house your access token within it.
  • AWS Service Catalog – We use the AWS Service Catalog for the implementation of SageMaker projects, which include components like a SageMaker code repository, Lambda function, EventBridge rule, artifact S3 bucket, etc., all implemented via CloudFormation. This allows your organization to use project templates repeatedly, allocate projects to each user, and streamline operations.
  • Amazon S3 – We use an Amazon Simple Storage Service (Amazon S3) bucket to keep the model artifacts produced by the pipeline.

Prerequisites

You should have the following prerequisites:

You must also complete additional setup steps before implementing the solution.

Set up an AWS CodeStar connection

If you don’t already have an AWS CodeStar connection to your GitHub account, refer to Create a connection to GitHub for instructions to create one. Your AWS CodeStar connection ARN will look like this:

arn:aws:codestar-connections:us-west-2:account_id:connection/aEXAMPLE-8aad-4d5d-8878-dfcab0bc441f

In this example, aEXAMPLE-8aad-4d5d-8878-dfcab0bc441f is the unique ID for this connection. We use this ID when we create our SageMaker project later in this example.

Set up secret access keys for your GitHub token

To securely store your GitHub personal access token, you need to create a secret in Secrets Manager. If you don’t have a personal access token for GitHub, refer to Managing your personal access tokens for instructions to create one.

You can create either a classic or fine-grained access token. However, make sure that the token has access to the repository’s contents and actions (workflows, runs, and artifacts).

Complete the following steps to store your token in Secrets Manager:

  1. On the Secrets Manager console, choose Store a new secret.
  2. Select Other type of secret for Choose secret type.
  3. Provide a name for your secret in the Key field and add your personal access token to the corresponding Value field.
  4. Choose Next, enter a name for your secret, and choose Next again.
  5. Choose Store to save your secret.

By storing your GitHub personal access token in Secrets Manager, you can securely access it within your MLOps pipeline while ensuring its confidentiality.

Create an IAM user for GitHub Actions

To allow GitHub Actions to deploy SageMaker endpoints in your AWS environment, you need to create an AWS Identity and Access Management (IAM) user and grant it the necessary permissions. For instructions, refer to Creating an IAM user in your AWS account. Use the iam/GithubActionsMLOpsExecutionPolicy.json file (provided in the code sample) to provide sufficient permissions for this user to deploy your endpoints.

After you create the IAM user, generate an access key. You will use this key, which consists of both an access key ID and a secret access key, in the subsequent step when configuring your GitHub secrets.

Set up your GitHub account

The following are the steps to prepare your GitHub account to run this example.

Clone the GitHub repository

You can reuse an existing GitHub repo for this example. However, it’s easier if you create a new repository. This repository is going to contain all the source code for both SageMaker pipeline builds and deployments.

Copy the contents of the seed code directory into the root of your GitHub repository. For instance, the .github directory should be under the root of your GitHub repository.

Create a GitHub secret containing your IAM user access key

In this step, we store the access key details of the newly created user in our GitHub secret.

  1. On the GitHub website, navigate to your repository and choose Settings.
  2. In the security section, select Secrets and Variables and choose Actions.
  3. Choose New Repository Secret.
  4. For Name, enter AWS_ACCESS_KEY_ID
  5. For Secret, enter the access key ID associated with the IAM user you created earlier.
  6. Choose Add Secret.
  7. Repeat the same procedure for AWS_SECRET_ACCESS_KEY

Configure your GitHub environments

To create a manual approval step in our deployment pipelines, we use a GitHub environment. Complete the following steps:

  1. Navigate to the Settings, Environments menu of your GitHub repository and create a new environment called production.
  2. For Environment protection rules, select Required reviewers.
  3. Add the desired GitHub user names as reviewers. For this example, you can choose your own user name.

Note that the environment feature is not available in some types of GitHub plans. For more information, refer to Using environments for deployment.

Deploy the Lambda function

In the following steps, we compress lambda_function.py into a .zip file, which is then uploaded to an S3 bucket.

The relevant code sample for this can be found in the following GitHub repo. Specifically, the lambda_function.py is located in the lambda_functions/lambda_github_workflow_trigger directory.

It’s recommended to create a fork of the code sample and clone that instead. This will give you the freedom to modify the code and experiment with different aspects of the sample.

  1. After you obtain a copy of the code, navigate to the appropriate directory and use the zip command to compress lambda_function.py. Both Windows and MacOS users can use their native file management system, File Explorer or Finder, respectively, to generate a .zip file.
cd lambda_functions/lambda_github_workflow_trigger
zip lambda-github-workflow-trigger.zip lambda_function.py
  1. Upload the lambda-github-workflow-trigger.zip to an S3 bucket.

This bucket will later be accessed by Service Catalog. You can choose any bucket that you have access to, as long as Service Catalog is able to retrieve data from it in subsequent steps.

From this step onwards, we require the AWS CLI v2 to be installed and configured. An alternative would be to utilize AWS CloudShell, which comes with all necessary tools pre-installed, eliminating the need for any additional configurations.

  1. To upload the file to the S3 bucket, use the following command:
aws s3 cp lambda-github-workflow-trigger.zip s3://your-bucket/

Now we construct a Lambda layer for the dependencies related to the lambda_function we just uploaded.

  1. Set up a Python virtual environment and get the dependencies installed:
mkdir lambda_layer
cd lambda_layer
python3 -m venv .env
source .env/bin/activate
pip install pygithub
deactivate
  1. Generate the .zip file with the following commands:
mv .env/lib/python3.9/site-packages/ python
zip -r layer.zip python
  1. Publish the layer to AWS:
aws lambda publish-layer-version --layer-name python39-github-arm64  
  --description "Python3.9 pygithub"  
  --license-info "MIT"  
  --zip-file fileb://layer.zip  
  --compatible-runtimes python3.9  
  --compatible-architectures "arm64"

With this layer published, all your Lambda functions can now reference it to meet their dependencies. For a more detailed understanding of Lambda layers, refer to Working with Lambda layers.

Create a custom project template in SageMaker

After completion of all the above steps, we have all the CI/CD pipeline resources and components. Next we demonstrate how we can make these resources available as a custom project within the SageMaker Studio accessible via one click deployment.

As discussed earlier, when the SageMaker-provided templates don’t meet your needs (for example, you want to have more complex orchestration in CodePipeline with multiple stages, custom approval steps or to integrate with a third party tool such as GitHub and GitHub actions demonstrated in this post), you can create your own templates. We recommend starting with the SageMaker-provided templates to understand how to organize your code and resources and build on top of it. For more details, refer to Create Custom Project Templates.

Note that you can also automate this step and instead use the CloudFormation to deploy the Service Catalogue portfolio and product via code. In this post however, for a greater learning experience, we show you the console deployment.

At this stage, we use the provided CloudFormation template to create a Service Catalog portfolio that helps us create custom projects in SageMaker.

You can create a new domain or reuse your SageMaker domain for the following steps. If you don’t have a domain, refer to Onboard to Amazon SageMaker Domain using Quick setup for setup instructions.

After you enable administrator access to the SageMaker templates, complete the following steps:

  1. On the Service Catalog console, under Administration in the navigation pane, choose Portfolios.
  2. Choose Create a new portfolio.
  3. Name the portfolio “SageMaker Organization Templates”.
  4. Download the template.yml file to your computer.

This Cloud Formation template provisions all the CI/CD resources we need as configuration and infrastructure as code. You can study the template in more detail to see what resources are deployed as part of it. This template has been customized to integrate with GitHub and GitHub Actions.

  1. In the template.yml file, change the S3Bucket value to your bucket where you have uploaded the Lambda .zip file:
GitHubWorkflowTriggerLambda:
  ...
  Code:
    S3Bucket: <your-bucket>
    S3Key: lambda-github-workflow-trigger.zip
  ...
  1. Choose the new portfolio.
  2. Choose Upload a new product.
  3. For Product name¸ enter a name for your template. We use the name build-deploy-github.
  4. For Description, enter a description.
  5. For Owner, enter your name.
  6. Under Version details, for Method, choose Use a template file.
  7. Choose Upload a template.
  8. Upload the template you downloaded.
  9. For Version title, choose 1.0.
  10. Choose Review.
  11. Review your settings and choose Create product.
  12. Choose Refresh to list the new product.
  13. Choose the product you just created.
  14. On the Tags tab, add the following tag to the product:
    • Key =sagemaker:studio-visibility
    • Valuetrue

Back in the portfolio details, you should see something similar to the following screenshot (with different IDs).

Service Catalog Portfolio

  1. On the Constraints tab, choose Create constraint.
  2. For Product, choose build-deploy-github (the product you just created).
  3. For Constraint type, choose Launch.
  4. Under Launch constraint, for Method, choose Select IAM role.
  5. Choose AmazonSageMakerServiceCatalogProductsLaunchRole.
  6. Choose Create.
  7. On the Groups, roles, and users tab, choose Add groups, roles, users.
  8. On the Roles tab, select the role you used when configuring your SageMaker Studio domain. This is where the SageMaker domain role can be found.

Service Catalog Launch Constraint

  1. Choose Add access.

Deploy the project from SageMaker Studio

In the previous sections, you prepared the custom MLOps project environment. Now, let’s create a project using this template:

  1. On the SageMaker console, navigate to the domain that you want to create this project.
  2. On the Launch menu, choose Studio.

You’ll be redirected to the SageMaker Studio environment.

  1. In SageMaker Studio, in the navigation pane under Deployments, choose Projects.
  2. Choose Create project.
  3. At the top of the list of templates, choose Organization templates.

If you have gone through all the previous steps successfully, you should be able to see a new custom project template named Build-Deploy-GitHub.

  1. Select that template and choose Select Project Template.
  2. Enter an optional description.
  3. For GitHub Repository Owner Name, enter the owner of your GitHub repository. For example, if your repository is at https://github.com/pooyavahidi/my-repo, the owner would be pooyavahidi.
  4. For GitHub Repository Name, enter the name of the repository into which you copied the seed code. It would be just the name of the repo. For example, in https://github.com/pooyavahidi/my-repo, the repo is my-repo.
  5. For Codestar connection unique ID, enter the unique ID of the AWS CodeStar connection that you created.
  6. For Name of the secret in the Secrets Manager which stores GitHub token, enter the name of the secret in Secrets Manager where you created and stored the GitHub token.
  7. For GitHub workflow file for deployment, enter the name of the GitHub workflow file (at .github/workflows/deploy.yml) where you have the deployment instructions. For this example, you can keep it as default, which is deploy.yml.
  8. Choose Create project.

SageMaker Studio Project

  1. After creating your project, make sure you update the AWS_REGION and SAGEMAKER_PROJECT_NAME environment variables in your GitHub workflow files accordingly. Workflow files are in your GitHub repo (copied from the seed code), inside the .github/workflows directory. Make sure you update both build.yml and deploy.yml files.
...
env:
  AWS_REGION: <region>   
  SAGEMAKER_PROJECT_NAME: <your project name>
...

Now your environment is ready to go! You can run the pipelines directly, make changes, and push those changes to your GitHub repository to trigger the automated build pipeline and see how all the steps of build and deploy are automated.

Clean up

To clean up the resources, complete the following steps:

  • Delete the CloudFormation stacks used for the SageMaker project and SageMaker endpoints.
  • Delete the SageMaker domain.
  • Delete the Service Catalog resources.
  • Delete the AWS CodeStar connection link with the GitHub repository.
  • Delete the IAM user that you created for GitHub Actions.
  • Delete the secret in Secrets Manager that stores the GitHub personal access details.

Summary

In this post, we walked through the process of using a custom SageMaker MLOps project template to automatically construct and organize a CI/CD pipeline. This pipeline effectively integrates your existing CI/CD mechanisms with SageMaker capabilities for data manipulation, model training, model approval, and model deployment. In our scenario, we focused on integrating GitHub Actions with SageMaker projects and pipelines. For a comprehensive understanding of the implementation details, visit the GitHub repository. Feel free to experiment with this and don’t hesitate to leave any queries you might have in the comments section.


About the Authors

Dr. Romina Sharifpour is a Senior Machine Learning and Artificial Intelligence Solutions Architect at Amazon Web Services (AWS). She has spent over 10 years leading the design and implementation of innovative end-to-end solutions enabled by advancements in ML and AI. Romina’s areas of interest are natural language processing, large language models, and MLOps.

Pooya Vahidi is a Senior Solutions Architect at AWS, passionate about computer science, artificial intelligence, and cloud computing. As an AI professional, he is an active member of the AWS AI/ML Area-of-Depth team. With a background spanning over two decades of expertise in leading the architecture and engineering of large-scale solutions, he helps customers on their transformative journeys through cloud and AI/ML technologies.

Read More

Create a web UI to interact with LLMs using Amazon SageMaker JumpStart

Create a web UI to interact with LLMs using Amazon SageMaker JumpStart

The launch of ChatGPT and rise in popularity of generative AI have captured the imagination of customers who are curious about how they can use this technology to create new products and services on AWS, such as enterprise chatbots, which are more conversational. This post shows you how you can create a web UI, which we call Chat Studio, to start a conversation and interact with foundation models available in Amazon SageMaker JumpStart such as Llama 2, Stable Diffusion, and other models available on Amazon SageMaker. After you deploy this solution, users can get started quickly and experience the capabilities of multiple foundation models in conversational AI though a web interface.

Chat Studio can also optionally invoke the Stable Diffusion model endpoint to return a collage of relevant images and videos if the user requests for media to be displayed. This feature can help enhance the user experience with the use of media as accompanying assets to the response. This is just one example of how you can enrich Chat Studio with additional integrations to meet your goals.

The following screenshots show examples of what a user query and response look like.

Chat Studio query interface

Chat Studio response interface

Large language models

Generative AI chatbots such as ChatGPT are powered by large language models (LLMs), which are based on a deep learning neural network that can be trained on large quantities of unlabeled text. The use of LLMs allows for a better conversational experience that closely resembles interactions with real humans, fostering a sense of connection and improved user satisfaction.

SageMaker foundation models

In 2021, the Stanford Institute for Human-Centered Artificial Intelligence termed some LLMs as foundation models. Foundation models are pre-trained on a large and broad set of general data and are meant to serve as the foundation for further optimizations in a wide range of use cases, from generating digital art to multilingual text classification. These foundation models are popular with customers because training a new model from scratch takes time and can be expensive. SageMaker JumpStart provides access to hundreds of foundation models maintained from third-party open source and proprietary providers.

Solution overview

This post walks through a low-code workflow for deploying pre-trained and custom LLMs through SageMaker, and creating a web UI to interface with the models deployed. We cover the following steps:

  1. Deploy SageMaker foundation models.
  2. Deploy AWS Lambda and AWS Identity and Access Management (IAM) permissions using AWS CloudFormation.
  3. Set up and run the user interface.
  4. Optionally, add other SageMaker foundation models. This step extends Chat Studio’s capability to interact with additional foundation models.
  5. Optionally, deploy the application using AWS Amplify. This step deploys Chat Studio to the web.

Refer to the following diagram for an overview of the solution architecture.

Chat Studio Solution Architecture

Prerequisites

To walk through the solution, you must have the following prerequisites:

  • An AWS account with sufficient IAM user privileges.
  • npm installed in your local environment. For instructions on how to install npm, refer to Downloading and installing Node.js and npm.
  • A service quota of 1 for the corresponding SageMaker endpoints. For Llama 2 13b Chat, we use an ml.g5.48xlarge instance and for Stable Diffusion 2.1, we use an ml.p3.2xlarge instance.

To request a service quota increase, on the AWS Service Quotas console, navigate to AWS services, SageMaker, and request for a service quota raise to a value of 1 for ml.g5.48xlarge for endpoint usage and ml.p3.2xlarge for endpoint usage.

The service quota request may take a few hours to be approved, depending on the instance type availability.

Deploy SageMaker foundation models

SageMaker is a fully managed machine learning (ML) service for developers to quickly build and train ML models with ease. Complete the following steps to deploy the Llama 2 13b Chat and Stable Diffusion 2.1 foundation models using Amazon SageMaker Studio:

  1. Create a SageMaker domain. For instructions, refer to Onboard to Amazon SageMaker Domain using Quick setup.

A domain sets up all the storage and allows you to add users to access SageMaker.

  1. On the SageMaker console, choose Studio in the navigation pane, then choose Open Studio.
  2. Upon launching Studio, under SageMaker JumpStart in the navigation pane, choose Models, notebooks, solutions.
    SageMaker JumpStart Console
  3. In the search bar, search for Llama 2 13b Chat.
  4. Under Deployment Configuration, for SageMaker hosting instance, choose ml.g5.48xlarge and for Endpoint name, enter meta-textgeneration-llama-2-13b-f.
  5. Choose Deploy.

SageMaker JumpStart Deployment Configuration

After the deployment succeeds, you should be able to see the In Service status.

Llama Model Status

  1. On the Models, notebooks, solutions page, search for Stable Diffusion 2.1.
  2. Under Deployment Configuration, for SageMaker hosting instance, choose ml.p3.2xlarge and for Endpoint name, enter jumpstart-dft-stable-diffusion-v2-1-base.
  3. Choose Deploy.

SageMaker JumpStart Deployment Configuration

After the deployment succeeds, you should be able to see the In Service status.

Stable Diffusion Model Status

Deploy Lambda and IAM permissions using AWS CloudFormation

This section describes how you can launch a CloudFormation stack that deploys a Lambda function that processes your user request and calls the SageMaker endpoint that you deployed, and deploys all the necessary IAM permissions. Complete the following steps:

  1. Navigate to the GitHub repository and download the CloudFormation template (lambda.cfn.yaml) to your local machine.
  2. On the CloudFormation console, choose the Create stack drop-down menu and choose With new resources (standard).
  3. On the Specify template page, select Upload a template file and Choose file.
  4. Choose the lambda.cfn.yaml file that you downloaded, then choose Next.
  5. On the Specify stack details page, enter a stack name and the API key that you obtained in the prerequisites, then choose Next.
  6. On the Configure stack options page, choose Next.
  7. Review and acknowledge the changes and choose Submit.

Set up the web UI

This section describes the steps to run the web UI (created using Cloudscape Design System) on your local machine:

  1. On the IAM console, navigate to the user functionUrl.
  2. On the Security Credentials tab, choose Create access key.
  3. On the Access key best practices & alternatives page, select Command Line Interface (CLI) and choose Next.
  4. On the Set description tag page, choose Create access key.
  5. Copy the access key and secret access key.
  6. Choose Done.
  7. Navigate to the GitHub repository and download the react-llm-chat-studio code.
  8. Launch the folder in your preferred IDE and open a terminal.
  9. Navigate to src/configs/aws.json and input the access key and secret access key you obtained.
  10. Enter the following commands in the terminal:
    npm install
    
    npm start

  11. Open http://localhost:3000 in your browser and start interacting with your models!

To use Chat Studio, choose a foundational model on the drop-down menu and enter your query in the text box. To get AI-generated images along with the response, add the phrase “with images” to the end of your query.

Add other SageMaker foundation models

You can further extend the capability of this solution to include additional SageMaker foundation models. Because every model expects different input and output formats when invoking its SageMaker endpoint, you will need to write some transformation code in the callSageMakerEndpoints Lambda function to interface with the model.

This section describes the general steps and code changes required to implement an additional model of your choice. Note that basic knowledge of Python language is required for Steps 6–8.

  1. In SageMaker Studio, deploy the SageMaker foundation model of your choice.
  2. Choose SageMaker JumpStart and Launch JumpStart assets.
  3. Choose your newly deployed model endpoint and choose Open Notebook.
  4. On the notebook console, find the payload parameters.

These are the fields that the new model expects when invoking its SageMaker endpoint. The following screenshot shows an example.

SageMaker Endpoint Configuration

  1. On the Lambda console, navigate to callSageMakerEndpoints.
  2. Add a custom input handler for your new model.

In the following screenshot, we transformed the input for Falcon 40B Instruct BF16 and GPT NeoXT Chat Base 20B FP16. You can insert your custom parameter logic as indicated to add the input transformation logic with reference to the payload parameters that you copied.

Lambda Code Snippet

  1. Return to the notebook console and locate query_endpoint.

This function gives you an idea how to transform the output of the models to extract the final text response.

SageMaker Endpoint Configuration

  1. With reference to the code in query_endpoint, add a custom output handler for your new model.
    Lambda Code
  2. Choose Deploy.
  3. Open your IDE, launch the react-llm-chat-studio code, and navigate to src/configs/models.json.
  4. Add your model name and model endpoint, and enter the payload parameters from Step 4 under payload using the following format:
    "add_model_name": {
    "endpoint_name": "add_model_enpoint",
    "payload": {
    "add_payload_paramters_here"
    }
    },

  5. Refresh your browser to start interacting with your new model!

Deploy the application using Amplify

Amplify is a complete solution that allows you to quickly and efficiently deploy your application. This section describes the steps to deploy Chat Studio to an Amazon CloudFront distribution using Amplify if you wish to share your application with other users.

  1. Navigate to the react-llm-chat-studio code folder you created earlier.
  2. Enter the following commands in the terminal and follow the setup instructions:
    npm install -g @aws-amplify/cli
    
    amplify configure

  3. Initialize a new Amplify project by using the following command. Provide a ­­project name, accept the default configurations, and choose AWS access keys when prompted to select the authentication method.
    amplify init

  4. Host the Amplify project by using the following command. Choose Amazon CloudFront and S3 when prompted to select the plugin mode.
    amplify hosting add

  5. Finally, build and deploy the project with the following command:
    amplify publish

  6. After the deployment succeeds, open the URL provided in your browser and start interacting with your models!

Clean up

To avoid incurring future charges, complete the following steps:

  1. Delete the CloudFormation stack. For instructions, refer to Deleting a stack on the AWS CloudFormation console.
  2. Delete the SageMaker JumpStart endpoint. For instructions, refer to Delete Endpoints and Resources.
  3. Delete the SageMaker domain. For instructions, refer to Delete an Amazon SageMaker Domain.

Conclusion

In this post, we explained how to create a web UI for interfacing with LLMs deployed on AWS.

With this solution, you can interact with your LLM and hold a conversation in a user-friendly manner to test or ask the LLM questions, and get a collage of images and videos if required.

You can extend this solution in various ways, such as to integrate additional foundation models, integrate with Amazon Kendra to enable ML-powered intelligent search for understanding enterprise content, and more!

We invite you to experiment with different pre-trained LLMs available on AWS, or build on top of or even create your own LLMs in SageMaker. Let us know your questions and findings in the comments, and have fun!


About the authors

Jarrett Yeo Shan Wei is an Associate Cloud Architect in AWS Professional Services covering the Public Sector across ASEAN and is an advocate for helping customers modernize and migrate into the cloud. He has attained five AWS certifications, and has also published a research paper on gradient boosting machine ensembles in the 8th International Conference on AI. In his free time, Jarrett focuses on and contributes to the generative AI scene at AWS.

Tammy Lim Lee Xin is an Associate Cloud Architect at AWS. She uses technology to help customers deliver their desired outcomes in their cloud adoption journey and is passionate about AI/ML. Outside of work she loves travelling, hiking, and spending time with family and friends.

Read More

Frugality meets Accuracy: Cost-efficient training of GPT NeoX and Pythia models with AWS Trainium

Frugality meets Accuracy: Cost-efficient training of GPT NeoX and Pythia models with AWS Trainium

Large language models (or LLMs) have become a topic of daily conversations. Their quick adoption is evident by the amount of time required to reach a 100 million users, which has gone from “4.5yrs by facebook” to an all-time low of mere “2 months by ChatGPT.” A generative pre-trained transformer (GPT) uses causal autoregressive updates to make prediction. Variety of tasks such as speech recognition, text generation, and question answering are demonstrated to have stupendous performance by these model architectures. Several recent models such as NeoX, Falcon, Llama use the GPT architecture as a backbone. Training LLMs requires colossal amount of compute time, which costs millions of dollars. In this post, we’ll summarize training procedure of GPT NeoX on AWS Trainium, a purpose-built machine learning (ML) accelerator optimized for deep learning training. We’ll outline how we cost-effectively (3.2 M tokens/$) trained such models with AWS Trainium without losing any model quality.

Solution overview

GPT NeoX and Pythia models

GPT NeoX and Pythia are the open-source causal language models by Eleuther-AI with approximately 20 billion parameters in NeoX and 6.9 billion in Pythia. Both are decoder models following similar architectural design as Chat GPT3. However, they also have several additions, which are also widely adopted in the recent models such as Llama. Particularly, they have rotational positional embedding (ROPE) with partial rotation across head dimensions. The original models (NeoX and Pythia 6.9B) are trained on openly available Pile dataset with deduplication and using Megatron and Deepspeed backend.

We demonstrate the pre-training and fine-tuning of these models on AWS Trainium-based Trn1 instances using Neuron NeMo library. To establish the proof-of-concept and quick reproduction, we’ll use a smaller Wikipedia dataset subset tokenized using GPT2 Byte-pair encoding (BPE) tokenizer.

Walkthrough

Download the pre-tokenized Wikipedia dataset as shown:

export DATA_DIR=~/examples_datasets/gpt2

mkdir -p ${DATA_DIR} && cd ${DATA_DIR}

wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
aws s3 cp s3://neuron-s3/training_datasets/gpt/wikipedia/my-gpt2_text_document.bin . --no-sign-request
aws s3 cp s3://neuron-s3/training_datasets/gpt/wikipedia/my-gpt2_text_document.idx . --no-sign-request
aws s3 cp s3://neuron-s3/training_datasets/gpt/wikipedia/license.txt . --no-sign-request

Both NeoX 20B and Pythia 6.9B uses ROPE with partial rotation, for example, rotating 25% of the head dimensions and keeping the rest unrotated. To efficiently implement the partial rotation on AWS Trainium accelerator, instead of concatenating the rotating and non-rotating dimensions, we append zero frequencies for non-rotating dimensions and then rotate the complete set of head dimensions. This simple trick helped us improve the throughput (sequences processed per sec) on AWS Trainium.

Training steps

To run the training, we use SLURM managed multi-node Amazon Elastic Compute Cloud (Amazon EC2) Trn1 cluster, with each node containing a trn1.32xl instance. Each trn1.32xl has 16 accelerators with two workers per accelerator. After downloading the latest Neuron NeMo package, use the provided neox and pythia pre-training and fine-tuning scripts with optimized hyper-parameters and execute the following for a four node training.

  1. Compile: Pre-compile the model with three train iterations to generate and save the graphs:
    sbatch --nodes 4 compile.slurm ./neoX_20B_slurm.sh

  2. Run: Execute the training by loading the cached graphs from first steps
    sbatch --nodes 4 run.slurm ./neoX_20B_slurm.sh

  3. Monitor results
    tensorboard --logdir=nemo_experiments/megatron_neox

Same steps needs to be followed for running Pythia 6.9B model with replacing neox_20B_slurm.sh by pythia_6.9B_slurm.sh.

Pre-training and fine-tuning experiments

We demonstrate the pre-training of GPT-NeoX and Pythia models on AWS Trainium using Neuron NeMo library for 10k iterations, and also show fine-tuning of these models for 1k steps. For pre-training, we use the GPT2 BPE tokenizer inside the NeMo and follow same config as used in the original model. Fine-tuning on AWS Trainium requires change of few parameters (such as vocab size division factor), which are provided in the fine-tuning scripts to accommodate for Megatron versus NeMo differences and GPU versus AWS Trainium changes. The multi-node distributed training throughput with varying number of nodes is shown in the Table-1.

Model Tensor Parallel Pipeline Parallel Number of instances Cost ($/hour) Sequence length Global batch size Throughput (seq/sec) Cost-throughput ratio (tokens/$)
Pythia 6.9B 8 1 1 7.59 2048 256 10.4 10,102,387
8 1 4 30.36 2048 256 35.8 8,693,881
NeoX 20B 8 4 4 30.36 2048 16384 13.60 3,302,704
8 4 8 60.72 2048 16384 26.80 3,254,134
8 4 16 121.44 2048 16384 54.30 3,296,632
8 4 32 242.88 2048 16384 107.50 3,263,241
8 4 64 485.76 2048 16384 212.00 3,217,708

Table 1. Comparing mean throughput of GPT NeoX and Pythia models for training up to 500 steps with changing number of nodes. The pricing of trn1.32xl is based on the 3-year reserved effective per hour rate.

Next, we also evaluate the loss trajectory of the model training on AWS Trainium and compare it with the corresponding run on a P4d (Nvidia A100 GPU cores) cluster. Along with the training loss, we also compare useful indicator such as gradient norm, which is 2-norm of the model gradients computed at each training iteration to monitor the training progress. The training results are shown in Figure-1, 2 and fine-tuning of NeoX 20B in Figure-3.

Training loss averaged across all workers (left) and gradient norm (right) at training each step.

Figure-1. Training loss averaged across all workers (left) and gradient norm (right) at training each step. NeoX 20B is trained on 4 nodes with small wiki dataset on GPU and Trainium with same training hyper-parameters (global batch size=256). GPU is using BF16 and default mixed-precision while AWS Trainium is using full BF16 with stochastic rounding. The loss and gradient norm trajectories match for GPU and AWS Trainium.

Training loss averaged across all workers (left) and gradient norm (right) at training each step (Pythia).

Figure-2. Training loss averaged across all workers (left) and gradient norm (right) at training each step. Similar to GPT NeoX in Figure-1, Pythia 6.9B is trained on 4 nodes with small wiki dataset on GPU and Trainium with same training hyper-parameters (global batch size=256). The loss and gradient norm trajectories match for GPU and Trainium.

Fine-tuning GPT NeoX 20B model on GPU and AWS Trainium with training loss averaged across all workers (left) and gradient norm (right).

Figure-3. Fine-tuning GPT NeoX 20B model on GPU and AWS Trainium with training loss averaged across all workers (left) and gradient norm (right). A small wiki dataset is used for fine-tuning demonstration. The loss and gradient norm trajectories match for GPU and AWS Trainium.

In this post, we showed cost-efficient training of LLMs on AWS deep learning hardware. We trained GPT NeoX 20B and Pythia 6.9B models on AWS Trn1 with Neuron NeMo library. The cost normalized throughput for 20 billion models with AWS Trainium is around approximately 3.2M tokens/$ spent. Along with cost-efficient training on AWS Trainium, we obtain similar model accuracy, which is evident from training step loss and gradient norm trajectory. We also fine-tuned the available checkpoints for NeoX 20B model on AWS Trainium. For additional information on the distributed training with NeMo Megatron on AWS Trainium, see AWS Neuron Reference for NeMo Megatron. A good resource to start fine-tuning of Llama model could be found here, Llama2 fine-tuning. To get started with managed AWS Trainium on Amazon SageMaker, see Train your ML Models with AWS Trainium and Amazon SageMaker.


About the Authors

Gaurav Gupta is currently an Applied Scientist at Amazon Web Services (AWS) AI labs. Dr. Gupta completed his PhD from USC Viterbi. His research interests span the domain of sequential data modeling, learning partial differential equations, information theory for machine learning, fractional dynamical models, and complex networks. He is currently working on applied and mathematical problems on LLMs training behavior, vision models with PDEs, information-theoretic multi-modality models. Dr. Gupta has publications in top journals/conferences such as Neurips, ICLR, ICML, Nature, IEEE Control Society, ACM cyber-physical society.

Ben Snyder is an applied scientist with AWS Deep Learning. His research interests include foundational models, reinforcement learning, and asynchronous optimization. Outside of work, he enjoys cycling and backcountry camping.

Amith (R) Mamidala is the senior machine learning application engineering at AWS Annapurna Labs. Dr. Mamidala completed his PhD at the Ohio State University in high performance computing and communication. During his tenure at IBM research, Dr. Mamidala contributed towards the BlueGene class of computers which often led the Top500 ranking of the most powerful and power-efficient supercomputers. The project was awarded 2009 National medal of Technology and Innovation. After a brief stint as an AI engineer at a financial hedge fund, Dr. Mamidala joined the Annapurna labs focusing on Large Language model training.

Jun (Luke) Huan is a principal scientist at AWS AI Labs. Dr. Huan works on AI and Data Science. He has published more than 180 peer-reviewed papers in leading conferences and journals. He was a recipient of the NSF Faculty Early Career Development Award in 2009. Before joining AWS, he worked at Baidu research as a distinguished scientist and the head of Baidu Big Data Laboratory. He founded StylingAI Inc., an AI start-up, and worked as the CEO and Chief Scientist in 2019-2021. Before joining industry, he was the Charles E. and Mary Jane Spahr Professor in the EECS Department at the University of Kansas.

Shruti Koparkar is a Senior Product Marketing Manager at AWS. She helps customers explore, evaluate, and adopt Amazon EC2 accelerated computing infrastructure for their machine learning needs.

Read More

Vodafone advances its machine learning skills with AWS DeepRacer and Accenture

Vodafone advances its machine learning skills with AWS DeepRacer and Accenture

Vodafone is transitioning from a telecommunications company (telco) to a technology company (TechCo) by 2025, with objectives of innovating faster, reducing costs, improving security, and simplifying operations. Thousands of engineers are being onboarded to contribute to this transition. By 2025, Vodafone plans to have 50% of its global workforce actively involved in software development, with an objective to deliver 60% of digital services in-house. This new workforce requires rapid reskilling and understanding of disruptive services such as artificial intelligence (AI) and machine learning (ML) to drive meaningful outcomes.

To help achieve this ambitious transition, Vodafone has partnered with Accenture and AWS to build a cloud platform that helps its engineers work in flexible, creative, and agile ways by providing them a curated set of managed, security and DevOps-oriented AWS services and application workloads. To learn more, check out Redefining Vodafone’s customer experience with AWS and the following talk at AWS re:Invent 2022.

Vodafone Digital engineering (VDE) invited Accenture and AWS to co-host an exclusive event at their annual DigiFest, a week-long event celebrating the scale of their global VDE teams, championing reusable apps and collaborative idea generation. As one of the main events of the DigiFest, AWS and Accenture conceptualized a company-wide AWS DeepRacer challenge where engineers can build and train their models to become better versed in using ML with AWS.

In this post, we share how Vodafone is advancing its ML skills using AWS DeepRacer and Accenture.

Why is machine learning important to Vodafone?

Machine learning is one of the fastest growing domains in technology and telecommunications, owing to the benefits of improved productivity and forecasting across key domains in telecommunications such as channels, CRM, billing, order management, service assurance, network management, and more.

Vodafone has already adopted ML in the proactive detection and correction of network anomalies to improve customer satisfaction. Their AI and ML capabilities in digital self-care, via a chatbot, have been helping their customer care team focus on cases that need deeper attention. Because they use AWS for providing digital services packaged as telco as a service, incorporating AI and ML components is crucial to maintain a competitive edge in delivering state-of-the-art services to customers.

Why AWS DeepRacer?

AWS DeepRacer is an interesting and fun way to get started with reinforcement learning (RL). RL is an advanced ML technique that takes a very different approach to training models than other ML methods. Its super power is that it learns very complex behaviors without requiring any labeled training data, and can make short-term decisions while optimizing for a longer-term goal. The AWS DeepRacer Challenge provided an opportunity for Vodafone’s engineers to engage in a friendly competition, develop an ML mindset, and share insights on how to succeed in a private virtual racing event.

Racing with AWS DeepRacer

The event played out in three stages, starting with a workshop on AWS DeepRacer to cover the basics of reinforcement learning, which was attended by over 225 Vodafone engineers. They learned how to fine-tune an AWS DeepRacer model by creating a reward function, exploring the action space, systematically tuning hyperparameters, examining the training job progress, evaluating the model, and testing the model on a virtual AWS DeepRacer vehicle and virtual track.

In the next stage, a league race was organized where 130 racers were able to view the race videos of the best model submission of every participant on a live leaderboard. This helped them understand how a high-performance model performs after it’s trained. They quickly understood overtraining occurs when a model is trained for too long, leading to overfitting, which leads to underperformance in a new environment. They also experimented with different styles of reward functions such as follow the center line, excessive steering penalty, slowness penalty, and progress rewards.

The event culminated with a grand finale, a showdown of 11 racers who tuned their models one final time to compete in a live race with commentary. All 11 racers completed a full lap with their models. Eight racers had a lap time of less than 15 seconds, with the winner coming in with an incredible lap time of 11.194 seconds on the tricky Toronto Turnpike virtual race track.

Summary

The goal of the AWS DeepRacer Challenge was to build awareness and excitement of ML on AWS for a global cloud engineering audience with varying technology skills and competencies. The tournament exceeded 585 total registrations across the globe, with over 400 models submitted and over 600 hours of training and evaluation.

Vodafone was able to help a broad range of builders get hands-on with ML through the AWS DeepRacer challenge. With over 47% AWS and ML beginners, it reaffirms how effective AWS DeepRacer can be in introducing ML with AWS in a safe and engaging environment for beginners.

“Having the Digital Engineering team attend events like DigiFest and participate in challenges like AWS DeepRacer is a huge part of our vision of building a world-class software engineering team in Vodafone. As we take on the complex challenge of transforming a telecommunications company into a technology company, growing our skillset becomes a top priority and our partnership with Accenture and AWS has provided the team with not just this, but multiple opportunities to learn and develop. I am excited for more of this to come!”

Ben Connolly, Vodafone Global Director of Cloud Engineering


About the Author

Ramakrishna Natarajan is a Senior Partner Solutions Architect at Amazon Web Services. He is based out of London and helps AWS Partners find optimal solutions on AWS for their customers. He specialises in Telecommunications OSS/BSS and has a keen interest in evolving domains such as AI/ML, Data Analytics, Security and Modernisation. He enjoys playing squash, going on long hikes and learning new languages.

Read More