How Games24x7 transformed their retraining MLOps pipelines with Amazon SageMaker

How Games24x7 transformed their retraining MLOps pipelines with Amazon SageMaker

This is a guest blog post co-written with Hussain Jagirdar from Games24x7.

Games24x7 is one of India’s most valuable multi-game platforms and entertains over 100 million gamers across various skill games. With “Science of Gaming” as their core philosophy, they have enabled a vision of end-to-end informatics around game dynamics, game platforms, and players by consolidating orthogonal research directions of game AI, game data science, and game user research. The AI and data science team dive into a plethora of multi-dimensional data and run a variety of use cases like player journey optimization, game action detection, hyper-personalization, customer 360, and more on AWS.

Games24x7 employs an automated, data-driven, AI powered framework for the assessment of each player’s behavior through interactions on the platform and flags users with anomalous behavior. They’ve built a deep-learning model ScarceGAN, which focuses on identification of extremely rare or scarce samples from multi-dimensional longitudinal telemetry data with small and weak labels. This work has been published in CIKM’21 and is open source for rare class identification for any longitudinal telemetry data. The need for productionization and adoption of the model was paramount to create a backbone behind enabling responsible game play in their platform, where the flagged users can be taken through a different journey of moderation and control.

In this post, we share how Games24x7 improved their training pipelines for their responsible gaming platform using Amazon SageMaker.

Customer challenges

The DS/AI team at Games24x7 used multiple services provided by AWS, including SageMaker notebooks, AWS Step Functions, AWS Lambda, and Amazon EMR, for building pipelines for various use cases. To handle the drift in data distribution, and therefore to retrain their ScarceGAN model, they discovered that the existing system needed a better MLOps solution.

In the previous pipeline through Step Functions, a single monolith codebase ran data preprocessing, retraining, and evaluation. This became a bottleneck in troubleshooting, adding, or removing a step, or even in making some small changes in the overall infrastructure. This step-function instantiated a cluster of instances to extract and process data from S3 and the further steps of pre-processing, training, evaluation would run on a single large EC2 instance. In scenarios where the pipeline failed at any step the whole workflow needed to be restarted from the beginning, which resulted in repeated runs and increased cost. All the training and evaluation metrics were inspected manually from Amazon Simple Storage Service (Amazon S3). There was no mechanism to pass and store the metadata of the multiple experiments done on the model. Due to the decentralized model monitoring, thorough investigation and cherry-picking the best model required hours from the data science team. Accumulation of all these efforts had resulted in lower team productivity and increased overhead. Additionally, with a fast-growing team, it was very challenging to share this knowledge across the team.

Because MLOps concepts are very extensive and implementing all the steps would need time, we decided that in the first stage we would address the following core issues:

  • A secure, controlled, and templatized environment to retrain our in-house deep learning model using industry best practices
  • A parameterized training environment to send a different set of parameters for each retraining job and audit the last-runs
  • The ability to visually track training metrics and evaluation metrics, and have metadata to track and compare experiments
  • The ability to scale each step individually and reuse the previous steps in cases of step failures
  • A single dedicated environment to register models, store features, and invoke inferencing pipelines
  • A modern toolset that could minimize compute requirements, drive down costs, and drive sustainable ML development and operations by incorporating the flexibility of using different instances for different steps
  • Creating a benchmark template of state-of-the-art MLOps pipeline that could be used across various data science teams

Games24x7 started evaluating other solutions, including Amazon SageMaker Studio Pipelines. The already existing solution through Step Functions had limitations. Studio pipelines had the flexibility of adding or removing a step at any point of time. Also, the overall architecture and their data dependencies between each step can be visualized through DAGs. The evaluation and fine-tuning of the retraining steps became quite efficient after we adopted different Amazon SageMaker functionalities such as the Amazon SageMaker Studio, Pipelines, Processing, Training, model registry and experiments and trials. The AWS Solution Architecture team showed great deep dive and was really instrumental in the design and implementation of this solution.

Solution overview

The following diagram illustrates the solution architecture.

architecture

The solution uses a SageMaker Studio environment to run the retraining experiments. The code to invoke the pipeline script is available in the Studio notebooks, and we can change the hyperparameters and input/output when invoking the pipeline. This is quite different from our earlier method where we had all the parameters hard coded within the scripts and all the processes were inextricably linked. This required modularization of the monolithic code into different steps.

The following diagram illustrates our original monolithic process.

legacy-method

Modularization

In order to scale, track, and run each step individually, the monolithic code needed to be modularized. Parameters, data, and code dependencies between each step were removed, and shared modules for the shared components across the steps was created. An illustration of the modularization is shown below:-

mono-modular-sagemaker

For every single module , testing was done locally using SageMaker SDK’s Script mode for training, processing and evaluation which required minor changes in the code to run with SageMaker. The local mode testing for deep learning scripts can be done either on SageMaker notebooks if already being used or by using Local Mode using SageMaker Pipelines in case of directly starting with Pipelines. This helps in validating if our custom scripts will run on SageMaker instances.

Each module was then tested in isolation using SageMaker Training/processing SDK’s using Script mode and ran them in a sequence manually using the SageMaker instances for each step like below training step:

estimator = TensorFlow(
    entry_point="inference.py",
    source_dir="scripts_train/training/",
    instance_type="ml.c5.2xlarge",  # Running on SageMaker ML instances
    instance_count=1,
    hyperparameters=hyperparameters,
    role=sagemaker.get_execution_role(),  # Passes to the container the AWS role that you are using on this notebook
    framework_version="2.11",
    py_version="py39",
)

estimator.fit(inputs)
2022-09-28 11:10:34 Starting - Starting the training job...

Amazon S3 was used to get the source data to process and then store the intermediate data, data frames, and NumPy results back to Amazon S3 for the next step. After the integration testing between individual modules for pre-processing, training, evaluation was complete, the SageMaker Pipeline SDK’s which is integrated with the SageMaker Python SDK’s that we already used in the above steps, allowed us to chain all these modules programmatically by passing the input parameters, data, metadata and output of each step as an input to the next steps.

We could re-use the previous Sagemaker Python SDK code to run the modules individually into Sagemaker Pipeline SDK based runs. The relationships between each steps of the pipeline are determined by the data dependencies between steps.

The final steps of the pipeline are as follows:

  • Data preprocessing
  • Retraining
  • Evaluation
  • Model registration

dag-pipeline

In the following sections, we discuss each of the steps in more detail when run with the SageMaker Pipeline SDK’s.

Data preprocessing

This step transforms the raw input data and preprocesses and splits into train, validation, and test sets. For this processing step, we instantiated a SageMaker processing job with TensorFlow Framework Processor, which takes our script, copies the data from Amazon S3, and then pulls a Docker image provided and maintained by SageMaker. This Docker container allowed us to pass our library dependencies in the requirements.txt file while having all the TensorFlow libraries already included, and pass the path for source_dir for the script. The train and validation data goes to the training step, and the test data gets forwarded to the evaluation step. The best part of using this container was that it allowed us to pass a variety of inputs and outputs as different S3 locations, which could then be passed as a step dependency to the next steps in the SageMaker pipeline.

#Initialize the TensorFlowProcessor
tp = TensorFlowProcessor(
    framework_version='2.11',
    role=get_execution_role(),
    instance_type='ml.m5.xlarge',
    instance_count=1,
    base_job_name='frameworkprocessor-TF',
    py_version='py39',
    sagemaker_session=pipeline_session,

)
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.workflow.steps import ProcessingStep
processor_args = tp.run(
    code='new_data_collection_kfold.py',
    source_dir='scripts_processing',
    inputs=[
        ProcessingInput(input_name='data_unlabeled',source=data_unlabeled, destination="/opt/ml/processing/data_unlabeled"),
        ProcessingInput(input_name='data_risky',source=data_risky, destination= "/opt/ml/processing/data_risky"),
        ProcessingInput(input_name='data_dormant',source=data_dormant, destination= "/opt/ml/processing/data_dormant"),
        ProcessingInput(input_name='data_normal',source=data_normal, destination= "/opt/ml/processing/data_normal"),
        ProcessingInput(input_name='data_heavy',source=data_heavy, destination= "/opt/ml/processing/data_heavy")
    ],
    outputs=[
        ProcessingOutput(output_name="train_output_data", source="/opt/ml/processing/train/data", destination=f's3://{BUCKET}/{op_train_path}/data'),
        ProcessingOutput(output_name="train_output_label", source="/opt/ml/processing/train/label", destination=f's3://{BUCKET}/{op_train_path}/label'),
        ProcessingOutput(output_name="train_kfold_output_data", source="/opt/ml/processing/train/kfold/data", destination=f's3://{BUCKET}/{op_train_path}/kfold/data'),
        ProcessingOutput(output_name="train_kfold_output_label", source="/opt/ml/processing/train/kfold/label", destination=f's3://{BUCKET}/{op_train_path}/kfold/label'),
        ProcessingOutput(output_name="val_output_data", source="/opt/ml/processing/val/data", destination=f's3://{BUCKET}/{op_val_path}/data'),
        ProcessingOutput(output_name="val_output_label", source="/opt/ml/processing/val/label", destination=f's3://{BUCKET}/{op_val_path}/label'),
        ProcessingOutput(output_name="val_output_kfold_data", source="/opt/ml/processing/val/kfold/data", destination=f's3://{BUCKET}/{op_val_path}/kfold/data'),
        ProcessingOutput(output_name="val_output_kfold_label", source="/opt/ml/processing/val/kfold/label", destination=f's3://{BUCKET}/{op_val_path}/kfold/label'),
        ProcessingOutput(output_name="train_unlabeled_kfold_data", source="/opt/ml/processing/train/unlabeled/kfold/", destination=f's3://{BUCKET}/{op_train_path}/unlabeled/kfold/'),
        ProcessingOutput(output_name="test_output", source="/opt/ml/processing/test", destination=f's3://{BUCKET}/{op_test_path}')
    ],
    arguments=["--scaler_path", op_scaler_path,
              "--bucket", BUCKET],
)

Retraining

We wrapped the training module through the SageMaker Pipelines TrainingStep API and used already available deep learning container images through the TensorFlow Framework estimator (also known as Script mode) for SageMaker training. Script mode allowed us to have minimal changes in our training code, and the SageMaker pre-built Docker container handles the Python, Framework versions, and so on. The ProcessingOutputs from the Data_Preprocessing step were forwarded as the TrainingInput of this step.

from sagemaker.inputs import TrainingInput

inputs={
        "train_output_data": TrainingInput(
            s3_data=step_process.properties.ProcessingOutputConfig.Outputs["train_output_data"].S3Output.S3Uri,
            content_type="text/csv",
        ),
        "train_output_label": TrainingInput(
            s3_data=step_process.properties.ProcessingOutputConfig.Outputs["train_output_label"].S3Output.S3Uri,
            content_type="text/csv",
        )

All the hyperparameters were passed through the estimator through a JSON file. For every epoch in our training, we were already sending our training metrics through stdOut in the script. Because we wanted to track the metrics of an ongoing training job and compare them with previous training jobs, we just had to parse this StdOut by defining the metric definitions through regex to fetch the metrics from StdOut for every epoch.

tensorflow_version = "2.11"
training_py_version = "py39"
training_instance_count = 1
training_instance_type = "ml.c5.2xlarge"
tf2_estimator = TensorFlow(
source_dir='scripts_train/training/',
entry_point='train.py',
instance_type=training_instance_type,
instance_count=training_instance_count,
framework_version=tensorflow_version,
hyperparameters=hyperparameters,
image_uri = "763104351884.dkr.ecr.ap-south-1.amazonaws.com/tensorflow-training:2.11.0-cpu-py39-ubuntu20.04-sagemaker",
role=role,
base_job_name="Training-Marco-model",
py_version=training_py_version,
metric_definitions=[    {'Name': 'iteration', 'Regex': 'Iteration=(.*?);'},
{'Name': 'Discriminator_Supervised_Loss=', 'Regex': 'Discriminator_Supervised_Loss=(.*?);'},
{'Name': 'Discriminator_UnSupervised_Loss', 'Regex': 'Discriminator_UnSupervised_Loss=(.*?);'},
{'Name': 'Generator_Loss', 'Regex': 'Generator_Loss=(.*?);'},
{'Name': 'Accuracy_Supervised', 'Regex': 'Accuracy_Supervised=(.*?);'}                       ]
)

It was interesting to understand that SageMaker Pipelines automatically integrates with SageMaker Experiments API, which by default creates an experiment, trial, and trial component for every run. This allows us to compare training metrics like accuracy and precision across multiple runs as shown below.

experiments-api-display

For each training job run, we generate four different models to Amazon S3 based on our custom business definition.

Evaluation

This step loads the trained models from Amazon S3 and evaluates on our custom metrics. This ProcessingStep takes the model and the test data as its input and dumps the reports of the model performance on Amazon S3.

We’re using custom metrics, so in order to register these custom metrics to the model registry, we needed to convert the schema of the evaluation metrics stored in Amazon S3 as CSV to the SageMaker Model quality JSON output. Then we can register the location of this evaluation JSON metrics to the model registry.

The following screenshots show an example of how we converted a CSV to Sagemaker Model quality JSON format.

csv-metrics

evaluation-metric-schema

Model registration

As mentioned earlier, we were creating multiple models in a single training step, so we had to use a SageMaker Pipelines Lambda integration to register all four models into a model registry. For a single model registration we can use the ModelStep API to create a SageMaker model in registry. For each model, the Lambda function retrieves the model artifact and evaluation metric from Amazon S3 and creates a model package to a specific ARN, so that all four models can be registered into a single model registry. The SageMaker Python APIs also allowed us to send custom metadata that we wanted to pass to select the best models. This proved to be a major milestone for productivity because all the models can now be compared and audited from a single window. We provided metadata to uniquely distinguish the model from each other. This also helped in approving a single model with the help of peer-reviews and management reviews based on model metrics.

def register_model_version(model_url, model_package_group_name, model_metrics_path, key, run_id):
    modelpackage_inference_specification =  {
        "InferenceSpecification": {
          "Containers": [
             {
                "Image": '763104351884.dkr.ecr.ap-south-1.amazonaws.com/tensorflow-inference:2.11.0-cpu-py39-ubuntu20.04-sagemaker',
    	         "ModelDataUrl": model_url
             }
          ],
          "SupportedContentTypes": [ "text/csv" ],
          "SupportedResponseMIMETypes": [ "text/csv" ],
       }
     }
    
    ModelMetrics={
        'ModelQuality': {
            'Statistics': {
                'ContentType': 'application/json',
                'S3Uri': model_metrics_path
            },
        }
        }
    create_model_package_input_dict = {
        "ModelPackageGroupName" : model_package_group_name,
        "ModelPackageDescription" : key+" run_id:"+run_id, # additional metadata example
        "ModelApprovalStatus" : "PendingManualApproval",
        "ModelMetrics" : ModelMetrics
    }
    create_model_package_input_dict.update(modelpackage_inference_specification)
    create_model_package_response = sm_client.create_model_package(**create_model_package_input_dict)
    model_package_arn = create_model_package_response["ModelPackageArn"]
    return model_package_arn

The above code block shows an example of how we added metadata through model package input to the model registry along with the model metrics.­­

The screenshot below shows how easily we can compare metrics of different model versions once they are registered.

model-registry-comparison

Pipeline Invocation

The pipeline can be invoked through EventBridge , Sagemaker Studio or the SDK itself. The invocation runs the jobs based on the data dependencies between steps.

from sagemaker.workflow.pipeline import Pipeline

pipeline = Pipeline(
    name=pipeline_name,
    steps=[Preprocess-Kfold,Training-Marco,Evaluate-Marco,ScarceGAN-Model-register]
)

definition = json.loads(pipeline.definition())
pipeline.upsert(role_arn=role)
execution = pipeline.start()
execution.wait()

Conclusion

In this post, we demonstrated how Games24x7 transformed their MLOps assets through SageMaker pipelines. The ability to visually track training metrics and evaluation metrics, with parameterized environment, scaling the steps individually with the right processing platform and a central model registry proved to be a major milestone in standardizing and advancing to an auditable, reusable, efficient, and explainable workflow. This project is a blueprint across different data science teams and has increased the overall productivity by allowing members to operate, manage, and collaborate with best practices.

If you have a similar use case and want to get started then we would recommend to go through SageMaker Script mode and the SageMaker end to end examples using Sagemaker Studio. These examples have the technical details which has been covered in this blog.

A modern data strategy gives you a comprehensive plan to manage, access, analyze, and act on data. AWS provides the most complete set of services for the entire end-to-end data journey for all workloads, all types of data and all desired business outcomes. In turn, this makes AWS the best place to unlock value from your data and turn it into insight.


About the Authors

Hussain Jagirdar is a Senior Scientist – Applied Research at Games24x7. He is currently involved in research efforts in the area of explainable AI and deep learning. His recent work has involved deep generative modeling, time-series modeling, and related subareas of machine learning and AI. He is also passionate about MLOps and standardizing projects that demand constraints such as scalability, reliability, and sensitivity.

Sumir Kumar is a Solutions Architect at AWS and has over 13 years of experience in technology industry. At AWS, he works closely with key AWS customers to design and implement cloud based solutions that solve complex business problems. He is very passionate about data analytics and machine learning and has a proven track record of helping organizations unlock the full potential of their data using AWS Cloud.

Read More

Detect real and live users and deter bad actors using Amazon Rekognition Face Liveness

Detect real and live users and deter bad actors using Amazon Rekognition Face Liveness

Financial services, the gig economy, telco, healthcare, social networking, and other customers use face verification during online onboarding, step-up authentication, age-based access restriction, and bot detection. These customers verify user identity by matching the user’s face in a selfie captured by a device camera with a government-issued identity card photo or preestablished profile photo. They also estimate the user’s age using facial analysis before allowing access to age-restricted content. However, bad actors increasingly deploy spoof attacks using the user’s face images or videos posted publicly, captured secretly, or created synthetically to gain unauthorized access to the user’s account. To deter this fraud, as well as reduce the costs associated with it, customers need to add liveness detection before face matching or age estimation is performed in their face verification workflow to confirm that the user in front of the camera is a real and live person.

We are excited to introduce Amazon Rekognition Face Liveness to help you easily and accurately deter fraud during face verification. In this post, we start with an overview of the Face Liveness feature, its use cases, and the end-user experience; provide an overview of its spoof detection capabilities; and show how you can add Face Liveness to your web and mobile applications.

Face Liveness overview

Today, customers detect liveness using various solutions. Some customers use open-source or commercial facial landmark detection machine learning (ML) models in their web and mobile applications to check if users correctly perform specific gestures such as smiling, nodding, shaking their head, blinking their eyes, or opening their mouth. These solutions are costly to build and maintain, fail to deter advanced spoof attacks performed using physical 3D masks or injected videos, and require high user effort to complete. Some customers use third-party face liveness features that can only detect spoof attacks presented to the camera (such as printed or digital photos or videos on a screen), which work well for users in select geographies, and are often completely customer-managed. Lastly, some customer solutions rely on hardware-based infrared and other sensors in phone or computer cameras to detect face liveness, but these solutions are costly, hardware-specific, and work only for users with select high-end devices.

With Face Liveness, you can detect in seconds that real users, and not bad actors using spoofs, are accessing your services. Face Liveness includes these key features:

  • Analyzes a short selfie video from the user in real time to detect whether the user is real or a spoof
  • Returns a liveness confidence score—a metric for the confidence level from 0–100 that indicates the probability for a person being real and live
  • Returns a high-quality reference image—a selfie frame with quality checks that can be used for downstream Amazon Rekognition face matching or age estimation analysis
  • Returns up to four audit images—frames from the selfie video that can be used for maintaining audit trails
  • Detects spoofs presented to the camera, such as a printed photo, digital photo, digital video, or 3D mask, as well as spoofs that bypass the camera, such as a pre-recorded or deepfake video
  • Can easily be added to applications running on most devices with a front-facing camera using open-source pre-built AWS Amplify UI components

In addition, no infrastructure management, hardware-specific implementation, or ML expertise is required. The feature automatically scales up or down in response to demand, and you only pay for the face liveness checks you perform. Face Liveness uses ML models trained on diverse datasets to provide high accuracy across user skin tones, ancestries, and devices.

Use cases

The following diagram illustrates a typical workflow using Face Liveness.

You can use Face Liveness in the following user verification workflows:

  • User onboarding – You can reduce fraudulent account creation on your service by validating new users with Face Liveness before downstream processing. For example, a financial services customer can use Face Liveness to detect a real and live user and then perform face matching to check that this is the right user prior to opening an online account. This can deter a bad actor using social media pictures of another person to open fraudulent bank accounts.
  • Step-up authentication – You can strengthen the verification of high-value user activities on your services, such as device change, password change, and money transfers, with Face Liveness before the activity is performed. For example, a ride-sharing or food-delivery customer can use Face Liveness to detect a real and live user and then perform face matching using an established profile picture to verify a driver’s or delivery associate’s identity before a ride or delivery to promote safety. This can deter unauthorized delivery associates and drivers from engaging with end-users.
  • User age verification – You can deter underage users from accessing restricted online content. For example, online tobacco retailers or online gambling customers can use Face Liveness to detect a real and live user and then perform age estimation using facial analysis to verify the user’s age before granting them access to the service content. This can deter an underage user from using their parent’s credit cards or photo and gaining access to harmful or inappropriate content.
  • Bot detection – You can avoid bots from engaging with your service by using Face Liveness in place of “real human” captcha checks. For example, social media customers can use Face Liveness for posing real human checks to keep bots at bay. This significantly increases the cost and effort required by users driving bot activity because key bot actions now need to pass a face liveness check.

End-user experience

When end-users need to onboard or authenticate themselves on your application, Face Liveness provides the user interface and real-time feedback for the user to quickly capture a short selfie video of moving their face into an oval rendered on their device’s screen. As the user’s face moves into the oval, a series of colored lights is displayed on the device’s screen and the selfie video is securely streamed to the cloud APIs, where advanced ML models analyze the video in real time. After the analysis is complete, you receive a liveness prediction score (a value between 0–100), a reference image, and audit images. Depending on whether the liveness confidence score is above or below the customer-set thresholds, you can perform downstream verification tasks for the user. If liveness score is below threshold, you can ask the user to retry or route them to an alternative verification method.

The sequence of screens that the end-user will be exposed to is as follows:

  1. The sequence begins with a start screen that includes an introduction and photosensitive warning. It prompts the end-user to follow instructions to prove they are a real person.
  2. After the end-user chooses Begin check, a camera screen is displayed and the check starts a countdown from 3.
  3. At the end of the countdown, a video recording begins, and an oval appears on the screen. The end-user is prompted to move their face into the oval. When Face Liveness detects that the face is in the correct position, the end-user is prompted to hold still for a sequence of colors that are displayed.
  4. The video is submitted for liveness detection and a loading screen with the message “Verifying” appears.
  5. The end-user receives a notification of success or a prompt to try again.

Here is what the user experience in action looks like in a sample implementation of Face Liveness.

Spoof detection

Face Liveness can deter presentation and bypass spoof attacks. Let’s outline the key spoof types and see Face Liveness deterring them.

Presentation spoof attacks

These are spoof attacks where a bad actor presents the face of another user to camera using printed or digital artifacts. The bad actor can use a print-out of a user’s face, display the user’s face on their device display using a photo or video, or wear a 3D face mask that looks like the user. Face Liveness can successfully detect these types of presentation spoof attacks, as we demonstrate in the following example.

The following shows a presentation spoof attack using a digital video on the device display.

The following shows an example of a presentation spoof attack using a digital photo on the device display.

The following example shows a presentation spoof attack using a 3D mask.

The following example shows a presentation spoof attack using a printed photo.

Bypass or video injection attacks

These are spoof attacks where a bad actor bypasses the camera to send a selfie video directly to the application using a virtual camera.

Face Liveness components

Amazon Rekognition Face Liveness uses multiple components:

  • AWS Amplify web and mobile SDKs with the FaceLivenessDetector component
  • AWS SDKs
  • Cloud APIs

Let’s review the role of each component and how you can easily use these components together to add Face Liveness in your applications in just a few days.

Amplify web and mobile SDKs with the FaceLivenessDetector component

The Amplify FaceLivenessDetector component integrates the Face Liveness feature into your application. It handles the user interface and real-time feedback for users while they capture their video selfie.

When a client application renders the FaceLivenessDetector component, it establishes a connection to the Amazon Rekognition streaming service, renders an oval on the end-user’s screen, and displays a sequence of colored lights. It also records and streams video in real-time to the Amazon Rekognition streaming service, and appropriately renders the success or failure message.

AWS SDKs and cloud APIs

When you configure your application to integrate with the Face Liveness feature, it uses the following API operations:

  • CreateFaceLivenessSession – Starts a Face Liveness session, letting the Face Liveness detection model be used in your application. Returns a SessionId for the created session.
  • StartFaceLivenessSession – Is called by the FaceLivenessDetector component. Starts an event stream containing information about relevant events and attributes in the current session.
  • GetFaceLivenessSessionResults – Retrieves the results of a specific Face Liveness session, including a Face Liveness confidence score, reference image, and audit images.

You can test Amazon Rekognition Face Liveness with any supported AWS SDK like the AWS Python SDK Boto3 or the AWS SDK for Java V2.

Developer experience

The following diagram illustrates the solution architecture.

The Face Liveness check process involves several steps:

  1. The end-user initiates a Face Liveness check in the client app.
  2. The client app calls the customer’s backend, which in turn calls Amazon Rekognition. The service creates a Face Liveness session and returns a unique SessionId.
  3. The client app renders the FaceLivenessDetector component using the obtained SessionId and appropriate callbacks.
  4. The FaceLivenessDetector component establishes a connection to the Amazon Rekognition streaming service, renders an oval on the user’s screen, and displays a sequence of colored lights. FaceLivenessDetector records and streams video in real time to the Amazon Rekognition streaming service.
  5. Amazon Rekognition processes the video in real time, stores the results including the reference image and audit images which are stored in an Amazon Simple Storage Service (S3) bucket, and returns a DisconnectEvent to the FaceLivenessDetector component when the streaming is complete.
  6. The FaceLivenessDetector component calls the appropriate callbacks to signal to the client app that the streaming is complete and that scores are ready for retrieval.
  7. The client app calls the customer’s backend to get a Boolean flag indicating whether the user was live or not. The customer backend makes the request to Amazon Rekognition to get the confidence score, reference, and audit images. The customer backend uses these attributes to determine whether the user is live and returns an appropriate response to the client app.
  8. Finally, the client app passes the response to the FaceLivenessDetector component, which appropriately renders the success or failure message to complete the flow.

Conclusion

In this post, we showed how the new Face Liveness feature in Amazon Rekognition detects if a user going through a face verification process is physically present in front of a camera and not a bad actor using a spoof attack. Using Face Liveness, you can deter fraud in your face-based user verification workflows.

Get started today by visiting the Face Liveness feature page for more information and to access the developer guide. Amazon Rekognition Face Liveness cloud APIs are available in the US East (N. Virginia), US West (Oregon), Europe (Ireland), Asia Pacific (Mumbai), and Asia Pacific (Tokyo) Regions.


About the Authors

Zuhayr Raghib is an AI Services Solutions Architect at AWS. Specializing in applied AI/ML, he is passionate about enabling customers to use the cloud to innovate faster and transform their businesses.

Pavan Prasanna Kumar is a Senior Product Manager at AWS. He is passionate about helping customers solve their business challenges through artificial intelligence. In his spare time, he enjoys playing squash, listening to business podcasts, and exploring new cafes and restaurants.

Tushar Agrawal leads Product Management for Amazon Rekognition. In this role, he focuses on building computer vision capabilities that solve critical business problems for AWS customers. He enjoys spending time with family and listening to music.

Read More

Build Streamlit apps in Amazon SageMaker Studio

Build Streamlit apps in Amazon SageMaker Studio

Developing web interfaces to interact with a machine learning (ML) model is a tedious task. With Streamlit, developing demo applications for your ML solution is easy. Streamlit is an open-source Python library that makes it easy to create and share web apps for ML and data science. As a data scientist, you may want to showcase your findings for a dataset, or deploy a trained model. Streamlit applications are useful for presenting progress on a project to your team, gaining and sharing insights to your managers, and even getting feedback from customers.

With the integrated development environment (IDE) of Amazon SageMaker Studio with Jupyter Lab 3, we can build, run, and serve Streamlit web apps from within that same environment for development purposes. This post outlines how to build and host Streamlit apps in Studio in a secure and reproducible manner without any time-consuming front-end development. As an example, we use a custom Amazon Rekognition demo, which will annotate and label an uploaded image. This will serve as a starting point, and it can be generalized to demo any custom ML model. The code for this blog can be found in this GitHub repository.

Solution overview

The following is the architecture diagram of our solution.

A user first accesses Studio through the browser. The Jupyter Server associated with the user profile runs inside the Studio Amazon Elastic Compute Cloud (Amazon EC2) instance. Inside the Studio EC2 instance exists the example code and dependencies list. The user can run the Streamlit app, app.py, in the system terminal. Studio runs the JupyterLab UI in a Jupyter Server, decoupled from notebook kernels. The Jupyter Server comes with a proxy and allows us to access our Streamlit app. Once the app is running, the user can initiate a separate session through the AWS Jupyter Proxy by adjusting the URL.

From a security aspect, the AWS Jupyter Proxy is extended by AWS authentication. As long as a user has access to the AWS account, Studio domain ID, and user profile, they can access the link.

Create Studio using JupyterLab 3.0

Studio with JupyterLab 3 must be installed for this solution to work. Older versions might not support features outlined in this post. For more information, refer to Amazon SageMaker Studio and SageMaker Notebook Instance now come with JupyterLab 3 notebooks to boost developer productivity. By default, Studio comes with JupyterLab 3. You should check the version and change it if running an older version. For more information, refer to JupyterLab Versioning.

You can set up Studio using the AWS Cloud Development Kit (AWS CDK); for more information, refer to Set up Amazon SageMaker Studio with Jupyter Lab 3 using the AWS CDK. Alternatively, you can use the SageMaker console to change the domain settings. Complete the following steps:

  1. On the SageMaker console, choose Domains in the navigation pane.
  2. Select your domain and choose Edit.

  1. For Default Jupyter Lab version, make sure the version is set to Jupyter Lab 3.0.

(Optional) Create a Shared Space

We can use the SageMaker console or the AWS CLI to add support for shared spaces to an existing Domain by following the steps in the docs or in this blog. Creating a shared space in AWS has the following benefits:

  1. Collaboration: A shared space allows multiple users or teams to collaborate on a project or set of resources, without having to duplicate data or infrastructure.
  2. Cost savings: Instead of each user or team creating and managing their own resources, a shared space can be more cost-effective, as resources can be pooled and shared across multiple users.
  3. Simplified management: With a shared space, administrators can manage resources centrally, rather than having to manage multiple instances of the same resources for each user or team.
  4. Improved scalability: A shared space can be more easily scaled up or down to meet changing demands, as resources can be allocated dynamically to meet the needs of different users or teams.
  5. Enhanced security: By centralizing resources in a shared space, security can be improved, as access controls and monitoring can be applied more easily and consistently.

Install dependencies and clone the example on Studio

Next, we launch Studio and open the system terminal. We use the SageMaker IDE to clone our example and the system terminal to launch our app. The code for this blog can be found in this GitHub repository. We start with cloning the repository:

Next, we open the System Terminal.

Once cloned, in the system terminal install dependencies to run our example code by running the following command. This will first pip install the dependences by running pip install --no-cache-dir -r requirements.txt. The no-cache-dir flag will disable the cache. Caching helps store the installation files (.whl) of the modules that you install through pip. It also stores the source files (.tar.gz) to avoid re-download when they haven’t expired. If there isn’t space on our hard drive or if we want to keep a Docker image as small as possible, we can use this flag so the command runs to completion with minimal memory usage. Next the script will install packages iproute and jq , which will be used in the following step.
sh setup.sh

Run Streamlit Demo and Create Shareable Link

To verify all dependencies are successfully installed and to view the Amazon Rekognition demo, run the following command:

sh run.sh

The port number hosting the app will be displayed.

Note that while developing, it might be helpful to automatically rerun the script when app.py is modified on disk. To do, so we can modify the runOnSave configuration option by adding the --server.runOnSave true flag to our command:

streamlit run app.py --server.runOnSave true

The following screenshot shows an example of what should be displayed on the terminal.

From the above example we see the port number, domain ID, and studio URL we are running our app on. Finally, we can see the URL we need to use to access our streamlit app. This script is modifying the Studio URL, replacing lab? with proxy/[PORT NUMBER]/ . The Rekognition Object Detection Demo will be displayed, as shown in the following screenshot.

Now that we have the Streamlit app working, we can share this URL with anyone who has access to this Studio domain ID and user profile. To make sharing these demos easier, we can check the status and list all running streamlit apps by running the following command: sh status.sh

We can use lifecycle scripts or shared spaces to extend this work. Instead of manually running the shell scripts and installing dependencies, use lifecycle scripts to streamline this process. To develop and extend this app with a team and share dashboards with peers, use shared spaces. By creating shared spaces in Studio, users can collaborate in the shared space to develop a Streamlit app in real time. All resources in a shared space are filtered and tagged, making it easier to focus on ML projects and manage costs. Refer to the following code to make your own applications in Studio.

Cleanup

Once we are done using the app, we want to free up the listening ports. To get all the processes running streamlit and free them up for use we can run our cleanup script: sh cleanup.sh

Conclusion

In this post, we showed an end-to-end example of hosting a Streamlit demo for an object detection task using Amazon Rekognition. We detailed the motivations for building quick web applications, security considerations, and setup required to run our own Streamlit app in Studio. Finally, we modified the URL pattern in our web browser to initiate a separate session through the AWS Jupyter Proxy.

This demo allows you to upload any image and visualize the outputs from Amazon Rekognition. The results are also processed, and you can download a CSV file with all the bounding boxes through the app. You can extend this work to annotate and label your own dataset, or modify the code to showcase your custom model!


About the Authors

Dipika Khullar is an ML Engineer in the Amazon ML Solutions Lab. She helps customers integrate ML solutions to solve their business problems. Most recently, she has built training and inference pipelines for media customers and predictive models for marketing.

Marcelo Aberle is an ML Engineer in the AWS AI organization. He is leading MLOps efforts at the Amazon ML Solutions Lab, helping customers design and implement scalable ML systems. His mission is to guide customers on their enterprise ML journey and accelerate their ML path to production.

Yash Shah is a Science Manager in the Amazon ML Solutions Lab. He and his team of applied scientists and ML engineers work on a range of ML use cases from healthcare, sports, automotive, and manufacturing.

Read More

Secure Amazon SageMaker Studio presigned URLs Part 3: Multi-account private API access to Studio

Secure Amazon SageMaker Studio presigned URLs Part 3: Multi-account private API access to Studio

Enterprise customers have multiple lines of businesses (LOBs) and groups and teams within them. These customers need to balance governance, security, and compliance against the need for machine learning (ML) teams to quickly access their data science environments in a secure manner. These enterprise customers that are starting to adopt AWS, expanding their footprint on AWS, or plannng to enhance an established AWS environment need to ensure they have a strong foundation for their cloud environment. One important aspect of this foundation is to organize their AWS environment following a multi-account strategy.

In the post Secure Amazon SageMaker Studio presigned URLs Part 2: Private API with JWT authentication, we demonstrated how to build a private API to generate Amazon SageMaker Studio presigned URLs that are only accessible by an authenticated end-user within the corporate network from a single account. In this post, we show how you can extend that architecture to multiple accounts to support multiple LOBs. We demonstrate how you can use Studio presigned URLs in a multi-account environment to secure and route access from different personas to their appropriate Studio domain. We explain the process and network flow, and how to easily scale this architecture to multiple accounts and Amazon SageMaker domains. The proposed solution also ensures that all network traffic stays within AWS’s private network and communication happens in a secure way.

Although we demonstrate using two different LOBs, each with a separate AWS account, this solution can scale to multiple LOBs. We also introduce a logical construct of a shared services account that plays a key role in governance, administration, and orchestration.

Solution overview

We can achieve communication between all LOBs’ SageMaker VPCs and the shared services account VPC using either VPC peering or AWS Transit Gateway. In this post, we use a transit gateway because it provides a simpler VPC-to-VPC communication mechanism over VPC peering when there are a large number of VPCs involved. We also use Amazon Route 53 forwarding rules in combination with inbound and outbound resolvers to resolve all DNS queries to the shared service account VPC endpoints. The networking architecture has been designed using the following patterns:

Let’s look at the two main architecture components, the information flow and network flow, in more detail.

Information flow

The following diagram illustrates the architecture of the information flow.

The workflow steps are as follows:

  1. The user authenticates with the Amazon Cognito user pool and receives a token to consume the Studio access API.
  2. The user calls the API to access Studio and includes the token in the request.
  3. When this API is invoked, the custom AWS Lambda authorizer is triggered to validate the token with the identity provider (IdP), and returns the proper permissions for the user.
  4. After the call is authorized, a Lambda function is triggered.
  5. This Lambda function uses the user’s name to retrieve their LOB name and the LOB account from the following Amazon DynamoDB tables that store these relationships:
    1. Users table – This table holds the relationship between users and their LOB.
    2. LOBs table – This table holds the relationship between the LOBs and the AWS account where the SageMaker domain for that LOB exists.
  6. With the account ID, the Lambda function assumes the PresignedUrlGenerator role in that account (each LOB account has a PresignedURLGenerator role that can only be assumed by the Lambda function in charge of generating the presigned URLs).
  7. Finally, the function invokes the SageMaker create-presigned-domain-url API call for that user in their LOB´s SageMaker domain.
  8. The presigned URL is returned to the end-user, who consumes it via the Studio VPC endpoint.

Steps 1–4 are covered in more detail in Part 2 of this series, where we explain how the custom Lambda authorizer works and takes care of the authorization process in the access API Gateway.

Network flow

All network traffic flows in a secure and private manner using AWS PrivateLink, as shown in the following diagram.

The steps are as follows:

  1. When the user calls the access API, it happens via the VPC endpoint for Amazon API Gateway in the networking VPC in the shared services account. This API is set as private, and has a policy that allows its consumption only via this VPC endpoint, as described in Part 2 of this series.
  2. All the authorization process happens privately between API Gateway, Lambda, and Amazon Cognito.
  3. After authorization is granted, API Gateway triggers the Lambda function in charge of generating the presigned URLs using AWS’s private network.
  4. Then, because the routing Lambda function lives in a VPC, all calls to different services happen through their respective VPC endpoints in the shared services account. The function performs the following actions:
    1. Retrieve the credentials to assume the role via the AWS Security Token Service (AWS STS) VPC endpoint in the networking account.
    2. Call DynamoDB to retrieve user and LOB information through the DynamoDB VPC endpoint.
    3. Call the SageMaker API to create a presigned URL for the user in their SageMaker domain through the SageMaker API VPC endpoint.
  5. The user finally consumes the presigned URL via the Studio VPC endpoint in the networking VPC in the shared services account, because this VPC endpoint has been specified during the creation of the presigned URL.
  6. All further communications between Studio and AWS services happen via Studio’s ENI inside the LOB account’s SageMaker VPC. For example, to allow SageMaker to call Amazon Elastic Container Registry (Amazon ECR), the Amazon ECR interface VPC endpoint can be provisioned in the shared services account VPC, and a forwarding rule is shared with the SageMaker accounts that need to consume it. This allows SageMaker queries to Amazon ECR to be resolved to this endpoint, and the Transit Gateway routing will do the rest.

Prerequisites

To represent a multi-account environment, we use one shared services account and two different LOBs:

  • Shared services account – Where the VPC endpoints and the Studio access Gateway API live
  • SageMaker account LOB A – The account for the SageMaker domain for LOB A
  • SageMaker account LOB B – The account for the SageMaker domain for LOB B

For more information on how to create an AWS account, refer to How do I create and activate a new AWS account.

LOB accounts are logical entities that are business, department, or domain specific. We assume one account per logical entity. However, there will be different accounts per environment (development, test, production). For each environment, you typically have a separate shared services account (based on compliance requirements) to restrict the blast radius.

You can use the templates and instructions in the GitHub repository to set up the needed infrastructure. This repository is structured into folders for the different accounts and different parts of the solution.

Infrastructure setup

For large companies with many Studio domains, it’s also advisable to have a centralized endpoint architecture. This can result in cost savings as the architecture scales and more domains and accounts are created. The networking.yml template in the shared services account deploys the VPC endpoints and needed Route 53 resources, and the Transit Gateway infrastructure to scale out the proposed solution.

Detailed instructions of the deployment can be found in the README.md file in the GitHub repository. The full deployment includes the following resources:

  • Two AWS CloudFormation templates in the shared services account: one for networking infrastructure and one for the AWS Serverless Application Model (AWS SAM) Studio access Gateway API
  • One CloudFormation template for the infrastructure in the SageMaker account LOB A
  • One CloudFormation template for the infrastructure of the SageMaker account LOB B
  • Optionally, an on-premises simulator can be deployed in the shared services account to test the end-to-end deployment

After everything is deployed, navigate to the Transit Gateway console for each SageMaker account (LOB accounts) and confirm that the transit gateway has been correctly shared and the VPCs are associated with it.

Optionally, if any forwarding rules have been shared with the accounts, they can be associated with the SageMaker accounts’ VPC. The basic rules to make the centralized VPC endpoints solution work are automatically shared with the LOB Account during deployment. For more information about this approach, refer to Centralized access to VPC private endpoints.

Populate the data

Run the following script to populate the DynamoDB tables and Amazon Cognito user pool with the required information:

./scripts/setup/fill-data.sh

The script performs the required API calls using the AWS Command Line Interface (AWS CLI) and the previously configured parameters and profiles.

Amazon Cognito users

This step works the same as Part 2 of this series, but has to be performed for users in all LOBs and should match their user profile in SageMaker, regardless of which LOB they belong to. For this post, we have one user in a Studio domain in LOB A (user-lob-a) and one user in a Studio domain in LOB B (user-lob-b). The following table lists the users populated in the Amazon Cognito user pool.

User Password
user-lob-a UserLobA1!
user-lob-b UserLobB1!

Note that these passwords have been configured for demo purposes.

DynamoDB tables

The access application uses two DynamoDB tables to direct requests from the different users to their LOB’s Studio domain.

The users table holds the relationship between users and their LOB.

Primary Key LOB
user-lob-a lob-a
user-lob-b lob-b

The LOB table holds the relationship between the LOB and the AWS account where the SageMaker domain for that LOB exists.

LOB ACCOUNT_ID
lob-a <YOUR_LOB_A_ACCOUNT_ID>
lob-b <YOUR_LOB_B_ACCOUNT_ID>

Note that these user names must be consistent across the Studio user profiles and the names of the users we previously added to the Amazon Cognito user pool.

Test the deployment

At this point, we can test the deployment going to API Gateway and check what the API responds for any of the users. We get a presigned URL in the response; however, consuming that URL in the browser will give an auth token error.

For this demo, we have set up a simulated on-premises environment with a bastion host and a Windows application. We install Firefox in the Windows instance and use the dev tools to add authorization headers to our requests and test the solution. More detailed information on how to set up the on-premises simulated environment is available in the associated GitHub repository.

The following diagram shows our test architecture.

We have two users, one for LOB A (User A) and another one for LOB B (User B), and we show how the Studio domain changes just by changing the authorization key retrieved from Amazon Cognito when logging in as User A and User B.

Complete the following steps to test the deployment:

  1. Retrieve the session token for User A, as shown in Part 2 of the series and also in the instructions in the GitHub repository.

We use the following example command to get the user credentials from Amazon Cognito:

aws cognito-idp initiate-auth 
--auth-flow USER_PASSWORD_AUTH 
--client-id <your-cognito-client-id> 
--auth-parameters USERNAME=user-lob-a,PASSWORD=Userloba1! 
--region <your-region>
  1. For this demo, we use a simulated Windows on-premises application. To connect to the Windows instance, you can follow the same approach specified in Secure access to Amazon SageMaker Studio with AWS SSO and a SAML application.
  2. Firefox should be installed in the instance. If not, once in the instance, we can install Firefox.
  3. Open Firefox and try to access the API of Studio with either user-lob-a or user-lob-b as the API path parameter.

You get a not authorized message.

  1. Open the developer tools of Firefox and on the Network tab, choose (right-click) the previous API call, and choose Edit and Resend.

  1. Here we add the token as an authorization header in the Firefox developer tools and make the request to the Studio access Gateway API again.

This time, we see in the developer tools that the URL is returned along with a 302 redirect.

  1. Although the redirect won´t work when using the developer tools, you can still choose it to access the LOB SageMaker domain for that user.

  1. Repeat for User B with its corresponding token and check that they get redirected to a different Studio domain.

If you perform these steps correctly, you can access both domains at the same time.

In our on-premises Windows application, we can have both domains consumed via the Studio VPC endpoint through our VPC peering connection.

Let’s explore some other testing scenarios.

If you edit the API again and change the path to the opposite LOB, when resending, we get an error in the API response: a forbidden response from API Gateway.

Trying to take the returned URL for the correct user and consume it in your laptop´s browser will also fail, because it won’t be consumed via the internal Studio VPC endpoint. This is the same error we saw when testing with API Gateway. It returns an “Auth token containing insufficient permissions” error.

Taking too long to consume the presigned URL will result in an “Invalid or Expired Auth Token” error.

Scale domains

Whenever a new SageMaker domain is added, you must complete the following networking and access steps:

  1. Share the transit gateway with the new account using AWS Resource Access Manager (AWS RAM).
  2. Attach the VPC to the transit gateway in the LOB account (this is done in AWS CloudFormation).

In our scenario, the transit gateway was set with automatic association to the default route table and automatic propagation enabled. In a real-world use case, you may need to complete three additional steps:

  1. In the shared services account, associate the attached Studio VPC to the respective Transit Gateway route table for SageMaker domains.
  2. Propagate the associated VPC routes to Transit Gateway.
  3. Lastly, add the account ID along with the LOB name to the LOBs’ DynamoDB table.

Clean up

Complete the following steps to clean up your resources:

  1. Delete the VPC peering connection.
  2. Remove the associated VPCs from the private hosted zones.
  3. Delete the on-premises simulator template from the shared services account.
  4. Delete the Studio CloudFormation templates from the SageMaker accounts.
  5. Delete the access CloudFormation template from the shared services account.
  6. Delete the networking CloudFormation template from the shared services account.

Conclusion

In this post, we walked through how you can set up multi-account private API access to Studio. We explained how the networking and application flows happen as well as how you can easily scale this architecture for multiple accounts and SageMaker domains. Head over to the GitHub repository to begin your journey. We’d love to hear your feedback!


About the Authors

Neelam Koshiya is an Enterprise Solutions Architect at AWS. Her current focus is helping enterprise customers with their cloud adoption journey for strategic business outcomes. In her spare time, she enjoys reading and being outdoors.

Alberto Menendez is an Associate DevOps Consultant in Professional Services at AWS. He helps accelerate customers´ journeys to the cloud. In his free time, he enjoys playing sports, especially basketball and padel, spending time with family and friends, and learning about technology.

Rajesh Ramchander is a Senior Data & ML Engineer in Professional Services at AWS. He helps customers migrate big data and AL/ML workloads to AWS.

Ram Vittal is a machine learning solutions architect at AWS. He has over 20 years of experience architecting and building distributed, hybrid, and cloud applications. He is passionate about building secure and scalable AI/ML and big data solutions to help enterprise customers with their cloud adoption and optimization journey to improve their business outcomes. In his spare time, he enjoys tennis and photography.

Read More

Run secure processing jobs using PySpark in Amazon SageMaker Pipelines

Run secure processing jobs using PySpark in Amazon SageMaker Pipelines

Amazon SageMaker Studio can help you build, train, debug, deploy, and monitor your models and manage your machine learning (ML) workflows. Amazon SageMaker Pipelines enables you to build a secure, scalable, and flexible MLOps platform within Studio.

In this post, we explain how to run PySpark processing jobs within a pipeline. This enables anyone that wants to train a model using Pipelines to also preprocess training data, postprocess inference data, or evaluate models using PySpark. This capability is especially relevant when you need to process large-scale data. In addition, we showcase how to optimize your PySpark steps using configurations and Spark UI logs.

Pipelines is an Amazon SageMaker tool for building and managing end-to-end ML pipelines. It’s a fully managed on-demand service, integrated with SageMaker and other AWS services, and therefore creates and manages resources for you. This ensures that instances are only provisioned and used when running the pipelines. Furthermore, Pipelines is supported by the SageMaker Python SDK, letting you track your data lineage and reuse steps by caching them to ease development time and cost. A SageMaker pipeline can use processing steps to process data or perform model evaluation.

When processing large-scale data, data scientists and ML engineers often use PySpark, an interface for Apache Spark in Python. SageMaker provides prebuilt Docker images that include PySpark and other dependencies needed to run distributed data processing jobs, including data transformations and feature engineering using the Spark framework. Although those images allow you to quickly start using PySpark in processing jobs, large-scale data processing often requires specific Spark configurations in order to optimize the distributed computing of the cluster created by SageMaker.

In our example, we create a SageMaker pipeline running a single processing step. For more information about what other steps you can add to a pipeline, refer to Pipeline Steps.

SageMaker Processing library

SageMaker Processing can run with specific frameworks (for example, SKlearnProcessor, PySparkProcessor, or Hugging Face). Independent of the framework used, each ProcessingStep requires the following:

  • Step name – The name to be used for your SageMaker pipeline step
  • Step arguments – The arguments for your ProcessingStep

Additionally, you can provide the following:

  • The configuration for your step cache in order to avoid unnecessary runs of your step in a SageMaker pipeline
  • A list of step names, step instances, or step collection instances that the ProcessingStep depends on
  • The display name of the ProcessingStep
  • A description of the ProcessingStep
  • Property files
  • Retry policies

The arguments are handed over to the ProcessingStep. You can use the sagemaker.spark.PySparkProcessor or sagemaker.spark.SparkJarProcessor class to run your Spark application inside of a processing job.

Each processor comes with its own needs, depending on the framework. This is best illustrated using the PySparkProcessor, where you can pass additional information to optimize the ProcessingStep further, for instance via the configuration parameter when running your job.

Run SageMaker Processing jobs in a secure environment

It’s best practice to create a private Amazon VPC and configure it so that your jobs aren’t accessible over the public internet. SageMaker Processing jobs allow you to specify the private subnets and security groups in your VPC as well as enable network isolation and inter-container traffic encryption using the NetworkConfig.VpcConfig request parameter of the CreateProcessingJob API. We provide examples of this configuration using the SageMaker SDK in the next section.

PySpark ProcessingStep within SageMaker Pipelines

For this example, we assume that you have Studio deployed in a secure environment already available, including VPC, VPC endpoints, security groups, AWS Identity and Access Management (IAM) roles, and AWS Key Management Service (AWS KMS) keys. We also assume that you have two buckets: one for artifacts like code and logs, and one for your data. The basic_infra.yaml file provides example AWS CloudFormation code to provision the necessary prerequisite infrastructure. The example code and deployment guide is also available on GitHub.

As an example, we set up a pipeline containing a single ProcessingStep in which we’re simply reading and writing the abalone dataset using Spark. The code samples show you how to set up and configure the ProcessingStep.

We define parameters for the pipeline (name, role, buckets, and so on) and step-specific settings (instance type and count, framework version, and so on). In this example, we use a secure setup and also define subnets, security groups, and the inter-container traffic encryption. For this example, you need a pipeline execution role with SageMaker full access and a VPC. See the following code:

{
	"pipeline_name": "ProcessingPipeline",
	"trial": "test-blog-post",
	"pipeline_role": "arn:aws:iam::<ACCOUNT_NUMBER>:role/<PIPELINE_EXECUTION_ROLE_NAME>",
	"network_subnet_ids": [
		"subnet-<SUBNET_ID>",
		"subnet-<SUBNET_ID>"
	],
	"network_security_group_ids": [
		"sg-<SG_ID>"
	],
	"pyspark_process_volume_kms": "arn:aws:kms:<REGION_NAME>:<ACCOUNT_NUMBER>:key/<KMS_KEY_ID>",
	"pyspark_process_output_kms": "arn:aws:kms:<REGION_NAME>:<ACCOUNT_NUMBER>:key/<KMS_KEY_ID>",
	"pyspark_helper_code": "s3://<INFRA_S3_BUCKET>/src/helper/data_utils.py",
	"spark_config_file": "s3://<INFRA_S3_BUCKET>/src/spark_configuration/configuration.json",
	"pyspark_process_code": "s3://<INFRA_S3_BUCKET>/src/processing/process_pyspark.py",
	"process_spark_ui_log_output": "s3://<DATA_S3_BUCKET>/spark_ui_logs/{}",
	"pyspark_framework_version": "2.4",
	"pyspark_process_name": "pyspark-processing",
	"pyspark_process_data_input": "s3a://<DATA_S3_BUCKET>/data_input/abalone_data.csv",
	"pyspark_process_data_output": "s3a://<DATA_S3_BUCKET>/pyspark/data_output",
	"pyspark_process_instance_type": "ml.m5.4xlarge",
	"pyspark_process_instance_count": 6,
	"tags": {
		"Project": "tag-for-project",
		"Owner": "tag-for-owner"
	}
}

To demonstrate, the following code example runs a PySpark script on SageMaker Processing within a pipeline by using the PySparkProcessor:

# import code requirements
# standard libraries import
import logging
import json

# sagemaker model import
import sagemaker
from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.pipeline_experiment_config import PipelineExperimentConfig
from sagemaker.workflow.steps import CacheConfig
from sagemaker.processing import ProcessingInput
from sagemaker.workflow.steps import ProcessingStep
from sagemaker.workflow.pipeline_context import PipelineSession
from sagemaker.spark.processing import PySparkProcessor

from helpers.infra.networking.networking import get_network_configuration
from helpers.infra.tags.tags import get_tags_input
from helpers.pipeline_utils import get_pipeline_config

def create_pipeline(pipeline_params, logger):
    """
    Args:
        pipeline_params (ml_pipeline.params.pipeline_params.py.Params): pipeline parameters
        logger (logger): logger
    Returns:
        ()
    """
    # Create SageMaker Session
    sagemaker_session = PipelineSession()

    # Get Tags
    tags_input = get_tags_input(pipeline_params["tags"])

    # get network configuration
    network_config = get_network_configuration(
        subnets=pipeline_params["network_subnet_ids"],
        security_group_ids=pipeline_params["network_security_group_ids"]
    )

    # Get Pipeline Configurations
    pipeline_config = get_pipeline_config(pipeline_params)

    # setting processing cache obj
    logger.info("Setting " + pipeline_params["pyspark_process_name"] + " cache configuration 3 to 30 days")
    cache_config = CacheConfig(enable_caching=True, expire_after="p30d")

    # Create PySpark Processing Step
    logger.info("Creating " + pipeline_params["pyspark_process_name"] + " processor")

    # setting up spark processor
    processing_pyspark_processor = PySparkProcessor(
        base_job_name=pipeline_params["pyspark_process_name"],
        framework_version=pipeline_params["pyspark_framework_version"],
        role=pipeline_params["pipeline_role"],
        instance_count=pipeline_params["pyspark_process_instance_count"],
        instance_type=pipeline_params["pyspark_process_instance_type"],
        volume_kms_key=pipeline_params["pyspark_process_volume_kms"],
        output_kms_key=pipeline_params["pyspark_process_output_kms"],
        network_config=network_config,
        tags=tags_input,
        sagemaker_session=sagemaker_session
    )
    
    # setting up arguments
    run_ags = processing_pyspark_processor.run(
        submit_app=pipeline_params["pyspark_process_code"],
        submit_py_files=[pipeline_params["pyspark_helper_code"]],
        arguments=[
        # processing input arguments. To add new arguments to this list you need to provide two entrances:
        # 1st is the argument name preceded by "--" and the 2nd is the argument value
        # setting up processing arguments
            "--input_table", pipeline_params["pyspark_process_data_input"],
            "--output_table", pipeline_params["pyspark_process_data_output"]
        ],
        spark_event_logs_s3_uri=pipeline_params["process_spark_ui_log_output"].format(pipeline_params["trial"]),
        inputs = [
            ProcessingInput(
                source=pipeline_params["spark_config_file"],
                destination="/opt/ml/processing/input/conf",
                s3_data_type="S3Prefix",
                s3_input_mode="File",
                s3_data_distribution_type="FullyReplicated",
                s3_compression_type="None"
            )
        ],
    )

    # create step
    pyspark_processing_step = ProcessingStep(
        name=pipeline_params["pyspark_process_name"],
        step_args=run_ags,
        cache_config=cache_config,
    )

    # Create Pipeline
    pipeline = Pipeline(
        name=pipeline_params["pipeline_name"],
        steps=[
            pyspark_processing_step
        ],
        pipeline_experiment_config=PipelineExperimentConfig(
            pipeline_params["pipeline_name"],
            pipeline_config["trial"]
        ),
        sagemaker_session=sagemaker_session
    )
    pipeline.upsert(
        role_arn=pipeline_params["pipeline_role"],
        description="Example pipeline",
        tags=tags_input
    )
    return pipeline


def main():
    # set up logging
    logger = logging.getLogger(__name__)
    logger.setLevel(logging.INFO)
    logger.info("Get Pipeline Parameter")

    with open("ml_pipeline/params/pipeline_params.json", "r") as f:
        pipeline_params = json.load(f)
    print(pipeline_params)

    logger.info("Create Pipeline")
    pipeline = create_pipeline(pipeline_params, logger=logger)
    logger.info("Execute Pipeline")
    execution = pipeline.start()
    return execution


if __name__ == "__main__":
    main()

As shown in the preceding code, we’re overwriting the default Spark configurations by providing configuration.json as a ProcessingInput. We use a configuration.json file that was saved in Amazon Simple Storage Service (Amazon S3) with the following settings:

[
    {
        "Classification":"spark-defaults",
        "Properties":{
            "spark.executor.memory":"10g",
            "spark.executor.memoryOverhead":"5g",
            "spark.driver.memory":"10g",
            "spark.driver.memoryOverhead":"10g",
            "spark.driver.maxResultSize":"10g",
            "spark.executor.cores":5,
            "spark.executor.instances":5,
            "spark.yarn.maxAppAttempts":1
            "spark.hadoop.fs.s3a.endpoint":"s3.<region>.amazonaws.com",
            "spark.sql.parquet.fs.optimized.comitter.optimization-enabled":true
        }
    }
]

We can update the default Spark configuration either by passing the file as a ProcessingInput or by using the configuration argument when running the run() function.

The Spark configuration is dependent on other options, like the instance type and instance count chosen for the processing job. The first consideration is the number of instances, the vCPU cores that each of those instances have, and the instance memory. You can use Spark UIs or CloudWatch instance metrics and logs to calibrate these values over multiple run iterations.

In addition, the executor and driver settings can be optimized even further. For an example of how to calculate these, refer to Best practices for successfully managing memory for Apache Spark applications on Amazon EMR.

Next, for driver and executor settings, we recommend investigating the committer settings to improve performance when writing to Amazon S3. In our case, we’re writing Parquet files to Amazon S3 and setting “spark.sql.parquet.fs.optimized.comitter.optimization-enabled” to true.

If needed for a connection to Amazon S3, a regional endpoint “spark.hadoop.fs.s3a.endpoint” can be specified within the configurations file.

In this example pipeline, the PySpark script spark_process.py (as shown in the following code) loads a CSV file from Amazon S3 into a Spark data frame, and saves the data as Parquet back to Amazon S3.

Note that our example configuration is not proportionate to the workload because reading and writing the abalone dataset could be done on default settings on one instance. The configurations we mentioned should be defined based on your specific needs.

# import requirements
import argparse
import logging
import sys
import os
import pandas as pd

# spark imports
from pyspark.sql import SparkSession
from pyspark.sql.functions import (udf, col)
from pyspark.sql.types import StringType, StructField, StructType, FloatType

from data_utils import(
    spark_read_parquet,
    Unbuffered
)

sys.stdout = Unbuffered(sys.stdout)

# Define custom handler
logger = logging.getLogger(__name__)
handler = logging.StreamHandler(sys.stdout)
handler.setFormatter(logging.Formatter("%(asctime)s %(message)s"))
logger.addHandler(handler)
logger.setLevel(logging.INFO)

def main(data_path):

    spark = SparkSession.builder.appName("PySparkJob").getOrCreate()
    spark.sparkContext.setLogLevel("ERROR")

    schema = StructType(
        [
            StructField("sex", StringType(), True),
            StructField("length", FloatType(), True),
            StructField("diameter", FloatType(), True),
            StructField("height", FloatType(), True),
            StructField("whole_weight", FloatType(), True),
            StructField("shucked_weight", FloatType(), True),
            StructField("viscera_weight", FloatType(), True),
            StructField("rings", FloatType(), True),
        ]
    )

    df = spark.read.csv(data_path, header=False, schema=schema)
    return df.select("sex", "length", "diameter", "rings")

if __name__ == "__main__":
    logger.info(f"===============================================================")
    logger.info(f"================= Starting pyspark-processing =================")
    parser = argparse.ArgumentParser(description="app inputs")
    parser.add_argument("--input_table", type=str, help="path to the channel data")
    parser.add_argument("--output_table", type=str, help="path to the output data")
    args = parser.parse_args()
    
    df = main(args.input_table)

    logger.info("Writing transformed data")
    df.write.csv(os.path.join(args.output_table, "transformed.csv"), header=True, mode="overwrite")

    # save data
    df.coalesce(10).write.mode("overwrite").parquet(args.output_table)

    logger.info(f"================== Ending pyspark-processing ==================")
    logger.info(f"===============================================================")

To dive into optimizing Spark processing jobs, you can use the CloudWatch logs as well as the Spark UI. You can create the Spark UI by running a Processing job on a SageMaker notebook instance. You can view the Spark UI for the Processing jobs running within a pipeline by running the history server within a SageMaker notebook instance if the Spark UI logs were saved within the same Amazon S3 location.

Clean up

If you followed the tutorial, it’s good practice to delete resources that are no longer used to stop incurring charges. Make sure to delete the CloudFormation stack that you used to create your resources. This will delete the stack created as well as the resources it created.

Conclusion

In this post, we showed how to run a secure SageMaker Processing job using PySpark within SageMaker Pipelines. We also demonstrated how to optimize PySpark using Spark configurations and set up your Processing job to run in a secure networking configuration.

As a next step, explore how to automate the entire model lifecycle and how customers built secure and scalable MLOps platforms using SageMaker services.


About the Authors

Maren Suilmann is a Data Scientist at AWS Professional Services. She works with customers across industries unveiling the power of AI/ML to achieve their business outcomes. Maren has been with AWS since November 2019. In her spare time, she enjoys kickboxing, hiking to great views, and board game nights.


Maira Ladeira Tanke
is an ML Specialist at AWS. With a background in data science, she has 9 years of experience architecting and building ML applications with customers across industries. As a technical lead, she helps customers accelerate their achievement of business value through emerging technologies and innovative solutions. In her free time, Maira enjoys traveling and spending time with her family someplace warm.


Pauline Ting
is Data Scientist in the AWS Professional Services team. She supports customers in achieving and accelerating their business outcome by developing AI/ML solutions. In her spare time, Pauline enjoys traveling, surfing, and trying new dessert places.


Donald Fossouo
is a Sr Data Architect in the AWS Professional Services team, mostly working with Global Finance Service. He engages with customers to create innovative solutions that address customer business problems and accelerate the adoption of AWS services. In his spare time, Donald enjoys reading, running, and traveling.

Read More

Create your RStudio on Amazon SageMaker licensed or trial environment in three easy steps

Create your RStudio on Amazon SageMaker licensed or trial environment in three easy steps

RStudio on Amazon SageMaker is the first fully managed cloud-based Posit Workbench (formerly known as RStudio Workbench). RStudio on Amazon SageMaker removes the need for you to manage the underlying Posit Workbench infrastructure, so your teams can concentrate on producing value for your business. You can quickly launch the familiar RStudio integrated development environment (IDE) and scale up and down the underlying compute resources without interrupting your work, making it easy to build machine learning (ML) and analytics solutions in R at scale.

Setting up a new Amazon SageMaker Studio domain with RStudio support or adding RStudio to an existing domain is now easier, thanks to the service integration with AWS Marketplace and AWS License Manager. You can now acquire your new Posit Workbench license or request a trial directly from AWS Marketplace and set up your environment using the AWS Management Console. In this post, we walk you through this process in three straightforward steps:

  1. Acquire a Posit Workbench license or request a time-bound trial in AWS Marketplace.
  2. Create a license grant in License Manager for your AWS account.
  3. Provision a new Studio domain with RStudio or add RStudio to your existing domain.

Prerequisites

Before beginning this walkthrough, make sure you have the following prerequisites:

Step 1: Acquire your Posit Workbench license

To acquire your Posit Workbench license, complete the following steps:

  1. Log in to your AWS account and navigate to the AWS Marketplace console.
  2. In the navigation pane, choose Discover Products.
  3. Search for Posit, then choose Posit Workbench and choose Continue to Subscribe.
Posit Workbench Product Page on AWS marketplace

Fig 1: Posit Workbench product page on AWS marketplace

  1. Specify your settings for Contract duration, Renewal Settings, and Contract options, then choose Create Contract.
Posit Workbench Product agreement Page on AWS marketplace

Fig 2: Posit Workbench product agreement page on AWS marketplace

You will see a message stating your request is being processed. This step will take a few minutes to complete.

AWS Marketplace manage subscriptions page

Fig 3: AWS Marketplace manage subscriptions page

After few minutes, you see the RStudio Workbench product under your subscriptions.

Request a trial license

If you want to create a test environment or a proof of concept, you can use the Posit Workbench product page to request a trial license. Complete the following steps:

  1.  Locate the evaluation request form link on the Overview tab in AWS Marketplace.

    Fig 4: Contact from link in Posit Workbench product page

    Fig 4: Contact from link in Posit Workbench product page

  2. Fill out the contact form and make sure you include your AWS account ID in the How we can help? prompt.

This is very important because that will allow you to get the trial license private offer directly to your email without any additional back and forth.

You will receive an email with a link to a $0 limited-time private offer that you can open while logged in to your AWS account. After you accept the offer, you will be able to follow the next steps to activate your license grant.

Step 2: Manage your license grant in License Manager

To activate your license grant, complete the following steps:

  1. Navigate to the License Manager console to view the Posit Workbench license.
  1. If you’re using License Manager for the first time, you need to grant permission to use License Manager by selecting I grant AWS License Manager the required permissions and choosing Grant permissions.
Fig 5: AWS License Manager one-time setup page for IAM Permissions

Fig 5: AWS License Manager one-time setup page for IAM Permissions

  1. Choose Granted licenses in the navigation pane.

You can see two entitlements related to Posit Workbench: one for AWS Marketplace usage and the other for named users. In order to be able to use your license and create a Studio domain with RStudio support, you need to accept the license.

  1. On the Granted licenses page, select the license grant with RStudio Workbench as the product name and choose View.
AWS License Manager console with Granted licenses

Fig 6: AWS License Manager console with Granted licenses

  1. On the license detail page, choose Accept & activate license.
Fig 7: AWS License Manager console with License details

Fig 7: AWS License Manager console with License details

If you have a single account and want to create your Studio domain in the same account you’re managing your license, you can jump to Step 11. However, it’s an AWS recommended best practice to use a multi-account AWS environment where you have a dedicated shared services account to manage your licenses. If that’s the case, you need to create a license grant for the AWS account where you will create the Studio domain with RStudio.

  1. In the navigation pane, choose Granted licenses, then choose the license ID to open the license details page.
  2. In the Grants section, choose Create grant.
  3. Enter a name and AWS account ID of the grant recipient (the AWS account where you will create your RStudio-enabled Studio domain).
  4. Choose Create grant.
Create grant page in AWS License Manager console

Fig 8: Create grant page in AWS License Manager console

  1. Log in to the AWS account where you will set up your RStudio on Amazon SageMaker domain and navigate to the License Manager console to accept and activate the granted license that appears as Pending acceptance.

The status changes to Active when you accept the grant or Rejected otherwise.

  1. Choose the license ID to see the details of the license.
  2. Choose Accept & activate license.
Amazon License Manager console with license status available

Fig 9: Amazon License Manager console with license status available

The license status changes to Available.

  1. To finalize, choose Activate license.
Amazon License Manager console with active license status

Fig 10: Amazon License Manager console with active license status

Now that you have accepted your Posit Workbench license, you’re ready to create your RStudio on Amazon SageMaker domain. Your license can be consumed by RStudio on Amazon SageMaker in any AWS Region that supports the feature.

Prerequisites to create a SageMaker domain

RStudio on Amazon SageMaker requires an IAM execution role that has permissions to License Manager and Amazon CloudWatch. For instructions, refer to Create DomainExecution role.

You can also use the following AWS CloudFormation stack template that creates the required IAM execution role in your account. Complete the following steps:

  1. Choose Launch Stack:

The link takes you to the us-east-1 Region, but you can change to your preferred Region. IAM roles are global resources, so you can access the role in any Region.

  1. In the Specify template section, choose Next.
  2. In the Specify stack details section, for Stack name, enter a name and choose Next.
  3. In the Configure stack options section, choose Next.
  4. In the Review section, select I acknowledge that AWS CloudFormation might create IAM resources and choose Create stack.
  5. When the stack status changes to CREATE_COMPLETE, go to the Resources tab to find the IAM role you created.

Step 3: Create a Studio domain with RStudio

You can configure RStudio on Amazon SageMaker as part of a multi-step SageMaker domain creation process on the console. You can also perform the steps using the AWS Command Line Interface (AWS CLI) following the instructions on Create an Amazon SageMaker Domain with RStudio using the AWS CLI. To create your domain on the console, complete the following steps:

  1. On the SageMaker console, on the Setup SageMaker Domain page, choose Standard setup , and choose Configure.
Amazon SageMaker domain creation

Fig 11: Amazon SageMaker domain creation

  1. In Step 1 of the Standard setup, you will need to provide:
    • Your domain name.
    • Your chosen authentication method (IAM or AWS Identity Center)
    • Your domain execution role (see the pre-requisites section above).
    • Your network and storage selection.
  2. In Step 2 you will provide configuration of your Studio Jupyter Lab environment (you can keep the default values and proceed).
  3. In Step 3, Studio automatically detects your RStudio Workbench license after it’s added and accepted in License Manager, as seen below.
Amazon SageMaker domain creation – general settings

Fig12: Amazon SageMaker domain creation – general settings

You can choose the instance type for the RStudio server that is going to be shared by all users in your domain. ml.t3.medium is recommended for Domains with low UI use and is free to use. For more information about how to choose an instance type, see RStudioServerPro instance type page. Note that this is not the instance where your R sessions run their analysis and ML code.

The domain creation takes a couple of minutes. When it’s complete, we can add users for data scientists to access RStudio on SageMaker.

Add RStudio support to an existing Studio domain

If you already have a SageMaker domain, you can add RStudio support by using the update-domain API call from the AWS CLI. Complete the following steps:

  1. Delete all apps in your SageMaker domain. This is necessary because adding RStudio will update all your existing user profile security groups.
    • Obtain a list of all existing apps by running the following command:
      aws sagemaker 
          list-apps 
          --domain-id-equals <DOMAIN_ID>

    • Then delete every app by running the following command:
      // JupyterServer apps
      aws sagemaker 
      delete-app 
      --domain-id <DOMAIN_ID> 
      --user-profile-name <USER_PROFILE> 
      --app-type JupyterServer 
      --app-name <APP_NAME>
      
      // KernelGateway apps
      aws sagemaker 
         delete-app 
         --domain-id <DOMAIN_ID> 
         --user-profile-name <USER_PROFILE> 
         --app-type KernelGateway 
         --app-name <APP_NAME>

  1. Activate RStudio by updating your domain. Depending on the type of networking you have set up your domain with, you will choose between the following  code examples:
    • If your domain is in VPCOnly mode:
      aws sagemaker 
          update-domain 
          --domain-id <DOMAIN_ID> 
          --app-security-group-management Service 
          --domain-settings-for-update RStudioServerProDomainSettingsForUpdate={DomainExecutionRoleArn=<DOMAIN_EXECUTION_ROLE_ARN>} 
          --default-user-settings "{"SecurityGroups": ["<SECURITY_GROUP>", "<SECURITY_GROUP>"]}"

    • If your domain is in PublicInternetOnly mode:
      aws sagemaker 
          update-domain 
          --domain-id <DOMAIN_ID> 
          --domain-settings-for-update RStudioServerProDomainSettingsForUpdate={DomainExecutionRoleArn=<DOMAIN_EXECUTION_ROLE_ARN>} 
          --default-user-settings "{"SecurityGroups": ["<SECURITY_GROUP>", "<SECURITY_GROUP>"]}"

Important:  If you have modified the security groups for existing user profiles in your domain, you have to make an additional update to make sure you don’t run into the maximum number of security groups per Elastic Network Interface limit. For more information, refer to Add RStudio support to an existing Domain.

  1. You can now start adding new user profiles to your domain with RStudio support (by default, they will have access to RStudio). You can also add RStudio access to pre-existing user profiles. This is necessary because, by default, pre-existing user profiles in the domain are not granted access to RStudio on SageMaker.
    • Run the following command to add RStudio access to existing user profiles:
      aws sagemaker 
          update-user-profile 
          --domain-id <DOMAIN_ID>
          --user-profile-name <USER_PROFILE> 
          --user-settings "{"RStudioServerProAppSettings": {"AccessStatus": "ENABLED"}}"

Create a Studio domain user profile

Creating a user in your Studio domain allows access to both Studio and RStudio on SageMaker. You can configure both on the SageMaker console. If you prefer to use the AWS CLI to set up a user, refer to Manage users. To enable RStudio for a user via the console, complete the following steps:

  1. On the Domain details page, choose Add user.
  2. For Name, enter a user name.
  3. For Default execution role, create the user profile’s execution role.
  4. Choose Next.
Amazon SageMaker Add user– General Settings tab

Fig 13: Amazon SageMaker Add user– General Settings tab

  1. Next, you can configure the access to SageMaker project templates and JumpStart. You can keep it default even though we don’t use this feature in this post; you can always edit it later.
  2. Choose Next to proceed.
  3. For License Authorization, Studio automatically detects and adds RStudio Workbench licenses to the domain for you to choose from:
    • RStudio Admin – Has access to the RStudio IDE and RStudio administrative dashboard
    • RStudio User – Has access to the RStudio IDE
    • Unauthorized – Doesn’t have access to the RStudio IDE

Note that all options grant access to Studio.

  1. Choose either RStudio Admin or RStudio User and choose Next to proceed.
  2. Choose Submit.

The user profile creation takes less than a minute.

Amazon SageMaker Add user – RStudio setting tab

Fig 14: Amazon SageMaker Add user – RStudio setting tab

  1. To open RStudio on SageMaker, on the Launch app menu in the user list, choose RStudio.
Amazon SageMaker Domain users

Fig 15: Amazon SageMaker Domain users

You will see the RStudio Workbench home page and a list of sessions, projects, and published content.

  1. To create a new session, choose New Session.
  2. Choose a desired instance on the Instance Type menu and choose Start Session.
Creating a new RStudio session in RStudio workbench

Fig 16: Creating a new RStudio session in RStudio workbench

When you launch your RStudio session, the Base R image serves as the basis of your instance. This Docker image includes R v4.0, AWS tools such as awscli, sagemaker, and boto3 Python packages, and the reticulate package for the interoperability between Python and R.

RStudio Workbench session

Fig 17: RStudio Workbench session

Clean up

As part of this walkthrough, you provisioned a SageMaker domain, user profiles, and RStudio session. To delete these resources, refer to Delete an Amazon SageMaker Domain.

Conclusion

In this post, we showed how you can easily set up your RStudio on Amazon SageMaker environment in three straightforward steps. You can now either acquire a new paid Posit Workbench license or request a trial directly from AWS Marketplace and quickly import your license using License Manager. We also showed you how, after you accept the license grant, Studio automatically detects your new license and allows you to create a Studio domain with Posit Workbench support. We encourage you to try out RStudio on Amazon SageMaker today by following these steps and give us your feedback in the comments section!


About the Authors

Venkata Kampana is a Senior Solutions Architect in the AWS Health and Human Services team and is based in Sacramento, CA. In that role, he helps public sector customers achieve their mission objectives with well-architected solutions on AWS.

Eric Peña is a Senior Technical Product Manager in the AWS Artificial Intelligence Platforms team, working on Amazon SageMaker Interactive Machine Learning. He currently focuses on IDE integrations on SageMaker Studio. He holds an MBA degree from MIT Sloan and outside of work enjoys playing basketball and football.

Read More

Inpaint images with Stable Diffusion using Amazon SageMaker JumpStart

Inpaint images with Stable Diffusion using Amazon SageMaker JumpStart

In November 2022, we announced that AWS customers can generate images from text with Stable Diffusion models using Amazon SageMaker JumpStart. Today, we are excited to introduce a new feature that enables users to inpaint images with Stable Diffusion models. Inpainting refers to the process of replacing a portion of an image with another image based on a textual prompt. By providing the original image, a mask image that outlines the portion to be replaced, and a textual prompt, the Stable Diffusion model can produce a new image that replaces the masked area with the object, subject, or environment described in the textual prompt.

You can use inpainting for restoring degraded images or creating new images with novel subjects or styles in certain sections. Within the realm of architectural design, Stable Diffusion inpainting can be applied to repair incomplete or damaged areas of building blueprints, providing precise information for construction crews. In the case of clinical MRI imaging, the patient’s head must be restrained, which may lead to subpar results due to the cropping artifact causing data loss or reduced diagnostic accuracy. Image inpainting can effectively help mitigate these suboptimal outcomes.

In this post, we present a comprehensive guide on deploying and running inference using the Stable Diffusion inpainting model in two methods: through JumpStart’s user interface (UI) in Amazon SageMaker Studio, and programmatically through JumpStart APIs available in the SageMaker Python SDK.

Solution overview

The following images are examples of inpainting. The original images are on the left, the mask image is in the center, and the inpainted image generated by the model is on the right. For the first example, the model was provided with the original image, a mask image, and the textual prompt “a white cat, blue eyes, wearing a sweater, lying in park,” as well as the negative prompt “poorly drawn feet.” For the second example, the textual prompt was “A female model gracefully showcases a casual long dress featuring a blend of pink and blue hues,”

Running large models like Stable Diffusion requires custom inference scripts. You have to run end-to-end tests to make sure that the script, the model, and the desired instance work together efficiently. JumpStart simplifies this process by providing ready-to-use scripts that have been robustly tested. You can access these scripts with one click through the Studio UI or with very few lines of code through the JumpStart APIs.

The following sections guide you through deploying the model and running inference using either the Studio UI or the JumpStart APIs.

Note that by using this model, you agree to the CreativeML Open RAIL++-M License.

Access JumpStart through the Studio UI

In this section, we illustrate the deployment of JumpStart models using the Studio UI. The accompanying video demonstrates locating the pre-trained Stable Diffusion inpainting model on JumpStart and deploying it. The model page offers essential details about the model and its usage. To perform inference, we employ the ml.p3.2xlarge instance type, which delivers the required GPU acceleration for low-latency inference at an affordable price. After the SageMaker hosting instance is configured, choose Deploy. The endpoint will be operational and prepared to handle inference requests within approximately 10 minutes.

JumpStart provides a sample notebook that can help accelerate the time it takes to run inference on the newly created endpoint. To access the notebook in Studio, choose Open Notebook in the Use Endpoint from Studio section of the model endpoint page.

Use JumpStart programmatically with the SageMaker SDK

Utilizing the JumpStart UI enables you to deploy a pre-trained model interactively with only a few clicks. Alternatively, you can employ JumpStart models programmatically by using APIs integrated within the SageMaker Python SDK.

In this section, we choose an appropriate pre-trained model in JumpStart, deploy this model to a SageMaker endpoint, and perform inference on the deployed endpoint, all using the SageMaker Python SDK. The following examples contain code snippets. To access the complete code with all the steps included in this demonstration, refer to the Introduction to JumpStart Image editing – Stable Diffusion Inpainting example notebook.

Deploy the pre-trained model

SageMaker utilizes Docker containers for various build and runtime tasks. JumpStart utilizes the SageMaker Deep Learning Containers (DLCs) that are framework-specific. We first fetch any additional packages, as well as scripts to handle training and inference for the selected task. Then the pre-trained model artifacts are separately fetched with model_uris, which provides flexibility to the platform. This allows multiple pre-trained models to be used with a single inference script. The following code illustrates this process:

model_id, model_version = "model-inpainting-stabilityai-stable-diffusion-2-inpainting-fp16", "*"
# Retrieve the inference docker container uri
deploy_image_uri = image_uris.retrieve(
    region=None,
    framework=None,  # automatically inferred from model_id
    image_scope="inference",
    model_id=model_id,
    model_version=model_version,
    instance_type=inference_instance_type,
)
# Retrieve the inference script uri
deploy_source_uri = script_uris.retrieve(model_id=model_id, model_version=model_version, script_scope="inference")

base_model_uri = model_uris.retrieve(model_id=model_id, model_version=model_version, model_scope="inference")

Next, we provide those resources to a SageMaker model instance and deploy an endpoint:

# Create the SageMaker model instance
# Create the SageMaker model instance
model = Model(
    image_uri=deploy_image_uri,
    source_dir=deploy_source_uri,
    model_data=base_model_uri,
    entry_point="inference.py",  # entry point file in source_dir and present in deploy_source_uri
    role=aws_role,
    predictor_cls=Predictor,
    name=endpoint_name,
)

# deploy the Model - note that we need to pass the Predictor class when we deploy the model through the Model class,
# in order to run inference through the SageMaker API
base_model_predictor = model.deploy(
    initial_instance_count=1,
    instance_type=inference_instance_type,
    predictor_cls=Predictor,
    endpoint_name=endpoint_name,
)

After the model is deployed, we can obtain real-time predictions from it!

Input

The input is the base image, a mask image, and the prompt describing the subject, object, or environment to be substituted in the masked-out portion. Creating the perfect mask image for in-painting effects involves several best practices. Start with a specific prompt, and don’t hesitate to experiment with various Stable Diffusion settings to achieve desired outcomes. Utilize a mask image that closely resembles the image you aim to inpaint. This approach aids the inpainting algorithm in completing the missing sections of the image, resulting in a more natural appearance. High-quality images generally yield better results, so make sure your base and mask images are of good quality and resemble each other. Additionally, opt for a large and smooth mask image to preserve detail and minimize artifacts.

The endpoint accepts the base image and mask as raw RGB values or a base64 encoded image. The inference handler decodes the image based on content_type:

  • For content_type = “application/json”, the input payload must be a JSON dictionary with the raw RGB values, textual prompt, and other optional parameters
  • For content_type = “application/json;jpeg”, the input payload must be a JSON dictionary with the base64 encoded image, a textual prompt, and other optional parameters

Output

The endpoint can generate two types of output: a Base64-encoded RGB image or a JSON dictionary of the generated images. You can specify which output format you want by setting the accept header to "application/json" or "application/json;jpeg" for a JPEG image or base64, respectively.

  • For accept = “application/json”, the endpoint returns the a JSON dictionary with RGB values for the image
  • For accept = “application/json;jpeg”, the endpoint returns a JSON dictionary with the JPEG image as bytes encoded with base64.b64 encoding

Note that sending or receiving the payload with the raw RGB values may hit default limits for the input payload and the response size. Therefore, we recommend using the base64 encoded image by setting content_type = “application/json;jpeg” and accept = “application/json;jpeg”.

The following code is an example inference request:

content_type = "application/json;jpeg"

with open(input_img_file_name, "rb") as f:
    input_img_image_bytes = f.read()
with open(input_img_mask_file_name, "rb") as f:
    input_img_mask_image_bytes = f.read()

encoded_input_image = base64.b64encode(bytearray(input_img_image_bytes)).decode()
encoded_mask = base64.b64encode(bytearray(input_img_mask_image_bytes)).decode()


payload = {
    "prompt": "a white cat, blue eyes, wearing a sweater, lying in park",
    "image": encoded_input_image,
    "mask_image": encoded_mask,
    "num_inference_steps": 50,
    "guidance_scale": 7.5,
    "seed": 0,
    "negative_prompt": "poorly drawn feet",
}


accept = "application/json;jpeg"

def query(model_predictor, payload, content_type, accept):
    """Query the model predictor."""

    query_response = model_predictor.predict(
        payload,
        {
            "ContentType": content_type,
            "Accept": accept,
        },
    )
    return query_response

query_response = query(model_predictor, json.dumps(payload).encode("utf-8"), content_type, accept)
generated_images = parse_response(query_response)

Supported parameters

Stable Diffusion inpainting models support many parameters for image generation:

  • image – The original image.
  • mask – An image where the blacked-out portion remains unchanged during image generation and the white portion is replaced.
  • prompt – A prompt to guide the image generation. It can be a string or a list of strings.
  • num_inference_steps (optional) – The number of denoising steps during image generation. More steps lead to higher quality image. If specified, it must be a positive integer. Note that more inference steps will lead to a longer response time.
  • guidance_scale (optional) – A higher guidance scale results in an image more closely related to the prompt, at the expense of image quality. If specified, it must be a float. guidance_scale<=1 is ignored.
  • negative_prompt (optional) – This guides the image generation against this prompt. If specified, it must be a string or a list of strings and used with guidance_scale. If guidance_scale is disabled, this is also disabled. Moreover, if the prompt is a list of strings, then the negative_prompt must also be a list of strings.
  • seed (optional) – This fixes the randomized state for reproducibility. If specified, it must be an integer. Whenever you use the same prompt with the same seed, the resulting image will always be the same.
  • batch_size (optional) – The number of images to generate in a single forward pass. If using a smaller instance or generating many images, reduce batch_size to be a small number (1–2). The number of images = number of prompts*num_images_per_prompt.

Limitations and biases

Even though Stable Diffusion has impressive performance in inpainting, it suffers from several limitations and biases. These include but are not limited to:

  • The model may not generate accurate faces or limbs because the training data doesn’t include sufficient images with these features.
  • The model was trained on the LAION-5B dataset, which has adult content and may not be fit for product use without further considerations.
  • The model may not work well with non-English languages because the model was trained on English language text.
  • The model can’t generate good text within images.
  • Stable Diffusion inpainting typically works best with images of lower resolutions, such as 256×256 or 512×512 pixels. When working with high-resolution images (768×768 or higher), the method might struggle to maintain the desired level of quality and detail.
  • Although the use of a seed can help control reproducibility, Stable Diffusion inpainting may still produce varied results with slight alterations to the input or parameters. This might make it challenging to fine-tune the output for specific requirements.
  • The method might struggle with generating intricate textures and patterns, especially when they span large areas within the image or are essential for maintaining the overall coherence and quality of the inpainted region.

For more information on limitations and bias, refer to the Stable Diffusion Inpainting model card.

Inpainting solution with mask generated via a prompt

CLIPSeq is an advanced deep learning technique that utilizes the power of pre-trained CLIP (Contrastive Language-Image Pretraining) models to generate masks from input images. This approach provides an efficient way to create masks for tasks such as image segmentation, inpainting, and manipulation. CLIPSeq uses CLIP to generate a text description of the input image. The text description is then used to generate a mask that identifies the pixels in the image that are relevant to the text description. The mask can then be used to isolate the relevant parts of the image for further processing.

CLIPSeq has several advantages over other methods for generating masks from input images. First, it’s a more efficient method, because it doesn’t require the image to be processed by a separate image segmentation algorithm. Second, it’s more accurate, because it can generate masks that are more closely aligned with the text description of the image. Third, it’s more versatile, because you can use it to generate masks from a wide variety of images.

However, CLIPSeq also has some disadvantages. First, the technique may have limitations in terms of subject matter, because it relies on pre-trained CLIP models that may not encompass specific domains or areas of expertise. Second, it can be a sensitive method, because it’s susceptible to errors in the text description of the image.

For more information, refer to Virtual fashion styling with generative AI using Amazon SageMaker.

Clean up

After you’re done running the notebook, make sure to delete all resources created in the process to ensure that the billing is stopped. The code to clean up the endpoint is available in the associated notebook.

Conclusion

In this post, we showed how to deploy a pre-trained Stable Diffusion inpainting model using JumpStart. We showed code snippets in this post—the full code with all of the steps in this demo is available in the Introduction to JumpStart – Enhance image quality guided by prompt example notebook. Try out the solution on your own and send us your comments.

To learn more about the model and how it works, see the following resources:

To learn more about JumpStart, check out the following posts:


About the Authors

Dr. Vivek Madan is an Applied Scientist with the Amazon SageMaker JumpStart team. He got his PhD from University of Illinois at Urbana-Champaign and was a Post Doctoral Researcher at Georgia Tech. He is an active researcher in machine learning and algorithm design and has published papers in EMNLP, ICLR, COLT, FOCS, and SODA conferences.

Alfred Shen is a Senior AI/ML Specialist at AWS. He has been working in Silicon Valley, holding technical and managerial positions in diverse sectors including healthcare, finance, and high-tech. He is a dedicated applied AI/ML researcher, concentrating on CV, NLP, and multimodality. His work has been showcased in publications such as EMNLP, ICLR, and Public Health.

Read More