How Medidata used Amazon SageMaker asynchronous inference to accelerate ML inference predictions up to 30 times faster

This post is co-written with Rajnish Jain, Priyanka Kulkarni and Daniel Johnson from Medidata.

Medidata is leading the digital transformation of life sciences, creating hope for millions of patients. Medidata helps generate the evidence and insights to help pharmaceutical, biotech, medical devices, and diagnostics companies as well as academic researchers with accelerating value, minimizing risk, and optimizing outcomes for their solutions. More than one million registered users across over 1,900 customers and partners access the world’s most trusted platform for clinical development, commercial, and real-world data.

Medidata’s AI team combines unparalleled clinical data, advanced analytics, and industry expertise to help life sciences leaders reimagine what is possible, uncover breakthrough insights to make confident decisions, and pursue continuous innovation. Medidata’s AI suite of solutions is backed by an integrated team of scientists, physicians, technologists, and ex-regulatory officials—built upon Medidata’s core platform comprising over 27,000 trials and 8 million patients.

Amazon SageMaker is a fully managed machine learning (ML) platform within the secure AWS landscape. With SageMaker, data scientists and developers can quickly and easily build and train ML models, and then directly deploy them into a production-ready hosted environment. For hosting trained ML models, SageMaker offers a wide array of options. Depending on the type of traffic pattern and latency requirements, you could choose one of these several options. For example, real-time inference is suitable for persistent workloads with millisecond latency requirements, payload sizes up to 6 MB, and processing times of up to 60 seconds. With Serverless Inference, you can quickly deploy ML models for inference without having to configure or manage the underlying infrastructure, and you pay only for the compute capacity used to process inference requests, which is ideal for intermittent workloads. For requests with large unstructured data with payload sizes up to 1 GB, with processing times up to 15 mins, and near real-time latency requirements, you can use asynchronous inference. Batch transform is ideal for offline predictions on large batches of data that are available up front.

In this collaborative post, we demonstrate how AWS helped Medidata take advantage of the various hosting capabilities within SageMaker to experiment with different architecture choices for predicting the operational success of proposed clinical trials. We also validate why Medidata chose SageMaker asynchronous inference for its final design and how this final architecture helped Medidata serve its customers with predictions up to 30 times faster while keeping ML infrastructure costs relatively low.

Architecture evolution

System design is not about choosing one right architecture. It’s the ability to discuss and experiment multiple possible approaches and weigh their trade-offs in satisfying the given requirements for our use case. During this process, it’s essential to take into account prior knowledge of various types of requirements and existing common systems that can interact with our proposed design. The scalability of a system is its ability to easily and cost-effectively vary resources allocated to it so as to serve changes in load. This applies to both increasing or decreasing user numbers or requests to the system.

In the following sections, we discuss how Medidata worked with AWS in iterating over a list of possible scalable architecture designs. We especially focus on the evolution journey, design choices, and trade-offs we went through to arrive at a final choice.

SageMaker batch transform

Medidata originally used SageMaker batch transform for ML inference to meet current requirements and develop a minimum viable product (MVP) for a new predictive solution due to low usage and loose performance requirements of the application. When a batch transform job starts, SageMaker initializes compute instances and distributes the inference or preprocessing workload between them. It’s a high-performance and high-throughput method for transforming data and generating inferences. It’s ideal for scenarios where you’re dealing with large batches of data, don’t need subsecond latency, and need to either preprocess or transform the data or use a trained model to run batch predictions on it in a distributed manner. The Sagemaker batch transform workflow also uses Amazon Simple Storage Service (Amazon S3) as the persistent layer, which maps to one of our data requirements.

Initially, using SageMaker batch transform worked well for the MVP, but as the requirements evolved and Medidata needed to support its customers in near real time, batch transform wasn’t suitable because it was an offline method and customers need to wait anywhere between 5–15 minutes for responses. This primarily included the startup cost for the underlying compute cluster to spin up every time a batch workload needs to be processed. This architecture also required configuring Amazon CloudWatch event rules to track the progress of the batch predictions job together with employing a database of choice to track the states and metadata of the fired job. The MVP architecture is shown in the following diagram.

The flow of this architecture is as follows:

  1. The incoming bulk payload is persisted as an input to an S3 location. This event in turn triggers an AWS Lambda Submit function.
  2. The Submit function kicks off a SageMaker batch transform job using the SageMaker runtime client.
  3. The Submit function also updates a state and metadata tracker database of choice with the job ID and sets the status of the job to inProgress. The function also updates the job ID with its corresponding metadata information.
  4. The transient (on-demand) compute cluster required to process the payload spins up, initiating a SageMaker batch transform job. At the same time, the job also emits status notifications and other logging information to CloudWatch logs.
  5. The CloudWatch event rule captures the status of the batch transform job and sends a status notification to an Amazon Simple Notification Service (Amazon SNS) topic configured to capture this information.
  6. The SNS topic is subscribed by a Notification Lambda function that is triggered every time an event rule is fired by CloudWatch and when there is a message in the SNS topic.
  7. The Notification function then updates the status of the transform job for success or failure in the tracking database.

While exploring alternative strategies and architectures, Medidata realized that the traffic pattern for the application consisted of short bursts followed by periods of inactivity. To validate the drawbacks of this existing MVP architecture, Medidata performed some initial benchmarking to understand and prioritize the bottlenecks of this pipeline. As shown in the following diagram, the largest bottleneck was the transition time before running the model for inference due to spinning up new resources with each bulk request. The definition of a bulk request here corresponds to a payload that is a collection of operational site data to be processed rather than a single instance of a request. The second biggest bottleneck was the time to save and write the output, which was also introduced due to the batch model architecture.

As the number of clients increased and usage multiplied, Medidata prioritized user experience by tightening performance requirements. Therefore, Medidata decided to replace the batch transform workflow with a faster alternative. This led to Medidata experimenting with several architecture designs involving SageMaker real-time inference, Lambda, and SageMaker asynchronous inference. In the following sections, we compare these evaluated designs in depth and analyze the technical reasons for choosing one over the other for Medidata’s use case.

SageMaker real-time inference

You can use SageMaker real-time endpoints to serve your models for predictions in real time with low latency. Serving your predictions in real time requires a model serving stack that not only has your trained model, but also a hosting stack to be able to serve those predictions. The hosting stack typically include a type of proxy, a web server that can interact with your loaded serving code, and your trained model. Your model can then be consumed by client applications through a real-time invoke API request. The request payload sent when you invoke the endpoint is routed to a load balancer and then routed to your ML instance or instances that are hosting your models for prediction. SageMaker real-time inference comes with all of the aforementioned components and makes it relatively straightforward to host any type of ML model for synchronous real-time inference.

SageMaker real-time inference has a 60-second timeout for endpoint invocation, and the maximum payload size for invocation is capped out at 6 MB. Because Medidata’s inference logic is complex and frequently requires more than 60 seconds, real-time inference alone can’t be a viable option for dealing with bulk requests that normally require unrolling and processing many individual operational identifiers without re-architecting the existing ML pipeline. Additionally, real-time inference endpoints need to be sized to handle peak load. This could be challenging because Medidata has quick bursts of high traffic. Auto scaling could potentially fix this issue, but it would require manual tuning to ensure there are enough resources to handle all requests at any given time. Alternatively, we could manage a request queue to limit the number of concurrent requests at a given time, but this would introduce additional overhead.

Lambda

Serverless offerings like Lambda eliminate the hassle of provisioning and managing servers, and automatically take care of scaling in response to varying workloads. They can be also much cheaper for lower-volume services because they don’t run 24/7. Lambda works well for workloads that can tolerate cold starts after periods of inactivity. If a serverless function has not been run for approximately 15 minutes, the next request experiences what is known as a cold start because the function’s container must be provisioned.

Medidata built several proof of concept (POC) architecture designs to compare Lambda with other alternatives. As a first simple implementation, the ML inference code was packaged as a Docker image and deployed as a container using Lambda. To facilitate faster predictions with this setup, the invoked Lambda function requires a large provisioned memory footprint. For larger payloads, there is an extra overhead to compress the input before calling the Lambda Docker endpoint. Additional configurations are also needed for the CloudWatch event rules to save the inputs and outputs, tracking the progress of the request, and employing a database of choice to track the internal states and metadata of the fired requests. Additionally, there is also an operational overhead for reading and writing data to Amazon S3. Medidata calculated the projected cost of the Lambda approach based on usage estimates and determined it would be much more expensive than SageMaker with no added benefits.

SageMaker asynchronous inference

Asynchronous inference is one of the newest inference offerings in SageMaker that uses an internal queue for incoming requests and processes them asynchronously. This option is ideal for inferences with large payload sizes (up to 1 GB) or long-processing times (up to 15 minutes) that need to be processed as requests arrive. Asynchronous inference enables you to save on costs by autoscaling the instance count to zero when there are no requests to process, so you only pay when your endpoint is processing requests.

For use cases that can tolerate a cold start penalty of a few minutes, you can optionally scale down the endpoint instance count to zero when there are no outstanding requests and scale back up as new requests arrive so that you only pay for the duration that the endpoints are actively processing requests.

Creating an asynchronous inference endpoint is very similar to creating a real-time endpoint. You can use your existing SageMaker models and only need to specify additional asynchronous inference configuration parameters while creating your endpoint configuration. Additionally, you can attach an auto scaling policy to the endpoint according to your scaling requirements. To invoke the endpoint, you need to place the request payload in Amazon S3 and provide a pointer to the payload as a part of the invocation request. Upon invocation, SageMaker enqueues the request for processing and returns an output location as a response. Upon processing, SageMaker places the inference response in the previously returned Amazon S3 location. You can optionally choose to receive success or error notifications via Amazon SNS.

Based on the different architecture designs discussed previously, we identified several bottlenecks and complexity challenges with these architectures. With the launch of asynchronous inference and based on our extensive experimentation and performance benchmarking, Medidata decided to choose SageMaker asynchronous inference for their final architecture for hosting due to a number of reasons outlined earlier. SageMaker is designed from the ground up to support ML workloads, whereas Lambda is more of a general-purpose tool. For our specific use case and workload type, SageMaker asynchronous inference is cheaper than Lambda. Also, SageMaker asynchronous inference’s timeout is much longer (15 minutes) compared to the real-time inference timeout of 60 seconds. This ensures that asynchronous inference can support all of Medidata’s workloads without modification. Additionally, SageMaker asynchronous inference queues up requests during quick bursts of traffic rather than dropping them, which was a strong requirement as per our use case. Exception and error handling is automatically taken care of for you. Asynchronous inference also makes it easy to handle large payload sizes, which is a common pattern with our inference requirements. The final architecture diagram using SageMaker asynchronous inference is shown in the following figure.

The flow of our final architecture is as follows:

  1. The Submit function receives the bulk payload from upstream consumer applications and is set up to be event-driven. This function uploads the payload to the pre-designated Amazon S3 location.
  2. The Submit function then invokes the SageMaker asynchronous endpoint, providing it with the Amazon S3 pointer to the uploaded payload.
  3. The function also updates the state of the request to inProgress in the state and metadata tracker database.
  4. The SageMaker asynchronous inference endpoint reads the input from Amazon S3 and runs the inference logic. When the ML inference succeeds or fails, the inference output is written back to Amazon S3 and the status is sent to an SNS topic.
  5. A Notification Lambda function subscribes to the SNS topic. The function is invoked whenever a status update notification is published to the topic.
  6. The Notification function updates the status of the request to success or failure in the state and metadata tracker database.

To recap, the batch transform MVP architecture we started with took 5–15 minutes to run depending on the size of the input. With the switch to asynchronous inference, the new solution runs end to end in 10–60 seconds. We see a speedup of at least five times faster for larger inputs and up to 30 times faster for smaller inputs, leading to better customer satisfaction with the performance results. The revised final architecture greatly simplifies the previous asynchronous fan-out/fan-in architecture because we don’t have to worry about partitioning the incoming payload, spawning workers, and delegating and consolidating work amongst the worker Lambda functions.

Conclusion

With SageMaker asynchronous inference, Medidata’s customers using this new predictive application now experience a speedup that’s up to 30 times faster for predictions. Requests aren’t dropped during traffic spikes because the asynchronous inference endpoint queues up requests rather than dropping them. The built-in SNS notification was able to overcome the custom CloudWatch event log notification that Medidata had built to notify the app when the job was complete. In this case, the asynchronous inference approach is cheaper than Lambda. SageMaker asynchronous inference is an excellent option if your team is running heavy ML workloads with burst traffic while trying to minimize cost. This is a great example of collaboration with the AWS team to push the boundaries and use bleeding edge technology for maximum efficiency.

For detailed steps on how to create, invoke, and monitor asynchronous inference endpoints, refer to documentation, which also contains a sample notebook to help you get started. For pricing information, visit Amazon SageMaker Pricing. For examples on using asynchronous inference with unstructured data such as computer vision and natural language processing (NLP), refer to Run computer vision inference on large videos with Amazon SageMaker asynchronous endpoints and Improve high-value research with Hugging Face and Amazon SageMaker asynchronous inference endpoints, respectively.


About the authors

Rajnish Jain is a Senior Director of Engineering at Medidata AI based in NYC. Rajnish heads engineering for a suite of applications that use machine learning on AWS to help customers improve operational success of proposed clinical trials. He is passionate about the use of machine learning to solve business problems.

Priyanka Kulkarni is a Lead Software Engineer within Acorn AI at Medidata Solutions. She architects and develops solutions and infrastructure to support ML predictions at scale. She is a data-driven engineer who believes in building innovative software solutions for customer success.

Daniel Johnson is a Senior Software Engineer within Acorn AI at Medidata Solutions. He builds APIs to support ML predictions around the feasibility of proposed clinical trials.

Arunprasath Shankar is a Senior AI/ML Specialist Solutions Architect with AWS, helping global customers scale their AI solutions effectively and efficiently in the cloud. In his spare time, Arun enjoys watching sci-fi movies and listening to classical music.

Raghu Ramesha is an ML Solutions Architect with the Amazon SageMaker Service team. He focuses on helping customers build, deploy, and migrate ML production workloads to SageMaker at scale. He specializes in machine learning, AI, and computer vision domains, and holds a master’s degree in Computer Science from UT Dallas. In his free time, he enjoys traveling and photography.

Read More

Fusion-Id: A Photoplethysmography and Motion Sensor Fusion Biometric Authenticator With Few-Shot on-Boarding

The abundance of wrist-worn heart rate measuring devices enables long term cardiovascular monitoring through photoplethysmography (PPG). Such signals contain unique identifiable information that can help in biometric authentication. In this work, we propose Fusion-ID, which use wrist-worn PPG sensors fused with motion sensor data as a way to do bio authentication on wrist worn devices. We conducted a user study using a PPG and motion sensor enabled wrist-worn device and collected data from 247 users. We then propose a novel sensor fusion deep Siamese network architecture for feature embedding…Apple Machine Learning Research

PyTorch strengthens its governance by joining the Linux Foundation

Today, I am proud to announce that PyTorch is moving to the Linux Foundation (LF) as a top-level project under the name PyTorch Foundation. The core mission of the Linux Foundation is the collaborative development of open source software. With a governing board of leaders from AMD, Amazon Web Services (AWS), Google Cloud, Meta, Microsoft Azure and NVIDIA, this model aligns with where PyTorch stands today and what it needs to travel forward. The creation of the PyTorch Foundation will ensure business decisions are being made in a transparent and open manner by a diverse group of members for years to come. The technical decisions remain in control of individual maintainers. I’m excited that the Linux Foundation will be our new home as they have notable experience supporting large open-source projects like ours such as Kubernetes and NodeJS. At this pivotal moment, I want to take a look back at how we started, share why we are moving, and what’s ahead.

This January, PyTorch celebrated its 5 year anniversary! I reflected on what it meant to me in this tweet thread, and this conversation with my colleagues Mike Schroepfer, Lin Qiao, and Yann LeCun. When we started PyTorch development in 2016, it was a collective effort by a band of people from the [Lua]Torch community with a big chunk of people and funding from Meta and individuals contributing from NVIDIA, Twitter and other entities.

Since 2017, PyTorch has grown far beyond our initial vision. With over 2,400 contributors who have built nearly 154,000 projects using PyTorch as a foundation, PyTorch has become one of the primary platforms for AI research, as well as commercial production use. We’ve seen its impact across industry and academia, from large companies to numerous university courses at Stanford, NYU, EPFL, Oxford, and other academic institutions. As a maintainer of PyTorch, the journey has been extremely fulfilling, with the impact of the project seen in various fields from self-driving cars to healthcare to aerospace.

As PyTorch grew, many companies have made foundational investments around it. While Meta remains the largest contributor to PyTorch, companies such as AMD, Amazon Web Services (AWS), Google Cloud, HuggingFace, Lightning AI, Microsoft Azure, Nvidia, and many others have made significant investments, including both technical contributions and community building efforts. They’ve established teams around PyTorch or filled significant voids within the PyTorch community and sent countless contributions to the PyTorch core and to the ecosystem around it — PyTorch is an important part of their future. With PyTorch continuing to grow as a multi-stakeholder project, it’s time to move to a broader open-source foundation.

The business governance of PyTorch was fairly unstructured for quite some time since launch – we operated like a scrappy startup. Team members at Meta spent the time and energy to structure this properly and organize PyTorch into an organizationally more healthy entity. Meta helped PyTorch with introducing many structures, such as Contributor License Agreements, Branding Guidelines, and Trademark registration. Keeping PyTorch’s organizational health up to check is essential and beneficial for the community. The next stage of our organizational progress is to support the interests of multiple stakeholders, hence moving to a foundation is good. We chose the Linux Foundation as it has vast organization experience hosting large multi-stakeholder open-source projects with the right balance of organizational structure and finding specific solutions for these projects.

Simultaneously, the technical governance of PyTorch has been a loosely structured community model of open-source development — A set of people maintaining PyTorch by area with their responsibility often tied to their individual identity rather than their employment. While we kept a codified list at the PyTorch – Persons of Interest page, the technical governance was not formalized nor codified. As PyTorch scales as a community, the next step is to structure and codify. The PyTorch Technical Governance now supports a hierarchical maintainer structure and clear outlining of processes around day to day work and escalations. This doesn’t change how we run things, but it does add discipline and openness that at our scale feels essential and timely.

It’s been an exciting journey since 2016. I am grateful for the experiences and people I’ve met along the way. PyTorch started with a small group of contributors which have grown and diversified over the years, all bringing in new ideas and innovations that would not have been possible without our community. We want to continue the open-source spirit – for the community and by the community. Thank you to our contributors, maintainers, users, supporters and new foundation members. We look forward to the next chapter of PyTorch with the PyTorch Foundation.

Read More

Tracking Any Pixel in a Video

We upgrade pixels into PIPs: “Persistent Independent Particles”. With this representation, we track any pixel over time, and overcome visibility issues with a learned temporal prior.

Motion estimation is a fundamental task of computer vision, with extremely broad applications. By tracking something, you can build models of its various properties: shape, texture, articulation, dynamics, affordances, and so on. More fine-grained tracking allows more fine-grained understanding. For robots, fine-grained tracking also enables fine-grained manipulation. Even setting aside downstream AI-related applications, motion tracks are directly useful for video editing applications — making realistic edits to a person or object in a video demands precise-as-possible tracking of the pixels, across an indefinite timespan.

There are a variety of methods for tracking objects (at the level of segmentation masks or bounding boxes), or for tracking certain points in certain categories (e.g., the joints of a person), but there are actually very few options for general-purpose fine-grained tracking. In this domain, the dominant approaches are feature matching and optical flow. The feature matching approach is: compute a feature for the target on the first frame, then compute features for pixels in other frames, and then compute “matches” using feature similarity (i.e., nearest neighbors). This often works well, but does not take into account temporal context, like smoothness of motion. The optical flow approach is: compute a dense “motion field” that relates each pair of frames, and then do some post-processing to link the fields together. Optical flow is very powerful, but since it only describes motion for a pair of frames at a time, it cannot produce useful outputs for targets that undergo multi-frame occlusion. “Occlusion” means our view is obstructed, and we need to guess the target’s location from context.

During an occlusion, appearance information does not suffice, because the target is not even present in the frames. Multi-frame temporal priors are key.

Around the year 2006, Peter Sand and Seth Teller proposed an alternative to flow-based and feature-based methods, called a “particle video.” This approach aims to represent a video with a set of particles that move across multiple frames. Their proposed method did not handle occlusions, but in our view, they laid the groundwork for treating pixels as persistent entities, with multi-frame trajectories and long-range temporal priors.

Inspired by their work, we propose Persistent Independent Particles (PIPs), a new particle video method. Our method takes a video as input, along with the ( (x, y) ) coordinate of a target to track, and produces the target’s trajectory as output. The model can be queried for any number of particles, at any positions.

Particle trajectories for arbitrary “target” pixels in the video.

You may notice in our visualizations that the PIP trajectories often leave the video bounds. We treat pixels flying out-of-bounds as just another “occlusion.” This type of robustness is exactly what is missing from feature-based and flow-based methods.

Let’s step through how we achieved this.

How does it work?

At a high level, our method makes an extreme trade-off between spatial awareness and temporal awareness: we estimate the trajectory of every target independently. This extreme choice allows us to devote the majority of parameters into a module that simultaneously learns (1) temporal priors, and (2) an iterative inference mechanism that searches for the target pixel’s location in all input frames. Related work on optical flow estimation typically uses the opposite approach: they estimate the motion of all pixels simultaneously (using maximal spatial context), for just 2 frames at a time (using minimal temporal context).

Given the ( (x_1,y_1) ) coordinate of the target on the first frame, our concrete goal is to estimate the target’s full trajectory over (T) frames: ( (x_1,y_1), ldots, (x_T,y_T) ).

We start by initializing a zero-velocity estimate. This means copying the initial coordinate to every timestep.

To track the target, we need to know what it looks like, so we compute appearance features for all the frames (using a CNN), and initialize the target’s appearance trajectory with a bilinear sample at the given coordinate on the first frame. (The “bilinear sample” step extracts a feature vector at a subpixel location in the spatial map of features.)

Our inference process will proceed by iteratively refining the sequence of positions, and sequence of appearance features, until they (hopefully) match the true trajectory of the target. This idea is illustrated in the video below: at initialization, only “timestep 1” has the correct location of the target (since this was given), and gradually, the model “locks on” to the target in all frames.

Our model’s job is to produce updates (i.e., deltas) for the positions and features, so that the trajectory tracks the target on every frame. A critical detail here is that we ask our model to produce these updates for multiple timesteps simultaneously. This allows us to “catch” a target after it re-emerges from an occluder, and “fill in” the missing part of the trajectory.

There are many ways to implement this, but for fast training and good generalization, we need to carefully select what information we provide to the model.

The main source of information we provide to the model is: measurements of local appearance similarity. We obtain these measurements using cross correlation (i.e., dot products), computed at multiple scales. When the target is visible, it should show up as a strong peak in at least one of the similarity maps. Also, when we are “locked on”, the peak should be in the middle of the map. This is illustrated in the animation below.

The second source of information we provide to the model is: the estimated trajectory itself. This allows the model to impose a temporal prior, and fix up parts of the trajectory where the local similarity information was ambiguous.

Finally, we allow the model to inspect the feature vector of the target, in case it might learn different strategies for different types of features. For example, depending on the scale or texture of the target, it may adjust the way it uses information from the multi-scale similarity maps.

For the model architecture, we elected to use an MLP-Mixer, which we found to have a good trade-off between model capacity, training time, and generalization. We also tried convolutional models and transformers, but the convolutional models could not fit the data as well as the MLP-Mixer, and the transformers took too long to train.

We trained the model in synthetic data that we made (based on an existing optical flow dataset), where we could provide multi-frame ground-truth for targets that undergo occlusions. The animation below shows the kind of data we trained on. You might say the data looks crazy — but that’s the point! If you can’t get real data, your best bet is synthetic data with extremely high diversity.

FlyingThings++: We train our model to track objects in this data, so that real videos are easy in comparison.

After training for a couple days on this data, the model starts to work on real videos.

Results

In the paper, we provide some quantitative analysis, showing that it works better than existing methods. It’s certainly not perfect — on keypoint tracking, the model works about 6/10 times, so there is a lot of room for improvement. The baselines are at around 5/10 or less.

The idea here is: pick a point on the first frame, and try to locate that same point in other frames of the video. This is a hard task especially when the point gets occluded.

Output of our PIPs model.

Baseline methods tend to get stuck on occluders, since they do not use multi-frame temporal context. For example, here is the output of a state-of-the-art optical flow method, on the same video and same target.

Output of an optical flow model (RAFT).

We have had fun trying the model on various videos, and observing the estimated trajectories. Sometimes they are surprisingly complex, since the target’s actual motion is subtly entangled with camera motion.

Visualizing the trajectories more densely gives mesmerizing results. Notice that the model even tracks ripples and specularities in the water.

Despite the fact that each particle trajectory is estimated independently of the others, they show surprisingly accurate grouping. Notice that the background particles all move together.

What’s next?

Our method upgrades pixels into PIPs: “Persistent Independent Particles.” This independence assumption, however, is probably not what we want in general. In ongoing work, we are trying to incorporate context across particles, so that confident particles can help the unconfident ones, and so that we track at multiple levels of granularity simultaneously.

We have released our code and model weights on GitHub. We encourage you to try our demo.py! If you are interested in building on our method, the provided tests, visualizations, and training scripts should make that easy. Or, if you are working on a video-based method that currently relies on optical flow, you may want to try our PIP trajectories as a replacement, which should give a better signal under occlusions.

We hope that our work opens up long-range fine-grained tracking of “anything.” For more details, please check out our project page and paper.

Read More

Learning to Walk in the Wild from Terrain Semantics

Learning to Walk in the Wild from Terrain Semantics

An important promise for quadrupedal robots is their potential to operate in complex outdoor environments that are difficult or inaccessible for humans. Whether it’s to find natural resources deep in the mountains, or to search for life signals in heavily-damaged earthquake sites, a robust and versatile quadrupedal robot could be very helpful. To achieve that, a robot needs to perceive the environment, understand its locomotion challenges, and adapt its locomotion skill accordingly. While recent advances in perceptive locomotion have greatly enhanced the capability of quadrupedal robots, most works focus on indoor or urban environments, thus they cannot effectively handle the complexity of off-road terrains. In these environments, the robot needs to understand not only the terrain shape (e.g., slope angle, smoothness), but also its contact properties (e.g., friction, restitution, deformability), which are important for a robot to decide its locomotion skills. As existing perceptive locomotion systems mostly focus on the use of depth cameras or LiDARs, it can be difficult for these systems to estimate such terrain properties accurately.

In “Learning Semantics-Aware Locomotion Skills from Human Demonstrations”, we design a hierarchical learning framework to improve a robot’s ability to traverse complex, off-road environments. Unlike previous approaches that focus on environment geometry, such as terrain shape and obstacle locations, we focus on environment semantics, such as terrain type (grass, mud, etc.) and contact properties, which provide a complementary set of information useful for off-road environments. As the robot walks, the framework decides the locomotion skill, including the speed and gait (i.e., shape and timing of the legs’ movement) of the robot based on the perceived semantics, which allows the robot to walk robustly on a variety of off-road terrains, including rocks, pebbles, deep grass, mud, and more.

Our framework selects skills (gait and speed) of the robot from the camera RGB image. We first compute the speed from terrain semantics, and then select a gait based on the speed.

Overview
The hierarchical framework consists of a high-level skill policy and a low level motor controller. The skill policy selects a locomotion skill based on camera images, and the motor controller converts the selected skill into motor commands. The high-level skill policy is further decomposed into a learned speed policy and a heuristic-based gait selector. To decide a skill, the speed policy first computes the desired forward speed, based on the semantic information from the onboard RGB camera. For energy efficiency and robustness, quadrupedal robots usually select a different gait for each speed, so we designed the gait selector to compute a desired gait based on the forward speed. Lastly, a low-level convex model-predictive controller (MPC) converts the desired locomotion skill into motor torque commands, and executes them on the real hardware. We train the speed policy directly in the real world using imitation learning because it requires fewer training data compared to standard reinforcement learning algorithms.

The framework consists of a high-level skill policy and a low-level motor controller.

Learning Speed Command from Human Demonstrations
As the central component in our pipeline, the speed policy outputs the desired forward speed of the robot based on the RGB image from the onboard camera. Although many robot learning tasks can leverage simulation as a source of lower-cost data collection, we train the speed policy in the real world because accurate simulation of complex and diverse off-road environments is not yet available. As policy learning in the real world is time-consuming and potentially unsafe, we make two key design choices to improve the data efficiency and safety of our system.

The first is learning from human demonstrations. Standard reinforcement learning algorithms typically learn by exploration, where the agent attempts different actions in an environment and builds preferences based on the rewards received. However, such explorations can be potentially unsafe, especially in off-road environments, since any robot failures can damage both the robot hardware and the surrounding environment. To ensure safety, we train the speed policy using imitation learning from human demonstrations. We first ask a human operator to teleoperate the robot on a variety of off-road terrains, where the operator controls the speed and heading of the robot using a remote joystick. Next, we collect the training data by storing (image, forward_speed) pairs. We then train the speed policy using standard supervised learning to predict the human operator’s speed command. As it turns out, the human demonstration is both safe and high-quality, and allows the robot to learn a proper speed choice for different terrains.

The second key design choice is the training method. Deep neural networks, especially those involving high-dimensional visual inputs, typically require lots of data to train. To reduce the amount of real-world training data required, we first pre-train a semantic segmentation model on RUGD (an off-road driving dataset where the images look similar to those captured by the robot’s onboard camera), where the model predicts the semantic class (grass, mud, etc.) for every pixel in the camera image. We then extract a semantic embedding from the model’s intermediate layers and use that as the feature for on-robot training. With the pre-trained semantic embedding, we can train the speed policy effectively using less than 30 minutes of real-world data, which greatly reduces the amount of effort required.

We pre-train a semantic segmentation model and extract a semantic embedding to be fine-tuned on robot data.

Gait Selection and Motor Control
The next component in the pipeline, the gait selector, computes the appropriate gait based on the speed command from the speed policy. The gait of a robot, including its stepping frequency, swing height, and base height, can greatly affect the robot’s ability to traverse different terrains.

Scientific studies have shown that animals switch between different gaits at different speeds, and this result is further validated in quadrupedal robots, so we designed the gait selector to compute a robust gait for each speed. Compared to using a fixed gait across all speeds, we find that the gait selector further enhances the robot’s navigation performance on off-road terrains (more details in the paper).

The last component of the pipeline is a motor controller, which converts the speed and gait commands into motor torques. Similar to previous work, we use separate control strategies for swing and stance legs. By separating the task of skill learning and motor control, the skill policy only needs to output the desired speed, and does not need to learn low-level locomotion controls, which greatly simplifies the learning process.

Experiment Results
We implemented our framework on an A1 quadrupedal robot and tested it on an outdoor trail with multiple terrain types, including grass, gravel, and asphalt, which pose varying degrees of difficulty for the robot. For example, while the robot needs to walk slowly with high foot swings in deep grass to prevent its foot from getting stuck, on asphalt it can walk much faster with lower foot swings for better energy efficiency. Our framework captures such differences and selects an appropriate skill for each terrain type: slow speed (0.5m/s) on deep grass, medium speed (1m/s) on gravel, and high speed (1.4m/s) on asphalt. It completes the 460m-long trail in 9.6 minutes with an average speed of 0.8m/s (i.e., that’s 1.8 miles or 2.9 kilometers per hour). In contrast, non-adaptive policies either cannot complete the trail safely or walk significantly slower (0.5m/s), illustrating the importance of adapting locomotion skills based on the perceived environments.

The framework selects different speeds based on conditions of the trail.

To test generalizability, we also deployed the robot to a number of trails that are not seen during training. The robot traverses through all of them without failure, and adjusts its locomotion skills based on terrain semantics. In general, the skill policy selects a faster skill on rigid and flat terrains and a slower speed on deformable or uneven terrain. At the time of writing, the robot has traversed over 6km of outdoor trails without failure.

With the framework, the robot walks safely on a variety of outdoor terrains not seen during training.

Conclusion
In this work, we present a hierarchical framework to learn semantic-aware locomotion skills for off-road locomotion. Using less than 30 minutes of human demonstration data, the framework learns to adjust the speed and gait of the robot based on the perceived semantics of the environment. The robot can walk safely and efficiently on a wide variety of off-road terrains. One limitation of our framework is that it only adjusts locomotion skills for standard walking and does not support more agile behaviors such as jumping, which can be essential for traversing more difficult terrains with gaps or hurdles. Another limitation is that our framework currently requires manual steering commands to follow a desired path and reach the goal. In future work, we plan to look into a deeper integration of high-level skill policy with the low-level controller for more agile behaviors, and incorporate navigation and path planning into the framework so that the robot can operate fully autonomously in challenging off-road environments.

Acknowledgements
We would like to thank our paper co-authors: Xiangyun Meng, Wenhao Yu, Tingnan Zhang, Jie Tan, and Byron Boots. We would also like to thank the team members of Robotics at Google for discussions and feedback.

Read More

Deploy large models on Amazon SageMaker using DJLServing and DeepSpeed model parallel inference

The last few years have seen rapid development in the field of natural language processing (NLP). Although hardware has improved, such as with the latest generation of accelerators from NVIDIA and Amazon, advanced machine learning (ML) practitioners still regularly encounter issues deploying their large language models. Today, we announce new capabilities in Amazon SageMaker that can help: you can configure the maximum Amazon EBS volume size and timeout quotas to facilitate large model inference. Coupled with model parallel inference techniques, you can now use the fully managed model deployment and management capabilities of SageMaker when working with large models with billions of parameters.

In this post, we demonstrate these new SageMaker capabilities by deploying a large, pre-trained NLP model from Hugging Face across multiple GPUs. In particular, we use the Deep Java Library (DJL) serving and tensor parallelism techniques from DeepSpeed to achieve under 0.1 second latency in a text generation use case with 6 billion parameter GPT-J. Complete example on our GitHub repository coming soon.

Large language models and the increasing necessity of model parallel inference

Language models have recently exploded in both size and popularity. In 2018, BERT-large entered the scene and, with its 340 million parameters and novel transformer architecture, set the standard on NLP task accuracy. Within just a few years, state-of-the-art NLP model size has grown by more than 500 times, with models such as OpenAI’s 175 billion parameter GPT-3 and similarly sized open-source Bloom 176 B raising the bar on NLP accuracy. This increase in the number of parameters is driven by the simple and empirically-demonstrated positive relationship between model size and accuracy: more is better. With easy access from model zoos such as Hugging Face and improved accuracy in NLP tasks such as classification and text generation, practitioners are increasingly reaching for these large models. However, deploying them can be a challenge.

Large language models can be difficult to host for low-latency inference use cases because of their size. Typically, ML practitioners simply host a model (or even multiple models) within the memory of a single accelerator device that handles inference end to end on its own. However, large language models can be too big to fit within the memory of a single accelerator, so this paradigm can’t work. For example, open-source GPT-NeoX with 20 billion parameters can require more than 80 GB of accelerator memory, which is more than triple what is available on an NVIDIA A10G, a popular GPU for inference. Practitioners have a few options to work against this accelerator memory constraint. A simple but slow approach is to use CPU memory and stream model parameters sequentially to the accelerator. However, this introduces a communication bottleneck between the CPU and GPU, which can add seconds to inference latency and is therefore unsuitable for many use cases that require fast responses. Another approach is to optimize or compress the model so that it can fit on a single device. Practitioners must implement complex techniques such as quantization, pruning, distillation, and others to reduce the memory requirements. This approach requires a lot of time and expertise and can also reduce the accuracy and generalization of a model, which can also be a non-starter for many use cases.

A third option to use model parallelism. With model parallelism, the parameters and layers of a model are partitioned and then spread across multiple accelerators. This approach allows practitioners to take advantage of both the memory and processing power of multiple accelerators at once and can deliver low-latency inference without impacting the accuracy of the model. Model parallelism is already a popular technique in training (see Introduction to Model Parallelism) and is increasingly becoming used in inference as practitioners require low-latency responses from large models.

There are two general types of model parallelism: pipeline parallelism and tensor parallelism. Pipeline parallelism splits a model between layers, so that any given layer is contained within the memory of a single GPU. In contrast, tensor parallelism splits layers such that a model layer is spread out across multiple GPUs. Both of these model parallel techniques are used in training (often together), but tensor parallelism can be a better choice for inference because batch size is often one with inference. When batch size is one, only tensor parallelism can take advantage of multiple GPUs at once when processing the forward pass to improve latency.

In this post, we use DeepSpeed to partition the model using tensor parallelism techniques. DeepSpeed Inference supports large Transformer-based models with billions of parameters. It allows you to efficiently serve large models by adapting to the best parallelism strategies for multi-GPU inference, accounting for both inference latency and cost. For more information, refer to DeepSpeed: Accelerating large-scale model inference and training via system optimizations and compression and this DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale.

Solution overview

The Deep Java Library (DJL) is an open-source, high-level, engine-agnostic Java framework for deep learning. The DJL is built with native Java concepts on top of existing deep learning frameworks. The DJL is designed to be deep learning engine agonistic. You can switch engines at any point. The DJL also provides automatic CPU/GPU choice based on hardware configuration.

Although the DJL is designed originally for Java developers to get start with ML, DJLServing is a high-performance universal model serving solution powered by the DJL that is programming language agnostic. It can serve the commonly seen model types, such the PyTorch TorchScript model, TensorFlow SavedModel bundle, Apache MXNet model, ONNX model, TensorRT model, and Python script model. DJLServing supports dynamic batching and worker auto scaling to increase throughput. You can load different versions of a model on a single endpoint. You can also serve models from different ML frameworks at the same time. What’s more, DJLServing natively supports multi-GPU by setting up MPI configurations and socket connections for inference. This frees the heavy lifting of setting up a multi-GPU environment.

Our proposed solution uses the newly announced SageMaker capabilities, DJLServing and DeepSpeed Inference, for large model inference. As of this writing, all Transformer-based models are supported. This solution is intended for parallel model inference using a single model on a single instance.

DJLServing is built with multiple layers. The routing layer is built on top of Netty. The remote requests are handled in the routing layer to distribute to workers, either threads in Java or processes in Python, to run inference. The total number of Java threads are set to 2 * cpu_core from the machine to make full usage of computing power. The worker numbers can be configured per model or the DJL’s auto-detection on hardware. The following diagram illustrates our architecture.

Inference large models on SageMaker

The following steps demonstrate how to deploy a gpt-j-6B model in SageMaker using DJL serving. This is made possible by the capability to configure the EBS volume size, model download timeout time, and startup health-check timeout time. You can try out this demo by running the following notebook.

Pull the Docker image and push to Amazon ECR

The Docker image djl-serving:0.18.0-deepspeed is our DJL serving container with DeepSpeed incorporated. We then push this image to Amazon Elastic Container Registry (Amazon ECR) for later use. See the following code:

docker pull deepjavalibrary/djl-serving:0.18.0-deepspeed

Create our model file

First, we create a file called serving.properties that contains only one line of code. This tells the DJL model server to use the Rubikon engine. Rubikon is an AWS developed large model supporting package. In this demo, it facilitates the MPI threads setup and socket connection. It also sets the number of GPUs (model slicing number) by reading in the TENSOR_PARALLEL_DEGREE parameter defined in our model.py file in the next paragraph. The file contains the following code:

engine=Rubikon

Next, we create our model.py file, which defines our model as gpt-j-6B. In our code, we read in the TENSOR_PARALLEL_DEGREE environment variable (default value is 1). This sets the number of devices over which the tensor parallel modules are distributed. Please note, DeepSpeed provides a few built-in partition logics, and gpt-j-6B is one of them. We use it by specifying replace_method and relpace_with_kernel_inject. If you have your customized model and need DeepSpeed to partition effectively, you need to change relpace_with_kernel_inject to false and add injection_policy to make the runtime partition work. For more information, refer to Initializing for Inference.

from djl_python import Input, Output
import os
import deepspeed
import torch
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer

predictor = None

def get_model():
    model_name = 'EleutherAI/gpt-j-6B'
    tensor_parallel = int(os.getenv('TENSOR_PARALLEL_DEGREE', '1'))
    local_rank = int(os.getenv('LOCAL_RANK', '0'))
    model = AutoModelForCausalLM.from_pretrained(model_name, revision="float32", torch_dtype=torch.float32)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    model = deepspeed.init_inference(model,
                                     mp_size=tensor_parallel,
                                     dtype=model.dtype,
                                     replace_method='auto',
                                     replace_with_kernel_inject=True)
    generator = pipeline(task='text-generation', model=model, tokenizer=tokenizer, device=local_rank)
    return generator


def handle(inputs: Input) -> None:
    global predictor
    if not predictor:
        predictor = get_model()

    if inputs.is_empty():
        # Model server makes an empty call to warmup the model on startup
        return None

    data = inputs.get_as_string()
    result = predictor(data, do_sample=True, min_tokens=200, max_new_tokens=256)
    return Output().add(result)

We create a directory called gpt-j and copy model.py and serving.properties to this directory:

mkdir gpt-j
cp model.py gpt-j
cp serving.properties gpt-j

Lastly, we create the model file and upload it to Amazon Simple Storage Service (Amazon S3):

tar cvfz gpt-j.tar.gz gpt-j
aws s3 cp gpt-j.tar.gz s3://djl-sm-test/deepspeed/

Create a SageMaker model

We now create a SageMaker model. We use the ECR image we created earlier and the model artifact from the previous step to create the SageMaker model. In the model setup, we configure TENSOR_PARALLEL_DEGREE=2, which means the model will be partitioned along 2 GPUs. See the following code:

aws sagemaker create-model 
--model-name gpt-j 
--primary-container 
Image=<account_id>.dkr.ecr.us-east-1.amazonaws.com/djl-deepspeed:latest,ModelDataUrl=s3://djl-sm-test/deepspeed/gpt-j.tar.gz,Environment={TENSOR_PARALLEL_DEGREE=2} 
--execution-role-arn <IAM_role_arn>

After running the preceding command, you see output similar to the following:

{
    "ModelArn": "arn:aws:sagemaker:us-east-1:<account_id>:model/gpt-j"
}

Create a SageMaker endpoint

You can use any instances with multiple GPUs for testing. In this demo, we use a p3.16xlarge instance. In the following code, note how we set the ModelDataDownloadTimeoutInSeconds, ContainerStartupHealthCheckTimeoutInSeconds, and VolumeSizeInGB parameters to accommodate the large model size. The VolumeSizeInGB parameter is applicable to GPU instances supporting the EBS volume attachment.

aws sagemaker create-endpoint-config 
    --region us-east-1 
    --endpoint-config-name gpt-j-config 
    --production-variants '[
      {
        "ModelName": "gpt-j",
        "VariantName": "AllTraffic",
        "InstanceType": "ml.p3.16xlarge",
        "InitialInstanceCount": 1,
        "VolumeSizeInGB": 256,
        "ModelDataDownloadTimeoutInSeconds": 1800,
        "ContainerStartupHealthCheckTimeoutInSeconds": 3600
        }
    ]'

Lastly, we create a SageMaker endpoint:

aws sagemaker create-endpoint 
--endpoint-name gpt-j 
--endpoint-config-name gpt-j-config

You see it printed out in the following code:

{
    "EndpointArn": "arn:aws:sagemaker:us-east-1:<aws-account-id>:endpoint/gpt-j"
}

Starting the endpoint might take a while. You can try a few more times if you run into the InsufficientInstanceCapacity error.

Performance tuning

Performance tuning and optimization is an empirical process often involving multiple iterations. The number of parameters to tune is combinatorial and the set of configuration parameter values aren’t independent of each other. Various factors affect optimal parameter tuning, including payload size, type, and the number of ML models in the inference request flow graph, storage type, compute instance type, network infrastructure, application code, inference serving software runtime and configuration, and more.

SageMaker real-time inference is ideal for inference workloads where you have real-time, interactive, low-latency requirements. There are four most commonly used metrics for monitoring inference request latency for SageMaker inference endpoints:

  • Container latency – The time it takes to send the request, fetch the response from the model’s container, and complete inference in the container. This metric is available in Amazon CloudWatch as part of the invocation metrics published by SageMaker.
  • Model latency – The total time taken by all SageMaker containers in an inference pipeline. This metric is available in CloudWatch as part of the invocation metrics published by SageMaker.
  • Overhead latency – Measured from the time that SageMaker receives the request until it returns a response to the client, minus the model latency. This metric is available in CloudWatch as part of the invocation metrics published by SageMaker.
  • End-to-end latency – Measured from the time the client sends the inference request until it receives a response back. You can publish this as a custom metric in CloudWatch.

Container latency depends on several factors; the following are among the most important:

  • Underlying protocol (HTTP(s)/gRPC) used to communicate with the inference server
  • Overhead related to creating new TLS connections
  • Deserialization time of the request/response payload
  • Request queuing and batching features provided by the underlying inference server
  • Request scheduling capabilities provided by the underlying inference server
  • Underlying runtime performance of the inference server
  • Performance of preprocessing and postprocessing libraries before calling the model prediction function
  • Underlying ML framework backend performance
  • Model-specific and hardware-specific optimizations

In this section, we focus primarily on container latency and specifically on optimizing DJLServing running inside a SageMaker container.

Tune the ML engine for multi-threaded inference

One of the advantages of the DJL is multi-threaded inference support. It can help increase the throughput of your inference on multi-core CPUs and GPUs and reduce memory consumption compare to Python. Refer to Inference Performance Optimization for more information about optimizing the number of threads for different engines.

Tune Netty

DJLServing is built with multiple layers. The routing layer is built on top of Netty. Netty is a NIO client server framework that enables quick and easy development of network applications such as protocol servers and clients. In Netty, Channel is the main container; it contains a ChannelPipeline and is associated with an EventLoop (a container for a thread) from an EventLoopGroup. EventLoop is essentially an I/O thread and may be shared by multiple channels. ChannelHandlers are run on these EventLoop threads. This simple threading model means that you don’t need to worry about concurrency issues in the run of your ChannelHandlers. You are always guaranteed sequential runs on the same thread for a single run through your pipeline. DJLServing uses Netty’s EpollEventLoopGroup on Linux. The total number of Netty threads by default is set to 2 * the number of virtual CPUs from the machine to make full usage of computing power. Furthermore, because you don’t create large numbers of threads, your CPU isn’t overburdened by context switching. This default setting works fine in most cases; however, if you want to set the number of Netty threads for processing the incoming requests, you can do so by setting the SERVING_NUMBER_OF_NETTY_THREADS environment variable.

Tune workload management (WLM) of DJLServing

DJLServing has WorkLoadManager, which is responsible for managing the workload of the worker thread. It manages the thread pools and job queues, and scales up or down the required amount of worker threads per ML model. It has auto scaling, which adds an inference job to the job queue of the next free worker and scales up the worker thread pool for that specific model if necessary. The scaling is primarily based on the job queue depth of the model, the batch size, and the current number of worker threads in the pool. The job_queue_size controls the number of inference jobs that can be queued up at any point in time. By default, it is set to 100. If you have higher concurrency needs per model serving instance, you can increase the job_queue_size, thread pool size, and minimum or maximum thread workers for a particular model by setting the properties in serving.properties, as shown in the following example code:

serving.properties
# use minWorkers/maxWorkers for all devices
gpu.minWorkers=2
gpu.maxWorkers=3
cpu.minWorkers=2
cpu.maxWorkers=4

As of this writing, you can’t configure job_queue_size in serving.properties. The default value job_queue_size is controlled by an environment variable, and you can only configure the per-model setting with the registerModel API.

Many practitioners tend to run inference sequentially when the server is invoked with multiple independent requests. Although easier to set up, it’s usually not the best practice to utilize GPU’s compute power. To address this, DJLServing offers the built-in optimizations of dynamic batching to combine these independent inference requests on the server side to form a larger batch dynamically to increase throughput.

All the requests reach the dynamic batcher first before entering the actual job queues to wait for inference. You can set your preferred batch sizes for dynamic batching using the batch_size settings in serving.properties. You can also configure max_batch_delay to specify the maximum delay time in the batcher to wait for other requests to join the batch based on your latency requirements.

You can fine-tune the following parameters to increase the throughput per model:

  • batch_size – The inference batch size. The default value is 1.
  • max_batch_delay – The maximum delay for batch aggregation. The default value is 100 milliseconds.
  • max_idle_time – The maximum idle time before the worker thread is scaled down.
  • min_worker – The minimum number of worker processes. For the DJL’s DeepSpeed engine, min_worker is set to number of GPUs/TENSOR_PARALLEL_DEGREE.
  • max_worker – The maximum number of worker processes. For the DJL’s DeepSpeed engine, max_worker is set to mumber of GPUs/TENSOR_PARALLEL_DEGREE.

Tune degree of tensor parallelism

For large model support that doesn’t fit in the single accelerator device memory, the number of Python processes are determined by the total number of accelerator devices on the host. The tensor_parallel_degree is created for slicing the model and distribute to multiple accelerator devices. In this case, even if a model is too large to host on a single accelerator, it can still be handled by DJLServing and can run on multiple accelerator devices by partitioning the model. Internally, DJLServing creates multiple MPI processes (equal to tensor_parallel_degree) to manage the slice of each model on each accelerator device.

You can set the number of partitions for your model by setting the TENSOR_PARALLEL_DEGREE environment variable. Please note this configuration is a global setting and applies to all the models on the host. If the TENSOR_PARALLEL_DEGREE is less than the total number of accelerator devices (GPUs), DJLServing launches multiple Python process groups equivalent to the total number of GPUs/TENSOR_PARALLEL_DEGREE. Each Python process group consists of Python processes equivalent to TENSOR_PARALLEL_DEGREE. Each Python process group holds the full copy of the model.

Summary

In this post, we showcased the newly launched SageMaker capability to allow you to configure inference instance EBS volumes, model downloading timeout, and container startup timeout. We demonstrated this new capability in an example of deploying a large model in SageMaker. We also covered options available to tune the performance of the DJL. For more details about SageMaker and the new capability launched, refer to [!Link] and [!Link].


About the authors

Frank Liu is a Software Engineer for AWS Deep Learning. He focuses on building innovative deep learning tools for software engineers and scientists. In his spare time, he enjoys hiking with friends and family.

Qing Lan is a Software Development Engineer in AWS. He has been working on several challenging products in Amazon, including high performance ML inference solutions and high performance logging system. Qing’s team successfully launched the first Billion-parameter model in Amazon Advertising with very low latency required. Qing has in-depth knowledge on the infrastructure optimization and Deep Learning acceleration.

Qingwei Li is a Machine Learning Specialist at Amazon Web Services. He received his Ph.D. in Operations Research after he broke his advisor’s research grant account and failed to deliver the Nobel Prize he promised. Currently he helps customers in the financial service and insurance industry build machine learning solutions on AWS. In his spare time, he likes reading and teaching.

Dhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing, and Artificial Intelligence. He focuses on Deep learning including NLP and Computer Vision domains. He helps customers achieve high performance model inference on SageMaker.

Robert Van Dusen is a Senior Product Manager at AWS.

Alan Tan is a Senior Product Manager with SageMaker leading efforts on large model inference. He’s passionate about applying Machine Learning to the area of Analytics. Outside of work, he enjoys the outdoors.

Read More

RGB-X Classification for Electronics Sorting

Effectively disassembling and recovering materials from waste electrical and electronic equipment (WEEE) is a critical step in moving global supply chains from carbon-intensive, mined materials to recycled and renewable ones. Conventional recycling processes rely on shredding and sorting waste streams, but for WEEE, which is comprised of numerous dissimilar materials, we explore targeted disassembly of numerous objects for improved material recovery. Many WEEE objects share many key features and therefore can look quite similar, but their material composition and internal component layout can…Apple Machine Learning Research

Tips to improve your Amazon Rekognition Custom Labels model

In this post, we discuss best practices to improve the performance of your computer vision models using Amazon Rekognition Custom Labels. Rekognition Custom Labels is a fully managed service to build custom computer vision models for image classification and object detection use cases. Rekognition Custom Labels builds off of the pre-trained models in Amazon Rekognition, which are already trained on tens of millions of images across many categories. Instead of thousands of images, you can get started with a small set of training images (a few hundred or less) that are specific to your use case. Rekognition Custom Labels abstracts away the complexity involved in building a custom model. It automatically inspects the training data, selects the right ML algorithms, selects the instance type, trains multiple candidate models with various hyperparameters settings, and outputs the best trained model. Rekognition Custom Labels also provides an easy-to-use interface from the AWS Management Console for managing the entire ML workflow, including labeling images, training the model, deploying the model, and visualizing the test results.

There are times when a model’s accuracy isn’t the best, and you don’t have many options to adjust the configuration parameters of the model.  Behind the scenes there are multiple factors that play a key role to build a high-performing model, such as the following:

  • Picture angle
  • Image resolution
  • Image aspect ratio
  • Light exposure
  • Clarity and vividness of background
  • Color contrast
  • Sample data size

The following are the general steps to be followed to train a production-grade Rekognition Custom Labels model:

  1. Review Taxonomy – This defines the list of attributes/items that you want to identify in an image.
  2. Collect relevant data – This is the most important step, where you need to collect relevant images that should resemble what you would see in a production environment. This could involve images of objects with varying backgrounds, lighting, or camera angles. You then create a training and testing datasets by splitting the collected images. You should only include real-world images as part of the testing dataset, and shouldn’t include any synthetically generated images. Annotations of the data you collected are crucial for the model performance. Make sure the bounding boxes are tight around the objects and the labels are accurate. We discuss some tips that you can consider when building an appropriate dataset later in this post.
  3. Review training metrics – Use the preceding datasets to train a model and review the training metrics for F1 score, precision, and recall. We will discuss in details about how to analyze the training metrics later in this post.
  4. Evaluate the trained model – Use a set of unseen images (not used for training the model) with known labels to evaluate the predictions. This step should always be performed to make sure that the model performs as expected in a production environment.
  5. Re-training (optional) – In general, training any machine learning model is an iterative process to achieve the desired results, a computer vision model is no different. Review the results in Step 4, to see if more images need to be added to the training data and repeat the above Steps 3 – 5.

In this post, we focus on the best practices around collecting relevant data (Step 2) and evaluating your trained metrics (Step 3) to improve your model performance.

Collect relevant data

This is the most critical stage of training a production-grade Rekognition Custom Labels model. Specifically, there are two datasets: training and testing. Training data is used for training the model, and you need to spend the effort building an appropriate training set. Rekognition Custom Labels models are optimized for F1 score on the testing dataset to select the most accurate model for your project. Therefore, it’s essential to curate a testing dataset that resembles the real world.

Number of images

We recommend having a minimum of 15-20 images per label. Having more images with more variations that reflects your use case will improve the model performance.

Balanced dataset

Ideally, each label in the dataset should have a similar number of samples. There shouldn’t be a massive disparity in the number of images per label. For example, a dataset where the highest number of images for a label is 1,000 vs. 50 images for another label resembles an imbalanced dataset. We recommend avoiding scenarios with lopsided ratio of 1:50 between the label with the least number of images vs. the label with the highest number of images.

Varying types of images

Include images in the training and test dataset that resembles what you will be using in the real world. For example, if you want to classify images of living rooms vs. bedrooms, you should include empty and furnished images of both rooms.

The following is an example image of a furnished living room.

In contrast, the following is an example of an unfurnished living room.

The following is an example image of a furnished bedroom.

The following is an example image of an unfurnished bedroom.

Varying backgrounds

Include images with different backgrounds. Images with natural context can provide better results than plain background.

The following is an example image of the front yard of a house.

The following is an example image of the front yard of a different house with a different background.

Varying lighting conditions

Include images with varying lighting so that it covers the different lighting conditions that occur during inference (for example, with and without flash). You can also include images with varying saturation, hue, and brightness.

The following is an example image of a flower under normal light.

In contrast, the following image is of the same flower under bright light.

Varying angles

Include images taken from various angles of the object. This helps the model learn different characteristics of the objects.

The following images are of the same bedroom from different angles.

   

There could be occasions where it’s not possible to acquire images of varying types. In those scenarios, synthetic images can be generated as part of the training dataset. For more information about common image augmentation techniques, refer to Data Augmentation.

Add negative labels

For image classification, adding negative labels can help increase model accuracy. For example, you can add a negative label, which doesn’t match any of the required labels. The following image represents the different labels used to identify fully grown flowers.

Adding the negative label not_fully_grown helps the model learn characteristics that aren’t part of the fully_grown label.

Handling label confusion

Analyze the results on the test dataset to recognize any patterns that are missed in the training or testing dataset. Sometimes it’s easy to spot such patterns by visually examining the images. In the following image, the model is struggling to resolve between a backyard vs. patio label.

In this scenario, adding more images to these labels in the dataset and also redefining the labels so that each label is distinct can help increase the accuracy of the model.

Data augmentation

Inside Rekognition Custom Labels, we perform various data augmentations for model training, including random cropping of the image, color jittering, random Gaussian noises, and more. Based on your specific use cases, it might also be beneficial to add more explicit data augmentations to your training data. For example, if you’re interested in detecting animals in both color and black and white images, you could potentially get better accuracy by adding black and white and color versions of the same images to the training data.

We don’t recommend augmentations on testing data unless the augmentations reflect your production use cases.

Review training metrics

F1 score, precision, recall, and assumed threshold are the metrics that are generated as an output of training a model using Rekognition Custom Labels. The models are optimized for the best F1 score based on the testing dataset that is provided. The assumed threshold is also generated based on the testing dataset. You can adjust the threshold based on your business requirement in terms of precision or recall.

Because the assumed thresholds are set on the testing dataset, an appropriate test set should reflect the real-world production use case. If the test dataset isn’t representative of the use case, you may see artificially high F1 scores and poor model performance on your real-world images.

These metrics are helpful when performing an initial evaluation of the model. For a production-grade system, we recommend evaluating the model against an external dataset (500–1,000 unseen images) representative of the real world. This helps evaluate how the model would perform in a production system and also identify any missing patterns and correct them by retraining the model. If you see a mismatch between F1 scores and external evaluation, we suggest you examine whether your test data is reflecting the real-world use case.

Conclusion

In this post, we walked you through the best practices for improving Rekognition Custom Labels models. We encourage you to learn more about Rekognition Custom Labels and try it out for your business-specific datasets.


About the authors

Amit Gupta is a Senior AI Services Solutions Architect at AWS. He is passionate about enabling customers with well-architected machine learning solutions at scale.

Yogesh Chaturvedi is a Solutions Architect at AWS with a focus in computer vision. He works with customers to address their business challenges using cloud technologies. Outside of work, he enjoys hiking, traveling, and watching sports.

Hao Yang is a Senior Applied Scientist at the Amazon Rekognition Custom Labels team. His main research interests are object detection and learning with limited annotations. Outside works, Hao enjoys watching films, photography, and outdoor activities.

Pashmeen Mistry is the Senior Product Manager for Amazon Rekognition Custom Labels. Outside of work, Pashmeen enjoys adventurous hikes, photography, and spending time with his family.

Read More

Use ADFS OIDC as the IdP for an Amazon SageMaker Ground Truth private workforce

To train a machine learning (ML) model, you need a large, high-quality, labeled dataset. Amazon SageMaker Ground Truth helps you build high-quality training datasets for your ML models. With Ground Truth, you can use workers from either Amazon Mechanical Turk, a vendor company of your choosing, or an internal, private workforce to enable you to create a labeled dataset. You can use the labeled dataset output from Ground Truth to train your own models. You can also use the output as a training dataset for an Amazon SageMaker model.

With Ground Truth, you can create a private workforce of employees or contractors to handle your data within your organization. This enables customers who want to keep their data within their organization to use a private workforce to support annotation workloads containing sensitive business data or personal identifiable information (PII) that can’t be handled by external parties. Alternately, if data annotation requires domain-specific subject matter expertise, you can use a private workforce to route tasks to employees, contractors, or third-party annotators with that specific domain knowledge. This workforce can be employees in your company or third-party workers who have domain and industry knowledge of your datasets. For example, if the task is to label medical images, you could create a private workforce of people knowledgeable about the images in question.

You can configure a private workforce to authenticate using OpenID Connect (OIDC) with your Identity Provider (IdP). In this post, we demonstrate how to configure OIDC with on-premises Active Directory using Active Directory Federation Service (ADFS). Once the configuration is set up, you can configure and manage work teams, track worker performance, and set up notifications when labeling tasks are available in Ground Truth.

Solution overview

When you use existing on-premises Active Directory credentials to authenticate your private workforce, you don’t need to worry about managing multiple identities in different environments. Workers use existing Active Directory credentials to federate to your labeling portal.

Prerequisites

Make sure you have the following prerequisites:

  • A registered public domain
  • An existing or newly deployed ADFS environment
  • An AWS Identity and Access Management (IAM) user with permissions to run SageMaker API operations

Additionally, make sure you use Ground Truth in a supported Region.

Configure Active Directory

The Ground Truth private workforce OIDC configuration requires sending a custom claim sagemaker:groups to Ground Truth from your IdP.

  1. Create an AD group named sagemaker (be sure to use all lower-case).
  2. Add the users that will form your private workforce to this group.

Configure ADFS

The next step is to configure an ADFS application with specific claims that Ground Truth uses to obtain Issuer, ClientId, and ClientSecret, and other optional claims from your IdP to authenticate workers by obtaining an authentication code from the configured AuthorizationEndpoint in your IdP.

For more information about the claims your IdP sends to Ground Truth, refer to Send Required and Optional Claims to Ground Truth and Amazon A2I.

Create Application Group

To create your application group, complete the following steps:

  1. Open the ADFS Management Console
  2. Change the ADFS Federation Service Identifier from https://${HostName}/adfs/service/trust to https://${HostName}/adfs
  3. Choose Application Group, right-click, and choose Add Application Group.
  4. Enter a name (for example, SageMaker Ground Truth Workforce) and description.
  5. Under Template, for Client-Server applications, choose Server application accessing a web API.
  6. Choose Next.Add Application Group Wizard
  7. Copy and save the client ID for future reference.
  8. For Redirect URI, use a placeholder such as https://privateworkforce.local.
  9. Choose Add, then choose Next.
  10. Select Generate a shared secret and save the generated value for later use, then choose Next.
  11. In Configure Web API section, enter the client ID obtained earlier.
  12. Choose Add, then choose Next.
  13. Select Permit everyone under Access Control Policy, then choose Next.
  14. Under Permitted scopes, select openid, then choose Next.Application Permission Scopes
  15. Review the configuration information, then choose Next and Close.

Configure claim descriptions

To configure your claim descriptions, complete the following steps:

  1. In the ADFS Management Console, expand Service Section.
  2. Right-click Claim Description and choose Add Claim Description.
  3. For Display name, enter SageMaker Client ID.
  4. For Short Name, enter sagemaker:client_id.
  5. For Claim identifier, enter sagemaker:client_id.
  6. Select the options to publish the claim to federation metadata for both accept and send.
  7. Choose OK.
  8. Repeat these steps for the remaining claim groups (Sagemaker Name, Sagemaker Sub, and Sagemaker Groups), as shown in the following screenshot.

Note that your claim identifier is listed as Claim Type.All Claims Descriptions

Configure the application group claim rules

To configure your application group claim rules, complete the following steps:

  1. Choose Application Groups, then choose the application group you just created.
  2. Under Web API, choose the name shown, which opens the Web API properties.
  3. Choose the Issuance Transform Rules tab and choose Add Rule.
  4. Choose Transform an Incoming Claim and provide the following information:
    • For Claim rule name, enter sagemaker:client_id.
    • For Incoming claim type, choose OAuth Client Id.
    • For Outgoing claim type, choose the claim SageMaker Client ID.
    • Leave other values as default.
    • Choose Finish.
  5. Choose Add New Rule.
  6. Choose Transform an Incoming Claim and provide the following information:
    • For Claim rule name, enter sagemaker:sub.
    • For Incoming claim type, choose Primary SID.
    • For Outgoing claim type, choose the claim Sagemaker Sub.
    • Leave other values as default.
    • Choose Finish.Add Claim Rules
  7. Choose Add New Rule.
  8. Choose Transform an Incoming Claim and provide the following information:
    • For Claim rule name, choose sagemaker:name.
    • For Incoming claim type, choose Name.
    • For Outgoing claim type, choose the claim Sagemaker Name.
    • Leave other values as default.
    • Choose Finish.
  9. Choose Add New Rule.
  10. Choose Send Group Membership as a Claim and provide the following information:
    • For Claim rule name, enter sagemaker:groups.
    • For User’s group, choose the sagemaker AD group created earlier.
    • For Outgoing claim type, choose the claim Sagemaker Groups.
    • For Outgoing claim value, enter sagemaker.
    • Choose Finish.
  11. Choose Apply and OK.

You should have four rules, as shown in the following screenshot.

Create and configure an OIDC IdP workforce using the SageMaker API

In this step, you create a workforce from the AWS Command Line Interface (AWS CLI) using an IAM user or role with appropriate permissions.

  1. Run the following AWS CLI command to create a private workforce. The oidc-config parameter contains information you must obtain from the IdP. Provide the appropriate values that you obtained from your IdP:
    1. client_id is the client ID, and client_secret is the client secret you obtained when creating your application group.
    2. You can reconstruct AuthorizationEndpoint, TokenEndpoint, UserInfoEndpoint, LogoutEndpoint, and JwksUri by replacing only the sts.example.com portion with your ADFS endpoint.
      aws sagemaker create-workforce --oidc-config "ClientId="9b123069-0afc-56f2-a7ce-bd8e4dc705gh",ClientSecret="vtMG9fz_D9W2Y6u4t390wQ4o-hr8VsdHxD294FsD",Issuer="https://sts.example.com/adfs",AuthorizationEndpoint="https://sts.example.com/adfs/oauth2/authorize/",TokenEndpoint="https://sts.example.com/adfs/oauth2/token/",UserInfoEndpoint="https://sts.example.com/adfs/userinfo",LogoutEndpoint="https://sts.example.com/adfs/oauth2/logout",JwksUri="https://sts.example.com/adfs/discovery/keys“ --workforce-name privatewf --region us-east-1

      The preceding command should successfully return the WorkforceArn. Save this output for reference later.

  1. Use the following code to describe the created workforce to get the SubDomain.
    We use this to configure the redirect URI in ADFS. After Ground Truth authenticates a worker, this URI redirects the worker to the worker portal where the workers can access labeling or human review tasks.

    aws sagemaker describe-workforce --workforce-name "privatewf" --region us-east-1

    {
    		"Workforce": {
    			"WorkforceName": "privatewf",
    			"WorkforceArn": "arn:aws:sagemaker:us-east-1:206400014001:workforce/privatewf",
    			"LastUpdatedDate": "2022-03-20T11:45:57.916000-07:00",
    			"SourceIpConfig": {
    				"Cidrs": []
    			},
    			"SubDomain": "drxxxxxlf0.labeling.us-east-1.sagemaker.aws",
    			"OidcConfig": {
    				"ClientId": "9b123069-0afc-56f2-a7ce-bd8e4dc705gh",
    				"Issuer": "https://sts.example.com/adfs",
    				"AuthorizationEndpoint": "https://sts.example.com/adfs/oauth2/authorize/",
    				"TokenEndpoint": "https://sts.example.com/adfs/oauth2/token/",
    				"UserInfoEndpoint": "https://sts.example.com/adfs/userinfo",
    				"LogoutEndpoint": "https://sts.example.com/adfs/oauth2/logout",
    				"JwksUri": "https://sts.example.com/adfs/discovery/keys“"
    			},
    			"CreateDate": "2022-03-20T11:45:57.916000-07:00"
    		}
    	}

  1. Copy the SubDomain and append /oauth2/idpresponse to the end. For example, it should look like https://drxxxxxlf0.labeling.us-east-1.sagemaker.aws/oauth2/idpresponse.You use this URL to update the redirect URI in ADFS.
  2. Choose the application you created earlier (SageMaker Ground Truth Private Workforce).
  3. Choose the name under Server application.
  4. Select the placeholder URL used earlier and choose Remove.
  5. Enter the appended SubDomain value.
  6. Choose Add.
  7. Choose OK twice.

Validate the OIDC IdP workforce authentication response

Now that you have configured OIDC with your IdP, it’s time to validate the authentication workflow using curl.

  1. Replace the placeholder values with your information, then enter the modified URI in your browser:
    https://sts.example.com/adfs/oauth2/authorize?client_id=9b123069-0afc-56f2-a7ce-bd8e4dc705gh&redirect_uri=https://drxxxxxlf0.labeling.us-east-1.sagemaker.aws/oauth2/idpresponse&scope=openid&response_type=code

    You should be prompted to log in with AD credentials. You may receive a 401 Authorization Required error.

  1. Copy the code parameter from the browser query and use it to perform a curl with the following command. The portion you need to copy starts with code=. Replace this code with code you copied. Also, don’t forget to change the values of url, client_id, client_secret, and redirect_uri:
    1. url is the token endpoint from ADFS.
    2. client_id is the client ID from the application group in ADFS.
    3. client_secret is the client secret from ADFS.

      curl -k --request POST 
      	  --url 'https://sts.example.com/adfs/oauth2/token/' 
      	  --header 'content-type: application/x-www-form-urlencoded' 
      	  --data grant_type=authorization_code 
      	  --data 'client_id=9b123069-0afc-56f2-a7ce-bd8e4dc705gh' 
      	  --data client_secret=vtMG9fz_D9W2Y6u4t390wQ4o-hr8VsdHxD294FsD 
      	  --data code=ZE-yvYF7GUmaFmAGAUdlcg.3Oy-_lPP2QgBAJxAW8uvXYgXojg.GXiaFggY5IdmrumD00cPkdjpABXTAG25YdXJxBr64HPwyl1WJDlcr1pqvURR1ZkBsBA1DxrloTQM4IGH1LcNVIzGcoynNm151leWXnIIP11JjOdl4Jt7tGyxyymll0c0IqfQcOk0w-oU9q2k-nx3jmAK4Pmw3D0Ghhm4jL6_15gBwvY4-mY6DVDg2sGQMELj-dNzfvMuMiLJQhX5XyUJcHjW69KX9xxnHfa3MCZbp2oF_41HBtMazPqKKC04TQPvTiAeMzUZ0-Z3IQhA9_mfv28JPdpGlPOxr8QM9vu9ANCbURimjPkmHA2Gm3df9QUbsIxEtQ-OuAPWlcg5MNbqGQ 
      	  --data 'redirect_uri=https://drxxxxxlf0.labeling.us-east-1.sagemaker.aws/oauth2/idpresponse'
      	

  1. After making the appropriate modifications, copy the entire command and run it from a terminal.
    The output of the command generates an access token in JWT format.
  1. Copy this output to the encoded box and decode it with JWT.
    The decoded message should contain the required claims you configured. If the claims are present, proceed to the next step; if not, ensure you have followed all the steps outlined so far.
  1. From the output obtained in the preceding step, run the following command from a terminal after making necessary modifications. Replace the value for Bearer with the access_token obtained in the preceding command’s output and the userinfo with your own.
    curl -X POST -H 'Authorization: Bearer eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiIsIng1dCI6IkNLX2k1SEtOS1B2QVdGWnhCRkZ2T2NuVUhNQSIsImtpZCI6IkNLX2k1SEtOS1B2QVdGWnhCRkZ2T2NuVUhNQSJ9.eyJhdWQiOiJ1cm46bWljcm9zb2Z0OnVzZXJpbmZvIiwiaXNzIjoiaHR0cHM6Ly9mcy5hZC5nbWluZHByby5jb20vYWRmcyIsImlhdCI6MTY0OTE5NzYzMCwibmJmIjoxNjQ5MTk3NjMwLCJleHAiOjE2NDkyMDEyMzAsImFwcHR5cGUiOiJDb25maWRlbnRpYWwiLCJhcHBpZCI6IjBlZDQ0MDYzLTMzZDUtNGYxZi1hZTg4LTQ0OTgzZDRlN2E3MiIsImF1dGhtZXRob2QiOiJ1cm46b2FzaXM6bmFtZXM6dGM6U0FNTDoyLjA6YWM6Y2xhc3NlczpQYXNzd29yZFByb3RlY3RlZFRyYW5zcG9ydCIsImF1dGhfdGltZSI6IjIwMjItMDQtMDVUMjI6Mjc6MDkuOTcxWiIsInZlciI6IjEuMCIsInNjcCI6Im9wZW5pZCIsInN1YiI6ImV2MTdTQkRUWnFXd2NXR0R1Z2s1OHRXQm4wYkRKbDBvYnAzbU9sL1hVUlk9In0.hsED4iUlQPgiiLaCyrKTKg3aKQjsKsLKPusPncRz3rNCSTp5xh8APDo33hhBx5JK-Ie2FG9Pa78dHdY_U2UtGBl2IHKmIfPcBTdkLGc1a8PlSQLvManCcEwzxAaO5J_jGdbt_P3qvy3cA6YCgNUwV3Ex9VTySLK1r-gLvnWE4zEiz_QytdlXvwFDIZi94YTgGf8b5uOQieM9pgJ0D9d-HOUw7-sKMBbZLqeYh_heNekwV3p3FQAIQyqifzl5qaftMR_J6lpOINHPtSPbl80MwHpmoDPHa0emWg6wuSZa7gpDbqDGHmuwQfbVhBdNLY8v9Nm4MA5RbSWQmqZmwG0GkA' -d '' -k -v https://sts.example.com/adfs/userinfo

The output from this command may look similar to following code:

{
    "sub":"122",
    "exp":"10000",
    "sagemaker-groups":["privateworkforce"]
    "sagemaker-name":"name",
    "sagemaker-sub":"122",
    "sagemaker-client_id":"123456"
}

Now that you have successfully validated your OIDC configuration, it’s time to create the work teams.

Create a private work team

To create a private work team, complete the following steps:

  1. On the Ground Truth console, choose Labeling workforces.
  2. Select Private.
  3. In the Private teams section, select Create private team.
  4. In the Team details section, enter a team name.
  5. In the Add workers section, enter the name of a single user group.
    All workers associated with this group in your IdP are added to this work team.
  1. To add more than one user group, choose Add new user group and enter the names of the user groups you want to add to this work team. Enter one user group per line.
  2. Optionally, for Ground Truth labeling jobs, if you provide an email for workers in your JWT, Ground Truth notifies workers when a new labeling task is available if you select an Amazon Simple Notification Service (Amazon SNS) topic.
  3. Choose Create private team.

Test access to the private labeling portal

To test your access, browse to https://console.aws.amazon.com/sagemaker/groundtruth#/labeling-workforces and open the labeling portal sign-in URL in a new browser window or incognito mode.

Log in with your IdP credentials. If authentication is successful, you should be redirected to the portal.Labeling Portal

Cost

You will be charged for the number of jobs labeled by your internal employees. For more information, refer to Amazon SageMaker Data Labeling Pricing.

Clean up

You can delete the private workforce using the SageMaker API, DeleteWorkforce. If you have work teams associated with the private workforce, you must delete them before deleting the work force. For more information, see Delete a work team.

Summary

In this post, we demonstrated how to configure an OIDC application with Active Directory Federation Services and use your existing Active Directory credentials to authenticate to a Ground Truth labeling portal.

We’d love to hear from you. Let us know what you think in the comments section.


About the authors

Adeleke Coker is a Global Solutions Architect with AWS. He works with customers globally to provide guidance and technical assistance in deploying production workloads at scale on AWS. In his spare time, he enjoys learning, reading, gaming and watching sport events.

Aishwarya Kaushal is a Senior Software Engineer at Amazon. Her focus is on solving challenging problems using machine learning, building scalable AI solutions using distributed systems and helping customers to adopt the new features/products. In her spare time, Aishwarya enjoys watching sci-fi movies, listening to music and dancing.

Read More