How RallyPoint and AWS are personalizing job recommendations to help military veterans and service providers transition back into civilian life using Amazon Personalize

How RallyPoint and AWS are personalizing job recommendations to help military veterans and service providers transition back into civilian life using Amazon Personalize

This post was co-written with Dave Gowel, CEO of RallyPoint. In his own words,RallyPoint is an online social and professional network for veterans, service members, family members, caregivers, and other civilian supporters of the US armed forces. With two million members on the platform, the company provides a comfortable place for this deserving population to connect with each other and programs designed to support them.”

All those who serve – and those who support them – often face a variety of employment challenges when a servicemember transitions back into civilian life. RallyPoint has identified the transition period to a civilian career as a major opportunity to improve the quality of life for this population by creating automated and compelling job recommendations. However, the team historically employed a rule-based curation method to recommend jobs throughout its user experience, which doesn’t allow members to get job recommendations personalized to their individual experience, expertise, and interests.

“To improve this experience for its members, we at RallyPoint wanted to explore how machine learning (ML) could help. We don’t want our servicemembers, veterans, and their loved ones to waste time searching for a fulfilling civilian career path when they decide to leave the military. It should be an easy process. We want our members to tell us about their military experiences, any schools they’ve attended, and their personal preferences. Then by leveraging what we know from our millions of military and veteran members, relevant open jobs should be easily surfaced instead of laboriously searched. This free service for our members is also expected to drive revenue by at least seven figures from employers seeking the right military and veteran talent, allowing us to build more free capabilities for our members.”

This blog post summarizes how the Amazon Machine Learning Solution Lab (MLSL) partnered with RallyPoint to drive a 35% improvement in personalized career recommendations and a 66x increase in coverage, amongst other improvements for RallyPoint members from the current rule-based implementation.

“MLSL helped RallyPoint save and improve the lives of the US military community. Fortunate to work on multiple complex and impactful projects with MLSL to support the most deserving of populations, RallyPoint accelerated growth in multiple core organizational metrics in the process. MLSL’s high caliber talent, culture, and focus on aiding our realization of measurable and compelling results from machine learning investments enabled us to reduce suicide risk, improve career transition, and speed up important connections for our service members, veterans, and their families.”

Screenshot of the RallyPoint Website

*Photo provided by the RallyPoint team.

The following sections cover the business and technical challenges, the approach taken by the AWS and RallyPoint teams, and the performance of implemented solution that leverages Amazon Personalize.

Amazon Personalize makes it easy for developers to build applications capable of delivering a wide array of personalization experiences, including specific product recommendations, personalized product re-ranking, and customized direct marketing. Amazon Personalize is a fully managed ML service that goes beyond rigid, static rule-based recommendation systems by training, tuning, and deploying custom ML models to deliver highly customized recommendations to customers across industries such as retail and media and entertainment.

Business and Technical challenges

Multiple business challenges inspired this partnership. The most pertinent was the clickthrough rate on the top 10 recommended jobs on the RallyPoint website. RallyPoint analyzed user engagement within their platform and discovered that they needed to increase the number of relevant jobs that users are clicking. The idea is that the more relevant a recommended job is, the higher the likelihood of members applying to those jobs, leading to improved employment outcomes.

The next challenge was to increase the engagement by members on job services offered on the site. RallyPoint offers the opportunity for people to “Build your brand and engage the military community, advertise your products and services, run recruitment marketing campaigns, post jobs, and search veteran talent.” They once again identified an opportunity to apply AWS Personalize to help more people transition to civilian life, and sought to improve their click-to-customer conversion numbers, leading to better outcomes for RallyPoint’s direct customers.

From a technical perspective, like many traditional recommender system problems, data sparsity and a long tail was a challenge to overcome. The sample set of de-identified, already publicly shared data included thousands of anonymized user profiles, with more than fifty user-metadata points, but many had inconsistent or missing meta-data/profile information. To tackle this, the team leveraged the Amazon Personalize cold start recommendation functionality for relevant users.

Solution overview

To solve the problem, MLSL collaborated with RallyPoint to construct a custom Amazon Personalize pipeline for RallyPoint. Some of the services used include Amazon Simple Storage Service (Amazon S3), Amazon SageMaker Notebook Instances, and Amazon Personalize. The following diagram illustrates the solution architecture.

The anonymized raw data used for the solution consisted of a history of interactions with job postings along with metadata on user profiles and job positions. This was stored in S3. The MLSL team used Amazon SageMaker Notebook Instances to prepare data as input to Amazon Personalize. This step included data preprocessing, feature engineering, and creating dataset groups and schemas required for Amazon Personalize. For more information refer to Creating a Custom dataset group.

The next step was to create a solution in Amazon Personalize. A solution refers to the combination of an Amazon Personalize recipe, customized parameters, and one or more solution versions. For more information refer to Creating a solution. The team used the User-Personalization recipe to generate user-specific job recommendations for users in a validation set. The Amazon Personalize outputs, including the job recommendations and performance metrics, are stored in an Amazon S3 bucket for further analysis.

In the final step, the team used a notebook instance to prepare the output recommendations for external evaluation by human annotators, as described in the Using Domain Experts section.

Evaluation of Amazon Personalize results

The performance of an Amazon Personalize solution version can be evaluated using offline metrics, online metrics, and A/B testing. Offline metrics allow you to view the effects of modifying hyperparameters and algorithms used to train your models, calculated against historical data. Online metrics are the empirical results observed in your user’s interactions with real-time recommendations provided in a live environment (such as clickthrough rate). A/B testing is an online method of comparing the performance of multiple solution versions to a default solution. Users are randomly assigned to either the control (default) group or one of the treatment (test) groups. The control group users receive recommendations from the default solution (baseline), whereas each of the treatment groups interact with a different solution version. Statistical significance tests are used to compare the performance metrics (such as clickthrough rate or latency) and business metrics (such as revenue) to that of the default solution.

Amazon Personalize measures offline metrics during training a solution version. The team used offline metrics such as Mean Reciprocal Rank (MRR), normalized discounted cumulative gain (NCDG@k), Precision@k, and Coverage. For the definitions of all available offline metrics, refer to Metric definitions.

Although Amazon Personalize provides an extensive list of offline metrics that the team can use to objectively measure the performance of solutions during training, online metrics and A/B testing are recommended to track and validate model performance. One caveat to these tests is that they require users to interact with Amazon Personalize recommendations in real time. Because the RallyPoint Amazon Personalize model wasn’t deployed prior to this publication, the team didn’t have results to report for these tests.

Using Domain Experts

A/B testing is the preferred method of analyzing the quality of a recommendation system however, using domain experts to annotate recommendations is a viable precursor. Since online testing was not an option, to test the robustness of the recommendations, the team asked domain experts in RallyPoint to annotate the recommendations generated by the models and count the number of job positions the experts agreed should be recommended (given a user’s information and indicated preferences) as the number of “correct” recommendations. This metric was used to compare solution versions. A popularity solution (the current rule-based criteria) was used as a baseline which consisted of recommending top five most popular job positions to every user. Moreover, a solution with default settings was used as another baseline model called Amazon Personalize baseline solution.

Results

Using the best performing model resulted in a 35% improvement in the number of “correct” recommendations over the Amazon Personalize baseline solution and a 54% improvement over the popularity solution. The team could also achieve a 66x improvement in coverage, 30x improvement in MRR, and 2x improvement in precision@10 when compared to the popularity solution. In addition to the popularity solution, the team observed up to 2x increase in MRR and precision@10 when compared to the Amazon Personalize baseline solution.

Summary

RallyPoint recognized an opportunity to better serve their customers with more personalized career recommendations. They began their user personalization journey with customer obsession in mind, partnering with the Machine Learning Solutions Lab. RallyPoint now has the opportunity to give their users more valuable career recommendations, through this solution. Incorporating this improved recommendation system into their website will result in RallyPoint users seeing more relevant jobs in their career feed, easing the path to more fulfilling careers and an improved quality of life for their members.

Use Amazon Personalize to provide an individualized experience for your users today! If you’d like to collaborate with experts to bring ML solutions to your organization, contact the Amazon ML Solutions Lab.

Additional resources

For more information about Amazon Personalize, see the following:


About the Authors

Dave Gowel is an Army veteran and the CEO of RallyPoint. Dave is a graduate of West Point and the US Army Ranger School, served in Iraq as a tank platoon leader, and taught as an assistant professor at the Massachusetts Institute of Technology ROTC program. RallyPoint is the third technology company for which Dave has been CEO.

Matthew Rhodes is a Data Scientist working in the Amazon ML Solutions Lab. He specializes in building machine learning pipelines that involve concepts such as natural language processing and computer vision.

Amin Tajgardoon is an Applied Scientist at the Amazon ML Solutions Lab. He has an extensive background in computer science and machine learning. In particular, Amin’s focus has been on deep learning and forecasting, prediction explanation methods, model drift detection, probabilistic generative models, and applications of AI in the healthcare domain.

Yash Shah is a Science Manager in the Amazon ML Solutions Lab. He and his team of applied scientists and machine learning engineers work on a range of machine learning use cases from healthcare, sports, automotive and manufacturing.

Vamshi Krishna Enabothala is a Sr. Applied AI Specialist Architect at AWS. He works with customers from different sectors to accelerate high-impact data, analytics, and machine learning initiatives. He is passionate about recommendation systems, NLP, and computer vision areas in AI and ML. Outside of work, Vamshi is an RC enthusiast, building RC equipment (planes, cars, and drones), and also enjoys gardening.

Greg Tolmie is an Account Manager on the AWS Public Sector ISV partners team. Greg supports a portfolio of AWS public sector ISV partners to help them grow and mature their adoption of AWS services while maximizing benefits of the AWS partner network.

Read More

Generate actionable insights for predictive maintenance management with Amazon Monitron and Amazon Kinesis

Generate actionable insights for predictive maintenance management with Amazon Monitron and Amazon Kinesis

Reliability managers and technicians in industrial environments such as manufacturing production lines, warehouses, and industrial plants are keen to improve equipment health and uptime to maximize product output and quality. Machine and process failures are often addressed by reactive activity after incidents happen or by costly preventive maintenance, where you run the risk of over-maintaining the equipment or missing issues that could happen between the periodic maintenance cycles. Predictive condition-based maintenance is a proactive strategy that is better than reactive or preventive ones. Indeed, this approach combines continuous monitoring, predictive analytics, and just-in-time action. This enables maintenance and reliability teams to service equipment only when necessary, based on the actual equipment condition.

There have been common challenges with condition-based monitoring to generate actionable insights for large industrial asset fleets. These challenges include but are not limited to: build and maintain a complex infrastructure of sensors collecting data from the field, obtain a reliable high-level summary of industrial asset fleets, efficiently manage failure alerts, identify possible root causes of anomalies, and effectively visualize the state of industrial assets at scale.

Amazon Monitron is an end-to-end condition monitoring solution that enables you to start monitoring equipment health with the aid of machine learning (ML) in minutes, so you can implement predictive maintenance and reduce unplanned downtime. It includes sensor devices to capture vibration and temperature data, a gateway device to securely transfer data to the AWS Cloud, the Amazon Monitron service that analyzes the data for anomalies with ML, and a companion mobile app to track potential failures in your machinery. Your field engineers and operators can directly use the app to diagnose and plan maintenance for industrial assets.

From the operational technology (OT) team standpoint, using the Amazon Monitron data also opens up new ways to improve how they operate large industrial asset fleets thanks to AI. OT teams can reinforce the predictive maintenance practice from their organization by building a consolidated view across multiple hierarchies (assets, sites, and plants). They can combine actual measurement and ML inference results with unacknowledged alarms, sensors or getaways connectivity status, or asset state transitions to build a high-level summary for the scope (asset, site, project) they are focused on.

With the recently launched Amazon Monitron Kinesis data export v2 feature, your OT team can stream incoming measurement data and inference results from Amazon Monitron via Amazon Kinesis to AWS Simple Storage Service (Amazon S3) to build an Internet of Things (IoT) data lake. By leveraging the latest data export schema, you can obtain sensors connectivity status, gateways connectivity status, measurement classification results, closure reason code and details of asset state transition events.

Use cases overview

The enriched data stream Amazon Monitron now exposes enables you to implement several key use cases such as automated work order creation, enriching an operational single pane of glass or automating failure reporting. Let’s dive into these use cases.

You can use the Amazon Monitron Kinesis data export v2 to create work orders in Enterprise Asset Management (EAM) systems such as Infor EAM, SAP Asset Management, or IBM Maximo. For example, in the video avoiding mechanical issues with predictive maintenance & Amazon Monitron, you can discover how our Amazon Fulfillment Centers are avoiding mechanical issues on conveyor belts with Amazon Monitron sensors integrated with third-party software such as the EAM used at Amazon as well as with the chat rooms technicians used. This shows how you can naturally integrate Amazon Monitron insights into your existing workflows. Stay tuned in the coming months to read the next installment of this series with an actual implementation of this integration works.

You can also use the data stream to ingest Amazon Monitron insights back into a shop floor system such as a Supervisory Control and Data Acquisition (SCADA) or a Historian. Shop floor operators are more efficient when all the insights about their assets and processes are provided in a single pane of glass. In this concept, Amazon Monitron doesn’t become yet another tool technicians have to monitor, but another data source with insights provided in the single view they are already used to. Later this year, we will also describe an architecture you can use to perform this task and send Amazon Monitron feedback to major third-party SCADA systems and Historians.

Last but not least, the new data stream from Amazon Monitron includes the asset state transitions and closure codes provided by users when acknowledging alarms (which trigger the transition to a new state). Thanks to this data, you can automatically build visualizations that provide real-time reporting of the failures and actions taken while operating their assets.

Your team can then build a broader data analytics dashboard to support your industrial fleet management practice by combining this asset state data with Amazon Monitron measurement data and other IoT data across large industrial asset fleets by using key AWS services, which we describe in this post. We explain how to build an IoT data lake, the workflow to produce and consume the data, as well as a summary dashboard to visualize Amazon Monitron sensors data and inference results. We use an Amazon Monitron dataset coming from about 780 sensors installed in an industrial warehouse, which has been running for more than 1 year. For the detailed Amazon Monitron installation guide, refer to Getting started with Amazon Monitron.

Solution overview

Amazon Monitron provides ML inference of asset health status after 21 days of the ML model training period for each asset. In this solution, the measurement data and ML inference from these sensors are exported to Amazon S3 via Amazon Kinesis Data Streams by using the latest Amazon Monitron data export feature. As soon as Amazon Monitron IoT data is available in Amazon S3, a database and table are created in Amazon Athena by using an AWS Glue crawler. You can query Amazon Monitron data via AWS Glue tables with Athena, and visualize the measurement data and ML inference with Amazon Managed Grafana. With Amazon Managed Grafana, you can create, explore, and share observability dashboards with your team, and spend less time managing your Grafana infrastructure. In this post, you connect Amazon Managed Grafana to Athena, and learn how to build a data analytics dashboard with Amazon Monitron data to help you plan industrial asset operations at scale.

The following screenshot is an example of what you can achieve at the end of this post. This dashboard is divided into three sections:

  • Plant View – Analytical information from all sensors across plants; for example, the overall counts of various states of sensors (Healthy, Warning, or Alarm), number of unacknowledged and acknowledged alarms, gateway connectivity, and average time for maintenance
  • Site View – Site-level statistics, such as asset status statistics at each site, total number of days that an alarm remains unacknowledged, top/bottom performing assets at each site, and more
  • Asset View – Summary information for the Amazon Monitron project at the asset level, such as the alarm type for an unacknowledged alarm (ISO or ML), the timeline for an alarm, and more

These panels are examples that can help strategic operational planning, but they are not exclusive. You can use a similar workflow to customize the dashboard according to your targeted KPI.



Architecture overview

The solution you will build in this post combines Amazon Monitron, Kinesis Data Streams, Amazon Kinesis Data Firehose, Amazon S3, AWS Glue, Athena, and Amazon Managed Grafana.

The following diagram illustrates the solution architecture. Amazon Monitron sensors measure and detect anomalies from equipment. Both measurement data and ML inference outputs are exported at a frequency of once per hour to a Kinesis data stream, and they are delivered to Amazon S3 via Kinesis Data Firehose with a 1-minute buffer. The exported Amazon Monitron data is in JSON format. An AWS Glue crawler analyzes the Amazon Monitron data in Amazon S3 at a chosen frequency of once per hour, builds a metadata schema, and creates tables in Athena. Finally, Amazon Managed Grafana uses Athena to query the Amazon S3 data, allowing dashboards to be built to visualize both measurement data and device health status.

To build this solution, you complete the following high-level steps:

  1. Enable a Kinesis Data Stream export from Amazon Monitron and create a data stream.
  2. Configure Kinesis Data Firehose to deliver data from the data stream to an S3 bucket.
  3. Build the AWS Glue crawler to create a table of Amazon S3 data in Athena.
  4. Create a dashboard of Amazon Monitron devices with Amazon Managed Grafana.

Prerequisites

For this walkthrough, you should have the following prerequisites:

Additionally, make sure that all the resources you deploy are in the same Region.

Enable a Kinesis data stream export from Amazon Monitron and create a data stream

To configure your data stream export, complete the following steps:

  1. On the Amazon Monitron console, from your project’s main page, choose Start live data export.
  2. Under Select Amazon Kinesis data stream, choose Create a new data stream.
  3. Under Data stream configuration, enter your data stream name.
  4. For Data stream capacity, choose On-demand.
  5. Choose Create data stream.

Note that any live data export enabled after April 4th, 2023 will stream data following the Kinesis Data Streams v2 schema. If you have an existing data export that was enabled before this date, the schema will follow the v1 format.

You can now see live data export information on the Amazon Monitron console with your specified Kinesis data stream.

Configure Kinesis Data Firehose to deliver data to an S3 bucket

To configure your Firehose delivery stream, complete the following steps:

  1. On the Kinesis console, choose Delivery streams in the navigation pane.
  2. Choose Create delivery stream.
  3. For Source, select Amazon Kinesis Data Streams.
  4. For Destination, select Amazon S3.
  5. Under Source settings, for Kinesis data stream, enter the ARN of your Kinesis data stream.
  6. Under Delivery stream name, enter the name of your Kinesis data stream.
  7. Under Destination settings, choose an S3 bucket or enter a bucket URI. You can either use an existing S3 bucket to store Amazon Monitron data, or you can create a new S3 bucket.
  8. Enable dynamic partitioning using inline parsing for JSON:
    • Choose Enabled for Dynamic partitioning.
    • Choose Enabled for Inline parsing for JSON.
    • Under Dynamic partitioning keys, add the following partition keys:
Key Name JQ Expression
project .projectName| "project=(.)"
site .eventPayload.siteName| "site=(.)"
asset .eventPayload.assetName| "asset=(.)"
position .eventPayload.positionName| "position=(.)"
time .timestamp| sub(" [0-9]{2}:[0-9]{2}:[0-9]{2}.[0-9]{3}$"; "")| "time=(.)"
  1. Choose Apply dynamic partitioning keys and confirm the generated S3 bucket prefix is:
!{partitionKeyFromQuery:project}/!{partitionKeyFromQuery:site}/!{partitionKeyFromQuery:asset}/!{partitionKeyFromQuery:position}/!{partitionKeyFromQuery:time}/.
  1. Enter a prefix for S3 bucket error output prefix. Any JSON payload that doesn’t contain the keys described earlier will be delivered in this prefix. For instance, thegatewayConnectedand gatewayDisconnected events are not linked to a given asset or position. Therefore, they won’t contain the assetName and positionName fields. Specifying this optional prefix here allows you to monitor this location and process these events accordingly.
  2. Choose Create delivery stream.

You can inspect the Amazon Monitron data in the S3 bucket. Note that the Amazon Monitron data will export live data at a frequency of once per hour, so wait for 1 hour to inspect the data.

This Kinesis Data Firehose setup enables dynamic partitioning, and the S3 objects delivered will use the following key format:

/project={projectName}/site={siteDisplayName}/asset={assetDisplayName}/ position={sensorPositionDisplayName}/time={yyyy-mm-dd 00:00:00}/{filename}.

Build the AWS Glue crawler to create a table of Amazon S3 data in Athena

After the live data has been exported to Amazon S3, we use an AWS Glue crawler to generate the metadata tables. In this post, we use AWS Glue crawlers to automatically infer database and table schema from Amazon Monitron data exported in Amazon S3, and store the associated metadata in the AWS Glue Data Catalog. Athena then uses the table metadata from the Data Catalog to find, read, and process the data in Amazon S3. Complete the following steps to create your database and table schema:

  1. On the AWS Glue console, choose Crawlers in the navigation pane.
  2. Choose Create crawler.
  3. Enter a name for the crawler (for example,XXX_xxxx_monitron).
  4. Choose Next.
  5. For Is your data already mapped to Glue tables, choose Not yet.
  6. For Data Source, choose S3.
  7. For Location of S3 data, choose In this Account, and enter the path of your S3 bucket directory you set up in the previous section (s3://YourBucketName).
  8. For Repeat crawls of S3 data stores, select Crawl all sub-folders.
  9. Finally, choose Next.
  10. Select Create new IAM role and enter a name for the role.
  11. Choose Next.
  12. Select Add Database, and enter a name for the database. This creates the Athena database where your metadata tables are located after the crawler is complete.
  13. For Crawler Schedule, select a preferred time-based scheduler (for example, hourly) to refresh the Amazon Monitron data in the database, and choose Next.
  14. Review the crawler details and choose Create.
  15. On the Crawlers page of the AWS Glue console, select the crawler you created and choose Run crawler.

You may need to wait a few minutes, depending on the size of the data. When it’s complete, the crawler’s status shows as Ready. To see the metadata tables, navigate to your database on the Databases page and choose Tables in the navigation pane.

You can also view data by choosing Table data on the console.

You’re redirected to the Athena console to view the top 10 records of the Amazon Monitron data in Amazon S3.

Create a dashboard of Amazon Monitron devices with Amazon Managed Grafana

In this section, we build a customized dashboard with Amazon Managed Grafana to visualize Amazon Monitron data in Amazon S3, so that OT team can get streamlined access to assets in alarm across their whole Amazon Monitron sensors fleet. This will enable the OT team to plan next step actions based on the possible root cause of the anomalies.

To create a Grafana workspace, complete the following steps:

  1. Ensure that your user role is admin or editor.
  2. On the Amazon Managed Grafana console, choose Create workspace.
  3. For Workspace name, enter a name for the workspace.
  4. Choose Next.
  5. For Authentication access, select AWS IAM Identity Center (successor to AWS Single Sign-On). You can use the same AWS IAM Identity Center user that you used to set up your Amazon Monitron project.
  6. Choose Next.
  7. For this first workspace, confirm that Service managed is selected for Permission type. This selection enables Amazon Managed Grafana to automatically provision the permissions you need for the AWS data sources that you use for this workspace.
  8. Choose Current account.
  9. Choose Next.
  10. Confirm the workspace details, and choose Create workspace. The workspace details page appears. Initially, the status is CREATING.
  11. Wait until the status is ACTIVE to proceed to the next step.

To configure your Athena data source, complete the following steps:

  1. On the Amazon Managed Grafana console, choose the workspace you want to work on.
  2. On the Data sources tab, select Amazon Athena, and choose Actions, Enable service-managed policy.
  3. Choose Configure in Grafana in the Amazon Athena row.
  4. Sign in to the Grafana workspace console using IAM Identity Center if necessary. The user should have the Athena access policy attached to the user or role to have access to the Athena data source. See AWS managed policy: AmazonGrafanaAthenaAccess for more info.
  5. On the Grafana workspace console, in the navigation pane, choose the lower AWS icon (there are two) and then choose Athena on the Data sources menu.
  6. Select the default Region that you want the Athena data source to query from, select the accounts that you want, then choose Add data source.
  7. Follow the steps to configure Athena details.

If your workgroup in Athena doesn’t have an output location configured already, you need to specify an S3 bucket and folder to use for query results. After setting up the data source, you can view or edit it in the Configuration pane.

In the following subsections, we demonstrate several panels in the Amazon Monitron dashboard built in Amazon Managed Grafana to gain operational insights. The Athena data source provides a standard SQL query editor that we’ll use to analyze the Amazon Monitron data to generate desired analytics.

First, if there are many sensors in the Amazon Monitron project and they are in different states (healthy, warning, alarm, and needs maintenance), the OT team wants to visually see the count of positions that sensors are in various states. You can obtain such information as a pie chart widget in Grafana via the following Athena query:

Select * FROM (Select latest_status, COUNT(assetdisplayname)OVER (PARTITION BY latest_status) AS asset_health_count FROM (SELECT timestamp, sitedisplayname, assetdisplayname, assetState.newState as latest_status, RANK() OVER (PARTITION BY assetdisplayname ORDER BY timestamp DESC)AS rnk FROM "AwsDataCatalog"."Replace with your Athena database name"."Replace with your Athena table name") tt WHERE tt.rnk=1) GROUP BY latest_status, asset_health_count; 

The following screenshot shows a panel with the latest distribution of Amazon Monitron sensor status.

To format your SQL query for Amazon Monitron data, refer to Understanding the data export schema.

Next, your Operations Technology team may want to plan predictive maintenance based on assets that are in alarm status, and therefore they want to quickly know the total number of acknowledged alarms vs. unacknowledged alarms. You can show the summary information of alarm state as simple stats panels in Grafana:

Select COUNT(*) FROM (Select timestamp, sitedisplayname, assetdisplayname, assetState.newState as latest_status, RANK() OVER (PARTITION BY assetdisplayname ORDER BY timestamp DESC)AS rnk FROM "AwsDataCatalog"."Replace with your Athena database name"."Replace with your Athena table name") tt WHERE tt.rnk=1 AND tt.latest_status = 'Alarm';

The following panel shows acknowledged and unacknowledged alarms.

The OT team can also query the amount of time the sensors remain in alarm status, so that they can decide their maintenance priority:

Select c.assetdisplayname, b.sensorpositiondisplayname, b.alarm_date FROM (Select a.assetdisplayname, a.sensorpositiondisplayname, COUNT(*)/24+1 AS number_of_days_in_alarm_state FROM (Select * FROM "AwsDataCatalog"."Replace with your Athena database name"."Replace with your Athena table name" WHERE (assetState.newState = 'ALARM' AND assetState.newState = assetState.previousState) ORDER BY timestamp DESC) a GROUP BY a.assetdisplayname, a.sensorpositiondisplayname) b INNER JOIN (Select * FROM (Select timestamp, sitedisplayname, assetdisplayname, assetState.newState AS latest_status, RANK() OVER (PARTITION BY assetdisplayname ORDER BY timestamp DESC)AS rnk FROM "AwsDataCatalog"."Replace with your Athena database name"."Replace with your Athena table name") tt WHERE tt.rnk=1 AND tt.latest_status = 'ALARM') c ON b.assetdisplayname = c.assetdisplayname;

The output of this analysis can be visualized by a bar chart in Grafana, and the alarm in alarm state can be easily visualized as shown in the following screenshot.

To analyze top/bottom asset performance based on the total amount of time the assets are in an alarm or need maintenance state, use the following query:

Select s.sitedisplayname, s.assetdisplayname, COUNT(s.timestamp)/24 AS trouble_time FROM (Select timestamp, sitedisplayname, assetdisplayname, sensorpositiondisplayname, assetState.newState FROM "AwsDataCatalog"."Replace with your Athena database name"."Replace with your Athena table name" WHERE assetState.newState = 'ALARM' OR assetState.newState = 'NEEDS_MAINTENANCE') AS s GROUP BY s.assetdisplayname, s.sitedisplayname ORDER BY trouble_time, s.assetdisplayname ASC LIMIT 5;

The following bar gauge is used to visualize the preceding query output, with the top performing assets showing 0 days of alarm states, and the bottom performing assets showing accumulated alarming states over the past year.

To help the OT team understand the possible root cause of an anomaly, the alarm types can be displayed for these assets still in alarm state with the following query:

Select a.assetdisplayname, a.sensorpositiondisplayname, a.latest_status, CASE WHEN a.temperatureML != 'HEALTHY' THEN 'TEMP' WHEN a.vibrationISO != 'HEALTHY' THEN 'VIBRATION_ISO' ELSE 'VIBRATION_ML' END AS alarm_type  FROM (Select sitedisplayname, assetdisplayname, sensorpositiondisplayname, models.temperatureML.persistentClassificationOutput as temperatureML, models.vibrationISO.persistentClassificationOutput as vibrationISO, models.vibrationML.persistentClassificationOutput as vibrationML, assetState.newState as latest_status FROM (Select *, RANK() OVER (PARTITION BY assetdisplayname, sensorpositiondisplayname ORDER BY timestamp DESC)AS rnk FROM "AwsDataCatalog"."Replace with your Athena database name"."Replace with your Athena table name") tt WHERE tt.rnk=1 AND assetState.newState = 'ALARM' ) a WHERE (a.temperatureML != 'HEALTHY' OR a. vibrationISO != 'HEALTHY' OR a. vibrationML != 'HEALTHY');

You can visualize this analysis as a table in Grafana. In this Amazon Monitron project, two alarms were triggered by ML models for vibration measurement.

The Amazon Managed Grafana dashboard is shown here for illustration purposes. You can adapt the dashboard design according to your own business needs.

Failure Reports

When a user acknowledges an alarm in the Amazon Monitron app, the associated assets transition to a new state. The user also has the opportunity to provide some details about this alarm:

  • Failure cause – This can be one of the following: ADMINISTRATION, DESIGN, FABRICATION, MAINTENANCE, OPERATION, OTHER, QUALITY, WEAR, or UNDEDETERMINED
  • Failure mode – This can be one of the following: NO_ISSUE, BLOCKAGE, CAVITATION, CORROSION, DEPOSIT, IMBALANCE, LUBRICATION, MISALIGNMENT, OTHER, RESONANCE, ROTATING_LOOSENESS, STRUCTURAL_LOOSENESS, TRANSMITTED_FAULT, or UNDETERMINED
  • Action taken – This can be ADJUST, CLEAN, LUBRICATE, MODIFY, OVERHAUL, REPLACE, NO_ACTION, or OTHER

The event payload associated to the asset state transition contains all this information, the previous state of the asset, and the new state of the asset. Stay tuned for an update of this post with more details on how you can use this information in an additional Grafana panel to build Pareto charts of the most common failures and actions taken across your assets.

Conclusion

Enterprise customers of Amazon Monitron are looking for a solution to build an IoT data lake with Amazon Monitron’s live data, so they can manage multiple Amazon Monitron projects and assets, and generate analytics reports across multiple Amazon Monitron projects. This post provide a detailed walkthrough of a solution to build this IoT data lake with the latest Amazon Monitron Kinesis data export v2 feature. This solution also showed how to use other AWS services, such as AWS Glue and Athena to query the data, generate analytics outputs, and visualize such outputs with Amazon Managed Grafana with frequent refresh.

As a next step, you can expand this solution by sending ML inference results to other EAM systems that you might use for work order management. This will allow your operation team to integrate Amazon Monitron with other enterprise applications, and improve their operation efficiency. You can also start building more in-depth insights into your failure modes and actions taken by processing the asset state transitions and the closure codes that are now part of the Kinesis data stream payload.


About the authors

Julia Hu is a Sr. AI/ML Solutions Architect at Amazon Web Services. She has extensive experience in IoT architecture and Applied Data Science, and is part of both the Machine Learning and IoT Technical Field Community. She works with customers, ranging from start-ups to enterprises, to develop AWSome IoT machine learning (ML) solutions, at the Edge and in the Cloud. She enjoys leveraging latest IoT and big data technology to scale up her ML solution, reduce latency, and accelerate industry adoption.

Bishr Tabbaa is a solutions architect at Amazon Web Services. Bishr specializes in helping customers with machine learning, security, and observability applications. Outside of work, he enjoys playing tennis, cooking, and spending time with family.

Shalika Pargal is a Product Manager at Amazon Web Services. Shalika focuses on building AI products and services for Industrial customers. She brings significant experience at the intersection of Product, Industrial and Business Development. She recently shared Monitron’s success story at Reinvent 2022.

Garry Galinsky is a Principal Solutions Architect supporting Amazon on AWS. He has been involved with Monitron since its debut and has helped integrate and deploy the solution into Amazon’s worldwide fulfillment network. He recently shared Amazon’s Monitron success story at re:Invent 2022.

Michaël Hoarau is an AI/ML Specialist Solutions Architect at AWS who alternates between data scientist and machine learning architect, depending on the moment. He is passionate about bringing the AI/ML power to the shop floors of his industrial customers and has worked on a wide range of ML use cases, ranging from anomaly detection to predictive product quality or manufacturing optimization. He published a book on time series analysis in 2022 and regularly writes about this topic on LinkedIn and Medium. When not helping customers develop the next best machine learning experiences, he enjoys observing the stars, traveling, or playing the piano.

Read More

Automatic post-deployment management of cloud applications

Automatic post-deployment management of cloud applications

SelfTune interaction with Client (Developer Machine) into Data Store (Azure ML Workspace)

Cloud Intelligence/AIOps blog series

In the first two blog posts in this series, we presented our vision for Cloud Intelligence/AIOps (AIOps) research, and scenarios where innovations in AI technologies can help build and operate complex cloud platforms and services effectively and efficiently at scale. In this blog post, we dive deeper into our efforts to automatically manage large-scale cloud services in deployment. In particular, we focus on an important post-deployment cloud management task that is pervasive across cloud services – tuning configuration parameters. And we discuss SelfTune, a horizontal reinforcement learning (RL) platform for successful configuration management of various cloud services in deployment.

Post-deployment management of cloud applications

Managing cloud applications includes mission-critical tasks such as resource allocation, scheduling, pre-provisioning, capacity planning and provisioning, and autoscaling. Currently, several of these tasks rely on hand-tuned and manually designed algorithms, heuristics, and domain knowledge. For a large cloud company like Microsoft, a hand-tuned, manually designed algorithm works well only to a certain extent, because deployments are extremely varied, large-scale, and involve complex interactions of various components. Moreover, user, customer, and application behavior can change over time, making yesterday’s hand-tuning not as relevant today and even less so in the future. The varied nature of today’s cloud technologies forces our engineers to spend an inordinate amount of time on special casing, introducing new configuration parameters, and writing or rewriting heuristics to set them appropriately. This also creates a lot of undocumented domain knowledge and dependence on a few individuals to solve significant problems. All of this, we believe, is unsustainable in the long term.

As we discussed in the earlier posts in this blog series, the right AI/ML formulations and techniques could help to alleviate this problem. Specifically, cloud management tasks are a natural fit for adopting the reinforcement learning paradigm. These tasks are repetitive in space and time; they run simultaneously on multiple machines, clusters, datacenters, and/or regions, and they run once every hour, day, week, or month. For instance, the VM pre-provisioning service for Azure Functions is a continuously running process, pre-provisioning for every application. Scheduling of background jobs on substrate runs separately on every machine. Reinforcement learning also needs a repetitive and iterative platform to converge on an optimized setup and, hence, can go together with the basic functioning of the cloud management task.

Spotlight: On-Demand EVENT

Microsoft Research Summit 2022

On-Demand
Watch now to learn about some of the most pressing questions facing our research community and listen in on conversations with 120+ researchers around how to ensure new technologies have the broadest possible benefit for humanity.

Our goal is to reduce manual effort in ensuring service efficiency, performance, and reliability by augmenting, complimenting, or replacing existing heuristics for various management tasks with general RL-based solutions. In this blog post, we present our recent solution frameworks for cloud applications, to automatically tune their configuration parameters and to design policies for managing the parameters over time. Our solutions require minimal engineering effort and no AI expertise from the application developers or cloud operators.

Example Microsoft scenarios

O365 Workload Manager: Workload Manager (WLM) is a process that runs on each of the backend Exchange Online (EXO) servers to help schedule resources (CPU, disk, network) to background jobs that periodically execute. WLM has several configuration parameters that need to be carefully set so that the throughput of the scheduler is maximized while also ensuring that the resources are not too strained to execute low-latency user-facing jobs (e.g., Outlook search). Could we help EXO infrastructure manage the various knobs that dictate the control logic implemented in the scheduler for optimizing resource management and user latency?

Azure ML/Spark: Spark is the platform for performing distributed data analytics, and it comes with various configuration knobs that need to be appropriately set by developers based on their job context: Does the query involve JOIN clauses? How big are the data shards? The workload patterns change over time, and pre-trained models for choosing optimal configurations may not suffice. Can we help developers dynamically choose the deployment configuration based on workload signals?

Azure Functions VM management: Can we tune the VM management policy implemented in Azure Functions for VM pre-fetching/eviction to minimize cold starts and memory wastage over time? Our results in simulations are quite encouraging. We want to engage with the Azure and MSR Redmond teams to discuss the possibility of tuning the policy in the production setting.

Azure Kubernetes Service: AKS is chosen by first-party as well as third-party Azure customers for facilitating containerized development and deployment of cloud applications. The in-built workload autoscaling policies in AKS use several configuration parameters, which can be far from optimal in several scenarios. Can we help automatically adjust the parameters that govern resource allocation to containers running microservices based on applications’ workload patterns?

Horizontal solution design for configuration tuning

We see three main reasons why this is the right time to design and incorporate an RL-based solution framework across cloud management tasks:

  1. As the size and complexity of services in the cloud continue to increase, as our hardware footprint continues to include many SKUs, and as configuration and code get larger and more complex, heuristics and hand-tuning cannot provide optimal operations at all times. Not without significant and proportionate investment in human experts and engineers.
  2. While we will have to rely on domain experts for key changes in systems and the services landscape on the cloud, using RL sub-systems can help reduce dependence on expert decisions and domain-knowledge over time.
  3. It is important to have a horizontal framework with a simple yet expressive API, with appropriate algorithms for tuning configuration parameters in an online fashion to optimize a developer-specific metric of interest or reward.

SelfTune framework

We have designed and deployed the SelfTune framework to help cloud service developers automatically tune the configuration parameters in their codebase, which would otherwise be manually set or heuristically tweaked. SelfTune is an RL-based framework that helps developers automate complex post-deployment cloud management tasks such as parameter tuning and performance engineering.

SelfTune is hosted as a service on the public Azure cloud. First-party applications that are interested in post-deployment parameter tuning can use RestAPI calls to access SelfTune endpoints. The SelfTune framework has two components:

  1. Client API provides necessary support to access the SelfTune endpoints via RestAPI calls, namely, Predict for getting the parameters from the framework and SetReward for providing reward/feedback to the framework.
  2. RL Engine implements a suite of ML/RL algorithms for periodically updating the parameters and returning the latest values to the clients as well as for periodically computing the reward metrics.

At the core of the SelfTune framework is the formulation of the post-deployment parameter tuning problem as that of “online learning from bandit feedback.” SelfTune assumes that the only interaction possible with the external system (i.e., the application being tuned) is a black-box access to some form of feedback (e.g., daily P95 latency of the service). The framework repeatedly deploys configuration parameters and observes the corresponding rewards after a developer-defined period. As the operational environment (e.g., production cluster running certain types of workloads) is constantly in flux, there is no single setting of parameters that will remain optimal throughout. Thus, SelfTune continuously runs the explore-exploit paradigm of RL techniques – explore new parameters in the vicinity of the currently deployed parameters, observe rewards, update its internal model based on the reward, and exploit parameters that tend to give high rewards over time.

We have designed a bandit learning algorithm called Bluefinin SelfTune that crystallizes the aforementioned idea. Our algorithm has lower sample complexity, which means it takes a lower number of rounds for the algorithm to converge to desired values when we want to tune multiple real-valued parameters simultaneously, compared to peer techniques like multi-arm bandits (which is the base of Azure Personalizer), Bayesian Optimization (used by the MLOS framework), or genetic algorithms. This is provable under some assumptions on the reward function, but we observe, across multiple deployments, that the algorithm converges to good solutions in practice even when theoretical assumptions are often violated.

We have open-sourced Bluefin through Vowpal Wabbit, a popular RL library for practitioners, which houses the core algorithms of Azure Personalizer. We are continuing to work on designing vertical RL algorithms and horizontal feature learning for the systems domain. Besides Bluefin, SelfTune supports a suite of black-box optimization (e.g. Bayesian Optimization) and RL techniques (e.g., Deep Deterministic Policy Gradients) that the cloud applications can choose from, based on their needs.

A simple integration use case: Consider the scenario of setting PySpark cluster configuration parameters for Azure ML jobs that are spawned for ML workloads in the O365 MS-AI organization. The workloads are composed of various data processing jobs and run on various Azure ML clusters with different capacities and hardware. It is non-trivial to set parameters for various jobs, such that the workloads complete quickly, and not fail in the middle due to resourcing issues thereby losing all computations.

Basic SelfTune workflow: The basic integration of SelfTune in the AzureML pipeline is illustrated in the figure below. Here, the developer wants to tune seven key Apache PySpark parameters per job, namely driver memory, driver cores, executor cores, number executors, executor memory, spark.sql.shuffle.partitions, and spark.default.parallelism.

Basic SelfTune workflow
  1. Developer invokes Predict on the SelfTune instance, asking for the parameters for the next job.
  2. SelfTune service responds with the predicted parameters for the next job.
  3. The developer submits a job using SelfTune’s predicted parameters. //outside SelfTune’s purview
  4. Once the job is complete, the cluster sends job meta data to the data store. // outside SelfTune’s purview
  5. Developer queries rewards for previously completed jobs, if any, from Data Store (e.g., Azure ML workspace).
  6. Data Store responds with the rewards (e.g., job completion times, which is part of the job meta-data) from previously completed jobs.
  7. If the rewards exist in the store, the developer invokes SetReward for those jobs (which pushes the rewards to the SelfTune service endpoint hosted somewhere).

Self-tuning substrate background jobs scheduler

User-level background job scheduling: All the substrate backend servers in EXO datacenters (that host user mailboxes) run hundreds of low-priority, latency-insensitive, periodic workloads locally (e.g., mailbox replication, encryption, event-driven assistants, etc.). Workload Management (WLM) is a core substrate service that runs on all such backend servers. It helps with the user-level scheduling of workloads on the servers: a) with the goal of completing the tasks when resources become available (at micro-granular timescales), and b) mindful of the fact that high-priority, latency-sensitive workloads will bypass this scheduler. Thus, ensuring high availability of resources especially during peak hours is critical, besides meeting workload SLAs.

Tuning real-valued configuration parameters: The scheduler is implemented today as part of a huge codebase in the substrate core. The scheduler trades off resource utilization and completion rates by dynamically ramping up and ramping down the number of concurrent background tasks requiring access for the resources. This is achieved by carefully setting several configuration settings (hundreds of real-valued parameters). At a server level, we can achieve better resource utilization and throughput, by automatically tuning the key parameters, based on the workloads it receives and the ensuing resource health fluctuations.

Impact of using SelfTune in WLM: We have integrated SelfTune with the substrate background scheduler codebase (the change required is simple, on the order of tens of lines of code, as shown in the figure below). We first deployed in the inner rings of substrate (over 3000+ servers). The results gathered over 4-5 weeks of deployment clearly indicate that tuning helps on most of the deployed servers, increasing throughput at least 20% across multiple forests in their heavily throttled servers, with a marginal increase in CPU health and insignificant-to-mild degradation of disk health. Based on this validation, we have now rolled out SelfTune integration to most EXO backend servers (nearly 200,000) across the worldwide production ring.

SelfTune Application library contains the SelfTune client API and the RL/ML algorithms

Ongoing work and future AI+systems research

SelfTune is a general platform and can be readily applied to many RL-for-cloud scenarios without any additional feature engineering or onboarding efforts (which are typically required in AIOps). We expect that developers can define a suitable spatial and temporal tuning scope in the service/system, tuning the parameters of the service running in the cluster, at the level of machines, every hour of every day. Thus, instead of hand-coding the optimal operating points for various machines or various clusters that the service operates in, we could integrate SelfTune in the service codebase to dynamically figure them out over time, based on the real-time feedback at a determined temporal granularity.

Our work poses a lot of interesting design and algorithmic questions in this space. For instance, can we automatically scope the tuning problem based on some observed context such as cluster type, hardware, workload volumes, etc., and find optimal parameters per scope? Given that typical cloud applications have hundreds, if not thousands, of knobs to tune, can we automatically identify the knobs that impact the performance metric of interest, and then tune those knobs more efficiently?

A combination of system insights, ML formulations, and cross-layer optimization is vital for effective post-deployment management of cloud applications and services. We will post an update to this blog post on our ongoing work in this space soon. Meanwhile, the final blog post in this series will explore how AIOps can be made more comprehensive by spanning the entire cloud stack.

The post Automatic post-deployment management of cloud applications appeared first on Microsoft Research.

Read More

Automatic post-deployment management of cloud applications

Automatic post-deployment management of cloud applications

SelfTune interaction with Client (Developer Machine) into Data Store (Azure ML Workspace)

Cloud Intelligence/AIOps blog series

In the first two blog posts in this series, we presented our vision for Cloud Intelligence/AIOps (AIOps) research, and scenarios where innovations in AI technologies can help build and operate complex cloud platforms and services effectively and efficiently at scale. In this blog post, we dive deeper into our efforts to automatically manage large-scale cloud services in deployment. In particular, we focus on an important post-deployment cloud management task that is pervasive across cloud services – tuning configuration parameters. And we discuss SelfTune, a horizontal reinforcement learning (RL) platform for successful configuration management of various cloud services in deployment.

Post-deployment management of cloud applications

Managing cloud applications includes mission-critical tasks such as resource allocation, scheduling, pre-provisioning, capacity planning and provisioning, and autoscaling. Currently, several of these tasks rely on hand-tuned and manually designed algorithms, heuristics, and domain knowledge. For a large cloud company like Microsoft, a hand-tuned, manually designed algorithm works well only to a certain extent, because deployments are extremely varied, large-scale, and involve complex interactions of various components. Moreover, user, customer, and application behavior can change over time, making yesterday’s hand-tuning not as relevant today and even less so in the future. The varied nature of today’s cloud technologies forces our engineers to spend an inordinate amount of time on special casing, introducing new configuration parameters, and writing or rewriting heuristics to set them appropriately. This also creates a lot of undocumented domain knowledge and dependence on a few individuals to solve significant problems. All of this, we believe, is unsustainable in the long term.

As we discussed in the earlier posts in this blog series, the right AI/ML formulations and techniques could help to alleviate this problem. Specifically, cloud management tasks are a natural fit for adopting the reinforcement learning paradigm. These tasks are repetitive in space and time; they run simultaneously on multiple machines, clusters, datacenters, and/or regions, and they run once every hour, day, week, or month. For instance, the VM pre-provisioning service for Azure Functions is a continuously running process, pre-provisioning for every application. Scheduling of background jobs on substrate runs separately on every machine. Reinforcement learning also needs a repetitive and iterative platform to converge on an optimized setup and, hence, can go together with the basic functioning of the cloud management task.

Spotlight: On-Demand EVENT

Microsoft Research Summit 2022

On-Demand
Watch now to learn about some of the most pressing questions facing our research community and listen in on conversations with 120+ researchers around how to ensure new technologies have the broadest possible benefit for humanity.

Our goal is to reduce manual effort in ensuring service efficiency, performance, and reliability by augmenting, complimenting, or replacing existing heuristics for various management tasks with general RL-based solutions. In this blog post, we present our recent solution frameworks for cloud applications, to automatically tune their configuration parameters and to design policies for managing the parameters over time. Our solutions require minimal engineering effort and no AI expertise from the application developers or cloud operators.

Example Microsoft scenarios

O365 Workload Manager: Workload Manager (WLM) is a process that runs on each of the backend Exchange Online (EXO) servers to help schedule resources (CPU, disk, network) to background jobs that periodically execute. WLM has several configuration parameters that need to be carefully set so that the throughput of the scheduler is maximized while also ensuring that the resources are not too strained to execute low-latency user-facing jobs (e.g., Outlook search). Could we help EXO infrastructure manage the various knobs that dictate the control logic implemented in the scheduler for optimizing resource management and user latency?

Azure ML/Spark: Spark is the platform for performing distributed data analytics, and it comes with various configuration knobs that need to be appropriately set by developers based on their job context: Does the query involve JOIN clauses? How big are the data shards? The workload patterns change over time, and pre-trained models for choosing optimal configurations may not suffice. Can we help developers dynamically choose the deployment configuration based on workload signals?

Azure Functions VM management: Can we tune the VM management policy implemented in Azure Functions for VM pre-fetching/eviction to minimize cold starts and memory wastage over time? Our results in simulations are quite encouraging. We want to engage with the Azure and MSR Redmond teams to discuss the possibility of tuning the policy in the production setting.

Azure Kubernetes Service: AKS is chosen by first-party as well as third-party Azure customers for facilitating containerized development and deployment of cloud applications. The in-built workload autoscaling policies in AKS use several configuration parameters, which can be far from optimal in several scenarios. Can we help automatically adjust the parameters that govern resource allocation to containers running microservices based on applications’ workload patterns?

Horizontal solution design for configuration tuning

We see three main reasons why this is the right time to design and incorporate an RL-based solution framework across cloud management tasks:

  1. As the size and complexity of services in the cloud continue to increase, as our hardware footprint continues to include many SKUs, and as configuration and code get larger and more complex, heuristics and hand-tuning cannot provide optimal operations at all times. Not without significant and proportionate investment in human experts and engineers.
  2. While we will have to rely on domain experts for key changes in systems and the services landscape on the cloud, using RL sub-systems can help reduce dependence on expert decisions and domain-knowledge over time.
  3. It is important to have a horizontal framework with a simple yet expressive API, with appropriate algorithms for tuning configuration parameters in an online fashion to optimize a developer-specific metric of interest or reward.

SelfTune framework

We have designed and deployed the SelfTune framework to help cloud service developers automatically tune the configuration parameters in their codebase, which would otherwise be manually set or heuristically tweaked. SelfTune is an RL-based framework that helps developers automate complex post-deployment cloud management tasks such as parameter tuning and performance engineering.

SelfTune is hosted as a service on the public Azure cloud. First-party applications that are interested in post-deployment parameter tuning can use RestAPI calls to access SelfTune endpoints. The SelfTune framework has two components:

  1. Client API provides necessary support to access the SelfTune endpoints via RestAPI calls, namely, Predict for getting the parameters from the framework and SetReward for providing reward/feedback to the framework.
  2. RL Engine implements a suite of ML/RL algorithms for periodically updating the parameters and returning the latest values to the clients as well as for periodically computing the reward metrics.

At the core of the SelfTune framework is the formulation of the post-deployment parameter tuning problem as that of “online learning from bandit feedback.” SelfTune assumes that the only interaction possible with the external system (i.e., the application being tuned) is a black-box access to some form of feedback (e.g., daily P95 latency of the service). The framework repeatedly deploys configuration parameters and observes the corresponding rewards after a developer-defined period. As the operational environment (e.g., production cluster running certain types of workloads) is constantly in flux, there is no single setting of parameters that will remain optimal throughout. Thus, SelfTune continuously runs the explore-exploit paradigm of RL techniques – explore new parameters in the vicinity of the currently deployed parameters, observe rewards, update its internal model based on the reward, and exploit parameters that tend to give high rewards over time.

We have designed a bandit learning algorithm called Bluefinin SelfTune that crystallizes the aforementioned idea. Our algorithm has lower sample complexity, which means it takes a lower number of rounds for the algorithm to converge to desired values when we want to tune multiple real-valued parameters simultaneously, compared to peer techniques like multi-arm bandits (which is the base of Azure Personalizer), Bayesian Optimization (used by the MLOS framework), or genetic algorithms. This is provable under some assumptions on the reward function, but we observe, across multiple deployments, that the algorithm converges to good solutions in practice even when theoretical assumptions are often violated.

We have open-sourced Bluefin through Vowpal Wabbit, a popular RL library for practitioners, which houses the core algorithms of Azure Personalizer. We are continuing to work on designing vertical RL algorithms and horizontal feature learning for the systems domain. Besides Bluefin, SelfTune supports a suite of black-box optimization (e.g. Bayesian Optimization) and RL techniques (e.g., Deep Deterministic Policy Gradients) that the cloud applications can choose from, based on their needs.

A simple integration use case: Consider the scenario of setting PySpark cluster configuration parameters for Azure ML jobs that are spawned for ML workloads in the O365 MS-AI organization. The workloads are composed of various data processing jobs and run on various Azure ML clusters with different capacities and hardware. It is non-trivial to set parameters for various jobs, such that the workloads complete quickly, and not fail in the middle due to resourcing issues thereby losing all computations.

Basic SelfTune workflow: The basic integration of SelfTune in the AzureML pipeline is illustrated in the figure below. Here, the developer wants to tune seven key Apache PySpark parameters per job, namely driver memory, driver cores, executor cores, number executors, executor memory, spark.sql.shuffle.partitions, and spark.default.parallelism.

Basic SelfTune workflow
  1. Developer invokes Predict on the SelfTune instance, asking for the parameters for the next job.
  2. SelfTune service responds with the predicted parameters for the next job.
  3. The developer submits a job using SelfTune’s predicted parameters. //outside SelfTune’s purview
  4. Once the job is complete, the cluster sends job meta data to the data store. // outside SelfTune’s purview
  5. Developer queries rewards for previously completed jobs, if any, from Data Store (e.g., Azure ML workspace).
  6. Data Store responds with the rewards (e.g., job completion times, which is part of the job meta-data) from previously completed jobs.
  7. If the rewards exist in the store, the developer invokes SetReward for those jobs (which pushes the rewards to the SelfTune service endpoint hosted somewhere).

Self-tuning substrate background jobs scheduler

User-level background job scheduling: All the substrate backend servers in EXO datacenters (that host user mailboxes) run hundreds of low-priority, latency-insensitive, periodic workloads locally (e.g., mailbox replication, encryption, event-driven assistants, etc.). Workload Management (WLM) is a core substrate service that runs on all such backend servers. It helps with the user-level scheduling of workloads on the servers: a) with the goal of completing the tasks when resources become available (at micro-granular timescales), and b) mindful of the fact that high-priority, latency-sensitive workloads will bypass this scheduler. Thus, ensuring high availability of resources especially during peak hours is critical, besides meeting workload SLAs.

Tuning real-valued configuration parameters: The scheduler is implemented today as part of a huge codebase in the substrate core. The scheduler trades off resource utilization and completion rates by dynamically ramping up and ramping down the number of concurrent background tasks requiring access for the resources. This is achieved by carefully setting several configuration settings (hundreds of real-valued parameters). At a server level, we can achieve better resource utilization and throughput, by automatically tuning the key parameters, based on the workloads it receives and the ensuing resource health fluctuations.

Impact of using SelfTune in WLM: We have integrated SelfTune with the substrate background scheduler codebase (the change required is simple, on the order of tens of lines of code, as shown in the figure below). We first deployed in the inner rings of substrate (over 3000+ servers). The results gathered over 4-5 weeks of deployment clearly indicate that tuning helps on most of the deployed servers, increasing throughput at least 20% across multiple forests in their heavily throttled servers, with a marginal increase in CPU health and insignificant-to-mild degradation of disk health. Based on this validation, we have now rolled out SelfTune integration to most EXO backend servers (nearly 200,000) across the worldwide production ring.

SelfTune Application library contains the SelfTune client API and the RL/ML algorithms

Ongoing work and future AI+systems research

SelfTune is a general platform and can be readily applied to many RL-for-cloud scenarios without any additional feature engineering or onboarding efforts (which are typically required in AIOps). We expect that developers can define a suitable spatial and temporal tuning scope in the service/system, tuning the parameters of the service running in the cluster, at the level of machines, every hour of every day. Thus, instead of hand-coding the optimal operating points for various machines or various clusters that the service operates in, we could integrate SelfTune in the service codebase to dynamically figure them out over time, based on the real-time feedback at a determined temporal granularity.

Our work poses a lot of interesting design and algorithmic questions in this space. For instance, can we automatically scope the tuning problem based on some observed context such as cluster type, hardware, workload volumes, etc., and find optimal parameters per scope? Given that typical cloud applications have hundreds, if not thousands, of knobs to tune, can we automatically identify the knobs that impact the performance metric of interest, and then tune those knobs more efficiently?

A combination of system insights, ML formulations, and cross-layer optimization is vital for effective post-deployment management of cloud applications and services. We will post an update to this blog post on our ongoing work in this space soon. Meanwhile, the final blog post in this series will explore how AIOps can be made more comprehensive by spanning the entire cloud stack.

The post Automatic post-deployment management of cloud applications appeared first on Microsoft Research.

Read More

Deploy large models at high performance using FasterTransformer on Amazon SageMaker

Deploy large models at high performance using FasterTransformer on Amazon SageMaker

Sparked by the release of large AI models like AlexaTM, GPT, OpenChatKit, BLOOM, GPT-J, GPT-NeoX, FLAN-T5, OPT, Stable Diffusion, and ControlNet, the popularity of generative AI has seen a recent boom. Businesses are beginning to evaluate new cutting-edge applications of the technology in text, image, audio, and video generation that have the potential to revolutionize the services they provide and the ways they interact with customers. However, as the size and complexity of the deep learning models that power generative AI continue to grow, deployment can be a challenging task. Advanced techniques such as model parallelism and quantization become necessary to achieve latency and throughput requirements. Without expertise in using these techniques, many customers struggle to get started with hosting large models for generative AI applications.

This post can help! We begin by discussing different types of model optimizations that can be used to boost performance before you deploy your model. Then, we highlight how Amazon SageMaker large model inference deep learning containers (LMI DLCs) can help with optimization and deployment. Finally, we include code examples using LMI DLCs and FasterTransformer model parallelism to deploy models like flan-t5-xxl and flan-ul2. You can find an accompanying example notebook in the SageMaker examples repository.

Large model deployment pipeline

Major steps in any model inference workflow include loading a model into memory and handling inference requests on this in-memory model through a model server. Large models complicate this process because loading a 350 GB model such as BLOOM-176B can take tens of minutes, which materially impacts endpoint startup time. Furthermore, because these models can’t fit within the memory of a single accelerator, the model must be organized and partitioned such that it can be spread across the memory of multiple accelerators; then, model servers must handle processes and communication across multiple accelerators. Beyond model loading, partitioning, and serving, compression techniques are increasingly necessary to achieve performance goals (such as subsecond latency) for customers working with large models. Quantization and compression can reduce model size and serving cost by reducing the precision of weights or reducing the number of parameters via pruning or distillation. Compilation can optimize the computation graph and fuse operators to reduce memory and compute requirements of a model. Achieving low latency for large language models (LLMs) requires improvements in all the steps in the inference workflow: compilation, model loading, compression (runtime quantization), partitioning (tensor or pipeline parallelism), and model serving. At a high level, partitioning (with kernel optimization) brings down inference latency up to 66% (for example, BLOOM-176B from 30 seconds to 10 seconds), compilation by 20%, and compression by 50% (fp32 to fp16). An example pipeline for large model hosting with runtime partitioning is illustrated in the following diagram.

Overview of large model inference optimization techniques

With the large model deployment pipeline in mind, we now explore the optimizations. Optimizations can be critical to achieve latency and throughput goals. However, you need to be thoughtful about which optimizations you use and to what degree, because the accuracy of your model can be affected.

The following diagram is a high-level overview of different inference optimization techniques. Optimization approaches can be at the hardware or software level. We focus only on software optimization techniques in this post.

Optimized kernels and compilation

Today, optimized kernels are the greatest source of performance improvement for LMI (for example, DeepSpeed’ kernels reduced BLOOM-176B latency by three times). Fused kernel operators are model specific, and different model parallel libraries have different approaches. DeepSpeed created an inject policy for each model family. DeepSpeed has handwritten PyTorch modules and CUDA kernels that could speed up the model partially. Meanwhile, FasterTransformer rewrites the model in pure C++ and CUDA to speed up model as a whole. PyTorch 2.0 offers an open portal (via torch.compile) to allow easy compilation into different platforms. To bring cost/performance-wise optimization on SageMaker for LLMs, we offer SageMaker LMI containers that provide the best open-source compilation stack offering on a model basis, like T5 with FasterTransformers and GPT-J with DeepSpeed.

Compilation or integration to optimized runtime

ML compilers, such as Amazon SageMaker Neo, apply techniques such as operator fusion, memory planning, graph optimizations, and automatic integration to optimized inference libraries. Because inference includes only a forward propagation, intermediate tensors between layers are discarded instead of stored for reuse in back-propagation. The graph optimization techniques improve the inference throughput and have a small impact on model memory footprints. Relative to other optimization techniques, compilation for inference provides a limited benefit for reducing a model’s memory requirements. Several runtime libraries for GPU are available today, such as FasterTransformer, TensorRT, and ONNX Runtime.

Model compression

Model compression is a collection of approaches that researchers and practitioners can use to reduce the size of their model, realize faster speed, and reduce hosting cost. Model compression techniques primarily include knowledge distillation, pruning, and quantization. Most compression technologies are challenging for LLMs due to requiring additional training cycles to improve the accuracy of compressed models.

Quantization

Quantization is the process of mapping values from a larger or continuous set of numbers to a smaller set of numbers (for example, INT8 {-128:127}, uINT8 {0:255}). Using a smaller set of numbers reduces memory use and complexity of computations, but the decreased precision can degrade the accuracy of the model. The level of quantization can be adjusted to fit size constraints and accuracy needs. For example, a model quantized to FP8 will be about half the size of a model in FP16 but at the expense of reduced accuracy.

Quantization has shown great and consistent success for inference tasks by reducing the size of the model up to 75%, offering 2–4 times throughput improvements and cost savings.

The success of quantization is because it’s broadly applicable across a range of models and use cases with approximately 1% accuracy/score loss, if a proper technique is used. It doesn’t require changing model architecture. Typically, it starts with an existing floating-point model and quantizes it to obtain a fixed-point quantized model. Quantizing from FP32 to INT8 reduces the model size by 75%, but the accuracy/score loss impact is often less than a point.

Distillation

With distillation, a larger teacher model transfers knowledge to a smaller student model. The model size can be reduced until the student model can fit on an edge device or smaller cloud-based hardware, but accuracy decreases as the model is reduced. There is no industry standard for distillation, and many techniques are experimental. Distillation requires more work by the customer in tuning and trial and error to shrink the model without affecting accuracy. For more information, refer to Knowledge distillation in deep learning and its applications.

Pruning

Pruning is a model compression technique that reduces the number of operations by removing parameters. To minimize the impact to model accuracy, parameters are first ranked by importance. Parameters that are less important are set to zero or connections to the neuron are removed. This decreases the number of operations with minimal impact to model accuracy. For example, when using a pre-trained model for a narrow use case, parts of the larger model that are less relevant to your application could be pruned away to reduce size without significantly degrading performance for your task.

Model partitioning

A model that can’t fit on a single accelerator’s memory must be split into multiple partitions. At a high level, there are two fundamental approaches to partitioning the model (model parallelism): tensor parallelism and pipeline parallelism.

Tensor parallelism is also called intra-layer model parallelism. In this approach, each one of the layers is partitioned across the workers (accelerators). On the positive side, we can handle models with very large layers, because the layers are split across workers. Therefore, we no longer need to fit at least a single layer on a worker, as was the case for pipeline parallelism. However, this leads to an all-to-all communication pattern between the workers after each one of the layers, so there’s a heavy burden on the GPU/accelerator interconnect.

Pipeline parallelism partitions the model into layers. Each worker may end up with having one or more layers. This approach uses point-to-point communication and therefore introduces lower communication overhead compared to tensor parallelism. However, this approach won’t be useful if a layer can’t fit into a single worker’s or accelerator’s memory. This approach is also prone to pipeline idleness and may reduce the scaling efficiency.

Open-source frameworks like DeepSpeed, Hugging Face Accelerate, and FasterTransformer allow per model-based optimization to shard the model. Especially for DeepSpeed, the partitioning algorithm is tightly coupled with fused kernel operators. SageMaker LMI containers come with pre-integrated model partitioning frameworks like FasterTransformer, DeepSpeed, HuggingFace, and Transformers-NeuronX,. Currently, DeepSpeed, FasterTransformer, and Hugging Face Accelerate shard the model at model loading time. Runtime model partitioning can take more than 10 minutes (OPT-66B) and consume extensive CPU, GPU, and accelerator memory. Ahead-of-time (AOT) partitioning can help reduce model loading times. With AOT, models are partitioned before deployment and partitions are kept ready for downstream optimization and subsequent ingestion by model parallel frameworks. When model parallel frameworks are fed already partitioned models, then runtime partition doesn’t happen. This improves model loading time and reduces CPU, GPU, and accelerator memory consumption. DeepSpeed and FasterTransformer have support for pre-partitioning and saving for models.

Prompt engineering

Prompt engineering refers to efforts to extract accurate, consistent, and fair outputs from large models, such text-to-image synthesizers or large language models. LLMs are trained on large-scale bodies of text, so they encode a great deal of factual information about the world. A prompt consists of text and optionally an image given to a pre-trained model for a prediction task. A prompt text may consist of additional components like context, task (instruction, question, and so on), image or text, and training samples. Prompt engineering also provides a way for LLMs to do few-shot generalization, in which a machine learning model trained on a set of generic tasks learns a new or related task from just a handful of examples. For more information, refer to EMNLP: Prompt engineering is the new feature engineering. Refer to the following GitHub repo for more information about getting the most out of your large models using prompt engineering on SageMaker.

Model downloading and loading

Large language models incur long download times (for example, 40 minutes to download BLOOM-176B). In 2022, SageMaker Hosting added the support for larger Amazon Elastic Block Store (Amazon EBS) volumes up to 500 GB, longer download timeout up to 60 minutes, and longer container startup time of 60 minutes. You can enable this configuration to deploy LLMs on SageMaker. SageMaker LMI containers includes model download optimization by using the s5cmd library to speed up the model download time and container startup times, and eventually speed up auto scaling on SageMaker.

Diving deep into SageMaker LMI containers

SageMaker maintains large model inference containers with popular open-source libraries for hosting large models such as GPT, T5, OPT, BLOOM, and Stable Diffusion on AWS infrastructure. With these containers, you can use corresponding open-source libraries such as DeepSpeed, Accelerate, FasterTransformer, and Transformers-NeuronX to partition model parameters using model parallelism techniques to use the memory of multiple GPUs or accelerators for inference. Transformers-NeuronX is a model parallel library introduced by the AWS Neuron team for AWS Inferentia and AWS Trainium to support LLMs. It supports tensor parallelism across Neuron cores.

The LMI container uses DJLServing as the pre-built integrated model server; pre-built integrated model partitioning frameworks like DeepSpeed, Accelerate, FasterTransformer, and Transformers-NeuronX; support for PyTorch; and comes with pre-installed cuDNN, cuBLAS, NCCL CUDA Toolkit for GPUs, MKL for CPU, and the Neuron SDK and runtime for running models on AWS Inferentia and Trainium.

Pre-integrated model partitioning frameworks in SageMaker LMI containers

SageMaker LMI comes with pre-integrated model partitioning frameworks to suite your performance and model support requirements.

Most of the model parallel frameworks support both pipeline and tensor parallelism. Pipeline parallelism is simpler implementation compared to tensor parallelism. However, due to its sequential operating nature, it’s slower than tensor parallelism. Pipeline parallelism and tensor parallelism can be combined together.

Transformers-NeuronX is a model parallel library introduced by the Neuron team to support LLMs on AWS Inferentia and Trainium. It supports tensor parallelism across Neuron cores. The following table summarizes different model partitioning frameworks. This will help you select the right framework for deploying your models on SageMaker.

Hugging Face Accelerate DeepSpeed FasterTransformer TransformersNeuronX (Inf2/Trn1)
Model Parallel Pipeline Parallelism Pipeline and Tensor Parallelism Pipeline and Tensor Parallelism Tensor Parallelism
Load Hugging Face checkpoints
Runtime partition .
Ahead-of-time partition . .
Model partitioning on CPU memory . . .
Supported models All Hugging Face models All GPT family, Stable Diffusion, and T5 family GPT2/OPT/BLOOM/T5 GPT2/OPT/GPTJ/GPT-NeoX*
Streaming tokens .
Fast model loading .
Model loading speed Medium Fast Fast .
Performance on model types All other non-optimized models GPT family T5 and BLOOM All supported models
Hardware support CPU/GPU GPU GPU Inf2/Trn1
SM MME support .

Large model deployment pipeline on SageMaker

SageMaker LMI containers offer a low-code/no-code mechanism to set up your large model deployment pipeline with the following capabilities:

  • Faster model download time using s5cmd
  • Pre-built optimized model parallel frameworks including Transformers-NeuronX, DeepSpeed, Hugging Face Accelerate, and FasterTransformer
  • Pre-built foundation software stack including PyTorch, NCCL, and MPI
  • Low-code/no-code deployment of large models by configuring serving.properties
  • SageMaker-compatible containers

The following diagram gives an overview of a SageMaker LMI deployment pipeline you can use to deploy your models.

Deploy a FLAN-T5-XXL model on SageMaker using the newly released LMI container version

FasterTransformer is a library implementing an accelerated engine for the inference of transformer-based neural networks, with a special emphasis on large models, spanning many GPUs and nodes in a distributed manner. FasterTransformer contains the implementation of the highly optimized version of the transformer block that contains the encoder and decoder parts. With this block, you can run the inference of both the full encoder-decoder architectures like T5, as well as encoder-only models such as BERT, or decoder-only models such as GPT. It’s written in C++/CUDA and relies on the highly optimized cuBLAS, cuBLASLt, and cuSPARSELt libraries. This allows you to build the fastest transformer inference pipeline on GPU.

The FasterTransformer model parallel library is now available in a SageMaker LMI container, adding support for popular models such as flan-t5-xxl and flan-ul2. FasterTransformer is an open-source library from NVIDIA that provides an accelerated engine for efficiently running transformer-based neural network inference. It has been designed to handle large models that require multiple GPUs or accelerators and nodes in a distributed manner. The library includes an optimized version of the transformer block, which comprises both the encoder and decoder parts, enabling you to run the inference of full encoder-decoder architectures like T5, as well as encoder-only models like BERT and decoder-only models like GPT.

Runtime architecture of hosting a model using an LMI container’s FasterTransformer engine on SageMaker

The FasterTransformer engine in an LMI container supports loading model weights from an Amazon Simple Storage Service (Amazon S3) path or Hugging Face Hub. After fetching the model, it converts the Hugging Face model checkpoint to FasterTransformer supported partitioned model artifacts based on input parameters like tensor parallel degree and loads the partitioned model artifacts across GPU devices. It has faster loading and uses multi-process loading on Python. It supports AOT compilation and uses CPU to partition the model. SageMaker LMI containers improve the performance in downloading the models from Amazon S3 using s5cmd, provide the FasterTransformer engine, which provides a layer of abstraction for developers that loads the model in Hugging Face checkpoint or PyTorch bin format, and uses the FasterTransformer library to convert it into FasterTransformer-compatible format. These steps happen during the container startup and load the model in the memory before the inference requests come in. The FasterTransformer engine provides high performance C++ and CUDA implementations for the models to run inference. This helps improve the container startup time and reduce the inference latency. The following diagram illustrates the runtime architecture of serving models using FasterTransformer on SageMaker. For more information about DJLServing’s runtime architecture, refer to Deploy large models on Amazon SageMaker using DJLServing and DeepSpeed model parallel inference.

Use SageMaker LMI container images

To use a SageMaker LMI container to host a FLAN-T5 model, we have no-code option or a bring-your-own-script option. We showcase the bring-your-own-script option in this post. The first step in the process is to use the right LMI container image. An example notebook is available in the GitHub repo.

Use the following code to use the SageMaker LMI container image after replacing the Region with the specific Region you’re running the notebook in:

inference_image_uri = image_uris.retrieve(
    framework="djl-fastertransformer", region=sess.boto_session.region_name, version="0.21.0"
)

Download the model weights

An LMI container allows us to download the model weights from the Hugging Face Hub at run time when spinning up the instance for deployment. However, that takes longer because it’s dependent on the network and on the provider. The faster option is to download the model weights into Amazon S3 and then use the LMI container to download them to the container from Amazon S3. This is also a preferred method when we need to scale up our instances. In this post, we showcase how to download the weights to Amazon S3 and then use them when configuring the container. See the following code:

model_name = "google/flan-t5-xxl"
# Only download pytorch checkpoint files
allow_patterns = ["*.json", "*.pt", "*.bin", "*.txt", "*.model"]
# - Leverage the snapshot library to download the model since the model is stored in repository using LFS
model_download_path = snapshot_download(
    repo_id=model_name,
    cache_dir=local_model_path,
    allow_patterns=allow_patterns,
)

# define a variable to contain the s3url of the location that has the model
pretrained_model_location = f"s3://{model_bucket}/{s3_model_prefix}/"

model_artifact = sess.upload_data(path=model_download_path, key_prefix=s3_model_prefix)

Create the model configuration and inference script

First, we create a file called serving.properties that configure the container. This tells the DJL model server to use the FasterTransformer engine to load and shard the model weights. Secondly, we point to the S3 URI where the model weights have been installed. The LMI container will download the model artifacts from Amazon S3 using s5cmd. The file contains the following code:

engine = FasterTransformer
option.tensor_parallel_degree = 4
option.s3url = {{s3url}}

For the no-code option, the key changes are to specify the entry_point as the built-in handler. We specify the value as djl_python.fastertransformer. For more details, refer to the GitHub repo. You can use this code to modify for your own use case as needed. A complete example that illustrates the no-code option can be found in the following notebook. The serving.properties file will now look like the following code:

engine=FasterTransformer
option.entryPoint=djl_python.fastertransformer
option.s3url={{s3url}}
option.tensor_parallel_degree=4

Next, we create our model.py file, which defines the code needed to load and then serve the model. The only mandatory method is handle(inputs). We continue to use the functional programing paradigm to build the other helpful methods like load_model(), pipeline_generate(), and more. In our code, we read in the tensor_parallel_degree property value (the default value is 1). This sets the number of devices over which the tensor parallel modules are distributed. Secondly, we get the model weights downloaded under the /tmp location on the container and referenceable by the environment variable “model_dir”. To load the model, we use the FasterTransformer init method as shown in the following code. Note we load the full precision weights in FP32. You can also quantize the model at runtime by setting dtype = "fp16" in the following code and setting tensor_parallel_degree = 2 in serving.properties. However, note that the FP16 version of this model may not provide similar performance in terms of output quality as compared to FP32 version. In addition, refer to an existing issue related to impact on the model quality on FasterTransformer for the T5 model for certain NLP tasks.

import fastertransformer as ft
from djl_python import Input, Output
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    T5Tokenizer,
    T5ForConditionalGeneration,
)
import os
import logging
import math
import torch


def load_model(properties):
    model_name = "google/flan-t5-xxl"
    tensor_parallel_degree = properties["tensor_parallel_degree"]
    pipeline_parallel_degree = 1
    model_location = properties["model_dir"]
    if "model_id" in properties:
        model_location = properties["model_id"]
    logging.info(f"Loading model in {model_location}")

    tokenizer = T5Tokenizer.from_pretrained(model_location)
    dtype = "fp32"
    model = ft.init_inference(
        model_location, tensor_parallel_degree, pipeline_parallel_degree, dtype
    )
    return model, tokenizer


model = None
tokenizer = None


def handle(inputs: Input):
    """
    inputs: Contains the configurations from serving.properties
    """
    global model, tokenizer

    if not model:
        model, tokenizer = load_model(inputs.get_properties())

    if inputs.is_empty():
        # Model server makes an empty call to warmup the model on startup
        return None

    data = inputs.get_as_json()

    input_sentences = data["inputs"]
    params = data["parameters"]

    outputs = model.pipeline_generate(input_sentences, **params)
    result = {"outputs": outputs}

    return Output().add_as_json(result)

Create a SageMaker endpoint for inference

In this section, we go through the steps to create a SageMaker model and endpoint for inference.

Create a SageMaker model

We now create a SageMaker model. We use the Amazon Elastic Container Registry (Amazon ECR) image provided by and the model artifact from the previous step to create the SageMaker model. In the model setup, we configure tensor_parallel_degree to 4 in serving.properties, which means the model is partitioned along 4 GPUs. See the following code:

from sagemaker.utils import name_from_base
model_name = name_from_base(f"flan-xxl-fastertransformer")
print(model_name)
create_model_response = sm_client.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": inference_image_uri, 
        "ModelDataUrl": s3_code_artifact
    },
)
model_arn = create_model_response["ModelArn"]
print(f"Created Model: {model_arn}")

Create a SageMaker endpoint for inference

You can use any instances with multiple GPUs for testing. In this demo, we use a g5.12xlarge instance. In the following code, note how we set ModelDataDownloadTimeoutInSeconds and ContainerStartupHealthCheckTimeoutInSeconds. We don’t set the VolumeSizeInGB parameters because this instance comes with SSD. The VolumeSizeInGB parameter is applicable to GPU instances supporting the EBS volume attachment.

endpoint_config_response = sm_client.create_endpoint_config(
EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
{
            "VariantName": "variant1",
            "ModelName": model_name,
            "InstanceType": "ml.g5.12xlarge",
            "InitialInstanceCount": 1,
            #"VolumeSizeInGB" : 200,
            "ModelDataDownloadTimeoutInSeconds": 600,
            "ContainerStartupHealthCheckTimeoutInSeconds": 600,
        },
    ],)'

Lastly, we create a SageMaker endpoint:

create_endpoint_response = sm_client.create_endpoint(
EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name)

Starting the endpoint might take a few minutes. You can try a few more times if you run into the InsufficientInstanceCapacity error, or you can raise a request to AWS to increase the limit in your account.

Invoke the model

This is a generative model, so we pass in a text as a prompt and model will complete the sentence and return the results.

You can pass a batch of prompts as input to the model. This done by setting inputs to the list of prompts. The model then returns a result for each prompt. The text generation can be configured using appropriate parameters.

# -- we set the prompt in the parameter name which matches what we try and extract in model.py
response_model = smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps({
        "batch_size": 1,
        "inputs" : "Amazon.com is an awesome site",
        "parameters" : {},
    }),
    ContentType="application/json",
)
response_model["Body"].read().decode("utf8")

Model parameters at inference time

The following code lists the set of default parameters that is used by the model. You can set these arguments to a specific value of your choice while invoking the endpoint.

default_args = dict(
            inputs_embeds=None,
            beam_width=1,
            max_seq_len=200,
            top_k=1,
            top_p=0.0,
            beam_search_diversity_rate=0.0,
            temperature=1.0,
            len_penalty=0.0,
            repetition_penalty=1.0,
            presence_penalty=None,
            min_length=0,
            random_seed=0,
            is_return_output_log_probs=False,
            is_return_cum_log_probs=False,
            is_return_cross_attentions=False,
            bad_words_list=None,
            stop_words_list=None
        )

The following code has a sample invocation to the endpoint we deployed. We use the max_seq_len parameter to control the number of tokens that are generated and temperature to control the randomness of the generated text.

smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps(
        {
            "inputs": [
                "Title: ”University has a new facility coming up“\nGiven the above title of an imaginary article, imagine the article.n"
            ],
            "parameters": {"max_seq_len": 200, "temperature": 0.7},
            "padding": True,
        }
    ),
    ContentType="application/json",
)["Body"].read().decode("utf8")

Clean up

When you’re done testing the model, delete the endpoint to save costs if the endpoint is no longer required:

# - Delete the end point
sm_client.delete_endpoint(EndpointName=endpoint_name)

Performance tuning

If you intend to use this post and accompanying notebook with a different model, you may want to explore some of the tunable parameters that SageMaker, DeepSpeed, and the DJL offer. Iteratively experimenting with these parameters can have a material impact on the latency, throughput, and cost of your hosted large model. To learn more about tuning parameters such as number of workers, degree of tensor parallelism, job queue size, and others, refer to DJLServing configurations and Deploy large models on Amazon SageMaker using DJLServing and DeepSpeed model parallel inference.

Benchmarking results on hosting FLAN-T5 model on SageMaker

The following table summarizes our benchmarking results.

Model Model Partitioning and Optimization Engine Quantization Batch Size Tensor Parallel Degree Number of Workers Inference Latency
P50
(ms)
Inference Latency
P90
(ms)
Inference Latency
P99
(ms)
Data Quality
flan-t5-xxl FasterTransformer FP32 4 4 1 327.39 331.01 612.73 Normal

For our benchmark, we used four different type of tasks that form into a single batch and benchmarked Flan-T5-XXL model. FasterTransformer is using a tensor parallel degree of 4 (the model gets partitioned across four accelerator devices on the same host). From our benchmark observation, FasterTransformer was the most performant in terms of latency and throughput as compared to other frameworks for hosting this model. The p99 inference latency was 612 milliseconds.

Conclusion

In this post, we gave an overview of large model hosting challenges, and how SageMaker LMI containers help you address these challenges using its low-code/no-code capabilities. We showcased how to host large models using FasterTransformer with high performance on SageMaker using the SageMaker LMI container. We demonstrated this new capability in an example of deploying a FLAN-T5-XXL model on SageMaker. We also covered options available to tune the performance of your models using different model optimization approaches and how SageMaker LMI containers offer low-code/no-code options to you in hosting and optimizing the large models.


About the authors

Dhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing, and Artificial Intelligence. He focuses on Deep learning including NLP and Computer Vision domains. He helps customers achieve high performance model inference on SageMaker.

Rohith Nallamaddi is a Software Development Engineer at AWS. He works on optimizing deep learning workloads on GPUs, building high performance ML inference and serving solutions. Prior to this, he worked on building microservices based on AWS for Amazon F3 business. Outside of work he enjoys playing and watching sports.

Robert Van Dusen is a Senior Product Manager with Amazon SageMaker. He leads deep learning model optimization for applications such as large model inference.

Rupinder Grewal is a Sr Ai/ML Specialist Solutions Architect with AWS. He currently focuses on serving of models and MLOps on SageMaker. Prior to this role he has worked as Machine Learning Engineer building and hosting models. Outside of work he enjoys playing tennis and biking on mountain trails.

Pinak Panigrahi works with customers to build machine learning driven solutions to solve strategic business problems on AWS. When not occupied with machine learning, he can be found taking a hike, reading a book or catching up with sports.

Qing Lan is a Software Development Engineer in AWS. He has been working on several challenging products in Amazon, including high performance ML inference solutions and high performance logging system. Qing’s team successfully launched the first Billion-parameter model in Amazon Advertising with very low latency required. Qing has in-depth knowledge on the infrastructure optimization and Deep Learning acceleration.

Read More

Authoring custom transformations in Amazon SageMaker Data Wrangler using NLTK and SciPy

Authoring custom transformations in Amazon SageMaker Data Wrangler using NLTK and SciPy

“Instead of focusing on the code, companies should focus on developing systematic engineering practices for improving data in ways that are reliable, efficient, and systematic. In other words, companies need to move from a model-centric approach to a data-centric approach.” – Andrew Ng

A data-centric AI approach involves building AI systems with quality data involving data preparation and feature engineering. This can be a tedious task involving data collection, discovery, profiling, cleansing, structuring, transforming, enriching, validating, and securely storing the data.

Amazon SageMaker Data Wrangler is a service in Amazon SageMaker Studio that provides an end-to-end solution to import, prepare, transform, featurize, and analyze data using little to no coding. You can integrate a Data Wrangler data preparation flow into your machine learning (ML) workflows to simplify data preprocessing and feature engineering, taking data preparation to production faster without the need to author PySpark code, install Apache Spark, or spin up clusters.

For scenarios where you need to add your own custom scripts for data transformations, you can write your transformation logic in Pandas, PySpark, PySpark SQL. Data Wrangler now supports NLTK and SciPy libraries for authoring custom transformations to prepare text data for ML and perform constraint optimization.

You might run into scenarios where you have to add your own custom scripts for data transformation. With the Data Wrangler custom transform capability, you can write your transformation logic in Pandas, PySpark, PySpark SQL.

In this post, we discuss how you can write your custom transformation in NLTK to prepare text data for ML. We will also share some example custom code transform using other common frameworks such as NLTK, NumPy, SciPy, and scikit-learn as well as AWS AI Services. For the purpose of this exercise, we use the Titanic dataset, a popular dataset in the ML community, which has now been added as a sample dataset within Data Wrangler.

Solution overview

Data Wrangler provides over 40 built-in connectors for importing data. After data is imported, you can build your data analysis and transformations using over 300 built-in transformations. You can then generate industrialized pipelines to push the features to Amazon Simple Storage Service (Amazon S3) or Amazon SageMaker Feature Store. The following diagram shows the end-to-end high-level architecture.

Prerequisites

Data Wrangler is a SageMaker feature available within Amazon SageMaker Studio. You can follow the Studio onboarding process to spin up the Studio environment and notebooks. Although you can choose from a few authentication methods, the simplest way to create a Studio domain is to follow the Quick start instructions. The Quick start uses the same default settings as the standard Studio setup. You can also choose to onboard using AWS IAM Identity Center (successor to AWS Single Sign-On) for authentication (see Onboard to Amazon SageMaker Domain Using IAM Identity Center).

Import the Titanic dataset

Start your Studio environment and create a new Data Wrangler flow. You can either import your own dataset or use a sample dataset (Titanic) as shown in the following screenshot. Data Wrangler allows you to import datasets from different data sources. For our use case, we import the sample dataset from an S3 bucket.

Once imported, you will see two nodes (the source node and the data type node) in the data flow. Data Wrangler automatically identifies the data type for all the columns in the dataset.

Custom transformations with NLTK

For data preparation and feature engineering with Data Wrangler, you can use over 300 built-in transformations or build your own custom transformations. Custom transforms can be written as separate steps within Data Wrangler. They become part of the .flow file within Data Wrangler. The custom transform feature supports Python, PySpark, and SQL as different steps in code snippets. After notebook files (.ipynb) are generated from the .flow file or the .flow file is used as recipes, the custom transform code snippets persist without requiring any changes. This design of Data Wrangler allows custom transforms to become part of a SageMaker Processing job for processing massive datasets with custom transformations.

Titanic dataset has couple of features (name and home.dest) that contain text information. We use NLTK to split the name column and extract the last name, and print the frequency of last names. NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, and wrappers for industrial-strength natural language processing (NLP) libraries.

To add a new transform, complete the following steps:

  1. Choose the plus sign and choose Add Transform.
  2. Choose Add Step and choose Custom Transform.

You can create a custom transform using Pandas, PySpark, Python user-defined functions, and SQL PySpark.

  1. Choose Python (Pandas) and add the following code to extract the last name from the name column:
    import nltk
    nltk.download('punkt')
    tokens = [nltk.word_tokenize(name) for name in df['Name']]
    
    # Extract the last names of the passengers
    df['last_name'] = [token[0] for token in tokens]

  2. Choose Preview to review the results.

The following screenshot shows the last_name column extracted.

  1. Add another custom transform step to identify the frequency distribution of the last names, using the following code:
    import nltk
    fd = nltk.FreqDist(df["last_name"])
    print(fd.most_common(10))

  2. Choose Preview to review the results of the frequency.

Custom transformations with AWS AI services

AWS pre-trained AI services provide ready-made intelligence for your applications and workflows. AWS AI services easily integrate with your applications to address many common use cases. You can now use the capabilities for AWS AI services as a custom transform step in Data Wrangler.

Amazon Comprehend uses NLP to extract insights about the content of documents. It develops insights by recognizing the entities, key phrases, language, sentiments, and other common elements in a document.

We use Amazon Comprehend to extract the entities from the name column. Complete the following steps:

  1. Add a custom transform step.
  2. Choose Python (Pandas).
  3. Enter the following code to extract the entities:
    import boto3
    comprehend = boto3.client("comprehend")
    
    response = comprehend.detect_entities(LanguageCode = 'en', Text = df['name'].iloc[0])
    
    for entity in response['Entities']:
    print(entity['Type'] + ":" + entity["Text"])

  4. Choose Preview and visualize the results.

We have now added three custom transforms in Data Wrangler.

  1. Choose Data Flow to visualize the end-to-end data flow.

Custom transformations with NumPy and SciPy

NumPy is an open-source library for Python offering comprehensive mathematical functions, random number generators, linear algebra routines, Fourier transforms, and more. SciPy is an open-source Python library used for scientific computing and technical computing, containing modules for optimization, linear algebra, integration, interpolation, special functions, fast Fourier transform (FFT), signal and image processing, solvers, and more.

Data Wrangler custom transforms allow you to combine Python, PySpark, and SQL as different steps. In the following Data Wrangler flow, different functions from Python packages, NumPy, and SciPy are applied on the Titanic dataset as multiple steps.

NumPy transformations

The fare column of the Titanic dataset has boarding fares of different passengers. The histogram of the fare column shows uniform distribution, except for the last bin. By applying NumPy transformations like log or square root, we can change the distribution (as shown by the square root transformation).

See the following code:

import pandas as pd
import numpy as np
df["fare_log"] = np.log(df["fare_interpolate"])
df["fare_sqrt"] = np.sqrt(df["fare_interpolate"])
df["fare_cbrt"] = np.cbrt(df["fare_interpolate"])

SciPy transformations

SciPy functions like z-score are applied as part of the custom transform to standardize fare distribution with mean and standard deviation.

See the following code:

df["fare_zscore"] = zscore(df["fare_interpolate"])
from scipy.stats import zscore

Constraint optimization with NumPy and SciPy

Data Wrangler custom transforms can handle advanced transformations like constraint optimization applying SciPy optimize functions and combining SciPy with NumPy. In the following example, fare as a function of age doesn’t show any observable trend. However, constraint optimization can transform fare as a function of age. The constraint condition in this case is that the new total fare remains the same as the old total fare. Data Wrangler custom transforms allow you to run the SciPy optimize function to determine the optimal coefficient that can transform fare as a function of age under constraint conditions.

Optimization definition, objective definition, and multiple constraints can be mentioned as different functions while formulating constraint optimization in a Data Wrangler custom transform using SciPy and NumPy. Custom transforms can also bring different solver methods that are available as part of the SciPy optimize package. A new transformed variable can be generated by multiplying the optimal coefficient with the original column and added to existing columns of Data Wrangler. See the following code:

import numpy as np
import scipy.optimize as opt
import pandas as pd

df2 = pd.DataFrame({"Y":df["fare_interpolate"], "X1":df["age_interpolate"]})

# optimization defination
def main(df2):
x0 = [0.1]
res = opt.minimize(fun=obj, x0=x0, args=(df2), method="SLSQP", bounds=[(0,50)], constraints=cons)
return res

# objective function
def obj(x0, df2):
sumSquares = np.sum(df2["Y"] - x0*df2["X1"])
return sumSquares

# constraints
def constraint1(x0):
sum_cons1 = np.sum(df2["Y"] - x0*df2["X1"]) - 0
return sum_cons1
con1 = {'type': 'eq', 'fun': constraint1}
cons = ([con1])

print(main(df2))

df["new_fare_age_optimized"]=main(df2).x*df2["X1"]

The Data Wrangler custom transform feature has the UI capability to show the results of SciPy optimize functions like value of optimal coefficient (or multiple coefficients).

Custom transformations with scikit-learn

scikit-learn is a Python module for machine learning built on top of SciPy. It’s an open-source ML library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection, model evaluation, and many other utilities.

Discretization

Discretization (otherwise known as quantization or binning) provides a way to partition continuous features into discrete values. Certain datasets with continuous features may benefit from discretization, because discretization can transform the dataset of continuous attributes to one with only nominal attributes. One-hot encoded discretized features can make a model more expressive, while maintaining interpretability. For instance, preprocessing with a discretizer can introduce nonlinearity to linear models.

In the following code, we use KBinsDiscretizer to discretize the age column into 10 bins:

# Table is available as variable `df`
from sklearn.preprocessing import KBinsDiscretizer
import numpy as np
# discretization transform the raw data
df = df.dropna()
kbins = KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform')
ages = np.array(df["age"]).reshape(-1, 1)
df["age"] = kbins.fit_transform(ages)
print(kbins.bin_edges_)

You can see the bin edges printed in the following screenshot.

One-hot encoding

Values in the Embarked columns are categorical values. Therefore, we have to represent these strings as numerical values in order to perform our classification with our model. We could also do this using a one-hot encoding transform.

There are three values for Embarked: S, C, and Q. We represent these with numbers. See the following code:

# Table is available as variable `df`
from sklearn.preprocessing import LabelEncoder

le_embarked = LabelEncoder()
le_embarked.fit(df["embarked"])

encoded_embarked_training = le_embarked.transform(df["embarked"])
df["embarked"] = encoded_embarked_training

Clean up

When you’re not using Data Wrangler, it’s important to shut down the instance on which it runs to avoid incurring additional fees.

Data Wrangler automatically saves your data flow every 60 seconds. To avoid losing work, save your data flow before shutting Data Wrangler down.

  1. To save your data flow in Studio, choose File, then choose Save Data Wrangler Flow.
  2. To shut down the Data Wrangler instance, in Studio, choose Running Instances and Kernels.
  3. Under RUNNING APPS, choose the shutdown icon next to the sagemaker-data-wrangler-1.0 app.
  4. Choose Shut down all to confirm.

Data Wrangler runs on an ml.m5.4xlarge instance. This instance disappears from RUNNING INSTANCES when you shut down the Data Wrangler app.

After you shut down the Data Wrangler app, it has to restart the next time you open a Data Wrangler flow file. This can take a few minutes.

Conclusion

In this post, we demonstrated how you can use custom transformations in Data Wrangler. We used the libraries and framework within the Data Wrangler container to extend the built-in data transformation capabilities. The examples in this post represent a subset of the frameworks used. The transformations in the Data Wrangler flow can now be scaled in to a pipeline for DataOps.

To learn more about using data flows with Data Wrangler, refer to Create and Use a Data Wrangler Flow and Amazon SageMaker Pricing. To get started with Data Wrangler, see Prepare ML Data with Amazon SageMaker Data Wrangler. To learn more about Autopilot and AutoML on SageMaker, visit Automate model development with Amazon SageMaker Autopilot.


About the authors

Meenakshisundaram Thandavarayan is a Senior AI/ML specialist with AWS. He helps hi-tech strategic accounts on their AI and ML journey. He is very passionate about data-driven AI.

 Sovik Kumar Nath is an AI/ML solution architect with AWS. He has extensive experience in end-to-end designs and solutions for machine learning; business analytics within financial, operational, and marketing analytics; healthcare; supply chain; and IoT. Outside work, Sovik enjoys traveling and watching movies.

Abigail is a Software Development Engineer at Amazon SageMaker. She is passionate about helping customers prepare their data in DataWrangler and building distributed machine learning systems. In her free time, Abigail enjoys traveling, hiking, skiing, and baking.

Read More

Overcome the machine learning cold start challenge in fraud detection using Amazon Fraud Detector

Overcome the machine learning cold start challenge in fraud detection using Amazon Fraud Detector

As more businesses increase their online presence to serve their customers better, new fraud patterns are constantly emerging. In today’s ever-evolving digital landscape, where fraudsters are becoming more sophisticated in their tactics, detecting and preventing such fraudulent activities has become paramount for companies and financial institutions.

Traditional rule-based fraud detection systems are capped in their ability to quickly iterate as they rely on predefined rules and thresholds to flag potentially fraudulent activity. These systems can generate a large number of false positives, significantly increasing the volume of manual investigations performed by the fraud team. Furthermore, humans are also error-prone and have limited capacity to process large amounts of data, making manual efforts to detect fraud time-consuming, which can result in missed fraudulent transactions, increased losses, and reputational damage.

Machine learning (ML) plays a crucial role in detecting fraud because it can quickly and accurately analyze large volumes of data to identify anomalous patterns and possible fraud trends. ML fraud model performance relies heavily on the quality of data it is trained on, and, specifically for the supervised models, accurate labeled data is crucial. In ML, a lack of significant historical data to train a model is called the cold start problem.

In the world of fraud detection, the following are some traditional cold start scenarios:

  • Building an accurate fraud model while lacking a history of transactions or fraud cases
  • Being able to accurately distinguish legitimate activity from fraud for new customers and accounts
  • Risk-decisioning payments to an address or beneficiary never seen before by the fraud system

There are multiple ways to solve for these scenarios. For example, you can use generic models, known as one-size-fits-all models, which are typically trained on top of fraud data sharing platforms like fraud consortiums. The challenge with this approach is that no business is equal, and fraud attack vectors change constantly.

Another option is to use an unsupervised anomaly detection model to monitor and surface unusual behavior among customer events. The challenge with this approach is that not all fraud events are anomalies, and not all anomalies are indeed fraud. Therefore, you can expect higher false positive rates.

In this post, we show how you can quickly bootstrap a real-time fraud prevention ML model with a little as 100 events using the Amazon Fraud Detector new feature, Cold Start, thereby dramatically lowering the barrier of entry to custom ML models for many organizations that simply don’t have the time or ability to collect and accurately label large datasets. Moreover, we discuss how by using Amazon Fraud Detector stored events, you can review results and correctly label the events to retrain your models, thereby improving the effectiveness of fraud prevention measures over time.

Solution overview

Amazon Fraud Detector is a fully managed fraud detection service that automates detecting potentially fraudulent activities online. You can use Amazon Fraud Detector to build customized fraud detection models using your own historical dataset, add decision logic using the built-in rules engine, and orchestrate risk decision workflows with a click of a button.

Previously, you had to provide over 10,000 labeled events with at least 400 examples of fraud to train a model. With the release of the Cold Start feature, you can quickly train a model with a minimum of 100 events and at least 50 classified as fraud. Compared with initial data requirements, this is a reduction of 99% in historical data and an 87% reduction in label requirements.

The new Cold Start feature provides intelligent methods for enriching, extending, and risk modeling small sets of data. Moreover, Amazon Fraud Detector performs label assignments and sampling for unlabeled events.

Experiments performed with public datasets show that, by lowering the limits to 50 fraud and only 100 events, you can build fraud ML models that consistently outperform unsupervised and semi-supervised models.

Cold Start model performance

The ability of an ML model to generalize and make accurate predictions on unseen data is impacted by the quality and diversity of the training dataset. For Cold Start models, this is no different. You should have processes in place as more data is collected to correctly label these events and retrain the models, ultimately leading to an optimal model performance.

With a lower data requirement, the instability of reported performance increases due to the increased variance of the model and the limited test data size. To help you build the right expectation of model performance, besides model AUC, Amazon Fraud Detector also reports uncertainty range metrics. The following table defines these metrics.

. . AUC
. . < 0.6 0.6 – 0.8 >= 0.8
AUC uncertainty interval > 0.3 The model performance is very low and might vary greatly. Expect low fraud detection performance. The model performance is low and might vary greatly. Expect limited fraud detection performance. The model performance might vary greatly.
0.1 – 0.3 The model performance is very low and might vary significantly. Expect low fraud detection performance. The model performance is low and might vary significantly. Expect limited fraud detection performance. The model performance might vary significantly.
< 0.1 The model performance is very low. Expect low fraud detection performance. The model performance is low. Expect limited fraud detection performance. No Warning

Train a Cold Start model

Training a Cold Start fraud model is identical to training any other Amazon Fraud Detector model; what differs is the dataset size. You can find sample datasets for Cold Start training in our GitHub repo. To train an Amazon Fraud Detector custom model, you can follow our hands-on tutorial. You can either use the Amazon Fraud Detector console tutorial or the SDK tutorial to build, train, and deploy a fraud detection model.

After your model is trained, you can review performance metrics and then deploy it by changing its status to Active. To learn more about model scores and performance metrics, see Model scores and Model performance metrics. At this point, you can now add your model to your detector, add business rules to interpret the risk scores that the model outputs, and make real-time predictions using the GetEventPrediction API.

Fraud ML model continuous improvement and feedback loop

With the Amazon Fraud Detector Cold Start feature, you can quickly bootstrap a fraud detector endpoint and start protecting your businesses immediately. However, new fraud patterns are constantly emerging, so it’s critical to retrain Cold Start models with newer data to improve the accuracy and effectiveness of the predictions over time.

To help you iterate on your models, Amazon Fraud Detector automatically stores all events sent to the service for inference. You can change or validate the event ingestion flag is on at the event type level, as shown in the following screenshot.

With the stored events feature, you can use the Amazon Fraud Detector SDK to programmatically access an event, review the event metadata and the prediction explanation, and make an informed risk decision. Moreover, you can label the event for future model retraining and continuous model improvement. The following diagram shows an example of this workflow.

In the following code snippets, we demonstrate the process to label a stored event:

  • To do a real-time fraud prediction on an event, call the GetEventPrediction API:
import boto3

def get_event_prediction():
    fraudDetector = boto3.client('frauddetector')
    
    prediction = fraudDetector.get_event_prediction(
        detectorId='your_detector_name',
        detectorVersionId='1',
        eventId='my-event-id-1234',
        eventTypeName='your_event_type',
        entities=[
            {
                'entityType': 'user',
                'entityId': 'A12345'
            },
        ],
        eventTimestamp= '2023-03-23T21:42:03.658Z',
        eventVariables={
            'email': 'test@anymockcompany.com',
            'ip': '123.123.123.123',
            'card_bin': '400022',
            'billing_zip': '50401'
        }
    )
    return(prediction)

API Response:
{
  "modelScores": [
    {
      "modelVersion": {
        "modelId": "your_model_name",
        "modelType": "TRANSACTION_FRAUD_INSIGHTS",
        "modelVersionNumber": "1.0"
      },
      "scores": {
        "your_model_insightscore": 932
      }
    }
  ],
  "ruleResults": [
    {
      "ruleId": "high_risk_score",
      "outcomes": [
        "high_risk_send_for_manual_review"
      ]
    }
  ]

As seen in the response, based on the decision engine rule matched, the event should be sent for manual review by the fraud team. By gathering the prediction explanation metadata, you can gain insights into how each event variable impacted the model’s fraud prediction score.

  • To collect these insights, we use the get_event_prediction_metada API:
import boto3

def get_event_prediction_metadata(event, context):
    fraudDetector = boto3.client('frauddetector')
    
    prediction = fraudDetector.get_event_prediction_metadata(
        eventId = 'my-event-id-1234',
        eventTypeName = 'your_event_type',
        predictionTimestamp = '2023-03-23T21:44:39.318Z',
        detectorId = 'your_detector_name',
        detectorVersionId = '1'
    )
    return(prediction)

API Response:

{
  "modelScores": [
    {
      "modelVersion": {
        "modelId": "your_model_name",
        "modelType": "TRANSACTION_FRAUD_INSIGHTS",
        "modelVersionNumber": "1.0"
      },
      "scores": {
        "your_model_insightscore": 932
      }
    }
  ],
  "ruleResults": [
    {
      "ruleId": "high_risk_score",
      "outcomes": [
        "high_risk_send_for_manual_review"
      ]
    }
  ]


{
  "eventId": "my-event-id-1234",
  …
  <REDACTED>
  …
  "eventVariables": [
    {
      "name": "ip",
      "value": "123.123.123.123"
    },
    {
      "name": "billing_zip",
      "value": "50401"
    },
    {
      "name": "email",
      "value": "test@anymockcompany.com"
    },
    {
      "name": "card_bin",
      "value": "400022"
    }
  ],
…
 <REDACTED>
…
   "evaluations": [
        {
          "evaluationScore": "932.0",
          "predictionExplanations": {
            "variableImpactExplanations": [
              {
                "eventVariableName": "billing_zip",
                "relativeImpact": "1",
                "logOddsImpact": 1.018196990713477135
              },
              {
                "eventVariableName": "ip",
                "relativeImpact": "0",
                "logOddsImpact": -0.23122438788414001
              },
              {
                "eventVariableName": "email",
                "relativeImpact": "0",
                "logOddsImpact": 0.004304269328713417
              },
              {
                "eventVariableName": "card_bin",
                "relativeImpact": "0",
                "logOddsImpact": -0.011150157079100609
              } 
           ],
}

With these insights, the fraud analyst can make an informed risk decision about the event in question and update the event label.

  • To update the event label  call the update_event_label API:
import boto3

def update_event_label(event, context):
    fraudDetector = boto3.client('frauddetector')
    
    prediction = fraudDetector.update_event_label(
        eventId = "my-event-id-1234",
        eventTypeName = "your_event_type",
        assignedLabel='1', # Fraud
        labelTimestamp='2023-03-25T11:20:03.658Z'
    )
    
    return(prediction)

API Response

{
  "ResponseMetadata": {
    "RequestId": "3e28caa0-2a06-4b8d-9a10-9081811bf22d",
    "HTTPStatusCode": 200,
    …
     <REDACTED>
    …

    "RetryAttempts": 0
  }
}

As a final step, you can verify if the event label was correctly updated.

  • To verify the event label, call the get_event API:
import boto3

def get_event():
    fraudDetector = boto3.client('frauddetector')
    
    event = fraudDetector.get_event(
        eventId='my-event-id-1234',
        eventTypeName=’your_event_type'
    )
    
    return(event)

API Response

{
  "event": {
    "eventId": "my-event-id-1234",
    "eventTimestamp": "2023-03-23T21:42:03.658Z",
    "eventVariables": {
      "billing_zip": "50401",
      "card_bin": "400022",
      "email": "test@anymockcompany.com",
      "ip": "123.123.123.123"
    },
    "currentLabel": "1",
    "labelTimestamp": "2023-03-25T11:20:03.658Z",
    "entities": [
      {
        "entityType": "user",
        "entityId": "A12345"
      }
    ]
  }
}

Clean up

To avoid incurring future charges, delete the resources created for the solution.

Conclusion

This post demonstrated how you can quickly bootstrap a real-time fraud prevention system with a few as 100 events using the Amazon Fraud Detector new Cold Start feature. We discussed how you can use stored events to review results and correctly label the events and retrain your models, improving the effectiveness of fraud prevention measures over time.

Fully managed AWS services such as Amazon Fraud Detector help reduce the time businesses spend analyzing user behavior to identify fraud in their platforms and focus more on driving business value. To learn more about how Amazon Fraud Detector can help your business, visit Amazon Fraud Detector.


About the Authors

Marcel Pividal is a Global Sr. AI Services Solutions Architect in the World-Wide Specialist Organization. Marcel has more than 20 years of experience solving business problems through technology for FinTechs, payment providers, pharma, and government agencies. His current areas of focus are risk management, fraud prevention, and identity verification.

Julia Xu is a Research Scientist with Amazon Fraud Detector. She is passionate about solving customer challenges using machine learning techniques. In her free time, she enjoys hiking, painting, and exploring new coffee shops.

Guilherme Ricci is a Senior Solution Architect at AWS, helping Startups to modernize and optimize the costs of their applications. With over 10 years of experience with companies in the financial sector, he is currently working together with the team of AI/ML specialists.

Read More

Connect Amazon EMR and RStudio on Amazon SageMaker

Connect Amazon EMR and RStudio on Amazon SageMaker

RStudio on Amazon SageMaker is the industry’s first fully managed RStudio Workbench integrated development environment (IDE) in the cloud. You can quickly launch the familiar RStudio IDE and dial up and down the underlying compute resources without interrupting your work, making it easy to build machine learning (ML) and analytics solutions in R at scale.

In conjunction with tools like RStudio on SageMaker, users are analyzing, transforming, and preparing large amounts of data as part of the data science and ML workflow. Data scientists and data engineers use Apache Spark, Hive, and Presto running on Amazon EMR for large-scale data processing. Using RStudio on SageMaker and Amazon EMR together, you can continue to use the RStudio IDE for analysis and development, while using Amazon EMR managed clusters for larger data processing.

In this post, we demonstrate how you can connect your RStudio on SageMaker domain with an EMR cluster.

Solution overview

We use an Apache Livy connection to submit a sparklyr job from RStudio on SageMaker to an EMR cluster. This is demonstrated in the following diagram.

Scope of Solution
All code demonstrated in the post is available in our GitHub repository. We implement the following solution architecture.

Prerequisites

Prior to deploying any resources, make sure you have all the requirements for setting up and using RStudio on SageMaker and Amazon EMR:

We’ll also build a custom RStudio on SageMaker image, so ensure you have Docker running and all required permissions. For more information, refer to Use a custom image to bring your own development environment to RStudio on Amazon SageMaker.

Create resources with AWS CloudFormation

We use an AWS CloudFormation stack to generate the required infrastructure.

If you already have an RStudio domain and an existing EMR cluster, you can skip this step and start building your custom RStudio on SageMaker image. Substitute the information of your EMR cluster and RStudio domain in place of the EMR cluster and RStudio domain created in this section.

Launching this stack creates the following resources:

  • Two private subnets
  • EMR Spark cluster
  • AWS Glue database and tables
  • SageMaker domain with RStudio
  • SageMaker RStudio user profile
  • IAM service role for the SageMaker RStudio domain
  • IAM service role for the SageMaker RStudio user profile

Complete the following steps to create your resources:

Choose Launch Stack to create the stack.

  1. On the Create stack page, choose Next.
  2. On the Specify stack details page, provide a name for your stack and leave the remaining options as default, then choose Next.
  3. On the Configure stack options page, leave the options as default and choose Next.
  4. On the Review page, select
  5. I acknowledge that AWS CloudFormation might create IAM resources with custom names and
  6. I acknowledge that AWS CloudFormation might require the following capability: CAPABILITY_AUTO_EXPAND.
  7. Choose Create stack.

The template generates five stacks.

To see the EMR Spark cluster that was created, navigate to the Amazon EMR console. You will see a cluster created for you called sagemaker. This is the cluster we connect to through RStudio on SageMaker.

Build the custom RStudio on SageMaker image

We have created a custom image that will install all the dependencies of sparklyr, and will establish a connection to the EMR cluster we created.

If you’re using your own EMR cluster and RStudio domain, modify the scripts accordingly.

Make sure Docker is running. Start by getting into our project repository:

cd sagemaker-rstudio-emr/sparklyr-image
./build-r-image.sh

We will now build the Docker image and register it to our RStudio on SageMaker domain.

  1. On the SageMaker console, choose Domains in the navigation pane.
  2. Choose the domain select rstudio-domain.
  3. On the Environment tab, choose Attach image.

    Now we attach the sparklyr image that we created earlier to the domain.
  4. For Choose image source, select Existing image.
  5. Select the sparklyr image we built.
  6. For Image properties, leave the options as default.
  7. For Image type, select RStudio image.
  8. Choose Submit.

    Validate the image has been added to the domain. It may take a few minutes for the image to attach fully.
  9. When it’s available, log in to the RStudio on SageMaker console using the rstudio-user profile that was created.
  10. From here, create a session with the sparklyr image that we created earlier.

    First, we have to connect to our EMR cluster.
  11. In the connections pane, choose New Connection.
  12. Select the EMR cluster connect code snippet and choose Connect to Amazon EMR Cluster.

    After the connect code has run, you will see a Spark connection through Livy, but no tables.
  13. Change the database to credit_card:
    tbl_change_db(sc, “credit_card”)
  14. Choose Refresh Connection Data.
    You can now see the tables.
  15. Now navigate to the rstudio-sparklyr-code-walkthrough.md file.

This has a set of Spark transformations we can use on our credit card dataset to prepare it for modeling. The following code is an excerpt:

Let’s count() how many transactions are in the transactions table. But first we need to cache Use the tbl() function.

users_tbl &amp;lt;- tbl(sc, "users")
cards_tbl &amp;lt;- tbl(sc, "cards")
transactions_tbl &amp;lt;- tbl(sc, "transactions")

Let’s run a count of the number of rows for each table.

count(users_tbl)
count(cards_tbl)
count(transactions_tbl)

Now let’s register our tables as Spark Data Frames and pull them into the cluster-wide in memory cache for better performance. We will also filter the header that gets placed in the first row for each table.

users_tbl &lt;- tbl(sc, 'users') %&gt;%
  filter(gender != 'Gender')
sdf_register(users_tbl, "users_spark")
tbl_cache(sc, 'users_spark')
users_sdf &lt;- tbl(sc, 'users_spark')

cards_tbl &lt;- tbl(sc, 'cards') %&gt;%
  filter(expire_date != 'Expires')
sdf_register(cards_tbl, "cards_spark")
tbl_cache(sc, 'cards_spark')
cards_sdf &lt;- tbl(sc, 'cards_spark')

transactions_tbl &lt;- tbl(sc, 'transactions') %&gt;%
  filter(amount != 'Amount')
sdf_register(transactions_tbl, "transactions_spark")
tbl_cache(sc, 'transactions_spark')
transactions_sdf &lt;- tbl(sc, 'transactions_spark')

To see the full list of commands, refer to the rstudio-sparklyr-code-walkthrough.md file.

Clean up

To clean up any resources to avoid incurring recurring costs, delete the root CloudFormation template. Also delete all Amazon Elastic File Service (Amazon EFS) mounts created and any Amazon Simple Storage Service (Amazon S3) buckets and objects created.

Conclusion

The integration of RStudio on SageMaker with Amazon EMR provides a powerful solution for data analysis and modeling tasks in the cloud. By connecting RStudio on SageMaker and establishing a Livy connection to Spark on EMR, you can take advantage of the computing resources of both platforms for efficient processing of large datasets. RStudio, one of the most widely used IDEs for data analysis, allows you to take advantage of the fully managed infrastructure, access control, networking, and security capabilities of SageMaker. Meanwhile, the Livy connection to Spark on Amazon EMR provides a way to perform distributed processing and scaling of data processing tasks.

If you’re interested in learning more about using these tools together, this post serves as a starting point. For more information, refer to RStudio on Amazon SageMaker. If you have any suggestions or feature improvements, please create a pull request on our GitHub repo or leave a comment on this post!


About the Authors

Ryan Garner is a Data Scientist with AWS Professional Services. He is passionate about helping AWS customers use R to solve their Data Science and Machine Learning problems.


Raj Pathak
 is a Senior Solutions Architect and Technologist specializing in Financial Services (Insurance, Banking, Capital Markets) and Machine Learning. He specializes in Natural Language Processing (NLP), Large Language Models (LLM) and Machine Learning infrastructure and operations projects (MLOps).


Saiteja Pudi
 is a Solutions Architect at AWS, based in Dallas, Tx. He has been with AWS for more than 3 years now, helping customers derive the true potential of AWS by being their trusted advisor. He comes from an application development background, interested in Data Science and Machine Learning.

Read More

Powering the Future: Next Step in Siemens, NVIDIA Collaboration Showcased With FREYR Virtual Factory Demos

Powering the Future: Next Step in Siemens, NVIDIA Collaboration Showcased With FREYR Virtual Factory Demos

At the Hannover Messe trade show this week, Siemens unveiled a digital model of next-generation FREYR Battery factories that was developed using NVIDIA technology.

The model was created in part to highlight a strategic partnership announced Monday by Siemens and FREYR, with Siemens becoming FREYR’s preferred supplier in automation technology, enabling the Norway-based group to scale up production and maximize plant efficiency.

Built by Siemens, the demo uses the NVIDIA Omniverse development platform to provide an immersive experience of the FREYR factories and follows the joint vision for an industrial metaverse unveiled last year by Siemens and NVIDIA.

Displayed as part of an industrial metaverse experience in the Siemens booth during Hannover Messe 2023, the world’s largest industrial technology trade show, the demos incorporate operational data from the FREYR factory in Norway.

Highlighting the integration between Siemens Xcelerator and NVIDIA Omniverse, the demo features 3D representations of the infrastructure, plant, machinery, equipment, human ergonomics, safety information, robots, automated guided vehicles, and detailed product and production simulations.

These technologies will help FREYR to meet surging demand for high-density, cost-effective battery cells for stationary energy storage, electric mobility and marine applications.

Amid growing worldwide sustainability initiatives and the rapid electrification of transportation, the battery industry is projected to grow to $400 billion by 2030. Battery cell manufacturing is a critical step in the battery value chain, with manufacturers investing billions of dollars in new battery-cell plants to meet this new demand.

In the demo, Siemens shows a vision for how teams can harness comprehensive digital twins in the industrial metaverse using models of existing and future plants.

Within moments, FREYR can set up a meeting with potential investors or customers to take place within the digital FREYR plant in Norway and explore the facility’s exterior before entering to view current production processes at work.

The striking interior flythrough instantly conveys the facility’s size and scale. The real-time, physically accurate simulation shows how machines and robots inside the factory move, and can even simulate complex processes. Sensors capturing machine information allow real-time performance visualization and ergonomic assessments.

The demo also demonstrates how the model can be used for production planning, highlighting how a plant manager can rapidly evaluate plant performance using a custom Siemens application, which provides at a glance an overview of the facility’s operation.

From there, the manager initiates a Microsoft Teams meeting with colleagues at a manufacturing “cell” — which places key people, machines and supplies in one strategic location — inside the virtual factory.

The team can then examine a robotic arm experiencing low-cycle-time issues, access machine performance data, identify specific cycle-time problems and view a live video stream with accompanying sensor data on machine performance.

This showcase at Hannover Messe is only the beginning, as more industries embrace and implement the industrial metaverse.

Learn more about NVIDIA Omniverse and our partnership with Siemens.

Read More

Microsoft at NSDI 2023: A commitment to advancing networking and distributed systems

Microsoft at NSDI 2023: A commitment to advancing networking and distributed systems

nsdi'23 on a red background with

Microsoft has made significant contributions to the prestigious USENIX NSDI’23 conference, which brings together experts in computer networks and distributed systems. A silver sponsor for the conference, Microsoft is a leader in developing innovative technologies for networking, and we are proud to have contributed to 30 papers accepted this year. Our team members also served on the program committee, highlighting our commitment to advancing the field.

The accepted research papers span a wide range of topics, including networking for AI workloads, cloud networking, WAN, and wireless networks. These papers showcase some of the latest advancements in networking research.

The paper, “DOTE: Rethinking (Predictive) WAN Traffic Engineering”, which revisits traffic engineering in the Wide Area Network (WAN), was selected for one of the Best Paper Awards at the conference. This work was done jointly by researchers at Microsoft, along with academics at Hebrew University of Jerusalem and Technion.

Some other innovations on cloud networking infrastructure include:

Empowering Azure Storage with RDMA, which presents the findings from deploying intra-region Remote Direct Memory Access (RDMA) to support storage workloads in Azure. Today, around 70% of traffic in Azure is RDMA and intra-region RDMA is supported in all Azure public regions. RDMA helps us achieve significant disk I/O performance improvements and CPU core savings. This research is a testament to Microsoft’s ongoing commitment to providing customers with the best possible user experience.

Disaggregating Stateful Network Functions, which introduces a new approach for better reliability and performance at a lower per-server cost for cloud users. The core idea is to move the network function processing off individual servers and into shared resource pools. This technology is now shipping as part of Microsoft Azure Accelerated Connections.

Our colleagues from Microsoft Research Asia, will present ARK: GPU-driven Code Execution for Distributed Deep Learning, which overcomes the overhead of GPU communication for large deep learning workloads by having GPUs run their code, and handle communication events autonomously, without CPU intervention.

Spotlight: On-demand video

AI Explainer: Foundation models ​and the next era of AI

Explore how the transformer architecture, larger models and more data, and in-context learning have helped advance AI from perception to creation.

Microsoft’s collective contributions to the USENIX NSDI’23 conference highlight our commitment to advancing the field of networking research and developing innovative solutions to real-world networking problems, leveraging strong academic collaborations. We look forward to continuing to push the boundaries of what is possible in networking research and delivering cutting-edge solutions to our customers.

A complete list of Microsoft papers accepted at USENIX NSDI’23:

  1. Understanding RDMA Microarchitecture Resources for Performance Isolation, Xinhao Kong and Jingrong Chen, Duke University; Wei Bai, Microsoft; Yechen Xu, Shanghai Jiao Tong University; Mahmoud Elhaddad, Shachar Raindel, and Jitendra Padhye, Microsoft; Alvin R. Lebeck and Danyang Zhuo, Duke University
  2. Empowering Azure Storage with RDMA, Wei Bai, Shanim Sainul Abdeen, Ankit Agrawal, Krishan Kumar Attre, Paramvir Bahl, Ameya Bhagat, Gowri Bhaskara, Tanya Brokhman, Lei Cao, Ahmad Cheema, Rebecca Chow, Jeff Cohen, Mahmoud Elhaddad, Vivek Ette, Igal Figlin, Daniel Firestone, Mathew George, Ilya German, Lakhmeet Ghai, Eric Green, Albert Greenberg, Manish Gupta, Randy Haagens, Matthew Hendel, Ridwan Howlader, Neetha John, Julia Johnstone, Tom Jolly, Greg Kramer, David Kruse, Ankit Kumar, Erica Lan, Ivan Lee, Avi Levy, Marina Lipshteyn, Xin Liu, Chen Liu, Guohan Lu, Yuemin Lu, Xiakun Lu, Vadim Makhervaks, Ulad Malashanka, David A. Maltz, Ilias Marinos, Rohan Mehta, Sharda Murthi, Anup Namdhari, Aaron Ogus, Jitendra Padhye, Madhav Pandya, Douglas Phillips, Adrian Power, Suraj Puri, Shachar Raindel, Jordan Rhee, Anthony Russo, Maneesh Sah, Ali Sheriff, Chris Sparacino, Ashutosh Srivastava, Weixiang Sun, Nick Swanson, Fuhou Tian, Lukasz Tomczyk, Vamsi Vadlamuri, Alec Wolman, Ying Xie, Joyce Yom, Lihua Yuan, Yanzhao Zhang, and Brian Zill, Microsoft
  3. ARK: GPU-driven Code Execution for Distributed Deep Learning, Changho Hwang, KAIST, Microsoft Research; KyoungSoo Park, KAIST; Ran Shu, Xinyuan Qu, Peng Cheng, and Yongqiang Xiong, Microsoft Research
  4. Hydra: Serialization-Free Network Ordering for Strongly Consistent Distributed Applications, Inho Choi, National University of Singapore; Ellis Michael, University of Washington; Yunfan Li, National University of Singapore; Dan R. K. Ports, Microsoft Research; Jialin Li, National University of Singapore
  5. Waverunner: An Elegant Approach to Hardware Acceleration of State Machine Replication, Mohammadreza Alimadadi and Hieu Mai, Stony Brook University; Shenghsun Cho, Microsoft; Michael Ferdman, Peter Milder, and Shuai Mu, Stony Brook University
  6. Scalable Distributed Massive MIMO Baseband Processing, Junzhi Gong, Harvard University; Anuj Kalia, Microsoft; Minlan Yu, Harvard University
  7. Unlocking unallocated cloud capacity for long, uninterruptible workloads, Anup Agarwal, Carnegie Mellon University; Shadi Noghabi, Microsoft Research; Íñigo Goiri, Azure Systems Research; Srinivasan Seshan, Carnegie Mellon University; Anirudh Badam, Microsoft Research
  8. Invisinets: Removing Networking from Cloud Networks, Sarah McClure and Zeke Medley, UC Berkeley; Deepak Bansal and Karthick Jayaraman, Microsoft; Ashok Narayanan, Google; Jitendra Padhye, Microsoft; Sylvia Ratnasamy, UC Berkeley and Google; Anees Shaikh, Google; Rishabh Tewari, Microsoft
  9. Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs, John Thorpe, Pengzhan Zhao, Jonathan Eyolfson, and Yifan Qiao, UCLA; Zhihao Jia, CMU; Minjia Zhang, Microsoft Research; Ravi Netravali, Princeton University; Guoqing Harry Xu, UCLA
  10. OneWAN is better than two: Unifying a split WAN architecture, Umesh Krishnaswamy, Microsoft; Rachee Singh, Microsoft and Cornell University; Paul Mattes, Paul-Andre C Bissonnette, Nikolaj Bjørner, Zahira Nasrin, Sonal Kothari, Prabhakar Reddy, John Abeln, Srikanth Kandula, Himanshu Raj, Luis Irun-Briz, Jamie Gaudette, and Erica Lan, Microsoft
  11. TACCL: Guiding Collective Algorithm Synthesis using Communication Sketches, Aashaka Shah, University of Texas at Austin; Vijay Chidambaram, University of Texas at Austin and VMware Research; Meghan Cowan, Saeed Maleki, Madan Musuvathi, Todd Mytkowicz, Jacob Nelson, and Olli Saarikivi, Microsoft Research; Rachee Singh, Microsoft and Cornell University
  12. Synthesizing Runtime Programmable Switch Updates, Yiming Qiu, Rice University; Ryan Beckett, Microsoft; Ang Chen, Rice University
  13. Formal Methods for Network Performance Analysis, Mina Tahmasbi Arashloo, University of Waterloo; Ryan Beckett, Microsoft Research; Rachit Agarwal, Cornell University
  14. Scalable Tail Latency Estimation for Data Center Networks, Kevin Zhao, University of Washington; Prateesh Goyal, Microsoft Research; Mohammad Alizadeh, MIT CSAIL; Thomas E. Anderson, University of Washington
  15. Addax: A fast, private, and accountable ad exchange infrastructure, Ke Zhong, Yiping Ma, and Yifeng Mao, University of Pennsylvania; Sebastian Angel, University of Pennsylvania & Microsoft Research
  16. RECL: Responsive Resource-Efficient Continuous Learning for Video Analytics, Mehrdad Khani, MIT CSAIL and Microsoft; Ganesh Ananthanarayanan and Kevin Hsieh, Microsoft; Junchen Jiang, University of Chicago; Ravi Netravali, Princeton University; Yuanchao Shu, Zhejiang University; Mohammad Alizadeh, MIT CSAIL; Victor Bahl, Microsoft
  17. Tambur: Efficient loss recovery for videoconferencing via streaming codes, Michael Rudow, Carnegie Mellon University; Francis Y. Yan, Microsoft Research; Abhishek Kumar, Carnegie Mellon University; Ganesh Ananthanarayanan and Martin Ellis, Microsoft; K.V. Rashmi, Carnegie Mellon University
  18. Gemel: Model Merging for Memory-Efficient, Real-Time Video Analytics at the Edge, Arthi Padmanabhan, UCLA; Neil Agarwal, Princeton University; Anand Iyer and Ganesh Ananthanarayanan, Microsoft Research; Yuanchao Shu, Zhejiang University; Nikolaos Karianakis, Microsoft Research; Guoqing Harry Xu, UCLA; Ravi Netravali, Princeton University
  19. On Modular Learning of Distributed Systems for Predicting End-to-End Latency, Chieh-Jan Mike Liang, Microsoft Research; Zilin Fang, Carnegie Mellon University; Yuqing Xie, Tsinghua University; Fan Yang, Microsoft Research; Zhao Lucis Li, University of Science and Technology of China; Li Lyna Zhang, Mao Yang, and Lidong Zhou, Microsoft Research
  20. SelfTune: Tuning Cluster Managers, Ajaykrishna Karthikeyan and Nagarajan Natarajan, Microsoft Research; Gagan Somashekar, Stony Brook University; Lei Zhao, Microsoft; Ranjita Bhagwan, Microsoft Research; Rodrigo Fonseca, Tatiana Racheva, and Yogesh Bansal, Microsoft
  21. OpenLoRa: Validating LoRa Implementations through an Extensible and Open-sourced Framework, Manan Mishra, Daniel Koch, Muhammad Osama Shahid, and Bhuvana Krishnaswamy, University of Wisconsin-Madison; Krishna Chintalapudi, Microsoft Research; Suman Banerjee, University of Wisconsin-Madison
  22. ExoPlane: An Operating System for On-Rack Switch Resource Augmentation, Daehyeok Kim, Microsoft and University of Texas at Austin; Vyas Sekar and Srinivasan Seshan, Carnegie Mellon University
  23. Sketchovsky: Enabling Ensembles of Sketches on Programmable Switches, Hun Namkung, Carnegie Mellon University; Zaoxing Liu, Boston University; Daehyeok Kim, Microsoft Research; Vyas Sekar and Peter Steenkiste, Carnegie Mellon University
  24. Acoustic Sensing and Communication Using Metasurface, Yongzhao Zhang, Yezhou Wang, and Lanqing Yang, Shanghai Jiao Tong University; Mei Wang, UT Austin; Yi-Chao Chen, Shanghai Jiao Tong University and Microsoft Research Asia; Lili Qiu, UT Austin and Microsoft Research Asia; Yihong Liu, University of Glasgow; Guangtao Xue and Jiadi Yu, Shanghai Jiao Tong University
  25. Disaggregating Stateful Network Functions, Deepak Bansal, Gerald DeGrace, Rishabh Tewari, Michal Zygmunt, and James Grantham, Microsoft; Silvano Gai, Mario Baldi, Krishna Doddapaneni, Arun Selvarajan, Arunkumar Arumugam, and Balakrishnan Raman, AMD Pensando; Avijit Gupta, Sachin Jain, Deven Jagasia, Evan Langlais, Pranjal Srivastava, Rishiraj Hazarika, Neeraj Motwani, Soumya Tiwari, Stewart Grant, Ranveer Chandra, and Srikanth Kandula, Microsoft
  26. Doing More with Less: Orchestrating Serverless Applications without an Orchestrator, David H. Liu and Amit Levy, Princeton University; Shadi Noghabi and Sebastian Burckhardt, Microsoft Research
  27. NetPanel: Traffic Measurement of Exchange Online Service, Yu Chen, Microsoft 365, China; Liqun Li and Yu Kang, Microsoft Research, China; Boyang Zheng, Yehan Wang, More Zhou, Yuchao Dai, and Zhenguo Yang, Microsoft 365, China; Brad Rutkowski and Jeff Mealiffe, Microsoft 365, USA; Qingwei Lin, Microsoft Research, China
  28. DOTE: Rethinking (Predictive) WAN Traffic Engineering, Yarin Perry, Hebrew University of Jerusalem; Felipe Vieira Frujeri, Microsoft Research; Chaim Hoch, Hebrew University of Jerusalem; Srikanth Kandula and Ishai Menache, Microsoft Research; Michael Schapira, Hebrew University of Jerusalem; Aviv Tamar, Technion
  29. Push-Button Reliability Testing for Cloud-Backed Applications with Rainmaker, Yinfang Chen and Xudong Sun, University of Illinois at Urbana-Champaign; Suman Nath, Microsoft Research; Ze Yang and Tianyin Xu, University of Illinois at Urbana-Champaign
  30. Test Coverage for Network Configurations, Xieyang Xu and Weixin Deng, University of Washington; Ryan Beckett, Microsoft; Ratul Mahajan, University of Washington; David Walker, Princeton University

NSDI 2023 Program Committee members:

Members of other committees:

The post Microsoft at NSDI 2023: A commitment to advancing networking and distributed systems appeared first on Microsoft Research.

Read More