Relevance tuning with Amazon Kendra

Relevance tuning with Amazon Kendra

Amazon Kendra is a highly accurate and easy-to-use enterprise search service powered by machine learning (ML). As your users begin to perform searches using Kendra, you can fine-tune which search results they receive. For example, you might want to prioritize results from certain data sources that are more actively curated and therefore more authoritative. Or if your users frequently search for documents like quarterly reports, you may want to display the more recent quarterly reports first.

Relevance tuning allows you to change how Amazon Kendra processes the importance of certain fields or attributes in search results. In this post, we walk through how you can manually tune your index to achieve the best results.

It’s important to understand the three main response types of Amazon Kendra: matching to FAQs, reading comprehension to extract suggested answers, and document ranking. Relevance tuning impacts document ranking. Additionally, relevance tuning is just one of many factors that impact search results for your users. You can’t change specific results, but you can influence how much weight Amazon Kendra applies to certain fields or attributes.

Faceting

Because you’re tuning based on fields, you need to have those fields faceted in your index. For example, if you want to boost the signal of the author field, you need to make the author field a searchable facet in your index. For more information about adding facetable fields to your index, see Creating custom document attributes.

Performing relevance tuning

You can perform relevance tuning in several different ways, such as on the AWS Management Console through the Amazon Kendra search console or with the Amazon Kendra API. You can also use several different types of fields when tuning:

  • Date fields – Boost more recent results
  • Number fields – Amplify content based on number fields, such as total view counts
  • String fields – Elevate results based on string fields, for example those that are tagged as coming from a more authoritative data source

Prerequisites

This post requires you to complete the following prerequisites: set up your environment, upload the example dataset, and create an index.

Setting up your environment

Ensure you have the AWS CLI installed. Open a terminal window and create a new working directory. From that directory, download the following files:

  • The sample dataset, available from: s3://aws-ml-blog/artifacts/kendra-relevance-tuning/ml-blogs.tar.gz
  • The Python script to create your index, available from: s3://aws-ml-blog/artifacts/kendra-relevance-tuning/create-index.py

The following screenshot shows how to download the dataset and the Python script.

Uploading the dataset

For this use case, we use a dataset that is a selection of posts from the AWS Machine Learning Blog. If you want to use your own dataset, make sure you have a variety of metadata. You should ideally have varying string fields and date fields. In the example dataset, the different fields include:

  • Author name – Author of the post
  • Content type – Blog posts and whitepapers
  • Topic and subtopic – The main topic is Machine Learning and subtopics include Computer Vision and ML at the Edge
  • Content language – English, Japanese, and French
  • Number of citations in scientific journals – These are randomly fabricated numbers for this post

To get started, create two Amazon Simple Storage Service (Amazon S3) buckets. Make sure to create them in the same Region as your index. Our index has two data sources.

Within the ml-blogs.tar.gz tarball there are two directories. Extract the tarball and sync the contents of the first directory, ‘bucket1’ to your first S3 bucket. Then sync the contents of the second directory, ‘bucket2’, to your second S3 bucket.

The following screenshot shows how to download the dataset and upload it to your S3 buckets.

Creating the index

Using your preferred code editor, open the Python script ‘create-index.py’ that you downloaded previously. You will need to set your bucket name variables to the names of the Amazon S3 buckets you created earlier. Make sure you uncomment those lines.

Once this is done, run the script by typing python create-index.py. This does the following:

  • Creates an AWS Identity and Access Management (IAM) role to allow your Amazon Kendra index to read data from Amazon S3 and write logs to Amazon CloudWatch Logs
  • Creates an Amazon Kendra index
  • Adds two Amazon S3 data sources to your index
  • Adds new facets to your index, which allows you to search based on the different fields in the dataset
  • Initiates a data source sync job

Working with relevance tuning

Now that our data is properly indexed and our metadata is facetable, we can test different settings to understand how relevance tuning affects search results. In the following examples, we will boost based on several different attributes. These include the data source, document type, freshness, and popularity.

Boosting your authoritative data sources

The first kind of tuning we look at is based on data sources. Perhaps you have one data source that is well maintained and curated, and another with information that is less accurate and dated. You want to prioritize the results from the first data source so your users get the most relevant results when they perform searches.

When we created our index, we created two data sources. One contains all our blog posts—this is our primary data source. The other contains only a single file, which we’re treating as our legacy data source.

Our index creation script set the field _data_source_id to be facetable, searchable, and displayable. This is an essential step in boosting particular data sources.

The following screenshot shows the index fields of our Amazon Kendra index.

  1. On the Amazon Kendra search console, search for Textract.

Your results should reference posts about Amazon Textract, a service that can automatically extract text and data from scanned documents.

The following screenshot shows the results of a search for ‘Textract’.

Also in the results should be a file called Test_File.txt. This is a file from our secondary, less well-curated data source. Make a note of where this result appears in your search results. We want to de-prioritize this result and boost the results from our primary source.

  1. Choose Tuning to open the Relevance tuning
  2. Under Text fields, expand data source.
  3. Drag the slider for your first data source to the right to boost the results from this source. For this post, we start by setting it to 8.
  4. Perform another search for Textract.

You should find that the file from the second data source has moved down the search rankings.

  1. Drag the slider all the way to the right, so that the boost is set to 10, and perform the search again.

You should find that the result from the secondary data source has disappeared from the first page of search results.

The following screenshot shows the relevance tuning panel with data source field boost applied to one data source, and the search results excluding the results from our secondary data source.

Although we used this approach with S3 buckets as our data sources, you can use it to prioritize any data source available in Amazon Kendra. You can boost the results from your Amazon S3 data lake and de-prioritize the results from your Microsoft SharePoint system, or vice-versa.

Boosting certain document types

In this use case, we boost the results of our whitepapers over the results from the AWS Machine Learning Blog. We first establish a baseline search result.

  1. Open the Amazon Kendra search console and search for What is machine learning?

Although the top result is a suggested answer from a whitepaper, the next results are likely from blog posts.

The following screenshot shows the results of a search for ‘What is machine learning?’

How do we influence Amazon Kendra to push whitepapers towards the top of its search results?

First, we want to tune the search results based on the content Type field.

  1. Open the Relevance tuning panel on the Amazon Kendra console.
  2. Under Custom fields, expand Type.
  3. Drag the Type field boost slider all the way to the right to set the relevancy of this field to 10.

We also want to boost the importance of a particular Type value, namely Whitepapers.

  1. Expand Advanced boosting and choose Add value.
  2. Whitepapers are indicated in our metadata by the field “Type”: “Whitepaper”, so enter a value of Whitepaper and set the value to 10.
  3. Choose Save.

The following screenshot shows the relevance tuning panel with type field boost applied to the ‘Whitepaper’ document type.

Wait for up to 10 seconds before you rerun your search. The top results should all be whitepapers, and blog post results should appear further down the list.

The following screenshot shows the results of a search for ‘What is machine learning?’ with type field boost applied.

  1. Return your Type field boost settings back to their normal values.

Boosting based on document freshness

You might have a large archive of documents spanning multiple decades, but the more recent answers are more useful. For example, if your users ask, “Where is the IT helpdesk?” you want to make sure they’re given the most up-to-date answer. To achieve this, you can give a freshness boost based on date attributes.

In this use case, we boost the search results to include more recent posts.

  1. On the Amazon Kendra search console, search for medical.

The first result is De-identify medical images with the help of Amazon Comprehend Medical and Amazon Rekognition, published March 19, 2019.

The following screenshot shows the results of a search for ‘medical’.

 

  1. Open the Relevance tuning panel again.
  2. On the Date tab, open Custom fields.
  3. Adjust the Freshness boost of PublishDate to 10.
  4. Search again for medical.

This time the first result is Enhancing speech-to-text accuracy of COVID-19-related terms with Amazon Transcribe Medical, published May 15, 2020.

The following screenshot shows the results of a search for ‘medical’ with freshness boost applied.

You can also expand Advanced boosting to boost results from a particular period of time. For example, if you release quarterly business results, you might want to set the sensitivity range to the last 3 months. This boosts documents released in the last quarter so users are more likely to find them.

The following screenshot shows the section of the relevance tuning panel related to freshness boost, showing the Sensitivity slider to capture range of sensitivity.

Boosting based on document popularity

The final scenario is tuning based on numerical values. In this use case, we assigned a random number to each post to represent the number of citations they received in scientific journals. (It’s important to reiterate that these are just random numbers, not actual citation numbers!) We want the most frequently cited posts to surface.

  1. Run a search for keras, which is the name of a popular library for ML.

You might see a suggested answer from Amazon Kendra, but the top results (and their synthetic citation numbers) are likely to include:

  1. On the Relevance tuning panel, on the Numeric tab, pull the slider for Citations all the way to 10.
  2. Select Ascending to boost the results that have more citations.

The following screenshot shows the relevance tuning panel with numeric boost applied to the Citations custom field.

  1. Search for keras again and see which results appear.

At the top of the search results are:

Amazon Kendra prioritized the results with more citations.

Conclusion

This post demonstrated how to use relevance tuning to adjust your users’ Amazon Kendra search results. We used a small and somewhat synthetic dataset to give you an idea of how relevance tuning works. Real datasets have a lot more complexity, so it’s important to work with your users to understand which types of search results they want prioritized. With relevance tuning, you can get the most value out of enterprise search with Amazon Kendra! For more information about Amazon Kendra, see AWS re:Invent 2019 – Keynote with Andy Jassy on YouTube, Amazon Kendra FAQs, and What is Amazon Kendra?

Thanks to Tapodipta Ghosh for providing the sample dataset and technical review. This post couldn’t have been written without his assistance.


About the Author

James Kingsmill is a Solution Architect in the Australian Public Sector team. He has a longstanding interest in helping public sector customers achieve their transformation, automation, and security goals. In his spare time, you can find him canyoning in the Blue Mountains near Sydney.

 

Read More

Using A/B testing to measure the efficacy of recommendations generated by Amazon Personalize

Using A/B testing to measure the efficacy of recommendations generated by Amazon Personalize

Machine learning (ML)-based recommender systems aren’t a new concept, but developing such a system can be a resource-intensive task—from data management during training and inference, to managing scalable real-time ML-based API endpoints. Amazon Personalize allows you to easily add sophisticated personalization capabilities to your applications by using the same ML technology used on Amazon.com for over 20 years. No ML experience required. Customers in industries such as retail, media and entertainment, gaming, travel and hospitality, and others use Amazon Personalize to provide personalized content recommendations to their users. With Amazon Personalize, you can solve the most common use cases: providing users with personalized item recommendations, surfacing similar items, and personalized re-ranking of items.

Amazon Personalize automatically trains ML models from your user-item interactions and provides an API to retrieve personalized recommendations for any user. A frequently asked question is, “How do I compare the performance of recommendations generated by Amazon Personalize to my existing recommendation system?” In this post, we discuss how to perform A/B tests with Amazon Personalize, a common technique for comparing the efficacy of different recommendation strategies.

You can quickly create a real-time recommender system on the AWS Management Console or the Amazon Personalize API by following these simple steps:

  1. Import your historical user-item interaction data.
  2. Based on your use case, start a training job using an Amazon Personalize ML algorithm (also known as recipes).
  3. Deploy an Amazon Personalize-managed, real-time recommendations endpoint (also known as a campaign).
  4. Record new user-item interactions in real time by streaming events to an event tracker attached to your Amazon Personalize deployment.

The following diagram represents which tasks Amazon Personalize manages.

This diagram represents which tasks Amazon Personalize manages

Metrics overview

You can measure the performance of ML recommender systems through offline and online metrics. Offline metrics allow you to view the effects of modifying hyperparameters and algorithms used to train your models, calculated against historical data. Online metrics are the empirical results observed in your user’s interactions with real-time recommendations provided in a live environment.

Amazon Personalize generates offline metrics using test datasets derived from the historical data you provide. These metrics showcase how the model recommendations performed against historical data. The following diagram illustrates a simple example of how Amazon Personalize splits your data at training time.

The following diagram illustrates a simple example of how Amazon Personalize splits your data at training time

Consider a training dataset containing 10 users with 10 interactions per user; interactions are represented by circles and ordered from oldest to newest based on their timestamp. In this example, Amazon Personalize uses 90% of the users’ interactions data (blue circles) to train your model, and the remaining 10% for evaluation. For each of the users in the evaluation data subset, 90% of their interaction data (green circles) is used as input for the call to the trained model, and the remaining 10% of their data (orange circle) is compared to the output produced by the model to validate its recommendations. The results are displayed to you as evaluation metrics.

Amazon Personalize produces the following metrics:

  • Coverage – This metric is appropriate if you’re looking for what percentage of your inventory is recommended
  • Mean reciprocal rank (at 25) – This metric is appropriate if you’re interested in the single highest ranked recommendation
  • Normalized discounted cumulative gain (at K) – The discounted cumulative gain is a measure of ranking quality; it refers to how well the recommendations are ordered
  • Precision (at K) – This metric is appropriate if you’re interested in how a carousel of size K may perform in front of your users

For more information about how Amazon Personalize calculates these metrics, see Evaluating a Solution Version.

Offline metrics are a great representation of how your hyperparameters and data features influence your model’s performance against historical data. To find empirical evidence of the impact of Amazon Personalize recommendations on your business metrics, such as click-through rate, conversion rate, or revenue, you should test these recommendations in a live environment, getting them in front of your customers. This exercise is important because a seemingly small improvement in these business metrics can translate into a significant increase in your customer engagement, satisfaction, and business outputs, such as revenue.

The following sections include an experimentation methodology and reference architecture in which you can identify the steps required to expose multiple recommendation strategies (for example, Amazon Personalize vs. an existing recommender system) to your users in a randomized fashion and measure the difference in performance in a scientifically sound manner (A/B testing).

Experimentation methodology

The data collected across experiments enables you to measure the efficacy of Amazon Personalize recommendations in terms of business metrics. The following diagram illustrates the experimentation methodology we suggest adhering to.

The following diagram illustrates the experimentation methodology we suggest adhering to

The process consists of five steps:

  • Research – The formulation of your questions and definition of the metrics to improve are solely based on the data you gather before starting your experiment. For example, after exploring your historical data, you might be interested in why you experience shopping cart abandonment or high bounce rates from leads generated by advertising.
  • Develop a hypothesis – You use the data gathered during the research phase to make observations and develop a change and effect statement. The hypothesis must be quantifiable. For example, providing user personalization through an Amazon Personalize campaign on the shopping cart page will translate into an increase of the average cart value by 10%.
  • Create variations based on the hypothesis – The variations of your experiment are based on the hypothesized behavior you’re evaluating. A newly created Amazon Personalize campaign can be considered the variation of your experiment when compared against an existing rule-based recommendation system.
  • Run an experiment – You can use several techniques to test your recommendation system; this post focuses on A/B testing. The metrics data gathered during the experiment help validate (or invalidate) the hypothesis. For example, a 10% increase on the average cart value after adding Amazon Personalize recommendations to the shopping cart page over 1 month compared to the average cart value keeping the current system’s recommendations.
  • Measure the results – In this step, you determine if there is statistical significance to draw a conclusion and select the best performing variation. Was the increase in your cart average value a result of the randomness of your user testing set, or did the newly created Amazon Personalize campaign influence this increase?

A/B testing your Amazon Personalize deployment

The following architecture showcases a microservices-based implementation of an A/B test between two Amazon Personalize campaigns. One is trained with one of the recommendation recipes provided by Amazon Personalize, and the other is trained with a variation of this recipe. Amazon Personalize provides various predefined ML algorithms (recipes); HRNN-based recipes enable you to provide personalized user recommendations.

The following architecture showcases a microservices-based implementation of an A/B test between two Amazon Personalize campaigns

This architecture compares two Amazon Personalize campaigns. You can apply the same logic when comparing an Amazon Personalize campaign against a custom rule-based or ML-based recommender system. For more information about campaigns, see Creating a Campaign.

The basic workflow of this architecture is as follows:

  1. The web application requests customer recommendations from the recommendations microservice.
  2. The microservice determines if there is an active A/B test. For this post, we assume your testing strategy settings are stored in Amazon DynamoDB.
  3. When the microservice identifies the group your customer belongs to, it resolves the Amazon Personalize campaign endpoint to query for recommendations.
  4. Amazon Personalize campaigns provide the recommendations for your users.
  5. The users interact with their respective group recommendations.
  6. The web application streams user interaction events to Amazon Kinesis Data Streams.
  7. The microservice consumes the Kinesis stream, which sends the user interaction event to both Amazon Personalize event trackers. Recording events is an Amazon Personalize feature that collects real-time user interaction data and provides relevant recommendations in real time.
  8. Amazon Kinesis Data Firehose ingests your user-item interactions stream and stores the interactions data in Amazon Simple Storage Service (Amazon S3) to use in future trainings.
  9. The microservice keeps track of your pre-defined business metrics throughout the experiment.

For instructions on running an A/B test, see the Retail Demo Store Experimentation Workshop section in the Github repo.

Tracking well-defined business metrics is a critical task during A/B testing; seeing improvements on these metrics is the true indicator of the efficacy of your Amazon Personalize recommendations. The metrics measured throughout your A/B tests need to be consistent across your variations (groups A and B). For example, an ecommerce site can evaluate the change (increase or decrease) on the click-through rate of a particular widget after adding Amazon Personalize recommendations (group A) compared to the click-through rate obtained using the current rule-based recommendations (group B).

An A/B experiment runs for a defined period, typically dictated by the number of users necessary to reach a statistically significant result. Tools such as Optimizely, AB Tasty, and Evan Miller’s Awesome A/B Tools can help you determine how large your sample size needs to be. A/B tests are usually active across multiple days or even weeks, in order to collect a large enough sample from your userbase. The following graph showcases the feedback loop between testing, adjusting your model, and rolling out new features on success.

The following graph showcases the feedback loop between testing, adjusting your model, and rolling out new features on success.

For an A/B test to be considered successful, you need to perform a statistical analysis of the data gathered from your population to determine if there is a statistically significant result. This analysis is based on the significance level you set for the experiment; a 5% significance level is considered the industry standard. For example, a significance level of 0.05 indicates a 5% risk of concluding that a difference exists when there is no actual difference. A lower significance level means that we need stronger evidence for a statistically significant result. For more information about statistical significance, see A Refresher on Statistical Significance.

The next step is to calculate the p-value. The p-value is the probability of seeing a particular result (or greater) from zero, assuming that the null hypothesis is TRUE. In other words, the p-value is the expected fluctuation in a given sample, similar to the variance. For example, imagine we ran an A/A test where we displayed the same variation to two groups of users. After such an experiment, we would expect the metrics results across groups to be very similar but not dramatically different: a p-value greater than your significance level. Therefore, in an A/B test, we hope to see a p-value that is less than our significance level so we can conclude the influence on the business metric was the result of your variance group. AWS Partners such as Amplitude or Optimizely provide A/B testing tools to facilitate the setup and analysis of your experiments.

A/B tests are statistical measures of the efficacy of your Amazon Personalize recommendations, allowing you to quantify the impact these recommendations have on your business metrics. Additionally, A/B tests allows you to gather organic user-item interactions that you can use to train subsequent Amazon Personalize implementations. We recommend spending less time on offline tests and getting your Amazon Personalize recommendations in front of your users as quickly as possible. This helps eliminate biases from existing recommender systems in your training dataset, which allows your Amazon Personalize deployments to learn from organic user-item interactions data.

Conclusion

Amazon Personalize is an easy-to-use, highly scalable solution that can help you solve some of the most popular recommendation use cases:

  • Personalized recommendations
  • Similar items recommendations
  • Personalized re-ranking of items

A/B testing provides invaluable information on how your customers interact with your Amazon Personalize recommendations. These results, measured according to well-defined business metrics, will give you a sense of the efficacy of these recommendations along with clues on how to further adjust your training datasets. After you iterate through this process multiple times, you will see an improvement on the metrics that matter most to improve customer engagement.

If this post helps you or inspires you to use A/B testing to improve your business metrics, please share your thoughts in the comments.

Additional resources

For more information about Amazon Personalize, see the following:


About the Author

Luis Lopez Soria is an AI/ML specialist solutions architect working with the AWS machine learning team. He works with AWS customers to help them adopt machine learning on a large scale. He enjoys playing sports, traveling around the world, and exploring new foods and cultures.

 

 

 

Read More

On Becoming Green: 800+ Interns Enliven Our First-Ever Virtual Internship Program

On Becoming Green: 800+ Interns Enliven Our First-Ever Virtual Internship Program

More than 800 students from over 100 universities around the world joined NVIDIA as the first class of our virtual internship program — I’m one of them, working on the corporate communications team this summer.

Shortly after the pandemic’s onset, NVIDIA decided to reinvent its internship program as a virtual one. We’ve been gaining valuable experience and having a lot of fun — all through our screens.

Fellow interns have contributed ideas to teams ranging from robotics to financial reporting. I’ve been writing stories on how cutting-edge tech improves industries from healthcare to animation, learning to work the backend of the company newsroom, and fostering close relationships with some fabulous colleagues.

And did I mention fun? Game show and cook-along events, a well-being panel series and gatherings such as book clubs were part of the programming. We also had several swag bags sent to our doorsteps, which included a customized intern company sweatshirt and an NVIDIA SHIELD TV.

Meet a few other interns who joined the NVIDIA family this year:

Amevor Aids Artists by Using Deep Learning

Christoph Amevor just graduated with a bachelor’s in computational sciences and engineering from ETH Zurich in Switzerland.

At NVIDIA, he’s working on a variety of deep learning projects including one to simplify the workflow of artists and creators using NVIDIA Omniverse, a real-time simulation platform for 3D production pipelines.

“Machine learning is such a powerful tool, and I’ve been interested in seeing how it can help us solve problems that are simply too complex to tackle with analytic math,” Amevor said.

He lives with another NVIDIA intern, which he said has made working from home feel like a mini company location.

Santos Shows Robots the Ropes

Beatriz Santos is an undergrad at California State University, East Bay, studying computer science. She’s a software intern working on the Isaac platform for robotics.

Though the pandemic has forced her to social distance from other humans, Santos has been spending a lot of time with the robot Kaya, in simulation, training it to do various tasks.

Her favorite virtual event this summer was the women’s community panel featuring female leaders at NVIDIA.

“I loved their inputs on working in a historically male-dominated field, and how they said we don’t have to change because of that,” she said. “We can just be ourselves, be girls.”

Sindelar Sharpens Websites

When researching potential summer internships, Justin Sindelar — a marketing major at San Jose State University — was immediately drawn to NVIDIA’s.

“The NVIDIA I once knew as a consumer graphics card company has grown into a multifaceted powerhouse that serves several high-tech industries and has contributed to the proliferation of AI,” he said.

Using the skills he’s learned at school and as a web designer, Sindelar has been performing UX analyses to help improve NVIDIA websites and their accessibility features.

His favorite intern activity was the game show event where he teamed up with his manager and mentors in the digital marketing group to answer trivia questions and fill in movie quotes.

Zhang Zaps Apps Into Shape

Maggie Zhang is a third-year biomedical engineering student at the University of Waterloo in Ontario. She works on the hardware infrastructure team to make software applications that improve workflow for hardware engineers.

When not coding or testing a program, she’s enjoyed online coffee chats, where she formed an especially tight bond with other Canadian interns.

She also highlighted how thankful she is for her team lead and mentor, who set up frequent one-on-one check-ins and taught her new concepts to improve code and make programs more manageable.

“They’ve taught me to be brave, experiment and learn as I go,” she said. “It’s more about what you learn than what you already know.”

For many interns, this fulfilling and challenging summer will lead to future roles at NVIDIA.

Learn more about NVIDIA’s internship program.

The post On Becoming Green: 800+ Interns Enliven Our First-Ever Virtual Internship Program appeared first on The Official NVIDIA Blog.

Read More

Tackling Open Challenges in Offline Reinforcement Learning

Tackling Open Challenges in Offline Reinforcement Learning

Posted by George Tucker and Sergey Levine, Research Scientists, Google Research


Over the past several years, there has been a surge of interest in reinforcement learning (RL) driven by its high-profile successes in game playing and robotic control. However, unlike supervised learning methods, which learn from massive datasets that are collected once and then reused, RL algorithms use a trial-and-error feedback loop that requires active interaction during learning, collecting data every time a new policy is learned. This approach is prohibitive in many real-world settings, such as healthcare, autonomous driving, and dialogue systems, where trial-and-error data collection can be costly, time consuming, or even irresponsible. Even for problems where some active data collection can be used, the requirement for interactive collection limits dataset size and diversity.

Offline RL (also called batch RL or fully off-policy RL) relies solely on a previously collected dataset without further interaction. It provides a way to utilize previously collected datasets — from previous RL experiments, from human demonstrations, and from hand-engineered exploration strategies — in order to automatically learn decision-making strategies. In principle, while off-policy RL algorithms can be used in the offline setting (fully off-policy), they are generally only successful when used with active environment interaction — without receiving this direct feedback, they often exhibit undesirable performance in practice. Consequently, while offline RL has enormous potential, that potential cannot be reached without resolving significant algorithmic challenges.

In “Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems”, we provide a comprehensive tutorial on approaches for tackling the challenges of offline RL and discuss the many issues that remain. To address these issues, we have designed and released an open-source benchmarking framework, Datasets for Deep Data-Driven Reinforcement Learning (D4RL), as well as a new, simple, and highly effective offline RL algorithm, called conservative Q-learning (CQL).

Benchmarks for Offline RL
In order to understand the capabilities of current approaches and to guide future progress, it is first necessary to have effective benchmarks. A common choice in prior work was to simply use data generated by a successful online RL run. However, while simple, this data collection approach is artificial because it involves training an online RL agent which is prohibitive in many real-world settings as we discussed previously. One wishes to learn a policy that is better than the current best from diverse data sources that provides good coverage of the task. For example, one might have data collected from a hand-designed controller of a robot arm, and use offline RL to train an improved controller. To enable progress in this field under realistic settings, one needs a benchmark suite that accurately reflects these settings, while being simple and accessible enough to enable rapid experimentation.

D4RL provides standardized environments, datasets and evaluation protocols, as well as reference scores for recent algorithms to help accomplish this. This is a “batteries-included” resource, making it ideal for anyone to jump in and get started with minimal fuss.

Environments in D4RL

The key design goal for D4RL was to develop tasks that reflect both real-world dataset challenges as well as real-world applications. Previous datasets used data collected either from random agents or agents trained with RL. Instead, by thinking through potential applications in autonomous driving, robotics, and other domains, we considered how real-world applications of offline RL might require handling of data generated from human demonstrations or hard-coded controllers, data collected from heterogeneous sources, and data collected by agents with a variety of different goals.

Aside from the widely used MuJoCo locomotion tasks, D4RL includes datasets for more complex tasks. The Adroit domain, which requires manipulating a realistic robotic hand to use a hammer, for example, illustrates the challenges of working with limited human demonstrations, without which these tasks are extremely challenging. Previous work found that existing datasets could not distinguish between competing methods, whereas the Adroit domain reveals clear deficiencies between them.

Another common scenario for real-world tasks is one in which the dataset used for training is collected from agents performing a wide range of other activities that are related to, but not specifically targeted towards, the task of interest. For example, data from human drivers may illustrate how to drive a car well, but do not necessarily show how to reach a specific desired destination. In this case, one might like offline RL methods to “stitch” together parts of routes in the driving dataset to accomplish a task that was not actually seen in the data (i.e., navigation). As an illustrative example, given paths labeled “A” and “B” in the picture below, offline RL should be able to “remix” them to produce path C.

Having only observed paths A and B, they can be combined to form a shortest path (C).

We constructed a series of increasingly difficult tasks to exercise this “stitching” ability. The maze environments, shown below, require two robots (a simple ball or an “Ant” robot) to navigate to locations in a series of mazes.

Maze navigation environments in D4RL, which require “stitching” parts of paths to accomplish new navigational goals that were not seen in the dataset.

A more complex “stitching” scenario is provided by the Franka kitchen domain (based on the Adept environment), where demonstrations from humans using a VR interface comprise a multi-task dataset, and offline RL methods must again “remix” this data.

The “Franka kitchen” domain requires using data from human demonstrators performing a variety of different tasks in a simulated kitchen.

Finally, D4RL includes two tasks that are meant to more accurately reflect potential realistic applications of offline RL, both based on existing driving simulators. One is a first-person driving dataset that utilizes the widely used CARLA simulator developed at Intel, which provides photo-realistic images in realistic driving domains, and the other is a dataset from the Flow traffic control simulator (from UC Berkeley), which requires controlling autonomous vehicles to facilitate effective traffic flow.

D4RL includes datasets based on existing realistic simulators for driving with CARLA (left) and traffic management with Flow (right).

We have packaged these tasks and standardized datasets into an easy-to-use Python package to accelerate research. Furthermore, we provide benchmark numbers for all tasks using relevant prior methods (BC, SAC, BEAR, BRAC, AWR, BCQ), in order to baseline new approaches. We are not the first to propose a benchmark for offline RL: a number of prior works have proposed simple datasets based on running RL algorithms, and several more recent works have proposed datasets with image observations and other features. However, we believe that the more realistic dataset composition in D4RL makes it an effective way to drive progress in the field.

An Improved Algorithm for Offline RL
As we developed the benchmark tasks, we found that existing methods could not solve the more challenging tasks. The central challenge arises from a distributional shift: in order to improve over the historical data, offline RL algorithms must learn to make decisions that differ from the decisions taken in the dataset. However, this can lead to problems when the consequences of a seemingly good decision cannot be deduced from the data — if no agent has taken this particular turn in the maze, how does one know if it leads to the goal or not? Without handling this distributional shift problem, offline RL methods can extrapolate erroneously, making over-optimistic conclusions about the outcomes of rarely seen actions. Contrast this with the online setting, where reward bonuses modeled after curiosity and surprise optimistically bias the agent to explore all potentially rewarding paths. Because the agent receives interactive feedback, if the action turns out to be unrewarding, then it can simply avoid the path in the future.

To address this, we developed conservative Q-learning (CQL), an offline RL algorithm designed to guard against overestimation while avoiding explicit construction of a separate behavior model and without using importance weights. While standard Q-learning (and actor-critic) methods bootstrap from previous estimates, CQL is unique in that it is fundamentally a pessimistic algorithm: it assumes that if a good outcome was not seen for a given action, that action is likely to not be a good one. The central idea of CQL is to learn a lower bound on the policy’s expected return (called the Q-function), instead of learning to approximate the expected return. If we then optimize our policy under this conservative Q-function, we can be confident that its value is no lower than this estimate, preventing errors from overestimation.

We found that CQL attains state-of-the-art results on many of the harder D4RL tasks: CQL outperformed other approaches on the AntMaze, Kitchen tasks, and 6 out of 8 Adroit tasks. In particular, on the AntMaze tasks, which require navigating through a maze with an “Ant” robot, CQL is often the only algorithm that is able to learn non-trivial policies. CQL also performs well on other tasks, including Atari games. On the Atari tasks from Agarwal et al., CQL outperforms prior methods when data is limited (“1%” dataset). Moreover, CQL is simple to implement on top of existing algorithms (e.g., QR-DQN and SAC), without training additional neural networks.

Performance of CQL on Atari games with the 1% dataset from Agarwal et al.

Future Thoughts
We are excited about the fast-moving field of offline RL. While we took a first step towards a standard benchmark, there is clearly still room for improvement. We expect that as algorithms improve, we will need to reevaluate the tasks in the benchmark and develop more challenging tasks. We look forward to working with the community to evolve the benchmark and evaluation protocols. Together, we can bring the rich promises of offline RL to real-world applications.

Acknowledgements
This work was carried out in collaboration with UC Berkeley PhD students Aviral Kumar, Justin Fu, and Aurick Zhou, with contributions from Ofir Nachum from Google Research.

Read More

Starry, Starry Night: AI-Based Camera System Discovers Two New Meteor Showers

Starry, Starry Night: AI-Based Camera System Discovers Two New Meteor Showers

Spotting a meteor flash across the sky is a rare event for most people, unless you’re the operators of the CAMS meteor shower surveillance project, who frequently spot more than a thousand in a single night and recently discovered two new showers.

CAMS, which stands for Cameras for Allsky Meteor Surveillance, was founded in 2010. Since 2017, it’s been improved by researchers using AI at the Frontier Development Lab, in partnership with NASA and the SETI Institute.

The project uses AI to identify whether a point of light moving in the night sky is a bird, plane, satellite or, in fact, a meteor. The CAMS network consists of cameras that take pictures of the sky, at a rate of 60 frames per second.

The AI pipeline also verifies the findings to confirm the direction from which meteoroids, small pieces of comets that cause meteors, approach the Earth. The project’s AI model training is optimized on NVIDIA TITAN GPUs housed at the SETI Institute.

Each night’s meteor sightings are then mapped onto the NASA meteor shower portal, a visualization tool available to the public. All meteor showers identified since 2010 are available on the portal.

CAMS detected two new meteor showers in mid-May, called the gamma Piscis Austrinids and the sigma Phoenicids. They were added to the International Astronomical Union’s meteor data center, which has recorded 1,041 unique meteor showers to date.

Analysis found both showers to be caused by meteoroids from long-period comets, which take more than 200 years to complete an orbit around the sun.

Improving the Meteor Classification Process

Peter Jenniskens, principal investigator for CAMS, has been classifying meteors since he founded the project in 2010. Before having access to NVIDIA’s GPUs, Jenniskens would look at the images these cameras collected and judge by eye if a light curve from a surveyed object fit the categorization for a meteor.

Now, the CAMS pipeline is entirely automated, from the transferring of data from an observatory to the SETI Institute’s server, to analyzing the findings and displaying them on the online portal on a nightly basis.

With the help of AI, researchers have been able to expand the project and focus on its real-world impact, said Siddha Ganju, a solutions architect at NVIDIA and member of FDL’s AI technical steering committee who worked on the CAMS project.

“The goal of studying space is to figure out the unknowns of the unknowns,” said Ganju. “We want to know what we aren’t yet able to know. Access to data, instruments and computational power is the holy trifecta available today to make discoveries that would’ve been impossible 50 years ago.”

Public excitement around the CAMS network has spurred it to expand the number of cameras fourfold since the project began incorporating AI in 2017. With stations all over the world, from Namibia to the Netherlands, the project now hunts for one-hour long meteor showers, which are only visible in a small part of the world at a given time.

Applying the Information Gathered

The AI model, upon identifying a meteor, calculates the direction it’s coming from. According to Jenniskens, meteors come in groups, called meteoroid streams, which are mostly caused by comets. A comet can approach from as far as Jupiter or Saturn, he said, and when it’s that far away, it’s impossible to see until it comes closer to Earth.

The project’s goal is to enable astronomers to look along the path of an approaching comet and provide enough time to figure out the potential impact it may have on Earth.

Mapping out all discoverable meteor showers brings us a step closer to figuring out what the entire solar system looks like, said Ganju, which is crucial to identifying the potential dangers of comets.

But this map, NASA’s meteor shower portal, isn’t just for professional use. The visualization tool was made available online with the goal of “democratizing science for citizens and fostering interest in the project,” according to Ganju. Anyone can use it to find out what meteor showers are visible each night.

Check out a timeline of notable CAMS discoveries.

The post Starry, Starry Night: AI-Based Camera System Discovers Two New Meteor Showers appeared first on The Official NVIDIA Blog.

Read More

There’s a Code for That: Hugging Face’s Sam Shleifer Talks Natural Language Processing

There’s a Code for That: Hugging Face’s Sam Shleifer Talks Natural Language Processing

Hugging Face is more than just an adorable emoji — it’s a company that’s demystifying AI by transforming the latest developments in deep learning into usable code for businesses and researchers.

Research engineer Sam Shleifer spoke with AI Podcast host Noah Kravitz about Hugging Face NLP technology, which is in use at over 1,000 companies, including Apple, Bing and Grammarly, across fields ranging from finance to medical technology.

EMBED PODCAST

Hugging Face’s models serve a variety of purposes for their customers, including autocompletion, customer service automation and translation. Their popular web application, Write with Transformer, can even take half-formed thoughts and suggest options for completion.

Schleifer is currently at work developing models that are accessible to everyone, whether they are proficient coders or not.

In the next few years, Schleifer envisions the continued growth of smaller NLP models that power a wave of chat apps with state-of-the-art translation capabilities.

Key Points From This Episode:

  • Hugging Face first launched an original chatbot app, before moving into natural language processing models. The move was well-received, and last year the company announced a $15 million funding round.
  • The company is a member of NVIDIA Inception, a virtual accelerator that Schleifer credits with significantly accelerating their experiments.
  • Hugging Face has released over 1,000 models trained with unsupervised learning and the Open Parallel Corpus project, pioneered by the University of Helsinki. These models are capable of machine translation in a huge variety of languages, even for low-resource languages with minimal training data.

Tweetables:

“We’re trying to make state-of-the-art NLP accessible to everyone who wants to use it, whether they can code or not code.” — Sam Shleifer [1:44]

“Our research is targeted at this NLP accessibility mission — and NLP isn’t really accessible when models can’t fit on a single GPU.” — Sam Shleifer [10:38]

You Might Also Like

Sarcasm Detector Uses AI to Understand People at Their Funniest, Meanest

Dr. Pushpak Bhattacharyya’s work is giving computers the ability to understand one of humanity’s most challenging, and amusing, modes of communication. Bhattacharyya, director of IIT Patna, and a professor at the Computer Science and Engineering Department at IIT Bombay, has spent the past few years using GPU-powered deep learning to detect sarcasm.

Speaking the Same Language: How Oracle’s Conversational AI Serves Customers

At Oracle, customer service chatbots use conversational AI to respond to users with more speed and complexity. Suhas Uliyar, vice president of bots, AI and mobile product management at Oracle, talks about how the newest wave of conversational AI can keep up with the nuances of human conversation.

How Syed Ahmed Taught AI to Translate Sign Language

Syed Ahmed, a research assistant at the National Technical Institute for the Deaf, is directing the power of AI toward another form of communication: American Sign Language. And what Ahmed has done is set up a deep learning model that translates ASL into English.

Tune in to the AI Podcast

Get the AI Podcast through iTunes, Google Podcasts, Google Play, Castbox, DoggCatcher, Overcast, PlayerFM, Pocket Casts, Podbay, PodBean, PodCruncher, PodKicker, Soundcloud, Spotify, Stitcher and TuneIn. If your favorite isn’t listed here, drop us a note.

Tune in to the Apple Podcast Tune in to the Google Podcast Tune in to the Spotify Podcast

Make the AI Podcast Better

Have a few minutes to spare? Fill out this listener survey. Your answers will help us make a better podcast.

The post There’s a Code for That: Hugging Face’s Sam Shleifer Talks Natural Language Processing appeared first on The Official NVIDIA Blog.

Read More

The fastest driver in Formula 1

The fastest driver in Formula 1

This blog post was co-authored, and includes an introduction, by Rob Smedley, Director of Data Systems at Formula 1

Formula 1 (F1) racing is the most complex sport in the world. It is the blended perfection of human and machine that create the winning formula. It is this blend that makes F1 racing, or more pertinently, the driver talent, so difficult to understand. How many races or Championships would Michael Schumacher really have won without the power of Benetton and later, Ferrari, and the collective technical genius that were behind those teams? Could we really have seen Lewis Hamilton win six World Championships if his career had taken a different turn and he was confined to back-of-the-grid machinery? Maybe these aren’t the best examples because they are two of the best drivers the world has ever seen. There are many examples, however, of drivers whose real talent has remained fairly well hidden throughout their career. Those that never got that “right place, right time” break into a winning car and, therefore, those that will be forever remembered as a midfield driver.

The latest F1 Insight powered by AWS is designed to build mathematical models and algorithms that can help us answer the perennial question: who is the fastest driver of all time? F1 and AWS scientists have spent almost a year building these models and algorithms to bring us that very insight. The output focuses solely on one element of a driver’s vast armory—the pure speed that is most evident on a Saturday afternoon during the qualifying hour. It doesn’t focus on racecraft or the ability to win races or drive at 200 mph while still having the bandwidth to understand everything going on around you (displayed so well by the likes of Michael Schumacher or Fernando Alonso). This ability, which transgresses speed alone, allowed them both, on many an occasion, to operate as master tacticians. For someone like myself, who has had the honor of watching those very skills in action from the pitwall, I cannot emphasize enough how important those skills are—they are the difference between the good and the great. It is important to point out that these skills are not included in this insight. This is about raw speed only and the ability to push the car to its very limits over one lap.

The output and the list of the fastest drivers of all time (based on the F1 Historic Data Repository information spanning from 1983 to present day) offers some great names indeed. Of course, there are the obvious ones that rank highly—Ayrton Senna, Michael Schumacher, Lewis Hamilton, all of whom emerge as the top five fastest drivers. However, there are some names that many may not think of as top 20 drivers on first glance. A great example I would cite is Heikki Kovalainen. Is that the Kovalainen that finished his career circling round at the back of the Grand Prix field in Caterham, I hear you ask? Yes in fact, it’s the very same. For those of us who watched Kovalainen throughout his F1 career, it comes as little surprise that he is so high up the list when we consider pure speed. Look at his years on the McLaren team against Lewis Hamilton. The qualifying speaks volumes, with the median difference of just 0.1 seconds per lap. Ask Kovalainen himself and he’ll tell you that he didn’t perform at the same level as Hamilton in the races for many reasons (this is a tough business, believe me). But in qualifying, his statistics speak for themselves—the model has ranked him so highly because of his consistent qualifying performances throughout his career. I, for one, am extremely happy to see Kovalainen get the data-driven recognition that he deserves for that raw talent that was always on display during qualifying. There are others in the list, too, and hopefully some of these are your favorites—drivers that you have been banging the drum about for the last 10, 20, 40 years; the ones that might never have gotten every break, but you were able to see just how talented they were.

— Rob Smedley


Fastest Driver

As part of F1’s 70th anniversary celebrations and to help fans better understand who are the fastest drivers in the sport’s history, F1 and the Amazon Machine Learning Solutions Lab teamed up to develop Fastest Driver, the latest F1 Insight powered by AWS.

Fastest Driver uses AWS machine learning (ML) to rank drivers using their qualifying sessions lap times from F1’s Historic Data Repository going back to 1983. In this post, we demonstrate how by using Amazon SageMaker, a fully managed service to build, train, and deploy ML models, the Fastest Driver insight can objectively determine the fastest drivers in F1.

Quantifying driver pace using qualifying data

We define pace as a driver’s lap time during qualifying sessions. Driver race performance depends on a large number of factors, such as weather conditions, car setup (such as tires), and race track. F1 qualifying sessions are split into three sessions: the first session eliminates cars that set a lap time in 16th position or lower, the second eliminates positions 11–15, and the final part determines the grid position of 1st (pole position) to 10th. We use all qualification data from the driver qualifying sessions to construct Fastest Driver.

Lap times from qualifying sessions are normalized to adjust for differences in race tracks, which enables us to pool lap times across different tracks. This normalization process equalizes driver lap time differences, helping us compare drivers across race tracks and eliminating the need to construct track-specific models to account for track alterations over time. Another important technique is that we compare qualifying data for drivers on the same race team (such as Aston Martin Red Bull Racing), where teammates have competed against each other in a minimum of five qualifying sessions. By holding the team constant, we get a direct performance comparison under the same race conditions while controlling for car effects.

Differences in race conditions (such as wet weather) and rule changes (such as rule impacts) leads to significant variations in driver performances. We identify and remove anomalous lap time outliers by using deviations from median lap times between teammates with a 2-second threshold. For example, let’s compare Daniel Ricciardo with Sebastian Vettel when they raced together for Red Bull in 2014. During that season, Ricciardo was, on average, 0.2 seconds faster than Vettel. However, the average lap time difference between Ricciardo and VetteI falls to 0.1 seconds if we exclude the 2014 US Grand Prix (GP), where Ricciardo was more than 2 seconds faster than Vettel on account of Vettel being penalized to comply with the 107% rule (which forced him to start from the pit lane).

Constructing Fastest Driver

Building a performant ML model starts with good data. Following the driver qualification data aggregation process, we construct a network of teammate comparisons over the years, with the goal of comparing drivers across all teams, seasons, and circuits. For example, Sebastian Vettel and Max Verstappen have never been on the same team, so we compare them through their respective connections with Daniel Ricciardo at Red Bull. Ricciardo was, on average, 0.18 seconds slower than Verstappen during the 2016–2018 seasons while they were at Red Bull. We remove outlier sessions, such as the 2018 GPs in Bahrain, where Ricciardo was quicker than Verstappen by large margins because Verstappen didn’t get past Q1 due to a crash. If each qualifying session is assumed to be equally important, a subset of our driver network including only Ricciardo, Vettel, and Verstappen yields Verstappen as the fastest driver: Verstappen was 0.18 seconds faster than Ricciardo, and Ricciardo 0.1 seconds faster than Vettel.

Using the full driver network, we can compare all driver pairings to determine the faster racers. Going back to Heikki Kovalainen, let’s look at his years in the McLaren team against Lewis Hamilton. The qualifying speaks volumes with the median difference of just 0.1 seconds per lap. Kovalainen doesn’t have the same number of World Championships as Hamilton, but his qualifying statistics speak for themselves—the model has ranked him high because of his consistent qualifying performance throughout his career.

An algorithm called the Massey’s method (a form of linear regression) is one of the core models behind the Insight. Fastest Driver uses Massey’s method to rank drivers by solving for a set of linear equations, where each driver’s rating is calculated as their average lap time difference against teammates. Additionally, when comparing ratings of teammates, the model uses features like driver strength of schedule normalized by the number of interactions with the driver. Overall, the model places high rankings to drivers who perform extraordinarily well against their teammates or perform well against strong opponents.

Our goal is to assign each driver a numeric rating to infer that a driver’s competitive advantage relative to other drivers assuming the expected margin of lap time difference in any race is proportional to the difference in a driver’s true intrinsic rating. For the more mathematically inclined reader: let xj represent each of all drivers and rj represent the true intrinsic driver ratings. For every race, we can predict the margin of the lap time advantage or disadvantage (yi) between any pair of two drivers as:

In this equation, xj is +1 for the winner and -1 for the loser, and ei is the error term due to unexplained variations. For a given set of m game observations and n drivers, we can formulate an (m * n) system of linear equations:

Driver ratings (r) is a solution to the normal equation via linear regression:

The following example code of Massey’s method and calculating driver rankings using Amazon SageMaker demonstrates the training process:

1.	import numpy as np
2.	import pandas as pd
3.	import statsmodels.api as sm
4.	from scipy.stats import norm 
5.	
6.	#example data comparing five drivers
7.	data = pd.DataFrame([[1, 0, 88, 90], 
8.	                     [2, 1, 87, 88],
9.	                     [2, 0, 87, 90],
10.	                     [3, 4, 88, 92],
11.	                     [3, 1, 88, 90],
12.	                     [1, 4, 90, 92]], columns=['Driver1', 'Driver2', 'Driver1_laptime', 'Driver2_laptime'])
13.	
14.	def init_linear_regressor_matrix(data, num_of_drivers, col_to_rank):
15.	    """initialize linear system matrix for regression"""
16.	    wins = np.zeros((data.shape[0], num_of_drivers))
17.	    score_diff = np.zeros(data.shape[0])
18.	
19.	    for index, row in data.iterrows():
20.	        idx1 = row["Driver1"]
21.	        idx2 = row["Driver2"]
22.	        if row['Driver1_laptime'] - row['Driver2_laptime'] > 0:
23.	            wins[(index)][(idx1)] = -1
24.	            wins[(index)][(idx2)] = 1
25.	            score_diff[(index)] = row['Driver1_laptime'] - row['Driver2_laptime']
26.	        else:
27.	            wins[(index)][(idx1)] = 1
28.	            wins[(index)][(idx2)] = -1
29.	            score_diff[(index)] = row['Driver2_laptime'] - row['Driver1_laptime']
30.	    wins_df = pd.DataFrame(wins)
31.	    wins_df[col_to_rank] = score_diff
32.	    return wins_df 	
33.	
34.	def massey(data, num_of_drivers, col_to_rank='delta'):
35.	    """Compute for each driver, adjacency matrix and aggregated scores, as input to the Massey Model"""
36.	
37.	    wins_df = init_linear_regressor_matrix(data, num_of_drivers, col_to_rank)
38.	    model = sm.OLS(
39.	        wins_df[col_to_rank], wins_df.drop(columns=[col_to_rank])
40.	    )
41.	    results = model.fit(cov_type='HC1')
42.	    rankings = pd.DataFrame(results.params)
43.	    rankings['std'] = np.sqrt(np.diag(results.cov_params()))
44.	    rankings['consistency'] = (norm.ppf(0.9)-norm.ppf(0.1))*rankings['std']
45.	    rankings = (
46.	        rankings
47.	        .sort_values(by=0, ascending=False)
48.	        .reset_index()
49.	        .rename(columns={"index": "Driver", 0: "massey"})
50.	    )
51.	    rankings = rankings.sort_values(by=["massey"], ascending=False)
52.	    rankings["massey_new"] = rankings["massey"].max() - rankings["massey"]
53.	    return rankings[['Driver', 'massey_new']]
54.	
55.	rankings = massey(data, 5)
56.	print(rankings)

The kings of the asphalt

Topping our list of rankings of fastest drivers are the esteemed Ayrton Senna, Michael Schumacher, Lewis Hamilton, Max Verstappen, and Fernando Alonso. This is delivered through the Fastest Driver insight, which produces a dataset ranking based on speed (or qualifying times) of all drivers from the present day back to 1983, by simply ranking drivers in descending order of Driver, Rank (integer), Gap to Best (milliseconds).

It’s important to note that to quantify a driver’s ability, we need to observe a minimum number of interactions. To factor this in, we only include teammates who have competed against each other in at least five qualifying sessions. A number of parameters and considerations have been put in place as an effective means of identifying various conditions with unfair comparisons, such as crashes, failures, age, career breaks, or weather conditions changing over qualifying sessions.

Furthermore, we noticed that if a driver re-joined F1 following a break of three years or more (such as Michael Schumacher in 2010, Pedro de la Rosa in 2010, Narain Karthikeyan in 2011, and Robert Kubica in 2019), this adds a 0.1 second advantage to driver relative pace. This is exemplified when drivers have a large age gap with their teammates, such as Mark Webber vs. Sebastian Vettel in 2013, Felipe Massa vs. Lance Stroll in 2017, and Kimi Räikkönen vs. Antonio Giovinazzi in 2019. From 1983–2019, we observe that competing against a teammate who is significantly older gives a 0.06-second advantage.

These rankings aren’t proposed as definitive, and there will no doubt be disagreement among fans. In fact, we encourage a healthy debate! Fastest Driver presents a scientific approach to driver ranking aimed at objectively assessing a driver’s performance controlling for car difference.

Lightweight and flexible deployment with Amazon SageMaker

To deliver the insights from Fastest Driver, we implemented Massey’s method on a Python web server. One complication was that the qualifying data consumed by the model is updated with fresh lap times after every race weekend. To handle this, in addition to the standard request to the web server for the rankings, we implemented a refresh request that instructs the server to download new qualifying data from Amazon Simple Storage Service (Amazon S3).

We deployed our model web server to an Amazon SageMaker model endpoint. This makes sure that our endpoint is highly available, because multi-instance Amazon SageMaker model endpoints are distributed across multiple Availability Zones by default, and have automatic scaling capabilities built in. As an additional benefit, the endpoints integrate with other Amazon SageMaker features, such as Amazon SageMaker Model Monitor, which automatically monitors model drift in an endpoint. Using a fully-managed service like Amazon SageMaker means our final architecture is very lightweight. To complete the deployment, we added an API layer around our endpoint using Amazon API Gateway and AWS Lambda. The following diagram shows this architecture in action.

The architecture includes the following steps:

  1. The user makes a request to API Gateway.
  2. API Gateway passes the request to a Lambda function.
  3. The Lambda function makes a request to the Amazon SageMaker model endpoint. If the request is for rankings, the endpoint computes the UDC rankings using the currently available qualifying data and returns the result. If the request is to refresh, the endpoint downloads the new qualifying data from Amazon S3.

Summary

In this post, we described how F1 and the Amazon ML Solutions Lab scientists collaborated to create Fastest Driver, the first objective and data-driven model to determine who might be the fastest driver ever. This collaborative work between F1 and AWS has provided a unique view of one of the sport’s most enduring questions by looking back at its history on its 70th anniversary. Although F1 is the first to employ ML in this way, you can apply the technology to answer complex questions in sports, or even settle age-old disputes with fans of rival teams. This F1 season, fans will have many opportunities to see Fastest Driver in action and launch into their own debates about the sport’s all-time fastest drivers.

Sports leagues around the world are using AWS machine learning technology to transform the fan experience. The Guinness Six Nations Rugby Championship competition and Germany’s Bundesliga use AWS to bring fans closer to the action of the game and deliver deeper insights. In America, the NFL uses AWS to bring advanced stats to fans, players, and the league to improve player health and safety initiatives using AI and ML.

If you’d like help accelerating your use of ML in your products and processes, please contact the Amazon ML Solutions Lab program.


About the Authors

Rob Smedley has over 20 years of experience in the world of motorsport, having spent time at multiple F1 teams including Jordan, as a Race Engineer at Ferrari and most recently as Head of Vehicle Performance at Williams. He is now Director of Data systems at Formula 1, and oversees the F1 Insights program from a technical data side.

 

Colby Wise is a Data Scientist and manager at the Amazon ML Solutions Lab, where he helps AWS customers across numerous industries accelerate their AI and cloud adoption.

 

 

 

Delger Enkhbayar is a data scientist in the Amazon ML Solutions Lab. She has worked on a wide range of deep learning use cases in sports analytics, public sector and healthcare. Her background is in mechanism design and econometrics.

 

 

 

Guang Yang is a data scientist at the Amazon ML Solutions Lab where he works with customers across various verticals and applies creative problem solving to generate value for customers with state-of-the-art ML/AI solutions.

 

 

 

Ryan Cheng is a Deep Learning Architect in the Amazon ML Solutions Lab. He has worked on a wide range of ML use cases from sports analytics to optical character recognition. In his spare time, Ryan enjoys cooking.

 

 

 

George Price is a Deep Learning Architect at the Amazon ML Solutions Lab where he helps build models and architectures for AWS customers. Previously, he was a software engineer working on Amazon Alexa.

 

Read More

Understanding Deep Learning on Controlled Noisy Labels

Understanding Deep Learning on Controlled Noisy Labels

Posted by Lu Jiang, Senior Research Scientist and Weilong Yang, Senior Staff Software Engineer, Google Research

The success of deep neural networks depends on access to high-quality labeled training data, as the presence of label errors (label noise) in training data can greatly reduce the accuracy of models on clean test data. Unfortunately, large training datasets almost always contain examples with inaccurate or incorrect labels. This leads to a paradox: on one hand, large datasets are necessary to train better deep networks, while on the other hand, deep networks tend to memorize training label noise, resulting in poorer model performance in practice.

The research community has recognized the importance of this problem, introducing works attempting to understand noisy training labels, e.g., by Arpit et al., as well as mitigation strategies, such as MentorNet or co-teaching, to overcome them. Controlled experiments play a crucial role in understanding noisy labels by studying the impact of the noise level — the percentage of examples with incorrect labels in the dataset — on model performance. However, current experiments have only been performed on synthetic labels, in which noisy examples have randomly assigned labels, not real-world label noise, which follows a different noise distribution. Such studies may then result in very different or even contradictory findings about noisy labels compared to practical experience. In addition, methods that perform well on synthetic noise may not work as well on real-world noisy labels.

In “Beyond Synthetic Noise: Deep Learning on Controlled Noisy Labels”, published at ICML 2020, we make three contributions towards better understanding deep learning on non-synthetic noisy labels. First, we establish the first controlled dataset and benchmark of realistic, real-world label noise sourced from the web (i.e., web label noise). Second, we propose a simple but highly effective method to overcome both synthetic and real-world noisy labels. Finally, we conduct the largest study to date that compares synthetic and web label noise across a wide variety of settings.

Properties of Synthetic vs Real-World (Web) Label Noise
There are a number of differences between the distribution of images with synthetic versus real-world (web) label noise. First, images with web label noise tend to be more consistent, visually or semantically, with the true positive images. Second, synthetic label noise is at class-level (all examples in the same class are equally noisy), whereas real-world label noise is at instance-level (certain images are more likely to be mislabelled than others, regardless of the associated class). For example, images of “Honda Civic” and “Honda Accord” are more often confused when the images are taken from the side than when the vehicles are imaged from the front. Third, images with real-world label noise come from an open class vocabulary that may not overlap with the class vocabulary of a specific dataset. For example, the web noisy images of “ladybug” include classes such as “fly” and other bugs that are not included in the class list of the dataset being used. The benchmark for controlled label noise will help provide better quantitative understanding of the differences between synthetic and real-world web label noise.

Benchmark for Controlled Label Noise from the Web
The benchmark in this work is built on two public datasets: Mini-ImageNet, for coarse-grained image classification, and Stanford Cars, for fine-grained image classification. We gradually replace clean images in these datasets with incorrectly labeled images gathered from the web, following standard methods for the construction of synthetic datasets.

To do this, we collect images from the web using the class name (e.g., “ladybug”) as a keyword — an automatic approach to collect noisy labeled images from the web without manual annotations. Each retrieved image is then examined by 3-5 annotators using Google Cloud Labeling Service who identify whether or not the web label given is correct, yielding nearly 213k annotated images. We use these web images with incorrect labels to replace a percentage of the clean training images in the original Mini-ImageNet and Stanford Cars datasets. We create 10 different datasets with progressively higher levels of label noise (from 0% clean data to 80% data with erroneous labels). The datasets have been open-sourced at our Controlled Noisy Web Labels website.

Comparison of synthetic label noise and web label noise. From left to right, columns are true positive images in the Mini-ImageNet or Stanford Cars dataset, images with incorrect synthetic labels, and images with incorrect web labels (collected in the present work).

MentorMix: A Simple Robust Learning Method
Given a dataset of some unknown noise level, our goal is to train a robust model that can generalize well on the clean test data. Building on two existing techniques, MentorNet and Mixup, we introduce a simple yet effective method called MentorMix, which works with a given model of interest to overcome both synthetic and real-world noisy labels.

MentorMix is an iterative approach that comprises four steps: weight, sample, mixup, and weight again. In the first step, a weight is computed for every example in a mini-batch by a MentorNet network, which can be tailored to the task at hand, and the weights are normalized into a distribution. In practice, the goal is to assign high weights for correctly labeled examples and zero weights for incorrectly labeled examples. In reality, we don’t know which are correct and which are incorrect, so MentorNet weights are based on approximations. In the example here, MentorNet uses the StudentNet training loss to determine the weights in the distribution. 

Next, for each example, we use importance sampling to select another example in the same mini-batch according to the distribution. As examples with higher weights tend to have the correct label, they are favored in the sampling procedure. We then use Mixup to mix the original and sampled examples to regularize the model prediction between noisy training examples. Finally, we may compute another weight for the mixed example to scale the final loss. The impact of this second weighting strategy becomes more pronounced for high noise levels. 

Conceptually, the above steps implement a new robust loss, which turns out to be more resilient to noisy training labels. More discussion on this topic can be found in our paper. The animation below illustrates the four key steps in MentorMix, where StudentNet is the model to be trained on noisy labeled data. We employ a very simple version of MentorNet, as described by Jiang et al., to compute the weight for each example.

Illustration of four steps in the MentorMix method: weight, sample, mixup, and weight again.

Evaluation
We evaluate MentorMix on five datasets including CIFAR 10/100 with synthetic label noise, and WebVision 1.0, a large dataset of 2.2 million images with real-world noisy labels. MentorMix consistently yields improved results on the CIFAR 10/100 datasets and achieves the best published result on the WebVision dataset, improving the previous best method by a significant ~3% in terms of the top-1 classification accuracy on the ImageNet ILSVRC12 validation set.

Our model is trained only on the WebVision 2.2 million noisy training sample and is tested on the ImageNet ILSVRC12 validation set. The baseline models reported are (Lee et al. 2018), (MentorNet 2018), and (Guo et al. 2018).

New Findings on Noisy Labels from the Web
This work represents the largest study to date into understanding deep neural networks trained on noisy labels. We propose three new findings on web label noise:

  • Deep neural networks generalize much better on web label noise

    While it is well known that deep neural networks generalize poorly on synthetic label noise, our results suggest that deep neural networks generalize much better on web label noise. For example, the classification accuracy of a network trained on the Stanford Cars dataset using the 60% web label noise level is 0.66, much higher than that for the same network trained at the same 60% level of synthetic noise, which achieves only 0.09. This pattern is consistent across our two datasets using both fine-tuning and training from scratch.

  • Deep neural networks may NOT learn patterns first when trained on web label noise

    Our common understanding is that deep neural networks learn patterns first — an interesting property in which DNNs are able to automatically capture generalizable “patterns” in the early training stage before memorizing noisy training labels. Because of this, early stopping is commonly used for training on noisy data. However, our results suggest deep neural networks may not learn patterns first when trained using datasets with web label noise, at least for the fine-grained classification task, suggesting that early stopping may not be effective on real-world label noise from the web.

  • ImageNet architectures generalize on noisy training labels when the networks are fine-tuned

    Kornblith et al. (2019) found that fine-tuning more advanced architectures trained on ImageNet tend to perform better on downstream tasks that have clean training labels. Our results extend this finding to noisy training data, showing that a better pre-trained architecture that exhibits better performance when pre-trained on ImageNet is likely to perform better even when it is fine-tuned on noisy training labels.

Summary
Based on our findings, we have the following practical recommendations for training deep neural networks on noisy data.

  1. A simple way to deal with noisy labels is to fine-tune a model that is pre-trained on clean datasets, like ImageNet. The better the pre-trained model is, the better it may generalize on downstream noisy training tasks.
  2. Early stopping may not be effective on the real-world label noise from the web.
  3. Methods that perform well on synthetic noise may not work as well on the real-world noisy labels from the web.
  4. The label noise from the web appears to be less harmful, yet it is more difficult for our current robust learning methods to tackle. This encourages more future research to be carried out on controlled real-world label noise.
  5. The proposed MentorMix can better overcome both synthetic and real-world noisy labels.

The code of MentorMix is available on GitHub, the datasets are on our Dataset Website.

Aknowledgements
This research was conducted by Lu Jiang, Di Huang, Mason Liu, and Weilong Yang. We’d like to thank Boqing Gong and Fei Sha for constructive feedback. Additional thanks go to the leadership Andrew Moore for supporting our data labeling effort, along with Tomas Izo and Rahul Sukthankar for help in releasing the dataset.

Read More

2 Million Registered Developers, Countless Breakthroughs

2 Million Registered Developers, Countless Breakthroughs

Everyone has problems.

Whether they’re tackling challenges at the cutting edge of physics, trying to tame a worldwide pandemic, or sorting their child’s Lego collection, innovators join NVIDIA’s developer program to help them solve their most challenging problems.

With the number of registered NVIDIA developers having just hit 2 million, NVIDIA developers are pursuing more breakthroughs than ever.

Their ranks continue to grow by larger numbers every year. It took 13 years to reach 1 million registered developers, and less than two more to reach 2 million.

Most recently, teams at the U.S. National Institutes of Health, Scripps Research Institute and Oak Ridge National Laboratory have been among the NVIDIA developers at the forefront of efforts to combat COVID-19.

Every Country, Every Field

No surprise. Whether they’re software programmers, data scientists or devops engineers, developers are problem solvers.

They write, debug and optimize code, often taking a set of software building blocks — frameworks, application programming interfaces and other tools — and putting them to work to do something new.

These developers include business and academic leaders from every region in the world.

In China, Alibaba and Baidu are among the most active GPU developers. In North America, those names include Microsoft, Amazon and Google. In Japan, it’s Sony, Hitachi and Panasonic. In Europe, they include Bosch, Daimler and Siemens.

All the top technical universities are represented, including CalTech, MIT, Oxford, Cambridge, Stanford, Tsinghua University, the University of Tokyo, and IIT campuses throughout India. 

Look beyond the big names — there are too many to drop here — and you’ll find tens of thousands of entrepreneurs, hobbyists and enthusiasts.

Developers are signing up for our developer program to put NVIDIA accelerated computing tools to work across fields such as scientific and high performance computing, graphics and professional visualization, robotics, AI and data science, networking, and autonomous vehicles.

Developers are trained and equipped for success through our GTC conferences, online and in-person tutorials, our Deep Learning Institute training, and technical blogs. We provide them with software development kits such as CUDA, cuDNN, TensorRT and OptiX.

Registered developers account for 100,000 downloads a month, thousands participate each month in DLI training sessions, and thousands more engage in our online forums or attend conferences and webinars.

NVIDIA’s developer program, however, is just a piece of a much bigger developer story. There are now more than a billion CUDA GPUs in the world — each capable of running CUDA-accelerated software — giving developers, hackers and makers a vast installed base to work with.

As a result, the number of downloads of CUDA, which is free, without registration, is far higher than that of registered developers. On average, 39,000 developers sign up for memberships each month and 438,000 download CUDA each month.

That’s an awful lot of problem solvers.

Breakthroughs in Science and Research

The ranks of those who depend on such problem solvers include the team who won the 2017 Nobel Prize in Chemistry — Jacques Dubochet, Joachim Frank and Richard Henderson — for their contribution to cryogenic electron microscopy.

They also include the team that won the 2017 Nobel Prize in Physics — Rainer Weiss, Barry Barish and Kip Thorne — for their work detecting gravitational waves.

More scientific breakthroughs are coming, as developers attack new HPC problems and, increasingly, deep learning.

William Tang, principal research physicist at the Princeton Plasma Physics Laboratory — one of the world’s foremost experts on fusion energy — leads a team using deep learning and HPC to advance the quest for cheap, clean energy.

Michael Kirk and Raphael Attie, scientists at NASA’s Goddard Space Flight Center — are among the many active GPU developers at NASA — relying on Quadro RTX data science workstations to analyze the vast quantities of data streaming in from satellites monitoring the sun.

And at UC Berkeley, astrophysics Ph.D. student Gerry Zhang uses GPU-accelerated deep learning to analyze signals from space for signs of intelligent extraterrestrial civilizations.

Top Companies

Outside of research and academia, developers at the world’s top companies are tackling problems faced by every one of the world’s industries.

At Intuit, Chief Data Officer Ashok Srivastava leads a team using GPU-accelerated machine learning to help consumers with taxes and help small businesses through the financial effects of COVID-19.

At health insurer Anthem, Chief Digital Officer Rajeev Ronanki uses GPU-accelerated AI to help patients personalize and better understand their healthcare information.

Arne Stoschek, head of autonomous systems at Acubed, the Silicon Valley-based advanced products and partnerships outpost of Airbus Group, is developing self-piloted air taxis powered by GPU-accelerated AI.

New Problems, New Businesses: Entrepreneurs Swell Developer Ranks

Other developers — many supported by the NVIDIA Inception program — work at startups building businesses that solve new kinds of problems.

Looking to invest in a genuine pair of vintage Air Jordans? Michael Hall, director of data at GOAT Group, uses GPU-accelerated AI to help the startup connect sneaker enthusiasts with Air Jordans, Yeezys and a variety of old-school kicks that they can be confident are authentic.

Don’t know what to wear? Brad Klingenberg, chief algorithms officer at fashion ecommerce startup Stitch Fix, leads a team that uses GPU-accelerated AI to help us all dress better.

And Benjamin Schmidt, at Roadbotics, offers what might be the ultimate case study in how developers are solving concrete problems: his startup helps cities find and fix potholes.

Entrepreneurs are also supported by NVIDIA’s Inception program, which includes more than 6,000 startups in industries ranging from agriculture to healthcare to logistics to manufacturing.

Of course, just because something’s a problem, doesn’t mean you can’t love solving it.

Love beer? Eric Boucher, a home brewing enthusiast, is using AI to invent new kinds of suds.

Love a critter-free lawn? Robert Bond has trained a system that can detect cats and gently shoo them from his grass by turning on his sprinklers to the amazement and delight of his grandchildren.

Francisco “Paco” Garcia has even trained an AI to help sort out his children’s Lego pile.

Most telling: stories from developers working at the cutting edge of the arts.

Pierre Barreau has created an AI, named AIVA, which uses mathematical models based on the work of great composers to create new music.

And Raiders of the Lost Art — a collaboration between Anthony Bourached and George Cann, a pair of Ph.D. candidates at the University College, London — has used neural style transfer techniques to tease out hidden artwork in a Leonardo da Vinci painting.

Wherever you go, follow the computing power and you’ll find developers delivering breakthroughs.

How big is the opportunity for problem solvers like these? However many problems there are in the world.

Want more stories like these? No problem. Over the months to come, we’ll be bringing as many to you as we can. 

The post 2 Million Registered Developers, Countless Breakthroughs appeared first on The Official NVIDIA Blog.

Read More