PyTorch 1.6 now includes Stochastic Weight Averaging

PyTorch 1.6 now includes Stochastic Weight Averaging

Do you use stochastic gradient descent (SGD) or Adam? Regardless of the procedure you use to train your neural network, you can likely achieve significantly better generalization at virtually no additional cost with a simple new technique now natively supported in PyTorch 1.6, Stochastic Weight Averaging (SWA) [1]. Even if you have already trained your model, it’s easy to realize the benefits of SWA by running SWA for a small number of epochs starting with a pre-trained model. Again and again, researchers are discovering that SWA improves the performance of well-tuned models in a wide array of practical applications with little cost or effort!

SWA has a wide range of applications and features:

  • SWA significantly improves performance compared to standard training techniques in computer vision (e.g., VGG, ResNets, Wide ResNets and DenseNets on ImageNet and CIFAR benchmarks [1, 2]).
  • SWA provides state-of-the-art performance on key benchmarks in semi-supervised learning and domain adaptation [2].
  • SWA was shown to improve performance in language modeling (e.g., AWD-LSTM on WikiText-2 [4]) and policy-gradient methods in deep reinforcement learning [3].
  • SWAG, an extension of SWA, can approximate Bayesian model averaging in Bayesian deep learning and achieves state-of-the-art uncertainty calibration results in various settings. Moreover, its recent generalization MultiSWAG provides significant additional performance gains and mitigates double-descent [4, 10]. Another approach, Subspace Inference, approximates the Bayesian posterior in a small subspace of the parameter space around the SWA solution [5].
  • SWA for low precision training, SWALP, can match the performance of full-precision SGD training, even with all numbers quantized down to 8 bits, including gradient accumulators [6].
  • SWA in parallel, SWAP, was shown to greatly speed up the training of neural networks by using large batch sizes and, in particular, set a record by training a neural network to 94% accuracy on CIFAR-10 in 27 seconds [11].

Figure 1. Illustrations of SWA and SGD with a Preactivation ResNet-164 on CIFAR-100 [1]. Left: test error surface for three FGE samples and the corresponding SWA solution (averaging in weight space). Middle and Right: test error and train loss surfaces showing the weights proposed by SGD (at convergence) and SWA, starting from the same initialization of SGD after 125 training epochs. Please see [1] for details on how these figures were constructed.

In short, SWA performs an equal average of the weights traversed by SGD (or any stochastic optimizer) with a modified learning rate schedule (see the left panel of Figure 1.). SWA solutions end up in the center of a wide flat region of loss, while SGD tends to converge to the boundary of the low-loss region, making it susceptible to the shift between train and test error surfaces (see the middle and right panels of Figure 1). We emphasize that SWA can be used with any optimizer, such as Adam, and is not specific to SGD.

Previously, SWA was in PyTorch contrib. In PyTorch 1.6, we provide a new convenient implementation of SWA in torch.optim.swa_utils.

Is this just Averaged SGD?

At a high level, averaging SGD iterates dates back several decades in convex optimization [7, 8], where it is sometimes referred to as Polyak-Ruppert averaging, or averaged SGD. But the details matter. Averaged SGD is often used in conjunction with a decaying learning rate, and an exponential moving average (EMA), typically for convex optimization. In convex optimization, the focus has been on improved rates of convergence. In deep learning, this form of averaged SGD smooths the trajectory of SGD iterates but does not perform very differently.

By contrast, SWA uses an equal average of SGD iterates with a modified cyclical or high constant learning rate and exploits the flatness of training objectives [8] specific to deep learning for improved generalization.

How does Stochastic Weight Averaging Work?

There are two important ingredients that make SWA work. First, SWA uses a modified learning rate schedule so that SGD (or other optimizers such as Adam) continues to bounce around the optimum and explore diverse models instead of simply converging to a single solution. For example, we can use the standard decaying learning rate strategy for the first 75% of training time and then set the learning rate to a reasonably high constant value for the remaining 25% of the time (see Figure 2 below). The second ingredient is to take an average of the weights (typically an equal average) of the networks traversed by SGD. For example, we can maintain a running average of the weights obtained at the end of every epoch within the last 25% of training time (see Figure 2). After training is complete, we then set the weights of the network to the computed SWA averages.

Figure 2. Illustration of the learning rate schedule adopted by SWA. Standard decaying schedule is used for the first 75% of the training and then a high constant value is used for the remaining 25%. The SWA averages are formed during the last 25% of training.

One important detail is the batch normalization. Batch normalization layers compute running statistics of activations during training. Note that the SWA averages of the weights are never used to make predictions during training. So the batch normalization layers do not have the activation statistics computed at the end of training. We can compute these statistics by doing a single forward pass on the train data with the SWA model.

While we focus on SGD for simplicity in the description above, SWA can be combined with any optimizer. You can also use cyclical learning rates instead of a high constant value (see e.g., [2]).

How to use SWA in PyTorch?

In torch.optim.swa_utils we implement all the SWA ingredients to make it convenient to use SWA with any model. In particular, we implement AveragedModel class for SWA models, SWALR learning rate scheduler, and update_bn utility function to update SWA batch normalization statistics at the end of training.

In the example below, swa_model is the SWA model that accumulates the averages of the weights. We train the model for a total of 300 epochs, and we switch to the SWA learning rate schedule and start to collect SWA averages of the parameters at epoch 160.

from torch.optim.swa_utils import AveragedModel, SWALR
from torch.optim.lr_scheduler import CosineAnnealingLR

loader, optimizer, model, loss_fn = ...
swa_model = AveragedModel(model)
scheduler = CosineAnnealingLR(optimizer, T_max=100)
swa_start = 5
swa_scheduler = SWALR(optimizer, swa_lr=0.05)

for epoch in range(100):
      for input, target in loader:
          optimizer.zero_grad()
          loss_fn(model(input), target).backward()
          optimizer.step()
      if epoch > swa_start:
          swa_model.update_parameters(model)
          swa_scheduler.step()
      else:
          scheduler.step()

# Update bn statistics for the swa_model at the end
torch.optim.swa_utils.update_bn(loader, swa_model)
# Use swa_model to make predictions on test data 
preds = swa_model(test_input)

Next, we explain each component of torch.optim.swa_utils in detail.

AveragedModel class serves to compute the weights of the SWA model. You can create an averaged model by running swa_model = AveragedModel(model). You can then update the parameters of the averaged model by swa_model.update_parameters(model). By default, AveragedModel computes a running equal average of the parameters that you provide, but you can also use custom averaging functions with the avg_fn parameter. In the following example, ema_model computes an exponential moving average.

ema_avg = lambda averaged_model_parameter, model_parameter, num_averaged:
0.1 * averaged_model_parameter + 0.9 * model_parameter
ema_model = torch.optim.swa_utils.AveragedModel(model, avg_fn=ema_avg)

In practice, we find an equal average with the modified learning rate schedule in Figure 2 provides the best performance.

SWALR is a learning rate scheduler that anneals the learning rate to a fixed value, and then keeps it constant. For example, the following code creates a scheduler that linearly anneals the learning rate from its initial value to 0.05 in 5 epochs within each parameter group.

swa_scheduler = torch.optim.swa_utils.SWALR(optimizer, 
anneal_strategy="linear", anneal_epochs=5, swa_lr=0.05)

We also implement cosine annealing to a fixed value (anneal_strategy="cos"). In practice, we typically switch to SWALR at epoch swa_start (e.g. after 75% of the training epochs), and simultaneously start to compute the running averages of the weights:

scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)
swa_start = 75
for epoch in range(100):
      # <train epoch>
      if i > swa_start:
          swa_model.update_parameters(model)
          swa_scheduler.step()
      else:
          scheduler.step()

Finally, update_bn is a utility function that computes the batchnorm statistics for the SWA model on a given dataloader loader:

torch.optim.swa_utils.update_bn(loader, swa_model) 

update_bn applies the swa_model to every element in the dataloader and computes the activation statistics for each batch normalization layer in the model.

Once you computed the SWA averages and updated the batch normalization layers, you can apply swa_model to make predictions on test data.

Why does it work?

There are large flat regions of the loss surface [9]. In Figure 3 below, we show a visualization of the loss surface in a subspace of the parameter space containing a path connecting two independently trained SGD solutions, such that the loss is similarly low at every point along the path. SGD converges near the boundary of these regions because there isn’t much gradient signal to move inside, as the points in the region all have similarly low values of loss. By increasing the learning rate, SWA spins around this flat region, and then by averaging the iterates, moves towards the center of the flat region.

Figure 3: visualization of mode connectivity for ResNet-20 with no skip connections on CIFAR-10 dataset. The visualization is created in collaboration with Javier Ideami (https://losslandscape.com/). For more details, see this blogpost.

We expect solutions that are centered in the flat region of the loss to generalize better than those near the boundary. Indeed, train and test error surfaces are not perfectly aligned in the weight space. Solutions that are centered in the flat region are not as susceptible to the shifts between train and test error surfaces as those near the boundary. In Figure 4 below, we show the train loss and test error surfaces along the direction connecting the SWA and SGD solutions. As you can see, while the SWA solution has a higher train loss compared to the SGD solution, it is centered in a region of low loss and has a substantially better test error.

Figure 4. Train loss and test error along the line connecting the SWA solution (circle) and SGD solution (square). The SWA solution is centered in a wide region of low train loss, while the SGD solution lies near the boundary. Because of the shift between train loss and test error surfaces, the SWA solution leads to much better generalization.

##What are results achieved with SWA?

We release a GitHub repo with examples using the PyTorch implementation of SWA for training DNNs. For example, these examples can be used to achieve the following results on CIFAR-100:

  VGG-16 ResNet-164 WideResNet-28×10
Regular Training 72.8 ± 0.3 78.4 ± 0.3 82.5 ± 0.2
SWA 74.4 ± 0.3 79.8 ± 0.4 81.0 ± 0.3

Semi-Supervised Learning

In a follow-up paper SWA was applied to semi-supervised learning, where it improved the best reported results in multiple settings [2]. For example, with SWA you can get 95% accuracy on CIFAR-10 if you only have the training labels for 4k training data points (the previous best reported result on this problem was 93.7%). This paper also explores averaging multiple times within epochs, which can accelerate convergence and find still flatter solutions in a given time.

Figure 5. Performance of fast-SWA on semi-supervised learning with CIFAR-10. fast-SWA achieves record results in every setting considered.

Reinforcement Learning

In another follow-up paper SWA was shown to improve the performance of policy gradient methods A2C and DDPG on several Atari games and MuJoCo environments [3]. This application is also an instance of where SWA is used with Adam. Recall that SWA is not specific to SGD and can benefit essentially any optimizer.

Environment Name A2C A2C + SWA
Breakout 522 ± 34 703 ± 60
Qbert 18777 ± 778 21272 ± 655
SpaceInvaders 7727 ± 1121 21676 ± 8897
Seaquest 1779 ± 4 1795 ± 4
BeamRider 9999 ± 402 11321 ± 1065
CrazyClimber 147030 ± 10239 139752 ± 11618

Low Precision Training

We can filter through quantization noise by combining weights that have been rounded down with weights that have been rounded up. Moreover, by averaging weights to find a flat region of the loss surface, large perturbations of the weights will not affect the quality of the solution (Figures 9 and 10). Recent work shows that by adapting SWA to the low precision setting, in a method called SWALP, one can match the performance of full-precision SGD even with all training in 8 bits [5]. This is quite a practically important result, given that (1) SGD training in 8 bits performs notably worse than full precision SGD, and (2) low precision training is significantly harder than predictions in low precision after training (the usual setting). For example, a ResNet-164 trained on CIFAR-100 with float (16-bit) SGD achieves 22.2% error, while 8-bit SGD achieves 24.0% error. By contrast, SWALP with 8 bit training achieves 21.8% error.

Figure 9. Quantizing a solution leads to a perturbation of the weights which has a greater effect on the quality of the sharp solution (left) compared to wide solution (right).

Figure 10. The difference between standard low precision training and SWALP.

Another work, SQWA, presents an approach for quantization and fine-tuning of neural networks in low precision [12]. In particular, SQWA achieved state-of-the-art results for DNNs quantized to 2 bits on CIFAR-100 and ImageNet.

Calibration and Uncertainty Estimates

By finding a centred solution in the loss, SWA can also improve calibration and uncertainty representation. Indeed, SWA can be viewed as an approximation to an ensemble, resembling a Bayesian model average, but with a single model [1].

SWA can be viewed as taking the first moment of SGD iterates with a modified learning rate schedule. We can directly generalize SWA by also taking the second moment of iterates to form a Gaussian approximate posterior over the weights, further characterizing the loss geometry with SGD iterates. This approach,SWA-Gaussian (SWAG) is a simple, scalable and convenient approach to uncertainty estimation and calibration in Bayesian deep learning [4]. The SWAG distribution approximates the shape of the true posterior: Figure 6 below shows the SWAG distribution and the posterior log-density for ResNet-20 on CIFAR-10.

Figure 6. SWAG posterior approximation and the loss surface for a ResNet-20 without skip-connections trained on CIFAR-10 in the subspace formed by the two largest eigenvalues of the SWAG covariance matrix. The shape of SWAG distribution is aligned with the posterior: the peaks of the two distributions coincide, and both distributions are wider in one direction than in the orthogonal direction. Visualization created in collaboration with Javier Ideami.

Empirically, SWAG performs on par or better than popular alternatives including MC dropout, KFAC Laplace, and temperature scaling on uncertainty quantification, out-of-distribution detection, calibration and transfer learning in computer vision tasks. Code for SWAG is available here.

Figure 7. MultiSWAG generalizes SWAG and deep ensembles, to perform Bayesian model averaging over multiple basins of attraction, leading to significantly improved performance. By contrast, as shown here, deep ensembles select different modes, while standard variational inference (VI) marginalizes (model averages) within a single basin.

MultiSWAG [9] uses multiple independent SWAG models to form a mixture of Gaussians as an approximate posterior distribution. Different basins of attraction contain highly complementary explanations of the data. Accordingly, marginalizing over these multiple basins provides a significant boost in accuracy and uncertainty representation. MultiSWAG can be viewed as a generalization of deep ensembles, but with performance improvements.

Indeed, we see in Figure 8 that MultiSWAG entirely mitigates double descent – more flexible models have monotonically improving performance – and provides significantly improved generalization over SGD. For example, when the ResNet-18 has layers of width 20, Multi-SWAG achieves under 30% error whereas SGD achieves over 45%, more than a 15% gap!

Figure 8. SGD, SWAG, and Multi-SWAG on CIFAR-100 for a ResNet-18 with varying widths. We see Multi-SWAG in particular mitigates double descent and provides significant accuracy improvements over SGD.

Reference [10] also considers Multi-SWA, which uses multiple independently trained SWA solutions in an ensemble, providing performance improvements over deep ensembles without any additional computational cost. Code for MultiSWA and MultiSWAG is available here.

Another method, Subspace Inference, constructs a low-dimensional subspace around the SWA solution and marginalizes the weights in this subspace to approximate the Bayesian model average [5]. Subspace Inference uses the statistics from the SGD iterates to construct both the SWA solution and the subspace. The method achieves strong performance in terms of prediction accuracy and uncertainty calibration both in classification and regression problems. Code is available here.

Try it Out!

One of the greatest open questions in deep learning is why SGD manages to find good solutions, given that the training objectives are highly multimodal, and there are many settings of parameters that achieve no training loss but poor generalization. By understanding geometric features such as flatness, which relate to generalization, we can begin to resolve these questions and build optimizers that provide even better generalization, and many other useful features, such as uncertainty representation. We have presented SWA, a simple drop-in replacement for standard optimizers such as SGD and Adam, which can in principle, benefit anyone training a deep neural network. SWA has been demonstrated to have a strong performance in several areas, including computer vision, semi-supervised learning, reinforcement learning, uncertainty representation, calibration, Bayesian model averaging, and low precision training.

We encourage you to try out SWA! SWA is now as easy as any standard training in PyTorch. And even if you have already trained your model, you can use SWA to significantly improve performance by running it for a small number of epochs from a pre-trained model.

[1] Averaging Weights Leads to Wider Optima and Better Generalization; Pavel Izmailov, Dmitry Podoprikhin, Timur Garipov, Dmitry Vetrov, Andrew Gordon Wilson; Uncertainty in Artificial Intelligence (UAI), 2018.

[2] There Are Many Consistent Explanations of Unlabeled Data: Why You Should Average; Ben Athiwaratkun, Marc Finzi, Pavel Izmailov, Andrew Gordon Wilson;
International Conference on Learning Representations (ICLR), 2019.

[3] Improving Stability in Deep Reinforcement Learning with Weight Averaging; Evgenii Nikishin, Pavel Izmailov, Ben Athiwaratkun, Dmitrii Podoprikhin,
Timur Garipov, Pavel Shvechikov, Dmitry Vetrov, Andrew Gordon Wilson; UAI 2018 Workshop: Uncertainty in Deep Learning, 2018.

[4] A Simple Baseline for Bayesian Uncertainty in Deep Learning
Wesley Maddox, Timur Garipov, Pavel Izmailov, Andrew Gordon Wilson; Neural Information Processing Systems (NeurIPS), 2019.

[5] Subspace Inference for Bayesian Deep Learning
Pavel Izmailov, Wesley Maddox, Polina Kirichenko, Timur Garipov, Dmitry Vetrov, Andrew Gordon Wilson
Uncertainty in Artificial Intelligence (UAI), 2019

[6] SWALP : Stochastic Weight Averaging in Low Precision Training
Guandao Yang, Tianyi Zhang, Polina Kirichenko, Junwen Bai,
Andrew Gordon Wilson, Christopher De Sa; International Conference on Machine Learning (ICML), 2019.

[7] David Ruppert. Efficient estimations from a slowly convergent Robbins-Monro process; Technical report, Cornell University Operations Research and Industrial Engineering, 1988.

[8] Acceleration of stochastic approximation by averaging. Boris T Polyak and Anatoli B Juditsky; SIAM Journal on Control and Optimization, 30(4):838–855, 1992.

[9] Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs
Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry Vetrov,
Andrew Gordon Wilson. Neural Information Processing Systems (NeurIPS), 2018.

[10] Bayesian Deep Learning and a Probabilistic Perspective of Generalization
Andrew Gordon Wilson, Pavel Izmailov. ArXiv preprint, 2020.

[11] Stochastic Weight Averaging in Parallel: Large-Batch Training That Generalizes Well
Gupta, Vipul, Santiago Akle Serrano, and Dennis DeCoste; International Conference on Learning Representations (ICLR). 2019.

[12] SQWA: Stochastic Quantized Weight Averaging for Improving the Generalization Capability of Low-Precision Deep Neural Networks
Shin, Sungho, Yoonho Boo, and Wonyong Sung; arXiv preprint 2020.

Read More

Meet the Maker: YouTuber Insists It’s Easier Than You Think to Make Something Super Using AI

Meet the Maker: YouTuber Insists It’s Easier Than You Think to Make Something Super Using AI

Alex Schepelmann went from being a teacher’s assistant for an Intro to Programming class to educating 40,000 YouTube subscribers by championing the mantra: anyone can make something super using AI and machine learning.

His YouTube channel, Super Make Something, posts two types of videos. “Basics” videos provide in-depth explanations of technologies and their methods, using fun, understandable lingo. “Project” videos let viewers follow along with instructions for creating a product.

About the Maker

Schepelmann got a B.S. and M.S. in mechanical engineering from Case Western Reserve University and a Ph.D. in robotics from Carnegie Mellon University. His master’s thesis focused on employing computer vision to identify grass and obstacles in a camera stream, and he was part of a team that created an award-winning autonomous lawnmower.

Now, he’s a technical fellow for an engineering consulting firm and an aerospace contractor supporting various robotics projects in partnership with NASA. In his free time, he creates content for his channel, based out of his home in Cleveland.

His Inspiration

In his undergrad years, Schepelmann saw how classmates found the introductory programming class hard because the assignments didn’t relate to their everyday lives. So, when he got to teach the class as a grad student, he implemented fun projects, like coding a Tamagotchi digital pet.

His aim was to help students realize that choosing topics they’re interested in can make learning easy and enjoyable. Schepelmann later heard from one of his students, an art history major, that his class had inspired her to add a computer science minor to her degree.

“Since then, I’ve thought it was great to introduce these topics to people who might never have considered them or felt that they were too hard,” he said. “I want to show people that AI can be really fun and easy to learn. With YouTube, it’s now possible to reach an audience of any background or age range on a large scale.”

Schepelmann’s YouTube channel started as a hobby during his years at Carnegie Mellon. It’s grown to reach 2.1 million total views on videos explaining 3D printing, robotics and machine learning, including how to use the NVIDIA Jetson platform to train AI models.

His Favorite Jetson Projects

“It’s super, super easy to use the NVIDIA Jetson products,” said Schepelmann. “It’s a great machine learning platform and an inexpensive way for people to learn AI and experiment with computationally intensive applications.”

To show viewers exactly how, he’s created two Jetson-based tutorials:

Machine Learning 101: Intro to Neural Networks – Schepelmann dives into what neural networks are and walks through how to set up the NVIDIA Jetson Nano developer kit to train a neural network model from scratch.

Machine Learning 101: Naive Bayes Classifier – Schepelmann explains how the probabilistic classifier can be used for image processing and speech recognition applications, using the NVIDIA Jetson Xavier NX developer kit to demonstrate.

The creator has released the full code used in both tutorials on his GitHub site for anyone to explore.

Where to Learn More 

To make something super with Super Make Something, visit Schepelmann’s YouTube channel.

Discover tools, inspiration and three easy steps to help kickstart your project with AI on our “Get AI, Learn AI, Build AI” page.

The post Meet the Maker: YouTuber Insists It’s Easier Than You Think to Make Something Super Using AI appeared first on The Official NVIDIA Blog.

Read More

Amazon Personalize can now create up to 50% better recommendations for fast changing catalogs of new products and fresh content

Amazon Personalize can now create up to 50% better recommendations for fast changing catalogs of new products and fresh content

Amazon Personalize now makes it easier to create personalized recommendations for fast-changing catalogs of books, movies, music, news articles, and more, improving recommendations by up to 50% (measured by click-through rate) with just a few clicks in the AWS console. Without needing to change any application code, Amazon Personalize enables customers to include completely new products and fresh content in their usual recommendations, so that the best new products and content is discovered, clicked, purchased, or consumed by end-users an order of magnitude more quickly than other recommendation systems.

Many catalogs are fast moving with new products and fresh content being continuously added, and it is crucial for businesses to help their users discover and engage with these products or content. For example, users on a news website expect to see latest personalized news, users consuming media via video-on-demand services expect to be recommended the latest series and episodes they might like. Meeting these expectations by showcasing new products and content to users helps keep the user experience fresh, and aids in sales either through direct conversion, or through subscriber conversion and retention. However, there are usually way too many new products in fast moving catalogs to make it feasible to showcase each of them to every user. It makes much more sense to personalize the user experience by matching these new products with users, based on their interests and preferences. Personalization of new products is inherently hard due to absence of data about past views, clicks, purchases, and subscriptions for these products. In such a scenario, most recommender systems only make recommendations for products they have sufficient past data about, and ignore the products that are new to the catalog.

With today’s launch, Amazon Personalize can help customers create personalized recommendations for new products and fresh content for their users, in matter of a few clicks. Amazon Personalize does this by recommending new products to users who have positively engaged (clicked, purchased, etc.) with similar products in the past. If users positively engage with the recommended new products, Personalize further recommends them to more users with similar interests. At Amazon, this capability has been in use since many years for creating product recommendations, and has resulted in 21% higher conversions compared to recommendations that do not include new products. This capability is now available in Amazon Personalize at no additional cost as part of its existing deep learning based algorithms that have been perfected over years of development and use at Amazon. It’s a win-win situation for customers, as they can benefit from this new capability at no extra cost, without losing out on the highly relevant recommendations that they already create through Amazon Personalize.

Amazon Personalize makes it easy for customers to develop applications with a wide array of personalization use cases, including real time product recommendations and customized direct marketing. Amazon Personalize brings the same machine learning technology used by Amazon.com to everyone for use in their applications – with no machine learning experience required. Amazon Personalize customers pay for what they use, with no minimum fees or upfront commitment. You can start using Amazon Personalize with a simple three step process, which only takes a few clicks in the AWS console, or a set of simple API calls. First, point Amazon Personalize to user data, catalog data, and activity stream of views, clicks, purchases, etc. in Amazon S3 or upload using a simple API call. Second, with a single click in the console or an API call, train a custom private recommendation model for your data (CreateSolution). Third, retrieve personalized recommendations for any user by creating a campaign, and using the GetRecommendations API.

The rest of this post walks you through this process in greater detail and discusses the recommended best practices.

Adding your data to Personalize

For this post, we create a dataset group with an interaction dataset and item dataset (item metadata). For instructions on creating a dataset group, see Getting Started (Console).

Creating an interaction dataset

To create an interaction dataset, use the following schema and import the file bandits-demo-interactions.csv, which is a synthetic movie rating dataset:

{
    "type": "record",
    "name": "Interactions",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "USER_ID",
            "type": "string"
        },
        {
            "name": "ITEM_ID",
            "type": "string"
        },
        {
            "name": "EVENT_TYPE",
            "type": "string"
        },
        {
            "name": "EVENT_VALUE",
            "type": ["null","float"]
        },
        {
            "name": "TIMESTAMP",
            "type": "long"
        },
        {
            "name": "IMPRESSION",
            "type": "string"
        }
    ],
    "version": "1.0"
}

You can now optionally add impression information to Amazon Personalize. Impressions are the list of items that were visible to the user when they interacted with a particular item. The following screenshot shows some interactions with impression data.

The impression is represented as an ordered list of item IDs that are pipe separated. The first row of the data in the preceding screenshot shows that when user_id 1 rated item_id 1270, they had items 1270, 1...9 in that order visible in the UX. The contrast between which items were recommended to the user and which they interacted with helps us generate better recommendations.

Amazon Personalize has two modes to input impression information:

  • Explicit impressions – Impressions that you manually record and send to Personalize. The preceding example pertains to explicit impressions.
  • Implicit impressions – The list of items recommended for recommendations users receive from Amazon Personalize

Amazon Personalize now returns a RecommendationID for each set of recommendations from the service. If you do not change the order or content of the recommendations when generating you user experience you can reference the impression through the RecommendationID without needing to send a list of ItemIDs (explicit impressions). If you provide both explicit and implicit impressions for an interaction, the explicit impression takes precedence. You can also send both implicit and explicit recommendations via the putEvents API. Please see our documentation for more details.

Creating an item dataset

You follow similar steps to create an item dataset and import your data using bandits-demo-items.csv, which has metadata for each movies. We use an optional reserved keyword CREATION_TIMESTAMP for the item dataset, which helps Amazon Personalize compute the age of the item and adjust recommendations accordingly. When using your own data to model provide the timestamp when the item was first available to your user in this field. We infer the age of an item from the reference point of the latest interaction timestamp in your dataset.

If you don’t provide the CREATION_TIMESTAMP, the model infers this information from the interaction dataset and uses the timestamp of the item’s earliest interaction as its corresponding release date. If an item doesn’t have an interaction, its release date is set as the timestamp of the latest interaction in the training set and it is considered a new item with age 0.

Our dataset for this post has 1,931 movies, of which 191 have a creation timestamp marked as the latest timestamp in the interaction dataset. These newest 191 items are considered cold items and have a label number higher than 1800 in the dataset. The schema of the item dataset is as follows:

{
    "type": "record",
    "name": "Items",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "ITEM_ID",
            "type": "string"
        },
        {
            "name": "GENRES",
            "type": ["null","string"],
            "categorical": true
        },
        {
            "name": "TITLE",
            "type": "string"
        },
        {
            "name": "CREATION_TIMESTAMP",
            "type": "long"
        }
    ],
    "version": "1.0"
}

Training a model

After the dataset import jobs are complete, you’re ready to train your model.

  1. On the Solutions tab, choose Create solution.
  2. Choose the new aws-user-personalization recipe.

This new recipe effectively combines deep learning models (RNNs) with bandits to provide you more accurate user modeling (high relevance) and effective exploration.

  1. Leave the Solution configuration section at its default values, and choose

  1. On the Create solution version page, choose Finish to start training.

When the training is complete, you can navigate to the Solution Version Overview page to see the offline metrics. In certain situations, you might see a slight drop on accuracy metrics (such as mrr or precision@k) and on coverage compared to models trained on the HRNN-Metadata recipe. This is because recommendation made by the new aws-user-personalization recipe isn’t solely based on exploitation, and it may sacrifice short-term interest for the long-term reward. The offline metrics are computed using the default values of parameters (explorationWeight, explorationItemAgeCutoff), which impacts item exploration. You can find more details on these in the following section.

After several rounds of retraining, you should see the accuracy metrics and item coverage increase, and the new aws-user-personalization recipe should outperform the exploitation-based HRNN-Metadata recipe.

Creating a campaign

In Amazon Personalize, you use a campaign to make recommendations for your users. In this step, you create two campaigns using the solution you created in the previous step and demonstrate the impact of different amounts of exploration.

To create a new campaign, complete the following steps:

  1. On the Campaigns tab, choose Create Campaign.
  2. For Campaign name, enter a name.
  3. For Solution, choose user-personalization-solution.
  4. For Solution version ID, choose the solution version that uses the aws-user-personalization recipe.

You now have the option of setting additional configuration for the campaign, which allows you to adjust the exploration Amazon Personalize does for the item recommendations and therefore adjust the results. These settings are only available if you’re creating a campaign whose solution version uses the user-personalization recipe. The configuration options are as follows:

  • explorationWeight – Higher values for explorationWeight signify higher exploration; new items with low impressions are more likely to be recommended. A value of 0 signifies that there is no exploration and results are ranked according to relevance. You can set this parameter in a range of [0,1] and its default value is 0.3.
  • explorationItemAgeCutoff – This is the maximum duration in days relative to the latest interaction(event) timestamp in the training data. For example, if you set explorationItemAgeCutoff to 7, the items with an age over or equal to 7 days aren’t considered cold items and there is no exploration on these items. You may still see some items older than or equal to 7 days in the recommendation list because they’re relevant to the user’s interests and are of good quality even without the help of the exploration. The default value for this parameter is 30, and you can set it to any value over 0.

To demonstrate the effect of exploration, we create two campaigns.

  1. For the first campaign, set Exploration weight to 0.
  2. Leave Exploration item age cut off at its default of 30.0.
  3. Choose Create campaign.

Repeat the preceding steps to create a second campaign, but give it a different name and change the exploration weight to 1.

Getting recommendations

After you create or update your campaign, you can get recommended items for a user, similar items for an item, or a reranked list of input items for a user.

  1. On the Campaigns detail page, enter the user ID for your user personalization campaign.

The following screenshot shows the campaign detail page with results from a GetRecommendations call that include the recommended items and the recommendation ID, which you can use as an implicit impression. The service interprets the recommendation ID in training.

  1. Enter a user ID that has interactions in the interactions dataset. For this post, we get recommendations for user ID 1.
  2. On the campaign detail page of the campaign that has an exploration weight of 0, choose the Detail
  3. For User ID, enter 1.
  4. Choose Get recommendations.

The following image is for campaigns with an exploration weight of 0; we can see that the recommendation items are old items, and users have already seen or rated those movies.

The next image shows recommendation results for the same user but for a campaign where we set the exploration weight to 1. This results in a higher proportion of movies that were recently added and that few users have rated being recommended. Furthermore, the trade-off between the relevance (exploitation) and exploration is adjusted automatically depending on the coldness of the new items and as new feedback from users is leveraged.

Retraining and updating campaigns

New interactions against explored items hold important feedback on the quality of the item, which you can use to update exploration on the items. We recommend updating the model hourly to adjust the future item exploration.

To update a model(solutionVersion), you can call the createSolutionVersion API with trainingMode set to UPDATE. This updates the model with the latest item information and adjusts the exploration according to implicit feedback from the users. This is not equivalent to training a model, which you can do by setting trainingMode to FULL. You should perform full training less frequently, typically one time every 1–5 days. When the new updated solutionVersion is created, you can update the campaign to get recommendations using it.

The following code walks you through these steps:

#Updating the solutionVersion (model) and Campaign

import time

def wait_for_solution_version(solution_version_arn):
    status = None
    max_time = time.time() + 60*60 # 1 hour
    while time.time() < max_time:
        describe_solution_version_response = personalize.describe_solution_version(
            solutionVersionArn = solution_version_arn
        )
        status = describe_solution_version_response["solutionVersion"]["status"]
        print("SolutionVersion: {}".format(status))

        if status == "ACTIVE" or status == "CREATE FAILED":
            break
        time.sleep(60) 
        
def update_campaign(solution_arn, campaign_arn):
    create_solution_version_response = personalize.create_solution_version(
        solutionArn = solution_arn, 
        trainingMode='UPDATE')
    new_solution_version_arn = create_solution_version_response['solutionVersionArn']
    print("Creating solution version: {}".format(new_solution_version_arn))
    wait_for_solution_version(new_solution_version_arn)
    personalize.update_campaign(campaignArn=campaign_arn, solutionVersionArn=new_solution_version_arn)
    print("Updating campaign...")

# Update the campaign every hour
while True:
    dt = time.time() + 60*60
    try:
        solution_arn = <your solution arn>
        campaign_arn = <your campaign arn>
        update_campaign(solution_arn, campaign_arn)
    except Exception as e:
        print("Not able to update the campaign: {}".format(str(e)))
    while time.time() < dt:
        time.sleep(1)

Best practices

That wraps our post. As you use the new ‘aws-user-personalization’ recipe please keep the following best practices in mind.

  1. Don’t forget to do retraining. Retraining, with ‘UPDATE’ mode is essential to learn about “cold” items. During inference the model will recommend “cold” items to the user and collect user feedback, and retraining will let the model discover the “cold” item properties via collected feedback. Without retraining, the model will never learn more about the “cold” items besides their item metadata, and it will be not be useful to do continued exploration on “cold” items.
  2. Provide good item metadata. Even with the exploration, the item metadata is still crucial for recommending relevant cold items. The model learns item properties from two resources: interactions and item metadata, and since the “cold” items don’t have any interactions, the model can only learn from the item metadata before exploration.
  3. Provide accurate item release date via ‘CREATION_TIMESTAMP’ in the item dataset. This information is used to model the time effect on the item, so that we do not explore on old items.

Conclusion

The new aws-user-personalization recipe from Amazon Personalize effectively mitigates the item cold start problem by also recommending new items with few interactions and learning their properties through user feedback during retraining. For more information about optimizing your user experience with Amazon Personalize, see What Is Amazon Personalize?


About the Authors

Hao Ding is an Applied Scientist at AWS AI Labs and is working on developing next generation recommender system for Amazon Personalize. His research interests include Recommender System, Deep Learning, and Graph Mining

 

 

 

 

Yen Su is a software development engineer in Amazon Personalize team. After work, she enjoys hiking and exploring new restaurants.

 

 

 

 

Vaibhav Sethi is the lead Product Manager for Amazon Personalize. He focuses on delivering products that make it easier to build machine learning solutions. In his spare time, he enjoys hiking and reading.

 

 

 

 

 

Read More

Even Faster Mobile GPU Inference with OpenCL

Even Faster Mobile GPU Inference with OpenCL

Posted by Juhyun Lee and Raman Sarokin, Software Engineers

While the TensorFlow Lite (TFLite) GPU team continuously improves the existing OpenGL-based mobile GPU inference engine, we also keep investigating other technologies. One of those experiments turned out quite successful, and we are excited to announce the official launch of OpenCL-based mobile GPU inference engine for Android, which offers up to ~2x speedup over our existing OpenGL backend, on reasonably sized neural networks that have enough workload for the GPU.

Figure 1. Duo’s AR effects are powered by our OpenCL backend.

Improvements over the OpenGL Backend

Historically, OpenGL is an API designed for rendering vector graphics. Compute shaders were added with OpenGL ES 3.1, but its backward compatible API design decisions were limiting us from reaching the full potential of the GPU. OpenCL, on the other hand, was designed for computation with various accelerators from the beginning and is thus more relevant to our domain of mobile GPU inference. Therefore, we have looked into an OpenCL-based inference engine, and it brings quite a lot of features that let us optimize our mobile GPU inference engine.

Performance Profiling: Optimizing the OpenCL backend was much easier than OpenGL, because OpenCL offers good profiling features and Adreno supports them well. With these profiling APIs, we are able to measure the performance of each kernel dispatch very precisely.

Optimized Workgroup Sizes: We have observed that the performance of TFLite GPU on Qualcomm Adreno GPUs is very sensitive to workgroup sizes; picking the right workgroup size can boost the performance, whereby picking the wrong one can degrade the performance by an equal amount. Unfortunately, picking the right workgroup size is not trivial for complex kernels with complicated memory access patterns. With the help of the aforementioned performance profiling features in OpenCL, we were able to implement an optimizer for workgroup sizes, which resulted in up to 50% speedup over the average.

Native 16-bit Precision Floating Point (FP16): OpenCL supports FP16 natively and requires the accelerator to specify the data type’s availability. Being a part of the official spec, even some of the older GPUs, e.g. Adreno 305 from 2012, can operate at their full capabilities. OpenGL, on the other hand, relies on hints which the vendors can choose to ignore in their implementations, leading to no performance guarantees.

Constant Memory: OpenCL has a concept of constant memory. Qualcomm added a physical memory that has properties that makes it ideal to be used with OpenCL’s constant memory. This turned out to be very efficient for certain special cases, e.g. very thin layers at the beginning or at the end of the neural network. OpenCL on Adreno is able to greatly outperform OpenGL’s performance by having a synergy with this physical constant memory and the aforementioned native FP16 support.

Performance Evaluation

Below, we show the performance of TFLite on the CPU (single-threaded on a big core), on the GPU using our existing OpenGL backend, and on the GPU using our new OpenCL backend. Figure 2 and Figure 3 depict the performance of the inference engine on select Android devices with OpenCL on a couple of well-known neural networks, MNASNet 1.3 and SSD MobileNet v3 (large), respectively. Each group of 3 bars are to be observed independently which shows the relative speedup among the TFLite backends on a device. Our new OpenCL backend is roughly twice as fast as the OpenGL backend, but does particularly better on Adreno devices (annotated with SD), as we have tuned the workgroup sizes with Adreno’s performance profilers mentioned earlier. Also, the difference between Figure 2 and Figure 3 visualizes that OpenCL performs even better on larger networks.

Figure 2. Inference latency of MNASNet 1.3 on select Android devices with OpenCL.
Figure 3. Inference latency of SSD MobileNet v3 (large) on select Android devices with OpenCL.

Seamless Integration through the GPU Delegate

One major hurdle in employing the OpenCL inference engine is that OpenCL is not a part of the standard Android distribution. While major Android vendors include OpenCL as part of their system library, it is possible that OpenCL is not available for some users. For these devices, one needs to fall back to the OpenGL backend which is available on every Android device.

To make developers’ life easy, we have added a couple of modifications to the TFLite GPU delegate. We first check the availability of OpenCL at runtime. If it is available, we employ the new OpenCL backend as it is much faster than the OpenGL backend; if it is unavailable or couldn’t be loaded, we fall back to the existing OpenGL backend. In fact, the OpenCL backend has been in the TensorFlow repository since mid 2019 and seamlessly integrated through the TFLite GPU delegate v2, so you might be already using it through the delegate’s fallback mechanism.

Acknowledgements

Andrei Kulik, Matthias Grundman, Jared Duke, Sarah Sirajuddin, and special thanks to Sachin Joglekar for his contributions to this blog post.
Read More

NVIDIA Partner Program Expands to 1,500 Members, Adds New Benefits

NVIDIA Partner Program Expands to 1,500 Members, Adds New Benefits

NVIDIA’s enterprise partner program has grown to more than 1,500 members worldwide and added new resources to boost opportunities for training, collaboration and sales.

The expanded NVIDIA Partner Network boasts an array of companies that span the globe and help customers across a variety of needs, from high performance computing to systems integration.

graphic showing how nvidia partner network members span the globe
NPN partners span the globe.

The NPN has seen exponential growth over the past two years, and these new program enhancements enable future expansion as Mellanox and Cumulus partner programs are set to be integrated into NPN throughout 2021.

Mellanox and Cumulus bring strong partners into the NVIDIA fold. Focused on enterprise data center markets, they provide accelerated, disaggregated and software-defined networking to meet the rapid growth in AI, cloud and HPC.

In anticipation of this growth, the NPN has introduced educational opportunities, tools and resources for training and collaboration, as well as added sales incentives. These benefits include:

Educational opportunities:

  • Industry-Specific Training Curriculums: New courses and enablement tools in healthcare, higher education and research, financial services and insurance, and retail. Additional courses in energy and telco are coming next year.
  • NPN Learning Maps: These dramatically reduce the time partners need to get up and running. Partners can discover and build their NVIDIA learning matrix based on industry and cross-referenced by role, including sales, solution architect or data scientist.

New tools and resources:

  • AI Consulting Network: New AI consulting services for data scientists and solution architects who are part of our Service Delivery Partner-Professional Services program to help build and deploy HPC and AI solutions.
  • Enhanced NPN Partner Portal: Expanded to allow access to the vast storehouse of NVIDIA-built sales tools and data, including partner rebate history and registered opportunities. The simplified portal gives partners increased visibility and easy access to the information required to quickly track sales and build accurate forecasts.
  • Industry-Specific Marketing Campaigns: Provides partners with the opportunity to build campaigns that more accurately target customers with content built from data-driven insights.

New incentives:

  • A fixed backend rebate for Elite-level Solution Provider and Solutions Integration partners for compute, compute DGX, visualization and virtualization.
  • An enhanced quarterly performance bonus program, incorporating an annualized goal to better align with sudden fluctuations in partner selling seasons.
  • Expanded AI Champions Club to honor top NVIDIA DGX systems sellers.
  • Dedicated market development funds for Elite-level providers and integration partners for most competencies.

NPN expanded categories:

  • Solution advisors focused on storage solutions and mutual reference architectures
  • Federal government system integrators

The NVIDIA Partner Network is dedicated to supporting partners that deliver world-class products and services to customers. The NPN collaborates with hundreds of companies globally, across a range of businesses and competencies, to serve customers in HPC, AI and emerging high-growth areas such as visualization, edge computing and virtualization.

Learn how to become a partner in the NPN.

The post NVIDIA Partner Program Expands to 1,500 Members, Adds New Benefits appeared first on The Official NVIDIA Blog.

Read More

AI Will Change the World.Who Will Change AI?We Will.

AI Will Change the World.Who Will Change AI?We Will.




Editor’s Note: The following blog is a special guest post by a recent graduate
of Berkeley BAIR’s AI4ALL summer program for high school students.

AI4ALL is a nonprofit dedicated to increasing diversity and inclusion in AI
education, research, development, and policy.

The idea for AI4ALL began in early 2015 with Prof. Olga Russakovsky, then
a Stanford University Ph.D. student, AI researcher Prof. Fei-Fei Li, and Rick
Sommer – Executive Director of Stanford Pre-Collegiate Studies. They founded
SAILORS as a summer outreach program for high school girls to learn about
human-centered AI, which later became AI4ALL. In 2016, Prof. Anca Dragan
started the Berkeley/BAIR AI4ALL camp, geared towards high school students from
underserved communities.

How Citibot’s chatbot search engine uses AI to find more answers

How Citibot’s chatbot search engine uses AI to find more answers

This is a guest blog post by Francisco Zamora and Nicholas Burden at TensorIoT and Bratton Riley at Citibot. In their own words, “TensorIoT is an AWS Advanced Consulting Partner with competencies in IoT, Machine Learning, Industrial IoT and Retail. Founded by AWS alums, they have delivered end-to-end IoT and Machine Learning solutions to customers across the globe. Citibot provides tools for citizens and their governments to use for efficient and effective communication and civic change.”

Citibot is a technology company that builds AI-powered chat solutions for local governments from Fort Worth, Texas to Arlington, Virginia. With Citibot, local residents can quickly get answers to city-related questions, report issues, and receive real-time alerts via text responses. To power these interactions, Citibot uses Amazon Lex, a service for building conversational interfaces for text and voice applications. Citibot built a chatbot to handle basic call queries, which allows government employees to allocate more time to higher-impact community actions.

The challenges imposed by the COVID-19 pandemic surfaced the need for public organizations to have scalable, self-service tools that can quickly provide reliable information to its constituents. With COVID-19, Citibot call centers saw a dramatic uptick in wait times and call abandonments as citizens tried to get information about virus prevention and unemployment insurance. To increase the flexibility and robustness of their chatbot to new query types, Citibot looked to add a general search capability. Citibot wanted a solution that could outperform third-party solutions and effectively use curated FAQ content and recently published data from multiple websites such as the CDC and federal, state, and local government.

The following image shows screenshots of sample Citibot conversations.

To design this general search solution, Citibot chose TensorIoT, an AWS Advanced Consulting Partner that specializes in serverless application development. TensorIot developed a solution that included TensorIoT’s Web Connector Tool and Amazon Kendra, an enterprise search service. TensorIoT’s Web Connector Tool, built natively on AWS, enabled Amazon Kendra to index the content of target web pages and be a fallback search intent when Amazon Lex intents can’t provide an answer.

This new chatbot search solution helped local citizens quickly find the answers they needed and reduced wait times by up to 90%. This in turn decreased the volume of interactions handled by city officials, eased uncertainty within communities, and allowed municipal governments to focus on keeping their communities safe. As offices closed due to the pandemic, this solution provided a contactless way for residents without internet access to search for information on government websites at any time through their phones.

The following diagram illustrates the architecture for Citibot’s general search solution.

How it all came together

First, TensorIoT deployed a custom Amazon Lex search intent that is triggered when the chatbot receives a question or utterance it can’t answer. The team used AWS Lambda to develop the intent’s dialog and fulfillment code hooks to manage the conversation flow and fulfillment APIs. This new search intent was developed, tested, and merged into the dev version of Citibot to ensure all the original intents worked properly.

Second, TensorIoT needed to create a search query index. They choose Amazon Kendra because it can integrate a variety of data sources and data types into Citibot’s existing technology stack. The TensorIoT and Citibot development teams determined a target group of government data sources, including the CDC website for COVID-19 data and multiple city websites for municipal data, that are checked on a routine basis. This helps the chatbot access the most recent guidelines about the virus and social distancing.

The following diagram illustrates the data sources used for Citibot’s general search solution.

Next, the teams researched the optimal format type and data storage containers for saving information and connecting to Amazon Kendra. TensorIoT knew that Amazon Kendra is trained to systematically process and index data sources to derive meaning from a variety of data formats, such as .pdf, .csv, and .html files. To increase the processing efficiency of Amazon Kendra, the TensorIoT team intelligently partitioned the data into queryable information chunks that could be relayed back to the users. The TensorIoT approach used a combination of .csv, .pdf, and .html files to provide complete data, giving a solid foundation for product build and development.

The TensorIoT team then developed a versatile Web Connector using NodeJS and the Javascript library Cheerio to crawl trusted websites and deposit that information into the data stores. Because COVID-19-related information changes frequently, TensorIoT created an Amazon DynamoDB table to store all the websites to routinely index for updated information.

With the additional information from the targeted websites, the TensorIoT and Citibot teams decided to use Amazon Simple Storage Service (Amazon S3) buckets for data storage. Amazon Kendra provides machine learning (ML)-powered search capabilities for all unstructured data stored in AWS and offers easy-to-use native connectors for popular sources like Amazon S3, SharePoint, Salesforce, ServiceNow, RDS databases, and OneDrive. By unifying the extracted .html pages and .pdf files from the CDC website in the same S3 bucket, the development team could sync the index to the data source, providing readily available data. They also used Amazon Kendra to extract metadata files from the scraped .html pages, which provided additional file attributes such as city names to further improve answer results.

The following image shows an example of the attributes that Citibot could use to tune search results.

Without any model training, TensorIoT and Citibot could point Amazon Kendra at their content stores and start receiving specific answers to natural language queries (such as, “How can I protect myself from Covid-19?”) by extracting the answer from the most relevant document.

To test the solution, the engineers ran sample event scripts with test inputs that allowed them to verify if all the sample questions were being answered successfully. TensorIoT tested and confirmed that each question or utterance returned an answer with a valid text excerpt and link. Additionally, the team used a negative feedback API that flagged answers users had downvoted and gave Citibot the ability to revisit the search answers that were voted as unhelpful. This data helps drive continuous improvement around the answers provided by the index for specific questions.

For curated content search, the developers could also upload a .csv file of FAQs to provide direct answers to the most commonly asked questions. For Citibot, TensorIoT used this feature to fill in the specific answers for municipal information questions, and added a .csv file with relevant questions and answers (Q&A) that required a complete search engine microservice. Using these features brings numerous benefits, including accuracy, simplicity, and connectivity.

In just a few weeks, TensorIoT also built and added custom query logic and feedback submission APIs to the Amazon Lex bot, giving users better answers without requiring human interaction or extensive searching. Amazon Kendra exposes their services via API, such as the submit feedback API, which allows end-users to interact with search results. The team used the custom Amazon Lex intent and Lambda to handle the incoming queries and create a powerful search service.

The following image shows how the solution uses Amazon Lex and Lambda.

The TensorIoT solution was designed so Citibot can effortlessly add new cities to the service and disseminate information to their respective communities. The next challenge for the TensorIoT team was using city-specific information to provide more relevant search results. Combined with the additional session and request attributes of Amazon Lex, TensorIoT provided Amazon Kendra with search filters to refine the data query with specific city information. If no city was stated, the system defaulted to the call location of the user. With TensorIoT’s custom search intent deployed, search filter in place, data sources filled, and APIs built, the team started to integrate this search engine into the existing chatbot product.

Deployment

To deploy this TensorIot solution, the development teams integrated the new Amazon Lex custom search intent with Citibot and tested the bot’s ability to successfully answer queries. Using a sample phone number provided by Citibot through Twilio, TensorIoT used SMS to validate the returned results for each utterance.

With Amazon Kendra, the TensorIoT team eliminated the need for a third-party search engine and could focus on creating an automated solution for gathering information. After the chatbot was updated, the team redeployed the service with a version upgrade of the software development kit. The upgraded chatbot now uses the search power of Amazon Kendra to answer more questions for users based on the curation of document content. The resulting informational Citibot stands above the prior tools the cities had used.

Storing information in a curated content form is especially useful when combining Amazon Lex and Amazon Kendra. Amazon Kendra is perfect for customized information retrieval that is ultimately communicated to the end-user through agentless voice interactions of Amazon Lex.

Conclusion

This use case demonstrates how TensorIot used multiple AWS services to add value in solution development. Beyond COVID-19, cities can continue to utilize the Amazon Kendra-powered chatbot to provide fast access to information about public facility hours, road closures, and events. Depending on your use case, you can easily customize the subject matter of the AWS Kendra index to provide information for emerging user needs.

The TensorIoT search engine proved to be a powerful solution to a modern-day problem, allowing communities to stay informed and connected through text. Although the primary purpose of this application was to enhance customer support services, the solution is applicable to searching internal knowledge bases for schools, banks, local businesses, and non-profit organizations. With AWS and TensorIoT, companies like Citibot can use new and powerful technologies such as Amazon Kendra to improve their existing chatbot solutions.

 


About the Authors

Francisco Zamora is a Software Engineer at TensorIoT.

Nicholas Burden is a Technical Evangelist at at TensorIoT.

Bratton Riley is the CEO at Citibot.

Read More

High-Frequency Component Helps Explain the Generalization of Convolutional Neural Networks

High-Frequency Component Helps Explain the Generalization of Convolutional Neural Networks

Fig. 1: The central hypothesis: within a dataset with finite samples, there are correlations between the high-frequency component and the “semantic” component of the images. As a result, the model will perceive both the high-frequency component as well as the “semantic” ones, leading to generalization behavior counterintuitive to humans.

It’s all about data

There are many works aiming to explain the generalization behavior of neural networks using heavy mathematical machinery, but we will do something different here: with a simple and intuitive twist of data, we will show that many generalization mysteries (like adversarial vulnerability, BatchNorm’s efficacy, and the “generalization paradox”) might be results of our overconfidence in processing data through naked eyes. Or simply: 

The models may have not outsmarted us, but the data has.

Let’s start with an interesting observation (Fig. 2): we trained a ResNet-18 with the Cifar10 dataset, picked a test sample, and plotted the model’s prediction confidence for this sample. Then we mapped the sample into the frequency domain through Fourier transform, and cut the frequency representation into its high-frequency component (HFC) and low-frequency component (LFC). We reconstructed the image through these two components and fed the reconstructed image into the model:

  • HFC-reconstructed images look distinctly different from the original sample but predicted to be the same label.
  • LFC-reconstructed images look similar to the original sample but the model classifies them differently.
Fig. 2: The striking misalignment between human and models: HFC-reconstructed images look distinctly different from the original sample, but predicted to be the same label; LFC-reconstructed images look similar to the original sample but the model classifies them differently.

Although this phenomenon can only be observed with a subset of samples (~600 images), it’s striking enough to raise an alarm.

Why does a model behave like this?

We believe the underlying reason is the coincidental correlation between HFC and the “semantics” depicted within a dataset (Fig. 1). With a finite number of samples from the same distribution, chances are that the human-imperceptible HFC is correlated with how a human annotates the image; thus, when a model is optimized to reduce the training loss, it can pick up either the “semantics” or HFC to reduce the loss, leading to high prediction accuracy even though the model may not truly “understand” the data. 

Please note: we are not claiming that the model itself has a tendency to capture HFC. Instead, our main argument is that a generic model does not have the incentive to learn LFC only; thus, it may end up learning a mixture of LFC and HFC.

Also, one may wonder whether the fact that models can capture HFC is promising or worrisome. One side may argue that this enables the development of models that can surpass humans on the test data (likely only when the test data is from the same distribution as the training data), while the other side may argue that the resulting models, despite performing better on test data from the same distribution, may underperform after deployed on similar data from other distributions (i.e., HFC may be dataset-specific). This post does not intend to resolve this argument, but only to offer the observations we made.

Observations

We can leverage the main observation to help explain multiple previously elusive empirical phenomena. In this post, we will highlight two discussions from our paper.

One of the roots of adversarial vulnerability

Conceptually similar to the argument made in a preceding paper, we show that the predictive signal from HFC is one of the roots of adversarial vulnerability. However, in contrast to this prior work, we offer a concrete proposal regarding what the adversarial features might be: signals from HFC.

To investigate the relationship, we trained an adversarially robust model with Madry’s adversarial training method and studied the convolutional kernels of a robust model and a vanilla model. We notice that the convolutional kernels of a robust model tend to be smoother (“smooth” in the sense that differences between adjacent values are small), as shown in Fig. 3. Relevant mathematical tools suggest that smooth convolutional kernels only consider a negligible amount of HFC in data, thus linking HFC to adversarial vulnerability.

Fig. 3. Left: visualization of convolutional kernels of a vanilla CNN; Right: visualization of an adversarially robust CNN from adversarial training.

With this knowledge, a more enticing question is whether we can directly smooth a vanilla model’s convolutional kernel to improve its adversarial robustness. To answer this question, we tested multiple methods to smooth convolutional kernels:

  • To heuristically adjust the weights of trained kernels to improve the smoothness.
  • To filter out the high-frequency information of trained kernels (not reported in our paper).
  • To design regularization schemes limiting the differences of adjacent values of kernels during training (not reported in our paper).

Unfortunately, only marginal improvements are observed. Thus, we can conclude that adversarially robust models tend to have smooth convolutional kernels, but the inverse is not necessarily true. In other words, HFC is one of the issues of adversarial vulnerability, but not all of the issues.

However, one solution inspired by this observation can indeed defend against most adversarial attacks at a remarkable rate:

  • To filter out HFC of input images before feeding the data into models.

The method can improve the robustness of a model, but the method is effectively masking the gradient of the model, thus not solving the adversarial attack & defense problem the research community focuses on.

The Efficacy of BatchNormalization

Another interesting observation concerns the efficacy of BatchNorm. BatchNorm is one of the most effective training heuristics in modern deep learning research, especially in computer vision applications. However, why BatchNorm can help training so significantly is not yet well understood. Interestingly, our experiments offer another perspective on why BatchNorm often helps.

Fig. 4. How test accuracy changes along with the increment of epochs during training. Each panel depicts the performances of different heuristics. Each color represents the performance from a different radius to cut LFC and HFC. Solid lines represent performances for LFC and dashed lines show those for HFC. The higher the curves of dashed lines, the more HFC a method exploits.

In Fig. 4, along with the increment of training epochs, we report the accuracies of training data and various copies of test data, where r refers to the radius we used to cut LFC and HFC, and solid/dashed line denotes the performances from LFC/HFC, respectively. Thus, the higher the curves get in dashed lines, the more HFC a model takes advantage of.

Surprisingly, the model trained with BatchNorm exploits a significant amount of HFC, as the dashed curves of the 4th panel are remarkably higher than those of the other panels. This observation suggests that one of the reasons why BatchNorm helps is that it encourages the usage of HFC. As we argued previously, there are multiple predictive signals (LFC and HFC) in the data. It is expected that the more signals a model uses, the higher test accuracy a model can get, consistent with the fact that BatchNorm is widely known as a method to improve testing accuracy.

Intuitively, we conjecture that the performance boost is due to the fact that HFC usually only involves pixels with very small magnitude (as the reconstructed images look mostly black to humans). BatchNorm conveniently rescales these signals through normalization, thus leading to improved accuracy.

One may naturally wonder what our observation implies regarding the usage of BatchNorm: we suggest that we might need to reevaluate the value of BatchNorm, especially for models to meet the expectations of being robust in multiple datasets. Our observation also conveniently aligns with another observation suggesting that BatchNorm may encourage adversarial vulnerability.

In our paper, we also have discussions relating to the paradox widely known as “rethinking generalization”, formal results on the tradeoff between accuracy and robustness, and experiments suggesting that these interesting phenomena may appear in other vision tasks such as object detection.

Conclusions

For more information, please refer to our paper, where we mainly draw three conclusions:

  • Since HFC may be dataset-specific, SOTA (state-of-the-art) may not be as important as the community thought; the alignment between human and the models may be more important.
  • We may need a new testing regime for computer vision; for example, a simple way is to always test the models over LFC-reconstructed images in addition to the original testing images.
  • Explicit inductive bias design to align the models and the human may play an important role.

Relevant resources:

DISCLAIMER: All opinions expressed in this post are those of the author and do not represent the views of CMU.

Read More

Creating Sounds Of India: An on device, AI powered, musical experience built with TensorFlow

Creating Sounds Of India: An on device, AI powered, musical experience built with TensorFlow

Posted by Anusha Ramesh, Product Manager TFX, David Zats, Software Engineer TFX, Ping Yu, Software Engineer TensorFlow.js, Lamtharn (Hanoi) Hantrakul, AI Resident Magenta

Introduction

Sounds of India is a unique and fun interactive musical experience launching for India’s 74th Independence Day, inspired by Indian tradition and powered by machine learning. When users throughout India (and around the world) sing the Indian National Anthem into the microphone of their mobile devices, machine learning models transform their voices into a range of classical Indian musical instruments live in the browser. The entire process of creating this experience took only 12 weeks, showing how rapidly developers can take models from research to production at scale using the TensorFlow Ecosystem.

The Research: Magenta’s Differentiable Digital Signal Processing (DDSP)

Magenta is an open source research project within Google AI exploring the role of machine learning in the creative process. Differentiable Digital Signal Processing or DDSP is a new open source library fusing modern machine learning with interpretable signal processing. Instead of training a pure deep learning model like WaveNet to render waveforms sample-by-sample, we can train lightweight models that output time varying control signals into these differentiable DSP modules (hence the extra “D” in DDSP) which synthesize the final sound. Both recurrent and convolutional models incorporating DDSP in TensorFlow Keras layers can efficiently generate audio 1000 times faster than their larger autoregressive counterparts, with 100x reduction in model parameters and training data requirements. One particularly fun application of DDSP is Tone Transfer, which transforms sounds into musical instruments. Try it by first training a DDSP model on 15 minutes of a target saxophone. You can then sing a melody and the trained DDSP model will re-render it as a saxophone. For Sounds of India, we applied this technology to three classical Indian instruments: the Bansuri, the Shehnai, and the Sarangi.

Train with TFX, deploy to the browser with TensorFlow.js

TFX

TensorFlow Extended (TFX) is an end-to-end platform for production ML, which includes preparing data, training, validating, and deploying models in production environments. TFX was used to train the models responsible for transforming the user’s voice to one of the instruments, and these models were then converted to TensorFlow.js for deployment on a standard Web browser.

Deploying to the browser provides a seamless experience for users to interact with the machine learning model: simply click a hyperlink and load the page just like any other website. No complicated installs necessary. By executing client side in the browser, we are able to perform inference right at the source of the sensor data, minimising latency and reducing server costs associated with large graphics cards, CPU, and memory. Moreover, given the application uses your voice as input, user privacy is quite important. Since the entire end-to-end experience happens client-side and in the browser, absolutely no sensor or microphone data is sent to the server side.

Browser-based machine learning models are often optimized to be as small as possible to minimize bandwidth used. In this case, the ideal hyperparameters for each musical instrument can also vary drastically. We leveraged TFX to perform large-scale training and tuning over hundreds of models to determine the smallest size for each instrument. As a result, we were able to dramatically reduce their memory footprints. For example, the Bansuri instrument model had a reduction in its on-disk size of ~20x without a noticeable impact on sound quality.

TFX also empowered us to perform rapid iteration over different model architectures (GRU, CNN), different types of inputs (loudness, RMS energy), and varying musical instrument data sources. Each time, we were able to quickly and effectively run the TFX pipeline to produce a new model with the desired characteristics.

TensorFlow.js

Creating a TensorFlow.js DDSP model was uniquely challenging because of the need to hit tight performance and model quality targets. We needed the model to be highly efficient at performing tone transfer so that it could effectively run on mobile devices. At the same time, any degradation in model quality would quickly lead to audio distortions and a poor user experience.

We started by exploring a wide range of TensorFlow.js backends and model architectures. The WebGL backend is the most optimized, while the WebAssembly backend works well on low end phones. Given the computational requirements of DDSP, we settled on a Convnet-based DDSP model and leveraged the WebGL backend.

To minimize the model download time, we studied the topology of the model, and compressed a large set of constant tensors with Fill/ZeroLike ops, which reduced the size from 10MB to 300KB.

We also focused on three key areas to make the TensorFlow.js model ready for production scale deployment on devices: inference performance, memory footprint, and numerical stability.

Inference Performance Optimization
DDSP models contain both a neural network and a signal synthesizer. The synthesizer part has many signal processing ops that require large amounts of computation. To improve performance on mobile devices, we re-wrote several kernels with special WebGL shaders to fully utilize the GPU. For example, a parallel version of the cumulative summation op reduced inference time by 90%.

Reduce memory footprint
Our goal is to be able to run the model on as many types of mobile devices as possible. Since many phones have very limited GPU memory, we need to make sure that model has a minimal memory footprint. We achieve this by disposing of intermediate tensors and adding a new flag to allow early disposal of GPU textures. Through these approaches we were able to reduce memory size by 60%.

Numerical stability
The DDSP model requires very high numerical precision in order to generate beautiful music. This is quite different from typical classification models, where a certain level of precision loss does not affect the final classifications. DDSP models used in this experience are generative models. Any loss in precision and discontinuities in the audio output are easily picked up by our sensitive ears. We encountered numerical stability problems with float16 WebGL texture. We therefore rewrote some of the key ops to reduce the overflow and underflow of the outputs. For example, in the Cumulative Summation op, we make sure cumulation is done within the shader with full float precision, and apply modulo calculation to avoid overflow before we write the output to a float16 texture.

Try it yourself!

You can try out the experience on your mobile phone at g.co/SoundsofIndia – and please share your results with us if you wish. We would love to see what you create with your voice.

If you are excited about how machine learning can augment creativity and innovation, you can learn more about Magenta through the team’s blog and contribute to their open source github, or check out #MadeWithTFJS for even more examples of browser-based machine learning from the TensorFlow.js community. If you are interested in training and deploying models at production scale using ML best practices, check out the Tensorflow Extended blog.

Acknowledgements

This project wouldn’t have been possible without the incredible effort of Miguel de Andrés-Clavera, Yiling Liu, Aditya Mirchandani, KC Chung, Alap Bharadwaj, Kiattiyot (Boon) Panichprecha, Pittayathorn (Kim) Nomrak, Phatchara (Lek) Pongsakorntorn, Nattadet Chinthanathatset, Hieu Dang, Ann Yuan, Sandeep Gupta, Chong Li, Edwin Toh, Jesse Engel and additional help from Michelle Carney, Nida Zada, Doug Eck, Hannes Widsomer and Greg Mikels. Huge thanks to Tris Warkentin and Mitch Trott for their tremendous support.
Read More

AI of the Storm: How We Built the Most Powerful Industrial Computer in the U.S. in Three Weeks During a Pandemic

AI of the Storm: How We Built the Most Powerful Industrial Computer in the U.S. in Three Weeks During a Pandemic

In under a month amid the global pandemic, a small team assembled the world’s seventh-fastest computer.

Today that mega-system, called Selene, communicates with its operators on Slack, has its own robot attendant and is driving AI forward in automotive, healthcare and natural-language processing.

While many supercomputers tap exotic, proprietary designs that take months to commission, Selene is based on an open architecture NVIDIA shares with its customers.

The Argonne National Laboratory, outside Chicago, is using a system based on Selene’s DGX SuperPOD design to research ways to stop the coronavirus. The University of Florida will use the design to build the fastest AI computer in academia.

DGX SuperPODs are driving business results for companies like Continental in automotive, Lockheed Martin in aerospace and Microsoft in cloud-computing services.

Birth of an AI System

The story of how and why NVIDIA built Selene starts in 2015.

NVIDIA engineers started their first system-level design with two motivations. They wanted to build something both powerful enough to train the AI models their colleagues were building for autonomous vehicles and general purpose enough to serve the needs of any deep-learning researcher.

The result was the SATURNV cluster, born in 2016 and based on the NVIDIA Pascal GPU. When the more powerful NVIDIA Volta GPU debuted a year later, the budding systems group’s motivation and its designs expanded rapidly.

AI Jobs Grow Beyond the Accelerator

“We’re trying to anticipate what’s coming based on what we hear from researchers, building machines that serve multiple uses and have long lifetimes, packing as much processing, memory and storage as possible,” said Michael Houston, a chief architect who leads the systems team.

As early as 2017, “we were starting to see new apps drive the need for multi-node training, demanding very high-speed communications between systems and access to high-speed storage,” he said.

AI models were growing rapidly, requiring multiple GPUs to handle them. Workloads were demanding new computing styles, like model parallelism, to keep pace.

So, in fast succession, the team crafted ever larger clusters of V100-based NVIDIA DGX-2 systems, called DGX PODs. They used 32, then 64 DGX-2 nodes, culminating in a 96-node architecture dubbed the DGX SuperPOD.

They christened it Circe for the irresistible Greek goddess. It debuted in June 2019 at No. 22 on the TOP500 list of the world’s fastest supercomputers and currently holds No. 23.

Cutting Cables in a Computing Jungle

Along the way, the team learned lessons about networking, storage, power and thermals. Those learnings got baked into the latest NVIDIA DGX systems, reference architectures and today’s 280-node Selene.

In the race through ever larger clusters to get to Circe, some lessons were hard won.

“We tore everything out twice, we literally cut the cables out. It was the fastest way forward, but it still had a lot of downtime and cost. So we vowed to never do that again and set ease of expansion and incremental deployment as a fundamental design principle,” said Houston.

The team redesigned the overall network to simplify assembling the system.

They defined modules of 20 nodes connected by relatively simple “thin switches.” Each of these so-called scalable units could be laid down, cookie-cutter style, turned on and tested before the next one was added.

The design let engineers specify set lengths of cables that could be bundled together with Velcro at the factory. Racks could be labeled and mapped, radically simplifying the process of filling them with dozens of systems.

Doubling Up on InfiniBand

Early on, the team learned to split up compute, storage and management fabrics into independent planes, spreading them across more, faster network-interface cards.

The number of NICs per GPU doubled to two. So did their speeds, going from 100 Gbit per second InfiniBand in Circe to 200G HDR InfiniBand in Selene. The result was a 4x increase in the effective node bandwidth.

Likewise, memory and storage links grew in capacity and throughput to handle jobs with hot, warm and cold storage needs. Four storage tiers spanned 100 terabyte/second memory links to 100 Gbyte/s storage pools.

Power and thermals stayed within air-cooled limits. The default designs used 35kW racks typical in leased data centers, but they can stretch beyond 50kW for the most aggressive supercomputer centers and down to 7kW racks some telcos use.

Seeking the Big, Balanced System

The net result is a more balanced design that can handle today’s many different workloads. That flexibility also gives researchers the freedom to explore new directions in AI and high performance computing.

“To some extent HPC and AI both require max performance, but you have to look carefully at how you deliver that performance in terms of power, storage and networking as well as raw processing,” said Julie Bernauer, who leads an advanced development team that’s worked on all of NVIDIA’s large-scale systems.

A portrait of Selene by the numbers

Skeleton Crews on Strict Protocols

The gains paid off in early 2020.

Within days of the pandemic hitting, the first NVIDIA Ampere architecture GPUs arrived, and engineers faced the job of assembling the 280-node Selene.

In the best of times, it can take dozens of engineers a few months to assemble, test and commission a supercomputer-class system. NVIDIA had to get Selene running in a few weeks to participate in industry benchmarks and fulfill obligations to customers like Argonne.

And engineers had to stay well within public-health guidelines of the pandemic.

“We had skeleton crews with strict protocols to keep staff healthy,” said Bernauer.

“To unbox and rack systems, we used two-person teams that didn’t mix with the others — they even took vacation at the same time. And we did cabling with six-foot distances between people. That really changes how you build systems,” she said.

Even with the COVID restrictions, engineers racked up to 60 systems in a day, the maximum their loading dock could handle. Virtual log-ins let administrators validate cabling remotely, testing the 20-node modules as they were deployed.

Bernauer’s team put several layers of automation in place. That cut the need for people at the co-location facility where Selene was built, a block from NVIDIA’s Silicon Valley headquarters.

Slacking with a Supercomputer

Selene talks to staff over a Slack channel as if it were a co-worker, reporting loose cables and isolating malfunctioning hardware so the system can keep running.

“We don’t want to wake up in the night because the cluster has a problem,” Bernauer said.

It’s part of the automation customers can access if they follow the guidance in the DGX POD and SuperPOD architectures.

Thanks to this approach, the University of Florida, for example, is expected to rack and power up a 140-node extension to its HiPerGator system, switching on the most powerful AI supercomputer in academia within as little as 10 days of receiving it.

As an added touch, the NVIDIA team bought a telepresence robot from Double Robotics so non-essential designers sheltering at home could maintain daily contact with Selene. Tongue-in-cheek, they dubbed it Trip given early concerns essential technicians on site might bump into it.

The fact that Trip is powered by an NVIDIA Jetson TX2 module was an added attraction for team members who imagined some day they might tinker with its programming.

Trip robot with Selene
Trip helped engineers inspect Selene while it was under construction.

Since late July, Trip’s been used regularly to let them virtually drive through Selene’s aisles, observing the system through the robot’s camera and microphone.

“Trip doesn’t replace a human operator, but if you are worried about something at 2 a.m., you can check it without driving to the data center,” she said.

Delivering HPC, AI Results at Scale

In the end, it’s all about the results, and they came fast.

In June, Selene hit No. 7 on the TOP500 list and No. 2 on the Green500 list of the most power-efficient systems. In July, it broke records in all eight systems tests for AI training performance in the latest MLPerf benchmarks.

“The big surprise for me was how smoothly everything came up given we were using new processors and boards, and I credit all the testing along the way,” said Houston. “To get this machine up and do a bunch of hard back-to-back benchmarks gave the team a huge lift,” he added.

The work pre-testing NGC containers and HPC software for Argonne was even more gratifying. The lab is already hammering on hard problems in protein docking and quantum chemistry to shine a light on the coronavirus.

Separately, Circe donates many of its free cycles to the Folding@Home initiative that fights COVID.

At the same time, NVIDIA’s own researchers are using Selene to train autonomous vehicles and refine conversational AI, nearing advances they’re expected to report soon. They are among more than a thousand jobs run, often simultaneously, on the system so far.

Meanwhile the team already has on the whiteboard ideas for what’s next. “Give performance-obsessed engineers enough horsepower and cables and they will figure out amazing things,” said Bernauer.

At top: An artist’s rendering of a portion of Selene.

The post AI of the Storm: How We Built the Most Powerful Industrial Computer in the U.S. in Three Weeks During a Pandemic appeared first on The Official NVIDIA Blog.

Read More