DALL·E 2 Research Preview Update

DALL·E 2 Research Preview Update

Last month, we started previewing DALL·E 2 to a limited number of trusted users to learn about the technology’s capabilities and limitations.

Since then, we’ve been working with our users to actively incorporate the lessons we learn. As of today:

  • Our users have collectively created over 3 million images with DALL·E.
  • We’ve enhanced our safety system, improving the text filters and tuning the automated detection & response system for content policy violations.
  • Less than 0.05% of downloaded or publicly shared images were flagged as potentially violating our content policy. About 30% of those flagged images were confirmed by human reviewers to be policy violations, leading to an account deactivation.
  • As we work to understand and address the biases that DALL·E has inherited from its training data, we’ve asked early users not to share photorealistic generations that include faces and to flag problematic generations. We believe this has been effective in limiting potential harm, and we plan to continue the practice in the current phase.

Learning from real-world use is an important part of our commitment to develop and deploy AI responsibly, so we’re starting to widen access to users who joined our waitlist, slowly but steadily.

We intend to onboard up to 1,000 people every week as we iterate on our safety system and require all users to abide by our content policy. We hope to increase the rate at which we onboard new users as we learn more and gain confidence in our safety system. We’re inspired by what our users have created with DALL·E so far, and excited to see what new users will create.


OpenAI

Use Amazon Lex to capture street addresses

Amazon Lex provides automatic speech recognition (ASR) and natural language understanding (NLU) technologies to transcribe user input, identify the nature of their request, and efficiently manage conversations. Lex lets you create sophisticated conversations, streamline your user experience to improve customer satisfaction (CSAT) scores, and increase containment in your contact centers.

Natural, effective customer interactions require that the Lex virtual agent accurately interprets the information provided by the customer. One scenario that can be particularly challenging is capturing a street address during a call. For example, consider a customer who has recently moved to a new city and calls in to update their street address for their wireless account. Even a single United States zip code can contain a wide range of street names. Getting the right address over the phone can be difficult, even for human agents.

In this post, we’ll demonstrate how you can use Amazon Lex and the Amazon Location Service to provide an effective user experience for capturing their address via voice or text.

Solution overview

For this example, we’ll use an Amazon Lex bot that provides self-service capabilities as part of an Amazon Connect contact flow. When the user calls in on their phone, they can ask to change their address, and the bot will ask them for their customer number and their new address. In many cases, the new address will be captured correctly in the first try. For more challenging addresses, the bot may ask them to restate their street name, spell their street name, or repeat their zip code or address number to capture the correct address.

Here’s a sample user interaction to model our Lex bot:

IVR: Hi, welcome to ACME bank customer service. How can I help? You can check account balances, order checks, or change your address.

User: I want to change my address.

IVR: Can you please tell me your customer number?

User: 123456.

IVR: Thanks. Please tell me your new zip code.

User: 32312.

IVR: OK, what’s your new street address?

User: 6800 Thomasville Road, Suite 1-oh-1.

IVR: Thank you. To make sure I get it right, can you tell me just the name of your street?

User: Thomasville Road.

IVR: OK, your new address is 6800 Thomasville Road, Suite 101, Tallahassee Florida 32312, USA. Is that right?

User: Yes.

IVR: OK, your address has been updated. Is there anything else I can help with?

User: No thanks.

IVR: Thank you for reaching out. Have a great day!

As an alternative approach, you can capture the whole address in a single turn, rather than asking for the zip code first:

IVR: Hi, welcome to ACME bank customer service. How can I help? You can check account balances, order checks, or change your address.

User: I want to update my address.

IVR: Can you please tell me your customer number?

User: 123456.

IVR: Thanks. Please tell me your new address, including the street, city, state, and zip code.

User: 6800 Thomasville Road, Suite 1-oh-1, Tallahassee Florida, 32312.

IVR: Thank you. To make sure I get it right, can you tell me just the name of your street?

User: Thomasville Road.

IVR: OK, your new address is 6800 Thomasville Road, Suite 101, Tallahassee Florida 32312, US. Is that right?

User: Yes.

IVR: OK, your address has been updated. Is there anything else I can help with?

User: No thanks.

IVR: Thank you for reaching out. Have a great day!

Solution architecture

We’ll use an Amazon Lex bot integrated with Amazon Connect in this solution. When the user calls in and provides their new address, Lex uses automatic speech recognition to transcribe their speech to text. Then, it uses an AWS Lambda fulfillment function to send the transcribed text to Amazon Location Service, which performs address lookup and returns a normalized address.

As part of the AWS CloudFormation stack, you can also create an optional Amazon CloudWatch Logs log group for capturing Lex conversation logs, which can be used to create a conversation analytics dashboard to visualize the results (see the post Building a business intelligence dashboard for your Amazon Lex bots for one way to do this).

How it works

This solution combines several techniques to create an effective user experience, including:

  • Amazon Lex automatic speech recognition technology to convert speech to text.
  • Integration with Amazon Location Service for address lookup and normalization.
  • Lex spelling styles, to implement a “say-spell” approach when voice inputs are not clear (for example, ask the user to say their street name, and then if necessary, to spell it).

The first step is to make sure that the required slots have been captured.

In the first code section that follows, we prompt the user for their zip code and street address using the Lex ElicitSlot dialog action. The elicit_slot_with_retries() function prompts the user based on a set of configurable prompts.

 
    # check for ZipCode code slot; if not available; elicit it
    zip_code = None
    zipCode = slot_values.get('ZipCode', None)
    if zipCode is not None:
        zip_code = zipCode['value'].get('interpretedValue', None)
    else:
        response = helpers.elicit_slot_with_retries( intent, activeContexts, sessionAttributes, 'ZipCode', requestAttributes)
        return response		
    # check for StreetAddress slot
    street_address = None
    streetAddress = slot_values.get('StreetAddress', None)
    if streetAddress is not None:
        street_address = streetAddress['value'].get('interpretedValue', None)
    else:
        # give the caller extra time for this response
        sessionAttributes['x-amz-lex:audio:end-timeout-ms:' + intent_name + ':StreetAddress'] = 2000
        response = helpers.elicit_slot_with_retries( intent, activeContexts, sessionAttributes, 'StreetAddress', requestAttributes)
        return response
    street_address = parse_address.parse(street_address)
    sessionAttributes['inputAddress'] = street_address

The last section of code above uses a helper function parse_address.parse() that converts spoken numbers into digits (for example, it converts “sixty eight hundred” to “6800”).

Then, we send the user’s utterance to Amazon Location Service and inspect the response. We discard any entries that don’t have a street, a street number, or have an incorrect zip code. In cases where we have to re-prompt for a street name or number, we also discard any previously suggested addresses.

# validate the address using the AWS Location Service
    location_response = locationClient.search_place_index_for_text(IndexName='explore.place', Text=street_address)
    # inspect the response from Amazon Location Service
    if location_response.get('Results', None) is not None:
        for address in location_response['Results']:
            if address.get('Place', None) is not None:
                addressLabel = address['Place'].get('Label', None)
                addressNumber = address['Place'].get('AddressNumber', None)
                street = address['Place'].get('Street', None)
                postalCode = address['Place'].get('PostalCode', None)
                if street is None:
                    continue                    
                if addressNumber is None:
                    continue                    
                if zip_code is not None:
                    if postalCode[:len(zip_code)] != zip_code:
                        continue
                already_tried = False
                prior_suggestions = helpers.get_all_values('suggested_address', sessionAttributes)
                for prior_suggestion in prior_suggestions:
                    if addressLabel == prior_suggestion:
                        already_tried = True
                        break                    
                if already_tried:
                    continue
                # the first entry with a valid street that was not already tried is the next best guess
                resolvedAddress = addressLabel
                break

Once we have a resolved address, we confirm it with the user.

if (event.get('inputMode') == 'Speech'):
        response_string = '<speak>OK, your new address is <say-as interpret-as="address">'
        response_string += resolvedAddress + '</say-as>. Is that right?</speak>'
        response_message = helpers.format_message_array(response_string, 'SSML')
    else:
       response_string = 'OK, your new address is ' + resolvedAddress + '. Is that right?'
        response_message = helpers.format_message_array(response_string, 'PlainText')
    intent['state'] = 'Fulfilled'
    response = helpers.confirm(intent, activeContexts, sessionAttributes, response_message, requestAttributes)
    return response

If we don’t get a resolved address back from the Amazon Location Service, or if the user says the address that we suggested wasn’t right, then we re-prompt for some additional information, and try again. The additional information slots include:

  • StreetName: slot type AMAZON.StreetName
  • SpelledStreetName: slot type AMAZON.AlphaNumeric (using Amazon Lex spelling styles)
  • StreetAddressNumber: slot type AMAZON.Number

The logic to re-prompt is controlled by the next_retry() function, which consults a list of actions to try:

RETRY_ACTIONS = [
    { "street_name": {
          "method": elicit_street_name,
          "style": None,
          "no-match": "Thank you. To make sure I get it right, can you tell me just the name of your street?",
          "incorrect": "Let's try again. Can you tell me just the name of your street?"
       }
    },
    { "street_name_spelled_by_letter": {
          "method": elicit_spelled_street, 
          "style": "SpellByLetter",
          "no-match": "Let's try a different way. Can you please spell just the name of your street?",
          "incorrect": "Let's try a different way. Can you please spell just the name of your street?"
       }
    },
    { "street_address_number": {
          "method": elicit_street_address_number, 
          "style": None,
          "no-match": "I didn't find a matching address. Can you please tell me your street address number?",
          "incorrect": "OK, let's try your street address number. Can you tell me that once more?"
       }
    },
    { "street_name_spelled_by_word": {
          "method": elicit_spelled_street, 
          "style": "SpellByWord",
          "no-match": "Let's try one last time. Please spell the name of your street. You can use words for letters, such as a as in apple, or b like bob.",
          "incorrect": "Let's try one last time. Please spell the name of your street. You can use words for letters, such as a as in apple, or b like bob."
       }
    },
    { "agent": {
          "method": route_to_agent, 
          "style": None,
          "no-match": "Sorry, I was not able to find a match for your address. Let me get you to an agent.",
          "incorrect": "Sorry, I was not able to find a match for your address. Let me get you to an agent."
       }
    }
]

The next_retry() function will try these actions in order. You can modify the sequence of prompts by changing the order in the RETRY_ACTIONS list. You can also configure different prompts for scenarios where Amazon Location Service doesn’t find a match, versus when the user says that the suggested address wasn’t correct. As you can see, we may ask the user to restate their street name, and failing that, to spell it using Amazon Lex spelling styles. We refer to this as a “say-spell” approach, and it’s similar to how a human agent would interact with a customer in this scenario.

To see this in action, you can deploy it in your AWS account.

Prerequisites

You can use the CloudFormation link that follows to deploy the solution in your own AWS account. Before deploying this solution, you should confirm that you have the following prerequisites:

  • An available AWS account where you can deploy the solution.
  • Access to the following AWS services:
    • Amazon Lex
    • AWS Lambda, for integration with Amazon Location Service
    • Amazon Location Service, for address lookup
    • AWS Identity and Access Management (IAM), for creating the necessary policies and roles
    • CloudWatch Logs, to create log groups for the Lambda function and optionally for capturing Lex conversation logs
    • CloudFormation to create the stack
  • An Amazon Connect instance (for instructions on setting one up, see Create an Amazon Connect instance).

The following AWS Regions support Amazon Lex, Amazon Connect, and Amazon Location Service: US East (N. Virginia), US West (Oregon), Europe (Frankfurt), Asia Pacific (Singapore), Asia Pacific (Sydney) Region, and Asia Pacific (Tokyo).

Deploying the sample solution

Sign in to the AWS Management Console in your AWS account, and select the following link to deploy the sample solution:

This will create a new CloudFormation stack.

Enter a Stack name, such as lex-update-address-example. Enter the ARN (Amazon Resource Name) for the Amazon Connect instance that you’ll use for testing the solution. You can keep the default values for the other parameters, or change them to suit your needs. Choose Next, and add any tags that you may want for your stack (optional). Choose Next again, review the stack details, select the checkbox to acknowledge that IAM resources will be created, and then choose Create stack.

After a few minutes, your stack will be complete, and include the following resources:

  • A Lex bot, including a published version with an alias (Development-Alias)
  • A Lambda fulfillment function for the bot (BotHandler)
  • A CloudWatch Logs log group for Lex conversation logs
  • Required Amazon IAM roles
  • A custom resource that adds a sample contact flow to your Connect instance

At this point, you can try the example interaction above in the Lex V2 console. You should see the sample bot with the name that you specified in the CloudFormation template (e.g., update-address-bot).

Choose this bot, choose Bot versions in the left-side navigation panel, choose the Version 1 version, and then choose Intents in the left-side panel. You’ll see the list of intents, as well as a Test button.

To test, select the Test button, select Development-Alias, and then select Confirm to open the test window.

Try “I want to change my address” to get started. This will use the UpdateAddressZipFirst intent to capture an address, starting by asking for the zip code, and then asking for the street address.

You can also say “I want to update my address” to try the UpdateAddress intent, which captures an address all at once with a single utterance.

Testing with Amazon Connect

Now let’s try this with voice using a Connect instance. A sample contact flow was already configured in your Connect instance:

All you need to do is set up a phone number, and associate it with this contact flow. To do this, follow these steps:

  • Launch Amazon Connect in the AWS Console.
  • Open your Connect instance by selecting the Access URL, and logging in to the instance.
  • In Dashboard, select View phone numbers.
  • Select Claim a number, choose a country from the Country drop-down, and choose a number.
  • Enter a Description, such as “Example flow to update an address with Amazon Lex”, and select the contact flow that you just created.
  • Choose Save.

Now you’re ready to call in to your Connect instance to test your bot using voice. Just dial the number on your phone, and try some US addresses. To try the zip code first approach, say “change my address”. To try the change address in one turn approach, say “update my address”. You can also just say, “my new address is”, followed by a valid US address.

But wait… there’s more

Another challenging use case for voice scenarios is capturing a user’s email address. This is often needed for user verification purposes, or simply to let the user change their email address on file. Lex has built-in support for email addresses using the AMAZON.EmailAddress built-in slot type, which also supports Lex spelling styles.

Using a “say-spell” approach for capturing email addresses can be very effective, and since the approach is similar to the user experience in the street address capture scenarios that we described above, we’ve included it here. Give it a try!

Clean up

You may want to clean up the resources created as part of the CloudFormation template when you’re done using the bot to avoid incurring ongoing charges. To do this, delete the CloudFormation Stack.

Conclusion

Amazon Lex offers powerful automated speech recognition and natural language understanding capabilities that can be used to capture the information needed from your users to provide automated, self-service functionality. Capturing a customer’s address via speech recognition can be challenging due to the range of names for streets, cities, and towns. However, you can easily integrate Amazon Lex with the Amazon Location Service to look up the correct address, based on the customer’s input. You can incorporate this technique in your own Lex conversation flows.


About the Author

Brian Yost is a Senior Technical Program manager on the AWS Lex team. In his spare time, he enjoys mountain biking, home brewing, and tinkering with technology.

Read More

Vector-Quantized Image Modeling with Improved VQGAN

In recent years, natural language processing models have dramatically improved their ability to learn general-purpose representations, which has resulted in significant performance gains for a wide range of natural language generation and natural language understanding tasks. In large part, this has been accomplished through pre-training language models on extensive unlabeled text corpora.

This pre-training formulation does not make assumptions about input signal modality, which can be language, vision, or audio, among others. Several recent papers have exploited this formulation to dramatically improve image generation results through pre-quantizing images into discrete integer codes (represented as natural numbers), and modeling them autoregressively (i.e., predicting sequences one token at a time). In these approaches, a convolutional neural network (CNN) is trained to encode an image into discrete tokens, each corresponding to a small patch of the image. A second stage CNN or Transformer is then trained to model the distribution of encoded latent variables. The second stage can also be applied to autoregressively generate an image after the training. But while such models have achieved strong performance for image generation, few studies have evaluated the learned representation for downstream discriminative tasks (such as image classification).

In “Vector-Quantized Image Modeling with Improved VQGAN”, we propose a two-stage model that reconceives traditional image quantization techniques to yield improved performance on image generation and image understanding tasks. In the first stage, an image quantization model, called VQGAN, encodes an image into lower-dimensional discrete latent codes. Then a Transformer model is trained to model the quantized latent codes of an image. This approach, which we call Vector-quantized Image Modeling (VIM), can be used for both image generation and unsupervised image representation learning. We describe multiple improvements to the image quantizer and show that training a stronger image quantizer is a key component for improving both image generation and image understanding.

Vector-Quantized Image Modeling with ViT-VQGAN
One recent, commonly used model that quantizes images into integer tokens is the Vector-quantized Variational AutoEncoder (VQVAE), a CNN-based auto-encoder whose latent space is a matrix of discrete learnable variables, trained end-to-end. VQGAN is an improved version of this that introduces an adversarial loss to promote high quality reconstruction. VQGAN uses transformer-like elements in the form of non-local attention blocks, which allows it to capture distant interactions using fewer layers.

In our work, we propose taking this approach one step further by replacing both the CNN encoder and decoder with ViT. In addition, we introduce a linear projection from the output of the encoder to a low-dimensional latent variable space for lookup of the integer tokens. Specifically, we reduced the encoder output from a 768-dimension vector to a 32- or 8-dimension vector per code, which we found encourages the decoder to better utilize the token outputs, improving model capacity and efficiency.

Overview of the proposed ViT-VQGAN (left) and VIM (right), which, when working together, is capable of both image generation and image understanding. In the first stage, ViT-VQGAN converts images into discrete integers, which the autoregressive Transformer (Stage 2) then learns to model. Finally, the Stage 1 decoder is applied to these tokens to enable generation of high quality images from scratch.

With our trained ViT-VQGAN, images are encoded into discrete tokens represented by integers, each of which encompasses an 8×8 patch of the input image. Using these tokens, we train a decoder-only Transformer to predict a sequence of image tokens autoregressively. This two-stage model, VIM, is able to perform unconditioned image generation by simply sampling token-by-token from the output softmax distribution of the Transformer model.

VIM is also capable of performing class-conditioned generation, such as synthesizing a specific image of a given class (e.g., a dog or a cat). We extend the unconditional generation to class-conditioned generation by prepending a class-ID token before the image tokens during both training and sampling.

Uncurated set of dog samples from class-conditioned image generation trained on ImageNet. Conditioned classes: Irish terrier, Norfolk terrier, Norwich terrier, Yorkshire terrier, wire-haired fox terrier, Lakeland terrier.

To test the image understanding capabilities of VIM, we also fine-tune a linear projection layer to perform ImageNet classification, a standard benchmark for measuring image understanding abilities. Similar to ImageGPT, we take a layer output at a specific block, average over the sequence of token features (frozen) and insert a softmax layer (learnable) projecting averaged features to class logits. This allows us to capture intermediate features that provide more information useful for representation learning.

Experimental Results
We train all ViT-VQGAN models with a training batch size of 256 distributed across 128 CloudTPUv4 cores. All models are trained with an input image resolution of 256×256. On top of the pre-learned ViT-VQGAN image quantizer, we train Transformer models for unconditional and class-conditioned image synthesis and compare with previous work.

We measure the performance of our proposed methods for class-conditioned image synthesis and unsupervised representation learning on the widely used ImageNet benchmark. In the table below we demonstrate the class-conditioned image synthesis performance measured by the Fréchet Inception Distance (FID). Compared to prior work, VIM improves the FID to 3.07 (lower is better), a relative improvement of 58.6% over the VQGAN model (FID 7.35). VIM also improves the capacity for image understanding, as indicated by the Inception Score (IS), which goes from 188.6 to 227.4, a 20.6% improvement relative to VQGAN.

Model Acceptance
Rate
FID IS
Validation data 1.0 1.62 235.0
DCTransformer 1.0 36.5 N/A
BigGAN 1.0 7.53 168.6
BigGAN-deep 1.0 6.84 203.6
IDDPM 1.0 12.3 N/A
ADM-G, 1.0 guid. 1.0 4.59 186.7
VQVAE-2 1.0 ~31 ~45
VQGAN 1.0 17.04 70.6
VQGAN 0.5 10.26 125.5
VQGAN 0.25 7.35 188.6
ViT-VQGAN (Ours) 1.0 4.17 175.1
ViT-VQGAN (Ours) 0.5 3.04 227.4
Fréchet Inception Distance (FID) comparison between different models for class-conditional image synthesis and Inception Score (IS) for image understanding, both on ImageNet with resolution 256×256. The acceptance rate shows results filtered by a ResNet-101 classification model, similar to the process in VQGAN.

After training a generative model, we test the learned image representations by fine-tuning a linear layer to perform ImageNet classification, a standard benchmark for measuring image understanding abilities. Our model outperforms previous generative models on the image understanding task, improving classification accuracy through linear probing (i.e., training a single linear classification layer, while keeping the rest of the model frozen) from 60.3% (iGPT-L) to 73.2%. These results showcase VIM’s strong generation results as well as image representation learning abilities.

Conclusion
We propose Vector-quantized Image Modeling (VIM), which pretrains a Transformer to predict image tokens autoregressively, where discrete image tokens are produced from improved ViT-VQGAN image quantizers. With our proposed improvements on image quantization, we demonstrate superior results on both image generation and understanding. We hope our results can inspire future work towards more unified approaches for image generation and understanding.

Acknowledgements
We would like to thank Xin Li, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, Yonghui Wu for the preparation of the VIM paper. We thank Wei Han, Yuan Cao, Jiquan Ngiam‎, Vijay Vasudevan, Zhifeng Chen and Claire Cui for helpful discussions and feedback, and others on the Google Research and Brain Team for support throughout this project.

Read More

What’s new in TensorFlow 2.9?

Posted by Goldie Gadde and Douglas Yarrington for the TensorFlow team

TensorFlow 2.9 has been released! Highlights include performance improvements with oneDNN, and the release of DTensor, a new API for model distribution that can be used to seamlessly move from data parallelism to model parallelism

We’ve also made improvements to the core library, including Eigen and tf.function unification, deterministic behavior, and new support for Windows’ WSL2. Finally, we’re releasing new experimental APIs for tf.function retracing and Keras Optimizers. Let’s take a look at these new and improved features.

Improved CPU performance: oneDNN by default

We have worked with Intel to integrate the oneDNN performance library with TensorFlow to achieve top performance on Intel CPUs. Since TensorFlow 2.5, TensorFlow has had experimental support for oneDNN, which could provide up to a 4x performance improvement. In TensorFlow 2.9, we are turning on oneDNN optimizations by default on Linux x86 packages and for CPUs with neural-network-focused hardware features such as AVX512_VNNI, AVX512_BF16, AMX, and others, which are found on Intel Cascade Lake and newer CPUs.

Users running TensorFlow with oneDNN optimizations enabled might observe slightly different numerical results from when the optimizations are off. This is because floating-point round-off approaches and order differ, and can create slight errors. If this causes issues for you, turn the optimizations off by setting TF_ENABLE_ONEDNN_OPTS=0 before running your TensorFlow programs. To enable or re-enable them, set TF_ENABLE_ONEDNN_OPTS=1 before running your TensorFlow program. To verify that the optimizations are on, look for a message beginning with "oneDNN custom operations are on" in your program log. We welcome feedback on GitHub and the TensorFlow Forum.

Model parallelism with DTensor

DTensor is a new TensorFlow API for distributed model processing that allows models to seamlessly move from data parallelism to single program multiple data (SPMD) based model parallelism, including spatial partitioning. This means you have tools to easily train models where the model weights or inputs are so large they don’t fit on a single device. (If you are familiar with Mesh TensorFlow in TF1, DTensor serves a similar purpose.)

DTensor is designed with the following principles at its core:

  • A device-agnostic API: This allows the same model code to be used on CPU, GPU, or TPU, including models partitioned across device types.
  • Multi-client execution: Removes the coordinator and leaves each task to drive its locally attached devices, allowing scaling a model with no impact to startup time.
  • A global perspective vs. per-replica: Traditionally with TensorFlow, distributed model code is written around replicas, but with DTensor, model code is written from the global perspective and per replica code is generated and run by the DTensor runtime. Among other things, this means no uncertainty about whether batch normalization is happening at the global level or the per replica level.

We have developed several introductory tutorials on DTensor, from DTensor concepts to training DTensor ML models with Keras:

TraceType for tf.function

We have revamped the way tf.function retraces to make it simpler, predictable, and configurable.

All arguments of tf.function are assigned a tf.types.experimental.TraceType. Custom user classes can declare a TraceType using the Tracing Protocol (tf.types.experimental.SupportsTracingProtocol).

The TraceType system makes it easy to understand retracing rules. For example, subtyping rules indicate what type of arguments can be used with particular function traces. Subtyping also explains how different specific shapes are joined into a generic shape that is their supertype, to reduce the number of traces for a function.

To learn more, see the new APIs for tf.types.experimental.TraceType, tf.types.experimental.SupportsTracingProtocol, and the reduce_retracing parameter of tf.function.

Support for WSL2

The Windows Subsystem for Linux lets developers run a Linux environment directly on Windows, without the overhead of a traditional virtual machine or dual boot setup. TensorFlow now supports WSL2 out of the box, including GPU acceleration. Please see the documentation for more details about the requirements and how to install WSL2 on Windows.

Deterministic behavior

The API tf.config.experimental.enable_op_determinism makes TensorFlow ops deterministic.

Determinism means that if you run an op multiple times with the same inputs, the op returns the exact same outputs every time. This is useful for debugging models, and if you train your model from scratch several times with determinism, your model weights will be the same every time. Normally, many ops are non-deterministic due to the use of threads within ops which can add floating-point numbers in a nondeterministic order.

TensorFlow 2.8 introduced an API to make ops deterministic, and TensorFlow 2.9 improved determinism performance in tf.data in some cases. If you want your TensorFlow models to run deterministically, just add the following to the start of your program:

“`

tf.keras.utils.set_random_seed(1)

tf.config.experimental.enable_op_determinism()

“`

The first line sets the random seed for Python, NumPy, and TensorFlow, which is necessary for determinism. The second line makes each TensorFlow op deterministic. Note that determinism in general comes at the expense of lower performance and so your model may run slower when op determinism is enabled.

Optimized Training with Keras

In TensorFlow 2.9, we are releasing a new experimental version of the Keras Optimizer API, tf.keras.optimizers.experimental. The API provides a more unified and expanded catalog of built-in optimizers which can be more easily customized and extended.

In a future release, tf.keras.optimizers.experimental.Optimizer (and subclasses) will replace tf.keras.optimizers.Optimizer (and subclasses), which means that workflows using the legacy Keras optimizer will automatically switch to the new optimizer. The current (legacy) tf.keras.optimizers.* API will still be accessible via tf.keras.optimizers.legacy.*, such as tf.keras.optimizers.legacy.Adam.

Here are some highlights of the new optimizer class:

  • Incrementally faster training for some models.
  • Easier to write customized optimizers.
  • Built-in support for moving average of model weights (“Polyak averaging”).

For most users, you will need to take no action. But, if you have an advanced workflow falling into the following cases, please make corresponding changes:

Use Case 1: You implement a customized optimizer based on the Keras optimizer

For these works, please first check if it is possible to change your dependency to tf.keras.optimizers.experimental.Optimizer. If for any reason you decide to stay with the old optimizer (we discourage it), then you can change your optimizer to tf.keras.optimizers.legacy.Optimizer to avoid being automatically switched to the new optimizer in a later TensorFlow version.

Use Case 2: Your work depends on third-party Keras-based optimizers (such as tensorflow_addons)

Your work should run successfully as long as the library continues to support the specific optimizer. However, if the library maintainers fail to take actions to accommodate the Keras optimizer change, your work would error out. So please stay tuned with the third-party library’s announcement, and file a bug to Keras team if your work is broken due to optimizer malfunction.

Use Case 3: Your work is based on TF1

First of all, please try migrating to TF2. It is worth it, and may be easier than you think! If for any reason migration is not going to happen soon, then please replace your tf.keras.optimizers.XXX to tf.keras.optimizers.legacy.XXX to avoid being automatically switched to the new optimizer.

Use Case 4: Your work has customized gradient aggregation logic

Usually this means you are doing gradients aggregation outside the optimizer, and calling apply_gradients() with experimental_aggregate_gradients=False. We changed the argument name, so please change your optimizer to tf.keras.optimizers.experimental.Optimizer and set skip_gradients_aggregation=True. If it errors out after making this change, please file a bug to Keras team.

Use Case 5: Your work has direct calls to deprecated optimizer public APIs

Please check if your method call has a match here. change your optimizer to tf.keras.optimizers.experimental.Optimizer. If for any reason you want to keep using the old optimizer, change your optimizer to tf.keras.optimizers.legacy.Optimizer.

Next steps

Check out the release notes for more information. To stay up to date, you can read the TensorFlow blog, follow twitter.com/tensorflow, or subscribe to youtube.com/tensorflow. If you’ve built something you’d like to share, please submit it for our Community Spotlight at goo.gle/TFCS. For feedback, please file an issue on GitHub or post to the TensorFlow Forum. Thank you!

Read More

Introducing Accelerated PyTorch Training on Mac

In collaboration with the Metal engineering team at Apple, we are excited to announce support for GPU-accelerated PyTorch training on Mac. Until now, PyTorch training on Mac only leveraged the CPU, but with the upcoming PyTorch v1.12 release, developers and researchers can take advantage of Apple silicon GPUs for significantly faster model training. This unlocks the ability to perform machine learning workflows like prototyping and fine-tuning locally, right on Mac.

Metal Acceleration

Accelerated GPU training is enabled using Apple’s Metal Performance Shaders (MPS) as a backend for PyTorch. The MPS backend extends the PyTorch framework, providing scripts and capabilities to set up and run operations on Mac. MPS optimizes compute performance with kernels that are fine-tuned for the unique characteristics of each Metal GPU family. The new device maps machine learning computational graphs and primitives on the MPS Graph framework and tuned kernels provided by MPS.

Training Benefits on Apple Silicon

Every Apple silicon Mac has a unified memory architecture, providing the GPU with direct access to the full memory store. This makes Mac a great platform for machine learning, enabling users to train larger networks or batch sizes locally. This reduces costs associated with cloud-based development or the need for additional local GPUs. The Unified Memory architecture also reduces data retrieval latency, improving end-to-end performance.

In the graphs below, you can see the performance speedup from accelerated GPU training and evaluation compared to the CPU baseline:

Testing conducted by Apple in April 2022 using production Mac Studio systems with Apple M1 Ultra, 20-core CPU, 64-core GPU 128GB of RAM, and 2TB SSD. Tested with macOS Monterey 12.3, prerelease PyTorch 1.12, ResNet50 (batch size=128), HuggingFace BERT (batch size=64), and VGG16 (batch size=64). Performance tests are conducted using specific computer systems and reflect the approximate performance of Mac Studio.

Getting Started

To get started, just install the latest Preview (Nightly) build on your Apple silicon Mac running macOS 12.3 or later with a native version (arm64) of Python.

You can also learn more about Metal and MPS on Apple’s Metal page.

Read More

Living better with algorithms

Laboratory for Information and Decision Systems (LIDS) student Sarah Cen remembers the lecture that sent her down the track to an upstream question.

At a talk on ethical artificial intelligence, the speaker brought up a variation on the famous trolley problem, which outlines a philosophical choice between two undesirable outcomes.

The speaker’s scenario: Say a self-driving car is traveling down a narrow alley with an elderly woman walking on one side and a small child on the other, and no way to thread between both without a fatality. Who should the car hit?

Then the speaker said: Let’s take a step back. Is this the question we should even be asking?

That’s when things clicked for Cen. Instead of considering the point of impact, a self-driving car could have avoided choosing between two bad outcomes by making a decision earlier on — the speaker pointed out that, when entering the alley, the car could have determined that the space was narrow and slowed to a speed that would keep everyone safe.

Recognizing that today’s AI safety approaches often resemble the trolley problem, focusing on downstream regulation such as liability after someone is left with no good choices, Cen wondered: What if we could design better upstream and downstream safeguards to such problems? This question has informed much of Cen’s work.

“Engineering systems are not divorced from the social systems on which they intervene,” Cen says. Ignoring this fact risks creating tools that fail to be useful when deployed or, more worryingly, that are harmful.

Cen arrived at LIDS in 2018 via a slightly roundabout route. She first got a taste for research during her undergraduate degree at Princeton University, where she majored in mechanical engineering. For her master’s degree, she changed course, working on radar solutions in mobile robotics (primarily for self-driving cars) at Oxford University. There, she developed an interest in AI algorithms, curious about when and why they misbehave. So, she came to MIT and LIDS for her doctoral research, working with Professor Devavrat Shah in the Department of Electrical Engineering and Computer Science, for a stronger theoretical grounding in information systems.

Auditing social media algorithms

Together with Shah and other collaborators, Cen has worked on a wide range of projects during her time at LIDS, many of which tie directly to her interest in the interactions between humans and computational systems. In one such project, Cen studies options for regulating social media. Her recent work provides a method for translating human-readable regulations into implementable audits.

To get a sense of what this means, suppose that regulators require that any public health content — for example, on vaccines — not be vastly different for politically left- and right-leaning users. How should auditors check that a social media platform complies with this regulation? Can a platform be made to comply with the regulation without damaging its bottom line? And how does compliance affect the actual content that users do see?

Designing an auditing procedure is difficult in large part because there are so many stakeholders when it comes to social media. Auditors have to inspect the algorithm without accessing sensitive user data. They also have to work around tricky trade secrets, which can prevent them from getting a close look at the very algorithm that they are auditing because these algorithms are legally protected. Other considerations come into play as well, such as balancing the removal of misinformation with the protection of free speech.

To meet these challenges, Cen and Shah developed an auditing procedure that does not need more than black-box access to the social media algorithm (which respects trade secrets), does not remove content (which avoids issues of censorship), and does not require access to users (which preserves users’ privacy).

In their design process, the team also analyzed the properties of their auditing procedure, finding that it ensures a desirable property they call decision robustness. As good news for the platform, they show that a platform can pass the audit without sacrificing profits. Interestingly, they also found the audit naturally incentivizes the platform to show users diverse content, which is known to help reduce the spread of misinformation, counteract echo chambers, and more.

Who gets good outcomes and who gets bad ones?

In another line of research, Cen looks at whether people can receive good long-term outcomes when they not only compete for resources, but also don’t know upfront what resources are best for them.

Some platforms, such as job-search platforms or ride-sharing apps, are part of what is called a matching market, which uses an algorithm to match one set of individuals (such as workers or riders) with another (such as employers or drivers). In many cases, individuals have matching preferences that they learn through trial and error. In labor markets, for example, workers learn their preferences about what kinds of jobs they want, and employers learn their preferences about the qualifications they seek from workers.

But learning can be disrupted by competition. If workers with a particular background are repeatedly denied jobs in tech because of high competition for tech jobs, for instance, they may never get the knowledge they need to make an informed decision about whether they want to work in tech. Similarly, tech employers may never see and learn what these workers could do if they were hired.

Cen’s work examines this interaction between learning and competition, studying whether it is possible for individuals on both sides of the matching market to walk away happy.

Modeling such matching markets, Cen and Shah found that it is indeed possible to get to a stable outcome (workers aren’t incentivized to leave the matching market), with low regret (workers are happy with their long-term outcomes), fairness (happiness is evenly distributed), and high social welfare.

Interestingly, it’s not obvious that it’s possible to get stability, low regret, fairness, and high social welfare simultaneously.  So another important aspect of the research was uncovering when it is possible to achieve all four criteria at once and exploring the implications of those conditions.

What is the effect of X on Y?

For the next few years, though, Cen plans to work on a new project, studying how to quantify the effect of an action X on an outcome Y when it’s expensive — or impossible — to measure this effect, focusing in particular on systems that have complex social behaviors.

For instance, when Covid-19 cases surged in the pandemic, many cities had to decide what restrictions to adopt, such as mask mandates, business closures, or stay-home orders. They had to act fast and balance public health with community and business needs, public spending, and a host of other considerations.

Typically, in order to estimate the effect of restrictions on the rate of infection, one might compare the rates of infection in areas that underwent different interventions. If one county has a mask mandate while its neighboring county does not, one might think comparing the counties’ infection rates would reveal the effectiveness of mask mandates. 

But of course, no county exists in a vacuum. If, for instance, people from both counties gather to watch a football game in the maskless county every week, people from both counties mix. These complex interactions matter, and Sarah plans to study questions of cause and effect in such settings.

“We’re interested in how decisions or interventions affect an outcome of interest, such as how criminal justice reform affects incarceration rates or how an ad campaign might change the public’s behaviors,” Cen says.

Cen has also applied the principles of promoting inclusivity to her work in the MIT community.

As one of three co-presidents of the Graduate Women in MIT EECS student group, she helped organize the inaugural GW6 research summit featuring the research of women graduate students — not only to showcase positive role models to students, but also to highlight the many successful graduate women at MIT who are not to be underestimated.

Whether in computing or in the community, a system taking steps to address bias is one that enjoys legitimacy and trust, Cen says. “Accountability, legitimacy, trust — these principles play crucial roles in society and, ultimately, will determine which systems endure with time.” 

Read More

Contextual Rephrasing in Google Assistant

When people converse with one another, context and references play a critical role in driving their conversation more efficiently. For instance, if one asks the question “Who wrote Romeo and Juliet?” and, after receiving an answer, asks “Where was he born?”, it is clear that ‘he’ is referring to William Shakespeare without the need to explicitly mention him. Or if someone mentions “python” in a sentence, one can use the context from the conversation to determine whether they are referring to a type of snake or a computer language. If a virtual assistant cannot robustly handle context and references, users would be required to adapt to the limitation of the technology by repeating previously shared contextual information in their follow-up queries to ensure that the assistant understands their requests and can provide relevant answers.

In this post, we present a technology currently deployed on Google Assistant that allows users to speak in a natural manner when referencing context that was defined in previous queries and answers. The technology, based on the latest machine learning (ML) advances, rephrases a user’s follow-up query to explicitly mention the missing contextual information, thus enabling it to be answered as a stand-alone query. While Assistant considers many types of context for interpreting the user input, in this post we are focusing on short-term conversation history.

Context Handling by Rephrasing
One of the approaches taken by Assistant to understand contextual queries is to detect if an input utterance is referring to previous context and then rephrase it internally to explicitly include the missing information. Following on from the previous example in which the user asked who wrote Romeo and Juliet, one may ask follow-up questions like “When?”. Assistant recognizes that this question is referring to both the subject (Romeo and Juliet) and answer from the previous query (William Shakespeare) and can rephrase “When?” to “When did William Shakespeare write Romeo and Juliet?”

While there are other ways to handle context, for instance, by applying rules directly to symbolic representations of the meaning of queries, like intents and arguments, the advantage of the rephrasing approach is that it operates horizontally at the string level across any query answering, parsing, or action fulfillment module.

Conversation on a smart display device, where Assistant understands multiple contextual follow-up queries, allowing the user to have a more natural conversation. The phrases appearing at the bottom of the display are suggestions for follow-up questions that the user can select. However, the user can still ask different questions.

A Wide Variety of Contextual Queries
The natural language processing field, traditionally, has not put much emphasis on a general approach to context, focusing on the understanding of stand-alone queries that are fully specified. Accurately incorporating context is a challenging problem, especially when considering the large variety of contextual query types. The table below contains example conversations that illustrate query variability and some of the many contextual challenges that Assistant’s rephrasing method can resolve (e.g., differentiating between referential and non-referential cases or identifying what context a query is referencing). We demonstrate how Assistant is now able to rephrase follow-up queries, adding contextual information before providing an answer.

System Architecture
At a high level, the rephrasing system generates rephrasing candidates by using different types of candidate generators. Each rephrasing candidate is then scored based on a number of signals, and the one with the highest score is selected.

High level architecture of Google Assistant contextual rephraser.

Candidate Generation
To generate rephrasing candidates we use a hybrid approach that applies different techniques, which we classify into three categories:

  1. Generators based on the analysis of the linguistic structure of the queries use grammatical and morphological rules to perform specific operations — for instance, the replacement of pronouns or other types of referential phrases with antecedents from the context.
  2. Generators based on query statistics combine key terms from the current query and its context to create candidates that match popular queries from historical data or common query patterns.
  3. Generators based on Transformer technologies, such as MUM, learn to generate sequences of words according to a number of training samples. LaserTagger and FELIX are technologies suitable for tasks with high overlap between the input and output texts, are very fast at inference time, and are not vulnerable to hallucination (i.e., generating text that is not related to the input texts). Once presented with a query and its context, they can generate a sequence of text edits to transform the input queries into a rephrasing candidate by indicating which portions of the context should be preserved and which words should be modified.

Candidate Scoring
We extract a number of signals for each rephrasing candidate and use an ML model to select the most promising candidate. Some of the signals depend only on the current query and its context. For example, is the topic of the current query similar to the topic of the previous query? Or, is the current query a good stand-alone query or does it look incomplete? Other signals depend on the candidate itself: How much of the information of the context does the candidate preserve? Is the candidate well-formed from a linguistic point of view? Etc.

Recently, new signals generated by BERT and MUM models have significantly improved the performance of the ranker, fixing about one-third of the recall headroom while minimizing false positives on query sequences that are not contextual (and therefore do not require a rephrasing).

Example conversation on a phone where Assistant understands a sequence of contextual queries.

Conclusion
The solution described here attempts to resolve contextual queries by rephrasing them in order to make them fully answerable in a stand-alone manner, i.e., without having to relate to other information during the fulfillment phase. The benefit of this approach is that it is agnostic to the mechanisms that would fulfill the query, thus making it usable as a horizontal layer to be deployed before any further processing.

Given the variety of contexts naturally used in human languages, we adopted a hybrid approach that combines linguistic rules, large amounts of historic data through logs, and ML models based on state-of-the-art Transformer approaches. By generating a number of rephrasing candidates for each query and its context, and then scoring and ranking them using a variety of signals, Assistant can rephrase and thus correctly interpret most contextual queries. As Assistant can handle most types of linguistic references, we are empowering users to have more natural conversations. To make such multi-turn conversations even less cumbersome, Assistant users can turn on Continued Conversation mode to enable asking follow-up queries without the need to repeat “Hey Google” between each query. We are also using this technology in other virtual assistant settings, for instance, interpreting context from something shown on a screen or playing on a speaker.

Acknowledgements
This post reflects the combined work of Aliaksei Severyn, André Farias, Cheng-Chun Lee, Florian Thöle, Gabriel Carvajal, Gyorgy Gyepesi, Julien Cretin, Liana Marinescu, Martin Bölle, Patrick Siegler, Sebastian Krause, Victor Ähdel, Victoria Fossum, Vincent Zhao. We also thank Amar Subramanya, Dave Orr, Yury Pinsky for helpful discussions and support.

Read More

Can artificial intelligence overcome the challenges of the health care system?

Even as rapid improvements in artificial intelligence have led to speculation over significant changes in the health care landscape, the adoption of AI in health care has been minimal. A 2020 survey by Brookings, for example, found that less than 1 percent of job postings in health care required AI-related skills.

The Abdul Latif Jameel Clinic for Machine Learning in Health (Jameel Clinic), a research center within the MIT Schwarzman College of Computing, recently hosted the MITxMGB AI Cures Conference in an effort to accelerate the adoption of clinical AI tools by creating new opportunities for collaboration between researchers and physicians focused on improving care for diverse patient populations.

Once virtual, the AI Cures Conference returned to in-person attendance at MIT’s Samberg Conference Center on the morning of April 25, welcoming over 300 attendees primarily made up of researchers and physicians from MIT and Mass General Brigham (MGB). 

MIT President L. Rafael Reif began the event by welcoming attendees and speaking to the “transformative capacity of artificial intelligence and its ability to detect, in a dark river of swirling data, the brilliant patterns of meaning that we could never see otherwise.” MGB’s president and CEO Anne Klibanski followed up by lauding the joint partnership between the two institutions and noting that the collaboration could “have a real impact on patients’ lives” and “help to eliminate some of the barriers to information-sharing.”

Domestically, about $20 million in subcontract work currently takes place between MIT and MGB. MGB’s chief academic officer and AI Cures co-chair Ravi Thadhani thinks that five times that amount would be necessary in order to do more transformative work. “We could certainly be doing more,” Thadhani said. “The conference … just scratched the surface of a relationship between a leading university and a leading health-care system.”

MIT Professor and AI Cures Co-Chair Regina Barzilay echoed similar sentiments during the conference. “If we’re going to take 30 years to take all the algorithms and translate them into patient care, we’ll be losing patient lives,” she said. “I hope the main impact of this conference is finding a way to translate it into a clinical setting to benefit patients.”

This year’s event featured 25 speakers and two panels, with many of the speakers addressing the obstacles facing the mainstream deployment of AI in clinical settings, from fairness and clinical validation to regulatory hurdles and translation issues using AI tools. 

On the speaker list, of note was the appearance of Amir Khan, a senior fellow from the U.S. Food and Drug Administration (FDA), who fielded a number of questions from curious researchers and clinicians on the FDA’s ongoing efforts and challenges in regulating AI in health care.

The conference also covered many of the impressive advancements AI made in the past several years: Lecia Sequist, a lung cancer oncologist from MGB, spoke about her collaborative work with MGB radiologist Florian Fintelmann and Barzilay to develop an AI algorithm that could detect lung cancer up to six years in advance. MIT Professor Dina Katabi presented with MGB’s doctors Ipsit Vahia and Aleksandar Videnovic on an AI device that could detect the presence of Parkinson’s disease simply by monitoring a person’s breathing patterns while asleep. “It is an honor to collaborate with Professor Katabi,” Videnovic said during the presentation.

MIT Assistant Professor Marzyeh Ghassemi, whose presentation concerned designing machine learning processes for more equitable health systems, found the longer-range perspectives shared by the speakers during the first panel on AI changing clinical science compelling.

“What I really liked about that panel was the emphasis on how relevant technology and AI has become in clinical science,” Ghassemi says. “You heard some panel members [Eliezer Van Allen, Najat Khan, Isaac Kohane, Peter Szolovits] say that they used to be the only person at a conference from their university that was focused on AI and ML [machine learning], and now we’re in a space where we have a miniature conference with posters just with people from MIT.”

The 88 posters accepted to AI Cures were on display for attendees to peruse during the lunch break. The presented research spanned different areas of focus from clinical AI and AI for biology to AI-powered systems and others. 

“I was really impressed with the breadth of work going on in this space,” Collin Stultz, a professor at MIT, says. Stultz also spoke at AI Cures, focusing primarily on the risks of interpretability and explainability when using AI tools in a clinical setting, using cardiovascular care as an example of showing how algorithms could potentially mislead clinicians with grave consequences for patients. 

“There are a growing number of failures in this space where companies or algorithms strive to be the most accurate, but do not take into consideration how the clinician views the algorithm and their likelihood of using it,” Stultz said. “This is about what the patient deserves and how the clinician is able to explain and justify their decision-making to the patient.” 

Phil Sharp, MIT Institute Professor and chair of the advisory board for Jameel Clinic, found the conference energizing and thought that the in-person interactions were crucial to gaining insight and motivation, unmatched by many conferences that are still being hosted virtually. 

“The broad participation by students and leaders and members of the community indicate that there’s an awareness that this is a tremendous opportunity and a tremendous need,” Sharp says. He pointed out that AI and machine learning are being used to predict the structures of “almost everything” from protein structures to drug efficacy. “It says to young people, watch out, there might be a machine revolution coming.” 

Read More