University teams are competing to develop a bot that best responds to customer commands in a virtual world.Read More
Fine-tune text-to-image Stable Diffusion models with Amazon SageMaker JumpStart
In November 2022, we announced that AWS customers can generate images from text with Stable Diffusion models in Amazon SageMaker JumpStart. Stable Diffusion is a deep learning model that allows you to generate realistic, high-quality images and stunning art in just a few seconds. Although creating impressive images can find use in industries ranging from art to NFTs and beyond, today we also expect AI to be personalizable. Today, we announce that you can personalize the image generation model to your use case by fine-tuning it on your custom dataset in Amazon SageMaker JumpStart. This can be useful when creating art, logos, custom designs, NFTs, and so on, or fun stuff such as generating custom AI images of your pets or avatars of yourself.
In this post, we provide an overview of how to fine-tune the Stable Diffusion model in two ways: programmatically through JumpStart APIs available in the SageMaker Python SDK, and JumpStart’s user interface (UI) in Amazon SageMaker Studio. We also discuss how to make design choices including dataset quality, size of training dataset, choice of hyperparameter values, and applicability to multiple datasets. Finally, we discuss the over 80 publicly available fine-tuned models with different input languages and styles recently added in JumpStart.
Stable Diffusion and transfer learning
Stable Diffusion is a text-to-image model that enables you to create photorealistic images from just a text prompt. A diffusion model trains by learning to remove noise that was added to a real image. This de-noising process generates a realistic image. These models can also generate images from text alone by conditioning the generation process on the text. For instance, Stable Diffusion is a latent diffusion where the model learns to recognize shapes in a pure noise image and gradually brings these shapes into focus if the shapes match the words in the input text. The text must first be embedded into a latent space using a language model. Then, a series of noise addition and noise removal operations are performed in the latent space with a U-Net architecture. Finally, the de-noised output is decoded into the pixel space.
In machine learning (ML), the ability to transfer the knowledge learned in one domain to another is called transfer learning. You can use transfer learning to produce accurate models on your smaller datasets, with much lower training costs than the ones involved in training the original model. With transfer learning, you can fine-tune the stable diffusion model on your own dataset with as little as five images. For example, on the left are training images of a dog named Doppler used to fine-tune the model, in the middle and right are images generated by the fine-tuned model when asked to predict Doppler’s image on the beach and a pencil sketch.
On the left are images of a white chair used to fine-tune the model and an image of the chair in red generated by the fine-tuned model. On the right are images of an ottoman used to fine-tune the model and an image of a cat sitting on an ottoman.
Fine-tuning large models like Stable Diffusion usually requires you to provide training scripts. There are a host of issues, including out of memory issues, payload size issues, and more. Furthermore, you have to run end-to-end tests to make sure that the script, the model, and the desired instance work together in an efficient manner. JumpStart simplifies this process by providing ready-to-use scripts that have been robustly tested. The JumpStart fine-tuning script for Stable Diffusion models builds on the fine-tuning script from DreamBooth. You can access these scripts with a single click through the Studio UI or with very few lines of code through the JumpStart APIs.
Note that by using the Stable Diffusion model, you agree to the CreativeML Open RAIL++-M License.
Use JumpStart programmatically with the SageMaker SDK
This section describes how to train and deploy the model with the SageMaker Python SDK. We choose an appropriate pre-trained model in JumpStart, train this model with a SageMaker training job, and deploy the trained model to a SageMaker endpoint. Furthermore, we run inference on the deployed endpoint, all using the SageMaker Python SDK. The following examples contain code snippets. For the full code with all of the steps in this demo, see the Introduction to JumpStart – Text to Image example notebook.
Train and fine-tune the Stable Diffusion model
Each model is identified by a unique model_id
. The following code shows how to fine-tune a Stable Diffusion 2.1 base model identified by model_id
model-txt2img-stabilityai-stable-diffusion-v2-1-base
on a custom training dataset. For a full list of model_id
values and which models are fine-tunable, refer to Built-in Algorithms with pre-trained Model Table. For each model_id
, in order to launch a SageMaker training job through the Estimator class of the SageMaker Python SDK, you need to fetch the Docker image URI, training script URI, and pre-trained model URI through the utility functions provided in SageMaker. The training script URI contains all the necessary code for data processing, loading the pre-trained model, model training, and saving the trained model for inference. The pre-trained model URI contains the pre-trained model architecture definition and the model parameters. The pre-trained model URI is specific to the particular model. The pre-trained model tarballs have been pre-downloaded from Hugging Face and saved with the appropriate model signature in Amazon Simple Storage Service (Amazon S3) buckets, such that the training job runs in network isolation. See the following code:
With these model-specific training artifacts, you can construct an object of the Estimator class:
Training dataset
The following are the instructions for how the training data should be formatted:
- Input – A directory containing the instance images,
dataset_info.json
, with the following configuration:- Images may be of .png, .jpg, or .jpeg format
- The
dataset_info.json
file must be of the format{'instance_prompt':<<instance_prompt>>}
- Output – A trained model that can be deployed for inference
The S3 path should look like s3://bucket_name/input_directory/
. Note the trailing /
is required.
The following is an example format of the training data:
For instructions on how to format the data while using prior preservation, refer to the section Prior Preservation in this post.
We provide a default dataset of cat images. It consists of eight images (instance images corresponding to instance prompt) of a single cat with no class images. It can be downloaded from GitHub. If using the default dataset, try the prompt “a photo of a riobugger cat” while doing inference in the demo notebook.
License: MIT.
Hyperparameters
Next, for transfer learning on your custom dataset, you might need to change the default values of the training hyperparameters. You can fetch a Python dictionary of these hyperparameters with their default values by calling hyperparameters.retrieve_default
, update them as needed, and then pass them to the Estimator class. See the following code:
The following hyperparameters are supported by the fine-tuning algorithm:
- with_prior_preservation – Flag to add prior preservation loss. Prior preservation is a regularizer that avoids overfitting. (Choices:
[“True”,“False”]
, default:“False”
.) - num_class_images – The minimum class images for prior preservation loss. If
with_prior_preservation = True
and there aren’t enough images already present inclass_data_dir
, additional images will be sampled withclass_prompt
. (Values: positive integer, default: 100.) - Epochs – The number of passes that the fine-tuning algorithm takes through the training dataset. (Values: positive integer, default: 20.)
- Max_steps – The total number of training steps to perform. If not
None
, overrides epochs. (Values:“None”
or a string of integer, default:“None”
.) - Batch size –: The number of training examples that are worked through before the model weights are updated. Same as the batch size during class images generation if
with_prior_preservation = True
. (Values: positive integer, default: 1.) - learning_rate – The rate at which the model weights are updated after working through each batch of training examples. (Values: positive float, default: 2e-06.)
- prior_loss_weight – The weight of prior preservation loss. (Values: positive float, default: 1.0.)
- center_crop – Whether to crop the images before resizing to the desired resolution. (Choices:
[“True”/“False”]
, default:“False”
.) - lr_scheduler – The type of learning rate scheduler. (Choices:
["linear", "cosine", "cosine_with_restarts", "polynomial", "constant", "constant_with_warmup"]
, default:"constant"
.) For more information, see Learning Rate Schedulers. - adam_weight_decay – The weight decay to apply (if not zero) to all layers except all bias and
LayerNorm
weights inAdamW
optimizer. (Value: float, default: 1e-2.) - adam_beta1 – The beta1 hyperparameter (exponential decay rate for the first moment estimates) for the
AdamW
optimizer. (Value: float, default: 0.9.) - adam_beta2 – The beta2 hyperparameter (exponential decay rate for the first moment estimates) for the
AdamW
optimizer. (Value: float, default: 0.999.) - adam_epsilon – The
epsilon
hyperparameter for theAdamW
optimizer. It is usually set to a small value to avoid division by 0. (Value: float, default: 1e-8.) - gradient_accumulation_steps – The number of updates steps to accumulate before performing a backward/update pass. (Value: integer, default: 1.)
- max_grad_norm – The maximum gradient norm (for gradient clipping). (Value: float, default: 1.0.)
- seed – Fix the random state to achieve reproducible results in training. (Value: integer, default: 0.)
Deploy the fine-trained model
After model training is finished, you can directly deploy the model to a persistent, real-time endpoint. We fetch the required Docker Image URIs and script URIs and deploy the model. See the following code:
On the left are the training images of a cat named riobugger used to fine-tune the model (default parameters except max_steps
= 400). In the middle and right are the images generated by the fine-tuned model when asked to predict riobugger’s image on the beach and a pencil sketch.
For more details on inference, including supported parameters, response format, and so on, refer to Generate images from text with the stable diffusion model on Amazon SageMaker JumpStart.
Access JumpStart through the Studio UI
In this section, we demonstrate how to train and deploy JumpStart models through the Studio UI. The following video shows how to find the pre-trained Stable Diffusion model on JumpStart, train it, and then deploy it. The model page contains valuable information about the model and how to use it. After configuring the SageMaker training instance, choose Train. After the model is trained, you can deploy the trained model by choosing Deploy. After the endpoint is in the “in service” stage, it’s ready to respond to inference requests.
To accelerate the time to inference, JumpStart provides a sample notebook that shows how to run inference on the newly created endpoint. To access the notebook in Studio, choose Open Notebook in the Use Endpoint from Studio section of the model endpoint page.
JumpStart also provides a simple notebook which you can use to fine-tune the stable diffusion model and deploy the resulting fine-tuned model. You can use it to generate fun images of your dog. To access the notebook, search for “Generate Fun images of your dog” in the JumpStart search bar. To execute the notebook, you can use as little as five training images and upload to the local studio folder. If you have more than five images, you can upload them as well. Notebook uploads the training images to S3, trains the model on your dataset and deploy the resulting model. Training may take 20 mins to finish. You can change the number of steps to speed up the training. Notebook provides some sample prompts to try with the deployed model but you can try any prompt that you like. You can also adapt the notebook to create avatars of yourself or your pets. For instance, instead of your dog, you can upload images of your cat in the first step and then change the prompts from dogs to cats and the model will generate images of your cat.
Fine-tuning considerations
Training Stable Diffusion models tends to overfit quickly. To get good-quality images, we must find a good balance between the available training hyperparameters such as number of training steps and the learning rate. In this section, we show some experimental results and provide guidance on how set these parameters.
Recommendations
Consider the following recommendations:
- Start with good quality of training images (4–20). If training on human faces, you may need more images.
- Train for 200–400 steps when training on dogs or cats and other non-human subjects. If training on human faces, you may need more steps. If overfitting happens, reduce the nnumber of steps. If under-fitting happens (the fine-tuned model can’t generate the target subject’s image), increase the number of steps.
- If training on non-human faces, you may set
with_prior_preservation = False
because it doesn’t significantly impact performance. On human faces, you may need to setwith_prior_preservation=True
. - If setting
with_prior_preservation=True
, use the ml.g5.2xlarge instance type. - When training on multiple subjects sequentially, if the subjects are very similar (for example, all dogs), the model retains the last subject and forgets the previous subjects. If subjects are different (for example, first a cat then a dog), the model retains both subjects.
- We recommend using a low learning rate and progressively increasing the number of steps until the results are satisfactory.
Training dataset
The quality of the fine-tuned model is directly impacted by the quality of the training images. Therefore, you need to collect high-quality images to get good results. Blurred or low-resolution images will impact the quality of the fine-tuned model. Keep in mind the following additional parameters:
- Number of training images – You may fine-tune the model on as little as four training images. We experimented with training datasets of size as little as 4 images and as many as 16 images. In both cases, fine-tuning was able to adapt the model to the subject.
- Dataset formats – We tested the fine-tuning algorithm on images of format .png, .jpg, and .jpeg. Other formats may also work.
- Image resolution – Training images may be any resolution. The fine-tuning algorithm will resize all training images before starting fine-tuning. That being said, if you want to have more control over the cropping and resizing of the training images, we recommend resizing the images yourself to the base resolution of the model (in this example, 512×512 pixels).
Experiment settings
In the experiment in this post, while fine-tuning we use the default values of the hyperparameters unless specified. Furthermore, we use one of the four datasets:
- Dog1-8 – Dog 1 with 8 images
- Dog1-16 – Dog 1 with 16 images
- Dog2-4 – Dog 2 with four images
- Cat-8 – Cat with 8 images
To reduce cluttering, we only show one representative image of the dataset in each section along with the dataset name. You can find the full training set in the section Experimentation Datasets in this post.
Overfitting
Stable Diffusion models tend to overfit when fine-tuning on a few images. Therefore, you need to select the parameters such as epochs
, max_epochs
, and learning rate carefully. In this section, we used the Dog1-16 dataset.
To evaluate the model’s performance, we evaluate the fine-tuned model for four tasks:
- Can the fine-tuned model generate images of the subject (Doppler dog) in the same setting as it was trained on?
- Observation – Yes it can. It’s worth noting that model performance increases with the number of training steps.
- Can the fine-tuned model generate images of the subject in a different setting than it was trained on? For example, can it generate images of Doppler on a beach?
- Observation – Yes it can. It’s worth noting that model performance increases with the number of training steps up to a certain point. If the model is being trained for too long, however, the model performance degrades as the model tends to overfit.
- Can the fine-tuned model generate images of a class which the training subject belong to? For example, can it generate an image of a generic dog?
- Observation – As we increase the number of training steps, the model starts to overfit. As a result, it forgets the generic class of a dog and will only produce images related to the subject.
- Can the fine-tuned model generate images of a class or subject not in the training dataset? For example, can it generate an image of a cat?
- Observation – As we increase the number of training steps, the model starts to overfit. As a result, it will only produce images related to the subject, regardless of the class specified.
We fine-tune the model for a different number of steps (by setting max_steps
hyperparameters) and for each fine-tuned model, we generate images on each of the following four prompts (shown in the following examples from left to right:
- “A photo of a Doppler dog”
- “A photo of a Doppler dog on a beach”
- “A photo of a dog”
- “A photo of a cat”
The following images are from the model trained with 50 steps.
The following model was trained with 100 steps.
We trained the following model with 200 steps.
The following images are from a model trained with 400 steps.
Lastly, the following images are the result of 800 steps.
Train on multiple datasets
While fine-tuning, you may want to fine-tune on multiple subjects and have the fine-tuned model be able to generate images of all the subjects. Unfortunately, JumpStart is currently limited to training on a single subject. You can’t fine-tune the model on multiple subjects at the same time. Furthermore, fine-tuning the model for different subjects sequentially results in the model forgetting the first subject if the subjects are similar.
We consider the following experimentation in this section:
- Fine-tune the model for Subject A.
- Fine-tune the resulting model from Step 1 for Subject B.
- Generate images of Subject A and Subject B using the output model from Step 2.
In the following experiments, we observe that:
- If A is dog 1 and B is dog 2, then all images generated in Step 3 resemble dog 2
- If A is dog 2 and B is dog 1, then all images generated in Step 3 resemble dog 1
- If A is dog 1 and B is cat, then images generated with dog prompts resemble dog 1 and images generated with cat prompts resemble cat
Train on dog 1 and then dog 2
In Step 1, we fine-tune the model for 200 steps on eight images of dog 1. In Step 2, we fine-tune the model further for 200 steps on four images of dog 2.
The following are the images generated by the fine-tuned model at the end of Step 2 for different prompts.
Train on dog 2 and then dog 1
In Step 1, we fine-tune the model for 200 steps on four images of dog 2. In Step 2, we fine-tune the model further for 200 steps on eight images of dog 1.
The following are the images generated by the fine-tuned model at the end of Step 2 with different prompts.
Train on dogs and cats
In Step 1, we fine-tune the model for 200 steps on eight images of a cat. Then we fine-tune the model further for 200 steps on eight images of dog 1.
The following are the images generated by the fine-tuned model at the end of Step 2. Images with cat-related prompts look like the cat in Step 1 of the fine-tuning, and images with dog-related prompts look like the dog in Step 2 of the fine-tuning.
Prior preservation
Prior preservation is a technique that uses additional images of the same class that we are trying to train on. For instance, if the training data consists of images of a particular dog, with prior preservation, we incorporate class images of generic dogs. It tries to avoid overfitting by showing images of different dogs while training for a particular dog. A tag indicating the specific dog present in the instance prompt is missing in the class prompt. For instance, the instance prompt may be “a photo of a riobugger cat” and the class prompt may be “a photo of a cat.” You can enable prior preservation by setting the hyperparameter with_prior_preservation = True
. If setting with_prior_preservation = True
, you must include class_prompt
in dataset_info.json
and may include any class images available to you. The following is the training dataset format when setting with_prior_preservation = True
:
- Input – A directory containing the instance images,
dataset_info.json
and (optional) directoryclass_data_dir
. Note the following:- Images may be of .png, .jpg, .jpeg format.
- The
dataset_info.json
file must be of the format{'instance_prompt':<<instance_prompt>>,'class_prompt':<<class_prompt>>}
. - The
class_data_dir
directory must have class images. Ifclass_data_dir
is not present or there aren’t enough images already present inclass_data_dir
, additional images will be sampled withclass_prompt
.
For datasets such as cats and dogs, prior preservation doesn’t significantly impact the performance of the fine-tuned model and therefore can be avoided. However, when training on faces, this is necessary. For more information, refer to Training Stable Diffusion with Dreambooth using Diffusers.
Instance types
Fine-tuning Stable Diffusion models require accelerated computation provided by GPU-supported instances. We experiment our fine-tuning with ml.g4dn.2xlarge (16 GB CUDA memory, 1 GPU) and ml.g5.2xlarge (24 GB CUDA memory, 1 GPU) instances. The memory requirement is higher when generating class images. Therefore, if setting with_prior_preservation=True
, use the ml.g5.2xlarge instance type, because training runs into the CUDA out of memory issue on the ml.g4dn.2xlarge instance. The JumpStart fine-tuning script currently utilizes single GPU and therefore, fine-tuning on multi-GPU instances will not yield performance gain. For more information on different instance types, refer to Amazon EC2 Instance Types.
Limitations and bias
Even though Stable Diffusion has impressive performance in generating images, it suffers from several limitations and biases. These include but are not limited to:
- The model may not generate accurate faces or limbs because the training data doesn’t include sufficient images with these features
- The model was trained on the LAION-5B dataset, which has adult content and may not be fit for product use without further considerations
- The model may not work well with non-English languages because the model was trained on English language text
- The model can’t generate good text within images
For more information on limitations and bias, see Stable Diffusion v2-1-base Model Card. These limitations for the pre-trained model can also carry over to the fine-tuned models.
Clean up
After you’re done running the notebook, make sure to delete all resources created in the process to ensure that the billing is stopped. Code to clean up the endpoint is provided in the associated Introduction to JumpStart – Text to Image example notebook.
Publicly available fine-tuned models in JumpStart
Even though Stable Diffusion models released by StabilityAI have impressive performance, they have limitations in terms of the language or domain it was trained on. For instance, Stable Diffusion models were trained on English text, but you may need to generate images from non-English text. Alternatively, Stable Diffusion models were trained to generate photorealistic images, but you may need to generate animated or artistic images.
JumpStart provides over 80 publicly available models with various languages and themes. These models are often fine-tuned versions from Stable Diffusion models released by StabilityAI. If your use case matches with one of the fine-tuned models, you don’t need to collect your own dataset and fine-tune it. You can simply deploy one of these models through the Studio UI or using easy-to-use JumpStart APIs. To deploy a pre-trained Stable Diffusion model in JumpStart, refer to Generate images from text with the stable diffusion model on Amazon SageMaker JumpStart.
The following are some of the examples of images generated by the different models available in JumpStart.
Note that these models are not fine-tuned using JumpStart scripts or DreamBooth scripts. You can download the full list of publicly available fine-tuned models with example prompts from here.
For more example generated images from these models, please see section Open Sourced Fine-tuned models in the Appendix.
Conclusion
In this post, we showed how to fine-tune the Stable Diffusion model for text-to-image and then deploy it using JumpStart. Furthermore, we discussed some of the considerations you should make while fine-tuning the model and how it can impact the fine-tuned model’s performance. We also discussed the over 80 ready-to-use fine-tuned models available in JumpStart. We showed code snippets in this post—for the full code with all of the steps in this demo, see the Introduction to JumpStart – Text to Image example notebook. Try out the solution on your own and send us your comments.
To learn more about the model and the DreamBooth fine-tuning, see the following resources:
- High-Resolution Image Synthesis with Latent Diffusion Models
- Stable Diffusion Launch Announcement
- Stable Diffusion 2.0 Launch Announcement
- Stable Diffusion x4 upscaler model card
- DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation
- Training Stable Diffusion with Dreambooth using Diffusers
- How to Fine-tune Stable Diffusion using Dreambooth
To learn more about JumpStart, check out the following blog posts:
- Generate images from text with the stable diffusion model on Amazon SageMaker JumpStart
- Upscale images with Stable Diffusion in Amazon SageMaker JumpStart
- AlexaTM 20B is now available in Amazon SageMaker JumpStart
- Run text generation with Bloom and GPT models on Amazon SageMaker JumpStart
- Run image segmentation with Amazon SageMaker JumpStart
- Run text classification with Amazon SageMaker JumpStart using TensorFlow Hub and Hugging Face models
- Amazon SageMaker JumpStart models and algorithms now available via API
- Incremental training with Amazon SageMaker JumpStart
- Transfer learning for TensorFlow object detection models in Amazon SageMaker
- Transfer learning for TensorFlow text classification models in Amazon SageMaker
- Transfer learning for TensorFlow image classification models in Amazon SageMaker
About the Authors
Dr. Vivek Madan is an Applied Scientist with the Amazon SageMaker JumpStart team. He got his PhD from University of Illinois at Urbana-Champaign and was a Post Doctoral Researcher at Georgia Tech. He is an active researcher in machine learning and algorithm design and has published papers in EMNLP, ICLR, COLT, FOCS, and SODA conferences.
Heiko Hotz is a Senior Solutions Architect for AI & Machine Learning with a special focus on natural language processing (NLP), large language models (LLMs), and generative AI. Prior to this role, he was the Head of Data Science for Amazon’s EU Customer Service. Heiko helps our customers be successful in their AI/ML journey on AWS and has worked with organizations in many industries, including insurance, financial services, media and entertainment, healthcare, utilities, and manufacturing. In his spare time, Heiko travels as much as possible.
Appendix: Experiment datasets
This section contains the datasets used in the experiments in this post.
Dog1-8
Dog1-16
Dog2-4
Dog3-8
Appendix: Open Sourced Fine-tuned models
The following are some of the examples of images generated by the different models available in JumpStart. Each image is captioned with a model_id
starting with a prefix huggingface-txt2img-
followed by the prompt used to generate the image in the next line.
7 ways AI is already making your Pixel more helpful
Whether you’re using your Pixel to translate a foreign language, edit photos or take a phone call in a noisy area — AI is making everyday moments easier.Read More
FRMT: A Benchmark for Few-Shot Region-Aware Machine Translation
Many languages spoken worldwide cover numerous regional varieties (sometimes called dialects), such as Brazilian and European Portuguese or Mainland and Taiwan Mandarin Chinese. Although such varieties are often mutually intelligible to their speakers, there are still important differences. For example, the Brazilian Portuguese word for “bus” is ônibus, while the European Portuguese word is autocarro. Yet, today’s machine translation (MT) systems typically do not allow users to specify which variety of a language to translate into. This may lead to confusion if the system outputs the “wrong” variety or mixes varieties in an unnatural way. Also, region-unaware MT systems tend to favor whichever variety has more data available online, which disproportionately affects speakers of under-resourced language varieties.
In “FRMT: A Benchmark for Few-Shot Region-Aware Machine Translation”, accepted for publication in Transactions of the Association for Computational Linguistics, we present an evaluation dataset used to measure MT systems’ ability to support regional varieties through a case study on Brazilian vs. European Portuguese and Mainland vs. Taiwan Mandarin Chinese. With the release of the FRMT data and accompanying evaluation code, we hope to inspire and enable the research community to discover new ways of creating MT systems that are applicable to the large number of regional language varieties spoken worldwide.
Challenge: Few-Shot Generalization
Most modern MT systems are trained on millions or billions of example translations, such as an English input sentence and its corresponding Portuguese translation. However, the vast majority of available training data doesn’t specify what regional variety the translation is in. In light of this data scarcity, we position FRMT as a benchmark for few-shot translation, measuring an MT model’s ability to translate into regional varieties when given no more than 100 labeled examples of each language variety. MT models need to use the linguistic patterns showcased in the small number of labeled examples (called “exemplars”) to identify similar patterns in their unlabeled training examples. In this way, models can generalize, producing correct translations of phenomena not explicitly shown in the exemplars.
An illustration of a few-shot MT system translating the English sentence, “The bus arrived,” into two regional varieties of Portuguese: Brazilian (🇧🇷; left) and European (🇵🇹; right). |
Few-shot approaches to MT are attractive because they make it much easier to add support for additional regional varieties to an existing system. While our work is specific to regional varieties of two languages, we anticipate that methods that perform well will be readily applicable to other languages and regional varieties. In principle, those methods should also work for other language distinctions, such as formality and style.
Data Collection
The FRMT dataset consists of partial English Wikipedia articles, sourced from the Wiki40b dataset, that have been translated by paid, professional translators into different regional varieties of Portuguese and Mandarin. In order to highlight key region-aware translation challenges, we designed the dataset using three content buckets: (1) Lexical, (2) Entity, and (3) Random.
- The Lexical bucket focuses on regional differences in word choice, such as the “ônibus” vs. “autocarro” distinction when translating a sentence with the word “bus” into Brazilian vs. European Portuguese, respectively. We manually collected 20-30 terms that have regionally distinctive translations according to blogs and educational websites, and filtered and vetted the translations with feedback from volunteer native speakers from each region. Given the resulting list of English terms, we extracted texts of up to 100 sentences each from the associated English Wikipedia articles (e.g., bus). The same process was carried out independently for Mandarin.
- The Entity bucket is populated in a similar way and concerns people, locations or other entities strongly associated with one of the two regions in question for a given language. Consider an illustrative sentence like, “In Lisbon, I often took the bus.” In order to translate this correctly into Brazilian Portuguese, a model must overcome two potential pitfalls:
- The strong geographical association between Lisbon and Portugal might influence a model to generate a European Portuguese translation instead, e.g., by selecting “autocarro” rather than “ônibus“.
- Replacing “Lisbon” with “Brasília” might be a naive way for a model to localize its output toward Brazilian Portuguese, but would be semantically inaccurate, even in an otherwise fluent translation.
- The Random bucket is used to check that a model correctly handles other diverse phenomena, and consists of text from 100 randomly sampled articles from Wikipedia’s “featured” and “good” collections.
Evaluation Methodology
To verify that the translations collected for the FRMT dataset capture region-specific phenomena, we conducted a human evaluation of their quality. Expert annotators from each region used the Multi-dimensional Quality Metrics (MQM) framework to identify and categorize errors in the translations. The framework includes a category-wise weighting scheme to convert the identified errors into a single score that roughly represents the number of major errors per sentence; so a lower number indicates a better translation. For each region, we asked MQM raters to score both translations from their region and translations from their language’s other region. For example, Brazilian Portuguese raters scored both the Brazilian and European Portuguese translations. The difference between these two scores indicates the prevalence of linguistic phenomena that are acceptable in one variety but not the other. We found that in both Portuguese and Chinese, raters identified, on average, approximately two more major errors per sentence in the mismatched translations than in the matched ones. This indicates that our dataset truly does capture region-specific phenomena.
While human evaluation is the best way to be sure of model quality, it is often slow and expensive. We therefore wanted to find an existing automatic metric that researchers can use to evaluate their models on our benchmark, and considered chrF, BLEU, and BLEURT. Using the translations from a few baseline models that were also evaluated by our MQM raters, we discovered that BLEURT has the best correlation with human judgments, and that the strength of that correlation (0.65 Pearson correlation coefficient, ρ) is comparable to the inter-annotator consistency (0.70 intraclass correlation).
Metric | Pearson’s ρ | ||
chrF | 0.48 | ||
BLEU | 0.58 | ||
BLEURT | 0.65 |
Correlation between different automatic metrics and human judgements of translation quality on a subset of FRMT. Values are between -1 and 1; higher is better. |
System Performance
Our evaluation covered a handful of recent models capable of few-shot control. Based on human evaluation with MQM, the baseline methods all showed some ability to localize their output for Portuguese, but for Mandarin, they mostly failed to use knowledge of the targeted region to produce superior Mainland or Taiwan translations.
Google’s recent language model, PaLM, was rated best overall among the baselines we evaluated. In order to produce region-targeted translations with PaLM, we feed an instructive prompt into the model and then generate text from it to fill in the blank (see the example shown below).
Translate the following texts from English to European Portuguese.
English: [English example 1].
European Portuguese: [correct translation 1].
...
English: [input].
European Portuguese: _____"
PaLM obtained strong results using a single example, and had marginal quality gains on Portuguese when increasing to ten examples. This performance is impressive when taking into consideration that PaLM was trained in an unsupervised way. Our results also suggest language models like PaLM may be particularly adept at memorizing region-specific word choices required for fluent translation. However, there is still a significant performance gap between PaLM and human performance. See our paper for more details.
MQM performance across dataset buckets using human and PaLM translations. Thick bars represent the region-matched case, where raters from each region evaluate translations targeted at their own region. Thin, inset bars represent the region-mismatched case, where raters from each region evaluate translations targeted at the other region. Human translations exhibit regional phenomena in all cases. PaLM translations do so for all Portuguese buckets and the Mandarin lexical bucket only. |
Conclusion
In the near future, we hope to see a world where language generation systems, especially machine translation, can support all speaker communities. We want to meet users where they are, generating language fluent and appropriate for their locale or region. To that end, we have released the FRMT dataset and benchmark, enabling researchers to easily compare performance for region-aware MT models. Validated via our thorough human-evaluation studies, the language varieties in FRMT have significant differences that outputs from region-aware MT models should reflect. We are excited to see how researchers utilize this benchmark in development of new MT models that better support under-represented language varieties and all speaker communities, leading to improved equitability in natural-language technologies.
Acknowledgements
We gratefully acknowledge our paper co-authors for all their contributions to this project: Timothy Dozat, Xavier Garcia, Dan Garrette, Jason Riesa, Orhan Firat, and Noah Constant. For helpful discussion and comments on the paper, we thank Jacob Eisenstein, Noah Fiedel, Macduff Hughes and Mingfei Lau. For essential feedback around specific regional language differences, we thank Andre Araujo, Chung-Ching Chang, Andreia Cunha, Filipe Gonçalves, Nuno Guerreiro, Mandy Guo, Luis Miranda, Vitor Rodrigues and Linting Xue. For logistical support in collecting human translations and ratings, we thank the Google Translate team. We thank the professional translators and MQM raters for their role in producing the dataset. We also thank Tom Small for providing the animation in this post.
Scaling Large Language Model (LLM) training with Amazon EC2 Trn1 UltraClusters
Modern model pre-training often calls for larger cluster deployment to reduce time and cost. At the server level, such training workloads demand faster compute and increased memory allocation. As models grow to hundreds of billions of parameters, they require a distributed training mechanism that spans multiple nodes (instances).
In October 2022, we launched Amazon EC2 Trn1 Instances, powered by AWS Trainium, which is the second generation machine learning accelerator designed by AWS. Trn1 instances are purpose built for high-performance deep learning model training while offering up to 50% cost-to-train savings over comparable GPU-based instances. In order to bring down training time from weeks to days, or days to hours, and distribute a large model’s training job, we can use an EC2 Trn1 UltraCluster, which consists of densely packed, co-located racks of Trn1 compute instances all interconnected by non-blocking petabyte scale networking. It is our largest UltraCluster to date, offering 6 exaflops of compute power on demand with up to 30,000 Trainium chips.
In this post, we use a Hugging Face BERT-Large model pre-training workload as a simple example to explain how to useTrn1 UltraClusters.
Trn1 UltraClusters
A Trn1 UltraCluster is a placement group of Trn1 instances in a data center. As part of a single cluster run, you can spin up a cluster of Trn1 instances with Trainium accelerators. The following diagram shows an example.
UltraClusters of Trn1 instances are co-located in a data center, and interconnected using Elastic Fabric Adapter (EFA), which is a petabyte scale, non-blocking network interface, with up to 800 Gbps networking bandwidth, which is twice the bandwidth supported by AWS P4d instances (1.6 Tbps, four times greater with the upcoming Trn1n instances). These EFA interfaces help run model training workloads that use Neuron Collective Communication Libraries at scale. Trn1 UltraClusters also include co-located network attached storage services like Amazon FSx for Lustre to enable high throughput access to large datasets, ensuring clusters operate efficiently. Trn1 UltraClusters can host up to 30,000 Trainium devices and deliver up to 6 exaflops of compute in a single cluster. EC2 Trn1 UltraClusters deliver up to 6 exaflops of compute, literally an on-demand supercomputer, with a pay-as-you-go usage model. In this post, we use some HPC tools like Slurm to ramp an UltraCluster and manage workloads.
Solution overview
AWS offers a wide variety of services for distributed model training or inferencing workloads at scale, including AWS Batch, Amazon Elastic Kubernetes Service (Amazon EKS), and UltraClusters. This post focuses on model training in an UltraCluster. Our solution uses the AWS ParallelCluster management tool to create the necessary infrastructure and environment to spin up a Trn1 UltraCluster. The infrastructure consists of a head node and multiple Trn1 compute nodes within a virtual private cloud (VPC). We use Slurm as the cluster management and job scheduling system. The following diagram illustrates our solution architecture.
For more details and how to deploy this solution, see Train a model on AWS Trn1 ParallelCluster.
Let’s look at some important steps of this solution:
- Create a VPC and subnets.
- Configure the compute fleet.
- Create the cluster.
- Inspect the cluster.
- Launch your training job.
Prerequisites
To follow along with this post, a broad familiarity with core AWS services such as Amazon Elastic Compute Cloud (Amazon EC2) is implied, and basic familiarity with deep learning and PyTorch would be helpful.
Create VPC and subnets
An easy way to create the VPC and subnets is through the Amazon Virtual Private Cloud (Amazon VPC) console. Complete instructions can be found on GitHub. After the VPC and subnets are installed, you need to configure the instances in the compute fleet. Briefly, this is made possible by an installation script specified by CustomActions in the YAML file used for creating the ParallelCluster (see Create ParallelCluster). A ParallelCluster requires a VPC that has two subnets and a Network Address Translation (NAT) gateway, as shown in the preceding architecture diagram. This VPC has to reside in the Availability Zones where Trn1 instances are available. Also, in this VPC, you need to have a public subnet and a private subnet to hold the head node and Trn1 compute nodes, respectively. You also need a NAT gateway internet access, such that Trn1 compute nodes can download AWS Neuron packages. In general, the compute nodes will receive updates for the OS packages, Neuron driver and runtime, and EFA driver for multi-instance training.
As for the head node, in addition to the aforementioned components for the compute nodes, it also receives the PyTorch-NeuronX and NeuronX compiler, which enables the model compilation process in XLA devices such as Trainium.
Configure the compute fleet
In the YAML file for creating the Trn1 UltraCluster, InstanceType
is specified as trn1.32xlarge. MaxCount
and MinCount
are used to indicate your compute fleet size range. You may use MinCount
to keep some or all Trn1 instances available at all time. MinCount
may be set to zero so that if there is no running job, the Trn1 instances are released from this cluster.
Trn1 may also be deployed in an UltraCluster with multiple queues. In the following example, there is only one queue being set up for Slurm job submission:
If you need more than one queue, you can specify multiple InstanceType
, each with its own MaxCount
, MinCount
, and Name
:
Here, two queues are set up, so that user has the flexibility to choose the resources for their Slurm job.
Create the cluster
To launch a Trn1 UltraCluster, use the following pcluster
command from where your ParallelCluster tool is installed:
We use the following options in this command:
--cluster-configuration
– This option expects a YAML file that describes the cluster configuration-n
(or--cluster-name
) – The name of this cluster
This command creates a Trn1 cluster in your AWS account. You can check the progress of cluster creation on the AWS CloudFormation console. For more information, refer to Using the AWS CloudFormation console.
Alternatively, you can use the following command to see the status of your request:
and the command will indicate the status, for example:
The following are parameters of interest from the output:
- instanceId – This is the instance ID of the head node, which will be listed on the Amazon EC2 console
- computeFleetStatus – This attribute indicates readiness of the compute nodes
- Tags – This attribute indicates the version of
pcluster
tool used to create this cluster
Inspect the cluster
You can use the aforementioned pcluster describe-cluster
command to check the cluster. After the cluster is created, you will observe the following in the output:
At this point, you may SSH into the head node (identified by instance ID on the Amazon EC2 console). The following is a logical diagram of the cluster.
After you SSH into the head node, you can verify the compute fleet and their status with a Slurm command such as sinfo
to view the node information for the system. The following is an example output:
This indicates that there is one queue as shown by a single partition. There are 16 nodes available, and resources are allocated. From the head node, you can SSH into any given compute node:
Use exit
to get back to the head node.
Likewise, you can SSH into a compute node from another compute node. Each compute node has Neuron tools installed, such as neuron-top
. You can invoke neuron-top
during the training script run to inspect NeuronCore utilization at each node.
Launch your training job
We use the Hugging Face BERT-Large Pretraining Tutorial as an example to run on this cluster. After the training data and scripts are downloaded to the cluster, we use the Slurm controller to manage and orchestrate our workload. We submit the training job with the sbatch
command. The shell script invokes the Python script via the neuron_parallel_compile
API to compile the model into graphs without a full training run. See the following code:
We use the following options in this command:
--exclusive
– This job will use all nodes and will not share nodes with other jobs while running the current job.--nodes
– The number of nodes for this job.--wrap
– This defines a command string that is run by the Slurm controller. In this case, it simply compiles the model in parallel using all nodes.
After the model is compiled successfully, you may start the full training job with the following command:
This command will launch the training job for the Hugging Face BERT-Large model. With 16 Trn1.32xlarge nodes, you can expect it to complete in less than 8 hours.
At this point, you can use a Slurm command such as squeue
to inspect the submitted job. An example output is as follows:
This output shows the job is running (R
) on 16 compute nodes.
As the job is running, outputs are captured and appended in a Slurm log file. From the head node‘s terminal, you can inspect it in real time.
Also, in the same directory as the Slurm log file, there is a corresponding directory for this job. This directory includes the following (for example):
This directory is accessible to all compute nodes. results.json
captures the metadata of this particular job run, such as the model’s configuration, batch size, total steps, gradient accumulation steps, and training dataset name. The model checkpoint and output log per each compute node are also captured in this directory.
Consider scalability of the cluster
In a Trn1 UltraCluster, multiple interconnected Trn1 instances run a large model training workload in parallel and reduce total computation time or time to convergence. There are two measures of scalability of a cluster: strong scaling and weak scaling. Typically, for model training, the need is to speed up the training run, because usage cost is determined by sample throughput for rounds of gradient updates. Strong scaling refers to the scenario where the total problem size stays the same as the number of processors increases, strong scaling is an important measure of scalability for model training. In evaluating strong scaling, (i.e the impact of parallelization), we want to keep global batch size the same and see how much time it takes to convergence. In such scenario, we need to adjust gradient accumulation micro-step according to number of compute nodes. This is achieved with the following in the training shell script run_dp_bert_large_hf_pretrain_bf16_s128.sh
:
On the other hand, if you want to evaluate how many more workloads can be run at a fixed time by adding more nodes, use weak scaling to measure scalability. In weak scaling, the problem size increases at the same rate as the number of NeuronCoress, thereby keeping the amount of work per NeuronCores the same. To evaluate weak scaling, or the effect of adding more nodes on the increased workload, simply remove the above line from the training script, and keep the number of steps for gradient accumulation constant with a default value (32) provided in the training script.
Evaluate your results
We provide some benchmark results in the Neuron performance page to demonstrate the effect of scaling. The data demonstrates the benefit of using multiple instances to parallelize the training job for many different large models to train at scale.
Clean up your infrastructure
To delete all the infrastructure of this UltraCluster, use the pcluster
command to delete the cluster and its resources:
Conclusion
In this post, we discussed how scaling your training job across an Trn1-UltraCluster, which is powered by Trainium accelerators in AWS, reduces the time to train a model. We also provided a link to the Neuron samples repository, which contains instructions on how to deploy a distributed training job for a BERT-Large model. Trn1-UltraCluster runs distributed training workloads to train ultra-large deep learning models at scale. A distributed training setup results in much faster model convergence as compared to training on a single Trn1 instance.
To learn more about how to get started with Trainium-powered Trn1 instances, visit the Neuron documentation.
About the Authors
K.C. Tung is a Senior Solution Architect in AWS Annapurna Labs. He specializes in large deep learning model training and deployment at scale in cloud. He has a Ph.D. in molecular biophysics from the University of Texas Southwestern Medical Center in Dallas. He has spoken at AWS Summits and AWS Reinvent. Today he helps customers to train and deploy large PyTorch and TensorFlow models in AWS cloud. He is the author of two books: Learn TensorFlow Enterprise and TensorFlow 2 Pocket Reference.
Jeffrey Huynh is a Principal Engineer in AWS Annapurna Labs. He is passionate about helping customers run their training and inference workloads on Trainium and Inferentia accelerator devices using AWS Neuron SDK. He is a Caltech/Stanford alumni with degrees in Physics and EE. He enjoys running, tennis, cooking, and reading about science and technology.
Shruti Koparkar is a Senior Product Marketing Manager at AWS. She helps customers explore, evaluate, and adopt EC2 accelerated computing infrastructure for their machine learning needs.
Transportation Generation: See How AI and the Metaverse Are Shaping the Automotive Industry at GTC
Novel AI technologies are generating images, stories and, now, new ways to imagine the automotive future.
At NVIDIA GTC, a global conference for the era of AI and the metaverse running online March 20-23, industry luminaries working on these breakthroughs will come together and share their visions to transform transportation.
This year’s slate of in-depth sessions includes leaders from automotive, robotics, healthcare and other industries, as well as trailblazing AI researchers.
Headlining GTC is NVIDIA founder and CEO Jensen Huang, who will present the latest in AI and NVIDIA Omniverse, a platform for creating and operating metaverse applications, in a keynote address on Tuesday, March 21, at 8 a.m. PT.
Conference attendees will have plenty of opportunities to network and learn from NVIDIA and industry experts about the technologies powering the next generation of automotive.
Here’s what to expect from auto sessions at GTC:
End-to-End Innovation
The entire automotive industry is being transformed by AI and metaverse technologies, whether they’re used for design and engineering, manufacturing, autonomous driving or the customer experience.
Speakers from these areas will share how they’re using the latest innovations to supercharge development:
- Sacha Vražić, director of autonomous driving R&D at Rimac Technology, discusses how the supercar maker is using AI to teach any driver how to race like a professional on the track.
- Toru Saito, deputy chief of Subaru Lab at Subaru Corporation, walks through how the automaker is improving camera perception with AI, using large-dataset training on GPUs and in the cloud.
- Tom Xie, vice president at ZEEKR, explains how the electric vehicle company is rethinking the electronic architecture in EVs to develop a software-defined lineup that is continuously upgradeable.
- Liz Metcalfe-Williams, senior data scientist, and Otto Fitzke, machine learning engineer at Jaguar Land Rover, cover key learnings from the premium automaker’s research into natural language processing to improve knowledge and systems, and to accelerate the development of high-quality, validated, cutting-edge products.
- Marco Pavone, director of autonomous vehicle research; Sanja Fidler, vice president of AI research; and Sarah Tariq, vice president of autonomous vehicle software at NVIDIA, show how generative AI and novel, highly integrated system architectures will radically change how AVs are designed and developed.
Develop Your Drive
In addition to sessions from industry leaders, GTC attendees can access talks on the latest NVIDIA DRIVE technologies led by in-house experts.
NVIDIA DRIVE Developer Days consist of a series of deep-dive sessions on building safe and robust autonomous vehicles. Led by the NVIDIA engineering team, these talks will highlight the newest DRIVE features and how to apply them.
Topics include high-definition mapping, AV simulation, synthetic data generation for testing and validation, enhancing AV safety with in-system testing, and multi-task models for AV perception.
Access these virtual sessions and more by registering free to attend and see the technologies generating the intelligent future of transportation.
How should AI systems behave, and who should decide?
OpenAI’s mission is to ensure that artificial general intelligence (AGI) benefits all of humanity. We therefore think a lot about the behavior of AI systems we build in the run-up to AGI, and the way in which that behavior is determined.
Since our launch of ChatGPT, users have shared outputs that they consider politically biased, offensive, or otherwise objectionable. In many cases, we think that the concerns raised have been valid and have uncovered real limitations of our systems which we want to address. We’ve also seen a few misconceptions about how our systems and policies work together to shape the outputs you get from ChatGPT.
Below, we summarize:
- How ChatGPT’s behavior is shaped;
- How we plan to improve ChatGPT’s default behavior;
- Our intent to allow more system customization; and
- Our efforts to get more public input on our decision-making.
Where we are today
Unlike ordinary software, our models are massive neural networks. Their behaviors are learned from a broad range of data, not programmed explicitly. Though not a perfect analogy, the process is more similar to training a dog than to ordinary programming. An initial “pre-training” phase comes first, in which the model learns to predict the next word in a sentence, informed by its exposure to lots of Internet text (and to a vast array of perspectives). This is followed by a second phase in which we “fine-tune” our models to narrow down system behavior.
As of today, this process is imperfect. Sometimes the fine-tuning process falls short of our intent (producing a safe and useful tool) and the user’s intent (getting a helpful output in response to a given input). Improving our methods for aligning AI systems with human values is a top priority for our company, particularly as AI systems become more capable.
A two step process: Pre-training and fine-tuning
The two main steps involved in building ChatGPT work as follows:
- First, we “pre-train” models by having them predict what comes next in a big dataset that contains parts of the Internet. They might learn to complete the sentence “instead of turning left, she turned ___.” By learning from billions of sentences, our models learn grammar, many facts about the world, and some reasoning abilities. They also learn some of the biases present in those billions of sentences.
- Then, we “fine-tune” these models on a more narrow dataset that we carefully generate with human reviewers who follow guidelines that we provide them. Since we cannot predict all the possible inputs that future users may put into our system, we do not write detailed instructions for every input that ChatGPT will encounter. Instead, we outline a few categories in the guidelines that our reviewers use to review and rate possible model outputs for a range of example inputs. Then, while they are in use, the models generalize from this reviewer feedback in order to respond to a wide array of specific inputs provided by a given user.
The role of reviewers and OpenAI’s policies in system development
In some cases, we may give guidance to our reviewers on a certain kind of output (for example, “do not complete requests for illegal content”). In other cases, the guidance we share with reviewers is more high-level (for example, “avoid taking a position on controversial topics”). Importantly, our collaboration with reviewers is not one-and-done—it’s an ongoing relationship, in which we learn a lot from their expertise.
A large part of the fine-tuning process is maintaining a strong feedback loop with our reviewers, which involves weekly meetings to address questions they may have, or provide clarifications on our guidance. This iterative feedback process is how we train the model to be better and better over time.
Addressing biases
Many are rightly worried about biases in the design and impact of AI systems. We are committed to robustly addressing this issue and being transparent about both our intentions and our progress. Towards that end, we are sharing a portion of our guidelines that pertain to political and controversial topics. Our guidelines are explicit that reviewers should not favor any political group. Biases that nevertheless may emerge from the process described above are bugs, not features.
While disagreements will always exist, we hope sharing this blog post and these instructions will give more insight into how we view this critical aspect of such a foundational technology. It’s our belief that technology companies must be accountable for producing policies that stand up to scrutiny.
We’re always working to improve the clarity of these guidelines—and based on what we’ve learned from the ChatGPT launch so far, we’re going to provide clearer instructions to reviewers about potential pitfalls and challenges tied to bias, as well as controversial figures and themes. Additionally, as part of ongoing transparency initiatives, we are working to share aggregated demographic information about our reviewers in a way that doesn’t violate privacy rules and norms, since this is an additional source of potential bias in system outputs.
We are currently researching how to make the fine-tuning process more understandable and controllable, and are building on external advances such as rule based rewards and Constitutional AI.
Where we’re going: The building blocks of future systems
In pursuit of our mission, we’re committed to ensuring that access to, benefits from, and influence over AI and AGI[1] are widespread. We believe there are at least three building blocks required in order to achieve these goals in the context of AI system behavior.[2]
1. Improve default behavior. We want as many users as possible to find our AI systems useful to them “out of the box” and to feel that our technology understands and respects their values.
Towards that end, we are investing in research and engineering to reduce both glaring and subtle biases in how ChatGPT responds to different inputs. In some cases ChatGPT currently refuses outputs that it shouldn’t, and in some cases, it doesn’t refuse when it should. We believe that improvement in both respects is possible.
Additionally, we have room for improvement in other dimensions of system behavior such as the system “making things up.” Feedback from users is invaluable for making these improvements.
2. Define your AI’s values, within broad bounds. We believe that AI should be a useful tool for individual people, and thus customizable by each user up to limits defined by society. Therefore, we are developing an upgrade to ChatGPT to allow users to easily customize its behavior.
This will mean allowing system outputs that other people (ourselves included) may strongly disagree with. Striking the right balance here will be challenging–taking customization to the extreme would risk enabling malicious uses of our technology and sycophantic AIs that mindlessly amplify people’s existing beliefs.
There will therefore always be some bounds on system behavior. The challenge is defining what those bounds are. If we try to make all of these determinations on our own, or if we try to develop a single, monolithic AI system, we will be failing in the commitment we make in our Charter to “avoid undue concentration of power.”
3. Public input on defaults and hard bounds. One way to avoid undue concentration of power is to give people who use or are affected by systems like ChatGPT the ability to influence those systems’ rules.
We believe that many decisions about our defaults and hard bounds should be made collectively, and while practical implementation is a challenge, we aim to include as many perspectives as possible. As a starting point, we’ve sought external input on our technology in the form of red teaming. We also recently began soliciting public input on AI in education (one particularly important context in which our technology is being deployed).
We are in the early stages of piloting efforts to solicit public input on topics like system behavior, disclosure mechanisms (such as watermarking), and our deployment policies more broadly. We are also exploring partnerships with external organizations to conduct third-party audits of our safety and policy efforts.
Conclusion
Combining the three building blocks above gives the following picture of where we’re headed:
Sometimes we will make mistakes. When we do, we will learn from them and iterate on our models and systems.
We appreciate the ChatGPT user community as well as the wider public’s vigilance in holding us accountable, and are excited to share more about our work in the three areas above in the coming months.
If you are interested in doing research to help achieve this vision, including but not limited to research on fairness and representation, alignment, and sociotechnical research to understand the impact of AI on society, please apply for subsidized access to our API via the Researcher Access Program.
We are also hiring for positions across Research, Alignment, Engineering, and more.
OpenAI
TensorFlow Datasets is turning 4!
Posted by the TensorFlow Datasets team
Datasets landscape has changed a lot since TensorFlow Datasets (TFDS) was introduced about 4 years ago: TFDS made sharing or re-using a dataset significantly easier, and transformed the datasets landscape by inspiring other ML tools, libraries and services.
Loading a dataset went from complicated scripts to:
import tensorflow_datasets as tfds ds = tfds.load('mnist', split='train') for example in ds: # example is `{'image': tf.Tensor, 'label': tf.Tensor}` print(list(example.keys())) image = example["image"] label = example["label"] print(image.shape, label) |
Read the documentation for a more extensive introduction.
Over the years, TFDS has grown to become a recognized way to load datasets. To celebrate our last 4.8.2 release, we would like to take some time to reflect on the progress and improvements made over those past years and thank the community for their support.
TFDS is still a library to facilitate download, preparation and loading of datasets for ML pipelines, but it now supports hundreds of datasets and offers the following main features:
- A large variety of features with encoding and decoding, ranging from text to images, videos, audio and even RL-specific types (e.g. dataset of datasets).
- Large datasets support: TFDS is successfully used within Google to prepare and load large datasets (PBs) using high performance input pipelines.
- Dataset collections, to arbitrarily group together a number of existing TFDS datasets, for example used in a benchmark.
- Support for all main ML Python frameworks: yes there is “TF” in “TFDS”, but besides TensorFlow, one can use TFDS with Torch, Jax, NumPy, Keras and any other Python ML framework that can consume a tf.data.Dataset or a NumPy Iterator.
- Global shuffling at preparation time: It is good practice to shuffle training data, TFDS optionally does a global shuffling at preparation time in case the source of the data wasn’t already shuffled.
- Splits and slicing: datasets can specify their splits, and readers can specify which split(s) they want to read, or slices of splits they want to read, eg: test[:10%] to “load the 10 first percent of the test split”.
- Versioning and determinism: TFDS datasets and collection are versioned, so it is possible to reproduce experiments reliably. Loading a dataset pinned at a particular version will always return the same set of examples. This works with slicing and global shuffling too, as those are deterministic.
- Code-less sharing: TFDS can read TFDS prepared datasets even if the code used to prepare the dataset is not available. This facilitates sharing and versioning datasets.
- Community datasets and support for internal datasets within organizations: TFDS allows organizations to manage different corpuses of datasets and make them available to their internal users.
- Formats-specific builders: to easily define datasets based on well known formats such as CoNLL.
- GCS integration: TFDS works well with GCS.
Thank you to all of our contributors and users!
What’s next?
TFDS is under active development to bring you the best datasets to use as input in your ML pipelines.
Notably, we work on making transformations seamless. Sometimes, a dataset is derived from another dataset by a few transformations (e.g., data augmentation or column renaming). We want those transformations to be as easy to implement as possible. This feature is already available experimentally, don’t hesitate to give feedback on GitHub!
We are also working on making the TensorFlow dependency optional. TFDS is a framework agnostic library that provides datasets and tools to support machine learning research. TFDS does not rely on any specific machine learning framework, and we are working to make the TensorFlow dependency optional.
We have other plans too, smaller ones such as the support of partitioned datasets, and longer-term ones that could durably influence the field. Follow us on GitHub to receive future updates about those upcoming developments!
New expanded data format support in Amazon Kendra
Enterprises across the globe are looking to utilize multiple data sources to implement a unified search experience for their employees and end customers. Considering the large volume of data that needs to be examined and indexed, the retrieval speed, solution scalability, and search performance become key factors to consider when choosing an enterprise intelligent search solution. Additionally, these unique data sources comprise structured and unstructured content repositories—including various file types—which may cause compatibility issues.
Amazon Kendra is a highly accurate and intelligent search service that enables users to search for answers to their questions from your unstructured and structured data using natural language processing and advanced search algorithms. It returns specific answers to questions, giving users an experience that’s close to interacting with a human expert.
Today, Amazon Kendra launched seven additional data format support options for you to use. This allows you to easily integrate your existing data sources as is and perform intelligent search across multiple content repositories.
In this post, we discuss the new supported data formats and how to use them.
New supported data formats
Previously, Amazon Kendra supported documents that included structured text in the form of frequently asked questions and answers, as well as unstructured text in the form of HTML files, Microsoft PowerPoint presentations, Microsoft Word documents, plain text documents, and PDFs.
With this launch, Amazon Kendra now offers support for seven additional data formats:
- Rich Text Format (RTF)
- JavaScript Object Notation (JSON)
- Markdown (MD)
- Comma separated values (CSV)
- Microsoft Excel (MS Excel)
- Extensible Markup Language (XML)
- Extensible Stylesheet Language Transformations (XSLT)
Amazon Kendra users can ingest these documents with different data formats to their index in the following two ways:
- Using the BatchPutDocument API:
- Pass the document as an Amazon Simple Storage Service (Amazon S3) file.
- Pass the document as binary data (blob).
- As a data source. For more information, see Creating a data source.
Solution overview
In the following sections, we walk through the steps for adding documents from a data source and performing a search on those documents.
The following diagram shows our solution architecture.
For testing this solution for any of the supported formats, you need to use your own data. You can test by uploading documents of the same or different formats to the S3 bucket.
Create an Amazon Kendra index
For instructions on creating your Amazon Kendra index, refer to Creating an index.
You can skip this step if you have a pre-existing index to use for this demo.
Upload documents to an S3 bucket and ingest to the index using the S3 connector
Complete the following steps to connect an S3 bucket to your index:
- Create an S3 bucket to store your documents.
- Create a folder named sample-data.
- Upload the documents that you want to test to the folder.
- On the Amazon Kendra console, go to your index and choose Data sources.
- Choose Add data source.
- Under Available data sources, select S3 and choose Add Connector.
- Enter a name for your connector (such as
Demo_S3_connector
) and choose Next. - Choose Browse S3 and choose the S3 bucket where you uploaded the documents.
- For IAM Role, create a new role.
- For Set sync run schedule, select Run on demand.
- Choose Next.
- On the Review and create page, choose Add data source.
- After the creation process is complete, choose Sync Now.
Now that you have ingested some documents, you can navigate to the built-in search console to test queries.
Search your documents with the Amazon Kendra search console
On the Amazon Kendra console, choose Search indexed content in the navigation pane.
The following are examples of the results from the search for different document types:
- RTF – Input data in RTF format uploaded to the S3 bucket and sync the data source:
The following screenshot shows the search results.
- JSON – Input data in JSON format uploaded to the S3 bucket and sync the data source:
The following screenshot shows the search results.
- Markdown – Input data in MD format uploaded to the S3 bucket and sync the data source:
The following screenshot shows the search results.
- CSV – Input data in CSV format uploaded to the S3 bucket and sync the data source:
The following screenshot shows the search results.
- Excel – Input data in Excel format uploaded to the S3 bucket and sync the data source:
The following screenshot shows the search results.
- XML – Input data in XML format uploaded to the S3 bucket and sync the data source:
The following screenshot shows the search results.
- XSLT – Input data in XSLT format uploaded to the S3 bucket and sync the data source:
The following screenshot shows the search results.
Clean up
To avoid incurring future costs, clean up the resources you created as part of this solution using the following steps:
- On the Amazon Kendra console, choose Indexes in the navigation pane.
- Choose the index that contains the data source to delete.
- In the navigation pane, choose Data sources.
- Choose the data source to remove, then choose Delete.
When you delete a data source, Amazon Kendra removes all the stored information about the data source. Amazon Kendra removes all the document data stored in the index, and all run histories and metrics associated with the data source. Deleting a data source does not remove the original documents from your storage.
- On the Amazon Kendra console, choose Indexes in the navigation pane.
- Choose the index to delete, then choose Delete.
Refer to Deleting an index and data source for more details.
- On the Amazon S3 console, choose Buckets in the navigation pane.
- Select the bucket you want to delete, then choose Delete.
- Enter the name of the bucket to confirm deletion, then choose Delete bucket.
If the bucket contains any objects, you’ll receive an error alert. Empty the bucket before deleting it by choosing the link in the error message and following the instructions on the Empty bucket page. Then return to the Delete bucket page and delete the bucket.
- To verify that you’ve deleted the bucket, open the Buckets page and enter the name of the bucket that you deleted. If the bucket can’t be found, your deletion was successful.
Refer to Deleting a bucket page for more details.
Conclusion
In this post, we discussed the new data formats that Amazon Kendra now supports. In addition, we discussed how to use Amazon Kendra to ingest and perform a search on these new document types stored in an S3 bucket. To learn more about the different data formats supported, refer to Types of documents.
We introduced you to the basics, but there are many additional features that we didn’t cover in this post, such as the following:
- You can enable user-based access control for your Amazon Kendra index and restrict access to users and groups that you configure.
- You can map additional fields to Amazon Kendra index attributes and enable them for faceting, search, and display in the search results.
- You can integrate different third-party data source connectors like Service Now and Salesforce with the Custom Document Enrichment (CDE) capability in Amazon Kendra to perform additional attribute mapping logic and even custom content transformation during ingestion. For the complete list of supported connectors, refer to Connectors.
To learn about these possibilities and more, refer to the Amazon Kendra Developer Guide.
About the authors
Rishabh Yadav is a Partner Solutions architect at AWS with an extensive background in DevOps and Security offerings at AWS. He works with the ASEAN partners to provide guidance on enterprise cloud adoption and architecture reviews along with building AWS practice through the implementation of Well-Architected Framework. Outside of work, he likes to spend his time in the sports field and FPS gaming.
Kruthi Jayasimha Rao is a Partner Solutions Architect with a focus in AI and ML. She provides technical guidance to AWS Partners in following best practices to build secure, resilient, and highly available solutions in the AWS Cloud.
Keerthi Kumar Kallur is a Software Development Engineer at AWS. He has been with the AWS Kendra team since past 2 years and worked on various features as well as customers. In his spare time, he likes to do outdoor activities such as hiking, sports such as volleyball.
UK’s Conservation AI Makes Huge Leap Detecting Threats to Endangered Species Across the Globe
The video above represents one of the first times that a pangolin, one of the world’s most critically endangered species, was detected in real time using artificial intelligence.
A U.K.-based nonprofit called Conservation AI made this possible with the help of NVIDIA technology. Such use of AI can help track even the rarest, most reclusive of species in real time, enabling conservationists to protect them from threats, such as poachers and fires, before it’s too late to intervene.
The organization was founded four years ago by researchers at Liverpool John Moores University — Paul Fergus, Carl Chalmers, Serge Wich and Steven Longmore.
In the past year and a half, Conservation AI has deployed 70+ AI-powered cameras across the world. These help conservationists preserve biodiversity through real-time detection of threats using deep learning models trained with transfer learning.
“It’s very simple — if we don’t protect our biodiversity, there won’t be people on this planet,” said Chalmers, who teaches deep learning and applied AI at Liverpool John Moores University. “And without AI, we’re never going to achieve our targets for protecting endangered species.”
The Conservation AI platform — built using NVIDIA Jetson modules for edge AI and the NVIDIA Triton Inference Server — in just four seconds analyzes footage, identifies species of interest and alerts conservationists and other users of potential threats via email.
It can also rapidly model trends in biodiversity and habitat health using a huge database of images and other metadata that would otherwise take years to analyze. The platform now enables conservationists to identify these trends and species activities in real time.
Conservation AI works with 150 organizations across the globe, including conservation societies, safaris and game reserves. To date, the platform has processed over 2 million images, about half of which were from the past three months.
Saving Time to Save Species
Threats to biodiversity have long been monitored using camera traps — networks of cameras equipped with infrared sensors that are placed in the wild. But camera traps can produce data that is hard to manage, as there’s often much variability in images of the animals and their environments.
“A typical camera trap study can take three years to analyze, so by the time you get the insights, it’s too late to do anything about the threat to those species,” said Fergus, a professor of machine learning at Liverpool John Moores University. “Conservation AI can analyze the same amount of data and send results to conservation teams so that interventions can happen in real time, all enabled by NVIDIA technology.”
Many endangered species occupy remote areas without access to human communication systems. The team uses NVIDIA Jetson AGX Xavier modules to analyze drone footage from such areas streamed to a smart controller that can count species population or alert conservationists when species of interest are detected.
Energy-efficient edge AI provided by the Jetson modules, which are equipped with Triton Inference Server, has sped up deep learning inference by 4x compared to the organization’s previous methods, according to Chalmers.
“We chose Triton because of the elasticity of the framework and the many types of models it supports,” he added. “Being able to train the models on the NVIDIA accelerated computing stack means we can make huge improvements on the models very, very quickly.”
Conservation AI trains and inferences its deep learning models with NVIDIA RTX 8000, T4 and A100 Tensor Core GPUs — along with the NVIDIA CUDA toolkit. Fergus called NVIDIA GPUs “game changers in the world of applied AI and conservation, where there are big-data challenges.”
In addition, the team’s species-detection pipeline is built on the NVIDIA DeepStream software development kit for vision AI applications, which enables real-time video inference in the field.
“Without this technology, helicopters would normally be sent up to observe the animals, which is hugely expensive and bad for the environment as it emits huge amounts of carbon dioxide,” Chalmers said. “Conservation AI technology helps reduce this problem and detects threats to animals before it’s too late to intervene.”
Detecting Pangolins, Rhinos and More
The Conservation AI platform has been deployed by Chester Zoo, a renowned conservation society based in the U.K., to detect poachers in real time, including those hunting pangolins in Uganda.
Since many endangered species, like pangolins, are so elusive, obtaining enough imagery of them to train AI models can be difficult. So, the Conservation AI team is working with NVIDIA to explore the use of synthetic data for model training.
The platform is also deployed at a game reserve in Limpopo, South Africa, where the AI keeps an eye on wildlife in the region, including black and white rhinos.
“Pound for pound, rhino horn is worth more than diamond,” Chalmers said. “We’ve basically created a geofence around these rhinos, so the reserve can intervene as soon as a poacher or another type of threat is detected.”
The organization’s long-term goal, Fergus said, is to create a toolkit that supports conservationists with many types of efforts, including wildlife monitoring through satellite imagery, as well as using deep learning models that analyze audio — like animal cries or the sounds of a forest fire.
“The loss of biodiversity is really a ticking time bomb, and the beauty of NVIDIA AI is that it makes every second count,” Chalmers said. “Without the NVIDIA accelerated computing stack, we just wouldn’t be able to do this — we wouldn’t be able to tackle climate change and reverse biodiversity loss, which is the ultimate dream.”
Read more about how NVIDIA technology helps to boost conservation and prevent poaching.
Featured imagery courtesy of Chester Zoo.