Intelligently search Alfresco content using Amazon Kendra

Amazon Kendra is an intelligent search service powered by machine learning (ML). With Amazon Kendra, you can easily aggregate content from a variety of content repositories into a centralized index that lets you quickly search all your enterprise data and find the most accurate answer. Many organizations use the content management platform Alfresco to store their content. One of the key requirements for many enterprise customers using Alfresco is the ability to easily and securely find accurate information across all the documents in the data source.

We are excited to announce the public preview of the Amazon Kendra Alfresco connector. You can index Alfresco content, filter the types of content you want to index, and easily search your data in Alfresco with Amazon Kendra intelligent search and its Alfresco OnPrem connector.

This post shows you how to use the Amazon Kendra Alfresco OnPrem connector to configure the connector as a data source for your Amazon Kendra index and search your Alfresco documents. Based on the configuration of the Alfresco connector, you can synchronize the connector to crawl and index different types of Alfresco content such as wikis and blogs. The connector also ingests the access control list (ACL) information for each file. The ACL information is used for user context filtering, where search results for a query are filtered by what a user has authorized access to.

Prerequisites

To try out the Amazon Kendra connector for Alfresco using this post as a reference, you need the following:

Configure the data source using the Amazon Kendra connector for Alfresco

To add a data source to your Amazon Kendra index using the Alfresco OnPrem connector, you can use an existing index or create a new index. Then complete the following steps. For more information on this topic, refer to the Amazon Kendra Developer Guide.

  1. On the Amazon Kendra console, open your index and choose Data sources in the navigation pane.
  2. Choose Add data source.
  3. Under Alfresco, choose Add connector.
  4. In the Specify data source details section, enter a name and description and choose Next.
  5. In the Define access and security section, for Alfresco site URL, enter the Alfresco host name.
  6. To configure the SSL certificates, you can create a self-signed certificate for this setup utilizing openssl x509 -in pattern.pem -out alfresco.crt and add this certificate to an Amazon Simple Storage Service (Amazon S3) bucket. Choose Browse S3 and choose the S3 bucket with the SSL certificate.
  7. For Site ID, enter the Alfresco site ID where you want to search documents.
  8. Under Authentication, you have two options:
    1. Use Secrets Manager to create new Alfresco authentication credentials. You need an Alfresco admin user name and password.
    2. Use an existing Secrets Manager secret that has the Alfresco authentication credentials you want the connector to access.
  9. Choose Save and add secret.
  10. For IAM role, choose Create a new role or choose an existing IAM role configured with appropriate IAM policies to access the Secrets Manager secret, Amazon Kendra index, and data source.
  11. Choose Next.
  12. In the Configure sync settings section, provide information about your sync scope and run schedule.
    You can include the files to be crawled using inclusion patterns or exclude them using exclusion patterns.
  13. Choose Next.
  14. In the Set field mappings section, you can optionally configure the field mappings to specify how the Alfresco field names are mapped to Amazon Kendra attributes or facets.
  15. Choose Next.
  16. Review your settings and confirm to add the data source.
  17. After the data source is added, choose Data sources in the navigation pane, select the newly added data source, and choose Sync now to start data source synchronization with the Amazon Kendra index.

    The sync process can take about 10–15 minutes. You can now search indexed Alfresco content using the search console or a search application. Optionally, you can search with ACL with the following additional steps.
  18. Go to the index page that you created and on the User access control tab, choose Edit settings.
  19. Under Access control settings, select Yes.
  20. For Token type, choose JSON.
  21. Choose Next.
  22. Choose Update.

Wait a few minutes for the index to get updated by the changes. Now let’s see how you can perform intelligent search with Amazon Kendra.

Perform intelligent search with Amazon Kendra

Before you try searching on the Amazon Kendra console or using the API, make sure that the data source sync is complete. To check, view the data sources and verify if the last sync was successful.

  1. To start your search, on the Amazon Kendra console, choose Search indexed content in the navigation pane.
    You’re redirected to the Amazon Kendra Search console. Now you can search information from the Alfresco documents you indexed using Amazon Kendra.
  2. For this post, we search for a document stored in Alfresco, AWS.
  3. Expand Test query with an access token and choose Apply token.
  4. For Username, enter the email address associated with your Alfresco account.
  5. Choose Apply.

Now the user can only see the content they have access to. In our example, user test@amazon.com doesn’t have access to any documents on Alfresco, so none are visible.

Limitations

The connector has the following limitations:

  • As of this writing, we only support Alfresco OnPrem. Alfresco PAAS is not supported.
  • The connector doesn’t crawl the following entities: calendars, discussions, data lists, links, and system files.
  • During public preview, we only support basic authentication. For support for other forms of authentication please contact your Amazon representative.

Clean up

To avoid incurring future costs, clean up the resources you created as part of this solution. If you created a new Amazon Kendra index while testing this solution, delete it. If you only added a new data source using the Amazon Kendra connector for Alfresco, delete that data source.

Conclusion

With the Amazon Kendra Alfresco connector, your organization can search contents securely using intelligent search powered by Amazon Kendra.

To learn more about the Amazon Kendra Alfresco connector, refer to the Amazon Kendra Developer Guide.

For more information on other Amazon Kendra built-in connectors to popular data sources, refer to Amazon Kendra native connectors.


About the author

Vikas Shah is an Enterprise Solutions Architect at Amazon web services. He is a technology enthusiast who enjoys helping customers find innovative solutions to complex business challenges. His areas of interest are ML, IoT, robotics and storage. In his spare time, Vikas enjoys building robots, hiking, and traveling.

Read More

Best practices for TensorFlow 1.x acceleration training on Amazon SageMaker

Today, a lot of customers are using TensorFlow to train deep learning models for their clickthrough rate in advertising and personalization recommendations in ecommerce. As the behavior of their clients change, they can accumulate large amounts of new data every day. Model iteration is one of a data scientist’s daily jobs, but they face the problem of taking too long to train on large datasets.

Amazon SageMaker is a fully managed machine learning (ML) platform that could help data scientists focus on models instead of infrastructure, with native support for bring-your-own-algorithms and frameworks such as TensorFlow and PyTorch. SageMaker offers flexible distributed training options that adjust to your specific workflows. Because many data scientists may lack experience in the acceleration training process, in this post we show you the factors that matter for fast deep learning model training and the best practices of acceleration training for TensorFlow 1.x on SageMaker. We also have a sample code of DeepFM distributed training on SageMaker on the GitHub repo.

There are many factors you should consider to maximize CPU/GPU utilization when you run your TensorFlow script on SageMaker, such as infrastructure, type of accelerator, distributed training method, data loading method, mixed precision training, and more.

We discuss best practices in the following areas:

  • Accelerate training on a single instance
  • Accelerate training on multiple instances
  • Data pipelines
  • Automatic mixed precision training

Accelerate training on a single instance

When running your TensorFlow script on a single instance, you could choose a computer optimized series such as the Amazon Elastic Compute Cloud (Amazon EC2) C5 series, or an accelerated computing series with multiple GPU in a single instance such as p3.8xlarge, p3.16xlarge, p3dn.24xlarge, and p4d.24xlarge.

In this section, we discuss strategies for multiple CPUs on a single instance, and distributed training with multiple GPUs on a single instance.

Multiple CPUs on a single instance

In this section, we discuss manually setting operators’ parallelism on CPU devices, the tower method, TensorFlow MirroredStrategy, and Horovod.

Manually setting operators’ parallelism on CPU devices

TensorFlow automatically selects the appropriate number of threads to parallelize the operation calculation in the training process. However, you could set the intra_op threads pool and inter_op parallelism settings provided by TensorFlow and use environment variables of MKL-DNN to set binding for the OS thread. See the following code:

# Set parallelism of intra_op and inter_op
num_cpus = int(os.environ['SM_NUM_CPUS'])
config = tf.ConfigProto(allow_soft_placement=True, device_count={'CPU': num_cpus}, intra_op_parallelism_threads=num_cpus, inter_op_parallelism_threads=num_cpus)
run_config = tf.estimator.RunConfig().replace(session_config = config)

# Use Intel MKL-DNN Setting to accelerate training speed
os.environ["KMP_AFFINITY"]= "verbose,disabled"
os.environ['OMP_NUM_THREADS'] = str(num_cpus)
os.environ['KMP_SETTINGS'] = '1'

The environment variable KMP_AFFINITY of MKL-DNN is set to granularity=fine,compact,1,0 by default. After setting both intra and inter of TensorFlow to the maximum number of vCPUs of the current instance, the upper limit of CPU usage is almost the same as the number of physical cores of the training instance.

If you set os.environ["KMP_AFFINITY"]= "verbose,disabled", the OS thread isn’t bound to the hardware hyper thread, and CPU usage could exceed the number of physical cores.

Regarding the settings of TensorFlow intra parallelism, TensorFlow inter parallelism, and the number of MKL-DNN threads, different combinations of these three parameters result in different training speeds. Therefore, you need to test each case to find the best combination. A common situation is to set the three parameters (intra_op_parallelism_threads and inter_op_parallelism_threads for TensorFlow, os.environ['OMP_NUM_THREADS'] for MKL-DNN) to half the number of vCPUs (physical core) or the total number of vCPUs.

Tower method

To replicate a model over GPUs, each GPU gets its own instance of the forward pass. The instance of the forward pass is called a tower. The tower method is almost always used for GPU devices. To compare training speed with other methods, here we also use the tower method for our CPU device.

If you don’t set the CPU device manually, TensorFlow don’t use the tower method to average the gradients, so you don’t need to scale the batch size in such cases.

  1. Set the CPU device manually:
device_list = []
if manual_CPU_device_set:
		cpu_prefix=’/cpu:’
		for I in range(1, num_cpus):
			devices_list.append(cpu_prefix + str(i))
  1. Use replicate_model_fn to wrap model_fn:
DeepFM = tf.estimator.Estimator(model_fn=tf.contrib.estimator.replicate_model_fn(model_fn, devices=device_list), model_dir=FLAGS.model_dir, params=model_params, config=config)
  1. Use TowerOptimizer to wrap optimizer:
optimizer = tf.contrib.estimator.TowerOptimizer(optimizer)
  1. Wrap your model_fn:
with tf.variable_scope(‘deepfm_model’, reuse=tf.AUTO_REUSE)

  1. Scale batch size to (NUM_CPU – 1).

Let’s look at the difference of CPU utilization with tower mode enabled. The following figure shows ml.c5.18xlarge instance’s CPU utilization with the following configuration:

No Tower + LibSVM data + pipe mode + MKL-DNN disable binding + TensorFlow intra/inter op parallelism setting to max number of instance’s vCPUs

No Tower

The following figure shows the ml.c5.18xlarge instance’s CPU utilization with the following configuration:

Tower with set CPU device + LibSVM data + pipe mode + MKL-DNN disable binding + TensorFlow intra/inter op parallelism setting to max number of instance’s vCPUs

with Tower

The CPU usage is higher when using the tower method, and it exceeds the number of physical cores.

TensorFlow MirroredStrategy

TensorFlow MirroredStrategy means synchronous training across multiple replicas on one machine. This strategy is typically used for training on one machine with multiple GPUs. To compare training speed with another method, we use MirroredStrategy for our CPU device.

When using TensorFlow MirroredStrategy, if you don’t set the CPU device, TensorFlow just uses one CPU as single worker, which is a waste of resources. We recommend manually setting the CPU device, because it will do a reduce operation on /CPU:0, so the /CPU:0 device isn’t used as a replica here. See the following code:

device_list = []
if manual_CPU_device_set:
		cpu_prefix=’/cpu:’
		for I in range(1, num_cpus):
			devices_list.append(cpu_prefix + str(i))
mirrored_strategy = tf.distribute.MirroredStrategy(devices=devices_list)
	else:
mirrored_strategy = tf.distribute.MirroredStrategy()

# Set strategy to config:
config = tf.estimator.RunConfig(train_distribute=mirrored_strategy,
eval_distribute=mirrored_strategy,
session_config = config)

You need to scale batch size when using MirroredStrategy; for example, scale the batch size to a multiple of the number of GPU devices.

For the sub-strategy when you set CPU device, if you don’t set the cross_device_ops parameter in tf.distribute.MirroredStrategy(), TensorFlow uses the ReductionToOneDevice sub-strategy by default. However, if you set HierarchicalCopyAllReduce as the sub-strategy, TensorFlow just does the reduce work on /CPU:0. When you use the TensorFlow dataset API and distribute strategy together, the dataset object should be returned instead of features and labels in function input_fn.

Usually, TensorFlow MirroredStrategy is slower than the tower method on CPU training, so we don’t recommend using MirroredStrategy on a multi-CPU single host.

Horovod

Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. The goal of Horovod is to make distributed deep learning fast and easy to use.

There is a parameter of distribution in the SageMaker Python SDK Estimator API, which you could use to state the Horovod distributed training. SageMaker provisions the infrastructure and runs your script with MPI. See the following code:

hvd_processes_per_host = 4
distribution = {'mpi': { 
'enabled': True, 
'processes_per_host': hvd_processes_per_host,
'custom_mpi_options': '-verbose --NCCL_DEBUG=INFO -x OMPI_MCA_btl_vader_single_copy_mechanism=none' 
} 
}

When choosing a GPU instance such as ml.p3.8xlarge, you need to pin each GPU for every worker:

config = tf.ConfigProto()
config.gpu_options.visible_device_list = str(hvd.local_rank())

To speed up model convergence, scale the learning rate by the number of workers according to the Horovod official documentation. However, in real-world projects, you should scale the learning rate to some extent, but not by the number of workers, which results in bad model performance. For example, if the original learning rate is 0.001, we scale the learning rate to 0.0015, even if number of workers is four or more.

Generally, only the primary (Horovod rank 0) saves the checkpoint and model as well as the evaluation operation. You don’t need to scale the batch size when using Horovod. SageMaker offers Pipe mode to stream data from Amazon Simple Storage Service (Amazon S3) into training instances. When you enable Pipe mode, be aware that different workers on the same host need to use different channels to avoid errors. This is because the first worker process reads the FIFO/channel data, and other worker processes on the same instance will hang because they can’t read data from the same FIFO/channel, so Horovod doesn’t work properly. To avoid this issue, set the channels according to the number of workers per instance. At least make sure that different workers on the same host consume different channels; the same channel can be consumed by workers on a different host.

When using Horovod, you may encounter the following error:

“One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.”

The possible cause for this issue is that a certain rank (such as rank 0) works slower or does more jobs than other ranks, and this causes other ranks to wait for a long time. Although rank 0 sometimes has to do more work than other ranks, it should be noted that rank 0 shouldn’t do much for a long time. For example, for the model evaluation on the validation set and saving checkpoints during training, if it’s inevitable that these operations will take a long time, which could cause errors, one workaround is to let all workers do the same work as rank 0 (checkpoints saving, evaluation, and so on).

Data sharding is one of the most important things to consider when using distributed training. You can use TensorFlow dataset.shard() in your script. SageMaker also offers a dataset shard feature in the inputs channel by setting distribution=S3shardbykey in the dataset channel. See the following code:

dataset = PipeModeDataset(channel, record_format='TFRecord')

number_host = len(FLAGS.hosts)

if FLAGS.enable_data_multi_path : # If there are multi channels mapping with different S3 path
    if FLAGS.enable_s3_shard == False :
        if number_host > 1:
            index = hvd.rank() // FLAGS.worker_per_host
            dataset = dataset.shard(number_host, index)
else :
    if FLAGS.enable_s3_shard :
        dataset = dataset.shard(FLAGS.worker_per_host, hvd.local_rank())
    else :
        dataset = dataset.shard(hvd.size(), hvd.rank())

The following figure shows the result when using Horovod (ml.c5.18xlarge, Horovod + LibSVM + default intra op and inter op setting), which you can compare to the tower method.

horovod

Distributed training with multiple GPUs on a single instance

It’s normal to start distributed training with multiple GPUs on a single instance because data scientists only need to manage one instance and take advantage of the high-speed interlink between GPUs. SageMaker training jobs support multiple instance types that have multiple GPUs on a single instance, such as ml.p3.8xlarge, ml.p3.16xlarge, ml.p3dn.24xlarge, and ml.p4d.24xlarge. The method is the same as multiple CPUs in a single instance, but with a few changes in the script.

Tower method

The tower method here is almost the same as in multi-CPU training. You need to scale the batch size according to the number of GPUs in use.

TensorFlow MirroredStrategy

The default sub-strategy of MirroredStrategy is NcclAllReduce. You need to scale the batch size according to the number of GPUs in use. See the following code:

mirrored_strategy = tf.distribute.MirroredStrategy()
config = tf.estimator.RunConfig(train_distribute=mirrored_strategy,
				eval_distribute=mirrored_strategy)

Accelerate training on multiple instances

Scaling out is always an option to improve training speed. More and more data scientists choose this as a default option in regards to distributed training. In this section, we discuss strategies for distributed training with multiple hosts.

Multiple CPUs with multiple instances

There are four main methods for using multiple CPUs with multiple instances when enabling distributed training:

    • Parameter server without manually setting operators’ parallelism on CPU devices
    • Parameter server with manually setting operators’ parallelism on CPU devices
    • Parameter server with tower (setting CPU devices manually, and set allow_soft_placement=True in tf.ConfigProto)
    • Horovod

When using a parameter server in the tf.estimator API, the path of checkpoint must be a sharable path such as Amazon S3 or the local path of Amazon Elastic File Service (Amazon EFS) mapping to the container. For a parameter server in tf.keras, the checkpoint path can be set to the local path. For Horovod, the checkpoint path can be set to a local path of the training instance.

When using a parameter server and the tf.estimator API with the checkpoint path to Amazon S3, if the model is quite large, you might encounter an error of the primary is stuck at saving checkpoint to S3. You can use SageMaker built-in container TensorFlow 1.15 or TensorFlow 1.15.2 or use Amazon EFS as the checkpoint path of the share.

When using a parameter server for multiple hosts, the parameter load on each parameter server process may be unbalanced (especially when there are relatively large embedding table variables), which could cause errors. You could check the file size of each the shard’s checkpoint in Amazon S3 to determine whether the parameters on the parameter server are balanced, because each parameter server corresponds to a shard of the checkpoint file. To avoid such issues, you can use the partitioner function to try to make the parameters of each parameter server evenly distributed:

with tf.variable_scope('deepfm_model', reuse=tf.AUTO_REUSE, partitioner = tf.fixed_size_partitioner(num_shards=len(FLAGS.hosts))):

Single GPU with multiple instances

SageMaker training jobs support instances that only have one GPU, like the ml.p3.xlarge, ml.g4dn, and ml.g5 series. There are two main methods used in this scenario: parameter servers and Horovod.

The built-in parameter server distributed training method of SageMaker is to start a parameter server process and a worker process for each training instance (each parameter server is only responsible for part of the model parameters), so the default is multi-machine single-GPU training. The SageMaker built-in parameter server distributed training is an asynchronous gradient update method. To reduce the impact of asynchronous updates on training convergence, it’s recommended to reduce the learning rate. If you want to use all the GPUs on the instance, you need to use a combination of parameter servers and the tower method.

For Horovod, just set processes_per_host=1 in the distribution parameter of the SageMaker Python Estimator API.

Multiple GPUs with multiple instances

For parameter servers and the tower method, the code changes are basically the same as the tower method for a single instance with multiple GPUs, and there is no need to manually set the GPU devices.

For Horovod, set processes_per_host in the distribution parameter to the number of GPUs of each training instance. If you use Pipe mode, the number of workers per instance needs to match the number of channels.

Data pipelines

In addition to the infrastructure we have discussed, there is another important thing to consider: the data pipeline. A data pipeline refers to how you load data and transform data before it feeds into neural networks. CPU is used to prepare data, whereas GPU is used to calculate the data from CPU. Because GPU is an expensive resource, more GPU idle time is inefficient; a good data pipeline in your training job could improve GPU and CPU utilization.

When you’re trying to optimize your TensorFlow data input pipeline, consider the API order used in TensorFlow datasets, the training data size (a lot of small files or several large files), batch size, and so on.

Let’s look at the interaction between GPU and CPU during training. The following figures compare interactions with and without a pipeline.

pipeline

A better pipeline could reduce GPU idle time. Consider the following tips:

  • Use simple function logic in extracting features and labels
  • Prefetch samples to memory
  • Reduce unnecessary disk I/O and networking I/O
  • Cache the processed features and labels in memory
  • Reduce the number of replication times between CPU and GPU
  • Have different workers deal with different parts of the training dataset
  • Reduce the times of calling the TensorFlow dataset API

TensorFlow provides a transform API related to dataset formats, and the order of the transformation API in TensorFlow affects training speed a lot. The best order of calling the TensorFlow dataset API needs to be tested. The following are some basic principles:

  • Use a vectorized map. This means call the TensorFlow dataset batch API first, then the dataset map API. The custom parsing function provided in the map function, such as decode_tfrecord in the sample code, parses a mini batch of data. On the contrary, map first and then batch is a scalar map, and the custom parser function processes just one sample.
  • Use the TensorFlow dataset cache API to cache features and labels. Put the TensorFlow dataset cache API before the TensorFlow dataset repeat API, otherwise RAM utilization increases linearly epoch by epoch. If the dataset is as large as RAM, don’t use the TensorFlow dataset cache API. If you need to use the TensorFlow dataset cache API and shuffle API, consider use the following order: create TensorFlow dataset object -> cache API -> shuffle API -> batch API -> map API -> repeat API -> prefetch API.
  • Use the tfrecord dataset format more than LibSVM format.
  • File mode or Pipe mode depends on your dataset format and amount of files. The tfrecorddataset API can set num_parallel_reads to read multiple files in parallel and set buffer_size to optimize data’s reading, whereas the pipemodedataset API doesn’t have such settings. Pipe mode is more suitable for situations where a single file is large and the total number of files is small. We recommend using a SageMaker processing job to do the preprocessing work, such as joining multiple files to a bigger file according to labels, using a sampling method to make the dataset more balanced, and shuffling the balanced dataset.

See the following code sample:

def decode_tfrecord(batch_examples):
        # The feature definition here should BE consistent with LibSVM TO TFRecord process.
        features = tf.parse_example(batch_examples,
                                           features={
                                               "label": tf.FixedLenFeature([], tf.float32),
                                               "ids": tf.FixedLenFeature(dtype=tf.int64, shape=[FLAGS.field_size]),
                                               "values": tf.FixedLenFeature(dtype=tf.float32, shape=[FLAGS.field_size]) 
                                           })
        
        batch_label = features["label"]
        batch_ids = features["ids"]
        batch_values = features["values"]
        
        return {"feat_ids": batch_ids, "feat_vals": batch_values}, batch_label


    def decode_libsvm(line):
        columns = tf.string_split([line], ' ')
        labels = tf.string_to_number(columns.values[0], out_type=tf.float32)
        splits = tf.string_split(columns.values[1:], ':')
        id_vals = tf.reshape(splits.values,splits.dense_shape)
        feat_ids, feat_vals = tf.split(id_vals,num_or_size_splits=2,axis=1)
        feat_ids = tf.string_to_number(feat_ids, out_type=tf.int32)
        feat_vals = tf.string_to_number(feat_vals, out_type=tf.float32)
        return {"feat_ids": feat_ids, "feat_vals": feat_vals}, labels

if FLAGS.pipe_mode == 0:
        dataset = tf.data.TFRecordDataset(filenames)
    else :
        # Enter Pipe mode
        dataset = PipeModeDataset(channel, record_format='TFRecord')
        
    if FLAGS.enable_s3_shard == False:
        host_rank = FLAGS.hosts.index(FLAGS.current_host)
        number_host = len(FLAGS.hosts)
        dataset = dataset.shard(number_host, host_rank)
    
    dataset = dataset.batch(batch_size, drop_remainder=True) # Batch size to use
    dataset = dataset.map(decode_tfrecord,
                          num_parallel_calls=tf.data.experimental.AUTOTUNE) 

    if num_epochs > 1:
        dataset = dataset.repeat(num_epochs)
    dataset = dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)

For training on CPU instances, setting parallelism of intra op, inter op, and the environment variable of MKL-DNN is a good starting point.

Automatic mixed precision training

The last thing we discuss is automatic mixed precision training, which can accelerate speed and result in model performance. As of this writing, Nvidia V100 GPU (P3 instance) and A100 (P4dn instance) support Tensor core. You can enable mixed precision training in TensorFlow when using those types of instances. Starting from version 1.14, TensorFlow has supported automatic mixed precision training. You can use the following statement to wrap your original optimizer:

tf.train.experimental.enable_mixed_precision_graph_rewrite(optimizer)

If the model is small and utilization of GPU is low, there’s no advantage of automatic mixed precision training. If the model is large, automatic mixed precision training can accelerate training speed.

Conclusion

When you start your deep learning model training in SageMaker, consider the following tips to achieve a faster training speed:

  • Try the multi-CPU, single-instance method or single-GPU, single-instance method first. If CPU/GPU utilization is very high (for example more than 90%), move to the next step.
  • Try more CPUs in single host or more GPUs in single host. If utilization is near the maximum utilization of CPUs or GPUs, move to the next step.
  • Try multiple CPUs or multiple GPUs with multiple hosts.
  • You need to modify codes when using parameter servers or Horovod. The code modification isn’t the same for the TensorFlow session-based API, tf.estimator API, and tf.keras API. A parameter server or Horovod may show different training speeds in different training cases and tasks, so try both methods if you have the time and budget to determine the best one.

Keep in mind the following advice:

  • Check utilization before scaling, optimize your data pipeline, and make CPU and GPU overlap in the timeline.
  • First scale up, then scale out.
  • If you can’t increate your GPU utilization after all the methods, try CPU. There are many cases (especially for the clickthrough rate ranking model) where the total training time of CPU instance training is shorter and more cost-effective than GPU instance training.

We also have a code sample in the GitHub repo, where we show two samples of DeepFM distributed training on SageMaker. One is a TensorFlow parameter server on CPU instances, the other one is Horovod on GPU instances.


About the Authors

Yuhui Liang is a Sr. Machine Learning Solutions Architect. He’s focused on the promotion and application of machine learning, and deeply involved in many customers’ machine learning projects. He has a rich experience in deep learning distributed training, recommendation systems, and computational advertising.

Shishuai Wang is a Sr. Machine Learning Solutions Architect. He works with AWS customers to help them adopt machine learning on a large scale. He enjoys watching movies and traveling around the world.

Read More

Run PyTorch Lightning and native PyTorch DDP on Amazon SageMaker Training, featuring Amazon Search

So much data, so little time. Machine learning (ML) experts, data scientists, engineers and enthusiasts have encountered this problem the world over. From natural language processing to computer vision, tabular to time series, and everything in-between, the age-old problem of optimizing for speed when running data against as many GPUs as you can get has inspired countless solutions. Today, we’re happy to announce features for PyTorch developers using native open-source frameworks, like PyTorch Lightning and PyTorch DDP, that will streamline their path to the cloud.

Amazon SageMaker is a fully-managed service for ML, and SageMaker model training is an optimized compute environment for high-performance training at scale. SageMaker model training offers a remote training experience with a seamless control plane to easily train and reproduce ML models at high performance and low cost. We’re thrilled to announce new features in the SageMaker training portfolio that make running PyTorch at scale even easier and more accessible:

  1. PyTorch Lightning can now be integrated to SageMaker’s distributed data parallel library with only one-line of code change.
  2. SageMaker model training now has support for native PyTorch Distributed Data Parallel with NCCL backend, allowing developers to migrate onto SageMaker easier than ever before.

In this post, we discuss these new features, and also learn how Amazon Search has run PyTorch Lightning with the optimized distributed training backend in SageMaker to speed up model training time.

Before diving into the Amazon Search case study, for those who aren’t familiar we would like to give some background on SageMaker’s distributed data parallel library. In 2020, we developed and launched a custom cluster configuration for distributed gradient descent at scale that increases overall cluster efficiency, introduced on Amazon Science as Herring. Using the best of both parameter servers and ring-based topologies, SageMaker Distributed Data Parallel (SMDDP) is optimized for the Amazon Elastic Compute Cloud (Amazon EC2) network topology, including EFA. For larger cluster sizes, SMDDP is able to deliver 20–40% throughput improvements relative to Horovod (TensorFlow) and PyTorch Distributed Data Parallel. For smaller cluster sizes and supported models, we recommend the SageMaker Training Compiler, which is able to decrease overall job time by up to 50%.

Customer spotlight: PyTorch Lightning on SageMaker’s optimized backend with Amazon Search

Amazon Search is responsible for the search and discovery experience on Amazon.com. It powers the search experience for customers looking for products to buy on Amazon. At a high level, Amazon Search builds an index for all products sold on Amazon.com. When a customer enters a query, Amazon Search then uses a variety of ML techniques, including deep learning models, to match relevant and interesting products to the customer query. Then it ranks the products before showing the results to the customer.

Amazon Search scientists have used PyTorch Lightning as one of the main frameworks to train the deep learning models that power Search ranking due to its added usability features on top of PyTorch. SMDDP was not supported for deep learning models written in PyTorch Lightning before this new SageMaker launch. This prevented Amazon Search scientists who prefer using PyTorch Lightning from scaling their model training using data parallel techniques, significantly slowing down their training time and preventing them from testing new experiments that require more scalable training.

The team’s early benchmarking results show 7.3 times faster training time for a sample model when trained on eight nodes as compared to a single-node training baseline. The baseline model used in these benchmarking is a multi-layer perceptron neural network with seven dense fully connected layers and over 200 parameters. The following table summarizes the benchmarking result on ml.p3.16xlarge SageMaker training instances.

Number of Instances Training Time (minutes) Improvement
1 99 Baseline
2 55 1.8x
4 27 3.7x
8 13.5 7.3x

Next, we dive into the details on the new launches. If you like, you can step through our corresponding example notebook .

Run PyTorch Lightning with the SageMaker distributed training library

We are happy to announce that SageMaker Data Parallel now seamlessly integrates with PyTorch Lightning within SageMaker training.

PyTorch Lightning is an open-source framework that provides a simplification for writing custom models in PyTorch. In some ways similar to what Keras did for TensorFlow, or even arguably Hugging Face, PyTorch Lightning provides a high-level API with abstractions for much of the lower-level functionality of PyTorch itself. This includes defining the model, profiling, evaluation, pruning, model parallelism, hyperparameter configurations, transfer learning, and more.

Previously, PyTorch Lightning developers were uncertain about how to seamlessly migrate their training code on to high-performance SageMaker GPU clusters. In addition, there was no way for them to take advantage of efficiency gains introduced by SageMaker Data Parallel.

For PyTorch Lightning, generally speaking, there should be little-to-no code changes to simply run these APIs on SageMaker Training. In the example notebooks we use the DDPStrategy and DDPPlugin methods.

There are three steps to use PyTorch Lightning with SageMaker Data Parallel as an optimized backend:

  1. Use a supported AWS Deep Learning Container (DLC) as your base image, or optionally create your own container and install the SageMaker Data Parallel backend yourself. Ensure that you have PyTorch Lightning included in your necessary packages, such as with a requirements.txt file.
  2. Make a few minor code changes to your training script that enable the optimized backend. These include:
    1. Import the SM DDP library:
      import smdistributed.dataparallel.torch.torch_smddp
      

    2. Set up the PyTorch Lightning environment for SageMaker:
      from pytorch_lightning.plugins.environments.lightning_environment 
        import LightningEnvironment
      
      env = LightningEnvironment()
      env.world_size = lambda: int(os.environ["WORLD_SIZE"])
      env.global_rank = lambda: int(os.environ["RANK"])

    3. If you’re using a version of PyTorch Lightning older than 1.5.10, you’ll need to add a few more steps.
      1. First, add the environment variable:
        os.environ["PL_TORCH_DISTRIBUTED_BACKEND"] = "smddp"

      2. Second, ensure you use DDPPlugin, rather than DDPStrategy. If you’re using a more recent version, which you can easily set by placing the requirements.txt in the source_dir for your job, then this isn’t necessary. See the following code:
        ddp = DDPPlugin(parallel_devices=[torch.device("cuda", d) for d in range(num_gpus)], cluster_environment=env)

    4. Optionally, define your process group backend as "smddp" in the DDPSTrategy object. However, if you’re using PyTorch Lightning with the PyTorch DDP backend, which is also supported, simply remove this `process_group_backend` parameter. See the following code:
      ddp = DDPStrategy(
        cluster_environment=env, 
        process_group_backend="smddp", 
        accelerator="gpu")

  3. Ensure that you have a distribution method noted in the estimator, such as distribution={"smdistributed":{"dataparallel":{"enabled":True} if you’re using the Herring backend, or distribution={"pytorchddp":{"enabled":True}.
  • For a full listing of suitable parameters in the distribution parameter, see our documentation here.

Now you can launch your SageMaker training job! You can launch your training job via the Python SDK, Boto3, the SageMaker console, the AWS Command Line Interface (AWS CLI), and countless other methods. From an AWS perspective, this is a single API command: create-training-job. Whether you launch this command from your local terminal, an AWS Lambda function, an Amazon SageMaker Studio notebook, a KubeFlow pipeline, or any other compute environment is completely up to you.

Please note that the integration between PyTorch Lightning and SageMaker Data Parallel is currently supported for only newer versions of PyTorch, starting at 1.11. In addition, this release is only available in the AWS DLCs for SageMaker starting at PyTorch 1.12. Make sure you point to this image as your base. In us-east-1, this address is as follows:

ecr_image = '763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.12.0-gpu-py38-cu113-ubuntu20.04-sagemaker'

Then you can extend your Docker container using this as your base image, or you can pass this as a variable into the image_uri argument of the SageMaker training estimator.

As a result, you’ll be able to run your PyTorch Lightning code on SageMaker Training’s optimized GPUs, with the best performance available on AWS.

Run PyTorch Distributed Data Parallel on SageMaker

The biggest problem PyTorch Distributed Data Parallel (DDP) solves is deceptively simple: speed. A good distributed training framework should provide stability, reliability, and most importantly, excellent performance at scale. PyTorch DDP delivers on this through providing torch developers with APIs to replicate their models over multiple GPU devices, in both single-node and multi-node settings. The framework then manages sharding different objects from the training dataset to each model copy, averaging the gradients for each of the model copies to synchronize them at each step. This produces one model at the total completion of the full training run. The following diagram illustrates this process.

PyTorch DDP is common in projects that use large datasets. The precise size of each dataset will vary widely, but a general guideline is to scale datasets, compute sizes, and model sizes in similar ratios. Also called scaling laws, the optimal combination of these three is very much up for debate and will vary based on applications. At AWS, based on working with multiple customers, we can clearly see benefits from data parallel strategies when an overall dataset size is at least a few tens of GBs. When the datasets get even larger, implementing some type of data parallel strategy is a critical technique to speed up the overall experiment and improve your time to value.

Previously, customers who were using PyTorch DDP for distributed training on premises or in other compute environments lacked a framework to easily migrate their projects onto SageMaker Training to take advantage of high-performance GPUs with a seamless control plane. Specifically, they needed to either migrate their data parallel framework to SMDDP, or develop and test the capabilities of PyTorch DDP on SageMaker Training manually. Today, SageMaker Training is happy to provide a seamless experience for customers onboarding their PyTorch DDP code.

To use this effectively, you don’t need to make any changes to your training scripts.

You can see this new parameter in the following code. In the distribution parameter, simply add pytorchddp and set enabled as true.

estimator = PyTorch(
    base_job_name="pytorch-dataparallel-mnist",
    source_dir="code",
    entry_point = "my_model.py",
    ... 
    # Training using SMDataParallel Distributed Training Framework
    distribution = {"pytorchddp": {"enabled": "true"}}
)

This new configuration starts at SageMaker Python SDK versions 2.102.0 and PyTorch DLC’s 1.11.

For PyTorch DDP developers who are familiar with the popular torchrun framework, it’s helpful to know that this isn’t necessary on the SageMaker training environment, which already provides robust fault tolerance. However, to minimize code rewrites, you can bring another launcher script that runs this command as your entry point.

Now PyTorch developers can easily move their scripts onto SageMaker, ensuring their scripts and containers can run seamlessly across multiple compute environments.

This prepares them to, in the future, take advantage of SageMaker’s distributed training libraries that provide AWS-optimized training topologies to deliver up to 40% speedup enhancements. For PyTorch developers, this is a single line of code! For PyTorch DDP code, you can simply set the backend to smddp in the initialization (see Modify a PyTorch Training Script), as shown in the following code:

import smdistributed.dataparallel.torch.torch_smddp
import torch.distributed as dist
dist.init_process_group(backend='smddp')

As we saw above, you can also set the backend of DDPStrategy to smddp when using Lightning. This can lead to up to 40% overall speedups for large clusters! To learn more about distributed training on SageMaker see our on-demand webinar, supporting notebooks, relevant documentation, and papers.

Conclusion

In this post, we introduced two new features within the SageMaker Training family. These make it much easier for PyTorch developers to use their existing code on SageMaker, both PyTorch DDP and PyTorch Lightning.

We also showed how Amazon Search uses SageMaker Training for training their deep learning models, and in particular PyTorch Lightning with the SageMaker Data Parallel optimized collective library as a backend. Moving to distributed training overall helped Amazon Search achieve 7.3x faster train times.


About the authors

Emily Webber joined AWS just after SageMaker launched, and has been trying to tell the world about it ever since! Outside of building new ML experiences for customers, Emily enjoys meditating and studying Tibetan Buddhism.

Karan Dhiman is a Software Development Engineer at AWS, based in Toronto, Canada. He is very passionate about Machine Learning space and building solutions for accelerating distributed computing workloads.

Vishwa Karia is a Software Development Engineer at AWS Deep Engine. Her interests lie at the intersection of Machine Learning and Distributed Systems and she is also passionate about empowering women in tech and AI.

Eiman Elnahrawy is a Principal Software Engineer at Amazon Search leading the efforts on Machine Learning acceleration, scaling, and automation. Her expertise spans multiple areas, including Machine Learning, Distributed Systems, and Personalization.

Read More

Visualize your Amazon Lookout for Metrics anomaly results with Amazon QuickSight

One of the challenges encountered by teams using Amazon Lookout for Metrics is quickly and efficiently connecting it to data visualization. The anomalies are presented individually on the Lookout for Metrics console, each with their own graph, making it difficult to view the set as a whole. An automated, integrated solution is needed for deeper analysis.

In this post, we use a Lookout for Metrics live detector built following the Getting Started section from the AWS Samples, Amazon Lookout for Metrics GitHub repo. After the detector is active and anomalies are generated from the dataset, we connect Lookout for Metrics to Amazon QuickSight. We create two datasets: one by joining the dimensions table with the anomaly table, and another by joining the anomaly table with the live data. We can then add these two datasets to a QuickSight analysis, where we can add charts in a single dashboard.

We can provide two types of data to the Lookout for Metrics detector: continuous and historical. The AWS Samples GitHub repo offers both, though we focus on the continuous live data. The detector monitors this live data to identify anomalies and writes the anomalies to Amazon Simple Storage Service (Amazon S3) as they’re generated. At the end of a specified interval, the detector analyzes the data. Over time, the detector learns to more accurately identify anomalies based on patterns it finds.

Lookout for Metrics uses machine learning (ML) to automatically detect and diagnose anomalies in business and operational data, such as a sudden dip in sales revenue or customer acquisition rates. The service is now generally available as of March 25, 2021. It automatically inspects and prepares data from a variety of sources to detect anomalies with greater speed and accuracy than traditional methods used for anomaly detection. You can also provide feedback on detected anomalies to tune the results and improve accuracy over time. Lookout for Metrics makes it easy to diagnose detected anomalies by grouping together anomalies related to the same event and sending an alert that includes a summary of the potential root cause. It also ranks anomalies in order of severity so you can prioritize your attention to what matters the most to your business.

QuickSight is a fully-managed, cloud-native business intelligence (BI) service that makes it easy to connect to your data to create and publish interactive dashboards. Additionally, you can use Amazon QuickSight to get instant answers through natural language queries.

You can access serverless, highly scalable QuickSight dashboards from any device, and seamlessly embed them into your applications, portals, and websites. The following screenshot is an example of what you can achieve by the end of this post.

Overview of solution

The solution is a combination of AWS services, primarily Lookout for Metrics, QuickSight, AWS Lambda, Amazon Athena, AWS Glue, and Amazon S3.

The following diagram illustrates the solution architecture. Lookout for Metrics detects and sends the anomalies to Lambda via an alert. The Lambda function generates the anomaly results as CSV files and saves them in Amazon S3. An AWS Glue crawler analyzes the metadata, and creates tables in Athena. QuickSight uses Athena to query the Amazon S3 data, allowing dashboards to be built to visualize both the anomaly results and the live data.

Solution Architecture

This solution expands on the resources created in the Getting Started section of the GitHub repo. For each step, we include options to create the resources either using the AWS Management Console or launching the provided AWS CloudFormation stack. If you have a customized Lookout for Metrics detector, you can use it and adapt it the following notebook to achieve the same results.

The implementation steps are as follows:

  1. Create the Amazon SageMaker notebook instance (ALFMTestNotebook) and notebooks using the stack provided in the Initial Setup section from the GitHub repo.
  2. Open the notebook instance on the SageMaker console and navigate to the amazon-lookout-for-metrics-samples/getting_started folder.
  3. Create the S3 bucket and complete the data preparation using the first notebook (1.PrereqSetupData.ipynb). Open the notebook with the conda_python3 kernel, if prompted.

We skip the second notebook because it’s focused on backtesting data.

  1. If you’re walking through the example using the console, create the Lookout for Metrics live detector and its alert using the third notebook (3.GettingStartedWithLiveData.ipynb).

If you’re using the provided CloudFormation stacks, the third notebook isn’t required. The detector and its alert are created as part of the stack.

  1. After you create the Lookout for Metrics live detector, you need to activate it from the console.

This can take up to 2 hours to initialize the model and detect anomalies.

  1. Deploy a Lambda function, using Python with a Pandas library layer, and create an alert attached to the live detector to launch it.
  2. Use the combination of Athena and AWS Glue to discover and prepare the data for QuickSight.
  3. Create the QuickSight data source and datasets.
  4. Finally, create a QuickSight analysis for visualization, using the datasets.

The CloudFormation scripts are typically run as a set of nested stacks in a production environment. They’re provided individually in this post to facilitate a step-by-step walkthrough.

Prerequisites

To go through this walkthrough, you need an AWS account where the solution will be deployed. Make sure that all the resources you deploy are in the same Region. You need a running Lookout for Metrics detector built from notebooks 1 and 3 from the GitHub repo. If you don’t have a running Lookout for Metrics detector, you have two options:

  • Run notebooks 1 and 3, and continue from the step 1 of this post (creating the Lambda function and alert)
  • Run notebook 1 and then use the CloudFormation template to generate the Lookout for Metrics detector

Create the live detector using AWS CloudFormation

The L4MLiveDetector.yaml CloudFormation script creates the Lookout for Metrics anomaly detector with its source pointing to the live data in the specified S3 bucket. To create the detector, complete the following steps:

  1. Launch the stack from the following link:

  1. On the Create stack page, choose Next.
  2. On the Specify stack details page, provide the following information:
    1. A stack name. For example, L4MLiveDetector.
    2. The S3 bucket, <Account Number>-lookoutmetrics-lab.
    3. The Role ARN, arn:aws:iam::<Account Number>:role/L4MTestRole.
    4. An anomaly detection frequency. Choose PT1H (hourly).
  3. Choose Next.
  4. On the Configure stack options page, leave everything as is and choose Next.
  5. On the Review page, leave everything as is and choose Create stack.

Create the live detector SMS alert using AWS CloudFormation (Optional)

This step is optional. The alert is presented as an example, with no impact on the dataset creation. The L4MLiveDetectorAlert.yaml CloudFormation script creates the Lookout for Metrics anomaly detector alert with an SMS target.

  1. Launch the stack from the following link:

  1. On the Create stack page, choose Next.
  2. On the Specify stack details page, update the SMS phone number and enter a name for the stack (for example, L4MLiveDetectorAlert).
  3. Choose Next.
  4. On the Configure stack options page, leave everything as is and choose Next.
  5. On the Review page, select the acknowledgement check box, leave everything else as is, and choose Create stack.

Resource cleanup

Before proceeding with the next step, stop your SageMaker notebook instance to ensure no unnecessary costs are incurred. It is no longer needed.

Create the Lambda function and alert

In this section, we provide instructions on creating your Lambda function and alert via the console or AWS CloudFormation.

Create the function and alert with the console

You need a Lambda AWS Identity and Access Management (IAM) role following the least privilege best practice to access the bucket where you want the results to be saved.

    1. On the Lambda console, create a new function.
    2. Select Author from scratch.
    3. For Function name¸ enter a name.
    4. For Runtime, choose Python 3.8.
    5. For Execution role, select Use an existing role and specify the role you created.
    6. Choose Create function.
    1. Download the ZIP file containing the necessary code for the Lambda function.
    2. On the Lambda console, open the function.
    3. On the Code tab, choose Upload from, choose .zip file, and upload the file you downloaded.
    4. Choose Save.

Your file tree should remain the same after uploading the ZIP file.

  1. In the Layers section, choose Add layer.
  2. Select Specify an ARN.
  3. In the following GitHub repo, choose the CSV corresponding to the Region you’re working in and copy the ARN from the latest Pandas version.
  4. For Specify an ARN, enter the ARN you copied.
  5. Choose Add.

  1. To adapt the function to your environment, at the bottom of the code from the lambda_function.py file, make sure to update the bucket name with your bucket where you want to save the anomaly results, and the DataSet_ARN from your anomaly detector.
  2. Choose Deploy to make the changes active.

You now need to connect the Lookout for Metrics detector to your function.

  1. On the Lookout for Metrics console, navigate to your detector and choose Add alert.
  2. Enter the alert name and your preferred severity threshold.
  3. From the channel list, choose Lambda.
  4. Choose the function you created and make sure you have the right role to trigger it.
  5. Choose Add alert.

Now you wait for your alert to trigger. The time varies depending on when the detector finds an anomaly.

When an anomaly is detected, Lookout for Metrics triggers the Lambda function. It receives the necessary information from Lookout for Metrics and checks if there is already a saved CSV file in Amazon S3 at the corresponding timestamp of the anomaly. If there isn’t a file, Lambda generates the file and adds the anomaly data. If the file already exists, Lambda updates the file with the extra data received. The function generates a separated CSV file for each different timestamp.

Create the function and alert using AWS CloudFormation

Similar to the console instructions, you download the ZIP file containing the necessary code for the Lambda function. However, in this case it needs to be uploaded to the S3 bucket in order for the AWS CloudFormation code to load it during function creation.

In the S3 bucket specified in the Lookout for Metrics detector creation, create a folder called lambda-code, and upload the ZIP file.

The Lambda function loads this as its code during creation.

The L4MLambdaFunction.yaml CloudFormation script creates the Lambda function and alert resources and uses the function code archive stored in the same S3 bucket.

  1. Launch the stack from the following link:

  1. On the Create stack page, choose Next.
  2. On the Specify stack details page, specify a stack name (for example, L4MLambdaFunction).
  3. In the following GitHub repo, open the CSV corresponding to the Region you’re working in and copy the ARN from the latest Pandas version.
  4. Enter the ARN as the Pandas Lambda layer ARN parameter.
  5. Choose Next.
  6. On the Configure stack options page, leave everything as is and choose Next.
  7. On the Review page, select the acknowledgement check box, leave everything else as is, and choose Create stack.

Activate the detector

Before proceeding to the next step, you need to activate the detector from the console.

  1. On the Lookout for Metrics console, choose Detectors in the navigation pane.
  2. Choose your newly created detector.
  3. Choose Activate, then choose Activate again to confirm.

Activation initializes the detector; it’s finished when the model has completed its learning cycle. This can take up to 2 hours.

Prepare the data for QuickSight

Before you complete this step, give the detector time to find anomalies. The Lambda function you created saves the anomaly results in the Lookout for Metrics bucket in the anomalyResults directory. We can now process this data to prepare it for QuickSight.

Create the AWS Glue crawler on the console

After some anomaly CSV files have been generated, we use an AWS Glue crawler to generate the metadata tables.

  1. On the AWS Glue console, choose Crawlers in the navigation pane.
  2. Choose Add crawler.

  1. Enter a name for the crawler (for example, L4MCrawler).
  2. Choose Next.
  3. For Crawler source type, select Data stores.
  4. For Repeat crawls of S3 data stores, select Crawl all folders.
  5. Choose Next.

  1. On the data store configuration page, for Crawl data in, select Specified path in my account.
  2. For Include path, enter the path of your dimensionContributions file (s3://YourBucketName/anomalyResults/dimensionContributions).
  3. Choose Next.
  4. Choose Yes to add another data store and repeat the instructions for metricValue_AnomalyScore(s3://YourBucketName/anomalyResults/metricValue_AnomalyScore).
  5. Repeat the instructions again for the live data to be analyzed by the Lookout for Metrics anomaly detector (this is the S3 dataset location from your Lookout for Metrics detector).

You should now have three data stores for the crawler to process.

Now you need to select the role to allow the crawler to go through the S3 locations of your data.

  1. For this post, select Create an IAM role and enter a name for the role.
  2. Choose Next.

  1. For Frequency, leave as Run on demand and choose Next.
  2. In the Configure the crawler’s output section, choose Add database.

This creates the Athena database where your metadata tables are located after the crawler is complete.

  1. Enter a name for your database and choose Create.
  2. Choose Next, then choose Finish.

  1. On the Crawlers page of the AWS Glue console, select the crawler you created and choose Run crawler.

You may need to wait a few minutes, depending on the size of the data. When it’s complete, the crawler’s status shows as Ready. To see the metadata tables, navigate to your database on the Databases page and choose Tables in the navigation pane.

In this example, the metadata table called live represents the S3 dataset from the Lookout for Metrics live detector. As a best practice, it’s recommended to encrypt your AWS Glue Data Catalog metadata.

Athena automatically recognizes the metadata tables, and QuickSight uses Athena to query the data and visualize the results.

Create the AWS Glue crawler using AWS CloudFormation

The L4MGlueCrawler.yaml CloudFormation script creates the AWS Glue crawler, its associated IAM role, and the output Athena database.

  1. Launch the stack from the following link:

  1. On the Create stack page, choose Next.
  2. On the Specify stack details page, enter a name for your stack (for example, L4MGlueCrawler), and choose Next.
  3. On the Configure stack options page, leave everything as is and choose Next.
  4. On the Review page, select the acknowledgement check box, leave everything else as is, and choose Create stack.

Run the AWS Glue crawler

After you create the crawler, you need to run it before moving to the next step. You can run it from the console or the AWS Command Line Interface (AWS CLI). To use the console, complete the following steps:

  1. On the AWS Glue console, choose Crawlers in the navigation pane.
  2. Select your crawler (L4MCrawler).
  3. Choose Run crawler.

When the crawler is complete, it shows the status Ready.

Create a QuickSight account

Before starting this next step, navigate to the QuickSight console and create an account if you don’t already have one. To make sure you have access to the corresponding services (Athena and S3 bucket), choose your account name on the top right, choose Manage QuickSight, and choose Security and Permissions, where you can add the necessary services. When setting up your Amazon S3 access, make sure to select Write permission for Athena Workgroup.

Now you’re ready to visualize your data in QuickSight.

Create the QuickSight datasets on the console

If this is your first time using Athena, you have to configure the output location of the queries. For instructions, refer to Steps 1–6 in Create a database. Then complete the following steps:

  1. On the QuickSight console, choose Datasets.
  2. Choose New dataset.
  3. Choose Athena as your source.
  4. Enter a name for your data source.
  5. Choose Create data source.

  1. For your database, specify the one you created earlier with the AWS Glue crawler.
  2. Specify the table that contains your live data (not the anomalies).
  3. Choose Edit/preview data.

You’re redirected to an interface similar to the following screenshot.

The next step is to add and combine the metricValue_AnomalyScore data with the live data.

  1. Choose Add data.
  2. Choose Add data source.
  3. Specify the database you created and the metricValue_AnomalyScore table.
  4. Choose Select.

You need now to configure the join of the two tables.

  1. Choose the link between the two tables.
  2. Leave the join type as Left, add the timestamp and each dimension you have as a join clause, and choose Apply.

In the following example, we use timestamp, platform, and marketplace as join clauses.

On the right pane, you can remove the fields you’re not interested in keeping.

  1. Remove the timestamp from the metricValue_AnomalyScore table to not have a duplicated column.
  2. Change the timestamp data type (of the live data table) from string to date, and specify the correct format. In our case, it should be yyyy-MM-dd HH:mm:ss.

The following screenshot shows your view after you remove some fields and adjust the data type.

  1. Choose Save and visualize.
  2. Choose the pencil icon next to the dataset.
  3. Choose Add dataset and choose dimensioncontributions.

Create the QuickSight datasets using AWS CloudFormation

This step contains three CloudFormation stacks.

The first CloudFormation script, L4MQuickSightDataSource.yaml, creates the QuickSight Athena data source.

  1. Launch the stack from the following link:

  1. On the Create stack page, choose Next.
  2. On the Specify stack details page, enter your QuickSight user name, the QuickSight account Region (specified when creating the QuickSight account), and a stack name (for example, L4MQuickSightDataSource).
  3. Choose Next.
  4. On the Configure stack options page, leave everything as is and choose Next.
  5. On the Review page, leave everything as is and choose Create stack.

The second CloudFormation script, L4MQuickSightDataSet1.yaml, creates a QuickSight dataset that joins the dimensions table with the anomaly table.

  1. Launch the stack from the following link:

  1. On the Create stack page, choose Next.
  2. On the Specify stack details, enter a stack name (for example, L4MQuickSightDataSet1).
  3. Choose Next.
  4. On the Configure stack options page, leave everything as is and choose Next.
  5. On the Review page, leave everything as is and choose Create stack.

The third CloudFormation script, L4MQuickSightDataSet2.yaml, creates the QuickSight dataset that joins the anomaly table with the live data table.

  1. Launch the stack from the following link:

  1. On the Create stack page¸ choose Next.
  2. On the Specify stack details page, enter a stack name (for example, L4MQuickSightDataSet2).
  3. Choose Next.
  4. On the Configure stack options page, leave everything as is and choose Next.
  5. On the Review page, leave everything as is and choose Create stack.

Create the QuickSight analysis for dashboard creation

This step can only be completed on the console. After you’ve created your QuickSight datasets, complete the following steps:

  1. On the QuickSight console, choose Analysis in the navigation pane.
  2. Choose New analysis.
  3. Choose the first dataset, L4MQuickSightDataSetWithLiveData.

  1. Choose Create analysis.

The QuickSight analysis is initially created with only the first dataset.

  1. To add the second dataset, choose the pencil icon next to Dataset and choose Add dataset.
  2. Choose the second dataset and choose Select.

You can then use either dataset for creating charts by choosing it on the Dataset drop-down menu.

Dataset metrics

You have successfully created a QuickSight analysis from Lookout for Metrics inference results and the live data. Two datasets are in QuickSight for you to use: L4M_Visualization_dataset_with_liveData and L4M_Visualization_dataset_with_dimensionContribution.

The L4M_Visualization_dataset_with_liveData dataset includes the following metrics:

  • timestamp – The date and time of the live data passed to Lookout for Metrics
  • views – The value of the views metric
  • revenue – The value of the revenue metric
  • platform, marketplace, revenueAnomalyMetricValue, viewsAnomalyMetricValue, revenueGroupScore and viewsGroupScore – These metrics are part of both datasets

The L4M_Visualization_dataset_with_dimensionContribution dataset includes the following metrics:

  • timestamp – The date and time of when the anomaly was detected
  • metricName – The metrics you’re monitoring
  • dimensionName – The dimension within the metric
  • dimensionValue – The value of the dimension
  • valueContribution – The percentage on how much dimensionValue is affecting the anomaly when detected

The following screenshot shows these five metrics on the anomaly dashboard of the Lookout for Metrics detector.

The following metrics are part of both datasets:

  • platform – The platform where the anomaly happened
  • marketplace – The marketplace where the anomaly happened
  • revenueAnomalyMetricValue and viewsAnomalyMetricValue – The corresponding values of the metric when the anomaly was detected (in this situation, the metrics are revenue or views)
  • revenueGroupScore and viewsGroupScore – The severity scores for each metric for the detected anomaly

To better understand these last metrics, you can review the CSV files created by the Lambda function in your S3 bucket where you saved anomalyResults/metricValue_AnomalyScore.

Next steps

The next step is to build the dashboards for the data you want to see. This post doesn’t include an explanation on creating QuickSight charts. If you’re new to QuickSight, refer to Getting started with data analysis in Amazon QuickSight for an introduction. The following screenshots show examples of basic dashboards. For more information, check out the QuickSight workshops.

Conclusion

The anomalies are presented individually on the Lookout for Metrics console, each with their own graph, making it difficult to view the set as a whole. An automated, integrated solution is needed for deeper analysis. In this post, we used a Lookout for Metrics detector to generate anomalies, and connected the data to QuickSight to create visualizations. This solution enables us to conduct deeper analysis into anomalies and have them all in one single place/dashboard.

As a next step, this solution could as well be expanded by adding an extra dataset and combine anomalies from multiple detectors. You could also adapt the Lambda function. The Lambda function contains the code that generates the data sets and variable names that we use for the QuickSight dashboards. You can adapt this code to your particular use case by changing the data sets itself or the variable names that make more sense to you.

If you have any feedback or questions, please leave them in the comments.


About the Authors

Benoît de Patoul is an AI/ML Specialist Solutions Architect at AWS. He helps customers by providing guidance and technical assistance to build solutions related to AI/ML when using AWS.

Paul Troiano is a Senior Solutions Architect at AWS, based in Atlanta, GA. He helps customers by providing guidance on technology strategies and solutions on AWS. He is passionate about all things AI/ML and solution automation.

Read More

AWS Localization uses Amazon Translate to scale localization

The AWS website is currently available in 16 languages (12 for the AWS Management Console and for technical documentation): Arabic, Chinese Simplified, Chinese Traditional, English, French, German, Indonesian, Italian, Japanese, Korean, Portuguese, Russian, Spanish, Thai, Turkish, and Vietnamese. Customers all over the world gain hands-on experience with the AWS platform, products, and services in their native language. This is made possible thanks to the AWS Localization team (AWSLOC).

AWSLOC manages the end-to-end localization process of digital content at AWS (webpages, consoles, technical documentation, e-books, banners, videos, and more). On average, the team manages 48,000 projects across all digital assets yearly, which amounts to over 3 billion translated words. Given the growing demand of global customers and new local cloud adoption journeys, AWS Localization needs to support content localization at scale, with the aim to make more content available and cater to new markets. To do so, AWSLOC uses a network of over 2,800 linguists globally and supports hundreds of content creators across AWS to scale localization. The team strives to continuously improve the language experience for customers by investing heavily in automation and building automated pipelines for all content types.

AWSLOC aspires to build a future where you can interact with AWS in your preferred language. To achieve this vision, they’re using AWS machine translation and Amazon Translate. The goal is to remove language barriers and make AWS content more accessible through consistent locale-specific experiences to help every AWS creator deliver what matters most to global audiences.

This post describes how AWSLOC uses Amazon Translate to scale localization and offer their services to new locales. Amazon Translate is a neural machine translation service that delivers fast, high-quality, cost-effective, and customizable language translation. Neural machine translation is a form of language translation that uses deep learning models to deliver accurate and natural sounding translation. For more information about the languages Amazon Translate supports, see Supported languages and language codes.

How AWSLOC uses Amazon Translate

The implementation of machine translation allows AWSLOC to speed up the localization process for all types of content. AWSLOC chose AWS technical documentation to jumpstart their machine translation journey with Amazon Translate because it’s one of the pillars of AWS. Around 18% of all customers chose to view technical documentation in their local language in 2021, which is a 27% increase since 2020. In 2020 alone, over 1,435 features and 31 new services were added in technical documentation, which generated an increase of translation volume of 353% in 2021.

To cater to this demand for translated documentation, AWSLOC partnered with Amazon Translate to optimize the localization processes.

Amazon Translate is used to pre-translate the strings that fall below a fuzzy matching threshold (against the translation memory) across 10 supported languages. A dedicated Amazon Translate instance was configured with Active Custom Translation (ACT) and the corresponding parallel data was updated on a monthly basis. In most of the language pairs, the Amazon Translate plus ACT output has shown a positive trend in quality improvement across the board. Furthermore, to raise the bar on quality, a human post-editing process is then performed on assets that have a higher customer visibility. AWSLOC established a governance process to monitor migration of content across machine translation and machine translation post-editing (MTPE), including MTPE-Light and MTPE-Premium. Human editors review MT outputs to correct translation errors, which are incorporated back into the tool via the ACT process. There is a regular engine refresh (once every 40 days on average), the contributions being mostly bug submissions.

AWSLOC follows best practices to maintain the ACT table, which includes marking some terms with the do not translate feature provided by Amazon Translate.

The following diagram illustrates the detailed workflow.

The main components in the process are as follows:

  1. Translation memory – The database that stores sentences, paragraphs, or bullet points that have been previously translated, in order to help human translators. This database stores the source text and its corresponding translation in language pairs, called translation units.
  2. Language quality service (LQS) – The accuracy check that an asset goes through after the Language Service Provider (LSP) completes their pass. 20% of the asset is spot-checked unless otherwise specified.
  3. Parallel data – The method for analyzing data using parallel processes that run simultaneously on multiple containers.
  4. Fuzzy matching – This technique is used in computer-assisted translation as a special case of record linkage. It works with matches that may be less than 100% perfect when finding correspondences between segments of a text and entries in a database of previous translations.
  5. Do-not-translate terms – A list of phrases and words that don’t require translation, such as brand names and trademarks.
  6. Pre-translation – The initial application of do-not-translate terms, translation memory, and machine translation or human translation engines against a source text before it’s presented to linguists.

MTPE-Light produces understandable but not stylistically perfect text. The following table summarizes the differences between MTPE-Light and MTPE-Premium.

MTPE-Light MTPE-Premium
Additions and omissions Punctuation
Accuracy Consistency
Spelling Literalness
Numbers Style
Grammar Preferential terminology
. Formatting errors

Multi-faceted impacts

Amazon Translate is a solution for localization projects at scale. With Amazon Translate, the project turnaround time isn’t tethered to translation volume. Amazon Translate can deliver more than 50,000 words within 1 hour compared to traditional localization cycles, which can complete 10,000-word projects in 7–8 days and 50,000-word projects in 30–35 days. Amazon Translate is also 10 times cheaper than standard translation, and it makes it easier to track and manage the localization budget. Compared to human translation projects that use MTPE-Premium, AWSLOC observed a savings of up to 40%, and a savings of up to 60% for MTPE-Light. Additionally, projects with machine translation exclusively only incur a monthly flat fee—the technology costs for the translation management system AWSLOC uses to process machine translation.

Lastly, thanks to Amazon Translate, AWSLOC is now able to go from monthly to weekly refresh cycles for technical documentation.

All in all, machine translation is the most cost-effective and time-saving option for any global localization team if they want to cater to an increasing amount of content localization in the long term.

Conclusion

The benefits of Amazon Translate are great to Amazon and to our customers, both in exercising savings and delivering localized content faster and in multiple languages. For more information about the capabilities of Amazon Translate, see the Amazon Translate Developer Guide. If you have any questions or feedback, feel free to contact us or leave a comment.


About the authors

Marie-Alice Daniel is a Language Quality Manager at AWS, based in Luxembourg. She leads a variety of efforts to monitor and improve the quality of localized AWS content, especially Marketing content, with a focus on customer social outreach. She also supports stakeholders to address quality concerns and to ensure localized content consistently meets the quality bar.

Ajit Manuel is a Senior Product Manager (Tech) at AWS, based in Seattle. Ajit leads the localization product management team that builds solutions centered around language analytics services, translation automation and language research and design. The solutions that Ajit’s team builds help AWS scale its global footprint while staying locally relevant. Ajit is passionate about building innovative products especially in niche markets and has pioneered solutions that augmented digital transformation within the insurance-tech and media-analytics space.

Read More

Incrementally update a dataset with a bulk import mechanism in Amazon Personalize

We are excited to announce that Amazon Personalize now supports incremental bulk dataset imports; a new option for updating your data and improving the quality of your recommendations. Keeping your datasets current is an important part of maintaining the relevance of your recommendations. Prior to this new feature launch, Amazon Personalize offered two mechanisms for ingesting data:

  • DatasetImportJobDatasetImportJob is a bulk data ingestion mechanism designed to import large datasets into Amazon Personalize. A typical journey starts with importing your historical interactions dataset in addition to your item catalog and user dataset. DatasetImportJob can then be used to keep your datasets current by sending updated records in bulk. Prior to this launch, data ingested via previous import jobs was overwritten by any subsequent DatasetImportJob.
  • Streaming APIs: The streaming APIs (PutEvents, PutUsers, and PutItems) are designed to incrementally update each respective dataset in real-time. For example, after you have trained your model and launched your campaign, your users continue to generate interactions data. This data is then ingested via the PutEvents API, which incrementally updates your interactions dataset. Using the streaming APIs allows you to ingest data as you get it rather than accumulating the data and scheduling ingestion.

With incremental bulk imports, Amazon Personalize simplifies the data ingestion of historical records by enabling you to import incremental changes to your datasets with a DatasetImportJob. You can import 100 GB of data per FULL DatasetImportJob or 1 GB of data per INCREMENTAL DatasetImportJob. Data added to the datasets using INCREMENTAL imports are appended to your existing datasets. Personalize will update records with the current version if your incremental import duplicates any records found in your existing dataset, further simplifying the data ingestion process. In the following sections, we describe the changes to the existing API to support incremental dataset imports.

CreateDatasetImportJob

A new parameter called importMode has been added to the CreateDatasetImportJob API. This parameter is an enum type with two values: FULL and INCREMENTAL. The parameter is optional and is FULL by default to preserve backward compatibility. The CreateDatasetImportJob request is as follows:

{
   "datasetArn": "string",
   "dataSource": { 
      "dataLocation": "string"
   },
   "jobName": "string",
   "roleArn": "string",
   "importMode": {INCREMENTAL, FULL}
}

The Boto3 API is create_dataset_import_job, and the AWS Command Line Interface (AWS CLI) command is create-dataset-import-job.

DescribeDatasetImportJob

The response to DescribeDatasetImportJob has been extended to include whether the import was a full or incremental import. The type of import is indicated in a new importMode field, which is an enum type with two values: FULL and INCREMENTAL. The DescribeDatasetImportJob response is as follows:

{ 
    "datasetImportJob": {
        "creationDateTime": number,
        "datasetArn": "string",
        "datasetImportJobArn": "string",
        "dataSource": {
            "dataLocation": "string"
        },
        "failureReason": "string",
        "jobName": "string",
        "lastUpdatedDateTime": number,
        "roleArn": "string",
        "status": "string",
        "importMode": {INCREMENTAL, FULL}
    }
}

The Boto3 API is describe_dataset_import_job, and the AWS CLI command is describe-dataset-import-job.

ListDatasetImportJob

The response to ListDatasetImportJob has been extended to include whether the import was a full or incremental import. The type of import is indicated in a new importMode field, which is an enum type with two values: FULL and INCREMENTAL. The ListDatasetImportJob response is as follows:

{ 
    "datasetImportJobs": [ { 
        "creationDateTime": number,
        "datasetImportJobArn": "string",
        "failureReason": "string",
        "jobName": "string",
        "lastUpdatedDateTime": number,
        "status": "string",
        "importMode": " {INCREMENTAL, FULL}
    } ],
    "nextToken": "string" 
}

The Boto3 API is list_dataset_import_jobs, and the AWS CLI command is list-dataset-import-jobs.

Code example

The following code shows how to create a dataset import job for incremental bulk import using the SDK for Python (Boto3):

import boto3

personalize = boto3.client('personalize')

response = personalize.create_dataset_import_job(
    jobName = 'YourImportJob',
    datasetArn = 'arn:aws:personalize:us-east 1:111111111111:dataset/AmazonPersonalizeExample/INTERACTIONS',
    dataSource = {'dataLocation':'s3://bucket/file.csv'},
    roleArn = 'role_arn',
    importMode = 'INCREMENTAL'
)

dsij_arn = response['datasetImportJobArn']

print ('Dataset Import Job arn: ' + dsij_arn)

description = personalize.describe_dataset_import_job(
    datasetImportJobArn = dsij_arn)['datasetImportJob']

print('Name: ' + description['jobName'])
print('ARN: ' + description['datasetImportJobArn'])
print('Status: ' + description['status'])

Summary

In this post, we described how you can use this new feature in Amazon Personalize to perform incremental updates to a dataset with bulk import, keeping the data fresh and improving the relevance of Amazon Personalize recommendations. If you have delayed access to your data, incremental bulk import allows you to import your data more easily by appending it to your existing datasets.

Try out this new feature by accessing Amazon Personalize now.


About the authors

Neelam Koshiya is an enterprise solution architect at AWS. Her current focus is to help enterprise customers with their cloud adoption journey for strategic business outcomes. In her spare time, she enjoys reading and being outdoors.

James Jory is a Principal Solutions Architect in Applied AI with AWS. He has a special interest in personalization and recommender systems and a background in ecommerce, marketing technology, and customer data analytics. In his spare time, he enjoys camping and auto racing simulations.

Daniel Foley is a Senior Product Manager for Amazon Personalize. He is focused on building applications that leverage artificial intelligence to solve our customers’ largest challenges. Outside of work, Dan is an avid skier and hiker.

Alex Berlingeri is a Software Development Engineer with Amazon Personalize working on a machine learning powered recommendations service. In his free time he enjoys reading, working out and watching soccer.

Read More

Announcing the launch of the model copy feature for Amazon Rekognition Custom Labels

Amazon Rekognition Custom Labels is a fully managed computer vision service that allows developers to build custom models to classify and identify objects in images that are specific and unique to your business. Rekognition Custom Labels doesn’t require you to have any prior computer vision expertise. For example, you can find your logo in social media posts, identify your products on store shelves, classify machine parts in an assembly line, distinguish healthy and infected plants, or detect animated characters in videos.

Developing a custom model to analyze images is a significant undertaking that requires time, expertise, and resources, often taking months to complete. Additionally, it often requires thousands or tens of thousands of hand-labeled images to provide the model with enough data to accurately make decisions. Generating this data can take months to gather and requires large teams of labelers to prepare it for use in machine learning (ML).

Rekognition Custom Labels builds off of the existing capabilities of Amazon Rekognition, which are already trained on tens of millions of images across many categories. Instead of thousands of images, you simply need to upload a small set of training images (typically a few hundred images or less) that are specific to your use case using the Amazon Rekognition console. If the images are already labeled, you can begin training a model in just a few clicks. If not, you can label them directly on the Rekognition Custom Labels console, or use Amazon SageMaker Ground Truth to label them. Rekognition Custom Labels uses transfer learning to automatically inspect the training data, select the right model framework and algorithm, optimize the hyperparameters, and train the model. When you’re satisfied with the model accuracy, you can start hosting the trained model with just one click.

Today we’re happy to announce the launch of the Rekognition Custom Labels model copy feature. This feature allows you to copy your Rekognition Custom Labels models across projects, which can be in the same AWS account or across AWS accounts in the same AWS Region, without retraining the models from scratch. This new capability makes it easier for you to move Rekognition Custom Labels models through various environments such as development, quality assurance, integration, and production without needing to copy the original training and test datasets and retraining the model. You can use the AWS Command Line Interface (AWS CLI) to copy trained models across projects, which can be in the same AWS account or across AWS accounts.

In this post, we show you how to copy models between different AWS accounts in the same AWS Region.

Benefits of the model copy feature

This new feature has the following benefits:

  • Multi-account ML-Ops best practices – You can train a model one time and ensure predictable deployment with consistent results across multiple accounts mapped to various environments such as development, quality assurance, integration, and production allowing you to follow ML-Ops best practices within your organization.
  • Cost savings and faster deployment – You can quickly copy a trained model between accounts, avoiding the time taken to retrain in every account and saving on the model retraining cost.
  • Protect sensitive datasets – You no longer need to share the datasets between different AWS accounts or users. The training data needs to be available only on the AWS account where model training is done. This is very important for certain industries, where data isolation is essential to meet business or regulatory requirements.
  • Easy collaboration – Partners or vendors can now easily train Amazon Rekognition Custom Labels model in their own AWS account and share the models with users across AWS accounts.
  • Consistent performance – Model performance is now consistent across different AWS accounts. Model training is generally non-deterministic and two models trained with the same dataset does not guarantee the same performance scores and the same predictions. Copying the model helps make sure that the behavior of the copied model is consistent with the source model eliminating the need to re-test the model.

Solution overview

The following diagram illustrates our solution architecture.

This post assumes you have a trained a Rekognition Custom Labels model in your source account. For instructions, refer to Training a custom single class object detection model with Amazon Rekognition Custom Labels. In this post, we used the image classification “Rooms” project from the Rekognition Custom Labels sample projects list and trained a room classification model in the source account to classify images of kitchens, bathrooms, living rooms, and more.

To demonstrate the functionality of the model copy feature, we go through the following steps in the source account:

  1. Start the model and run inferences on sample images.
  2. Define a resource-based policy to allow cross-account access to copy the Rekognition Custom Labels model.

Then we copy the source model to the target account.

  1. Create an Amazon Simple Storage Service (Amazon S3) bucket, which serves as a container for the model evaluation and performance statistics.
  2. Create a project.
  3. Copy the trained model from the source account to the target account.
  4. Start the model and run inference on the sample images.
  5. Verify the inference results match the results of the source account model.

Prerequisites

In addition to having a trained model in your source account, make sure you complete the following prerequisite steps:

  1. Install the AWS CLI V2.
  2. Configure your AWS CLI with the following code and enter your Region:
    aws configure

  3. Run the following commands to ensure you have AWS CLI version 2.xx installed on your local host:
    aws --version

  4. Update the AWS credentials file under $HOME/.aws/credentials with the following entry:
    [source-account]
    aws_access_key_id = ####
    aws_secret_access_key = #######
    
    [target-account]
    aws_access_key_id = ####
    aws_secret_access_key = #######

  5. Get the ProjectArn and ProjectVersionArn for the source AWS account.ProjectArn is the project associated with your source model. ProjectVersionArn is the version of the model you’re interested in copying to the target account.You can find the SourceProjectArn using the following command:
    aws rekognition describe-projects 
    --region us-east-1 
    --profile source-account
    
    {
        "ProjectDescriptions": [{
            "ProjectArn": "arn:aws:rekognition:us-east-1::111111111111:project/rooms_1/1657588855531",
            .
            .
        }]
    }

    If you see multiple lines of output, pick the ProjectArn associated with the model you’re going to copy.

    You can find the SourceProjectVersionArn for the model you trained using the SourceProjectArn (the preceding output). Replace the SourceProjectArn in the following command:

    aws rekognition describe-project-versions 
    --project-arn SourceProjectArn 
    --region us-east-1 
    --profile source-account

    The command returns the SourceProjectVersionArn. If you see multiple lines of output, pick the ProjectVersionArn of interest.

    {
        "ProjectVersionDescriptions": [
            {
                "ProjectVersionArn": "arn:aws:rekognition:us-east-1:111111111111:project/rooms_1/version/rooms_1.2022-07-12T09.39.36/1657643976475",
                .
                .
            }
        ]
    }

You’re now ready to run the steps to implement the solution. Replace the values of SourceProjectArn and SourceProjectVersionArn in the following commands with the values you generated.

1. Start the model and run inference on sample images

In the source account, enter the following code to start the model:

aws rekognition start-project-version 
--project-version-arn SourceProjectVersionArn 
--min-inference-units 1 
--region us-east-1 
--profile source-account
{
    "Status": "STARTING"
}

After the model is hosted and in the running state, you can run inference.

We used the following images (demo1.jpeg and demo2.jpeg) to run inference. These images are located in our local file system in the same directory where the AWS CLI commands are being run from.

The following image is demo1.jpeg, which shows a backyard.

See the following inference code and output:

aws rekognition detect-custom-labels 
--project-version-arn SourceProjectVersionArn   
--image-bytes fileb://demo1.jpeg 
--region us-east-1 
--profile source-account
{
    "Name": "backyard",
    "Confidence": 45.77000045776367
 }

The following image is demo2.jpeg, which shows a bedroom.

See the following inference code and output:

aws rekognition detect-custom-labels 
--project-version-arn SourceProjectVersionArn   
--image-bytes fileb://demo2.jpeg 
--region us-east-1 
--profile source-account
{
    "Name": "bedroom",
    "Confidence": 61.84600067138672
 }

The inference results show the image belongs to the classes backyard and bedroom, with a confidence score of 45.77 and 61.84, respectively.

2. Define the IAM resource policy for the trained model to allow cross-account access

To create your resource-based IAM policy, complete the following steps in the source account:

  1. Allow your specific AWS account to access resources using the provided IAM resource policy (for more information, refer to Creating a project policy document. Replace the values for TargetAWSAccountId and SourceProjectVersionArn in the following policy:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Principal": {
                    "AWS": [ "TargetAWSAccountId" ]
                },
                "Action": "Rekognition:CopyProjectVersion",
                "Resource": "SourceProjectVersionArn",
                "Effect": "Allow"
            }
        ]
    }

  2. Attach the policy to the project in the source account by calling the following command.
    aws rekognition put-project-policy 
    --project-arn SourceProjectArn 
    --policy-name PolicyName 
    --policy-document '{
        "Version": "2012-10-17",
        "Statement": [
            {
                "Principal": {
                    "AWS": [ "TargetAWSAccountId" ]
                },
                "Action": "Rekognition:CopyProjectVersion",
                "Resource": "SourceProjectVersionArn",
                "Effect": "Allow"
            }
        ]
    }' 
    --region us-east-1 
    --profile source-account

    Replace SourceProjectArn, PolicyName, TargetAWSAccountId, and SourceProjectVersionArn.

    The output shows the policy revision ID created:

    {
        "PolicyRevisionId": "f95907f9c1472c114f61b0e1f31ed131"
    }

Now we’re ready to copy the trained model from the source account to the target account.

3. Create an S3 bucket in the target account

You can use an existing S3 bucket in your account or create a new S3 bucket. For this post, we call this S3 bucket DestinationS3Bucket.

4. Create a new Rekognition Custom Labels project

Create a new project with the following code:

aws rekognition create-project 
--project-name target_rooms_1 
--region us-east-1 
--profile target-account 

This creates a TargetProjectArn in the target account:

{
    "ProjectArn": "arn:aws:rekognition:us-east-1:222222222222:project/target_rooms_1/1657599660206"
}

Note the value of the destination project ProjectArn field. We use this value in the following copy model command.

5. Copy the model from the source account to the target account

Supply the source and target ProjectArn, source ProjectVersionArn, and target S3 bucket and S3 key prefix in the following code:

aws rekognition copy-project-version 
--source-project-arn SourceProjectArn 
--source-project-version-arn SourceProjectVersionArn 
--destination-project-arn TargetProjectArn 
--version-name TargetVersionName 
--output-config '{"S3Bucket":"DestinationS3Bucket", "S3KeyPrefix":"DestinationS3BucketPrefix"}' 
--region us-east-1 
--profile target-account

This creates a copied model TargetProjectVersionArn in the target account. The TargetVersionName in our case has been named copy_rooms_1:

{
    "ProjectVersionArn": "arn:aws:rekognition:us-east-1:222222222222:project/target_rooms_1/version/copy_rooms_1/1657667877079"
}

Check the status of the model copy process:

aws rekognition describe-project-versions 
--project-arn TargetProjectArn 
--version-names TargetVersionName 
--region us-east-1 
--profile target-account

The model copy from the source account to the target account is complete when the Status changes to COPYING_COMPLETED:

 {
    "ProjectVersionDescriptions": [
        {
            "ProjectVersionArn": "arn:aws:rekognition:us-east-1:222222222222:project/target_rooms_1/version/copy_rooms_1/1657667877079",
            "CreationTimestamp": "2022-07-12T16:17:57.079000-07:00",
            "Status": "COPYING_COMPLETED",
            "StatusMessage": "Model copy operation was successful",
            ..........
            ..........
            "EvaluationResult": {
                "F1Score": 0.0,
                "Summary": {

6. Start the model and run inference

Enter the following code to start the model in the target account:

aws rekognition start-project-version 
--project-version-arn TargetProjectArn 
--min-inference-units 1 
--region us-east-1 
--profile target-account
{
    "Status": "STARTING"
}

Check the status of the model:

aws rekognition describe-project-versions 
--project-arn TargetProjectArn 
--version-names copy_rooms_1 
--region us-east-1 
--profile target-account

The model is now hosted and running:

{
    "ProjectVersionDescriptions": [
        {
            "ProjectVersionArn": "arn:aws:rekognition:us-east-1:222222222222:project/target_rooms_1/version/copy_rooms_1/1657667877079",
            "CreationTimestamp": "2022-07-12T16:17:57.079000-07:00",
            "MinInferenceUnits": 1,
            "Status": "RUNNING",
            "StatusMessage": "The model is running.",
            ..........
            ..........
        }
    ]
}

Run inference with the following code:

aws rekognition detect-custom-labels 
 --project-version-arn TargetProjectVersionArn 
 --image-bytes fileb://demo1.jpeg 
 --region us-east-1 
 --profile target-account
{
    "Name": "backyard",
    "Confidence": 45.77000045776367
 }
aws rekognition detect-custom-labels 
 --project-version-arn TargetProjectVersionArn 
 --image-bytes fileb://demo2.jpeg 
 --region us-east-1 
 --profile target-account
{
    "Name": "bedroom",
    "Confidence": 61.84600067138672

7. Verify the inference results match

The classes and the confidence scores for the images demo1.jpg and demo2.jpg in the target account should match the results in the source account.

Conclusion

In this post, we demonstrated the Rekognition Custom Label model copy feature. This feature enables you to train a classification or object detection model in one account and then share the model with another account in the same Region. This simplifies the multi-account strategy where the model can be trained one time and shared between accounts within the same Region without having to retrain or share the training datasets. This allows for a predicable deployment in every account as part of your MLOps workflow. For more information, refer to Copying an Amazon Rekognition Custom Labels model, or try out the walkthrough in this post using a cloud shell with the AWS CLI.

As of this writing, the model copy feature in Amazon Rekognition Custom Labels is available in the following Regions:

  • US East (Ohio)
  • US East (N. Virginia)
  • US West (Oregon)
  • Asia Pacific (Mumbai)
  • Asia Pacific (Seoul)
  • Asia Pacific (Singapore)
  • Asia Pacific (Sydney)
  • Asia Pacific (Tokyo)
  • EU (Frankfurt)
  • EU (Ireland)
  • EU (London)

Give the feature a try, and please send us feedback either via the AWS forum for Amazon Rekognition or through your AWS support contacts.


About the authors

Amit Gupta is a Senior AI Services Solutions Architect at AWS. He is passionate about enabling customers with well-architected machine learning solutions at scale.

Yogesh Chaturvedi is a Solutions Architect at AWS with a focus in computer vision. He works with customers to address their business challenges using cloud technologies. Outside of work, he enjoys hiking, traveling, and watching sports.

Aakash Deep is a Senior Software Engineer with AWS. He enjoys working on computer vision, AI, and distributed systems. Outside of work, he enjoys hiking and traveling.

Pashmeen Mistry is the Senior Product Manager for Amazon Rekognition Custom Labels. Outside of work, Pashmeen enjoys adventurous hikes, photography, and spending time with his family.

Read More