Amazon Translate ranked as #1 machine translation provider by Intento

Amazon Translate ranked as #1 machine translation provider by Intento

Customer obsession, one of the key Amazon Leadership principles that guides everything we do at Amazon, has helped Amazon Translate be recognized as an industry leading neural machine translation provider. This year, Intento ranked Amazon Translate #1 on the list of top-performing machine translation providers in its The State of Machine Translation 2020 report. We are excited to be recognized for pursuing our passion—designing the best customer experience in machine translation.

Amazon Translate is a neural machine translation service that delivers fast, high-quality, and affordable language translation. Neural machine translation is a form of machine translation that uses deep learning models to deliver more accurate and more natural sounding translation than traditional statistical and rule-based translation algorithms. Amazon Translate’s development has been fueled by customer feedback, leading to a steady stream of rich features that help you reach more people in more places—without breaking your translation services budget.

Intento is one of the leading organizations helping global companies procure and utilize the best-fit cognitive AI services. In this independent study, Intento evaluated 15 of the most prominent MT providers used by language service providers and localization services. The data used included examples from across 16 industry sectors, with 8 content types, including topics such as financial documentation, patents, sales and marketing material. These inputs were translated between 14 common language pairs to determine the best engine for a given translation scenario. It ranked the results of each MT engine based on how they compared to a reference human translation.

The results showed that no MT service is best in all language pairs across all industry sectors and content types. However, Amazon Translate had the highest number of instances in which it was rated “best.”

At Amazon, we strive to bring the most value to our customers and deliver the world’s best machine translation service! If your company is looking for machine translation, please contact us. We’d love to show you what Amazon Translate can do.

More information on the features and capabilities that Intento considered in its analysis of the top MT providers is available in the full report (registration is required).

 


About the Author

Greg Rushing is a US Air Force Fellow in Amazon’s BRIDGE program. He is currently working with the Amazon Translate Product Management team, where he focuses on coordinating the activities required to bring Amazon Translate features to market. Outside of work, you can find him spending time exploring the outdoors with his family, doing auto repair, or woodworking.

Read More

At GTC, Educators and Leaders Focus on Equity in AI, Developer Diversity

At GTC, Educators and Leaders Focus on Equity in AI, Developer Diversity

Not everyone needs to be a developer, but everyone will need to be an AI decision maker.

That was the message behind a panel discussion on Advancing Equitable AI, which took place at our GPU Technology Conference last week. It was one of several GTC events advancing the conversation on diversity, equity and ethics in AI.

This year, we strengthened our support for women and underrepresented developers and scientists at GTC by providing conference passes to members of professional organizations supporting women, Black and Latino developers. Professors at historically Black colleges and universities — including Prairie View A&M University, Hampton University and Jackson State University — as well as groups like Black in AI and LatinX in AI received complimentary access to training from the NVIDIA Deep Learning Institute.

A Forbes report last year named GTC as one of the U.S.’s top conferences for women to attend to further their careers in AI. At this month’s event, women made up better than one in five registered attendees — doubling last year’s count and an almost 4x increase since 2017 — and more than 100 of the speakers.

And in a collaboration with the National Society of Black Engineers that will extend beyond GTC, we created opportunities for the society’s collegiate and professional developers to engage with NVIDIA’s recruiting team, which provided guidance on navigating the new world of virtual interviewing and networking.

“We’re excited to be embarking on a partnership with NVIDIA,” said Johnnie Tangle, national finance chairman of NSBE Professionals. “Together, we are both on the mission of increasing the visibility of Blacks in development and showing why diversity in the space enhances the community as a whole.”

Panel Discussions: Paving Pathways for Equitable AI

Two power-packed, all-female panels at GTC focused on a roadmap for responsible and equitable AI.

In a live session that drew over 250 attendees, speakers from the University of Florida, the Boys and Girls Club of Western Pennsylvania and AI4All — a nonprofit working to increase diversity and inclusion in AI — discussed the importance of AI exposure and education for children and young adults from underrepresented groups.

When a broader group of young people has access to AI education, “we naturally see a way more diverse and interesting set of problems being addressed,” said Tess Posner, CEO of AI4All, “because young people and emerging leaders in the field are going to connect the technology to a problem they’ve seen in their own lives, in their own experience or in their communities.”

The conversation also covered the role parents and schools play in fostering awareness and exposure to STEM subjects in their children’s schools, as well as the need for everyone — developers or not — to have a foundational understanding of how AI works.

“We want students to be conscious consumers, and hopefully producers,” said Christina Gardner-McCune, associate professor and director of the Engaging Learning Lab at the University of Florida, and co-chair of the AI4K12 initiative. “Everybody is going to be making decisions about what AI technologies are used in their homes, what AI technologies their children interact with.”

Later in the week, a panel titled “Aligning Around Common Values to Advance AI Policy” explored ideas to pave the way for responsible AI on a global scale.

The webinar featured representatives from the U.S. National Institute of Standards and Technology, Scotland-based innovation center The Data Lab, and C Minds, a think tank focused on AI initiatives in Latin America. Speakers shared their priorities for developing trustworthy AI, and defined what success would like to them five years in the future.

Dinner with Strangers: Developer Diversity in AI

In a virtual edition of the popular Dinner with Strangers networking events at GTC, experts from NVIDIA and NSBE partnered to moderate two conversations with GTC attendees. NVIDIA employees shared their experiences and tips with early-career attendees, offering advice on how to build a personal brand in a virtual world, craft a resume and prepare for interviews.

For more about GTC, watch NVIDIA founder and CEO Jensen Huang’s keynote below.

The post At GTC, Educators and Leaders Focus on Equity in AI, Developer Diversity appeared first on The Official NVIDIA Blog.

Read More

Increasing the sensitivity of A/B tests by utilizing the variance estimates of experimental units

Increasing the sensitivity of A/B tests by utilizing the variance estimates of experimental units

Kevin Liou is a Research Scientist within Core Data Science, a research and development team focused on improving Facebook’s processes, infrastructure, and products.

What we did

Companies routinely turn to A/B testing when evaluating the effectiveness of their product changes. Also known as a randomized field experiment, A/B testing has been used extensively over the past decade to measure the causal impact of product changes or variants of services, and has proved to be an important success factor for businesses making decisions.

With increased adoption of A/B testing, proper analysis of experimental data is crucial to decision quality. Successful A/B tests must exhibit sensitivity — they must be capable of detecting effects that product changes generate. From a hypothesis-testing perspective, experimenters aim to have high statistical power, or the likelihood that the experiment will detect a nonzero effect when such an effect exists.

In our paper, “Variance-weighted estimators to improve sensitivity in online experiments,” we focus on increasing the sensitivity of A/B tests by attempting to understand the inherent uncertainty introduced by individual experimental units. To leverage this information, we propose directly estimating the pre-experiment individual variance for each unit. For example, if our target metric is “time spent by someone on the site per day,” we may want to give more weight to those who previously exhibited lower variance for this metric through their more consistent usage of the product. We can estimate the variance of a person’s daily time spent during the month before the experiment and assign weights that are higher for people with less noisy behaviors.

Applying our approach of using variance-weighted estimators to a corpus of real A/B tests at Facebook, we find opportunity for substantial variance reduction with minimal impact on the bias of treatment effect estimates. Specifically, our results show an average variance reduction of 17 percent, while bias is bounded within 2 percent. In addition, we show that these estimators can achieve improved variance reduction when combined with other standard approaches, such as regression adjustment (also known as CUPED, a commonly used approach at Facebook), demonstrating that this method complements existing work. Our approach has been adopted in several experimental platforms within Facebook.

How we did it

There are several ways in which one can estimate the variance for each unit, and this is still an active area of research. We studied unpooled estimators (using the pre-experiment user-level sampling variance), building a machine learning model to predict out-of-sample variance from features, and using Empirical-Bayes estimators to pool information across those using our platform.

Statistically, we prove that the amount of variance reduction one can achieve when weighting by variance is a function of the coefficient of variation of the variance of experimental users, or roughly, how variable people are in their variability. Details of this proof can be found in our paper.

We tested these approaches on Facebook data and experiments. Figure 1, below, shows how better estimates of in-experiment unit-level variance provide much larger variance reduction. Poorer models of user-level variance can actually increase variance of the estimator, so good estimation is important. To demonstrate that variance-weighted estimators are likely to be useful in practical settings, we collected 12 popular metrics used in A/B tests at Facebook (such as likes, comments, posts shared, and so on) to estimate the predictability of the variance for each metric and its coefficient of variation. The results, shown in Figure 2, indicate that the variance of most of the metrics is highly predictable (as measured using R^2). In addition, the coefficient of variation of the variances is large enough that they can be used effectively in a variance-weighted estimator.

We took a sample of 100 Facebook A/B tests that experimented for an increase in time spent, with the average sample size of each test at around 500,000 users. Before analyzing the results of each test, we assembled the daily time spent for each user in the month prior to the experiment and estimated the variance for each user. To see how accurate the estimated variance of each user was, we compared how well the pre-experiment variance correlated with the post-experiment variance. The results showed an R^2 of 0.696 and a Pearson correlation of 0.754, indicating that the pre-exposed variances, when calculated over an extended period of time, do show reasonable estimations of post-exposed variance.

Next, for each experiment, all users were ranked based on their estimated variance and applied stratification, as in section 4.1 of our paper. To do this, we divided users into quantiles based on pre-experiment estimated variance, and then we calculated the sample variance of the experiment based on various numbers of quantiles. Across all experiments, we found an average of 17 percent decrease in variance with less than 2 percent bias. We also found that our approach worked well with other popular variance reduction approaches, such as CUPED. Table 1, below, shows that we can achieve close to 50 percent variance reduction when both approaches are used together.

What’s next?

There are several opportunities to explore in future work. In particular, there may be significant gains in devising conditional variance models that estimate variance more accurately. Figure 1 showed in simulations how increased estimate qualities can improve variance reduction, suggesting very large gains possible for more precise estimation. Moreover, we would like to understand how variance-weighted estimators may improve the variance reduction observed from other approaches (such as machine learning–based methods), as well as analytically understand the interactions when using multiple variance reduction approaches at once.

The post Increasing the sensitivity of A/B tests by utilizing the variance estimates of experimental units appeared first on Facebook Research.

Read More

Building a medical image search platform on AWS

Building a medical image search platform on AWS

Improving radiologist efficiency and preventing burnout is a primary goal for healthcare providers. A nationwide study published in Mayo Clinic Proceedings in 2015 showed radiologist burnout percentage at a concerning 61% [1]. In additon, the report concludes that “burnout and satisfaction with work-life balance in US physicians worsened from 2011 to 2014. More than half of US physicians are now experiencing professional burnout.”[2] As technologists, we’re looking for ways to put new and innovative solutions in the hands of physicians to make them more efficient, reduce burnout, and improve care quality.

To reduce burnout and improve value-based care through data-driven decision-making, Artificial Intelligence (AI) can be used to unlock the information trapped in the vast amount of unstructured data (e.g. images, texts, and voice) and create clinically actionable knowledge base. AWS AI services can derive insights and relationships from free-form medical reports, automate the knowledge sharing process, and eventually improve personalized care experience.

In this post, we use Convolutional Neural Networks (CNN) as a feature extractor to convert medical images into a one-dimensional feature vector with a size of 1024. We call this process medical image embedding. Then we index the image feature vector using the K-nearest neighbors (KNN) algorithm in Amazon Elasticsearch Service (Amazon ES) to build a similarity-based image retrieval system. Additionally, we use the AWS managed natural language processing (NLP) service Amazon Comprehend Medical to perform named entity recognition (NER) against free text clinical reports. The detected named entities are also linked to medical ontology, ICD-10-CM, to enable simple aggregation and distribution analysis. The presented solution also includes a front-end React web application and backend GraphQL API managed by AWS Amplify and AWS AppSync, and authentication is handled by Amazon Cognito.

After deploying this working solution, the end-users (healthcare providers) can search through a repository of unstructured free text and medical images, conduct analytical operations, and use it in medical training and clinical decision support. This eliminates the need to manually analyze all the images and reports and get to the most relevant ones. Using a system like this improves the provider’s efficiency. The following graphic shows an example end result of the deployed application.

 

Dataset and architecture

We use the MIMIC CXR dataset to demonstrate how this working solution can benefit healthcare providers, in particular, radiologists. MIMIC CXR is a publicly available database of chest X-ray images in DICOM format and the associated radiology reports as free text files[3]. The methods for data collection and the data structures in this dataset have been well documented and are very detailed [3]. Also, this is a restricted-access resource. To access the files, you must be a registered user and sign the data use agreement. The following sections provide more details on the components of the architecture.

The following diagram illustrates the solution architecture.

The architecture is comprised of the offline data transformation and online query components. The offline data transformation step, the unstructured data, including free texts and image files, is converted into structured data.

Electronic Heath Record (EHR) radiology reports as free text are processed using Amazon Comprehend Medical, an NLP service that uses machine learning to extract relevant medical information from unstructured text, such as medical conditions including clinical signs, diagnosis, and symptoms. The named entities are identified and mapped to structured vocabularies, such as ICD-10 Clinical Modifications (CMs) ontology. The unstructured text plus structured named entities are stored in Amazon ES to enable free text search and term aggregations.

The medical images from Picture Archiving and Communication System (PACS) are converted into vector representations using a pretrained deep learning model deployed in an Amazon Elastic Container Service (Amazon ECS) AWS Fargate cluster. Similar visual search on AWS has been published previously for online retail product image search. It used an Amazon SageMaker built-in KNN algorithm for similarity search, which supports different index types and distance metrics.

We took advantage of the KNN for Amazon ES to find the k closest images from a feature space as demonstrated on the GitHub repo. KNN search is supported in Amazon ES version 7.4+. The container running on the ECS Fargate cluster reads medical images in DICOM format, carries out image embedding using a pretrained model, and saves a PNG thumbnail in an Amazon Simple Storage Service (Amazon S3) bucket, which serves as the storage for AWS Amplify React web application. It also parses out the DICOM image metadata and saves them in Amazon DynamoDB. The image vectors are saved in an Elasticsearch cluster and are used for the KNN visual search, which is implemented in an AWS Lambda function.

The unstructured data from EHR and PACS needs to be transferred to Amazon S3 to trigger the serverless data processing pipeline through the Lambda functions. You can achieve this data transfer by using AWS Storage Gateway or AWS DataSync, which is out of the scope of this post. The online query API, including the GraphQL schemas and resolvers, was developed in AWS AppSync. The front-end web application was developed using the Amplify React framework, which can be deployed using the Amplify CLI. The detailed AWS CloudFormation templates and sample code are available in the Github repo.

Solution overview

To deploy the solution, you complete the following steps:

  1. Deploy the Amplify React web application for online search.
  2. Deploy the image-embedding container to AWS Fargate.
  3. Deploy the data-processing pipeline and AWS AppSync API.

Deploying the Amplify React web application

The first step creates the Amplify React web application, as shown in the following diagram.

  1. Install and configure the AWS Command Line Interface (AWS CLI).
  2. Install the AWS Amplify CLI.
  3. Clone the code base with stepwise instructions.
  4. Go to your code base folder and initialize the Amplify app using the command amplify init. You must answer a series of questions, like the name of the Amplify app.

After this step, you have the following changes in your local and cloud environments:

  • A new folder named amplify is created in your local environment
  • A file named aws-exports.js is created in local the src folder
  • A new Amplify app is created on the AWS Cloud with the name provided during deployment (for example, medical-image-search)
  • A CloudFormation stack is created on the AWS Cloud with the prefix amplify-<AppName>

You create authentication and storage services for your Amplify app afterwards using the following commands:

amplify add auth
amplify add storage
amplify push

When the CloudFormation nested stacks for authentication and storage are successfully deployed, you can see the new Amazon Cognito user pool as the authentication backend and S3 bucket as the storage backend are created. Save the Amazon Cognito user pool ID and S3 bucket name from the Outputs tab of the corresponding CloudFormation nested stack (you use these later).

The following screenshot shows the location of the user pool ID on the Outputs tab.

The following screenshot shows the location of the bucket name on the Outputs tab.

Deploying the image-embedding container to AWS Fargate

We use the Amazon SageMaker Inference Toolkit to serve the PyTorch inference model, which converts a medical image in DICOM format into a feature vector with the size of 1024. To create a container with all the dependencies, you can either use pre-built deep learning container images or derive a Dockerfile from the Amazon Sagemaker Pytorch inference CPU container, like the one from the GitHub repo, in the container folder. You can build the Docker container and push it to Amazon ECR manually or by running the shell script build_and_push.sh. You use the repository image URI for the Docker container later to deploy the AWS Fargate cluster.

The following screenshot shows the sagemaker-pytorch-inference repository on the Amazon ECR console.

We use Multi Model Server (MMS) to serve the inference endpoint. You need to install MMS with pip locally, use the Model archiver CLI to package model artifacts into a single model archive .mar file, and upload it to an S3 bucket to be served by a containerized inference endpoint. The model inference handler is defined in dicom_featurization_service.py in the MMS folder. If you have a domain-specific pretrained Pytorch model, place the model.pth file in the MMS folder; otherwise, the handler uses a pretrained DenseNET121[4] for image processing. See the following code:

model_file_path = os.path.join(model_dir, "model.pth")
if os.path.isfile(model_file_path):
    model = torch.load(model_file_path) 
else:
    model = models.densenet121(pretrained=True)
    model = model._modules.get('features')
    model.add_module("end_relu", nn.ReLU())
    model.add_module("end_globpool", nn.AdaptiveAvgPool2d((1, 1)))
    model.add_module("end_flatten", nn.Flatten())
model = model.to(self.device)
model.eval()

The intermediate results of this CNN-based model is to represent images as feature vectors. In other words, the convolutional layers before the final classification layer is flattened to convert feature layers to a vector representation. Run the following command in the MMS folder to package up the model archive file:

model-archiver -f --model-name dicom_featurization_service --model-path ./ --handler dicom_featurization_service:handle --export-path ./

The preceding code generates a package file named dicom_featurization_service.mar. Create a new S3 bucket and upload the package file to that bucket with public read Access Control List (ACL). See the following code:

aws s3 cp ./dicom_featurization_service.mar s3://<S3bucketname>/ --acl public-read --profile <profilename>

You’re now ready to deploy the image-embedding inference model to the AWS Fargate cluster using the CloudFormation template ecsfargate.yaml in the CloudFormationTemplates folder. You can deploy using the AWS CLI: go to the CloudFormationTemplates folder and copy the following command:

aws cloudformation deploy --capabilities CAPABILITY_IAM --template-file ./ecsfargate.yaml --stack-name <stackname> --parameter-overrides ImageUrl=<imageURI> InferenceModelS3Location=https://<S3bucketname>.s3.amazonaws.com/dicom_featurization_service.mar --profile <profilename>

You need to replace the following placeholders:

  • stackname – A unique name to refer to this CloudFormation stack
  • imageURI – The image URI for the MMS Docker container uploaded in Amazon ECR
  • S3bucketname – The MMS package in the S3 bucket, such as https://<S3bucketname>.s3.amazonaws.com/dicom_featurization_service.mar
  • profilename – Your AWS CLI profile name (default if not named)

Alternatively, you can choose Launch stack for the following Regions:

  • us-east-1

  • us-west-2

After the CloudFormation stack creation is complete, go to the stack Outputs tab on the AWS CloudFormation console and copy the InferenceAPIUrl for later deployment. See the following screenshot.

You can delete this stack after the offline image embedding jobs are finished to save costs, because it’s not used for online queries.

Deploying the data-processing pipeline and AWS AppSync API

You deploy the image and free text data-processing pipeline and AWS AppSync API backend through another CloudFormation template named AppSyncBackend.yaml in the CloudFormationTemplates folder, which creates the AWS resources for this solution. See the following solution architecture.

To deploy this stack using the AWS CLI, go to the CloudFormationTemplates folder and copy the following command:

aws cloudformation deploy --capabilities CAPABILITY_NAMED_IAM --template-file ./AppSyncBackend.yaml --stack-name <stackname> --parameter-overrides AuthorizationUserPool=<CFN_output_auth> PNGBucketName=<CFN_output_storage> InferenceEndpointURL=<inferenceAPIUrl> --profile <profilename>

Replace the following placeholders:

  • stackname – A unique name to refer to this CloudFormation stack
  • AuthorizationUserPool – Amazon Cognito user pool
  • PNGBucketName – Amazon S3 bucket name
  • InferenceEndpointURL – The inference API endpoint
  • Profilename – The AWS CLI profile name (use default if not named)

Alternatively, you can choose Launch stack for the following Regions:

  • us-east-1

  • us-west-2

You can download the Lambda function for medical image processing, CMprocessLambdaFunction.py, and its dependency layer separately if you deploy this stack in AWS Regions other than us-east-1 and us-west-2. Because their file size exceeds the CloudFormation template limit, you need to upload them to your own S3 bucket (either create a new S3 bucket or use the existing one, like the aforementioned S3 bucket for hosting the MMS model package file) and override the LambdaBucket mapping parameter using your own bucket name.

Save the AWS AppySync API URL and AWS Region from the settings on the AWS AppSync console.

Edit the src/aws-exports.js file in your local environment and replace the placeholders with those values:

const awsmobile = {
  "aws_appsync_graphqlEndpoint": "<AppSync API URL>", 
  "aws_appsync_region": "<AWS AppSync Region>",
  "aws_appsync_authenticationType": "AMAZON_COGNITO_USER_POOLS"
};

After this stack is successfully deployed, you’re ready to use this solution. If you have in-house EHR and PACS databases, you can set up the AWS Storage Gateway to transfer data to the S3 bucket to trigger the transformation jobs.

Alternatively, you can use the public dataset MIMIC CXR: download the MIMIC CXR dataset from PhysioNet (to access the files, you must be a credentialed user and sign the data use agreement for the project) and upload the DICOM files to the S3 bucket mimic-cxr-dicom- and the free text radiology report to the S3 bucket mimic-cxr-report-. If everything works as expected, you should see the new records created in the DynamoDB table medical-image-metadata and the Amazon ES domain medical-image-search.

You can test the Amplify React web application locally by running the following command:

npm install && npm start

Or you can publish the React web app by deploying it in Amazon S3 with AWS CloudFront distribution, by first entering the following code:

amplify hosting add

Then, enter the following code:

amplify publish

You can see the hosting endpoint for the Amplify React web application after deployment.

Conclusion

We have demonstrated how to deploy, index and search medical images on AWS, which segregates the offline data ingestion and online search query functions. You can use AWS AI services to transform unstructured data, for example the medical images and radiology reports, into structured ones.

By default, the solution uses a general-purpose model trained on ImageNET to extract features from images. However, this default model may not be accurate enough to extract medical image features because there are fundamental differences in appearance, size, and features between medical images in its raw form. Such differences make it hard to train commonly adopted triplet-based learning networks [5], where semantically relevant images or objects can be easily defined or ranked.

To improve search relevancy, we performed an experiment by using the same MIMIC CXR dataset and the derived diagnosis labels to train a weakly supervised disease classification network similar to Wang et. Al [6]. We found this domain-specific pretrained model yielded qualitatively better visual search results. So it’s recommended to bring your own model (BYOM) to this search platform for real-world implementation.

The methods presented here enable you to perform indexing, searching and aggregation against unstructured images in addition to free text. It sets the stage for future work that can combine these features for multimodal medical image search engine. Information retrieval from unstructured corpuses of clinical notes and images is a time-consuming and tedious task. Our solution allows radiologists to become more efficient and help them reduce potential burnout.

To find the latest development to this solution, check out medical image search on GitHub.

Reference:

  1. https://www.radiologybusiness.com/topics/leadership/radiologist-burnout-are-we-done-yet
  2. https://www.mayoclinicproceedings.org/article/S0025-6196(15)00716-8/abstract#secsectitle0010
  3. Johnson, Alistair EW, et al. “MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports.” Scientific Data 6, 2019.
  4. Huang, Gao, et al. “Densely connected convolutional networks.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  5. Wang, Jiang, et al. “Learning fine-grained image similarity with deep ranking.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014.
  6. Wang, Xiaosong, et al. “Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.

About the Authors

 Gang Fu is a Healthcare Solution Architect at AWS. He holds a PhD in Pharmaceutical Science from the University of Mississippi and has over ten years of technology and biomedical research experience. He is passionate about technology and the impact it can make on healthcare.

 

 

 

Ujjwal Ratan is a Principal Machine Learning Specialist Solution Architect in the Global Healthcare and Lifesciences team at Amazon Web Services. He works on the application of machine learning and deep learning to real world industry problems like medical imaging, unstructured clinical text, genomics, precision medicine, clinical trials and quality of care improvement. He has expertise in scaling machine learning/deep learning algorithms on the AWS cloud for accelerated training and inference. In his free time, he enjoys listening to (and playing) music and taking unplanned road trips with his family.

 

 

Erhan Bas is a Senior Applied Scientist in the AWS Rekognition team, currently developing deep learning algorithms for computer vision applications. His expertise is in machine learning and large scale image analysis techniques, especially in biomedical, life sciences and industrial inspection technologies. He enjoys playing video games, drinking coffee, and traveling with his family.

 

Read More

Streamlining data labeling for YOLO object detection in Amazon SageMaker Ground Truth

Streamlining data labeling for YOLO object detection in Amazon SageMaker Ground Truth

Object detection is a common task in computer vision (CV), and the YOLOv3 model is state-of-the-art in terms of accuracy and speed. In transfer learning, you obtain a model trained on a large but generic dataset and retrain the model on your custom dataset. One of the most time-consuming parts in transfer learning is collecting and labeling image data to generate a custom training dataset. This post explores how to do this in Amazon SageMaker Ground Truth.

Ground Truth offers a comprehensive platform for annotating the most common data labeling jobs in CV: image classification, object detection, semantic segmentation, and instance segmentation. You can perform labeling using Amazon Mechanical Turk or create your own private team to label collaboratively. You can also use one of the third-party data labeling service providers listed on the AWS Marketplace. Ground Truth offers an intuitive interface that is easy to work with. You can communicate with labelers about specific needs for your particular task using examples and notes through the interface.

Labeling data is already hard work. Creating training data for a CV modeling task requires data collection and storage, setting up labeling jobs, and post-processing the labeled data. Moreover, not all object detection models expect the data in the same format. For example, the Faster RCNN model expects the data in the popular Pascal VOC format, which the YOLO models can’t work with. These associated steps are part of any machine learning pipeline for CV. You sometimes need to run the pipeline multiple times to improve the model incrementally. This post shows how to perform these steps efficiently by using Python scripts and get to model training as quickly as possible. This post uses the YOLO format for its use case, but the steps are mostly independent of the data format.

The image labeling step of a training data generation task is inherently manual. This post shows how to create a reusable framework to create training data for model building efficiently. Specifically, you can do the following:

  • Create the required directory structure in Amazon S3 before starting a Ground Truth job
  • Create a private team of annotators and start a Ground Truth job
  • Collect the annotations when labeling is complete and save it in a pandas dataframe
  • Post-process the dataset for model training

You can download the code presented in this post from this GitHub repo. This post demonstrates how to run the code from the AWS CLI on a local machine that can access an AWS account. For more information about setting up AWS CLI, see What Is the AWS Command Line Interface? Make sure that you configure it to access the S3 buckets in this post. Alternatively, you can run it in AWS Cloud9 or by spinning up an Amazon EC2 instance. You can also run the code blocks in an Amazon SageMaker notebook.

If you’re using an Amazon SageMaker notebook, you can still access the Linux shell of the underlying EC2 instance and follow along by opening a new terminal from the Jupyter main page and running the scripts from the /home/ec2-user/SageMaker folder.

Setting up your S3 bucket

The first thing you need to do is to upload the training images to an S3 bucket. Name the bucket ground-truth-data-labeling. You want each labeling task to have its own self-contained folder under this bucket. If you start labeling a small set of images that you keep in the first folder, but find that the model performed poorly after the first round because the data was insufficient, you can upload more images to a different folder under the same bucket and start another labeling task.

For the first labeling task, create the folder bounding_box and the following three subfolders under it:

  • images – You upload all the images in the Ground Truth labeling job to this subfolder.
  • ground_truth_annots – This subfolder starts empty; the Ground Truth job populates it automatically, and you retrieve the final annotations from here.
  • yolo_annot_files – This subfolder also starts empty, but eventually holds the annotation files ready for model training. The script populates it automatically.

If your images are in .jpeg format and available in the current working directory, you can upload the images with the following code:

aws s3 sync . s3://ground-truth-data-labeling/bounding_box/images/ --exclude "*" --include "*.jpg" 

For this use case, you use five images. There are two types of objects in the images—pencil and pen. You need to draw bounding boxes around each object in the images. The following images are examples of what you need to label. All images are available in the GitHub repo.

Creating the manifest file

A Ground Truth job requires a manifest file in JSON format that contains the Amazon S3 paths of all the images to label. You need to create this file before you can start the first Ground Truth job. The format of this file is simple:

{"source-ref": < S3 path to image1 >}
{"source-ref": < S3 path to image2 >}
...

However, creating the manifest file by hand would be tedious for a large number of images. Therefore, you can automate the process by running a script. You first need to create a file holding the parameters required for the scripts. Create a file input.json in your local file system with the following content:

{
    "s3_bucket":"ground-truth-data-labeling",
    "job_id":"bounding_box",
    "ground_truth_job_name":"yolo-bbox",
    "yolo_output_dir":"yolo_annot_files"
}

Save the following code block in a file called prep_gt_job.py:

import boto3
import json


def create_manifest(job_path):
    """
    Creates the manifest file for the Ground Truth job

    Input:
    job_path: Full path of the folder in S3 for GT job

    Returns:
    manifest_file: The manifest file required for GT job
    """

    s3_rec = boto3.resource("s3")
    s3_bucket = job_path.split("/")[0]
    prefix = job_path.replace(s3_bucket, "")[1:]
    image_folder = f"{prefix}/images"
    print(f"using images from ... {image_folder} n")

    bucket = s3_rec.Bucket(s3_bucket)
    objs = list(bucket.objects.filter(Prefix=image_folder))
    img_files = objs[1:]  # first item is the folder name
    n_imgs = len(img_files)
    print(f"there are {n_imgs} images n")

    TOKEN = "source-ref"
    manifest_file = "/tmp/manifest.json"
    with open(manifest_file, "w") as fout:
        for img_file in img_files:
            fname = f"s3://{s3_bucket}/{img_file.key}"
            fout.write(f'{{"{TOKEN}": "{fname}"}}n')

    return manifest_file


def upload_manifest(job_path, manifest_file):
    """
    Uploads the manifest file into S3

    Input:
    job_path: Full path of the folder in S3 for GT job
    manifest_file: Path to the local copy of the manifest file
    """

    s3_rec = boto3.resource("s3")
    s3_bucket = job_path.split("/")[0]
    source = manifest_file.split("/")[-1]
    prefix = job_path.replace(s3_bucket, "")[1:]
    destination = f"{prefix}/{source}"

    print(f"uploading manifest file to {destination} n")
    s3_rec.meta.client.upload_file(manifest_file, s3_bucket, destination)


def main():
    """
    Performs the following tasks:
    1. Reads input from 'input.json'
    2. Collects image names from S3 and creates the manifest file for GT
    3. Uploads the manifest file to S3
    """

    with open("input.json") as fjson:
        input_dict = json.load(fjson)

    s3_bucket = input_dict["s3_bucket"]
    job_id = input_dict["job_id"]

    gt_job_path = f"{s3_bucket}/{job_id}"
    man_file = create_manifest(gt_job_path)
    upload_manifest(gt_job_path, man_file)


if __name__ == "__main__":
    main()

Run the following script:

python prep_gt_job.py

This script reads the S3 bucket and job names from the input file, creates a list of images available in the images folder, creates the manifest.json file, and uploads the manifest file to the S3 bucket at s3://ground-truth-data-labeling/bounding_box/.

This method illustrates a programmatic control of the process, but you can also create the file from the Ground Truth API. For instructions, see Create a Manifest File.

At this point, the folder structure in the S3 bucket should look like the following:

ground-truth-data-labeling 
|-- bounding_box
    |-- ground_truth_annots
    |-- images
    |-- yolo_annot_files
    |-- manifest.json

Creating the Ground Truth job

You’re now ready to create your Ground Truth job. You need to specify the job details and task type, and create your team of labelers and labeling task details. Then you can sign in to begin the labeling job.

Specifying the job details

To specify the job details, complete the following steps:

  1. On the Amazon SageMaker console, under Ground Truth, choose Labeling jobs.

  1. On the Labeling jobs page, choose Create labeling job.

  1. In the Job overview section, for Job name, enter yolo-bbox. It should be the name you defined in the input.json file earlier.
  2. Pick Manual Data Setup under Input Data Setup.
  3. For Input dataset location, enter s3://ground-truth-data-labeling/bounding_box/manifest.json.
  4. For Output dataset location, enter s3://ground-truth-data-labeling/bounding_box/ground_truth_annots.

  1. In the Create an IAM role section, first select Create a new role from the drop down menu and then select Specific S3 buckets.
  2. Enter ground-truth-data-labeling.

  1. Choose Create.

Specifying the task type

To specify the task type, complete the following steps:

  1. In the Task selection section, from the Task Category drop-down menu, choose Image.
  2. Select Bounding box.

  1. Don’t change Enable enhanced image access, which is selected by default. It enables Cross-Origin Resource Sharing (CORS) that may be required for some workers to complete the annotation task.
  2. Choose Next.

Creating a team of labelers

To create your team of labelers, complete the following steps:

  1. In the Workers section, select Private.
  2. Follow the instructions to create a new team.

Each member of the team receives a notification email titled, “You’re invited to work on a labeling project” that has initial sign-in credentials. For this use case, create a team with just yourself as a member.

Specifying labeling task details

In the Bounding box labeling tool section, you should see the images you uploaded to Amazon S3. You should check that the paths are correct in the previous steps. To specify your task details, complete the following steps:

  1. In the text box, enter a brief description of the task.

This is critical if the data labeling team has more than one members and you want to make sure everyone follows the same rule when drawing the boxes. Any inconsistency in bounding box creation may end up confusing your object detection model. For example, if you’re labeling beverage cans and want to create a tight bounding box only around the visible logo, instead of the entire can, you should specify that to get consistent labeling from all the workers. For this use case, you can enter Please enter a tight bounding box around the entire object.

  1. Optionally, you can upload examples of a good and a bad bounding box.

You can make sure your team is consistent in their labels by providing good and bad examples.

  1. Under Labels, enter the names of the labels you’re using to identify each bounding box; in this case, pencil and pen.

A color is assigned to each label automatically, which helps to visualize the boxes created for overlapping objects.

  1. To run a final sanity check, choose Preview.

  1. Choose Create job.

Job creation can take up to a few minutes. When it’s complete, you should see a job titled yolo-bbox on the Ground Truth Labeling jobs page with In progress as the status.

  1. To view the job details, select the job.

This is a good time to verify the paths are correct; the scripts don’t run if there’s any inconsistency in names.

For more information about providing labeling instructions, see Create high-quality instructions for Amazon SageMaker Ground Truth labeling jobs.

Sign in and start labeling

After you receive the initial credentials to register as a labeler for this job, follow the link to reset the password and start labeling.

If you need to interrupt your labeling session, you can resume labeling by choosing Labeling workforces under Ground Truth on the SageMaker console.

You can find the link to the labeling portal on the Private tab. The page also lists the teams and individuals involved in this private labeling task.

After you sign in, start labeling by choosing Start working.

Because you only have five images in the dataset to label, you can finish the entire task in a single session. For larger datasets, you can pause the task by choosing Stop working and return to the task later to finish it.

Checking job status

After the labeling is complete, the status of the labeling job changes to Complete and a new JSON file called output.manifest containing the annotations appears at s3://ground-truth-data-labeling/bounding_box/ground_truth_annots/yolo-bbox/manifests/output /output.manifest.

Parsing Ground Truth annotations

You can now parse through the annotations and perform the necessary post-processing steps to make it ready for model training. Start by running the following code block:

from io import StringIO
import json
import s3fs
import boto3
import pandas as pd


def parse_gt_output(manifest_path, job_name):
    """
    Captures the json Ground Truth bounding box annotations into a pandas dataframe

    Input:
    manifest_path: S3 path to the annotation file
    job_name: name of the Ground Truth job

    Returns:
    df_bbox: pandas dataframe with bounding box coordinates
             for each item in every image
    """

    filesys = s3fs.S3FileSystem()
    with filesys.open(manifest_path) as fin:
        annot_list = []
        for line in fin.readlines():
            record = json.loads(line)
            if job_name in record.keys():  # is it necessary?
                image_file_path = record["source-ref"]
                image_file_name = image_file_path.split("/")[-1]
                class_maps = record[f"{job_name}-metadata"]["class-map"]

                imsize_list = record[job_name]["image_size"]
                assert len(imsize_list) == 1
                image_width = imsize_list[0]["width"]
                image_height = imsize_list[0]["height"]

                for annot in record[job_name]["annotations"]:
                    left = annot["left"]
                    top = annot["top"]
                    height = annot["height"]
                    width = annot["width"]
                    class_name = class_maps[f'{annot["class_id"]}']

                    annot_list.append(
                        [
                            image_file_name,
                            class_name,
                            left,
                            top,
                            height,
                            width,
                            image_width,
                            image_height,
                        ]
                    )

        df_bbox = pd.DataFrame(
            annot_list,
            columns=[
                "img_file",
                "category",
                "box_left",
                "box_top",
                "box_height",
                "box_width",
                "img_width",
                "img_height",
            ],
        )

    return df_bbox


def save_df_to_s3(df_local, s3_bucket, destination):
    """
    Saves a pandas dataframe to S3

    Input:
    df_local: Dataframe to save
    s3_bucket: Bucket name
    destination: Prefix
    """

    csv_buffer = StringIO()
    s3_resource = boto3.resource("s3")

    df_local.to_csv(csv_buffer, index=False)
    s3_resource.Object(s3_bucket, destination).put(Body=csv_buffer.getvalue())


def main():
    """
    Performs the following tasks:
    1. Reads input from 'input.json'
    2. Parses the Ground Truth annotations and creates a dataframe
    3. Saves the dataframe to S3
    """

    with open("input.json") as fjson:
        input_dict = json.load(fjson)

    s3_bucket = input_dict["s3_bucket"]
    job_id = input_dict["job_id"]
    gt_job_name = input_dict["ground_truth_job_name"]

    mani_path = f"s3://{s3_bucket}/{job_id}/ground_truth_annots/{gt_job_name}/manifests/output/output.manifest"

    df_annot = parse_gt_output(mani_path, gt_job_name)
    dest = f"{job_id}/ground_truth_annots/{gt_job_name}/annot.csv"
    save_df_to_s3(df_annot, s3_bucket, dest)


if __name__ == "__main__":
    main()

From the AWS CLI, save the preceding code block in the file parse_annot.py and run:

python parse_annot.py

Ground Truth returns the bounding box information using the following four numbers: x and y coordinates, and its height and width. The procedure parse_gt_output scans through the output.manifest file and stores the information for every bounding box for each image in a pandas dataframe. The procedure save_df_to_s3 saves it in a tabular format as annot.csv to the S3 bucket for further processing.

The creation of the dataframe is useful for a few reasons. JSON files are hard to read and the output.manifest file contains more information, like label metadata, than you need for the next step. The dataframe contains only the relevant information and you can visualize it easily to make sure everything looks fine.

To grab the annot.csv file from Amazon S3 and save a local copy, run the following:

aws s3 cp s3://ground-truth-data-labeling/bounding_box/ground_truth_annots/yolo-bbox/annot.csv 

You can read it back into a pandas dataframe and inspect the first few lines. See the following code:

import pandas as pd
df_ann = pd.read_csv('annot.csv')
df_ann.head()

The following screenshot shows the results.

You also capture the size of the image through img_width and img_height. This is necessary because the object detection models need to know the location of each bounding box within the image. In this case, you can see that images in the dataset were captured with a 4608×3456 pixel resolution.

There are quite a few reasons why it is a good idea to save the annotation information into a dataframe:

  • In a subsequent step, you need to rescale the bounding box coordinates into a YOLO-readable format. You can do this operation easily in a dataframe.
  • If you decide to capture and label more images in the future to augment the existing dataset, all you need to do is join the newly created dataframe with the existing one. Again, you can perform this easily using a dataframe.
  • As of this writing, Ground Truth doesn’t allow through the console more than 30 different categories to label in the same job. If you have more categories in your dataset, you have to label them under multiple Ground Truth jobs and combine them. Ground Truth associates each bounding box to an integer index in the output.manifest file. Therefore, the integer labels are different across multiple Ground Truth jobs if you have more than 30 categories. Having the annotations as dataframes makes the task of combining them easier and takes care of the conflict of category names across multiple jobs. In the preceding screenshot, you can see that you used the actual names under the category column instead of the integer index.

Generating YOLO annotations

You’re now ready to reformat the bounding box coordinates Ground Truth provided into a format the YOLO model accepts.

In the YOLO format, each bounding box is described by the center coordinates of the box and its width and height. Each number is scaled by the dimensions of the image; therefore, they all range between 0 and 1. Instead of category names, YOLO models expect the corresponding integer categories.

Therefore, you need to map each name in the category column of the dataframe into a unique integer. Moreover, the official Darknet implementation of YOLOv3 needs to have the name of the image match the annotation text file name. For example, if the image file is pic01.jpg, the corresponding annotation file should be named pic01.txt.

The following code block performs all these tasks:

import os
import json
from io import StringIO
import boto3
import s3fs
import pandas as pd


def annot_yolo(annot_file, cats):
    """
    Prepares the annotation in YOLO format

    Input:
    annot_file: csv file containing Ground Truth annotations
    ordered_cats: List of object categories in proper order for model training

    Returns:
    df_ann: pandas dataframe with the following columns
            img_file int_category box_center_w box_center_h box_width box_height


    Note:
    YOLO data format: <object-class> <x_center> <y_center> <width> <height>
    """

    df_ann = pd.read_csv(annot_file)

    df_ann["int_category"] = df_ann["category"].apply(lambda x: cats.index(x))
    df_ann["box_center_w"] = df_ann["box_left"] + df_ann["box_width"] / 2
    df_ann["box_center_h"] = df_ann["box_top"] + df_ann["box_height"] / 2

    # scale box dimensions by image dimensions
    df_ann["box_center_w"] = df_ann["box_center_w"] / df_ann["img_width"]
    df_ann["box_center_h"] = df_ann["box_center_h"] / df_ann["img_height"]
    df_ann["box_width"] = df_ann["box_width"] / df_ann["img_width"]
    df_ann["box_height"] = df_ann["box_height"] / df_ann["img_height"]

    return df_ann


def save_annots_to_s3(s3_bucket, prefix, df_local):
    """
    For every image in the dataset, save a text file with annotation in YOLO format

    Input:
    s3_bucket: S3 bucket name
    prefix: Folder name under s3_bucket where files will be written
    df_local: pandas dataframe with the following columns
              img_file int_category box_center_w box_center_h box_width box_height
    """

    unique_images = df_local["img_file"].unique()
    s3_resource = boto3.resource("s3")

    for image_file in unique_images:
        df_single_img_annots = df_local.loc[df_local.img_file == image_file]
        annot_txt_file = image_file.split(".")[0] + ".txt"
        destination = f"{prefix}/{annot_txt_file}"

        csv_buffer = StringIO()
        df_single_img_annots.to_csv(
            csv_buffer,
            index=False,
            header=False,
            sep=" ",
            float_format="%.4f",
            columns=[
                "int_category",
                "box_center_w",
                "box_center_h",
                "box_width",
                "box_height",
            ],
        )
        s3_resource.Object(s3_bucket, destination).put(Body=csv_buffer.getvalue())


def get_cats(json_file):
    """
    Makes a list of the category names in proper order

    Input:
    json_file: s3 path of the json file containing the category information

    Returns:
    cats: List of category names
    """

    filesys = s3fs.S3FileSystem()
    with filesys.open(json_file) as fin:
        line = fin.readline()
        record = json.loads(line)
        labels = [item["label"] for item in record["labels"]]

    return labels


def main():
    """
    Performs the following tasks:
    1. Reads input from 'input.json'
    2. Collect the category names from the Ground Truth job
    3. Creates a dataframe with annotaion in YOLO format
    4. Saves a text file in S3 with YOLO annotations
       for each of the labeled images
    """

    with open("input.json") as fjson:
        input_dict = json.load(fjson)

    s3_bucket = input_dict["s3_bucket"]
    job_id = input_dict["job_id"]
    gt_job_name = input_dict["ground_truth_job_name"]
    yolo_output = input_dict["yolo_output_dir"]

    s3_path_cats = (
        f"s3://{s3_bucket}/{job_id}/ground_truth_annots/{gt_job_name}/annotation-tool/data.json"
    )
    categories = get_cats(s3_path_cats)
    print("n labels used in Ground Truth job: ")
    print(categories, "n")

    gt_annot_file = "annot.csv"
    s3_dir = f"{job_id}/{yolo_output}"
    print(f"annotation files saved in = ", s3_dir)

    df_annot = annot_yolo(gt_annot_file, categories)
    save_annots_to_s3(s3_bucket, s3_dir, df_annot)


if __name__ == "__main__":
    main()

From the AWS CLI, save the preceding code block in a file create_annot.py and run:

python create_annot.py

The annot_yolo procedure transforms the dataframe you created by rescaling the box coordinates by the image size, and the save_annots_to_s3 procedure saves the annotations corresponding to each image into a text file and stores it in Amazon S3.

 

You can now inspect a couple of images and their corresponding annotations to make sure they’re properly formatted for model training. However, you first need to write a procedure to draw YOLO formatted bounding boxes on an image. Save the following code block in visualize.py:

import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import matplotlib.colors as mcolors
import argparse


def visualize_bbox(img_file, yolo_ann_file, label_dict, figure_size=(6, 8)):
    """
    Plots bounding boxes on images

    Input:
    img_file : numpy.array
    yolo_ann_file: Text file containing annotations in YOLO format
    label_dict: Dictionary of image categories
    figure_size: Figure size
    """

    img = mpimg.imread(img_file)
    fig, ax = plt.subplots(1, 1, figsize=figure_size)
    ax.imshow(img)

    im_height, im_width, _ = img.shape

    palette = mcolors.TABLEAU_COLORS
    colors = [c for c in palette.keys()]
    with open(yolo_ann_file, "r") as fin:
        for line in fin:
            cat, center_w, center_h, width, height = line.split()
            cat = int(cat)
            category_name = label_dict[cat]
            left = (float(center_w) - float(width) / 2) * im_width
            top = (float(center_h) - float(height) / 2) * im_height
            width = float(width) * im_width
            height = float(height) * im_height

            rect = plt.Rectangle(
                (left, top),
                width,
                height,
                fill=False,
                linewidth=2,
                edgecolor=colors[cat],
            )
            ax.add_patch(rect)
            props = dict(boxstyle="round", facecolor=colors[cat], alpha=0.5)
            ax.text(
                left,
                top,
                category_name,
                fontsize=14,
                verticalalignment="top",
                bbox=props,
            ) 
    plt.show()


def main():
    """
    Plots bounding boxes
    """

    labels = {0: "pen", 1: "pencil"}
    parser = argparse.ArgumentParser()
    parser.add_argument("img", help="image file")
    args = parser.parse_args()
    img_file = args.img
    ann_file = img_file.split(".")[0] + ".txt"
    visualize_bbox(img_file, ann_file, labels, figure_size=(6, 8))


if __name__ == "__main__":
    main()

Download an image and the corresponding annotation file from Amazon S3. See the following code:

aws s3 cp s3://ground-truth-data-labeling/bounding_box/yolo_annot_files/IMG_20200816_205004.txt .
aws s3 cp s3://ground-truth-data-
labeling/bounding_box/images/IMG_20200816_205004.jpg .

To display the correct label of each bounding box, you need to specify the names of the objects you labeled in a dictionary and pass it to visualize_bbox. For this use case, you only have two items in the list. However, the order of the labels is important—it should match the order you used while creating the Ground Truth labeling job. If you can’t remember the order, you can access the information from the s3://data-labeling-ground-truth/bounding_box/ground_truth_annots/bbox-yolo/annotation-tool/data.json

file in Amazon S3, which the Ground Truth job creates automatically.

The contents of the data.json file the task look like the following code:

{"document-version":"2018-11-28","labels":[{"label":"pencil"},{"label":"pen"}]}

Therefore, a dictionary with the labels as follows was created in visualize.py:

labels = {0: 'pencil', 1: 'pen'}

Now run the following to visualize the image:

python visualize.py IMG_20200816_205004.jpg

The following screenshot shows the bounding boxes correctly drawn around two pens.

To plot an image with a mix of pens and pencils, get the image and the corresponding annotation text from Amazon S3. See the following code:

aws s3 cp s3://ground-truth-data-labeling/bounding_box/yolo_annot_files/IMG_20200816_205029.txt .
    
aws s3 cp s3://ground-truth-data-
labeling/bounding_box/images/IMG_20200816_205029.jpg .

Override the default image size in the  visualize_bbox  procedure to (10, 12) and run the following:

python visualize.py IMG_20200816_205029.jpg

The following screenshot shows three bounding boxes correctly drawn around two types of objects.

Conclusion

This post described how to create an efficient, end-to-end data-gathering pipeline in Amazon Ground Truth for an object detection model. Try out this process yourself next time you are creating an object detection model. You can modify the post-processing annotations to produce labeled data in the Pascal VOC format, which is required for models like Faster RCNN. You can also adopt the basic framework to other data-labeling pipelines with job-specific modifications. For example, you can rewrite the annotation post-processing procedures to adopt the framework for an instance segmentation task, in which an object is labeled at the pixel level instead of drawing a rectangle around the object. Amazon Ground Truth gets regularly updated with enhanced capabilities. Therefore, check  the documentation for the most up to date features.


About the Author

Arkajyoti Misra is a Data Scientist working in AWS Professional Services. He loves to dig into Machine Learning algorithms and enjoys reading about new frontiers in Deep Learning.

Read More

Lilt CEO Spence Green Talks Removing Language Barriers in Business

Lilt CEO Spence Green Talks Removing Language Barriers in Business

When large organizations require translation services, there’s no room for the amusing errors often produced by automated apps. That’s where Lilt, an AI-powered enterprise language translation company, comes in.

Lilt CEO Spence Green spoke with AI Podcast host Noah Kravitz about how the company is using a human-in-the-loop process to achieve fast, accurate and affordable translation.

Lilt does so with a predictive typing software, in which professional translators receive AI-based suggestions of how to translate content. By relying on machine assistance, Lilt’s translations are efficient while retaining accuracy.

However, including people in the company’s workflow also makes localization possible. Professional translators use cultural context to take direct translations and adjust phrases or words to reflect the local language and customs.

Lilt currently supports translations of 45 languages, and aims to continue improving its AI and make translation services more affordable.

Key Points From This Episode:

  • Green’s experience living in Abu Dhabi was part of the inspiration behind Lilt. While there, he met a man, an accountant, who had immigrated from Egypt. When asked why he no longer worked in accounting, the man explained that he didn’t speak English, and accountants who only spoke Arabic were paid less. Green didn’t want the difficulty of adult language learning to be a source of inequality in a business environment.
  • Lilt was founded in 2015, and evolved from a solely software company into a software and services business. Green explains the steps it took for the company to manage translators and act as a complete solution for enterprises.

Tweetables:

“We’re trying to provide technology that’s going to drive down the cost and increase the quality of this service, so that every organization can make all of its information available to anyone.” — Spence Green [2:53]

“One could argue that [machine translation systems] are getting better at a faster rate than at any point in the 70-year history of working on these systems.” — Spence Green [14:01]

You Might Also Like:

Hugging Face’s Sam Shleifer Talks Natural Language Processing

Hugging Face is more than just an adorable emoji — it’s a company that’s demystifying AI by transforming the latest developments in deep learning into usable code for businesses and researchers, explains research engineer Sam Shleifer.

Credit Check: Capital One’s Kyle Nicholson on Modern Machine Learning in Finance

Capital One Senior Software Engineer Kyle Nicholson explains how modern machine learning techniques have become a key tool for financial and credit analysis.

A Conversation with the Entrepreneur Behind the World’s Most Realistic Artificial Voices

Voice recognition is one thing, creating natural sounding artificial voices is quite another. Lyrebird co-founder Jose Solero speaks about how the startup is using deep learning to create a system that’s able to listen to human voices and generate speech mimicking the original human speaker.

Tune in to the AI Podcast

Get the AI Podcast through iTunes, Google Podcasts, Google Play, Castbox, DoggCatcher, Overcast, PlayerFM, Pocket Casts, Podbay, PodBean, PodCruncher, PodKicker, Soundcloud, Spotify, Stitcher and TuneIn. If your favorite isn’t listed here, drop us a note.

Tune in to the Apple Podcast Tune in to the Google Podcast Tune in to the Spotify Podcast

Make the AI Podcast Better

Have a few minutes to spare? Fill out this listener survey. Your answers will help us make a better podcast.

The post Lilt CEO Spence Green Talks Removing Language Barriers in Business appeared first on The Official NVIDIA Blog.

Read More

Measuring Gendered Correlations in Pre-trained NLP Models

Measuring Gendered Correlations in Pre-trained NLP Models

Posted by Kellie Webster, Software Engineer, Google Research

Natural language processing (NLP) has seen significant progress over the past several years, with pre-trained models like BERT, ALBERT, ELECTRA, and XLNet achieving remarkable accuracy across a variety of tasks. In pre-training, representations are learned from a large text corpus, e.g., Wikipedia, by repeatedly masking out words and trying to predict them (this is called masked language modeling). The resulting representations encode rich information about language and correlations between concepts, such as surgeons and scalpels. There is then a second training stage, fine-tuning, in which the model uses task-specific training data to learn how to use the general pre-trained representations to do a concrete task, like classification. Given the broad adoption of these representations in many NLP tasks, it is crucial to understand the information encoded in them and how any learned correlations affect performance downstream, to ensure the application of these models aligns with our AI Principles.

In “Measuring and Reducing Gendered Correlations in Pre-trained Models” we perform a case study on BERT and its low-memory counterpart ALBERT, looking at correlations related to gender, and formulate a series of best practices for using pre-trained language models. We present experimental results over public model checkpoints and an academic task dataset to illustrate how the best practices apply, providing a foundation for exploring settings beyond the scope of this case study. We will soon release a series of checkpoints, Zari1, which reduce gendered correlations while maintaining state-of-the-art accuracy on standard NLP task metrics.

Measuring Correlations
To understand how correlations in pre-trained representations can affect downstream task performance, we apply a diverse set of evaluation metrics for studying the representation of gender. Here, we’ll discuss results from one of these tests, based on coreference resolution, which is the capability that allows models to understand the correct antecedent to a given pronoun in a sentence. For example, in the sentence that follows, the model should recognize his refers to the nurse, and not to the patient.

The standard academic formulation of the task is the OntoNotes test (Hovy et al., 2006), and we measure how accurate a model is at coreference resolution in a general setting using an F1 score over this data (as in Tenney et al. 2019). Since OntoNotes represents only one data distribution, we also consider the WinoGender benchmark that provides additional, balanced data designed to identify when model associations between gender and profession incorrectly influence coreference resolution. High values of the WinoGender metric (close to one) indicate a model is basing decisions on normative associations between gender and profession (e.g., associating nurse with the female gender and not male). When model decisions have no consistent association between gender and profession, the score is zero, which suggests that decisions are based on some other information, such as sentence structure or semantics.

BERT and ALBERT metrics on OntoNotes (accuracy) and WinoGender (gendered correlations). Low values on the WinoGender metric indicate that a model does not preferentially use gendered correlations in reasoning.

In this study, we see that neither the (Large) BERT or ALBERT public model achieves zero score on the WinoGender examples, despite achieving impressive accuracy on OntoNotes (close to 100%). At least some of this is due to models preferentially using gendered correlations in reasoning. This isn’t completely surprising: there are a range of cues available to understand text and it is possible for a general model to pick up on any or all of these. However, there is reason for caution, as it is undesirable for a model to make predictions primarily based on gendered correlations learned as priors rather than the evidence available in the input.

Best Practices
Given that it is possible for unintended correlations in pre-trained model representations to affect downstream task reasoning, we now ask: what can one do to mitigate any risk this poses when developing new NLP models?

  • It is important to measure for unintended correlations: Model quality may be assessed using accuracy metrics, but these only measure one dimension of performance, especially if the test data is drawn from the same distribution as the training data. For example, the BERT and ALBERT checkpoints have accuracy within 1% of each other, but differ by 26% (relative) in the degree to which they use gendered correlations for coreference resolution. This difference might be important for some tasks; selecting a model with low WinoGender score could be desirable in an application featuring texts about people in professions that may not conform to historical social norms, e.g., male nurses.
  • Be careful even when making seemingly innocuous configuration changes: Neural network model training is controlled by many hyperparameters that are usually selected to maximize some training objective. While configuration choices often seem innocuous, we find they can cause significant changes for gendered correlations, both for better and for worse. For example, dropout regularization is used to reduce overfitting by large models. When we increase the dropout rate used for pre-training BERT and ALBERT, we see a significant reduction in gendered correlations even after fine-tuning. This is promising since a simple configuration change allows us to train models with reduced risk of harm, but it also shows that we should be mindful and evaluate carefully when making any change in model configuration.
    Impact of increasing dropout regularization in BERT and ALBERT.
  • There are opportunities for general mitigations: A further corollary from the perhaps unexpected impact of dropout on gendered correlations is that it opens the possibility to use general-purpose methods for reducing unintended correlations: by increasing dropout in our study, we improve how the models reason about WinoGender examples without manually specifying anything about the task or changing the fine-tuning stage at all. Unfortunately, OntoNotes accuracy does start to decline as the dropout rate increases (which we can see in the BERT results), but we are excited about the potential to mitigate this in pre-training, where changes can lead to model improvements without the need for task-specific updates. We explore counterfactual data augmentation as another mitigation strategy with different tradeoffs in our paper.

What’s Next
We believe these best practices provide a starting point for developing robust NLP systems that perform well across the broadest possible range of linguistic settings and applications. Of course these techniques on their own are not sufficient to capture and remove all potential issues. Any model deployed in a real-world setting should undergo rigorous testing that considers the many ways it will be used, and implement safeguards to ensure alignment with ethical norms, such as Google’s AI Principles. We look forward to developments in evaluation frameworks and data that are more expansive and inclusive to cover the many uses of language models and the breadth of people they aim to serve.

Acknowledgements
This is joint work with Xuezhi Wang, Ian Tenney, Ellie Pavlick, Alex Beutel, Jilin Chen, Emily Pitler, and Slav Petrov. We benefited greatly throughout the project from discussions with Fernando Pereira, Ed Chi, Dipanjan Das, Vera Axelrod, Jacob Eisenstein, Tulsee Doshi, and James Wexler.



1 Zari is an Afghan Muppet designed to show that ‘a little girl could do as much as everybody else’.

Read More