Launch processing jobs with a few clicks using Amazon SageMaker Data Wrangler

Amazon SageMaker Data Wrangler makes it faster for data scientists and engineers to prepare data for machine learning (ML) applications by using a visual interface. Previously, when you created a Data Wrangler data flow, you could choose different export options to easily integrate that data flow into your data processing pipeline. Data Wrangler offers export options to Amazon Simple Storage Service (Amazon S3), SageMaker Pipelines, and SageMaker Feature Store, or as Python code. The export options create a Jupyter notebook and require you to run the code to start a processing job facilitated by SageMaker Processing.

We’re excited to announce the general release of destination nodes and the Create Job feature in Data Wrangler. This feature gives you the ability to export all the transformations that you made to a dataset to a destination node with just a few clicks. This allows you to create data processing jobs and export to Amazon S3 purely via the visual interface without having to generate, run, or manage Jupyter notebooks, thereby enhancing the low-code experience. To demonstrate this new feature, we use the Titanic dataset and show how to export your transformations to a destination node.

Prerequisites

Before we learn how to use destination nodes with Data Wrangler, you should already understand how to access and get started with Data Wrangler. You also need to know what a data flow means with context to Data Wrangler and how to create one by importing your data from the different data sources Data Wrangler supports.

Solution overview

Consider the following data flow named example-titanic.flow:

  • It imports the Titanic dataset three times. You can see these different imports as separate branches in the data flow.
  • For each branch, it applies a set of transformations and visualizations.
  • It joins the branches into a single node with all the transformations and visualizations.

With this flow, you might want to process and save parts of your data to a specific branch or location.

In the following steps, we demonstrate how to create destination nodes, export them to Amazon S3, and create and launch a processing job.

Create a destination node

You can use the following procedure to create destination nodes and export them to an S3 bucket:

  1. Determine what parts of the flow file (transformations) you want to save.
  2. Choose the plus sign next to the nodes that represent the transformations that you want to export. (If it’s a collapsed node, you must select the options icon (three dots) for the node).
  3. Hover over Add destination.
  4. Choose Amazon S3.
  5. Specify the fields as shown in the following screenshot.
  6. For the second join node, follow the same steps to add Amazon S3 as a destination and specify the fields.

You can repeat these steps as many times as you need for as many nodes you want in your data flow. Later on, you pick which destination nodes to include in your processing job.

Launch a processing job

Use the following procedure to create a processing job and choose the destination node where you want to export to:

  1. On the Data Flow tab, choose Create job.
  2. For Job name¸ enter the name of the export job.
  3. Select the destination nodes you want to export.
  4. Optionally, specify the AWS Key Management Service (AWS KMS) key ARN.

The KMS key is a cryptographic key that you can use to protect your data. For more information about KMS keys, see the AWS Key Developer Guide.

  1. Choose Next, 2. Configure job.
  2. Optionally, you can configure the job as per your needs by changing the instance type or count, or adding any tags to associate with the job.
  3. Choose Run to run the job.

A success message appears when the job is successfully created.

View the final data

Finally, you can use the following steps to view the exported data:

  1. After you create the job, choose the provided link.

A new tab opens showing the processing job on the SageMaker console.

  1. When the job is complete, review the exported data on the Amazon S3 console.

You should see a new folder with the job name you chose.

  1. Choose the job name to view a CSV file (or multiple files) with the final data.

FAQ

In this section, we address a few frequently asked questions about this new feature:

  • What happened to the Export tab? With this new feature, we removed the Export tab from Data Wrangler. You can still facilitate the export functionality via the Data Wrangler generated Jupyter notebooks from any nodes you created in the data flow with the following steps:
    1. Choose the plus sign next to the node that you want to export.
    2. Choose Export to.
    3. Choose Amazon S3 (via Jupyter Notebook).
    4. Run the Jupyter notebook.
  • How many destinations nodes can I include in a job? There is a maximum of 10 destinations per processing job.
  • How many destination nodes can I have in a flow file? You can have as many destination nodes as you want.
  • Can I add transformations after my destination nodes? No, the idea is destination nodes are terminal nodes that have no further steps afterwards.
  • What are the supported sources I can use with destination nodes? As of this writing, we only support Amazon S3 as a destination source. Support for more destination source types will be added in the future. Please reach out if there is a specific one you would like to see.

Summary

In this post, we demonstrated how to use the newly launched destination nodes to create processing jobs and save your transformed datasets directly to Amazon S3 via the Data Wrangler visual interface. With this additional feature, we have enhanced the tool-driven low-code experience of Data Wrangler.

As next steps, we recommend you try the example demonstrated in this post. If you have any questions or want to learn more, see Export or leave a question in the comment section.


About the Authors

Alfonso Austin-Rivera is a Front End Engineer at Amazon SageMaker Data Wrangler. He is passionate about building intuitive user experiences that spark joy. In his spare time, you can find him fighting gravity at a rock-climbing gym or outside flying his drone.

Parsa Shahbodaghi is a Technical Writer in AWS specializing in machine learning and artificial intelligence. He writes the technical documentation for Amazon SageMaker Data Wrangler and Amazon SageMaker Feature Store. In his free time, he enjoys meditating, listening to audiobooks, weightlifting, and watching stand-up comedy. He will never be a stand-up comedian, but at least his mom thinks he’s funny.

Balaji Tummala is a Software Development Engineer at Amazon SageMaker. He helps support Amazon SageMaker Data Wrangler and is passionate about building performant and scalable software. Outside of work, he enjoys reading fiction and playing volleyball.

Arunprasath Shankar is an Artificial Intelligence and Machine Learning (AI/ML) Specialist Solutions Architect with AWS, helping global customers scale their AI solutions effectively and efficiently in the cloud. In his spare time, Arun enjoys watching sci-fi movies and listening to classical music.

Read More

Prepare and analyze JSON and ORC data with Amazon SageMaker Data Wrangler

Amazon SageMaker Data Wrangler is a new capability of Amazon SageMaker that makes it faster for data scientists and engineers to prepare data for machine learning (ML) applications via a visual interface. Data preparation is a crucial step of the ML lifecycle, and Data Wrangler provides an end-to-end solution to import, prepare, transform, featurize, and analyze data for ML in a seamless, visual, low-code experience. It lets you easily and quickly connect to AWS components like Amazon Simple Storage Service (Amazon S3), Amazon Athena, Amazon Redshift, and AWS Lake Formation, and external sources like Snowflake. Data Wrangler also supports standard data types such as CSV and Parquet.

Data Wrangler now additionally supports Optimized Row Columnar (ORC), JavaScript Object Notation (JSON), and JSON Lines (JSONL) file formats:

  • ORC – The ORC file format provides a highly efficient way to store Hive data. It was designed to overcome limitations of the other Hive file formats. Using ORC files improves performance when Hive is reading, writing, and processing data. ORC is widely used in the Hadoop ecosystem.
  • JSON – The JSON file format is a lightweight, commonly used data interchange format.
  • JSONL – JSON Lines, also called newline-delimited JSON, is a convenient format for storing structured data that may be processed one record at a time.

You can preview ORC, JSON, and JSONL data prior to importing the datasets into Data Wrangler. After you import the data, you can also use one of the newly launched transformers to work with columns that contain JSON strings or arrays that are commonly found in nested JSONs.

Import and analyze ORC data with Data Wrangler

Importing ORC data is in Data Wrangler is easy and similar to importing files in any other supported formats. Browse to your ORC file in Amazon S3 and in the DETAILS pane, choose ORC as the file type during import.

If you’re new to Data Wrangler, review Get Started with Data Wrangler. Also, see Import to learn about the various import options.

Import and analyze JSON data with Data Wrangler

Now let’s import files in JSON format with Data Wrangler and work with columns that contain JSON strings or arrays. We also demonstrate how to deal with nested JSONs. With Data Wrangler, importing JSON files from Amazon S3 is a seamless process. This is similar to importing files in any other supported formats. After you import the files, you can preview the JSON files as shown in the following screenshot. Make sure to set the file type to JSON in the DETAILS pane.

Next, let’s work on structured columns in the imported JSON file.

To deal with structured columns in JSON files, Data Wrangler is introducing two new transforms: Flatten structured column and Explode array column, which can be found under the Handle structured column option in the ADD TRANSFORM pane.

Let’s start by applying the Explode array column transform to one of the columns in our imported data. Before applying the transform, we can see the column topping is an array of JSON objects with id and type keys.

After we apply the transform, we can observe the new rows added as a result. Each element in the array is now a new row in the resulting DataFrame.

Now let’s apply the Flatten structured column transform on the topping_flattened column that was created as a result of the Explode array column transformation we applied in the previous step.

Before applying the transform, we can see the keys id and type in the topping_flattened column.

After applying the transform, we can now observe the keys id and type under the topping_flattened column as new columns topping_flattened_id and topping_flattened_type, which are created as a result of the transformation. You also have the option to flatten only specific keys by entering the comma separated key names for Keys to flatten on. If left empty, all the keys inside the JSON string or struct are flattened.

Conclusion

In this post, we demonstrated how to import file formats in ORC and JSON easily with Data Wrangler. We also applied the newly launched transformations that allow us to transform any structured columns in JSON data. This makes working with columns that contain JSON strings or arrays a seamless experience.

As next steps, we recommend you replicate the demonstrated examples in your own Data Wrangler visual interface. If you have any questions related to Data Wrangler, feel free to leave them in the comment section.


About the Authors

Balaji Tummala is a Software Development Engineer at Amazon SageMaker. He helps support Amazon SageMaker Data Wrangler and is passionate about building performant and scalable software. Outside of work, he enjoys reading fiction and playing volleyball.

Arunprasath Shankar is an Artificial Intelligence and Machine Learning (AI/ML) Specialist Solutions Architect with AWS, helping global customers scale their AI solutions effectively and efficiently in the cloud. In his spare time, Arun enjoys watching sci-fi movies and listening to classical music.

Read More

Run AutoML experiments with large parquet datasets using Amazon SageMaker Autopilot.

Starting today, you can use Amazon SageMaker Autopilot to tackle regression and classification tasks on large datasets up to 100 GB. Additionally, you can now provide your datasets in either CSV or Apache Parquet content types.

Businesses are generating more data than ever. A corresponding demand is growing for generating insights from these large datasets to shape business decisions. However, successfully training state-of-the-art machine learning (ML) algorithms on these large datasets can be challenging. Autopilot automates this process and provides a seamless experience for running automated machine learning (AutoML) on large datasets up to 100 GB.

Autopilot subsamples your large datasets automatically to fit the maximum supported limit while preserving the rare class in case of class imbalance. Class imbalance is an important problem to be aware of in ML, especially when dealing with large datasets. Consider a fraud detection dataset where only a small fraction of transactions is expected to be fraudulent. In this case, Autopilot subsamples only the majority class, non-fraudulent transactions, while preserving the rare class, fraudulent transactions.

When you run an AutoML job using Autopilot, all relevant information for subsampling is stored in Amazon CloudWatch. Navigate to the log group for /aws/sagemaker/ProcessingJobs, search for the name of your AutoML job, and choose the CloudWatch log stream that includes -db- in its name.

Many of our customers prefer the Parquet content type to store their large datasets. This is generally due to its compressed nature, support for advanced data structures, efficiency, and low-cost operations. This data can often reach up to tens or even hundreds of GBs. Now, you can directly bring these Parquet datasets to Autopilot. You can either use our API or navigate to Amazon SageMaker Studio to create an Autopilot job with a few clicks. You can specify the input location of your Parquet dataset as a single file or multiple files specified as a manifest file. Autopilot automatically detects the content type of your dataset, parses it, extracts meaningful features, and trains multiple ML algorithms.

You can get started using our sample notebook for running AutoML using Autopilot on Parquet datasets.


About the Authors

H. Furkan Bozkurt, Machine Learning Engineer, Amazon SageMaker Autopilot.

Valerio Perrone, Applied Science Manager, Amazon SageMaker Autopilot.

Read More

Use a web browser plugin to quickly translate text with Amazon Translate

Web browsers can be a single pane of glass for organizations to interact with their information—all of the tools can be viewed and accessed on one screen so that users don’t have to switch between applications and interfaces. For example, a customer call center might have several different applications to see customer reviews, social media feeds, and customer data. Each one of these applications are interacted with through web browsers. If the information is in a language that the user doesn’t speak, however, a separate application often needs to be pulled up to translate text. Web browser plugins enable customization of this user experience.

Amazon Translate is a neural machine translation service that delivers fast, high-quality, affordable, and customizable language translation. Neural machine translation is a form of language translation automation that uses deep learning models to deliver more accurate and more natural sounding translation than traditional statistical and rule-based translation algorithms. As of writing this post, Amazon Translate supports 75 languages and 5,550 language pairs. For the latest list, see the Amazon Translate Developer Guide.

With the Amazon Translate web browser plugin, you can simply click a button and have an entire web page translated to whatever language you prefer. This browser plugin works in Chromium-based and Firefox-based browsers.

This post shows how you can use a browser plugin to quickly translate web pages with neural translation with Amazon Translate.

Overview of solution

To use the plugin, install it into a browser on your workstation. To translate a web page, activate the plugin, which authenticates to Amazon Translate using AWS Identity and Access Management (IAM), sends the text of the page you wish to translate to the Amazon Translate service, and returns the translated text to be displayed in the web browser. The browser plugin also enables caching of translated pages. When caching is enabled, translations requested for a webpage are cached to your local machine by their language pairs. Caching improves the speed of the translation of the page and reduces the number of requests made to the Amazon Translate service, potentially saving time and money.

To install and use the plugin, complete the following steps:

  1. Set up an IAM user and credentials.
  2. Install the browser plugin.
  3. Configure the browser plugin.
  4. Use the plugin to translate text.

The browser plugin is available on GitHub.

Prerequisites

For this walkthrough, you should have the following prerequisites:

  • An AWS account
  • A compatible web browser
  • The privileges to create IAM users to authenticate to Amazon Translate

For more information about how Amazon Translate interacts with IAM, see Identity and Access Management for Amazon Translate.

Set up an IAM user and credentials

The browser plugin needs to be configured with credentials to access Amazon Translate. IAM is configured with an AWS-managed policy called TranslateReadOnly. This policy allows API calls to Amazon Translate. To set up a read-only IAM user, complete the following steps:

  1. On the IAM console, choose Users in the navigation pane under Access management.
  2. Choose Add users.
  3. For User name, enter TranslateBrowserPlugin.
  4. Choose Next: Permissions.
  5. To add permissions, choose Attach existing policies directly and choose the policy TranslateReadOnly.
  6. Choose Next: Tags.
  7. Optionally, give the user a tag, and choose Next: Review.
  8. Review the new role and choose Create user.
  9. Choose Download .csv and save the credentials locally.

Although these credentials only provide the most restrictive access to Amazon Translate, you should take extreme care with these credentials so they’re not shared with unintended entities. AWS or Amazon will not be responsible if our customers share their credentials.

Install the browser plugin

The web browser plugin is supported in all Chromium-based browsers. To install the plugin in Chrome, complete the following steps:

  1. Download the extension.zip file from GitHub.
  2. Unzip the file on your local machine.
  3. In Chrome, choose the extensions icon.
  4. Choose Manage Extensions.
  5. Toggle Developer mode on.
  6. Choose Load Unpacked and point to the extension folder that you just unzipped.

Configure the plugin

To configure the plugin, complete the following steps:

  1. In your browser, choose the extensions toolbar and choose Amazon Translate, the newly installed plugin.

You can choose the pin icon for easier access later.

  1. Choose Extension Settings.
  2. For AWS Region, enter the Region closest to you.
  3. For AWS Access Key ID, enter the AWS access key from the spreadsheet you downloaded.
  4. For AWS Secret Access Key, enter the secret access key from the spreadsheet.
  5. Select the check box to enable caching.
  6. Choose Save Settings.

Use the plugin with Amazon Translate

Now the plugin is ready to be used.

  1. To get started, open a web page in a browser to be translated. For this post, we use the landing page for Amazon Translate in German.
  2. Open the browser plugin and choose Amazon Translate in the browser extension list as you did earlier.
  3. For the source language, choose Auto for Amazon Translate to use automatic language detection, and choose your target language.
  4. Choose Translate..

The plugin sends the text to Amazon Translate and translates the page contents to English.

Cost

Amazon Translate is priced at $15 per million characters prorated by number of characters ($0.000015 per character).

You also get 2 million characters per month for 12 months for free, starting from the date on which you create your first translation request. For more information, see Amazon Translate pricing.

The Amazon Translate landing page we translated has about 8,000 characters, making the translation cost about $0.12. With the caching feature enabled, subsequent calls to translate the page for the language pair use the local cached copy, and don’t require calls to Amazon Translate.

Conclusion

Amazon Translate provides neural network translation for 75 languages and 5,550 language pairs. You can integrate Amazon Translate into a browser plugin, to seamlessly integrate translation into an application workflow. We look forward to hearing how using this plugin helps accelerate your translation workloads! Learn more about Amazon Translate by on the Amazon Translate Developer Guide, or AWS blog.


About the Authors

Andrew Stacy is a Front-end Developer with AWS Professional Services. Andrew enjoys creating delightful user experiences for customers through UI/UX development and design. When off the clock, Andrew enjoys playing with his kids, writing code, trying craft beverages, or building things around the house.

Ron Weinstein is a Solutions Architect specializing in Artificial Intelligence and Machine Learning in AWS Public Sector. Ron loves working with his customers on how AI/ML can accelerate and transform their business. When not at work, Ron likes the outdoors and spending time with his family.

Read More

How Clearly accurately predicts fraudulent orders using Amazon Fraud Detector

This post was cowritten by Ziv Pollak, Machine Learning Team Lead, and Sarvi Loloei, Machine Learning Engineer at Clearly. The content and opinions in this post are those of the third-party authors and AWS is not responsible for the content or accuracy of this post.

A pioneer in online shopping, Clearly launched their first site in 2000. Since then, we’ve grown to become one of the biggest online eyewear retailers in the world, providing customers across Canada, the US, Australia, and New Zealand with glasses, sunglasses, contact lenses, and other eye health products. Through its mission to eliminate poor vision, Clearly strives to make eyewear affordable and accessible for everyone. Creating an optimized fraud detection platform is a key part of this wider vision.

Identifying online fraud is one of the biggest challenges every online retail organization has—hundreds of thousands of dollars are lost due to fraud every year. Product costs, shipping costs, and labor costs for handling fraudulent orders further increase the impact of fraud. Easy and fast fraud evaluation is also critical for maintaining high customer satisfaction rates. Transactions shouldn’t be delayed due to lengthy fraud investigation cycles.

In this post, we share how Clearly built an automated and orchestrated forecasting pipeline using AWS Step Functions, and used Amazon Fraud Detector to train a machine learning (ML) model that can identify online fraudulent transactions and bring them to the attention of the billing operations team. This solution also collects metrics and logs, provides auditing, and is invoked automatically.

With AWS services, Clearly deployed a serverless, well-architected solution in just a few weeks.

The challenge: Predicting fraud quickly and accurately

Clearly’s existing solution was based on flagging transactions using hard-coded rules that weren’t updated frequently enough to capture new fraud patterns. Once flagged, the transaction was manually reviewed by a member of the billing operations team.

This existing process had major drawbacks:

  • Inflexible and inaccurate – The hard-coded rules to identify fraud transactions were difficult to update, meaning the team couldn’t respond quickly to emerging fraud trends. The rules were unable to accurately identify many suspicious transactions.
  • Operationally intensive – The process couldn’t scale to high sales volume events (like Black Friday), requiring the team to implement workarounds or accept higher fraud rates. Moreover, the high level of human involvement added significant cost to the product delivery process.
  • Delayed orders – The order fulfillment timeline was delayed by manual fraud reviews, leading to unhappy customers.

Although our existing fraud identification process was a good starting point, it was neither accurate enough nor fast enough to meet the order fulfillment efficiencies that Clearly desired.

Another major challenge we faced was the lack of a tenured ML team—all members had been with the company less than a year when the project kicked off.

Overview of solution: Amazon Fraud Detector

Amazon Fraud Detector is a fully managed service that uses ML to deliver highly accurate fraud detection and requires no ML expertise. All we had to do was upload our data and follow a few straightforward steps. Amazon Fraud Detector automatically examined the data, identified meaningful patterns, and produced a fraud identification model capable of making predictions on new transactions.

The following diagram illustrates our pipeline:

To operationalize the flow, we applied the following workflow:

  1. Amazon EventBridge calls the orchestration pipeline hourly to review all pending transactions.
  2. Step Functions helps manage the orchestration pipeline.
  3. An AWS Lambda function calls Amazon Athena APIs to retrieve and prepare the training data, stored on Amazon Simple Storage Service (Amazon S3).
  4. An orchestrated pipeline of Lambda functions trains an Amazon Fraud Detector model and saves the model performance metrics to an S3 bucket.
  5. Amazon Simple Notification Service (Amazon SNS) notifies users when a problem occurs during the fraud detection process or when the process completes successfully.
  6. Business analysts build dashboards on Amazon QuickSight, which queries the fraud data from Amazon S3 using Athena, as we describe later in this post.

We chose to use Amazon Fraud Detector for a few reasons:

  • The service taps into years of expertise that Amazon has fighting fraud. This gave us a lot of confidence in the service’s capabilities.
  • The ease of use and implementation allowed us to quickly confirm we have the dataset we need to produce accurate results.
  • Because the Clearly ML team was less than 1 year old, a fully managed service allowed us to deliver this project without needing deep technical ML skills and knowledge.

Results

Writing the prediction results into our existing data lake allows us to use QuickSight to build metrics and dashboards for senior leadership. This enables them to understand and use these results when making decisions on the next steps to meet our monthly marketing targets.

We were able to present the forecast results on two levels, starting with overall business performance and then going deeper into needed performance per each line of business (contacts and glasses).

Our dashboard includes the following information:

  • Fraud per day per different lines of business
  • Revenue loss due to fraud transactions
  • Location of fraud transactions (identifying fraud hot spots)
  • Fraud transactions impact by different coupon codes, which allows us to monitor for problematic coupon codes and take further actions to reduce the risk
  • Fraud per hour, which allows us to plan and manage the billing operation team and make sure we have resources available to handle transaction volume when needed

Conclusions

Effective and accurate prediction of customer fraud is one of the biggest challenges in ML for retail today, and having a good understanding of our customers and their behavior is vital to Clearly’s success. Amazon Fraud Detector provided a fully managed ML solution to easily create an accurate and reliable fraud prediction system with minimal overhead. Amazon Fraud Detector predictions have a high degree of accuracy and are simple to generate.

With leading ecommerce tools like Virtual Try On, combined with our unparalleled customer service, we strive to help everyone see clearly in an affordable and effortless manner—which means constantly looking for ways to innovate, improve, and streamline processes,” said Dr. Ziv Pollak, Machine Learning Team Leader. “Online fraud detection is one of the biggest challenges in machine learning in retail today. In just a few weeks, Amazon Fraud Detector helped us accurately and reliably identify fraud with a very high level of accuracy, and save thousands of dollars.


About the Author

Dr Ziv Pollak Dr. Ziv Pollak is an experienced technical leader who transforms the way organizations use machine learning to increase revenue, reduce costs, improve customer service, and ensure business success. He is currently leading the Machine Learning team at Clearly.

Sarvi Loloei is an Associate Machine Learning Engineer at Clearly. Using AWS tools, she evaluates model effectiveness to drive business growth, increase revenue, and optimize productivity.

Read More