Method automatically generates negative training examples for deep-learning model.Read More
Amazon at ACL: Old standards and new forays
Amazon researchers coauthor 17 conference papers, participate in seven workshops.Read More
2019 Q4 AWS Machine Learning Research Awards recipients announced
The AWS Machine Learning Research Awards (MLRA) is pleased to announce the 28 recipients of the 2019 Q4 call-for-proposal cycle.Read More
2019 Q4 recipients of AWS Machine Learning Research Awards
The AWS Machine Learning Research Awards (MLRA) aims to advance machine learning (ML) by funding innovative research and open-source projects, training students, and providing researchers with access to the latest technology. Since 2017, MLRA has supported over 180 research projects from 73 schools and research institutes in 13 countries, with topics such as ML algorithms, computer vision, natural language processing, medical research, neuroscience, social science, physics, and robotics.
On February 18, 2020, we announced the winners of MLRA’s 2019 Q2/Q3 call-for-proposal cycles. We’re now pleased to announce 28 new recipients of MLRA’s 2019 Q4 call-for-proposal cycle. The MLRA recipients represent 26 universities in six countries. The funded projects aim to develop open-source tools and research that benefit the ML community at large, or create impactful research using AWS ML solutions, such as Amazon SageMaker, AWS AI Services, and Apache MXNet on AWS. The following are the 2019 Q4 award recipients:
Recipient | University | Research Title |
Anasse Bari | New York University | Predicting the 2020 Elections Using Big Data, Analyzing What the World Wants Using Twitter and Teaching Next Generation of Thinkers How to Apply AI for Social Good |
Andrew Gordon Wilson | New York University | Scalable Numerical Methods and Probabilistic Deep Learning with Applications to AutoML |
Bo Li | University of Illinois at Urbana-Champaign | Trustworthy Machine Learning as Services via Robust AutoML and Knowledge Enhanced Logic Inference |
Dawn Song | University of California, Berkeley | Protecting the Public Against AI-Generated Fakes |
Dimosthenis Karatzas | Universitat Autónoma de Barcelona | Document Visual Question Answer (DocVQA) for Large-Scale Document Collections |
Dit-Yan Yeung | Hong Kong University of Science and Technology | Temporally Misaligned Spatiotemporal Sequence Modeling |
Lantao Liu | Indiana University Bloomington | Environment-Adaptive Sensing and Modeling using Autonomous Robots |
Leonidas Guibas | Stanford University | Learning Canonical Spaces for Object-Centric 3D Perception |
Maryam Rahnemoonfar | University of Maryland, Baltimore | Combining Model-Based and Data Driven Approaches to Study Climate Change via Amazon SageMaker |
Mi Zhang | Michigan State University | DA-NAS: An AutoML Framework for Joint Data Augmentation and Neural Architecture Search |
Michael P. Kelly | Washington University | Web-Based Machine Learning for Surgeon Benchmarking in Pediatric Spine Surgery |
Ming Zhao | Arizona State University | Enabling Deep Learning across Edge Devices and Cloud Resources |
Nianwen Xue | Brandeis University | AMR2KB: Construct a High-Quality Knowledge by Parsing Meaning Representations |
Nicholas Chia | Mayo Clinic | Massively-Scaled Inverse Reinforcement Learning Approach for Reconstructing the Mutational History of Colorectal Cancer |
Oswald Lanz | Fondazione Bruno Kessler | Structured Representation Learning for Video Recognition and Question Answering |
Pierre Gentine | Columbia University | Learning Fires |
Pratik Chaudhari | University of Pennsylvania | Offline and Off-Policy Reinforcement Learning |
Pulkit Agrawal | Massachusetts Institute of Technology | Curiosity Baselines for the Reinforcement Learning Community |
Quanquan Gu | University of California, Los Angeles | Towards Provably Efficient Deep Reinforcement Learning |
Shayok Chakraborty | Florida State University | Active Learning with Imperfect Oracles |
Soheil Feizi | University of Maryland, College Park | Explainable Deep Learning: Accuracy, Robustness and Fairness |
Spyros Makradakis | University of Nicosia | Clustered Ensemble of Specialist Amazon GluonTS Models for Time Series Forecasting |
Xin Jin | Johns Hopkins University | Making Sense of Network Performance for Distributed Machine Learning |
Xuan (Sharon) Di | Columbia University | Multi-Autonomous Vehicle Driving Policy Learning for Efficient and Safe Traffic |
Yi Yang | University of Technology Sydney | Efficient Video Analysis with Limited Supervision |
Yun Raymond Fu | Northeastern University | Generative Feature Transformation for Multi-Viewed Domain Adaptation |
Zhangyang (Atlas) Wang | Texas A&M University | Mobile-Captured Wound Image Analysis and Dynamic Modeling for Post-Discharge Monitoring of Surgical Site Infection |
Zhi-Li Zhang | University of Minnesota | Universal Graph Embedding Neural Networks for Learning Graph-Structured Data |
Congratulations to all MLRA recipients! We look forward to supporting your research.
For more information about MLRA, see AWS Machine Learning Research Awards or send an email to aws-ml-research-awards@amazon.com.
About the Author
Seo Yeon Shin is a program manager for the AWS AI Academic Programs.
Cisco uses Amazon SageMaker and Kubeflow to create a hybrid machine learning workflow
This is a guest post from members of Cisco’s AI/ML best practices team, including Technical Product Manager Elvira Dzhuraeva, Distinguished Engineer Debo Dutta, and Principal Engineer Amit Saha.
Cisco is a large enterprise company that applies machine learning (ML) and artificial intelligence (AI) across many of its business units. The Cisco AI team in the office of the CTO is responsible for the company’s open source (OSS) AI/ML best practices across the business units that use AI and ML, and is also a major contributor to the Kubeflow open-source project and MLPerf/MLCommons. Our charter is to create artifacts and best practices in ML that both Cisco business units and our customers can use, and we share these solutions as reference architectures.
Due to business needs, such as localized data requirements, Cisco operates a hybrid cloud environment. Model training is done on our own Cisco UCS hardware, but many of our teams leverage the cloud for inference to take advantage of the scalability, geo redundancy, and resiliency. However, such implementations may be challenging to customers, because hybrid integration often requires deep expertise and knowledge to build and support consistent AI/ML workflows.
To address this, we built an ML pipeline using the Cisco Kubeflow starter pack for a hybrid cloud implementation that uses Amazon SageMaker to serve models in the cloud. By providing this reference architecture, we aim to help customers build seamless and consistent ML workloads across their complex infrastructure to satisfy whatever limitations they may face.
Kubeflow is a popular open-source library for ML orchestration on Kubernetes. If you operate in a hybrid cloud environment, you can install the Cisco Kubeflow starter pack to develop, build, train, and deploy ML models on-premises. The starter pack includes the latest version of Kubeflow and an application examples bundle.
Amazon SageMaker is a managed ML service that helps you prepare data, process data, train models, track model experiments, host models, and monitor endpoints. With SageMaker Components for Kubeflow Pipelines, you can orchestrate jobs from Kubeflow Pipelines, which we did for our hybrid ML project. This approach lets us seamlessly use Amazon SageMaker managed services for training and inference from our on-premises Kubeflow cluster. Using Amazon SageMaker provides our hosted models with enterprise features such as automatic scaling, multi-model endpoints, model monitoring, high availability, and security compliance.
To illustrate how our use case works, we recreate the scenario using the publicly available BLE RSSI Dataset for Indoor Localization and Navigation dataset, which contains Bluetooth Low Energy (BLE) Received Signal Strength Indication (RSSI) measurements. The pipeline trains and deploys a model to predict the location of Bluetooth devices. The following steps outline how a Kubernetes cluster can interact with Amazon SageMaker for a hybrid solution. The ML model, written in Apache MXNet, is trained using Kubeflow running on Cisco UCS servers to satisfy our localized data requirements and then deployed to AWS using Amazon SageMaker.
The created and trained model is uploaded to Amazon Simple Storage Service (Amazon S3) and uses Amazon SageMaker endpoints for serving. The following diagram shows our end-to-end workflow.
Development environment
To get started, if you don’t currently have Cisco hardware, you can set up Amazon Elastic Kubernetes Service (Amazon EKS) running with Kubeflow. For instructions, see Creating an Amazon EKS cluster and Deploying Kubeflow Pipelines.
If you have an existing UCS machine, the Cisco Kubeflow starter pack offers a quick Kubeflow setup on your Kubernetes cluster (v15.x or later). To install Kubeflow, set the INGRESS_IP
variable with the machine’s IP address and run the kubeflowup.bash installation script. See the following code:
export INGRESS_IP=<UCS Machine's IP>
bash kubeflowup.bash
For more information about installation, see Installation Instructions on the GitHub repo.
Preparing the hybrid pipeline
For a seamless ML workflow between Cisco UCS and AWS, we created a hybrid pipeline using the Kubeflow Pipelines component and Amazon SageMaker Kubeflow components.
To start using the components you need to import Kubeflow Pipeline packages, including the AWS package:
import kfp
import kfp.dsl as dsl
from kfp import components
from kfp.aws import use_aws_secret
For the full code to configure and get the pipeline running, see the GitHub repo.
The pipeline describes the workflow and how the components relate to each other in the form of a graph. The pipeline configuration includes the definition of the inputs (parameters) required to run the pipeline and the inputs and outputs of each component. The following screenshot shows the visual representation of the finished pipeline on the Kubeflow UI.
The pipeline runs the following three steps:
- Train the model
- Create the model resource
- Deploy the model
Training the model
You train the model with the BLE data locally, create an image, upload it to the S3 bucket, and register the model to Amazon SageMaker by applying the MXNet model configurations .yaml file.
When the trained model artifacts are uploaded to Amazon S3, Amazon SageMaker uses the model stored in Amazon S3 to deploy the model to a hosting endpoint. Amazon SageMaker endpoints make it easier for downstream applications to consume models and help the team monitor them with Amazon CloudWatch. See the following code:
def blerssi_mxnet_train_upload_op(step_name='mxnet-train'):
return dsl.ContainerOp(
name='mxnet-train-upload-s3',
image='ciscoai/mxnet-blerssi-train-upload:v0.2',
command=['python', '/opt/mx-dnn.py', 'train'],
arguments=['--bucket-name', bucket_name]
).apply(use_aws_secret(secret_name=aws_secret_name, aws_access_key_id_name='AWS_ACCESS_KEY_ID', aws_secret_access_key_name='AWS_SECRET_ACCESS_KEY'))
Creating the model resource
When the MXNet model and artifacts are uploaded to Amazon S3, use the KF Pipeline CreateModel component to create an Amazon SageMaker model resource.
The Amazon SageMaker endpoint API is flexible and offers several options to deploy a trained model to an endpoint. For example, you can let the default Amazon SageMaker runtime manage the model deployment, health check, and model invocation. Amazon SageMaker also allows for customization to the runtime with custom containers and algorithms. For instructions, see Overview of containers for Amazon SageMaker.
For our use case, we wanted some degree of control over the model health check API and the model invocation API. We chose the custom override for the Amazon SageMaker runtime to deploy the trained model. The custom predictor allows for flexibility in how the incoming request is processed and passed along to the model for prediction. See the following code:
sagemaker_model_op = components.load_component_from_url(model)
Deploying the model
You deploy the model to an Amazon SageMaker endpoint with the KF Pipeline CreateEndpoint component.
The custom container used for inference gives the team maximum flexibility to define custom health checks and invocations to the model. However, the custom container must follow the golden path for APIs prescribed by the Amazon SageMaker runtime. See the following code:
sagemaker_deploy_op = components.load_component_from_url(deploy)
Running the pipeline
To run your pipeline, complete the following steps:
- Configure the Python code that defines the hybrid pipeline with Amazon SageMaker components:
@dsl.pipeline( name='MXNet Sagemaker Hybrid Pipeline', description='Pipeline to train BLERSSI model using mxnet and save in aws s3 bucket' ) def mxnet_pipeline( region="", image="", model_name="", endpoint_config_name="", endpoint_name="", model_artifact_url="", instance_type_1="", role="" ): train_upload_model = blerssi_mxnet_train_upload_op() create_model = sagemaker_model_op( region=region, model_name=model_name, image=image, model_artifact_url=model_artifact_url, role=role ).apply(use_aws_secret(secret_name=aws_secret_name, aws_access_key_id_name='AWS_ACCESS_KEY_ID', aws_secret_access_key_name='AWS_SECRET_ACCESS_KEY')) create_model.after(train_upload_model) sagemaker_deploy=sagemaker_deploy_op( region=region, endpoint_config_name=endpoint_config_name, endpoint_name=endpoint_name, model_name_1=create_model.output, instance_type_1=instance_type_1 ).apply(use_aws_secret(secret_name=aws_secret_name, aws_access_key_id_name='AWS_ACCESS_KEY_ID', aws_secret_access_key_name='AWS_SECRET_ACCESS_KEY')) sagemaker_deploy.after(create_model)
For more information about configuration, see Pipelines Quickstart. For the full pipeline code, see the GitHub repo.
- Run the pipeline by feeding the following parameters to execute the pipeline:
run = client.run_pipeline(blerssi_hybrid_experiment.id, 'blerssi-sagemaker-pipeline-'+timestamp, pipeline_package_path='mxnet_pipeline.tar.gz', params={ 'region': aws_region, 'image': inference_image, 'model_name': model_name, 'endpoint_config_name': endpoint_config_name, 'endpoint_name': endpoint_name, 'model_artifact_url': model_path, 'instance_type_1': instance_type, 'role': role_arn })
At this point, the BLERSSI Amazon SageMaker pipeline starts executing. After all the components execute successfully, check the logs of the sagemaker-deploy component to verify that the endpoint is created. The following screenshot shows the logs of the last step with the URL to the deployed model.
Validating the model
After the model is deployed in AWS, we validate it by submitting sample data to the model via an HTTP request using the endpoint name of the model deployed on AWS. The following screenshot shows a snippet from a sample Jupyter notebook that has a Python client and the corresponding output with location predictions.
Conclusion
Amazon SageMaker and Kubeflow Pipelines can easily integrate in one single hybrid pipeline. The complete set of blogs and tutorials for Amazon SageMaker makes it easy to create a hybrid pipeline via the Amazon SageMaker components for Kubeflow Pipelines. The API was exhaustive, covered all the key components we needed to use, and allowed for the development of custom algorithms and integration with the Cisco Kubeflow Starter Pack. By uploading a trained ML model to Amazon S3 to serve on AWS with Amazon SageMaker, we reduced the complexity and TCO of managing complex ML lifecycles by about 50%. We comply with the highest standards of enterprise policies in data privacy and serve models in a scalable fashion with redundancy on AWS all over the United States and the world.
About the Authors
Elvira Dzhuraeva is a Technical Product Manager at Cisco where she is responsible for cloud and on-premise machine learning and artificial intelligence strategy. She is also a Community Product Manager at Kubeflow and a member of MLPerf community.
Debo Dutta is a Distinguished Engineer at Cisco where he leads a technology group at the intersection of algorithms, systems and machine learning. While at Cisco, Debo is currently a visiting scholar at Stanford. He got his PhD in Computer Science from University of Southern California, and an undergraduate in Computer Science from IIT Kharagpur, India.
Amit Saha is a Principal Engineer at Cisco where he leads efforts in systems and machine learning. He is a visiting faculty at IIT Kharagpur. He has a PhD in Computer Science from Rice University, Houston and an undergraduate from IIT Kharagpur. He has served on several program committees for top Computer Science conferences.
Deriving conversational insights from invoices with Amazon Textract, Amazon Comprehend, and Amazon Lex
Organizations across industries have a large number of physical documents such as invoices that they need to process. It is difficult to extract information from a scanned document when it contains tables, forms, paragraphs, and check boxes. Organization have been addressing these problems with manual effort or custom code or by using Optical Character Recognition (OCR) technology. However, that requires templates for form extraction and custom workflows.
Moreover, after extracting the text or content from a document, they want to extract insights from these receipts or invoices for their end users. However, that would require building a complex NLP model. Training the model would require a large amount of training data and compute resources. Building and training a machine learning model could be expensive and time-consuming.
Further, providing a human like interface to interact with these documents is cumbersome for end users. These end users often call the help desk but over time this adds cost to the organization.
This post shows you how to use AWS AI services to automate text data processing and insight discovery. With AWS AI services such as Amazon Textract, Amazon Comprehend and Amazon Lex, you can set up an automated serverless solution to address this requirement. We will walk you through below steps:
- Extract text from receipts or invoices in pdf or images with Amazon Textract.
- Derive insights with Amazon Comprehend.
- Interact with these insights in natural language using Amazon Lex.
Next, we will go through the services and the architecture for building the solution to solve the problem.
Services used
This solution uses the following AI services, serverless technologies, and managed services to implement a scalable and cost-effective architecture:
- Amazon Cognito – Lets you add user signup, signin, and access control to your web and mobile apps quickly and easily.
- AWS Lambda – Executes code in response to triggers such as changes in data, shifts in system state, or user actions. Because Amazon S3 can directly trigger a Lambda function, you can build a variety of real-time serverless data-processing systems.
- Amazon Lex – Provides an interface to create conversational chatbots.
- Amazon Comprehend – NLP service that uses machine learning to find insights and relationships in text.
- Amazon Textract– Uses ML to extract text and data from scanned documents in PDF, JPEG, or PNG formats.
- Amazon Simple Storage Service (Amazon S3) – Serves as an object store for your documents and allows for central management with fine-tuned access controls.
Architecture
The following diagram illustrates the architecture of the solution.
The architecture contains the following steps:
- The backend user or administrator uses the AWS Management Console or AWS Command Line Interface (AWS CLI) to upload the PDF documents or images to an S3 bucket.
- The Amazon S3 upload triggers a AWS Lambda function.
- The Lambda function invokes an Amazon Textract StartDocumentTextDetection API, which sets up an asynchronous job to detect text from the PDF you uploaded.
- Amazon Textract notifies Amazon Simple Notification Service (Amazon SNS) when text processing is complete.
- A second Lambda function gets the notification from SNS topic when the job is completed to detect text.
- Once the lambda is notified of job completion from Amazon SNS, it calls a Amazon Textract GetDocumentTextDetection API to receive the result from asynchronous operation and loads the results into an S3 bucket.
- A Lambda function is used for fulfillment of the Amazon Lex intents. For a more detailed sequence of interactions please refer to the Building your chatbot step in “Deploying the Architecture with Cloudformation” section.
- Amazon Comprehend uses ML to find insights and relationships in text. The lambda function uses boto3 APIs that Amazon Comprehend provides for entity and key phrases detection.
- In response to the Bot’s welcome message, the user types “Show me the invoice summary”, this invokes the GetInvoiceSummary Lex intent and the Lambda function invokes the Amazon Comprehend DetectEntities API to detect entities for fulfillment.
- When the user types “Get me the invoice details”, this invokes the GetInvoiceDetails intent, Amazon Lex prompts the user to enter Invoice Number, and the Lambda function invokes the Amazon Comprehend DetectEntities API to return the Invoice Details message.
- When the user types “Can you show me the invoice notes for <invoice number>”, this invokes the GetInvoiceNotes intent, and the Lambda function invokes the Amazon Comprehend DetectKeyPhrases API to return comments associated with the invoice.
- You deploy the Lexbot Web UI in your AWS Cloudformation template by using an existing CloudFormation stack as a nested stack. To download the stack, see Deploy a Web UI for Your Chatbot. This nested stack deploys a Lex Web UI, the webpage is served as a static website from an S3 bucket. The web UI uses Amazon Cognito to generate an access token for authentication and uses AWS CodeStar to set up a delivery pipeline.The end-users interact this chatbot web UI.
Deploying the architecture with AWS CloudFormation
You deploy a CloudFormation template to provision the necessary AWS Indentity and Access Management (IAM) roles, services, and components of the solution including Amazon S3, Lambda, Amazon Textract, Amazon Comprehend, and the Amazon Lex chatbot.
- Launch the following CloudFormation template and in the US East (N. Virginia) Region:
- Don’t make any changes to stack name or parameters botname
InvoiceBot
. - In the Capabilities and transforms section, select all three check-boxes to provide acknowledgment to AWS CloudFormation to create IAM resources and expand the template.
For more information about these resources, see AWS IAM resources.
This template uses AWS Serverless Application Model (AWS SAM), which simplifies how to define functions and APIs for serverless applications, and also has features for these services, like environment variables.
- Choose Create stack.
The following screenshot of the Stack Detail page shows the status of the stack as CREATE_IN_PROGRESS
. It can take up to 20 minutes for the status to change to CREATE_COMPLETE
.
- On the Outputs tab, copy the value of
LexLambaFunctionArn
,AssetsUploadBucket
,ExtractedTextfilesBucket
, andLexUIWebAppUrl
.
Uploading documents to the S3 bucket
To upload your documents to your new S3 bucket, choose the S3 bucket URL corresponding to AssetsUploadBucket
that you copied earlier. Upload a PDF or image to start the text extraction flow.
You can download the invoice used in this blog from the GitHub repo and upload it to the AssetsUploadBucket
S3 URL. We recommend to customize this solution for your invoice templates. For more information about uploading files, see How do I upload files and folders to an S3 bucket?
After the upload completes, you can see the file on the Amazon S3 console on the Overview tab.
After you upload the file, the text is extracted from the document. To see an extracted file with the text, open the bucket by choosing the URL you copied earlier.
On the Overview tab, you can download the file and inspect the content to see if it’s the same as the text in the uploaded file.
Building your chatbot
We will use the following conversation to model the bot:
Bot: Welcome to InvoiceBot. You can ask me to provide your invoice summary, or details of your invoices, or your invoice notes
User: Show me the invoice summary
Bot: I reviewed your input documents and found 1 invoice with invoice numbers 35678-9 totaling $2100.0. I can get you invoice details or invoice notes. Simply type your request
User: Get me the invoice details
Bot: Please enter the invoice number
User: 35678-9
Bot: Invoice Details for 35678-9: On 5/10/1019 for the item One there is a charge of 1500.00. On 5/11/2019 for the item Merchant Two there is a charge of 100.00. On 5/12/2019 for the item Merchant Three there is a charge of 300.00. On 5/13/2019 for the item Merchant Three there is a charge of 200.00. You can request me for invoice notes or simply close this chat.
User: Can you show me the invoice notes for 35678-9
Bot: Invoice Notes for 35678-9: 5/13/2019 Merchant Three 200.00 Merchant Three 300.00 Laptop Office Supplies Merchant Two 100.00 Team Dinner Food 5/12/2019 5/11/2019 Desks and Office Supplies 5/10/1019 Merchant One 1500.00 Chairs . Feel free to try the options again or you can simply close this chat
We will build an Amazon Lex bot (InvoiceBot) with the following intents:
- GetInvoiceSummary – Intent that’s invoked when the user requests to view the Invoice Summary. This is fulfilled by a Lambda function and returns the count of invoices available, and the total amount of the invoices
- GetInvoiceDetails – Intent that’s invoked when the user requests to view the Invoice Details. This is fulfilled by a Lambda function and provides item level breakdown of the invoices including Date, Quantity and Item Details
- GetInvoiceNotes – Intent that’s invoked when the user requests to view the Invoice Notes. This is fulfilled by a Lambda function and provides notes from the invoices uploaded with Date and Item Description.
Publishing your chatbot
As described in the solution overview earlier, you use an Amazon Lex chatbot (InvoiceBot) to interact with the insights Amazon Comprehend derives from the text Amazon Textract extracts.
To publish your chatbot, complete the following steps:
- On the Amazon Lex console, choose Bots.
- Choose the chatbot you created.
- Under Intents, choose GetInvoiceSummary.
- Under Fulfilment, select your Lambda function.
- Search for the function by entering
LexLambdaFunction
and selecting the result.
A pop-up box appears.
- Choose OK.
- Choose Save intent.
- Repeat these steps for the remaining two intents,
GetInvoiceDetails
andGetInvoiceNotes
. - Choose Build.
- When the build is complete, choose Publish.
- For Create an alias, enter
Latest
. You can consider a different name; names like test, dev, beta, or prod primarily refer to the environment of the bot. - Choose Publish.
The following page opens after the bot is published.
- Choose Close.
Using the chatbot
Your chatbot is now ready to use. Navigate to the URL LexUIWebAppUrl
copied from the AWS CloudFormation Outputs tab. The following screenshots show the user conversation with the bot (read from left to right):
Conclusion
This post demonstrated how to create a conversational chatbot in Amazon Lex that enables interaction with insights derived using Amazon Comprehend and Amazon Textract from a text in images or in a PDF document. The code from this post is available on the GitHub repo for you to use and extend. We are interested to hear how you would like to apply this solution for your usecase. Please share your thoughts and questions in the comments section.
About the Authors
Mona Mona is an AI/ML Specialist Solutions Architect based out of Arlington, VA. She works with World Wide Public Sector Team and helps customers adopt machine learning on a large scale. She is passionate about NLP and ML Explainability areas in AI/ML .
Prem Ranga is an Enterprise Solutions Architect based out of Houston, Texas. He is part of the Machine Learning Technical Field Community and loves working with customers on their ML and AI journey. Prem is passionate about robotics, is an Autonomous Vehicles researcher, and also built the Alexa-controlled Beer Pours in Houston and other locations.
Saida Chanda is a Senior Partner Solutions Architect based out of Seattle, WA. He is a technology enthusiast who drives innovation through AWS partners to meet customers complex business requirements via simple solutions. His areas of interest are ML and DevOps. In his spare time, he likes to spend time with family and exploring his innerself through meditation.
How Euler Hermes detects typo squatting with Amazon SageMaker
This is a guest post from Euler Hermes. In their own words, “For over 100 years, Euler Hermes, the world leader in credit insurance, has accompanied its clients to provide simpler and safer digital products, thus becoming a key catalyzer in the world’s commerce.”
Euler Hermes manages more than 600,000 B2B transactions per month and effectuates data analytics from over 30 million companies worldwide. At-scale artificial intelligence and machine learning (ML) have become the heart of the business.
Euler Hermes uses ML across a variety of use cases. One recent example is typo squatting detection, which came about after an ideation workshop between the Cybersecurity and IT Innovation teams to better protect clients. As it turns out, moving from idea to production has never been easier when your data is in the AWS Cloud and you can put the right tools in the hands of your data scientists in minutes.
Typo squatting, or hijacking, is a form of cybersecurity attack. It consists of registering internet domain names that closely resemble legitimate, reputable, and well-known ones with the goal of phishing scams, identity theft, advertising, and malware installation, among other potential issues. The sources of typo squatting can be varied, including different top-level domains (TLD), typos, misspellings, combo squatting, or differently phrased domains.
The challenge we faced was building an ML solution to quickly detect any suspicious domains registered that could be used to exploit the Euler Hermes brand or its products.
To simplify the ML workflow and reduce time-to-market, we opted to use Amazon SageMaker. This fully managed AWS service was a natural choice due to the ability to easily build, train, tune, and deploy ML models at scale without worrying about the underlying infrastructure while being able to integrate with other AWS services such as Amazon Simple Storage Service (Amazon S3) or AWS Lambda. Furthermore, Amazon SageMaker meets the strict security requirements necessary for financial services companies like Euler Hermes, including support for private notebooks and endpoints, encryption of data in transit and at rest, and more.
Solution overview
To build and tune ML models, we used Amazon SageMaker notebooks as the main working tool for our data scientists. The idea was to train an ML model to recognize domains related to Euler Hermes. To accomplish this, we worked on the following two key steps: dataset construction and model building.
Dataset construction
Every ML project requires a lot of data, and our first objective was to build the training dataset.
The dataset of negative examples was composed of 1 million entries randomly picked from Alexa, Umbrella, and publicly registered domains, whereas the dataset of 1 million positive examples was created from a domain generated algorithm (DGA) using Euler Hermes’s internal domains.
Model building and tuning
One of the project’s biggest challenges was to decrease the number of false positives to a minimum. On a daily basis, we need to unearth domains related to Euler Hermes from a large dataset of approximately 150,000 publicly registered domains.
We tried two approaches: classical ML models and deep learning.
We considered various models for classical ML, including Random Forest, Logistic regression, and gradient boosting (LightGBM and XGBoost). For these models, we manually created more than 250 features. After an extensive feature-engineering phase, we selected the following as the most relevant:
- Number of FQDN levels
- Vowels ration
- Number of characters
- Bag of n-grams (top 50 n-grams)
- Features TF-IDF
- Latent Dirichlet allocation features
For deep learning, we decided to work with recurrent neural networks. The model we adopted was a Bidirectional LSTM (BiLSTM) with an attention layer. We found this model to be the best at extracting a URL’s underlying structure.
The following diagram shows the architecture designed for the BiLSTM model. To avoid overfitting, a Dropout layer was added.
The following code orchestrates the set of layers:
def AttentionModel_(vocab_size, input_length, hidden_dim):
model = tf.keras.models.Sequential()
model.add(Embedding(MAX_VOCAB_SIZE, hidden_dim, input_length=input_length))
model.add(Bidirectional(LSTM(units=hidden_dim, return_sequences=True, dropout=0.2, recurrent_dropout=0.2)))
model.add(SecSelfAttention(attention_activation='sigmoid'))
model.add(Reshape((2*hidden_dim*input_length)))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["acc", tf.keras.metrics.FalsePositives()])
return model
We built and tuned the classical ML and the deep learning models using the Amazon SageMaker-provided containers for Scikit-learn and Keras.
The following table summarizes the results we obtained. The BiLSTM outperformed the other models with a 13% precision improvement compared to the second-best model (LightGBM). For this reason, we put the BiLSTM model into production.
Models |
Precision | F1-Score |
ROC-AUC (Area Under the Curve) |
Random Forest |
0.832 |
0.841 |
0.908 |
XGBoost |
0.870 |
0.876 |
0.921 |
LightGBM |
0.880 |
0.883 |
0.928 |
RNN (BiLSTM) |
0.996 |
0.997 |
0.997 |
Model training
For model training, we made use of Managed Spot Training in Amazon SageMaker to use Amazon Elastic Compute Cloud (Amazon EC2) Spot Instances for training jobs. This allowed us to optimize the cost of training models at a lower cost compared to On-Demand Instances.
Because we predominantly used custom deep learning models, we needed GPU instances for time-consuming neural network training jobs, with times ranging from minutes to a few hours. Under these constraints, Managed Spot Training was a game-changing solution. The on-demand solution permitted no interruption of our data scientists while managing instance-stopping conditions.
Productizing
Euler Hermes’s cloud principles follow a serverless-first strategy, with an Infrastructure as Code DevOps practice. Systematically, we construct a serverless architecture based on Lambda whenever possible, but when this isn’t possible, we deploy to containers using AWS Fargate.
Amazon SageMaker allows us to deploy our ML models at scale within the same platform on a 100% serverless and scalable architecture. It creates a model endpoint that is ready to serve inference requests. To get inferences for an entire dataset, we use batch transform, which cuts the dataset off in smaller batches and gets the predictions on each one. Batch transform manages all the compute resources required to get inferences, including launching instances and deleting them after the batch transform job is complete.
The following figure depicts the architecture deployed for the use case in this post.
First, a daily Amazon CloudWatch event is set to trigger a Lambda function with two jobs: download all the publicly registered domains and store them in an Amazon Simple Storage Service (Amazon S3) bucket subfolder and trigger the BatchTransform job. Amazon SageMaker automatically saves the inferences in an S3 bucket that you specify when creating the batch transform job.
Finally, a second CloudWatch event monitors the task success of Amazon SageMaker. If the task succeeds, it triggers a second Lambda function that retrieves the inferred domains and selects those that have label 1—related to Euler Hermes or its products—and stores them in another S3 bucket subfolder.
Following Euler Hermes’s DevOps principles, all the infrastructure in this solution is coded in Terraform to implement an MLOps pipeline to deploy to production.
Conclusion
Amazon SageMaker provides the tool that our data scientists need to quickly and securely experiment and test while maintaining compliance with strict financial service standards. This allows us to bring new ideas into production very rapidly. With flexibility and inherent programmability, Amazon SageMaker helped us tackle our main pain point of industrializing ML models at scale. After we train an ML model, we can use Amazon SageMaker to deploy the model, and can automate the entire pipeline following the same DevOps principles and tools we use for all other applications we run with AWS.
In under 7 months, we were able to launch a new internal ML service from ideation to production and can now identify URL squatting fraud within 24 hours after the creation of a malicious domain.
Although our application is ready, we have some additional steps planned. First, we’ll extend the inferences currently stored on Amazon S3 to our SIEM platform. Second, we’ll implement a web interface to monitor the model and allow manual feedback that is captured for model retraining.
About the Authors
Luis Leon is the IT Innovation Advisor responsible for the data science practice in the IT at Euler Hermes. He is in charge of the ideation of digital projects as well as managing the design, build and industrialization of at scale machine learning products. His main interests are Natural Language Processing, Time Series Analysis and non-supervised learning.
Hamza Benchekroun is Data Scientist in the IT Innovation hub at Euler Hermes focusing on deep learning solutions to increase productivity and guide decision making across teams. His research interests include Natural Language Processing, Time Series Analysis, Semi-Supervised Learning and their applications.
Hatim Binani is data scientist intern in the IT Innovation hub at Euler Hermes. He is an engineering student at INSA Lyon in the computer science department. His field of interest is data science and machine learning. He contributed within the IT innovation team to the deployment of Watson on Amazon Sagemaker.
Guillaume Chambert is an IT security engineer at Euler Hermes. As SOC manager, he strives to stay ahead of new threats in order to protect Euler Hermes’ sensitive and mission-critical data. He is interested in developing innovation solutions to prevent critical information from being stolen, damaged or compromised by hackers.
Building a visual search application with Amazon SageMaker and Amazon ES
Sometimes it’s hard to find the right words to describe what you’re looking for. As the adage goes, “A picture is worth a thousand words.” Often, it’s easier to show a physical example or image than to try to describe an item with words, especially when using a search engine to find what you’re looking for.
In this post, you build a visual image search application from scratch in under an hour, including a full-stack web application for serving the visual search results.
Visual search can improve customer engagement in retail businesses and e-commerce, particularly for fashion and home decoration retailers. Visual search allows retailers to suggest thematically or stylistically related items to shoppers, which retailers would struggle to achieve by using a text query alone. According to Gartner, “By 2021, early adopter brands that redesign their websites to support visual and voice search will increase digital commerce revenue by 30%.”
High-level example of visual searching
Amazon SageMaker is a fully managed service that provides every developer and data scientist with the ability to build, train, and deploy machine learning (ML) models quickly. Amazon Elasticsearch Service (Amazon ES) is a fully managed service that makes it easy for you to deploy, secure, and run Elasticsearch cost-effectively at scale. Amazon ES offers k-Nearest Neighbor (KNN) search, which can enhance search in similar use cases such as product recommendations, fraud detection, and image, video, and semantic document retrieval. Built using the lightweight and efficient Non-Metric Space Library (NMSLIB), KNN enables high-scale, low-latency, nearest neighbor search on billions of documents across thousands of dimensions with the same ease as running any regular Elasticsearch query.
The following diagram illustrates the visual search architecture.
Overview of solution
Implementing the visual search architecture consists of two phases:
- Building a reference KNN index on Amazon ES from a sample image dataset.
- Submitting a new image to the Amazon SageMaker endpoint and Amazon ES to return similar images.
KNN reference index creation
In this step, from each image you extract 2,048 feature vectors from a pre-trained Resnet50 model hosted in Amazon SageMaker. Each vector is stored to a KNN index in an Amazon ES domain. For this use case, you use images from FEIDEGGER, a Zalando research dataset consisting of 8,732 high-resolution fashion images. The following screenshot illustrates the workflow for creating KNN index.
The process includes the following steps:
- Users interact with a Jupyter notebook on an Amazon SageMaker notebook instance.
- A pre-trained Resnet50 deep neural net from Keras is downloaded, the last classifier layer is removed, and the new model artifact is serialized and stored in Amazon Simple Storage Service (Amazon S3). The model is used to start a TensorFlow Serving API on an Amazon SageMaker real-time endpoint.
- The fashion images are pushed through the endpoint, which runs the images through the neural network to extract the image features, or embeddings.
- The notebook code writes the image embeddings to the KNN index in an Amazon ES domain.
Visual search from a query image
In this step, you present a query image from the application, which passes through the Amazon SageMaker hosted model to extract 2,048 features. You use these features to query the KNN index in Amazon ES. KNN for Amazon ES lets you search for points in a vector space and find the “nearest neighbors” for those points by Euclidean distance or cosine similarity (the default is Euclidean distance). When it finds the nearest neighbors vectors (for example, k = 3 nearest neighbors) for a given image, it returns the associated Amazon S3 images to the application. The following diagram illustrates the visual search full-stack application architecture.
The process includes the following steps:
- The end-user accesses the web application from their browser or mobile device.
- A user-uploaded image is sent to Amazon API Gateway and AWS Lambda as a base64 encoded string and is re-encoded as bytes in the Lambda function.
- A publicly readable image URL is passed as a string and downloaded as bytes in the function.
- The bytes are sent as the payload for inference to an Amazon SageMaker real-time endpoint, and the model returns a vector of the image embeddings.
- The function passes the image embedding vector in the search query to the k-nearest neighbor in the index in the Amazon ES domain. A list of k similar images and their respective Amazon S3 URIs is returned.
- The function generates pre-signed Amazon S3 URLs to return back to the client web application, used to display similar images in the browser.
AWS services
To build the end-to-end application, you use the following AWS services:
- AWS Amplify – AWS Amplify is a JavaScript library for front-end and mobile developers building cloud-enabled applications. For more information, see the GitHub repo.
- Amazon API Gateway – A fully managed service to create, publish, maintain, monitor, and secure APIs at any scale.
- AWS CloudFormation – AWS CloudFormation gives developers and businesses an easy way to create a collection of related AWS and third-party resources and provision them in an orderly and predictable fashion.
- Amazon ES – A managed service that makes it easy to deploy, operate, and scale Elasticsearch clusters at scale.
- AWS IAM – AWS Identity and Access Management (IAM) enables you to manage access to AWS services and resources securely.
- AWS Lambda – An event-driven, serverless computing platform that runs code in response to events and automatically manages the computing resources the code requires.
- Amazon SageMaker – A fully managed end-to-end ML platform to build, train, tune, and deploy ML models at scale.
- AWS SAM– AWS Serverless Application Model (AWS SAM) is an open-source framework for building serverless applications.
- Amazon S3 – An object storage service that offers an extremely durable, highly available, and infinitely scalable data storage infrastructure at very low cost.
Prerequisites
For this walkthrough, you should have an AWS account with appropriate IAM permissions to launch the CloudFormation template.
Deploying your solution
You use a CloudFormation stack to deploy the solution. The stack creates all the necessary resources, including the following:
- An Amazon SageMaker notebook instance to run Python code in a Jupyter notebook
- An IAM role associated with the notebook instance
- An Amazon ES domain to store and retrieve image embedding vectors into a KNN index
- Two S3 buckets: one for storing the source fashion images and another for hosting a static website
From the Jupyter notebook, you also deploy the following:
- An Amazon SageMaker endpoint for getting image feature vectors and embeddings in real time.
- An AWS SAM template for a serverless back end using API Gateway and Lambda.
- A static front-end website hosted on an S3 bucket to demonstrate a real-world, end-to-end ML application. The front-end code uses ReactJS and the Amplify JavaScript library.
To get started, complete the following steps:
- Sign in to the AWS Management Console with your IAM user name and password.
- Choose Launch Stack and open it in a new tab:
- On the Quick create stack page, select the check box to acknowledge the creation of IAM resources.
- Choose Create stack.
- Wait for the stack to complete executing.
You can examine various events from the stack creation process on the Events tab. When the stack creation is complete, you see the status CREATE_COMPLETE.
You can look on the Resources tab to see all the resources the CloudFormation template created.
- On the Outputs tab, choose the SageMakerNotebookURL value.
This hyperlink opens the Jupyter notebook on your Amazon SageMaker notebook instance that you use to complete the rest of the lab.
You should be on the Jupyter notebook landing page.
- Choose visual-image-search.ipynb.
Building a KNN index on Amazon ES
For this step, you should be at the beginning of the notebook with the title Visual image search. Follow the steps in the notebook and run each cell in order.
You use a pre-trained Resnet50 model hosted on an Amazon SageMaker endpoint to generate the image feature vectors (embeddings). The embeddings are saved to the Amazon ES domain created in the CloudFormation stack. For more information, see the markdown cells in the notebook.
Continue when you reach the cell Deploying a full-stack visual search application
in your notebook.
The notebook contains several important cells.
To load a pre-trained ResNet50 model without the final CNN classifier layer, see the following code (this model is used just as an image feature extractor):
#Import Resnet50 model
model = tf.keras.applications.ResNet50(weights='imagenet', include_top=False,input_shape=(3, 224, 224),pooling='avg')
You save the model as a TensorFlow SavedModel
format, which contains a complete TensorFlow program, including weights and computation. See the following code:
#Save the model in SavedModel format
model.save('./export/Servo/1/', save_format='tf')
Upload the model artifact (model.tar.gz
) to Amazon S3 with the following code:
#Upload the model to S3
sagemaker_session = sagemaker.Session()
inputs = sagemaker_session.upload_data(path='model.tar.gz', key_prefix='model')
inputs
You deploy the model into an Amazon SageMaker TensorFlow Serving-based server using the Amazon SageMaker Python SDK. The server provides a super-set of the TensorFlow Serving REST API. See the following code:
#Deploy the model in Sagemaker Endpoint. This process will take ~10 min.
from sagemaker.tensorflow.serving import Model
sagemaker_model = Model(entry_point='inference.py', model_data = 's3://' + sagemaker_session.default_bucket() + '/model/model.tar.gz',
role = role, framework_version='2.1.0', source_dir='./src' )
predictor = sagemaker_model.deploy(initial_instance_count=3, instance_type='ml.m5.xlarge')
Extract the reference images features from the Amazon SageMaker endpoint with the following code:
# define a function to extract image features
from time import sleep
sm_client = boto3.client('sagemaker-runtime')
ENDPOINT_NAME = predictor.endpoint
def get_predictions(payload):
return sm_client.invoke_endpoint(EndpointName=ENDPOINT_NAME,
ContentType='application/x-image',
Body=payload)
def extract_features(s3_uri):
key = s3_uri.replace(f's3://{bucket}/', '')
payload = s3.get_object(Bucket=bucket,Key=key)['Body'].read()
try:
response = get_predictions(payload)
except:
sleep(0.1)
response = get_predictions(payload)
del payload
response_body = json.loads((response['Body'].read()))
feature_lst = response_body['predictions'][0]
return s3_uri, feature_lst
You define Amazon ES KNN index mapping with the following code:
#Define KNN Elasticsearch index mapping
knn_index = {
"settings": {
"index.knn": True
},
"mappings": {
"properties": {
"zalando_img_vector": {
"type": "knn_vector",
"dimension": 2048
}
}
}
}
Import the image feature vector and associated Amazon S3 image URI into the Amazon ES KNN Index with the following code:
# defining a function to import the feature vectors corrosponds to each S3 URI into Elasticsearch KNN index
# This process will take around ~3 min.
def es_import(i):
es.index(index='idx_zalando',
body={"zalando_img_vector": i[1],
"image": i[0]}
)
process_map(es_import, result, max_workers=workers)
Building a full-stack visual search application
Now that you have a working Amazon SageMaker endpoint for extracting image features and a KNN index on Amazon ES, you’re ready to build a real-world full-stack ML-powered web app. You use an AWS SAM template to deploy a serverless REST API with API Gateway and Lambda. The REST API accepts new images, generates the embeddings, and returns similar images to the client. Then you upload a front-end website that interacts with your new REST API to Amazon S3. The front-end code uses Amplify to integrate with your REST API.
- In the following cell, prepopulate a CloudFormation template that creates necessary resources such as Lambda and API Gateway for full-stack application:
s3_resource.Object(bucket, 'backend/template.yaml').upload_file('./backend/template.yaml', ExtraArgs={'ACL':'public-read'}) sam_template_url = f'https://{bucket}.s3.amazonaws.com/backend/template.yaml' # Generate the CloudFormation Quick Create Link print("Click the URL below to create the backend API for visual search:n") print(( 'https://console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks/create/review' f'?templateURL={sam_template_url}' '&stackName=vis-search-api' f'¶m_BucketName={outputs["s3BucketTraining"]}' f'¶m_DomainName={outputs["esDomainName"]}' f'¶m_ElasticSearchURL={outputs["esHostName"]}' f'¶m_SagemakerEndpoint={predictor.endpoint}' ))
The following screenshot shows the output: a pre-generated CloudFormation template link.
- Choose the link.
You are sent to the Quick create stack page.
- Select the check boxes to acknowledge the creation of IAM resources, IAM resources with custom names, and
CAPABILITY_AUTO_EXPAND
. - Choose Create stack.
After the stack creation is complete, you see the status CREATE_COMPLETE
. You can look on the Resources tab to see all the resources the CloudFormation template created.
- After the stack is created, proceed through the cells.
The following cell indicates that your full-stack application, including front-end and back-end code, are successfully deployed:
print('Click the URL below:n')
print(outputs['S3BucketSecureURL'] + '/index.html')
The following screenshot shows the URL output.
- Choose the link.
You are sent to the application page, where you can upload an image of a dress or provide the URL link of a dress and get similar dresses.
- When you’re done testing and experimenting with your visual search application, run the last two cells at the bottom of the notebook:
# Delete the endpoint predictor.delete_endpoint() # Empty S3 Contents training_bucket_resource = s3_resource.Bucket(bucket) training_bucket_resource.objects.all().delete() hosting_bucket_resource = s3_resource.Bucket(outputs['s3BucketHostingBucketName']) hosting_bucket_resource.objects.all().delete()
These cells terminate your Amazon SageMaker endpoint and empty your S3 buckets to prepare you for cleaning up your resources.
Cleaning up
To delete the rest of your AWS resources, go to the AWS CloudFormation console and delete the vis-search-api
and vis-search
stacks.
Conclusion
In this post, we showed you how to create an ML-based visual search application using Amazon SageMaker and the Amazon ES KNN index. You used a pre-trained Resnet50 model trained on an ImageNet dataset. However, you can also use other pre-trained models, such as VGG, Inception, and MobileNet, and fine-tune with your own dataset.
A GPU instance is recommended for most deep learning purposes. Training new models is faster on a GPU instance than a CPU instance. You can scale sub-linearly when you have multi-GPU instances or if you use distributed training across many instances with GPUs. However, we used CPU instances for this use case so that you can complete the walkthrough under the AWS Free Tier.
For more information about the code sample in the post, see the GitHub repo. For more information about Amazon ES, see the following:
- How can I improve indexing performance on my Elasticsearch cluster?
- Amazon Elasticseach Service Best Practices
- Reducing cost for small Amazon Elasticsearch Service domains
About the Authors
Amit Mukherjee is a Sr. Partner Solutions Architect with AWS. He provides architectural guidance to help partners achieve success in the cloud. He has a special interest in AI and machine learning. In his spare time, he enjoys spending quality time with his family.
Laith Al-Saadoon is a Sr. Solutions Architect with a focus on data analytics at AWS. He spends his days obsessing over designing customer architectures to process enormous amounts of data at scale. In his free time, he follows the latest in machine learning and artificial intelligence.
Introducing the open-source Amazon SageMaker XGBoost algorithm container
XGBoost is a popular and efficient machine learning (ML) algorithm for regression and classification tasks on tabular datasets. It implements a technique known as gradient boosting on trees and performs remarkably well in ML competitions.
Since its launch, Amazon SageMaker has supported XGBoost as a built-in managed algorithm. For more information, see Simplify machine learning with XGBoost and Amazon SageMaker. As of this writing, you can take advantage of the open-source Amazon SageMaker XGBoost container, which has improved flexibility, scalability, extensibility, and Managed Spot Training. For more information, see the Amazon SageMaker sample notebooks and sagemaker-xgboost-container on GitHub, or see XBoost Algorithm.
This post introduces the benefits of the open-source XGBoost algorithm container and presents three use cases.
Benefits of the open-source SageMaker XGBoost container
The new XGBoost container has following benefits:
Latest version
The open-source XGBoost container supports the latest XGBoost 1.0 release and all improvements, including better performance scaling on multi-core instances and improved stability for distributed training.
Flexibility
With the new script mode, you can now customize or use your own training script. This functionality, which is also available for TensorFlow, MXNet, PyTorch, and Chainer users, allows you to add in custom pre- or post-processing logic, run additional steps after the training process, or take advantage of the full range of XGBoost functions (such as cross-validation support). You can still use the no-script algorithm mode (like other Amazon SageMaker built-in algorithms), which only requires you to specify a data location and hyperparameters.
Scalability
The open-source container has a more efficient implementation of distributed training, which allows it to scale out to more instances and reduces out-of-memory errors.
Extensibility
Because the container is open source, you can extend, fork, or modify the algorithm to suit your needs, beyond using the script mode. This includes installing additional libraries and changing the underlying version of XGBoost.
Managed Spot Training
You can save up to 90% on your Amazon SageMaker XGBoost training jobs with Managed Spot Training support. This fully managed option lets you take advantage of unused compute capacity in the AWS Cloud. Amazon SageMaker manages the Spot Instances on your behalf so you don’t have to worry about polling for capacity. The new version of XGBoost automatically manages checkpoints for you to make sure your job finishes reliably. For more information, see Managed Spot Training in Amazon SageMaker and Use Checkpoints in Amazon SageMaker.
Additional input formats
XGBoost now includes support for Parquet and Recordio-protobuf input formats. Parquet is a standardized, open-source, self-describing columnar storage format for use in data analysis systems. Recordio-protobuf is a common binary data format used across Amazon SageMaker for various algorithms, which XGBoost now supports for training and inference. For more information, see Common Data Formats for Training. Additionally, this container supports Pipe mode training for these data formats. For more information, see Using Pipe input mode for Amazon SageMaker algorithms.
Using the latest XGBoost container as a built-in algorithm
As an existing Amazon SageMaker XGBoost user, you can take advantage of the new features and improved performance by specifying the version when you create your training jobs. For more information about getting started with XGBoost or using the latest version, see the GitHub repo.
You can upgrade to the new container by specifying the framework version (1.0-1
). This version specifies the upstream XGBoost framework version (1.0
) and an additional Amazon SageMaker version (1
). If you have an existing XGBoost workflow based on the legacy 0.72
container, this is the only change necessary to get the same workflow working with this container. The container also supports XGBoost 0.90
by using version as 0.90-1
.
See the following code:
from sagemaker.amazon.amazon_estimator import get_image_uri
container = get_image_uri(region, 'xgboost', '1.0-1')
estimator = sagemaker.estimator.Estimator(container,
role,
hyperparameters=hyperparameters,
train_instance_count=1,
train_instance_type='ml.m5.2xlarge',
)
estimator.fit(training_data)
Using managed Spot Instances
You can also take advantage of managed Spot Instance support by enabling the train_use_spot_instances
flag on your Estimator
. For more information, see the GitHub repo.
When you are training with managed Spot Instances, the training job may be interrupted, which causes it to take longer to start or finish. If a training job is interrupted, you can use a checkpointed snapshot to resume from a previously saved point, which can save training time (and cost). You can also use the checkpoint_s3_uri
, which is where your training job stores snapshots, to seamlessly resume when a Spot Instance is interrupted. See the following code:
estimator = sagemaker.estimator.Estimator(container,
role,
hyperparameters=hyperparameters,
train_instance_count=1,
train_instance_type='ml.m5.2xlarge',
output_path=output_path,
sagemaker_session=sagemaker.Session(),
train_use_spot_instances=train_use_spot_instances,
train_max_run=train_max_run,
train_max_wait=train_max_wait,
checkpoint_s3_uri=checkpoint_s3_uri
)
estimator.fit({'train': train_input})
Towards the end of the job, you should see the following two lines of output:
- Training seconds: X – The actual compute time your training job
- Billable seconds: Y – The time you are billed for after you apply Spot discounting
If you enabled train_use_spot_instances
, you should see a notable difference between X
and Y
, which signifies the cost savings from using Managed Spot Training. This is reflected in the following code:
Managed Spot Training savings: (1-Y/X)*100 %
Using script mode
Script mode is a new feature with the open-source Amazon SageMaker XGBoost container. You can use your own training or hosting script to fully customize the XGBoost training or inference workflow. The following code example is a walkthrough of using a customized training script in script mode. For more information, see the GitHub repo.
Preparing the entry-point script
A typical training script loads data from the input channels, configures training with hyperparameters, trains a model, and saves a model to model_dir
so it can be hosted later. Hyperparameters are passed to your script as arguments and can be retrieved with an argparse.ArgumentParser
instance.
Starting with the main guard, use a parser to read the hyperparameters passed to your Amazon SageMaker estimator when creating the training job. These hyperparameters are made available as arguments to your input script. You also parse several Amazon SageMaker-specific environment variables to get information about the training environment, such as the location of input data and where to save the model. See the following code:
if __name__ == '__main__':
parser = argparse.ArgumentParser()
# Hyperparameters are described here
parser.add_argument('--num_round', type=int)
parser.add_argument('--max_depth', type=int, default=5)
parser.add_argument('--eta', type=float, default=0.2)
parser.add_argument('--objective', type=str, default='reg:squarederror')
# Sagemaker specific arguments. Defaults are set in the environment variables.
parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAIN'])
parser.add_argument('--validation', type=str, default=os.environ['SM_CHANNEL_VALIDATION'])
args = parser.parse_args()
train_hp = {
'max_depth': args.max_depth,
'eta': args.eta,
'gamma': args.gamma,
'min_child_weight': args.min_child_weight,
'subsample': args.subsample,
'silent': args.silent,
'objective': args.objective
}
dtrain = xgb.DMatrix(args.train)
dval = xgb.DMatrix(args.validation)
watchlist = [(dtrain, 'train'), (dval, 'validation')] if dval is not None else [(dtrain, 'train')]
callbacks = []
prev_checkpoint, n_iterations_prev_run = add_checkpointing(callbacks)
# If checkpoint is found then we reduce num_boost_round by previously run number of iterations
bst = xgb.train(
params=train_hp,
dtrain=dtrain,
evals=watchlist,
num_boost_round=(args.num_round - n_iterations_prev_run),
xgb_model=prev_checkpoint,
callbacks=callbacks
)
model_location = args.model_dir + '/xgboost-model'
pkl.dump(bst, open(model_location, 'wb'))
logging.info("Stored trained model at {}".format(model_location))
Inside the entry-point script, you can optionally customize the inference experience when you use Amazon SageMaker hosting or batch transform. You can customize the following:
- input_fn() – How the input is handled
- predict_fn() – How the XGBoost model is invoked
- output_fn() – How the response is returned
The defaults work for this use case, so you don’t need to define them.
Training with the Amazon SageMaker XGBoost estimator
After you prepare your training data and script, the XGBoost
estimator class in the Amazon SageMaker Python SDK allows you to run that script as a training job on the Amazon SageMaker managed training infrastructure. You also pass the estimator your IAM role, the type of instance you want to use, and a dictionary of the hyperparameters that you want to pass to your script. See the following code:
from sagemaker.session import s3_input
from sagemaker.xgboost.estimator import XGBoost
xgb_script_mode_estimator = XGBoost(
entry_point="abalone.py",
hyperparameters=hyperparameters,
image_name=container,
role=role,
train_instance_count=1,
train_instance_type="ml.m5.2xlarge",
framework_version="1.0-1",
output_path="s3://{}/{}/{}/output".format(bucket, prefix, "xgboost-script-mode"),
train_use_spot_instances=train_use_spot_instances,
train_max_run=train_max_run,
train_max_wait=train_max_wait,
checkpoint_s3_uri=checkpoint_s3_uri
)
xgb_script_mode_estimator.fit({"train": train_input})
Deploying the custom XGBoost model
After you train the model, you can use the estimator to create an Amazon SageMaker endpoint—a hosted and managed prediction service that you can use to perform inference. See the following code:
predictor = xgb_script_mode_estimator.deploy(initial_instance_count=1, instance_type="ml.m5.xlarge")
test_data = xgboost.DMatrix('/path/to/data')
predictor.predict(test_data)
Training with Parquet input
You can now train the latest XGBoost algorithm with Parquet-formatted files or streams directly by using the Amazon SageMaker supported open-sourced ML-IO library. ML-IO is a high-performance data access library for ML frameworks with support for multiple data formats, and is installed by default on the latest XGBoost container. For more information about importing a Parquet file and training with it, see the GitHub repo.
Conclusion
The open-source XGBoost container for Amazon SageMaker provides a fully managed experience and additional benefits that save you money in training and allow for more flexibility.
About the Authors
Rahul Iyer is a Software Development Manager at AWS AI. He leads the Framework Algorithms team, building and optimizing machine learning frameworks like XGBoost and Scikit-learn. Outside work, he enjoys nature photography and cherishes time with his family.
Rocky Zhang is a Senior Product Manager at AWS SageMaker. He builds products that help customers solve real world business problems with Machine Learning. Outside of work he spends most of his time watching, playing, and coaching soccer.
Eric Kim is an engineer in the Algorithms & Platforms Group of Amazon AI. He helps support the AWS service SageMaker, and has experience in machine learning research, development, and application. Outside of work, he is a music lover and a fan of all dogs.
Laurence Rouesnel is a Senior Manager in Amazon AI. He leads teams of engineers and scientists working on deep learning and machine learning research and products, like SageMaker AutoPilot and Algorithms. In his spare time he’s an avid fan of traveling, table-top RPGs, and running.
3 questions: Prem Natarajan on issues of AI fairness and bias
Alexa AI vice president of natural understanding Prem Natarajan discusses the upcoming cycle for the National Science Foundation collaboration on fairness in AI, his participation on the Partnership on AI board, and issues related to bias in natural language processing.Read More