NeurIPS competition involves reinforcement learning, with the objective of minimizing both cost and CO2 emissions.Read More
Data-driven fault analysis is key to sustainable facilities management
How data-driven methods can help to identify fault detection and drive energy efficiencies for facilities of all sizes.Read More
How Sophos trains a powerful, lightweight PDF malware detector at ultra scale with Amazon SageMaker
This post is co-authored by Salma Taoufiq and Harini Kannan from Sophos.
As a leader in next-generation cybersecurity, Sophos strives to protect more than 500,000 organizations and millions of customers across over 150 countries against evolving threats. Powered by threat intelligence, machine learning (ML), and artificial intelligence from Sophos X-Ops, Sophos delivers a broad and varied portfolio of advanced products and services to secure and defend users, networks, and endpoints against phishing, ransomware, malware, and the wide range of cyberattacks out there.
The Sophos Artificial Intelligence (AI) group (SophosAI) oversees the development and maintenance of Sophos’s major ML security technology. Security is a big-data problem. To evade detection, cybercriminals are constantly crafting novel attacks. This translates into colossal threat datasets that the group must work with to best defend customers. One notable example is the detection and elimination of files that were cunningly laced with malware, where the datasets are in terabytes.
In this post, we focus on Sophos’s malware detection system for the PDF file format specifically. We showcase how SophosAI uses Amazon SageMaker distributed training with terabytes of data to train a powerful lightweight XGBoost (Extreme Gradient Boosting) model. This allows their team to iterate over large training data faster with automatic hyperparameter tuning and without managing the underlying training infrastructure.
The solution is currently seamlessly integrated into the production training pipeline and the model deployed on millions of user endpoints via the Sophos endpoint service.
Use case context
Whether you want to share an important contract or preserve the fancy design of your CV, the PDF format is the most common choice. Its widespread use and the general perception that such documents are airtight and static have lulled users into a false sense of security. PDF has, therefore, become an infection vector of choice in attackers’ arsenal. Malicious actions using PDFs are most often achieved via embedding a JavaScript payload that is run by the PDF reader to download a virus from a URI, sabotage the user’s machine, or steal sensitive information.
Sophos detects malicious PDF files at various points of an attack using an ensemble of deterministic and ML models. One such approach is illustrated in the following diagram, where the malicious PDF file is delivered through email. As soon as a download attempt is made, it triggers the malicious executable script to connect to the attacker’s Command and Control server. SophosAI’s PDF detector blocks the download attempt after detecting that it’s malicious.
Other ways include blocking the PDF files in the endpoint, sending the malicious files to a sandbox (where it’s scored using multiple models), submitting the malicious file to a scoring infrastructure and generating a security report, and so on.
Motivation
To build a tree-based detector that can convict malicious PDFs with high confidence, while allowing for low endpoint computing power consumption and fast inference responses, the SophosAI team found the XGBoost algorithm to be a perfect candidate for the task. Such research avenues are important for Sophos for two reasons. Having powerful yet small models deployed at the level of customer endpoints has a high impact on the company’s product reviews by analysts. It also, and more importantly, provides a better user experience overall.
Technical challenge
Because the goal was to have a model with a smaller memory footprint than their existing PDF malware detectors (both on disk and in memory), SophosAI turned XGBoost, a classification algorithm with a proven record of producing drastically smaller models than neural networks while achieving impressive performance on tabular data. Before venturing into modeling XGBoost experiments, an important consideration was the sheer size of the dataset. Indeed, Sophos’s core dataset of PDF files is in terabytes.
Therefore, the main challenge was training the model with a large dataset without having to downsample. Because it’s crucial for the detector to learn to spot any PDF-based attacks — even needle-in-the-haystack and completely novel ones to better defend Sophos customers — it’s of the utmost importance to use all available diverse datasets.
Unlike neural networks, where you can train in batches, for XGBoost, we need the entire training dataset in memory. The largest training dataset for this project is over 1 TB, and there is no way to train on such a scale without utilizing the methodologies of a distributed training framework.
Solution overview
SageMaker is a fully managed ML service providing various tools to build, train, optimize, and deploy ML models. The SageMaker built-in libraries of algorithms consist of 21 popular ML algorithms, including XGBoost. (For more information, see Simplify machine learning with XGBoost and Amazon SageMaker.) With the XGBoost built-in algorithm, you can take advantage of the open-source SageMaker XGBoost Container by specifying a framework version greater than 1.0-1, which has improved flexibility, scalability, extensibility, and Managed Spot Training, and supports input formats like Parquet, which is the format used for the PDF dataset.
The main reason SophosAI chose SageMaker is the ability to benefit from the fully managed distributed training on multi-node CPU instances by simply specifying more than one instance. SageMaker automatically splits the data across nodes, aggregates the results across peer nodes, and generates a single model. The instances can be Spot Instances, thereby significantly reducing the training costs. With the built-in algorithm for XGBoost, you can do this without any additional custom script. Distributed versions of XGBoost also exist as open source, such as XGBoost-Ray and XGBoost4J-Spark, but their use requires building, securing, tuning, and self-managing distributed computing clusters, which represents significant effort additional to scientific development.
Additionally, SageMaker automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by running many training jobs with ranges of hyperparameters that you specify. It then chooses the hyperparameter values that result in a model that performs the best, as measured by a metric for the given ML task.
The following diagram illustrates the solution architecture.
It’s worth noting that, when SophosAI started XGBoost experiments before turning to SageMaker, attempts were made to use large-memory Amazon Elastic Compute Cloud (Amazon EC2) instances (for example, r5a.24xlarge and x1.32xlarge) to train the model on as large of a sample of the data as possible. However, these attempts took more than 10 hours on average and usually failed due to running out of memory.
In contrast, by using the SageMaker XGBoost algorithm and a hassle-free distributed training mechanism, SophosAI could train a booster model at scale on the colossal PDF training dataset in a matter of 20 minutes. The team only had to store the data on Amazon Simple Storage Service (Amazon S3) as Parquet files of similar size, and choose an EC2 instance type and the desired number of instances, and SageMaker managed the underlying compute cluster infrastructure and distributed training between multiple nodes of the cluster. Under the hood, SageMaker splits the data across nodes using ShardedByS3Key to distribute the file objects equally between each instance and uses XGBoost implementation of the Rabit protocol (reliable AllReduce and broadcast interface) to launch distributed processing and communicate between primary and peer nodes. (For more details on the histogram aggregation and broadcast across nodes, refer to XGBoost: A Scalable Tree Boosting System.)
Beyond just training one model, with SageMaker, XGBoost hyperparameter tuning was also made quick and easy with the ability to run different experiments simultaneously to fine-tune the best combination of hyperparameters. The tunable hyperparameters include both booster-specific and objective function-specific hyperparameters. Two search strategies are offered: random or Bayesian. The Bayesian search strategy has proven to be valuable because it helps find better hyperparameters than a mere random search, in fewer experimental iterations.
Dataset information
SophosAI’s PDF malware detection modeling relies on a variety of features such as n-gram histograms and byte entropy features (For more information, refer to MEADE: Towards a Malicious Email Attachment Detection Engine). Metadata and features extracted from collected PDF files are stored in a distributed data warehouse. A dataset of over 3,500 features is then computed, further split based on time into training and test sets and stored in batches as Parquet files in Amazon S3 to be readily accessible by SageMaker for training jobs.
The following table provides information about the training and test data.
Dataset | Number of Samples | Number of Parquet Files | Total Size |
Training | 70,391,634 | 5,500 | ~1010 GB |
Test | 1,242,283 | 98 | ~18 GB |
The data sizes have been computed following the formula:
Data Size = N × (nF + nL) × 4
The formula has the following parameters:
- N is the number of samples in the dataset
- nF is the number of features, with nF = 3585
- nL is the number of ground truth labels, with nL = 1
- 4 is the number of bytes needed for the features’ data type:
float32
Additionally, the following pie charts provide the label distribution of both the training and test sets, eliciting the class imbalance faced in the PDF malware detection task.
The distribution shifts from the training set to the One-month test set. A time-based split of the dataset into training and testing is applied in order to simulate the real-life deployment scenario and avoid temporal snooping. This strategy also allowed SophosAI to evaluate the model’s true generalization capabilities when faced with previously unseen brand-new PDF attacks, for example.
Experiments and results
To kickstart experiments, the SophosAI team trained a baseline XGBoost model with default parameters. Then they started performing hyperparameter fine-tuning with SageMaker using the Bayesian strategy, which is as simple as specifying the hyperparameters to be tuned and the desired range of values, the evaluation metric (ROC (Receiver Operating Characteristic) AUC in this case) and the training and validation sets. For the PDF malware detector, SophosAI prioritized hyperparameters including the number of boosting rounds (num_round
), the maximum tree depth (max_depth
), the learning rate (eta
), and the columns sampling ratio when building trees (colsample_bytree
). Eventually, the best hyperparameters were obtained and used to train a model on the full dataset, and finally evaluated on the holdout test set.
The following plot shows the objective metric (ROC AUC) vs. the 15 training jobs run within the tuning job. The best hyperparameters are those corresponding to the ninth training job.
At the beginning of SophosAI’s experiments on SageMaker, an especially important question to answer was: what type of instances and how many of them are needed to train XGBoost on the data on hand? This is crucial because using the wrong number or type of instance can be a waste of time and money; the training is bound to fail due to running out of memory, or, if using too many too-large instances, this can become unnecessarily expensive.
XGBoost is a memory-bound (as opposed to compute-bound) algorithm. So, a general-purpose compute instance (for example, M5) is a better choice than a compute-optimized instance (for example, C4). To make an informed decision, there is a simple SageMaker guideline for picking the number of instances required to run training on the full dataset:
Total Training Data Size × Safety Factor(*) < Instance Count × Instance Type’s Total Memory
In this case: Total Training Data Size × Safety Factor (12) = 12120 GB
The following table summarizes the requirements when the chosen instance type is ml.m5.24xlarge.
Training Size × Safety Factor (12) | Instance Memory ml.m5.24xlarge | Minimum Instance Count Required for Training |
12120 GB | 384 GB | 32 |
*Due to the nature of XGBoost distributed training, which requires the entire training dataset to be loaded into a DMatrix object before training and additional free memory, a safety factor of 10–12 is recommended.
To take a closer look at the memory utilization for a full SageMaker training of XGBoost on the provided dataset, we provide the corresponding graph obtained from the training’s Amazon CloudWatch monitoring. For this training job, 40 ml.m5.24xlarge instances were used and maximum memory utilization reached around 62 %.
The engineering cost saved by integrating a managed ML service like SageMaker into the data pipeline is around 50%. The option to use Spot Instances for training and hyperparameter tuning jobs cut costs by an additional 63%.
Conclusion
With SageMaker, the SophosAI team could successfully resolve a complex high-priority project by building a lightweight PDF malware detection XGBoost model that is much smaller on disk (up to 25 times smaller) and in-memory (up to 5 times smaller) than its detector predecessor. It’s a small but mighty malware detector with ~0.99 AUC and a true positive rate of 0.99 and a false positive rate of . This model can be quickly retrained, and its performance can be easily monitored over time, because it takes less than 20 minutes to train it on more than 1 TB of data.
You can leverage SageMaker built-in algorithm XGBoost for building models with your tabular data at scale. Additionally, you can also try the new built-in Amazon SageMaker algorithms LightGBM, CatBoost, AutoGluon-Tabular and Tab Transformer as described in this blog.
About the authors
Salma Taoufiq is a Senior Data Scientist at Sophos, working at the intersection of machine learning and cybersecurity. With an undergraduate background in computer science, she graduated from the Central European University with a MSc. in Mathematics and Its Applications. When not developing a malware detector, Salma is an avid hiker, traveler, and consumer of thrillers.
Harini Kannan is a Data Scientist at SophosAI. She has been in security data science for ~4 years. She was previously the Principal Data Scientist at Capsule8, which got acquired by Sophos. She has given talks at CAMLIS, BlackHat (USA), Open Data Science Conference (East), Data Science Salon, PyData (Boston), and Data Connectors. Her areas of research include detecting hardware-based attacks using performance counters, user behavior analysis, interpretable ML, and unsupervised anomaly detection.
Hasan Poonawala is a Senior AI/ML Specialist Solutions Architect at AWS, based in London, UK. Hasan helps customers design and deploy machine learning applications in production on AWS. He has over 12 years of work experience as a data scientist, machine learning practitioner and software developer. In his spare time, Hasan loves to explore nature and spend time with friends and family.
Digant Patel is an Enterprise Support Lead at AWS. He works with customers to design, deploy and operate in cloud at scale. His areas of interest are MLOps and DevOps practices and how it can help customers in their cloud journey. Outside of work, he enjoys photography, playing volleyball and spending time with friends and family.
Amazon Scholar Rupak Majumdar wins CONCUR Test-of-Time Award
Majumdar’s 2003 paper established an elegant algorithm that influences current work on timed games.Read More
The science behind Alexa’s new interactive story-creation experience
AI models that generate stories, place objects in a visual scene, and assemble music on the fly customize content to children’s specifications.Read More
The science behind Amazon’s spatial audio-processing technology
Combining psychoacoustics, signal processing, and speaker beamforming enhances stereo audio and delivers an immersive sound experience for customers.Read More
Amazon Halo Rise advances the future of sleep
Built-in radar technology, deep domain adaptation for sleep stage classification, and low-latency incremental sleep tracking enable Halo Rise to deliver a seamless, no-contact way to help customers improve sleep.Read More
Build an AI-powered virtual agent for Genesys Cloud using QnABot and Amazon Lex
The rise of artificial intelligence technologies enables organizations to adopt and improve self-service capabilities in contact center operations to create a more proactive, timely, and effective customer experience. Voice bots, or conversational interactive voice response systems (IVR), use natural language processing (NLP) to understand customers’ questions and provide relevant answers. Businesses can automate responses to frequently asked transactional questions by deploying bots that are available 24/7. As a result, customers benefit from reduced wait time and faster call resolution time, especially during peak hours.
In the post Enhancing customer service experiences using Conversational AI: Power your contact center with Amazon Lex and Genesys Cloud, we introduced Amazon Lex support on the Genesys Cloud platform and outlined the process of activating the integration. In this post, we demonstrate how to elevate traditional customer service FAQs with an interactive voice bot. We dive into a common self-service use case, explore Q&A interactions, and offer an automated approach using QnABot on AWS Solution built on Amazon Lex with Genesys Cloud.
Solution overview
Informational interactions are widely applicable, with examples such as hours of operation, policy information, school schedules, or other frequently asked questions that are high volume and straightforward. The solution discussed in this post enables customers to interact with a voice bot backed by a curated knowledge base in a natural and conversational manner. Customers can get answers without having to wait for a human customer service representative, thereby improving resolution time and customer satisfaction. You can also implement the same bot directly as a web client, or embed it into an existing site as a chat widget, expanding touch points through multiple channels and increasing overall engagement with customers.
For a demo video describing the experience of a customer dialing into a contact center and interacting with QnABot, check out the below video:
QnABot provides a preconfigured architecture that delivers a low-code experience, as shown in the following diagram. Behind the scenes, it uses Amazon Lex along with other AWS services. Non-technical users can deploy the solution with the click of a button, build their bot through a user-friendly interface, and integrate the voice bot into a Genesys Cloud call flow.
The solution workflow contains the following steps:
- The admin deploys the QnABot solution into their AWS account, opens the Content Designer UI, and uses Amazon Cognito to authenticate.
- After authentication, Amazon CloudFront and Amazon Simple Storage Service (Amazon S3) deliver the contents of the Content Designer UI.
- The admin configures questions and answers in the Content Designer, and the UI sends requests to Amazon API Gateway to save the questions and answers.
- The Content Designer AWS Lambda function saves the input in Amazon OpenSearch Service in a questions bank index.
- The admin activates the Amazon Lex integration on Genesys Cloud, exports a sample flow from the Content Designer UI, and imports this flow into Genesys Cloud using the Genesys Archy tool.
- The customer dials into Genesys Cloud and begins an interaction with QnABot. Genesys Cloud streams this audio to Amazon Lex, which converts the audio to text and calls the Bot Fulfillment Lambda function.
- The Bot Fulfillment function takes the user input and looks up the answer in OpenSearch Service. Alternatively, you can use Amazon Kendra if an index is configured and provided at the time of deployment. The answer is synthesized into voice by Amazon Polly and played back to the customer.
- User interactions with the Bot Fulfillment function generate logs and metrics data, which are sent to Amazon Kinesis Data Firehose then to Amazon S3 for later data analysis.
To implement this solution, we walk through the following steps:
- Enable Amazon Lex V2 integration with Genesys.
- Configure Archy, the Genesys Cloud Architect YAML processor.
- Export the Genesys call flow from the QnABot Content Designer.
- Import and publish the call flow with Archy.
- Import example questions to QnABot.
- Create a test call and interact with the bot.
- Customize the call flow in Genesys Architect.
Prerequisites
To get started, you need the following:
- An AWS account
- An active deployment of QnABot on AWS (version 5.1.0 or later)
- A Genesys Cloud organization
Enable Amazon Lex V2 integration with Genesys Cloud
The first step is to enable Amazon Lex V2 integration with Genesys Cloud. For instructions, refer to Enhancing customer service experiences using Conversational AI: Power your contact center with Amazon Lex and Genesys Cloud.
Configure Archy
We have prepared a sample inbound call flow to get you started with QnABot and Genesys Cloud. We use Archy, the Genesys Cloud Architect YAML processor tool, to publish this call flow. You must first generate an OAuth client ID and client secret, then you can download and configure Archy.
Generate an OAuth client ID and client secret
Archy requires either a client ID and secret pair or an authorization token. For more information about Archy’s OAuth requirements, refer to Prerequisites in the Archy installation documentation.
To generate a client ID and secret pair, complete the following steps:
- On the Genesys Cloud Admin page, navigate to Integrations, then choose OAuth.
- Choose Add Client.
- For App Name, enter
QnABot
. - For Description, enter a description.
- For Grant Types, select Client Credentials.
A new Roles tab appears.
- On the Roles tab, assign a role that has Architect > flow > publish permissions.
In the following screenshot, we’re assigning the admin
role. You may have to also assign the Master Admin
role.
- Choose Save.
- On the Client Details tab, copy the values for the client ID and client secret.
Download and configure Archy
Download and unzip the appropriate version of Archy for your operating system. Then navigate to the folder in a terminal and begin the setup process by running the following command:
Continue through the Archy setup, and provide the client ID and client secret when prompted.
Export the call flow YAML from the QnABot Content Designer
Now that Archy is authorized to publish call flows, we export the preconfigured call flow from the QnABot Content Designer.
- Log in to the QnABot Content Designer.
- On the Tools menu, choose Genesys Cloud.
- Choose Next until you reach the Download Call Flow section.
- Choose Download Inbound Call Flow.
You download a file named QnABotFlow.yaml
, which is a preconfigured Genesys call flow.
- Copy this file to the same folder Archy is located in.
Import and publish the call flow with Archy
To publish the call flow to Archy, run the following command:
When complete, a new inbound call flow named QnABotFlow
is available in Genesys Architect.
To assign this call flow, on the Genesys Cloud Admin page, navigate to Routing and choose Call Routing.
The new QnABotFlow
should appear in the list of call flows under Regular Routing. Assign the flow, then choose Save.
Import example questions to QnABot
Navigate back to the QnABot Content Designer, choose the Tools menu, and choose Import.
Expand Examples/Extensions, find the GenesysWizardQnA example, and choose Load.
If you navigate back to the main questions and answers page, you now have the GenesysHelper
questions. These are a set of example questions and answers for you to get started.
Create a test phone call and interact with the bot
Back to Genesys Cloud Admin, make sure you have an inbound phone number associated with the QnABotFlow
call flow under Call Routing. We now navigate to the agent desktop and make a test call to interact with the bot for the first time.
QnABot is designed to answer questions based on the data preconfigured in the Content Designer. Let’s try the following:
- What is your business hour?
- What is the meaning of life?
Each time QnABot provides an answer, you have the option to ask another question, conclude the call by saying “Goodbye,” or ask to be connected to a human agent by saying “I would like to speak to an agent.”
Customize the call flow with Genesys Architect
The Genesys call flow is preconfigured to enable specific Amazon Lex session attributes. For example, if you edit the question with ID GenesysHelper.Hours
, the answer contains the following statement:
This is based on Handlebars, and allows you to set values for session attributes. The exported Genesys Cloud CX call flow contains a block that reads back the value of the genesys_nextPrompt
session attribute, which is only spoken by the Genesys call flow.
To branch to a queue or another call flow, a QnABot answer can use setSessionAttr
to set genesys_nextAction
to a specific value. An example of this is in the question with ID GenesysHelper.Agent,
where the answer has {{setSessionAttr 'nextAction' 'AGENT'}}
. In the call flow’s QnABot reusable task, there is a switch block that reads the value of this attribute to branch to a specific action. The example call flow contains branches for AGENT
, MENU
, and END
. If there is no value for the genesys_nextAction
session attribute, the call flow plays back any string found in the genesys_nextPrompt
content, or the value of the defaultPrompt
task variable defined at the beginning of the main flow, which is set by default to ask another question or say return to main menu
.
The following diagram illustrates the main call flow.
The following diagram illustrates the flow of the reusable task.
Clean up
To avoid incurring future charges, delete the resources created via the template by navigating to the AWS CloudFormation console, selecting the QnABot stack created by the template, and choosing Delete. This removes all resources created by the template.
To remove the resources in Genesys Cloud, first remove the call flow from call routing. Then delete the call flow from Genesys Architect.
Conclusion
In this post, we walked through how to get started with QnABot and Genesys Cloud with an easy-to-deploy, readily usable solution to address a transactional interaction use case. This voice bot frees your customer service representatives to spend time with your customers on more complex tasks, and provides users with a better experience through self-service. Customer satisfaction increases, while costs decrease, because you’re consuming fewer connected minutes and maximizing agent utilization.
To get started, you can launch QnABot with a single click and go through the QnABot Workshop to learn about additional features. Amazon Lex integration is available on Genesys AppFoundry.
About the Authors
Christopher Lott is a Senior Solutions Architect in the AWS AI Language Services team. He has 20 years of enterprise software development experience. Chris lives in Sacramento, California, and enjoys gardening, aerospace, and traveling the world.
Jessica Ho is a Solutions Architect at Amazon Web Services, supporting ISV partners who build business applications on AWS. She is passionate about creating differentiated solutions that unlock customers for cloud adoption. Outside of work, she enjoys turning her garden into a mini jungle.
Set up enterprise-level cost allocation for ML environments and workloads using resource tagging in Amazon SageMaker
As businesses and IT leaders look to accelerate the adoption of machine learning (ML), there is a growing need to understand spend and cost allocation for your ML environment to meet enterprise requirements. Without proper cost management and governance, your ML spend may lead to surprises in your monthly AWS bill. Amazon SageMaker is a fully managed ML platform in the cloud that equips our enterprise customers with tools and resources to establish cost allocation measures and improve visibility into detailed cost and usage by your teams, business units, products, and more.
In this post, we share tips and best practices regarding cost allocation for your SageMaker environment and workloads. Across almost all AWS services, SageMaker included, applying tags to resources is a standard way to track costs. These tags can help you track, report, and monitor your ML spend through out-the-box solutions like AWS Cost Explorer and AWS Budgets, as well as custom solutions built on the data from AWS Cost and Usage Reports (CURs).
Cost allocation tagging
Cost allocation on AWS is a three-step process:
- Attach cost allocation tags to your resources.
- Activate your tags in the Cost allocation tags section of the AWS Billing console.
- Use the tags to track and filter for cost allocation reporting.
After you create and attach tags to resources, they appear in the AWS Billing console’s Cost allocation tags section under User-defined cost allocation tags. It can take up to 24 hours for tags to appear after they’re created. You then need to activate these tags for AWS to start tracking them for your resources. Typically, after a tag is activated, it takes about 24–48 hours for the tags to show up in Cost Explorer. The easiest way to check if your tags are working is to look for your new tag in the tags filter in Cost Explorer. If it’s there, then you’re ready to use the tags for your cost allocation reporting. You can then choose to group your results by tag keys or filter by tag values, as shown in the following screenshot.
One thing to note: if you use AWS Organizations and have linked AWS accounts, tags can only be activated in the primary payer account. Optionally, you can also activate CURs for the AWS accounts that enable cost allocation reports as a CSV file with your usage and costs grouped by your active tags. This gives you more detailed tracking of your costs and makes it easier to set up your own custom reporting solutions.
Tagging in SageMaker
At a high level, tagging SageMaker resources can be grouped into two buckets:
- Tagging the SageMaker notebook environment, either Amazon SageMaker Studio domains and domain users, or SageMaker notebook instances
- Tagging SageMaker-managed jobs (labeling, processing, training, hyperparameter tuning, batch transform, and more) and resources (such as models, work teams, endpoint configurations, and endpoints)
We cover these in more detail in this post and provide some solutions on how to apply governance control to ensure good tagging hygiene.
Tagging SageMaker Studio domains and users
Studio is a web-based, integrated development environment (IDE) for ML that lets you build, train, debug, deploy, and monitor your ML models. You can launch Studio notebooks quickly, and dynamically dial up or down the underlying compute resources without interrupting your work.
To automatically tag these dynamic resources, you need to assign tags to SageMaker domain and domain users who are provisioned access to those resources. You can specify these tags in the tags parameter of create-domain or create-user-profile during profile or domain creation, or you can add them later using the add-tags API. Studio automatically copies and assigns these tags to the Studio notebooks created in the domain or by the specific users. You can also add tags to SageMaker domains by editing the domain settings in the Studio Control Panel.
The following is an example of assigning tags to the profile during creation.
To tag existing domains and users, use the add-tags
API. The tags are then applied to any new notebooks. To have these tags applied to your existing notebooks, you need to restart the Studio app (Kernel Gateway and Jupyter Server) belonging to that user profile. This won’t cause any loss in notebook data. Refer to this Shut Down and Update SageMaker Studio and Studio Apps to learn how to delete and restart your Studio apps.
Tagging SageMaker notebook instances
In the case of a SageMaker notebook instance, tagging is applied to the instance itself. The tags are assigned to all resources running in the same instance. You can specify tags programmatically using the tags parameter in the create-notebook-instance API or add them via the SageMaker console during instance creation. You can also add or update tags anytime using the add-tags API or via the SageMaker console.
Note that this excludes SageMaker managed jobs and resources such as training and processing jobs because they’re in the service environment rather than on the instance. In the next section, we go over how to apply tagging to these resources in greater detail.
Tagging SageMaker managed jobs and resources
For SageMaker managed jobs and resources, tagging must be applied to the tags
attribute as part of each API request. An SKLearnProcessor
example is illustrated in the following code. You can find more examples of how to assign tags to other SageMaker managed jobs and resources on the GitHub repo.
Tagging SageMaker pipelines
In the case of SageMaker pipelines, you can tag the entire pipeline as a whole instead of each individual step. The SageMaker pipeline automatically propagates the tags to each pipeline step. You still have the option to add additional, separate tags to individual steps if needed. In the Studio UI, the pipeline tags appear in the metadata section.
To apply tags to a pipeline, use the SageMaker Python SDK:
Enforce tagging using IAM policies
Although tagging is an effective mechanism for implementing cloud management and governance strategies, enforcing the right tagging behavior can be challenging if you just leave it to the end-users. How do you prevent ML resource creation if a specific tag is missing, how do you ensure the right tags are applied, and how do you prevent users from deleting existing tags?
You can accomplish this using AWS Identity and Access Management (IAM) policies. The following code is an example of a policy that prevents SageMaker actions such as CreateDomain
or CreateNotebookInstance
if the request doesn’t contain the environment key and one of the list values. The ForAllValues
modifier with the aws:TagKeys
condition key indicates that only the key environment
is allowed in the request. This stops users from including other keys, such as accidentally using Environment
instead of environment
.
Tag policies and service control policies (SCPs) can also be a good way to standardize creation and labeling of your ML resources. For more information about how to implement a tagging strategy that enforces and validates tagging at the organization level, refer to Cost Allocation Blog Series #3: Enforce and Validate AWS Resource Tags.
Cost allocation reporting
You can view the tags by filtering the views on Cost Explorer, viewing a monthly cost allocation report, or by examining the CUR.
Visualizing tags in Cost Explorer
Cost Explorer is a tool that enables you to view and analyze your costs and usage. You can explore your usage and costs using the main graph: the Cost Explorer cost and usage reports. For a quick video on how to use Cost Explorer, check out How can I use Cost Explorer to analyze my spending and usage?
With Cost Explorer, you can filter how you view your AWS costs by tags. Group by allows us to filter out results by tag keys such as Environment
, Deployment
, or Cost Center
. The tag filter helps us select the value we desire regardless of the key. Examples include Production
and Staging
. Keep in mind that you must run the resources after adding and activating tags; otherwise, Cost Explorer won’t have any usage data and the tag value won’t be displayed as a filter or group by option.
The following screenshot is an example of filtering by all values of the BusinessUnit
tag.
Examining tags in the CUR
The Cost and Usage Rreport contains the most comprehensive set of cost and usage data available. The report contains line items for each unique combination of AWS product, usage type, and operation that your AWS account uses. You can customize the CUR to aggregate the information either by the hour or by the day. A monthly cost allocation report is one way to set up cost allocation reporting. You can set up a monthly cost allocation report that lists the AWS usage for your account by product category and linked account user. The report contains the same line items as the detailed billing report and additional columns for your tag keys. You can set it up and download your report by following the steps in Monthly cost allocation report.
The following screenshot shows how user-defined tag keys show up in the CUR. User-defined tag keys have the prefix user
, such as user:Department
and user:CostCenter
. AWS-generated tag keys have the prefix aws
.
Visualize the CUR using Amazon Athena and Amazon QuickSight
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. To integrate Athena with CURs, refer to Querying Cost and Usage Reports using Amazon Athena. You can then build custom queries to query CUR data using standard SQL. The following screenshot is an example of a query to filter all resources that have the value TF2WorkflowTraining for the cost-center
tag.
In the following example, we’re trying to figure out which resources are missing values under the cost-center
tag.
More information and example queries can be found in the AWS CUR Query Library.
You can also feed CUR data into Amazon QuickSight, where you can slice and dice it any way you’d like for reporting or visualization purposes. For instructions on ingesting CUR data into QuickSight, see How do I ingest and visualize the AWS Cost and Usage Report (CUR) into Amazon QuickSight.
Budget monitoring using tags
AWS Budgets is an excellent way to provide an early warning if spend spikes unexpectedly. You can create custom budgets that alert you when your ML costs and usage exceed (or are forecasted to exceed) your user-defined thresholds. With AWS Budgets, you can monitor your total monthly ML costs or filter your budgets to track costs associated with specific usage dimensions. For example, you can set the budget scope to include SageMaker resource costs tagged as cost-center: ML-Marketing
, as shown in the following screenshot. For additional dimensions and detailed instructions on how to set up AWS Budgets, refer to here.
With budget alerts, you can send notifications when your budget limits are (or are about to be) exceeded. These alerts can also be posted to an Amazon Simple Notification Service (Amazon SNS) topic. An AWS Lambda function that subscribes to the SNS topic is then invoked, and any programmatically implementable actions can be taken.
AWS Budgets also lets you configure budget actions, which are steps that you can take when a budget threshold is exceeded (actual or forecasted amounts). This level of control allows you to reduce unintentional overspending in your account. You can configure specific responses to cost and usage in your account that will be applied automatically or through a workflow approval process when a budget target has been exceeded. This is a really powerful solution to ensure that your ML spend is consistent with the goals of the business. You can select what type of action to take. For example, when a budget threshold is crossed, you can move specific IAM users from admin permissions to read-only. For customers using Organizations, you can apply actions to an entire organizational unit by moving them from admin to read-only. For more details on how to manage cost using budget actions, refer to How to manage cost overruns in your AWS multi-account environment – Part 1.
You can also set up a report to monitor the performance of your existing budgets on a daily, weekly, or monthly cadence and deliver that report to up to 50 email addresses. With AWS Budgets reports, you can combine all SageMaker-related budgets into a single report. This feature enables you to track your SageMaker footprint from a single location, as shown in the following screenshot. You can opt to receive these reports on a daily, weekly, or monthly cadence (I’ve chosen Weekly for this example), and choose the day of week when you want to receive them.
This feature is useful to keep your stakeholders up to date with your SageMaker costs and usage, and help them see when spend isn’t trending as expected.
After you set up this configuration, you should receive an email similar to the following.
Conclusion
In this post, we showed how you can set up cost allocation tagging for SageMaker and shared tips on tagging best practices for your SageMaker environment and workloads. We then discussed different reporting options like Cost Explorer and the CUR to help you improve visibility into your ML spend. Lastly, we demonstrated AWS Budgets and the budget summary report to help you monitor the ML spend of your organization.
For more information about applying and activating cost allocation tags, see User-Defined Cost Allocation Tags.
About the authors
Sean Morgan is an AI/ML Solutions Architect at AWS. He has experience in the semiconductor and academic research fields, and uses his experience to help customers reach their goals on AWS. In his free time, Sean is an active open-source contributor and maintainer, and is the special interest group lead for TensorFlow Add-ons.
Brent Rabowsky focuses on data science at AWS, and leverages his expertise to help AWS customers with their own data science projects.
Nilesh Shetty is as a Senior Technical Account Manager at AWS, where he helps enterprise support customers streamline their cloud operations on AWS. He is passionate about machine learning and has experience working as a consultant, architect, and developer. Outside of work, he enjoys listening to music and watching sports.
James Wu is a Senior AI/ML Specialist Solution Architect at AWS. helping customers design and build AI/ML solutions. James’s work covers a wide range of ML use cases, with a primary interest in computer vision, deep learning, and scaling ML across the enterprise. Prior to joining AWS, James was an architect, developer, and technology leader for over 10 years, including 6 years in engineering and 4 years in marketing & advertising industries.
Index your Dropbox content using the Dropbox connector for Amazon Kendra
Amazon Kendra is a highly accurate and simple-to-use intelligent search service powered by machine learning (ML). Amazon Kendra offers a suite of data source connectors to simplify the process of ingesting and indexing your content, wherever it resides.
Valuable data in organizations is stored in both structured and unstructured repositories. An enterprise search solution should be able to pull together data across several structured and unstructured repositories to index and search on.
One such data repository is Dropbox. Enterprise users use Dropbox to upload, transfer, and store documents to the cloud. Along with the ability to store documents, Dropbox offers Dropbox Paper, a coediting tool that lets users collaborate and create content in one place. Dropbox Paper can optionally use templates to add structure to documents. In addition to files and paper, Dropbox also allows you to store shortcuts to webpages in your folders.
We’re excited to announce that you can now use the Amazon Kendra connector for Dropbox to search information stored in your Dropbox account. In this post, we show how to index information stored in Dropbox and use the Amazon Kendra intelligent search function. In addition, Amazon Kendra’s ML powered intelligent search can accurately find information from unstructured documents having natural language narrative content, for which keyword search is not very effective.
Solution overview
With Amazon Kendra, you can configure multiple data sources to provide a central place to search across your document repository. For our solution, we demonstrate how to index a Dropbox repository or folder using the Amazon Kendra connector for Dropbox. The solution consists of the following steps:
- Configure an app on Dropbox and get the connection details.
- Store the details in AWS Secrets Manager.
- Create a Dropbox data source via the Amazon Kendra console.
- Index the data in the Dropbox repository.
- Run a sample query to get the information.
Prerequisites
To try out the Amazon Kendra connector for Dropbox, you need the following:
- A Dropbox Enterprise (not personal) account.
- An AWS account with privileges to create AWS Identity and Access Management (IAM) roles and policies. For more information, see Overview of access management: Permissions and policies.
- Basic knowledge of AWS.
Configure a Dropbox app and gather connection details
Before we set up the Dropbox data source, we need a few details about your Dropbox repository. Let’s gather those in advance.
- Go to www.dropbox.com/developers.
- Choose App console.
- Sign in with your credentials (make sure you’re signing in to an Enterprise account).
- Choose Create app.
- Select Scoped access.
- Select Full Dropbox (or the name of the specific folder you want to index).
- Enter a name for your app.
- Choose Create app.
You can see the configuration screen with a set of tabs. - To set up permissions, choose the Permissions tab.
- Select a minimal set of permissions, as shown in the following screenshots.
- Choose Submit.
A message appears saying that the permission change was successful.
- On the Settings tab, copy the app key.
- Choose Show next to App secret and copy the secret.
- Under Generated access token, choose Generate and copy the token.
Store these values in a safe place—we need to refer to these later.
The session token is valid for up to 4 hours. You have to generate a new session token each time you index the content.
Store Dropbox credentials in Secrets Manager
To store your Dropbox credentials in Secrets Manager, compete the following steps:
- On the Secrets Manager console, choose Store a new secret.
- Choose Other type of secret.
- Create three key-value pairs for
appKey
,appSecret
, andrefreshToken
and enter the values saved from Dropbox. - Choose Save.
- For Secret name, enter a name (for example,
AmazonKendra-dropbox-secret
). - Enter an optional description.
- Choose Next.
- In the Configure rotation section, keep all settings at their defaults and choose Next.
- On the Review page, choose Store.
Configure the Amazon Kendra connector for Dropbox
To configure the Amazon Kendra connector, complete the following steps:
- On the Amazon Kendra console, choose Create an Index.
- For Index name, enter a name for the index (for example,
my-dropbox-index
). - Enter an optional description.
- For Role name, enter an IAM role name.
- Configure optional encryption settings and tags.
- Choose Next.
- In the Configure user access control section, leave the settings at their defaults and choose Next.
- For Provisioning editions, select Developer edition.
- Choose Create.
This creates and propagates the IAM role and then creates the Amazon Kendra index, which can take up to 30 minutes. - Choose Data sources in the navigation pane.
- Under Dropbox, choose Add connector.
- For Data source name, enter a name (for example,
my-dropbox-connector
). - Enter an optional description.
- Choose Next.
- For Type of authentication token, select Access Token (temporary use).
- For AWS Secrets Manager secret, choose the secret you created earlier.
- For IAM role, choose Create a new role.
- For Role name, enter a name (for example,
AmazonKendra-dropbox-role
). - Choose Next.
- For Select entities or content types, choose your content types.
- For Frequency, choose Run on demand.
- Choose Next.
- Set any optional field mappings and choose Next.
- Choose Review and Create and choose Add data source.
- Choose Sync now.
- Wait for the sync to complete.
Test the solution
Now that you have ingested the content from your Dropbox account into your Amazon Kendra index, you can test some queries.
Go to your index and choose Search indexed content. Enter a sample search query and test out your search results (your query will vary based on the contents of your account).
The Dropbox connector also crawls local identity information from Dropbox. For users, it sets user email id as principal. For groups, it sets group id as principal. To filter search results by users/groups, go to the Search Console.
Click on “Test query with user name or groups” to expand it and click on the button that says “apply user name or groups”.
Enter the user and/or group names and click Apply. Next, enter the search query and hit enter. This brings you a filtered set of results based on your criteria.
Congratulations! You have successfully used Amazon Kendra to surface answers and insights based on the content indexed from your Dropbox account.
Generate permanent tokens for offline access
The instructions in this post walk you through creating, configuring, and using a temporary access token. Apps can also get long-term access by requesting offline access, in which case the app receives a refresh token that can be used to retrieve new short-lived access tokens as needed, without further manual user intervention. You can find more information in the Dropbox OAuth Guide and Dropbox authorization documentation. Use the following steps to create a permanent refresh token (for example to set the sync to trigger on a schedule):
- Get the app key and app secret as before.
- In a new browser, navigate to
https://www.dropbox.com/oauth2/authorize?token_access_type=offline&response_type=code&client_id=<appkey>
. - Accept the defaults and choose Submit.
- Choose Continue.
- Choose Allow.
An access code is generated for you. - Copy the access code.
Now you get the refresh token from the access code. - In a terminal window, run the following curl command:
You can store this refresh token along with the app key and app secret to configure a permanent token in the data source configuration for Amazon Kendra. Amazon Kendra generates the access token and uses it as needed for access.
Limitations
This solution has the following limitations:
- File comments are not imported into the index
- You don’t have the option to add custom metadata for Dropbox
- Google docs, sheets, and slides need a Google workspace or Google account and are not included
Conclusion
With the Dropbox connector for Amazon Kendra, organizations can tap into the repository of information stored in their account securely using intelligent search powered by Amazon Kendra.
In this post, we introduced you to the basics, but there are many additional features that we didn’t cover. For example:
- You can enable user-based access control for your Amazon Kendra index and restrict access to users and groups that you configure
- You can specify
allowedUsersColumn
andallowedGroupsColumn
so you can apply access controls based on users and groups, respectively - You can map additional fields to Amazon Kendra index attributes and enable them for faceting, search, and display in the search results
- You can integrate the Dropbox data source with the Custom Document Enrichment (CDE) capability in Amazon Kendra to perform additional attribute mapping logic and even custom content transformation during ingestion
To learn about these possibilities and more, refer to the Amazon Kendra Developer Guide.
About the author
Ashish Lagwankar is a Senior Enterprise Solutions Architect at AWS. His core interests include AI/ML, serverless, and container technologies. Ashish is based in the Boston, MA, area and enjoys reading, outdoors, and spending time with his family.