Developing advanced machine learning systems at Trumid with the Deep Graph Library for Knowledge Embedding

This is a guest post co-written with Mutisya Ndunda from Trumid.

Like many industries, the corporate bond market doesn’t lend itself to a one-size-fits-all approach. It’s vast, liquidity is fragmented, and institutional clients demand solutions tailored to their specific needs. Advances in AI and machine learning (ML) can be employed to improve the customer experience, increase the efficiency and accuracy of operational workflows, and enhance performance by supporting multiple aspects of the trading process.

Trumid is a financial technology company building tomorrow’s credit trading network—a marketplace for efficient trading, information dissemination, and execution between corporate bond market participants. Trumid is optimizing the credit trading experience by combining leading-edge product design and technology principles with deep market expertise. The result is an integrated trading solution delivering a full ecosystem of protocols and execution tools within one intuitive platform.

The bond trading market has traditionally involved offline buyer/seller matching processes aided by rules-based technology. Trumid has embarked on an initiative to transform this experience. Through its electronic trading platform, traders can access thousands of bonds to buy or sell, a community of engaged users to interact with, and a variety of trading protocols and execution solutions. With an expanding network of users, Trumid’s AI and Data Strategy team partnered with the AWS Machine Learning Solutions Lab. The objective was to develop ML systems that could deliver a more personalized trading experience by modeling the interest and preferences of users for bonds available on Trumid.

These ML models can be used to speed up time to insight and action by personalizing how information is displayed to each user to ensure that the most relevant and actionable information a trader may care about is prioritized and accessible.

To solve this challenge, Trumid and the ML Solutions Lab developed an end-to-end data preparation, model training, and inference process based on a deep neural network model built using the Deep Graph Library for Knowledge Embedding (DGL-KE). An end-to-end solution with Amazon SageMaker was also deployed.

Benefits of graph machine learning

Real-world data is complex and interconnected, and often contains network structures. Examples include molecules in nature, social networks, the internet, roadways, and financial trading platforms.

Graphs provide a natural way to model this complexity by extracting important and rich information that is embedded in the relations between entities.

Traditional ML algorithms require data to be organized as tables or sequences. This generally works well, but some domains are more naturally and effectively represented by graphs (such as a network of objects related to each other, as illustrated later in this post). Instead of coercing these graph datasets into tables or sequences, you can use graph ML algorithms to both represent and learn from the data as presented in its graph form, including information about constituent nodes, edges, and other features.

Considering that bond trading is inherently represented as a network of interactions between buyers and sellers involving various types of bond instruments, an effective solution needs to harness the network effects of the communities of traders that participate in the market. Let’s look at how we leveraged the trading network effects and implemented this vision here.

Solution

Bond trading is characterized by several factors, including trade size, term, issuer, rate, coupon values, bid/ask offer, and type of trading protocol involved. In addition to orders and trades, Trumid also captures “indications of interest” (IOIs). The historical interaction data embodies the trading behavior and the market conditions evolving over time. We used this data to build a graph of timestamped interactions between traders, bonds, and issuers, and used graph ML to predict future interactions.

The recommendation solution comprised four main steps:

  • Preparing the trading data as a graph dataset
  • Training a knowledge graph embedding model
  • Predicting new trades
  • Packaging the solution as a scalable workflow

In the following sections, we discuss each step in more detail.

Preparing the trading data as a graph dataset

There are many ways to represent trading data as a graph. One option is to represent the data exhaustively with nodes, edges, and properties: traders as nodes with properties (such as employer or tenure), bonds as nodes with properties (issuer, amount outstanding, maturity, rate, coupon value), and trades as edges with properties (date, type, size). Another option is to simplify the data and use only nodes and relations (relations are typed edges like traded or issued-by). This latter approach worked better in our case, and we used the graph represented in the following figure.

This graph represents the relations between traders, bonds and issuers

Graph of relations between traders, bonds and bond issuers

Additionally, we removed some of the edges considered obsolete: if a trader interacted with more than 100 different bonds, we kept only the last 100 bonds.

Finally, we saved the graph dataset as a list of edges in TSV format:

t987	trade-old		i55198
t995	trade-old		i55306
t987	trade-recent	i24528
t995	trade-recent	i49181
t987	ioi-recent		i24523
t995	ioi-old 		i49178
…
i49611	issued-by		XXX
i46569	issued-by		YYY
i46507	issued-by		ZZZ

Training a knowledge graph embedding model

For graphs composed only of nodes and relations (often called knowledge graphs), the DGL team developed the knowledge graph embedding framework DGL-KE. KE stands for knowledge embedding, the idea being to represent nodes and relations (knowledge) by coordinates (embeddings) and optimize (train) the coordinates so that the original graph structure can be recovered from the coordinates. In the list of available embedding models, we selected TransE (translational embeddings). TransE trains embeddings with the objective of approximating the following equality:

Source node embedding + relation embedding = target node embedding (1)

We trained the model by invoking the dglke_train command. The output of the training is a model folder containing the trained embeddings.

For more details about TransE, refer to Translating Embeddings for Modeling Multi-relational Data.

Predicting new trades

To predict new trades from a trader with our model, we used the equality (1): add the trader embedding to the trade-recent embedding and looked for bonds closest to the resulting embedding.

We did this in two steps:

  1. Compute scores for all possible trade-recent relations with dglke_predict.
  2. Compute the top 100 highest scores for each trader.

For detailed instructions on how to use the DGL-KE, refer to Training knowledge graph embeddings at scale with the Deep Graph Library and DGL-KE Documentation.

Packaging the solution as a scalable workflow

We used SageMaker notebooks to develop and debug our code. For production, we wanted to invoke the model as a simple API call. We found that we didn’t need to separate data preparation, model training, and prediction, and it was convenient to package the whole pipeline as a single script and use SageMaker processing. SageMaker processing allows you to run a script remotely on a chosen instance type and Docker image without having to worry about resource allocation and data transfer. This was simple and cost-effective for us, because the GPU instance is only used and paid for during the 15 minutes needed for the script to run.

For detailed instructions on how to use SageMaker processing, see Amazon SageMaker Processing – Fully Managed Data Processing and Model Evaluation and Processing.

Results

Our custom graph model performed very well compared to other methods: performance improved by 80%, with more stable results across all trader types. We measured performance by mean recall (percentage of actual trades predicted by the recommender, averaged over all traders). With other standard metrics, the improvement ranged from 50–130%.

This performance enabled us to better match traders and bonds, indicating an enhanced trader experience within the model, with machine learning delivering a big step forward from hard-coded rules, which can be difficult to scale.

Conclusion

Trumid is focused on delivering innovative products and workflow efficiencies to their community of users. Building tomorrow’s credit trading network requires continuous collaboration with peers and industry experts like the AWS ML Solutions Lab, designed to help you innovate faster.

For more information, see the following resources:


About the authors

Marc van Oudheusden is a Senior Data Scientist with the Amazon ML Solutions Lab team at Amazon Web Services. He works with AWS customers to solve business problems with artificial intelligence and machine learning. Outside of work you may find him at the beach, playing with his children, surfing or kitesurfing.

Mutisya Ndunda is the Head of Data Strategy and AI at Trumid. He is a seasoned financial professional with over 20 years of broad institutional experience in capital markets, trading, and financial technology.  Mutisya has a strong quantitative and analytical background with over a decade of experience in artificial intelligence, machine learning and big data analytics. Prior to Trumid, he was the CEO of Alpha Vertex, a financial technology company offering analytical solutions powered by proprietary AI algorithms to financial institutions. Mutisya holds a bachelor’s degree in Electrical Engineering from Cornell University and a master’s degree in Financial Engineering from Cornell University.

Isaac Privitera is a Senior Data Scientist at the Amazon Machine Learning Solutions Lab, where he develops bespoke machine learning and deep learning solutions to address customers’ business problems. He works primarily in the computer vision space, focusing on enabling AWS customers with distributed training and active learning.

Read More

Organize your machine learning journey with Amazon SageMaker Experiments and Amazon SageMaker Pipelines

The process of building a machine learning (ML) model is iterative until you find the candidate model that is performing well and is ready to be deployed. As data scientists iterate through that process, they need a reliable method to easily track experiments to understand how each model version was built and how it performed.

Amazon SageMaker allows teams to take advantage of a broad range of features to quickly prepare, build, train, deploy, and monitor ML models. Amazon SageMaker Pipelines provides a repeatable process for iterating through model build activities, and is integrated with Amazon SageMaker Experiments. By default, every SageMaker pipeline is associated with an experiment, and every run of that pipeline is tracked as a trial in that experiment. Then your iterations are automatically tracked without any additional steps.

In this post, we take a closer look at the motivation behind having an automated process to track experiments with Experiments and the native capabilities built into Pipelines.

Why is it important to keep your experiments organized?

Let’s take a step back for a moment and try to understand why it’s important to have experiments organized for machine learning. When data scientists approach a new ML problem, they have to answer many different questions, from data availability to how they will measure model performance.

At the start, the process is full of uncertainty and is highly iterative. As a result, this experimentation phase can produce multiple models, each created from their own inputs (datasets, training scripts, and hyperparameters) and producing their own outputs (model artifacts and evaluation metrics). The challenge then is to keep track of all these inputs and outputs of each iteration.

Data scientists typically train many different model versions until they find the combination of data transformation, algorithm, and hyperparameters that results in the best performing version of a model. Each of these unique combinations is a single experiment. With a traceable record of the inputs, algorithms, and hyperparameters that were used by that trial, the data science team can find it easy to reproduce their steps.

Having an automated process in place to track experiments improves the ability to reproduce as well as deploy specific model versions that are performing well. The Pipelines native integration with Experiments makes it easy to automatically track and manage experiments across pipeline runs.

Benefits of SageMaker Experiments

SageMaker Experiments allows data scientists organize, track, compare, and evaluate their training iterations.

Let’s start first with an overview of what you can do with Experiments:

  • Organize experiments – Experiments structures experimentation with a top-level entity called an experiment that contains a set of trials. Each trial contains a set of steps called trial components. Each trial component is a combination of datasets, algorithms, and parameters. You can picture experiments as the top-level folder for organizing your hypotheses, your trials as the subfolders for each group test run, and your trial components as your files for each instance of a test run.
  • Track experiments – Experiments allows data scientists to track experiments. It offers the possibility to automatically assign SageMaker jobs to a trial via simple configurations and via the tracking SDKs.
  • Compare and evaluate experiments – The integration of Experiments with Amazon SageMaker Studio makes it easy to produce data visualizations and compare different trials. You can also access the trial data via the Python SDK to generate your own visualization using your preferred plotting libraries.

To learn more about Experiments APIs and SDKs, we recommend the following documentation: CreateExperiment and Amazon SageMaker Experiments Python SDK.

If you want to dive deeper, we recommend looking into the amazon-sagemaker-examples/sagemaker-experiments GitHub repository for further examples.

Integration between Pipelines and Experiments

The model building pipelines that are part of Pipelines are purpose-built for ML and allow you to orchestrate your model build tasks using a pipeline tool that includes native integrations with other SageMaker features as well as the flexibility to extend your pipeline with steps run outside SageMaker. Each step defines an action that the pipeline takes. The dependencies between steps are defined by a direct acyclic graph (DAG) built using the Pipelines Python SDK. You can build a SageMaker pipeline programmatically via the same SDK. After a pipeline is deployed, you can optionally visualize its workflow within Studio.

Pipelines automatically integrate with Experiments by automatically creating an experiment and trial for every run. Pipelines automatically create an experiment and a trial for every run of the pipeline before running the steps unless one or both of these inputs are specified. While running the pipeline’s SageMaker job, the pipeline associates the trial with the experiment, and associates to the trial every trial component that is created by the job. Specifying your own experiment or trial programmatically allows you to fine-tune how to organize your experiments.

The workflow we present in this example consists of a series of steps: a preprocessing step to split our input dataset into train, test, and validation datasets; a tuning step to tune our hyperparameters and kick off training jobs to train a model using the XGBoost built-in algorithm; and finally a model step to create a SageMaker model from the best trained model artifact. Pipelines also offers several natively supported step types outside of what is discussed in this post. We also illustrate how you can track your pipeline workflow and generate metrics and comparison charts. Furthermore, we show how to associate the new trial generated to an existing experiment that might have been created before the pipeline was defined.

SageMaker Pipelines code

You can review and download the notebook from the GitHub repository associated with this post. We look at the Pipelines-specific code to understand it better.

Pipelines enables you to pass parameters at run time. Here we define the processing and training instance types and counts at run time with preset defaults:

base_job_prefix = "pipeline-experiment-sample"
model_package_group_name = "pipeline-experiment-model-package"

processing_instance_count = ParameterInteger(
  name="ProcessingInstanceCount", default_value=1
)

training_instance_count = ParameterInteger(
  name="TrainingInstanceCount", default_value=1
)

processing_instance_type = ParameterString(
  name="ProcessingInstanceType", default_value="ml.m5.xlarge"
)
training_instance_type = ParameterString(
  name="TrainingInstanceType", default_value="ml.m5.xlarge"
)

Next, we set up a processing script that downloads and splits the input dataset into train, test, and validation parts. We use SKLearnProcessor for running this preprocessing step. To do so, we define a processor object with the instance type and count needed to run the processing job.

Pipelines allows us to achieve data versioning in a programmatic way by using execution-specific variables like ExecutionVariables.PIPELINE_EXECUTION_ID, which is the unique ID of a pipeline run. We can, for example, create a unique key for storing the output datasets in Amazon Simple Storage Service (Amazon S3) that ties them to a specific pipeline run. For the full list of variables, refer to Execution Variables.

framework_version = "0.23-1"

sklearn_processor = SKLearnProcessor(
    framework_version=framework_version,
    instance_type=processing_instance_type,
    instance_count=processing_instance_count,
    base_job_name="sklearn-ca-housing",
    role=role,
)

process_step = ProcessingStep(
    name="ca-housing-preprocessing",
    processor=sklearn_processor,
    outputs=[
        ProcessingOutput(
            output_name="train",
            source="/opt/ml/processing/train",
            destination=Join(
                on="/",
                values=[
                    "s3://{}".format(bucket),
                    prefix,
                    ExecutionVariables.PIPELINE_EXECUTION_ID,
                    "train",
                ],
            ),
        ),
        ProcessingOutput(
            output_name="validation",
            source="/opt/ml/processing/validation",
            destination=Join(
                on="/",
                values=[
                    "s3://{}".format(bucket),
                    prefix,
                    ExecutionVariables.PIPELINE_EXECUTION_ID,
                    "validation",
                ],
            )
        ),
        ProcessingOutput(
            output_name="test",
            source="/opt/ml/processing/test",
            destination=Join(
                on="/",
                values=[
                    "s3://{}".format(bucket),
                    prefix,
                    ExecutionVariables.PIPELINE_EXECUTION_ID,
                    "test",
                ],
            )
        ),
    ],
    code="california-housing-preprocessing.py",
)

Then we move on to create an estimator object to train an XGBoost model. We set some static hyperparameters that are commonly used with XGBoost:

model_path = f"s3://{default_bucket}/{base_job_prefix}/ca-housing-experiment-pipeline"

image_uri = sagemaker.image_uris.retrieve(
    framework="xgboost",
    region=region,
    version="1.2-2",
    py_version="py3",
    instance_type=training_instance_type,
)

xgb_train = Estimator(
    image_uri=image_uri,
    instance_type=training_instance_type,
    instance_count=training_instance_count,
    output_path=model_path,
    base_job_name=f"{base_job_prefix}/ca-housing-train",
    sagemaker_session=sagemaker_session,
    role=role,
)

xgb_train.set_hyperparameters(
    eval_metric="rmse",
    objective="reg:squarederror",  # Define the object metric for the training job
    num_round=50,
    max_depth=5,
    eta=0.2,
    gamma=4,
    min_child_weight=6,
    subsample=0.7
)

We do hyperparameter tuning of the models we create by using a ContinuousParameter range for lambda. Choosing one metric to be the objective metric tells the tuner (the instance that runs the hyperparameters tuning jobs) that you will evaluate the training job based on this specific metric. The tuner returns the best combination based on the best value for this objective metric, meaning the best combination that minimizes the best root mean square error (RMSE).

objective_metric_name = "validation:rmse"

hyperparameter_ranges = {
    "lambda": ContinuousParameter(0.01, 10, scaling_type="Logarithmic")
}

tuner = HyperparameterTuner(estimator,
                            objective_metric_name,
                            hyperparameter_ranges,
                            objective_type=objective_type,
                            strategy="Bayesian",
                            max_jobs=10,
                            max_parallel_jobs=3)

tune_step = TuningStep(
    name="HPTuning",
    tuner=tuner_log,
    inputs={
        "train": TrainingInput(
            s3_data=process_step.properties.ProcessingOutputConfig.Outputs[
                "train"
            ].S3Output.S3Uri,
            content_type="text/csv",
        ),
        "validation": TrainingInput(
            s3_data=process_step.properties.ProcessingOutputConfig.Outputs[
                "validation"
            ].S3Output.S3Uri,
            content_type="text/csv",
        ),
    } 
)

The tuning step runs multiple trials with the goal of determining the best model among the parameter ranges tested. With the method get_top_model_s3_uri, we rank the top 50 performing versions of the model artifact S3 URI and only extract the best performing version (we specify k=0 for the best) to create a SageMaker model.

model_bucket_key = f"{default_bucket}/{base_job_prefix}/ca-housing-experiment-pipeline"
model_candidate = Model(
    image_uri=image_uri,
    model_data=tune_step.get_top_model_s3_uri(top_k=0, s3_bucket=model_bucket_key),
    sagemaker_session=sagemaker_session,
    role=role,
    predictor_cls=XGBoostPredictor,
)

create_model_step = CreateModelStep(
    name="CreateTopModel",
    model=model_candidate,
    inputs=sagemaker.inputs.CreateModelInput(instance_type="ml.m4.large"),
)

When the pipeline runs, it creates trial components for each hyperparameter tuning job and each SageMaker job created by the pipeline steps.

You can further configure the integration of pipelines with Experiments by creating a PipelineExperimentConfig object and pass it to the pipeline object. The two parameters define the name of the experiment that will be created, and the trial that will refer to the whole run of the pipeline.

If you want to associate a pipeline run to an existing experiment, you can pass its name, and Pipelines will associate the new trial to it. You can prevent the creation of an experiment and trial for a pipeline run by setting pipeline_experiment_config to None.

#Pipeline experiment config
ca_housing_experiment_config = PipelineExperimentConfig(
    experiment_name,
    Join(
        on="-",
        values=[
            "pipeline-execution",
            ExecutionVariables.PIPELINE_EXECUTION_ID
        ],
    )
)

We pass on the instance types and counts as parameters, and chain the preceding steps in order as follows. The pipeline workflow is implicitly defined by the outputs of a step being the inputs of another step.

pipeline_name = f"CAHousingExperimentsPipeline"

pipeline = Pipeline(
    name=pipeline_name,
    pipeline_experiment_config=ca_housing_experiment_config,
    parameters=[
        processing_instance_count,
        processing_instance_type,
        training_instance_count,
        training_instance_type
    ],
    steps=[process_step,tune_step,create_model_step],
)

The full-fledged pipeline is now created and ready to go. We add an execution role to the pipeline and start it. From here, we can go to the SageMaker Studio Pipelines console and visually track every step. You can also access the linked logs from the console to debug a pipeline.

pipeline.upsert(role_arn=sagemaker.get_execution_role())
execution = pipeline.start()

The preceding screenshot shows in green a successfully run pipeline. We obtain the metrics of one trial from a run of the pipeline with the following code:

# SM Pipeline injects the Execution ID into trial component names
execution_id = execution.describe()['PipelineExecutionArn'].split('/')[-1]
source_arn_filter = Filter(
    name="TrialComponentName", operator=Operator.CONTAINS, value=execution_id
)

source_type_filter = Filter(
    name="Source.SourceType", operator=Operator.EQUALS, value="SageMakerTrainingJob"
)

search_expression = SearchExpression(
    filters=[source_arn_filter, source_type_filter]
)

trial_component_analytics = ExperimentAnalytics(
    sagemaker_session=sagemaker_session,
    experiment_name=experiment_name,
    search_expression=search_expression.to_boto()
)

analytic_table = trial_component_analytics.dataframe()
analytic_table.head()

Compare the metrics for each trial component

You can plot the results of hyperparameter tuning in Studio or via other Python plotting libraries. We show both ways of doing this.

Explore the training and evaluation metrics in Studio

Studio provides an interactive user interface where you can generate interactive plots. The steps are as follows:

  1. Choose Experiments and Trials from the SageMaker resources icon on the left sidebar.
  2. Choose your experiment to open it.
  3. Choose (right-click) the trial of interest.
  4. Choose Open in trial component list.
  5. Press Shift to select the trial components representing the training jobs.
  6. Choose Add chart.
  7. Choose New chart and customize it to plot the collected metrics that you want to analyze. For our use case, choose the following:
    1. For Data type¸ select Summary Statistics.
    2. For Chart type¸ select Scatter Plot.
    3. For X-axis, choose lambda.
    4. For Y-axis, choose validation:rmse_last.

The new chart appears at the bottom of the window, labeled as ‘8’.

You can include more or fewer training jobs by pressing Shift and choosing the eye icon for a more interactive experience.

Analytics with SageMaker Experiments

When the pipeline run is complete, we can quickly visualize how different variations of the model compare in terms of the metrics collected during training. Earlier, we exported all trial metrics to a Pandas DataFrame using ExperimentAnalytics. We can reproduce the plot obtained in Studio by using the Matplotlib library.

analytic_table.plot.scatter("lambda", "validation:rmse - Last", grid=True)

Conclusion

The native integration between SageMaker Pipelines and SageMaker Experiments allows data scientists to automatically organize, track, and visualize experiments during model development activities. You can create experiments to organize all your model development work, such as the following:

  • A business use case you’re addressing, such as creating an experiment to predict customer churn
  • An experiment owned by the data science team regarding marketing analytics, for example
  • A specific data science and ML project

In this post, we dove into Pipelines to show how you can use it in tandem with Experiments to organize a fully automated end-to-end workflow.

As a next step, you can use these three SageMaker features – Studio, Experiments and Pipelines – for your next ML project.

Suggested readings


About the authors

Paolo Di FrancescoPaolo Di Francesco is a solutions architect at AWS. He has experience in the telecommunications and software engineering. He is passionate about machine learning and is currently focusing on using his experience to help customers reach their goals on AWS, in particular in discussions around MLOps. Outside of work, he enjoys playing football and reading.

Mario Bourgoin is a Senior Partner Solutions Architect for AWS, an AI/ML specialist, and the global tech lead for MLOps. He works with enterprise customers and partners deploying AI solutions in the cloud. He has more than 30 years of experience doing machine learning and AI at startups and in enterprises, starting with creating one of the first commercial machine learning systems for big data. Mario spends his free time playing with his three Belgian Tervurens, cooking dinner for his family, and learning about mathematics and cosmology.

Ganapathi Krishnamoorthi is a Senior ML Solutions Architect at AWS. Ganapathi provides prescriptive guidance to startup and enterprise customers helping them to design and deploy cloud applications at scale. He is specialized in machine learning and is focused on helping customers leverage AI/ML for their business outcomes. When not at work, he enjoys exploring outdoors and listening to music.

Valerie Sounthakith is a Solutions Architect for AWS, working in the Gaming Industry and with Partners deploying AI solutions. She is aiming to build her career around Computer Vision. During her free time, Valerie spends it to travel, discover new food spots and change her house interiors.

Read More

Build taxonomy-based contextual targeting using AWS Media Intelligence and Hugging Face BERT

As new data privacy regulations like GDPR (General Data Protection Regulation, 2017) have come into effect, customers are under increased pressure to monetize media assets while abiding by the new rules. Monetizing media while respecting privacy regulations requires the ability to automatically extract granular metadata from assets like text, images, video, and audio files at internet scale. It also requires a scalable way to map media assets to industry taxonomies that facilitate discovery and monetization of content. This use case is particularly significant for the advertising industry as data privacy rules cause a shift from behavioral targeting using third-party cookies.

Third-party cookies help enable personalized ads for web users, and allow advertisers to reach their intended audience. A traditional solution to serve ads without third-party cookies is contextual advertising, which places ads on webpages based on the content published on the pages. However, contextual advertising poses the challenge of extracting context from media assets at scale, and likewise using that context to monetize the assets.

In this post, we discuss how you can build a machine learning (ML) solution that we call Contextual Intelligence Taxonomy Mapper (CITM) to extract context from digital content and map it to standard taxonomies in order to generate value. Although we apply this solution to contextual advertising, you can use it to solve other use cases. For example, education technology companies can use it to map their content to industry taxonomies in order to facilitate adaptive learning that delivers personalized learning experiences based on students’ individual needs.

Solution overview

The solution comprises two components: AWS Media Intelligence (AWS MI) capabilities for context extraction from content on web pages, and CITM for intelligent mapping of content to an industry taxonomy. You can access the solution’s code repository for a detailed view of how we implement its components.

AWS Media Intelligence

AWS MI capabilities enable automatic extraction of metadata that provides contextual understanding of a webpage’s content. You can combine ML techniques like computer vision, speech to text, and natural language processing (NLP) to automatically generate metadata from text, videos, images, and audio files for use in downstream processing. Managed AI services such as Amazon Rekognition, Amazon Transcribe, Amazon Comprehend, and Amazon Textract make these ML techniques accessible using API calls. This eliminates the overhead needed to train and build ML models from scratch. In this post, you see how using Amazon Comprehend and Amazon Rekognition for media intelligence enables metadata extraction at scale.

Contextual Intelligence Taxonomy Mapper

After you extract metadata from media content, you need a way to map that metadata to an industry taxonomy in order to facilitate contextual targeting. To do this, you build Contextual Intelligence Taxonomy Mapper (CITM), which is powered by a BERT sentence transformer from Hugging Face.

The BERT sentence transformer enables CITM to categorize web content with contextually related keywords. For example, it can categorize a web article about healthy living with keywords from the industry taxonomy, such as “Healthy Cooking and Eating,” “Running and Jogging,” and more, based on the text written and the images used within the article. CITM also provides the ability to choose the mapped taxonomy terms to use for your ad bidding process based on your criteria.

The following diagram illustrates the conceptual view of the architecture with CITM.

CITM conceptual diagram

The IAB (Interactive Advertising Bureau) Content Taxonomy

For this post, we use the IAB Tech Lab’s Content Taxonomy as the industry standard taxonomy for the contextual advertising use case. By design, the IAB taxonomy helps content creators more accurately describe their content, and it provides a common language for all parties in the programmatic advertising process. The use of a common terminology is crucial because the selection of ads for a webpage a user visits has to happen within milliseconds. The IAB taxonomy serves as a standardized way to categorize content from various sources while also being an industry protocol that real-time bidding platforms use for ad selection. It has a hierarchical structure, which provides granularity of taxonomy terms and enhanced context for advertisers.

Solution workflow

The following diagram illustrates the solution workflow.

CITM solution overivew

The steps are as follows:

  1. Amazon Simple Storage Service (Amazon S3) stores the IAB content taxonomy and extracted web content.
  2. Amazon Comprehend performs topic modeling to extract common themes from the collection of articles.
  3. The Amazon Rekognition object label API detects labels in images.
  4. CITM maps content to a standard taxonomy.
  5. Optionally, you can store content to taxonomy mapping in a metadata store.

In the following sections, we walk through each step in detail.

Amazon S3 stores the IAB content taxonomy and extracted web content

We store extracted text and images from a collection of web articles in an S3 bucket. We also store the IAB content taxonomy. As a first step, we concatenate different tiers on the taxonomy to create combined taxonomy terms. This approach helps maintain the taxonomy’s hierarchical structure when the BERT sentence transformer creates embeddings for each keyword. See the following code:

def prepare_taxonomy(taxonomy_df):
    
    """
    Concatenate IAB Tech Lab content taxonomy tiers and prepare keywords for BERT embedding. 
    Use this function as-is if using the IAB Content Taxonomy
    
    Parameters (input):
    ----------
    taxonomy_df : Content taxonomy dataframe

    Returns (output):
    -------
    df_clean : Content taxonomy with tiers in the taxonomy concatenated
    keyword_list: List of concatenated content taxonomy keywords
    ids: List of ids for the content taxonomy keywords
    """
    
    df = taxonomy_df[['Unique ID ','Parent','Name','Tier 1','Tier 2','Tier 3']] 
    df_str = df.astype({"Unique ID ": 'str', "Parent": 'str', "Tier 1": 'str', "Tier 2": 'str', "Tier 3": 'str'})
    df_clean = df_str.replace('nan','')
    
    #create a column that concatenates all tiers for each taxonomy keyword
    df_clean['combined']=df_clean[df_clean.columns[2:6]].apply(lambda x: ' '.join(x.dropna().astype(str)),axis=1)
    
    #turn taxonomy keyords to list of strings a prep for encoding with BERT sentence transformer
    keyword_list=df_clean['combined'].to_list()
                       
    #get list of taxonomy ids
    ids = df_clean['Unique ID '].to_list()                  
            
    return df_clean, keyword_list, ids

taxonomy_df, taxonomy_terms, taxonomy_ids = prepare_taxonomy(read_taxonomy)

The following diagram illustrates the IAB context taxonomy with combined tiers.

IAB Content Taxonomy with concatenated tiers

Amazon Comprehend performs topic modeling to extract common themes from the collection of articles

With the Amazon Comprehend topic modeling API, you analyze all the article texts using the Latent Dirichlet Allocation (LDA) model. The model examines each article in the corpus and groups keywords into the same topic based on the context and frequency in which they appear across the entire collection of articles. To ensure the LDA model detects highly coherent topics, you perform a preprocessing step prior to calling the Amazon Comprehend API. You can use the gensim library’s CoherenceModel to determine the optimal number of topics to detect from the collection of articles or text files. See the following code:

def compute_coherence_scores(dictionary, corpus, texts, limit, start=2, step=3):
    """
    Compute coherence scores for various number of topics for your topic model. 
    Adjust the parameters below based on your data

    Parameters (input):
    ----------
    dictionary : Gensim dictionary created earlier from input texts
    corpus : Gensim corpus created earlier from input texts
    texts : List of input texts
    limit : The maximum number of topics to test. Amazon Comprehend can detect up to 100 topics in a collection

    Returns (output):
    -------
    models : List of LDA topic models
    coherence_scores : Coherence values corresponding to the LDA model with respective number of topics
    """
    coherence_scores = []
    models = []
    for num_topics in range(start, limit, step):
        model = gensim.models.LdaMulticore(corpus=corpus, num_topics=num_topics, id2word=id2word)
        models.append(model)
        coherencemodel = CoherenceModel(model=model, texts=corpus_words, dictionary=id2word, coherence='c_v')
        coherence_scores.append(coherencemodel.get_coherence())

    return models, coherence_scores

models, coherence_scores = compute_coherence_scores(dictionary=id2word, corpus=corpus_tdf, texts=corpus_words, start=2, limit=100, step=3)

After you get the optimal number of topics, you use that value for the Amazon Comprehend topic modeling job. Providing different values for the NumberOfTopics parameter in the Amazon Comprehend StartTopicsDetectionJob operation results in a variation in the distribution of keywords placed in each topic group. An optimized value for the NumberOfTopics parameter represents the number of topics that provide the most coherent grouping of keywords with higher contextual relevance. You can store the topic modeling output from Amazon Comprehend in its raw format in Amazon S3.

The Amazon Rekognition object label API detects labels in images

You analyze each image extracted from all webpages using the Amazon Rekognition DetectLabels operation. For each image, the operation provides a JSON response with all labels detected within the image, coupled with a confidence score for each. For our use case, we arbitrarily select a confidence score of 60% or higher as the threshold for object labels to use in the next step. You store object labels in their raw format in Amazon S3. See the following code:

"""
Create a function to extract object labels from a given image using Amazon Rekognition
"""

def get_image_labels(image_loc):
    labels = []
    with fs.open(image_loc, "rb") as im:
        response = rekognition_client.detect_labels(Image={"Bytes": im.read()})
    
    for label in response["Labels"]:
        if label["Confidence"] >= 60:   #change to desired confidence score threshold, value between [0,100]:
            object_label = label["Name"]
            labels.append(object_label)
    return labels

CITM maps content to a standard taxonomy

CITM compares extracted content metadata (topics from text and labels from images) with keywords on the IAB taxonomy, and then maps the content metadata to keywords from the taxonomy that are semantically related. For this task, CITM completes the following three steps:

  1. Generate neural embeddings for the content taxonomy, topic keywords, and image labels using Hugging Face’s BERT sentence transformer. We access the sentence transformer model from Amazon SageMaker. In this post, we use the paraphrase-MiniLM-L6-v2 model, which maps keywords and labels to a 384 dimensional dense vector space.
  2. Compute the cosine similarity score between taxonomy keywords and topic keywords using their embeddings. It also computes cosine similarity between the taxonomy keywords and the image object labels. We use cosine similarity as a scoring mechanism to find semantically similar matches between the content metadata and the taxonomy. See the following code:
def compute_similarity(entity_embeddings, entity_terms, taxonomy_embeddings, taxonomy_terms):
    """
    Compute cosine scores between entity embeddings and taxonomy embeddings
    
    Parameters (input):
    ----------
    entity_embeddings : Embeddings for either topic keywords from Amazon Comprehend or image labels from Amazon Rekognition
    entity_terms : Terms for topic keywords or image labels
    taxonomy_embeddings : Embeddings for the content taxonomy
    taxonomy_terms : Terms for the taxonomy keywords

    Returns (output):
    -------
    mapping_df : Dataframe that matches each entity keyword to each taxonomy keyword and their cosine similarity score
    """
    
    #calculate cosine score, pairing each entity embedding with each taxonomy keyword embedding
    cosine_scores = util.pytorch_cos_sim(entity_embeddings, taxonomy_embeddings)
    pairs = []
    for i in range(len(cosine_scores)-1):
        for j in range(0, cosine_scores.shape[1]):
            pairs.append({'index': [i, j], 'score': cosine_scores[i][j]})
    
    #Sort cosine similarity scores in decreasing order
    pairs = sorted(pairs, key=lambda x: x['score'], reverse=True)
    rows = []
    for pair in pairs:
        i, j = pair['index']
        rows.append([entity_terms[i], taxonomy_terms[j], pair['score']])
    
    #move sorted values to a dataframe
    mapping_df= pd.DataFrame(rows, columns=["term", "taxonomy_keyword","cosine_similarity"])
    mapping_df['cosine_similarity'] = mapping_df['cosine_similarity'].astype('float')
    mapping_df= mapping_df.sort_values(by=['term','cosine_similarity'], ascending=False)
    drop_dups= mapping_df.drop_duplicates(subset=['term'], keep='first')
    mapping_df = drop_dups.sort_values(by=['cosine_similarity'], ascending=False).reset_index(drop=True)
    return mapping_df
                                               
#compute cosine_similairty score between topic keywords and content taxonomy keywords using BERT embeddings                                               
text_taxonomy_mapping=compute_similarity(keyword_embeddings, topic_keywords, taxonomy_embeddings, taxonomy_terms)
  1. Identify pairings with similarity scores that are above a user-defined threshold and use them to map the content to semantically related keywords on the content taxonomy. In our test, we select all keywords from pairings that have a cosine similarity score of 0.5 or higher. See the following code:
#merge text and image keywords mapped to content taxonomy
rtb_keywords=pd.concat([text_taxonomy_mapping[["term","taxonomy_keyword","cosine_similarity"]],image_taxonomy_mapping]).sort_values(by='cosine_similarity',ascending=False).reset_index(drop=True)

#select keywords with a cosine_similarity score greater than your desired threshold ( the value should be from 0 to 1)
rtb_keywords[rtb_keywords["cosine_similarity"]> 50] # change to desired threshold for cosine score, value between [0,100]:

A common challenge when working with internet-scale language representation (such as in this use case) is that you need a model that can fit most of the content—in this case, words in the English language. Hugging Face’s BERT transformer has been pre-trained using a large corpus of Wikipedia posts in the English language to represent the semantic meaning of words in relation to one another. You fine-tune the pre-trained model using your specific dataset of topic keywords, image labels, and taxonomy keywords. When you place all embeddings in the same feature space and visualize them, you see that BERT logically represents semantic similarity between terms.

The following example visualizes IAB content taxonomy keywords for the class Automotive represented as vectors using BERT. BERT places Automotive keywords from the taxonomy close to semantically similar terms.

Visualization of BERT embeddings for taxonomy keywords

The feature vectors allow CITM to compare the metadata labels and taxonomy keywords in the same feature space. In this feature space, CITM calculates cosine similarity between each feature vector for taxonomy keywords and each feature vector for topic keywords. In a separate step, CITM compares taxonomy feature vectors and feature vectors for image labels. Pairings with cosine scores closest to 1 are identified as semantically similar. Note that a pairing can either be a topic keyword and a taxonomy keyword, or an object label and a taxonomy keyword.

The following screenshot shows example pairings of topic keywords and taxonomy keywords using cosine similarity calculated with BERT embeddings.

Topic to taxonomy keyword pairings

To map content to taxonomy keywords, CITM selects keywords from pairings with cosine scores that meet a user-defined threshold. These are the keywords that will be used on real-time bidding platforms to select ads for the webpage’s inventory. The result is a rich mapping of online content to the taxonomy.

Optionally store content to taxonomy mapping in a metadata store

After you identify contextually similar taxonomy terms from CITM, you need a way for low-latency APIs to access this information. In programmatic bidding for advertisements, low response time and high concurrency play an important role in monetizing the content. The schema for the data store needs to be flexible to accommodate additional metadata when needed to enrich bid requests. Amazon DynamoDB can match the data access patterns and operational requirements for such a service.

Conclusion

In this post, you learned how to build a taxonomy-based contextual targeting solution using Contextual Intelligence Taxonomy Mapper (CITM). You learned how to use Amazon Comprehend and Amazon Rekognition to extract granular metadata from your media assets. Then, using CITM you mapped the assets to an industry standard taxonomy to facilitate programmatic ad bidding for contextually related ads. You can apply this framework to other use cases that require use of a standard taxonomy to enhance the value of existing media assets.

To experiment with CITM, you can access its code repository and use it with a text and image dataset of your choice.

We recommend learning more about the solution components introduced in this post. Discover more about AWS Media Intelligence to extract metadata from media content. Also, learn more about how to use Hugging Face models for NLP using Amazon SageMaker.


About the Authors

Aramide Kehinde is a Sr. Partner Solution Architect at AWS in Machine Learning and AI. Her career journey has spanned the areas of Business Intelligence and Advanced Analytics across multiple industries. She works to enable partners to build solutions with AWS AI/ML services that serve customers needs for innovation. She also enjoys building the intersection of AI and creative arenas and spending time with her family.

Anuj Gupta is a Principal Solutions Architect working with hyper-growth companies on their cloud native journey. He is passionate about using technology to solve challenging problems and has worked with customers to build highly distributed and low latency applications. He contributes to open-source Serverless and Machine Learning solutions. Outside of work, he loves traveling with his family and writing poems and philosophical blogs.

Read More