Use everyday language to search and retrieve data with Mixtral 8x7B on Amazon SageMaker JumpStart

Use everyday language to search and retrieve data with Mixtral 8x7B on Amazon SageMaker JumpStart

With the widespread adoption of generative artificial intelligence (AI) solutions, organizations are trying to use these technologies to make their teams more productive. One exciting use case is enabling natural language interactions with relational databases. Rather than writing complex SQL queries, you can describe in plain language what data you want to retrieve or manipulate. The large language model (LLM) can understand the intent behind your natural language input and data topography and automatically generate the appropriate SQL code. This allows analysts to be more productive by not having to context switch into rigid query syntax, while also opening up relational databases to less technical users.

In this post, we show you how to set up and deploy a solution to chat with your databases using natural language, allowing users to gain insights into their data without writing any code or SQL queries.

Benefits of text-to-SQL generative AI and the Mixtral 8x7B model

Consider Michelle, a business analyst responsible for preparing weekly sales reports by running complex SQL queries on their data warehouse to aggregate numbers by product, region, and time period. In the past, this manual process took 2–3 hours per week working with the analyst team to write these queries by hand. Now with text-to-SQL generative AI, Michelle simply describes the report she needs in plain English, such as “Show total revenue last week for shoes in the Western region grouped by sub-category.” The AI assistant automatically generates the required SQL query, runs it on the data warehouse, and returns a formatted report in seconds.

By eliminating the SQL bottleneck, Michelle saves hours per week, now spent on more impactful analysis instead of query writing. She can iterate faster and answer questions on demand. Other business users like Michelle gain similar productivity benefits from this conversational access to relational data. The generative AI tool essentially turns self-service analytics aspirations into reality by allowing business teams to leave the SQL to the machines.

For this implementation, Mixtral 8x7B MoE was used. Mixtral 8x7B is a state-of-the-art Sparse Mixture of Experts (MoE) foundation model released by Mistral AI. It supports multiple use cases such as text summarization, classification, text generation, and code generation. It is an 8x model, which means it contains eight distinct groups of parameters. The model has about 45 billion total parameters and supports a context length of 32,000 tokens. MoE is a type of neural network architecture that consists of multiple “experts,” where each expert is a neural network. In the context of transformer models, MoE replaces some feed-forward layers with sparse MoE layers. These layers have a certain number of experts, and a router network selects which experts process each token at each layer. MoE models enable more compute-efficient and faster inference compared to dense models. Compared to traditional LLMs, Mixtral 8x7B offers the advantage of faster decoding at the speed of a smaller parameter-dense model despite containing more parameters. It also outperforms other open-access models on certain benchmarks and supports a longer context length.

You can currently deploy Mixtral 8x7B on Amazon SageMaker JumpStart with one click. Amazon SageMaker JumpStart provides a simplified way to access and deploy over 100 different open source and third-party foundation models. Instead of having to manually integrate, optimize, and configure each foundation model yourself, SageMaker JumpStart handles those complex tasks for you. With just a few clicks, you can deploy state-of-the-art models from Hugging Face, Cohere, AI21 Labs, Stability AI, and more using optimized containers and SageMaker endpoints. SageMaker JumpStart eliminates the heavy lifting involved in foundation model deployment. You get access to a huge catalog of prebuilt models that you can quickly put to use for inference. It’s a scalable, cost-effective way to implement powerful AI solutions without machine learning (ML) expertise.

Solution overview

The following diagram illustrates the solution architecture.

At a high level, the overall solution consists of three core components:

The end-to-end flow is as follows:

  1. The user asks a natural language question, which is passed to the Mixtral 8x7B Instruct model, hosted in SageMaker.
  2. The LLM analyzes the question and uses the schema fetched from the connected Amazon Redshift database to generate a SQL query.
  3. The SQL query is run against the database. In case of an error, a retry workflow is run.
  4. Tabular results received are passed back to the LLM to interpret and convert them into a natural language response to the user’s original question.

Prerequisites

To launch an endpoint to host Mixtral 8x7B from SageMaker JumpStart, you may need to request a service quota increase to access an ml.g5.48xlarge instance for endpoint usage. You can request service quota increases through the AWS Management Console, AWS Command Line Interface (AWS CLI), or API to allow access to those additional resources.

To follow along with this example, you also need access to a relational data source. Amazon Redshift is used as the primary data source in this post with the TICKIT database. This database helps analysts track sales activity for the fictional TICKIT website, where users buy and sell tickets online for sporting events, shows, and concerts. In particular, analysts can identify ticket movement over time, success rates for sellers, and the best-selling events, venues, and seasons. You can also experiment with other AWS data sources like Amazon RDS, Athena, or your own relational databases. Make sure to have the connection details for your data source available, such as database URL, user name, and password.

To follow the demo using Amazon Redshift, you first need to set up a Redshift cluster if you don’t already have one. Use the Amazon Redshift console or AWS CLI to launch a cluster with your desired node type and number of nodes. When the cluster is available, create a new database and tables in it to hold your sample relational data. You can load data from Amazon Simple Storage Service (Amazon S3) or directly insert rows. When storing data in Amazon S3, make sure that all public access is blocked and the data is encrypted at rest and in transit. For more information, refer to Security best practices for Amazon S3. Finally, make sure to note the cluster endpoint, database name, and credentials to connect. With a Redshift cluster provisioned and loaded with data, you will have an ideal relational backend ready to pair for natural language access.

To test that you successfully added data to your Redshift cluster, complete the following steps:

  1. On the Amazon Redshift console, choose Clusters in the navigation pane.
  2. Choose the cluster you want to query.
  3. Navigate to the Query Editor tab to open the query editor.
  4. Run the following sample queries or write your own SQL queries:
    • Find total sales on a given date:
      SELECT sum(qtysold)
      FROM sales, date
      WHERE sales.dateid = date.dateid AND caldate = '2008-01-05';

    • Find top 10 buyers:
      SELECT firstname, lastname, total_quantity
      FROM (SELECT buyerid, sum(qtysold) total_quantity 
      FROM sales GROUP BY buyerid ORDER BY total_quantity desc limit 10) Q, users
      WHERE Q.buyerid = userid ORDER BY Q.total_quantity desc;

The query editor allows saving, scheduling, and sharing queries. You can also view query plans, inspect run details, and monitor query performance.

Implement the solution

The code consists of a number of functions that are invoked by the logic shown in the solution diagram. We show you the relevant code blocks in this breakdown that match with the diagram. You can see the complete code for the solution in the GitHub repository.

To implement this solution, complete the following steps:

  1. Set up a Redshift cluster. For this post, we use an RA3 type cluster.
  2. Load the TICKIT sales dataset into the Redshift cluster. For instructions, see Load data from Amazon S3 to Amazon Redshift.
  3. To confirm that Amazon Redshift access is private and restricted only to your VPC, refer to the steps in Enable private access to Amazon Redshift from your client applications in another VPC.
  4. Set up a SageMaker domain, making sure it has the appropriate permissions to interact with Amazon Redshift.
  5. Clone the following GitHub repository into SageMaker Studio Classic.
  6. The first step is to deploy the Mixtral 8x7B Instruct SageMaker endpoint. We use the default size ml.g5.48xlarge instance. Make sure that you have an ml.g5.48xlarge for endpoint usage service quota of at least 1.
    # Note this requires an ml.g5.48xlarge instance.
    model_id = "huggingface-llm-mixtral-8x7b-instruct"
    from sagemaker.jumpstart.model import JumpStartModel
    model = JumpStartModel(model_id=model_id)
    predictor = model.deploy(endpoint_name=MIXTRAL_ENDPOINT)

  7. Set up the connectivity to the Redshift cluster. Make sure to replace these placeholders with your Redshift identifiers. For security purposes, you should have the credentials secured using AWS Secrets Manager. For instructions, see Enhance your security posture by storing Amazon Redshift admin credentials without human intervention using AWS Secrets Manager integration
    redshift_client = boto3.client('redshift-data')
    CLUSTER_IDENTIFIER = 'redshift-cluster-1'
    DATABASE = 'dev'
    DB_USER = 'awsuser'

  8. Set up the natural language question and the prompt parameters for the model
    prompt = "What are the top five seller names in San Diego, based on the number of tickets sold in 2008?"
    
    params={'sql-len':700,'text-token':500,'tables':tables,'db':schm,'temp':0.01,
    'model_id':'mixtral','prompt':prompt}

The Redshift cluster is queried to generate the relevant database schema and example records, as shown in Step 2:

%%time
ress=redshift_qna(params)
"""
    Execute a Q&A process for generating SQL queries based on user questions.
    Args:
        params (dict): A dictionary containing parameters including table name, database name, prompt, etc.
    Returns:
        tuple: A tuple containing the response, generated SQL statement, and query output.
    """
    sql1=f"SELECT table_catalog,table_schema,table_name,column_name,ordinal_position,is_nullable,data_type FROM information_schema.columns WHERE table_schema='{params['db']}'"
    sql2=[]
    for table in params['tables']:
        sql2.append(f"SELECT * from dev.{params['db']}.{table} LIMIT 3")
    sqls=[sql1]+sql2
    
    question=params['prompt']
    results=execute_query_with_pagination(sqls, CLUSTER_IDENTIFIER, DATABASE, DB_USER)    
    
    col_names=results[0].split('n')[0]
    observations="n".join(sorted(results[0].split('n')[1:])).strip()
    params['schema']=f"{col_names}n{observations}"
    params['sample']=''
    for examples in results[1:]:
        params['sample']+=f"{examples}nn"

The generated SQL query is run on the Redshift cluster (Steps 6–8):

q_s=query_llm(prompts,200)
sql_pattern = re.compile(r'<sql>(.*?)(?:</sql>|$)', re.DOTALL)           
sql_match = re.search(sql_pattern, q_s)
q_s = sql_match.group(1) 
print(f" FIRST ATTEMPT SQL:n{q_s}")
output, q_s=single_execute_query(q_s, CLUSTER_IDENTIFIER, DATABASE, DB_USER,question) 
"""
    Execute a single SQL query on an Amazon Redshift cluster and process the result.

    Args:
        sql_query (str): The SQL query to execute.
        cluster_identifier (str): The identifier of the Redshift cluster.
        database (str): The name of the database.
        db_user (str): The username used to authenticate with the Redshift cluster.
        question (str): A descriptive label or question associated with the query.

    Returns:
        pandas.DataFrame: DataFrame containing the processed result of the SQL query.

    """
    result_sets = []
    response = execute_query_redshift(sql_query, cluster_identifier, database, db_user)

The query might fail because of errors in the LLM-generated SQL. This is why we have a debugging step, which can iterate for a certain number of times, asking the LLM to look at the Amazon Redshift error message and the previous context (user question, DB schema, table samples, and past SQL query generated) and generate a new query addressing it. Guidance is provided to the model using prompt engineering and instructions to come up with a different query. The new query is then run on the cluster again. This process is configured to repeat up to five times in the sample code, or until the query successfully runs. If the query doesn’t run successfully within the number of retries specified, a failure message is returned back to the user. This step highlighted in red in the diagram.

def llm_debugger(question, statement, error, params): 
    """
    Generate debugging guidance and expected SQL correction for a PostgreSQL error.
    Args:
        question (str): The user's question or intent.
        statement (str): The SQL statement that caused the error.
        error (str): The error message encountered.
        params (dict): Additional parameters including schema, sample data, and length.
    Returns:
        str: Formatted debugging guidance and expected SQL correction.
    """
    prompts=f'''<s><<SYS>>[INST]
You are a PostgreSQL developer who is an expert at debugging errors.  

Here are the schema definition of table(s):
{params['schema']}
#############################
Here are example records for each table:
{params['sample']}
#############################
Here is the sql statement that threw the error below:
{statement}
#############################
Here is the error to debug:
{error}
#############################
Here is the intent of the user:
{params['prompt']}
<</SYS>>
First understand the error and think about how you can fix the error.
Use the provided schema and sample row to guide your thought process for a solution.
Do all this thinking inside <thinking></thinking> XML tags. This is a space for you to write down relevant content and will not be shown to the user.

Once your are done debugging, provide the the correct SQL statement without any additional text.
When generating the correct SQL statement:
1. Pay attention to the schema and table name and use them correctly in your generated sql. 
2. Never query for all columns from a table unless the question says so. You must query only the columns that are needed to answer the question.
3. Wrap each column name in double quotes (") to denote them as delimited identifiers. Do not use backslash () to escape underscores (_) in column names. 

Format your response as:
<sql> Correct SQL Statement </sql>[/INST]'''
    answer=query_llm(prompts,round(params['sql-len']))
    return answer

If the query successfully runs, we pass the tabular results from Amazon Redshift to the LLM to interpret them and, based on the initial question, provide an answer in natural language to be returned to the user (Steps 10–13):

if len(input_token)>28000:    
        csv_rows=output.split('n')
        chunk_rows=chunk_csv_rows(csv_rows, 20000)
        initial_summary=[]
        for chunk in chunk_rows:
            prompts=f'''<s><<SYS>>[INST]You are a helpful and truthful assistant. Your job is provide answers based on samples of a tabular data provided.

Here is the tabular data:
#######
{chunk}
#######
<</SYS>>
Question: {question}

When providing your response:
- First, review the result to understand the information within. Then provide a complete answer to the my question, based on the result.
- If you can't answer the question, please say so[/INST]'''
            initial_summary.append(qna_llm(prompts,params))
        prompts = f'''<s><<SYS>>[INST]You are a helpful and truthful assistant.

Here are multiple answer for a question on different subset of a tabular data:
#######
{initial_summary}
#######
<</SYS>>
Question: {question}
Based on the given question above, merege all answers provided in a coherent singular answer[/INST]'''
        response=qna_llm(prompts,params)
        
    else:        
        prompts=f'''<s><<SYS>>[INST]You are a helpful and truthful assistant. Your job is to examine a sql statement and its generated result, then provide a response to my question.

Here is the sql query:
{q_s}

Here is the corresponding sql query result:
{output}
<</SYS>>
question: {question}

When providing your response:
- First, review the sql query and the corresponding result. Then provide a complete answer to the my question, based on the result.
- If you can't answer the question, please say so[/INST]'''
        response=qna_llm(prompts, params) 
    return response, q_s,output

Test the solution

Let’s explore an example run of the solution. We ask the question, “What are the top five seller names in San Diego, based on the number of tickets sold in 2008?” The following is the SQL query generated:

SELECT MAX("u"."firstname") AS firstname, MAX("u"."lastname") AS lastname, SUM("l"."numtickets") AS total_tickets_sold
FROM "users" AS u
JOIN "sales" AS s ON "u"."userid" = "s"."sellerid"
JOIN "listing" AS l ON "s"."listid" = "l"."listid"
JOIN "event" AS e ON "l"."eventid" = "e"."eventid"
JOIN "date" AS d ON "l"."dateid" = "d"."dateid"
WHERE "d"."year" = 2008 AND "u"."city" = 'San Diego'
GROUP BY "u"."userid"
ORDER BY total_tickets_sold DESC
LIMIT 5;

The following is the query response from Amazon Redshift:

firstname,lastname,total_tickets_sold
Julie,Hanson,156
Joan,Wright,115
Ocean,Sears,100
Hiram,Turner,99
Jena,Burch,90

The following is the final answer provided by the LLM:

The top five sellers in San Diego, based on the number of tickets sold in 2008, are:
1. Julie Hanson with a total of 156 tickets sold.
2. Joan Wright with a total of 115 tickets sold.
3. Ocean Sears with a total of 100 tickets sold.
4. Hiram Turner with a total of 99 tickets sold.
5. Jena Burch with a total of 90 tickets sold.

Best practices

Enhancing response efficiency in text-to-SQL systems involves incorporating several key best practices:

  • Caching parsed SQL – To improve response times and avoid reprocessing repeated queries, parsed SQL and recognized query prompts can be cached from the system. This cache can be checked before invoking the LLM for each new text query.
  • Monitoring – Usage logs and metrics around query parsing, SQL generation latency, and result set sizes should be collected. Monitoring this data enables optimization by revealing pain points—whether from inadequate training data, limitations in prompt engineering, or data model issues.
  • Scheduled data refresh – To keep materialized view data current, refresh schedules using batch or incremental approaches are needed. The right balance mitigates the overhead of the refresh while making sure that text queries generate results using the latest data.
  • Central data catalog – Maintaining a centralized data catalog provides a unified metadata layer across data sources, which is critical for guiding LLM SQL generation. This catalog enables selecting appropriate tables and schemas to handle text queries.
  • Guardrails – Use prompt engineering to prevent the LLM from generating SQL that would alter tables or logic to prevent running queries that would alter any tables. One important recommendation is to use a user role that only has read privileges.

By considering these optimization dimensions, natural language-to-SQL solutions can scale efficiently while delivering intuitive data access. As with any generative AI system, keeping an eye on performance is key while enabling more users to benefit.

These are just a few of the different best practices that you can follow. For a deeper dive, see Generating value from enterprise data: Best practices for Text2SQL and generative AI.

Clean up

To clean up your resources, complete the steps in this section.

Delete the SageMaker endpoint

To delete a SageMaker model endpoint, follow these steps:

  1. On the SageMaker console, in the navigation pane, choose Inference, then choose Endpoints.
  2. On the Endpoints page, select the endpoint you want to delete.
  3. On the Actions menu, select Delete.
  4. On the confirmation page, choose Delete to delete the endpoint.

The endpoint deletion process will begin. You can check the endpoint status on the Endpoints page to confirm it has been deleted.

Delete the Redshift cluster

Complete the following steps to delete your Redshift cluster:

  1. On the Amazon Redshift console, in the navigation pane, choose Clusters to display your list of clusters.
  2. Choose the cluster you want to delete.
  3. On the Actions menu, choose Delete.
  4. Confirm the cluster to be deleted, then choose Delete cluster.

The cluster status will be updated as the cluster is deleted. This process usually takes a few minutes.

Conclusion

The ability to query data through intuitive natural language interfaces unlocks huge potential for business users. Instead of struggling with complex SQL syntax, teams can self-serve the analytical insights they need, on demand. This improves time-to-value while allowing less technical users to access and extract meaning from enterprise data.

As highlighted in this post, the latest advances in generative AI make robust NLQ-to-SQL systems achievable. With foundation models such as Mixtral 8x7B running on SageMaker and tools and libraries for connecting to different data sources, organizations can now have an enterprise-grade solution to convert natural language queries into efficient SQL. By eliminating the traditional SQL bottleneck, generative NLQ-to-SQL systems give back countless hours each week for analysts and non-technical roles, driving greater business agility and democratization in self-service analytics.

As generative AI continues to mature rapidly, keeping up with the latest models and optimization techniques is critical. This post only scratched the surface of what will be possible in the near future as these technologies improve. Natural language interfaces for accessing and manipulating data still have huge runways for innovation ahead. To learn more about how AWS is helping customers make their ideas a reality, refer to the Generative AI Innovation Center.


About the Authors

Jose Navarro is an AI/ML Solutions Architect at AWS, based in Spain. Jose helps AWS customers—from small startups to large enterprises—architect and take their end-to-end machine learning use cases to production. In his spare time, he loves to exercise, spend quality time with friends and family, and catch up on AI news and papers.

Prashanth Ganapathy is a Senior Solutions Architect in the Small Medium Business (SMB) segment at AWS. He enjoys learning about AWS AI/ML services and helping customers meet their business outcomes by building solutions for them. Outside of work, Prashanth enjoys photography, travel, and trying out different cuisines.

Uchenna Egbe is an Associate Solutions Architect at AWS. He spends his free time researching about herbs, teas, superfoods, and how to incorporate them into his daily diet.

Sebastian Bustillo is a Solutions Architect at AWS. He focuses on AI/ML technologies with a with a profound passion for generative AI and compute accelerators. At AWS, he helps customers unlock business value through generative AI, assisting with the overall process from ideation to production. When he’s not at work, he enjoys brewing a perfect cup of specialty coffee and exploring the world with his wife.

Read More

Boost inference performance for Mixtral and Llama 2 models with new Amazon SageMaker containers

Boost inference performance for Mixtral and Llama 2 models with new Amazon SageMaker containers

In January 2024, Amazon SageMaker launched a new version (0.26.0) of Large Model Inference (LMI) Deep Learning Containers (DLCs). This version offers support for new models (including Mixture of Experts), performance and usability improvements across inference backends, as well as new generation details for increased control and prediction explainability (such as reason for generation completion and token level log probabilities).

LMI DLCs offer a low-code interface that simplifies using state-of-the-art inference optimization techniques and hardware. LMI allows you to apply tensor parallelism; the latest efficient attention, batching, quantization, and memory management techniques; token streaming; and much more, by just requiring the model ID and optional model parameters. With LMI DLCs on SageMaker, you can accelerate time-to-value for your generative artificial intelligence (AI) applications, offload infrastructure-related heavy lifting, and optimize large language models (LLMs) for the hardware of your choice to achieve best-in-class price-performance.

In this post, we explore the latest features introduced in this release, examine performance benchmarks, and provide a detailed guide on deploying new LLMs with LMI DLCs at high performance.

New features with LMI DLCs

In this section, we discuss new features across LMI backends, and drill down on some others that are backend-specific. LMI currently supports the following backends:

  • LMI-Distributed Library – This is the AWS framework to run inference with LLMs, inspired from OSS, to achieve the best possible latency and accuracy on the result
  • LMI vLLM – This is the AWS backend implementation of the memory-efficient vLLM inference library
  • LMI TensorRT-LLM toolkit – This is the AWS backend implementation of NVIDIA TensorRT-LLM, which creates GPU-specific engines to optimize performance on different GPUs
  • LMI DeepSpeed – This is the AWS adaptation of DeepSpeed, which adds true continuous batching, SmoothQuant quantization, and the ability to dynamically adjust memory during inference
  • LMI NeuronX – You can use this for deployment on AWS Inferentia2 and AWS Trainium-based instances, featuring true continuous batching and speedups, based on the AWS Neuron SDK

The following table sumarizes the newly added features, both common and backend-specific.

Common across backends

          • New models supported: Mistral7B, Mixtral, Llama2-70B (NeuronX)
          • RoPE scaling support for longer contexts
          • Generation details added: generation finish reason and token-level log probability
          • Server config parameters consolidation

Backend specific

LMI-Distributed

vLLM TensorRT-LLM

NeuronX

  • Added grouping granularity for optimized GPU collectives
  • CUDA graphs support up to 50% performance improvement
  • New models supported for managed JIT compilation
  • Support for TensorRT-LLM’s native SmoothQuant quantization
  • Grouped-query attention support
  • Continuous batching performance improvements

New models supported

New popular models are supported across backends, such as Mistral-7B (all backends), the MoE-based Mixtral (all backends except Transformers-NeuronX), and Llama2-70B (Transformers-NeuronX).

Context window extension techniques

Rotary Positional Embedding (RoPE)-based context scaling is now available on the LMI-Dist, vLLM, and TensorRT-LLM backends. RoPE scaling enables the extension of a model’s sequence length during inference to virtually any size, without the need for fine-tuning.

The following are two important considerations when using RoPE:

  • Model perplexity – As the sequence length increases, so can the model’s perplexity. This effect can be partially offset by conducting minimal fine-tuning on input sequences larger than those used in the original training. For an in-depth understanding of how RoPE affects model quality, refer to Extending the RoPE.
  • Inference performance – Longer sequence lengths will consume higher accelerator’s high bandwidth memory (HBM). This increased memory usage can adversely affect the number of concurrent requests your accelerator can handle.

Added generation details

You can now get two fine-grained details about generation results:

  • finish_reason – This gives the reason for generation completion, which can be reaching the maximum generation length, generating an end-of-sentence (EOS) token, or generating a user-defined stop token. It is returned with the last streamed sequence chunk.
  • log_probs – This returns the log probability assigned by the model for each token in the streamed sequence chunk. You can use these as a rough estimate of model confidence by computing the joint probability of a sequence as the sum of the log_probs of the individual tokens, which can be useful for scoring and ranking model outputs. Be mindful that LLM token probabilities are generally overconfident without calibration.

You can enable the generation results output by adding details=True in your input payload to LMI, leaving all other parameters unchanged:

payload = {“inputs”:“your prompt”,
“parameters”:{max_new_tokens”:256,...,“details”:True}
}

Consolidated configuration parameters

Finally, LMI configuration parameters have also been consolidated. For more information about all common and backend-specific deployment configuration parameters, see Large Model Inference Configurations.

LMI-Distributed backend

At AWS re:Invent 2023, LMI-Dist added new, optimized collective operations to speed up communication between GPUs, resulting in lower latency and higher throughput for models that are too big for a single GPU. These collectives are available exclusively for SageMaker, for p4d instances.

Whereas the previous iteration only supported sharding across all 8 GPUs, LMI 0.26.0 introduces support for a tensor parallel degree of 4, in a partial all-to-all pattern. This can be combined with SageMaker inference components, with which you can granularly configure how many accelerators should be allocated to each model deployed behind an endpoint. Together, these features provide better control over the resource utilization of the underlying instance, enabling you to increase model multi-tenancy by hosting different models behind one endpoint, or fine-tune the aggregate throughput of your deployment to match your model and traffic characteristics.

The following figure compares direct all-to-all with partial all-to-all.

All to all partial collectives.

TensorRT-LLM backend

NVIDIA’s TensorRT-LLM was introduced as part of the previous LMI DLC release (0.25.0), enabling state-of-the-art GPU performance and optimizations like SmoothQuant, FP8, and continuous batching for LLMs when using NVIDIA GPUs.

TensorRT-LLM requires models to be compiled into efficient engines before deployment. The LMI TensorRT-LLM DLC can automatically handle compiling a list of supported models just-in-time (JIT), before starting the server and loading the model for real-time inference. Version 0.26.0 of the DLC grows the list of supported models for JIT compilation, introducing Baichuan, ChatGLM , GPT2, GPT-J, InternLM, Mistral, Mixtral, Qwen, SantaCoder and StarCoder models.

JIT compilation adds several minutes of overhead to endpoint provisioning and scaling time, so it is always recommended to compile your model ahead-of-time. For a guide on how to do this and a list of supported models, see TensorRT-LLM ahead-of-time compilation of models tutorial. If your selected model isn’t supported yet, refer to TensorRT-LLM manual compilation of models tutorial to compile any other model that is supported by TensorRT-LLM.

Additionally, LMI now exposes native TensorRT-LLM SmootQuant quantization, with parameters to control alpha and scaling factor by token or channel. For more information about the related configurations, refer to TensorRT-LLM.

vLLM backend

The updated release of vLLM included in LMI DLC features performance improvements of up to 50% fueled by CUDA graph mode instead of eager mode. CUDA graphs accelerate GPU workloads by launching several GPU operations in one go instead of launching them individually, which reduces overheads. This is particularly effective for small models when using tensor parallelism.

The added performance comes at a trade-off of added GPU memory consumption. CUDA graph mode is now default for the vLLM backend, so if you are constrained on the amount of GPU memory available, you can set option.enforce_eager=True to force PyTorch eager mode.

Transformers-NeuronX backend

The updated release of NeuronX included in the LMI NeuronX DLC now supports models that feature the grouped-query attention mechanism, such as Mistral-7B and LLama2-70B. Grouped-query attention is an important optimization of the default transformer attention mechanism, where the model is trained with fewer key and value heads than query heads. This reduces the size of the KV cache on GPU memory, allowing for greater concurrency, and improving price-performance.

The following figure illustrates multi-head, grouped-query, and multi-query attention methods (source).

Diagram of grouped query attention

Different KV cache sharding strategies are available to suit different types of workloads. For more information on sharding strategies, see Grouped-query attention (GQA) support. You can enable your desired strategy (shard-over-heads, for example) with the following code:

option.group_query_attention=shard-over-heads

Additionally, the new implementation of NeuronX DLC introduces a cache API for TransformerNeuronX that enables access to the KV cache. It allows you to insert and remove KV cache rows from new requests while you’re handing batched inference. Before introducing this API, the KV cache was recomputed for any newly added requests. Compared to LMI V7 (0.25.0), we have improved latency by more than 33% with concurrent requests, and support much higher throughput.

Selecting the right backend

To decide what backend to use based on the selected model and task, use the following flow chart. For individual backend user guides along with supported models, see LMI Backend User Guides.

Decision tree to decide what backend to use

Deploy Mixtral with LMI DLC with additional attributes

Let’s walk through how you can deploy the Mixtral-8x7B model with LMI 0.26.0 container and generate additional details like log_prob and finish_reason as part of the output. We also discuss how you can benefit from these additional attributes through a content generation use case.

The complete notebook with detailed instructions is available in the GitHub repo.

We start by importing the libraries and configuring the session environment:

import boto3
import sagemaker 
import json 
import io 
import numpy as np 
from sagemaker import Model, image_uris, serializers, deserializers 

role = sagemaker.get_execution_role() # execution role for the endpoint 
session = sagemaker.session.Session() # sagemaker session for interacting with different AWS APIs 
region = session._region_name # region name of the current SageMaker Studio environment

You can use SageMaker LMI containers to host models without any additional inference code. You can configure the model server either through the environment variables or a serving.properties file. Optionally, you could have a model.py file for any preprocessing or postprocessing and a requirements.txt file for any additional packages that are required to be installed.

In this case, we use the serving.properties file to configure the parameters and customize the LMI container behavior. For more details, refer to the GitHub repo. The repo explains details of the various configuration parameters that you can set. We need the following key parameters:

  • engine – Specifies the runtime engine for DJL to use. This drives the sharding and the model loading strategy in the accelerators for the model.
  • option.model_id – Specifies the Amazon Simple Storage Service (Amazon S3) URI of the pre-trained model or the model ID of a pretrained model hosted inside a model repository on Hugging Face. In this case, we provide the model ID for the Mixtral-8x7B model.
  • option.tensor_parallel_degree – Sets the number of GPU devices over which Accelerate needs to partition the model. This parameter also controls the number of workers per model that will be started up when DJL serving runs. We set this value to max (maximum GPU on the current machine).
  • option.rolling_batch – Enables continuous batching to optimize accelerator utilization and overall throughput. For the TensorRT-LLM container, we use auto.
  • option.model_loading_timeout – Sets the timeout value for downloading and loading the model to serve inference.
  • option.max_rolling_batch – Sets the maximum size of the continuous batch, defining how many sequences can be processed in parallel at any given time.
%%writefile serving.properties 
engine=MPI 
option.model_id=mistralai/Mixtral-8x7B-v0.1 
option.tensor_parallel_degree=max 
option.max_rolling_batch_size=32 
option.rolling_batch=auto 
option.model_loading_timeout = 7200

We package the serving.properties configuration file in the tar.gz format, so that it meets SageMaker hosting requirements. We configure the DJL LMI container with tensorrtllm as the backend engine. Additionally, we specify the latest version of the container (0.26.0).

image_uri = image_uris.retrieve(
   framework="djl-tensorrtllm",
   region=sess.boto_session.region_name,
   version="0.26.0"
)

Next, we upload the local tarball (containing the serving.properties configuration file) to an S3 prefix. We use the image URI for the DJL container and the Amazon S3 location to which the model serving artifacts tarball was uploaded, to create the SageMaker model object.

model = Model(image_uri=image_uri, model_data=code_artifact, role=role) 

instance_type = "ml.p4d.24xlarge" 
endpoint_name = sagemaker.utils.name_from_base("mixtral-lmi-model") 

model.deploy(
   initial_instance_count=1,
   instance_type=instance_type,
   endpoint_name=endpoint_name,
   container_startup_health_check_timeout=1800
)

As part of LMI 0.26.0, you can now use two additional fine-grained details about the generated output:

  • log_probs – This is the log probability assigned by the model for each token in the streamed sequence chunk. You can use these as a rough estimate of model confidence by computing the joint probability of a sequence as the sum of the log probabilities of the individual tokens, which can be useful for scoring and ranking model outputs. Be mindful that LLM token probabilities are generally overconfident without calibration.
  • finish_reason – This is the reason for generation completion, which can be reaching the maximum generation length, generating an EOS token, or generating a user-defined stop token. This is returned with the last streamed sequence chunk.

You can enable these by passing "details"=True as part of your input to the model.

Let’s see how you can generate these details. We use a content generation example to understand their application.

We define a LineIterator helper class, which has functions to lazily fetch bytes from a response stream, buffer them, and break down the buffer into lines. The idea is to serve bytes from the buffer while fetching more bytes from the stream asynchronously.

class LineIterator:
    def __init__(self, stream):
        # Iterator to get bytes from stream 
        self.byte_iterator = iter(stream)  
        # Buffer stream bytes until we get a full line
        self.buffer = io.BytesIO()  
        # Track current reading position within buffer
        self.read_pos = 0

   def __iter__(self):
        # Make class iterable 
        return self

    def __next__(self):
        while True:
           # Seek read position within buffer
           self.buffer.seek(self.read_pos)  
           # Try reading a line from current position
           line = self.buffer.readline()
           # If we have a full line
           if line and line[-1] == ord('n'):
               # Increment reading position past this line
               self.read_pos += len(line)  
               # Return the line read without newline char
               return line[:-1] 
           # Fetch next chunk from stream  
           try:
               chunk = next(self.byte_iterator)
           # Handle end of stream 
           except StopIteration:
               # Check if we have any bytes still unread
               if self.read_pos < self.buffer.getbuffer().nbytes:
                   continue
               # If not, raise StopIteration
               raise
           # Add fetched bytes to end of buffer
           self.buffer.seek(0, io.SEEK_END)  
           self.buffer.write(chunk['PayloadPart']['Bytes'])

Generate and use token probability as an additional detail

Consider a use case where we are generating content. Specifically, we’re tasked with writing a brief paragraph about the benefits of exercising regularly for a lifestyle-focused website. We want to generate content and output some indicative score of the confidence that the model has in the generated content.

We invoke the model endpoint with our prompt and capture the generated response. We set "details": True as a runtime parameter within the input to the model. Because the log probability is generated for each output token, we append the individual log probabilities to a list. We also capture the complete generated text from the response.

sm_client = boto3.client("sagemaker-runtime")

# Set details: True as a runtime parameter within the input.
body = {"inputs": prompt, "parameters": {"max_new_tokens":512, "details": True}}
resp = sm_client.invoke_endpoint_with_response_stream(EndpointName=endpoint_name, Body=json.dumps(body), ContentType="application/json")
event_stream = resp['Body']

overall_log_prob = []

for line in LineIterator(event_stream):
    resp = json.loads(line)
    if resp['token'].get('text') != None:
        token_log_prob = resp['token']['log_prob']
        overall_log_prob.append(token_log_prob)
    elif resp['generated_text'] != None:
        generated_text= resp['generated_text']

To calculate the overall confidence score, we calculate the mean of all the individual token probabilities and subsequently get the exponential value between 0 and 1. This is our inferred overall confidence score for the generated text, which in this case is a paragraph about the benefits of regular exercising.

print(generated_text) 
overall_score=np.exp(np.mean(overall_log_prob)) 
print(f"nnOverall confidence score in the generated text: {overall_score}")

This was one example of how you can generate and use log_prob, in the context of a content generation use case. Similarly, you can use log_prob as measure of confidence score for classification use cases.

Alternatively, you can use it for the overall output sequence or sentence-level scoring to evaluate the affect of parameters such as temperature on the generated output.

Generate and use finish reason as an additional detail

Let’s build on the same use case, but this time we’re tasked with writing a longer article. Additionally, we want to make sure that the output is not truncated due to generation length issues (max token length) or due to stop tokens being encountered.

To accomplish this, we use the finish_reason attribute generated in the output, monitor its value, and continue generating until the entire output is generated.

We define an inference function that takes a payload input and calls the SageMaker endpoint, streams back a response, and processes the response to extract generated text. The payload contains the prompt text as inputs and parameters like max tokens and details. The response is read in a stream and processed line by line to extract the generated text tokens into a list. We extract details like finish_reason. We call the inference function in a loop (chained requests) while adding more context each time, and track the number of tokens generated and number of requests sent until the model finishes.

def inference(payload):
    # Call SageMaker endpoint and get response stream
    resp = sm_client.invoke_endpoint_with_response_stream(EndpointName=endpoint_name, Body=json.dumps(payload), ContentType="application/json")
    event_stream = resp['Body']
    text_output = []
    for line in LineIterator(event_stream):
        resp = json.loads(line) 
        # Extract text tokens if present
        if resp['token'].get('text') != None:
            token = resp['token']['text']
            text_output.append(token)  
            print(token, end='')
        # Get finish reason if details present
        if resp.get('details') != None:
            finish_reason = resp['details']['finish_reason']
            # Return extracted output, finish reason and token length
            return payload['inputs'] + ''.join(text_output), finish_reason, len(text_output)

# set details: True as a runtime parameter within the input.
payload = {"inputs": prompt,  "parameters": {"max_new_tokens":256, "details": True}} 

finish_reason = "length"
# Print initial output 
print(f"Output: {payload['inputs']}", end='')  
total_tokens = 0
total_requests = 0
while finish_reason == 'length':
    # Call inference and get extracts
    output_text, finish_reason, out_token_len = inference(payload)
    # Update payload for next request
    payload['inputs'] = output_text 
    total_tokens += out_token_len
    total_requests += 1
# Print metrics
print(f"nntotal tokens generated: {total_tokens} ntotal requests sent: {total_requests}")

As we can see, even though the max_new_token parameter is set to 256, we use the finish_reason detail attribute as part of the output to chain multiple requests to the endpoint, until the entire output is generated.

Similarly, based on your use case, you can use stop_reason to detect insufficient output sequence length specified for a given task or unintended completion due to a human stop sequence.

Conclusion

In this post, we walked through the v0.26.0 release of the AWS LMI container. We highlighted key performance improvements, new model support, and new usability features. With these capabilities, you can better balance cost and performance characteristics while providing a better experience to your end-users.

To learn more about LMI DLC capabilities, refer to Model parallelism and large model inference. We’re excited to see how you use these new capabilities from SageMaker.


About the authors

João Moura is a Senior AI/ML Specialist Solutions Architect at AWS. João helps AWS customers – from small startups to large enterprises – train and deploy large models efficiently, and more broadly build ML platforms on AWS.

Rahul Sharma is a Senior Solutions Architect at AWS Data Lab, helping AWS customers design and build AI/ML solutions. Prior to joining AWS, Rahul has spent several years in the finance and insurance sector, helping customers build data and analytical platforms.

Qing Lan is a Software Development Engineer in AWS. He has been working on several challenging products in Amazon, including high performance ML inference solutions and high performance logging system. Qing’s team successfully launched the first Billion-parameter model in Amazon Advertising with very low latency required. Qing has in-depth knowledge on the infrastructure optimization and Deep Learning acceleration.

Jian Sheng is a Software Development Engineer at Amazon Web Services who has worked on several key aspects of machine learning systems. He has been a key contributor to the SageMaker Neo service, focusing on deep learning compilation and framework runtime optimization. Recently, he has directed his efforts and contributed to optimizing the machine learning system for large model inference.

Tyler Osterberg is a Software Development Engineer at AWS. He specializes in crafting high-performance machine learning inference experiences within SageMaker. Recently, his focus has been on optimizing the performance of Inferentia Deep Learning Containers on the SageMaker platform. Tyler excels in implementing performant hosting solutions for large language models and enhancing user experiences using cutting-edge technology.

Rupinder Grewal is a Senior AI/ML Specialist Solutions Architect with AWS. He currently focuses on serving of models and MLOps on Amazon SageMaker. Prior to this role, he worked as a Machine Learning Engineer building and hosting models. Outside of work, he enjoys playing tennis and biking on mountain trails.

Dhawal PatelDhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing, and Artificial Intelligence. He focuses on Deep learning including NLP and Computer Vision domains. He helps customers achieve high performance model inference on SageMaker.

Read More

Improving Content Moderation with Amazon Rekognition Bulk Analysis and Custom Moderation

Improving Content Moderation with Amazon Rekognition Bulk Analysis and Custom Moderation

Amazon Rekognition makes it easy to add image and video analysis to your applications. It’s based on the same proven, highly scalable, deep learning technology developed by Amazon’s computer vision scientists to analyze billions of images and videos daily. It requires no machine learning (ML) expertise to use and we’re continually adding new computer vision features to the service. Amazon Rekognition includes a simple, easy-to-use API that can quickly analyze any image or video file that’s stored in Amazon Simple Storage Service (Amazon S3).

Customers across industries such as advertising and marketing technology, gaming, media, and retail & e-commerce rely on images uploaded by their end-users (user-generated content or UGC) as a critical component to drive engagement on their platform. They use Amazon Rekognition content moderation to detect inappropriate, unwanted, and offensive content in order to protect their brand reputation and foster safe user communities.

In this post, we will discuss the following:

  • Content Moderation model version 7.0 and capabilities
  • How does Amazon Rekognition Bulk Analysis work for Content Moderation
  • How to improve Content Moderation prediction with Bulk Analysis and Custom Moderation

Content Moderation Model Version 7.0 and Capabilities

Amazon Rekognition Content Moderation version 7.0 adds 26 new moderation labels and expands the moderation label taxonomy from a two-tier to a three-tier label category. These new labels and the expanded taxonomy enable customers to detect fine-grained concepts on the content they want to moderate. Additionally, the updated model introduces a new capability to identify two new content types, animated and illustrated content. This allows customers to create granular rules for including or excluding such content types from their moderation workflow. With these new updates, customers can moderate content in accordance with their content policy with higher accuracy.

Let’s look at a moderation label detection example for the following image.

The following table shows the moderation labels, content type, and confidence scores returned in the API response.

Moderation Labels Taxonomy Level Confidence Scores
Violence L1 92.6%
Graphic Violence L2 92.6%
Explosions and Blasts L3 92.6%
Content Types Confidence Scores
Illustrated 93.9%

To obtain the full taxonomy for Content Moderation version 7.0, visit our developer guide.

Bulk Analysis for Content Moderation

Amazon Rekognition Content Moderation also provides batch image moderation in addition to real-time moderation using Amazon Rekognition Bulk Analysis. It enables you to analyze large image collections asynchronously to detect inappropriate content and gain insights into the moderation categories assigned to the images. It also eliminates the need for building a batch image moderation solution for customers.

You can access the bulk analysis feature either via the Amazon Rekognition console or by calling the APIs directly using the AWS CLI and the AWS SDKs. On the Amazon Rekognition console, you can upload the images you want to analyze and get results with a few clicks. Once the bulk analysis job completes, you can identify and view the moderation label predictions, such as Explicit, Non-Explicit Nudity of Intimate parts and Kissing, Violence, Drugs & Tobacco, and more. You also receive a confidence score for each label category.

Create a bulk analysis job on the Amazon Rekognition console

Complete the following steps to try Amazon Rekognition Bulk Analysis:

  1. On the Amazon Rekognition console, choose Bulk Analysis in the navigation pane.
  2. Choose Start Bulk Analysis.
  3. Enter a job name and specify the images to analyze, either by entering an S3 bucket location or by uploading images from your computer.
  4. Optionally, you can select an adapter to analyze images using the custom adapter that you have trained using Custom Moderation.
  5. Choose Start analysis to run the job.

When the process is complete, you can see the results on the Amazon Rekognition console. Also, a JSON copy of the analysis results will be stored in the Amazon S3 output location.

Amazon Rekognition Bulk Analysis API request

In this section, we guide you through creating a bulk analysis job for image moderation using programming interfaces. If your image files aren’t already in an S3 bucket, upload them to ensure access by Amazon Rekognition. Similar to creating a bulk analysis job on the Amazon Rekognition console, when invoking the StartMediaAnalysisJob API, you need to provide the following parameters:

  • OperationsConfig – These are the configuration options for the media analysis job to be created:
    • MinConfidence – The minimum confidence level with the valid range of 0–100 for the moderation labels to return. Amazon Rekognition doesn’t return any labels with a confidence level lower than this specified value.
  • Input – This includes the following:
    • S3Object – The S3 object information for the input manifest file, including the bucket and name of the file. input file includes JSON lines for each image stored on S3 bucket. for example: {"source-ref": "s3://MY-INPUT-BUCKET/1.jpg"}
  • OutputConfig – This includes the following:
    • S3Bucket – The S3 bucket name for the output files.
    • S3KeyPrefix – The key prefix for the output files.

See the following code:

import boto3
import os
import datetime
import time
import json
import uuid

region = boto3.session.Session().region_name
s3=boto3.client('s3')
rekognition_client=boto3.client('rekognition', region_name=region)

min_confidence = 50
input_bucket = "MY-INPUT-BUCKET"

input_file = "input_file.jsonl"
output_bucket = "MY-OUTPUT-BUCKET"
key_prefix = "moderation-results"
job_name = "bulk-analysis-demo"

job_start_response = rekognition_client.start_media_analysis_job(
    OperationsConfig={"DetectModerationLabels": {"MinConfidence": min_confidence}},
    JobName = job_name,
    Input={"S3Object": {"Bucket": input_bucket, "Name": input_file}},
    OutputConfig={"S3Bucket": output_bucket, "S3KeyPrefix": key_prefix},
)

job_id = job_start_response["JobId"]
max_tries = 60
while max_tries > 0:
    max_tries -= 1
    job = rekognition_client.get_media_analysis_job(JobId=job_id)
    job_status = job["Status"]
    if job_status in ["SUCCEEDED", "FAILED"]:
        print(f"Job {job_name} is {job_status}.")
        if job_status == "SUCCEEDED":
            print(
                f"Bulk Analysis output file copied to:n"
                f"tBucket: {job['Results']['S3Object']['Bucket']}n"
                f"tObject: {job['Results']['S3Object']['Name']}."
            )
        break
    else:
        print(f"Waiting for {job_name}. Current status is {job_status}.")
    time.sleep(10)

You can invoke the same media analysis using the following AWS CLI command:

aws rekognition start-media-analysis-job 
--operations-config "DetectModerationLabels={MinConfidence='50'}" 
--input "S3Object={Bucket=input_bucket,Name=input_file.jsonl}" 
--output-config "S3Bucket=output_bucket,S3KeyPrefix=moderation-results"

Amazon Rekognition Bulk Analysis API results

To get a list of bulk analysis jobs, you can use ListMediaAnalysisJobs. The response includes all the details about the analysis job input and output files and the status of the job:

# get the latest 10 media analysis jobs
moderation_job_list = rekognition_client.list_media_analysis_jobs(MaxResults=10, NextToken="")
for job_result in moderation_job_list["MediaAnalysisJobs"]:
 print(f'JobId: {job_result["JobId"]} ,Status: {job_result["Status"]},n
Summary: {job_result["ManifestSummary"]["S3Object"]["Name"]}, n
Result: {job_result["Results"]["S3Object"]["Name"]}n')

You can also invoke the list-media-analysis-jobs command via the AWS CLI:

aws rekognition list-media-analysis-jobs --max-results 10

Amazon Rekognition Bulk Analysis generates two output files in the output bucket. The first file is manifest-summary.json, which includes bulk analysis job statistics and a list of errors:

{
    "version": "1.0",
    "statistics": {
      "total-json-lines": 2,
      "valid-json-lines": 2,
      "invalid-json-lines": 0
    },
    "errors": []
 }

The second file is results.json, which includes one JSON line per each analyzed image in the following format. Each result includes the top-level category (L1) of a detected label and the second-level category of the label (L2), with a confidence score between 1–100. Some Taxonomy Level 2 labels may have Taxonomy Level 3 labels (L3). This allows a hierarchical classification of the content.

{
  "source-ref": "s3://MY-INPUT-BUCKET/1.jpg",
    "detect-moderation-labels": {
    "ModerationLabels": [
      {
        "ParentName": "Products",
        "TaxonomyLevel": 3,
        "Confidence": 91.9385,
        "Name": "Pills"
      },
      {
        "ParentName": "Drugs & Tobacco",
        "TaxonomyLevel": 2,
        "Confidence": 91.9385,
        "Name": "Products"
      },
      {
        "ParentName": "",
        "TaxonomyLevel": 1,
        "Confidence": 91.9385,
        "Name": "Drugs & Tobacco"
      }
    ],
    "ModerationModelVersion": "7.0",
    "ContentTypes": [
      
    ]
  }
}

Improving Content Moderation model prediction using Bulk Analysis and Custom Moderation

You can enhance the accuracy of the Content Moderation base model with the Custom Moderation feature. With Custom Moderation, you can train a Custom Moderation adapter by uploading your images and annotating these images. Adapters are modular components that can extend and enhance the capabilities of the Amazon Rekognition deep learning model. To easily annotate your images, you can simply verify the predictions of your bulk analysis job to train a custom adapter. To verify the prediction results, follow the steps below:

  1. On the Amazon Rekognition console, choose Bulk Analysis in the navigation pane.
  2. Choose the bulk analysis job, then choose Verify predictions.

On the Verify prediction page, you can see all the images evaluated in this job and the predicted labels.

  1. Select each image’s label as present (check mark) to validate a True Positive; or mark as non-present (X mark) to invalidate each assigned label (i.e., the label prediction is a False Positive).
  2. If the appropriate label is not assigned to the image (i.e., False Negative), you can also select and assign the correct labels to the image.

Based on your verification, False Positives and False Negatives will be updated in the verification statistics. You can use these verifications to train a Custom Moderation adapter, which allows you to enhance the accuracy of the content moderation predictions.

  1. As a prerequisite, training a custom moderation adapter requires you to verify at least 20 false positives or 50 false negatives for each moderation label that you want to improve. Once you verify 20 false positives or 50 false negatives, you can choose Train an adapter.

You can use Custom Moderation adapters later to analyze your images by simply selecting the custom adapter while creating a new bulk analysis job or via API by passing the custom adapter’s unique adapter ID.

Summary

In this post, we provided an overview of Content Moderation version 7.0, Bulk Analysis for Content Moderation, and how to improve Content Moderation predictions using Bulk Analysis and Custom Moderation. To try the new moderation labels and bulk analysis, log in to your AWS account and check out the Amazon Rekognition console for Image Moderation and Bulk Analysis.


About the authors

Mehdy Haghy is a Senior Solutions Architect at AWS WWCS team, specializing in AI and ML on AWS. He works with enterprise customers, helping them migrate, modernize, and optimize their workloads for the AWS cloud. In his spare time, he enjoys cooking Persian foods and electronics tinkering.

Shipra Kanoria is a Principal Product Manager at AWS. She is passionate about helping customers solve their most complex problems with the power of machine learning and artificial intelligence. Before joining AWS, Shipra spent over 4 years at Amazon Alexa, where she launched many productivity-related features on the Alexa voice assistant.

Maria Handoko is a Senior Product Manager at AWS. She focuses on helping customers solve their business challenges through machine learning and computer vision. In her spare time, she enjoys hiking, listening to podcasts, and exploring different cuisines.

Read More

Understanding and predicting urban heat islands at Gramener using Amazon SageMaker geospatial capabilities

Understanding and predicting urban heat islands at Gramener using Amazon SageMaker geospatial capabilities

This is a guest post co-authored by Shravan Kumar and Avirat S from Gramener.

Gramener, a Straive company, contributes to sustainable development by focusing on agriculture, forestry, water management, and renewable energy. By providing authorities with the tools and insights they need to make informed decisions about environmental and social impact, Gramener is playing a vital role in building a more sustainable future.

Urban heat islands (UHIs) are areas within cities that experience significantly higher temperatures than their surrounding rural areas. UHIs are a growing concern because they can lead to various environmental and health issues. To address this challenge, Gramener has developed a solution that uses spatial data and advanced modeling techniques to understand and mitigate the following UHI effects:

  • Temperature discrepancy – UHIs can cause urban areas to be hotter than their surrounding rural regions.
  • Health impact – Higher temperatures in UHIs contribute to a 10-20% increase in heat-related illnesses and fatalities.
  • Energy consumption UHIs amplify air conditioning demands, resulting in an up to 20% surge in energy consumption.
  • Air quality UHIs worsen air quality, leading to elevated levels of smog and particulate matter, which can increase respiratory problems.
  • Economic impact – UHIs can result in billions of dollars in additional energy costs, infrastructure damage, and healthcare expenditures.

Gramener’s GeoBox solution empowers users to effortlessly tap into and analyze public geospatial data through its powerful API, enabling seamless integration into existing workflows. This streamlines exploration and saves valuable time and resources, allowing communities to quickly identify UHI hotspots. GeoBox then transforms raw data into actionable insights presented in user-friendly formats like raster, GeoJSON, and Excel, ensuring clear understanding and immediate implementation of UHI mitigation strategies. This empowers communities to make informed decisions and implement sustainable urban development initiatives, ultimately supporting citizens through improved air quality, reduced energy consumption, and a cooler, healthier environment.

This post demonstrates how Gramener’s GeoBox solution uses Amazon SageMaker geospatial capabilities to perform earth observation analysis and unlock UHI insights from satellite imagery. SageMaker geospatial capabilities make it straightforward for data scientists and machine learning (ML) engineers to build, train, and deploy models using geospatial data. SageMaker geospatial capabilities allow you to efficiently transform and enrich large-scale geospatial datasets, and accelerate product development and time to insight with pre-trained ML models.

Solution overview

Geobox aims to analyze and predict the UHI effect by harnessing spatial characteristics. It helps in understanding how proposed infrastructure and land use changes can impact UHI patterns and identifies the key factors influencing UHI. This analytical model provides accurate estimates of land surface temperature (LST) at a granular level, allowing Gramener to quantify changes in the UHI effect based on parameters (names of indexes and data used).

Geobox enables city departments to do the following:

  • Improved climate adaptation planning – Informed decisions reduce the impact of extreme heat events.
  • Support for green space expansion – More green spaces enhance air quality and quality of life.
  • Enhanced interdepartmental collaboration – Coordinated efforts improve public safety.
  • Strategic emergency preparedness – Targeted planning reduces the potential for emergencies.
  • Health services collaboration – Cooperation leads to more effective health interventions.

Solution workflow

In this section, we discuss how the different components work together, from data acquisition to spatial modeling and forecasting, serving as the core of the UHI solution. The solution follows a structured workflow, with a primary focus on addressing UHIs in a city of Canada.

Phase 1: Data pipeline

The Landsat 8 satellite captures detailed imagery of the area of interest every 15 days at 11:30 AM, providing a comprehensive view of the city’s landscape and environment. A grid system is established with a 48-meter grid size using Mapbox’s Supermercado Python library at zoom level 19, enabling precise spatial analysis.

Data Pipeline

Phase 2: Exploratory analysis

Integrating infrastructure and population data layers, Geobox empowers users to visualize the city’s variable distribution and derive urban morphological insights, enabling a comprehensive analysis of the city’s structure and development.

Also, Landsat imagery from phase 1 is used to derive insights like the Normalized Difference Vegetation Index (NDVI) and Normalized Difference Built-up Index (NDBI), with data meticulously scaled to the 48-meter grid for consistency and accuracy.

Exploratory Analysis

The following variables are used:

  • Land surface temperature
  • Building site coverage
  • NDVI
  • Building block coverage
  • NDBI
  • Building area
  • Albedo
  • Building count
  • Modified Normalized Difference Water Index (MNDWI)
  • Building height
  • Number of floors and floor area
  • Floor area ratio

Phase 3: Analytics model

This phase comprises three modules, employing ML models on data to gain insights into LST and its relationship with other influential factors:

  • Module 1: Zonal statistics and aggregation – Zonal statistics play a vital role in computing statistics using values from the value raster. It involves extracting statistical data for each zone based on the zone raster. Aggregation is performed at a 100-meter resolution, allowing for a comprehensive analysis of the data.
  • Module 2: Spatial modeling – Gramener evaluated three regression models (linear, spatial, and spatial fixed effects) to unravel the correlation between Land Surface Temperature (LST) and other variables. Among these models, the spatial fixed effect model yielded the highest mean R-squared value, particularly for the timeframe spanning 2014 to 2020.
  • Module 3: Variables forecasting – To forecast variables in the short term, Gramener employed exponential smoothing techniques. These forecasts aided in understanding future LST values and their trends. Additionally, they delved into long-term scale analysis by using Representative Concentration Pathway (RCP8.5) data to predict LST values over extended periods.

Analytics model

Data acquisition and preprocessing

To implement the modules, Gramener used the SageMaker geospatial notebook within Amazon SageMaker Studio. The geospatial notebook kernel is pre-installed with commonly used geospatial libraries, enabling direct visualization and processing of geospatial data within the Python notebook environment.

Gramener employed various datasets to predict LST trends, including building assessment and temperature data, as well as satellite imagery. The key to the UHI solution was using data from the Landsat 8 satellite. This Earth-imaging satellite, a joint venture of USGS and NASA, served as a fundamental component in the project.

With the SearchRasterDataCollection API, SageMaker provides a purpose-built functionality to facilitate the retrieval of satellite imagery. Gramener used this API to retrieve Landsat 8 satellite data for the UHI solution.

The SearchRasterDataCollection API uses the following input parameters:

  • Arn – The Amazon Resource Name (ARN) of the raster data collection used in the query
  • AreaOfInterest – A GeoJSON polygon representing the area of interest
  • TimeRangeFilter – The time range of interest, denoted as {StartTime: <string>, EndTime: <string>}
  • PropertyFilters – Supplementary property filters, such as specifications for maximum acceptable cloud cover, can also be incorporated

The following example demonstrates how Landsat 8 data can be queried via the API:

search_params = {
    "Arn": "arn:aws:sagemaker-geospatial:us-west-2:378778860802:raster-data-collection/public/gmqa64dcu2g9ayx1", # NASA/USGS Landsat
    "RasterDataCollectionQuery": {
        "AreaOfInterest": {
            "AreaOfInterestGeometry": {
                "PolygonGeometry": {
                    "Coordinates": coordinates
                }
            }
        },
        "TimeRangeFilter": {
            "StartTime": "2014-01-01T00:00:00Z",
            "EndTime": "2020-12-31T23:59:59Z",
        },
        "PropertyFilters": {
            "Properties": [{"Property": {"EoCloudCover": {"LowerBound": 0, "UpperBound": 20.0}}}],
            "LogicalOperator": "AND",
        }
    },
}

response = geospatial_client.search_raster_data_collection(**search_params)

To process large-scale satellite data, Gramener used Amazon SageMaker Processing with the geospatial container. SageMaker Processing enables the flexible scaling of compute clusters to accommodate tasks of varying sizes, from processing a single city block to managing planetary-scale workloads. Traditionally, manually creating and managing a compute cluster for such tasks was both costly and time-consuming, particularly due to the complexities involved in standardizing an environment suitable for geospatial data handling.

Now, with the specialized geospatial container in SageMaker, managing and running clusters for geospatial processing has become more straightforward. This process requires minimal coding effort: you simply define the workload, specify the location of the geospatial data in Amazon Simple Storage Service (Amazon S3), and select the appropriate geospatial container. SageMaker Processing then automatically provisions the necessary cluster resources, facilitating the efficient run of geospatial tasks on scales that range from city level to continent level.

Processing

SageMaker fully manages the underlying infrastructure required for the processing job. It allocates cluster resources for the duration of the job and removes them upon job completion. Finally, the results of the processing job are saved in the designated S3 bucket.

A SageMaker Processing job using the geospatial image can be configured as follows from within the geospatial notebook:

from sagemaker import get_execution_role
from sagemaker.sklearn.processing import ScriptProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput

execution_role_arn = get_execution_role()

geospatial_image_uri = '081189585635.dkr.ecr.us-west-2.amazonaws.com/sagemaker-geospatial-v1-0:latest'
processor = ScriptProcessor(
    command=['python3'],
    image_uri=geospatial_image_uri,
    role=execution_role_arn,
    instance_count=20,
    instance_type='ml.m5.xlarge',
    base_job_name='geospatial-processing-spectral-indices'
)

The instance_count parameter defines how many instances the processing job should use, and the instance_type defines what type of instance should be used.

The following example shows how a Python script is run on the processing job cluster. When the run command is invoked, the cluster starts up and automatically provisions the necessary cluster resources:

processor.run(
    code='calculate_variables.py',
    inputs=[
        ProcessingInput(
            source=s3_manifest_url,
            destination='/opt/ml/processing/input_data/',
            s3_data_type="ManifestFile",
            s3_data_distribution_type="ShardedByS3Key"
        ),
    ],
    outputs=[
        ProcessingOutput(
            source='/opt/ml/processing/output_data/',
            destination=s3_output_prefix_url
        )
    ]
)

Spatial modeling and LST predictions

In the processing job, a range of variables, including top-of-atmosphere spectral radiance, brightness temperature, and reflectance from Landsat 8, are computed. Additionally, morphological variables such as floor area ratio (FAR), building site coverage, building block coverage, and Shannon’s Entropy Value are calculated.

The following code demonstrates how this band arithmetic can be performed:

def calculate_ndvi(nir08, red): 
    return (nir08 - red) / (nir08 + red) 
 
def calculate_ndbi(swir16, nir08): 
    return (swir16 - nir08) / (swir16 + nir08) 
 
def calculate_st(bt): 
    return ((bt * 0.00341802) + 149.0) - 273 
 
def indices_calc(data): 
    with concurrent.futures.ThreadPoolExecutor() as executor: 
        ndvi_future = executor.submit(calculate_ndvi, data.sel(band="SR_B5"), data.sel(band="SR_B4")) 
        ndbi_future = executor.submit(calculate_ndbi, data.sel(band="SR_B6"), data.sel(band="SR_B5")) 
        st_future = executor.submit(calculate_st, data.sel(band="ST_B10")) 
 
        ndvi = ndvi_future.result() 
        ndbi = ndbi_future.result() 
        st = st_future.result() 
 
    ndvi.attrs = data.attrs 
    ndbi.attrs = data.attrs 
    st.attrs = data.attrs 
 
    return ndvi, ndbi, st 

After the variables have been calculated, zonal statistics are performed to aggregate data by grid. This involves calculating statistics based on the values of interest within each zone. For these computations a grid size of approximately 100 meters has been used.

def process_iteration(st, ndvi, ndmi, date, city_name): 
    datacube['st'] = (st.dims, st.values) 
    datacube['ndvi'] = (ndvi.dims, ndvi.values) 
    datacube['ndmi'] = (ndmi.dims, ndmi.values) 
    df = datacube.groupby("id").mean().to_dataframe().reset_index() 
    merged_grid = hexgrid_utm.join(df, on='id', how='left', lsuffix='_')[['id', 'hex_id', 'geometry', 'st', 'ndvi', 'ndmi']] 
    merged_grid.to_file(f"{DATA}/{city_name}/{city_name}_outputs_{date}.geojson", driver='GeoJSON') 
    print("Working on:", date) 
 
def iterative_op(city_json, st, ndvi, ndmi, city_name): 
    with concurrent.futures.ThreadPoolExecutor() as executor: 
        futures = [ 
            executor.submit(process_iteration, st[i], ndvi[i], ndmi[i], date, city_name) 
            for i, _ in enumerate(city_json.time) 
            for date in city_json.date 
        ] 
        for future in concurrent.futures.as_completed(futures): 
            future.result() 
 
    print('Process completed') 

After aggregating the data, spatial modeling is performed. Gramener used spatial regression methods, such as linear regression and spatial fixed effects, to account for spatial dependence in the observations. This approach facilitates modeling the relationship between variables and LST at a micro level.

The following code illustrates how such spatial modeling can be run:

features = [ 
    'ndvi', 
    'ndbi', 
    'st', 
    'build_count', 
    'bbc' 
] 
 
def compute_spatial_weights(df, k=8): 
    knn = KNN.from_dataframe(df, k=k) 
    return df[features].apply(lambda y: weights.spatial_lag.lag_spatial(knn, y)).rename(columns=lambda c: 'w_' + c) 
 
def ordinary_least_squares(df_year, spatial=False): 
    formula = f"lst ~ {' + '.join(features)}"  
    if spatial: 
        df_year = df_year.join(compute_spatial_weights(df_year)) 
        formula += f" + {' + '.join(['w_' + f for f in features])}"  
     
    return smf.ols(formula, data=df_year).fit() 
 
def process(df, year): 
    df_year = pd.merge(df[df['year'] == year].fillna(0), grids[['idx', 'name']], on='idx') 
    ols_model = ordinary_least_squares(df_year) 
    ols_spatial_model = ordinary_least_squares(df_year, spatial=True) 
    ols_spatial_fe_model = ordinary_least_squares(df_year, spatial=True) 
     
    return { 
        'year': year, 
        'ols_model': ols_model, 
        'ols_spatial_model': ols_spatial_model, 
        'ols_spatial_fe_model': ols_spatial_fe_model, 
        'ols_r2': [ols_model.rsquared, ols_spatial_model.rsquared, ols_spatial_fe_model.rsquared] 
    } 

Gramener used exponential smoothing to predict the LST values. Exponential smoothing is an effective method for time series forecasting that applies weighted averages to past data, with the weights decreasing exponentially over time. This method is particularly effective in smoothing out data to identify trends and patterns. By using exponential smoothing, it becomes possible to visualize and predict LST trends with greater precision, allowing for more accurate predictions of future values based on historical patterns.

To visualize the predictions, Gramener used the SageMaker geospatial notebook with open-source geospatial libraries to overlay model predictions on a base map and provides layered visualize geospatial datasets directly within the notebook.

Visualization

Conclusion

This post demonstrated how Gramener is empowering clients to make data-driven decisions for sustainable urban environments. With SageMaker, Gramener achieved substantial time savings in UHI analysis, reducing processing time from weeks to hours. This rapid insight generation allows Gramener’s clients to pinpoint areas requiring UHI mitigation strategies, proactively plan urban development and infrastructure projects to minimize UHI, and gain a holistic understanding of environmental factors for comprehensive risk assessment.

Discover the potential of integrating Earth observation data in your sustainability projects with SageMaker. For more information, refer to Get started with Amazon SageMaker geospatial capabilities.


About the Authors

Abhishek Mittal is a Solutions Architect for the worldwide public sector team with Amazon Web Services (AWS), where he primarily works with ISV partners across industries providing them with architectural guidance for building scalable architecture and implementing strategies to drive adoption of AWS services. He is passionate about modernizing traditional platforms and security in the cloud. Outside work, he is a travel enthusiast.

Janosch Woschitz is a Senior Solutions Architect at AWS, specializing in AI/ML. With over 15 years of experience, he supports customers globally in leveraging AI and ML for innovative solutions and building ML platforms on AWS. His expertise spans machine learning, data engineering, and scalable distributed systems, augmented by a strong background in software engineering and industry expertise in domains such as autonomous driving.

Shravan Kumar is a Senior Director of Client success at Gramener, with decade of experience in Business Analytics, Data Evangelism & forging deep Client Relations. He holds a solid foundation in Client Management, Account Management within the realm of data analytics, AI & ML.

Avirat S is a geospatial data scientist at Gramener, leveraging AI/ML to unlock insights from geographic data. His expertise lies in disaster management, agriculture, and urban planning, where his analysis informs decision-making processes.

Read More

Build a news recommender application with Amazon Personalize

Build a news recommender application with Amazon Personalize

With a multitude of articles, videos, audio recordings, and other media created daily across news media companies, readers of all types—individual consumers, corporate subscribers, and more—often find it difficult to find news content that is most relevant to them. Delivering personalized news and experiences to readers can help solve this problem, and create more engaging experiences. However, delivering truly personalized recommendations presents several key challenges:

  • Capturing diverse user interests – News can span many topics and even within specific topics, readers can have varied interests.
  • Addressing limited reader history – Many news readers have sparse activity histories. Recommenders must quickly learn preferences from limited data to provide value.
  • Timeliness and trending – Daily news cycles mean recommendations must balance personalized content with the discovery of new, popular stories.
  • Changing interests – Readers’ interests can evolve over time. Systems have to detect shifts and adapt recommendations accordingly.
  • Explainability – Providing transparency into why certain stories are recommended builds user trust. The ideal news recommendation system understands the individual and responds to the broader news climate and audience. Tackling these challenges is key to effectively connecting readers with content they find informative and engaging.

In this post, we describe how Amazon Personalize can power a scalable news recommender application. This solution was implemented at a Fortune 500 media customer in H1 2023 and can be reused for other customers interested in building news recommenders.

Solution overview

Amazon Personalize is a great fit to power a news recommendation engine because of its ability to provide real-time and batch personalized recommendations at scale. Amazon Personalize offers a variety of recommendation recipes (algorithms), such as the User Personalization and Trending Now recipes, which are particularly suitable for training news recommender models. The User Personalization recipe analyzes each user’s preferences based on their engagement with content over time. This results in customized news feeds that surface the topics and sources most relevant to an individual user. The Trending Now recipe complements this by detecting rising trends and popular news stories in real time across all users. Combining recommendations from both recipes allows the recommendation engine to balance personalization with the discovery of timely, high-interest stories.

The following diagram illustrates the architecture of a news recommender application powered by Amazon Personalize and supporting AWS services.

This solution has the following limitations:

  • Providing personalized recommendations for just-published articles (articles published a few minutes ago) can be challenging. We describe how to mitigate this limitation later in this post.
  • Amazon Personalize has a fixed number of interactions and items dataset features that can be used to train a model.
  • At the time of writing, Amazon Personalize doesn’t provide recommendation explanations at the user level.

Let’s walk through each of the main components of the solution.

Prerequisites

To implement this solution, you need the following:

  • Historical and real-time user click data for the interactions dataset
  • Historical and real-time news article metadata for the items dataset

Ingest and prepare the data

To train a model in Amazon Personalize, you need to provide training data. In this solution, you use two types of Amazon Personalize training datasets: the interactions dataset and items dataset. The interactions dataset contains data on user-item-timestamp interactions, and the items dataset contains features on the recommended articles.

You can take two different approaches to ingest training data:

  • Batch ingestion – You can use AWS Glue to transform and ingest interactions and items data residing in an Amazon Simple Storage Service (Amazon S3) bucket into Amazon Personalize datasets. AWS Glue performs extract, transform, and load (ETL) operations to align the data with the Amazon Personalize datasets schema. When the ETL process is complete, the output file is placed back into Amazon S3, ready for ingestion into Amazon Personalize via a dataset import job.
  • Real-time ingestion – You can use Amazon Kinesis Data Streams and AWS Lambda to ingest real-time data incrementally. A Lambda function performs the same data transformation operations as the batch ingestion job at the individual record level, and ingests the data into Amazon Personalize using the PutEvents and PutItems APIs.

In this solution, you can also ingest certain items and interactions data attributes into Amazon DynamoDB. You can use these attributes during real-time inference to filter recommendations by business rules. For example, article metadata may contain company and industry names in the article. To proactively recommend articles on companies or industries that users are reading about, you can record how frequently readers are engaging with articles about specific companies and industries, and use this data with Amazon Personalize filters to further tailor the recommended content. We discuss more about how to use items and interactions data attributes in DynamoDB later in this post.

The following diagram illustrates the data ingestion architecture.

Train the model

The bulk of the model training effort should focus on the User Personalization model, because it can use all three Amazon Personalize datasets (whereas the Trending Now model only uses the interactions dataset). We recommend running experiments that systematically vary different aspects of the training process. For the customer that implemented this solution, the team ran over 30 experiments. This included modifying the interactions and items dataset features, adjusting the length of interactions history provided to the model, tuning Amazon Personalize hyperparameters, and evaluating whether an explicit user’s dataset improved offline performance (relative to the increase in training time).

Each model variation was evaluated based on metrics reported by Amazon Personalize on the training data, as well as custom offline metrics on a holdout test dataset. Standard metrics to consider include mean average precision (MAP) @ K (where K is the number of recommendations presented to a reader), normalized discounted cumulative gain, mean reciprocal rank, and coverage. For more information about these metrics, see Evaluating a solution version with metrics. We recommend prioritizing MAP @ K out of these metrics, which captures the average number of articles a reader clicked on out of the top K articles recommended to them, because the MAP metric is a good proxy for (real) article clickthrough rates. K should be selected based on the number of articles a reader can view on a desktop or mobile webpage without having to scroll, allowing you to evaluate recommendation effectiveness with minimal reader effort. Implementing custom metrics, such as recommendation uniqueness (which describes how unique the recommendation output was across the pool of candidate users), can also provide insight into recommendation effectiveness.

With Amazon Personalize, the experimental process allows you to determine the optimal set of dataset features for both the User Personalization and Trending Now models. The Trending Now model exists within the same Amazon Personalize dataset group as the User Personalization model, so it uses the same set of interactions dataset features.

Generate real-time recommendations

When a reader visits a news company’s webpage, an API call will be made to the news recommender via Amazon API Gateway. This triggers a Lambda function that calls the Amazon Personalize models’ endpoints to get recommendations in real time. During inference, you can use filters to filter the initial recommendation output based on article or reader interaction attributes. For example, if “News Topic” (such as sports, lifestyle, or politics) is an article attribute, you can restrict recommendations to specific news topics if that is a product requirement. Similarly, you can use filters on reader interaction events, such as excluding articles a reader has already read.

One key challenge with real-time recommendations is effectively including just-published articles (also called cold items) into the recommendation output. Just-published articles don’t have any historical interaction data that recommenders normally rely on, and recommendation systems need sufficient processing time to assess how relevant just-published articles are to a specific user (even if only using user-item relationship signals).

Amazon Personalize can natively auto detect and recommend new articles ingested into the items dataset every 2 hours. However, because this use case is focused on news recommendations, you need a way to recommend new articles as soon as they’re published and ready for reader consumption.

One way to solve this problem is by designing a mechanism to randomly insert just-published articles into the final recommendation output for each reader. You can add a feature to control what percent of articles in the final recommendation set were just-published articles, and similar to the original recommendation output from Amazon Personalize, you can filter just-published articles by article attributes (such as “News Topic”) if it is a product requirement. You can track interactions on just-published articles in DynamoDB as they start trickling in to the system, and prioritize the most popular just-published articles during recommendation postprocessing, until the just-published articles are detected and processed by the Amazon Personalize models.

After you have your final set of recommended articles, this output is submitted to another postprocessing Lambda function that checks the output to see if it aligns with pre-specified business rules. These can include checking whether recommended articles meet webpage layout specifications, if recommendations are served in a web browser frontend, for example. If needed, articles can be reranked to ensure business rules are met. We recommend reranking by implementing a function that allows higher-ranking articles to only fall down in ranking one place at a time until all business rules are met, providing minimal relevancy loss for readers. The final list of postprocessed articles is returned to the web service that initiated the request for recommendations.

The following diagram illustrates the architecture for this step in the solution.

Generate batch recommendations

Personalized news dashboards (through real-time recommendations) require a reader to actively search for news, but in our busy lives today, sometimes it’s just easier to have your top news sent to you. To deliver personalized news articles as an email digest, you can use an AWS Step Functions workflow to generate batch recommendations. The batch recommendation workflow gathers and postprocesses recommendations from our User Personalization model or Trending Now model endpoints, giving flexibility to select what combination of personalized and trending articles teams want to push to their readers. Developers also have the option of using the Amazon Personalize batch inference feature; however, at the time of writing, creating an Amazon Personalize batch inference job doesn’t support including items ingested after an Amazon Personalize custom model has been trained, and it doesn’t support the Trending Now recipe.

During a batch inference Step Functions workflow, the list of readers is divided into batches, processed in parallel, and submitted to a postprocessing and validation layer before being sent to the email generation service. The following diagram illustrates this workflow.

Scale the recommender system

To effectively scale, you also need the news recommender to accommodate a growing number of users and increased traffic without creating any degradation in reader experience. Amazon Personalize model endpoints natively auto scale to meet increased traffic. Engineers only need to set and monitor a minimum provisioned transactions per second (TPS) variable for each Amazon Personalize endpoint.

Beyond Amazon Personalize, the news recommender application presented here is built using serverless AWS services, allowing engineering teams to focus on delivering the best reader experience without worrying about infrastructure maintenance.

Conclusion

In this attention economy, it has become increasingly important to deliver relevant and timely content for consumers. In this post, we discussed how you can use Amazon Personalize to build a scalable news recommender, and the strategies organizations can implement to address the unique challenges of delivering news recommendations.

To learn more about Amazon Personalize and how it can help your organization build recommendation systems, check out the Amazon Personalize Developer Guide.

Happy building!


About the Authors

Bala Krishnamoorthy is a Senior Data Scientist at AWS Professional Services, where he helps customers build and deploy AI-powered solutions to solve their business challenges. He has worked with customers across diverse sectors, including media & entertainment, financial services, healthcare, and technology. In his free time, he enjoys spending time with family/friends, staying active, trying new restaurants, travel, and kickstarting his day with a steaming hot cup of coffee.

Rishi Jala is a NoSQL Data Architect with AWS Professional Services. He focuses on architecting and building highly scalable applications using NoSQL databases such as Amazon DynamoDB. Passionate about solving customer problems, he delivers tailored solutions to drive success in the digital landscape.

Read More

Nielsen Sports sees 75% cost reduction in video analysis with Amazon SageMaker multi-model endpoints

Nielsen Sports sees 75% cost reduction in video analysis with Amazon SageMaker multi-model endpoints

This is a guest post co-written with Tamir Rubinsky and Aviad Aranias from Nielsen Sports.

Nielsen Sports shapes the world’s media and content as a global leader in audience insights, data, and analytics. Through our understanding of people and their behaviors across all channels and platforms, we empower our clients with independent and actionable intelligence so they can connect and engage with their audiences—now and into the future.

At Nielsen Sports, our mission is to provide our customers—brands and rights holders—with the ability to measure the return on investment (ROI) and effectiveness of a sport sponsorship advertising campaign across all channels, including TV, online, social media, and even newspapers, and to provide accurate targeting at local, national, and international levels.

In this post, we describe how Nielsen Sports modernized a system running thousands of different machine learning (ML) models in production by using Amazon SageMaker multi-model endpoints (MMEs) and reduced operational and financial cost by 75%.

Challenges with channel video segmentation

Our technology is based on artificial intelligence (AI) and specifically computer vision (CV), which allows us to track brand exposure and identify its location accurately. For example, we identify if the brand is on a banner or a shirt. In addition, we identify the location of the brand on the item, such as the top corner of a sign or the sleeve. The following figure shows an example of our tagging system.

example of Nielsen tagging system

To understand our scaling and cost challenges, let’s look at some representative numbers. Every month, we identify over 120 million brand impressions across different channels, and the system must support the identification of over 100,000 brands and variations of different brands. We have built one of the largest databases of brand impressions in the world with over 6 billion data points.

Our media evaluation process includes several steps, as illustrated in the following figure:

  1. First, we record thousands of channels around the world using an international recording system.
  2. We stream the content in combination with the broadcast schedule (Electronic Programming Guide) to the next stage, which is segmentation and separation between the game broadcasts themselves and other content or advertisements.
  3. We perform media monitoring, where we add additional metadata to each segment, such as league scores, relevant teams, and players.
  4. We perform an exposure analysis of the brands’ visibility and then combine the audience information to calculate the valuation of the campaign.
  5. The information is delivered to the customer by a dashboard or analyst reports. The analyst is given direct access to the raw data or through our data warehouse.

media evaluation steps

Because we operate at a scale of over a thousand channels and tens of thousands of hours of video a year, we must have a scalable automation system for the analysis process. Our solution automatically segments the broadcast and knows how to isolate the relevant video clips from the rest of the content.

We do this using dedicated algorithms and models developed by us for analyzing the specific characteristics of the channels.

In total, we are running thousands of different models in production to support this mission, which is costly, incurs operational overhead, and is error-prone and slow. It took months to get models with new model architecture to production.

This is where we wanted to innovate and rearchitect our system.

Cost-effective scaling for CV models using SageMaker MMEs

Our legacy video segmentation system was difficult to test, change, and maintain. Some of the challenges include working with an old ML framework, inter-dependencies between components, and a hard-to-optimize workflow. This is because we were based on RabbitMQ for the pipeline, which was a stateful solution. To debug one component, such as feature extraction, we had to test all of the pipeline.

The following diagram illustrates the previous architecture.

previous architecture

As part of our analysis, we identified performance bottlenecks such as running a single model on a machine, which showed a low GPU utilization of 30–40%. We also discovered inefficient pipeline runs and scheduling algorithms for the models.

Therefore, we decided to build a new multi-tenant architecture based on SageMaker, which would implement performance optimization improvements, support dynamic batch sizes, and run multiple models simultaneously.

Each run of the workflow targets a group of videos. Each video is between 30–90 minutes long, and each group has more than five models to run.

Let’s examine an example: a video can be 60 minutes long, consisting of 3,600 images, and each image needs to inferred by three different ML models during the first stage. With SageMaker MMEs, we can run batches of 12 images in parallel, and the full batch completes in less than 2 seconds. In a regular day, we have more than 20 groups of videos, and on a packed weekend day, we can have more than 100 groups of videos.

The following diagram shows our new, simplified architecture using a SageMaker MME.

simplified architecture using a SageMaker MME

Results

With the new architecture, we achieved many of our desired outcomes and some unseen advantages over the old architecture:

  • Better runtime – By increasing batch sizes (12 videos in parallel) and running multiple models concurrently (five models in parallel), we have decreased our overall pipeline runtime by 33%, from 1 hour to 40 minutes.
  • Improved infrastructure – With SageMaker, we upgraded our existing infrastructure, and we are now using newer AWS instances with newer GPUs such as g5.xlarge. One of the biggest benefits from the change is the immediate performance improvement from using TorchScript and CUDA optimizations.
  • Optimized infrastructure usage – By having a single endpoint that can host multiple models, we can reduce both the number of endpoints and the number of machines we need to maintain, and also increase the utilization of a single machine and its GPU. For a specific task with five videos, we now use only five machines of g5 instances, which gives us 75% cost benefit from the previous solution. For a typical workload during the day, we use a single endpoint with a single machine of g5.xlarge with a GPU utilization of more than 80%. For comparison, the previous solution had less than 40% utilization.
  • Increased agility and productivity – Using SageMaker allowed us to spend less time migrating models and more time improving our core algorithms and models. This has increased productivity for our engineering and data science teams. We can now research and deploy a new ML model in under 7 days, instead of over 1 month previously. This is a 75% improvement in velocity and planning.
  • Better quality and confidence – With SageMaker A/B testing capabilities, we can deploy our models in a gradual way and be able to safely roll back. The faster lifecycle to production also increased our ML models’ accuracy and results.

The following figure shows our GPU utilization with the previous architecture (3040% GPU utilization).

GPU utilization with the previous architecture

The following figure shows our GPU utilization with the new simplified architecture (90% GPU utilization).

GPU utilization with the new simplified architecture

Conclusion

In this post, we shared how Nielsen Sports modernized a system running thousands of different models in production by using SageMaker MMEs and reduced their operational and financial cost by 75%.

For further reading, refer to the following:


About the Authors

Eitan SelaEitan Sela is a Generative AI and Machine Learning Specialist Solutions Architect with Amazon Web Services. He works with AWS customers to provide guidance and technical assistance, helping them build and operate Generative AI and Machine Learning solutions on AWS. In his spare time, Eitan enjoys jogging and reading the latest machine learning articles.

Gal GoldmanGal Goldman is a Senior Software Engineer and an Enterprise Senior Solution Architect in AWS with a passion for cutting-edge solutions. He specializes in and has developed many distributed Machine Learning services and solutions. Gal also focuses on helping AWS customers accelerate and overcome their engineering and Generative AI challenges.

Tal PanchekTal Panchek is a Senior Business Development Manager for Artificial Intelligence and Machine Learning with Amazon Web Services. As a BD Specialist, he is responsible for growing adoption, utilization, and revenue for AWS services. He gathers customer and industry needs and partner with AWS product teams to innovate, develop, and deliver AWS solutions.

Tamir RubinskyTamir Rubinsky leads Global R&D Engineering at Nielsen Sports, bringing vast experience in building innovative products and managing high-performing teams. His work transformed sports sponsorship media evaluation through innovative, AI-powered solutions.

Aviad AraniasAviad Aranias is a MLOps Team Leader and Nielsen Sports Analysis Architect who specializes in crafting complex pipelines for analyzing sports event videos across numerous channels. He excels in building and deploying deep learning models to handle large-scale data efficiently. In his spare time, he enjoys baking delicious Neapolitan pizzas.

Read More

Seamlessly transition between no-code and code-first machine learning with Amazon SageMaker Canvas and Amazon SageMaker Studio

Seamlessly transition between no-code and code-first machine learning with Amazon SageMaker Canvas and Amazon SageMaker Studio

Amazon SageMaker Studio is a web-based, integrated development environment (IDE) for machine learning (ML) that lets you build, train, debug, deploy, and monitor your ML models. SageMaker Studio provides all the tools you need to take your models from data preparation to experimentation to production while boosting your productivity.

Amazon SageMaker Canvas is a powerful no-code ML tool designed for business and data teams to generate accurate predictions without writing code or having extensive ML experience. With its intuitive visual interface, SageMaker Canvas simplifies the process of loading, cleansing, and transforming datasets, and building ML models, making it accessible to a broader audience.

However, as your ML needs evolve, or if you require more advanced customization and control, you may want to transition from a no-code environment to a code-first approach. This is where the seamless integration between SageMaker Canvas and SageMaker Studio comes into play.

In this post, we present a solution for the following types of users:

  • Non-ML experts such as business analysts, data engineers, or developers, who are domain experts and are interested in low-code no-code (LCNC) tools to guide them in preparing data for ML and building ML models. This persona typically is only a SageMaker Canvas user and often relies on ML experts in their organization to review and approve their work.
  • ML experts who are interested in how LCNC tools can accelerate parts of the ML lifecycle (such as data prep), but are also likely to take a high-code approach to certain parts of the ML lifecycle (such as model building). This persona is typically a SageMaker Studio user who might also be a SageMaker Canvas user. ML experts also often play a role in reviewing and approving the work of non-ML experts for production use cases.

The utility of the solutions proposed in this post is two-fold. Firstly, by demonstrating how you can share models across SageMaker Canvas and SageMaker Studio, non-ML and ML experts can collaborate across their preferred environments, which might be a no-code environment (SageMaker Canvas) for non-experts and a high-code environment (SageMaker Studio) for experts. Secondly, by demonstrating how to share a model from SageMaker Canvas to SageMaker Studio, we show how ML experts who want to pivot from a LCNC approach for development to a high-code approach for production can do so across SageMaker environments. The solution outlined in this post is for users of the new SageMaker Studio. For users of SageMaker Studio Classic, see Collaborate with data scientists for how you can seamlessly transition between SageMaker Canvas and SageMaker Studio Classic.

Solution overview

To seamlessly transition between no-code and code-first ML with SageMaker Canvas and SageMaker Studio, we have outlined two options. You can choose the option based on your requirements. In some cases, you might decide to use both options in parallel.

  • Option 1: SageMaker Model Registry – A SageMaker Canvas user registers their model in the Amazon SageMaker Model Registry, invoking a governance workflow for ML experts to review model details and metrics, then approve or reject it, after which the user can deploy the approved model from SageMaker Canvas. This option is an automated sharing process providing you with built-in governance and approval tracking. You can view the model metrics; however, there is limited visibility on the model code and architecture. The following diagram illustrates the architecture.

Option 1: SageMaker Model Registry

  • Option 2: Notebook export – In this option, the SageMaker Canvas user exports the full notebook from SageMaker Canvas to Amazon Simple Storage Service (Amazon S3), then shares it with ML experts to import into SageMaker Studio, enabling complete visibility and customization of the model code and logic before the ML expert deploys the enhanced model. In this option, there is complete visibility of the model code and architecture with the ability for the ML expert to customize and enhance the model in SageMaker Studio. However, this option demands a manual export and import of the model notebook into the IDE. The following diagram illustrates this architecture.

Option 2: Notebook export

The following phases describe the steps for collaboration:

  • Share – The SageMaker Canvas user registers the model from SageMaker Canvas or downloads the notebook from SageMaker Canvas
  • Review – The SageMaker Studio user accesses the model through the model registry to review and run the exported notebook through JupyterLab to validate the model
  • Approval – The SageMaker Studio user approves the model from the model registry
  • Deploy – The SageMaker Studio user can deploy the model from JupyterLab, or the SageMaker Canvas user can deploy the model from SageMaker Canvas

Let’s look at the two options (model registry and notebook export) within each step in detail.

Prerequisites

Before you dive into the solution, make sure you have signed up for and created an AWS account. Then you need to create an administrative user and a group. For instructions on both steps, refer to Set Up Amazon SageMaker Prerequisites. You can skip this step if you already have your own version of SageMaker Studio running.

Complete the prerequisites for setting up SageMaker Canvas and create the model of your choice for your use case.

Share the model

The SageMaker Canvas user shares the model with the SageMaker Studio user by either registering it in SageMaker Model Registry, which triggers a governance workflow, or by downloading the full notebook from SageMaker Canvas and providing it to the SageMaker Studio user.

SageMaker Model Registry

To deploy using SageMaker Model Registry, complete the following steps:

  1. After a model is created in SageMaker Canvas, choose the options menu (three vertical dots) and choose Add to Model Registry.
    add to model registry
  2. Enter a name for the model group.
  3. Choose Add.
    model group name

You can now see the model is registered.
model registered

You can also see the model is pending approval.
pending approval

SageMaker notebook export

To deploy using a SageMaker notebook, complete the following steps:

  1. On the options menu, choose View Notebook.
    view notebook
  2. Choose Copy S3 URI.
    s3 uri

You can now share the S3 URI with the SageMaker Studio user.

Review the model

The SageMaker Studio user accesses the shared model through the model registry to review its details and metrics, or they can import the exported notebook into SageMaker Studio and use Jupyter notebooks to thoroughly validate the model’s code, logic, and performance.

SageMaker Model Registry

To use the model registry, complete the following steps:

  1. On the SageMaker Studio console, choose Models in the navigation pane.
  2. Choose Registered models.
  3. Choose your model.
    model registry

You can review the model details and see that the status is pending.
status pending

You can also review the different metrics to check on the model performance.
review metrics

You can view the model metrics; however, there is limited visibility on the model code and architecture. If you want complete visibility of the model code and architecture with the ability to customize and enhance the model, use the notebook export option.

SageMaker notebook export

To use the notebook export option as the SageMaker Studio user, complete the following steps.

  1. Launch SageMaker Studio and choose JupyterLab under Applications.
  2. Open the JupyterLab space.If you don’t have a JupyterLab space, you can create one.
    jupyter lab
  3. Open a terminal and run the following command to copy the notebook from Amazon S3 to SageMaker Studio (the account number in the following example is changed to awsaccountnumber):
    sagemaker-user@default:~$ aws s3 cp s3://sagemaker-us-east-1-awsaccountnumber/Canvas/default-20240130t161835/Training/output/Canvas1707947728560/sagemaker-automl-candidates/notebooks/SageMakerAutopilotCandidateDefinitionNotebook.ipynb ./canvas.ipynb

    terminal

  4. After the notebook is downloaded, you can open the notebook and run the notebook to evaluate further.

candidate trials

Approve the model

After a comprehensive review, the SageMaker Studio user can make an informed decision to either approve or reject the model in the model registry based on their assessment of its quality, accuracy, and suitability for the intended use case.

For users who registered their model via the Canvas UI, please follow the below steps to approve the model. For users who exported the model notebook from the Canvas UI, you may register and approve the model using SageMaker model registry, however, these steps are not required.

SageMaker Model Registry

As the SageMaker Studio user, when you’re comfortable with the model, you can update the status to approved. Approval happens only in SageMaker Model Registry. Complete the following steps:

  1. In SageMaker Studio, navigate to the version of the model.
  2. On the options menu, choose Update status and Approved.
    status update
  3. Enter an optional comment and choose Save and update.
    update model status

Now you can see the model is approved.
approved

Deploy the model

Once the model is ready to deploy (it has received necessary reviews and approvals), users have two options. For users who took the model registry approach, they can deploy from either SageMaker Studio or from SageMaker Canvas. For users who took the model notebook export approach, they can deploy from SageMaker Studio. Both deployment options are detailed below.

Deploy via SageMaker Studio

The SageMaker Studio user can deploy the model from the JupyterLab space.
model deployment

After the model is deployed, you can navigate to the SageMaker console, choose Endpoints under Inference in the navigation pane, and view the model.
endpoints

Deploy via SageMaker Canvas

Alternatively, if the deployment is handled by the SageMaker Canvas user, you can deploy the model from SageMaker Canvas.

canvas deploy

After the model is deployed, you can navigate to the Endpoints page on the SageMaker console to view the model.
deployed endpoints

Clean up

To avoid incurring future session charges, log out of SageMaker Canvas.

To avoid ongoing charges, delete the SageMaker inference endpoints. You can delete the endpoints via the SageMaker console or from the SageMaker Studio notebook using the following commands:

predictor.delete_model()

predictor.delete_endpoint()

Conclusion

Previously, you could only share models to SageMaker Canvas (or view shared SageMaker Canvas models) in SageMaker Studio Classic. In this post, we showed how to share models built in SageMaker Canvas with SageMaker Studio so that different teams can collaborate and you can pivot from a no-code to a high-code deployment path. By either using SageMaker Model Registry or exporting notebooks, ML experts and non-experts can collaborate, review, and enhance models across these platforms, enabling a smooth workflow from data preparation to production deployment.

For more information about collaborating on models using SageMaker Canvas, refer to Build, Share, Deploy: how business analysts and data scientists achieve faster time-to-market using no-code ML and Amazon SageMaker Canvas.


About the Authors

Rajakumar Sampathkumar is a Principal Technical Account Manager at AWS, providing customer guidance on business-technology alignment and supporting the reinvention of their cloud operation models and processes. He is passionate about cloud and machine learning. Raj is also a machine learning specialist and works with AWS customers to design, deploy, and manage their AWS workloads and architectures.

Meenakshisundaram Thandavarayan works for AWS as an AI/ ML Specialist. He has a passion to design, create, and promote human-centered data and analytics experiences. Meena focusses on developing sustainable systems that deliver measurable, competitive advantages for strategic customers of AWS. Meena is a connector and design thinker, and strives to drive business to new ways of working through innovation, incubation, and democratization.

Claire O’Brien Rajkumar is a Sr. Product Manager on the Amazon SageMaker team focused on SageMaker Canvas, the SageMaker low-code no-code workspace for ML and generative AI. SageMaker Canvas helps democratize ML and generative AI by lowering barriers to adoption for those new to ML and accelerating workflows for advanced practitioners.

Read More

Build a contextual text and image search engine for product recommendations using Amazon Bedrock and Amazon OpenSearch Serverless

Build a contextual text and image search engine for product recommendations using Amazon Bedrock and Amazon OpenSearch Serverless

The rise of contextual and semantic search has made ecommerce and retail businesses search straightforward for its consumers. Search engines and recommendation systems powered by generative AI can improve the product search experience exponentially by understanding natural language queries and returning more accurate results. This enhances the overall user experience, helping customers find exactly what they’re looking for.

Amazon OpenSearch Service now supports the cosine similarity metric for k-NN indexes. Cosine similarity measures the cosine of the angle between two vectors, where a smaller cosine angle denotes a higher similarity between the vectors. With cosine similarity, you can measure the orientation between two vectors, which makes it a good choice for some specific semantic search applications.

In this post, we show how to build a contextual text and image search engine for product recommendations using the Amazon Titan Multimodal Embeddings model, available in Amazon Bedrock, with Amazon OpenSearch Serverless.

A multimodal embeddings model is designed to learn joint representations of different modalities like text, images, and audio. By training on large-scale datasets containing images and their corresponding captions, a multimodal embeddings model learns to embed images and texts into a shared latent space. The following is a high-level overview of how it works conceptually:

  • Separate encoders – These models have separate encoders for each modality—a text encoder for text (for example, BERT or RoBERTa), image encoder for images (for example, CNN for images), and audio encoders for audio (for example, models like Wav2Vec). Each encoder generates embeddings capturing semantic features of their respective modalities
  • Modality fusion – The embeddings from the uni-modal encoders are combined using additional neural network layers. The goal is to learn interactions and correlations between the modalities. Common fusion approaches include concatenation, element-wise operations, pooling, and attention mechanisms.
  • Shared representation space – The fusion layers help project the individual modalities into a shared representation space. By training on multimodal datasets, the model learns a common embedding space where embeddings from each modality that represent the same underlying semantic content are closer together.
  • Downstream tasks – The joint multimodal embeddings generated can then be used for various downstream tasks like multimodal retrieval, classification, or translation. The model uses correlations across modalities to improve performance on these tasks compared to individual modal embeddings. The key advantage is the ability to understand interactions and semantics between modalities like text, images, and audio through joint modeling.

Solution overview

The solution provides an implementation for building a large language model (LLM) powered search engine prototype to retrieve and recommend products based on text or image queries. We detail the steps to use an Amazon Titan Multimodal Embeddings model to encode images and text into embeddings, ingest embeddings into an OpenSearch Service index, and query the index using the OpenSearch Service k-nearest neighbors (k-NN) functionality.

This solution includes the following components:

  • Amazon Titan Multimodal Embeddings model – This foundation model (FM) generates embeddings of the product images used in this post. With Amazon Titan Multimodal Embeddings, you can generate embeddings for your content and store them in a vector database. When an end-user submits any combination of text and image as a search query, the model generates embeddings for the search query and matches them to the stored embeddings to provide relevant search and recommendations results to end-users. You can further customize the model to enhance its understanding of your unique content and provide more meaningful results using image-text pairs for fine-tuning. By default, the model generates vectors (embeddings) of 1,024 dimensions, and is accessed via Amazon Bedrock. You can also generate smaller dimensions to optimize for speed and performance
  • Amazon OpenSearch Serverless – It is an on-demand serverless configuration for OpenSearch Service. We use Amazon OpenSearch Serverless as a vector database for storing embeddings generated by the Amazon Titan Multimodal Embeddings model. An index created in the Amazon OpenSearch Serverless collection serves as the vector store for our Retrieval Augmented Generation (RAG) solution.
  • Amazon SageMaker Studio – It is an integrated development environment (IDE) for machine learning (ML). ML practitioners can perform all ML development steps—from preparing your data to building, training, and deploying ML models.

The solution design consists of two parts: data indexing and contextual search. During data indexing, you process the product images to generate embeddings for these images and then populate the vector data store. These steps are completed prior to the user interaction steps.

In the contextual search phase, a search query (text or image) from the user is converted into embeddings and a similarity search is run on the vector database to find the similar product images based on similarity search. You then display the top similar results. All the code for this post is available in the GitHub repo.

The following diagram illustrates the solution architecture.

The following are the solution workflow steps:

  1. Download the product description text and images from the public Amazon Simple Storage Service (Amazon S3) bucket.
  2. Review and prepare the dataset.
  3. Generate embeddings for the product images using the Amazon Titan Multimodal Embeddings model (amazon.titan-embed-image-v1). If you have a huge number of images and descriptions, you can optionally use the Batch inference for Amazon Bedrock.
  4. Store embeddings into the Amazon OpenSearch Serverless as the search engine.
  5. Finally, fetch the user query in natural language, convert it into embeddings using the Amazon Titan Multimodal Embeddings model, and perform a k-NN search to get the relevant search results.

We use SageMaker Studio (not shown in the diagram) as the IDE to develop the solution.

These steps are discussed in detail in the following sections. We also include screenshots and details of the output.

Prerequisites

To implement the solution provided in this post, you should have the following:

  • An AWS account and familiarity with FMs, Amazon Bedrock, Amazon SageMaker, and OpenSearch Service.
  • The Amazon Titan Multimodal Embeddings model enabled in Amazon Bedrock. You can confirm it’s enabled on the Model access page of the Amazon Bedrock console. If Amazon Titan Multimodal Embeddings is enabled, the access status will show as Access granted, as shown in the following screenshot.

If the model is not available, enable access to the model by choosing Manage model access, selecting Amazon Titan Multimodal Embeddings G1, and choosing Request model access. The model is enabled for use immediately.

Set up the solution

When the prerequisite steps are complete, you’re ready to set up the solution:

  1. In your AWS account, open the SageMaker console and choose Studio in the navigation pane.
  2. Choose your domain and user profile, then choose Open Studio.

Your domain and user profile name may be different.

  1. Choose System terminal under Utilities and files.
  2. Run the following command to clone the GitHub repo to the SageMaker Studio instance:
git clone https://github.com/aws-samples/amazon-bedrock-samples.git
  1. Navigate to the multimodal/Titan/titan-multimodal-embeddings/amazon-bedrock-multimodal-oss-searchengine-e2e folder.
  2. Open the titan_mm_embed_search_blog.ipynb notebook.

Run the solution

Open the file titan_mm_embed_search_blog.ipynb and use the Data Science Python 3 kernel. On the Run menu, choose Run All Cells to run the code in this notebook.

This notebook performs the following steps:

  1. Install the packages and libraries required for this solution.
  2. Load the publicly available Amazon Berkeley Objects Dataset and metadata in a pandas data frame.

The dataset is a collection of 147,702 product listings with multilingual metadata and 398,212 unique catalogue images. For this post, you only use the item images and item names in US English. You use approximately 1,600 products.

  1. Generate embeddings for the item images using the Amazon Titan Multimodal Embeddings model using the get_titan_multomodal_embedding() function. For the sake of abstraction, we have defined all important functions used in this notebook in the utils.py file.

Next, you create and set up an Amazon OpenSearch Serverless vector store (collection and index).

  1. Before you create the new vector search collection and index, you must first create three associated OpenSearch Service policies: the encryption security policy, network security policy, and data access policy.

  1. Finally, ingest the image embedding into the vector index.

Now you can perform a real-time multimodal search.

Run a contextual search

In this section, we show the results of contextual search based on a text or image query.

First, let’s perform an image search based on text input. In the following example, we use the text input “drinkware glass” and send it to the search engine to find similar items.

The following screenshot shows the results.

Now let’s look at the results based on a simple image. The input image gets converted into vector embeddings and, based on the similarity search, the model returns the result.

You can use any image, but for the following example, we use a random image from the dataset based on item ID (for example, item_id = “B07JCDQWM6”), and then send this image to the search engine to find similar items.

The following screenshot shows the results.

Clean up

To avoid incurring future charges, delete the resources used in this solution. You can do this by running the cleanup section of the notebook.

Conclusion

This post presented a walkthrough of using the Amazon Titan Multimodal Embeddings model in Amazon Bedrock to build powerful contextual search applications. In particular, we demonstrated an example of a product listing search application. We saw how the embeddings model enables efficient and accurate discovery of information from images and textual data, thereby enhancing the user experience while searching for the relevant items.

Amazon Titan Multimodal Embeddings helps you power more accurate and contextually relevant multimodal search, recommendation, and personalization experiences for end-users. For example, a stock photography company with hundreds of millions of images can use the model to power its search functionality, so users can search for images using a phrase, image, or a combination of image and text.

The Amazon Titan Multimodal Embeddings model in Amazon Bedrock is now available in the US East (N. Virginia) and US West (Oregon) AWS Regions. To learn more, refer to Amazon Titan Image Generator, Multimodal Embeddings, and Text models are now available in Amazon Bedrock, the Amazon Titan product page, and the Amazon Bedrock User Guide. To get started with Amazon Titan Multimodal Embeddings in Amazon Bedrock, visit the Amazon Bedrock console.

Start building with the Amazon Titan Multimodal Embeddings model in Amazon Bedrock today.


About the Authors

Sandeep Singh is a Senior Generative AI Data Scientist at Amazon Web Services, helping businesses innovate with generative AI. He specializes in Generative AI, Artificial Intelligence, Machine Learning, and System Design. He is passionate about developing state-of-the-art AI/ML-powered solutions to solve complex business problems for diverse industries, optimizing efficiency and scalability.

Mani Khanuja is a Tech Lead – Generative AI Specialists, author of the book Applied Machine Learning and High Performance Computing on AWS, and a member of the Board of Directors for Women in Manufacturing Education Foundation Board. She leads machine learning projects in various domains such as computer vision, natural language processing, and generative AI. She speaks at internal and external conferences such AWS re:Invent, Women in Manufacturing West, YouTube webinars, and GHC 23. In her free time, she likes to go for long runs along the beach.

Rupinder Grewal is a Senior AI/ML Specialist Solutions Architect with AWS. He currently focuses on serving of models and MLOps on Amazon SageMaker. Prior to this role, he worked as a Machine Learning Engineer building and hosting models. Outside of work, he enjoys playing tennis and biking on mountain trails.

Read More

AWS and Mistral AI commit to democratizing generative AI with a strengthened collaboration

AWS and Mistral AI commit to democratizing generative AI with a strengthened collaboration

The generative artificial intelligence (AI) revolution is in full swing, and customers of all sizes and across industries are taking advantage of this transformative technology to reshape their businesses. From reimagining workflows to make them more intuitive and easier to enhancing decision-making processes through rapid information synthesis, generative AI promises to redefine how we interact with machines. It’s been amazing to see the number of companies launching innovative generative AI applications on AWS using Amazon Bedrock. Siemens is integrating Amazon Bedrock into its low-code development platform Mendix to allow thousands of companies across multiple industries to create and upgrade applications with the power of generative AI. Accenture and Anthropic are collaborating with AWS to help organizations—especially those in highly-regulated industries like healthcare, public sector, banking, and insurance—responsibly adopt and scale generative AI technology with Amazon Bedrock. This collaboration will help organizations like the District of Columbia Department of Health speed innovation, improve customer service, and improve productivity, while keeping data private and secure. Amazon Pharmacy is using generative AI to fill prescriptions with speed and accuracy, making customer service faster and more helpful, and making sure that the right quantities of medications are stocked for customers.

To power so many diverse applications, we recognized the need for model diversity and choice for generative AI early on. We know that different models excel in different areas, each with unique strengths tailored to specific use cases, leading us to provide customers with access to multiple state-of-the-art large language models (LLMs) and foundation models (FMs) through a unified service: Amazon Bedrock. By facilitating access to top models from Amazon, Anthropic, AI21 Labs, Cohere, Meta, Mistral AI, and Stability AI, we empower customers to experiment, evaluate, and ultimately select the model that delivers optimal performance for their needs.

Announcing Mistral Large on Amazon Bedrock

Today, we are excited to announce the next step on this journey with an expanded collaboration with Mistral AI. A French startup, Mistral AI has quickly established itself as a pioneering force in the generative AI landscape, known for its focus on portability, transparency, and its cost-effective design requiring fewer computational resources to run. We recently announced the availability of Mistral 7B and Mixtral 8x7B models on Amazon Bedrock, with weights that customers can inspect and modify. Today, Mistral AI is bringing its latest and most capable model, Mistral Large, to Amazon Bedrock, and is committed to making future models accessible to AWS customers. Mistral AI will also use AWS AI-optimized AWS Trainium and AWS Inferentia to build and deploy its future foundation models on Amazon Bedrock, benefitting from the price, performance, scale, and security of AWS. Along with this announcement, starting today, customers can use Amazon Bedrock in the AWS Europe (Paris) Region. At launch, customers will have access to some of the latest models from Amazon, Anthropic, Cohere, and Mistral AI, expanding their options to support various use cases from text understanding to complex reasoning.

Mistral Large boasts exceptional language understanding and generation capabilities, which is ideal for complex tasks that require reasoning capabilities or ones that are highly specialized, such as synthetic text generation, code generation, Retrieval Augmented Generation (RAG), or agents. For example, customers can build AI agents capable of engaging in articulate conversations, generating nuanced content, and tackling complex reasoning tasks. The model’s strengths also extend to coding, with proficiency in code generation, review, and comments across mainstream coding languages. And Mistral Large’s exceptional multilingual performance, spanning French, German, Spanish, and Italian, in addition to English, presents a compelling opportunity for customers. By offering a model with robust multilingual support, AWS can better serve customers with diverse language needs, fostering global accessibility and inclusivity for generative AI solutions.

By integrating Mistral Large into Amazon Bedrock, we can offer customers an even broader range of top-performing LLMs to choose from. No single model is optimized for every use case, and to unlock the value of generative AI, customers need access to a variety of models to discover what works best based for their business needs. We are committed to continuously introducing the best models, providing customers with access to the latest and most innovative generative AI capabilities.

“We are excited to announce our collaboration with AWS to accelerate the adoption of our frontier AI technology with organizations around the world. Our mission is to make frontier AI ubiquitous, and to achieve this mission, we want to collaborate with the world’s leading cloud provider to distribute our top-tier models. We have a long and deep relationship with AWS and through strengthening this relationship today, we will be able to provide tailor-made AI to builders around the world.”

– Arthur Mensch, CEO at Mistral AI.

Customers appreciate choice

Since we first announced Amazon Bedrock, we have been innovating at a rapid clip—adding more powerful features like agents and guardrails. And we’ve said all along that more exciting innovations, including new models will keep coming. With more model choice, customers tell us they can achieve remarkable results:

“The ease of accessing different models from one API is one of the strengths of Bedrock. The model choices available have been exciting. As new models become available, our AI team is able to quickly and easily evaluate models to know if they fit our needs. The security and privacy that Bedrock provides makes it a great choice to use for our AI needs.”

– Jamie Caramanica, SVP, Engineering at CS Disco.

“Our top priority today is to help organizations use generative AI to support employees and enhance bots through a range of applications, such as stronger topic, sentiment, and tone detection from customer conversations, language translation, content creation and variation, knowledge optimization, answer highlighting, and auto summarization. To make it easier for them to tap into the potential of generative AI, we’re enabling our users with access to a variety of large language models, such as Genesys-developed models and multiple third-party foundational models through Amazon Bedrock, including Anthropic’s Claude, AI21 Labs’s Jurrassic-2, and Amazon Titan. Together with AWS, we’re offering customers exponential power to create differentiated experiences built around the needs of their business, while helping them prepare for the future.”

– Glenn Nethercutt, CTO at Genesys.

As the generative AI revolution continues to unfold, AWS is poised to shape its future, empowering customers across industries to drive innovation, streamline processes, and redefine how we interact with machines. Together with outstanding partners like Mistral AI, and with Amazon Bedrock as the foundation, our customers can build more innovative generative AI applications.

Democratizing access to LLMs and FMs

Amazon Bedrock is democratizing access to cutting-edge LLMs and FMs and AWS is the only cloud provider to offer the most popular and advanced FMs to customers. The collaboration with Mistral AI represents a significant milestone in this journey, further expanding Amazon Bedrock’s diverse model offerings and reinforcing our commitment to empowering customers with unparalleled choice through Amazon Bedrock. By recognizing that no single model can optimally serve every use case, AWS has paved the way for customers to unlock the full potential of generative AI. Through Amazon Bedrock, organizations can experiment with and take advantage of the unique strengths of multiple top-performing models, tailoring their solutions to specific needs, industry domains, and workloads. This unprecedented choice, combined with the robust security, privacy, and scalability of AWS, enables customers to harness the power of generative AI responsibly and with confidence, no matter their industry or regulatory constraints.

Resources

  1. Mistral Large News Blog
  2. About Amazon Blog
  3. Mistral AI on Amazon Bedrock Product Page

About the author

Swami Sivasubramanian is Vice President of Data and Machine Learning at AWS. In this role, Swami oversees all AWS Database, Analytics, and AI & Machine Learning services. His team’s mission is to help organizations put their data to work with a complete, end-to-end data solution to store, access, analyze, and visualize, and predict.

Read More