Text embedding and sentence similarity retrieval at scale with Amazon SageMaker JumpStart

Text vectors or embeddings are numerical vector representations of text that are generated by large language models (LLMs). After LLMs are fully pre-trained on a large dataset or fine-tuned from different tasks, including text completion, question answering, and translations, text embeddings capture semantic information of the input text. Different downstream applications are made possible by text embeddings, including similarity searching, information retrieval, recommendations and personalization, multilingual translations, and more.

Before intelligent applications could be built from embeddings, enterprises and organizations had to embed their existing documents, which can be expensive and technically complicated. Amazon SageMaker JumpStart is a machine learning (ML) hub that helps accelerate this journey. With SageMaker JumpStart, you can access pre-trained, cutting-edge text embedding models from various model providers, including Hugging Face, AI 21 Labs, Cohere, and Meta AI. You can seamlessly deploy these models into production with the SageMaker JumpStart user interface or SDK. In addition, none of your data is used to train the underlying models. Because all data is encrypted and doesn’t leave its own VPC, you can trust your data remains private and confidential.

In this post, we demonstrate how to use the SageMaker Python SDK for text embedding and sentence similarity. Sentence similarity involves assessing the likeness between two pieces of text after they are converted into embeddings by the LLM, which is a foundation step for applications like Retrieval Augmented Generation (RAG). We demonstrate how to do the following:

  • Run inference on a text embedding model deployed from SageMaker JumpStart
  • Find the nearest neighbors for an input sentence with your own dataset
  • Run the batch transform on large documents to minimize costs

All the code is available on GitHub.

Deploy a text embedding model via SageMaker JumpStart

To host a model on Amazon SageMaker, the first step is to set up and authenticate the use of AWS services. In Amazon SageMaker Studio, we use the execution role associated with the notebook instance. See the following code:

import sagemaker, boto3, json
from sagemaker.session import Session
sagemaker_session = Session()
aws_role = sagemaker_session.get_caller_identity_arn()
aws_region = boto3.Session().region_name
sess = sagemaker.Session()

On Hugging Face, the Massive Text Embedding Benchmark (MTEB) is provided as a leaderboard for diverse text embedding tasks. It currently provides 129 benchmarking datasets across 8 different tasks on 113 languages. The top text embedding models from the MTEB leaderboard are made available from SageMaker JumpStart, including bge, gte, e5, and more. In this post, we use huggingface-sentencesimilarity-bge-large-en as an example. We can use the SageMaker SDK to deploy this state-of-the-art text embedding model:

from sagemaker.jumpstart.model import JumpStartModel

model_id = "huggingface-sentencesimilarity-bge-large-en"
text_embedding_model = JumpStartModel(model_id=model_id)
predictor = text_embedding_model.deploy()

Text embedding model query

Let’s look at the text embedding model query in more detail.

Text to embedding

If you have already deployed a SageMaker endpoint before, the predictor can be restored as follows:

from sagemaker.predictor import Predictor
from sagemaker.deserializers import JSONDeserializer
from sagemaker.serializers import IdentitySerializer

predictor = Predictor(
    endpoint_name=<YOUR_ENDPOINT_NAME>,
    deserializer=JSONDeserializer(),
    serializer=IdentitySerializer(),
)
predictor.content_type = "application/x-text"

After the model is successfully deployed, you can query the endpoint with a batch of input texts within a JSON payload:

sentences = [
    # Pets
    "Your dog is so cute.",
    "How cute your dog is!",
    "You have such a cute dog!",
    # Cities
    "Sydney is the place where I work.",
    "I work in Sydney.",
    # Color
    "What colour do you like the most?",
    "What is your favourite colour?",
]

predictor.predict(json.dumps(sentences).encode('utf-8'))

The correlation of the embeddings of these sentences is plotted in the following figure.

correlation_heat_map

As shown in the preceding figure, same subjects are highly correlated within themselves, including Pets, Cities, and Color; different subjects are much dissimilar. This indicates the embedding generated by the LLMs (in this case, bge) can represent the semantic information accurately.

For this post, we used the preceding sample and compared the latency across different sentence embedding models currently available from SageMaker JumpStart. Latency is the amount of time from the moment that a user sends a request until the time that the application indicates that the request has been completed. The numbers in the following table represent the average latency for a total of 100 requests using the same batch of input texts on the ml.g5.2xlarge and ml.c6i.xlarge instances.

Model g5.2xlarge Average Latency (ms) c6i.xlarge Average Latency(ms) Language Support
all-MiniLM-L6-v2 19.5 27.9 English
BGE Base En 21.2 114 English
BGE Small En 28.3 45.6 English
BGE Large En 34.7 337 English
Multilingual E5 Base 22.1 118 Multilingual
Multilingual E5 Large 39.8 360 Multilingual
E5 Base 25.6 117 English
E5 Base V2 25.2 123 English
E5 Large 32.2 339 English
E5 Large V2 32.5 331 English
GTE Base 22.2 112 English
GTE Small 19.7 46 English
GTE Large 39.7 347 English

Get the nearest neighbors

The deployed model from SageMaker JumpStart can also facilitate the process of identifying the nearest neighbors to queries within the corpus. When provided with queries and a corpus, the model will produce the corpus_id, which denotes the position of the relevant corpus entry in the input corpus list, and a score indicating the degree of proximity to the query. It uses the following parameters:

  • corpus – Provides the list of inputs from which to find the nearest neighbor
  • queries – Provides the list of inputs for which to find the nearest neighbor from the corpus
  • top_k – The number of nearest neighbors to find from the corpus
  • mode – Set as nn_corpus for getting the nearest neighbors to input queries within the corpus

See the following code:

corpus = [
    "Amazon SageMaker is a fully managed service to prepare data and build, train, and deploy machine learning (ML) models for any use case with fully managed infrastructure, tools, and workflows.",
    "Amazon SageMaker stores code in ML storage volumes, secured by security groups and optionally encrypted at rest.",
    "Amazon SageMaker provides a full end-to-end workflow, but you can continue to use your existing tools with SageMaker. You can easily transfer the results of each stage in and out of SageMaker as your business requirements dictate."
]
queries = [
    "What is Amazon SageMaker?",
    "How does Amazon SageMaker secure my code?",
    "What if I have my own notebook, training, or hosting environment in my own business environment?"
]

payload_nearest_neighbor = {"corpus": corpus, "queries": queries, "top_k": 3, "mode": "nn_corpus"}
query_response = predictor.predict(payload_nearest_neighbor)

We get the following output:

[
    [
        {'corpus_id': 0, 'score': 0.8992230892181396},
        {'corpus_id': 2, 'score': 0.8664969205856323},
        {'corpus_id': 1, 'score': 0.8456423282623291}
    ],
    [
        {'corpus_id': 1, 'score': 0.8919335603713989},
        {'corpus_id': 0, 'score': 0.840064525604248},
        {'corpus_id': 2, 'score': 0.8145401477813721}
    ],
    [
        {'corpus_id': 2, 'score': 0.7712811231613159},
        {'corpus_id': 1, 'score': 0.7564010620117188},
        {'corpus_id': 0, 'score': 0.7525666356086731}
    ]
]

This result means the first query is most similar to the first corpus, the second is closer to the second corpus, and so on. This is a correct match in this example.

We also took the preceding sample and compared the latency across different sentence embedding models currently available from SageMaker JumpStart. The numbers in the following table represent the average latency for a total of 100 requests using the same payload on the ml.g5.2xlarge and ml.c6i.xlarge instances.

Model g5.2xlarge Average Latency (ms) c6i.xlarge Average Latency(ms) Language Support
all-MiniLM-L6-v2 21.7 69.1 English
BGE Base En 29.1 372 English
BGE Small En 29.2 124 English
BGE Large En 47.2 1240 English
Multilingual E5 Base 30 389 Multilingual
Multilingual E5 Large 47.1 1380 Multilingual
E5 Base 30.4 373 English
E5 Base V2 31 409 English
E5 Large 45.9 1230 English
E5 Large V2 49.6 1220 English
GTE Base 30.3 375 English
GTE Small 28.5 129 English
GTE Large 46.6 1320 English

Get the nearest neighbors on a large dataset

When making requests to the SageMaker invoke endpoint, payloads are restricted to approximately 5 MB, and the request timeout is set to 1 minute. If corpus size exceeds these limits, you could use a SageMaker training job, which generates embeddings for your large dataset and persists them alongside the model inside the SageMaker endpoint. Therefore, they don’t have to be passed as part of the invocation payload. The process of finding the nearest neighbors is carried out using SentenceTransformer and its utility function. The nearest neighbor is based on the cosine similarity between the input sentence embedding and the precomputed sentence embeddings during the training job.

In the following example, we fetch and prepare the Amazon_SageMaker_FAQs dataset to use it in finding the nearest neighbor to an input question:

!aws s3 cp s3://jumpstart-cache-prod-us-west-2/training-datasets/Amazon_SageMaker_FAQs/Amazon_SageMaker_FAQs.csv Amazon_SageMaker_FAQs.csv

import pandas as pd

data = pd.read_csv("Amazon_SageMaker_FAQs.csv", names=["Questions", "Answers"])
data["id"] = data.index
data_req = data[["id", "Answers"]]
data_req.to_csv("data.csv", index=False, header=False)

output_bucket = sess.default_bucket()
output_prefix = "jumpstart-example-ss-training"

s3_output_location = f"s3://{output_bucket}/{output_prefix}/output"
training_dataset_s3_path = f"s3://{output_bucket}/{output_prefix}/data/data.csv"

!aws s3 cp data.csv {training_dataset_s3_path}

For algorithm-specific training hyperparameters, the SageMaker SDK can be fetched or overwritten:

from sagemaker import hyperparameters

hyperparameters = hyperparameters.retrieve_default(model_id=model_id, model_version = "*")
hyperparameters["batch_size"] = "64"
print(hyperparameters)
>>> {'max_seq_length': 'None', 'batch_size': '64', 'store_text_with_embedding': 'True'}

The SageMaker training consists of two steps: create the estimator object and launch the training job. The output is a model prepackaged with embeddings of your large dataset used as training data, which can be deployed for inference to get the nearest neighbor for any input sentence. See the following code:

from sagemaker.jumpstart.estimator import JumpStartEstimator

estimator = JumpStartEstimator(
    model_id=model_id,
    hyperparameters=hyperparameters,
    output_path=s3_output_location
)

estimator.fit(
    {"training": f"s3://{output_bucket}/{output_prefix}/data"}
)
predictor = estimator.deploy()

The query syntax to convert text into embeddings is the same as before. The code to get the nearest neighbor, however, can be simplified as follows:

payload_nearest_neighbour = {
    "queries": ["Is R supported with Amazon SageMaker?"],
    "top_k": 1,
    "mode": "nn_train_data",
}

response = predictor.predict(payload_nearest_neighbour)
>>> [[{'id': '9', 'score': 0.9240573048591614}]]

data["Answers"].iloc[int(response[0][0]["id"])]
>>> "Yes, R is supported with Amazon SageMaker. You can use R within SageMaker notebook instances, which include a preinstalled R kernel and the reticulate library. Reticulate offers an R interface for the Amazon SageMaker Python SDK, enabling ML practitioners to build, train, tune, and deploy R models."

We can also query the endpoint with questions in the Amazon_SageMaker_FAQs dataset and compare how many of the correct corresponding answers are returned. In the following example, we measure the top-3 accuracy, given there could be similar question answer pairs. This means if the correct answer is returned as one of the top-3 returns, it’s treated as a correct query.

total_correct_answers = 0

for i in range(len(data)):
    question = data["Questions"].iloc[i]
    payload_nearest_neighbor = {
        "queries": [question],
        "top_k": 3,
        "mode": "nn_train_data",
    }
    response = predictor.predict(payload_nearest_neighbor)
    response_ids = [int(res["id"]) for res in response[0]]

    if i in response_ids:
        total_correct_answers += 1
    else:
        pred_answer = [data["Answers"].iloc[response_id] for response_id in response_ids]

print(total_correct_answers*100/len(data))
>>>
81.16883116883118

Run a batch transform to get embeddings on large datasets

For enterprises and organizations with a large volume of historical documents that exceed the memory of a single endpoint instance, you can use SageMaker batch transform to save cost. When you start a batch transform job, SageMaker launches the necessary compute resources to process the data. During the job, SageMaker automatically provisions and manage the compute resources. When the batch transform job is complete, those resources are automatically cleaned up, which minimizes costs. By dividing a large dataset into smaller chunks and using more instances, you can scale out the compute for faster inference with similar cost, without managing infrastructure. The maximum payload for batch transform is 100 MB and timeout is 1 hour.

The input format for our batch transform job is a JSONL file, with entries as a line of JSON, which consists of id and text_inputs. See the following code:

test_data_file_name = "test.jsonl"
test_data = []

for i in range(len(data)):
    answer = data.loc[i, "Answers"]
    payload = {"id": i, "text_inputs": answer}
    test_data.append(payload)

with open(test_data_file_name, "w") as outfile:
    for entry in test_data:
        outfile.write(f"{json.dumps(entry)}n")

s3 = boto3.client("s3")
s3.upload_file(test_data_file_name, output_bucket, f"{output_prefix}/batch_input/test.jsonl")

When the data is ready in Amazon Simple Storage Service (Amazon S3), you can create the batch transform object from the SageMaker JumpStart model, which triggers the transform job:

s3_input_data_path = f"s3://{output_bucket}/{output_prefix}/batch_input/"
s3_output_data_path = f"s3://{output_bucket}/{output_prefix}/batch_output/"

batch_transformer = text_embedding_model.transformer(
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    output_path=s3_output_data_path,
    assemble_with="Line",
    accept="text/csv",
    max_payload=1,
)

batch_transformer.transform(
    s3_input_data_path,
    content_type="application/jsonlines",
    split_type="Line"
)

batch_transformer.wait()

After the batch transform job is complete, you can download the result from Amazon S3:

s3 = boto3.client("s3")
s3.download_file(
    output_bucket, output_prefix + "/batch_output/" + "test.jsonl.out", "predict.jsonl"
)

with open("predict.jsonl", "r") as json_file:
    json_list = list(json_file)

Conclusion

SageMaker JumpStart provides a straightforward way to use state-of-the-art large language foundation models for text embedding and semantic search. With the user interface or just a few lines of code, you can deploy a highly accurate text embedding model and find semantic matches across large datasets, at scale and cost-efficiently. SageMaker JumpStart removes the barriers to implement semantic search by providing instant access to cutting-edge models like the ones benchmarked on the MTEB leaderboard. Businesses and developers can build intelligent search and recommendation systems faster.

This post demonstrated how to find semantically similar questions and answers, which could be applied to RAG use cases, recommendations and personalization, multilingual translations, and more. With continued advances in language models and the simplicity of SageMaker JumpStart, more organizations can infuse generative AI capabilities into their products. As the next step, you can try text-embedding models from SageMaker JumpStart on your own dataset to test and benchmark the results for your RAG use cases.


About the Authors

Dr. Baichuan Sun, currently serving as a Sr. AI/ML Solution Architect at AWS, focuses on generative AI and applies his knowledge in data science and machine learning to provide practical, cloud-based business solutions. With experience in management consulting and AI solution architecture, he addresses a range of complex challenges, including robotics computer vision, time series forecasting, and predictive maintenance, among others. His work is grounded in a solid background of project management, software R&D, and academic pursuits. Outside of work, Dr. Sun enjoys the balance of traveling and spending time with family and friends, reflecting a commitment to both his professional growth and personal well-being.

Hemant Singh is an Applied Scientist with experience in Amazon SageMaker JumpStart. He got his masters from Courant Institute of Mathematical Sciences and B.Tech from IIT Delhi. He has experience in working on a diverse range of machine learning problems within the domain of natural language processing, computer vision, and time series analysis.

Dr. Ashish Khetan is a Senior Applied Scientist with Amazon SageMaker built-in algorithms and helps develop machine learning algorithms. He got his PhD from University of Illinois Urbana-Champaign. He is an active researcher in machine learning and statistical inference, and has published many papers in NeurIPS, ICML, ICLR, JMLR, ACL, and EMNLP conferences.

Read More