Stream large language model responses in Amazon SageMaker JumpStart

We are excited to announce that Amazon SageMaker JumpStart can now stream large language model (LLM) inference responses. Token streaming allows you to see the model response output as it is being generated instead of waiting for LLMs to finish the response generation before it is made available for you to use or display. The streaming capability in SageMaker JumpStart can help you build applications with better user experience by creating a perception of low latency to the end-user.

In this post, we walk through how to deploy and stream the response from a Falcon 7B Instruct model endpoint.

At the time of this writing, the following LLMs available in SageMaker JumpStart support streaming:

Mistral AI 7B, Mistral AI 7B Instruct
Falcon 180B, Falcon 180B Chat
Falcon 40B, Falcon 40B Instruct
Falcon 7B, Falcon 7B Instruct
Rinna Japanese GPT NeoX 4B Instruction PPO
Rinna Japanese GPT NeoX 3.6B Instruction PPO

To check for updates on the list of models supporting streaming in SageMaker JumpStart, search for “huggingface-llm” at Built-in Algorithms with pre-trained Model Table.

Note that you can use the streaming feature of Amazon SageMaker hosting out of the box for any model deployed using the SageMaker TGI Deep Learning Container (DLC) as described in Announcing the launch of new Hugging Face LLM Inference containers on Amazon SageMaker.

Foundation models in SageMaker

SageMaker JumpStart provides access to a range of models from popular model hubs, including Hugging Face, PyTorch Hub, and TensorFlow Hub, which you can use within your ML development workflow in SageMaker. Recent advances in ML have given rise to a new class of models known as foundation models, which are typically trained on billions of parameters and can be adapted to a wide category of use cases, such as text summarization, generating digital art, and language translation. Because these models are expensive to train, customers want to use existing pre-trained foundation models and fine-tune them as needed, rather than train these models themselves. SageMaker provides a curated list of models that you can choose from on the SageMaker console.

You can now find foundation models from different model providers within SageMaker JumpStart, enabling you to get started with foundation models quickly. SageMaker JumpStart offers foundation models based on different tasks or model providers, and you can easily review model characteristics and usage terms. You can also try these models using a test UI widget. When you want to use a foundation model at scale, you can do so without leaving SageMaker by using prebuilt notebooks from model providers. Because the models are hosted and deployed on AWS, you trust that your data, whether used for evaluating or using the model at scale, won’t be shared with third parties.

Token streaming

Token streaming allows the inference response to be returned as it’s being generated by the model. This way, you can see the response generated incrementally rather than wait for the model to finish before providing the complete response. Streaming can help enable a better user experience because it decreases the latency perception for the end-user. You can start seeing the output as it’s generated and therefore can stop generation early if the output isn’t looking useful for your purposes. Streaming can make a big difference, especially for long-running queries, because you can start seeing outputs as it’s generated, which can create a perception of lower latency even though the end-to-end latency stays the same.

As of this writing, you can use streaming in SageMaker JumpStart for models that utilize Hugging Face LLM Text Generation Inference DLC.

Response with No Steaming	Response with Streaming

Solution overview

For this post, we use the Falcon 7B Instruct model to showcase the SageMaker JumpStart streaming capability.

You can use the following code to find other models in SageMaker JumpStart that support streaming:

from sagemaker.jumpstart.notebook_utils import list_jumpstart_models
from sagemaker.jumpstart.filters import And

filter_value = And("task == llm", "framework == huggingface")
model_ids = list_jumpstart_models(filter=filter_value)
print(model_ids)

We get the following model IDs that support streaming:

['huggingface-llm-bilingual-rinna-4b-instruction-ppo-bf16', 'huggingface-llm-falcon-180b-bf16', 'huggingface-llm-falcon-180b-chat-bf16', 'huggingface-llm-falcon-40b-bf16', 'huggingface-llm-falcon-40b-instruct-bf16', 'huggingface-llm-falcon-7b-bf16', 'huggingface-llm-falcon-7b-instruct-bf16', 'huggingface-llm-mistral-7b', 'huggingface-llm-mistral-7b-instruct', 'huggingface-llm-rinna-3-6b-instruction-ppo-bf16']

Prerequisites

Before running the notebook, there are some initial steps required for setup. Run the following commands:

%pip install --upgrade sagemaker –quiet

Deploy the model

As a first step, use SageMaker JumpStart to deploy a Falcon 7B Instruct model. For full instructions, refer to Falcon 180B foundation model from TII is now available via Amazon SageMaker JumpStart. Use the following code:

from sagemaker.jumpstart.model import JumpStartModel

my_model = JumpStartModel(model_id="huggingface-llm-falcon-7b-instruct-bf16")
predictor = my_model.deploy()

Query the endpoint and stream response

Next, construct a payload to invoke your deployed endpoint with. Importantly, the payload should contain the key/value pair "stream": True. This indicates to the text generation inference server to generate a streaming response.

payload = {
    "inputs": "How do I build a website?",
    "parameters": {"max_new_tokens": 256},
    "stream": True
}

Before you query the endpoint, you need to create an iterator that can parse the bytes stream response from the endpoint. Data for each token is provided as a separate line in the response, so this iterator returns a token each time a new line is identified in the streaming buffer. This iterator is minimally designed, and you might want to adjust its behavior for your use case; for example, while this iterator returns token strings, the line data contains other information, such as token log probabilities, that could be of interest.

import io
import json

class TokenIterator:
    def __init__(self, stream):
        self.byte_iterator = iter(stream)
        self.buffer = io.BytesIO()
        self.read_pos = 0

    def __iter__(self):
        return self

    def __next__(self):
        while True:
            self.buffer.seek(self.read_pos)
            line = self.buffer.readline()
            if line and line[-1] == ord("n"):
                self.read_pos += len(line) + 1
                full_line = line[:-1].decode("utf-8")
                line_data = json.loads(full_line.lstrip("data:").rstrip("/n"))
                return line_data["token"]["text"]
            chunk = next(self.byte_iterator)
            self.buffer.seek(0, io.SEEK_END)
            self.buffer.write(chunk["PayloadPart"]["Bytes"])

Now you can use the Boto3 invoke_endpoint_with_response_stream API on the endpoint that you created and enable streaming by iterating over a TokenIterator instance:

import boto3

client = boto3.client("runtime.sagemaker")
response = client.invoke_endpoint_with_response_stream(
    EndpointName=predictor.endpoint_name,
    Body=json.dumps(payload),
    ContentType="application/json",
)

for token in TokenIterator(response["Body"]):
    print(token, end="")

Specifying an empty end parameter to the print function will enable a visual stream without new line characters inserted. This produces the following output:

Building a website can be a complex process, but it generally involves the following steps:

1. Determine the purpose and goals of your website
2. Choose a domain name and hosting provider
3. Design and develop your website using HTML, CSS, and JavaScript
4. Add content to your website and optimize it for search engines
5. Test and troubleshoot your website to ensure it is working properly
6. Maintain and update your website regularly to keep it running smoothly.

There are many resources available online to guide you through these steps, including tutorials and templates. It may also be helpful to seek the advice of a web developer or designer if you are unsure about any of these steps.<|endoftext|>

You can use this code in a notebook or other applications like Streamlit or Gradio to see the streaming in action and the experience it provides for your customers.

Clean up

Finally, remember to clean up your deployed model and endpoint to avoid incurring additional costs:

predictor.delete_model()
predictor.delete_endpoint()

Conclusion

In this post, we showed you how to use newly launched feature of streaming in SageMaker JumpStart. We hope you will use the token streaming capability to build interactive applications requiring low latency for a better user experience.

About the authors

Rachna Chadha is a Principal Solution Architect AI/ML in Strategic Accounts at AWS. Rachna is an optimist who believes that ethical and responsible use of AI can improve society in future and bring economic and social prosperity. In her spare time, Rachna likes spending time with her family, hiking, and listening to music.

Dr. Kyle Ulrich is an Applied Scientist with the Amazon SageMaker built-in algorithms team. His research interests include scalable machine learning algorithms, computer vision, time series, Bayesian non-parametrics, and Gaussian processes. His PhD is from Duke University and he has published papers in NeurIPS, Cell, and Neuron.

Dr. Ashish Khetan is a Senior Applied Scientist with Amazon SageMaker built-in algorithms and helps develop machine learning algorithms. He got his PhD from University of Illinois Urbana-Champaign. He is an active researcher in machine learning and statistical inference, and has published many papers in NeurIPS, ICML, ICLR, JMLR, ACL, and EMNLP conferences.

Vedere AI