Build powerful RAG pipelines with LlamaIndex and Amazon Bedrock

Build powerful RAG pipelines with LlamaIndex and Amazon Bedrock

This post was co-written with Jerry Liu from LlamaIndex.

Retrieval Augmented Generation (RAG) has emerged as a powerful technique for enhancing the capabilities of large language models (LLMs). By combining the vast knowledge stored in external data sources with the generative power of LLMs, RAG enables you to tackle complex tasks that require both knowledge and creativity. Today, RAG techniques are used in every enterprise, small and large, where generative artificial intelligence (AI) is used as an enabler for solving document-based question answering and other types of analysis.

Although building a simple RAG system is straightforward, building production RAG systems using advanced patterns is challenging. A production RAG pipeline typically operates over a larger data volume and larger data complexity, and must meet a higher quality bar compared to building a proof of concept. A general broad challenge that developers face is low response quality; the RAG pipeline is not able to sufficiently answer a large number of questions. This can be due to a variety of reasons; the following are some of the most common:

  • Bad retrievals – The relevant context needed to answer the question is missing.
  • Incomplete responses – The relevant context is partially there but not completely. The generated output doesn’t fully answer the input question.
  • Hallucinations – The relevant context is there but the model is not able to extract the relevant information in order to answer the question.

This necessitates more advanced RAG techniques on the query understanding, retrieval, and generation components in order to handle these failure modes.

This is where LlamaIndex comes in. LlamaIndex is an open source library with both simple and advanced techniques that enables developers to build production RAG pipelines. It provides a flexible and modular framework for building and querying document indexes, integrating with various LLMs, and implementing advanced RAG patterns.

Amazon Bedrock is a managed service providing access to high-performing foundation models (FMs) from leading AI providers through a unified API. It offers a wide range of large models to choose from, along with capabilities to securely build and customize generative AI applications. Key advanced features include model customization with fine-tuning and continued pre-training using your own data, as well as RAG to augment model outputs by retrieving context from configured knowledge bases containing your private data sources. You can also create intelligent agents that orchestrate FMs with enterprise systems and data. Other enterprise capabilities include provisioned throughput for guaranteed low-latency inference at scale, model evaluation to compare performance, and AI guardrails to implement safeguards. Amazon Bedrock abstracts away infrastructure management through a fully managed, serverless experience.

In this post, we explore how to use LlamaIndex to build advanced RAG pipelines with Amazon Bedrock. We discuss how to set up the following:

  • Simple RAG pipeline – Set up a RAG pipeline in LlamaIndex with Amazon Bedrock models and top-k vector search
  • Router query – Add an automated router that can dynamically do semantic search (top-k) or summarization over data
  • Sub-question query – Add a query decomposition layer that can decompose complex queries into multiple simpler ones, and run them with the relevant tools
  • Agentic RAG – Build a stateful agent that can do the preceding components (tool use, query decomposition), but also maintain state-like conversation history and reasoning over time

Simple RAG pipeline

At its core, RAG involves retrieving relevant information from external data sources and using it to augment the prompts fed to an LLM. This allows the LLM to generate responses that are grounded in factual knowledge and tailored to the specific query.

For RAG workflows in Amazon Bedrock, documents from configured knowledge bases go through preprocessing, where they are split into chunks, embedded into vectors, and indexed in a vector database. This allows efficient retrieval of relevant information at runtime. When a user query comes in, the same embedding model is used to convert the query text into a vector representation. This query vector is compared against the indexed document vectors to identify the most semantically similar chunks from the knowledge base. The retrieved chunks provide additional context related to the user’s query. This contextual information is appended to the original user prompt before being passed to the FM to generate a response. By augmenting the prompt with relevant data pulled from the knowledge base, the model’s output is able to use and be informed by an organization’s proprietary information sources. This RAG process can also be orchestrated by agents, which use the FM to determine when to query the knowledge base and how to incorporate the retrieved context into the workflow.

The following diagram illustrates this workflow.

The following is a simplified example of a RAG pipeline using LlamaIndex:

from llama_index import SimpleDirectoryReader, VectorStoreIndex

# Load documents
documents = SimpleDirectoryReader("data/").load_data()

# Create a vector store index
index = VectorStoreIndex.from_documents(documents)

# Query the index
response = index.query("What is the capital of France?")

# Print the response
print(response)

The pipeline includes the following steps:

  1. Use the SimpleDirectoryReader to load documents from the “data/”
  2. Create a VectorStoreIndex from the loaded documents. This type of index converts documents into numerical representations (vectors) that capture their semantic meaning.
  3. Query the index with the question “What is the capital of France?” The index uses similarity measures to identify the documents most relevant to the query.
  4. The retrieved documents are then used to augment the prompt for the LLM, which generates a response based on the combined information.

LlamaIndex goes beyond simple RAG and enables the implementation of more sophisticated patterns, which we discuss in the following sections.

Router query

RouterQueryEngine allows you to route queries to different indexes or query engines based on the nature of the query. For example, you could route summarization questions to a summary index and factual questions to a vector store index.

The following is a code snippet from the example notebooks demonstrating RouterQueryEngine:

from llama_index import SummaryIndex, VectorStoreIndex
from llama_index.core.query_engine import RouterQueryEngine

# Create summary and vector indices
summary_index = SummaryIndex.from_documents(documents)
vector_index = VectorStoreIndex.from_documents(documents)

# Define query engines
summary_query_engine = summary_index.as_query_engine()
vector_query_engine = vector_index.as_query_engine()

# Create router query engine
query_engine = RouterQueryEngine(
 # Define logic for routing queries
 # ...
 query_engine_tools=[
 summary_query_engine,
 vector_query_engine,
 ],
)

# Query the engine
response = query_engine.query("What is the main idea of the document?")

Sub-question query

SubQuestionQueryEngine breaks down complex queries into simpler sub-queries and then combines the answers from each sub-query to generate a comprehensive response. This is particularly useful for queries that span across multiple documents. It first breaks down the complex query into sub-questions for each relevant data source, then gathers the intermediate responses and synthesizes a final response that integrates the relevant information from each sub-query. For example, if the original query was “What is the population of the capital city of the country with the highest GDP in Europe,” the engine would first break it down into sub-queries like “What is the highest GDP country in Europe,” “What is the capital city of that country,” and “What is the population of that capital city,” and then combine the answers to those sub-queries into a final comprehensive response.

The following is an example of using SubQuestionQueryEngine:

from llama_index.core.query_engine import SubQuestionQueryEngine

# Create sub-question query engine
sub_question_query_engine = SubQuestionQueryEngine.from_defaults(
 # Define tools for generating sub-questions and answering them
 # ...
)

# Query the engine
response = sub_question_query_engine.query(
 "Compare the revenue growth of Uber and Lyft from 2020 to 2021"
)

Agentic RAG

An agentic approach to RAG uses an LLM to reason about the query and determine which tools (such as indexes or query engines) to use and in what sequence. This allows for a more dynamic and adaptive RAG pipeline. The following architecture diagram shows how agentic RAG works on Amazon Bedrock.

Agentic RAG in Amazon Bedrock combines the capabilities of agents and knowledge bases to enable RAG workflows. Agents act as intelligent orchestrators that can query knowledge bases during their workflow to retrieve relevant information and context to augment the responses generated by the FM.

After the initial preprocessing of the user input, the agent enters an orchestration loop. In this loop, the agent invokes the FM, which generates a rationale outlining the next step the agent should take. One potential step is to query an attached knowledge base to retrieve supplemental context from the indexed documents and data sources.

If a knowledge base query is deemed beneficial, the agent invokes an InvokeModel call specifically for knowledge base response generation. This fetches relevant document chunks from the knowledge base based on semantic similarity to the current context. These retrieved chunks provide additional information that is included in the prompt sent back to the FM. The model then generates an observation response that is parsed and can invoke further orchestration steps, like invoking external APIs (through action group AWS Lambda functions) or provide a final response to the user. This agentic orchestration augmented by knowledge base retrieval continues until the request is fully handled.

One example of an agent orchestration loop is the ReAct agent, which was initially introduced by Yao et al. ReAct interleaves chain-of-thought and tool use. At every stage, the agent takes in the input task along with the previous conversation history and decides whether to invoke a tool (such as querying a knowledge base) with the appropriate input or not.

The following is an example of using the ReAct agent with the LlamaIndex SDK:

from llama_index.core.agent import ReActAgent

# Create ReAct agent with defined tools
agent = ReActAgent.from_tools(
 query_engine_tools,
 llm=llm,
)

# Chat with the agent
response = agent.chat("What was Lyft's revenue growth in 2021?")

The ReAct agent will analyze the query and decide whether to use the Lyft 10K tool or another tool to answer the question. To try out agentic RAG, refer to the GitHub repo.

LlamaCloud and LlamaParse

LlamaCloud represents a significant advancement in the LlamaIndex landscape, offering a comprehensive suite of managed services tailored for enterprise-grade context augmentation within LLM and RAG applications. This service empowers AI engineers to concentrate on developing core business logic by streamlining the intricate process of data wrangling.

One key component is LlamaParse, a proprietary parsing engine adept at handling complex, semi-structured documents replete with embedded objects like tables and figures, seamlessly integrating with LlamaIndex’s ingestion and retrieval pipelines. Another key component is the Managed Ingestion and Retrieval API, which facilitates effortless loading, processing, and storage of data from diverse sources, including LlamaParse outputs and LlamaHub’s centralized data repository, while accommodating various data storage integrations.

Collectively, these features enable the processing of vast production data volumes, culminating in enhanced response quality and unlocking unprecedented capabilities in context-aware question answering for RAG applications. To learn more about these features, refer to Introducing LlamaCloud and LlamaParse.

For this post, we use LlamaParse to showcase the integration with Amazon Bedrock. LlamaParse is an API created by LlamaIndex to efficiently parse and represent files for efficient retrieval and context augmentation using LlamaIndex frameworks. What is unique about LlamaParse is that it is the world’s first generative AI native document parsing service, which allows users to submit documents along with parsing instructions. The key insight behind parsing instructions is that you know what kind of documents you have, so you already know what kind of output you want. The following figure shows a comparison of parsing a complex PDF with LlamaParse vs. two popular open source PDF parsers.

A green highlight in a cell means that the RAG pipeline correctly returned the cell value as the answer to a question over that cell. A red highlight means that the question was answered incorrectly.

Integrate Amazon Bedrock and LlamaIndex to build an Advanced RAG Pipeline

In this section, we show you how to build an advanced RAG stack combining LlamaParse and LlamaIndex with Amazon Bedrock services – LLMs, embedding models, and Bedrock Knowledge Base.

To use LlamaParse with Amazon Bedrock, you can follow these high-level steps:

  1. Download your source documents.
  2. Send the documents to LlamaParse using the Python SDK:
    from llama_parse import LlamaParse
    from llama_index.core import SimpleDirectoryReader
    
    parser = LlamaParse(
        api_key=os.environ.get('LLAMA_CLOUD_API_KEY'),  # set via api_key param or in your env as LLAMA_CLOUD_API_KEY
        result_type="markdown",  # "markdown" and "text" are available
        num_workers=4,  # if multiple files passed, split in `num_workers` API calls
        verbose=True,
        language="en",  # Optionally you can define a language, default=en
    )
    
    file_extractor = {".pdf": parser}
    reader = SimpleDirectoryReader(
        input_dir='data/10k/',
        file_extractor=file_extractor
    )

  3. Wait for the parsing job to finish and upload the resulting Markdown documents to Amazon Simple Storage Service (Amazon S3).
  4. Create an Amazon Bedrock knowledge base using the source documents.
  5. Choose your preferred embedding and generation model from Amazon Bedrock using the LlamaIndex SDK:
    llm = Bedrock(model = "anthropic.claude-v2")
    embed_model = BedrockEmbedding(model = "amazon.titan-embed-text-v1")

  6. Implement an advanced RAG pattern using LlamaIndex. In the following example, we use SubQuestionQueryEngine and a retriever specially created for Amazon Bedrock knowledge bases:
    from llama_index.retrievers.bedrock import AmazonKnowledgeBasesRetriever

  7. Finally, query the index with your question:
    response = await query_engine.aquery('Compare revenue growth of Uber and Lyft from 2020 to 2021')

We tested Llamaparse on a real-world, challenging example of asking questions about a document containing Bank of America Q3 2023 financial results. An example slide from the full slide deck (48 complex slides!) is shown below.

Using the procedure outlined above, we asked “What is the trend in digital households/relationships from 3Q20 to 3Q23?”; take a look at the answer generated using Llamaindex tools vs. the reference answer from human annotation.

LlamaIndex + LlamaParse answer Reference answer
The trend in digital households/relationships shows a steady increase from 3Q20 to 3Q23. In 3Q20, the number of digital households/relationships was 550K, which increased to 645K in 3Q21, then to 672K in 3Q22, and further to 716K in 3Q23. This indicates consistent growth in the adoption of digital services among households and relationships over the reported quarters. The trend shows a steady increase in digital households/relationships from 645,000 in 3Q20 to 716,000 in 3Q23. The digital adoption percentage also increased from 76% to 83% over the same period.

The following are example notebooks to try out these steps on your own examples. Note the prerequisite steps and cleanup resources after testing them.

Conclusion

In this post, we explored various advanced RAG patterns with LlamaIndex and Amazon Bedrock. To delve deeper into the capabilities of LlamaIndex and its integration with Amazon Bedrock, check out the following resources:

By combining the power of LlamaIndex and Amazon Bedrock, you can build robust and sophisticated RAG pipelines that unlock the full potential of LLMs for knowledge-intensive tasks.


About the Author

Shreyas Subramanian is a Principal data scientist and helps customers by using Machine Learning to solve their business challenges using the AWS platform. Shreyas has a background in large scale optimization and Machine Learning, and in use of Machine Learning and Reinforcement Learning for accelerating optimization tasks.

Jerry Liu is the co-founder/CEO of LlamaIndex, a data framework for building LLM applications. Before this, he has spent his career at the intersection of ML, research, and startups. He led the ML monitoring team at Robust Intelligence, did self-driving AI research at Uber ATG, and worked on recommendation systems at Quora.

Read More

Evaluating prompts at scale with Prompt Management and Prompt Flows for Amazon Bedrock

Evaluating prompts at scale with Prompt Management and Prompt Flows for Amazon Bedrock

As generative artificial intelligence (AI) continues to revolutionize every industry, the importance of effective prompt optimization through prompt engineering techniques has become key to efficiently balancing the quality of outputs, response time, and costs. Prompt engineering refers to the practice of crafting and optimizing inputs to the models by selecting appropriate words, phrases, sentences, punctuation, and separator characters to effectively use foundation models (FMs) or large language models (LLMs) for a wide variety of applications. A high-quality prompt maximizes the chances of having a good response from the generative AI models.

A fundamental part of the optimization process is the evaluation, and there are multiple elements involved in the evaluation of a generative AI application. Beyond the most common evaluation of FMs, the prompt evaluation is a critical, yet often challenging, aspect of developing high-quality AI-powered solutions. Many organizations struggle to consistently create and effectively evaluate their prompts across their various applications, leading to inconsistent performance and user experiences and undesired responses from the models.

In this post, we demonstrate how to implement an automated prompt evaluation system using Amazon Bedrock so you can streamline your prompt development process and improve the overall quality of your AI-generated content. For this, we use Amazon Bedrock Prompt Management and Amazon Bedrock Prompt Flows to systematically evaluate prompts for your generative AI applications at scale.

The importance of prompt evaluation

Before we explain the technical implementation, let’s briefly discuss why prompt evaluation is crucial. The key aspects to consider when building and optimizing a prompt are typically:

  1. Quality assurance – Evaluating prompts helps make sure that your AI applications consistently produce high-quality, relevant outputs for the selected model.
  2. Performance optimization – By identifying and refining effective prompts, you can improve the overall performance of your generative AI models in terms of lower latency and ultimately higher throughput.
  3. Cost efficiency – Better prompts can lead to more efficient use of AI resources, potentially reducing costs associated with model inference. A good prompt allows for the use of smaller and lower-cost models, which wouldn’t give good results with a bad quality prompt.
  4. User experience – Improved prompts result in more accurate, personalized, and helpful AI-generated content, enhancing the end user experience in your applications.

Optimizing prompts for these aspects is an iterative process that requires an evaluation for driving the adjustments in the prompts. It is, in other words, a way to understand how good a given prompt and model combination are for achieving the desired answers.

In our example, we implement a method known as LLM-as-a-judge, where an LLM is used for evaluating the prompts based on the answers it produced with a certain model, according to predefined criteria. The evaluation of prompts and their answers for a given LLM is a subjective task by nature, but a systematic prompt evaluation using LLM-as-a-judge allows you to quantify it with an evaluation metric in a numerical score. This helps to standardize and automate the prompting lifecycle in your organization and is one of the reasons why this method is one of the most common approaches for prompt evaluation in the industry.

Prompt evaluation logic flow

Let’s explore a sample solution for evaluating prompts with LLM-as-a-judge with Amazon Bedrock. You can also find the complete code example in amazon-bedrock-samples.

Prerequisites

For this example, you need the following:

Set up the evaluation prompt

To create an evaluation prompt using Amazon Bedrock Prompt Management, follow these steps:

  1. On the Amazon Bedrock console, in the navigation pane, choose Prompt management and then choose Create prompt.
  2. Enter a Name for your prompt such as prompt-evaluator and a Description such as “Prompt template for evaluating prompt responses with LLM-as-a-judge.” Choose Create.

Create prompt screenshot

  1. In the Prompt field, write your prompt evaluation template. In the example, you can use a template like the following or adjust it according to your specific evaluation requirements.
You're an evaluator for the prompts and answers provided by a generative AI model.
Consider the input prompt in the <input> tags, the output answer in the <output> tags, the prompt evaluation criteria in the <prompt_criteria> tags, and the answer evaluation criteria in the <answer_criteria> tags.

<input>
{{input}}
</input>

<output>
{{output}}
</output>

<prompt_criteria>
- The prompt should be clear, direct, and detailed.
- The question, task, or goal should be well explained and be grammatically correct.
- The prompt is better if containing examples.
- The prompt is better if specifies a role or sets a context.
- The prompt is better if provides details about the format and tone of the expected answer.
</prompt_criteria>

<answer_criteria>
- The answers should be correct, well structured, and technically complete.
- The answers should not have any hallucinations, made up content, or toxic content.
- The answer should be grammatically correct.
- The answer should be fully aligned with the question or instruction in the prompt.
</answer_criteria>

Evaluate the answer the generative AI model provided in the <output> with a score from 0 to 100 according to the <answer_criteria> provided; any hallucinations, even if small, should dramatically impact the evaluation score.
Also evaluate the prompt passed to that generative AI model provided in the <input> with a score from 0 to 100 according to the <prompt_criteria> provided.
Respond only with a JSON having:
- An 'answer-score' key with the score number you evaluated the answer with.
- A 'prompt-score' key with the score number you evaluated the prompt with.
- A 'justification' key with a justification for the two evaluations you provided to the answer and the prompt; make sure to explicitely include any errors or hallucinations in this part.
- An 'input' key with the content of the <input> tags.
- An 'output' key with the content of the <output> tags.
- A 'prompt-recommendations' key with recommendations for improving the prompt based on the evaluations performed.
Skip any preamble or any other text apart from the JSON in your answer.
  1. Under Configurations, select a model to use for running evaluations with the prompt. In our example we selected Anthropic Claude Sonnet. The quality of the evaluation will depend on the model you select in this step. Make sure you balance the quality, response time, and cost accordingly in your decision.
  2. Set the Inference parameters for the model. We recommend that you keep Temperature as 0 for making a factual evaluation and to avoid hallucinations.

You can test your evaluation prompt with sample inputs and outputs using the Test variables and Test window panels.

  1. Now that you have a draft of your prompt, you can also create versions of it. Versions allow you to quickly switch between different configurations for your prompt and update your application with the most appropriate version for your use case. To create a version, choose Create version at the top.

The following screenshot shows the Prompt builder page.

Evaluation prompt template screenshot

Set up the evaluation flow

Next, you need to build an evaluation flow using Amazon Bedrock Prompt Flows. In our example, we use prompt nodes. For more information on the types of nodes supported, check the Node types in prompt flow documentation. To build an evaluation flow, follow these steps:

  • On the Amazon Bedrock console, under Prompt flows, choose Create prompt flow.
  • Enter a Name such as prompt-eval-flow. Enter a Description such as “Prompt Flow for evaluating prompts with LLM-as-a-judge.” Choose Use an existing service role to select a role from the dropdown. Choose Create.
  • This will open the Prompt flow builder. Drag two Prompts nodes to the canvas and configure the nodes as per the following parameters:
    • Flow input
      • Output:
        • Name: document, Type: String
    • Invoke (Prompts)
      • Node name: Invoke
      • Define in node
      • Select model: A preferred model to be evaluated with your prompts
      • Message: {{input}}
      • Inference configurations: As per your preferences
      • Input:
        • Name: input, Type: String, Expression: $.data
      • Output:
        • Name: modelCompletion, Type: String
    • Evaluate (Prompts)
      • Node name: Evaluate
      • Use a prompt from your Prompt Management
      • Prompt: prompt-evaluator
      • Version: Version 1 (or your preferred version)
      • Select model: Your preferred model to evaluate your prompts with
      • Inference configurations: As set in your prompt
      • Input:
        • Name: input, Type: String, Expression: $.data
        • Name: output, Type: String, Expression: $.data
      • Output
        • Name: modelCompletion, Type: String
    • Flow output
      • Node name: End
      • Input:
        • Name: document, Type: String, Expression: $.data
  • To connect the nodes, drag the connecting dots, as shown in the following diagram.

Simple prompt evaluation flow

  • Choose Save.

You can test your prompt evaluation flow by using the Test prompt flow panel. Pass an input, such as the question, “What is cloud computing in a single paragraph?” It should return a JSON with the result of the evaluation similar to the following example. In the code example notebook, amazon-bedrock-samples, we also included the information about the models used for invocation and evaluation to our result JSON.

{
	"answer-score": 95,
	"prompt-score": 90,
	"justification": "The answer provides a clear and technically accurate explanation of cloud computing in a single paragraph. It covers key aspects such as scalability, shared resources, pay-per-use model, and accessibility. The answer is well-structured, grammatically correct, and aligns with the prompt. No hallucinations or toxic content were detected. The prompt is clear, direct, and explains the task well. However, it could be improved by providing more details on the expected format, tone, or length of the answer.",
	"input": "What is cloud computing in a single paragraph?",
	"output": "Cloud computing is a model for delivering information technology services where resources are retrieved from the internet through web-based tools. It is a highly scalable model in which a consumer can access a shared pool of configurable computing resources, such as applications, servers, storage, and services, with minimal management effort and often with minimal interaction with the provider of the service. Cloud computing services are typically provided on a pay-per-use basis, and can be accessed by users from any location with an internet connection. Cloud computing has become increasingly popular in recent years due to its flexibility, cost-effectiveness, and ability to enable rapid innovation and deployment of new applications and services.",
	"prompt-recommendations": "To improve the prompt, consider adding details such as the expected length of the answer (e.g., 'in a single paragraph of approximately 100-150 words'), the desired tone (e.g., 'in a professional and informative tone'), and any specific aspects that should be covered (e.g., 'including examples of cloud computing services or providers').",
	"modelInvoke": "amazon.titan-text-premier-v1:0",
	"modelEval": "anthropic.claude-3-sonnet-20240229-v1:0"
}

As the example shows, we asked the FM to evaluate with separate scores the prompt and the answer the FM generated from that prompt. We asked it to provide a justification for the score and some recommendations to further improve the prompts. All this information is valuable for a prompt engineer because it helps guide the optimization experiments and helps them make more informed decisions during the prompt life cycle.

Implementing prompt evaluation at scale

To this point, we’ve explored how to evaluate a single prompt. Often, medium to large organizations work with tens, hundreds, and even thousands of prompt variations for their multiple applications, making it a perfect opportunity for automation at scale. For this, you can run the flow in full datasets of prompts stored in files, as shown in the example notebook.

Alternatively, you can also rely on other node types in Amazon Bedrock Prompt Flows for reading and storing in Amazon Simple Storage Service (Amazon S3) files and implementing iterator and collector based flows. The following diagram shows this type of flow. Once you have established a file-based mechanism for running the prompt evaluation flow on datasets at scale, you can also automate the whole process by connecting it your preferred continuous integration and continuous development (CI/CD) tools. The details for these are out of the scope of this post.

Prompt evaluation flow at scale

Best practices and recommendations

Based on our evaluation process, here are some best practices for prompt refinement:

  1. Iterative improvement – Use the evaluation feedback to continuously refine your prompts. The prompt optimization is ultimately an iterative process.
  2. Context is key – Make sure your prompts provide sufficient context for the AI model to generate accurate responses. Depending on the complexity of the tasks or questions that your prompt will answer, you might need to use different prompt engineering techniques. You can check the Prompt engineering guidelines in the Amazon Bedrock documentation and other resources on the topic provided by the model providers.
  3. Specificity matters – Be as specific as possible in your prompts and evaluation criteria. Specificity guides the models towards desired outputs.
  4. Test edge cases – Evaluate your prompts with a variety of inputs to verify robustness. You might also want to run multiple evaluations on the same prompt for comparing and testing output consistency, which might be important depending on your use case.

Conclusion and next steps

By using the LLM-as-a-judge method with Amazon Bedrock Prompt Management and Amazon Bedrock Prompt Flows, you can implement a systematic approach to prompt evaluation and optimization. This not only improves the quality and consistency of your AI-generated content but also streamlines your development process, potentially reducing costs and improving user experiences.

We encourage you to explore these features further and adapt the evaluation process to your specific use cases. As you continue to refine your prompts, you’ll be able to unlock the full potential of generative AI in your applications. To get started, check out the full with the code samples used in this post. We’re excited to see how you’ll use these tools to enhance your AI-powered solutions!

For more information on Amazon Bedrock and its features, visit the Amazon Bedrock documentation.


About the Author

Antonio Rodriguez

Antonio Rodriguez is a Sr. Generative AI Specialist Solutions Architect at Amazon Web Services. He helps companies of all sizes solve their challenges, embrace innovation, and create new business opportunities with Amazon Bedrock. Apart from work, he loves to spend time with his family and play sports with his friends.

Read More

Collaborators: Silica in space with Richard Black and Dexter Greene

Collaborators: Silica in space with Richard Black and Dexter Greene

Headshots of Richard Black and Dexter Greene for the Microsoft Research Podcast

Transforming research ideas into meaningful impact is no small feat. It often requires the knowledge and experience of individuals from across disciplines and institutions. Collaborators, a Microsoft Research Podcast series, explores the relationships—both expected and unexpected—behind the projects, products, and services being pursued and delivered by researchers at Microsoft and the diverse range of people they’re teaming up with. 

Nearly 50 years ago, Voyager 1 and 2 took off for space, each with a record comprising a sampling of earthly sounds and sights. The records’ purpose? To give extraterrestrials a sense of humanity. Thanks to students at Avenues: The World School, the universe might be receiving an update. In this episode, college freshman and Avenues alum Dexter Greene and Microsoft research manager Richard Black talk about how Project Silica, a technology that uses tiny laser pulses to store data in small glass “platters,” is supporting the Avenues Golden Record 2.0 project; what it means for data storage more broadly; and why the students’ efforts are valuable even if the information never gets to its intended recipients.

Transcript

[TEASER] 

[MUSIC PLAYS UNDER DIALOGUE] 

DEXTER GREENE: So the original Golden Record is … I like to think of it as, sort of, a time capsule of humanity that was designed to represent us—who we are as a species, what we love, why we love it, what we do, and, sort of, our diversity, why we’re all different, why we do different things—to possible extraterrestrials. And so the Golden Record was produced in 1977 by a relatively small team led by Carl Sagan. What we’re doing, my team, is we’re working on creating an updated Golden Record. And I began researching different storage methods, and I began to realize that we hadn’t made that much headway in storage since then. Of course, we’ve made progress but nothing really spectacular until I found 5D storage. And I noticed that there were only two real places that I could find information about this. One was the University of Southampton, and one was Project Silica at Microsoft. I reached out to the University of Southampton and Dr. Black, and somehow, kind of, to my surprise, Dr. Black actually responded!

RICHARD BLACK: I was in particularly intrigued by the Avenues Golden Record application because I could see it was an application not just where Silica was a better media than what people use today but really where Silica was the only media that would work because none of the standard media really work over the kind of time scales that are involved in space travel, and none of them really work in the harsh environments that are involved in space and outer space and space travel. So in some ways for me, it was an easy way to communicate just what a transformative digital media technology Silica is, and that’s why as an application, it really grabbed my interest.


[TEASER ENDS] 

GRETCHEN HUIZINGA: You’re listening to Collaborators, a Microsoft Research Podcast showcasing the range of expertise that goes into transforming mind-blowing ideas into world-changing technologies. I’m Dr. Gretchen Huizinga.

[MUSIC FADES] 

Today I’m talking to Dr. Richard Black, a senior principal research manager and the research director of Project Silica at Microsoft Research. And with him is Dexter Greene, a rising freshman at the University of Michigan and a recent graduate of Avenues: The World School in New York City. Richard and Dexter are involved in a unique multidisciplinary, multi-institutional, and multigenerational collaboration called Avenues Golden Record, a current effort to communicate with extraterrestrial intelligence. We’ll get into that in a lot more detail shortly, but first, let’s meet our collaborators.

Richard, let’s start with you. As I’ve just noted, you’re a research manager at the Cambridge UK lab of Microsoft Research and the research director of a really cool technology called Silica. In a second, I want you to talk about that more specifically, but right now, tell us about yourself. What’s your background? What are your research interests writ large? And what excites you about the broad remit of your work at Cambridge?

RICHARD BLACK: So my background is a computer scientist. I’ve been at Microsoft Research for 24 years, and before that, I had a faculty position at a university here in the UK. So I also have an interest in education, and it’s been a delight to interact with Dexter and the other students at Avenues. My research interests really cover all aspects of computer systems, which means operating systems, networking, and computer architecture. And the exciting thing for me about being at Microsoft Research is that this is really a period of rapid change with the cloud, digital transformation of society. It gives really a huge motivation to research better underlying technologies for everything that we do. And for me in the last few years, that’s been in archival storage with Project Silica.

HUIZINGA: Hmm. Richard, I’m interested to know a little bit more about your background. Where did you go to school, what led you to this kind of research, and what university were you teaching at?

BLACK: Yeah, I went to university and did my PhD here in Cambridge. I was teaching at the University of Glasgow, which is in Scotland in the UK, and teaching again computer systems, so those operating systems, computer architecture, and computer networking.

HUIZINGA: Well, Dexter, you’re the first student collaborator we’ve featured on this show, which is super fun. Tell us about yourself and about Avenues: The World School, where this particular collaboration was born.

DEXTER GREENE: Thanks for having me. I’m super excited to be here. And like you said, it’s very cool to be the first student collaborator that you featured on the show. So I’m 18. I just graduated high school a few months ago, and I will be attending the University of Michigan’s College of Engineering in the fall. If you know me personally, you know that I love robotics. I competed in the FIRST Tech Challenge all throughout high school. The FIRST Tech Challenge is a student robotics competition. There is the FIRST Tech Challenge, FIRST Robotics Competition, and FIRST LEGO League. So it’s, like, three different levels of robotics competition, which is run all around the world. And every year, there’s, like, a championship at the end to declare a winner. And I plan to major in either robotics or mechanical engineering. So more about Avenues. Avenues is a K-through-12 international immersion school, which is very interesting. So younger students might do a day in Spanish and a day in English or a day in Mandarin and then a day in English, going through all their classes in that language. So I actually attended Avenues since second grade, so when I was younger, I would do a full day in Spanish and then I would switch to a full day in English, doing my courses like math, history, English, all in my language, Spanish for me. And Avenues is a very interesting school and very different in many ways. They like to, sort of, think outside the box. There’s a lot of very unique classes, unique programs. A great example is what they call J-Term, or June and January Term, which is where students will have one course every day for the entire month where they can really dive deep into that subject. And I was actually lucky enough to do the Golden Record for a full month in 11th grade, which I’ll talk about this more, but that’s actually when I first made contact with Dr. Black and found this amazing technology, which is, I guess why we’re all here today.

HUIZINGA: Right.

GREENE: So, yeah, there’s many really cool parts about Avenues. There’s travel programs that you can do where you can go all around the world. You can go between different campuses. There’s online classes that you can take. The list goes on …

HUIZINGA: Well, it’s funny that you say “when I first made contact with Dr. Black” because it sounds like something that you’re working on! So let’s talk about that for a second. So the project we’re talking about today is Avenues Golden Record, but it’s not the first Golden Record to exist. So for those of our listeners who don’t know what Golden Record even is, Dexter, give us a little history lesson and chronicle the story from the original Golden Record way back in 1977 all the way to what you’re doing today with the project.

GREENE: Yeah. So I guess let me start with, what is the Golden Record? So the original Golden Record is … I like to think of it as, sort of, a time capsule of humanity that was designed to represent us—who we are as a species, what we love, why we love it, what we do, and, sort of, our diversity, why we’re all different, why we do different things—to possible extraterrestrials. And so the Golden Record was produced in 1977 by a relatively small team led by Carl Sagan[1], an American astronomer who was a professor at, I believe, Cornell. And so it’s basically a series of meticulously curated content. So that could be images, audios, sounds of nature, music, the list goes on. Really anything you can think of. That’s, sort of, the beauty of it. Anything can go on it. So it’s just a compilation of what we are, who we are, and why we are—what’s important to us. A great example, one of my favorite parts of the Golden Record, is one of the first audios on it is a greeting in 55 languages. It’s, sort of, meant to be, like, a welcome … I guess less of a welcome, but more like a hello because we’re not welcoming anyone to Earth, [LAUGHTER] but it’s, like, a hello, nice to meet you, in 55 languages to show that we’re very diverse, very different. And, yeah, you can actually … if you’re interested and if you’d like to learn more, you can actually go see all the content that’s on the Golden Records. NASA has a webpage for that. I definitely recommend if you have a chance to check it out.

HUIZINGA: Yeah.

GREENE: And I guess moving on to future attempts … so what we’re doing, my team, is we’re working on creating an updated Golden Record. So it’s been 47 years now since the original Golden Record—kind of a long time. And of course a lot’s changed. Some for the better, some for the worse. And we think that it’s about time we update that. Update who we are, what we are, and what we care about, what we love.

HUIZINGA: Right.

GREENE: So our team has begun working on that. One project that I’m familiar with, other than our own, that’s, sort of, a similar attempt is known as Humanity’s Message to the Stars, which is led by Dr. Jonathan Jiang, who is a researcher at NASA’s Jet Propulsion Laboratory.[2] Very cool. That’s the only project that’s similar that I’m aware of, but I’m sure there have been other attempts in the past.

HUIZINGA: Yeah … just to make a note right now, we’re using the term “record,” and the original medium was actually a record, like an LP. But excitingly, we’ll get to why Dr. Black is on the show today [LAUGHS] and talk about the new media. Before we do that, as I was preparing this episode, it began to feel like a story of contrasting couplets, like earthlings and aliens, content and media, veteran researcher and high school student. … So let’s talk about the last pairing for a second, the two of you, and how you got together on this project. It’s a fun story. I like to call this question “how I met your mother.” So how did a high school kid from New York come to be a research collaborator with a seasoned scientist from Cambridge? Dexter, tell your side of the story. It’s cool. And then Richard can fill in the blanks from across the pond!

GREENE: Yeah, so let me actually rewind a little bit further than that, about how I got into the project myself, …

HUIZINGA: Good!

GREENE: … which, I think, is a pretty fun story. So one of my teachers—my design and engineering teacher at the time, Mr. Cavalier—gave a presentation at one of our gradewide assemblies. And the first slide was something along the lines of “the most challenging project in human history,” which immediately caught my eye. I was like, I have to do this! There’s no way I’m not doing this project! [LAUGHTER] And the slides to come of course made me want to partake in the project even more. But that first slide … really, I was sold. It was a done deal! So I applied to the project. I got in. And then we began working and researching, and I’ll talk about this more later, as well, but we, sort of, split up into two teams at the beginning: content and media. Media being the form, or medium, that we send it on. And so that was the team that I was on. And I began researching different storage methods and, sort of, advancements in storage methods since the original Golden Record in 1977. And I began to realize that we hadn’t made that much headway in storage since then. Of course we’ve made progress but nothing really spectacular until I found 5D storage. And I was immediately, just, amazed by the longevity, durability, capacity—so many things. I mean, there’s just so many reasons to be amazed. But … so I began researching and I noticed that there were only two real places that I could find information about this. One was the University of Southampton, I believe, and one was Project Silica at Microsoft. And so I actually reached out to both. I reached out to the University of Southampton and Dr. Black, and somehow, [LAUGHS] kind of, to my surprise, Dr. Black actually responded! And I was, kind of, stunned when he responded because I was like, there’s no way this researcher at Microsoft is going to respond to this high school student that he’s never met in the middle of nowhere. So when Dr. Black did respond, I was just amazed and so excited. And, yeah, it went from there. We began communicating back and forth. And then, I believe, we met once over the following summer, and now we’re here!

HUIZINGA: OK, there’s so many parallels right now between this communication contact and what you’re doing with potential extraterrestrial intelligence. It’s like, I contacted him, he contacted me back, and then we started having a conversation. … Yeah, so, Richard, you were the guy who received the cold email from this high school student. What was your reaction, and how did you get interested in pursuing a relationship in terms of the science of this?

BLACK: Yeah, so let me say I was really intrigued by the Avenues Golden Record application. I do get quite a lot of cold emails, [LAUGHTER] and I try to reply to most of them. I do have a few canned answers because I don’t have time to interact with everybody who reaches out to me. But I was in particularly intrigued by the Avenues Golden Record application because I could see it was an application not just where Silica was a better media than what people use today but really where Silica was the only media that would work because none of the standard media really work over the kind of time scales that are involved in space travel, and none of them really work in the harsh environments that are involved in space and outer space and space travel. So in some ways for me, it was an easy way to communicate just what a transformative digital media technology Silica is, and that’s why as an application it really grabbed my interest.

HUIZINGA: So did you have any idea when the initial exchange happened that this would turn into a full-blown project?

BLACK: I didn’t know how much time Dexter and his fellow students would have to invest in it. So for me, at the beginning, I was just quite happy to answer a few questions that they have, to point them in the right direction, to fill in a few blanks, and things like that. And it was only much later, I think, after perhaps we’d had our first meeting, that I realized that Dexter and his team were actually serious, [LAUGHTER] and they had some time, and they were going to actually invest in this and think it through. And so I was happy to work with them and to continue to answer questions that they had and to work towards actually, you know, writing a couple of Silica platters with the output that they were creating and providing it for them.

HUIZINGA: Well, let’s dig in there. Richard, let’s talk about digital data and the storage mediums that love it. I want to break this into two parts because I’m interested in it from two angles. And the first one is purely technical. I’ll take a second to note that we did an episode on Project Silica way back in 2019. I say way back, like … but in technical years right now, [LAUGHS] that seems like a long time! And on that episode, your colleague Ant Rowstron talked with me and Mark Russinovich, the CTO of Microsoft’s Azure. So we’ll put a link in the show notes for that super-fun, interesting show. But right now, Richard, would you give our listeners an overview of the current science of data on glass? What is Silica? How is it different from other storage media? And what’s changed in the five years since I talked to Ant and Mark?

BLACK: Sure. So Silica is an archival storage technology that stores data inside fused silica glass. And it does that using ultrashort laser pulses that make a permanent, detectable, and yet transparent modification to the glass crystal, so the data ends up as durable as the piece of glass itself.

HUIZINGA: Wow.

BLACK: And being transparent means that we can get hundreds of layers of data inside a block of glass that’s only two millimeters thin, making for really incredibly high densities. And since this new physics was discovered at the University of Southampton in the UK, we’ve been working to tame that, and we’ve improved density, energy over a hundred-fold in the time period that we’ve been working on it, and the speed over ten thousand-fold. And we continue to, in our research, to make Silica better and faster. And, yes, you’re right, five years might seem like quite a long time. A comparison that you might think of here is the history of the hard drive. In the history of the hard drive, there was a point in history at which humans discovered the physical effect of magnetism. And it took us actually quite a long time as a species to go from magnetism to hard drives. In this case, this new physical effect that was discovered at Southampton, this new physical effect, you can think of it a bit like discovering magnetism, and taking it all the way from there to actually a real operating storage system actually takes quite a lot of research and effort and development, and that’s the path that we’ve been on doing that, taming and improving densities and speeds and energies and so on during the years of the project.

HUIZINGA: Well, talk a little bit more about the reading and writing of this medium. What’s involved technically on how you get the data on and how you retrieve it?

BLACK: Yeah, and so interestingly the writing of the data and the reading of the data are actually completely different. So writing the data is done with an ultrashort laser pulse. It’s actually a femtosecond-length pulse, and a femtosecond is one-thousandth of one-millionth of one-millionth of a second. And if you take even quite a small amount of energy and you compress it in time into a pulse that short and then you use a lens to focus it in space into just a tiny point, then the intensity of the light at that point during that pulse is just so mind-bogglingly high that you actually get something called a plasma-induced nano-explosion. [LAUGHTER] And I’m not an appropriate physicist of the right sort by background, but I can tell you that what that does is it really transforms the glass crystal at that point but in a way in which it’s, just, it’s so short—the time pulse is so short—it doesn’t really get to damage the crystal around that point. And that’s what enables the data to be incredibly durable because you’ve made this permanent, detectable, and yet transparent change to the glass crystal.

HUIZINGA: So that’s writing. What about reading?

BLACK: Reading you do with a microscope!

HUIZINGA: Oh, my gosh.

BLACK: So it’s a much more straightforward process. A reader is basically a computer-controlled, high-speed, high-quality microscope. And you focus the microscope at an appropriate depth inside the glass, and then you just photograph it. And you get to, if it’s an appropriate sort of microscope, you get to see the changes that you’ve made to the glass crystal. And then we process those images, in fact, using machine learning neural networks to turn it back into the data that we’d originally put into the glass platter. So reading and writing quite different. And on the reading, we’re just using regular light, so the reading process can’t possibly damage the data that’s been stored inside the glass.

HUIZINGA: I imagine you wouldn’t want to get your eye in the path of a femtosecond laser …

BLACK: Yes, femtosecond lasers are not for use at home! That’s quite true. In fact, your joke comment about the eye is … eye surgery is also actually done with femtosecond lasers. That’s one of the other applications.

HUIZINGA: Oh, OK! So maybe you would!

BLACK: But, yes, no, this is definitely something that, for many reasons, Silica is something that’s related to cloud technology, the writing process. And I think we’ll get back to that perhaps later in our discussion.

HUIZINGA: Yeah, yeah.

BLACK: But, yeah, definitely not something for the home.

HUIZINGA: How powerful is the microscope that you have to use to read this incredibly small written data?

BLACK: It’s fairly straightforward from a power point of view, but it has been engineered to be high-speed, high-quality, and under complete computer control that enables us to move rapidly around the piece of glass to wherever the data is of interest and then image at high speed to get the data back out.

HUIZINGA: Yeah. Well, so as you describe it, these amazingly tiny laser pulses store zettabytes of data. Talk for one second, still technically, about how you find and extract the data. You know, I’ve used this analogy before, but at the end of the movie Indiana Jones, the Ark of the Covenant is stored in an army warehouse. And the camera pulls back and there’s just box after box after crate after crate. … It’s like, you’ll never find it. Once you’ve written and stored the data, how do you go about finding it?

BLACK: So like all storage media, whether it be hard drive, tape, flash that might be in your phone in your pocket, there are standard indexing methods. You know, there’s an addressing system, you know, blocks and sectors and tracks. And, you know, we use all of these, kind of, standard terminology in terms of the way we lay the data out on the glass, and then each piece of glass is uniquely identified, and the glass is stored in the library. And actually, we’ve done some quite interesting work and novel work on the robotics that we use for handling and moving the pieces of glass in Silica. It’s interesting Dexter is talking about being interested in robotics. We’ve done a whole bunch of new interesting robotics in Silica because we wanted the shelving or the library system that we keep the glass on to last as long as the glass. And so we wanted it to be completely passive. And we wanted all of the, kind of, the active components to be in the robotics. So we have these new robots that we call shuttles that can, kind of, climb around the library and retrieve the bits of glass that are needed and take them to a reader whenever reading is needed, and that enables us really to scale out a library to enormous scale over many decades or centuries and to just keep growing a passive, completely passive, library.

HUIZINGA: Yeah, I saw a video of the retrieval and it reminded me of those old-fashioned ladders in libraries where you scoot along and you’re on the wall of books and this is, sort of, like the wall of glass. … So, Richard, part two. Let’s talk about Silica from a practical point of view because apparently not all data is equal, and Silica isn’t for everyone’s data all the time. So who are you making this for generally speaking and why? And did you have aliens on your bingo card when you first started?!

BLACK: So, no, I didn’t have aliens [LAUGHTER] on the bingo card when I first started, definitely not. But as I mentioned, yeah, Project Silica is really about archival data. So that’s data that needs to be kept for many years—or longer—where it’s going to be accessed infrequently, and when you do need to access it, you don’t need it back instantaneously. And there’s actually a huge and increasing amount of data that fits those criteria and growing really very rapidly. Of course it’s not the kind of data that you keep in your pocket, but there is a huge amount of it. A lot of archival records that in the past might have been generated and kept on paper, they’re now, in the modern world, they’re all born digital. And we want to look for a low-cost- and low-environment-footprint way of really keeping it in that digital format for the length of time that it needs to be kept. And so Silica is really for data that’s kept in the cloud, not the pocket or the home or the business. Today most organizations already use the cloud for their digital data to get advantages of cost, sustainability, efficiency, reliability, availability, geographic redundancy, and so on. And Silica is definitely designed for that use case. So archival data in the cloud, data that needs to be kept for a long time period, and there’s huge quantities of it and it’s pouring in every day.

HUIZINGA: So concrete example. Financial data, medical data, I mean, what kinds of verticals or sectors would find this most useful?

BLACK: Yeah, so the financial industry, there’s a lot of regulatory requirements to keep data. Obviously in the healthcare situation, there’s a lot of general record keeping, any archives, museums, and so on that exist today. We see a lot of growth in things like the extractive industries, any kind of mining. You want to keep really good records of what it was that you did to, you know, did underground or did to the earth. The media and entertainment industry is one where they create a lot of content that needs to be kept for long time periods. We see scientific research studies where they measure and accumulate a large quantity of data that they want to keep for future analysis, possibly, you know, use it later in training ML models or just for future analysis. Sometimes that data can’t be reproduced. You know, it represents a measurement of the earth at some point and then, you know, things have changed and it wouldn’t be possible to go back and recapture that data.

HUIZINGA: Right.

BLACK: We see stuff in government and local government. One example is we see some local governments who want, essentially, to create a digital twin of their city. And so when new buildings are being built, they want to keep the blueprints, the photographs of the construction site, all of the data about what was built from floor plans and everything else that would help not only emergency services but just help the city in general to understand what’s in its environment, and they want all of that to be kept while that building exists in their city. So there’s lots and lots and lots of growing data that needs to be kept—sometimes for legal reasons, sometimes for practical reasons—lots of it a really fast-growing tier within the data universe.

HUIZINGA: Yeah. Dexter, let’s go back to you. On the Avenues website, it says the purpose of the Golden Record is to, as you mentioned before, “represent humanity and Earth to potential extraterrestrial beings, encapsulating our existence through a collection of visuals and sounds.” That’s pretty similar to the first Golden Record’s mission. But yours is also different in many ways. So talk about what’s new with this version, not just the medium but how you’re going about putting things together, both conceptually and technically.

GREENE: Yeah. So that’s a great question. I can take it in a million different directions. I’ll start by just saying of course the new technology that Dr. Black is working on is, like, the biggest change, at least in my view, because I like this kind of stuff. [LAUGHTER] But that’s like really the huge thing—durability, longevity, and capacity, capacity being one of the main aspects. We could just fit so much more content than was possible 50 years ago. But there’s a lot more. So on the original Golden Record, they only had weeks to work on the project before it had to be ready to go, to put on the Voyager 1 and 2 spacecrafts. So they had a huge time constraint, which of course we don’t have now. We’ve got as much time as we need. And then … I’ll talk about how we’ve been working on the project. So we split up into two main teams, content and form. Form being media, which I, like I said earlier, is the team that I work on. And our content team has been going through loads of websites and online databases, which is another huge difference. When they created the original Golden Record 50 years ago, they actually had to look through books and, like, photocopy each image they wanted. Of course now we don’t have to do that. We just find them online and drag and drop them into a folder. So there’s that aspect, which makes it so much easier to compile so much content and good-quality content that is ethically sourced. So we can find big databases that are OK with giving us their data. Diversity is another big aspect that we’ve been thinking about. The original Golden Record team didn’t have a lot of time to really focus on diversity and capturing everything, the whole image of what we are, which is something that we’ve really been working on. We’re trying to get a lot of different perspectives and cover really everything there is to cover, which is why we actually have an online submission platform on our website where any random person can take an image of their cat that they like [LAUGHTER] or an image of their house or whatever it may be and they can submit that and it will make its way into the content and actually be part of the Golden Record that we hopefully send to space.

HUIZINGA: Right. So, you know, originally, like you say, there’s a sense of curation that has to happen. I know that originally, they chose not to include war or conflict or anything that might potentially scare or frighten any intelligence that found it, saying, hey, we’re not those people. But I know you’ve had a little bit different thinking about that. Tell us about it.

GREENE: Yeah, so that’s something that we’ve talked about a lot, whether or not we should include good and bad. It’s funny. I actually wrote some of my college essays about that, so I have a lot to say about it. I’ll just give you my point of view, and I think most of my team shares the same point of view. We should really capture who we are with the fullest picture that we can without leaving anything out. One of the main reasons that I feel that way is what might be good to us could be bad to extraterrestrials. So I just don’t think it’s worth it to exclude something if we don’t even know how it’s perceived to someone else.

HUIZINGA: Mm-hmm. So back to the space limitations, are you having to make choices for limiting your data, or are you just sort of saying, let’s put everything on?

GREENE: So on the original Golden Record, of course they really meticulously curated everything that went on the record because there wasn’t that much space.

HUIZINGA: Yeah …

GREENE: So they had to be very careful with what they thought was worth it or not. Now that we have so much space, it seems worth it just to include everything that we can include because maybe they see something that we don’t see from an image.

HUIZINGA: Right.

GREENE: The one thing that we … at the very beginning, during my J-Term in 11th grade, we were actually lucky enough to have Jon Lomberg[3], one of the members of the original team, come in to talk to us a bit. And he gave us a, sort of, a lesson about how to choose images, and he was actually the one that chose a lot of the images for the original record. So it was really insightful. One thing we talked a lot about was, like, shadows. A shadow could be very confusing and, sort of, mess up how they perceive the image, but it also might just be worth including because, why not? We can include it, and maybe they get something … they learn about shadows from it even though it’s confusing. So that’s, sort of, how we have thought about it.

HUIZINGA: Well, that’s an interesting segue, because, Richard, at this point, I usually ask what could possibly go wrong if you got everything right. And there are some things that you think, OK, we don’t know. Even on Earth, we have different opinions about different things. And who knows what any other intelligence might think or see or interpret? But, I want to steer away from that question because when we talked earlier, Richard, I was intrigued by something you said, and I want you to talk about it here. I’ll, kind of, paraphrase, but you basically said, even if there’s no intelligent life outside our planet, this is a worthwhile exercise for us as humans. Why’d you say that?

BLACK: Well, I had two answers to that, one, kind of, one selfish and one altruistic! [LAUGHTER] I talk to a lot of archival data users, and those who are serious about keeping their data for many hundreds of years, they think about the problem in, kind of, three buckets. So one is the keeping of the bits themselves. And of course that’s what we are working on in Project Silica and what Silica is really excellent at. One is the metadata, or index, that records what is stored, where it’s stored, and so on. And that’s really the provenance or the remit of the archivist as curator. And then the third is really ensuring that there’s an understanding of how to read the media that persists to those future generations who’ll want to read it. And this is sometimes called the Rosetta Stone problem, and that isn’t the core expertise of me or my team. But the Golden Record, kind of, proves that it can be solved. You know, obviously, humanity isn’t going to give up on microscopes, but if we can explain to extraterrestrials how they would go about reading a Silica platter, then it should be pretty obvious that we can explain to our human descendants how to do so.

HUIZINGA: Hmmm.

BLACK: The altruistic reason is that I think encouraging humanity to reflect on itself—where we are, the challenges ahead for us as a species here on planet Earth—you know, this is a good time to think those thoughts. And any time capsule—and the Golden Record, you can, kind of, view it a bit like a time capsule—it’s a good time to step back and think those philosophical thoughts.

HUIZINGA: Dexter, do you have any thoughts? I know that Dr. Black has, kind of, taken the lead on that, but I wonder if you’ve given any thought to that yourself.

GREENE: Yeah, we’ve given a lot of thought to that: even if the record doesn’t reach extraterrestrials, is it worth it? Why are we doing this? And we feel the exact same as Dr. Black. It’s so worth it just for us to reflect on where we are and how we can improve what we’ve done in the past and what we can do in the future. It’s a … like Dr. Black said, it’s a great exercise for us to do. And it’s exciting. One of the beautiful parts about this project is that there’s no, like, right or wrong answer. Everyone has a different perspective on it.

HUIZINGA: Yeah …

GREENE: And I think this is a great way to think about that.

HUIZINGA: Yeah. So, Dexter, I always ask my collaborators where their project is on the spectrum from lab to life. But this research is a bit different from some of the other projects we featured. What is the, sort of, remit of your timeline? Is there one for completing the record in any way? Who, if anyone, are you accountable to? And what are your options for getting it up into space once it’s ready to go? Because there is no Voyager just imminently leaving right now, as I understand it. So talk a little bit about the scope from lab to life on this.

GREENE: Yeah. So, like you said, we don’t really have an exact timeline. This is, sort of, one of those projects where we could compile content forever. [LAUGHTER] There’s always more content to get. There’s always more perspectives to include. So I could do this forever. But I think the goal is to try and get all the content and get everything ready within the next couple years. As for who we’re accountable to, we’re, sort of, just accountable to ourselves. The way we’ve been working on this is not really like a club, I wouldn’t say, more just like a passion project that a few students and a few teachers have taken a liking to, I guess. So we’re just accountable to ourselves. We of course, like, we have meetings every week, and my teacher was the one that, like, organized the meetings. So I was, sort of, accountable to my teacher but really just doing it for ourselves.

HUIZINGA: Mm-hmm.

GREENE: As for getting it up into space, we have been talking a bit with the team led by Dr. Jiang. So ideally, in the future, we would collaborate more with them and [LAUGHS] go find our ticket to space on a NASA spaceship! But there are of course other options that we’ve been looking at. There’s a bunch of space agencies all around the world. So we’re not just looking at the United States.

HUIZINGA: Well, there’s also private space exploration companies …

GREENE: Yeah, and there’s also private space like SpaceX and etc. So we’ve thought about all of that, and we’ve been reaching out to other space agencies.

HUIZINGA: I love that “ticket to outer space” metaphor but true because there are constraints on what people can put on, although glass of this size would be pretty light.

GREENE: I feel the same way. You do have to get, like, approved. Like, for the original Golden Record, they had to get everything approved to make it to space. But I would think that it would be pretty reasonable—given the technology is just a piece of glass, essentially, and it’s quite small, the smallest it could be, really—I would think that there wouldn’t be too much trouble with that.

HUIZINGA: So, so … but that does lead to a question, kind of, about then extracting, and you’ve addressed this before by kind of saying, if the intelligence that it gets to is sophisticated enough, they’ll probably have a microscope, but I’m assuming you won’t include a microscope? You just send the glass?

GREENE: Yeah. So on the original record, they actually included a … I’m not sure what it’s called, but the device that you need to …

HUIZINGA: A phonograph?

GREENE: … play a rec … yeah, a phonograph, yes. [LAUGHTER] So they include—sorry! [LAUGHS]—they included a phonograph [cartridge and stylus] on the original Voyagers. And we’ve thought about that. It would probably be too difficult to include an actual microscope, but something that I’ve been working on is instructions on not exactly how to make the microscope that you would need but just to explain, “You’re going to need a microscope, and you’re going to need to play around with it.” One of the assumptions that we’ve made is that they will be curious and advanced. I mean, to actually retrieve the data, they would need to catch a spaceship out of the sky as it flies past them …

HUIZINGA: Right!

GREENE: … which we can’t do at the moment. So we’re assuming that they’re more advanced than us, curious, and would put a lot of time into it. Time and effort.

HUIZINGA: I always find it interesting that we always assume they’re smarter than us or more advanced than us. Maybe they’re not. Maybe it’s The Gods Must Be Crazy, and they find a computer and they start banging it on a rock. Who knows? Richard, setting aside any assumptions that this Golden Record on glass makes it into space and assuming that they could catch it and figure it out, Silica’s main mission is much more terrestrial in nature. And part of that, as I understand it, is informing the next generation of cloud infrastructure. So if you could, talk for a minute about the vision for the future of digital storage, particularly in terms of sustainability, and what role Silica may play in helping huge datacenters on this planet be more efficient and maybe even environmentally friendly.

BLACK: Yes, absolutely. So Microsoft is passionate about improving the sustainability of our operations, including data storage. So today archival data uses tape or hard drives, but those have a lifetime of only a few years, and they need to be continually replaced over the lifetime of the data. And that contributes to the costs both in manufacturing and it contributes to e-waste. And of course, those media also can consume electricity during their lifetime, either keeping them spinning or in the careful air-conditioning that’s required to preserve tape. So the transformative advantage of Silica is really in the durability of the data permanently stored in the glass. And this allows us to move from costs—whatever way you think about cost, either money or energy or a sustainability cost—move from costs that are based on the lifetime of the data to costs that are based on the operations that are done to the data. Because the glass doesn’t really need any cost while it’s just sitting there, while it’s doing nothing. And that’s a standout change in the way we can think about keeping archival data because it moves from, you know, a continual, as it were, monthly cost associated with keeping the thing over and over and over to, yeah, you have to pay to write. If you need to read the data, you have to pay the cost to read the data. But in the meantime, there’s no cost to just keeping it around in case you need it. And that’s a big change. And so actually, analysis suggests that Silica should be about a factor of 10 better for sustainability over archival time periods for archival data.

HUIZINGA: And I would imagine “space” is a good proof of concept for how durable and how long you expect it to be able to last and be retrieved. Well …

BLACK: Absolutely. You know, Dexter mentioned the original Golden Record had to get a, kind of, approval to be considered space-worthy. In fact, the windows on spacecraft that we use today are made of fused silica glass. So the fused silica glass is already considered space-worthy! You know, that’s a problem that’s already solved. And, you know, it is known to be very robust and to survive the rigors of outer space.

HUIZINGA: Yeah, and the large datacenter! Well, Dexter, you’re embarking on the next journey in your life, heading off to university this fall. What are you going to be studying, and how are you going to keep going with Avenues’ Golden Record once you’re at college because you don’t have any teachers or groups or whatever?

GREENE: Yeah, that’s a great question. So, like I said, I plan to major in robotics engineering. That’s still, I guess, like, TBD. I might do mechanical engineering, but I’m definitely leaning more towards robotics. And as for the project, I definitely want to continue work on the project. That’s something I’ve made very clear to my team. Like you said, like, I won’t have a teacher there with me, but one of the teachers that works on the project was my physics teacher last year, and I’ve developed a very good relationship with him. I can say for sure that I’ll continue to stay in touch with him, the rest of the team, and this project, which I’m super excited to be working on. And I think we’re really … we, sort of, got past the big first hump, which was like the, I guess, the hardest part, and I feel like it will be smooth sailing from here!

HUIZINGA: Do you think any self-imposed deadlines will help you close off the process? Because I mean, I could see this going … well, I should ask another question. Are there other students at Avenues, or any place else, that are involved in this that haven’t graduated yet?

GREENE: Yes, there are a few of us. Last year when we were working on the project, there were only a handful of us. So it was me and my best friend, Arthur Wilson, who also graduated. There were three other students. One was a ninth grader, and two were 10th graders. So they’re all still working on the project. And there’s one student from another campus that’s still working very closely on the project. And we’ve actually been working on expanding our team within our community. So at the end of last year, we were working on finding other students that we thought would be a great fit for the project and trying to rope them into it! [LAUGHTER] So we definitely want to continue to work on the project. And to answer your question from before about the deadlines, we like to set, sort of, smaller internal deadlines. That’s something that we’ve gotten very used to. As for a long-term deadline, we haven’t set one yet. It could be helpful to set a long-term deadline because if we don’t, we could just do the project forever.

HUIZINGA: [LAUGHS] Right …

GREENE: We might never end because there’s always more to add. But yeah, we do set smaller internal deadlines, so like get x amount of content done by this time, reach out to x number of space agencies, reach out to x number of whatever.

HUIZINGA: Mm-hmm. Yeah, it feels like there should be some kind of, you know, “enough is enough” for this round.

GREENE: Yeah.

HUIZINGA: Otherwise, you’re the artist who never puts enough paint on the canvas and …

GREENE: I also really like what you said just now with, like, “this round” and “next round.” That’s a very good way to look at it. Like Dr. Black said, he produced two platters for us already towards the end of my last school year. And I think that was a very good, like, first round and a good way to continue doing the project where we work on the project and we get a lot of content done and then we can say, let’s let this be a great first draft or a great second draft for now, and we have that draft ready to go, but we can continue to work on it if we want to.

HUIZINGA: Well, you know the famous computer science tagline “Shipping is a feature.” [LAUGHS] So there’s some element of “let’s get it out there” and then we can do the next iteration of upgrades and launch then.

GREENE: Exactly.

HUIZINGA: Well, Richard, while most people don’t put scientists and rock stars in the same bucket, Dexter isn’t the first young person to admit being a little intimidated—and even starstruck—by an accomplished and well-known researcher, but some students aren’t bold enough to cold email someone like you and ask for words of wisdom. So now that we’ve got you on the show, as we close, perhaps you could voluntarily share some encouraging words or direction to the next generation of students who are interested in making the next generation of technologies. So I’ll let you have the last word.

BLACK: Oh, I have a couple of small things to say. First of all, researchers are just people, too. [LAUGHTER] And, you know, they like others to talk to them occasionally. And usually, they like opportunities to be passionate about their research and to communicate the exciting things that they’re doing. So don’t be put off; it’s quite reasonable to talk. You know, I’m really excited by, you know, the, kind of, the passion and imagination that I see in some of the young people around today, and Dexter and his colleagues are an example of that. You know, advice to them would be, you know, work on a technology that excites you and in particular something that, if you were successful, it would have a big impact on our world and, you know, that should give you a kind of motivation and a path to having impact.

HUIZINGA: Hmm. What you just said reminded me of a Saturday Night Live skit with Christopher Walken—it’s the “More Cowbell” skit—but he says, we’re just like other people; we put our pants on one leg at a time, but once our pants are on, we make gold records! I think that’s funny right there!

[MUSIC]

Richard and Dexter, thank you so much for coming on and sharing this project with us today on Collaborators. Really had fun!

GREENE: Yeah, thank you so much for having us.

BLACK: Thank you.

[MUSIC FADES]


[1] (opens in new tab) It was later noted that the original Golden Record team was also led by astrophysicist Frank Drake (opens in new tab), whose efforts to search for extraterrestrial intelligence (SETI) inspired continued work in the area.

[2] (opens in new tab) While Dr. Jiang leads the Humanity’s Message to the Stars (opens in new tab) project, it is independent of NASA at this stage. 

[3] (opens in new tab) In his capacity as Design Director for the original Golden Record, Lomberg (opens in new tab) chose and arranged the images included.

The post Collaborators: Silica in space with Richard Black and Dexter Greene appeared first on Microsoft Research.

Read More

Three Ways to Ride the Flywheel of Cybersecurity AI

Three Ways to Ride the Flywheel of Cybersecurity AI

The business transformations that generative AI brings come with risks that AI itself can help secure in a kind of flywheel of progress.

Companies who were quick to embrace the open internet more than 20 years ago were among the first to reap its benefits and become proficient in modern network security.

Enterprise AI is following a similar pattern today. Organizations pursuing its advances — especially with powerful generative AI capabilities — are applying those learnings to enhance their security.

For those just getting started on this journey, here are ways to address with AI three of the top security threats industry experts have identified for large language models (LLMs).

AI Guardrails Prevent Prompt Injections

Generative AI services are subject to attacks from malicious prompts designed to disrupt the LLM behind it or gain access to its data. As the report cited above notes, “Direct injections overwrite system prompts, while indirect ones manipulate inputs from external sources.”

The best antidote for prompt injections are AI guardrails, built into or placed around LLMs. Like the metal safety barriers and concrete curbs on the road, AI guardrails keep LLM applications on track and on topic.

The industry has delivered and continues to work on solutions in this area. For example, NVIDIA NeMo Guardrails software lets developers protect the trustworthiness, safety and security of generative AI services.

AI Detects and Protects Sensitive Data

The responses LLMs give to prompts can on occasion reveal sensitive information. With multifactor authentication and other best practices, credentials are becoming increasingly complex, widening the scope of what’s considered sensitive data.

To guard against disclosures, all sensitive information should be carefully removed or obscured from AI training data. Given the size of datasets used in training, it’s hard for humans — but easy for AI models — to ensure a data sanitation process is effective.

An AI model trained to detect and obfuscate sensitive information can help safeguard against revealing anything confidential that was inadvertently left in an LLM’s training data.

Using NVIDIA Morpheus, an AI framework for building cybersecurity applications, enterprises can create AI models and accelerated pipelines that find and protect sensitive information on their networks. Morpheus lets AI do what no human using traditional rule-based analytics can: track and analyze the massive data flows on an entire corporate network.

AI Can Help Reinforce Access Control

Finally, hackers may try to use LLMs to get access control over an organization’s assets. So, businesses need to prevent their generative AI services from exceeding their level of authority.

The best defense against this risk is using the best practices of security-by-design. Specifically, grant an LLM the least privileges and continuously evaluate those permissions, so it can only access the tools and data it needs to perform its intended functions. This simple, standard approach is probably all most users need in this case.

However, AI can also assist in providing access controls for LLMs. A separate inline model can be trained to detect privilege escalation by evaluating an LLM’s outputs.

Start the Journey to Cybersecurity AI

No one technique is a silver bullet; security continues to be about evolving measures and countermeasures. Those who do best on that journey make use of the latest tools and technologies.

To secure AI, organizations need to be familiar with it, and the best way to do that is by deploying it in meaningful use cases. NVIDIA and its partners can help with full-stack solutions in AI, cybersecurity and cybersecurity AI.

Looking ahead, AI and cybersecurity will be tightly linked in a kind of virtuous cycle, a flywheel of progress where each makes the other better. Ultimately, users will come to trust it as just another form of automation.

Learn more about NVIDIA’s cybersecurity AI platform and how it’s being put to use. And listen to cybersecurity talks from experts at the NVIDIA AI Summit in October.

Read More

19 New Games to Drop for GeForce NOW in September

19 New Games to Drop for GeForce NOW in September

Fall will be here soon, so leaf it to GeForce NOW to bring the games, with 19 joining the cloud in September.

Get started with the seven games available to stream this week, and a day one PC Game Pass title, Age of Mythology: Retold, from the creators of the award-winning Age of Empires franchise World’s Edge, Forgotten Empires and Xbox Game Studios.

The Open Beta for Call of Duty: Black Ops 6 runs Sept. 6-9, offering everyone a chance to experience game-changing innovations before the title officially launches on Oct. 25. Members can stream the Battle.net and Steam versions of the Open Beta instantly this week on GeForce NOW to jump right into the action.

Where Myths and Heroes Collide

Age of Mythology on GeForce NOW
A vast, mythical world to explore with friends? Say no more…

Age of Mythology: Retold revitalizes the classic real-time strategy game by merging its beloved elements with modern visuals.

Get immersed in a mythical universe, command legendary units and call upon the powers of various gods from the Atlantean, Greek, Egyptian and Norse pantheons. The single-player experience features a 50-mission campaign, including engaging battles and myth exploration in iconic locations like Troy and Midgard. Challenge friends in head-to-head matches or cooperate to take on advanced, AI-powered opponents.

Call upon the gods from the cloud with an Ultimate and Priority membership and stream the game across devices. Games update automatically in the cloud, so members can dive into the action without having to wait.

September Gets Better With New Games

The Casting of Frank Stone on GeForce NOW
Choose your fate.

Catch the storytelling prowess of Supermassive Games in The Casting of Frank Stone, available to stream this week for members. The shadow of Frank Stone looms over Cedar Hills, a town forever altered by his violent past. Delve into the mystery of Cedar Hills alongside an original cast of characters bound together on a twisted journey where nothing is quite as it seems. Every decision shapes the story and impacts the fate of the characters.

In addition, members can look for the following games this week:

  • The Casting of Frank Stone (New release on Steam, Sept. 3)
  • Age of Mythology (New release on Steam and Xbox, available on PC Game Pass, Sept.4 )
  • Sniper Ghost Warrior Contracts  (New release on Epic Games Store, early access Sept. 5)
  • Warhammer 40,000: Space Marine 2 (New release on Steam, early access Sept. 5)
  • Crime Scene Cleaner (Steam)
  • FINAL FANTASY XVI Demo (Epic Games Store)
  • Sins of a Solar Empire II (Steam)

Here’s what members can expect for the rest of September:

  • Frostpunk 2 (New release on Steam and Xbox available  on PC Game Pass, Sept. 17)
  • FINAL FANTASY XVI (New release on Steam and Epic Games Store, Sept. 17)
  • The Plucky Squire (New release on Steam, Sept. 17)
  • Tiny Glade (New release on Steam, Sept. 23)
  • Disney Epic Mickey: Rebrushed (New release on Steam, Sept. 24)
  • Greedfall II: The Dying World (New release on Steam, Sept. 24)
  • Mechabellum ( Steam)
  • Blacksmith Master (New release on Steam, Sept. 26)
  • Breachway (New release on Steam, Sept. 26)
  • REKA (New release on Steam)
  • Test Drive Unlimited Solar Crown (New release on Steam)
  • Rider’s Republic (New release on PC Game Pass, Sept. 11). To begin playing, members need to activate access, and can refer to the help article for instructions.

Additions to August

In addition to the 18 games announced last month, 48 more joined the GeForce NOW library:

  • Prince of Persia: The Lost Crown (Day zero release on Steam, Aug. 8)
  • FINAL FANTASY XVI Demo (New release on Steam, Aug. 19)
  • Black Myth: Wukong (New release on Steam and Epic Games Store, Aug. 20)
  • GIGANTIC: RAMPAGE EDITION (Available on Epic Games Store, free Aug. 22)
  • Skull and Bones (New release on Steam, Aug. 22)
  • Endzone 2 (New release on Steam, Aug. 26)
  • Age of Mythology: Retold (Advanced access on Steam, Xbox, available on PC Game Pass, Aug. 27)
  • Core Keeper (New release on Xbox, available on PC Game Pass, Aug. 27)
  • Alan Wake’s American Nightmare (Xbox, available on Microsoft Store)
  • Car Manufacture (Steam)
  • Cat Quest III (Steam)
  • Commandos 3 – HD Remaster (Xbox, available on Microsoft Store)
  • Cooking Simulator (Xbox, available on PC Game Pass)
  • Crown Trick (Xbox, available on Microsoft Store)
  • Darksiders Genesis (Xbox, available on Microsoft Store)
  • Desperados III (Xbox, available on Microsoft Store)
  • The Dungeon of Naheulbeuk: The Amulet of Chaos (Xbox, available on Microsoft Store)
  • Expeditions: Rome (Xbox, available on Microsoft Store)
  • The Flame in the Flood (Xbox, available on Microsoft Store)
  • FTL: Faster Than Light (Xbox, available on Microsoft Store)
  • Genesis Noir (Xbox, available on PC Game Pass)
  • House Flipper (Xbox, available on PC Game Pass)
  • Into the Breach (Xbox, available on Microsoft Store)
  • Iron Harvest (Xbox, available on Microsoft Store)
  • The Knight Witch (Xbox, available on Microsoft Store)
  • Lightyear Frontier (Xbox, available on PC Game Pass)
  • Medieval Dynasty (Xbox, available on PC Game Pass)
  • Metro Exodus Enhanced Edition (Xbox, available on Microsoft Store)
  • My Time at Portia (Xbox, available on PC Game Pass)
  • Night in the Woods (Xbox, available on Microsoft Store )
  • Offworld Trading Company (Xbox, available on PC Game Pass)
  • Orwell: Keeping an Eye on You (Xbox, available on Microsoft Store)
  • Outlast 2 (Xbox, available on Microsoft Store)
  • Project Winter (Xbox, available on Microsoft Store)
  • Psychonauts (Steam)
  • Psychonauts 2 (Steam and Xbox, available on PC Game Pass)
  • Shadow Tactics: Blades of the Shogun (Xbox, available on Microsoft Store)
  • Sid Meier’s Civilization VI (Steam, Epic Games Store and Xbox, available on the Microsoft store)
  • Sid Meier’s Civilization V (Steam)
  • Sid Meier’s Civilization IV (Steam)
  • Sid Meier’s Civilization: Beyond Earth (Steam)
  • Spirit of the North (Xbox, available on PC Game Pass)
  • SteamWorld Heist II (Steam, Xbox, available on Microsoft Store)
  • Visions of Mana Demo (Steam)
  • This War of Mine (Xbox, available on PC Game Pass)
  • We Were Here Too (Steam)
  • Wreckfest (Xbox, available on PC Game Pass)
  • Yoku’s Island Express (Xbox, available on Microsoft Store)

Breachway was originally included in the August games list, but the launch date was moved to September by the developer. Stay tuned to GFN Thursday for updates.

Starting in October, members will no longer see the option of launching “Epic Games Store” versions of games published by Ubisoft on GeForce NOW.  To play these supported games, members can select the “Ubisoft Connect” option on GeForce NOW and will need to connect their Ubisoft Connect and Epic game store accounts the first time they play the game. Check out more details.

What are you planning to play this weekend? Let us know on X or in the comments below.

Read More

Volvo Cars EX90 SUV Rolls Out, Built on NVIDIA Accelerated Computing and AI

Volvo Cars EX90 SUV Rolls Out, Built on NVIDIA Accelerated Computing and AI

Volvo Cars’ new, fully electric EX90 is making its way from the automaker’s assembly line in Charleston, South Carolina, to dealerships around the U.S.

To ensure its customers benefit from future improvements and advanced safety features and capabilities, the Volvo EX90 is built on the NVIDIA DRIVE Orin system-on-a-chip (SoC), capable of more than 250 trillion operations per second (TOPS).

Running NVIDIA DriveOS, the system delivers high-performance processing in a package that’s literally the size of a postage stamp. This core compute architecture handles all vehicle functions, ranging from enabling safety and driving assistance features to supporting the development of autonomous driving capabilities — all while delivering an excellent user experience.

The state-of-the-art SUV is an intelligent mobile device on wheels, equipped with the automaker’s most advanced sensor suite to date, including radar, lidar, cameras, ultrasonic sensors and more. NVIDIA DRIVE Orin enables real-time, redundant and advanced 360-degree surround-sensor data processing, supporting Volvo Cars’ unwavering commitment to safety.

DRIVE Thor Powering the Next Generation of Volvo Cars

Setting its sights on the future, Volvo Cars also announced plans to migrate to the next-generation NVIDIA DRIVE Thor SoC for its upcoming fleets.

Before the end of the decade, Volvo Cars will move to NVIDIA DRIVE Thor, which boasts 1,000 TOPS —  quadrupling the processing power of a single DRIVE Orin SoC, while improving energy efficiency sevenfold.

The next-generation DRIVE Thor autonomous vehicle processor incorporates the latest NVIDIA Blackwell GPU architecture, helping unlock a new realm of possibilities and capabilities both in and around the car. This advanced platform will facilitate the deployment of safe advanced driver-assistance system (ADAS) and self-driving features — and pave the way for a new era of in-vehicle experiences powered by generative AI.

Highlighting Volvo Cars’ leap to NVIDIA’s next-generation processor, Volvo Cars CEO Jim Rowan noted, “With NVIDIA DRIVE Thor in our future cars, our in-house developed software becomes more scalable across our product lineup, and it helps us to continue to improve the safety in our cars, deliver best-in-class customer experiences — and increase our margins.”

Zenseact Strategic Investment in NVIDIA Technology

Volvo Cars and its software subsidiary, Zenseact, are also investing in NVIDIA DGX systems for AI model training in the cloud, helping ensure that future fleets are equipped with the most advanced and well-tested AI-powered safety features.

Managing the massive amount of data needed to safely train the next generation of AI-enabled vehicles demands data-center-level compute and infrastructure.

NVIDIA DGX systems provide the computational performance essential for training AI models with unprecedented efficiency. Transportation companies use them to speed autonomous technology development in a cost-effective, enterprise-ready and easy-to-deploy way.

Volvo Cars and Zenseact’s AI training hub, based in the Nordics, will use the systems to help catalyze multiple facets of ADAS and autonomous driving software development. A key benefit is the optimization of the data annotation process — a traditionally time-consuming task involving the identification and labeling of objects for classification and recognition.

The cluster of DGX systems will also enable processing of the required data for safety assurance, delivering twice the performance and potentially halving time to market.

“The NVIDIA DGX AI supercomputer will supercharge our AI training capabilities, making this in-house AI training center one of the largest in the Nordics,” said Anders Bell, chief engineering and technology officer at Volvo Cars. “By leveraging NVIDIA technology and setting up the data center, we pave a quick path to high-performing AI, ultimately helping make our products safer and better.”

With NVIDIA technology as the AI brain inside the car and in the cloud, Volvo Cars and Zenseact can deliver safe vehicles that allow customers to drive with peace of mind, wherever the road may lead.

Read More

Deploy Amazon SageMaker pipelines using AWS Controllers for Kubernetes

Deploy Amazon SageMaker pipelines using AWS Controllers for Kubernetes

Kubernetes is a popular orchestration platform for managing containers. Its scalability and load-balancing capabilities make it ideal for handling the variable workloads typical of machine learning (ML) applications. DevOps engineers often use Kubernetes to manage and scale ML applications, but before an ML model is available, it must be trained and evaluated and, if the quality of the obtained model is satisfactory, uploaded to a model registry.

Amazon SageMaker provides capabilities to remove the undifferentiated heavy lifting of building and deploying ML models. SageMaker simplifies the process of managing dependencies, container images, auto scaling, and monitoring. Specifically for the model building stage, Amazon SageMaker Pipelines automates the process by managing the infrastructure and resources needed to process data, train models, and run evaluation tests.

A challenge for DevOps engineers is the additional complexity that comes from using Kubernetes to manage the deployment stage while resorting to other tools (such as the AWS SDK or AWS CloudFormation) to manage the model building pipeline. One alternative to simplify this process is to use AWS Controllers for Kubernetes (ACK) to manage and deploy a SageMaker training pipeline. ACK allows you to take advantage of managed model building pipelines without needing to define resources outside of the Kubernetes cluster.

In this post, we introduce an example to help DevOps engineers manage the entire ML lifecycle—including training and inference—using the same toolkit.

Solution overview

We consider a use case in which an ML engineer configures a SageMaker model building pipeline using a Jupyter notebook. This configuration takes the form of a Directed Acyclic Graph (DAG) represented as a JSON pipeline definition. The JSON document can be stored and versioned in an Amazon Simple Storage Service (Amazon S3) bucket. If encryption is required, it can be implemented using an AWS Key Management Service (AWS KMS) managed key for Amazon S3. A DevOps engineer with access to fetch this definition file from Amazon S3 can load the pipeline definition into an ACK service controller for SageMaker, which is running as part of an Amazon Elastic Kubernetes Service (Amazon EKS) cluster. The DevOps engineer can then use the Kubernetes APIs provided by ACK to submit the pipeline definition and initiate one or more pipeline runs in SageMaker. This entire workflow is shown in the following solution diagram.

architecture

Prerequisites

To follow along, you should have the following prerequisites:

  • An EKS cluster where the ML pipeline will be created.
  • A user with access to an AWS Identity and Access Management (IAM) role that has IAM permissions (iam:CreateRole, iam:AttachRolePolicy, and iam:PutRolePolicy) to allow creating roles and attaching policies to roles.
  • The following command line tools on the local machine or cloud-based development environment used to access the Kubernetes cluster:

Install the SageMaker ACK service controller

The SageMaker ACK service controller makes it straightforward for DevOps engineers to use Kubernetes as their control plane to create and manage ML pipelines. To install the controller in your EKS cluster, complete the following steps:

  1. Configure IAM permissions to make sure the controller has access to the appropriate AWS resources.
  2. Install the controller using a SageMaker Helm Chart to make it available on the client machine.

The following tutorial provides step-by-step instructions with the required commands to install the ACK service controller for SageMaker.

Generate a pipeline JSON definition

In most companies, ML engineers are responsible for creating the ML pipeline in their organization. They often work with DevOps engineers to operate those pipelines. In SageMaker, ML engineers can use the SageMaker Python SDK to generate a pipeline definition in JSON format. A SageMaker pipeline definition must follow the provided schema, which includes base images, dependencies, steps, and instance types and sizes that are needed to fully define the pipeline. This definition then gets retrieved by the DevOps engineer for deploying and maintaining the infrastructure needed for the pipeline.

The following is a sample pipeline definition with one training step:

{
  "Version": "2020-12-01",
  "Steps": [
  {
    "Name": "AbaloneTrain",
    "Type": "Training",
    "Arguments": {
      "RoleArn": "<<YOUR_SAGEMAKER_ROLE_ARN>>",
      "HyperParameters": {
        "max_depth": "5",
        "gamma": "4",
        "eta": "0.2",
        "min_child_weight": "6",
        "objective": "multi:softmax",
        "num_class": "10",
        "num_round": "10"
     },
     "AlgorithmSpecification": {
     "TrainingImage": "683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-xgboost:1.7-1",
     "TrainingInputMode": "File"
   },
   "OutputDataConfig": {
     "S3OutputPath": "s3://<<YOUR_BUCKET_NAME>>/sagemaker/"
   },
   "ResourceConfig": {
     "InstanceCount": 1,
     "InstanceType": "ml.m4.xlarge",
     "VolumeSizeInGB": 5
   },
   "StoppingCondition": {
     "MaxRuntimeInSeconds": 86400
   },
   "InputDataConfig": [
   {
     "ChannelName": "train",
     "DataSource": {
       "S3DataSource": {
         "S3DataType": "S3Prefix",
         "S3Uri": "s3://<<YOUR_BUCKET_NAME>>/sagemaker/xgboost/train/",
         "S3DataDistributionType": "
       }
     },
     "ContentType": "text/libsvm"
   },
   {
     "ChannelName": "validation",
     "DataSource": {
       "S3DataSource": {
         "S3DataType": "S3Prefix",
         "S3Uri": "s3://<<YOUR_BUCKET_NAME>>/sagemaker/xgboost/validation/",
         "S3DataDistributionType": "FullyReplicated"
       }
     },
     "ContentType": "text/libsvm"
   }]
  }
 }]
}

With SageMaker, ML model artifacts and other system artifacts are encrypted in transit and at rest. SageMaker encrypts these by default using the AWS managed key for Amazon S3. You can optionally specify a custom key using the KmsKeyId property of the OutputDataConfig argument. For more information on how SageMaker protects data, see Data Protection in Amazon SageMaker.

Furthermore, we recommend securing access to the pipeline artifacts, such as model outputs and training data, to a specific set of IAM roles created for data scientists and ML engineers. This can be achieved by attaching an appropriate bucket policy. For more information on best practices for securing data in Amazon S3, see Top 10 security best practices for securing data in Amazon S3.

Create and submit a pipeline YAML specification

In the Kubernetes world, objects are the persistent entities in the Kubernetes cluster used to represent the state of your cluster. When you create an object in Kubernetes, you must provide the object specification that describes its desired state, as well as some basic information about the object (such as a name). Then, using tools such as kubectl, you provide the information in a manifest file in YAML (or JSON) format to communicate with the Kubernetes API.

Refer to the following Kubernetes YAML specification for a SageMaker pipeline. DevOps engineers need to modify the .spec.pipelineDefinition key in the file and add the ML engineer-provided pipeline JSON definition. They then prepare and submit a separate pipeline execution YAML specification to run the pipeline in SageMaker. There are two ways to submit a pipeline YAML specification:

  • Pass the pipeline definition inline as a JSON object to the pipeline YAML specification.
  • Convert the JSON pipeline definition into String format using the command line utility jq. For example, you can use the following command to convert the pipeline definition to a JSON-encoded string:
jq -r tojson <pipeline-definition.json>

In this post, we use the first option and prepare the YAML specification (my-pipeline.yaml) as follows:

apiVersion: sagemaker.services.k8s.aws/v1alpha1
kind: Pipeline
metadata:
  name: my-kubernetes-pipeline
spec:
  parallelismConfiguration:
  	maxParallelExecutionSteps: 2
  pipelineName: my-kubernetes-pipeline
  pipelineDefinition: |
  {
    "Version": "2020-12-01",
    "Steps": [
    {
      "Name": "AbaloneTrain",
      "Type": "Training",
      "Arguments": {
        "RoleArn": "<<YOUR_SAGEMAKER_ROLE_ARN>>",
        "HyperParameters": {
          "max_depth": "5",
          "gamma": "4",
          "eta": "0.2",
          "min_child_weight": "6",
          "objective": "multi:softmax",
          "num_class": "10",
          "num_round": "30"
        },
        "AlgorithmSpecification": {
          "TrainingImage": "683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-xgboost:1.7-1",
          "TrainingInputMode": "File"
        },
        "OutputDataConfig": {
          "S3OutputPath": "s3://<<YOUR_S3_BUCKET>>/sagemaker/"
        },
        "ResourceConfig": {
          "InstanceCount": 1,
          "InstanceType": "ml.m4.xlarge",
          "VolumeSizeInGB": 5
        },
        "StoppingCondition": {
          "MaxRuntimeInSeconds": 86400
        },
        "InputDataConfig": [
        {
          "ChannelName": "train",
          "DataSource": {
            "S3DataSource": {
              "S3DataType": "S3Prefix",
              "S3Uri": "s3://<<YOUR_S3_BUCKET>>/sagemaker/xgboost/train/",
              "S3DataDistributionType": "FullyReplicated"
            }
          },
          "ContentType": "text/libsvm"
        },
        {
          "ChannelName": "validation",
          "DataSource": {
            "S3DataSource": {
              "S3DataType": "S3Prefix",
              "S3Uri": "s3://<<YOUR_S3_BUCKET>>/sagemaker/xgboost/validation/",
              "S3DataDistributionType": "FullyReplicated"
            }
          },
          "ContentType": "text/libsvm"
        }
      ]
    }
  }
]}
pipelineDisplayName: my-kubernetes-pipeline
roleARN: <<YOUR_SAGEMAKER_ROLE_ARN>>

Submit the pipeline to SageMaker

To submit your prepared pipeline specification, apply the specification to your Kubernetes cluster as follows:

kubectl apply -f my-pipeline.yaml

Create and submit a pipeline execution YAML specification

Refer to the following Kubernetes YAML specification for a SageMaker pipeline. Prepare the pipeline execution YAML specification (pipeline-execution.yaml) as follows:

apiVersion: sagemaker.services.k8s.aws/v1alpha1
kind: PipelineExecution
metadata:
  name: my-kubernetes-pipeline-execution
spec:
  parallelismConfiguration:
  	maxParallelExecutionSteps: 2
  pipelineExecutionDescription: "My first pipeline execution via Amazon EKS cluster."
  pipelineName: my-kubernetes-pipeline

To start a run of the pipeline, use the following code:

kubectl apply -f pipeline-execution.yaml

Review and troubleshoot the pipeline run

To list all pipelines created using the ACK controller, use the following command:

kubectl get pipeline

To list all pipeline runs, use the following command:

kubectl get pipelineexecution

To get more details about the pipeline after it’s submitted, like checking the status, errors, or parameters of the pipeline, use the following command:

kubectl describe pipeline my-kubernetes-pipeline

To troubleshoot a pipeline run by reviewing more details about the run, use the following command:

kubectl describe pipelineexecution my-kubernetes-pipeline-execution

Clean up

Use the following command to delete any pipelines you created:

kubectl delete pipeline

Use the following command to cancel any pipeline runs you started:

kubectl delete pipelineexecution

Conclusion

In this post, we presented an example of how ML engineers familiar with Jupyter notebooks and SageMaker environments can efficiently work with DevOps engineers familiar with Kubernetes and related tools to design and maintain an ML pipeline with the right infrastructure for their organization. This enables DevOps engineers to manage all the steps of the ML lifecycle with the same set of tools and environment they are used to, which enables organizations to innovate faster and more efficiently.

Explore the GitHub repository for ACK and the SageMaker controller to start managing your ML operations with Kubernetes.


About the Authors

Pratik Yeole is a Senior Solutions Architect working with global customers, helping customers build value-driven solutions on AWS. He has expertise in MLOps and containers domains. Outside of work, he enjoys time with friends, family, music, and cricket.

Felipe Lopez is a Senior AI/ML Specialist Solutions Architect at AWS. Prior to joining AWS, Felipe worked with GE Digital and SLB, where he focused on modeling and optimization products for industrial applications.

Read More

Effectively manage foundation models for generative AI applications with Amazon SageMaker Model Registry

Effectively manage foundation models for generative AI applications with Amazon SageMaker Model Registry

Generative artificial intelligence (AI) foundation models (FMs) are gaining popularity with businesses due to their versatility and potential to address a variety of use cases. The true value of FMs is realized when they are adapted for domain specific data. Managing these models across the business and model lifecycle can introduce complexity. As FMs are adapted to different domains and data, operationalizing these pipelines becomes critical.

Amazon SageMaker, a fully managed service to build, train, and deploy machine learning (ML) models, has seen increased adoption to customize and deploy FMs that power generative AI applications. SageMaker provides rich features to build automated workflows for deploying models at scale. One of the key features that enables operational excellence around model management is the Model Registry. Model Registry helps catalog and manage model versions and facilitates collaboration and governance. When a model is trained and evaluated for performance, it can be stored in the Model Registry for model management.

Amazon SageMaker has released new features in Model Registry that make it easy to version and catalog FMs. Customers can use SageMaker to train or tune FMs, including Amazon SageMaker JumpStart and Amazon Bedrock models, and also manage these models within Model Registry. As customers begin to scale generative AI applications across various use cases such as fine-tuning for domain-specific tasks, the number of models can quickly grow. To keep track of models, versions, and associated metadata, SageMaker Model Registry can be used as an inventory of models.

In this post, we explore the new features of Model Registry that streamline FM management: you can now register unzipped model artifacts and pass an End User License Agreement (EULA) acceptance flag without needing users to intervene.

Overview

Model Registry has worked well for traditional models, which are smaller in size. For FMs, there were challenges because of their size and requirements for user intervention for EULA acceptance. With the new features in Model Registry, it’s become easier to register a fine-tuned FM within Model Registry, which then can be deployed for actual use.

A typical model development lifecycle is an iterative process. We conduct many experimentation cycles to achieve expected performance from the model. Once trained, these models can be registered in the Model Registry where they are cataloged as versions. The models can be organized in groups, the versions can be compared for their quality metrics, and models can have an associated approval status indicating if its deployable.

Once the model is manually approved, a continuous integration and continuous deployment (CI/CD) pipeline can be triggered to deploy these models to production. Optionally, Model Registry can be used as a repository of models that are approved for use by an enterprise. Various teams can then deploy these approved models from Model Registry and build applications around it.

An example workflow could follow these steps and is shown in the following diagram:

  1. Select a SageMaker JumpStart model and register it in Model Registry
  2. Alternatively, you can fine-tune a SageMaker JumpStart model
  3. Evaluate the model with SageMaker model evaluation. SageMaker allows for human evaluation if desired.
  4. Create a model group in the Model Registry. For each run, create a model version. Add your model group into one or more Model Registry Collections, which can be used to group registered models that are related to each other. For example, you could have a collection of large language models (LLMs) and another collection of diffusion models.
  5. Deploy the models as SageMaker Inference endpoints that can be consumed by generative AI applications.

Model Registry workflow for foundation modelsFigure 1: Model Registry workflow for foundation models

To better support generative AI applications, Model Registry released two new features: ModelDataSource, and source model URI. The following sections will explore these features and how to use them.

ModelDataSource speeds up deployment and provides access to EULA dependent models

Until now, model artifacts had to be stored along with the inference code when a model gets registered in Model Registry in a compressed format. This posed challenges for generative AI applications where FMs are of very large size with billions of parameters. The large size of FMs when stored as zipped models was causing increased latency with SageMaker endpoint startup time because decompressing these models at run time took very long. The model_data_source parameter can now accept the location of the unzipped model artifacts in Amazon Simple Storage Service (Amazon S3) making the registration process simple. This also eliminates the need for endpoints to unzip the model weights, leading to reduced latency during endpoint startup times.

Additionally, public JumpStart models and certain FMs from independent service providers, such as LLAMA2, require that their EULA must be accepted prior to using the models. Thus, when public models from SageMaker JumpStart were tuned, they could not be stored in the Model Registry because a user needed to accept the license agreement. Model Registry added a new feature: EULA acceptance flag support within the model_data_source parameter, allowing the registration of such models. Now customers can catalog, version, associate metadata such as training metrics, and more in Model Registry for a wider variety of FMs.

Register unzipped models stored in Amazon S3 using the AWS SDK.

model_data_source = {
               "S3DataSource": {
                      "S3Uri": "s3://bucket/model/prefix/", 
                      "S3DataType": "S3Prefix",          
                      "CompressionType": "None",            
                      "ModelAccessConfig": {                 
                           "AcceptEula": true
                       },
                 }
}
model = Model(       
               sagemaker_session=sagemaker_session,        
               image_uri=IMAGE_URI,      
               model_data=model_data_source
)
model.register()

Register models requiring a EULA.

from sagemaker.jumpstart.model importJumpStartModel
model_id = "meta-textgeneration-llama-2-7b"
my_model = JumpStartModel(model_id=model_id)
registered_model =my_model.register(accept_eula=True)
predictor = registered_model.deploy()

Source model URI provides simplified registration and proprietary model support

Model Registry now supports automatic population of inference specification files for some recognized model IDs, including select AWS Marketplace models, hosted models, or versioned model packages in Model Registry. Because of SourceModelURI’s support for automatic population, you can register proprietary JumpStart models from providers such as AI21 labs, Cohere, and LightOn without needing the inference specification file, allowing your organization to use a broader set of FMs in Model Registry.

Previously, to register a trained model in the SageMaker Model Registry, you had to provide the complete inference specification required for deployment, including an Amazon Elastic Container Registry (Amazon ECR) image and the trained model file. With the launch of source_uri support, SageMaker has made it easy for users to register any model by providing a source model URI, which is a free form field that stores model ID or location to a proprietary JumpStart and Bedrock model ID, S3 location, and MLflow model ID. Rather than having to supply the details required for deploying to SageMaker hosting at the time of registrations, you can add the artifacts later on. After registration, to deploy a model, you can package the model an inference specification and update Model Registry accordingly.

For example, you can register a model in Model Registry with a model Amazon Resource Name (ARN) SourceURI.

model_arn = "<arn of the model to be registered>"
registered_model_package = model.register(        
        model_package_group_name="model_group_name",
        source_uri=model_arn
)

Later, you can update the registered model with the inference specification, making it deployable on SageMaker.

model_package = sagemaker_session.sagemaker_client.create_model_package( 
        ModelPackageGroupName="model_group_name", 
        SourceUri="source_uri"
)
mp = ModelPackage(        
       role=get_execution_role(sagemaker_session),
       model_package_arn=model_package["ModelPackageArn"],
       sagemaker_session=sagemaker_session
)
mp.update_inference_specification(image_uris=["ecr_image_uri"])

Register an Amazon JumpStart proprietary FM.

from sagemaker.jumpstart.model import JumpStartModel
model_id = "ai21-contextual-answers"
my_model = JumpStartModel(
           model_id=model_id
)
model_package = my_model.register()

Conclusion

As organizations continue to adopt generative AI in different parts of their business, having robust model management and versioning becomes paramount. With Model Registry, you can achieve version control, tracking, collaboration, lifecycle management, and governance of FMs.

In this post, we explored how Model Registry can now more effectively support managing generative AI models across the model lifecycle, empowering you to better govern and adopt generative AI to achieve transformational outcomes.

To learn more about Model Registry, see Register and Deploy Models with Model Registry. To get started, visit the SageMaker console.


About the Authors

Chaitra Mathur serves as a Principal Solutions Architect at AWS, where her role involves advising clients on building robust, scalable, and secure solutions on AWS. With a keen interest in data and ML, she assists clients in leveraging AWS AI/ML and generative AI services to address their ML requirements effectively. Throughout her career, she has shared her expertise at numerous conferences and has authored several blog posts in the ML area.

Kait Healy is a Solutions Architect II at AWS. She specializes in working with startups and enterprise automotive customers, where she has experience building AI/ML solutions at scale to drive key business outcomes.

Saumitra Vikaram is a Senior Software Engineer at AWS. He is focused on AI/ML technology, ML model management, ML governance, and MLOps to improve overall organizational efficiency and productivity.

Siamak Nariman is a Senior Product Manager at AWS. He is focused on AI/ML technology, ML model management, and ML governance to improve overall organizational efficiency and productivity. He has extensive experience automating processes and deploying various technologies

Read More