Quickly build high-accuracy Generative AI applications on enterprise data using Amazon Kendra, LangChain, and large language models

Quickly build high-accuracy Generative AI applications on enterprise data using Amazon Kendra, LangChain, and large language models

Generative AI (GenAI) and large language models (LLMs), such as those available soon via Amazon Bedrock and Amazon Titan are transforming the way developers and enterprises are able to solve traditionally complex challenges related to natural language processing and understanding. Some of the benefits offered by LLMs include the ability to create more capable and compelling conversational AI experiences for customer service applications, and improving employee productivity through more intuitive and accurate responses.

For these use cases, however, it’s critical for the GenAI applications implementing the conversational experiences to meet two key criteria: limit the responses to company data, thereby mitigating model hallucinations (incorrect statements), and filter responses according to the end-user content access permissions.

To restrict the GenAI application responses to company data only, we need to use a technique called Retrieval Augmented Generation (RAG). An application using the RAG approach retrieves information most relevant to the user’s request from the enterprise knowledge base or content, bundles it as context along with the user’s request as a prompt, and then sends it to the LLM to get a GenAI response. LLMs have limitations around the maximum word count for the input prompt, therefore choosing the right passages among thousands or millions of documents in the enterprise, has a direct impact on the LLM’s accuracy.

In designing effective RAG, content retrieval is a critical step to ensure the LLM receives the most relevant and concise context from enterprise content to generate accurate responses. This is where the highly accurate, machine learning (ML)-powered intelligent search in Amazon Kendra plays an important role. Amazon Kendra is a fully managed service that provides out-of-the-box semantic search capabilities for state-of-the-art ranking of documents and passages. You can use the high-accuracy search in Amazon Kendra to source the most relevant content and documents to maximize the quality of your RAG payload, yielding better LLM responses than using conventional or keyword-based search solutions. Amazon Kendra offers easy-to-use deep learning search models that are pre-trained on 14 domains and don’t require any ML expertise, so there’s no need to deal with word embeddings, document chunking, and other lower-level complexities typically required for RAG implementations. Amazon Kendra also comes with pre-built connectors to popular data sources such as Amazon Simple Storage Service (Amazon S3), SharePoint, Confluence, and websites, and supports common document formats such as HTML, Word, PowerPoint, PDF, Excel, and pure text files. To filter responses based on only those documents that the end-user permissions allow, Amazon Kendra offers connectors with access control list (ACL) support. Amazon Kendra also offers AWS Identity and Access Management (IAM) and AWS IAM Identity Center (successor to AWS Single Sign-On) integration for user-group information syncing with customer identity providers such as Okta and Azure AD.

In this post, we demonstrate how to implement a RAG workflow by combining the capabilities of Amazon Kendra with LLMs to create state-of-the-art GenAI applications providing conversational experiences over your enterprise content. After Amazon Bedrock launches, we will publish a follow-up post showing how to implement similar GenAI applications using Amazon Bedrock, so stay tuned.

Solution overview

The following diagram shows the architecture of a GenAI application with a RAG approach.

We use an Amazon Kendra index to ingest enterprise unstructured data from data sources such as wiki pages, MS SharePoint sites, Atlassian Confluence, and document repositories such as Amazon S3. When a user interacts with the GenAI app, the flow is as follows:

  1. The user makes a request to the GenAI app.
  2. The app issues a search query to the Amazon Kendra index based on the user request.
  3. The index returns search results with excerpts of relevant documents from the ingested enterprise data.
  4. The app sends the user request and along with the data retrieved from the index as context in the LLM prompt.
  5. The LLM returns a succinct response to the user request based on the retrieved data.
  6. The response from the LLM is sent back to the user.

With this architecture, you can choose the most suitable LLM for your use case. LLM options include our partners Hugging Face, AI21 Labs, Cohere, and others hosted on an Amazon SageMaker endpoint, as well as models by companies like Anthropic and OpenAI. With Amazon Bedrock, you will be able to choose Amazon Titan, Amazon’s own LLM, or partner LLMs such as those from AI21 Labs and Anthropic with APIs securely without the need for your data to leave the AWS ecosystem. The additional benefits that Amazon Bedrock will offer include a serverless architecture, a single API to call the supported LLMs, and a managed service to streamline the developer workflow.

For the best results, a GenAI app needs to engineer the prompt based on the user request and the specific LLM being used. Conversational AI apps also need to manage the chat history and the context. GenAI app developers can use open-source frameworks such as LangChain that provide modules to integrate with the LLM of choice, and orchestration tools for activities such as chat history management and prompt engineering. We have provided the KendraIndexRetriever class, which implements a LangChain retriever interface, which applications can use in conjunction with other LangChain interfaces such as chains to retrieve data from an Amazon Kendra index. We have also provided a few sample applications in the GitHub repo. You can deploy this solution in your AWS account using the step-by-step guide in this post.

Prerequisites

For this tutorial, you’ll need a bash terminal with Python 3.9 or higher installed on Linux, Mac, or Windows Subsystem for Linux, and an AWS account. We also recommend using an AWS Cloud9 instance or an Amazon Elastic Compute Cloud (Amazon EC2) instance.

Implement a RAG workflow

To configure your RAG workflow, complete the following steps:

  1. Use the provided AWS CloudFormation template to create a new Amazon Kendra index.

This template includes sample data containing AWS online documentation for Amazon Kendra, Amazon Lex, and Amazon SageMaker. Alternately, if you have an Amazon Kendra index and have indexed your own dataset, you can use that. Launching the stack requires about 30 minutes followed by about 15 minutes to synchronize it and ingest the data in the index. Therefore, wait for about 45 minutes after launching the stack. Note the index ID and AWS Region on the stack’s Outputs tab.

  1. For an improved GenAI experience, we recommend requesting an Amazon Kendra service quota increase for maximum DocumentExcerpt size, so that Amazon Kendra provides larger document excerpts to improve semantic context for the LLM.
  2. Install the AWS SDK for Python on the command line interface of your choice.
  3. If you want to use the sample web apps built using Streamlit, you first need to install Streamlit. This step is optional if you want to only run the command line versions of the sample applications.
  4. Install LangChain.
  5. The sample applications used in this tutorial require you to have access to one or more LLMs from Flan-T5-XL, Flan-T5-XXL, Anthropic Claud-V1, and OpenAI-text-davinci-003.
    1. If you want to use Flan-T5-XL or Flan-T5-XXL, deploy them to an endpoint for inference using Amazon SageMaker Studio Jumpstart.
    2. If you want to work with Anthropic Claud-V1 or OpenAI-da-vinci-003, acquire the API keys for your LLMs of your interest from https://www.anthropic.com/ and https://openai.com/, respectively.
  6. Follow the instructions in the GitHub repo to install the KendraIndexRetriever interface and sample applications.
  7. Before you run the sample applications, you need to set environment variables with the Amazon Kendra index details and API keys of your preferred LLM or the SageMaker endpoints of your deployments for Flan-T5-XL or Flan-T5-XXL. The following is a sample script to set the environment variables:
    export AWS_REGION="<YOUR-AWS-REGION>"
    export KENDRA_INDEX_ID="<YOUR-KENDRA-INDEX-ID>"
    export FLAN_XL_ENDPOINT="<YOUR-SAGEMAKER-ENDPOINT-FOR-FLAN-T-XL>"
    export FLAN_XXL_ENDPOINT="<YOUR-SAGEMAKER-ENDPOINT-FOR-FLAN-T-XXL>"
    export OPENAI_API_KEY="<YOUR-OPEN-AI-API-KEY>"
    export ANTHROPIC_API_KEY="<YOUR-ANTHROPIC-API-KEY>"

  8. In a command line window, change to the samples subdirectory of where you have cloned the GitHub repository. You can run the command line apps from the command line as python <sample-file-name.py>. You can run the streamlit web app by changing the directory to samples and running streamlit run app.py <anthropic|flanxl|flanxxl|openai>.
  9. Open the sample file kendra_retriever_flan_xxl.py in an editor of your choice.

Observe the statement result = run_chain(chain, "What's SageMaker?"). This is the user query (“What’s SageMaker?”) that’s being run through the chain that uses Flan-T-XXL as the LLM and Amazon Kendra as the retriever. When this file is run, you can observe the output as follows. The chain sent the user query to the Amazon Kendra index, retrieved the top three result excerpts, and sent them as the context in a prompt along with the query, to which the LLM responded with a succinct answer. It has also provided the sources, (the URLs to the documents used in generating the answer).

~. python3 kendra_retriever_flan_xxl.py
Amazon SageMaker is a machine learning service that lets you train and deploy models in the cloud.
Sources:
https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-intro.html
https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-projects-whatis.html
https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html
  1. Now let’s run the web app app.py as streamlit run app.py flanxxl. For this specific run, we are using a Flan-T-XXL model as the LLM.

It opens a browser window with the web interface. You can enter a query, which in this case is “What is Amazon Lex?” As seen in the following screenshot, the application responds with an answer, and the Sources section provides the URLs to the documents from which the excerpts were retrieved from the Amazon Kendra index and sent to the LLM in the prompt as the context along with the query.

  1. Now let’s run app.py again and get a feel of the conversational experience using streamlit run app.py anthropic. Here the underlying LLM used is Anthropic Claud-V1.

As you can see in the following video, the LLM provides a detailed answer to the user’s query based on the documents it retrieved from the Amazon Kendra index and then supports the answer with the URLs to the source documents that were used to generate the answer. Note that the subsequent queries don’t explicitly mention Amazon Kendra; however, the ConversationalRetrievalChain (a type of chain that’s part of the LangChain framework and provides an easy mechanism to develop conversational application-based information retrieved from retriever instances, used in this LangChain application), manages the chat history and the context to get an appropriate response.

Also note that in the following screenshot, Amazon Kendra finds the extractive answer to the query and shortlists the top documents with excerpts. Then the LLM is able to generate a more succinct answer based on these retrieved excerpts.

In the following sections, we explore two use cases for using Generative AI with Amazon Kendra.

Use case 1: Generative AI for financial service companies

Financial organizations create and store data across various data repositories, including financial reports, legal documents, and whitepapers. They must adhere to strict government regulations and oversight, which means employees need to find relevant, accurate, and trustworthy information quickly. Additionally, searching and aggregating insights across various data sources is cumbersome and error prone. With Generative AI on AWS, users can quickly generate answers from various data sources and types, synthesizing accurate answers at enterprise scale.

We chose a solution using Amazon Kendra and AI21 Lab’s Jurassic-2 Jumbo Instruct LLM. With Amazon Kendra, you can easily ingest data from multiple data sources such as Amazon S3, websites, and ServiceNow. Then Amazon Kendra uses AI21 Lab’s Jurassic-2 Jumbo Instruct LLM to carry out inference activities on enterprise data such as data summarization, report generation, and more. Amazon Kendra augments LLMs to provide accurate and verifiable information to the end-users, which reduces hallucination issues with LLMs. With the proposed solution, financial analysts can make faster decisions using accurate data to quickly build detailed and comprehensive portfolios. We plan to make this solution available as an open-source project in near future.

Example

Using the Kendra Chatbot solution, financial analysts and auditors can interact with their enterprise data (financial reports and agreements) to find reliable answers to audit-related questions. Kendra ChatBot provides answers along with source links and has the capability to summarize longer answers. The following screenshot shows an example conversation with Kendra ChatBot.

Architecture overview

The following diagram illustrates the solution architecture.

The workflow includes the following steps:

  1. Financial documents and agreements are stored on Amazon S3, and ingested to an Amazon Kendra index using the S3 data source connector.
  2. The LLM is hosted on a SageMaker endpoint.
  3. An Amazon Lex chatbot is used to interact with the user via the Amazon Lex web UI.
  4. The solution uses an AWS Lambda function with LangChain to orchestrate between Amazon Kendra, Amazon Lex, and the LLM.
  5. When users ask the Amazon Lex chatbot for answers from a financial document, Amazon Lex calls the LangChain orchestrator to fulfill the request.
  6. Based on the query, the LangChain orchestrator pulls the relevant financial records and paragraphs from Amazon Kendra.
  7. The LangChain orchestrator provides these relevant records to the LLM along with the query and relevant prompt to carry out the required activity.
  8. The LLM processes the request from the LangChain orchestrator and returns the result.
  9. The LangChain orchestrator gets the result from the LLM and sends it to the end-user through the Amazon Lex chatbot.

Use case 2: Generative AI for healthcare researchers and clinicians

Clinicians and researchers often analyze thousands of articles from medical journals or government health websites as part of their research. More importantly, they want trustworthy data sources they can use to validate and substantiate their findings. The process requires hours of intensive research, analysis, and data synthesis, lengthening the time to value and innovation. With Generative AI on AWS, you can connect to trusted data sources and run natural language queries to generate insights across these trusted data sources in seconds. You can also review the sources used to generate the response and validate its accuracy.

We chose a solution using Amazon Kendra and Flan-T5-XXL from Hugging Face. First, we use Amazon Kendra to identify text snippets from semantically relevant documents in the entire corpus. Then we use the power of an LLM such as Flan-T5-XXL to use the text snippets from Amazon Kendra as context and obtain a succinct natural language answer. In this approach, the Amazon Kendra index functions as the passage retriever component in the RAG mechanism. Lastly, we use Amazon Lex to power the front end, providing a seamless and responsive experience to end-users. We plan to make this solution available as an open-source project in the near future.

Example

The following screenshot is from a web UI built for the solution using the template available on GitHub. The text in pink are responses from the Amazon Kendra LLM system, and the text in blue are the user questions.

Architecture overview

The architecture and solution workflow for this solution are similar to that of use case 1.

Clean up

To save costs, delete all the resources you deployed as part of the tutorial. If you launched the CloudFormation stack, you can delete it via the AWS CloudFormation console. Similarly, you can delete any SageMaker endpoints you may have created via the SageMaker console.

Conclusion

Generative AI powered by large language models is changing how people acquire and apply insights from information. However, for enterprise use cases, the insights must be generated based on enterprise content to keep the answers in-domain and mitigate hallucinations, using the Retrieval Augmented Generation approach. In the RAG approach, the quality of the insights generated by the LLM depends on the semantic relevance of the retrieved information on which it is based, making it increasingly necessary to use solutions such as Amazon Kendra that provide high-accuracy semantic search results out of the box. With its comprehensive ecosystem of data source connectors, support for common file formats, and security, you can quickly start using Generative AI solutions for enterprise use cases with Amazon Kendra as the retrieval mechanism.

For more information on working with Generative AI on AWS, refer to Announcing New Tools for Building with Generative AI on AWS. You can start experimenting and building RAG proofs of concept (POCs) for your enterprise GenAI apps, using the method outlined in this blog. As mentioned earlier, once Amazon Bedrock is available, we will publish a follow up blog showing how you can build RAG using Amazon Bedrock.


About the authors

Abhinav JawadekarAbhinav Jawadekar is a Principal Solutions Architect focused on Amazon Kendra in the AI/ML language services team at AWS. Abhinav works with AWS customers and partners to help them build intelligent search solutions on AWS.

Jean-Pierre Dodel is the Principal Product Manager for Amazon Kendra and leads key strategic product capabilities and roadmap prioritization. He brings extensive Enterprise Search and ML/AI experience to the team, with prior leading roles at Autonomy, HP, and search startups prior to joining Amazon 7 years ago.

Mithil Shah is an ML/AI Specialist at AWS. Currently he helps public sector customers improve lives of citizens by building Machine Learning solutions on AWS.

Firaz Akmal is a Sr. Product Manager for Amazon Kendra at AWS. He is a customer advocate, helping customers understand their search and generative AI use-cases with Kendra on AWS. Outside of work Firaz enjoys spending time in the mountains of the PNW or experiencing the world through his daughter’s perspective.

Abhishek Maligehalli Shivalingaiah is a Senior AI Services Solution Architect at AWS with focus on Amazon Kendra. He is passionate about building applications using Amazon Kendra ,Generative AI and NLP. He has around 10 years of experience in building Data & AI solutions to create value for customers and enterprises. He has built a (personal) chatbot for fun to answers questions about his career and professional journey. Outside of work he enjoys making portraits of family & friends, and loves creating artworks.

Read More

IndoorSim-to-OutdoorReal: Learning to navigate outdoors without any outdoor experience

IndoorSim-to-OutdoorReal: Learning to navigate outdoors without any outdoor experience

Teaching mobile robots to navigate in complex outdoor environments is critical to real-world applications, such as delivery or search and rescue. However, this is also a challenging problem as the robot needs to perceive its surroundings, and then explore to identify feasible paths towards the goal. Another common challenge is that the robot needs to overcome uneven terrains, such as stairs, curbs, or rockbed on a trail, while avoiding obstacles and pedestrians. In our prior work, we investigated the second challenge by teaching a quadruped robot to tackle challenging uneven obstacles and various outdoor terrains.

In “IndoorSim-to-OutdoorReal: Learning to Navigate Outdoors without any Outdoor Experience”, we present our recent work to tackle the robotic challenge of reasoning about the perceived surroundings to identify a viable navigation path in outdoor environments. We introduce a learning-based indoor-to-outdoor transfer algorithm that uses deep reinforcement learning to train a navigation policy in simulated indoor environments, and successfully transfers that same policy to real outdoor environments. We also introduce Context-Maps (maps with environment observations created by a user), which are applied to our algorithm to enable efficient long-range navigation. We demonstrate that with this policy, robots can successfully navigate hundreds of meters in novel outdoor environments, around previously unseen outdoor obstacles (trees, bushes, buildings, pedestrians, etc.), and in different weather conditions (sunny, overcast, sunset).

PointGoal navigation

User inputs can tell a robot where to go with commands like “go to the Android statue”, pictures showing a target location, or by simply picking a point on a map. In this work, we specify the navigation goal (a selected point on a map) as a relative coordinate to the robot’s current position (i.e., “go to ∆x, ∆y”), this is also known as the PointGoal Visual Navigation (PointNav) task. PointNav is a general formulation for navigation tasks and is one of the standard choices for indoor navigation tasks. However, due to the diverse visuals, uneven terrains and long distance goals in outdoor environments, training PointNav policies for outdoor environments is a challenging task.

Indoor-to-outdoor transfer

Recent successes in training wheeled and legged robotic agents to navigate in indoor environments were enabled by the development of fast, scalable simulators and the availability of large-scale datasets of photorealistic 3D scans of indoor environments. To leverage these successes, we develop an indoor-to-outdoor transfer technique that enables our robots to learn from simulated indoor environments and to be deployed in real outdoor environments.

To overcome the differences between simulated indoor environments and real outdoor environments, we apply kinematic control and image augmentation techniques in our learning system. When using kinematic control, we assume the existence of a reliable low-level locomotion controller that can control the robot to precisely reach a new location. This assumption allows us to directly move the robot to the target location during simulation training through a forward Euler integration and relieves us from having to explicitly model the underlying robot dynamics in simulation, which drastically improves the throughput of simulation data generation. Prior work has shown that kinematic control can lead to better sim-to-real transfer compared to a dynamic control approach, where full robot dynamics are modeled and a low-level locomotion controller is required for moving the robot.

Left Kinematic control; Right: Dynamic control

We created an outdoor maze-like environment using objects found indoors for initial experiments, where we used Boston Dynamics’ Spot robot for test navigation. We found that the robot could navigate around novel obstacles in the new outdoor environment.

The Spot robot successfully navigates around obstacles found in indoor environments, with a policy trained entirely in simulation.

However, when faced with unfamiliar outdoor obstacles not seen during training, such as a large slope, the robot was unable to navigate the slope.

The robot is unable to navigate up slopes, as slopes are rare in indoor environments and the robot was not trained to tackle it.

To enable the robot to walk up and down slopes, we apply an image augmentation technique during the simulation training. Specifically, we randomly tilt the simulated camera on the robot during training. It can be pointed up or down within 30 degrees. This augmentation effectively makes the robot perceive slopes even though the floor is level. Training on these perceived slopes enables the robot to navigate slopes in the real-world.

By randomly tilting the camera angle during training in simulation, the robot is now able to walk up and down slopes.

Since the robots were only trained in simulated indoor environments, in which they typically need to walk to a goal just a few meters away, we find that the learned network failed to process longer-range inputs — e.g., the policy failed to walk forward for 100 meters in an empty space. To enable the policy network to handle long-range inputs that are common for outdoor navigation, we normalize the goal vector by using the log of the goal distance.

Context-Maps for complex long-range navigation

Putting everything together, the robot can navigate outdoors towards the goal, while walking on uneven terrain, and avoiding trees, pedestrians and other outdoor obstacles. However, there is still one key component missing: the robot’s ability to plan an efficient long-range path. At this scale of navigation, taking a wrong turn and backtracking can be costly. For example, we find that the local exploration strategy learned by standard PointNav policies are insufficient in finding a long-range goal and usually leads to a dead end (shown below). This is because the robot is navigating without context of its environment, and the optimal path may not be visible to the robot from the start.

Navigation policies without context of the environment do not handle complex long-range navigation goals.

To enable the robot to take the context into consideration and purposefully plan an efficient path, we provide a Context-Map (a binary image that represents a top-down occupancy map of the region that the robot is within) as additional observations for the robot. An example Context-Map is given below, where the black region denotes areas occupied by obstacles and white region is walkable by the robot. The green and red circle denotes the start and goal location of the navigation task. Through the Context-Map, we can provide hints to the robot (e.g., the narrow opening in the route below) to help it plan an efficient navigation route. In our experiments, we create the Context-Map for each route guided by Google Maps satellite images. We denote this variant of PointNav with environmental context, as Context-Guided PointNav.

Example of the Context-Map (right) for a navigation task (left).

It is important to note that the Context-Map does not need to be accurate because it only serves as a rough outline for planning. During navigation, the robot still needs to rely on its onboard cameras to identify and adapt its path to pedestrians, which are absent on the map. In our experiments, a human operator quickly sketches the Context-Map from the satellite image, masking out the regions to be avoided. This Context-Map, together with other onboard sensory inputs, including depth images and relative position to the goal, are fed into a neural network with attention models (i.e., transformers), which are trained using DD-PPO, a distributed implementation of proximal policy optimization, in large-scale simulations.

The Context-Guided PointNav architecture consists of a 3-layer convolutional neural network (CNN) to process depth images from the robot’s camera, and a multilayer perceptron (MLP) to process the goal vector. The features are passed into a gated recurrent unit (GRU). We use an additional CNN encoder to process the context-map (top-down map). We compute the scaled dot product attention between the map and the depth image, and use a second GRU to process the attended features (Context Attn., Depth Attn.). The output of the policy are linear and angular velocities for the Spot robot to follow.

Results

We evaluate our system across three long-range outdoor navigation tasks. The provided Context-Maps are rough, incomplete environment outlines that omit obstacles, such as cars, trees, or chairs.

With the proposed algorithm, our robot can successfully reach the distant goal location 100% of the time, without a single collision or human intervention. The robot was able to navigate around pedestrians and real-world clutter that are not present on the context-map, and navigate on various terrain including dirt slopes and grass.

Route 1

  

Route 2

  

Route 3

  

Conclusion

This work opens up robotic navigation research to the less explored domain of diverse outdoor environments. Our indoor-to-outdoor transfer algorithm uses zero real-world experience and does not require the simulator to model predominantly-outdoor phenomena (terrain, ditches, sidewalks, cars, etc). The success in the approach comes from a combination of a robust locomotion control, low sim-to-real gap in depth and map sensors, and large-scale training in simulation. We demonstrate that providing robots with approximate, high-level maps can enable long-range navigation in novel outdoor environments. Our results provide compelling evidence for challenging the (admittedly reasonable) hypothesis that a new simulator must be designed for every new scenario we wish to study. For more information, please see our project page.

Acknowledgements

We would like to thank Sonia Chernova, Tingnan Zhang, April Zitkovich, Dhruv Batra, and Jie Tan for advising and contributing to the project. We would also like to thank Naoki Yokoyama, Nubby Lee, Diego Reyes, Ben Jyenis, and Gus Kouretas for help with the robot experiment setup.

Read More

Collaborators: Gov4git with Petar Maymounkov and Kasia Sitkiewicz

Collaborators: Gov4git with Petar Maymounkov and Kasia Sitkiewicz

GitHub Product Manager Kasia Sitkiewicz and Protocol Labs Research Scientist Petar Maymounkov discuss their collaboration on Gov4git on the Microsoft Research Podcast

Episode 139 | May 3, 2023

Transforming research ideas into meaningful impact is no small feat. It often requires the knowledge and experience of individuals from across disciplines and institutions. Collaborators, a new Microsoft Research podcast series, explores the relationships—both expected and unexpected—behind the projects, products, and services being pursued and delivered by researchers at Microsoft and the diverse range of people they’re teaming up with. 

In this inaugural episode, host Dr. Gretchen Huizinga talks with GitHub Staff Product Manager Kasia Sitkiewicz and Protocol Labs Research Scientist Petar Maymounkov about how their collaboration on Gov4git, a governance tool for decentralized, open-source cooperation, is helping to lay the foundation for a future in which everyone can collaborate more efficiently, transparently, and easily and in the ways that meet the unique desires and needs of their respective communities. They discuss the governance features that make Gov4git more suitable for serving a broader range of communities than today’s public blockchains and the open-source book project allowing them to test the potential and limitations of the work.

Transcript

[MUSIC] 

GRETCHEN HUIZINGA: Every great idea at Microsoft Research is yearning to find its way into the hearts, minds, and hands of people. Microsoft researchers work with an amazing—and sometimes surprising—array of collaborators from across the sciences who are integral to the process of shepherding these ideas from lab to life. Welcome to Collaborators, a podcast showcasing the range of expertise that goes into transforming mind-blowing ideas into world-changing technologies. I’m Dr. Gretchen Huizinga, and in this series, we’ll dive deep into the collaboration process and illuminate how research ideas move from mind to market in our ongoing effort to enhance human abilities, strengthen human communities, and benefit human lives. 


[MUSIC ENDS] 

Welcome to Episode 1 of Collaborators. Today, I’m joined by our first two guests, Petar Maymounkov and Kasia Sitkiewicz. Petar and Kasia are working on a project that has collaboration in its DNA: Gov4git, a decentralized, transparent, and secure git-based protocol for governing open-source communities that they say circumvents more costly approaches to things like validation and dispute resolution. 

We’re going to unpack all of that in this episode. But before we do, let’s get to know our collaborators. Kasia, let’s start with you. You’re at GitHub, “an open-source platform for collaborative software development and version management.” This platform is well-known in the dev community but give us a brief elevator tour of GitHub and particularly what your role is there. 

KASIA SITKIEWICZ: Sure. So I’m happy to give an overview of GitHub. Uh, GitHub is primarily known to be a home for all developers and open-source communities. It’s one of the most popular resources for developers, as you mentioned, to share code and work on projects in collaboration. It makes [it] super easy for developers to share code files and collaborate with each other using GitHub issues, which we will be referencing in the podcast, and pull requests, uh, which we call PRs. So imagine GitHub issues being like a project description or some kind of information that what needs to be built, and PRs, um, are pretty much amendments to the code change that a community wants to merge with the main code branch, uh, and that’s very well known among developer community. So pretty much like that’s how we use version control. We know what needs to be changed, what needs to be merged, and community pretty much participates in all of those changes. And what I do at GitHub, uh, I work as a product manager. I oversee growth for GitHub Enterprise Cloud and GitHub Advanced Security, and on the side, I collaborate with Microsoft, Web3, and Microsoft Research team on, uh, working on projects like Gov4git or other Web3 partnerships where I represent GitHub and, um, trying to onboard and make those projects successful. 

HUIZINGA: So there’s meta-collaboration, and then there’s micro-collaboration, and collaboration all over the place in GitHub. 

SITKIEWICZ: Exactly. Yes, we, we do like to collaborate. 

HUIZINGA: [LAUGHS] Well, you’re perfect for this show. So, Petar, you’re at Protocol Labs, “an open-source research, development, and deployment laboratory.” And, and you say you’re “building the next generation of the internet and making human existence orders of magnitude better through technology.” No pressure, right? Briefly tell us about Protocol Labs and your role in taking the internet and humanity to the next level. 

PETAR MAYMOUNKOV: Yeah, um, first, thank you for having us. Since you’re asking about the North Star mission of Protocol Labs, so to speak, I think it’s quite simple. I think it’s really trying to sort of create a better world that is both, um, it’s sustainable, fair, and inclusive, and it’s trying to do this through decentralization as a concept and technologies, of course, in particular. Now this is a mighty goal, and in practice, it, um, comprises essentially three workstreams, if you will. Um, the first thing is decentralized infrastructure, because it’s not possible to, to build anything useful without the infrastructure, and in this regard, Protocol Labs is, um, essentially working on and stewarding, uh, two products Filecoin and IPFS, which provide decentralized infrastructure in a democratic way to the whole world essentially. Um, now the second workstream is, um—Protocol Labs was one of the companies to realize early on that, uh, whenever decentralized technologies are involved, um, they go hand in hand with, uh, enabling everybody to contribute, so this raises the question of decentralized development, which is how do people collaborate across country boundaries, backgrounds, different levels of experience, and so forth. So along with all the engineering efforts, Protocol Labs is also essentially innovating workflows and culture about being productive in a decentralized development kind of, um setting. And the final workstream, uh, which kind of shows you how long term the vision is in Protocol Labs, so we recognize that, um, we cannot have a sustainable, decentralized world unless we replicate some of the important, um, sort of processes that happen in the real world, in particular the research-to-development innovation pipeline. So in the real world, this goes from academia to industry, and so forth. And part of, um, why this question is new and not the same as in the real world is because, uh, decentralized products being a type of public good, um, do not succumb to the same incentive mechanisms that drive the conventional economy. So we, we have a department called network funding and funding of public goods, which is itself involved in thinking about new mechanisms and incentives for, for making this, this process work in a repeatable fashion, basically. And my, uh, my role currently in the company is, uh, to think about facilitating decentralized development through standardized tools and protocols. 

HUIZINGA: Gotcha. Well, as we’re talking about collaboration and collaborators and you two are at two different companies, I’m going to call this question “how I met your mother”! How did Gov4git come about, and what was the initial felt need that defined the purpose? And as you answer that, tell us who’s all involved and how you each got involved on the team. Kasia, I’ll let you take the lead on this one. 

SITKIEWICZ: Sure. So I guess on my end, it all started through the passion I have for open source and the idea of decentralized communities. As I mentioned, I’m part of a lot of, uh, projects here at Microsoft and GitHub, and one of them is Web3 and Plural Technology Collaboratory that is led by Glen Weyl, and a few months ago, Glen and I, we had a conversation about how amazing git is and how amazing our GitHub communities are and overall like the efforts that they are working on towards like better world, public goods, and so on, and I share my vision for GitHub to be a tool or platform that can be accessible by anyone around the world where people can collaborate, they can own, uh, share and like earn money pretty much because of those contributions that they have. So we talk about this vision and we share the same kind of like a passion for all of those different projects and, you know, aspects of like open source, and he mentioned like, “Hey, we’re actually working on this like open-source book, uh, that will be hosted on GitHub, and we would love to do some kind of collaboration here.” And then he introduced me to Petar and Protocol Labs, and we had our first intro call. Uh, we learned like what is the objective, what problems we are trying to solve, and we put a small team of GitHub, Microsoft, and folks from Protocol Labs and a few folks also from open source, like purely I put a tweet about like, “Hey, I’m looking for contributors to this amazing project that will help with governance for open source,” and few folks reach out, and that’s how we kind of put it together. 

HUIZINGA: Right. Petar, how do you see the, the thing coming around? 

MAYMOUNKOV: So I had been working for Protocol Labs for about three and a half years. The first couple of years, I spent most of my time engineering and sort of being in the real-world decentralized development kind of environment, so I saw lots of things that work well; I saw lots of things that need improving; and over time, I developed an interest to kind of address, uh, this question sort of systematically and head on, which is when I, um, started working specifically just on this problem. And about six months ago or so, when I was starting, I was initially researching the space and what’s known. This is how I ran into Glen Weyl’s work, so eventually, we, we connected, and, um, I read sort of most of the stuff that he’s been working on and tried to sort of find a connection between this and what I knew from, from the trenches, if you will, from the engineering department, and then—and then, you know, he connected us with, um, with, uh, with Kasia. But the thing that sparked it, though, so at some point, Glen did sort of point out the specific project that he was trying to initiate, the plurality book, and this was kind of the thing that put a shape to our efforts because it was a very concrete task that we needed to figure out how to like address and accomplish in like a reasonable time. 

HUIZINGA: Yeah, so, so let’s get sort of granular about Gov4git and what it is, because I don’t think we’ve defined that, uh, from the get-go here, so, Kasia, can you kind of explain what it is and why it’s different? 

SITKIEWICZ: Sure. So Gov4git is pretty much a tool that helps, uh, open-source community to govern their community members in a more efficient, transparent, and easy way. There is a lot of problems in traditional governance model for any communities, and the larger communities are, there, there are more problems. And Gov4git is trying to solve a very particular problem of giving autonomy and ownership to the community to make decision what needs to happen and what changes the community needs to prioritize in order to make the project more successful. So, it’s just a solution that helps you to govern your communities in an efficient way. 

HUIZINGA: Yeah, so even as we’re talking, I’m thinking, OK, you’ve got Microsoft Research, you’ve got GitHub, you’ve got Protocol Labs. But do you use this to govern the things that you guys are working through as a community collaboration? 

MAYMOUNKOV: The tool itself is essentially implementing processes that kind of have organically emerged both in, in the context of Protocol Labs, as well as even other organizations like Ethereum. Um, I mean, this is the process of people kind of collaborating on specifications for decentralized protocols and so forth. For the particular—for Gov4git specifically, since the tool is still, uh, in some sense under development, but it, but it is kind of approaching MVP, we have used it internally as, as dog food, um, but not at large scale yet. 

HUIZINGA: Right. Gotcha. 

SITKIEWICZ: Yeah. And I think the beauty of Gov4git is actually very useful when you have a bigger community. Right now, our team is very small. It’s just like, uh, six people working together, so—and this is something I want to elaborate a little bit more in our, later in the podcast—but the smaller community, there is less problems, and you kind of make a decision on the fly, on the go, like, “Hey, what are we going to build next? And should we, should we focus on this or that?” So you can actually make those decisions without really spending too much time. And that’s a beauty for all startups moving fast, but the moment the community grows, you have those constraints and problems. So Gov4git is precisely designed for those growing communities and making sure the communities grow in like a very healthy way versus like there is a stop at some point where, like, you cannot make a consensus because of, you know, this person is out, or I don’t have enough information, or I don’t have rights or permissions to make those changes. So, uh, we—like Petar said—we dogfood the code, but at the same time, the use cases are like for a little bit bigger groups and communities. 

HUIZINGA: Well let’s get specific about the problems and solutions from a technical perspective. And, um, Petar, I’m going to ask you to take the lead on this. As I understand Gov4git from my non-technical perch, it’s a sort of sandbox for community governments mechanisms. How would you define the problems you’re trying to solve with Gov4git, and how are you going about solving them technically? 

MAYMOUNKOV: Yeah, this is a good way of putting it. It’s, it’s a sandbox for governance, um, solutions, so, um, indeed I have the, um, technical kind of part of this, um, project. And from, um, from a computer science point of view, governance is synonymous with trusted computation. So trusted computation is, is an abstraction or a notion whereby there is a public, uh, program or rules of governance and the community has a method of kind of—there is a, there is a device that, that executes and follows the rules of governance and the community members have, um, assurance that the rules are followed as advertised and that nobody can sidestep the system regardless of their role in the community. So governance is trusted computation to scientists, basically. Now, uh, trusted computation being a general abstraction is, is something that has various embodiments in the real world, and the most, uh, famously known currently embodiment of trusted computation are public blockchains such as Ethereum, Filecoin, and others. So we could have sort of chosen to use these existing solutions to how you build governance applications, um, but we ran into a number of practical issues with them that prevent us from delivering sort of practical results in a reasonable amount of time. And also, there are some shortcomings that prevent these solutions from reaching people in unprivileged parts of the world, so developing world, war zones, authoritarian countries. Uh, so effectively, Gov4git from a technical standpoint is a different embodiment, a different implementation, of trusted computation, which is not in competition with public blockchains. It captures a, a different tradeoff, so to speak. 

HUIZINGA: OK, talk a little bit more about the tradeoff. I mean, some of these things would represent to me a barrier to entry—I wouldn’t be able to, um, afford it. What are some of the, the upsides to Gov4git that, um, we don’t find in the other spaces? 

MAYMOUNKOV: Yeah, so to make a fair comparison, I should first give some context on the existing blockchains. Um, so the existing blockchain technologies are quite exciting, um, and they, they’re very promising. But currently, they’re in a state of having overshot in their level of ambition and slightly underdelivered, at least for the present time, and I’m sure they will eventually deliver, uh, sort of completely. So what do I mean by this? So they have overshot in the sense that they are—they provide so many features and, and they capture an extremely large set of applications, but at the same time, this of course involves a lot of complexity that they need to deal with, and this complexity hasn’t been fully sorted out yet to make them usable for sort of common cases. So what, what we’ve noticed here is that there is a large group of applications, in particular community governance, which does not need most of the features that are provided by public blockchains. And so once you realize that this is the case, you unlock much simpler solutions that have the same sort of outcome for the users. So public blockchains—let me be a little bit specific here for the technical listeners—so public blockchains, they’re global systems, so across the world. They’re capable of hosting multiple independent applications. Uh, you can think of this as independent communities which need to interact with each other at very high speeds and with a very high throughput. So the typical applications that you can think of is essentially high-volume, cross-community business or trade interactions. And, of course, this is a real use case, especially with financial systems and so forth. But, um, in contrast, community governance applications, which are sort of designed to serve humancentric deliberative processes within a community, they’re not global; they’re local to a community. They are not multiple applications; they are a single application that governs one community. And because they are human-deliberative applications, they don’t need high speeds and high throughput, so recognizing that these, um, this is the case, alternative designs for trusted computation, um, sort of emerge and this is what we’ve, what we went after. 

HUIZINGA: That’s, that’s awesome. Well, and so, Kasia, let’s go back to a little bit because we’re going to cross over here. There’s a couple of themes that are emerging that I think are really interesting. Um, you talk about, earlier, the issues in pull requests that you deal with and that Gov4git has some mechanisms to help address the tension between what I might call anarchy and dictatorship. Is there some kind of a, a mechanism that’s different that can help mitigate that? 

SITKIEWICZ: Yeah, absolutely. So, as I mentioned, there are different types of communities, and the bigger the community gets, the more issues you have. Within smaller community, you pretty much know who you’re interacting with; you know the contributors; you know who is the maintainer. And it’s actually quite fast to make those changes and like approving those pull requests and reviewing comments and issues and other activities that are happening around every project. With the bigger communities, there’s more, uh, logistics problem and governance problem, and many times, you truly don’t know who is contributing to your code source. You just know their handle. That can be anyone; that can be even some kind of like ChatGPT, especially with like right now like the generative foundation models. Like we’re going to see more problems of like interacting with non-humans, right? So I feel like communities will have more and more problems facing like, “OK, how do I manage my contributors, and, uh, how fast we want to move the project?” So Gov4git is using, uh, a lot of like beautiful features from Web3, which is quadratic voting. It’s, uh, pretty much collective decision-making procedures that involve individuals who are part of your community with allocating votes to express the degree of their preferences. So as you mention, in a traditional organization, there is one person or one dictator that tells you like, “Hey, you’re going to build that.” And once we have it, we’re going to like approve it, right? And we’re going to like ship it. With quadratic voting, the decision is made collectively. So we’re going to implement quadratic voting part of our governance model. Second feature that is also very nice is like the governance tokens. Right now, um, communities, there are few ways of like how they make decisions, either majority of the votes or through consensus. With this type of governance tokens, you will be able to see like how many people voted on a specific pull request or a feature, and the majority of the votes will be pretty much the decision-making. So community can use those governance tokens for making the decision. And lastly, uh, there is a concept of badges. So in the Web3 space, there are like NFTs, and one of the NFTs is a soulbound token, which is a token that you are given that you cannot transfer, and we believe that by implementing those soulbound tokens, you can authenticate the user, you can say, “Hey, I know you; you’re part of this community; you got this badge.” And that badge gives you, let’s say, right to receive those tokens and so on. So again, those are just like a few features that are actually like very nice in that decentralized communities that we want to bring into Gov4git so that the communities can benefit from having specific features like, uh, quadratic voting, governance tokens, or like those badges. And what I want to say is like, you know, GitHub or like other git platforms, they don’t support this type of governance features, and that’s the need from the users and customers being like, “Hey, I need something that will be very easy, efficient, and transparent,” and Gov4git provides all of it. 

HUIZINGA: Yeah. Well, and on that same topic, Petar, I always like to ask what could possibly go wrong, and even as Kasia’s talking, all kinds of things are coming into my head like, um, could a bot get an SBT or, I mean, do you have to be, provide validation to who you are and what you represent yourself as? 

MAYMOUNKOV: Yeah, so, um, let me answer the general question and the specific question. So I think the specific question about bots is that, has the following answer. So I think people in Microsoft Research in particular, but people in general, are realizing that identity is going to be much harder to, uh, prove and understand in the presence of AI. And so here we kind of—especially Glen, sort of leading with his paper on soulbound tokens, is essentially looking into something that we do in the real world, uh, which is that we have deep ways of verifying people’s identity by essentially, um, looking into their history with communities and within society. Uh, so the presence of these badges that Kasia is mentioning is essentially creating a system whereby people can collect certificates from different endeavors that they have participated in to build out a résumé that is verifiable by the communities where they participated that they are who they are. In some, in some sense, the person is the sum total of everything they’ve done for other people. And currently, a bot cannot accomplish as much as a person and get sort of, you know, certificates from other humans that this has been the case. So roughly, this addresses the question of, OK, can something go wrong with, with bots. In a sense, bot or not, to be acknowledged in a system, you have to have contributed verifiably to, to multiple communities eventually. Um, but there is a bigger sort of picture about what can possibly go wrong. And so in this regard, Gov4git kind of sits in a very standard situation with most, uh, very promising software tools, which is that it, it is, it is a powerful tool that can fall in the hands both of good and bad people, acknowledging the fact that good and bad are relative terms. And, and this is, this actually also plays on a, on a general theme in software and science, which is that software engineers and engineers, scientists and so forth, they design software which is symmetric, so the software from the start treats everybody in the same way. It doesn’t have a way of distinguishing, you know, who’s using it. And even though this sounds like the right place to be—it’s a neutral place to be—there are plenty of cases already in the real world where, um, it is unclear, you know, whether society wants symmetric treatment of everybody. The, the classical example here that I would give is, is Twitter. When it comes to the question of censorship on Twitter, there’s a few different alternative, um, kind of directions that that people can think of, of taking. One direction is to say that, uh, no censorship should happen, uh, which is the symmetric treatment. So everybody gets the same agency within a system. But as you know, there’s plenty of people who don’t like this approach. There’s other approaches, such as “somebody should censor us.” But who’s, who’s the somebody? So, so these kinds of issues all apply in this case, as well, because if governance for git is to be successful, what I hope, or, you know, cautiously hope, that it will result in, it’ll enable communities to forum at a much larger speed and a much larger volume around the world. And usually, when things speed up for humans, just like Twitter sped up discourse between people, um, we tend to find ourselves in a situation where we are slightly unprepared to, to, to reason about where does this go. 

HUIZINGA: Right. Kasia, what do you have to add to Petar’s conversation there on the “what could go wrong” from your end? 

SITKIEWICZ: I think from the product side—and I can speak as a product manager—there might be a case where like the community will come back to us like, “Hey, this is not what we want. We want something different,” right. Which, it’s a hypothesis, and can, this can, this feedback can happen, right. But at the same time, I believe that the community will ask for more. So like we are building just a very simple MVP to pretty much let the community to make those decisions, but perhaps the direction might be like, “Hey, the value’s somewhere else.” Uh, because once we launch, we can learn like, OK, this is great, but it’s not enough. So I would speak from the product side and like the user testing that perhaps we might discover like, oh, the actually true value will be somewhere else, and perhaps it can be a quadratic voting; it can be those tokens or those badges, right. So from my end, I feel like that’s the biggest like unknown, and speaking about bots and, uh, all the AI work, I feel like there is a lot of value in that, as well. So it’s not just a negative aspect of like, “Hey, I don’t want automation to be part of my project.” I think we will see it more, and there will be a lot of benefits. It’s just there are a lot of things we do not know as of now, and we just have to make sure like we are very flexible in terms of like how we pivot and how we adapt to feedback. 

HUIZINGA: Right. But, but in other ways, GitHub itself and Gov4git is a platform for people to form their own communities and govern their own communities, right? So you’re not going to be sort of the 10,000-foot hall monitor and try to meta-govern the people that are governing their own communities, correct? 

MAYMOUNKOV: Yes. SITKIEWICZ: That’s correct, yes. 

HUIZINGA: They’re nodding their heads. It’s a podcast—you can’t see it! [LAUGHS] Well, and this, this discussion on the “what could possibly go wrong” is important for me because I think people who are going to use the technology want to know that people promoting it are aware of the potential for unforeseen and unintended consequences and have a plan for mitigating. But it’s such an interesting ramp up to this new kind of use case for collaborative, open-source governance that it’s really cool. Kasia, let’s talk specifically about some of those use cases from the product side that you’ve alluded to. Um, GitHub is well known in the developer community, but how’s the idea of decentralized open-source work moving into non-technical communities and applications? 

SITKIEWICZ: Yeah, absolutely. So in any open-source project, you will find very technical contributors and maintainers and also you will find people who just like want to like observe the project or perhaps help with like project management or translation and so on. So we already have a lot of non-technical contributors who perhaps are struggling when they first log in to GitHub and they learn about git. They were like, “What the heck is that?” It’s a black box. So we truly get that feedback from customers. It’s like a very overwhelming experience, and it takes some time to wrap up and kind of learn how to use it. So the idea for Gov4git is pretty much a very simple presentation, or UI, via extension, Chrome extension, where you will see something very familiar like you see on Twitter, where you have like a post that you need to vote on, and if you are eligible to vote, you will, you’ll be able to use your tokens, uh, and vote on the decision, and you will be able to comment and interact with the community, and so on. So the ultimate goal is to create something very simple, just like a Twitter, you know, is simple, so that community is like, “Hey, I can participate, and I can put my vote, and I can contribute to this project.” So ultimately that’s the case. And the way—how we will be testing it, we talked about this book. So the book is called Plurality: Technology for Collaborative Diversity and Democracy, and it’s led by Audrey Tang and Glen Weyl and with, along with the plurality community. So the Plurality, it’s an open git-based collective book project that aims to offer a vision for the future of technology focusing on empowering and bridging social differences. So that book is on GitHub, and collaborators and maintainers who are participating are writing this book in an open-source way. And as you can imagine, writing a book is not an easy or trivial thing. You have a lot of reviews; you have everyone looking and providing feedback. So we believe that they can benefit from, uh, using Gov4git, with like management of like PRs and issues and decision-making. And, um, the initiative is already like there, right; it’s started. So we are just like trying to see like how that can—book can be completely managed by a community versus like Audrey or Glen has to like spend a lot of hours to review all of those PRs. And it sometimes is very challenging, and it’s almost impossible to go every single comment, so we believe that this can help and expedite the process and make it very transparent and efficient way to write in open source. 

HUIZINGA: Petar, talk a little bit about the other applications, including this one, from a technical perspective. Um, what makes it easier to resolve arguments and make edits with Gov4git versus other mechanisms to do that? 

MAYMOUNKOV: Gov4git, being a sandbox, at least technologically, is not trying to be prescriptive about how people do this. We’re trying to enable people to, to, to pick the mechanisms that they want for themselves, for arbitrating conflicts, so, you know, starting with, with Glen’s project, of course, we are starting with quadratic voting, and we plan, um, the quadratic voting is a, is a large, at this point, field. There’s lots of different variants of it. So we, we build the product so that over time Glen and Audrey can experiment with, you know, different types of conflict resolution and, and so forth. What Gov4git provides is the ease of adding a new mechanism that the community wants. And of course, we plan to have a library of like mechanisms that people can choose from. One nice side benefit from this entire project is that Gov4git, uh, enables people to like reflect on what they’ve done and, and what is happening. So with Gov4git, you always have a complete history, both of the governance motions of, of the community, alongside with the actual open-source collaborative work, which in particular enables academics and researchers from organizations such as the Metagovernance Project being a good example to go in there and study what types of mechanisms make for better results, basically, and kind of improve iteratively over this. 

HUIZINGA: Yeah. So it sounds like there’s a spectrum of assessment or meta-governance testing with computer scientists, product managers, academics. Even there, you see this great collaboration happening. Go back to the, the academics and other, uh, collaborators that are coming in on this. Do you find a broad spectrum of disciplines involved, not just computer scientists in academia but perhaps social scientists, legal scholars, any of these kinds of things coming into this? 

MAYMOUNKOV: Um, it’s too early to tell, but, uh, but there has been indeed interest, so, so from a few places, right. So the, the academics are indeed interested to, to consume this data when it’s available from real-world communities, because the key thing for them is to have real-world data like sufficiently scaled communities, like the Plurality book would be a great example because it’s probably expecting to have thousands of contributors. And otherwise, um, in addition to, uh, the Plurality book as like a first customer, so to speak, uh, we already have lots of interest from AI companies. So these are AI companies that are currently building open-source AI models, and they want to experiment with attaching governance to their open-source work, which is already happening on gits and GitHub. And they want—uh, because once you have governance plus open source, then you, you have a holistically democratic development of something like an AI tool. 

HUIZINGA: Right. That just struck me that you say thousands of contributors to a book and you never [LAUGHS] think of that being the case. Um … 

MAYMOUNKOV: Well, that’s a special, that’s a special book because it’s, it’s going to have translations in multiple languages, and being, being it, uh, also needs to be fact-checked, so there’s a lot of work on fact-checking that, that goes along with the writing process. 

HUIZINGA: Yeah. Sounds a bit like wiki in terms of contributors and checking and making decisions and so on. Um, is, is Gov4git even in beta yet, or is it still just, um, sandboxing itself? 

MAYMOUNKOV: Um, so the, the MVP—the first version, if you will—is, is ready and has been tested for a few months internally at Protocol Labs. What we’re missing and we’re still working on is like the user interface that brings in the non-technical users. So I guess you could say that it’s in beta. I think like our launch with the Plurality book would be the first kind of official introduction event. 

HUIZINGA: Right. Yeah, and that’s an interesting, you know, when the outsiders looking in going open source, you think software, you think developers, you think code, but there’s a lot of other applications, including writing a book, which is basically just text-based writing. So, Kasia, are there any other sort of cream-floating-to-the-top applications or products that you could see coming out of this? 

SITKIEWICZ: Technically, anyone who wants to start something new and is looking for collaborators, and it can be pretty much whatever you want to build. It doesn’t have to be like a big idea. It can be just, “Hey, I want to collaborate with someone, and I want to like figure out how to do things and how to practice.” It can be used by academics, as you mentioned. Like pretty much any, any, any person who wants to start with like building something in public, they can do it and use it. So there is no limits. It’s up to you if you want to build community around the project you’re working on. So we don’t have any restrictions, and I feel like, um, we are in the stage right now or like this AI revolution where we’re just entering this like open-source community’s growth because there is like a lot of hype right now and everybody’s interested in it. Oh, maybe I can build that. It’s just so much easier to do things right now. And, you know, if you want to grow, you have to have a community around you. Um, so I think this is just like a best practices for anyone who wants to start writing in public. Whatever is that is—it might be like just a book or a code or like learning or like sharing some information. It doesn’t really matter. And, you know, being at GitHub, we see a lot of like amazing projects regardless of the discipline and like the area, and communities are just fascinating. And I think that’s the future. Like pretty much a lot of companies will start doing open-source code, just [like] Twitter has done it, right, just to bring the transparencies, because in a decentralized world, that’s like the value proposition, like, hey, it’s a very transparent way of building, and you have a history being displayed of the decision-making. And there are a lot of companies started noticing the beauty of it, and they—I think the movement is just starting, so I see a huge growth. 

HUIZINGA: Yeah, and that leads into the last question I wanted to ask both of you, um, and you both alluded to some of this already in your answers, but just if you could encapsulate in your ideal preferred future, what is your work look like in five to 10 years? How have you changed the landscape of collaborative work, community governance, and even that concept of communities? 

MAYMOUNKOV: So I hope that well within 10 years, this tool becomes perceived as a somewhat go-to tool for building, you know, communities from scratch, and, in particular, I actually hope that the tool reaches a critical point which you can label the beginning of intersectionality, to borrow a term from Glen’s, um, Glen’s vocabulary. Um, and what this means, this is a point where there is enough deployments of Gov4git that you have a non-trivial amount of people that are members of more than one community. So in other words, communities are starting to overlap, and when, when we reach this critical point, there’s a whole new set of applications that open up because now communities can, uh, interact with each other, uh, and ask each other for various kinds of help. The classical example here is that, um, one community can ask another community whether a given member has had a long and productive career in the other community. And this kind of idea—also mostly coming from Glen—is actually a mirror image of what I mentioned earlier, what happens in the real world. So when you apply for a job with, uh, an employer, the employer being a community, this employer calls up your university to verify that you actually went there and you did a good job. So you have these two communities basically sharing information. Um, so there’s lots of applications of intersectionality, but the reason I call this a critical point is because once you get there, you actually expect the network effect that we know from social networks to start taking place. In particular, if the network of communities using Gov4git is, is, is large and there’s lots of intersection, then any new communities being formed would benefit a lot from reusing the same technologies because now they can benefit from all of these other communities that already exist and that they can interoperate with. This is sort of a critical point, because, uh, if we reach it, then the tool really has a chance of becoming like an international standard for like conceiving communities, basically. 

HUIZINGA: Yeah. Kasia, what would you add to that? 

SITKIEWICZ: So I will speak a little bit more high level on the data we are seeing at GitHub, and what we believe that will happen is last year we hit 100 million developers being on our platform … 

HUIZINGA: Wow. 

SITKIEWICZ: and they’re like thousands of thousand different open-source communities. And we, we see a huge growth, and especially with like the AI innovation that is happening in that space, I think this will like triple in the upcoming few years. So the more people start understanding the beauty of technology and collaboration and like writing in public, the more adoption we will have. So I think it’s just a matter of time how fast, uh, tools like Gov4git will grow and will be needed. We’re still early because there is, like we don’t know what we don’t know. We know the problem. But we don’t know how the problem will, um, intensify in the upcoming like months or years, right. So I truly believe that there is a need for it. There will be a huge growth in terms of like creating new communities, and people from around the world, they can unite through using platforms like GitHub or other services where they can actually engage with other people who are passionate about the same thing. So as you mentioned, open-source concept is not new, but it’s actually getting more in the strength, and the value’s there. So in my eyes, it’s just a matter of time on like the scale and the growth, and features like, like prioritization or like quadratic funding will be just like more adopted by the community. So that’s my, uh, take and, uh, opinion about the space. 

[MUSIC] 

HUIZINGA: Petar and Kasia, thank you so much for coming on the show today and being our first guests on the Collaborators podcast. 

MAYMOUNKOV: It’s a pleasure. 

SITKIEWICZ: Thank you for having us. 

[MUSIC ENDS] 

The post Collaborators: Gov4git with Petar Maymounkov and Kasia Sitkiewicz appeared first on Microsoft Research.

Read More

Implement backup and recovery using an event-driven serverless architecture with Amazon SageMaker Studio

Implement backup and recovery using an event-driven serverless architecture with Amazon SageMaker Studio

Amazon SageMaker Studio is the first fully integrated development environment (IDE) for ML. It provides a single, web-based visual interface where you can perform all machine learning (ML) development steps required to build, train, tune, debug, deploy, and monitor models. It gives data scientists all the tools you need to take ML models from experimentation to production without leaving the IDE. Moreover, as of November 2022, Studio supports shared spaces to accelerate real-time collaboration and multiple Amazon SageMaker domains in a single AWS Region for each account.

There are two prevailing use cases for Studio domain backup and recovery. The first use case involves a customer business unit and project wanting a functionality to replicate data scientists’ artifacts and data files to any target domains and profiles at will. The second use case involves the replication only when the domain and profile are deleted due to conditions such as the change from a customer-managed key to an AWS-managed key or a change of onboarding from AWS Identity and Access Management (IAM) authentication (see Onboard to Amazon SageMaker Domain Using IAM) to AWS IAM Identity Center (see Onboard to Amazon SageMaker Domain Using IAM Identity Center).

This post mainly covers the second use case by presenting how to back up and recover users’ work when the user and space profiles are deleted and recreated, but we also provide the Python script to support the first use case.

When the user and space profiles are recreated in the existing Studio domain, a new ID of the profile directory will be created within the Studio Amazon Elastic File System (Amazon EFS) volume. As a result, the Studio users could lose access to the model artifacts and data files stored in their previous profile directory if they are deleted. Additionally, Studio domains don’t currently support mounting custom or additional EFS volumes. We recommend keeping the previous Studio EFS volume as a backup using RetentionPolicy in Studio.

Therefore, a proper recovery solution needs to be implemented to access the data from the previous directory in case of profile deletion or to recover files from a detached volume in case of domain deletion. Data scientists can minimize the potential impacts of deleting the domain and profiles if they frequently commit their code to the repository and utilize external storage for data access. However, having the capability to back up and recover the data scientist’s workspace is another layer to ensure their continuity of work, which may increase their productivity. Moreover, if you have tens and hundreds of Studio users, consider how to automate the recovery process to avoid mistakes and save costs and time. To solve this problem, we provide a solution to supplement Studio domain recovery.

This post explains the backup and recovery module and one approach to automate the process using an event-driven architecture. First, we demonstrate how to perform backup and recovery if you create a new Studio domain, user, and space profiles using AWS CloudFormation templates. Next, we explain the required steps to test our recovery solution using the existing domain and profiles without using our CloudFormation templates (you can use your own templates). Although this post focuses on a single domain setting, our solution works for multiple Studio domains as well. Finally, we have automated the provisioning of all resources using the AWS Serverless Application Model (AWS SAM), an open-source framework for building serverless applications.

Solution overview

The following diagram illustrates the high-level workflow of Studio domain backup and recovery with an event-driven architecture.

technical architecture

The event-driven app includes the following steps:

  1. An Amazon CloudWatch events rule uses AWS CloudTrail to trackCreateUserProfile and CreateSpace API calls, trigger the rule, and invoke the AWS Lambda function.
  2. The function updates the user table and appends items in the history table in Amazon DynamoDB. In addition, the database layer keeps track of the domain and profile name and file system mapping.

The following image shows the DynamoDB tables structure. The partition key and sort key in the studioUser table consist of the profile and domain name. The replication column holds the replication flag with true as the default value. In addition, bytes_written, bytes_file_transferred, total_duration_ms, and replication_status fields are populated when the replication completes successfully.

table schema

The database layer can be replaced by other services, such as Amazon Relational Database Service (Amazon RDS) or Amazon Simple Storage Service (Amazon S3). However, we chose DynamoDB because of the Amazon DynamoDB Streams feature.

  1. DynamoDB Streams is enabled on the user table, and the Lambda function is set as a trigger and synchronously invoked when new stream records are available.
  2. Another Lambda function triggers the process to restore the files using the user and space files restore tools.

The backup and recovery workflow includes the following steps:

  1. The backup and recovery workflow consists of AWS Step Functions, integrated with other AWS services, including AWS DataSync, to orchestrate the recovery of the user and space files from the previous directory to a new directory between the same Studio domain EFS volume (profile recreation) or a new domain EFS volume (domain recreation). With the Step Functions Workflow Studio, the workflow can be implemented with no code (such as in this case) or low code for a more customized solution. The Step Functions state machine is invoked when the event-driven app detects the profile creation event. For each profile, the Step Functions state machine runs the DataSync task to copy all files from their previous directories to the new directory.

The following image is the actual graph of the Step Functions state machine. Note that the ListApp* step ensures the profile directories are populated in the Studio EFS volume before proceeding. Also, we implemented retry with exponential backoff to handle API throttle for DataSync CreateLocationEfs and CreateTask API calls.

step functions diagram

  1. When the users open their Studio, all the files from the respective directories from the previous directory will be available to continue their work. The DataSync job replicating one gigabyte of data from our experiment took approximately 1 minute.

The following are services that will be used as part of the solution:

Prerequisites

To implement this solution, you must have the following prerequisites:

  • An AWS account if you don’t already have one. The IAM user that you use must have sufficient permissions to make the necessary AWS service calls and manage AWS resources.
  • The AWS SAM CLI installed and configured.
  • Your AWS credentials set up.
  • Git installed.
  • Python 3.9.
  • A Studio profile and domain name combination that is unique across all Studio domains within a Region and account.
  • You need to use the existing Amazon VPC and S3 bucket to follow the deployment step.
  • Also, be aware of the service quota for the maximum number of DataSync tasks per account per Region (default is 100). You can request a quota increase to meet the number of replication tasks for your use case.

Refer to the AWS Regional Services List for service availability based on Region. Additionally, review Amazon SageMaker endpoints and quotas.

Set up a Studio profile recovery infrastructure

The following diagram shows the logical steps for a SageMaker administrator to set up the Studio user and space recovery infrastructure, which a single command can complete with our automated solution.

logical flow 1

To set up the environment, clone the GitHub repo in the terminal:

git clone https://github.com/aws-samples/sagemaker-studio-efs-recovery-serverless.git && cd sagemaker-studio-efs-recovery-serverless

The following code shows the deployment script usage:

bash deploy.sh -h

Usage: deploy.sh [-n <stack_name>] [-v <vpc_id>] [-s <subnet_id>] [-b <s3_bucket>] [-r <aws_region>] [-d] Options: -n: specify stack name -v: specify your vpc id -s: specify subnet -b: specify s3 bucket name to store artifacts -r: specify aws region -d: whether to skip a creation of a new SageMaker Studio Domain (default: no)

To create a new Amazon SageMaker domain, run the following command. You need to specify which Amazon VPC and subnet you want to use. We use VPC only mode for the Studio deployment. If you don’t have any preference, you can use the default VPC and subnet. Also, specify any stack name, AWS Region, and S3 bucket name for AWS SAM to deploy the Lambda function:

bash deploy.sh -v <vpc_id> -s <subnet_id> -b <s3_bucket_name> -n <stack_name> -r <aws_region>

If you want to use an existing Studio domain, run the following command. Option -d yes will skip creating a new Studio domain:

bash deploy.sh -v <vpc_id> -s <subnet_id> -b <s3_bucket_name> -n <stack_name> -r <aws_region> -d yes

For the existing domains, the SageMaker administrator must also update the source and target Studio EFS security groups to allow connection to the user and space file restore tool. For example, to run the following command, you need to specify HomeEfsFileSystemId, the EFS file system ID, and SecurityGroupId used by the user and space file restore tool (we discuss this in more detail later in the post):

python3 src/add-security-group.py --efs-id <HomeEfsFileSystemId> --security-groups <SecurityGroupId> --region <aws_region>

User and space recovery logical flow

The following diagram shows the logical user and space recovery flow diagram for a SageMaker administrator to understand how the solution works, and no additional setup is required. If the profile (user or space) and domain are accidentally deleted, the EFS volume is detached but not deleted. A possible scenario is that we may want to revert the deletion by recreating a new domain and profiles. If the same profiles are being onboarded again, they may wish to access the files from their respective workspace in the detached volume. The recovery process is almost entirely automated; the only action required by the SageMaker administrator is to recreate the Studio domain and profiles using the same CloudFormation template. The rest of the steps are automated.

logical flow 2

Optionally, if the SageMaker admin wants control over replication, run the following command to turn off replication for specific domains and profiles. This script updates the replication field given the domain and profile name in the table. Note that you need to run the script for the same user each time they get recreated.

python3 src/update-replication-flag.py --profile-name <profile_name> --domain-name <domain_name> --region <aws_region> --no-replication

The following optional step provides the solution for the first use case to allow replication to take place between the specified source file system to any target domain and profile name. If the SageMaker admin wants to replicate particular profile data to a different domain and a profile that doesn’t exist yet, run the following command. The script inserts the new domain and profile name with the specified source file system information. The subsequent profile creation will trigger the replication task. Note that you need to run add-security-group.py from the previous step to allow connection to the file restore tool.

python3 src/add-replication-target.py --src-profile-name <profile_name> --src-domain-name <domain_name> --target-profile-name <profile_name> --target-domain-name <domain_name> --region <aws_region>

In the following sections, we test two scenarios to confirm that the solution works as expected.

Create a new Studio domain

Our first test scenario assumes you are starting from scratch and want to create a new Studio domain and profiles in your environment using our templates. Then we deploy the Studio domain, user and space, backup and recovery workflow, and event app. The purpose of the first scenario is to confirm that the profile file is recovered in the new home directory automatically when the profile is deleted and recreated within the same Studio domain.

Complete the following steps:

  1. To deploy the application, run the following command:
    bash deploy.sh -v <vpc_id> -s <subnet_id> -b <s3_bucket_name> -n <stack_name> -r <aws_region>

  1. On the AWS CloudFormation console, ensure the following stacks are in CREATE_COMPLETE status:
    1. <stack_name>-DemoBootstrap-*
    2. <stack_name>-StepFunction-*
    3. <stack_name>-EventApp-*
    4. <stack_name>-StudioDomain-*
    5. <stack_name>-StudioUser1-*
    6. <stack_name>-StudioSpace-*

cloud formation console

If the deployment failed in any stacks, check the error and resolve the issues. Then, proceed to the next step only if the problems are resolved.

  1. On the DynamoDB console, choose Tables in the navigation pane and confirm that the studioUser and studioUserHistory tables are created.
  2. Select studioUser and choose Explore table items to confirm that items for user1 and space1 are populated in the table.
  3. On the SageMaker console, choose Domains in the navigation pane.
  4. Choose demo-myapp-dev-studio-domain.
  5. On the User profiles tab, select user1 and choose Launch, and choose Studio to open the Studio for the user.

Note that Studio may take 10-15 minutes to load for the first time.

  1. On the File menu, choose Terminal to launch a new terminal within Studio.
  2. Run the following command in the terminal to create a file for testing:
    echo "i don't want to lose access to this file" > user1.txt

  1. Repeat these steps for space1 (choose Spaces in Step 7). Feel free to create a file of your choice.
  2. Delete the Studio user user1 and space1 by removing the nested stacks <stack_name>-StudioUser1-* and <stack_name>-StudioSpace-* from the parent. Delete the stacks by commenting out the following code blocks from the AWS SAM template file, template.yaml. Make sure to save the file after the edit:
    StudioUser1:
      Type: AWS::Serverless::Application
      Condition: CreateDomainCond
      DependsOn: StudioDomain
      Properties:
        Location: Infrastructure/Templates/sagemaker-studio-user.yaml
        Parameters:
          LambdaLayerArn: !GetAtt DemoBootstrap.Outputs.LambdaLayerArn
          StudioUserProfileName: !Ref StudioUserProfileName1
          UID: !Ref UID
          Env: !Ref Env
          AppName: !Ref AppName
    StudioSpace:
      Type: AWS::Serverless::Application
      Condition: CreateDomainCond
      DependsOn: StudioDomain
      Properties:
        Location: Infrastructure/Templates/sagemaker-studio-space.yaml
        Parameters:
          LambdaLayerArn: !GetAtt DemoBootstrap.Outputs.LambdaLayerArn
          StudioSpaceName: !Ref StudioSpaceName
          UID: !Ref UID
          Env: !Ref Env
          AppName: !Ref AppName

  1. Run the following command to deploy the stack with this change:
    bash deploy.sh -v <vpc_id> -s <subnet_id> -b <s3_bucket_name> -n <stack_name> -r <aws_region>

  2. Recreate the Studio profiles by adding back the stack back to the parent. Uncomment the code block from the previous step, save the file, and run the same command:
    bash deploy.sh -v <vpc_id> -s <subnet_id> -b <s3_bucket_name> -n <stack_name> -r <aws_region>

After a successful deployment, you can check the results.

  1. On the AWS CloudFormation console, choose the stack <stack_name>-StepFunction-*
  2. In the stack, choose the value for Physical ID of StepFunction in the Resources section.
  3. Choose the most recent run and confirm its status in Graph view.

It should look like the following screenshot for the user profile replication. You can also check the other run to ensure the same for the space profile.

step functions complete

  1. If you completed Steps 5–10, open the Studio domain for user1 and confirm that the user1.txt file is copied to the newly created directory.

It should not be visible in space1 directory, keeping the same file ownership.

  1. Repeat this step for space1.
  2. On the DataSync console, choose the most recent task ID.
  3. Choose History and the most recent run ID.

This is another way to inspect the configurations and the run status of the DataSync task. As an example, the following screenshot shows the task result for user1 directory replication.

datasync complete

We only covered profile recreation in this scenario. However, our solution works in the same way for Studio domain recreation, and it can be tested by deleting and recreating the domain.

Use an existing Studio domain

Our second test scenario assumes you want to use the existing SageMaker domain and profiles in the environment. Therefore, we only deploy the backup and recovery workflow and the event app. Again, you can use your own Studio CloudFormation template or create them through the AWS CloudFormation console to follow along. Because we’re using the existing Studio domain, the solution will list the current user and space for all domains within the Region, which we call seeding.

Complete the following steps:

  1. To deploy the application, run the following command:
    bash deploy.sh -v <vpc_id> -s <subnet_id> -b <s3_bucket_name> -n <stack_name> -r <aws_region> -d yes

  2. On the AWS CloudFormation console, ensure the following stacks are in CREATE_COMPLETE status:
    1. <stack_name>-DemoBootstrap-*
    2. <stack_name>-StepFunction-*
    3. <stack_name>-EventApp-*

If the deployment failed in any stacks, check the error and resolve the issues. Then, proceed to the next step only if the problems are resolved.

  1. Verify the initial data seed has completed.
  2. On the DynamoDB console, choose Tables in the navigation pane and confirm that the studioUser and studioUserHistory tables are created.
  3. Choose studioUser and choose Explore table items to confirm that items for the existing Studio domain are populated in the table.

Proceed to the next step only if the seed has completed successfully. If the tables aren’t populated, check the CloudWatch logs of the corresponding Lambda function. On the AWS CloudFormation console, choose the stack <stack_name>-EventApp-*, and choose the physical ID of DDBSeedLambda in the Resources section. Under Monitor, choose View CloudWatch Logs and check the logs for the most recent run to troubleshoot.

  1. To update the EFS security group, first get the SecurityGroupId. We use the security group created by the CloudFormation template, which allows all traffic in the outbound connection. Run the following command:
    echo "SecurityGroupId:" $(aws ssm get-parameter --name /network/vpc/sagemaker/securitygroups --region
    
    <aws_region> --query 'Parameter.Value')

  1. Get the HomeEfsFileSystemId, which is the ID of the Studio home EFS volume. Run the following command:
    echo "HomeEfsFileSystemId:" $(aws sagemaker describe-domain --domain-id <domain_id> --region <aws_region> --query 'HomeEfsFileSystemId')

  2. Finally, update the EFS security group by allowing inbounds from the security group shared with the DataSync task using port 2049. Run the following command:
    python3 src/add-security-group.py --efs-id <HomeEfsFileSystemId> --security-groups <SecurityGroupId> --region <aws_region>

  3. Delete and recreate the Studio profiles of your choice using the same profile name.
  4. Confirm the run status of the Step Functions state machine and recovery of the Studio profile directory by following the steps from the first scenario.

You can also test the Step Functions workflow manually with your choice of source and target inputs for replication (more details found in README.md in the GitHub repository).

Clean up

Run the following commands to clean up your resources:

sam delete --region <aws_region> --no-prompts --stack-name <stack_name>

Manually delete the SageMakerSecurityGroup after 20 minutes or so. Deletion of the Elastic Network Interface (ENI) can make the stack show as DELETE_IN_PROGRESS for some time, so we intentionally set the security group to be retained. Also, you need to disassociate that security group from the security group managed by SageMaker before you can delete it.

Conclusion

Studio is a powerful IDE that allows data scientists to quickly develop, train, test, and deploy models. This post discusses how to back up and recover the files stored in a data scientist’s home and shared space directory. We also demonstrated how an event-driven architecture can help automate the recovery process.

Our solution can help improve the resiliency of data scientists’ artifacts within Studio, leading to operational efficiency on the AWS Cloud. Also, the solution is modular, so you can use the necessary components and update them for your usage. For instance, the enhancement to this solution might be a cross-account replication. We hope that what we demonstrated in the post will be a helpful resource to support those ideas.

To get started with Studio, check out Amazon SageMaker for Data Scientists. Please send us feedback on the AWS forum for SageMaker or through your AWS support contacts. You can find other Studio examples in our GitHub repository.


About the Authors

Kenny Sato is a Machine Learning Engineer at AWS, guiding customers in architecting and implementing machine learning solutions. He received his master’s in Computer Engineering from Virginia Tech and is pursuing a PhD in Computer Science. In his spare time, you can find him in his backyard or out somewhere playing with his lovely daughters.

Gautam Nambiar is a DevOps Consultant with AWS. He is particularly interested in architecting and building automated solutions, MLOps pipelines, and creating reusable and secure DevOps best practice patterns. In his spare time, he likes playing and watching soccer.

Read More

Optimized PyTorch 2.0 inference with AWS Graviton processors

Optimized PyTorch 2.0 inference with AWS Graviton processors

New generations of CPUs offer a significant performance improvement in machine learning (ML) inference due to specialized built-in instructions. Combined with their flexibility, high speed of development, and low operating cost, these general-purpose processors offer an alternative to other existing hardware solutions.

AWS, Arm, Meta and others helped optimize the performance of PyTorch 2.0 inference for Arm-based processors. As a result, we are delighted to announce that AWS Graviton-based instance inference performance for PyTorch 2.0 is up to 3.5 times the speed for Resnet50 compared to the previous PyTorch release (see the following graph), and up to 1.4 times the speed for BERT, making Graviton-based instances the fastest compute optimized instances on AWS for these models.

AWS measured up to 50% cost savings for PyTorch inference with AWS Graviton3-based Amazon Elastic Cloud Compute C7g instances across Torch Hub Resnet50, and multiple Hugging Face models relative to comparable EC2 instances, as shown in the following figure.

Additionally, the latency of inference is also reduced, as shown in the following figure.

We have seen a similar trend in the price-performance advantage for other workloads on Graviton, for example video encoding with FFmpeg.

Optimization details

The optimizations focused on three key areas:

  • GEMM kernels – PyTorch supports Arm Compute Library (ACL) GEMM kernels via the OneDNN backend (previously called MKL-DNN) for Arm-based processors. The ACL library provides Neon and SVE optimized GEMM kernels for both fp32 and bfloat16 formats. These kernels improve the SIMD hardware utilization and reduce the end-to-end inference latencies.
  • bfloat16 support – The bfloat16 support in Graviton3 allows for efficient deployment of models trained using bfloat16, fp32, and AMP (Automatic Mixed Precision). The standard fp32 models use bfloat16 kernels via OneDNN fast math mode, without model quantization, providing up to two times faster performance compared to the existing fp32 model inference without bfloat16 fast math support.
  • Primitive caching – We also implemented primitive caching for conv, matmul, and inner product operators to avoid redundant GEMM kernel initialization and tensor allocation overhead.

How to take advantage of the optimizations

The simplest way to get started is by using the AWS Deep Learning Containers (DLCs) on Amazon Elastic Compute Cloud (Amazon EC2) C7g instances or Amazon SageMaker. DLCs are available on Amazon Elastic Container Registry (Amazon ECR) for AWS Graviton or x86. For more details on SageMaker, refer to Run machine learning inference workloads on AWS Graviton-based instances with Amazon SageMaker and Amazon SageMaker adds eight new Graviton-based instances for model deployment.

Use AWS DLCs

To use AWS DLCs, use the following code:

sudo apt-get update
sudo apt-get -y install awscli docker

# Login to ECR to avoid image download throttling
aws ecr get-login-password --region us-east-1 
| docker login --username AWS 
  --password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com

# Pull the AWS DLC for pytorch
# Graviton
docker pull 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference-graviton:2.0.0-cpu-py310-ubuntu20.04-ec2

# x86
docker pull 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.0.0-cpu-py310-ubuntu20.04-ec2

If you prefer to install PyTorch via pip, install the PyTorch 2.0 wheel from the official repo. In this case, you will have to set two environment variables as explained in the code below before launching PyTorch to activate the Graviton optimization.

Use the Python wheel

To use the Python wheel, refer to the following code:

# Install Python
sudo apt-get update
sudo apt-get install -y python3 python3-pip

# Upgrade pip3 to the latest version
python3 -m pip install --upgrade pip

# Install PyTorch and extensions
python3 -m pip install torch
python3 -m pip install torchvision torchaudio torchtext

# Turn on Graviton3 optimization
export DNNL_DEFAULT_FPMATH_MODE=BF16
export LRU_CACHE_CAPACITY=1024

Run inference

You can use PyTorch TorchBench to measure the CPU inference performance improvements, or to compare different instance types:

# Pre-requisite: 
# pull and run the AWS DLC
# or 
# pip install PyTorch2.0 wheels and set the previously mentioned environment variables

# Clone PyTorch benchmark repo
git clone https://github.com/pytorch/benchmark.git

# Setup Resnet50 benchmark
cd benchmark
python3 install.py resnet50

# Install the dependent wheels
python3 -m pip install numba

# Run Resnet50 inference in jit mode. On successful completion of the inference runs,
# the script prints the inference latency and accuracy results
python3 run.py resnet50 -d cpu -m jit -t eval --use_cosine_similarity

Benchmarking

You can use the Amazon SageMaker Inference Recommender utility to automate performance benchmarking across different instances. With Inference Recommender, you can find the real-time inference endpoint that delivers the best performance at the lowest cost for a given ML model. We collected the preceding data using the Inference Recommender notebooks by deploying the models on production endpoints. For more details on Inference Recommender, refer to the GitHub repo. We benchmarked the following models for this post: ResNet50 image classification, DistilBERT sentiment analysis, RoBERTa fill mask, and RoBERTa sentiment analysis.

Conclusion

AWS measured up to 50% cost savings for PyTorch inference with AWS Graviton3-based Amazon Elastic Cloud Compute C7g instances across Torch Hub Resnet50, and multiple Hugging Face models relative to comparable EC2 instances. These instances are available on SageMaker and Amazon EC2. The AWS Graviton Technical Guide provides the list of optimized libraries and best practices that will help you achieve cost benefits with Graviton instances across different workloads.

If you find use cases where similar performance gains aren’t observed on AWS Graviton, please open an issue on the AWS Graviton Technical Guide to let us know about it. We will continue to add more performance improvements to make Graviton the most cost-effective and efficient general-purpose processor for inference using PyTorch.


About the author

Sunita Nadampalli is a Software Development Manager at AWS. She leads Graviton software performance optimizations for machine leaning, HPC, and multimedia workloads. She is passionate about open-source development and delivering cost-effective software solutions with Arm SoCs.

Read More

How Vericast optimized feature engineering using Amazon SageMaker Processing

How Vericast optimized feature engineering using Amazon SageMaker Processing

This post is co-written by Jyoti Sharma and Sharmo Sarkar from Vericast.

For any machine learning (ML) problem, the data scientist begins by working with data. This includes gathering, exploring, and understanding the business and technical aspects of the data, along with evaluation of any manipulations that may be needed for the model building process. One aspect of this data preparation is feature engineering.

Feature engineering refers to the process where relevant variables are identified, selected, and manipulated to transform the raw data into more useful and usable forms for use with the ML algorithm used to train a model and perform inference against it. The goal of this process is to increase the performance of the algorithm and resulting predictive model. The feature engineering process entails several stages, including feature creation, data transformation, feature extraction, and feature selection.

Building a platform for generalized feature engineering is a common task for customers needing to produce many ML models with differing datasets. This kind of platform includes the creation of a programmatically driven process to produce finalized, feature engineered data ready for model training with little human intervention. However, generalizing feature engineering is challenging. Each business problem is different, each dataset is different, data volumes vary wildly from client to client, and data quality and often cardinality of a certain column (in the case of structured data) might play a significant role in the complexity of the feature engineering process. Furthermore, the dynamic nature of a customer’s data can also result in a large variance of the processing time and resources required to optimally complete the feature engineering.

AWS customer Vericast is a marketing solutions company that makes data-driven decisions to boost marketing ROIs for its clients. Vericast’s internal cloud-based Machine Learning Platform, built around the CRISP-ML(Q) process, uses various AWS services, including Amazon SageMaker, Amazon SageMaker Processing, AWS Lambda, and AWS Step Functions, to produce the best possible models that are tailored to the specific client’s data. This platform aims at capturing the repeatability of the steps that go into building various ML workflows and bundling them into standard generalizable workflow modules within the platform.

In this post, we share how Vericast optimized feature engineering using SageMaker Processing.

Solution overview

Vericast’s Machine Learning Platform aids in the quicker deployment of new business models based on existing workflows or quicker activation of existing models for new clients. For example, a model predicting direct mail propensity is quite different from a model predicting discount coupon sensitivity of the customers of a Vericast client. They solve different business problems and therefore have different usage scenarios in a marketing campaign design. But from an ML standpoint, both can be construed as binary classification models, and therefore could share many common steps from an ML workflow perspective, including model tuning and training, evaluation, interpretability, deployment, and inference.

Because these models are binary classification problems (in ML terms), we are separating the customers of a company into two classes (binary): those that would respond positively to the campaign and those that would not. Furthermore, these examples are considered an imbalanced classification because the data used to train the model wouldn’t contain an equal number of customers who would and would not respond favorably.

The actual creation of a model such as this follows the generalized pattern shown in the following diagram.

Typical Imbalanced Class Binary Classification Model Training Process

Most of this process is the same for any binary classification except for the feature engineering step. This is perhaps the most complicated yet at times overlooked step in the process. ML models are largely dependent on the features used to create it.

Vericast’s cloud-native Machine Learning Platform aims to generalize and automate the feature engineering steps for various ML workflows and optimize their performance on a cost vs. time metric by using the following features:

  • The platform’s feature engineering library – This consists of an ever-evolving set of transformations that have been tested to yield high-quality generalizable features based on specific client concepts (for example, customer demographics, product details, transaction details, and so on).
  • Intelligent resource optimizers – The platform uses AWS’s on-demand infrastructure capability to spin up the most optimal type of processing resources for the particular feature engineering job based on the expected complexity of the step and the amount of data it needs to churn through.
  • Dynamic scaling of feature engineering jobs – A combination of various AWS services is used for this, but most notably SageMaker Processing. This ensures that the platform produces high-quality features in a cost-efficient and timely manner.

This post is focused around the third point in this list and shows how to achieve dynamic scaling of SageMaker Processing jobs to achieve a more managed, performant, and cost-effective data processing framework for large data volumes.

SageMaker Processing enables workloads that run steps for data preprocessing or postprocessing, feature engineering, data validation, and model evaluation on SageMaker. It also provides a managed environment and removes the complexity of undifferentiated heavy lifting required to set up and maintain the infrastructure needed to run the workloads. Furthermore, SageMaker Processing provides an API interface for running, monitoring, and evaluating the workload.

Running SageMaker Processing jobs takes place fully within a managed SageMaker cluster, with individual jobs placed into instance containers at run time. The managed cluster, instances, and containers report metrics to Amazon CloudWatch, including usage of GPU, CPU, memory, GPU memory, disk metrics, and event logging.

These features provide benefits to Vericast data engineers and scientists by assisting in the development of generalized preprocessing workflows and abstracting the difficulty of maintaining generated environments in which to run them. Technical problems can arise, however, given the dynamic nature of the data and its varied features that can be fed into such a general solution. The system must make an educated initial guess as to the size of the cluster and instances that compose it. This guess needs to evaluate criteria of the data and infer the CPU, memory, and disk requirements. This guess may be wholly appropriate and perform adequately for the job, but in other cases it may not. For a given dataset and preprocessing job, the CPU may be undersized, resulting in maxed out processing performance and lengthy times to complete. Worse yet, memory could become an issue, resulting in either poor performance or out of memory events causing the entire job to fail.

With these technical hurdles in mind, Vericast set out to create a solution. They needed to remain general in nature and fit into the larger picture of the preprocessing workflow being flexible in the steps involved. It was also important to solve for both the potential need to scale up the environment in cases where performance was compromised and to gracefully recover from such an event or when a job finished prematurely for any reason.

The solution built by Vericast to solve this issue uses several AWS services working together to reach their business objectives. It was designed to restart and scale up the SageMaker Processing cluster based on performance metrics observed using Lambda functions monitoring the jobs. To not lose work when a scaling event takes place or to recover from a job unexpectedly stopping, a checkpoint-based service was put in place that uses Amazon DynamoDB and stores the partially processed data in Amazon Simple Storage Service (Amazon S3) buckets as steps complete. The final outcome is an auto scaling, robust, and dynamically monitored solution.

The following diagram shows a high-level overview of how the system works.

Solution Architecture Diagram

In the following sections, we discuss the solution components in more detail.

Initializing the solution

The system assumes that a separate process initiates the solution. Conversely, this design is not designed to work alone because it won’t yield any artifacts or output, but rather acts as a sidecar implementation to one of the systems that use SageMaker Processing jobs. In Vericast’s case, the solution is initiated by way of a call from a Step Functions step started in another module of the larger system.

Once the solution initiated and a first run is triggered, a base standard configuration is read from a DynamoDB table. This configuration is used to set parameters for the SageMaker Processing job and has the initial assumptions of infrastructure needs. The SageMaker Processing job is now started.

Monitoring metadata and output

When the job starts, a Lambda function writes the job processing metadata (the current job configuration and other log information) into the DynamoDB log table. This metadata and log information maintains a history of the job, its initial and ongoing configuration, and other important data.

At certain points, as steps complete in the job, checkpoint data is added to the DynamoDB log table. Processed output data is moved to Amazon S3 for quick recovery if needed.

This Lambda function also sets up an Amazon EventBridge rule that monitors the running job for its state. Specifically, this rule is watching the job to observe if the job status changes to stopping or is in a stopped state. This EventBridge rule plays an important part in restarting a job if there is a failure or a planned auto scaling event occurs.

Monitoring CloudWatch metrics

The Lambda function also sets a CloudWatch alarm based on a metric math expression on the processing job, which monitors the metrics of all the instances for CPU utilization, memory utilization, and disk utilization. This type of alarm (metric) uses CloudWatch alarm thresholds. The alarm generates events based on the value of the metric or expression relative to the thresholds over a number of time periods.

In Vericast’s use case, the threshold expression is designed to consider the driver and the executor instances as separate, with metrics monitored individually for each. By having them separate, Vericast knows which is causing the alarm. This is important to decide how to scale accordingly:

  • If the executor metrics are passing the threshold, it’s good to scale horizontally
  • If the driver metrics cross the threshold, scaling horizontally will probably not help, so we must scale vertically

Alarm metrics expression

Vericast can access the following metrics in its evaluation for scaling and failure:

  • CPUUtilization – The sum of each individual CPU core’s utilization
  • MemoryUtilization – The percentage of memory that is used by the containers on an instance
  • DiskUtilization – The percentage of disk space used by the containers on an instance
  • GPUUtilization – The percentage of GPU units that are used by the containers on an instance
  • GPUMemoryUtilization – The percentage of GPU memory used by the containers on an instance

As of this writing, Vericast only considers CPUUtilization, MemoryUtilization, and DiskUtilization. In the future, they intend to consider GPUUtilization and GPUMemoryUtilization as well.

The following code is an example of a CloudWatch alarm based on a metric math expression for Vericast auto scaling:

(IF((cpuDriver) > 80, 1, 0) 
 OR IF((memoryDriver) > 80, 1, 0) 
 OR IF((diskDriver) > 80, 1, 0)) 
 OR (IF(AVG(METRICS("cpuExec")) > 80, 1, 0) 
 OR IF(AVG(METRICS("memoryExec")) > 80, 1, 0) 
 OR IF(AVG(METRICS("diskExec")) > 80, 1, 0))

This expression illustrates that the CloudWatch alarm is considering DriverMemoryUtilization (memoryDriver), CPUUtilization (cpuDriver), DiskUtilization (diskDriver), ExecutorMemoryUtilization (memoryExec), CPUUtilization (cpuExec), and DiskUtilization (diskExec) as monitoring metrics. The number 80 in the preceding expression stands for the threshold value.

Here, IF((cpuDriver) > 80, 1, 0 implies that if the driver CPU utilization goes beyond 80%, 1 is assigned as the threshold else 0. IF(AVG(METRICS("memoryExec")) > 80, 1, 0 implies that all the metrics with string memoryExec in it are considered and an average is calculated on that. If that average memory utilization percentage goes beyond 80, 1 is assigned as the threshold else 0.

The logical operator OR is used in the expression to unify all the utilizations in the expression—if any of the utilizations reach its threshold, trigger the alarm.

For more information on using CloudWatch metric alarms based on metric math expressions, refer to Creating a CloudWatch alarm based on a metric math expression.

CloudWatch alarm limitations

CloudWatch limits the number of metrics per alarm to 10. This can cause limitations if you need to consider more metrics than this.

To overcome this limitation, Vericast has set alarms based on the overall cluster size. One alarm is created per three instances (for three instances, there will be one alarm because that would add up to nine metrics). Assuming the driver instance is to be considered separately, another separate alarm is created for the driver instance. Therefore, the total number of alarms that are created are roughly equivalent to one third the number of executor nodes and an additional one for the driver instance. In each case, the number of metrics per alarm is under the 10 metric limitation.

What happens when in an alarm state

If a predetermined threshold is met, the alarm goes to an alarm state, which uses Amazon Simple Notification Service (Amazon SNS) to send out notifications. In this case, it sends out an email notification to all subscribers with the details about the alarm in the message.

Amazon SNS is also used as a trigger to a Lambda function that stops the currently running SageMaker Processing job because we know that the job will probably fail. This function also records logs to the log table related to the event.

The EventBridge rule set up at job start will notice that the job has gone into a stopping state a few seconds later. This rule then reruns the first Lambda function to restart the job.

The dynamic scaling process

The first Lambda function after running two or more times will know that a previous job had already started and now has stopped. The function will go through a similar process of getting the base configuration from the original job in the log DynamoDB table and will also retrieve updated configuration from the internal table. This updated configuration is a resources delta configuration that is set based on the scaling type. The scaling type is determined from the alarm metadata as described earlier.

The original configuration plus the resources delta are used because a new configuration and a new SageMaker Processing job are started with the increased resources.

This process continues until the job completes successfully and can result in multiple restarts as needed, adding more resources each time.

Vericast’s outcome

This custom auto scaling solution has been instrumental in making Vericast’s Machine Learning Platform more robust and fault tolerant. The platform can now gracefully handle workloads of different data volumes with minimal human intervention.

Before implementing this solution, estimating the resource requirements for all the Spark-based modules in the pipeline was one of the biggest bottlenecks of the new client onboarding process. Workflows would fail if the client data volume increased, or the cost would be unjustifiable if the data volume decreased in production.

With this new module in place, workflow failures due to resource constraints have been reduced by almost 80%. The few remaining failures are mostly due to AWS account constraints and beyond the auto scale process. Vericast’s biggest win with this solution is the ease with which they can onboard new clients and workflows. Vericast expects to speed up the process by at least 60–70%, with data still to be gathered for a final number.

Though this is viewed as a success by Vericast, there is a cost that comes with it. Based on the nature of this module and the concept of dynamic scaling as a whole, the workflows tend to take around 30% longer (average case) than a workflow with a custom-tuned cluster for each module in the workflow. Vericast continues to optimize in this area, looking to improve the solution by incorporating heuristics-based resource initialization for each client module.

Sharmo Sarkar, Senior Manager, Machine Learning Platform at Vericast, says, “As we continue to expand our use of AWS and SageMaker, I wanted to take a moment to highlight the incredible work of our AWS Client Services Team, dedicated AWS Solutions Architects, and AWS Professional Services that we work with. Their deep understanding of AWS and SageMaker allowed us to design a solution that met all of our needs and provided us with the flexibility and scalability we required. We are so grateful to have such a talented and knowledgeable support team on our side.”

Conclusion

In this post, we shared how SageMaker and SageMaker Processing have enabled Vericast to build a managed, performant, and cost-effective data processing framework for large data volumes. By combining the power and flexibility of SageMaker Processing with other AWS services, they can easily monitor the generalized feature engineering process. They can automatically detect potential issues generated from lack of compute, memory, and other factors, and automatically implement vertical and horizontal scaling as needed.

SageMaker and its tools can help your team meet its ML goals as well. To learn more about SageMaker Processing and how it can assist in your data processing workloads, refer to Process Data. If you’re just getting started with ML and are looking for examples and guidance, Amazon SageMaker JumpStart can get you started. JumpStart is an ML hub from which you can access built-in algorithms with pre-trained foundation models to help you perform tasks such as article summarization and image generation and pre-built solutions to solve common use cases.

Finally, if this post helps you or inspires you to solve a problem, we would love to hear about it! Please share your comments and feedback.


About the Authors

Anthony McClure is a Senior Partner Solutions Architect with the AWS SaaS Factory team. Anthony also has a strong interest in machine learning and artificial intelligence working with the AWS ML/AI Technical Field Community to assist customers in bringing their machine learning solutions to reality.

Jyoti Sharma is a Data Science Engineer with the machine learning platform team at Vericast. She is passionate about all aspects of data science and focused on designing and implementing a highly scalable and distributed Machine Learning Platform.

Sharmo Sarkar is a Senior Manager at Vericast. He leads the Cloud Machine Learning Platform and the Marketing Platform ML R&D Teams at Vericast. He has extensive experience in Big Data Analytics, Distributed Computing, and Natural Language Processing. Outside work, he enjoys motorcycling, hiking, and biking on mountain trails.

Read More