Meet the Omnivore: Startup Develops App Letting Users Turn Objects Into 3D Models With Just a Smartphone

Meet the Omnivore: Startup Develops App Letting Users Turn Objects Into 3D Models With Just a Smartphone

Editor’s note: This post is a part of our Meet the Omnivore series, which features individual creators and developers who accelerate 3D workflows and create virtual worlds using NVIDIA Omniverse, a development platform built on Universal Scene Description, aka OpenUSD.

As augmented reality (AR) becomes more prominent and accessible across the globe, Kiryl Sidarchuk is helping to erase the border between the real and virtual worlds.

Kiryl Sidarchuk

Co-founder and CEO of AR-Generation, which is a member of the NVIDIA Inception program for cutting-edge startups, Sidarchuk with his company developed MagiScan, an AI-based 3D scanner app.

It lets users capture any object with their smartphone camera and quickly creates a high-quality, detailed 3D model of it for use in any AR or metaverse application.

AR-Generation now offers an extension that enables direct export of 3D models from MagiScan to NVIDIA Omniverse, a development platform for connecting and building 3D tools and metaverse applications.

It’s made possible with speed and ease by Universal Scene Description, aka OpenUSD, an extensible framework that serves as a common language between digital content-creation tools.

“Augmented reality will become an integral part of everyday life,” said Sidarchuk, who’s based in Nicosia, Cyprus. “We customized our app to allow export of 3D models based on real-world objects directly to Omniverse, enabling users to showcase the models in AR and integrate them into any metaverse or game.”

Omniverse extensions are core building blocks that let anyone create and extend functions of Omniverse apps using the popular Python or C++ programming languages.

It was simple and convenient for AR-Generation to build the extension, Sidarchuk said, thanks to easily accessible documentation, as well as technical guidance from NVIDIA teams, free AWS credits and networking opportunities with other AI-driven companies — all benefits of being a part of NVIDIA Inception.

Capture, Click and Create 3D Models From Real-World Objects 

Sidarchuk estimates that MagiScan can create 3D models from objects 10x faster and at up to 100x less cost than it would take a designer to do so manually.

This frees creators up to focus on fine-tuning their work and makes AR more accessible to all through a simple app.

AR-Generation chose to build an extension for Omniverse because the platform “provides a convenient environment that integrates all the tools for working with 3D and generative AI,” said Sidarchuk. “Plus, we can collaborate and exchange ideas with colleagues in real time.”

Export 3D models from MagiScan to Omniverse with OpenUSD.

Sidarchuk’s favorite feature of Omniverse is its OpenUSD compatibility, which enables seamless interchange of 3D data between creative applications. “OpenUSD is the format of the future,” he said.

Based on this framework, the MagiScan extension for Omniverse enables fast, affordable creation of high-quality 3D models for any object. MagiScan is available for download on iOS and Android devices.

“It can help everyone from individuals to large corporations save time and money in digitalization,” said Sidarchuk, who claims his first word as a toddler was “money.”

The business-oriented developer started his first company at age 16. It was a one-man endeavor, buying fresh fruits and vegetables from a small village and selling them in Minsk, the capital of Belarus. “That’s how I earned enough to buy my first car,” he mused.

More than a dozen years later, when he’s not working to “enhance human capabilities through augmented-reality technologies,” he said, Sidarchuk now spends his free time with his five-year-old daughter, Aurora.

Watch Sidarchuk discuss 3D modeling, AI and AR on a replay of his Omniverse livestream on demand, and learn more about the MagiScan extension for Omniverse.

Join In on the Creation

Anyone can build their own Omniverse extension or Connector to enhance their 3D workflows and tools. Creators and developers across the world can download NVIDIA Omniverse for free, and enterprise teams can use the platform for their 3D projects.

Check out artwork from other “Omnivores” and submit projects in the gallery. Connect your workflows to Omniverse with software from Adobe, Autodesk, Epic Games, Maxon, Reallusion and more.

Get started with NVIDIA Omniverse by downloading the standard license free, or learn how Omniverse Enterprise can connect your team. Developers can get started with Omniverse resources and learn about OpenUSD. Explore the growing ecosystem of 3D tools connected to Omniverse.

Stay up to date on the platform by subscribing to the newsletter, and follow NVIDIA Omniverse on Instagram, Medium and Twitter. For more, join the Omniverse community and check out the Omniverse forums, Discord server, Twitch and YouTube channels. 

Read More

Quicker Cures: How Insilico Medicine Uses Generative AI to Accelerate Drug Discovery

Quicker Cures: How Insilico Medicine Uses Generative AI to Accelerate Drug Discovery

While generative AI is a relatively new household term, drug discovery company Insilico Medicine has been using it for years to develop new therapies for debilitating diseases.

The company’s early bet on deep learning is bearing fruit — a drug candidate discovered using its AI platform is now entering Phase 2 clinical trials to treat idiopathic pulmonary fibrosis, a relatively rare respiratory disease that causes progressive decline in lung function.

Insilico used generative AI for each step of the preclinical drug discovery process: to identify a molecule that a drug compound could target, generate novel drug candidates, gauge how well these candidates would bind with the target, and even predict the outcome of clinical trials.

Doing this using traditional methods would have cost more than $400 million and taken up to six years. But with generative AI, Insilico accomplished them for one-tenth of the cost and one-third of the time — reaching the first phase of clinical trials just two and a half years after beginning the project.

“This first drug candidate that’s going to Phase 2 is a true highlight of our end-to-end approach to bridge biology and chemistry with deep learning,” said Alex Zhavoronkov, CEO of Insilico Medicine. “This is a significant milestone not only for us, but for everyone in the field of AI-accelerated drug discovery.”

Insilico is a premier member of NVIDIA Inception, a free program that provides cutting-edge startups with technical training, go-to-market support and AI platform guidance. The company uses NVIDIA Tensor Core GPUs in its generative AI drug design engine, Chemistry42, to generate novel molecular structures — and was one of the first adopters of an early precursor to NVIDIA DGX systems in 2015.

AI Enables End-to-End Preclinical Drug Discovery

Insilico’s Pharma.AI platform includes multiple AI models trained on millions of data samples for a range of tasks. One AI tool, PandaOmics, rapidly identifies and prioritizes targets that play a significant role in a disease’s effectiveness — like the infamous spike protein on the virus that causes COVID-19.

The Chemistry42 engine can design within days new potential drug compounds that target the protein identified using PandaOmics. The generative chemistry tool uses deep learning to come up with drug-like molecular structures from scratch.

“Typically, AI companies in drug discovery focus either on biology or on chemistry,” said Petrina Kamya, head of AI platforms at Insilico. “From the start, Insilico has been applying the same deep learning approach to both fields, using AI both to discover drug targets and generate chemical structures of small molecules.”

Over the years, the Insilico team has adopted different kinds of deep neural networks for drug discovery, including generative adversarial networks and transformer models. They’re now using NVIDIA BioNeMo to accelerate the early drug discovery process with generative AI.

Finding the Needle in the AI Stack

To develop its pulmonary fibrosis drug candidate, Insilico used Pharma.AI to design and synthesize about 80 molecules, achieving unprecedented success rates for preclinical drug candidates. The process — from identifying the target to nominating a promising drug candidate for trials — took under 18 months.

During Phase 2 clinical trials, Insilico’s pulmonary fibrosis drug will be tested in several hundred people with the condition in the U.S. and China. The process will take several months — but in parallel, the company has more than 30 programs in the pipeline to target other diseases, including a number of cancer drugs.

“When we first presented our results, people just did not believe that generative AI systems could achieve this level of diversity, novelty and accuracy,” said Zhavoronkov. “Now that we have an entire pipeline of promising drug candidates, people are realizing that this actually works.”

Learn more about Insilico Medicine’s Chemistry42 platform for AI-accelerated drug candidate screening in this talk from NVIDIA GTC.

Subscribe to NVIDIA healthcare news and generative AI news.

Read More

Unlocking the future of computing: The Analog Iterative Machine’s lightning-fast approach to optimization 

Unlocking the future of computing: The Analog Iterative Machine’s lightning-fast approach to optimization 

Analog Iterative Machine (AIM)

Picture a world where computing is not limited by the binary confines of zeros and ones, but instead, is free to explore the vast possibilities of continuous value data. Over the past three years a team of Microsoft researchers has been developing a new kind of analog optical computer that uses photons and electrons to process continuous value data, unlike today’s digital computers that use transistors to crunch through binary data. This innovative new machine has the potential to surpass state-of-the-art digital technology and transform computing in years to come.

The Analog Iterative Machine (AIM) is designed to solve difficult optimization problems, which form the foundation of many industries, such as finance, logistics, transportation, energy, healthcare, and manufacturing. However, traditional digital computers struggle to crack these problems in a timely, energy-efficient and cost-effective manner. This is because the number of possible combinations explodes exponentially as the problem size grows, making it a massive challenge for even the most powerful digital computers. The Traveling Salesman Problem is a classic example. Imagine trying to find the most efficient route for visiting a set of cities just once before returning to the starting point. With only five cities, there are 12 possible routes – but for a 61-city problem, the number of potential routes surpasses the number of atoms in the universe.

AIM addresses two simultaneous trends. First, it sidesteps the diminishing growth of computing capacity per dollar in digital chips – or the unraveling of Moore’s Law. Second, it overcomes the limitations of specialized machines designed for solving optimization problems. Despite over two decades of research and substantial industry investment, such unconventional hardware-based machines have a limited range of practical applications, because they can only address optimization problems with binary values. This painful realization within the optimization community has driven the team to develop AIM, with a design that combines mathematical insights with cutting-edge algorithmic and hardware advancements. The result? An analog optical computer that can solve a much wider range of real-world optimization problems while operating at the speed of light, offering potential speed and efficiency gains of about a hundred times.

Today, AIM is still a research project, but the cross-disciplinary team has recently assembled the world’s first opto-electronic hardware for mixed – continuous and binary – optimization problems. Though presently operating on a limited scale, the initial results are promising, and the team has started scaling up its efforts. This includes a research collaboration with the UK-based multinational bank Barclays to solve an optimization problem critical to the financial markets on the AIM computer. Separate engagements are aimed at gaining more experience in solving industry-specific optimization problems. In June 2023, the team launched an online service that provides an AIM simulator to allow partners to explore the opportunities created by this new kind of computer.

The technology 

Photons possess a remarkable property of not interacting with one another, which has underpinned the internet era by enabling large amounts of data to be transmitted over light across vast distances. However, photons do interact with the matter through which they propagate, allowing for linear operations such as addition and multiplication, which form the basis for optimization applications. For instance, when light falls on the camera sensor on our smartphones, it adds up the incoming photons and generates the equivalent amount of current. Additionally, data transmission over fiber which brings internet connectivity to homes and businesses relies on encoding zeroes and ones onto light by programmatically controlling its intensity. This scaling of light through light-matter interaction multiplies the light intensity by a specific value – multiplication in the optical domain. Beyond optical technologies for linear operations, various other electronic components prevalent in everyday technologies can perform non-linear operations that are also critical for efficient optimization algorithms.

Analog optical computing thus involves constructing a physical system using a combination of analog technologies – both optical and electronic – governed by equations that capture the required computation. This can be very efficient for specific application classes where linear and non-linear operations are dominant. In optimization problems, finding the optimal solution is akin to discovering a needle in an inconceivably vast haystack. The team has developed a new algorithm that is highly efficient at such needle-finding tasks. Crucially, the algorithm’s core operation involves performing hundreds of thousands or even millions of vector-matrix multiplications – the vectors represent the problem variables whose values need to be determined while the matrix encodes the problem itself. These multiplications are executed swiftly and with low energy consumption using commodity optical and electronic technologies, as shown in Figure 1.

Figure 1: Illustration of the AIM computer
Figure 1: Illustration of the AIM computer, which implements massively parallel vector-matrix multiplication using commodity optical technologies (in the back) and non-linearity applied using analog electronics (front). The vector is represented using an array of light sources, the matrix is embedded into the modulator array (shown in grayscale) and the result is collected into the camera sensor.
Figure 2: The second-generation AIM computer
Figure 2: The second-generation AIM computer, with 48 variables, is a rack-mounted appliance.

Thanks to the miniaturization of all these components onto tiny centimeter-scale chips, the entire AIM computer fits into a small rack enclosure – as shown in Figure 2. As light travels incredibly fast – 5 nanoseconds per meter – each iteration within the AIM computer is significantly faster and consumes less electricity than running the same algorithm on a digital computer. Importantly, since the entire problem is embedded into the modulator matrix inside the computer itself, AIM does not require the problem to be transferred back and forth between storage and compute locations. And unlike synchronous digital computers, AIM’s operation is entirely asynchronous. These architectural choices circumvent key historical bottlenecks for digital computers. 

Finally, all technologies used in AIM are already prevalent in consumer products with existing manufacturing ecosystems, which paves the way for a viable computing platform, at full scale, if all the technical challenges can be tamed by the team.

The importance of optimization problems

Optimization problems are mathematical challenges that require finding the best possible solution from a set of feasible alternatives. The modern world relies heavily on efficient solutions to these problems – from managing electricity in our power grids and streamlining goods delivery across sea, air, and land, to optimizing internet traffic routing.

Effectively and efficiently solving optimization problems can significantly improve processes and outcomes across many other industries. Take finance, for example, where portfolio optimization involves selecting the ideal combination of assets to maximize returns while minimizing risks. In healthcare, optimizing patient scheduling can enhance resource allocation and minimize waiting times in hospitals.

For many larger problems, even the world’s biggest supercomputer would take years or even centuries to find the optimal solution to such problems. A common workaround is heuristic algorithms – problem-solving techniques that provide approximate solutions by employing shortcuts or “rules of thumb.” Although these algorithms might not guarantee the discovery of an optimal solution, they are the most practical and efficient methods for finding near-optimal solutions in reasonable timeframes. Now, imagine the immense impact of a computer that could deliver more optimal solutions in a significantly shorter timeframe for these critical problems. In some instances, solving these problems in real-time could create a domino effect of positive outcomes, revolutionizing entire workflows and industries.

QUMO: a world beyond QUBO

For years, researchers, both in industry and academia, have built impressive specialized machines to efficiently solve optimization problems using heuristic algorithms. This includes an array of custom hardware, such as field programmable gate arrays (FPGAs), quantum annealers, and electrical and optical parametric oscillator systems. However, all of them rely on mapping difficult optimization problems to the same binary representation, often referred to as Ising, Max-Cut or QUBO (quadratic unconstrained binary optimization). Unfortunately, none of these efforts have provided a practical alternative to conventional computers. This is because it is very hard to map real-world optimization problems at scale to the binary abstraction, a common theme in the team’s engagement with practitioners across industry and academia.

With AIM, the team has introduced a more expressive mathematical abstraction called QUMO (quadratic unconstrained mixed optimization), which can represent mixed – binary and continuous – variables and is compatible with hardware implementation, making it the “sweetspot” for many practical, heavily-constrained optimization problems. Discussions with industry experts indicate that scaling AIM to 10,000 variables would mean that most of the practical problems discussed earlier are within reach. A problem with 10,000 variables that can be directly mapped to the QUMO abstraction would require an AIM computer with 10,000 physical variables. In contrast, existing specialized machines would need to scale to beyond a million physical variables, well beyond the capabilities of the underlying hardware.

AIM also implements a novel and efficient algorithm for solving such QUMO problems that relies on an advanced form of gradient descent, a technique that is also popular in machine learning. The algorithm shows highly competitive performance and accuracy across various industrially inspired problem benchmarks. It even discovered new best-ever solutions to four problems. The first-generation AIM computer, built last year, solves QUMO optimization problems that are represented with an accuracy of up to 7 bits. The team, shown in Figure 3, has also shown good quantitative agreement between the simulated and the hardware version of the AIM computer to gain further confidence in the viability of these efficiency gains as the computer is scaled up. This paper gives more details about the AIM architecture, its implementation, evaluation and scaling roadmap.

Photo of the AIM team – Front row (left to right): Doug Kelly, Jiaqi Chu, James Clegg, Babak Rahmani. Back row: Hitesh Ballani, George Mourgias-Alexandris, Daniel Cletheroe, Francesca Parmigiani, Lucinda Pickup, Grace Brennan, Ant Rowstron, Kirill Kalinin, Jonathan Westcott, Christos Gkantsidis. (Greg O'Shea and Jannes Gladrow do not appear in this photo.)
Figure 3: AIM’s design involves innovation at the intersection of optical and analog hardware, mathematics and algorithms, and software and system architecture, which is typified in the cross-disciplinary nature of the team working hand-in-hand towards the mission of building a computer that solves practical problems. Photo of the AIM team – Front row (left to right): Doug Kelly, Jiaqi Chu, James Clegg, Babak Rahmani. Back row: Hitesh Ballani, George Mourgias-Alexandris, Daniel Cletheroe, Francesca Parmigiani, Lucinda Pickup, Grace Brennan, Ant Rowstron, Kirill Kalinin, Jonathan Westcott, Christos Gkantsidis. (Greg O’Shea and Jannes Gladrow do not appear in this photo.)

Rethinking optimization with QUMO: A more expressive way of reasoning for experts

AIM’s blueprint for co-designing unconventional hardware with an expressive abstraction and a new algorithm has the potential to spark a new era in optimization techniques, hardware platforms, and automated problem mapping procedures, utilizing the more expressive QUMO abstraction. This exciting journey has already begun, with promising results from mapping problems from diverse domains like finance and healthcare to AIM’s QUMO abstraction. Recent research has already shown that increased expressiveness with continuous variables can substantially expand the real-world business problems that can be tackled. However, to the team’s knowledge, AIM is the first and only hardware to natively support this abstraction.

As we venture into a new abstraction, we must also adopt new ways of thinking. It is crucial for the team to build a strong community to deeply investigate the benefits of embracing QUMO. We invite people who have previously been deterred by the limitations of binary solvers to consider the new opportunities offered by AIM’s QUMO abstraction. To facilitate this, we are releasing our AIM simulator as a service, allowing selected users to get first-hand experience. The initial users are the team’s collaborators at Princeton University and at Cambridge University. They have helped us identify several exciting problems where the AIM computer and its abstraction is a much more natural fit. We are also actively engaging with thought leaders from internal Microsoft divisions and external companies in sectors where optimization is crucial.

Together, we can drive innovation and unlock the true potential of analog optical computing for solving some of the most complex optimization problems across industries.

The post Unlocking the future of computing: The Analog Iterative Machine’s lightning-fast approach to optimization  appeared first on Microsoft Research.

Read More

Monge, Bregman and Occam: Interpretable Optimal Transport in High-Dimensions with Feature-Sparse Maps

Optimal transport (OT) theory focuses, among all maps that can morph a probability measure onto another, on those that are the “thriftiest”, i.e. such that the averaged cost between and its image be as small as possible. Many computational approaches have been proposed to estimate such Monge maps when is the distance, e.g., using entropic maps (Pooladian and Niles-Weed, 2021), or neural networks (Makkuva et al., 2020;
Korotin et al., 2020). We propose a new model for transport maps, built on a family of translation invariant costs , where and is a regularizer. We propose a…Apple Machine Learning Research

Define customized permissions in minutes with Amazon SageMaker Role Manager via the AWS CDK

Define customized permissions in minutes with Amazon SageMaker Role Manager via the AWS CDK

Machine learning (ML) administrators play a critical role in maintaining the security and integrity of ML workloads. Their primary focus is to ensure that users operate with the utmost security, adhering to the principle of least privilege. However, accommodating the diverse needs of different user personas and creating appropriate permission policies can sometimes impede agility. To address this challenge, AWS introduced Amazon SageMaker Role Manager in December 2022. SageMaker Role Manager is a powerful tool can you can use to swiftly develop persona-based roles, which can be easily customized to meet specific requirements.

With SageMaker Role Manager, administrators can efficiently define persona-based roles tailored to distinct user groups. This approach ensures that individuals have access only to the resources and actions essential for their tasks, reducing the risk of unauthorized actions or breaches. SageMaker Role Manager also allows for fine-grained customization. ML administrators can tailor the roles to meet specific requirements by modifying the permissions associated with each persona. This flexibility ensures that the permissions align precisely with the tasks and responsibilities of individual users, providing a robust security framework while accommodating unique use cases.

SageMaker Role Manager is currently available on the Amazon SageMaker console of all commercial Regions. Today, we are launching the ability to define customized permissions in minutes with SageMaker Role Manager via the AWS Cloud Development Kit (AWS CDK). This addresses a critical obstacle to wider adoption because ML administrators can now automate their tasks programmatically. With the power of the AWS CDK, ML administrators can streamline workflows, reduce manual efforts, and ensure consistency in managing permissions for their ML infrastructure.

Solution overview

With the release of the SageMaker Role Manager CDK, we are launching two new infrastructure as code (IaC) capabilities:

You can create fine-grained AWS Identity and Access Management (IAM) roles for ML personas such as data scientist, ML engineer, or data engineer. SageMaker Role Manager offers predefined personas and ML activities combined to streamline your permission generation process, allowing your ML practitioners to perform their responsibilities with the least privilege permissions. For secure access to your ML resources, SageMaker Role Manager allows you to specify networking and encryption permissions for Amazon Virtual Private Cloud (Amazon VPC) resources and AWS Key Management Service (AWS KMS) encryption keys. Furthermore, you can customize permissions by attaching your own customer managed policies.

The SageMaker Role Manager CDK lets you define custom permissions for SageMaker users in minutes. It comes with a set of predefined policy templates for different personas and ML activities. Personas represent the different types of users that need permissions to perform ML activities in SageMaker, such as data scientists or MLOps engineers. ML activities are a set of permissions to accomplish a common ML task, such as running Amazon SageMaker Studio applications or managing experiments, models, or pipelines. After you have selected the persona type and the set of ML activities, the SageMaker Role Manager CDK automatically creates the required IAM role and policies that you can assign to SageMaker users. Similarly, you can also create IAM roles with fine-grained permissions for automated jobs such as running SageMaker Pipelines.

Prerequisites

To start using the SageMaker Role Manager CDK, you need to complete the following prerequisite steps:

  1. Set up a role for your ML administrator to create and manage personas, as well as the IAM permissions for those users. For a sample admin policy, refer to the prerequisite section in Define customized permissions in minutes with Amazon SageMaker Role Manager blog post.
  2. Create a compute-only persona role (if you don’t have any) for passing to jobs and endpoints. For instructions to set up that role, refer to Using the role manager.
  3. Set up your AWS CDK development environment. For instructions, refer to Getting started with the AWS CDK.

Install and run the SageMaker Role Manager CDK

Complete the following steps to set up the SageMaker Role Manager CDK:

  1. Create your AWS CDK app and give it a name; for example, RoleManager.
  2. Navigate to the RoleManager folder and run the following command to create a blank typescript AWS CDK project:
    cdk init app --language typescript

  3. Open package.json and add the highlighted package as shown in the following code:
    "dependencies": {
        "aws-cdk-lib": "2.85.0",
        "@cdklabs/cdk-aws-sagemaker-role-manager": "0.0.15",
        "constructs": "^10.0.0",
        "source-map-support": "^0.5.21"
      }

  4. Run the following command to install the new cdk-aws-sagemaker-role-manager package:
    npm install

  5. Navigate to the lib folder and replace role_manager_stack.ts with the following code:
    import * as cdk from 'aws-cdk-lib';
    import { Construct } from 'constructs';
    import * as iam from 'aws-cdk-lib/aws-iam';
    import { Activity } from '@cdklabs/cdk-aws-sagemaker-role-manager';
    
    export class RoleManagerStack extends cdk.Stack {
      constructor(scope: Construct, id: string, props?: cdk.StackProps) {
        super(scope, id, props);
    
        const activity = Activity.manageJobs(this, 'id1', {
            rolesToPass: [iam.Role.fromRoleName(this, 'passRoleId', 'passRoleName')],
        });
        
        activity.createRole(this, 'newRoleId', 'newRoleName', newRoleDescription');
        
      }
    }

  6. Replace passRoleId, passRoleName, newRoleId, newRoleName, and newRoleDescription based on your requirements for role creation.
  7. Navigate back to your AWS CDK app home folder and run the following command to verify the generated AWS CloudFormation template:
    cdk synth

  8. Finally, run the following command to run the CloudFormation stack in your AWS account:
    cdk deploy

You should see an AWS CDK deployment output similar to the one in the following screenshot.

More SageMaker Role Manager CDK examples are available in the following GitHub repo.

ML persona and activity CDK reference

Administrators can define ML activities using one of the ML activity static functions of the ML activity class. For a list of the latest versions, refer to ML activity reference.

The ML persona class supports the following methods:

  • customizeVPC(subnets, securityGroups) – Customizes the VPC of all activities that support VPC customization of personas.
  • customizeKMS(dataKeys, volumeKeys) – Customizes KMS keys of all activities that support KMS key customization of personas.
  • createRole(scope, id, roleNameSuffix, roleDescription) – Creates a role with the persona’s activities’ permissions similar to the UI in the scope with ID, with the name SageMaker-${roleNameSuffix} and optionally with the passed role description.
  • grantPermissionsTo(identity) – Grants the persona’s activities’ permissions to the identity. The passed identity can be a role or an AWS resource associated with a role (for example, a Lambda function with the role of the Lambda function describing which resources the Lambda function can access).
  • grantPermissionsTo() – Updates the role of the passed identity to have the permissions specified in the ML activity.

The ML activity class supports the same set of functions as ML personas; however, the difference is an ML activity is constrained to a single activity when using this interface to create IAM roles.

Conclusion

SageMaker Role Manager enables you to create customized roles based on personas, pre-built ML activities, and custom policies, significantly reducing the time required. Now, with this latest AWS CDK support, the ability to define roles is further expanded to support infrastructure as code. This empowers ML practitioners to work programmatically in SageMaker, enhancing efficiency and enabling seamless integration into their workflows.

We would like to hear from you on how this new feature is helping you. Try out the new AWS CDK support for SageMaker Role Manager and send us your feedback!

To learn more about how to use SageMaker Role Manager, refer to the SageMaker Role Manager Developer Guide.


About The Authors

Akash Bhatia is a Principal Solution Architect with experience spanning multiple industries, including Manufacturing, Automotive, Retail ,and Space and Technology. Currently working in Amazon Web Services Enterprise Segments, Akash works closely with a diverse range of clients, including Fortune 100 companies and start-ups, to facilitate their cloud migration journey. In addition to his technical expertise, Akash has led product and program management, having successfully overseen numerous large-scale initiatives throughout his career.

Ram VittalRam Vittal is a Principal ML Solutions Architect at AWS. He has over 20 years of experience architecting and building distributed, hybrid, and cloud applications. He is passionate about building secure and scalable AI/ML and big data solutions to help enterprise customers with their cloud adoption and optimization journey to improve their business outcomes. In his spare time, he enjoys riding motorcycle, playing tennis, and photography.

Ozan Eken is a Senior Product Manager at Amazon Web Services. He has over 15 years of experience in consulting and product management. He is passionate about building governance products, and Admin capabilities in Machine Learning for enterprise customers. Outside of work, he likes exploring different outdoor activities and watching soccer.

Read More

Research Focus: Week of June 19, 2023

Research Focus: Week of June 19, 2023

Microsoft Research Focus 18 | Week of June 19, 2023

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

NEW RESOURCE

Responsible AI Maturity Model

As the use of AI continues to surge, new government regulations are expected. But the organizations that build and use AI technologies needn’t wait to devise best practices for developing and deploying AI systems responsibly. Many companies have adopted responsible AI (RAI) principles as a form of self-regulation. Yet, effectively translating these principles into practice is challenging.

To help organizations identify their current and desired levels of RAI maturity, researchers at Microsoft have developed the Responsible AI Maturity Model (RAI MM). The RAI MM is a framework containing 24 empirically derived dimensions that are key to an organization’s RAI maturity, and a roadmap of maturity progression so organizations and teams can identify where they are and where they could go next.

Derived from interviews and focus groups with over 90 RAI specialists and AI practitioners, the RAI MM can help organizations and teams navigate their RAI journey, even as RAI continues to evolve.

Spotlight: Microsoft Research Podcast

AI Frontiers: AI for health and the future of research with Peter Lee

Peter Lee, head of Microsoft Research, and Ashley Llorens, AI scientist and engineer, discuss the future of AI research and the potential for GPT-4 as a medical copilot.

NEW RESEARCH

FoundWright helps people re-find web content they previously discovered

Re-finding information is a common task—most online search requests involve re-finding information. However, this can be difficult when people struggle to express what they seek. People may forget exact details of the information they want to re-find, making it hard to craft a query to locate it. People may also struggle to recover information within web repositories, such as bookmarks or history, as these do not capture enough information, or present an experience to allow ambiguous queries. As a result, people can feel overwhelmed and cognitively exhausted when faced with a re-finding task.

A new paper from Microsoft researchers: FoundWright: A System to Help People Re-find Pages from Their Web-history, introduces a new system to address these problems. FoundWright leverages recent advances in language transformer models to expand people’s ability to express what they seek by defining concepts that can attract documents with semantically similar content. The researchers used FoundWright as a design probe to understand how people create and use concepts; how this expanded ability helps re-finding; and how people engage and collaborate with FoundWright’s machine learning support. The research reveals that this expanded way of expressing re-finding goals complements traditional searching and browsing.


NEW RESEARCH

Trace-Guided Inductive Synthesis of Recursive Functional Programs

In recent years, researchers have made significant advances in synthesis of recursive functional programs, including progress in inductive synthesis of recursive programs from input-output examples. The latter problem, however, continues to pose several challenges.

In a new paper: Trace-Guided Inductive Synthesis of Recursive Functional Programs, which received a distinguished paper award from the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2023), researchers from Microsoft and Purdue University propose a novel trace-guided approach to tackle the challenges of ambiguity and generalization in synthesis of recursive functional programs from examples. This approach augments the search space of programs with recursion traces consisting of sequences of recursive subcalls of programs. It is based on a new version space algebra (VSA) for succinct representation and efficient manipulation of pairs of recursion traces and programs that are consistent with each other. The researchers implement this approach in a tool called SyRup. Evaluating SyRup on benchmarks from prior work demonstrates that it not only requires fewer examples to achieve a certain success rate than existing synthesizers, but is also less sensitive to the quality of the examples.

These results indicate that utilizing recursion traces to differentiate satisfying programs with similar sizes is applicable to a wide range of tasks.


NEW RESEARCH

Wait-Free Weak Reference Counting

Reference counting is a common approach to memory management. One challenge with reference counting is cycles that prevent objects from being deallocated. Systems such as the C++ and Rust standard libraries introduce two types of reference: strong and weak. A strong reference allows access to the object and prevents the object from being deallocated, while a weak reference only prevents deallocation. A weak reference can be upgraded to provide a strong reference if other strong references to the object exist. Hence, the upgrade operation is partial, and may fail dynamically. The classic implementation of this upgrade operation is not wait-free—it can take arbitrarily long to complete if there is contention on the reference count.

In a new paper: Wait-Free Weak Reference Counting, researchers from Microsoft propose a wait-free algorithm for weak reference counting, which requires primitive wait-free atomic operations of “compare and swap”, and “fetch and add”. The paper includes a correctness proof of the algorithm using the Starling verification tool, a full implementation in C++, and a demonstration of the best- and worst-case performance using micro-benchmarks.

The new algorithm is faster than the classic algorithm in the best case, but has an overhead in the worst case. The researchers present a more complex algorithm that effectively combines the classic algorithm and the wait-free algorithm, delivering much better performance in the worst case, while maintaining the benefits of the wait-free algorithm.


NEW RESEARCH

Disaggregating Stateful Network Functions

For security, isolation, metering, and other purposes, public clouds today implement complex network functions at every server. Today’s implementations, in software or on FPGAs and ASICs that are attached to each host, are becoming increasingly complex and costly, creating bottlenecks to scalability.

In a new paper: Disaggregating Stateful Network Functions, researchers from Microsoft present a different design that disaggregates network function processing off the host and into shared resource pools by making novel use of appliances which tightly integrate general-purpose ARM cores with high-speed stateful match processing ASICs. When work is skewed across VMs, such disaggregation can offer better reliability and performance over the state of the art, at a lower per-server cost. The paper, which was published at the 2023 USENIX Symposium on Networked Systems Design and Implementation (NSDI), includes solutions to the consequent challenges and presents results from a production deployment at a large public cloud.


NEW RESEARCH

Industrial-Strength Controlled Concurrency Testing for C# Programs with Coyote

Testing programs with concurrency is challenging because their execution is non-deterministic, making bugs hard to find, re-produce and debug. Non-determinism can cause flaky tests—which may pass or fail without any code changes—creating a significant engineering burden on development teams. As concurrency is essential for building modern multi-threaded or distributed systems, solutions are required to help developers test their concurrent code for correctness.

Testing concurrent programs comes with two main challenges. First is the problem of reproducibility or control, while the second challenge is the state-space explosion problem. A concurrent program, even with a fixed test input, can have an enormous number of possible behaviors.

In a new research paper: Industrial-Strength Controlled Concurrency Testing for C# Programs with Coyote, researchers from Microsoft describe the design and implementation of the open-source tool Coyote for testing concurrent programs written in the C# language. This research won a 2023 Best Software Science Paper award from The European Association of Software Science and Technology (EASST).

The post Research Focus: Week of June 19, 2023 appeared first on Microsoft Research.

Read More

Accelerate time to business insights with the Amazon SageMaker Data Wrangler direct connection to Snowflake

Accelerate time to business insights with the Amazon SageMaker Data Wrangler direct connection to Snowflake

Amazon SageMaker Data Wrangler is a single visual interface that reduces the time required to prepare data and perform feature engineering from weeks to minutes with the ability to select and clean data, create features, and automate data preparation in machine learning (ML) workflows without writing any code.

SageMaker Data Wrangler supports Snowflake, a popular data source for users who want to perform ML. We launch the Snowflake direct connection from the SageMaker Data Wrangler in order to improve the customer experience. Before the launch of this feature, administrators were required to set up the initial storage integration to connect with Snowflake to create features for ML in Data Wrangler. This includes provisioning Amazon Simple Storage Service (Amazon S3) buckets, AWS Identity and Access Management (IAM) access permissions, Snowflake storage integration for individual users, and an ongoing mechanism to manage or clean up data copies in Amazon S3. This process is not scalable for customers with strict data access control and a large number of users.

In this post, we show how Snowflake’s direct connection in SageMaker Data Wrangler simplifies the administrator’s experience and data scientist’s ML journey from data to business insights.

Solution overview

In this solution, we use SageMaker Data Wrangler to speed up data preparation for ML and Amazon SageMaker Autopilot to automatically build, train, and fine-tune the ML models based on your data. Both services are designed specifically to increase productivity and shorten time to value for ML practitioners. We also demonstrate the simplified data access from SageMaker Data Wrangler to Snowflake with direct connection to query and create features for ML.

Refer to the diagram below for an overview of the low-code ML process with Snowflake, SageMaker Data Wrangler, and SageMaker Autopilot.

The workflow includes the following steps:

  1. Navigate to SageMaker Data Wrangler for your data preparation and feature engineering tasks.
    • Set up the Snowflake connection with SageMaker Data Wrangler.
    • Explore your Snowflake tables in SageMaker Data Wrangler, create a ML dataset, and perform feature engineering.
  2. Train and test the models using SageMaker Data Wrangler and SageMaker Autopilot.
  3. Load the best model to a real-time inference endpoint for predictions.
  4. Use a Python notebook to invoke the launched real-time inference endpoint.

Prerequisites

For this post, the administrator needs the following prerequisites:

Data scientists should have the following prerequisites

Lastly, you should prepare your data for Snowflake

  • We use credit card transaction data from Kaggle to build ML models for detecting fraudulent credit card transactions, so customers are not charged for items that they didn’t purchase. The dataset includes credit card transactions in September 2013 made by European cardholders.
  • You should use the SnowSQL client and install it in your local machine, so you can use it to upload the dataset to a Snowflake table.

The following steps show how to prepare and load the dataset into the Snowflake database. This is a one-time setup.

Snowflake table and data preparation

Complete the following steps for this one-time setup:

  1. First, as the administrator, create a Snowflake virtual warehouse, user, and role, and grant access to other users such as the data scientists to create a database and stage data for their ML use cases:
    -- Use the role SECURITYADMIN to create Role and User
    USE ROLE SECURITYADMIN;
    
    -- Create a new role 'ML Role'
    CREATE OR REPLACE ROLE ML_ROLE COMMENT='ML Role';
    GRANT ROLE ML_ROLE TO ROLE SYSADMIN;
    
    -- Create a new user and password and grant the role to the user
    CREATE OR REPLACE USER ML_USER PASSWORD='<REPLACE_PASSWORD>'
    DEFAULT_ROLE=ML_ROLE
    DEFAULT_WAREHOUSE=ML_WH
    DEFAULT_NAMESPACE=ML_WORKSHOP.PUBLIC
    COMMENT='ML User';
    GRANT ROLE ML_ROLE TO USER ML_USER;
    
    -- Grant privliges to role
    USE ROLE ACCOUNTADMIN;
    GRANT CREATE DATABASE ON ACCOUNT TO ROLE ML_ROLE;
    
    --Create Warehouse for AI/ML work
    USE ROLE SYSADMIN;
    
    CREATE OR REPLACE WAREHOUSE ML_WH
    WITH WAREHOUSE_SIZE = 'XSMALL' AUTO_SUSPEND = 120 AUTO_RESUME = true INITIALLY_SUSPENDED = TRUE;
    
    GRANT ALL ON WAREHOUSE ML_WH TO ROLE ML_ROLE;
    

  2. As the data scientist, let’s now create a database and import the credit card transactions into the Snowflake database to access the data from SageMaker Data Wrangler. For illustration purposes, we create a Snowflake database named SF_FIN_TRANSACTION:
    -- Select the role and the warehouse
    USE ROLE ML_ROLE;
    USE WAREHOUSE ML_WH;
    
    -- Create the DB to import the financial transactions
    CREATE DATABASE IF NOT EXISTS sf_fin_transaction;
    
    -- Create CSV File Format
    create or replace file format my_csv_format
    type = csv
    field_delimiter = ','
    skip_header = 1
    null_if = ('NULL', 'null')
    empty_field_as_null = true
    compression = gzip;
    

  3. Download the dataset CSV file to your local machine and create a stage to load the data into the database table. Update the file path to point to the downloaded dataset location before running the PUT command for importing the data to the created stage:
    -- Create a Snowflake named internal stage to store the transactions csv file
    CREATE OR REPLACE STAGE my_stage
    FILE_FORMAT = my_csv_format;
    
    -- Import the file in to the stage
    -- This command needs be run from SnowSQL client and not on WebUI
    PUT file:///Users/*******/Downloads/creditcard.csv @my_stage;
    
    -- Check whether the import was successful
    LIST @my_stage;
    

  4. Create a table named credit_card_transactions:
    -- Create table and define the columns mapped to the csv transactions file
    create or replace table credit_card_transaction (
    Time integer,
    V1 float, V2 float, V3 float,
    V4 float, V5 float, V6 float,
    V7 float, V8 float, V9 float,
    V10 float,V11 float,V12 float,
    V13 float,V14 float,V15 float,
    V16 float,V17 float,V18 float,
    V19 float,V20 float,V21 float,
    V22 float,V23 float,V24 float,
    V25 float,V26 float,V27 float,
    V28 float,Amount float,
    Class varchar(5)
    );
    

  5. Import the data into the created table from the stage:
    -- Import the transactions in to a new table named 'credit_card_transaction'
    copy into credit_card_transaction from @my_stage ON_ERROR = CONTINUE;
    
    -- Check whether the table was successfully created
    select * from credit_card_transaction limit 100;

Set up the SageMaker Data Wrangler and Snowflake connection

After we prepare the dataset to use with SageMaker Data Wrangler, let us create a new Snowflake connection in SageMaker Data Wrangler to connect to the sf_fin_transaction database in Snowflake and query the credit_card_transaction table:

  1. Choose Snowflake on the SageMaker Data Wrangler Connection page.
  2. Provide a name to identify your connection.
  3. Select your authentication method to connect with the Snowflake database:
    • If using basic authentication, provide the user name and password shared by your Snowflake administrator. For this post, we use basic authentication to connect to Snowflake using the user credentials we created in the previous step.
    • If you are using OAuth, provide your identity provider credentials.

SageMaker Data Wrangler by default queries your data directly from Snowflake without creating any data copies in S3 buckets. SageMaker Data Wrangler’s new usability enhancement uses Apache Spark to integrate with Snowflake to prepare and seamlessly create a dataset for your ML journey.

So far, we have created the database on Snowflake, imported the CSV file into the Snowflake table, created Snowflake credentials, and created a connector on SageMaker Data Wrangler to connect to Snowflake. To validate the configured Snowflake connection, run the following query on the created Snowflake table:

select * from credit_card_transaction;

Note that the storage integration option that was required before is now optional in the advanced settings.

Explore Snowflake data

After you validate the query results, choose Import to save the query results as the dataset. We use this extracted dataset for exploratory data analysis and feature engineering.

You can choose to sample the data from Snowflake in the SageMaker Data Wrangler UI. Another option is to download complete data for your ML model training use cases using SageMaker Data Wrangler processing jobs.

Perform exploratory data analysis in SageMaker Data Wrangler

The data within Data Wrangler needs to be engineered before it can be trained. In this section, we demonstrate how to perform feature engineering on the data from Snowflake using SageMaker Data Wrangler’s built-in capabilities.

First, let’s use the Data Quality and Insights Report feature within SageMaker Data Wrangler to generate reports to automatically verify the data quality and detect abnormalities in the data from Snowflake.

You can use the report to help you clean and process your data. It gives you information such as the number of missing values and the number of outliers. If you have issues with your data, such as target leakage or imbalance, the insights report can bring those issues to your attention. To understand the report details, refer to Accelerate data preparation with data quality and insights in Amazon SageMaker Data Wrangler.

After you check out the data type matching applied by SageMaker Data Wrangler, complete the following steps:

  1. Choose the plus sign next to Data types and choose Add analysis.
  2. For Analysis type, choose Data Quality and Insights Report.
  3. Choose Create.
  4. Refer to the Data Quality and Insights Report details to check out high-priority warnings.

You can choose to resolve the warnings reported before proceeding with your ML journey.

The target column Class to be predicted is classified as a string. First, let’s apply a transformation to remove the stale empty characters.

  1. Choose Add step and choose Format string.
  2. In the list of transforms, choose Strip left and right.
  3. Enter the characters to remove and choose Add.

Next, we convert the target column Class from the string data type to Boolean because the transaction is either legitimate or fraudulent.

  1. Choose Add step.
  2. Choose Parse column as type.
  3. For Column, choose Class.
  4. For From, choose String.
  5. For To, choose Boolean.
  6. Choose Add.

After the target column transformation, we reduce the number of feature columns, because there are over 30 features in the original dataset. We use Principal Component Analysis (PCA) to reduce the dimensions based on feature importance. To understand more about PCA and dimensionality reduction, refer to Principal Component Analysis (PCA) Algorithm.

  1. Choose Add step.
  2. Choose Dimensionality Reduction.
  3. For Transform, choose Principal component analysis.
  4. For Input columns, choose all the columns except the target column Class.
  5. Choose the plus sign next to Data flow and choose Add analysis.
  6. For Analysis type, choose Quick Model.
  7. For Analysis name, enter a name.
  8. For Label, choose Class.
  9. Choose Run.

Based on the PCA results, you can decide which features to use for building the model. In the following screenshot, the graph shows the features (or dimensions) ordered based on highest to lowest importance to predict the target class, which in this dataset is whether the transaction is fraudulent or valid.

You can choose to reduce the number of features based on this analysis, but for this post, we leave the defaults as is.

This concludes our feature engineering process, although you may choose to run the quick model and create a Data Quality and Insights Report again to understand the data before performing further optimizations.

Export data and train the model

In the next step, we use SageMaker Autopilot to automatically build, train, and tune the best ML models based on your data. With SageMaker Autopilot, you still maintain full control and visibility of your data and model.

Now that we have completed the exploration and feature engineering, let’s train a model on the dataset and export the data to train the ML model using SageMaker Autopilot.

  1. On the Training tab, choose Export and train.

We can monitor the export progress while we wait for it to complete.

Let’s configure SageMaker Autopilot to run an automated training job by specifying the target we want to predict and the type of problem. In this case, because we’re training the dataset to predict whether the transaction is fraudulent or valid, we use binary classification.

  1. Enter a name for your experiment, provide the S3 location data, and choose Next: Target and features.
  2. For Target, choose Class as the column to predict.
  3. Choose Next: Training method.

Let’s allow SageMaker Autopilot to decide the training method based on the dataset.

  1. For Training method and algorithms, select Auto.

To understand more about the training modes supported by SageMaker Autopilot, refer to Training modes and algorithm support.

  1. Choose Next: Deployment and advanced settings.
  2. For Deployment option, choose Auto deploy the best model with transforms from Data Wrangler, which loads the best model for inference after the experimentation is complete.
  3. Enter a name for your endpoint.
  4. For Select the machine learning problem type, choose Binary classification.
  5. For Objection metric, choose F1.
  6. Choose Next: Review and create.
  7. Choose Create experiment.

This starts an SageMaker Autopilot job that creates a set of training jobs that uses combinations of hyperparameters to optimize the objective metric.

Wait for SageMaker Autopilot to finish building the models and evaluation of the best ML model.

Launch a real-time inference endpoint to test the best model

SageMaker Autopilot runs experiments to determine the best model that can classify credit card transactions as legitimate or fraudulent.

When SageMaker Autopilot completes the experiment, we can view the training results with the evaluation metrics and explore the best model from the SageMaker Autopilot job description page.

  1. Select the best model and choose Deploy model.

We use a real-time inference endpoint to test the best model created through SageMaker Autopilot.

  1. Select Make real-time predictions.

When the endpoint is available, we can pass the payload and get inference results.

Let’s launch a Python notebook to use the inference endpoint.

  1. On the SageMaker Studio console, choose the folder icon in the navigation pane and choose Create notebook.
  2. Use the following Python code to invoke the deployed real-time inference endpoint:
    # Library imports
    import os
    import io
    import boto3
    import json
    import csv
    
    #: Define the endpoint's name.
    ENDPOINT_NAME = 'SnowFlake-FraudDetection' # replace the endpoint name as per your config
    runtime = boto3.client('runtime.sagemaker')
    
    #: Define a test payload to send to your endpoint.
    payload = {
        "body": {
        "TIME": 152895,
        "V1": 2.021155535,
        "V2": 0.05372872624,
        "V3": -1.620399104,
        "V4": 0.3530165253,
        "V5": 0.3048483853,
        "V6": -0.6850955461,
        "V7": 0.02483335885,
        "V8": -0.05101346021,
        "V9": 0.3550896835,
        "V10": -0.1830053153,
        "V11": 1.148091498,
        "V12": 0.4283365505,
        "V13": -0.9347237892,
        "V14": -0.4615291327,
        "V15": -0.4124343184,
        "V16": 0.4993445934,
        "V17": 0.3411548305,
        "V18": 0.2343833846,
        "V19": 0.278223588,
        "V20": -0.2104513475,
        "V21": -0.3116427235,
        "V22": -0.8690778214,
        "V23": 0.3624146958,
        "V24": 0.6455923598,
        "V25": -0.3424913329,
        "V26": 0.1456884618,
        "V27": -0.07174890419,
        "V28": -0.040882382,
        "AMOUNT": 0.27
        }
    }
    
    #: Submit an API request and capture the response object.
    response = runtime.invoke_endpoint(
        EndpointName=ENDPOINT_NAME,
        ContentType='text/csv',
        Body=str(payload)
    )
    
    #: Print the model endpoint's output.
    print(response['Body'].read().decode()) 
    

The output shows the result as false, which implies the sample feature data is not fraudulent.

Clean up

To make sure you don’t incur charges after completing this tutorial, shut down the SageMaker Data Wrangler application and shut down the notebook instance used to perform inference. You should also delete the inference endpoint you created using SageMaker Autopilot to prevent additional charges.

Conclusion

In this post, we demonstrated how to bring your data from Snowflake directly without creating any intermediate copies in the process. You can either sample or load your complete dataset to SageMaker Data Wrangler directly from Snowflake. You can then explore the data, clean the data, and perform featuring engineering using SageMaker Data Wrangler’s visual interface.

We also highlighted how you can easily train and tune a model with SageMaker Autopilot directly from the SageMaker Data Wrangler user interface. With SageMaker Data Wrangler and SageMaker Autopilot integration, we can quickly build a model after completing feature engineering, without writing any code. Then we referenced SageMaker Autopilot’s best model to run inferences using a real-time endpoint.

Try out the new Snowflake direct integration with SageMaker Data Wrangler today to easily build ML models with your data using SageMaker.


About the authors

Hariharan Suresh is a Senior Solutions Architect at AWS. He is passionate about databases, machine learning, and designing innovative solutions. Prior to joining AWS, Hariharan was a product architect, core banking implementation specialist, and developer, and worked with BFSI organizations for over 11 years. Outside of technology, he enjoys paragliding and cycling.

Aparajithan Vaidyanathan is a Principal Enterprise Solutions Architect at AWS. He supports enterprise customers migrate and modernize their workloads on AWS cloud. He is a Cloud Architect with 23+ years of experience designing and developing enterprise, large-scale and distributed software systems. He specializes in Machine Learning & Data Analytics with focus on Data and Feature Engineering domain. He is an aspiring marathon runner and his hobbies include hiking, bike riding and spending time with his wife and two boys.

Tim Song is a Software Development Engineer at AWS SageMaker, with 10+ years of experience as software developer, consultant and tech leader he has demonstrated ability to deliver scalable and reliable products and solve complex problems. In his spare time, he enjoys the nature, outdoor running, hiking and etc.

Bosco Albuquerque is a Sr. Partner Solutions Architect at AWS and has over 20 years of experience in working with database and analytics products from enterprise database vendors and cloud providers. He has helped large technology companies design data analytics solutions and has led engineering teams in designing and implementing data analytics platforms and data products.

Read More

Deploy a serverless ML inference endpoint of large language models using FastAPI, AWS Lambda, and AWS CDK

Deploy a serverless ML inference endpoint of large language models using FastAPI, AWS Lambda, and AWS CDK

For data scientists, moving machine learning (ML) models from proof of concept to production often presents a significant challenge. One of the main challenges can be deploying a well-performing, locally trained model to the cloud for inference and use in other applications. It can be cumbersome to manage the process, but with the right tool, you can significantly reduce the required effort.

Amazon SageMaker inference, which was made generally available in April 2022, makes it easy for you to deploy ML models into production to make predictions at scale, providing a broad selection of ML infrastructure and model deployment options to help meet all kinds of ML inference needs. You can use SageMaker Serverless Inference endpoints for workloads that have idle periods between traffic spurts and can tolerate cold starts. The endpoints scale out automatically based on traffic and take away the undifferentiated heavy lifting of selecting and managing servers. Additionally, you can use AWS Lambda directly to expose your models and deploy your ML applications using your preferred open-source framework, which can prove to be more flexible and cost-effective.

FastAPI is a modern, high-performance web framework for building APIs with Python. It stands out when it comes to developing serverless applications with RESTful microservices and use cases requiring ML inference at scale across multiple industries. Its ease and built-in functionalities like the automatic API documentation make it a popular choice amongst ML engineers to deploy high-performance inference APIs. You can define and organize your routes using out-of-the-box functionalities from FastAPI to scale out and handle growing business logic as needed, test locally and host it on Lambda, then expose it through a single API gateway, which allows you to bring an open-source web framework to Lambda without any heavy lifting or refactoring your codes.

This post shows you how to easily deploy and run serverless ML inference by exposing your ML model as an endpoint using FastAPI, Docker, Lambda, and Amazon API Gateway. We also show you how to automate the deployment using the AWS Cloud Development Kit (AWS CDK).

Solution overview

The following diagram shows the architecture of the solution we deploy in this post.

Scope of Solution

Prerequisites

You must have the following prerequisites:

  • Python3 installed, along with virtualenv for creating and managing virtual environments in Python
  • aws-cdk v2 installed on your system in order to be able to use the AWS CDK CLI
  • Docker installed and running on your local machine

Test if all the necessary software is installed:

  1. The AWS Command Line Interface (AWS CLI) is needed. Log in to your account and choose the Region where you want to deploy the solution.
  2. Use the following code to check your Python version:
    python3 --version

  3. Check if virtualenv is installed for creating and managing virtual environments in Python. Strictly speaking, this is not a hard requirement, but it will make your life easier and helps follow along with this post more easily. Use the following code:
    python3 -m virtualenv --version

  4. Check if cdk is installed. This will be used to deploy our solution.
    cdk --version

  5. Check if Docker is installed. Our solution will make your model accessible through a Docker image to Lambda. To build this image locally, we need Docker.
    docker --version

  6. Make sure Docker is up and running with the following code:
    docker ps

How to structure your FastAPI project using AWS CDK

We use the following directory structure for our project (ignoring some boilerplate AWS CDK code that is immaterial in the context of this post):

```

fastapi_model_serving
│
└───.venv
│
└───fastapi_model_serving
│   │   __init__.py
│   │   fastapi_model_serving_stack.py
│   │
│   └───model_endpoint
│       └───docker
│       │      Dockerfile
│       │      serving_api.tar.gz
│
│
│       └───runtime
│            └───serving_api
│                    requirements.txt
│                    serving_api.py
│                └───custom_lambda_utils
│                     └───model_artifacts
│                            ...
│                     └───scripts
│                            inference.py
│
└───templates
│   └───api
│   │     api.py
│   └───dummy
│         dummy.py
│
│ app.py
│   cdk.json
│   README.md
│   requirements.txt
│   init-lambda-code.sh

```

The directory follows the recommended structure of AWS CDK projects for Python.

The most important part of this repository is the fastapi_model_serving directory. It contains the code that will define the AWS CDK stack and the resources that are going to be used for model serving.

The fastapi_model_serving directory contains the model_endpoint subdirectory, which contains all the assets necessary that make up our serverless endpoint, namely the Dockerfile to build the Docker image that Lambda will use, the Lambda function code that uses FastAPI to handle inference requests and route them to the correct endpoint, and the model artifacts of the model that we want to deploy. model_endpoint also contains the following:

  • Docker– This subdirectory contains the following:
  • Dockerfile – This is used to build the image for the Lambda function with all the artifacts (Lambda function code, model artifacts, and so on) in the right place so that they can be used without issues.
  • serving.api.tar.gz – This is a tarball that contains all the assets from the runtime folder that are necessary for building the Docker image. We discuss how to create the .tar.gz file later in this post.
  • runtime– This subdirectory contains the following:
  • serving_api – The code for the Lambda function and its dependencies specified in the requirements.txt file.
  • custom_lambda_utils – This includes an inference script that loads the necessary model artifacts so that the model can be passed to the serving_api that will then expose it as an endpoint.

Additionally, we have the template directory, which provides a template of folder structures and files where you can define your customized codes and APIs following the sample we went through earlier. The template directory contains dummy code that you can use to create new Lambda functions:

  • dummy – Contains the code that implements the structure of an ordinary Lambda function using the Python runtime
  • api – Contains the code that implements a Lambda function that wraps a FastAPI endpoint around an existing API gateway

Deploy the solution

By default, the code is deployed inside the eu-west-1 region. If you want to change the Region, you can change the DEPLOYMENT_REGION context variable in the cdk.json file.

Keep in mind, however, that the solution tries to deploy a Lambda function on top of the arm64 architecture, and that this feature might not be available in all Regions. In this case, you need to change the architecture parameter in the fastapi_model_serving_stack.py file, as well as the first line of the Dockerfile inside the Docker directory, to host this solution on the x86 architecture.

To deploy the solution, complete the following steps:

  1. Run the following command to clone the GitHub repository: git clone https://github.com/aws-samples/lambda-serverless-inference-fastapiBecause we want to showcase that the solution can work with model artifacts that you train locally, we contain a sample model artifact of a pretrained DistilBERT model on the Hugging Face model hub for a question answering task in the serving_api.tar.gz file. The download time can take around 3–5 minutes. Now, let’s set up the environment.
  2. Download the pretrained model that will be deployed from the Hugging Face model hub into the ./model_endpoint/runtime/serving_api/custom_lambda_utils/model_artifacts directory. It also creates a virtual environment and installs all dependencies that are needed. You only need to run this command once: make prep. This command can take around 5 minutes (depending on your internet bandwidth) because it needs to download the model artifacts.
  3. Package the model artifacts inside a .tar.gz archive that will be used inside the Docker image that is built in the AWS CDK stack. You need to run this code whenever you make changes to the model artifacts or the API itself to always have the most up-to-date version of your serving endpoint packaged: make package_model. The artifacts are all in place. Now we can deploy the AWS CDK stack to your AWS account.
  4. Run cdk bootstrap if it’s your first time deploying an AWS CDK app into an environment (account + Region combination):
    make cdk_bootstrap

    This stack includes resources that are needed for the toolkit’s operation. For example, the stack includes an Amazon Simple Storage Service (Amazon S3) bucket that is used to store templates and assets during the deployment process.

    Because we’re building Docker images locally in this AWS CDK deployment, we need to ensure that the Docker daemon is running before we can deploy this stack via the AWS CDK CLI.

  5. To check whether or not the Docker daemon is running on your system, use the following command:
    docker ps

    If you don’t get an error message, you should be ready to deploy the solution.

  6. Deploy the solution with the following command:
    make deploy

    This step can take around 5–10 minutes due to building and pushing the Docker image.

Troubleshooting

If you’re a Mac user, you may encounter an error when logging into Amazon Elastic Container Registry (Amazon ECR) with the Docker login, such as Error saving credentials ... not implemented. For example:

exited with error code 1: Error saving credentials: error storing credentials - err: exit status 1,...dial unix backend.sock: connect: connection refused

Before you can use Lambda on top of Docker containers inside the AWS CDK, you may need to change the ~/docker/config.json file. More specifically, you might have to change the credsStore parameter in ~/.docker/config.json to osxkeychain. That solves Amazon ECR login issues on a Mac.

Run real-time inference

After your AWS CloudFormation stack is deployed successfully, go to the Outputs tab for your stack on the AWS CloudFormation console and open the endpoint URL. Now our model is accessible via the endpoint URL and we’re ready to run real-time inference.

Navigate to the URL to see if you can see “hello world” message and add /docs to the address to see if you can see the interactive swagger UI page successfully. There might be some cold start time, so you may need to wait or refresh a few times.

FastAPI Docs web page

After you log in to the landing page of the FastAPI swagger UI page, you can run via the root / or via /question.

From /, you could run the API and get the “hello world” message.

From /question, you could run the API and run ML inference on the model we deployed for a question answering case. For example, we use the question is What is the color of my car now? and the context is My car used to be blue but I painted red.

FastAPI web page question

When you choose Execute, based on the given context, the model will answer the question with a response, as shown in the following screenshot.

Execute result

In the response body, you can see the answer with the confidence score from the model. You could also experiment with other examples or embed the API in your existing application.

Alternatively, you can run the inference via code. Here is one example written in Python, using the requests library:

import requests

url = "https://<YOUR_API_GATEWAY_ENDPOINT_ID>.execute-api.<YOUR_ENDPOINT_REGION>.amazonaws.com/prod/question?question="What is the color of my car now?"&context="My car used to be blue but I painted red""

response = requests.request("GET", url, headers=headers, data=payload)

print(response.text)

The code outputs a string similar to the following:

'{"score":0.6947233080863953,"start":38,"end":41,"answer":"red"}'

If you are interested in knowing more about deploying Generative AI and large language models on AWS, check out here:

Clean up

Inside the root directory of your repository, run the following code to clean up your resources:

make destroy

Conclusion

In this post, we introduced how you can use Lambda to deploy your trained ML model using your preferred web application framework, such as FastAPI. We provided a detailed code repository that you can deploy, and you retain the flexibility of switching to whichever trained model artifacts you process. The performance can depend on how you implement and deploy the model.

You are welcome to try it out yourself, and we’re excited to hear your feedback!


About the Authors

Tingyi Li is an Enterprise Solutions Architect from AWS based out in Stockholm, Sweden supporting the Nordics customers. She enjoys helping customers with the architecture, design, and development of cloud-optimized infrastructure solutions. She is specialized in AI and Machine Learning and is interested in empowering customers with intelligence in their AI/ML applications. In her spare time, she is also a part-time illustrator who writes novels and plays the piano.

demir_headshotDemir Catovic is a Machine Learning Engineer from AWS based in Zurich, Switzerland. He engages with customers and helps them implement scalable and fully-functional ML applications. He is passionate about building and productionizing machine learning applications for customers and is always keen to explore around new trends and cutting-edge technologies in the AI/ML world.

Read More

Preference learning with automated feedback for cache eviction

Preference learning with automated feedback for cache eviction

Caching is a ubiquitous idea in computer science that significantly improves the performance of storage and retrieval systems by storing a subset of popular items closer to the client based on request patterns. An important algorithmic piece of cache management is the decision policy used for dynamically updating the set of items being stored, which has been extensively optimized over several decades, resulting in several efficient and robust heuristics. While applying machine learning to cache policies has shown promising results in recent years (e.g., LRB, LHD, storage applications), it remains a challenge to outperform robust heuristics in a way that can generalize reliably beyond benchmarks to production settings, while maintaining competitive compute and memory overheads.

In “HALP: Heuristic Aided Learned Preference Eviction Policy for YouTube Content Delivery Network”, presented at NSDI 2023, we introduce a scalable state-of-the-art cache eviction framework that is based on learned rewards and uses preference learning with automated feedback. The Heuristic Aided Learned Preference (HALP) framework is a meta-algorithm that uses randomization to merge a lightweight heuristic baseline eviction rule with a learned reward model. The reward model is a lightweight neural network that is continuously trained with ongoing automated feedback on preference comparisons designed to mimic the offline oracle. We discuss how HALP has improved infrastructure efficiency and user video playback latency for YouTube’s content delivery network.

Learned preferences for cache eviction decisions

The HALP framework computes cache eviction decisions based on two components: (1) a neural reward model trained with automated feedback via preference learning, and (2) a meta-algorithm that combines a learned reward model with a fast heuristic. As the cache observes incoming requests, HALP continuously trains a small neural network that predicts a scalar reward for each item by formulating this as a preference learning method via pairwise preference feedback. This aspect of HALP is similar to reinforcement learning from human feedback (RLHF) systems, but with two important distinctions:

  • Feedback is automated and leverages well-known results about the structure of offline optimal cache eviction policies.
  • The model is learned continuously using a transient buffer of training examples constructed from the automated feedback process.

The eviction decisions rely on a filtering mechanism with two steps. First, a small subset of candidates is selected using a heuristic that is efficient, but suboptimal in terms of performance. Then, a re-ranking step optimizes from within the baseline candidates via the sparing use of a neural network scoring function to “boost” the quality of the final decision.

As a production ready cache policy implementation, HALP not only makes eviction decisions, but also subsumes the end-to-end process of sampling pairwise preference queries used to efficiently construct relevant feedback and update the model to power eviction decisions.

A neural reward model

HALP uses a light-weight two-layer multilayer perceptron (MLP) as its reward model to selectively score individual items in the cache. The features are constructed and managed as a metadata-only “ghost cache” (similar to classical policies like ARC). After any given lookup request, in addition to regular cache operations, HALP conducts the book-keeping (e.g., tracking and updating feature metadata in a capacity-constrained key-value store) needed to update the dynamic internal representation. This includes: (1) externally tagged features provided by the user as input, along with a cache lookup request, and (2) internally constructed dynamic features (e.g., time since last access, average time between accesses) constructed from lookup times observed on each item.

HALP learns its reward model fully online starting from a random weight initialization. This might seem like a bad idea, especially if the decisions are made exclusively for optimizing the reward model. However, the eviction decisions rely on both the learned reward model and a suboptimal but simple and robust heuristic like LRU. This allows for optimal performance when the reward model has fully generalized, while remaining robust to a temporarily uninformative reward model that is yet to generalize, or in the process of catching up to a changing environment.

Another advantage of online training is specialization. Each cache server runs in a potentially different environment (e.g., geographic location), which influences local network conditions and what content is locally popular, among other things. Online training automatically captures this information while reducing the burden of generalization, as opposed to a single offline training solution.

Scoring samples from a randomized priority queue

It can be impractical to optimize for the quality of eviction decisions with an exclusively learned objective for two reasons.

  1. Compute efficiency constraints: Inference with a learned network can be significantly more expensive than the computations performed in practical cache policies operating at scale. This limits not only the expressivity of the network and features, but also how often these are invoked during each eviction decision.
  2. Robustness for generalizing out-of-distribution: HALP is deployed in a setup that involves continual learning, where a quickly changing workload might generate request patterns that might be temporarily out-of-distribution with respect to previously seen data.

To address these issues, HALP first applies an inexpensive heuristic scoring rule that corresponds to an eviction priority to identify a small candidate sample. This process is based on efficient random sampling that approximates exact priority queues. The priority function for generating candidate samples is intended to be quick to compute using existing manually-tuned algorithms, e.g., LRU. However, this is configurable to approximate other cache replacement heuristics by editing a simple cost function. Unlike prior work, where the randomization was used to tradeoff approximation for efficiency, HALP also relies on the inherent randomization in the sampled candidates across time steps for providing the necessary exploratory diversity in the sampled candidates for both training and inference.

The final evicted item is chosen from among the supplied candidates, equivalent to the best-of-n reranked sample, corresponding to maximizing the predicted preference score according to the neural reward model. The same pool of candidates used for eviction decisions is also used to construct the pairwise preference queries for automated feedback, which helps minimize the training and inference skew between samples.

An overview of the two-stage process invoked for each eviction decision.

Online preference learning with automated feedback

The reward model is learned using online feedback, which is based on automatically assigned preference labels that indicate, wherever feasible, the ranked preference ordering for the time taken to receive future re-accesses, starting from a given snapshot in time among each queried sample of items. This is similar to the oracle optimal policy, which, at any given time, evicts an item with the farthest future access from all the items in the cache.

Generation of the automated feedback for learning the reward model.

To make this feedback process informative, HALP constructs pairwise preference queries that are most likely to be relevant for eviction decisions. In sync with the usual cache operations, HALP issues a small number of pairwise preference queries while making each eviction decision, and appends them to a set of pending comparisons. The labels for these pending comparisons can only be resolved at a random future time. To operate online, HALP also performs some additional book-keeping after each lookup request to process any pending comparisons that can be labeled incrementally after the current request. HALP indexes the pending comparison buffer with each element involved in the comparison, and recycles the memory consumed by stale comparisons (neither of which may ever get a re-access) to ensure that the memory overhead associated with feedback generation remains bounded over time.

Overview of all main components in HALP.

Results: Impact on the YouTube CDN

Through empirical analysis, we show that HALP compares favorably to state-of-the-art cache policies on public benchmark traces in terms of cache miss rates. However, while public benchmarks are a useful tool, they are rarely sufficient to capture all the usage patterns across the world over time, not to mention the diverse hardware configurations that we have already deployed.

Until recently, YouTube servers used an optimized LRU-variant for memory cache eviction. HALP increases YouTube’s memory egress/ingress — the ratio of the total bandwidth egress served by the CDN to that consumed for retrieval (ingress) due to cache misses — by roughly 12% and memory hit rate by 6%. This reduces latency for users, since memory reads are faster than disk reads, and also improves egressing capacity for disk-bounded machines by shielding the disks from traffic.

The figure below shows a visually compelling reduction in the byte miss ratio in the days following HALP’s final rollout on the YouTube CDN, which is now serving significantly more content from within the cache with lower latency to the end user, and without having to resort to more expensive retrieval that increases the operating costs.

Aggregate worldwide YouTube byte miss ratio before and after rollout (vertical dashed line).

An aggregated performance improvement could still hide important regressions. In addition to measuring overall impact, we also conduct an analysis in the paper to understand its impact on different racks using a machine level analysis, and find it to be overwhelmingly positive.

Conclusion

We introduced a scalable state-of-the-art cache eviction framework that is based on learned rewards and uses preference learning with automated feedback. Because of its design choices, HALP can be deployed in a manner similar to any other cache policy without the operational overhead of having to separately manage the labeled examples, training procedure and the model versions as additional offline pipelines common to most machine learning systems. Therefore, it incurs only a small extra overhead compared to other classical algorithms, but has the added benefit of being able to take advantage of additional features to make its eviction decisions and continuously adapt to changing access patterns.

This is the first large-scale deployment of a learned cache policy to a widely used and heavily trafficked CDN, and has significantly improved the CDN infrastructure efficiency while also delivering a better quality of experience to users.

Acknowledgements

Ramki Gummadi is now part of Google DeepMind. We would like to thank John Guilyard for help with the illustrations and Richard Schooler for feedback on this post.

Read More