Create high-quality data for ML models with Amazon SageMaker Ground Truth

Machine learning (ML) has improved business across industries in recent years—from the recommendation system on your Prime Video account, to document summarization and efficient search with Alexa’s voice assistance. However, the question remains of how to incorporate this technology into your business. Unlike traditional rule-based methods, ML automatically infers patterns from data so as to perform your task of interest. Although this bypasses the need to curate rules for automation, it also means that ML models can only be as good as the data on which they’re trained. However, data creation is often a challenging task. At the Amazon Machine Learning Solutions Lab, we’ve repeatedly encountered this problem and want to ease this journey for our customers. If you want to offload this process, you can use Amazon SageMaker Ground Truth Plus.

By the end of this post, you’ll be able to achieve the following:

  • Understand the business processes involved in setting up a data acquisition pipeline
  • Identify AWS Cloud services for supporting and expediting your data labeling pipeline
  • Run a data acquisition and labeling task for custom use cases
  • Create high-quality data following business and technical best practices

Throughout this post, we focus on the data creation process and rely on AWS services to handle the infrastructure and process components. Namely, we use Amazon SageMaker Ground Truth to handle the labeling infrastructure pipeline and user interface. This service uses a point-and-go approach to collect your data from Amazon Simple Storage Service (Amazon S3) and set up a labeling workflow. For labeling, it provides you with the built-in flexibility to acquire data labels using your private team, an Amazon Mechanical Turk force, or your preferred labeling vendor from AWS Marketplace. Lastly, you can use AWS Lambda and Amazon SageMaker notebooks to process, visualize, or quality control the data—either pre- or post-labeling.

Now that all of the pieces have been laid down, let’s start the process!

The data creation process

Contrary to common intuition, the first step for data creation is not data collection. Working backward from the users to articulate the problem is crucial. For example, what do users care about in the final artifact? Where do experts believe the signals relevant to the use case reside in the data? What information about the use case environment could be provided to model? If you don’t know the answers to those questions, don’t worry. Give yourself some time to talk with users and field experts to understand the nuances. This initial understanding will orient you in the right direction and set you up for success.

For this post, we assume that you have covered this initial process of user requirement specification. The next three sections walk you through the subsequent process of creating quality data: planning, source data creation, and data annotation. Piloting loops at the data creation and annotation steps are vital for ensuring the efficient creation of labeled data. This involves iterating between data creation, annotation, quality assurance, and updating the pipeline as necessary.

The following figure provides an overview of the steps required in a typical data creation pipeline. You can work backward from the use case to identify the data that you need (Requirements Specification), build a process to obtain the data (Planning), implement the actual data acquisition process (Data Collection and Annotation), and assess the results. Pilot runs, highlighted with dashed lines, let you iterate on the process until a high-quality data acquisition pipeline has been developed.

In a typical data creation pipeline, you go through requirements specification for the use case in scope, plan for the data creation process, implement the process for data collection and labeling, and evaluate the results against the original requirements specification. Successive iterations of this workflow enable refinement of the pipeline.

Overview of steps required in a typical data creation pipeline.


A standard data creation process can be time-consuming and a waste of valuable human resources if conducted inefficiently. Why would it be time-consuming? To answer this question, we must understand the scope of the data creation process. To assist you, we have collected a high-level checklist and description of key components and stakeholders that you must consider. Answering these questions can be difficult at first. Depending on your use case, only some of these may be applicable.

  • Identify the legal point of contact for required approvals – Using data for your application can require license or vendor contract review to ensure compliance with company policies and use cases. It’s important to identify your legal support throughout the data acquisition and annotation steps of the process.
  • Identify the security point of contact for data handling –Leakage of purchased data might result in serious fines and repercussions for your company. It’s important to identify your security support throughout the data acquisition and annotation steps to ensure secure practices.
  • Detail use case requirements and define source data and annotation guidelines – Creating and annotating data is difficult due to the high specificity required. Stakeholders, including data generators and annotators, must be completely aligned to avoid wasting resources. To this end, it’s common practice to use a guidelines document that specifies every aspect of the annotation task: exact instructions, edge cases, an example walkthrough, and so on.
  • Align on expectations for collecting your source data – Consider the following:

    • Conduct research on potential data sources – For example, public datasets, existing datasets from other internal teams, self-collected, or purchased data from vendors.
    • Perform quality assessment – Create an analysis pipeline with relation to the final use case.
  • Align on expectations for creating data annotations – Consider the following:

    • Identify the technical stakeholders – This is usually an individual or team in your company capable of using the technical documentation regarding Ground Truth to implement an annotation pipeline. These stakeholders are also responsible for quality assessment of the annotated data to make sure that it meets the needs of your downstream ML application.
    • Identify the data annotators – These individuals use predetermined instructions to add labels to your source data within Ground Truth. They may need to possess domain knowledge depending on your use case and annotation guidelines. You can use a workforce internal to your company, or pay for a workforce managed by an external vendor.
  • Ensure oversight of the data creation process – As you can see from the preceding points, data creation is a detailed process that involves numerous specialized stakeholders. Therefore, it’s crucial to monitor it end to end toward the desired outcome. Having a dedicated person or team oversee the process can help you ensure a cohesive, efficient data creation process.

Depending on the route that you decide to take, you must also consider the following:

  • Create the source dataset – This refers to instances when existing data isn’t suitable for the task at hand, or legal constraints prevent you from using it. Internal teams or external vendors (next point) must be used. This is often the case for highly specialized domains or areas with low public research. For example, a physician’s common questions, garment lay down, or sports experts. It can be internal or external.
  • Research vendors and conduct an onboarding process – When external vendors are used, a contracting and onboarding process must be set in place between both entities.

In this section, we reviewed the components and stakeholders that we must consider. However, what does the actual process look like? In the following figure, we outline a process workflow for data creation and annotation. The iterative approach uses small batches of data called pilots to decrease turnaround time, detect errors early on, and avoid wasting resources in the creation of low-quality data. We describe these pilot rounds later in this post. We also cover some best practices for data creation, annotation, and quality control.

The following figure illustrates the iterative development of a data creation pipeline. Vertically, we find the data sourcing block (green) and the annotation block (blue). Both blocks have independent pilot rounds (Data creation/Annotation, QAQC, and Update). Increasingly higher sourced data is created and can be used to construct increasingly higher-quality annotations.

During the iterative development of a data creation or annotation pipeline, small batches of data are used for independent pilots. Each pilot round has a data creation or annotation phase, some quality assurance and quality control of the results, and an update step to refine the process. After these processes are finessed through successive pilots, you can proceed to large-scale data creation and annotation.

Overview of iterative development in a data creation pipeline.

Source data creation

The input creation process revolves around staging your items of interest, which depend on your task type. These could be images (newspaper scans), videos (traffic scenes), 3D point clouds (medical scans), or simply text (subtitle tracks, transcriptions). In general, when staging your task-related items, make sure of the following:

  • Reflect the real-world use case for the eventual AI/ML system – The setup for collecting images or videos for your training data should closely match the setup for your input data in the real-world application. This means having consistent placement surfaces, lighting sources, or camera angles.
  • Account for and minimize variability sources – Consider the following:

    • Develop best practices for maintaining data collection standards – Depending on the granularity of your use case, you may need to specify requirements to guarantee consistency among your data points. For example, if you’re collecting image or video data from single camera points, you may need to make sure of the consistent placement of your objects of interest, or require a quality check for the camera before a data capture round. This can avoid issues like camera tilt or blur, and minimize downstream overheads like removing out-of-frame or blurry images, as well as needing to manually center the image frame on your area of interest.
    • Pre-empt test time sources of variability – If you anticipate variability in any of the attributes mentioned so far during test time, make sure that you can capture those variability sources during training data creation. For example, if you expect your ML application to work in multiple different light settings, you should aim to create training images and videos at various light settings. Depending on the use case, variability in camera positioning can also influence the quality of your labels.
  • Incorporate prior domain knowledge when available – Consider the following:

    • Inputs on sources of error – Domain practitioners can provide insights into sources of error based on their years of experience. They can provide feedback on the best practices for the previous two points: What settings reflect the real-world use case best? What are the possible sources of variability during data collection, or at the time of use?
    • Domain-specific data collection best practices – Although your technical stakeholders may already have a good idea of the technical aspects to focus on in the images or videos collected, domain practitioners can provide feedback on how best to stage or collect the data such that these needs are met.

Quality control and quality assurance of the created data

Now that you have set up the data collection pipeline, it might be tempting to go ahead and collect as much data as possible. Wait a minute! We must first check if the data collected through the setup is suitable for your real-word use case. We can use some initial samples and iteratively improve the setup through the insights that we gained from analyzing that sample data. Work closely with your technical, business, and annotation stakeholders during the pilot process. This will make sure that your resultant pipeline is meeting business needs while generating ML-ready labeled data within minimal overheads.


The annotation of inputs is where we add the magic touch to our data—the labels! Depending on your task type and data creation process, you may need manual annotators, or you can use off-the-shelf automated methods. The data annotation pipeline itself can be a technically challenging task. Ground Truth eases this journey for your technical stakeholders with its built-in repertoire of labeling workflows for common data sources. With a few additional steps, it also enables you to build custom labeling workflows beyond preconfigured options.

Ask yourself the following questions when developing a suitable annotation workflow:

  • Do I need a manual annotation process for my data? In some cases, automated labeling services may be sufficient for the task at hand. Reviewing the documentation and available tools can help you identify if manual annotation is necessary for your use case (for more information, see What is data labeling?). The data creation process can allow for varying levels of control regarding the granularity of your data annotation. Depending on this process, you can also sometimes bypass the need for manual annotation. For more information, refer to Build a custom Q&A dataset using Amazon SageMaker Ground Truth to train a Hugging Face Q&A NLU model.
  • What forms my ground truth? In most cases, the ground truth will come from your annotation process—that’s the whole point! In others, the user may have access to ground truth labels. This can significantly speed up your quality assurance process, or reduce the overhead required for multiple manual annotations.
  • What is the upper bound for the amount of deviance from my ground truth state? Work with your end-users to understand the typical errors around these labels, the sources of such errors, and the desired reduction in errors. This will help you identify which aspects of the labeling task are most challenging or are likely to have annotation errors.
  • Are there preexisting rules used by the users or field practitioners to label these items? Use and refine these guidelines to build a set of instructions for your manual annotators.

Piloting the input annotation process

When piloting the input annotation process, consider the following:

  • Review the instructions with the annotators and field practitioners – Instructions should be concise and specific. Ask for feedback from your users (Are the instructions accurate? Can we revise any instructions to make sure that they are understandable by non-field practitioners?) and annotators (Is everything understandable? Is the task clear?). If possible, add an example of good and bad labeled data to help your annotators identify what is expected, and what common labeling errors might look like.
  • Collect data for annotations – Review the data with your customer to make sure that it meets the expected standards, and to align on expected outcomes from the manual annotation.
  • Provide examples to your pool of manual annotators as a test run – What is the typical variance among the annotators in this set of examples? Study the variance for each annotation within a given image to identify the consistency trends among annotators. Then compare the variances across the images or video frames to identify which labels are challenging to place.

Quality control of the annotations

Annotation quality control has two main components: assessing consistency between the annotators, and assessing the quality of the annotations themselves.

You can assign multiple annotators to the same task (for example, three annotators label the key points on the same image), and measure the average value alongside the standard deviation of these labels among the annotators. Doing so helps you identify any outlier annotations (incorrect label used, or label far away from the average annotation), which can guide actionable outcomes, such as refining your instructions or providing further training to certain annotators.

Assessing the quality of annotations themselves is tied to annotator variability and (when available) the availability of domain experts or ground truth information. Are there certain labels (across all of your images) where the average variance between annotators is consistently high? Are any labels far off from your expectations of where they should be, or what they should look like?

Based on our experience, a typical quality control loop for data annotation can look like this:

  • Iterate on the instructions or image staging based on results from the test run – Are any objects occluded, or does image staging not match the expectations of annotators or users? Are the instructions misleading, or did you miss any labels or common errors in your exemplar images? Can you refine the instructions for your annotators?
  • If you are satisfied that you have addressed any issues from the test run, do a batch of annotations – For testing the results from the batch, follow the same quality assessment approach of assessing inter-annotator and inter-image label variabilities.


This post serves as a guide for business stakeholders to understand the complexities of data creation for AI/ML applications. The processes described also serve as a guide for technical practitioners to generate quality data while optimizing business constraints such as personnel and costs. If not done well, a data creation and labeling pipeline can take upwards of 4–6 months.

With the guidelines and suggestions outlined in this post, you can preempt roadblocks, reduce time to completion, and minimize the costs in your journey toward creating high-quality data.

About the authors

Jasleen Grewal is an Applied Scientist at Amazon Web Services, where she works with AWS customers to solve real world problems using machine learning, with special focus on precision medicine and genomics. She has a strong background in bioinformatics, oncology, and clinical genomics. She is passionate about using AI/ML and cloud services to improve patient care.

Boris Aronchik is a Manager in the Amazon AI Machine Learning Solutions Lab, where he leads a team of ML scientists and engineers to help AWS customers realize business goals leveraging AI/ML solutions.

Miguel Romero Calvo is an Applied Scientist at the Amazon ML Solutions Lab where he partners with AWS internal teams and strategic customers to accelerate their business through ML and cloud adoption.

Lin Lee Cheong is a Senior Scientist and Manager with the Amazon ML Solutions Lab team at Amazon Web Services. She works with strategic AWS customers to explore and apply artificial intelligence and machine learning to discover new insights and solve complex problems.

Read More

Automate your time series forecasting in Snowflake using Amazon Forecast

This post is a joint collaboration with Andries Engelbrecht and James Sun of Snowflake, Inc.

The cloud computing revolution has enabled businesses to capture and retain corporate and organizational data without capacity planning or data retention constraints. Now, with diverse and vast reserves of longitudinal data, companies are increasingly able to find novel and impactful ways to use their digital assets to make better and informed decisions when making short-term and long-term planning decisions. Time series forecasting is a unique and essential science that allows companies to make surgical planning decisions to help balance customer service levels against often competing goals of optimal profitability.

At AWS, we sometimes work with customers who have selected our technology partner Snowflake to deliver a cloud data platform experience. Having a platform that can recall years and years of historical data is powerful—but how can you use this data to look ahead and use yesterday’s evidence to plan for tomorrow? Imagine not only having what has happened available in Snowflake—your single version of the truth—but also an adjacent set of non-siloed data that offers a probabilistic forecast for days, weeks, or months into the future.

In a collaborative supply chain, sharing information between partners can improve performance, increase competitiveness, and reduce wasted resources. Sharing your future forecasts can be facilitated with Snowflake Data Sharing, which enables you to seamlessly collaborate with your business partners securely and identify business insights. If many partners share their forecasts, it can help control the bullwhip effect in the connected supply chain. You can effectively use Snowflake Marketplace to monetize your predictive analytics from datasets produced in Amazon Forecast.

In this post, we discuss how to implement an automated time series forecasting solution using Snowflake and Forecast.

Essential AWS services that enable this solution

Forecast provides several state-of-the-art time series algorithms and manages the allocation of enough distributed computing capacity to meet the needs of nearly any workload. With Forecast, you don’t get one model; you get the strength of many models that are further optimized into a uniquely weighted model for each time series in the set. In short, the service delivers all the science, data handling, and resource management into a simple API call.

AWS Step Functions provides a process orchestration mechanism that manages the overall workflow. The service encapsulates API calls with Amazon Athena, AWS Lambda, and Forecast to create an automated solution that harvests data from Snowflake, uses Forecast to convert historical data into future predictions, and then creates the data inside Snowflake.

Athena federated queries can connect to several enterprise data sources, including Amazon DynamoDB, Amazon Redshift, Amazon OpenSearch Service, MySQL, PostgreSQL, Redis, and other popular third-party data stores, such as Snowflake. Data connectors run as Lambda functions—you can use this source code to help launch the Amazon Athena Lambda Snowflake Connector and connect with AWS PrivateLink or through a NAT Gateway.

Solution overview

One of the things we often do at AWS is work to help customers realize their goals while also removing the burden of the undifferentiated heavy lifting. With this in mind, we propose the following solution to assist AWS and Snowflake customers perform the following steps:

  1. Export data from Snowflake. You can use flexible metadata to unload the necessary historical data driven by a ready-to-go workflow.
  2. Import data into Forecast. No matter the use case, industry, or scale, importing prepared data inputs is easy and automated.
  3. Train a state-of-the-art time series model. You can automate time series forecasting without managing the underlying data science or hardware provisioning.
  4. Generate inference against the trained model. Forecast-produced outputs are easy to consume for any purpose. They’re available as simple CSV or Parquet files on Amazon Simple Storage Service (Amazon S3).
  5. Use history and future predictions side by side directly in Snowflake.

The following diagram illustrates how to implement an automated workflow that enables Snowflake customers to benefit from highly accurate time series predictions supported by Forecast, an AWS managed service. Transcending use case and industry, the design offered here first extracts historical data from Snowflake. Next, the workflow submits the prepared data for time series computation. Lastly, future period predictions are available natively in Snowflake, creating a seamless user experience for joint AWS and Snowflake customers.

Although this architecture only highlights the key technical details, the solution is simple to put together, sometimes within 1–2 business days. We provide you with working sample code to help remove the undifferentiated heavy lifting of creating the solution alone and without a head start. After you discover how to implement this pattern for one workload, you can repeat the forecasting process for any data held in Snowflake. In the sections that follow, we outline the key steps that enable you to build an automated pipeline.

Extract historical data from Snowflake

In this first step, you use SQL to define what data you want forecasted and let an Athena Federated Query connect to Snowflake, run your customized SQL, and persist the resulting record set on Amazon S3. Forecast requires historical training data to be available on Amazon S3 before ingestion; therefore, Amazon S3 serves as an intermediate storage buffer between Snowflake and Forecast. We feature Athena in this design to enable Snowflake and other heterogeneous data sources. If you prefer, another approach is using the Snowflake COPY command and storage integration to write query results to Amazon S3.

Regardless of the transport mechanism used, we now outline the kind of data Forecast needs and how data is defined, prepared, and extracted. In the section that follows, we describe how to import data into Forecast.

The following screenshot depicts what a set of data might look like in its native Snowflake schema.

Although this screenshot shows how the data looks in its natural state, Forecast requires data to be shaped into three different datasets:

  • Target time series – This is a required dataset containing the target variable and is used to train and predict a future value. Alone, this dataset serves as a univariate time series model.
  • Related time series – This is an optional dataset that contains temporal variables that should have a relationship to the target variable. Examples include variable pricing, promotional efforts, hyperlocal event traffic, economic outlook data—anything you feel might help explain variance in the target time series and produce a better forecast. The related time series dataset turns your univariate model into a multivariate to help improve accuracy.
  • Item metadata – This is an optional dataset containing categorical data about the forecasted item. Item metadata often helps boost performance for newly launched products, which we term a cold start.

With the scope of each of the Forecast datasets defined, you can write queries in Snowflake that source the correct data fields from the necessary source tables with the proper filters to get the desired subset of data. The following are three example SQL queries used to generate each dataset that Forecast needs for a specific food demand planning scenario.

We start with the target time series query:


The optional related time series query pulls covariates such as price and promotional:


The item metadata query fetches distinct categorical values that help give dimension and further define the forecasted item:


With the source queries defined, we can connect to Snowflake through an Athena Federated Query to submit the queries and persist the resulting datasets for forecasting use. For more information, refer to Query Snowflake using Athena Federated Query and join with data in your Amazon S3 data lake.

The Athena Snowflake Connector GitHub repo helps install the Snowflake connector. The Forecast MLOps GitHub repo helps orchestrate all macro steps defined in this post, and makes them repeatable without writing code.

Import data into Forecast

After we complete the previous step, a target time series dataset is in Amazon S3 and ready for import into Forecast. In addition, the optional related time series and item metadata datasets may also be prepared and ready for ingestion. With the provided Forecast MLOps solution, all you have to do here is initiate the Step Functions state machine responsible for importing data—no code is necessary. Forecast launches a cluster for each of the datasets you have provided and makes the data ready for the service to use for ML model building and model inference.

Create a time series ML model with accuracy statistics

After data has been imported, highly accurate time series models are created simply by calling an API. This step is encapsulated inside a Step Functions state machine that initiates the Forecast API to start model training. After the predictor model is trained, the state machine exports the model statistics and predictions during the backtest window to Amazon S3. Backtest exports are queryable by Snowflake as an external stage, as shown in the following screenshot. If you prefer, you can store the data in an internal stage. The point is to use the backtest metrics to evaluate the performance spread of time series in your dataset provided.

Create future predictions

With the model trained from the previous step, a purpose-built Step Functions state machine calls the Forecast API to create future-dated forecasts. Forecast provisions a cluster to perform the inference and pulls the imported target time series, related time series, and item metadata datasets through a named predictor model created in the previous step. After the predictions are generated, the state machine writes them to Amazon S3, where, once again, they can be queried in place as a Snowflake external stage or moved into Snowflake as an internal stage.

Use the future-dated prediction data directly in Snowflake

AWS hasn’t built a fully automated solution for this step; however, with the solution in this post, data was already produced by Forecast in the previous two steps. You may treat the outputs as actionable events or build business intelligence dashboards on the data. You may also use the data to create future manufacturing plans and purchase orders, estimate future revenue, build staffing resource plans, and more. Every use case is different, but the point of this step is to deliver the predictions to the correct consuming systems in your organization or beyond.

The following code snippet shows how to query Amazon S3 data directly from within Snowflake:

type = 'CSV'
field_delimiter = ','
empty_field_as_null = TRUE
skip_header = 1;

CREATE or REPLACE STORAGE INTEGRATION amazon_forecast_integration
STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::nnnnnnnnnn:role/snowflake-forecast-poc-role'
ENABLED = true

CREATE or REPLACE STAGE backtest_accuracy_metrics
storage_integration = amazon_forecast_integration
url = 's3://bucket/folder/backtest-export/accuracy-metrics-values'
file_format = mycsvformat;

ITEM_ID varchar AS (value:c1::varchar),
LOCATION_ID varchar AS (value:c2::varchar),
backtest_window varchar AS (value:c3::varchar),
backtestwindow_start_time varchar AS (value:c4::varchar),
backtestwindow_end_time varchar AS (value:c5::varchar),
wQL_10 varchar AS (value:c6::varchar),
wQL_30 varchar AS (value:c7::varchar),
wQL_50 varchar AS (value:c8::varchar),
wQL_70 varchar AS (value:c9::varchar),
wQL_90 varchar AS (value:c10::varchar),
AVG_wQL varchar AS (value:c11::varchar),
RMSE varchar AS (value:c12::varchar),
WAPE varchar AS (value:c13::varchar),
MAPE varchar AS (value:c14::varchar),
MASE varchar AS (value:c15::varchar)
with location = @backtest_accuracy_metrics

For more information about setting up permissions, refer to Option 1: Configuring a Snowflake Storage Integration to Access Amazon S3. Additionally, you can use the AWS Service Catalog to configure Amazon S3 storage integration; more information is available on the GitHub repo.

Initiate a schedule-based or event-based workflow

After you install a solution for your specific workload, your final step is to automate the process on a schedule that makes sense for your unique requirement, such as daily or weekly. The main thing is to decide how to start the process. One method is to use Snowflake to invoke the Step Functions state machine and then orchestrate the steps serially. Another approach is to chain state machines together and start the overall run through an Amazon EventBridge rule, which you can configure to run from an event or scheduled task—for example, at 9:00 PM GMT-8 each Sunday night.


With the most experience; the most reliable, scalable, and secure cloud; and the most comprehensive set of services and solutions, AWS is the best place to unlock value from your data and turn it into insight. In this post, we showed you how to create an automated time series forecasting workflow. Better forecasting can lead to higher customer service outcomes, less waste, less idle inventory, and more cash on the balance sheet.

If you’re ready to automate and improve forecasting, we’re here to help support you on your journey. Contact your AWS or Snowflake account team to get started today and ask for a forecasting workshop to see what kind of value you can unlock from your data.

About the Authors

Bosco Albuquerque is a Sr. Partner Solutions Architect at AWS and has over 20 years of experience working with database and analytics products from enterprise database vendors and cloud providers. He has helped technology companies design and implement data analytics solutions and products.

Frank Dallezotte is a Sr. Solutions Architect at AWS and is passionate about working with independent software vendors to design and build scalable applications on AWS. He has experience creating software, implementing build pipelines, and deploying these solutions in the cloud.

Andries Engelbrecht is a Principal Partner Solutions Architect at Snowflake and works with strategic partners. He is actively engaged with strategic partners like AWS supporting product and service integrations as well as the development of joint solutions with partners. Andries has over 20 years of experience in the field of data and analytics.

Charles Laughlin is a Principal AI/ML Specialist Solutions Architect and works on the Time Series ML team at AWS. He helps shape the Amazon Forecast service roadmap and collaborates daily with diverse AWS customers to help transform their businesses using cutting-edge AWS technologies and thought leadership. Charles holds an M.S. in Supply Chain Management and has spent the past decade working in the consumer packaged goods industry.

James Sun is a Senior Partner Solutions Architect at Snowflake. James has over 20 years of experience in storage and data analytics. Prior to Snowflake, he held several senior technical positions at AWS and MapR. James holds a PhD from Stanford University.

Read More

Achieve four times higher ML inference throughput at three times lower cost per inference with Amazon EC2 G5 instances for NLP and CV PyTorch models

Amazon Elastic Compute Cloud (Amazon EC2) G5 instances are the first and only instances in the cloud to feature NVIDIA A10G Tensor Core GPUs, which you can use for a wide range of graphics-intensive and machine learning (ML) use cases. With G5 instances, ML customers get high performance and a cost-efficient infrastructure to train and deploy larger and more sophisticated models for natural language processing (NLP), computer vision (CV), and recommender engine use cases.

The purpose of this post is to showcase the performance benefits of G5 instances for large-scale ML inference workloads. We do this by comparing the price-performance (measured as $ per million inferences) for NLP and CV models with G4dn instances. We start by describing our benchmarking approach and then present throughput vs. latency curves across batch sizes and data type precision. In comparison to G4dn instances, we find that G5 instances deliver consistently lower cost per million inferences for both full precision and mixed precision modes for the NLP and CV models while achieving higher throughput and lower latency.

Benchmarking approach

To develop a price-performance study between G5 and G4dn, we need to measure throughput, latency, and cost per million inferences as a function of batch size. We also study the impact of full precision vs. mixed precision. Both the model graph and inputs are loaded into CUDA prior to inferencing.

As shown in the following architecture diagram, we first create respective base container images with CUDA for the underlying EC2 instance (G4dn, G5). To build the base container images, we start with AWS Deep Learning Containers, which use pre-packaged Docker images to deploy deep learning environments in minutes. The images contain the required deep learning PyTorch libraries and tools. You can add your own libraries and tools on top of these images for a higher degree of control over monitoring, compliance, and data processing.

Then we build a model-specific container image that encapsulates the model configuration, model tracing, and related code to run forward passes. All container images are loaded on into Amazon ECR to allow for horizontal scaling of these models for various model configurations. We use Amazon Simple Storage Service (Amazon S3) as a common data store to download configuration and upload benchmark results for summarization. You can use this architecture to recreate and reproduce the benchmark results and repurpose to benchmark various model types (such as Hugging Face models, PyTorch models, other custom models) across EC2 instance types (CPU, GPU, Inf1).

With this experiment set up, our goal is to study latency as a function of throughput. This curve is important for application design to arrive at a cost-optimal infrastructure for the target application. To achieve this, we simulate different loads by queuing up queries from multiple threads and then measuring the round-trip time for each completed request. Throughput is measured based on the number of completed requests per unit clock time. Furthermore, you can vary the batch sizes and other variables like sequence length and full precision vs. half precision to comprehensively sweep the design space to arrive at indicative performance metrics. In our study, through a parametric sweep of batch size and queries from multi-threaded clients, the throughput vs. latency curve is determined. Every request can be batched to ensure full utilization of the accelerator, especially for small requests that may not fully utilize the compute node. You can also adopt this setup to identify the client-side batch size for optimal performance.

In summary, we can represent this problem mathematically as: (Throughput, Latency) = function of (Batch Size, Number of threads, Precision).

This means, given the exhaustive space, the number of experiments can be large. Fortunately, each experiment can be independently run. We recommend using AWS Batch to perform this horizontally scaled benchmarking in compressed time without an increase in benchmarking cost compared to a linear approach to testing. The code for replicating the results is present in the GitHub repository prepared for AWS Re:Invent 2021. The repository is comprehensive to perform benchmarking on different accelerators. You can refer to the GPU aspect of code to build the container (Dockerfile-gpu) and then refer to the code inside Container-Root for specific examples for BERT and ResNet50.

We used the preceding approach to develop performance studies across two model types: Bert-base-uncased (110 million parameters, NLP) and ResNet50 (25.6 million parameters, CV). The following table summarizes the model details.

Model Type Model Details
NLP twmkn9/bert-base-uncased-squad2 110 million parameters Sequence length = 128
CV ResNet50 25.6 million parameters

Additionally, to benchmark across data types (full, half precision), we use torch.cuda.amp, which provides convenient methods to handle mixed precision where some operations use the torch.float32 (float) data type and other operations use torch.float16 (half). For example, operators like linear layers and convolutions are much faster with float16, whereas others like reductions often require the dynamic range of float32. Automatic mixed precision tries to match each operator to its appropriate data type to optimize the network’s runtime and memory footprint.

Benchmarking results

For a fair comparison, we selected G4dn.4xlarge and G5.4xlarge instances with similar attributes, as listed in the following table.

Instance GPUs GPU Memory (GiB) vCPUs Memory (GiB) Instance Storage (GB) Network Performance (Gbps) EBS Bandwidth (Gbps) Linux On-Demand Pricing (us-east-1)
G5.4xlarge 1 24 16 64 1x 600 NVMe SSD up to 25 8 $1.204/hour
G4dn.4xlarge 1 16 16 64 1x 225 NVMe SSD up to 25 4.75 $1.624/hour

In the following sections, we compare ML inference performance of BERT and RESNET50 models with a grid sweep approach for specific batch sizes (32, 16, 8, 4, 1) and data type precision (full and half precision) to arrive at the throughput vs. latency curve. Additionally, we investigate the effect of throughput vs. batch size for both full and half precision. Lastly, we measure cost per million inferences as a function of batch size. The consolidated results across these experiments are summarized later in this post.

Throughput vs. latency

The following figures compare G4dn and G5 instances for NLP and CV workloads at both full and half precision. In comparison to G4dn instances, the G5 instance delivers a throughput of about five times higher (full precision) and about 2.5 times higher (half precision) for a BERT base model, and about 2–2.5 times higher for a ResNet50 model. Overall, G5 is a preferred choice, with increasing batch sizes for both models for both full and mixed precision from a performance perspective.

The following graphs compare throughput and P95 latency at full and half precision for BERT.

The following graphs compare throughput and P95 latency at full and half precision for ResNet50.

Throughput and latency vs. batch size

The following graphs show the throughput as a function of the batch size. At low batch sizes, the accelerator isn’t functioning to its fullest capacity and as the batch size increases, throughput is increased at the cost of latency. The throughput curve asymptotes to a maximum value that is a function of the accelerator performance. The curve has two distinct features: a rising section and a flat asymptotic section. For a given model, a performant accelerator (G5) is able to stretch the rising section to higher batch sizes than G4dn and asymptote at a higher throughput. Also, there is a linear trade-off between latency and batch size. Therefore, if the application is latency bound, we can use P95 latency vs. batch size to determine the optimum batch size. However, if the objective is to maximize throughput at the lowest latency, it’s better to select the batch size corresponding to the “knee” between the rising and the asymptotic sections, because any further increase in batch size would result in the same throughput at a worse latency. To achieve the best price-performance ratio, targeting higher throughput at lowest latency, you’re better off horizontally scaling this optimum through multiple inference servers rather than just increasing the batch size.

Cost vs. batch size

In this section, we present the comparative results of inference costs ($ per million inferences) versus the batch size. From the following figure, we can clearly observe that the cost (measured as $ per million inferences) is consistently lower with G5 vs. G4dn both (full and half precision).

The following table summarizes throughput, latency, and cost ($ per million inferences) comparisons for BERT and RESNET50 models across both precision modes for specific batch sizes. In spite of a higher cost per instance, G5 consistently outperforms G4dn across all aspects of inference latency, throughput, and cost ($ per million inference), for all batch sizes. Combining the different metrics into a cost ($ per million inferences), BERT model (32 batch size, full precision) with G5 is 3.7 times more favorable than G4dn, and with ResNet50 model (32 batch size, full precision), it is 1.6 times more favorable than G4dn.

Model Batch Size Precision


(Batch size X Requests/sec)

Latency (ms)


Inferences (On-Demand)

Cost Benefit

(G5 over G4dn)

. . . G5 G4dn G5 G4dn G5 G4dn
Bert-base-uncased 32 Full 723 154 44 208 $0.6 $2.2 3.7X
Mixed 870 410 37 79 $0.5 $0.8 1.6X
16 Full 651 158 25 102 $0.7 $2.1 3.0X
Mixed 762 376 21 43 $0.6 $0.9 1.5X
8 Full 642 142 13 57 $0.7 $2.3 3.3X
Mixed 681 350 12 23 $0.7 $1.0 1.4X
. 1 Full 160 116 6 9 $2.8 $2.9 1.0X
Mixed 137 102 7 10 $3.3 $3.3 1.0X
ResNet50 32 Full 941 397 34 82 $0.5 $0.8 1.6X
Mixed 1533 851 21 38 $0.3 $0.4 1.3X
16 Full 888 384 18 42 $0.5 $0.9 1.8X
Mixed 1474 819 11 20 $0.3 $0.4 1.3X
8 Full 805 340 10 24 $0.6 $1.0 1.7X
Mixed 1419 772 6 10 $0.3 $0.4 1.3X
. 1 Full 202 164 5 6 $2.2 $2 0.9X
Mixed 196 180 5 6 $2.3 $1.9 0.8X

Additional inference benchmarks

In addition to the BERT base and ResNet50 results in the prior sections, we present additional benchmarking results for other commonly used large NLP and CV models in PyTorch. The performance benefit of G5 over G4dn has been presented for BERT Large models at various precision, and Yolo-v5 models for various sizes. For the code for replicating the benchmark, refer to NVIDIA Deep Learning Examples for Tensor Cores. These results show the benefit of using G5 over G4dn for a wide range of inference tasks spanning different model types.

Model Precision Batch Size Sequence Length Throughput (sent/sec) Throughput: G4dn Speedup Over G4dn
BERT-large FP16 1 128 93.5 40.31 2.3
BERT-large FP16 4 128 264.2 87.4 3.0
BERT-large FP16 8 128 392.1 107.5 3.6
BERT-large FP32 1 128 68.4 22.67 3.0
BERT-large 4 128 118.5 32.21 3.7
BERT-large 8 128 132.4 34.67 3.8
Model GFLOPS Number of Parameters Preprocessing (ms) Inference (ms) Inference (Non-max-suppression) (NMS/image)
YOLOv5s 16.5 7.2M 0.2 3.6 4.5
YOLOv5m 49.1 21M 0.2 6.5 4.5
YOLOv5l 109.3 46M 0.2 9.1 3.5
YOLOv5x 205.9 86M 0.2 14.4 1.3


In this post, we showed that for inference with large NLP and CV PyTorch models, EC2 G5 instances are a better choice compared to G4dn instances. Although the on-demand hourly cost for G5 instances is higher than G4dn instances, its higher performance can achieve 2–5 times the throughput at any precision for NLP and CV models, which makes the cost per million inferences 1.5–3.5 times more favorable than G4dn instances. Even for latency bound applications, G5 is 2.5–5 times better than G4dn for NLP and CV models.

In summary, AWS G5 instances are an excellent choice for your inference needs from both a performance and cost per inference perspective. The universality of the CUDA framework and the scale and depth of the G5 instance pool on AWS provides you with a unique ability to perform inference at scale.

About the authors

Ankur Srivastava is a Sr. Solutions Architect in the ML Frameworks Team. He focuses on helping customers with self-managed distributed training and inference at scale on AWS. His experience includes industrial predictive maintenance, digital twins, probabilistic design optimization and has completed his doctoral studies from Mechanical Engineering at Rice University and post-doctoral research from Massachusetts Institute of Technology.

Sundar Ranganathan is the Head of Business Development, ML Frameworks on the Amazon EC2 team. He focuses on large-scale ML workloads across AWS services like Amazon EKS, Amazon ECS, Elastic Fabric Adapter, AWS Batch, and Amazon SageMaker. His experience includes leadership roles in product management and product development at NetApp, Micron Technology, Qualcomm, and Mentor Graphics.

Mahadevan Balasubramaniam is a Principal Solutions Architect for Autonomous Computing with nearly 20 years of experience in the area of physics-infused deep learning, building, and deploying digital twins for industrial systems at scale. Mahadevan obtained his PhD in Mechanical Engineering from the Massachusetts Institute of Technology and has over 25 patents and publications to his credit.

Amr Ragab is a Principal Solutions Architect for EC2 Accelerated Platforms for AWS, devoted to helping customers run computational workloads at scale. In his spare time he likes traveling and finding new ways to integrate technology into daily life.

Read More

Building a reinforcement learning agent with JAX, and deploying it on Android with TensorFlow Lite

Building a reinforcement learning agent with JAX, and deploying it on Android with TensorFlow Lite

Posted by Wei Wei, Developer Advocate

In our previous blog post Building a board game app with TensorFlow: a new TensorFlow Lite reference app, we showed you how to use TensorFlow and TensorFlow Agents to train a reinforcement learning (RL) agent to play a simple board game ‘Plane Strike’. We also converted the trained model to TensorFlow Lite and then deployed it into a fully-functional Android app. In this blog, we will demonstrate a new path: train the same RL agent with Flax/JAX and deploy it into the same Android app we have built before. The complete code has been open sourced in the tensorflow/examples repository for your reference.

To refresh your memory, our RL-based agent needs to predict a strike position based on the human player’s board position so that it can finish the game before the human player does. For more detailed game rules, please refer to our previous blog.

Demo game play in ‘Plane Strike’
Demo game play in ‘Plane Strike’

Background: JAX and TensorFlow

JAX is a NumPy-like library developed by Google Research for high performance computing. It uses XLA to compile programs optimized for GPUs and TPUs. Flax is a popular neural network library built on top of JAX. Researchers have been using JAX/Flax to train very large models with billions of parameters (such as PaLM for language understanding and generation, or Imagen for image generation), making full use of modern hardware. If you’re new to JAX and Flax, start with this JAX 101 tutorial and this Flax Getting Started example.

TensorFlow started as a library for ML towards the end of 2015 and has since become a rich ecosystem that includes tools for productionizing ML pipelines (TFX), data visualization (TensorBoard), deploying ML models to edge devices (TensorFlow Lite), and devices running on a web browser or any device capable of executing JavaScript (TensorFlow.js). Models developed in JAX or Flax can tap into this rich ecosystem by first converting such a model to the TensorFlow SavedModel format, and then using the same tooling as if they had been developed in TensorFlow natively.

If you already have a JAX-trained model and want to deploy it today, we have put together a list of resources for you:

  • This blog post demos how to convert a Flax/JAX model to TFLite and run it in a native Android app

Overall, no matter what your deployment target is (server, web or mobile), we got you covered.
Implementing the game agent with Flax/JAX

Coming back to our board game, to implement our RL agent, we will leverage the same gym environment as before. We will train the same policy gradient model using Flax/JAX this time. Recall that mathematically the policy gradient is defined as:



  • T: the number of timesteps per episode, which can vary per episode
  • st: the state at timestep t
  • at: chosen action at timestep t given state s
  • πθ: the policy parameterized by θ
  • R(*): the reward gathered, given the policy

We define a 3-layer MLP as our policy network, which predicts the agent’s next strike position.

class PolicyGradient(nn.Module):

  “””Neural network to predict the next strike position.”””



  def __call__(self, x):

    dtype = jnp.float32

    x = x.reshape((x.shape[0], –1))

    x = nn.Dense(

        features=2 * common.BOARD_SIZE**2, name=‘hidden1’, dtype=dtype)(


    x = nn.relu(x)

    x = nn.Dense(features=common.BOARD_SIZE**2, name=‘hidden2’, dtype=dtype)(x)

    x = nn.relu(x)

    x = nn.Dense(features=common.BOARD_SIZE**2, name=‘logits’, dtype=dtype)(x)

    policy_probabilities = nn.softmax(x)

    return policy_probabilities

In our main training loop, in each iteration we use the neural network to play a round of the game, gather the trajectory information (game board positions, actions taken and rewards), discount the rewards, and then train the model with the trajectories.

for i in tqdm(range(iterations)):

   predict_fn = functools.partial(run_inference, params)

   board_log, action_log, result_log = common.play_game(predict_fn)

   rewards = common.compute_rewards(result_log)

   optimizer, params, opt_state = train_step(optimizer, params, opt_state,

                                             board_log, action_log, rewards)

In the train_step() method, we first compute the loss using the trajectories. Then we use jax.grad() to compute the gradients. Lastly we use Optax, a gradient processing and optimization library for JAX, to update the model parameters.

def compute_loss(logits, labels, rewards):

  one_hot_labels = jax.nn.one_hot(labels, num_classes=common.BOARD_SIZE**2)

  loss = -jnp.mean(

      jnp.sum(one_hot_labels * jnp.log(logits), axis=-1) * jnp.asarray(rewards))

  return loss



def train_step(model_optimizer, params, opt_state, game_board_log,

              predicted_action_log, action_result_log):

“””Run one training step.”””


  def loss_fn(model_params):

    logits = run_inference(model_params, game_board_log)

    loss = compute_loss(logits, predicted_action_log, action_result_log)

    return loss


  def compute_grads(params):

    return jax.grad(loss_fn)(params)


  grads = compute_grads(params)

  updates, opt_state = model_optimizer.update(grads, opt_state)

  params = optax.apply_updates(params, updates)

  return model_optimizer, params, opt_state




def run_inference(model_params, board):

  logits = PolicyGradient().apply({‘params’: model_params}, board)

  return logits

That’s it for the training loop. We can visualize the training progress in TensorBoard as below; here we use the proxy metric ‘game_length’ (the number of steps to finish the game) to track the progress. The intuition is that when the agent becomes smarter, it can finish the game in fewer steps.

Converting the Flax/JAX model to TensorFlow Lite and integrating with the Android app

After the model is trained, we use the jax2tf, a TensorFlow-JAX interoperation tool, to convert the JAX model into a TensorFlow concrete function. And the final step is to call TensorFlow Lite converter to convert the concrete function into a TFLite model.

# Convert to tflite model

 model = PolicyGradient()

 jax_predict_fn = lambda input: model.apply({‘params’: params}, input)


 tf_predict = tf.function(

     jax2tf.convert(jax_predict_fn, enable_xla=False),



             shape=[1, common.BOARD_SIZE, common.BOARD_SIZE],







 converter = tf.lite.TFLiteConverter.from_concrete_functions(

     [tf_predict.get_concrete_function()], tf_predict)


 tflite_model = converter.convert()


 # Save the model

 with open(os.path.join(modeldir, ‘planestrike.tflite’), ‘wb’) as f:


The JAX-converted TFLite model behaves exactly like any TensorFlow-trained TFLite model. You can visualize it with Netron:

Visualizing TFLite model converted from Flax/JAX using Netron
Visualizing TFLite model converted from Flax/JAX using Netron
We can use exactly the same Java code as before to invoke the model and get the prediction.

convertBoardStateToByteBuffer(board);, outputProbArrays);
float[] probArray = outputProbArrays[0];
int agentStrikePosition = -1;
float maxProb = 0;
for (int i = 0; i < probArray.length; i++) {
  int x = i / Constants.BOARD_SIZE;
  int y = i % Constants.BOARD_SIZE;
  if (board[x][y] == BoardCellStatus.UNTRIED && probArray[i] > maxProb) {
    agentStrikePosition = i;
    maxProb = probArray[i];


In summary, this article walks you through how to train a simple reinforcement learning model with Flax/JAX, leverage jax2tf to convert it to TensorFlow Lite, and integrate the converted model into an Android app.

Now you have learned how to build neural network models with Flax/JAX, and tap into the powerful TensorFlow ecosystem to deploy your models pretty much anywhere you want. We can’t wait to see the fantastic apps you build with both JAX and TensorFlow!

Read More

Wiggling toward bio-inspired machine intelligence

Wiggling toward bio-inspired machine intelligence

Juncal Arbelaiz Mugica is a native of Spain, where octopus is a common menu item. However, Arbelaiz appreciates octopus and similar creatures in a different way, with her research into soft-robotics theory. 

More than half of an octopus’ nerves are distributed through its eight arms, each of which has some degree of autonomy. This distributed sensing and information processing system intrigued Arbelaiz, who is researching how to design decentralized intelligence for human-made systems with embedded sensing and computation. At MIT, Arbelaiz is an applied math student who is working on the fundamentals of optimal distributed control and estimation in the final weeks before completing her PhD this fall.

She finds inspiration in the biological intelligence of invertebrates such as octopus and jellyfish, with the ultimate goal of designing novel control strategies for flexible “soft” robots that could be used in tight or delicate surroundings, such as a surgical tool or for search-and-rescue missions.

“The squishiness of soft robots allows them to dynamically adapt to different environments. Think of worms, snakes, or jellyfish, and compare their motion and adaptation capabilities to those of vertebrate animals,” says Arbelaiz. “It is an interesting expression of embodied intelligence — lacking a rigid skeleton gives advantages to certain applications and helps to handle uncertainty in the real world more efficiently. But this additional softness also entails new system-theoretic challenges.”

In the biological world, the “controller” is usually associated with the brain and central nervous system — it creates motor commands for the muscles to achieve movement. Jellyfish and a few other soft organisms lack a centralized nerve center, or brain. Inspired by this observation, she is now working toward a theory where soft-robotic systems could be controlled using decentralized sensory information sharing.

“When sensing and actuation are distributed in the body of the robot and onboard computational capabilities are limited, it might be difficult to implement centralized intelligence,” she says. “So, we need these sort of decentralized schemes that, despite sharing sensory information only locally, guarantee the desired global behavior. Some biological systems, such as the jellyfish, are beautiful examples of decentralized control architectures — locomotion is achieved in the absence of a (centralized) brain. This is fascinating as compared to what we can achieve with human-made machines.”

A fluid transition to MIT

Her graduate studies at the University of Navarra in San Sebastian led to her working with MIT Professor John Bush in fluid dynamics. In 2015, he invited Arbelaiz to MIT as a visiting student to investigate droplet interactions. This led to their 2018 paper in Physical Review Fluids, and her pursuit of a PhD at MIT.   

In 2018, her doctoral research shifted to the interdisciplinary Sociotechnical System Research Center (SSRC), and is now advised by Ali Jadbabaie, the JR East Professor of Engineering and head of the Department of Civil and Environmental Engineering; and School of Engineering Associate Dean Anette “Peko” Hosoi, who is the Neil and Jane Pappalardo Professor of Mechanical Engineering as well as an applied math professor. Arbelaiz also regularly works with Bassam Bamieh, associate director of the Center for Control, Dynamical Systems, and Computation at the University of California at Santa Barbara. She says that working with this team of advisors gives her the freedom to explore the multidisciplinary research projects she has been drawn to over the past five years.

For example, she uses system-theoretic approaches to design novel optimal controllers and estimators for systems with spatiotemporal dynamics, and to gain a fundamental understanding of the sensory feedback communication topologies required to optimally control these systems. For the soft-robotic applications, this amounts to ranking which sensory measurements are important to best trigger each of the “muscles” of this robot. Did the robot’s performance degrade when each actuator only has access to the closest sensory measurements? Her research characterizes such a trade-off between closed-loop performance, uncertainty, and complexity in spatially distributed systems. 

“I am determined to bridge the gap between machine autonomy, systems theory, and biological intelligence,” she says.

Next chapter

A two-year Schmidt Science Fellowship, which funds young researchers to pursue postdoctoral studies in a field different from their graduate work, will let Arbelaiz further explore the intersection of biological and machine intelligence after graduation. 

She plans to spend her postdoc time at Princeton University with Professor Naomi Leonard, and to work with researchers in systems biology, computer science, and robotics, to explore the reliability and robustness of biological and artificial ensembles. Specifically, she is interested in learning how biological systems efficiently adapt to different environments so that she can apply this knowledge to human-made systems, such as autonomous machines, whose vulnerability to noise and uncertainty creates safety issues.

“I foresee an unprecedented revolution approaching in autonomous and intelligent machines, facilitated by a fruitful symbiosis between systems theory, computation, and (neuro)biology,” she says.

Paying it forward

Arbelaiz grew up in Spain acutely aware of the privilege of having access to a better education than her parents. Her father earned a degree in economics through independent study while working to support his family. His daughter inherited his persistence. 

“The hardships my parents experienced made them cherish autodidactism, lifelong learning, and critical thinking,” she says. “They passed on these values to me, so I grew up to be a curious and persevering person, enthusiastic about science and ready to seize every educational opportunity.”  

In a desire to pass this on to others, she mentors STEM students who lack guidance or resources. “I firmly believe that we should promote talent everywhere, and mentoring could be the key driver to encourage underrepresented minorities to pursue careers in STEM,” she says.

An advocate for women in STEM, she was part of the executive committee of Graduate Women at MIT (GWAMIT) and MIT Women in Mathematics, and participates in various panels and workshops. She also runs live experiments for kids, such as at the MIT Museum’s Girls Day events.

“As scientists, we are responsible to share our knowledge, to inform the public about scientific discovery and its impact, and to raise awareness about the value of research and the need to invest in it.” 

Arbelaiz also supports MIT’s Covid-19 outreach efforts, including talks about the mathematical modeling of the virus, and translating into Basque her former mentor John Bush’s MIT Covid-19 Indoor Safety app

This interest in paying her STEM knowledge forward is something she credits to her MIT education. 

“MIT has been one of the best experiences of my life so far: it has brought enormous academic, professional, and personal growth,” she says. “I share MIT’s taste for collaborative and multidisciplinary research, the attraction to intellectual challenges, and the enthusiasm for advancing science and technology to benefit humankind.”

Read More

Wiggling toward bio-inspired machine intelligence

Juncal Arbelaiz Mugica is a native of Spain, where octopus is a common menu item. However, Arbelaiz appreciates octopus and similar creatures in a different way, with her research into soft-robotics theory. 

More than half of an octopus’ nerves are distributed through its eight arms, each of which has some degree of autonomy. This distributed sensing and information processing system intrigued Arbelaiz, who is researching how to design decentralized intelligence for human-made systems with embedded sensing and computation. At MIT, Arbelaiz is an applied math student who is working on the fundamentals of optimal distributed control and estimation in the final weeks before completing her PhD this fall.

She finds inspiration in the biological intelligence of invertebrates such as octopus and jellyfish, with the ultimate goal of designing novel control strategies for flexible “soft” robots that could be used in tight or delicate surroundings, such as a surgical tool or for search-and-rescue missions.

“The squishiness of soft robots allows them to dynamically adapt to different environments. Think of worms, snakes, or jellyfish, and compare their motion and adaptation capabilities to those of vertebrate animals,” says Arbelaiz. “It is an interesting expression of embodied intelligence — lacking a rigid skeleton gives advantages to certain applications and helps to handle uncertainty in the real world more efficiently. But this additional softness also entails new system-theoretic challenges.”

In the biological world, the “controller” is usually associated with the brain and central nervous system — it creates motor commands for the muscles to achieve movement. Jellyfish and a few other soft organisms lack a centralized nerve center, or brain. Inspired by this observation, she is now working toward a theory where soft-robotic systems could be controlled using decentralized sensory information sharing.

“When sensing and actuation are distributed in the body of the robot and onboard computational capabilities are limited, it might be difficult to implement centralized intelligence,” she says. “So, we need these sort of decentralized schemes that, despite sharing sensory information only locally, guarantee the desired global behavior. Some biological systems, such as the jellyfish, are beautiful examples of decentralized control architectures — locomotion is achieved in the absence of a (centralized) brain. This is fascinating as compared to what we can achieve with human-made machines.”

A fluid transition to MIT

Her graduate studies at the University of Navarra in San Sebastian led to her working with MIT Professor John Bush in fluid dynamics. In 2015, he invited Arbelaiz to MIT as a visiting student to investigate droplet interactions. This led to their 2018 paper in Physical Review Fluids, and her pursuit of a PhD at MIT.   

In 2018, her doctoral research shifted to the interdisciplinary Sociotechnical System Research Center (SSRC), and is now advised by Ali Jadbabaie, the JR East Professor of Engineering and head of the Department of Civil and Environmental Engineering; and School of Engineering Associate Dean Anette “Peko” Hosoi, who is the Neil and Jane Pappalardo Professor of Mechanical Engineering as well as an applied math professor. Arbelaiz also regularly works with Bassam Bamieh, associate director of the Center for Control, Dynamical Systems, and Computation at the University of California at Santa Barbara. She says that working with this team of advisors gives her the freedom to explore the multidisciplinary research projects she has been drawn to over the past five years.

For example, she uses system-theoretic approaches to design novel optimal controllers and estimators for systems with spatiotemporal dynamics, and to gain a fundamental understanding of the sensory feedback communication topologies required to optimally control these systems. For the soft-robotic applications, this amounts to ranking which sensory measurements are important to best trigger each of the “muscles” of this robot. Did the robot’s performance degrade when each actuator only has access to the closest sensory measurements? Her research characterizes such a trade-off between closed-loop performance, uncertainty, and complexity in spatially distributed systems. 

“I am determined to bridge the gap between machine autonomy, systems theory, and biological intelligence,” she says.

Next chapter

A two-year Schmidt Science Fellowship, which funds young researchers to pursue postdoctoral studies in a field different from their graduate work, will let Arbelaiz further explore the intersection of biological and machine intelligence after graduation. 

She plans to spend her postdoc time at Princeton University with Professor Naomi Leonard, and to work with researchers in systems biology, computer science, and robotics, to explore the reliability and robustness of biological and artificial ensembles. Specifically, she is interested in learning how biological systems efficiently adapt to different environments so that she can apply this knowledge to human-made systems, such as autonomous machines, whose vulnerability to noise and uncertainty creates safety issues.

“I foresee an unprecedented revolution approaching in autonomous and intelligent machines, facilitated by a fruitful symbiosis between systems theory, computation, and (neuro)biology,” she says.

Paying it forward

Arbelaiz grew up in Spain acutely aware of the privilege of having access to a better education than her parents. Her father earned a degree in economics through independent study while working to support his family. His daughter inherited his persistence. 

“The hardships my parents experienced made them cherish autodidactism, lifelong learning, and critical thinking,” she says. “They passed on these values to me, so I grew up to be a curious and persevering person, enthusiastic about science and ready to seize every educational opportunity.”  

In a desire to pass this on to others, she mentors STEM students who lack guidance or resources. “I firmly believe that we should promote talent everywhere, and mentoring could be the key driver to encourage underrepresented minorities to pursue careers in STEM,” she says.

An advocate for women in STEM, she was part of the executive committee of Graduate Women at MIT (GWAMIT) and MIT Women in Mathematics, and participates in various panels and workshops. She also runs live experiments for kids, such as at the MIT Museum’s Girls Day events.

“As scientists, we are responsible to share our knowledge, to inform the public about scientific discovery and its impact, and to raise awareness about the value of research and the need to invest in it.” 

Arbelaiz also supports MIT’s Covid-19 outreach efforts, including talks about the mathematical modeling of the virus, and translating into Basque her former mentor John Bush’s MIT Covid-19 Indoor Safety app

This interest in paying her STEM knowledge forward is something she credits to her MIT education. 

“MIT has been one of the best experiences of my life so far: it has brought enormous academic, professional, and personal growth,” she says. “I share MIT’s taste for collaborative and multidisciplinary research, the attraction to intellectual challenges, and the enthusiasm for advancing science and technology to benefit humankind.”

Read More