Assessing AI system performance: thinking beyond models to deployment contexts

A graphic overview of the way performance assessment methods change across the development lifecycle. It has four phases: getting started, connecting with users, tuning the user experience, and performance assessment in the deployment context. It visually shows how the balance of user experience and tech development change over these four phases.
Figure 1: Performance assessment methods change across the development lifecycle for complex AI systems in ways that differ from general purpose AI. The emphasis shifts from rapid technical innovation that requires easy-to-calculate aggregate performance metrics at the beginning of the development process to metrics that reflect the performance of critical AI system attributes needed to underpin the user experience at the end.

AI systems are becoming increasingly complex as we move from visionary research to deployable technologies such as self-driving cars, clinical predictive models, and novel accessibility devices. Unlike singular AI models, it is more difficult to assess whether these more complex AI systems are performing consistently and as intended to realize human benefit.

    1. Real-world contexts for which the data might be noisy or different from training data;
    2. Multiple AI components interact with each other, creating unanticipated dependencies and behaviors;
    3. Human-AI feedback loops that come from repeated engagements between people and AI system.
    4. Very large AI models (e.g., transformer models)
    5. AI models that interact with other parts of a system (e.g., user interface or heuristic algorithm)

How do we know when these more advanced systems are ‘good enough’ for their intended use? When assessing the performance of AI models, we often rely on aggregate performance metrics like percentage of accuracy. But this ignores the many, often human elements, that make up an AI system.

Our research on what it takes to build forward-looking, inclusive AI experiences has demonstrated that getting to ‘good enough’ requires multiple performance assessment approaches at different stages of the development lifecycle, based upon realistic data and key user needs (figure 1).

Shifting emphasis gradually from iterative adjustments in the AI models themselves toward approaches that improve the AI system as a whole has implications not only in terms of how performance is assessed, but who should be involved in the performance assessment process. Engaging (and training) non-technical domain experts earlier (i.e., for choosing test data or defining experience metrics) and in a larger capacity throughout the development lifecycle can enhance relevance, usability, and reliability of the AI system.

Spotlight: UPCOMING EVENT

Register today: Microsoft Research Summit 2022

October 18–20, 2022
Join us as the global research community gathers to share progress and spark conversations around advances that could empower people in new ways and positively impact our world.

Performance assessment best practices from the PeopleLens

The PeopleLens (figure 2) is a new Microsoft technology designed to enable children who are born blind to experience social agency and build up the range of social attention skills needed to initiate and maintain social interactions. Running on smart glasses, it provides the wearer with continuous, real-time information about the people around them through spatial audio, helping them build up a dynamic map of the whereabouts of others. Its underlying technology is a complex AI system using several computer vision algorithms to calculate, pose, identify registered people, and track those entities over time.

The PeopleLens offers a useful illustration of the wide range of performance assessment methods and people necessary to comprehensively gauge its efficacy.

A young boy wearing the PeopleLens sits on the floor of a playroom holding a blind tennis ball in his hands. His attention is directed toward a woman sitting on the floor in front of him holding her hands out. The PeopleLens looks like small goggles that sit on the forehead. The image is marked with visual annotations to indicate what the PeopleLens is seeing and what sounds are being heard.
Figure 2: The PeopleLens is a new research technology designed to help people who are blind or have low vision better understand their immediate social environments by locating and identifying people in the space dynamically in real-time.

Getting started: AI model or AI system performance?

Calculating aggregate performance metrics on open-source benchmarked datasets may demonstrate the capability of an individual AI model, but may be insufficient when applied to an entire AI system. It can be tempting to believe a single aggregate performance metric (such as accuracy) can be sufficient to validate multiple AI models individually. But the performance of two AI models in a system cannot be comprehensively measured by simple summation of each model’s aggregate performance metric.

We used two AI models to test the accuracy of the PeopleLens to locate and identify people: the first was a benchmarked, state-of-the-art pose model used to indicate the location of people in an image. The second was a novel facial recognition algorithm previously demonstrated to have greater than 90% accuracy. Despite strong historical performance of these two models, when applied to the PeopleLens, the AI system recognized only 10% of people from a realistic dataset in which people were not always facing the camera.

This finding illustrates that multi-algorithm systems are more than a sum of their parts, requiring specific performance assessment approaches.

Connecting to the human experience: Metric scorecards and realistic data 

Metrics scorecards, calculated on a realistic reference dataset, offer one way to connect to the human experience while the AI system is still undergoing significant technical iteration. A metrics scorecard can combine several metrics to measure aspects of the system that are most important to users.

We used ten metrics in the development of PeopleLens. The most valuable two metrics were time-to-first-identification, which measured how long it took from the time a person was seen in a frame to the user hearing the name of that person, and number of repeat false positives, which measured how often a false positive occurred in three frames or more in a row within the reference dataset.

The first metric captured the core value proposition for the user: having the social agency to be the first to say hello when someone approaches. The second was important because the AI system would self-correct single misidentifications, while repeated mistakes would lead to a poor user experience. This measured the ramifications of that accuracy throughout the system, rather than just on a per-frame basis.

Beyond metrics: Using visualization tools to finetune the user experience

While metrics play a critical role in the development of AI systems, a wider range of tools is needed to finetune the intended user experience. It is essential for development teams to test on realistic datasets to understand how the AI system generates the actual user experience. This is especially important with complex systems, where multiple models, human-AI feedback loops, or unpredictable data (e.g., user-controlled data capture) can cause the AI system to respond unpredictably.

Visualization tools can enhance the top-down statistical tools of data scientists, helping domain experts contribute to system development. In the PeopleLens, we used custom-built visualization tools to compare side-by-side renditions of the experience with different model parameters (figure 3). We leveraged these visualizations to enable domain experts—in this case parents and teachers—to spot patterns of odd system behavior across the data.

Project Tokyo studio interface
Figure 3: Visualization tools helped the development team, including domain experts, in connecting the AI system to the user experience using realistic data. In this image, the top bar shows images taken from the wearable camera stream overlayed with the various model outcomes. The bottom bar shows the output of the world-state tracking algorithm on the left and the ground truth on the right. The panel in the middle shows model parameters that are being changed with the impact on the user experience being viewed in real time.

AI system performance in the context of the user experience

A user experience can only be as good as the underlying AI system. Testing the AI system in a realistic context, measuring things that matter to the users, is a critical stage before wide-spread deployment. We know, for example, that improving AI system performance does not necessarily correspond to improved performance of AI teams (reference).

We also know that human-to-AI feedback loops can make it difficult to measure an AI system’s performance. Essentially repeated interactions between AI system and user, these feedback loops can surface (and intensify) errors. They can also, through good intelligibility, be repaired by the user.

The PeopleLens system gave users feedback about the people’s locations and their faces. A missed identification (e.g., because the person is looking at a chest rather than a face) can be resolved once the user responds to feedback (e.g., by looking up). This example shows us that we do not need to focus on missed identification as they will be resolved by the human-AI feedback loop. However, users were very perplexed by the identification of people who were no longer present, and therefore performance assessments needed to focus on these false positive misidentifications.

    1. Multiple performance assessment methods should be used in AI system development. In contrast to developing individual AI models, general aggregate performance metrics are a small component, relevant primarily in the earliest stages of development.
    2. Documenting AI system performance should include a range of approaches, from metrics scorecards to system performance metrics for a deployed user experience, to visualization tools.
    3. Domain experts play an important role in performance assessment, beginning early in the development lifecycle. Domain experts are often not prepared or skilled for the in-depth participation optimal in AI system development.
    4. Visualization tools are as important as metrics in creating and documenting an AI system for a particular intended use. It is critical that domain experts have access to these tools as key decision-makers in AI system deployment.

Bringing it all together 

For complex AI systems, performance assessment methods change across the development lifecycle in ways that differ from individual AI models. Shifting performance assessment techniques from rapid technical innovation requiring easy-to-calculate aggregate metrics at the beginning of the development process, to the performance metrics that reflect critical AI system attributes that make up the user experience toward the end of development helps every type of stakeholder precisely and collectively define what is ‘good enough’ to achieve the intended use.  

It is useful for developers to remember performance assessment is not an end goal in itself; it is a process that defines how the system has reached its best state and whether that state is ready for deployment. The performance assessment process must include a broad range of stakeholders, including domain experts, who may need new tools to fulfill critical (sometimes unexpected) roles in the development and deployment of an AI system.

The post Assessing AI system performance: thinking beyond models to deployment contexts appeared first on Microsoft Research.

Read More

Announcing PyTorch Conference 2022

We are excited to announce that the PyTorch Conference returns in-person as a satellite event to NeurlPS (Neural Information Processing Systems) in New Orleans on Dec. 2nd.

We changed the name from PyTorch Developer Day to PyTorch Conference to signify the turning of a new chapter as we look to the future of PyTorch, encompassing the entire PyTorch Community. This conference will bring together leading researchers, academics and developers from the Machine Learning (ML) and Deep Learning (DL) communities to join a multiple set of talks and a poster session; covering new software releases on PyTorch, use cases in academia and industry, as well as ML/DL development and production trends.

EVENT OVERVIEW

When: Dec 2nd, 2022 (In-Person and Virtual)

Where: New Orleans, Louisiana (USA) | Virtual option as well

SCHEDULE

All times in Central Standard.

8:00-9:00 am   Registration/Check in

9:00-11:20 am   Keynote & Technical Talks

11:30-1:00 pm   Lunch

1:00-3:00 pm   Poster Session & Breakouts

3:00-4:00 pm   Community/Partner Talks

4:00-5:00 pm   Panel Discussion

Agenda subject to change.

All talks will be livestreamed and available to the public. The in-person event will be by invitation only as space is limited. If you’d like to apply to attend in person, please submit all requests here.

LINKS

Read More

Understanding reality through algorithms

Understanding reality through algorithms

Although Fernanda De La Torre still has several years left in her graduate studies, she’s already dreaming big when it comes to what the future has in store for her.

“I dream of opening up a school one day where I could bring this world of understanding of cognition and perception into places that would never have contact with this,” she says.

It’s that kind of ambitious thinking that’s gotten De La Torre, a doctoral student in MIT’s Department of Brain and Cognitive Sciences, to this point. A recent recipient of the prestigious Paul and Daisy Soros Fellowship for New Americans, De La Torre has found at MIT a supportive, creative research environment that’s allowed her to delve into the cutting-edge science of artificial intelligence. But she’s still driven by an innate curiosity about human imagination and a desire to bring that knowledge to the communities in which she grew up.

An unconventional path to neuroscience

De La Torre’s first exposure to neuroscience wasn’t in the classroom, but in her daily life. As a child, she watched her younger sister struggle with epilepsy. At 12, she crossed into the United States from Mexico illegally to reunite with her mother, exposing her to a whole new language and culture. Once in the States, she had to grapple with her mother’s shifting personality in the midst of an abusive relationship. “All of these different things I was seeing around me drove me to want to better understand how psychology works,” De La Torre says, “to understand how the mind works, and how it is that we can all be in the same environment and feel very different things.”

But finding an outlet for that intellectual curiosity was challenging. As an undocumented immigrant, her access to financial aid was limited. Her high school was also underfunded and lacked elective options. Mentors along the way, though, encouraged the aspiring scientist, and through a program at her school, she was able to take community college courses to fulfill basic educational requirements.

It took an inspiring amount of dedication to her education, but De La Torre made it to Kansas State University for her undergraduate studies, where she majored in computer science and math. At Kansas State, she was able to get her first real taste of research. “I was just fascinated by the questions they were asking and this entire space I hadn’t encountered,” says De La Torre of her experience working in a visual cognition lab and discovering the field of computational neuroscience.

Although Kansas State didn’t have a dedicated neuroscience program, her research experience in cognition led her to a machine learning lab led by William Hsu, a computer science professor. There, De La Torre became enamored by the possibilities of using computation to model the human brain. Hsu’s support also convinced her that a scientific career was a possibility. “He always made me feel like I was capable of tackling big questions,” she says fondly.

With the confidence imparted in her at Kansas State, De La Torre came to MIT in 2019 as a post-baccalaureate student in the lab of Tomaso Poggio, the Eugene McDermott Professor of Brain and Cognitive Sciences and an investigator at the McGovern Institute for Brain Research. With Poggio, also the director of the Center for Brains, Minds and Machines, De La Torre began working on deep-learning theory, an area of machine learning focused on how artificial neural networks modeled on the brain can learn to recognize patterns and learn.

“It’s a very interesting question because we’re starting to use them everywhere,” says De La Torre of neural networks, listing off examples from self-driving cars to medicine. “But, at the same time, we don’t fully understand how these networks can go from knowing nothing and just being a bunch of numbers to outputting things that make sense.”

Her experience as a post-bac was De La Torre’s first real opportunity to apply the technical computer skills she developed as an undergraduate to neuroscience. It was also the first time she could fully focus on research. “That was the first time that I had access to health insurance and a stable salary. That was, in itself, sort of life-changing,” she says. “But on the research side, it was very intimidating at first. I was anxious, and I wasn’t sure that I belonged here.”

Fortunately, De La Torre says she was able to overcome those insecurities, both through a growing unabashed enthusiasm for the field and through the support of Poggio and her other colleagues in MIT’s Department of Brain and Cognitive Sciences. When the opportunity came to apply to the department’s PhD program, she jumped on it. “It was just knowing these kinds of mentors are here and that they cared about their students,” says De La Torre of her decision to stay on at MIT for graduate studies. “That was really meaningful.”

Expanding notions of reality and imagination

In her two years so far in the graduate program, De La Torre’s work has expanded the understanding of neural networks and their applications to the study of the human brain. Working with Guangyu Robert Yang, an associate investigator at the McGovern Institute and an assistant professor in the departments of Brain and Cognitive Sciences and Electrical Engineering and Computer Sciences, she’s engaged in what she describes as more philosophical questions about how one develops a sense of self as an independent being. She’s interested in how that self-consciousness develops and why it might be useful.

De La Torre’s primary advisor, though, is Professor Josh McDermott, who leads the Laboratory for Computational Audition. With McDermott, De La Torre is attempting to understand how the brain integrates vision and sound. While combining sensory inputs may seem like a basic process, there are many unanswered questions about how our brains combine multiple signals into a coherent impression, or percept, of the world. Many of the questions are raised by audiovisual illusions in which what we hear changes what we see. For example, if one sees a video of two discs passing each other, but the clip contains the sound of a collision, the brain will perceive that the discs are bouncing off, rather than passing through each other. Given an ambiguous image, that simple auditory cue is all it takes to create a different perception of reality.

“There’s something interesting happening where our brains are receiving two signals telling us different things and, yet, we have to combine them somehow to make sense of the world,” she says.

De La Torre is using behavioral experiments to probe how the human brain makes sense of multisensory cues to construct a particular perception. To do so, she’s created various scenes of objects interacting in 3D space over different sounds, asking research participants to describe characteristics of the scene. For example, in one experiment, she combines visuals of a block moving across a surface at different speeds with various scraping sounds, asking participants to estimate how rough the surface is. Eventually she hopes to take the experiment into virtual reality, where participants will physically push blocks in response to how rough they perceive the surface to be, rather than just reporting on what they experience.

Once she’s collected data, she’ll move into the modeling phase of the research, evaluating whether multisensory neural networks perceive illusions the way humans do. “What we want to do is model exactly what’s happening,” says De La Torre. “How is it that we’re receiving these two signals, integrating them and, at the same time, using all of our prior knowledge and inferences of physics to really make sense of the world?”

Although her two strands of research with Yang and McDermott may seem distinct, she sees clear connections between the two. Both projects are about grasping what artificial neural networks are capable of and what they tell us about the brain. At a more fundamental level, she says that how the brain perceives the world from different sensory cues might be part of what gives people a sense of self. Sensory perception is about constructing a cohesive, unitary sense of the world from multiple sources of sensory data. Similarly, she argues, “the sense of self is really a combination of actions, plans, goals, emotions, all of these different things that are components of their own, but somehow create a unitary being.”

It’s a fitting sentiment for De La Torre, who has been working to make sense of and integrate different aspects of her own life. Working in the Computational Audition lab, for example, she’s started experimenting with combining electronic music with folk music from her native Mexico, connecting her “two worlds,” as she says. Having the space to undertake those kinds of intellectual explorations, and colleagues who encourage it, is one of De La Torre’s favorite parts of MIT.

“Beyond professors, there’s also a lot of students whose way of thinking just amazes me,” she says. “I see a lot of goodness and excitement for science and a little bit of — it’s not nerdiness, but a love for very niche things — and I just kind of love that.”

Read More

Understanding reality through algorithms

Although Fernanda De La Torre still has several years left in her graduate studies, she’s already dreaming big when it comes to what the future has in store for her.

“I dream of opening up a school one day where I could bring this world of understanding of cognition and perception into places that would never have contact with this,” she says.

It’s that kind of ambitious thinking that’s gotten De La Torre, a doctoral student in MIT’s Department of Brain and Cognitive Sciences, to this point. A recent recipient of the prestigious Paul and Daisy Soros Fellowship for New Americans, De La Torre has found at MIT a supportive, creative research environment that’s allowed her to delve into the cutting-edge science of artificial intelligence. But she’s still driven by an innate curiosity about human imagination and a desire to bring that knowledge to the communities in which she grew up.

An unconventional path to neuroscience

De La Torre’s first exposure to neuroscience wasn’t in the classroom, but in her daily life. As a child, she watched her younger sister struggle with epilepsy. At 12, she crossed into the United States from Mexico illegally to reunite with her mother, exposing her to a whole new language and culture. Once in the States, she had to grapple with her mother’s shifting personality in the midst of an abusive relationship. “All of these different things I was seeing around me drove me to want to better understand how psychology works,” De La Torre says, “to understand how the mind works, and how it is that we can all be in the same environment and feel very different things.”

But finding an outlet for that intellectual curiosity was challenging. As an undocumented immigrant, her access to financial aid was limited. Her high school was also underfunded and lacked elective options. Mentors along the way, though, encouraged the aspiring scientist, and through a program at her school, she was able to take community college courses to fulfill basic educational requirements.

It took an inspiring amount of dedication to her education, but De La Torre made it to Kansas State University for her undergraduate studies, where she majored in computer science and math. At Kansas State, she was able to get her first real taste of research. “I was just fascinated by the questions they were asking and this entire space I hadn’t encountered,” says De La Torre of her experience working in a visual cognition lab and discovering the field of computational neuroscience.

Although Kansas State didn’t have a dedicated neuroscience program, her research experience in cognition led her to a machine learning lab led by William Hsu, a computer science professor. There, De La Torre became enamored by the possibilities of using computation to model the human brain. Hsu’s support also convinced her that a scientific career was a possibility. “He always made me feel like I was capable of tackling big questions,” she says fondly.

With the confidence imparted in her at Kansas State, De La Torre came to MIT in 2019 as a post-baccalaureate student in the lab of Tomaso Poggio, the Eugene McDermott Professor of Brain and Cognitive Sciences and an investigator at the McGovern Institute for Brain Research. With Poggio, also the director of the Center for Brains, Minds and Machines, De La Torre began working on deep-learning theory, an area of machine learning focused on how artificial neural networks modeled on the brain can learn to recognize patterns and learn.

“It’s a very interesting question because we’re starting to use them everywhere,” says De La Torre of neural networks, listing off examples from self-driving cars to medicine. “But, at the same time, we don’t fully understand how these networks can go from knowing nothing and just being a bunch of numbers to outputting things that make sense.”

Her experience as a post-bac was De La Torre’s first real opportunity to apply the technical computer skills she developed as an undergraduate to neuroscience. It was also the first time she could fully focus on research. “That was the first time that I had access to health insurance and a stable salary. That was, in itself, sort of life-changing,” she says. “But on the research side, it was very intimidating at first. I was anxious, and I wasn’t sure that I belonged here.”

Fortunately, De La Torre says she was able to overcome those insecurities, both through a growing unabashed enthusiasm for the field and through the support of Poggio and her other colleagues in MIT’s Department of Brain and Cognitive Sciences. When the opportunity came to apply to the department’s PhD program, she jumped on it. “It was just knowing these kinds of mentors are here and that they cared about their students,” says De La Torre of her decision to stay on at MIT for graduate studies. “That was really meaningful.”

Expanding notions of reality and imagination

In her two years so far in the graduate program, De La Torre’s work has expanded the understanding of neural networks and their applications to the study of the human brain. Working with Guangyu Robert Yang, an associate investigator at the McGovern Institute and an assistant professor in the departments of Brain and Cognitive Sciences and Electrical Engineering and Computer Sciences, she’s engaged in what she describes as more philosophical questions about how one develops a sense of self as an independent being. She’s interested in how that self-consciousness develops and why it might be useful.

De La Torre’s primary advisor, though, is Professor Josh McDermott, who leads the Laboratory for Computational Audition. With McDermott, De La Torre is attempting to understand how the brain integrates vision and sound. While combining sensory inputs may seem like a basic process, there are many unanswered questions about how our brains combine multiple signals into a coherent impression, or percept, of the world. Many of the questions are raised by audiovisual illusions in which what we hear changes what we see. For example, if one sees a video of two discs passing each other, but the clip contains the sound of a collision, the brain will perceive that the discs are bouncing off, rather than passing through each other. Given an ambiguous image, that simple auditory cue is all it takes to create a different perception of reality.

“There’s something interesting happening where our brains are receiving two signals telling us different things and, yet, we have to combine them somehow to make sense of the world,” she says.

De La Torre is using behavioral experiments to probe how the human brain makes sense of multisensory cues to construct a particular perception. To do so, she’s created various scenes of objects interacting in 3D space over different sounds, asking research participants to describe characteristics of the scene. For example, in one experiment, she combines visuals of a block moving across a surface at different speeds with various scraping sounds, asking participants to estimate how rough the surface is. Eventually she hopes to take the experiment into virtual reality, where participants will physically push blocks in response to how rough they perceive the surface to be, rather than just reporting on what they experience.

Once she’s collected data, she’ll move into the modeling phase of the research, evaluating whether multisensory neural networks perceive illusions the way humans do. “What we want to do is model exactly what’s happening,” says De La Torre. “How is it that we’re receiving these two signals, integrating them and, at the same time, using all of our prior knowledge and inferences of physics to really make sense of the world?”

Although her two strands of research with Yang and McDermott may seem distinct, she sees clear connections between the two. Both projects are about grasping what artificial neural networks are capable of and what they tell us about the brain. At a more fundamental level, she says that how the brain perceives the world from different sensory cues might be part of what gives people a sense of self. Sensory perception is about constructing a cohesive, unitary sense of the world from multiple sources of sensory data. Similarly, she argues, “the sense of self is really a combination of actions, plans, goals, emotions, all of these different things that are components of their own, but somehow create a unitary being.”

It’s a fitting sentiment for De La Torre, who has been working to make sense of and integrate different aspects of her own life. Working in the Computational Audition lab, for example, she’s started experimenting with combining electronic music with folk music from her native Mexico, connecting her “two worlds,” as she says. Having the space to undertake those kinds of intellectual explorations, and colleagues who encourage it, is one of De La Torre’s favorite parts of MIT.

“Beyond professors, there’s also a lot of students whose way of thinking just amazes me,” she says. “I see a lot of goodness and excitement for science and a little bit of — it’s not nerdiness, but a love for very niche things — and I just kind of love that.”

Read More

Large-scale revenue forecasting at Bosch with Amazon Forecast and Amazon SageMaker custom models

This post is co-written by Goktug Cinar, Michael Binder, and Adrian Horvath from Bosch Center for Artificial Intelligence (BCAI).

Revenue forecasting is a challenging yet crucial task for strategic business decisions and fiscal planning in most organizations. Often, revenue forecasting is manually performed by financial analysts and is both time consuming and subjective. Such manual efforts are especially challenging for large-scale, multinational business organizations that require revenue forecasts across a wide range of product groups and geographical areas at multiple levels of granularity. This requires not only accuracy but also hierarchical coherence of the forecasts.

Bosch is a multinational corporation with entities operating in multiple sectors, including automotive, industrial solutions, and consumer goods. Given the impact of accurate and coherent revenue forecasting on healthy business operations, the Bosch Center for Artificial Intelligence (BCAI) has been heavily investing in the use of machine learning (ML) to improve the efficiency and accuracy of financial planning processes. The goal is to alleviate the manual processes by providing reasonable baseline revenue forecasts via ML, with only occasional adjustments needed by the financial analysts using their industry and domain knowledge.

To achieve this goal, BCAI has developed an internal forecasting framework capable of providing large-scale hierarchical forecasts via customized ensembles of a wide range of base models. A meta-learner selects the best-performing models based on features extracted from each time series. The forecasts from the selected models are then averaged to obtain the aggregated forecast. The architectural design is modularized and extensible through the implementation of a REST-style interface, which allows continuous performance improvement via the inclusion of additional models.

BCAI partnered with the Amazon ML Solutions Lab (MLSL) to incorporate the latest advances in deep neural network (DNN)-based models for revenue forecasting. Recent advances in neural forecasters have demonstrated state-of-the-art performance for many practical forecasting problems. Compared to traditional forecasting models, many neural forecasters can incorporate additional covariates or metadata of the time series. We include CNN-QR and DeepAR+, two off-the-shelf models in Amazon Forecast, as well as a custom Transformer model trained using Amazon SageMaker. The three models cover a representative set of the encoder backbones often used in neural forecasters: convolutional neural network (CNN), sequential recurrent neural network (RNN), and transformer-based encoders.

One of the key challenges faced by the BCAI-MLSL partnership was to provide robust and reasonable forecasts under the impact of COVID-19, an unprecedented global event causing great volatility on global corporate financial results. Because neural forecasters are trained on historical data, the forecasts generated based on out-of-distribution data from the more volatile periods could be inaccurate and unreliable. Therefore, we proposed the addition of a masked attention mechanism in the Transformer architecture to address this issue.

The neural forecasters can be bundled as a single ensemble model, or incorporated individually into Bosch’s model universe, and accessed easily via REST API endpoints. We propose an approach to ensemble the neural forecasters through backtest results, which provides competitive and robust performance over time. Additionally, we investigated and evaluated a number of classical hierarchical reconciliation techniques to ensure that forecasts aggregate coherently across product groups, geographies, and business organizations.

In this post, we demonstrate the following:

  • How to apply Forecast and SageMaker custom model training for hierarchical, large-scale time-series forecasting problems
  • How to ensemble custom models with off-the-shelf models from Forecast
  • How to reduce the impact of disruptive events such as COVID-19 on forecasting problems
  • How to build an end-to-end forecasting workflow on AWS

Challenges

We addressed two challenges: creating hierarchical, large-scale revenue forecasting, and the impact of the COVID-19 pandemic on long-term forecasting.

Hierarchical, large-scale revenue forecasting

Financial analysts are tasked with forecasting key financial figures, including revenue, operational costs, and R&D expenditures. These metrics provide business planning insights at different levels of aggregation and enable data-driven decision-making. Any automated forecasting solution needs to provide forecasts at any arbitrary level of business-line aggregation. At Bosch, the aggregations can be imagined as grouped time series as a more general form of hierarchical structure. The following figure shows a simplified example with a two-level structure, which mimics the hierarchical revenue forecasting structure at Bosch. The total revenue is split into multiple levels of aggregations based on product and region.

The total number of time series that need to be forecasted at Bosch is at the scale of millions. Notice that the top-level time series can be split by either products or regions, creating multiple paths to the bottom level forecasts. The revenue needs to be forecasted at every node in the hierarchy with a forecasting horizon of 12 months into the future. Monthly historical data is available.

The hierarchical structure can be represented using the following form with the notation of a summing matrix S (Hyndman and Athanasopoulos):

In this equation, Y equals the following:

Here, b represents the bottom level time-series at time t.

Impacts of the COVID-19 pandemic

The COVID-19 pandemic brought significant challenges for forecasting due to its disruptive and unprecedented effects on almost all aspects of work and social life. For long-term revenue forecasting, the disruption also brought unexpected downstream impacts. To illustrate this problem, the following figure shows a sample time series where the product revenue experienced a significant drop at the start of the pandemic and gradually recovered afterwards. A typical neural forecasting model will take revenue data including the out-of-distribution (OOD) COVID period as the historical context input, as well as the ground truth for model training. As a result, the forecasts produced are no longer reliable.

Modeling approaches

In this section, we discuss our various modeling approaches.

Amazon Forecast

Forecast is a fully-managed AI/ML service from AWS that provides preconfigured, state-of-the-art time series forecasting models. It combines these offerings with its internal capabilities for automated hyperparameter optimization, ensemble modeling (for the models provided by Forecast), and probabilistic forecast generation. This allows you to easily ingest custom datasets, preprocess data, train forecasting models, and generate robust forecasts. The service’s modular design further enables us to easily query and combine predictions from additional custom models developed in parallel.

We incorporate two neural forecasters from Forecast: CNN-QR and DeepAR+. Both are supervised deep learning methods that train a global model for the entire time series dataset. Both CNNQR and DeepAR+ models can take in static metadata information about each time series, which are the corresponding product, region, and business organization in our case. They also automatically add temporal features such as month of the year as part of the input to the model.

Transformer with attention masks for COVID

The Transformer architecture (Vaswani et al.), originally designed for natural language processing (NLP), recently emerged as a popular architectural choice for time series forecasting. Here, we used the Transformer architecture described in Zhou et al. without probabilistic log sparse attention. The model uses a typical architecture design by combining an encoder and a decoder. For revenue forecasting, we configure the decoder to directly output the forecast of the 12-month horizon instead of generating the forecast month by month in an autoregressive manner. Based on the frequency of the time series, additional time related features such as month of the year are added as the input variable. Additional categorical variables describing the meta information (product, region, business organization) are fed into the network via a trainable embedding layer.

The following diagram illustrates the Transformer architecture and the attention masking mechanism. Attention masking is applied throughout all the encoder and decoder layers, as highlighted in orange, to prevent OOD data from affecting the forecasts.

We mitigate the impact of OOD context windows by adding attention masks. The model is trained to apply very little attention to the COVID period that contains outliers via masking, and performs forecasting with masked information. The attention mask is applied throughout every layer of the decoder and encoder architecture. The masked window can be either specified manually or through an outlier detection algorithm. Additionally, when using a time window containing outliers as the training labels, the losses are not back-propagated. This attention masking-based method can be applied to handle disruptions and OOD cases brought by other rare events and improve the robustness of the forecasts.

Model ensemble

Model ensemble often outperforms single models for forecasting—it improves model generalizability and is better at handling time series data with varying characteristics in periodicity and intermittency. We incorporate a series of model ensemble strategies to improve model performance and robustness of forecasts. One common form of deep learning model ensemble is to aggregate results from model runs with different random weight initializations, or from different training epochs. We utilize this strategy to obtain forecasts for the Transformer model.

To further build an ensemble on top of different model architectures, such as Transformer, CNNQR, and DeepAR+, we use a pan-model ensemble strategy that selects the top-k best performing models for each time series based on the backtest results and obtain their averages. Because backtest results can be exported directly from trained Forecast models, this strategy enables us to take advantage of turnkey services like Forecast with improvements gained from custom models such as Transformer. Such an end-to-end model ensemble approach doesn’t require training a meta-learner or calculating time series features for model selection.

Hierarchical reconciliation

The framework is adaptive to incorporate a wide range of techniques as postprocessing steps for hierarchical forecast reconciliation, including bottom-up (BU), top-down reconciliation with forecasting proportions (TDFP), ordinary least square (OLS), and weighted least square (WLS). All the experimental results in this post are reported using top-down reconciliation with forecasting proportions.

Architecture overview

We developed an automated end-to-end workflow on AWS to generate revenue forecasts utilizing services including Forecast, SageMaker, Amazon Simple Storage Service (Amazon S3), AWS Lambda, AWS Step Functions, and AWS Cloud Development Kit (AWS CDK). The deployed solution provides individual time series forecasts through a REST API using Amazon API Gateway, by returning the results in predefined JSON format.

The following diagram illustrates the end-to-end forecasting workflow.

Key design considerations for the architecture are versatility, performance, and user-friendliness. The system should be sufficiently versatile to incorporate a diverse set of algorithms during development and deployment, with minimal required changes, and can be easily extended when adding new algorithms in the future. The system should also add minimum overhead and support parallelized training for both Forecast and SageMaker to reduce training time and obtain the latest forecast faster. Finally, the system should be simple to use for experimentation purposes.

The end-to-end workflow sequentially runs through the following modules:

  1. A preprocessing module for data reformatting and transformation
  2. A model training module incorporating both the Forecast model and custom model on SageMaker (both are running in parallel)
  3. A postprocessing module supporting model ensemble, hierarchical reconciliation, metrics, and report generation

Step Functions organizes and orchestrates the workflow from end to end as a state machine. The state machine run is configured with a JSON file containing all the necessary information, including the location of the historical revenue CSV files in Amazon S3, the forecast start time, and model hyperparameter settings to run the end-to-end workflow. Asynchronous calls are created to parallelize model training in the state machine using Lambda functions. All the historical data, config files, forecast results, as well as intermediate results such as backtesting results are stored in Amazon S3. The REST API is built on top of Amazon S3 to provide a queryable interface for querying forecasting results. The system can be extended to incorporate new forecast models and supporting functions such as generating forecast visualization reports.

Evaluation

In this section, we detail the experiment setup. Key components include the dataset, evaluation metrics, backtest windows, and model setup and training.

Dataset

To protect the financial privacy of Bosch while using a meaningful dataset, we used a synthetic dataset that has similar statistical characteristics to a real-world revenue dataset from one business unit at Bosch. The dataset contains 1,216 time series in total with revenue recorded in a monthly frequency, covering January 2016 to April 2022. The dataset is delivered with 877 time series at the most granular level (bottom time series), with a corresponding grouped time series structure represented as a summing matrix S. Each time series is associated with three static categorical attributes, which corresponds to product category, region, and organizational unit in the real dataset (anonymized in the synthetic data).

Evaluation metrics

We use median-Mean Arctangent Absolute Percentage Error (median-MAAPE) and weighted-MAAPE to evaluate the model performance and perform comparative analysis, which are the standard metrics used at Bosch. MAAPE addresses the shortcomings of the Mean Absolute Percentage Error (MAPE) metric commonly used in business context. Median-MAAPE gives an overview of the model performance by computing the median of the MAAPEs calculated individually on each time series. Weighted-MAAPE reports a weighted combination of the individual MAAPEs. The weights are the proportion of the revenue for each time series compared to the aggregated revenue of the entire dataset. Weighted-MAAPE better reflects downstream business impacts of the forecasting accuracy. Both metrics are reported on the entire dataset of 1,216 time series.

Backtest windows

We use rolling 12-month backtest windows to compare model performance. The following figure illustrates the backtest windows used in the experiments and highlights the corresponding data used for training and hyperparameter optimization (HPO). For backtest windows after COVID-19 starts, the result is affected by OOD inputs from April to May 2020, based on what we observed from the revenue time series.

Model setup and training

For Transformer training, we used quantile loss and scaled each time series using its historical mean value before feeding it into Transformer and computing the training loss. The final forecasts are rescaled back to calculate the accuracy metrics, using the MeanScaler implemented in GluonTS. We use a context window with monthly revenue data from the past 18 months, selected via HPO in the backtest window from July 2018 to June 2019. Additional metadata about each time series in the form of static categorical variables are fed into the model via an embedding layer before feeding it to the transformer layers. We train the Transformer with five different random weight initializations and average the forecast results from the last three epochs for each run, in total averaging 15 models. The five model training runs can be parallelized to reduce training time. For the masked Transformer, we indicate the months from April to May 2020 as outliers.

For all Forecast model training, we enabled automatic HPO, which can select the model and training parameters based on a user-specified backtest period, which is set to the last 12 months in the data window used for training and HPO.

Experiment results

We train masked and unmasked Transformers using the same set of hyperparameters, and compared their performance for backtest windows immediately after COVID-19 shock. In the masked Transformer, the two masked months are April and May 2020. The following table shows the results from a series of backtest periods with 12-month forecasting windows starting from June 2020. We can observe that the masked Transformer consistently outperforms the unmasked version.

We further performed evaluation on the model ensemble strategy based on backtest results. In particular, we compare the two cases when only the top performing model is selected vs. when the top two performing models are selected, and model averaging is performed by computing the mean value of the forecasts. We compare the performance of the base models and the ensemble models in the following figures. Notice that none of the neural forecasters consistently out-perform others for the rolling backtest windows.

The following table shows that, on average, ensemble modeling of the top two models gives the best performance. CNNQR provides the second-best result.

Conclusion

This post demonstrated how to build an end-to-end ML solution for large-scale forecasting problems combining Forecast and a custom model trained on SageMaker. Depending on your business needs and ML knowledge, you can use a fully managed service such as Forecast to offload the build, train, and deployment process of a forecasting model; build your custom model with specific tuning mechanisms with SageMaker; or perform model ensembling by combining the two services.

If you would like help accelerating the use of ML in your products and services, please contact the Amazon ML Solutions Lab program.

References

Hyndman RJ, Athanasopoulos G. Forecasting: principles and practice. OTexts; 2018 May 8.

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. Advances in neural information processing systems. 2017;30.

Zhou H, Zhang S, Peng J, Zhang S, Li J, Xiong H, Zhang W. Informer: Beyond efficient transformer for long sequence time-series forecasting. InProceedings of AAAI 2021 Feb 2.


About the Authors

Goktug Cinar is a lead ML scientist and the technical lead of the ML and stats-based forecasting at Robert Bosch LLC and Bosch Center for Artificial Intelligence. He leads the research of the forecasting models, hierarchical consolidation, and model combination techniques as well as the software development team which scales these models and serves them as part of the internal end-to-end financial forecasting software.

Michael Binder is a product owner at Bosch Global Services, where he coordinates the development, deployment and implementation of the company wide predictive analytics application for the large-scale automated data driven forecasting of financial key figures.

Adrian Horvath is a Software Developer at Bosch Center for Artificial Intelligence, where he develops and maintains systems to create predictions based on various forecasting models.

Panpan Xu is a Senior Applied Scientist and Manager with the Amazon ML Solutions Lab at AWS. She is working on research and development of Machine Learning algorithms for high-impact customer applications in a variety of industrial verticals to accelerate their AI and cloud adoption. Her research interest includes model interpretability, causal analysis, human-in-the-loop AI and interactive data visualization.

Jasleen Grewal is an Applied Scientist at Amazon Web Services, where she works with AWS customers to solve real world problems using machine learning, with special focus on precision medicine and genomics. She has a strong background in bioinformatics, oncology, and clinical genomics. She is passionate about using AI/ML and cloud services to improve patient care.

Selvan Senthivel is a Senior ML Engineer with the Amazon ML Solutions Lab at AWS, focusing on helping customers on machine learning, deep learning problems, and end-to-end ML solutions. He was a founding engineering lead of Amazon Comprehend Medical and contributed to the design and architecture of multiple AWS AI services.

Ruilin Zhang is an SDE with the Amazon ML Solutions Lab at AWS. He helps customers adopt AWS AI services by building solutions to address common business problems.

Shane Rai is a Sr. ML Strategist with the Amazon ML Solutions Lab at AWS. He works with customers across a diverse spectrum of industries to solve their most pressing and innovative business needs using AWS’s breadth of cloud-based AI/ML services.

Lin Lee Cheong is an Applied Science Manager with the Amazon ML Solutions Lab team at AWS. She works with strategic AWS customers to explore and apply artificial intelligence and machine learning to discover new insights and solve complex problems.

Read More