Interpretable Deep Learning for Time Series Forecasting

Multi-horizon forecasting, i.e. predicting variables-of-interest at multiple future time steps, is a crucial challenge in time series machine learning. Most real-world datasets have a time component, and forecasting the future can unlock great value. For example, retailers can use future sales to optimize their supply chain and promotions, investment managers are interested in forecasting the future prices of financial assets to maximize their performance, and healthcare institutions can use the number of future patient admissions to have sufficient personnel and equipment.

Deep neural networks (DNNs) have increasingly been used in multi-horizon forecasting, demonstrating strong performance improvements over traditional time series models. While many models (e.g., DeepAR, MQRNN) have focused on variants of recurrent neural networks (RNNs), recent improvements, including Transformer-based models, have used attention-based layers to enhance the selection of relevant time steps in the past beyond the inductive bias of RNNs – sequential ordered processing of information including. However, these often do not consider the different inputs commonly present in multi-horizon forecasting and either assume that all exogenous inputs are known into the future or neglect important static covariates.

Multi-horizon forecasting with static covariates and various time-dependent inputs.

Additionally, conventional time series models are controlled by complex nonlinear interactions between many parameters, making it difficult to explain how such models arrive at their predictions. Unfortunately, common methods to explain the behavior of DNNs have limitations. For example, post-hoc methods (e.g., LIME and SHAP) do not consider the order of input features. Some attention-based models are proposed with inherent interpretability for sequential data, primarily language or speech, but multi-horizon forecasting has many different types of inputs, not just language or speech. Attention-based models can provide insights into relevant time steps, but they cannot distinguish the importance of different features at a given time step. New methods are needed to tackle the heterogeneity of data in multi-horizon forecasting for high performance and to render these forecasts interpretable.

To that end, we announce “Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting”, published in the International Journal of Forecasting, where we propose the Temporal Fusion Transformer (TFT), an attention-based DNN model for multi-horizon forecasting. TFT is designed to explicitly align the model with the general multi-horizon forecasting task for both superior accuracy and interpretability, which we demonstrate across various use cases.

Temporal Fusion Transformer
We design TFT to efficiently build feature representations for each input type (i.e., static, known, or observed inputs) for high forecasting performance. The major constituents of TFT (shown below) are:

  1. Gating mechanismsto skip over any unused components of the model (learned from the data), providing adaptive depth and network complexity to accommodate a wide range of datasets.
  2. Variable selection networksto select relevant input variables at each time step. While conventional DNNs may overfit to irrelevant features, attention-based variable selection can improve generalization by encouraging the model to anchor most of its learning capacity on the most salient features.
  3. Static covariate encodersintegrate static features to control how temporal dynamics are modeled. Static features can have an important impact on forecasts, e.g., a store location could have different temporal dynamics for sales (e.g., a rural store may see higher weekend traffic, but a downtown store may see daily peaks after working hours).
  4. Temporal processingto learn both long- and short-term temporal relationships from both observed and known time-varying inputs. A sequence-to-sequence layer is employed for local processing as the inductive bias it has for ordered information processing is beneficial, whereas long-term dependencies are captured using a novel interpretable multi-head attention block. This can cut the effective path length of information, i.e., any past time step with relevant information (e.g. sales from last year) can be focused on directly.
  5. Prediction intervals show quantile forecasts to determine the range of target values at each prediction horizon, which help users understand the distribution of the output, not just the point forecasts.
TFT inputs static metadata, time-varying past inputs and time-varying a priori known future inputs. Variable Selection is used for judicious selection of the most salient features based on the input. Gated information is added as a residual input, followed by normalization. Gated residual network (GRN) blocks enable efficient information flow with skip connections and gating layers. Time-dependent processing is based on LSTMs for local processing, and multi-head attention for integrating information from any time step.

Forecasting Performance
We compare TFT to a wide range of models for multi-horizon forecasting, including various deep learning models with iterative methods (e.g., DeepAR, DeepSSM, ConvTrans) and direct methods (e.g., LSTM Seq2Seq, MQRNN), as well as traditional models such as ARIMA, ETS, and TRMF. Below is a comparison to a truncated list of models.

Model Electricity Traffic Volatility Retail
ARIMA 0.154 (+180%) 0.223 (+135%)
ETS 0.102 (+85%) 0.236 (+148%)
DeepAR 0.075 (+36%) 0.161 (+69%) 0.050 (+28%) 0.574 (+62%)
Seq2Seq 0.067 (+22%) 0.105 (+11%) 0.042 (+7%) 0.411 (+16%)
MQRNN 0.077 (+40%) 0.117 (+23%) 0.042 (+7%) 0.379 (+7%)
TFT 0.055 0.095 0.039 0.354
P50 quantile losses (lower is better) for TFT vs. alternative models.

As shown above, TFT outperforms all benchmarks over a variety of datasets. This applies to both point forecasts and uncertainty estimates, with TFT yielding an average 7% lower P50 and 9% lower P90 losses, respectively, compared to the next best model.

Interpretability Use Cases
We demonstrate how TFT’s design allows for analysis of its individual components for enhanced interpretability with three use cases.

  • Variable Importance
    One can observe how different variables impact retail sales by observing their model weights. For example, the largest weights for static variables were the specific store and item, while the largest weights for future variables were promotion period and national holiday (shown below).

    Variable importance for the retail dataset. The 10th, 50th, and 90th percentiles of the variable selection weights are shown, with values larger than 0.1 in bold purple.
  • Persistent Temporal Patterns
    Visualizing persistent temporal patterns can help in understanding the time-dependent relationships present in a given dataset. We identify similar persistent patterns by measuring the contributions of features at fixed lags in the past forecasts at various horizons. Shown below, attention weights reveal the most important past time steps on which TFT bases its decisions.

    Persistent temporal patterns for the traffic dataset (𝛕 denotes the forecasting horizon) for the 10%, 50% and 90% quantile levels. Clear periodicity is observed with peaks being separated by ~24 hours, i.e., the model attends the most to the time steps that are at the same time of the day from past days, which is aligned with the expected daily traffic patterns.

    The above shows the attention weight patterns across time, indicating how TFT learns persistent temporal patterns without any hard-coding. Such capability can help build trust with users because the output confirms expected known patterns. Model developers can also use these towards model improvements, e.g., via specific feature engineering or data collection.

  • Identifying Significant Events
    Identifying sudden changes can be useful, as temporary shifts can occur due to the presence of significant events. TFT uses the distance between attention patterns at each point with the average pattern to identify the significant deviations. The figures below show that TFT can alter its attention between events — placing equal attention across past inputs when volatility is low, while attending more to sharp trend changes during high volatility periods.

    Event identification for S&P 500 realized volatility from 2002 through 2014.

    Significant deviations in attention patterns can be observed above around periods of high volatility, corresponding to the peaks observed in dist(t), distance between attention patterns (red line). We use a threshold to denote significant events, as highlighted in purple.

    Focusing on periods around the 2008 financial crisis, the bottom plot below zooms on midway through the significant event (evident from the increased attention on sharp trend changes), compared to the normal event in the top plot (where attention is equal over low volatility periods).

    Event identification for S&P 500 realized volatility, a zoom of the above on a period from 2004 and 2005.

    Event identification for S&P 500 realized volatility, a zoom of the above on a period from 2008 and 2009.

Real-World Impact
Finally, TFT has been used to help retail and logistics companies with demand forecasting by both improving forecasting accuracy and providing interpretability capabilities.

Additionally, TFT has potential applications for climate-related challenges: for example, reducing greenhouse gas emissions by balancing electricity supply and demand in real time, and improving the accuracy and interpretability of rainfall forecasting results.

Conclusion
We present a novel attention-based model for high-performance multi-horizon forecasting. In addition to improved performance across a range of datasets, TFT also contains specialized components for inherent interpretability — i.e., variable selection networks and interpretable multi-head attention. With three interpretability use-cases, we also demonstrate how these components can be used to extract insights on feature importance and temporal dynamics.

Acknowledgements
We gratefully acknowledge contributions of Bryan Lim, Nicolas Loeff, Minho Jin, Yaguang Li, and Andrew Moore.

Read More

Forrester Report: ‘NVIDIA GPUs Are Synonymous With AI Infrastructure’

In a new evaluation of enterprise AI infrastructure providers, Forrester Research Monday recognized NVIDIA and three of its key partners — Amazon Web Services, Google and Microsoft — as the leaders in AI infrastructure.

The “Forrester Wave: AI Infrastructure, Q4 2021” report states that “​​NVIDIA’s DNA is in every other AI infrastructure solution we evaluated. It’s an understatement to say that NVIDIA GPUs are synonymous with AI infrastructure.”

The report recognizes NVIDIA DGX systems and NVIDIA-accelerated offerings not only from AWS, Google and Microsoft, but also from Dell, Exxact, Hewlett Packard Enterprise, IBM, Inspur, Lambda, Run: AI, Spell and Supermicro.

NVIDIA software, including NVIDIA AI Enterprise, brings full-stack AI development to the broad ecosystem of providers recognized in the Forrester report.

The news comes as multitrillion-dollar industries — from healthcare to finance to retail — are racing to put AI to work and relying on NVIDIA AI infrastructure, including NVIDIA DGX Foundry to accelerate their business transformations.

‘Rocking the World’

“AI is rocking the world,” Forrester’s team wrote.

The firm noted that 74 percent of surveyed global data and analytics decision-makers whose firm is implementing or expanding its use of AI said that adoption has had a positive impact.

“It’s rapidly gone from ‘if’ to ‘when’ to ‘now,’” Forrester reported.

The report from a leading independent research firm signals the arrival of enterprise AI in the mainstream and provides support for even more widespread adoption.

NVIDIA Positioned as a Leader for Its DGX Systems

And that requires flexible, enterprise-grade infrastructure.

“AI platform software that many of these enterprises rely on is all well and good, but it needs hearty infrastructure — compute, storage, networking — to keep AI teams working, not waiting,” Forrester’s team wrote.

NVIDIA’s DGX systems, engineered for enterprise AI workloads, scored the highest in the market presence category and received among the highest scores in the strategy category.

“Breakthroughs in deep learning around 2012 brought AI into focus, but only NVIDIA had the strategy, vision, and roadmap to invest in supporting these now mainstream AI workloads,” the Forrester report noted.

The Forrester report cited NVIDIA’s strengths in “architectural components, throughput, latency, and overall product strategy.”

“The vendor’s sweet spot for its DGX systems are for customers that want a complete system that is engineered by NVIDIA to include its latest component technology for AI workloads,” the Forrester report noted.

NVIDIA AI Available Across the Industry

With NVIDIA AI, customers can choose from any cloud provider and server maker and select from systems located in their data centers or colocation facilities, in the cloud and even on the desktop.
NVIDIA AI infrastructure supports every major cloud service provider. It’s available on the desktop from every major server and system vendor. And NVIDIA technologies can be deployed at the edge by enterprises in autonomous vehicles, robots and embedded systems.

As a result, every vendor mentioned in Forrester’s report is an NVIDIA customer, partner or NVIDIA Inception member, underscoring NVIDIA’s unique role in the industry as a full-stack AI company providing semiconductors, software and systems.

“Reference customers appreciate the vendor’s thought leadership in AI, its frameworks designed to run on NVIDIA GPUs, and having first access to chips coming out of NVIDIA R&D,” the Forrester report stated.

With widespread reports of positive results, universal support for NVIDIA enterprise AI and recognition from one of the most widely respected industry analyst firms, enterprise AI adoption is poised to accelerate.

Enterprises can experience NVIDIA-accelerated AI with NVIDIA AI Enterprise and NVIDIA DGX Foundry through the NVIDIA LaunchPad program, available free of charge in nine regions worldwide.

The post Forrester Report: ‘NVIDIA GPUs Are Synonymous With AI Infrastructure’ appeared first on The Official NVIDIA Blog.

Read More

Blender 3.0 Release Accelerated by NVIDIA RTX GPUs, Adds USD Support for Omniverse

‘Tis the season for all content creators, especially 3D artists, this month on NVIDIA Studio.

Blender, the world’s most popular open-source 3D creative application, launched a highly anticipated 3.0 release, delivering extraordinary performance gains powered by NVIDIA RTX GPUs, with added Universal Scene Description (USD) support for NVIDIA Omniverse.

Faster 3D creative workflows are made possible by a significant upgrade to Isotropix Clarisse’s redesigned renderer Angie, and Cyberlink’s PowerDirector integration with NVIDIA Broadcast technology — all backed by the December Studio Driver, ready for download.

Plus, NVIDIA Studio and Adobe continue to share the love this holiday season, offering three free months of Adobe Creative Cloud with the purchase of a select NVIDIA Studio laptop or desktop.

And, in the spirit of giving, check out the NVIDIA Studio Facebook, Twitter and Instagram channels all month long for a chance to win a new NVIDIA Studio laptop.

Blender and RTX Render Better Together

Blender 3.0 marks the beginning of a new era for 3D content creation, with new features including a more responsive viewport, reduced shadow artifacts, an asset library to access owned and borrowed assets quickly, and additional customization options.

Blender Cycles renderer has been completely overhauled, maximizing NVIDIA RTX GPU RT Cores for OptiX ray tracing and Tensor Cores for OptiX AI denoising.

 

This enables rendering nearly 12x and 15x faster than with a MacBook Pro M1 Max or CPU alone, respectively, with the GeForce RTX 3080 laptop GPU.

3D artists can get instant feedback when working in Cycles as the viewport renderer. OptiX support has also been added to model rendering with materials, delivering stunning performance.

Blender 3.0 also adds USD file importing. USD files are the foundation of NVIDIA Omniverse, allowing scenes to be edited by multiple creative applications simultaneously.

The new importer converts USD geometry, lights, cameras, time-sampled animation and preview surface materials to their Blender representations — a massive leap in the Omniverse ecosystem.

Omniverse users can access an alpha build of Blender 3.1, unlocking additional features such as  instancing and added support for USD material exporting, which includes basic NVIDIA Material Definition Language conversions.

To download Omniverse 3.1 alpha, open the launcher and look for the Blender 3.1 build.

Getting started? Check out the Omniverse Showroom app, which gives beginners a look into the foundational technologies of Omniverse, with new content releases over time.

A December (Studio Driver) to Remember

Isotropix Clarisse is a fully interactive computer graphics toolset specialized for set-dressing, look development, lighting and rendering.

The new release, Clarisse 5.5 with Angie, a redesigned renderer, significantly speeds up renders and enables interactive rendering of huge scenes with ray-traced acceleration, exclusively for NVIDIA RTX GPUs.

Clarisse 5.5 with Angie in beta is available for download today.

Cyberlink’s video editing creative app, PowerDirector, has added NVIDIA Broadcast integration and its advanced AI tools, including video denoising to remove grainy noise from video shot in low-light conditions; audio denoising to remove unwanted background noise; and audio dereverb to get rid of room echos.

Now, video and audio enhancements can be done in post-production, adding more flexibility and power to a video editor’s toolbox.

These incredible creator app upgrades — as well as optimizations in Blender 3.0, NVIDIA Omniverse, Isotropix Clarisse, Cyberlink PowerDirector, Blackmagic’s DaVinci Resolve, BorisFX Sapphire, JangaFX Embergen and more — are all supported by the December Studio Driver (472.84), available for download now.

Join the Studio Holiday Fun

NVIDIA Studio is sharing the love in December by way of featurette videos, created by some of the most popular content creators, who discuss their journeys into content creation and share tips on how to get started.

These one-of-a-kind insights into various creative fields will feature giveaways that include the same model of NVIDIA Studio laptops the creators use.

Visit the NVIDIA Studio Facebook, Twitter and Instagram channels to learn more and enter.

Get Studio Support During Winter Break

For new ways to create, including exclusive step-by-step tutorials from industry-leading artists, inspiring community showcases and more, visit the NVIDIA Studio YouTube channel.

This year, we’ve worked with 52 leading creative professionals to launch 137 videos, which feature 75+ RTX-accelerated apps.

The channel’s videos have over 3 million minutes watched and 14,000+ shares, so thank you for your continued support and be sure to subscribe for new weekly videos.

Finally, check out Adobe’s The Great Shoecase contest and customize a model shoe by applying and painting materials in Substance 3D Painter. Win epic prizes including an NVIDIA Studio laptop or NVIDIA SHIELD TV. Submissions end on Thursday, Dec. 16.

Stay up to date on all things Studio by subscribing to the NVIDIA Studio newsletter.

The post Blender 3.0 Release Accelerated by NVIDIA RTX GPUs, Adds USD Support for Omniverse appeared first on The Official NVIDIA Blog.

Read More

Hierarchical Forecasting using Amazon SageMaker

Time series forecasting is a common problem in machine learning (ML) and statistics. Some common day-to-day use cases of time series forecasting involve predicting product sales, item demand, component supply, service tickets, and all as a function of time. More often than not, time series data follows a hierarchical aggregation structure. For example, in retail, weekly sales for a Stock Keeping Unit (SKU) at a store can roll up to different geographical hierarchies at the city, state, or country level. In these cases, we must make sure that the sales estimates are in agreement when rolled up to a higher level. In these scenarios, Hierarchical Forecasting is used. It is the process of generating coherent forecasts (or reconciling incoherent forecasts) that allows individual time series to be forecasted individually while still preserving the relationships within the hierarchy. Hierarchical time series often arise due to various smaller geographies combining to form a larger one. For example, the following figure shows the case of a hierarchical structure in time series for store sales in the state of Texas. Individual store sales are depicted in the lowest level (level 2) of the tree, followed by sales aggregated on the city level (level 1), and finally all of the city sales aggregated on the state level (level 0).

In this post, we will first review the concept of hierarchical forecasting, including different reconciliation approaches. Then, we will take an example of demand forecasting on synthetic retail data to show you how to train and tune multiple hierarchical time series models. We will also perform hyper-parameter combinations using the scikit-hts toolkit on Amazon SageMaker, which is the most comprehensive and fully managed ML service. Amazon SageMaker lets data scientists and developers quickly and easily build and train ML models, and then directly deploy them into a production-ready hosted environment.

The forecasts at all of the levels must be coherent. The forecast for Texas in the previous figure should break down accurately into forecasts for the cities, and the forecasts for cities should also break down accurately for forecasts on the individual store level. There are various approaches to combining and breaking forecasts at different levels. The most common of these methods, as discussed in detail in Hyndman and Athanasopoulos, are as follows:

  • Bottom-Up:

In this method, the forecasts are carried out at the bottom-most level of the hierarchy, and then summed going up. For example, in the preceding figure, by using the bottom-up method, the time series’ for the individual stores (level 2) are used to build forecasting models. The outputs of individual models are then summed to generate the forecast for the cities. For example, forecasts for Store 1 and Store 2 are summed to get the forecasts for Austin. Finally, forecasts for all of the cities are summed to generate the forecasts for Texas.

  • Top-down:

In top-down approaches, the forecast is first generated for the top level (Texas in the preceding figure) and then disaggregated down the hierarchy. Disaggregate proportions are used in conjunction with the top level forecast to generate forecasts at the bottom level of the hierarchy. There are multiple methods to generate these disaggregate proportions, such as average historical proportions, proportions of the historical averages, and forecast proportions. These methods are briefly described in the following section. For a detailed discussion, please see Hyndman and Athanasopoulos.

    • Average historical proportions:

In this method, the bottom level series is generated by using the average of the historical proportions of the series at the bottom level (stores in the figure preceding), relative to the series at the top level (Texas in the preceding figure).

    • Proportions of the historical averages:

The average historical value of the series at the bottom level (stores in the preceding figure) relative to the average historical value of the series at the top level (Texas in the preceding figure) is used as the disaggregation proportion.

While both of the preceding top-down approaches are simple to implement and use, they are generally very accurate for the top level and are less accurate for lower levels. This is due to the loss of information and the inability to take advantage of characteristics of individual time series at lower levels. Furthermore, these methods also fail to account for how the historical proportions may change over time.

    • Forecast proportions:

In this method, instead of historical data, proportions based on forecasts are used for disaggregation. Forecasts are first generated for each individual series. These forecasts are not used directly, since they are not coherent at different levels of hierarchy. At each level, the proportions of these initial forecasts to that of the aggregate of all initial forecasts at the level are calculated. Then, these forecast proportions are used to disaggregate the top level forecast into individual forecasts at various levels. This method does not rely on outdated historical proportions and uses the current data to calculate the appropriate proportions. Due to this reason, forecast proportions often result in much more accurate forecasts as compared to the average historical proportions and proportions of the historical averages top-down approaches.

  • Middle-out:

In this method, forecasts are first generated for all of the series at a “middle level” (for example, Austin, Dallas, Houston, and San Antonio in the preceding figure). From these forecasts, the bottom-up approach is used to generate the aggregated forecasts for the levels above this middle level. For the levels below the middle level, a top-down approach is used.

  • Ordinary least squares (OLS):

In OLS, a least squares estimator is used to compute the reconciliation weights needed for generating coherent forecasts.

Solution overview

In this post, we take the example of demand forecasting on synthetic retail data to fine tune multiple hierarchical time series models across algorithms and hyper-parameter combinations. We are using the scikit-hts toolkit on Amazon SageMaker, which is the most comprehensive and fully managed ML service. SageMaker lets data scientists and developers quickly and easily build and train ML models, and then directly deploy them into a production-ready hosted environment.

First, we will show you how to setup scikit-hts on SageMaker using the SKLearn estimator, train multiple models using the SKLearn estimator, and track and organize experiments using SageMaker Experiments. We will walk you through the following steps:

  1. Prerequisites
  2. Prepare Time Series Data
  3. Setup the scikit-hts training script
  4. Setup the SKLearn Estimator
  5. Setup Amazon SageMaker Experiment and Trials
  6. Evaluate metrics and select a winning candidate
  7. Runtime series forecasts
  8. Visualize the forecasts:
    • Visualization at Region Level
    • Visualization at State Level

Prerequisites

The following is needed to follow along with this post and run the associated code:

Prepare Time Series Data

For this post, we will use synthetic retail clothing data to perform feature engineering steps to clean data. Then, we will convert the data into hierarchical representations as required by the scikit-hts package.

The retail clothing data is the time series daily quantity of sales data for six item categories: men’s clothing, men’s shoes, women’s clothing, women’s shoes, kids’ clothing, and kids’ shoes. The date range for the data is 11/25/1997 through 7/28/2009. Each row of the data corresponds to the quantity of sales for an item category in a state (total of 18 US states) for a specific date in the date range. Furthermore, the 18 states are also categorized into five US regions. The data is synthetically generated using repeatable patterns (for seasonality) with random noise added for each day.

First, let’s read the data into a Pandas DataFrame.

df_raw = pd.read_csv("retail-usa-clothing.csv",
                          parse_dates=True,
                          header=0,
                          names=['date', 'state',
                                   'item', 'quantity', 'region',
                                   'country']
                    )


Define the S3 bucket and folder locations to store the test and training data. This should be within the same region as SageMaker Studio.

Now, let’s divide the raw data into train and test samples, and save them in their respective S3 folder locations using the Pandas DataFrame query function. We can check the first few entries of the train and test dataset. Both datasets should have the same fields, as in the following code:

df_train = df_raw.query(f'date <= "2009-04-29"').copy()
df_train.to_csv("train.csv")
s3_client.upload_file("train.csv", bucket, pref+"/train.csv")

df_test = df_raw.query(f'date > "2009-04-29"').copy()
df_test.to_csv("test.csv")
s3_client.upload_file("test.csv", bucket, pref+"/test.csv")

Convert data into Hierarchical Representation

scikit-hts requires that each column in our DataFrame is a time series of its own, and for all hierarchy levels. To acheive this, we have created a dataset_prep.py script, which performs the following steps:

  1. Transform the dataset into a column-oriented one.
  2. Create the hierarchy representation as a dictionary.

For a complete description of how this is done under the hood, and for a sense of what the API accepts, see the scikit-hts’ docs.

Once we have created the hierarchy represenation as a dictionary, then we can visualize the data as a tree structure:

from hts.hierarchy import HierarchyTree
ht = HierarchyTree.from_nodes(nodes=train_hierarchy, df=train_product_bottom_level)
- total
   |- Mid-Alantic
   |  |- Mid-Alantic_NewJersey
   |  |- Mid-Alantic_NewYork
   |  - Mid-Alantic_Pennsylvania
   |- SouthCentral
   |  |- SouthCentral_Alabama
   |  |- SouthCentral_Kentucky
   |  |- SouthCentral_Mississippi
   |  - SouthCentral_Tennessee
   |- Pacific
   |  |- Pacific_Alaska
   |  |- Pacific_California
   |  |- Pacific_Hawaii
   |  - Pacific_Oregon
   |- EastNorthCentral
   |  |- EastNorthCentral_Illinois
   |  |- EastNorthCentral_Indiana
   |  - EastNorthCentral_Ohio
   - NewEngland
      |- NewEngland_Connecticut
      |- NewEngland_Maine
      |- NewEngland_RhodeIsland
      - NewEngland_Vermont

Setup the scikit-hts training script

We use a Python entry script to import the necessary SKLearn libraries, set up the scikit-hts estimators using the model packages for our algorithms of interest, and pass in our algorithm and hyper-parameter preferences from the SKLearn estimator that we set up in the notebook. In this post and associated code, we show the implementation and results for the bottom-up approach and the top-down approach with the average historical proportions division method. Note that the user can change these to select different hierarchical methods from the package. In addition, for the hyperparameters, we used additive and multiplicative seasonality with both the bottom-up and top-down approaches. The script uses the train and test data files that we uploaded to Amazon S3 to create the corresponding SKLearn datasets for training and evaluation. When training is complete, the script runs an evaluation to generate metrics, which we use to choose a winning model. For further analysis, the metrics are also available via the SageMaker trial component analytics (discussed later in this post). Then, the model is serialized for storage and future retrieval.

For more details, refer to the entry script “train.py” that is available in the GitHub repo. From the accompanying notebook, you can also run the cell in Step 3 to review the script. The following code shows the train function calling HTSRegressor with the Prophet algorithm along with the hierarchical method and seasonality mode:

def train(bucket, seasonality_mode, algo, daily_seasonality, changepoint_prior_scale, revision_method):
    print('**************** Training Script ***********************')
    # create train dataset
    df = pd.read_csv(filepath_or_buffer=os.environ['SM_CHANNEL_TRAIN'] + "/train.csv", header=0, index_col=0)
    hierarchy, data, region_states = prepare_data(df)
    regions = df["region"].unique().tolist()
    # create test dataset
    df_test = pd.read_csv(filepath_or_buffer=os.environ['SM_CHANNEL_TEST'] + "/test.csv", header=0, index_col=0)
    test_hierarchy, test_df, region_states = prepare_data(df_test)
    print("************** Create Root Edges *********************")
    print(hierarchy)
    print('*************** Data Type for Hierarchy *************', type(hierarchy))
    # determine estimators##################################
    if algo == "Prophet":
        print('************** Started Training Prophet Model ****************')
        estimator = HTSRegressor(model='prophet', 
                                 revision_method=revision_method, 
                                 n_jobs=4, 
                                 daily_seasonality=daily_seasonality, 
                                 changepoint_prior_scale = changepoint_prior_scale,
                                 seasonality_mode=seasonality_mode,
                                )
        # train the model
        print("************** Calling fit method ************************")
        model = estimator.fit(data, hierarchy)
        print("Prophet training is complete SUCCESS")
        
        # evaluate the model on test data
        evaluate(model, test_df, regions, region_states)
    
    ###################################################
 
    mainpref = "scikit-hts/models/"
    prefix = mainpref + "/"
    print('************************ Saving Model *************************')
    joblib.dump(estimator, os.path.join(os.environ['SM_MODEL_DIR'], "model.joblib"))
    print('************************ Model Saved Successfully *************************')

    return model

Setup Amazon SageMaker Experiment and Trials

SageMaker Experiments automatically tracks the inputs, parameters, configurations, and results of your iterations as trials. You can assign, group, and organize these trials into experiments. SageMaker Experiments is integrated with SageMaker Studio. This provides a visual interface to browse your active and past experiments, compare trials on key performance metrics, and identify the best-performing models. SageMaker Experiments comes with its own Experiments SDK, which makes the analytics capabilities easily accessible in SageMaker notebooks. Because SageMaker Experiments enables tracking of all the steps and artifacts that go into creating a model, you can quickly revisit the origins of a model when you’re troubleshooting issues in production or auditing your models for compliance verifications. You can create your experiment with the following code:

from datetime import datetime
from smexperiments.experiment import Experiment

#name of experiment
timestep = datetime.now()
timestep = timestep.strftime("%d-%m-%Y-%H-%M-%S")
experiment_name = "hierarchical-forecast-models-" + timestep

#create experiment
Experiment.create(
experiment_name=experiment_name,
description="Hierarchical Timeseries models",
sagemaker_boto_client=sagemaker_boto_client)

For each job, we define a new Trial component within that experiment:

from smexperiments.trial import Trial
trial = Trial.create(
experiment_name=experiment_name,
sagemaker_boto_client=sagemaker_boto_client
)
print(trial)

Next, we define an experiment config, which is a dictionary that we pass into the fit() method of SKLearn estimator later on. This makes sure that the training job is associated with that experiment and trial. For the full code block for this step, refer to the accompanying notebook. In the notebook, we use the bottom-up and top-down (with average historical proportions) approaches, along with additive and multiplicative seasonality as the seasonality hyperparameter values. This lets us train four different models. The code can be modified easily to use the rest of the hierarchical forecasting approaches discussed in the previous sections, since they are also implemented in scikit-hts package.

Creating the SKLearn estimator

You can run SKLearn training scripts on SageMaker’s fully managed training environment by creating an SKLearn estimator. Let’s set up the actual training runs with a combination of parameters and encapsulate the training jobs within SageMaker experiments.

We will use scikit-hts to fit the FBProphet model in our data and compare the results.

  • FBProphet
    • daily_seasonality: By default, daily seasonality is set to False, thereby explicitly changing it to True.
    • changepoint_prior_scale: If the trend changes are being overfit (too much flexibility) or underfit (not enough flexibility), you can adjust the strength of the sparse prior using the input argument changepoint_prior_scale. By default, this parameter is set to 0.05. Increasing it will make the trend more flexible.

See the following code:

import sagemaker
from sagemaker.sklearn import SKLearn

for idx, row in df_hps_combo.iterrows():
    trial = Trial.create(
        experiment_name=experiment_name,
        sagemaker_boto_client=sagemaker_boto_client
    )

    experiment_config = { "ExperimentName": experiment_name, 
                      "TrialName":  trial.trial_name,
                      "TrialComponentDisplayName": "Training"}
    
    sklearn_estimator = SKLearn('train.py',
                                source_dir='code',
                                instance_type='ml.m4.xlarge',
                                framework_version='0.23-1',
                                role=sagemaker.get_execution_role(),
                                debugger_hook_config=False,
                                hyperparameters = {'bucket': bucket,
                                                   'algo': "Prophet", 
                                                   'daily_seasonality': True,
                                                   'changepoint_prior_scale': 0.5,
                                                   'seasonality_mode': row['seasonality_mode'],
                                                   'revision_method' : row['revision_method']
                                                  },
                                metric_definitions = metric_definitions,
                               )

After specifying our estimator with all of the necessary hyperparameters, we can train it using our training dataset. We train it by invoking the fit() method of the SKLearn estimator. We pass the location of the train and test data, as well as the experiment configuration. The training algorithm returns a fitted model that we can use to construct forecasts. See the following code:

sklearn_estimator.fit({'train': s3_train_channel, "test": s3_test_channel},
                     experiment_config=experiment_config, wait=False)

We start four training jobs in this case corresponding to the combinations of two hierarchical forecasting methods and two seasonality modes. These jobs are run in parallel using SageMaker training. The average runtime for these training jobs in this example was approximately 450 seconds on ml.m4.xlarge instances. You can review the job parameters and metrics from the trial component view in SageMaker Studio (see the following screenshot):

Evaluate metrics and select a winning candidate

Amazon SageMaker Studio provides an experiments browser that you can use to view the lists of experiments, trials, and trial components. You can choose one of these entities to view detailed information about the entity, or choose multiple entities for comparison. For more details, refer to the documentation. Once the training jobs are running, we can use the experiment view in Studio (see the following screenshot) or the ExperimentAnalytics module to track the status of our training jobs and their metrics.

In the training script, we used SKLearn Metrics to calculate the mean_squared_error (MSE) and stored it in the experiment. We can access the recorded metrics via the ExperimentAnalytics function and convert it to a Pandas DataFrame. The training job with the lowest Mean Squared Error (MSE) is the winner.

from sagemaker.analytics import ExperimentAnalytics

trial_component_analytics = ExperimentAnalytics(experiment_name=experiment_name)
tc_df = trial_component_analytics.dataframe()
for name in tc_df['sagemaker_job_name']:
        description = sagemaker_boto_client.describe_training_job(TrainingJobName=name[1:-1])
        total_mse.append(description['FinalMetricDataList'][0]['Value'])
        model_url.append(description['ModelArtifacts']['S3ModelArtifacts'])
tc_df['total_mse'] = total_mse
new_df = tc_df[['sagemaker_job_name','algo', 'changepoint_prior_scale', 'revision_method', 'total_mse', 'seasonality_mode']]
mse_min = new_df['total_mse'].min()
df_winner = new_df[new_df['total_mse'] == mse_min]

Let’s select the winner model and download it for running forecasts:

for name in df_winner['sagemaker_job_name']:
    model_dir = sagemaker_boto_client.describe_training_job(TrainingJobName = name[1:-1])['ModelArtifacts']['S3ModelArtifacts']
key = model_dir.split('s3://{}/'.format(bucket))
s3_client.download_file(bucket, key[1], 'model.tar.gz')

Runtime series forecasts

Now, we will load the model and make forecasts 90 days in future:

import joblib
def model_fn(model_dir):
    clf = joblib.load(model_dir)
    return clf
model = model_fn('model.joblib')
predictions = model.predict(steps_ahead=90)

Visualize the forecasts

Let’s visualize the model results and fitted values for all of the states:

import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import plotly.graph_objects as go
def plot_results(cols, axes, preds):
    axes = np.hstack(axes)
    for ax, col in zip(axes, cols):
        preds[col].plot(ax=ax, label="Predicted")
        train_product_bottom_level[col].plot(ax=ax, label="Observed")
        ax.legend()
        ax.set_title(col)
        ax.set_xlabel("Date")
        ax.set_ylabel("Quantity")  

Visualization at Region Level

Visualization at State Level

The following screenshot is for some of the states. For a full list of state visualizations, execute the visualization section of the notebook.

Clean up

Make sure to shut down the studio notebook. You can reach the Running Terminals and Kernels pane on the left side of Amazon SageMaker Studio with the icon. The Running Terminals and Kernels pane consists of four sections. Each section lists all of the resources of that type. You can shut down each resource individually or shut down all of the resources in a section at the same time.

Conclusion

Hierarchical forecasting is important where time series data can be grouped or aggregated at various levels in a hierarchical fashion. For accurate forecasting/prediction at various levels of hierarchy, methods that generate coherent forecasts at these different levels are needed. In this post, we demonstrated how we can leverage Amazon SageMaker’s training capabilities to carry out hierarchical forecasting. We used synthetic retail data and showed how to train hierarchical forecasting models using the scikit-hts package. We used the FBProphet model along with bottom-up and top-down (average historic proportions) hierarchical aggregation and disaggregation methods (see code). Furthermore, we used SageMaker Experiments to train multiple models and picked the best model out of the four trained models. While we only demonstrated this approach on a synthetic retail dataset, the code provided can easily be used with any time-series dataset that exhibits a similar hierarchical structure.

References


About the Authors

Mani Khanuja is an Artificial Intelligence and Machine Learning Specialist SA at Amazon Web Services (AWS). She helps customers use machine learning to solve their business challenges with AWS. She spends most of her time diving deep and teaching customers on AI/ML projects related to computer vision, natural language processing, forecasting, ML at the edge, and more. She is passionate about ML at the edge. She has created her own lab with a self-driving kit and prototype manufacturing production line, where she spends a lot of her free time.

Farooq Sabir is a Senior Artificial Intelligence and Machine Learning Specialist Solutions Architect at AWS. He holds PhD and MS degrees in Electrical Engineering from The University of Texas at Austin and a MS in Computer Science from Georgia Institute of Technology. He has over 15 years of work experience and also likes to teach and mentor college students. At AWS, he helps customers formulate and solve their business problems in data science, machine learning, computer vision, artificial intelligence, numerical optimization and related domains. Based in Dallas, Texas, he and his family love to travel and make long road trips.

Neha Gupta is a Solutions Architect at AWS and has 16 years of experience as a Database architect/ DBA. Apart from work, she’s outdoorsy and loves to dance.

Read More

Machine learning speeds up vehicle routing

Waiting for a holiday package to be delivered? There’s a tricky math problem that needs to be solved before the delivery truck pulls up to your door, and MIT researchers have a strategy that could speed up the solution.

The approach applies to vehicle routing problems such as last-mile delivery, where the goal is to deliver goods from a central depot to multiple cities while keeping travel costs down. While there are algorithms designed to solve this problem for a few hundred cities, these solutions become too slow when applied to a larger set of cities.

To remedy this, Cathy Wu, the Gilbert W. Winslow Career Development Assistant Professor in Civil and Environmental Engineering and the Institute for Data, Systems, and Society, and her students have come up with a machine-learning strategy that accelerates some of the strongest algorithmic solvers by 10 to 100 times.

The solver algorithms work by breaking up the problem of delivery into smaller subproblems to solve — say, 200 subproblems for routing vehicles between 2,000 cities. Wu and her colleagues augment this process with a new machine-learning algorithm that identifies the most useful subproblems to solve, instead of solving all the subproblems, to increase the quality of the solution while using orders of magnitude less compute.

Their approach, which they call “learning-to-delegate,” can be used across a variety of solvers and a variety of similar problems, including scheduling and pathfinding for warehouse robots, the researchers say.

The work pushes the boundaries on rapidly solving large-scale vehicle routing problems, says Marc Kuo, founder and CEO of Routific, a smart logistics platform for optimizing delivery routes. Some of Routific’s recent algorithmic advances were inspired by Wu’s work, he notes.

“Most of the academic body of research tends to focus on specialized algorithms for small problems, trying to find better solutions at the cost of processing times. But in the real-world, businesses don’t care about finding better solutions, especially if they take too long for compute,” Kuo explains. “In the world of last-mile logistics, time is money, and you cannot have your entire warehouse operations wait for a slow algorithm to return the routes. An algorithm needs to be hyper-fast for it to be practical.”

Wu, social and engineering systems doctoral student Sirui Li, and electrical engineering and computer science doctoral student Zhongxia Yan presented their research this week at the 2021 NeurIPS conference.

Selecting good problems

Vehicle routing problems are a class of combinatorial problems, which involve using heuristic algorithms to find “good-enough solutions” to the problem. It’s typically not possible to come up with the one “best” answer to these problems, because the number of possible solutions is far too huge.

“The name of the game for these types of problems is to design efficient algorithms … that are optimal within some factor,” Wu explains. “But the goal is not to find optimal solutions. That’s too hard. Rather, we want to find as good of solutions as possible. Even a 0.5% improvement in solutions can translate to a huge revenue increase for a company.”

Over the past several decades, researchers have developed a variety of heuristics to yield quick solutions to combinatorial problems. They usually do this by starting with a poor but valid initial solution and then gradually improving the solution — by trying small tweaks to improve the routing between nearby cities, for example. For a large problem like a 2,000-plus city routing challenge, however, this approach just takes too much time.

More recently, machine-learning methods have been developed to solve the problem, but while faster, they tend to be more inaccurate, even at the scale of a few dozen cities. Wu and her colleagues decided to see if there was a beneficial way to combine the two methods to find speedy but high-quality solutions.

“For us, this is where machine learning comes in,” Wu says. “Can we predict which of these subproblems, that if we were to solve them, would lead to more improvement in the solution, saving computing time and expense?”

Traditionally, a large-scale vehicle routing problem heuristic might choose the subproblems to solve in which order either randomly or by applying yet another carefully devised heuristic. In this case, the MIT researchers ran sets of subproblems through a neural network they created to automatically find the subproblems that, when solved, would lead to the greatest gain in quality of the solutions. This process sped up subproblem selection process by 1.5 to 2 times, Wu and colleagues found.

“We don’t know why these subproblems are better than other subproblems,” Wu notes. “It’s actually an interesting line of future work. If we did have some insights here, these could lead to designing even better algorithms.”

Surprising speed-up

Wu and colleagues were surprised by how well the approach worked. In machine learning, the idea of garbage-in, garbage-out applies — that is, the quality of a machine-learning approach relies heavily on the quality of the data. A combinatorial problem is so difficult that even its subproblems can’t be optimally solved. A neural network trained on the “medium-quality” subproblem solutions available as the input data “would typically give medium-quality results,” says Wu. In this case, however, the researchers were able to leverage the medium-quality solutions to achieve high-quality results, significantly faster than state-of-the-art methods.

For vehicle routing and similar problems, users often must design very specialized algorithms to solve their specific problem. Some of these heuristics have been in development for decades.

The learning-to-delegate method offers an automatic way to accelerate these heuristics for large problems, no matter what the heuristic or — potentially — what the problem.

Since the method can work with a variety of solvers, it may be useful for a variety of resource allocation problems, says Wu. “We may unlock new applications that now will be possible because the cost of solving the problem is 10 to 100 times less.”

The research was supported by MIT Indonesia Seed Fund, U.S. Department of Transportation Dwight David Eisenhower Transportation Fellowship Program, and the MIT-IBM Watson AI Lab.

Read More

A Fast WordPiece Tokenization System

Tokenization is a fundamental pre-processing step for most natural language processing (NLP) applications. It involves splitting text into smaller units called tokens (e.g., words or word segments) in order to turn an unstructured input string into a sequence of discrete elements that is suitable for a machine learning (ML) model. ln deep learning–based models (e.g., BERT), each token is mapped to an embedding vector to be fed into the model.

Tokenization in a typical deep learning model, like BERT.

A fundamental tokenization approach is to break text into words. However, using this approach, words that are not included in the vocabulary are treated as “unknown”. Modern NLP models address this issue by tokenizing text into subword units, which often retain linguistic meaning (e.g., morphemes). So, even though a word may be unknown to the model, individual subword tokens may retain enough information for the model to infer the meaning to some extent. One such subword tokenization technique that is commonly used and can be applied to many other NLP models is called WordPiece. Given text, WordPiece first pre-tokenizes the text into words (by splitting on punctuation and whitespaces) and then tokenizes each word into subword units, called wordpieces.

The WordPiece tokenization process with an example sentence.

In “Fast WordPiece Tokenization”, presented at EMNLP 2021, we developed an improved end-to-end WordPiece tokenization system that speeds up the tokenization process, reducing the overall model latency and saving computing resources. In comparison to traditional algorithms that have been used for decades, this approach reduces the complexity of the computation by an order of magnitude, resulting in significantly improved performance, up to 8x faster than standard approaches. The system has been applied successfully in a number of systems at Google and has been publicly released in TensorFlow Text.

Single-Word WordPiece Tokenization
WordPiece uses a greedy longest-match-first strategy to tokenize a single word — i.e., it iteratively picks the longest prefix of the remaining text that matches a word in the model’s vocabulary. This approach is known as maximum matching or MaxMatch, and has also been used for Chinese word segmentation since the 1980s. Yet despite its wide use in NLP for decades, it is still relatively computation intensive, with the commonly adopted MaxMatch approaches’ computation being quadratic with respect to the input word length (n). This is because two pointers are needed to scan over the input: one to mark a start position, and the other to search for the longest substring matching a vocabulary token at that position.

We propose an alternative to the MaxMatch algorithm for WordPiece tokenization, called LinMaxMatch, which has a tokenization time that is strictly linear with respect to n. First, we organize the vocabulary tokens in a trie (also called a prefix tree), where each trie edge is labeled by a character, and a tree path from the root to some node represents a prefix of some token in the vocabulary. In the figure below, nodes are depicted as circles and tree edges are black solid arrows. Given a trie, a vocabulary token can be located to match an input text by traversing from the root and following the trie edges to match the input character by character; this process is referred to as trie matching.

The figure below shows the trie created from the vocabulary consisting of “a”, “abcd”, “##b”, “##bc”, and “##z”. An input text “abcd” can be matched to a vocabulary token by walking from the root (upper left) and following the trie edges with labels “a”, “b”, “c”, “d” one by one. (The leading “##” symbols are special characters used in WordPiece tokenization that are described in more detail below.)

Trie diagram of the vocabulary [“a”, “abcd”, “##b”, “##bc”, “##z”]. Circles and arrows represent nodes and edges along the trie, respectively.

Second, inspired by the Aho-Corasick algorithm, a classical string-searching algorithm invented in 1975, we introduce a method that breaks out of a trie branch that fails to match the given input and skips directly to an alternative branch to continue matching. As in standard trie matching, during tokenization, we follow the trie edges to match the input characters one by one. When trie matching cannot match an input character for a given node, a standard algorithm would backtrack to the last character where a token was matched and then restart the trie matching procedure from there, which results in repetitive and wasteful iterations. Instead of backtracking, our method triggers a failure transition, which is done in two steps: (1) it collects the precomputed tokens stored at that node, which we call failure pops; and (2) it then follows the precomputed failure link to a new node from which the trie matching process continues.

For example, given a model with the vocabulary described above (“a”, “abcd”, “##b”, “##bc”, and “##z”), WordPiece tokenization distinguishes subword tokens matching at the start of the input word from the subword tokens starting in the middle (the latter being marked with two leading hashes “##”). Hence, for input text “abcz”, the expected tokenization output is [“a”, “##bc”, “##z”], where “a” matches at the beginning of the input while “##bc” and “##z” match in the middle. For this example, the figure below shows that, after successfully matching three characters ‘a’, ‘b’, ‘c’, trie matching cannot match the next character ‘z’ because “abcz” is not in the vocabulary. In this situation, LinMaxMatch conducts a failure transition by outputting the first recognized token (using the failure pop token “a”) and following the failure link to a new node to continue the matching process (in this case, node with “##bc” as the failure pop tokens).The process then repeats from the new node.

Trie structure for the same vocabulary as shown in the example above, now illustrating the approach taken by our new Fast WordPiece Tokenizer algorithm. Failure pops are bracketed and shown in purple. Failure links between nodes are indicated with dashed red line arrows.

Since at least n operations are required to read the entire input, the LinMaxMatch algorithm is asymptotically optimal for the MaxMatch problem.

End-to-End WordPiece Tokenization
Whereas the existing systems pre-tokenize the input text (splitting it into words by punctuation and whitespace characters) and then call WordPiece tokenization on each resulting word, we propose an end-to-end WordPiece tokenizer that combines pre-tokenization and WordPiece into a single, linear-time pass. It uses the LinMaxMatch trie matching and failure transitions as much as possible and only checks for punctuation and whitespace characters among the relatively few input characters that are not handled by the loop. It is more efficient as it traverses the input only once, performs fewer punctuation / whitespace checks, and skips the creation of intermediate words.

End-to-End WordPiece Tokenization.

Benchmark Results
We benchmark our method against two widely-adopted WordPiece tokenization implementations, HuggingFace Tokenizers, from the HuggingFace Transformer library, one of the most popular open-source NLP tools, and TensorFlow Text, the official library of text utilities for TensorFlow. We use the WordPiece vocabulary released with the BERT-Base, Multilingual Cased model.

We compared our algorithms with HuggingFace and TensorFlow Text on a large corpus (several million words) and found that the way the strings are split into tokens is identical to other implementations for both single-word and end-to-end tokenization.

To generate the test data, we sample 1,000 sentences from the multilingual Wikipedia dataset, covering 82 languages. On average, each word has four characters, and each sentence has 82 characters or 17 words. We found this dataset large enough because a much larger dataset (consisting of hundreds of thousands of sentences) generated similar results.

We compare the average runtime when tokenizing a single word or general text (end-to-end) for each system. Fast WordPiece tokenizer is 8.2x faster than HuggingFace and 5.1x faster than TensorFlow Text, on average, for general text end-to-end tokenization.

Average runtime of each system. Note that for better visualization, single-word tokenization and end-to-end tokenization are shown in different scales.

We also examine how the runtime grows with respect to the input length for single-word tokenization. Because of its linear-time complexity, the runtime of LinMaxMatch increases at most linearly with the input length, which is much slower than other quadratic-time approaches.

The average runtime of each system with respect to the input length for single-word tokenization.

Conclusion
We proposed LinMaxMatch for single-word WordPiece tokenization, which solves the decades-old MaxMatch problem in the asymptotically-optimal time with respect to the input length. LinMaxMatch extends the Aho-Corasick Algorithm, and the idea can be applied to more string search and transducer challenges. We also proposed an End-to-End WordPiece algorithm that combines pre-tokenization and WordPiece tokenization into a single, linear-time pass for even higher efficiency.

Acknowledgements
We gratefully acknowledge the key contributions and useful advices from other team members and colleagues, including Abbas Bazzi, Alexander Frömmgen, Alex Salcianu, Andrew Hilton, Bradley Green, Ed Chi, Chen Chen, Dave Dopson, Eric Lehman, Fangtao Li, Gabriel Schubiner, Gang Li, Greg Billock, Hong Wang, Jacob Devlin, Jayant Madhavan, JD Chen, Jifan Zhu, Jing Li, John Blitzer, Kirill Borozdin, Kristina Toutanova, Majid Hadian-Jazi, Mark Omernick, Max Gubin, Michael Fields, Michael Kwong, Namrata Godbole, Nathan Lintz, Pandu Nayak, Pew Putthividhya, Pranav Khaitan, Robby Neale, Ryan Doherty, Sameer Panwar, Sundeep Tirumalareddy, Terry Huang, Thomas Strohmann, Tim Herrmann, Tom Small, Tomer Shani, Wenwei Yu, Xiaoxue Zang, Xin Li, Yang Guo, Yang Song, Yiming Xiao, Yuan Shen, and many more.

Read More

Live transcriptions of F1 races using Amazon Transcribe

The Formula 1 (F1) live steaming service, F1 TV, has live automated closed captions in three different languages: English, Spanish, and French.

For the 2021 season, FORMULA 1 has achieved another technological breakthrough, building a fully automated workflow to create closed captions in three languages and broadcasting to 85 territories using Amazon Transcribe. Amazon Transcribe is an automatic speech recognition (ASR) service that allows you to generate audio transcription.

In this post, we share how Formula 1 joined forces with the AWS Professional Services team to make it happen. We discuss how they used Amazon Transcribe and its custom vocabulary feature as well as custom-built postprocessing logic to improve their live transcription accuracy in three languages.

The challenge

For F1, everything is about extreme speed: with pit stops as short as 2 seconds, speeds of up to 375 KPH (233 MPH), and 5g forces on drivers under braking and through corners. In this fast-paced and dynamic environment, milliseconds dictate the difference between pole position or second on the grid. The role of the race commentators is to weave the multitude of parallel events and information into a single exciting narrative. This form of commentary greatly increases the engagement and excitement of viewers.

F1 has a strong affinity to cutting edge technology, and partnered with AWS to build a scalable and sustainable closed caption solution for F1 TV, their Over-the-top (OTT) platform, that can support a growing calendar and language portfolio. F1 now provides real-time live captions in three languages across four series: F1 in British English, US Spanish and French; and F2, F3, and Porsche Supercup in British English and US Spanish. This was achieved using Amazon Transcribe to automatically convert the commentary into subtitles.

This task provides many unique challenges. With the excitement of an F1 race, it’s common to have commentators with differing accents move quickly from one topic to another as the race unfolds. Being a sport steeped in technology, commentators often refer to F1 domain-specific terminology such as DRS (Drag Reduction System), aerodynamic, downforce, or halo (a safety device) for example. Moreover, F1 is a global sport, traveling across the world and drawing drivers from many different countries. Looking only at the 2021 season, 16/20 drivers had non-English names and 17/20 had non-Spanish names or non-French names. With the advanced customization features available in Amazon Transcribe, we tailored the underlying language models to recognize domain-specific terms that are rare in general language use, which boosted transcription accuracy.

In the following sections, we take a deep dive into how AWS Professional Services partnered with F1 to build a robust, state-of-the-art, real-time race commentary captioning system by enhancing Amazon Transcribe to understand the particularities of the F1 world. You will learn how to utilize Amazon Transcribe in real-time broadcasts and supercharge live captioning for your use case with custom vocabularies, postprocessing steps, and a human-in-the-loop validation layer.

Solution overview

The solution works as a proxy to Amazon Transcribe. Custom vocabularies are passed as parameters to Amazon Transcribe, and the resulting captions are postprocessed. The postprocessed text is then moderated by an F1 moderator before being transformed to captions that are displayed to the viewers. The following diagram shows the sequential process.

Live transcriptions: Understanding use case specific terminology and context

The output of Automatic Speech Recognition (ASR) systems is highly context-dependent. ASR language models benefit from utilizing the words across a fully spoken sentence. For example, in the following sentence, the system uses the words ‘WORLD CHAMPIONSHIP’ towards the end of the sentence to inform context and allow ‘FORMER ONE’ to be correctly transcribed as ‘FORMULA 1’.

GOOD AFTERNOON EVERYBODY WELCOME ALONG TO ROUND 4 OF THE FORMER ONE

 

GOOD AFTERNOON EVERYBODY WELCOME ALONG TO ROUND 4 OF THE FORMULA 1 WORLD CHAMPIONSHIP IN 2019

Amazon Transcribe supports both batch and streaming transcription models. In batch transcription, the model issues a transcription using the full context provided in the audio segment. Amazon Transcribe streaming transcription enables you to send an audio stream and receive a transcription stream in real time. Generating subtitles for a live broadcast requires a streaming model because transcriptions should appear on screen shortly after the commentary is spoken. This real-time need presents unique challenges compared to batch transcriptions and often affects the quality of the results because the language model has limited knowledge of the future context.

Amazon Transcribe is pre-trained to capture a wide range of use cases. However, F1 domain-specific terminology, names, and locations aren’t present in the Amazon Transcribe general language model. Getting those words correct is nevertheless crucial for the understanding of the narrative, such as who is leading the race, circuit corners, and technical details.

Amazon Transcribe allows you to develop with custom vocabularies and custom language models to improve transcription accuracy. You can use them separately for streaming transcriptions or together for batch transcriptions.

Custom vocabularies consist of a list of specific words that you want Amazon Transcribe to recognize in the audio input. These are generally domain-specific words and phrases, such as proper nouns. You can inform Amazon Transcribe how to pronounce these terms with information such as SoundsLike (in regular orthography) or the IPA (International Phonetic Alphabet) description of the term. Custom vocabularies are available for all languages supported by Amazon Transcribe. Custom vocabularies improve the ability of Amazon Transcribe to recognize terms without using the context in which they’re spoken.

The following table shows some examples of a custom vocabulary.

Phrase DisplayAs SoundsLike IPA
Charles-Leclerc Charles Leclerc ʃ ɑ ɹ l l ə k l ɛ ɹ
Charles-Leclerc Charles Leclerc shal-luh-klurk
Lewis-Hamilton Lewis Hamilton loo-is-ha-muhl-tn
Lewis-Hamilton Lewis Hamilton loo-uhs-ha-muhl-tn
Ferrari Ferrari f ɝ ɹ ɑ ɹ ɪ
Ferrari Ferrari fuh-rehr-ee
Mercedes Mercedes mer-sey-deez
Mercedes Mercedes m ɛ ɹ s eɪ d i z

The custom vocabulary includes the following details:

  • Phrase – The term that should be recognized.
  • DisplayAs – How the word or phrase looks when it’s output. If not declared, the output would be the phrase.
  • SoundsLike – The term broken into small pieces with the respective pronunciations in the specified language using standard orthography.
  • IPA – The International Phonetic Alphabet representation for the term.

Custom language models are valuable when there are larger corpuses of text data that can be used to train models. With the additional data, the models learn to predict the probabilities of sequences of words in the domain-specific context. For this project, F1 chose to use custom vocabulary given the unique words and phrases that are unique to F1 racing.

Postprocessing: the final layer of performance boosting

Due to the fast-paced nature of F1 commentary with rapidly changing context as well as commentator accents, inaccurate transcriptions may still occur. However, recurring mistakes can be easily fixed using text replacement. For example, “Kvyat and Albon” may be misunderstood as “create an album” by the British English language model. Because “create an album” is an unlikely term to occur in F1 commentaries, we can safely replace them with their assumed real meanings in a postprocessing routine. On top of that, postprocessing terms can be defined as general, or based on location and race series filters. Such selection allows for more specific term replacement, reducing the chance of erroneous replacements with this approach.

For this project, we gathered thousands of replacements for each language using hours of real-life F1 audio commentary that was analyzed by F1 domain specialists. On top of that, during every live event, F1 runs a transcribed commentary through a human-in-the-loop tool (described in the next section), which allows sentence rejection before the subtitles appear on screen. This data is used later to continuously improve the custom vocabulary and postprocessing rules. The following table shows examples of postprocessing rules for English captions. The location filter is a replacement filter based on race location, and the race series filter is based on the race series.

Original Term Replacement Location Filter Race Series Filter
CHARLOTTE CLAIRE CHARLES LECLERC FORMULA 1
CREATE AN ALBUM KVYAT AND ALBON FORMULA 1
SCHWARTZMAN SHWARTZMAN FORMULA 2
CURVE A PARABOLIC CURVA PARABOLICA Italy
CIRCUIT THE CATALONIA CIRCUIT DE CATALUNYA Spain
TYPE COMPOUNDS TYRE COMPOUNDS

Another important function of postprocessing is the standardization and formatting of numbers. When generating transcriptions for live broadcasts such as television, it’s a best practice to use digits when displaying numbers because they’re faster to read and occupy less space on screen. In English, Amazon Transcribe automatically displays numbers bigger than 10 as digits, and numbers between 0–10 are converted to digits under specific conditions, such as when there are more than one in a row. For example, “three four five” converts to 345. In an effort to standardize number transcriptions, we digitize all numbers.

As of August 8, 2021, transcriptions only output numbers as digits instead of words for a defined list of languages in both batch and streaming (for more information, see Transcribing numbers and punctuation). Notably, this list doesn’t include Spanish (es-US and es-ES) or French (fr-FR and fr-CA). With the postprocessing routine, numbers were also formatted to handle integers, decimals, and ordinals, as well F1-specific lap time formatting.

The following shows an example of number postprocessing for different languages that were built for F1.

Human in the loop: Continuous improvement and adaptation

Amazon Transcribe custom vocabularies and postprocessing boost the service’s real-time performance significantly. However, the fast-paced and quickly changing environment remains a challenge for automated transcriptions. It’s better for a person reliant on closed captions to miss out on a phase of commentary, rather than see an incorrect transcription that may be misleading. To this end, F1 employs a human in the loop as a final validation, where a moderator has a number of seconds to verify if a word or an entire sentence should be removed before it’s included in the video stream. Any removed sentences are then used to improve the custom vocabularies and postprocessing step for the next races.

Evaluation

Minor grammatical errors don’t greatly decrease the understandability of a sentence. However, using the wrong F1 terminology breaks a sentence. Usually ASR systems are evaluated on word error rate (WER), which quantifies how many insertions, deletions, and substitutions are required to change the predicted sentence to the correct one.

Although WER is important, F1-specific terms are even more crucial. For this, we created an accuracy score that measures the accuracy of people names (such as Charles Leclerc), teams (McLaren), locations (Hungaroring), and other F1 terms (DRS) transcribed in a commentary. These scores allow us to evaluate how understandable the transcriptions are to F1 fans and, combined with WER, allow us to maintain high-quality transcriptions and improvements in Amazon Transcribe.

Results

The F1 TV enhanced live transcriptions system was released on March 26, 2021, during the Formula 1 Gulf Air Bahrain Grand Prix. By the first race, the solution had already achieved a strong reduction in WER and F1-specific accuracy improvements for all three languages, compared to the Amazon Transcribe standard model. In the following tables, we highlight the WER and F1 specific accuracy improvements for the different languages. The numbers compare the developed solution using Amazon Transcribe using custom vocabularies and postprocessing with Amazon Transcribe generic model. The lower the WER, the better.

  Standard Amazon Transcribe WER Amazon Transcribe with CV and Postprocessing WER WER Improvement
English 18.95% 11.37% 39.99%
Spanish 25.95% 16.21% 37.16%
French 37.40% 16.80% 55.08%
Accuracy Group Standard Amazon Transcribe Accuracy Amazon Transcribe with CV and Postprocessing Accuracy Accuracy Improvement
English People Names 40.17% 92.25% 129.68%
Teams 56.33% 95.28% 69.15%
Locations 61.82% 94.33% 52.59%
Other F1 terms 81.47% 90.89% 11.55%
Spanish People Names 45.31% 95.43% 110.62%
Teams 39.40% 95.46% 142.28%
Locations 58.32% 87.58% 50.17%
Other F1 terms 63.87% 85.25% 33.47%
French People Names 39.12% 92.38% 136.15%
Teams 33.20% 90.84% 173.61%
Locations 55.34% 89.33% 61.42%
Other F1 terms 61.15% 86.77% 41.90%

Although the approach significantly improves the WER measures, its main influence is seen on F1 names, teams, and locations. Because the F1 specific terms are often in local languages, custom vocabularies, and postprocessing steps can quickly teach Amazon Transcribe to consider those terms and correctly transcribe them. The postprocessing step then further adapts the outcome transcriptions to F1’s domain to provide highly accurate automated transcriptions. In the following examples, we present phrases in English, Spanish, and French where Amazon Transcribe custom vocabularies, postprocessing, and number handling techniques successfully improved the transcription accuracy.

For Spanish, we have the original Amazon Transcribe output “EL PILOTO BRITÁNICO LORIS JAMIL TODOS ESTÁ A DOS SEGUNDOS PUNTO TRES DEL LIDER. COMPLETÓ SU ÚLTIMA VUELTA EN UNO VEINTINUEVE DOSCIENTOS TREINTA Y CUATRO” compared to the final transcription “EL PILOTO BRITÁNICO LEWIS HAMILTON ESTÁ A 2.3 s DEL LIDER. COMPLETÓ SU ÚLTIMA VUELTA EN 1:29.234.”

The custom vocabulary and postprocessing combination converted “LORIS JAMIL TODOS” to “LEWIS HAMILTON,” and the number handling routine converted the lap time to digits and added the appropriate punctuation (1:29.234).

For English, compare the original output “THE GERMAN DRIVER THE BASTION BETTER COMPLETED THE LAST LAP IN ONE 15 632” to the final transcription “THE GERMAN DRIVER SEBASTIAN VETTEL COMPLETED THE LAST LAP IN 1:15.632.”

The custom vocabulary and postprocessing combination converted “THE BASTION BETTER” to “SEBASTIAN VETTEL.”

In French, we can compare the original output “VICTOIRE POUR LES MISS MILLE TONNE DIX-HUIT POLE CENT TROIS PODIUM QUATRE VICTOIRES ICI” to the final output “VICTOIRE POUR LEWIS HAMILTON 18 POLE 103 PODIUM 4 VICTOIRES ICI.”

The custom vocabulary and postprocessing combination converted “LES MISS MILLE TONNE” to “LEWIS HAMILTON,” and the number handling routine converted the numbers to digits.

The following short video shows live captions in action during the Formula 1 Gulf Air Bahrain Grand Prix 2021.

Summary

In this post, we explained how F1 is now able to provide live closed captions on their OTT (Over-The-Top) platform to benefit viewers with accessibility needs and those who want to ensure they do not miss any live commentary.

In collaboration with AWS Professional Services, F1 has set up live transcriptions in English, Spanish, and French by using Amazon Transcribe and applying enhancements to capture domain-specific terminology.

Whether for sport broadcasting, streaming educational content, or conferences and webinars, AWS Professional Services is ready to help your team develop a real-time captioning system that is accurate and customizable by making full use of your domain-specific knowledge and the advanced features of Amazon Transcribe. For more information, see AWS Professional Services or reach out through your account manager to get in touch.


About the Authors

Beibit Baktygaliyev is a Senior Data Scientist with AWS Professional Services. As a technical lead, he helps customers to attain their business goals through innovative technology. In his spare time, Beibit enjoys sports and spending time with his family and friends.

Maira Ladeira Tanke is a Data Scientist at AWS Professional Services. She works with customers across industries to help them achieve business outcomes with AI and ML technologies. In her spare time, Maira likes to play with her cat Smila. She also loves to travel and spend time with her family and friends.

Sara Kazdagli is a Professional Services consultant specialized in Data Analytics and Machine Learning. She helps customers across different industries to build innovative solutions and make data-driven decisions. Sara holds a MSc in Software Engineering and a MSc in Data Science. In her spare time, she like to go on hikes and walks with her Australian shepherd dog Kiba.

Pablo Hermoso Moreno is a Data Scientist in the AWS Professional Services Team. He works with clients across industries using Machine Learning to tell stories with data and reach more informed engineering decisions faster. Pablo’s background is in Aerospace Engineering and having worked in the motorsport industry he has an interest in bridging physics and domain expertise with ML. In his spare time, he enjoys rowing and playing guitar.

Read More

Understanding User Interfaces with Screen Parsing

This blog post summarizes our paper Screen Parsing: Towards Reverse Engineering of UI Models from Screenshots, which was published in the proceedings of UIST 2021. [This blog post was cross-posted in the ACM UIST 2021 Blog on 10/22/2021.]

The Benefits of Machines that Understand User Interfaces

Machines that understand and operate user interfaces (UIs) on behalf of users could offer many benefits. For example, a screen reader (e.g., VoiceOver and TalkBack) could facilitate access to UIs for blind and visually impaired users, and task automation agents (e.g., Siri Shortcuts and IFTTT) could allow users to automate repetitive or complex tasks with their devices more efficiently. These benefits are gated on how well these systems can understand an underlying app’s UI by reasoning about 1) the functionality present, 2) how its different components work together, and 3) how it can be operated to accomplish some goal. Many rely on the availability of UI metadata (e.g., the view hierarchy and the accessibility hierarchy), which provide some information about what elements are present and their properties. However, this metadata is often unavailable due to poor toolkit support and low developer awareness. To maximize their support of apps and when they are helpful to users, these systems can benefit from understanding UIs solely from visual appearance.

Recent efforts have focused on predicting the presence of an app’s on-screen elements and semantic regions solely from its visual appearance. These have enabled many useful applications: such as allowing assistive technology to work with inaccessible apps and example-based search for UI designers. However, they constitute only a surface-level understanding of UIs, as they primarily focus on extracting what elements are on a screen and where they appear spatially. To further advance the UI understanding capabilities of machines and perform more valuable tasks, we focus on modeling the higher-level relationships by predicting UI structure.

Our work makes the following contributions:

  • A problem defnition of screen parsing which is useful for a wide range of UI modeling applications
  • A description of our implementation and its training procedure
  • A comprehensive evaluation of our implementation with baseline comparison
  • Three implemented examples of how our model can be used to facilitate downstream applications such as (i) UI similarity, (ii) accessibility metadata generation, and (iii) code generation.

Achieving Better Understanding of UIs through Hierarchy

Example of an input screenshot and corresponding screen parse. The screen parse is a tree that connects all visible elements on the screen with edges describing their relationship.
An example of an input screen (Left) and the corresponding UI Hierarchy (Right). The tree contains all of the visible elements on the screen (the output is complete), groups them together to form higher-level structures (abstractive), and nodes can be used to reference UI elements (the output is grounded)

Structural representations enhance the understanding of many types of content by capturing higher-level semantics. For example, scene graphs enrich visual scenes by making sense of interactions between individual objects and parse trees disambiguate sentences by analyzing their grammar. Similarly, structure is a core property of UIs reflected in how they are constructed (i.e., stacking together views and widgets) and used. Modeling element relationships can help machines perceive UIs as humans do — not as a set of elements but as a coordinated presentation of content.

We introduce the problem of screen parsing, which we use to predict structured UI models (a high-level definition of a UI) from visual information. We focus on generating an app screen’s UI hierarchy, which specifies how UI elements are grouped and rendered on the screen. The following are properties of UI hierarchies:

  • Complete — the output is a single directed tree that spans all of the UI elements on a screen
  • Grounded — nodes in the output reference on-screen elements and regions
  • Abstractive — the output can group elements (potentially more than once) to form higher-level structures.

Predicting UI Hierarchy from a Screenshot

Diagram of our system's three steps to screen parsing. First, UI Element Detection detects on-screen elements using object detection. Next the UI Hierarchy is predicted which results in a tree-like structure. Finally, container nodes in the tree are labeled as groups.
An overview of our implementation of screen parsing. To infer the structure of an app screen, our system (i) detects the location and type of UI elements from a screenshot, (ii) predicts a graph structure that describes the relationships between UI elements, and (iii) classifies groups of UI elements.

To predict UI hierarchy from a screenshot, we built a system to:

  1. detect the location and type of UI elements from a screenshot,
  2. predict a hierarchical structure that describes the relationships between them, and
  3. classify semantic groups.

The first step of our system processes a screenshot image using an object detection model (Faster-RCNN), which produces a list of UI element detections. The output is post-processed using standard techniques such as confidence-thresholding and non-max suppression. This list tells us what elements are on the screen and where they are but does not provide any information about their relationship.

Next, we use a stack-pointer parsing model to generate a tree structure representing UI hierarchy. Like other transition-based parsers, our model incrementally predicts a tree structure by generating a sequence of actions that build connections between UI elements using a pointer mechanism. We made two modifications to the original stack-pointer dependency parsing to adapt the parsing model for UI hierarchies. First, we injected a “container” token into the input, allowing the model to create multi-level groupings. Second, we trained the model using a dynamic oracle to reduce exposure bias since the multi-level nature of UI hierarchies leads to exponentially more “optimal” action sequences that produce the same output.

To illustrate how our model predicts UI hierarchy, we will describe the inference process. A flat list of detected UI elements is encoded using a bi-directional LSTM encoder (producing a list l of encoded elements), and the final hidden state is fed to an LSTM decoder network augmented with two data structures: 1) a stack (s) which is used by the network as intermediate memory and 2) a set (v) which records the set of nodes already processed. The stack s is initialized with a special node that represents the root of the tree. At each timestep, the element on top of s and the last hidden state is fed into the decoder network, which outputs one of three actions:

  • Arc – A directed edge is created between the node on top of s (parent) and the node in l – v with the highest attention score (child). The child is pushed on s and added to v. This action attaches one of the detected UI elements onto the tree.
  • Emit – An intermediate node (represented as a zero-vector) is created and pushed onto s. This action helps the model represent container or “grouping” elements, such as lists, that do not exist in l.
  • Pop – s is popped. This occurs when the model has finished adding all of an element’s children to the tree structure.

This technique for generating parse trees is widely used in NLP, and it has been shown that a correct sequence of actions exists for any target tree. Note that this was originally shown for a limited subset of parse trees known as “projective” parse trees, but recent work has extended it to handle any type of tree.

Finally, we apply a set classification model to label containers (i.e., intermediate nodes) based on their descendants. We defined seven container types (including an “Other” class) that represent common groupings such as collections (e.g., lists, grids), tables, and tab bars.

The Apple Notes app is fed through Screen Parser. UI elements are iteratively inserted into a tree structure by our model, then intermediate nodes in the tree are assigned labels such as
Screen Parser uses a multi-step process to infer the UI hierarchy from a screenshot. Element detections are iteratively grouped together using a parsing model that produces a sequence of special actions called transitions (transition-based parsing).

We trained our models on two mobile UI datasets: (i) AMP dataset of ~130,000 iOS screens, and (ii) RICO, a publicly available dataset of ~80,000 Android screens. Both datasets were collected by crowdworkers who installed and explored popular apps across 20+ categories (in some cases excluding certain ones such as games, AR, and multimedia) on the iOS and Android app stores. Each dataset contains screenshots, annotated screens, and a type of metadata called a view hierarchy. The view hierarchy is an artifact generated during UI rendering that describes which interface widgets are used and “stacked” together to produce the final layout. Not all screens in our dataset contain this metadata, especially apps created using third-party UI toolkits or game engines. We apply heuristics to detect and exclude examples with missing or incomplete view hierarchies. The view hierarchies are similar to the presentation model we aim to predict, with a few differences, so we transform them into our target representation by applying graph smoothing, filtering, and element matching between different data sources.

More details about our machine learning models and training procedures can be found in our paper.

Experiments

We used several metrics (e.g., F1 score, graph edit distance) to perform a quantitative evaluation of our system using the test split of our mobile UI datasets. Our main point of comparison was a heuristic-based approach to inferring screen groupings used in previous work, and we found that our system was much more accurate in inferring UI hierarchy. We also found that our improved training procedure led to significant performance gains (23%) over standard methods for training parsers.

Bar chart showing the performance of each system using F1 score and Edit Distance. Screen Parser Dynamic consistently outperforms all baseline systems.

Our system’s performance is affected by a number of factors such as screen complexity and object detection errors. Accuracy is highest for screens up to 32 elements and degrades following that point, in part due to the increased number of actions the parsing model must correctly predict. Complex and crowded screens introduce the additional difficulty of detecting small UI elements, which our analysis with a matching-based oracle (computes best possible matching between object detection output and ground truth) shows as a limiting factor.

Chart comparing the F1 score of screen parser and baselines against the screens of increasing complexity. Performance of all systems declines for more complex screens, with highest drop occurring after 32 elements. Complexity is divided into 5 buckets of 0 to 16 elements, 16 to 32 elements, 32 to 48 elements, 48 to 64 elements, and more than 64 elements.

UI Hierarchy Facilitates and Improves Applications

We present a suite of example applications implemented using our screen parsing system. These applications show the versatility of our approach and how the predicted UI hierarchy facilitates many downstream tasks.

UI Similarity

Scatter plot of screens represented as 2-D points corresponding to similarity in embedding space. We show four pairs of screenshots where each pair is similar structurally, but has surface-level differences such as scaling, language, theme, and dynamic content.
We used our system to generate embedding vectors for different UI screens that capture their structure, instead of their surface-level appearance. We show that embeddings for the same screen are minimally affected by different display settings (e.g., scaling, language, theme, dynamic content).

Many recent efforts in modeling UIs have focused on representing them as fixed-length embedding vectors. These vectors can be trained to encode different properties of UI screens, such as layout, content, and style, and they can be fine-tuned to support downstream tasks. For example, a common application of embedding models is measuring screen similarity, which is represented by distance in embedding space. We believe the performance of such models can be improved by incorporating structural information, an important property of UIs.

The intermediate representation of our parsing model can be used to produce a screen embedding, which describes the hierarchical structure of an app. To generate an embedding of a UI, we feed it into our model and pool the last hidden state of the encoder. This includes information about the position, type, and structure of on-screen elements. Our structural embedding can help minimize variations from display settings such as (i) scaling, (ii) language, (iii) theme, and (iv) small dynamic changes. The properties of our embedding could be useful for some UI understanding applications, such as app crawling and information extraction where it would be beneficial to disentangle screen structure and appearance. For example, an app crawler’s next action should be conditioned on the UI elements present on the screen, not on the user’s current theme. An autoencoder trained on UI screenshots would not have this property.

Accessibility Improvement

An app screen processed with heuristics and our screen parsing model. We show that our model leads to fewer grouping errors, which is beneficial to screen reader navigation experience.
Element boxes are annotated using their navigation ordering, where the number represents how many swipes are needed to access the element when using a screen reader. While both results contain errors, in this case, Screen Parser correctly groups more elements, which decreases the number of swipes needed to access elements.

Recent work has successfully generated missing metadata for inaccessible apps by running an object detection model on the UI screenshot. Their approach to generating hierarchical data relies on manually defined heuristics that detect localized patterns between elements. However, these approaches may sometimes fail because they do not have access to global information necessary for resolving ambiguities.

In contrast, our implementation generates a UI hierarchy with a global view of the input, so it can overcome some of the limitations of heuristic-based approaches. We used the predicted UI hierarchy to group together the children of intermediate nodes of height 1 that contained at most one text label and used the X-Y cut algorithm to determine navigation order. The figure above shows an example where the grouping output from the Screen Parsing model is more accurate than the one produced by manually-defined. While this is not always the case, learning grouping rules from data requires much less human effort than manual heuristic engineering.

Code Generation

A UI screenshot re-rendered on a tablet form factor using the code generated by our system. Some errors are visible in the generated output, such as incorrect element ordering.
By mapping nodes in the UI hierarchy to declarative view-creation methods, we can generate code for a UI from its screenshot. Here, a restaurant app is re-rendered on a tablet form-factor.

Existing approaches to code generation also rely on heuristics to detect a limited subset of container types. We employed a technique used by compilers to generate code from abstract syntax trees (AST) (the visitor pattern for code generation) and applied it to the predicted UI hierarchy. Specifically, we performed a depth-first traversal of the UI hierarchy using a visitor function that generates code based on the current state (current node and stack). The visitor function emits a SwiftUI control (e.g., Text, Toggle, Button) at every leaf node and emits a SwiftUI container (e.g., VStack, HStack) at every intermediate node. Additional parameters required by view constructors, such as label text and background color were extracted using OCR and a small set of heuristics.

The resulting code describes the original UI using only relative constraints (even if the original UI was not), allowing it to act responsively to changes in screen size or device type. The generated code does not contain appearance and style information, which is sometimes necessary to render a similar-looking screen. Nevertheless, prior work has shown that such output can be a useful starting point for UI development, and we believe future work can improve upon our approach by detecting these properties.

Conclusion

To help machines better reason about the underlying structure and purpose of UIs, we introduced the problem of screen parsing, the prediction of structured UI models from visual information. Our problem formulation captures the structural properties of UIs and is well-suited for downstream applications that rely on UI understanding. We described the architecture and training procedure for our reference implementation, which predicts an app’s presentation model as a UI hierarchy with high accuracy, surpassing baseline algorithms and training procedures. Finally, we used our system to build three example applications: (i) UI similarity search, (ii) accessibility enhancement, and (iii) code generation from UI screenshots. Screen parsing is an important step towards full machine understanding of UIs and its many benefits, but there is still much left to do. We’re excited by the opportunities at the intersection of HCI and ML, and we encourage other researchers in the ML community to work with us to realize this goal.

Acknowledgements

Many people contributed to this work and gave feedback on this blog post: Xiaoyi Zhang, Jeff Nichols, and Jeff Bigham. This work was done while Jason Wu was an intern at Apple.

For more information about machine learning research at Apple, check out the Apple Machine Learning website.

Paper Citation

Jason Wu, Xiaoyi Zhang, Jeffrey Nichols, and Jeffrey P. Bigham. 2021. Screen Parsing: Towards Reverse Engineering of UI Models from Screenshots. In Proceedings of the 2021 ACM Symposium on User Interface Software & Technology (UIST). Association for Computing Machinery, New York, NY, USA, 1–10. https://doi.org/10.1145/3472749.3474763

Read More

Announcing the winners of the Building Tools to Enhance Transparency in Fairness and Privacy RFP

In August, Meta, formerly known as Facebook, launched the Building Tools to Enhance Transparency in Fairness and Privacy request for proposals (RFP). Today, we’re announcing the winners of this award.
VIEW RFPThrough this RFP, we hope to support academics in building the trusted tools to more effectively monitor systems that help spot concerns in fairness, privacy, and safety.

“Improving fairness and privacy across the internet is an ambitious goal, and one that requires a consistent investment in new ideas and researchers who can bring them to life,” said Will Bullock, Meta Statistics and Privacy Director. “We’re excited to support these leading scholars, and eagerly anticipate their breakthroughs in the years to come.”

The RFP attracted 50 proposals from 40 universities and institutions around the world. Thank you to everyone who took the time to submit a proposal, and congratulations to the winners.

Research award winners

Principal investigators are listed first unless otherwise noted.

A tool to study the efficacy of fairness algorithms on specific bias types
Hoda Heidari, Haiyi Zhu, Steven Wu (Carnegie Mellon University)

Analyzing the accuracy, transparency and privacy of profiling algorithms
Ruben Cuevas Rumin, Angel Cuevas Rumin, Patricia Callejo Pinardo, Pelayo Vallina Rodriguez (University Carlos III de Madrid)

Comprehensive privacy auditing in machine learning
Reza Shokri, Vincent Bindschaedler (National University of Singapore)

Galaxy: a library for safeguarding deep neural networks against unknowns
Sharon Li (University of Wisconsin–Madison)

High-confidence long-term safety and fairness guarantees
Philip Thomas, Yuriy Brun (University of Massachusetts Amherst)

Towards ML governance with accountability and auditing
Nicolas Papernot (University of Toronto)

The post Announcing the winners of the Building Tools to Enhance Transparency in Fairness and Privacy RFP appeared first on Facebook Research.

Read More