Amazon SageMaker with TensorBoard: An overview of a hosted TensorBoard experience

Amazon SageMaker with TensorBoard: An overview of a hosted TensorBoard experience

Today, data scientists who are training deep learning models need to identify and remediate model training issues to meet accuracy targets for production deployment, and require a way to utilize standard tools for debugging model training. Among the data scientist community, TensorBoard is a popular toolkit that allows data scientists to visualize and analyze various aspects of their machine learning (ML) models and training processes. It provides a suite of tools for visualizing training metrics, examining model architectures, exploring embeddings, and more. TensorFlow and PyTorch projects both endorse and use TensorBoard in their official documentation and examples.

Amazon SageMaker with TensorBoard is a capability that brings the visualization tools of TensorBoard to SageMaker. Integrated with SageMaker training jobs and domains, it provides SageMaker domain users access to the TensorBoard data and helps domain users perform model debugging tasks using the SageMaker TensorBoard visualization plugins. When they create a SageMaker training job, domain users can use TensorBoard using the SageMaker Python SDK or Boto3 API. SageMaker with TensorBoard is supported by the SageMaker Data Manager plugin, with which domain users can access many training jobs in one place within the TensorBoard application.

In this post, we demonstrate how to set up a training job with TensorBoard in SageMaker using the SageMaker Python SDK, access SageMaker TensorBoard, explore training output data visualized in TensorBoard, and delete unused TensorBoard applications.

Solution overview

A typical training job for deep learning in SageMaker consists of two main steps: preparing a training script and configuring a SageMaker training job launcher. In this post, we walk you through the required changes to collect TensorBoard-compatible data from SageMaker training.

Prerequisites

To start using SageMaker with TensorBoard, you need to set up a SageMaker domain with an Amazon VPC under an AWS account. Domain user profiles for each individual user are required to access the TensorBoard on SageMaker, and the AWS Identity and Access Management (IAM) execution role needs a minimum set of permissions, including the following:

  • sagemaker:CreateApp
  • sagemaker:DeleteApp
  • sagemaker:DescribeTrainingJob
  • sagemaker:Search
  • s3:GetObject
  • s3:ListBucket

For more information on how to set up SageMaker Domain and user profiles, see Onboard to Amazon SageMaker Domain Using Quick setup and Add and Remove User Profiles.

Directory structure

When using Amazon SageMaker Studio, the directory structure can be organized as follows:

.
├── script
│	└── train.py
└── simple_tensorboard.ipynb

Here, script/train.py is your training script, and simple_tensorboard.ipynb launches the SageMaker training job.

Modify your training script

You can use any of the following tools to collect tensors and scalars: TensorBoardX, TensorFlow Summary Writer, PyTorch Summary Writer, or Amazon SageMaker Debugger, and specify the data output path as the log directory in the training container (log_dir). In this sample code, we use TensorFlow to train a simple, fully connected neural network for a classification task. For other options, refer to Prepare a training job with a TensorBoard output data configuration. In the train() function, we use the tensorflow.keras.callbacks.TensorBoard tool to collect tensors and scalars, specify /opt/ml/output/tensorboard as the log directory in the training container, and pass it to model training callbacks argument. See the following code:

import argparse
import json
import tensorflow as tf


def parse_args():
    cmdline = argparse.ArgumentParser(
        formatter_class=argparse.ArgumentDefaultsHelpFormatter
    )
    cmdline.add_argument("--epochs", default=5, type=int, help="""Number of epochs.""")
    cmdline.add_argument(
        "--optimizer", default="adam", type=str, help="""Optimizer type"""
    )
    cmdline.add_argument(
        "--loss",
        default="sparse_categorical_crossentropy",
        type=str,
        help="""Optimizer type""",
    )
    cmdline.add_argument(
        "--metrics",
        action="store",
        dest="metrics",
        type=json.loads,
        default="['accuracy']",
        help="List of metrics to be evaluated by the model during training and testing.",
    )
    return cmdline


def create_model():
    return tf.keras.models.Sequential(
        [
            tf.keras.layers.Flatten(input_shape=(28, 28)),
            tf.keras.layers.Dense(512, activation="relu"),
            tf.keras.layers.Dropout(0.2),
            tf.keras.layers.Dense(10, activation="softmax"),
        ]
    )


def train(args):
    mnist = tf.keras.datasets.mnist
    (x_train, y_train), (x_test, y_test) = mnist.load_data()
    x_train, x_test = x_train / 255.0, x_test / 255.0

    model = create_model()
    model.compile(optimizer=args.optimizer, loss=args.loss, metrics=args.metrics)

    # setup TensorBoard Callback
    LOG_DIR = "/opt/ml/output/tensorboard"
    tensorboard_callback = tf.keras.callbacks.TensorBoard(
        log_dir=LOG_DIR,
        histogram_freq=1,
        update_freq=1,
        embeddings_freq=5,
        write_images=True,
    )

    # pass TensorBoard Callback into the Model fit
    model.fit(
        x=x_train,
        y=y_train,
        epochs=args.epochs,
        validation_data=(x_test, y_test),
        callbacks=[tensorboard_callback],
    )


if __name__ == "__main__":
    cmdline = parse_args()
    args, unknown_args = cmdline.parse_known_args()
    train(args)

Construct a SageMaker training launcher with a TensorBoard data configuration

Use sagemaker.debugger.TensorBoardOutputConfig while configuring a SageMaker framework estimator, which maps the Amazon Simple Storage Service (Amazon S3) bucket you specify for saving TensorBoard data with the local path in the training container (for example, /opt/ml/output/tensorboard). You can use a different container local output path. However, it must be consistent with the value of the LOG_DIR variable, as specified in the previous step, to have SageMaker successfully search the local path in the training container and save the TensorBoard data to the S3 output bucket.

Next, pass the object of the module to the tensorboard_output_config parameter of the estimator class. The following code snippet shows an example of preparing a TensorFlow estimator with the TensorBoard output configuration parameter.

The following is the boilerplate code:

import os
from datetime import datetime
import boto3
import sagemaker

time_str = datetime.now().strftime("%d-%m-%Y-%H-%M-%S")

region = boto3.session.Session().region_name
boto_sess = boto3.Session()
role = sagemaker.get_execution_role()
sm = sagemaker.Session()

base_job_name = "simple-tensorboard"
date_str = datetime.now().strftime("%d-%m-%Y")
time_str = datetime.now().strftime("%d-%m-%Y-%H-%M-%S")
job_name = f"{base_job_name}-{time_str}"

s3_output_bucket = os.path.join("s3://", sm.default_bucket(), base_job_name)

output_path = os.path.join(s3_output_bucket, "sagemaker-output", date_str, job_name)
code_location = os.path.join(s3_output_bucket, "sagemaker-code", date_str, job_name)

The following code is for the training container:

instance_type = "ml.c5.xlarge"
instance_count = 1

image_uri = sagemaker.image_uris.retrieve(
    framework="tensorflow",
    region=region,
    version="2.11",
    py_version="py39",
    image_scope="training",
    instance_type=instance_type,
)

The following code is the TensorBoard configuration:

from sagemaker.tensorflow import TensorFlow

tensorboard_output_config = sagemaker.debugger.TensorBoardOutputConfig(
    s3_output_path=os.path.join(output_path, "tensorboard"),
    container_local_output_path="/opt/ml/output/tensorboard",
)

hyperparameters = {
    "epochs": 5,
    "optimizer": "adam",
    "loss": "sparse_categorical_crossentropy",
    "metrics": "'["accuracy"]'",
}

estimator = TensorFlow(
    entry_point="train.py",
    source_dir="script",
    role=role,
    image_uri=image_uri,
    instance_count=1,
    instance_type="ml.c5.xlarge",
    base_job_name=job_name,
    tensorboard_output_config=tensorboard_output_config,
    hyperparameters=hyperparameters,
)

Launch the training job with the following code:

estimator.fit(
    inputs=None,
    wait=False,
    job_name=job_name,
)

Access TensorBoard on SageMaker

You can access TensorBoard with two methods: programmatically using the sagemaker.interactive_apps.tensorboard module that generates the URL or using the TensorBoard landing page on the SageMaker console. After you open TensorBoard, SageMaker runs the TensorBoard plugin and automatically finds and loads all training job output data in a TensorBoard-compatible file format from S3 buckets paired with training jobs during or after training.

The following code autogenerates the URL to the TensorBoard console landing page:

from sagemaker.interactive_apps import tensorboard
app = tensorboard.TensorBoardApp(region)

print("Navigate to the following URL:")
if app._is_studio_user:
    print(app.get_app_url(job_name))
else:
    print(app.get_app_url())

This returns the following message with a URL that opens the TensorBoard landing page.

>>> Navigate to the following URL: https://<sagemaker_domain_id>.studio.<region>.sagemaker.aws/tensorboard/default/data/plugin/sa

For opening TensorBoard from the SageMaker console, please refer to How to access TensorBoard on SageMaker.

When you open the TensorBoard application, TensorBoard opens with the SageMaker Data Manager tab. The following screenshot shows the full view of the SageMaker Data Manager tab in the TensorBoard application.

On the SageMaker Data Manager tab, you can select any training job and load TensorBoard-compatible training output data from Amazon S3.

  1. In the Add Training Job section, use the check boxes to choose training jobs from which you want to pull data and visualize for debugging.
  2. Choose Add Selected Jobs.

The selected jobs should appear in the Tracked Training Jobs section.

Refresh the viewer by choosing the refresh icon in the upper-right corner, and the visualization tabs should appear after the job data is successfully loaded.

Explore training output data visualized in TensorBoard

On the Time Series tab and other graphics-based tabs, you can see the list of Tracked Training Jobs in the left pane. You can also use the check boxes of the training jobs to show or hide visualizations. The TensorBoard dynamic plugins are activated dynamically depending on how you have set your training script to include summary writers and pass callbacks for tensor and scalar collection, and the graphics tabs also appear dynamically. The following screenshots show example views of each tab with visualizations of the collected metrics of two training jobs. The metrices include time series, scalar, graph, distribution, and histogram plugins.

The following screenshot is the Time Series tab view.

The following screenshot is the Scalars tab view.

The following screenshot is the Graphs tab view.

The following screenshot is the Distributions tab view.

The following screenshot is the Histograms tab view.

Clean up

After you are done with monitoring and experimenting with jobs in TensorBoard, shut the TensorBoard application down:

  1. On the SageMaker console, choose Domains in the navigation pane.
  2. Choose your domain.
  3. Choose your user profile.
  4. Under Apps, choose Delete App for the TensorBoard row.
  5. Choose Yes, delete app.
  6. Enter delete in the text box, then choose Delete.

A message should appear at the top of the page: “Default is being deleted”.

Conclusion

TensorBoard is a powerful tool for visualizing, analyzing, and debugging deep learning models. In this post, we provide a guide to using SageMaker with TensorBoard, including how to set up TensorBoard in a SageMaker training job using the SageMaker Python SDK, access SageMaker TensorBoard, explore training output data visualized in TensorBoard, and delete unused TensorBoard applications. By following these steps, you can start using TensorBoard in SageMaker for your work.

We encourage you to experiment with different features and techniques.


About the authors

Dr. Baichuan Sun is a Senior Data Scientist at AWS AI/ML. He is passionate about solving strategic business problems with customers using data-driven methodology on the cloud, and he has been leading projects in challenging areas including robotics computer vision, time series forecasting, price optimization, predictive maintenance, pharmaceutical development, product recommendation system, etc. In his spare time he enjoys traveling and hanging out with family.

Manoj Ravi is a Senior Product Manager for Amazon SageMaker. He is passionate about building next-gen AI products and works on software and tools to make large-scale machine learning easier for customers. He holds an MBA from Haas School of Business and a Masters in Information Systems Management from Carnegie Mellon University. In his spare time, Manoj enjoys playing tennis and pursuing landscape photography.

Read More

Research Focus: Week of May 8, 2023

Research Focus: Week of May 8, 2023

Microsoft Research Focus 15 | Week of May 8, 2023

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

AWARD

Microsoft’s danah boyd awarded MIT’s Morison Prize

danah boyd, a partner researcher at Microsoft Research, has been awarded MIT’s Morison Prize in Science, Technology, and Society, for outstanding work combining humanistic values with effectiveness in the world of practical affairs, particular in science and technology.

Dr. boyd, who is also a Distinguished Visiting Professor at Georgetown University, is currently conducting a multi-year ethnographic study of the U.S. census to understand how data are made legitimate. Her previous studies have focused on media manipulation, algorithmic bias, privacy practices, social media, and teen culture. 

To learn more, see the Microsoft Research Summit presentation Statistical Imaginaries: An Ode to Responsible Data Science or the publications Differential Perspectives: Epistemic Disconnects Surrounding the U.S. Census Bureau’s Use of Differential Privacy.


AWARD

Microsoft’s Nicole Immorlica receives 2023 SIGecom Test of Time Award

Nicole Immorlica, a Senior Principal Researcher with Microsoft Research New England, has been awarded the 2023 SIGecom Test of Time Award for her work on a 2005 paper on matching markets. The award from the Association of Computing Machinery (ACM) recognizes “an influential paper or series of papers published between ten and twenty-five years ago that has significantly impacted research or applications exemplifying the interplay of economics and computation.” 

In the award-winning paper: Marriage, honesty, and stability, Immorlica and a co-author explored centralized two-sided markets, such as the medical residency market, matching participants by running a stable marriage algorithm. While no matching mechanism based on a stable marriage algorithm can guarantee ‘truthfulness’ as a dominant strategy, the paper showed that in certain probabilistic settings, truthfulness is the best strategy for the participants.

Spotlight: On-Demand EVENT

Microsoft Research Summit 2022

On-Demand
Watch now to learn about some of the most pressing questions facing our research community and listen in on conversations with 120+ researchers around how to ensure new technologies have the broadest possible benefit for humanity.

AWARD

Microsoft’s Lorin Crawford named 2023 COPSS Emerging Leader

Lorin Crawford, a principal researcher at Microsoft Research New England, has been named a 2023 COPSS Emerging Leader by the Committee of Presidents of Statistical Societies. The award announcement cited Crawford’s path-breaking research combining theory and methods of mathematics, statistics and computing to generate new knowledge and insight about the genetic basis of disease, and exceptional mentoring of students from multiple scientific disciplines.

The award recognizes the important role of early-career statistical scientists in shaping the future of their discipline. The selection criteria are designed to highlight contributions in areas not traditionally recognized by other early-career awards in the statistical sciences.

Crawford, who is also a faculty member at Brown University’s School of Public Health, focuses on developing novel and efficient algorithms that address complex problems in quantitative genetics, cancer pharmacology, molecular genomics, and geometric morphometrics.


AWARD

Microsoft researchers receive Test of Time award for personalized news recommendation work

A paper co-authored by two Microsoft researchers has received a 2023 Seoul Test of Time Award from the International World Wide Web Conference Committee (IW3C2). The 2020 paper: A Contextual-Bandit Approach to Personalized News Article Recommendation, was written by John Langford and Robert Schapire, along with two industry colleagues. The authors proposed a new approach for personalized recommendation using contextual bandit algorithms. According to the IW3C2, the paper now has more than 2,730 citations and has become foundational research in the area of recommendation systems.

The award announcement also states: “The paper addressed fundamental challenges in real-world recommendation systems via computationally efficient algorithms grounded in learning theory. It also showed that recommendation algorithms can be reliably evaluated offline, enabling algorithm selection without operational impact, and that contextual bandits can yield significant gains in user engagement.”


NEW RESEARCH

A Frequency Domain Approach to Predict Power System Transients

The dynamics of power grids are governed by a large number of nonlinear differential and algebraic equations (DAEs). To safely run the system, operators need to check that the states described by these DAEs stay within prescribed limits after various potential faults. However, current numerical solvers of DAEs are often too slow for real-time system operations. In addition, detailed system parameters are often not exactly known. Machine learning approaches have been proposed to reduce the computational efforts, but existing methods generally suffer from overfitting and failures to predict unstable behaviors.

In a new paper: A Frequency Domain Approach to Predict Power System Transients, Microsoft researchers propose a novel framework to predict power system transients by learning in the frequency domain. The intuition is that although the system behavior is complex in the time domain, relatively few dominant modes exist in the frequency domain. Therefore, the researchers learn to predict by constructing neural networks with Fourier transform and filtering layers. System topology and fault information are encoded by taking a multi-dimensional Fourier transform, allowing researchers to leverage the fact that the trajectories are sparse both in time and spatial frequencies. This research shows that the proposed approach does not need detailed system parameters, greatly speeds up prediction computations and is highly accurate for different fault types.


NEW RESEARCH

Inference with Reference: Lossless Acceleration of Large Language Models

The growing use of large foundation models like GPT-3.5/4 for real-world applications has raised concerns about high deployment costs. While general methodologies such as quantization, pruning, compression, and distillation help reduce costs. At test time, output tokens must be decoded (sequentially) one by one, which poses significant challenges for LLMs to be deployed at scale.

In a new paper: Inference with Reference: Lossless Acceleration of Large Language Models, Microsoft researchers study accelerating LLM inference by improving the efficiency of autoregressive decoding. In multiple real-world applications, this research shows that an LLM’s output tokens often come from its context. For example, in a retrieval-augmented generation scenario for a search engine, an LLM’s context usually includes relevant documents retrieved from an external corpus as reference according to a query, and its output usually contains many text spans found in the reference (i.e., retrieved documents). Motivated by this observation, the researchers propose an LLM accelerator (LLMA) to losslessly speed inference with references. Its improved computational parallelism allows LLMA to achieve over 2x speed-up for LLMs, with identical generation results as greedy decoding, in many practical generation scenarios where significant overlap between in-context reference and outputs exists. The researchers are collaborating with the Bing search team to explore integrating this technique into snippet/caption generation, Bing chat, and other potential scenarios.


NEW RESEARCH

High-throughput ab initio reaction mechanism exploration in the cloud with automated multi-reference validation

Quantum chemical calculations on atomistic systems have evolved into a standard approach to studying molecular matter. But these calculations often involve a significant amount of manual input and expertise. Most of these calculations could be automated, alleviating the need for expertise in software and hardware accessibility.

In a new paper: High-throughput ab initio reaction mechanism exploration in the cloud with automated multi-reference validation, researchers from Microsoft present the AutoRXN workflow, an automated workflow for exploratory high-throughput electronic structure calculations of molecular systems.

This workflow i) uses density functional theory methods to deliver minimum and transition-state structures and corresponding energies and properties, (ii) launches coupled cluster calculations for optimized structures to provide more accurate energy and property estimates, and (iii) evaluates multi-reference diagnostics to back check the coupled cluster results and subjects them to automated multi-configurational calculations for potential multi-configurational cases.

All calculations take place in a cloud environment and support massive computational campaigns. Key features of all components of the AutoRXN workflow are autonomy, stability, and minimum operator interference.

The paper was recently published in the Journal of Chemistry and Physics.

The post Research Focus: Week of May 8, 2023 appeared first on Microsoft Research.

Read More

State Spaces Aren’t Enough: Machine Translation Needs Attention

*= Equal Contributors
Structured State Spaces for Sequences (S4) is a recently proposed sequence model with successful applications in various tasks, e.g., vision, language modeling, and audio. Thanks to its mathematical formulation, it compresses its input to a single hidden state and is able to capture long-range dependencies while avoiding the need for an attention mechanism. In this work, we apply S4 to Machine Translation (MT) and evaluate several encoder-decoder variants on WMT’14 and WMT’16. In contrast with the success in language modeling, we find that S4 lags behind the Transformer…Apple Machine Learning Research