April 2024 – Page 4

NVIDIA to Acquire GPU Orchestration Software Provider Run:ai

To help customers make more efficient use of their AI computing resources, NVIDIA today announced it has entered into a definitive agreement to acquire Run:ai, a Kubernetes-based workload management and orchestration software provider.

Customer AI deployments are becoming increasingly complex, with workloads distributed across cloud, edge and on-premises data center infrastructure.

Managing and orchestrating generative AI, recommender systems, search engines and other workloads requires sophisticated scheduling to optimize performance at the system level and on the underlying infrastructure.

Run:ai enables enterprise customers to manage and optimize their compute infrastructure, whether on premises, in the cloud or in hybrid environments.

The company has built an open platform on Kubernetes, the orchestration layer for modern AI and cloud infrastructure. It supports all popular Kubernetes variants and integrates with third-party AI tools and frameworks.

Run:ai customers include some of the world’s largest enterprises across multiple industries, which use the Run:ai platform to manage data-center-scale GPU clusters.

“Run:ai has been a close collaborator with NVIDIA since 2020 and we share a passion for helping our customers make the most of their infrastructure,” said Omri Geller, Run:ai cofounder and CEO. “We’re thrilled to join NVIDIA and look forward to continuing our journey together.”

The Run:ai platform provides AI developers and their teams:

A centralized interface to manage shared compute infrastructure, enabling easier and faster access for complex AI workloads.
Functionality to add users, curate them under teams, provide access to cluster resources, control over quotas, priorities and pools, and monitor and report on resource use.
The ability to pool GPUs and share computing power — from fractions of GPUs to multiple GPUs or multiple nodes of GPUs running on different clusters — for separate tasks.
Efficient GPU cluster resource utilization, enabling customers to gain more from their compute investments.

NVIDIA will continue to offer Run:ai’s products under the same business model for the immediate future. And NVIDIA will continue to invest in the Run:ai product roadmap as part of NVIDIA DGX Cloud, an AI platform co-engineered with leading clouds for enterprise developers, offering an integrated, full-stack service optimized for generative AI.

NVIDIA DGX and DGX Cloud customers will gain access to Run:ai’s capabilities for their AI workloads, particularly for large language model deployments. Run:ai’s solutions are already integrated with NVIDIA DGX, NVIDIA DGX SuperPOD, NVIDIA Base Command, NGC containers, and NVIDIA AI Enterprise software, among other products.

NVIDIA’s accelerated computing platform and Run:ai’s platform will continue to support a broad ecosystem of third-party solutions, giving customers choice and flexibility.

Together with Run:ai, NVIDIA will enable customers to have a single fabric that accesses GPU solutions anywhere. Customers can expect to benefit from better GPU utilization, improved management of GPU infrastructure and greater flexibility from the open architecture.

Forecasting the Future: AI2’s Christopher Bretherton Discusses Using Machine Learning for Climate Modeling

Can machine learning help predict extreme weather events and climate change? Christopher Bretherton, senior director of climate modeling at the Allen Institute for Artificial Intelligence, or AI2, explores the technology’s potential to enhance climate modeling with AI Podcast host Noah Kravitz in an episode recorded live at the NVIDIA GTC global AI conference. Bretherton explains how machine learning helps overcome the limitations of traditional climate models and underscores the role of localized predictions in empowering communities to prepare for climate-related risks. Through ongoing research and collaboration, Bretherton and his team aim to improve climate modeling and enable society to better mitigate and adapt to the impacts of climate change.

Stay tuned for more episodes recorded live from GTC, and watch the replay of Bretherton’s GTC session on using machine learning for climate modeling.

The AI Podcast · AI2’s Christopher Bretherton Discusses Using Machine Learning for Climate Modeling – Ep. XXX

Time Stamps

2:03: What is climate modeling and how can it prepare us for climate change?

5:28: How can machine learning help enhance climate modeling?

7:21: What were the limitations of traditional climate models?

10:24: How does a climate model work?

12:11: What information can you get from a climate model?

13:26: What are the current climate models telling us about the future?

15:56: How does machine learning help enable localized climate modeling?

18:39: What, if anything, can individuals or small communities do to prepare for what climate change has in store for us?

25:59: How do you measure the accuracy or performance of an emulator that’s doing something like climate modeling out into the future?

You Might Also Like…

ITIF’s Daniel Castro on Energy-Efficient AI and Climate Change – Ep. 215

AI-driven change is in the air, as are concerns about the technology’s environmental impact. In this episode of NVIDIA’s AI Podcast, Daniel Castro, vice president of the Information Technology and Innovation Foundation and director of its Center for Data Innovation, speaks with host Noah Kravitz about the motivation behind his AI energy use report, which addresses misconceptions about the technology’s energy consumption.

DigitalPath’s Ethan Higgins on Using AI to Fight Wildfires – Ep. 211

DigitalPath is igniting change in the golden state — using computer vision, generative adversarial networks and a network of thousands of cameras to detect signs of fire in real-time. In the latest episode of NVIDIA’s AI Podcast, host Noah Kravtiz spoke with DigitalPath system architect Ethan Higgins about the company’s role in the ALERTCalifornia initiative, a collaboration between California’s wildfire fighting agency CAL FIRE and the University of California, San Diego.

Anima Anandkumar on Using Generative AI to Tackle Global Challenges – Ep. 203

Generative AI-based models can not only learn and understand natural languages — they can learn the very language of nature itself, presenting new possibilities for scientific research. On the latest episode of NVIDIA’s AI Podcast, host Noah Kravitz spoke with Anandkumar on generative AI’s potential to make splashes in the scientific community.

How Alex Fielding and Privateer Space Are Taking on Space Debris – Ep. 196

In this episode of the NVIDIA AI Podcast, host Noah Kravitz dives into an illuminating conversation with Alex Fielding, co-founder and CEO of Privateer Space. Privateer Space, Fielding’s latest venture, aims to address one of the most daunting challenges facing our world today: space debris.

Subscribe to the AI Podcast

Get the AI Podcast through iTunes, Google Podcasts, Google Play, Amazon Music, Castbox, DoggCatcher, Overcast, PlayerFM, Pocket Casts, Podbay, PodBean, PodCruncher, PodKicker, Soundcloud, Spotify, Stitcher and TuneIn.

Make the AI Podcast better: Have a few minutes to spare? Fill out this listener survey.

Rays Up: Decoding AI-Powered DLSS 3.5 Ray Reconstruction

Editor’s note: This post is part of the AI Decoded series, which demystifies AI by making the technology more accessible, and which showcases new hardware, software, tools and accelerations for RTX PC users.

AI continues to raise the bar for PC gaming.

DLSS 3.5 with Ray Reconstruction creates higher quality ray-traced images for intensive ray-traced games and apps. This advanced AI-powered neural renderer is a groundbreaking feature that elevates ray-traced image quality for all GeForce RTX GPUs, outclassing traditional hand-tuned denoisers by using an AI network trained by an NVIDIA supercomputer. The result improves lighting effects like reflections, global illumination, and shadows to create a more immersive, realistic gaming experience.

A Ray of Light

Ray tracing is a rendering technique that can realistically simulate the lighting of a scene and its objects by rendering physically accurate reflections, refractions, shadows and indirect lighting. Ray tracing generates computer graphics images by tracing the path of light from the view camera — which determines the view into the scene — through the 2D viewing plane, out into the 3D scene, and back to the light sources. For instance, if rays strike a mirror, reflections are generated.

A visualization of how ray tracing works.

It’s the digital equivalent to real-world objects illuminated by beams of light and the path of the light being followed from the eye of the viewer to the objects that light interacts with. That’s ray tracing.

Simulating light in this manner — shooting rays for every pixel on the screen — is computationally intensive, even for offline renderers that calculate scenes over the course of several minutes or hours. Instead, ray samples fire a handful of rays at various points across the scene for a representative sample of the scene’s lighting, reflectivity and shadowing.

However, there are limitations. The output is a noisy, speckled image with gaps, good enough to ascertain how the scene should look when ray traced. To fill in the missing pixels that weren’t ray traced, hand-tuned denoisers use two different methods, temporally accumulating pixels across multiple frames, and spatially interpolating them to blend neighboring pixels together. Through this process, the noisy raw output is converted into a ray-traced image.

This adds complexity and cost to the development process, and reduces the frame rate in highly ray-traced games where multiple denoisers operate simultaneously for different lighting effects.

DLSS 3.5 Ray Reconstruction introduces an NVIDIA supercomputer-trained, AI-powered neural network that generates higher-quality pixels in between the sampled rays. It recognizes different ray-traced effects to make smarter decisions about using temporal and spatial data, and retains high frequency information for superior-quality upscaling. And it recognizes lighting patterns from its training data, such as that of global illumination or ambient occlusion, and recreates it in-game.

Portal with RTX is a great example of Ray Reconstruction in action. With DLSS OFF, the denoiser struggles to reconstruct the dynamic shadowing alongside the moving fan.

With DLSS 3.5 and Ray Reconstruction enabled, the denoiser is trained on AI and recognizes certain patterns associated with shadows and keeps the image stable, accumulating accurate pixels while blending neighboring pixels to generate high-quality reflections.

Deep Learning, Deep Gaming

Ray Reconstruction is just one of the AI graphics breakthroughs that multiply performance in DLSS. Super Resolution, the cornerstone of DLSS, samples multiple lower resolution images and uses motion data and feedback from prior frames to reconstruct native-quality images. The result is high image quality without sacrificing game performance.

DLSS 3 introduced Frame Generation, which boosts performance by using AI to analyze data from surrounding frames to predict what the next generated frame should look like. These generated frames are then inserted in between rendered frames. Combining the DLSS-generated frames with DLSS Super Resolution enables DLSS 3 to reconstruct seven-eighths of the displayed pixels with AI, boosting frame rates by up to 4x compared to without DLSS.

Because DLSS Frame Generation is post-processed (applied after the main render) on the GPU, it can boost frame rates even when the game is bottlenecked by the CPU.

Generative AI is transforming gaming, videoconferencing and interactive experiences of all kinds. Make sense of what’s new and what’s next by subscribing to the AI Decoded newsletter.

PyTorch 2.3 Release Blog

We are excited to announce the release of PyTorch® 2.3 (release note)! PyTorch 2.3 offers support for user-defined Triton kernels in torch.compile, allowing for users to migrate their own Triton kernels from eager without experiencing performance regressions or graph breaks. Tensor Parallelism improves the experience for training Large Language Models using native PyTorch functions, which has been validated on training runs for 100B parameter models. As well, semi-structured sparsity implements semi-structured sparsity as a Tensor subclass, with observed speedups of up to 1.6 over dense matrix multiplication.

This release is composed of 3393 commits and 426 contributors since PyTorch 2.2. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.3. More information about how to get started with the PyTorch 2-series can be found at our Getting Started page.

Beta	Prototype	Performance Improvements
User-defined Triton kernels in torch.compile	torch.export adds new API to specify dynamic_shapes	Weight-Only-Quantization introduced into Inductor CPU backend
Tensor parallelism within PyTorch Distributed	Asynchronous checkpoint generation
Support for semi-structured sparsity

*To see a full list of public feature submissions click here.

Beta Features

[Beta] Support for User-defined Triton kernels in torch.compile

Allows for PyTorch code that contains triton kernels to be executed natively using torch.compile. This enables users to migrate code containing triton kernels from eager PyTorch to torch.compile without running into performance regressions or graph breaks. Native support also creates an opportunity for Torch Inductor to precompile the user-defined Triton kernel as well as better organize code around the Triton kernel allowing for further optimizations.

You can find more information about how to utilize user defined Triton kernels in torch.compile within this tutorial.

[Beta] Tensor Parallelism introduces more efficient ways to train LLMs

The Tensor Parallel API facilitates various tensor manipulations across GPUs/hosts and integrates with FSDP for 2D Parallelism (Tensor parallelism across devices + Data Parallelism across hosts). It also offers a low-level API for constructing higher-level Tensor parallel APIs. This API has been validated to support the training of transformer models with over 100 billion parameters.

You can find more information on how to utilize this within your workflows within this tutorial.

[Beta] Semi-structured sparsity provides users with a way to take advantage of accelerated sparse inference and memory savings

torch.sparse.SparseSemiStructuredTensor implements semi-structured sparsity as a Tensor subclass, which have observed speedups of up to 1.6 over dense matrix multiplication.

In particular it adds:

Additional support for quantization composability (mixed dtype, dequant fusion)
Updated cuSPARSELt and CUTLASS kernels
torch.compile support

You can find more information on how to take advantage of semi-structured sparsity here.

Prototype Features

[PROTOTYPE] torch.export adds new API to specify dynamic_shapes

You can now use torch.export.Dim to better represent dynamic shapes by enabling developers to specify ranges (min and max values) that can be reused across different input dimensions that are constrained to be equal.

To learn more about torch.export.Dim as well as how it can be used to express more interesting relationships (such as linear arithmetic expressions) check out the tutorial here.

[PROTOTYPE] Asynchronous checkpoint generation

Asynchronous checkpoint generation allows users to continue their training loops while checkpoints are being generated, essentially offloading much of the checkpointing cost.

You can find out how to utilize this within your own workflows with this example.

Performance Improvements

[PROTOTYPE] Weight-Only-Quantization introduced into Inductor CPU backend

PyTorch 2.3 enhances LLM inference performance on torch inductor CPU backend. The project gpt-fast offers a simple and efficient PyTorch native acceleration for transformer text generation with torch.compile. Prior to 2.3 only CUDA devices were supported and this feature enables the CPU counterpart by providing highly optimized kernels for the int4 and int8 weight only quantization Linear.

For more information / how to utilize this feature please refer to the gpt-fast README.

HumMUSS: Human Motion Understanding using State Space Models

Understanding human motion from video is crucial for applications such as pose estimation, mesh recovery, and action recognition. While state-of-the-art methods predominantly rely on Transformer-based architectures, these approaches have limitations in practical scenarios. They are notably slower when processing a continuous stream of video frames in real time and do not adapt to new frame rates. Given these challenges, we propose an attention free spatiotemporal model for human motion understanding, building upon recent advancements diagonal state space models. Our model performs comparably…Apple Machine Learning Research

Think While You Write Hypothesis Verification Promotes Faithful Knowledge-to-Text Generation

Neural knowledge-to-text generation models often struggle to faithfully generate descriptions for the input facts: they may produce hallucinations that contradict the given facts, or describe facts not present in the input. To reduce hallucinations, we propose a novel decoding method, TWEAK (Think While Effectively Articulating Knowledge). TWEAK treats the generated sequences at each decoding step and its future sequences as hypotheses, and ranks each generation candidate based on how well their corresponding hypotheses support the input facts using a Hypothesis Verification Model (HVM). We…Apple Machine Learning Research

Talaria: Interactively Optimizing Machine Learning Models for Efficient Inference

On-device machine learning (ML) moves computation from the cloud to personal devices, protecting user privacy and enabling intelligent user experiences. However, fitting models on devices with limited resources presents a major technical challenge: practitioners need to optimize models and balance hardware metrics such as model size, latency, and power. To help practitioners create efficient ML models, we designed and developed Talaria: a model visualization and optimization system. Talaria enables practitioners to compile models to hardware, interactively visualize model statistics, and…Apple Machine Learning Research

OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework

The reproducibility and transparency of large language models are crucial for advancing open research, ensuring the trustworthiness of results, and enabling investigations into data and model biases, as well as potential risks. To this end, we release OpenELM, a state-of-the-art open language model. OpenELM uses a layer-wise scaling strategy to efficiently allocate parameters within each layer of the transformer model, leading to enhanced accuracy. For example, with a parameter budget of approximately one billion parameters, OpenELM exhibits a 2.36% improvement in accuracy compared to OLMo…Apple Machine Learning Research

Accelerate ML workflows with Amazon SageMaker Studio Local Mode and Docker support

We are excited to announce two new capabilities in Amazon SageMaker Studio that will accelerate iterative development for machine learning (ML) practitioners: Local Mode and Docker support. ML model development often involves slow iteration cycles as developers switch between coding, training, and deployment. Each step requires waiting for remote compute resources to start up, which delays validating implementations and getting feedback on changes.

With Local Mode, developers can now train and test models, debug code, and validate end-to-end pipelines directly on their SageMaker Studio notebook instance without the need for spinning up remote compute resources. This reduces the iteration cycle from minutes down to seconds, boosting developer productivity. Docker support in SageMaker Studio notebooks enables developers to effortlessly build Docker containers and access pre-built containers, providing a consistent development environment across the team and avoiding time-consuming setup and dependency management.

Local Mode and Docker support offer a streamlined workflow for validating code changes and prototyping models using local containers running on a SageMaker Studio notebook

instance. In this post, we guide you through setting up Local Mode in SageMaker Studio, running a sample training job, and deploying the model on an Amazon SageMaker endpoint from a SageMaker Studio notebook.

SageMaker Studio Local Mode

SageMaker Studio introduces Local Mode, enabling you to run SageMaker training, inference, batch transform, and processing jobs directly on your JupyterLab, Code Editor, or SageMaker Studio Classic notebook instances without requiring remote compute resources. Benefits of using Local Mode include:

Instant validation and testing of workflows right within integrated development environments (IDEs)
Faster iteration through local runs for smaller-scale jobs to inspect outputs and identify issues early
Improved development and debugging efficiency by eliminating the wait for remote training jobs
Immediate feedback on code changes before running full jobs in the cloud

The following figure illustrates the workflow using Local Mode on SageMaker.

To use Local Mode, set instance_type='local' when running SageMaker Python SDK jobs such as training and inference. This will run them on the instances used by your SageMaker Studio IDEs instead of provisioning cloud resources.

Although certain capabilities such as distributed training are only available in the cloud, Local Mode removes the need to switch contexts for quick iterations. When you’re ready to take advantage of the full power and scale of SageMaker, you can seamlessly run your workflow in the cloud.

Docker support in SageMaker Studio

SageMaker Studio now also enables building and running Docker containers locally on your SageMaker Studio notebook instance. This new feature allows you to build and validate Docker images in SageMaker Studio before using them for SageMaker training and inference.

The following diagram illustrates the high-level Docker orchestration architecture within SageMaker Studio.

With Docker support in SageMaker Studio, you can:

Build Docker containers with integrated models and dependencies directly within SageMaker Studio
Eliminate the need for external Docker build processes to simplify image creation
Run containers locally to validate functionality before deploying models to production
Reuse local containers when deploying to SageMaker for training and hosting

Although some advanced Docker capabilities like multi-container and custom networks are not supported as of this writing, the core build and run functionality is available to accelerate developing containers for bring your own container (BYOC) workflows.

Prerequisites

To use Local Mode in SageMaker Studio applications, you must complete the following prerequisites:

For pulling images from Amazon Elastic Container Registry (Amazon ECR), the account hosting the ECR image must provide access permission to the user’s Identity and Access Management (IAM) role. The domain’s role must also allow Amazon ECR access.
To enable Local Mode and Docker capabilities, you must set the EnableDockerAccess parameter to true for the domain’s DockerSettings using the AWS Command Line Interface (AWS CLI). This allows users in the domain to use Local Mode and Docker features. By default, Local Mode and Docker are disabled in SageMaker Studio. Any existing SageMaker Studio apps will need to be restarted for the Docker service update to take effect. The following is an example AWS CLI command for updating a SageMaker Studio domain:

aws sagemaker --region <REGION> 
update-domain --domain-id <DOMAIN-ID> 
--domain-settings-for-update '{"DockerSettings": {"EnableDockerAccess": "ENABLED"}}'

You need to update the SageMaker IAM role in order to be able to push Docker images to Amazon ECR:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ecr:CompleteLayerUpload",
        "ecr:UploadLayerPart",
        "ecr:InitiateLayerUpload",
        "ecr:BatchCheckLayerAvailability",
        "ecr:PutImage"
      ],
      "Resource": "arn:aws:ecr:us-east-2:123456789012:repository/<repositoryname>"
    },
    {
      "Effect": "Allow",
      "Action": "ecr:GetAuthorizationToken",
      "Resource": "*"
    }
  ]
}

Run Python files in SageMaker Studio spaces using Local Mode

SageMaker Studio JupyterLab and Code Editor (based on Code-OSS, Visual Studio Code – Open Source), extends SageMaker Studio so you can write, test, debug, and run your analytics and ML code using the popular lightweight IDE. For more details on how to get started with SageMaker Studio IDEs, refer to Boost productivity on Amazon SageMaker Studio: Introducing JupyterLab Spaces and generative AI tools and New – Code Editor, based on Code-OSS VS Code Open Source now available in Amazon SageMaker Studio. Complete the following steps:

Create a new Code Editor or JupyterLab space called my-sm-code-editor-space or my-sm-jupyterlab-space, respectively.
Choose Create space.
Choose the ml.m5.large instance and set storage to 32 GB.
Choose Run space.
Open the JupyterLab or Code Editor space and clone the GitHub repo.
Clone the GitHub repo, with /home/sagemaker-user/ as the target folder.

Create a new terminal.
Install the Docker CLI and Docker Compose plugin following the instructions in the following GitHub repo. If chained commands fail, run the commands one at a time.

You must update the SageMaker SDK to the latest version.

Run pip install sagemaker -Uq in the terminal.

For Code Editor only, you need to set the Python environment to run in the current terminal.

In Code Editor, on the File menu¸ choose Preferences and Settings.

Search for and select Terminal: Execute in File Dir.

In Code Editor or JupyterLab, open the scikit_learn_script_mode_local_training_and_serving folder and run the scikit_learn_script_mode_local_training_and_serving.py file.

You can run the script by choosing Run in Code Editor or using the CLI in a JupyterLab terminal. You will be able to see how the model is trained locally. Then you deploy the model to a SageMaker endpoint locally, and calculate the root mean square error (RMSE).

Simulate training and inference in SageMaker Studio Classic using Local Mode

You can also use a notebook in SageMaker Studio Classic to run a small-scale training job on CIFAR10 using Local Mode, deploy the model locally, and perform inference.

Set up your notebook

To set up the notebook, complete the following steps:

Open SageMaker Studio Classic and clone the following GitHub repo.

Open the pytorch_local_mode_cifar10.ipynb notebook in blog/pytorch_cnn_cifar10.

For Image, choose PyTorch 2.1.0 Python 3.10 CPU Optimized.

Confirm that your notebook shows the correct instance and kernel selection.

Open a terminal by choosing Launch Terminal in the current SageMaker image.

Install the Docker CLI and Docker Compose plugin following the instructions in the following GitHub repo.

Because you’re using Docker from SageMaker Studio Classic, remove sudo when running commands because the terminal already runs under superuser. For SageMaker Studio Classic, the installation commands depend on the SageMaker Studio app image OS. For example, DLC-based framework images are Ubuntu based, in which the following instructions would work. However, for a Debian-based image like DataScience Images, you must follow the instructions in the following GitHub repo. If chained commands fail, run the commands one at a time. You should see the Docker version displayed.

Leave the terminal window open, go back to the notebook, and start running it cell by cell.

Make sure to run the cell with pip install -U sagemaker so you’re using the latest version of the SageMaker Python SDK.

Local training

When you start running the local SageMaker training job, you will see the following log lines:

INFO:sagemaker.local.image:'Docker Compose' found using Docker CLI.
INFO:sagemaker.local.local_session:Starting training job

This indicates that the training was running locally using Docker.

Be patient while the pytorch-training:2.1-cpu-py310 Docker image is pulled. Due to its large size (5.2 GB), it could take a few minutes.

Docker images will be stored in the SageMaker Studio app instance’s root volume, which is not accessible to end-users. The only way to access and interact with Docker images is via the exposed Docker API operations.

From a user confidentiality standpoint, the SageMaker Studio platform never accesses or stores user-specific images.

When the training is complete, you’ll be able to see the following success log lines:

8zlz1zbfta-sagemaker-local exited with code 0
Aborting on container exit...
Container 8zlz1zbfta-sagemaker-local  Stopping
Container 8zlz1zbfta-sagemaker-local  Stopped
INFO:sagemaker.local.image:===== Job Complete =====

Local inference

Complete the following steps:

Deploy the SageMaker endpoint using SageMaker Local Mode.

Be patient while the pytorch-inference:2.1-cpu-py310 Docker image is pulled. Due to its large size (4.32 GB), it could take a few minutes.

Invoke the SageMaker endpoint deployed locally using the test images.

You will be able to see the predicted classes: frog, ship, car, and plane:

Predicted:  frog ship  car plane

Because the SageMaker Local endpoint is still up, navigate back to the open terminal window and list the running containers:

docker ps

You’ll be able to see the running pytorch-inference:2.1-cpu-py310 container backing the SageMaker endpoint.

To shut down the SageMaker local endpoint and stop the running container, because you can only run one local endpoint at a time, run the cleanup code.

To make sure the Docker container is down, you can navigate to the opened terminal window, run docker ps, and make sure there are no running containers.
If you see a container running, run docker stop <CONTAINER_ID> to stop it.

Tips for using SageMaker Local Mode

If you’re using SageMaker for the first time, refer to Train machine learning models. To learn more about deploying models for inference with SageMaker, refer to Deploy models for inference.

Keep in mind the following recommendations:

Print input and output files and folders to understand dataset and model loading
Use 1–2 epochs and small datasets for quick testing
Pre-install dependencies in a Dockerfile to optimize environment setup
Isolate serialization code in endpoints for debugging

Configure Docker installation as a Lifecycle Configuration

You can define the Docker install process as a Lifecycle Configuration (LCC) script to simplify setup each time a new SageMaker Studio space starts. LCCs are scripts that SageMaker runs during events like space creation. Refer to the JupyterLab, Code Editor, or SageMaker Studio Classic LCC setup (using docker install cli as reference) to learn more.

Build and test custom Docker images in SageMaker Studio spaces

In this step, you install Docker inside the JupyterLab (or Code Editor) app space and use Docker to build, test, and publish custom Docker images with SageMaker Studio spaces. Spaces are used to manage the storage and resource needs of some SageMaker Studio applications. Each space has a 1:1 relationship with an instance of an application. Every supported application that is created gets its own space. To learn more about SageMaker spaces, refer to Boost productivity on Amazon SageMaker Studio: Introducing JupyterLab Spaces and generative AI tools. Make sure you provision a new space with at least 30 GB of storage to allow sufficient storage for Docker images and artifacts.

Install Docker inside a space

To install the Docker CLI and Docker Compose plugin inside a JupyterLab space, run the commands in the following GitHub repo. SageMaker Studio only supports Docker version 20.10.X.

Build Docker images

To confirm that Docker is installed and working inside your JupyterLab space, run the following code:

# to verify docker service
sagemaker-user@default:~$ docker version
Client: Docker Engine - Community
Version:           24.0.7
API version:       1.41 (downgraded from 1.43)
Go version:        go1.20.10
Git commit:        afdd53b
Built:             Thu Oct 26 09:07:41 2023
OS/Arch:           linux/amd64
Context:           default

Server:
Engine:
Version:          20.10.25
API version:      1.41 (minimum version 1.12)
Go version:       go1.20.10
Git commit:       5df983c
Built:            Fri Oct 13 22:46:59 2023
OS/Arch:          linux/amd64
Experimental:     false
containerd:
Version:          1.7.2
GitCommit:        0cae528dd6cb557f7201036e9f43420650207b58
runc:
Version:          1.1.7
GitCommit:        f19387a6bec4944c770f7668ab51c4348d9c2f38
docker-init:
Version:          0.19.0
GitCommit:        de40ad0

To build a custom Docker image inside a JupyterLab (or Code Editor) space, complete the following steps:

Create an empty Dockerfile:

touch Dockerfile

Edit the Dockerfile with the following commands, which create a simple flask web server image from the base python:3.10.13-bullseye image hosted on Docker Hub:

# Use the specified Python base image
FROM python:3.10.13-bullseye

# Create a code dir
RUN mkdir /code/

# Set the working directory in the container
WORKDIR /code

# Upgrade pip and install required packages
RUN python3 -m pip install --upgrade pip && 
python3 -m pip install flask

# Copy the app.py file to the container
COPY app.py /code/

# Set the command to run the app
ENTRYPOINT ["python", "app.py"]

The following code shows the contents of an example flask application file app.py:

from flask import Flask, jsonify

app = Flask(__name__)

@app.route('/')
def hello():
return jsonify({"response": "Hello"})

if __name__ == '__main__':
app.run(host='0.0.0.0', port=6006)

Additionally, you can update the reference Dockerfile commands to include packages and artifacts of your choice.

Build a Docker image using the reference Dockerfile:

docker build --network sagemaker --tag myflaskapp:v1 --file ./Dockerfile .

Include --network sagemaker in your docker build command, otherwise the build will fail. Containers can’t be run in Docker default bridge or custom Docker networks. Containers are run in same network as the SageMaker Studio application container. Users can only use sagemaker for the network name.

When your build is complete, validate if the image exists. Re-tag the build as an ECR image and push. If you run into permission issues, run the aws ecr get-login-password… command and try to rerun the Docker push/pull:

sagemaker-user@default:~$ docker image list
REPOSITORY      TAG       IMAGE ID       CREATED          SIZE
myflaskapp      v1        d623f1538f20   27 minutes ago   489MB

sagemaker-user@default:~$ docker tag myflaskapp:v1 123456789012.dkr.ecr.us-east-2.amazonaws.com/myflaskapp:v1

sagemaker-user@default:~$ docker image list
REPOSITORY                                                  TAG       IMAGE ID       CREATED          SIZE
123456789012.dkr.ecr.us-east-2.amazonaws.com/myflaskapp     latest    d623f1538f20   27 minutes ago   489MB
myflaskapp                                                  v1        d623f1538f20   27 minutes ago   489MB

sagemaker-user@default:~$ aws ecr get-login-password --region region | docker login --username AWS --password-stdin aws_account_id.dkr.ecr.region.amazonaws.com

sagemaker-user@default:~$ docker push 123456789012.dkr.ecr.us-east-2.amazonaws.com/myflaskapp:latest

Test Docker images

Having Docker installed inside a JupyterLab (or Code Editor) SageMaker Studio space allows you to test pre-built or custom Docker images as containers (or containerized applications). In this section, we use the docker run command to provision Docker containers inside a SageMaker Studio space to test containerized workloads like REST web services and Python scripts. Complete the following steps:

Check if the image you’re testing exists on the space’s Amazon Elastic Block Store (Amazon EBS) volume:

sagemaker-user@default:~$ docker image list
REPOSITORY                                                  TAG       IMAGE ID       CREATED       SIZE

If the test image doesn’t exist, run docker pull to pull the image into your local machine:

sagemaker-user@default:~$ docker pull 123456789012.dkr.ecr.us-east-2.amazonaws.com/myflaskapp:v1

If you encounter authentication issues, run the following commands:

aws ecr get-login-password --region region | docker login --username AWS --password-stdin aws_account_id.dkr.ecr.region.amazonaws.com

Create a container to test your workload:

docker run --network sagemaker 123456789012.dkr.ecr.us-east-2.amazonaws.com/myflaskapp:v1

This spins up a new container instance and runs the application defined using Docker’s ENTRYPOINT:

sagemaker-user@default:~$ docker run --network sagemaker 905418447590.dkr.ecr.us-east-2.amazonaws.com/myflaskapp:v1
* Serving Flask app 'app'
* Debug mode: off
WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
* Running on all addresses (0.0.0.0)
* Running on http://127.0.0.1:6006
* Running on http://169.255.255.2:6006

To test if your web endpoint is active, navigate to the URL https://<sagemaker-space-id>.studio.us-east-2.sagemaker.aws/jupyterlab/default/proxy/6006/.

You should see a JSON response similar to following screenshot.

Clean up

To avoid incurring unnecessary charges, delete the resources that you created while running the examples in this post:

In your SageMaker Studio domain, choose Studio Classic in the navigation pane, then choose Stop.
In your SageMaker Studio domain, choose JupyterLab or Code Editor in the navigation pane, choose your app, and then choose Stop.

Conclusion

SageMaker Studio Local Mode and Docker support empower developers to build, test, and iterate on ML implementations faster without leaving their workspace. By providing instant access to test environments and outputs, these capabilities optimize workflows and improve productivity. Try out SageMaker Studio Local Model and Docker support using our quick onboard feature, which allows you to spin up a new domain for single users within minutes. Share your thoughts in the comments section!

About the Authors

Shweta Singh is a Senior Product Manager in the Amazon SageMaker Machine Learning (ML) platform team at AWS, leading SageMaker Python SDK. She has worked in several product roles in Amazon for over 5 years. She has a Bachelor of Science degree in Computer Engineering and Masters of Science in Financial Engineering, both from New York University

Eitan Sela is a Generative AI and Machine Learning Specialist Solutions Architect ta AWS. He works with AWS customers to provide guidance and technical assistance, helping them build and operate Generative AI and Machine Learning solutions on AWS. In his spare time, Eitan enjoys jogging and reading the latest machine learning articles.

Pranav Murthy is an AI/ML Specialist Solutions Architect at AWS. He focuses on helping customers build, train, deploy and migrate machine learning (ML) workloads to SageMaker. He previously worked in the semiconductor industry developing large computer vision (CV) and natural language processing (NLP) models to improve semiconductor processes using state of the art ML techniques. In his free time, he enjoys playing chess and traveling. You can find Pranav on LinkedIn.

Mufaddal Rohawala is a Software Engineer at AWS. He works on the SageMaker Python SDK library for Amazon SageMaker. In his spare time, he enjoys travel, outdoor activities and is a soccer fan.

Small and Mighty: NVIDIA Accelerates Microsoft’s Open Phi-3 Mini Language Models

NVIDIA announced today its acceleration of Microsoft’s new Phi-3 Mini open language model with NVIDIA TensorRT-LLM, an open-source library for optimizing large language model inference when running on NVIDIA GPUs from PC to cloud.

Phi-3 Mini packs the capability of 10x larger models and is licensed for both research and broad commercial usage, advancing Phi-2 from its research-only roots. Workstations with NVIDIA RTX GPUs or PCs with GeForce RTX GPUs have the performance to run the model locally using Windows DirectML or TensorRT-LLM.

The model has 3.8 billion parameters and was trained on 3.3 trillion tokens in only seven days on 512 NVIDIA H100 Tensor Core GPUs.

Phi-3 Mini has two variants, with one supporting 4k tokens and the other supporting 128K tokens, which is the first model in its class for very long contexts. This allows developers to use 128,000 tokens — the atomic parts of language that the model processes — when asking the model a question, which results in more relevant responses from the model.

Developers can try Phi-3 Mini with the 128K context window at ai.nvidia.com, where it is packaged as an NVIDIA NIM, a microservice with a standard application programming interface that can be deployed anywhere.

Creating Efficiency for the Edge

Developers working on autonomous robotics and embedded devices can learn to create and deploy generative AI through community-driven tutorials, like on Jetson AI Lab, and deploy Phi-3 on NVIDIA Jetson.

With only 3.8 billion parameters, the Phi-3 Mini model is compact enough to run efficiently on edge devices. Parameters are like knobs, in memory, that have been precisely tuned during the model training process so that the model can respond with high accuracy to input prompts.

Phi-3 can assist in cost- and resource-constrained use cases, especially for simpler tasks. The model can outperform some larger models on key language benchmarks while delivering results within latency requirements.

TensorRT-LLM will support Phi-3 Mini’s long context window and uses many optimizations and kernels such as LongRoPE, FP8 and inflight batching, which improve inference throughput and latency. The TensorRT-LLM implementations will soon be available in the examples folder on GitHub. There, developers can convert to the TensorRT-LLM checkpoint format, which is optimized for inference and can be easily deployed with NVIDIA Triton Inference Server.

Developing Open Systems

NVIDIA is an active contributor to the open-source ecosystem and has released over 500 projects under open-source licenses.

Contributing to many external projects such as JAX, Kubernetes, OpenUSD, PyTorch and the Linux kernel, NVIDIA supports a wide variety of open-source foundations and standards bodies as well.

Today’s news expands on long-standing NVIDIA collaborations with Microsoft, which have paved the way for innovations including accelerating DirectML, Azure cloud, generative AI research, and healthcare and life sciences.

Learn more about our recent collaboration.