PINE: Efficient Norm-Bound Verification for Secret-Shared Vectors

Secure aggregation of high-dimensional vectors is a fundamental primitive in federated statistics and learning. A two-server system such as PRIO allows for scalable aggregation of secret-shared vectors. Adversarial clients might try to manipulate the aggregate, so it is important to ensure that each (secret-shared) contribution is well-formed. In this work, we focus on the important and well-studied goal of ensuring that each contribution vector has bounded Euclidean norm. Existing protocols for ensuring bounded-norm contributions either incur a large communication overhead, or only allow for…Apple Machine Learning Research

Projected Language Models: A Large Model Pre-Segmented Into Smaller Ones

This paper has been accepted at the Foundation Models in the Wild workshop at ICML 2024.
Large language models are versatile tools but are not suitable for small inference budgets. Small models have more efficient inference but their lower capacity means that their performance can be good only if one limits their scope to a specialized domain. This paper explores how to get a small language model with good specialized accuracy, even when specialization data is unknown during pretraining. We propose a novel architecture, projected networks (PN). PN is a high capacity network whose parameters…Apple Machine Learning Research

Improving GFlowNets for Text-to-Image Diffusion Alignment

This paper was accepted at the Foundation Models in the Wild workshop at ICML 2024.
Diffusion models have become the de-facto approach for generating visual data, which are trained to match the distribution of the training dataset. In addition, we also want to control generation to fulfill desired properties such as alignment to a text description, which can be specified with a black-box reward function. Prior works fine-tune pretrained diffusion models to achieve this goal through reinforcement learning-based algorithms. Nonetheless, they suffer from issues including slow credit assignment as…Apple Machine Learning Research

Towards Automated Accessibility Report Generation for Mobile Apps

Many apps have basic accessibility issues, like missing labels or low contrast. Automated tools can help app developers catch basic issues, but can be laborious to run or require writing dedicated tests. In this work, we developed a system to generate accessibility reports from mobile apps through a collaborative process with accessibility stakeholders at Apple. Our method combines varied data collection methods (e.g., app crawling, manual recording) with an existing accessibility scanner. Many such scanners are based on single-screen scanning, and a key problem in whole app accessibility…Apple Machine Learning Research

Accelerate your generative AI distributed training workloads with the NVIDIA NeMo Framework on Amazon EKS

Accelerate your generative AI distributed training workloads with the NVIDIA NeMo Framework on Amazon EKS

In today’s rapidly evolving landscape of artificial intelligence (AI), training large language models (LLMs) poses significant challenges. These models often require enormous computational resources and sophisticated infrastructure to handle the vast amounts of data and complex algorithms involved. Without a structured framework, the process can become prohibitively time-consuming, costly, and complex. Enterprises struggle with managing distributed training workloads, efficient resource utilization, and model accuracy and performance. This is where the NVIDIA NeMo Framework comes into play. In this post, we present a step-by-step guide to run distributed training workloads on an Amazon Elastic Kubernetes Service (Amazon EKS) cluster.

NVIDIA NeMo Framework

NVIDIA NeMo is an end-to-end cloud-centered framework for training and deploying generative AI models with billions and trillions of parameters at scale. The NVIDIA NeMo Framework provides a comprehensive set of tools, scripts, and recipes to support each stage of the LLM journey, from data preparation to training and deployment. It offers a variety of customization techniques and is optimized for at-scale inference of models for both language and image applications, using multi-GPU and multi-node configurations. NVIDIA NeMo simplifies generative AI model development, making it more cost-effective and efficient for enterprises. By providing end-to-end pipelines, advanced parallelism techniques, memory-saving strategies, and distributed checkpointing, NVIDIA NeMo makes sure AI model training is streamlined, scalable, and high-performing.

The following are benefits of using NVIDIA NeMo for distributed training:

  • End-to-end pipelines for different stages such as data preparation, training, and more, which allows for a plug-and-play approach for your custom data
  • Parallelism techniques, including the following:
    • Data parallelism
    • Tensor parallelism
    • Pipeline parallelism
    • Sequence parallelism
    • Expert parallelism
    • Context parallelism
  • Memory saving techniques, including the following:
    • Selective activation recompute
    • CPU offloading (activation, weights)
    • Attention, including Flash Attention (FA 1/2, FA-cuDNN), Grouped Query Attention, Multi-Query Attention, and Sliding Window Attention
    • Distributed optimizers, including Torch FSDP, Distributed Optimizer (zero-1)
  • Data loaders for different architectures
  • Distributed checkpointing

Solution overview

You can deploy and manage NVIDIA NeMo using either Slurm or Kubernetes orchestration platforms. Amazon EKS is a managed Kubernetes service that makes it straightforward to run Kubernetes clusters on AWS. It manages the availability and scalability of the Kubernetes control plane, and it provides compute node auto scaling and lifecycle management support to help you run highly available container applications.

Amazon EKS is an ideal platform for running distributed training workloads due to its robust integrations with AWS services and performance features. It seamlessly integrates with Amazon FSx for Lustre, a high-throughput file system, enabling fast data access and management using persistent volume claims with the FSx CSI driver. Amazon EKS also integrates with Amazon CloudWatch for comprehensive logging and monitoring, providing insights into cluster performance and resource utilization. It supports Amazon Simple Storage Service (Amazon S3) for scalable and durable data storage and management, providing accessibility for large datasets. Enhanced network performance is achieved with Elastic Fabric Adapter (EFA), which offers low-latency, high-throughput connectivity between nodes. These features collectively make Amazon EKS a powerful and efficient choice for optimizing AI and machine learning (ML) training workflows.

The following diagram shows the solution architecture.

In this post, we present the steps to run distributed training workloads on an EKS cluster. The high-level steps are as follows:

  1. Set up an EFA enabled 2-node 24xlarge cluster.
  2. Set up an FSx for Lustre file system so you can have a shared data repository for storing training dataset and model checkpoints.
  3. Set up an environment for NVIDIA NeMo.
  4. Modify the NVIDIA NeMo Kubernetes manifests to prepare a dataset and train a model.

Prerequisites

You need to be able to launch a CPU-based Amazon Elastic Compute Cloud (Amazon EC2) instance that you’ll use to create the EKS cluster. When your instance is up and running, SSH into your EC2 instance and install the following CLIs:

These steps may change if you are on a non-Linux platform. Consult the preceding documentation for installing the CLIs on other platforms accordingly. We also require that you have a capacity reservation with p4de.24xlarge instances and have the capacityReservationID.

Launch an EKS cluster

ECR p4de.24xlarge instances have the NVIDIA A100 80GB instances, which are highly popular for distributed training generative AI workloads. For more information, refer to Amazon EC2 Instance Types. In this section, we show how to create an EKS cluster with an On-Demand Capacity Reservation for p4de.24xlarge instances.

  1. We provide the cluster creation config in p4de-cluster-config.yaml. See the following code:
git clone https://github.com/aws-samples/awsome-distributed-training.git
cd awsome-distributed-training/3.test_cases/2.nemo-launcher/EKS

eksctl create cluster -f p4de-cluster-config.yaml

The following are key points to note when creating this cluster:

  • Make sure the kubectl version and the specified Region are correct.
  • Update the capacityReservationID field and make sure to specify the availabilityZones within the managedNodeGroups section, which should be the same Availability Zone ID in which your capacity lives.
  • This configuration will create two managed node groups: one for the system nodes using c5.2xlarge instances and another for running distributed training on p4de.24xlarge instances. Managed node groups will use Amazon EKS optimized AMIs. If you want to provide a custom AMI, you can create a self-managed node group and specify a custom AMI. To find the AMI ID, refer to Retrieving Amazon EKS optimized Amazon Linux AMI IDs. For more details about the Amazon EKS optimized AMI, see the GitHub repo.
  • Make sure efaEnabled is set to true. You can use the same config for creating a cluster with other node groups. For a list of EFA supported instance types, see Supported instance types.
  • Another popular instance for generative AI distributed training workloads is the p5.48xlarge instance with the NVIDIA H100 80 GB GPU. To add a P5 node group to an existing EKS cluster, refer to AWS CLI scripts for EKS management.
  1. After the cluster is created, you can enable kubectl to communicate with your cluster by adding a new context to the kubectl config file:
    aws eks update-kubeconfig --region region-code --name my-cluster

  2. You can confirm communication with your cluster by running the following command:
    kubectl get svc
    NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
    kubernetes   ClusterIP   10.100.0.1   <none>        443/TCP   28h

Next, you can install the AWS EFA Kubernetes Device Plugin. EFA is a network interface for EC2 instances that enhances the performance of inter-node communications, which is critical for distributed training workloads that involve GPUs. This plugin allows Kubernetes to recognize and utilize the EFA device, facilitating high-throughput, low-latency networking necessary for efficient distributed training and deep learning applications.

  1. Install the plugin with the following code:
helm repo add eks https://aws.github.io/eks-charts

helm install efa eks/aws-efa-k8s-device-plugin -n kube-system

The NVIDIA device plugin for Kubernetes enables GPU support within your EKS cluster by exposing the GPUs to the Kubernetes API server through the kubelet. It advertises the available GPU resources, allowing Kubernetes to schedule and manage GPU-accelerated workloads.

  1. Install the plugin with the following code:
    wget https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.3/nvidia-device-plugin.yml
    
    kubectl apply -f nvidia-device-plugin.yml

  2. Run the following command to verify all the pods:
    kubectl get pods --all-namespaces

  3. You can run kubectl get nodes to verify the nodes.

Alternatively, you can use the EKS node viewer tool to view nodes, their costs, and their status in your cluster. After it’s installed, enter eks-node-viewer to get the following view.

The node viewer displays the IP addresses of our two p4de.24xlarge compute nodes.

  1. We can choose one of these private IP DNS names to further examine and describe the node as follows:
kubectl describe node ip-192-168-165-37.us-west-2.compute.internal

The preceding command describes a lot of detail of the node. To make sure EFA is installed correctly, make sure you see details as shown in the following screenshot.

For p4 nodes, you will see vpc.amazonaws.com/efa:4 and for p5.48xlarge nodes, you should see vpc.amazonaws.com/efa:32.

If EFA is enabled in the node group, make sure that a security group is attached to the nodes that allows a rule to allow all outgoing traffic originating from the same security group. This is required for EFA to work. For instructions, see Get started with EFA and MPI. This security group is intended for testing purposes only. For your production environments, we recommend that you create an inbound SSH rule that allows traffic only from the IP address from which you are connecting, such as the IP address of your computer, or a range of IP addresses in your local network.

Create an FSx for Lustre file system

For distributed training applications, typically hundreds of GPU instances are used, with each node containing multiple GPUs. It is crucial that all nodes can access a shared file system to train on the same dataset efficiently. For this purpose, a high-performance file system with high throughput and low latency is essential. We recommend using the FSx for Lustre file system for large-scale distributed training, because it meets these requirements and provides seamless data access for all nodes involved in the training process.

To have a FSx for Lustre file system mounted on your EKS cluster, complete the following steps:

  1. Use the following scripts to create an AWS Identity and Access Management (IAM) role and attach the FSx policy:
    export FSX_POLICY_NAME=fsx-csi
    
    wget https://github.com/aws-samples/aws-do-eks/blob/main/Container-Root/eks/deployment/csi/fsx/fsx-policy.json
    export FSX_POLICY_DOC=file://fsx-policy.json
    
    # From EC2 Auto Scaling Group
    export EKS_INSTANCE_PROFILE_NAME=(eks-1ec6fc6b-1a19-d65d-66ac-293ff0a20eb9 )
    
    POLICY_ARN=$(aws iam create-policy --policy-name ${FSX_POLICY_NAME} --policy-document $FSX_POLICY_DOC --query "Policy.Arn" --output text)
    
    INSTANCE_PROFILE=$(aws iam list-instance-profiles --query InstanceProfiles[?InstanceProfileName=="'${EKS_INSTANCE_PROFILE_NAME}'"].{InstanceProfileName:InstanceProfileName} --output text)
    
    ROLE_NAME=$(aws iam get-instance-profile --instance-profile-name ${INSTANCE_PROFILE} --query InstanceProfile.Roles[0].RoleName --output text)
    
    # Attach FSx Policy to role ${ROLE_NAME} ..."
    aws iam attach-role-policy --policy-arn ${POLICY_ARN} --role-name ${ROLE_NAME}
    

  2. Use the following script to create a security group that allows EKS nodes to access the file system:
    # From EC2 console
    export MY_REGION=us-west-2
    # FSX_SUBNET_ID should be same ID the compute nodes are present in. You can get this from the EKS console 
    export FSX_SUBNET_ID=subnet-0edecd850cff2cfad
    # From EC2 Auto Scaling Group
    export FSX_SECURITY_GROUP_NAME=eks-fsx-sg
    
    # Get VPC_ID from EKS console
    export VPC_ID=vpc-04411d49af198a6ea
    
    # Create security group
    export SECURITY_GROUP_ID=$(aws ec2 create-security-group --vpc-id ${VPC_ID} --region ${MY_REGION} --group-name ${FSX_SECURITY_GROUP_NAME} --description "FSx for Lustre Security Group" --query "GroupId" --output text)
    
    export SUBNET_CIDR=$(aws ec2 describe-subnets --region ${MY_REGION} --query Subnets[?SubnetId=="'${FSX_SUBNET_ID}'"].{CIDR:CidrBlock} --output text)
    
    # Ingress rule
    aws ec2 authorize-security-group-ingress --region ${MY_REGION} --group-id ${SECURITY_GROUP_ID} --protocol tcp --port 988 --cidr ${SUBNET_CIDR}
    

  3. Create a 1.2 TB Persistent_2 FSx for Lustre file system from the FSx for Lustre console in the same Availability Zone as your compute instances (FSX_SUBNET_ID), VPC of Amazon EKS (VPC_ID), and the security group you created (SECURITY_GROUP_ID).
  4. After the file system is created, note the file system ID, DNS name, and mount name from the file system details page.

Before mounting the file system, you need to install the FSx CSI driver that allows EKS clusters to manage the lifecycle of FSx for Lustre file systems.

  1. Install the FSx CSI driver as follows:
    echo "Installing FSx CSI driver ..."
    kubectl apply -k "github.com/kubernetes-sigs/aws-fsx-csi-driver/deploy/kubernetes/overlays/stable/?ref=master"
    
    echo "FSx pods in kube-system namespace ..."
    kubectl -n kube-system get pods | grep fsx
    

  2. Next, to mount the file system, provide scripts in the fsx-storage-class.yaml, fsx-pv.yaml and fsx-pvc.yaml files:
    # Storage Class
    kubectl apply -f fsx-storage-class.yaml
    kubectl get sc
    
    # Persistent Volume
    kubectl apply -f fsx-pv.yaml
    
    # Persistent Volume Claim
    kubectl apply -f fsx-pvc.yaml
    

You can check to make sure that the volumes are in Bound state.

Set up the environment for NVIDIA NeMo

For this post, we use the NVIDIA device plugin for Kubernetes, but if you need to install the GPU Operator, you can do so as follows:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator

To enable distributed training, we use the KubeFlow Training Operator, which is essential for managing and scheduling ML training jobs in a Kubernetes environment. This operator simplifies the process of running distributed training jobs by automating the deployment and scaling of the necessary components. See the following code:

# Deploy Kubeflow training operator

kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.7.0"

# From https://github.com/aws-samples/aws-do-eks/blob/main/Container-Root/eks/deployment/kubeflow/training-operator/deploy.sh

# Configure RBAC resources

kubectl apply -f ./clusterrole-hpa-access.yaml

kubectl apply -f ./clusterrolebinding-training-operator-hpa-access.yaml

Additionally, we use the KubeFlow MPI Operator for preprocessing training data in parallel. The MPI Operator facilitates running Message Passing Interface (MPI) jobs, which are crucial for parallelizing the preprocessing tasks across multiple nodes, thereby speeding up the training process. See the following code:

kubectl apply -f https://raw.githubusercontent.com/kubeflow/mpi-operator/v0.4.0/deploy/v2beta1/mpi-operator.yaml

# From https://github.com/aws-samples/aws-do-eks/blob/main/Container-Root/eks/deployment/kubeflow/mpi-operator/clusterrole-mpi-operator.yaml
# Add lease permissions fot mpi-operator cluster role
kubectl apply -f ./clusterrole-mpi-operator.yaml

The NVIDIA NeMo Framework is available publicly in the image nvcr.io/nvidia/nemo:24.01.framework. We provide an AWS optimized Dockerfile for use with P4 and P5 instances. We recommend the following library versions for optimal performance:

ENV EFA_INSTALLER_VERSION=1.30.0
ENV AWS_OFI_NCCL_VERSION=1.8.1-aws
ENV NCCL_VERSION=2.19.4-1

You can build and push the image to Amazon Elastic Container Registry (Amazon ECR) as follows:

## AWS
export AWS_REGION=us-west-2
export ACCOUNT=$(aws sts get-caller-identity --query Account --output text)

## Docker Image
export REGISTRY=${ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/
export IMAGE=nemo-aws
export TAG=":24.01.framework"

docker build -t ${REGISTRY}${IMAGE}${TAG} -f 0.Dockerfile .

echo "Logging in to $REGISTRY ..."
aws ecr get-login-password | docker login --username AWS --password-stdin $REGISTRY

# Create registry if it does not exist
REGISTRY_COUNT=$(aws ecr describe-repositories | grep ${IMAGE} | wc -l)
if [ "$REGISTRY_COUNT" == "0" ]; then
        echo ""
        echo "Creating repository ${IMAGE} ..."
        aws ecr create-repository --repository-name ${IMAGE}
fi

# Push image
docker image push ${REGISTRY}${IMAGE}${TAG}

The NVIDIA NeMo Framework requires users to provide config files with job and model information. You can copy the launcher scripts from the container as follows:

# Run container
docker run -it ${REPOSITORY}${IMAGE}${TAG} bash

# Copy files
docker cp -a <container-id>: /opt/NeMo-Megatron-Launcher/ <Path-to-save-launcher-scripts>

In a Slurm cluster implementation, the launcher scripts, data, and results folder could reside in the file system that both the head node (node from where jobs are submitted) and compute nodes access. But in this Amazon EKS implementation, the node that you used to create the EKS cluster doesn’t have access to EKS file system. To get around this, you can put the launcher scripts in the head node and the results and data folder in the file system that the compute nodes have access to.

Run NVIDIA NeMo on an EKS cluster

We’re now ready to set up NVIDIA NeMo Kubernetes manifests for data preparation and model training. For more information about running it on premises, see Running NeMo Framework on Kubernetes. There are some modifications to be done for it to run on Amazon EKS, as shown in the following steps. We provide the launcher scripts in the GitHub repo.

  1. Modify the launcher_scripts/conf/cluster/k8s.yaml file as follows. The subPath field is the path where FSx for Lustre is mounted, which is /fsx-shared in this case.
    shm_size: 512Gi  # Amount of system memory to allocate in Pods. Should end in "Gi" for gigabytes.
    volumes:
      persistentVolumeClaim:
        # This claim should be created before running
        claimName: fsx-pvc
        subPath: fsx-shared  # path is mirrored into pod (no leading slash b/c relative to root)
    
    # NOTE: These args will soon be deprecated
    nfs_server: null  # Hostname or IP address for the NFS server where data is stored.
    nfs_path: null  # Path to store data in the NFS server.
    ib_resource_name: null  # Specify the resource name for IB devices according to kubernetes, such as "nvidia.com/hostdev" for Mellanox IB adapters. Can also be a list, but must be same length as ib_count
    ib_count: null  # Specify the number of IB devices to include per node in each pod. Can also be a list, but must be same length as ib_resource_name
    ib_network_annotation: ""  # Specify the networks as comma separated values
    dns_policy: null  # Specify a dnsPolicy to use in all pods, if necessary
    

  2. Install the required Python packages; this is required so that NeMo Launcher can submit jobs to the Kubernetes cluster:
sudo apt install python3-pip

pip install -r <Path-to- NeMo-Megatron-Launcher>/requirements.txt

Next, we copy the following folders from the container to the /fsx-shared/data folder:

  • NeMo-Megatron-Launcher/launcher_scripts/data/bpe
  • NeMo-Megatron-Launcher/launcher_scripts/data/nsfw
  1. To copy files from EKS pods, you can start a pod just for this purpose. Create a file fsx-share-test.yaml as follows:
    apiVersion: v1
    kind: Pod
    metadata:
      name: fsx-share-test
    spec:
      containers:
      - name: fsx-share-test
        image: ubuntu
        command: ["/bin/bash"]
        args: ["-c", "while true; do echo  "hello from FSx" - $(date -u) >> /fsx-shared/test.txt; sleep 120; done"]
        volumeMounts:
        - name: fsx-pv
          mountPath: /fsx-shared
      volumes:
      - name: fsx-pv
        persistentVolumeClaim:
          claimName: fsx-pvc
    

  2. Run this pod and copy the files:
    kubectl apply -f fsx-share-test.yaml
    
    kubectl cp <Path-to- NeMo-Megatron-Launcher>/launcher_scripts/data/bpe fsx-share-test: /fsx-shared/data/
    
    kubectl cp <Path-to- NeMo-Megatron-Launcher>/launcher_scripts/data/nsfw fsx-share-test: /fsx-shared/data/
    

A few files need to be updated for data preparation for it to work with the EKS cluster.

  1. Modify the launcher_scripts/conf/config.yaml file:
    • For cluster, use k8s.
    • For training, use gpt3/126m.
    • For stages, this should be just data_preparation and no other stages.
    • For launcher_scripts_path, use the path to the NeMo Megatron launch scripts, which should end with /launcher_scripts.
    • For data_dir, use /fsx-shared/data (the location to store and read the data).
    • For base_results_dir, use /fsx-shared/results (the location to store the results, checkpoints, and logs).
    • For container, use ${REPOSITORY}${IMAGE}${TAG}
  2. Modify the conf/data_preparation/gpt3/download_gpt3_pile.yaml file:
    • Set node_array_size to 2.
    • Set file_numbers to “0-5”. With five files, it should be around 350 GB of data
  3. Modify the nemo_launcher/core/k8s_templates/data_preparation/data-prep.yaml file:
    • If you get the error that mpirun is not found, add the full path to the executable /opt/amazon/openmpi/bin/mpirun.
    • Add /fsx-shared in the container volume mount path.
    • Add the volume:
volumes:
          - name: fsx-pv
            persistentVolumeClaim:
              claimName: fsx-pvc
  1. Launch the data preparation job:
    python3 main.py

This script creates a Helm chart for the selected stage (in this case, data_preparation) and runs the Helm chart automatically. Refer to Run NeMo Framework on Kubernetes for an explanation of the data preparation process. Make sure python3 is installed.

  1. You can monitor your job status and logs using three commands: helm list, kubectl get pods, and kubectl logs --follow).
  2. When the job is finished, you can remove the Helm chart:
    helm uninstall download-gpt3-pile

You can see the downloaded the data in the /fsx-shared folder by running in one of the pods as kubectl exec -it nlp-worker-0 bash.

Training

Now that our data preparation is complete, we’re ready to train our model with the created dataset. Complete the following steps:

  1. Modify a parameter in the conf/config.yaml file:
    • Set stages to training and no other stages.
  2. Modify parameters in conf/training/gpt3/126m.yaml:
    • Set num_nodes to 2.
    • Set devices to 1.
    • On line 18, change use_distributed_sampler: False to replace_sampler_ddp: False.

Optionally, if you want to use a mock dataset instead of real dataset for testing purposes, you can modify the data section as follows. You are essentially changing data_impl: mmap to data_impl: mock and assigning an empty list to data_prefix.

data:
  data_impl: mock
  splits_string: "99990,8,2"
  seq_length: 2048
  skip_warmup: True
  num_workers: 2
  dataloader_type: single # cyclic
  reset_position_ids: False # Reset position ids after end-of-document token
  reset_attention_mask: False # Reset attention mask after end-of-document token
  eod_mask_loss: False # Mask loss for the end of document tokens
  index_mapping_dir: null
  data_prefix: [] # Should be weight path weight path... for a blended dataset

# You can just comment the default “data_prefix” values like below.
# - ${data_dir}/my-gpt3_00_text_document
# - .0333
  1. Modify the parameters in the nemo_launcher/core/k8s_templates/training/training.yaml file:
  2. Run python3 main.py to start training and you should see the training pods by running kubectl get pods as follows:
    NAME                    READY   STATUS    RESTARTS   AGE
    nlp-training-worker-0   1/1     Running   0          168m
    nlp-training-worker-1   1/1     Running   0          168m
    

In addition to monitoring your job using helm list, kubectl get pods, and kubectl logs –follow, you can also SSH into your pod with kubectl exec and use nvidia-smi to check GPU status.

  1. When the job is finished, you can delete the helm chart:
    helm uninstall gpt3-126m

Model checkpoints are saved at /fsx-shared/results/checkpoints along with other training logs and TensorBoard events. By default, checkpoints are saved at every 2,000 steps. You can modify the conf/training/gpt3/126m.yaml file to make changes in the training setup.

Troubleshooting deployment failures

If deployment fails due to incorrect setup or configuration, complete the following debug steps:

  1. Find the error message by running kubectl logs --follow PODNAME and kubectl describe pod PODNAME.
  2. Stop any running jobs by removing the Helm chart. This can be done by running helm uninstall CHARTNAME.

Pods should be spun down after removing the Helm chart.

  1. You can double-check by running kubectl get pods.
  2. If pods are not spun down, you can manually stop them by running kubectl delete PODNAME.

Based on the error message, you may find errors from:

  • Unready nodes.
  • Missing Operators or CRDs. In this case, make sure your kubectl get pods -A output looks like that shown earlier. If errors exist, try reinstalling Operators and CRDs.
  • NeMo Framework scripts or Kubernetes manifests. This is more likely a bug or wrong setup on the NeMo side. Errors can vary.

Clean up

It’s important to spin down resources after model training in order to avoid costs associated with running idle instances. To clean up our setup, we must delete the FSx for Lustre file system before deleting the cluster because it’s associated with a subnet in the cluster’s VPC.

  1. To delete the file system integration with the EKS cluster, run the following command:
    kubectl delete -f ./fsx-storage-class.yaml

Not only will this delete the persistent volume, it will also delete the EFS file system and all the data on the file system will be lost.

  1. When Step 1 is complete, delete the cluster by using the following script:
    eksctl delete cluster -f p4de-cluster-config.yaml

This will delete all the existing pods, remove the cluster, and delete the VPC you created in the beginning.

Conclusion

In this post, we demonstrated how to train generative AI models at scale using the NeMo Framework within an EKS cluster. We covered the challenges of training LLMs and how NeMo’s comprehensive tools and optimizations address these challenges, making the process more efficient and cost-effective. With NeMo, you can manage and scale distributed training workloads effectively. This post works with P4de instances. Another popular instance for generative AI distributed training workloads is the p5.48xlarge instance with the NVIDIA H100 80 GB GPU. To add a P5 node group to an existing EKS cluster, refer to AWS CLI scripts for EKS management.

To help you get started, we have published a GitHub repository that provides step-by-step instructions for creating an EKS cluster with P4de instances, mounting an FSx for Lustre file system, and running distributed training workloads with NeMo. This guide empowers you to harness the full potential of NeMo and Amazon EKS for your AI model training needs.


About the authors

Ankur Srivastava is a Sr. Solutions Architect in the ML Frameworks Team. He focuses on helping customers with self-managed distributed training and inference at scale on AWS. His experience includes industrial predictive maintenance, digital twins, probabilistic design optimization and has completed his doctoral studies from Mechanical Engineering at Rice University and post-doctoral research from Massachusetts Institute of Technology.

Akshit Arora is a senior data scientist at NVIDIA, where he works on deploying conversational AI models on GPUs at scale. He’s a graduate of University of Colorado at Boulder, where he applied deep learning to improve knowledge tracking on a K-12 online tutoring platform. His work spans multilingual text-to-speech, time series classification, ed-tech, and practical applications of deep learning.

Eliuth Triana Isaza is a Developer Relations Manager at NVIDIA empowering Amazon’s AI MLOps, DevOps, Scientists and AWS technical experts to master the NVIDIA computing stack for accelerating and optimizing Generative AI Foundation models spanning from data curation, GPU training, model inference and production deployment on AWS GPU instances. In addition, Eliuth is a passionate mountain biker, skier, tennis and poker player.

Wenhan Tan is a Solutions Architect at Nvidia assisting customers to adopt Nvidia AI solutions at large-scale. His work focuses on accelerating deep learning applications and addressing inference and training challenges.

Read More

Governing the ML lifecycle at scale, Part 2: Multi-account foundations

Governing the ML lifecycle at scale, Part 2: Multi-account foundations

Your multi-account strategy is the core of your foundational environment on AWS. Design decisions around your multi-account environment are critical for operating securely at scale. Grouping your workloads strategically into multiple AWS accounts enables you to apply different controls across workloads, track cost and usage, reduce the impact of account limits, and mitigate the complexity of managing multiple virtual private clouds (VPCs) and identities by allowing different teams to access different accounts that are tailored to their purpose.

In Part 1 of this series, Governing the ML lifecycle at scale, Part 1: A framework for architecting ML workloads using Amazon SageMaker, you learned about best practices for operating and governing machine learning (ML) and analytics workloads at scale on AWS. In this post, we provide guidance for implementing a multi-account foundation architecture that can help you organize, build, and govern the following modules: data lake foundations, ML platform services, ML use case development, ML operations, centralized feature stores, logging and observability, and cost and reporting.

We cover the following key areas of the multi-account strategy for governing the ML lifecycle at scale:

  •  Implementing the recommended account and organizational unit structure to provide isolation of AWS resources (compute, network, data) and cost visibility for ML and analytics teams
  • Using AWS Control Tower to implement a baseline landing zone to support scaling and governing data and ML workloads
  • Securing your data and ML workloads across your multi-account environment at scale using the AWS Security Reference Architecture
  • Using the AWS Service Catalog to scale, share, and reuse ML across your multi-account environment and for implementing baseline configurations for networking
  • Creating a network architecture to support your multi-account environment and facilitate network isolation and communication across your multi-tenant environment

Your multi-account foundation is the first step towards creating an environment that enables innovation and governance for data and ML workloads on AWS. By integrating automated controls and configurations into your account deployments, your teams will be able to move quickly and access the resources they need, knowing that they are secure and comply with your organization’s best practices and governance policies. In addition, this foundational environment will enable your cloud operations team to centrally manage and distribute shared resources such as networking components, AWS Identity and Access Management (IAM) roles, Amazon SageMaker project templates, and more.

In the following sections, we present the multi-account foundation reference architectures, discuss the motivation behind the architectural decisions made, and provide guidance for implementing these architectures in your own environment.

Organizational units and account design

You can use AWS Organizations to centrally manage accounts across your AWS environment. When you create an organization, you can create hierarchical groupings of accounts within organizational units (OUs). Each OU is typically designed to hold a set of accounts that have common operational needs or require a similar set of controls.

The recommended OU structure and account structure you should consider for your data and ML foundational environment is based on the AWS whitepaper Organizing Your AWS Environment Using Multiple Accounts. The following diagram illustrates the solution architecture.

Organizing Your AWS Environment Using Multiple Accounts
Only those OUs that are relevant to the ML and data platform have been shown. You can also add other OUs along with the recommended ones. The next sections discuss how these recommended OUs serve your ML and data workloads and the specific accounts you should consider creating within these OUs.

The following image illustrates, respectively, the architecture of the account structure for setting up a multi-account foundation and how it would look like in AWS Organizations once implemented .

Recommended OUs

The recommended OUs include Security, Infrastructure, Workloads, Deployments, and Sandbox. If you deploy AWS Control Tower, which is strongly recommended, it creates two default OUs: Security and Sandbox. You should use these default OUs and create the other three. For instructions, refer to Create a new OU.

Security OU

The Security OU stores the various accounts related to securing your AWS environment. This OU and the accounts therein are typically owned by your security team.

You should consider the following initial accounts for this OU:

  •  Security Tooling account – This account houses general security tools as well as those security tools related to your data and ML workloads. For instance, you can use Amazon Macie within this account to help protect your data across all of your organization’s member accounts.
  • Log Archive account – If you deploy AWS Control Tower, this account is created by default and placed within your Security OU. This account is designed to centrally ingest and archive logs across your organization.

Infrastructure OU

Similar to other types of workloads that you can run on AWS, your data and ML workloads require infrastructure to operate correctly. The Infrastructure OU houses the accounts that maintain and distribute shared infrastructure services across your AWS environment. The accounts within this OU will be owned by the infrastructure, networking, or Cloud Center of Excellence (CCOE) teams.

The following are the initial accounts to consider for this OU:

  • Network account – To facilitate a scalable network architecture for data and ML workloads, it’s recommended to create a transit gateway within this account and share this transit gateway across your organization. This will allow for a hub and spoke network architecture that privately connects your VPCs in your multi-account environment and facilitates communication with on-premises resources if needed.
  • Shared Services account – This account hosts enterprise-level shared services such as AWS Managed Microsoft AD and AWS Service Catalog that you can use to facilitate the distribution of these shared services.

Workloads OU

The Workloads OU is intended to house the accounts that different teams within your platform use to create ML and data applications. In the case of an ML and data platform, you’ll use the following accounts:

  • ML team dev/test/prod accounts – Each ML team may have their own set of three accounts for the development, testing, and production stages of the MLOps lifecycle.
  • (Optional) ML central deployments – It’s also possible to have ML model deployments fully managed by an MLOps central team or ML CCOE. This team can handle the deployments for the entire organization or just for certain teams; either way, they get their own account for deployments.
  • Data lake account – This account is managed by data engineering or platform teams. There can be several data lake accounts organized by business domains. This is hosted in the Workloads OU.
  • Data governance account – This account is managed by data engineering or platform teams. This acts as the central governance layer for data access. This is hosted in the Workloads OU.

Deployments OU

The Deployments OU contains resources and workloads that support how you build, validate, promote, and release changes to your workloads. In the case of ML and data applications, this will be the OU where the accounts that host the pipelines and deployment mechanisms for your products will reside. These will include accounts like the following:

  • DevOps account – This hosts the pipelines to deploy extract, transform, and load (ETL) jobs and other applications for your enterprise cloud platform
  • ML shared services account – This is the main account for your platform ML engineers and the place where the portfolio of products related to model development and deployment are housed and maintained

If the same team managing the ML engineering resources is the one taking care of pipelines and deployments, then these two accounts may be combined into one. However, one team should be responsible for the resources in one account; the moment you have different independent teams taking care of these processes, the accounts should be different. This makes sure that a single team is accountable for the resources in its account, making it possible to have the right levels of billing, security, and compliance for each team.

Sandbox OU

The Sandbox OU typically contains accounts that map to an individual or teams within your organization and are used for proofs of concept. In the case of our ML platform, this can be cases of the platform and data scientist teams wanting to create proofs of concept with ML or data services. We recommend using synthetic data for proofs of concept and avoid using production data in Sandbox environments.

AWS Control Tower

AWS Control Tower enables you to quickly get started with the best practices for your ML platform. When you deploy AWS Control Tower, your multi-account AWS environment is initialized according to prescriptive best practices. AWS Control Tower configures and orchestrates additional AWS services, including Organizations, AWS Service Catalog, and AWS IAM Identity Center. AWS Control Tower helps you create a baseline landing zone, which is a well-architected multi-account environment based on security and compliance best practices. As a first step towards initializing your multi-account foundation, you should set up AWS Control Tower.

In the case of our ML platform, AWS Control Tower helps us with four basic tasks and configurations:

  • Organization structure – From the accounts and OUs that we discussed in the previous section, AWS Control Tower provides you with the Security and Sandbox OUs and the Security Tooling and Logging accounts.
  • Account vending – This enables you to effortlessly create new accounts that comply with your organization’s best practices at scale. It allows you to provide your own bootstrapping templates with AWS Service Catalog (as we discuss in the next sections).
  • Access management – AWS Control Tower integrates with IAM Identity Center, providing initial permissions sets and groups for the basic actions in your landing zone.
  • Controls – AWS Control Tower implements preventive, detective, and proactive controls that help you govern your resources and monitor compliance across groups of AWS accounts.

Access and identity with IAM Identity Center

After you establish your landing zone with AWS Control Tower and create the necessary additional accounts and OUs, the next step is to grant access to various users of your ML and data platform. Proactively determining which users will require access to specific accounts and outlining the reasons behind these decisions is recommended. Within IAM Identity Center, the concepts of groups, roles, and permission sets allows you to create fine-grained access for different personas within the platform.

Users can be organized into two primary groups: platform-wide and team-specific user groups. Platform-wide user groups encompass central teams such as ML engineering and landing zone security, and they are allocated access to the platform’s foundational accounts. Team-specific groups operate at the team level, denoted by roles such as team admins and data scientists. These groups are dynamic, and are established for new teams and subsequently assigned to their respective accounts upon provisioning.

The following table presents some example platform-wide groups.

User Group Description Permission Set Accounts
AWSControlTowerAdmins Responsible for managing AWS Control Tower in the landing zone AWSControlTowerAdmins and AWSSecurityAuditors Management account
AWSNetworkAdmins Manages the networking resources of the landing zone NetworkAdministrator Network account
AWSMLEngineers Responsible for managing the ML central resources PowerUserAccess ML shared services account
AWSDataEngineers Responsible for managing the data lake, ETLs and data processes of the platform PowerUserAccess Data lake account

The following table presents examples of team-specific groups.

User Group Description Permission Set Accounts
TeamLead Group for the administrators of the team. AdministratorAccess Team account
DataScientists Group for data scientists. This group is added as an access for the team’s SageMaker domain. DataScientist Team account
MLEngineers The team may have other roles dedicated to certain specific tasks that have a relationship with the matching platform-wide teams. MLEngineering Team account
DataEngineers DataEngineering Team account

AWS Control Tower automatically generates IAM Identity Center groups with permission set relationships for the various landing zone accounts it creates. You can use these preconfigured groups for your platform’s central teams or create new custom ones. For further insights into these groups, refer to IAM Identity Center Groups for AWS Control Tower. The following screenshot shows an example of the AWS Control Tower console, where you can view the accounts and determine which groups have permission on each account.

IAM Identity Center also provides a login page where landing zone users can get access to the different resources, such as accounts or SageMaker domains, with the different levels of permissions that you have granted them.

AWS Security Reference Architecture

The AWS SRA is a holistic set of guidelines for deploying the full complement of AWS security services in a multi-account environment. It can help you design, implement, and manage AWS security services so they align with AWS recommended practices.

To help scale security operations and apply security tools holistically across the organization, it’s recommended to use the AWS SRA to configure your desired security services and tools. You can use the AWS SRA to set up key security tooling services, such as Amazon GuardDuty, Macie, and AWS Security Hub. The AWS SRA allows you to apply these services across your entire multi-account environment and centralize the visibility these tools provide. In addition, when accounts get created in the future, you can use the AWS SRA to configure the automation required to scope your security tools to these new accounts.

The following diagram depicts the centralized deployment of the AWS SRA.

Scale your ML workloads with AWS Service Catalog

Within your organization, there will likely be different teams corresponding to different business units. These teams will have similar infrastructure and service needs, which may change over time. With AWS Service Catalog, you can scale your ML workloads by allowing IT administrators to create, manage, and distribute portfolios of approved products to end-users, who then have access to the products they need in a personalized portal. AWS Service Catalog has direct integrations with AWS Control Tower and SageMaker.

It’s recommended that you use AWS Service Catalog portfolios and products to enhance and scale the following capabilities within your AWS environment:

  • Account vending – The cloud infrastructure team should maintain a portfolio of account bootstrapping products within the shared infrastructure account. These products are templates that contain the basic infrastructure that should be deployed when an account is created, such as VPC configuration, standard IAM roles, and controls. This portfolio can be natively shared with AWS Control Tower and the management account, so that the products are directly used when creating a new account. For more details, refer to Provision accounts through AWS Service Catalog.
  • Analytics infrastructure self-service – This portfolio should be created and maintained by a central analytics team or the ML shared services team. This portfolio is intended to host templates to deploy different sets of analytics products to be used by the platform ML and analytics teams. It is shared with the entire Workloads OU (for more information, see Sharing a Portfolio). Examples of the products include a SageMaker domain configured according to the organization’s best practices or an Amazon Redshift cluster for the team to perform advanced analytics.
  • ML model building and deploying – This capability maps to two different portfolios, which are maintained by the platform ML shared services team:
    • Model building portfolio – This contains the products to build, train, evaluate, and register your ML models across all ML teams. This portfolio is shared with the Workloads OU and is integrated with SageMaker project templates.
    • Model deployment portfolio – This contains the products to deploy your ML models at scale in a reliable and consistent way. It will have products for different deployment types such as real-time inference, batch inference, and multi-model endpoints. This portfolio can be isolated within the ML shared services account by the central ML engineering team for a more centralized ML strategy, or shared with the Workloads OU accounts and integrated with SageMaker project templates to federate responsibility to the individual ML teams.

Let’s explore how we deal with AWS Service Catalog products and portfolios in our platform. Both of the following architectures show an implementation to govern the AWS Service Catalog products using the AWS Cloud Development Kit (AWS CDK) and AWS CodePipeline. Each of the aforementioned portfolios will have its own independent pipeline and code repository. The pipeline synthesizes the AWS CDK service catalog product constructs into actual AWS Service Catalog products and deploys them to the portfolios, which are later made available for its consumption and use. For more details about the implementation, refer to Govern CI/CD best practices via AWS Service Catalog.

The following diagram illustrates the architecture for the account vending portfolio.

The workflow includes the following steps:

  1. The shared infrastructure account is set up with the pipeline to create the AWS Service Catalog portfolio.
  2. The CCOE or central infrastructure team can work on these products and customize them so that company networking and security requirements are met.
  3. You can use the AWS Control Tower Account Factory Customization (AFC) to integrate the portfolio within the account vending process. For more details, see Customize accounts with Account Factory Customization (AFC).
  4. To create a new account from the AFC, we use a blueprint. A blueprint is an AWS CloudFormation template that will be deployed in the newly created AWS account. For more information, see Create a customized account from a blueprint.

The following screenshot shows an example of what account creation with a blueprint looks like.

For the analytics and ML portfolios, the architecture changes the way these portfolios are used downstream, as shown in the following diagram.

The following are the key steps involved in building this architecture:

  1. The ML shared services account is set up and bootstrapped with the pipelines to create the two AWS Service Catalog portfolios.
  2. The ML CCOE or ML engineering team can work on these products and customize them so they’re up to date and cover the main use cases from the different business units.
  3. These portfolios are shared with the OU where the ML dev accounts will be located. For more information about the different options to share AWS Service Catalog portfolios, see Sharing a Portfolio.
  4. Sharing these portfolios with the entire Workloads OU will result in these two portfolios being available for use by the account team as soon as the account is provisioned.

After the architecture has been set up, account admins will see the AWS Service Catalog portfolios and ML workload account after they log in. The portfolios are ready to use and can get the team up to speed quickly.

Network architecture

In our ML platform, we are considering two different major logical environments for our workloads: production and pre-production environments with corporate connectivity, and sandbox or development iteration accounts without corporate connectivity. These two environments will have different permissions and requirements when it comes to connectivity.

As your environment in AWS scales up, inter-VPC connectivity and on-premises VPC connectivity will need to scale in parallel. By using services such as Amazon Virtual Private Cloud (Amazon VPC) and AWS Transit Gateway, you can create a scalable network architecture that is highly available, secure, and compliant with your company’s best practices. You can attach each account to its corresponding network segment.

For simplicity, we create a transit gateway within the central network account for our production workloads; this will resemble a production network segment. This will create a hub and spoke VPC architecture that will allow our production accounts to do the following:

  • Enable inter-VPC communication between the different accounts.
  • Inspect traffic with centralized egress or ingress to the network segment.
  • Provide the environments with connectivity to on-premises data stores.
  • Create a centralized VPC endpoints architecture to reduce networking costs while maintaining private network compliance. For more details, see Centralized access to VPC private endpoints.

For more information about these type of architectures, refer to Building a Scalable and Secure Multi-VPC AWS Network Infrastructure.

The following diagram illustrates the recommended architecture for deploying your transit gateways and creating attachments to the VPCs within your accounts. Anything considered a production environment, whether it’s a workload or shared services account, is connected to the corporate network, while dev accounts have direct internet connectivity to speed up development and exploring of new features.

At a high level, this architecture allows you to create different transit gateways within your network account for your desired AWS Regions or environments. Scalability is provided through the account vending functionality of AWS Control Tower, which deploys a CloudFormation stack to the accounts containing a VPC and the required infrastructure to connect to the environment’s corresponding network segment. For more information about this approach, see the AWS Control Tower Guide for Extending Your Landing Zone.

With this approach, whenever a team needs a new account, the platform team just needs to know whether this will be an account with corporate network connectivity or not. Then the corresponding blueprint is selected to bootstrap the account with, and the account is created. If it’s a corporate network account, the VPC will come with an attachment to the production transit gateway.

Conclusion

In this post, we discussed best practices for creating a multi-account foundation to support your analytics and ML workloads and configuring controls to help you implement governance early in your ML lifecycle. We provided a baseline recommendation for OUs and accounts you should consider creating using AWS Control Tower and blueprints. In addition, we showed how you can deploy security tools at scale using the AWS SRA, how to configure IAM Identity Center for centralized and federated access management, how to use AWS Service Catalog to package and scale your analytics and ML resources, and a best practice approach for creating a hub and spoke network architecture.

Use this guidance to get started in the creation of your own multi-account environment for governing your analytics and ML workloads at scale, and make sure you subscribe to the AWS Machine Learning Blog to receive updates regarding additional blog posts within this series.


About the authors

Alberto Menendez is a DevOps Consultant in Professional Services at AWS. He helps accelerate customers’ journeys to the cloud and achieve their digital transformation goals. In his free time, he enjoys playing sports, especially basketball and padel, spending time with family and friends, and learning about technology.

Ram Vittal is a Principal ML Solutions Architect at AWS. He has over 3 decades of experience architecting and building distributed, hybrid, and cloud applications. He is passionate about building secure, scalable, reliable AI/ML and big data solutions to help enterprise customers with their cloud adoption and optimization journey to improve their business outcomes. In his spare time, he rides motorcycle and walks with his three-year old sheep-a-doodle!

Liam Izar is Solutions Architect at AWS, where he helps customers work backward from business outcomes to develop innovative solutions on AWS. Liam has led multiple projects with customers migrating, transforming, and integrating data to solve business challenges. His core area of expertise includes technology strategy, data migrations, and machine learning. In his spare time, he enjoys boxing, hiking, and vacations with the family.

Read More

Data-driven model improves accuracy in predicting EV battery degradation

Data-driven model improves accuracy in predicting EV battery degradation

white icons symbolizing renewable electric energy on a blue and green gradient background

Rising carbon emissions have significantly challenged sustainable development in recent years, prompting global efforts to implement carbon reduction policies and achieve long-term carbon neutrality. A crucial step in this transition involves the recycling and reuse of power batteries, which are assessed for their state-of-health (SoH) and then repaired or restructured for reuse in smaller-sized electric vehicles (EVs), energy storage systems, and smart streetlights. This process not only extends battery life but also maximizes their residual value. However, accurately assessing this value is complex.  

To address this, Microsoft Research collaborated with Nissan Motor Corporation to develop a new machine learning method that predicts battery degradation with an average error rate of just 0.94%, significantly bolstering Nissan’s battery recycling efforts.

Approaching carbon neutrality, one step at a time 

Nissan, the company that launched the world’s first mass-produced electric vehicle, has long been committed to reducing carbon emissions. In 2021, Nissan announced its goal to achieve carbon neutrality by 2050 throughout the vehicle’s lifecycle. Central to this effort is the management and innovation of batteries, the key power source for electric vehicles, making battery recycling is an important part of this initiative. 

The graph overviews Nissan’s Challenges in Battery Eco-cycle Innovation. The image is segmented into four quadrants, each representing a crucial phase in the battery life cycle. The top left quadrant, “Data-driven chemistry design” and the top right, “Cell design optimization” are integral to the development phase of Battery DX. The bottom right quadrant focuses on “Battery diagnosis/prognosis” which is essential for Battery DX during its use. Lastly, the bottom left quadrant, “Material recycle” emphasizes the importance of recycling in the eco-cycle.
Figure 1. The challenges faced by Nissan in battery eco-cycle innovation

Atsushi Ohma, Expert Leader of the EV System Laboratory at Nissan, noted that EVs and their batteries currently have an average lifecycle of about 10 years, contributing to approximately 50% of their CO2 emissions in the material mining and manufacturing process. Nissan aims to extend the lifecycle of EVs and batteries to more than 15 years, reducing CO2 emissions. To achieve this, the company hopes to leverage technologies like AI and big data to drive innovation in battery and electric vehicle development.

Flowchart showing the life cycle of electric vehicle (EV) batteries and their impact on CO2 emissions. It outlines stages such as raw material mining, battery production, usage in vehicles, and recycling/repurposing processes. The chart shows that about 50% of life cycle CO2 emissions are from raw material mining and battery production, and emphasizes that Nissan aims to extend the lifespan of electric vehicles and batteries by 15 or 20 years to reduce CO2 emissions.
Figure 2. Vision for reducing CO2 in the EV lifecycle

Collaborating to reduce CO2 in the EV lifecycle

Since Microsoft announced its sustainability commitments and outlined plans to work toward a more sustainable future in 2020, the team at Microsoft Research Asia has been actively engaged in addressing sustainability challenges through interdisciplinary research, collaborating with partners from related fields. The team has already developed BatteryML, an open-source machine learning tool for advancing battery research, and is working on methods to predict battery health and remaining service life. This makes the collaboration between Microsoft Research Asia and Nissan a natural one. Together, the joint team aims to achieve carbon neutrality and enhance lithium-ion battery performance prediction by focusing on battery performance degradation. 

photo of Atsushi Ohma, Expert Leader, EV System Laboratory, Research Division, Nissan

“Through our collaboration with Microsoft Research Asia, we are innovating battery degradation prediction methods to enhance the effectiveness of battery recycling and promote resource reuse. This is a pivotal step in our journey towards achieving long-term carbon neutrality. We call it ‘thinking big and starting with small steps.’”

Atsushi Ohma, Expert Leader, EV System Laboratory, Research Division, Nissan

Enhancing battery predictions with speed and accuracy

Understanding the SoH of batteries is crucial for efficient battery recycling. While usable capacity does not fully represent SoH, more important factors include the integrity of the battery’s chemistry over its life, such as the levels of lithium, cobalt, and nickel. Traditionally, battery degradation prediction relies on mathematical models based on chemical, electrochemical, and mechanical principles. This method requires continuous experimentation to adjust parameters, involving lengthy processes like battery disassembly and analysis, which can take six months to a year. Additionally, further experimentation and parameterization are needed whenever the chemistry changes. To address this, Nissan aims to apply machine learning to predict battery health based on external signals, minimizing the need for extensive physical testing. 

However, there are two main challenges to using machine learning to predict battery performance. First, it’s difficult to gather sufficient data due to the lengthy charging and discharging cycles. Second, because batteries operate under varying conditions, signal acquisition is complicated. Additionally, external environmental factors can influence battery capacity without directly reflecting its health status.

To filter out this “noise” and identify patterns that accurately reflect the battery’s internal condition, researchers have developed specialized features to analyze how the internal chemistry of lithium-ion batteries changes under different voltage and current conditions. By integrating these key features with real Nissan data, researchers improved the prediction accuracy of their machine learning models. 

photo of Shun Zheng, Senior Researcher, Microsoft Research Asia

“We found differences between academic public datasets and real-world corporate data. Models built on academic datasets are difficult to apply in enterprise settings due to variations in data patterns, testing conditions, and prediction goals. Developing broadly applicable models for industry requires integrating proprietary enterprise data with advanced AI technologies.”

Shun Zheng, Senior Researcher, Microsoft Research Asia

Data-driven model boosts accuracy by 80% in simulations

The machine learning methodology redefines the entire feature space to provide a comprehensive understanding of battery degradation. Advanced feature engineering analyzes diversified features derived from degradation patterns in voltage-capacity curves during charging and discharging cycles, as illustrated in Figure 3. Researchers focused on distinguishing information between high and low voltage intervals, including first-order and higher-order differences as effective indicators of battery health, enhancing predictive power and providing deep insights into battery performance and longevity.

The graph depicts discharge capacity of a battery cell at a specific voltage during the 50th cycle. The x-axis is labeled “Capacity [mAh/g]” and the y-axis “Cell Voltage [V]”. A descending line graph illustrates the relationship between cell voltage and capacity, with a highlighted point “Q^d (Vx)” representing the discharge capacity at that voltage during the 50th cycle. The accompanying text shows that this method is more accurate than using “Var(Δ_x-0 * Q^d)”.
Figure 3. Feature engineering, demonstrating the variation of voltage with respect to discharge capacity.

Compared with popular state-of-the-art battery prediction methods, this data-driven model improves accuracy by approximately 80% with Nissan’s simulation data and by over 30% with real-world experimental data. The new method has achieved a mean absolute error (MAE) of 0.0094 in predicting SoH at the 200th cycle using data from only the first 50 cycles, as shown in Figure 4.

This demonstrates that the new data-driven model is not only more accurate but also more efficient in predicting a battery’s SoH compared with existing methods. It requires less data and fewer cycles to make precise predictions, offering significant advantages for battery health monitoring and management.

Four graphs. The top left graph is a scatter plot with blue and red dots representing “Train” and “Test” data sets, respectively, showing a strong correlation between prediction and experiment. The top right graph displays two bar graphs for Mean Absolute Error (MAE) with values for “Train” and “Test”, the MAE of “Train” is 0.0077, the MAE of “Test” is 0.0094. Below is a box plot labeled “TEST MAE” across different “Qd (V)x” values, indicating the model’s accuracy at various stages. The image demonstrates the model’s effectiveness in predicting battery performance.
Figure 4. Test achieves MAE of 0.0094 in predicting SOH at the 200th cycle using Qd (V)50

By employing the data-driven method, researchers discovered that the indicated feature at 3.9 volts can be interpreted as the nickel manganese cobalt oxide (NMC) crystalline structure (M->H2). This finding aligns with electrochemical research and highlights that the features identified through our data-driven approach have significant real-world implications for understanding battery degradation.

photo of Jungwon Moon, Engineer, EV System Laboratory Research Division, Nissan

“This research extends the lifespan of power batteries in two ways: first, by improving reuse potential and accurately determining their remaining lifespan; and second, by developing effective recycling strategies for retired batteries. The unique approach of our joint research was to predict not only cell SoH but also cathode (NMC) SoH to improve the reliability of the cell SoH prediction model. It was surprising that the high sensitivity to certain voltages (3.9V) indicated by the data-driven cathode (NMC) SoH prediction model aligns with results from the physics-based method. Collaboration with Microsoft Research Asia has demonstrated that AI can be applied to battery manufacturing, including material selection and process optimization.”

Jungwon Moon, Engineer, EV System Laboratory Research Division, Nissan

Looking ahead: Exploring AI’s sustainability applications

The collaboration between Nissan and Microsoft Research Asia highlights the potential of AI technologies, including machine learning and deep learning, in the EV sector. Beyond predicting battery health for recycling, AI can optimize the driving experience by accurately predicting battery life and enabling smarter driving. Additionally, AI holds promise for discovering new materials and driving innovation in battery and EV technology.

photo of Jiang Bian, Senior Principal Researcher, Microsoft Research Asia

“There are existing issues with lithium batteries. We need batteries with high energy density, good safety, a long lifecycle, and with a minimal environmental impact. Through our collaboration with Nissan, we have learned that AI has great potential in the EV, including optimizing battery material combinations to improve performance, discovering new materials, and optimizing battery electrode processes. In the future, we hope to collaborate with more industry partners to further explore AI’s potential in various industrial applications.”

Jiang Bian, Senior Principal Researcher, Microsoft Research Asia

Building on their initial results, Nissan and Microsoft Research Asia plan to expand their collaboration to further advance technology and accelerate progress toward sustainable development and environmental protection goals.

Seven people posed for a group photo in front of the wall banner of Microsoft Research Asia when Atsushi Ohma visited in June 2024.
Figure 5. Atsushi Ohma from Nissan, center, visited Microsoft Research Asia in June 2024

The post Data-driven model improves accuracy in predicting EV battery degradation appeared first on Microsoft Research.

Read More

Next-Gen Video Editing: Wondershare Filmora Adds NVIDIA RTX Video HDR Support, RTX-Accelerated AI Features

Next-Gen Video Editing: Wondershare Filmora Adds NVIDIA RTX Video HDR Support, RTX-Accelerated AI Features

Editor’s note: This post is part of our In the NVIDIA Studio series, which celebrates featured artists, offers creative tips and tricks, and demonstrates how NVIDIA Studio technology improves creative workflows. We’re also deep diving on new GeForce RTX GPU features, technologies and resources, and how they dramatically accelerate content creation.

Wondershare Filmora — a video editing app with AI-powered tools — now supports NVIDIA RTX Video HDR, joining editing software like Blackmagic Design’s DaVinci Resolve and Cyberlink PowerDirector.

RTX Video HDR significantly enhances video quality, ensuring the final output is suitable for the best monitors available today.

Livestreaming software OBS Studio and XSplit Broadcaster now support Twitch Enhanced Broadcasting, giving streamers more control over video quality through client-side encoding and automatic configurations. The feature, developed in collaboration between Twitch, OBS and NVIDIA, also paves the way for more advancements, including vertical live video and advanced codecs such as HEVC and AV1.

A summer’s worth of creative app updates are included in the July Studio Driver, ready for download today. Install the NVIDIA app beta — the essential companion for creators and gamers — to keep GeForce RTX PCs up to date with the latest NVIDIA drivers and technology.

Join NVIDIA at SIGGRAPH to learn about the latest breakthroughs in graphics and generative AI, and tune in to a fireside chat featuring NVIDIA founder and CEO Jensen Huang and Lauren Goode, senior writer at WIRED, on Monday, July 29 at 2:30 p.m. MT. Register now.

And this week’s featured In the NVIDIA Studio artist, Kevin Stratvert, shares all about AI-powered content creation in Wondershare Filmora.

(Wonder)share the Beauty of RTX Video

RTX Video HDR analyzes standard dynamic range video and transforms it into HDR10-quality video, expanding the color gamut to produce clearer, more vibrant frames and enhancing the sense of depth for greater immersion.

With RTX Video HDR, Filmora users can create high-quality content that’s ideal for gaming videos, travel vlogs or event filmmaking.

Combining RTX Video HDR with RTX Video Super Resolution — another AI-powered tool that uses trained models to sharpen edges, restore features and remove artifacts in video — further enhances visual quality. RTX Video HDR requires an NVIDIA RTX GPU connected to an HDR10-compatible monitor or TV. For more information, check out the RTX Video FAQ.

Those with a RTX GPU-powered PC can send files to the Filmora desktop app and continue to edit with local RTX acceleration, doubling the speed of the export process with dual encoders on GeForce RTX 4070 Ti or above GPUs.

Learn more about Wondershare Filmora’s AI-powered features.

Maximizing AI Features in Filmora

Kevin Stratvert has the heart of a teacher — he’s always loved to share his technical knowledge and tips with others.

One day, he thought, “Why not make a YouTube video to explain stuff directly to users?” His first big hit was a tutorial on how to get Microsoft Office for free through Office.com. The video garnered millions of views and tons of engagement — and he’s continued creating content ever since.

“The more content I created, the more questions and feedback I got from viewers, sparking this cycle of creativity and connection that I just couldn’t get enough of,” said Stratvert.

Explaining the benefits of AI has been an area of particular interest for Stratvert, especially as it relates to AI-powered features in Wondershare Filmora. In one YouTube video, Filmora Video Editor Tutorial for Beginners, he breaks down the AI effects video editors can use to accelerate their workflows.

Examples include:

  • Smart Edit: Edit footage-based transcripts generated automatically, including in multiple languages.
  • Smart Cutout: Remove unwanted objects or change the background in seconds.
  • Speech-to-Text: Automatically generate compelling descriptions, titles and captions.

“AI has become a crucial part of my creative toolkit, especially for refining details that really make a difference,” said Stratvert. “By handling these technical tasks, AI frees up my time to focus more on creating content, making the whole process smoother and more efficient.”

Stratvert has also been experimenting with NVIDIA ChatRTX, a technology that lets users interact with their local data, installing and configuring various AI models, effectively prompting AI for both text and image outputs using CLIP and more.

NVIDIA Broadcast has been instrumental in giving Stratvert a professional setup for web conferences and livestreams. The app’s features, including background noise removal and virtual background, help maintain a professional appearance on screen. It’s especially useful in home studio settings, where controlling variables in the environment can be challenging.

“NVIDIA Broadcast has been instrumental in professionalizing my setup for web conferences and livestreams.” — Kevin Stratvert

Stratvert stresses the importance of his GeForce RTX 4070 graphics card in the content creation process.

“With an RTX GPU, I’ve noticed a dramatic improvement in render times and the smoothness of playback, even in demanding scenarios,” he said. “Additionally, the advanced capabilities of RTX GPUs support more intensive tasks like real-time ray tracing and AI-driven editing features, which can open up new creative possibilities in my edits.”

Check out Stratvert’s video tutorials on his website.

Content creator Kevin Stratvert.

Follow NVIDIA Studio on Instagram, X and Facebook. Access tutorials on the Studio YouTube channel and get updates directly in your inbox by subscribing to the Studio newsletter

Read More

Ferretv2: An Improved Baseline for Referring and Grounding

While Ferret seamlessly integrates regional understanding into the Large Language Model (LLM) to facilitate its referring and grounding capability, it poses certain limitations: constrained by the pre-trained fixed visual encoder and failed to perform well on broader tasks. In this work, we unveil Ferret-v2, a significant upgrade to Ferret, with three key designs. (1) Any resolution grounding and referring: A flexible approach that effortlessly handles higher image resolution, improving the model’s ability to process and understand images in greater detail. (2) Multi-granularity visual…Apple Machine Learning Research