NVIDIA to Help Elevate Japan’s Sovereign AI Efforts Through Generative AI Infrastructure Build-Out

NVIDIA to Help Elevate Japan’s Sovereign AI Efforts Through Generative AI Infrastructure Build-Out

Following an announcement by Japan’s Ministry of Economy, Trade and Industry, NVIDIA will play a central role in developing the nation’s generative AI infrastructure as Japan seeks to capitalize on the technology’s economic potential and further develop its workforce.

NVIDIA is collaborating with key digital infrastructure providers, including GMO Internet Group, Highreso, KDDI Corporation, RUTILEA, SAKURA internet Inc. and SoftBank Corp., which the ministry has certified to spearhead the development of cloud infrastructure crucial for AI applications.

Over the last two months, the ministry announced plans to allocate $740 million, approximately ¥114.6 billion, to assist six local firms in this initiative. Following on from last year, this is a significant effort by the Japanese government to subsidize AI computing resources, by expanding the number of companies involved.

With this move, Japan becomes the latest nation to embrace the concept of sovereign AI, aiming to fortify its local startups, enterprises and research efforts with advanced AI technologies.

Across the globe, nations are building up domestic computing capacity through various models. Some procure and operate sovereign AI clouds with state-owned telecommunications providers or utilities. Others are sponsoring local cloud partners to provide a shared AI computing platform for public and private sector use.

Japan’s initiative follows NVIDIA founder and CEO Jensen Huang’s visit last year, where he met with political and business leaders — including Japanese Prime Minister Fumio Kishida — to discuss the future of AI.

During his trip, Huang emphasized that “AI factories” — next-generation data centers designed to handle the most computationally intensive AI tasks — are crucial for turning vast amounts of data into intelligence. “The AI factory will become the bedrock of modern economies across the world,” Huang said during a meeting with the Japanese press in December.

The Japanese government plans to subsidize a significant portion of the costs for building AI supercomputers, which will facilitate AI adoption, enhance workforce skills, support Japanese language model development and bolster resilience against natural disasters and climate change.

Under the country’s Economic Security Promotion Act, the ministry aims to secure a stable supply of local cloud services, reducing the time and cost of developing next-generation AI technologies.

Japan’s technology powerhouses are already moving fast to embrace AI. Last week, SoftBank Corp. announced that it will invest ¥150 billion, approximately $960 million, for its plan to expand the infrastructure needed to develop Japan’s top-class AI, including purchases of NVIDIA accelerated computing.

The news follows Huang’s meetings with leaders in Canada, France, India, Japan, Malaysia, Singapore and Vietnam over the past year, as countries seek to supercharge their regional economies and embrace challenges such as climate change with AI.

Read More

Speeding up ViTs using Block Sparsity

Speeding up ViTs using Block Sparsity

TLDR: We show promising results of up to a 1.46x speedup with <2% drop in accuracy on float32 Vision Transformers on A100 GPUs by applying block sparsity on MLP module’s weights. This approach can potentially be applied to other types of transformers including large language models. Our implementation and benchmarks to reproduce our results are available at https://github.com/pytorch-labs/superblock.

Introduction

PyTorch has landed a lot of improvements to CUDA kernels that implement block sparse matrix multiplications. Recent updates to Pytorch can lead up to 4.8x speedup on large matrix multiplication shapes with high sparsity levels over dense baselines.

In this blog, we show the promising results of applying block sparsity on weights of linear layers of MLP (multi-layer perceptron) layers in vision transformers (ViTs) and show end-to-end model speedups on A100 Nvidia GPUs.

As a recap, block sparsity sparsifies weights in tiles of blocks of predetermined size, rather than sparsifying individual elements. This particular sparsity pattern is interesting because it is amenable to GPU acceleration via fast sparse kernels. For more information about the differences between different sparsity patterns, or about sparsity as a whole, please check out torchao.

Illustrations of different types of sparsity.

Illustrations of different types of sparsity.

Approach

Our approach can be broken down into two distinct steps:

  1. Training the model from scratch using block sparse masks subnets.
  2. Folding these masks into our weights to accelerate them for inference.

We explain our training and inference steps below

Training

Starting with an uninitialized Vision Transformer, we apply random trainable masks with a specified block size and sparsity level on the weights of output projection linear layer of attention blocks, the weights of the two linear layers inside the MLP, a.k.a., FFN (feed forward networks), as well as the final linear classification layer. The forward pass during training follows the supermask approach, as each mask is converted to binary map using a tuned threshold based on sparsity requirements, e.g., if we want 80% sparsity, we will have the threshold automatically tuned to keep top 20% weights. The masks are of a square <block size>x<block size> elements, where <block size> is a hyperparameter. The priority of the weights is dependent on the mask value or score which is trained. We multiply the binary masks of each layer with the weights to sparsify the model.

Illustration of the Supermask sparsification approach

Illustration of the Supermask sparsification approach.

Inference

After training, the dense weights can be turned to sparse weights by multiplying with the mask and stored for inference. At this stage, although the weights have a high percentage of zero values, they are still stored in dense format. We use PyTorch’s to_sparse_bsr() API to to convert the weights to Block Sparse Representation (BSR) format that stores only the non-zero values and the indices of their blocks. This step only needs to be done once and the results can be cached for runtime.

During runtime, no changes in code are required. We just pass any input tensor to the model, and when the forward() function of the sparsified linear layers are invoked, PyTorch takes care of invoking the optimized matrix multiplication for block sparse weights. This should work for A100 as well as H100 NVIDIA GPUs.

Results: Microbenchmarks

To validate the viability of block sparsity from a performance standpoint, we first ran a series of microbenchmarks using this simple script. Using the linear shapes from ViT-b, we compared the speedup of our block sparse kernels across a single linear layer as we varied the sparsity level and block size of the weight matrix.

We run using PyTorch 2.3.0.dev20240305+cu121 nightly on NVIDIA A100s and report the speedup of each sparsity configuration compared to dense baseline. We observed positive speedups when block size >=32 or sparsity level >= 0.8 for float32, while for bfloat16 we observe smaller speedups and usually for block size 64 and higher sparsities. Hence, for end-to-end speedups on the model, we will focus in this blog on float32 and leave bfloat16 for future work.

Micro benchmarking results on linear layers of ViT-b-16.

Micro benchmarking results on linear layers of ViT-b-16.

Micro benchmarking results on linear layers of ViT-b-16.

Results: Vision Transformers

Once we confirmed that we were able to show speedups over the linear layers, we focused on showing end-to-end speedups on ViT_B_16.

We trained this model from scratch on ImageNet dataset using the standard ViT_B_16 recipe. We show speedups for sparsifying MLP modules and leave sparsifying weights of input and output projections of attention for future work.

We looked at wall-clock inference speedup, focusing on batch size 256. We found that:

  • For 90% sparsity we can get 1.24x, 1.37x, 1.65x speedups for block sizes 16, 32, and 64 respectively.
  • To obtain speedup, the minimum sparsity for block sizes 16, 32, and 64 are 0.86, 0.82, and 0.7 respectively. Hence, as expected, the larger the block size, the smaller sparsity we need to obtain speedup.

We note a limitation of the sparse_bsr() API: that layers need to be multiples of the block size. Since the dimensions of the last FC classification layer in ViT was not a multiple of the block size, they were not converted to BSR representation in our experiments.

Speedup on ViT-b-16 with batch size 256 on MLP modules across different batch sparsities and block sizes.

Speedup on ViT-b-16 with batch size 256 on MLP modules across different batch sparsities and block sizes.

We also explored the speedup for different batch sizes for 90% sparsity. We observed a speedup over the baseline for batch sizes starting from 16 and upwards. While bigger block sizes have bigger speedups at the largest batch sizes, the smallest possible batch size to obtain >1 speedup is smaller for smaller block sizes.

We believe on-device hardware can obtain speedups for batch size 1 as they – unlike server GPUs – can be fully utilized at such small batch sizes.

Speedup on ViT-b-16 with 90% sparsity on MLP modules across different batch sizes and block sizes.

Speedup on ViT-b-16 with 90% sparsity on MLP modules across different batch sizes and block sizes.

Looking at the Top-1 accuracy on ImageNet=blurred test set of the sparsified models for different block sizes and sparsities, we see a few expected results:

  • low levels of sparsity (<=70%) have no meaningful regression in accuracy
  • mid levels of sparsity (>=80% to <90%) have limited regression in accuracy
  • high levels of sparsity (>=90%) removes so many weights that accuracy is significantly impacted

More research could be done to improve accuracies of higher sparsities and larger block sizes. We hope that the block sparsity support in PyTorch and the illustrated speedups in this blog will encourage researchers to explore more accurate sparsification approaches.

Accuracies on training ViT-b-16 on ImageNet-blurred using the SuperMask approach.

Accuracies on training ViT-b-16 on ImageNet-blurred using the SuperMask approach.

Next Steps

We have shown promising speedups for block sparsifying MLP modules ViT in float32 precision. There is still more work to be done in order to observe speedups on bfloat16 and we hope to obtain progress on that soon. Possible next steps to further optimize block sparsity on vision transformers and transformers in general:

  • Perform block sparsity on attention input and output projections.
  • Perform block sparsity during finetuning rather than training from scratch.
  • Perform further optimizations on the matmul kernels for ViT’s linear operator specific shapes (especially for 80% and lower sparsity).
  • Combine with other optimizations such as int8 and torch.compile()
  • Explore other weight sparsification algorithms, e.g., Spartan, to improve accuracy
  • Explore selecting weights to sparsify (e.g., specific transformer layers)

Please reach out to melhoushi@meta.com if you have questions or are interested in contributing to block sparsification!

Additionally if you’re broadly interested in sparsity please feel free to reach out to @jcaip / jessecai@meta.com and please come check out torchao, a community we’re building for architecture optimization techniques like quantization and sparsity.

Read More

KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache Generation

Large Language Model or LLM inference has two phases, the prompt (or prefill) phase to output the first token and the extension (or decoding) phase to the generate subsequent tokens. In this work, we propose an efficient parallelization scheme, KV-Runahead to accelerate the prompt phase. The key observation is that the extension phase generates tokens faster than the prompt phase because of key-value cache (KV-cache). Hence, KV-Runahead parallelizes the prompt phase by orchestrating multiple processes to populate the KV-cache and minimizes the time-to-first-token (TTFT). Dual-purposing the…Apple Machine Learning Research

pfl-research: Simulation Framework for Accelerating Research in Private Federated Learning

Federated Learning (FL) is an emerging ML training paradigm where clients own their data and collaborate to train a global model without revealing any data to the server and other participants.
Researchers commonly perform experiments in a simulation environment to quickly iterate on ideas. However, existing open-source tools do not offer the efficiency required to simulate FL on larger and more realistic FL datasets. We introduce pfl-research, a fast, modular, and easy-to-use Python framework for simulating FL. It supports TensorFlow, PyTorch, and non-neural network models, and is tightly…Apple Machine Learning Research

Evaluation of generative AI techniques for clinical report summarization

Evaluation of generative AI techniques for clinical report summarization

In part 1 of this blog series, we discussed how a large language model (LLM) available on Amazon SageMaker JumpStart can be fine-tuned for the task of radiology report impression generation. Since then, Amazon Web Services (AWS) has introduced new services such as Amazon Bedrock. This is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading artificial intelligence (AI) companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon through a single API.

Amazon Bedrock also comes with a broad set of capabilities required to build generative AI applications with security, privacy, and responsible AI. It’s serverless, so you don’t have to manage any infrastructure. You can securely integrate and deploy generative AI capabilities into your applications using the AWS services you are already familiar with. In this part of the blog series, we review techniques of prompt engineering and Retrieval Augmented Generation (RAG) that can be employed to accomplish the task of clinical report summarization by using Amazon Bedrock.

When summarizing healthcare texts, pre-trained LLMs do not always achieve optimal performance. LLMs can handle complex tasks like math problems and commonsense reasoning, but they are not inherently capable of performing domain-specific complex tasks. They require guidance and optimization to extend their capabilities and broaden the range of domain-specific tasks they can perform effectively. It can be achieved through the use of proper guided prompts. Prompt engineering helps to effectively design and improve prompts to get better results on different tasks with LLMs. There are many prompt engineering techniques.

In this post, we provide a comparison of results obtained by two such techniques: zero-shot and few-shot prompting. We also explore the utility of the RAG prompt engineering technique as it applies to the task of summarization. Evaluating LLMs is an undervalued part of the machine learning (ML) pipeline. It is time-consuming but, at the same time, critical. We benchmark the results with a metric used for evaluating summarization tasks in the field of natural language processing (NLP) called Recall-Oriented Understudy for Gisting Evaluation (ROUGE). These metrics will assess how well a machine-generated summary compares to one or more reference summaries.

Solution overview

In this post, we start with exploring a few of the prompt engineering techniques that will help assess the capabilities and limitations of LLMs for healthcare-specific summarization tasks. For more complex, clinical knowledge-intensive tasks, it’s possible to build a language model–based system that accesses external knowledge sources to complete the tasks. This enables more factual consistency, improves the reliability of the generated responses, and helps to mitigate the propensity that LLMs have to be confidently wrong, called hallucination.

Pre-trained language models

In this post, we experimented with Anthropic’s Claude 3 Sonnet model, which is available on Amazon Bedrock. This model is used for the clinical summarization tasks where we evaluate the few-shot and zero-shot prompting techniques. This post then seeks to assess whether prompt engineering is more performant for clinical NLP tasks compared to the RAG pattern and fine-tuning.

Dataset

The MIMIC Chest X-ray (MIMIC-CXR) Database v2.0.0 is a large publicly available dataset of chest radiographs in DICOM format with free-text radiology reports. We used the MIMIC CXR dataset, which can be accessed through a data use agreement. This requires user registration and the completion of a credentialing process.

During routine clinical care clinicians trained in interpreting imaging studies (radiologists) will summarize their findings for a particular study in a free-text note. Radiology reports for the images were identified and extracted from the hospital’s electronic health records (EHR) system. The reports were de-identified using a rule-based approach to remove any protected health information.

Because we used only the radiology report text data, we downloaded just one compressed report file (mimic-cxr-reports.zip) from the MIMIC-CXR website. For evaluation, the 2,000 reports (referred to as the ‘dev1’ dataset) from a subset of this dataset and the 2,000 radiology reports (referred to as ‘dev2’) from the chest X-ray collection from the Indiana University hospital network were used.

Techniques and experimentation

Prompt design is the technique of creating the most effective prompt for an LLM with a clear objective. Crafting a successful prompt requires a deeper understanding of the context, it’s the subtle art of asking the right questions to elicit the desired answers. Different LLMs may interpret the same prompt differently, and some may have specific keywords with particular meanings. Also, depending on the task, domain-specific knowledge is crucial in prompt creation. Finding the perfect prompt often involves a trial-and-error process.

Prompt structure

Prompts can specify the desired output format, provide prior knowledge, or guide the LLM through a complex task. A prompt has three main types of content: input, context, and examples. The first of these specifies the information for which the model needs to generate a response. Inputs can take various forms, such as questions, tasks, or entities. The latter two are optional parts of a prompt. Context is providing relevant background to ensure the model understands the task or query, such as the schema of a database in the example of natural language querying. Examples can be something like adding an excerpt of a JSON file in the prompt to coerce the LLM to output the response in that specific format. Combined, these components of a prompt customize the response format and behavior of the model.

Prompt templates are predefined recipes for generating prompts for language models. Different templates can be used to express the same concept. Hence, it is essential to carefully design the templates to maximize the capability of a language model. A prompt task is defined by prompt engineering. Once the prompt template is defined, the model generates multiple tokens that can fill a prompt template. For instance, “Generate radiology report impressions based on the following findings and output it within <impression> tags.” In this case, a model can fill the <impression> with tokens.

Zero-shot prompting

Zero-shot prompting means providing a prompt to a LLM without any (zero) examples. With a single prompt and no examples, the model should still generate the desired result. This technique makes LLMs useful for many tasks. We have applied zero-shot technique to generate impressions from the findings section of a radiology report.

In clinical use cases, numerous medical concepts need to be extracted from clinical notes. Meanwhile, very few annotated datasets are available. It’s important to experiment with different prompt templates to get better results. An example zero-shot prompt used in this work is shown in Figure 1.

Zero-shot prompting

Figure 1 – Zero-shot prompting

Few-shot prompting

The few-shot prompting technique is used to increase performance compared to the zero-shot technique. Large, pre-trained models have demonstrated remarkable capabilities in solving an abundance of tasks by being provided only a few examples as context. This is known as in-context learning, through which a model learns a task from a few provided examples, specifically during prompting and without tuning the model parameters. In the healthcare domain, this bears great potential to vastly expand the capabilities of existing AI models.

Few shot prompting

Figure 2 – Few-shot prompting

Few-shot prompting uses a small set of input-output examples to train the model for specific tasks. The benefit of this technique is that it doesn’t require large amounts of labeled data (examples) and performs reasonably well by providing guidance to large language models.
In this work, five examples of findings and impressions were provided to the model for few-shot learning as shown in Figure 2.

Retrieval Augmented Generation pattern

The RAG pattern builds on prompt engineering. Instead of a user providing relevant data, an application intercepts the user’s input. The application searches across a data repository to retrieve content relevant to the question or input. The application feeds this relevant data to the LLM to generate the content. A modern healthcare data strategy enables the curation and indexing of enterprise data. The data can then be searched and used as context for prompts or questions, assisting an LLM in generating responses.

To implement our RAG system, we utilized a dataset of 95,000 radiology report findings-impressions pairs as the knowledge source. This dataset was uploaded to Amazon Simple Service (Amazon S3) data source and then ingested using Knowledge Bases for Amazon Bedrock. We used the Amazon Titan Text Embeddings model on Amazon Bedrock to generate vector embeddings.

Embeddings are numerical representations of real-world objects that ML systems use to understand complex knowledge domains like humans do. The output vector representations were stored in a newly created vector store for efficient retrieval from the Amazon OpenSearch Serverless vector search collection. This leads to a public vector search collection and vector index setup with the required fields and necessary configurations. With the infrastructure in place, we set up a prompt template and use RetrieveandGenerate API for vector similarity search. Then, we use the Anthropic Claude 3 Sonnet model for impressions generation. Together, these components enabled both precise document retrieval and high-quality conditional text generation from the findings-to-impressions dataset.

The following reference architecture diagram in Figure 3 illustrates the fully managed RAG pattern with Knowledge Bases for Amazon Bedrock on AWS. The fully managed RAG provided by Knowledge Bases for Amazon Bedrock converts user queries into embeddings, searches the knowledge base, obtains relevant results, augments the prompt, and then invokes an LLM (Claude 3 Sonnet) to generate the response.

Retrieval Augmented Generation pattern

Figure 3 – Retrieval Augmented Generation pattern

Prerequisites

You need to have the following to run this demo application:

  • An AWS account
  • Basic understanding of how to navigate Amazon SageMaker Studio
  • Basic understanding of how to download a repo from GitHub
  • Basic knowledge of running a command on a terminal

Key steps in implementation

Following are key details of each technique

Zero-shot prompting

prompt_zero_shot = """Human: Generate radiology report impressions based on the following findings and output it within &amp;lt;impression&amp;gt; tags. Findings: {} Assistant:"""

Few-shot prompting

examples_string = '' for ex in examples: examples_string += f"""H:{ex['findings']}
A:{ex['impression']}n"""
prompt_few_shot = """Human: Generate radiology report impressions based on the following findings. Findings: {}
Here are a few examples: """ + examples_string + """ 
Assistant:"""

Implementation of Retrieval Augmented Generation

  1. Load the reports into the Amazon Bedrock knowledge base by connecting to the S3 bucket (data source).
  2. The knowledge base will split them into smaller chunks (based on the strategy selected), generate embeddings, and store them in the associated vector store. For detailed steps, refer to the Amazon Bedrock User Guide. We used Amazon Titan Embeddings G1 – Text embedding model for converting the reports data to embeddings.
  3. Once the knowledge base is up and running, locate the knowledge base id and generate model Amazon Resource Number (ARN) for Claude 3 Sonnet model using the following code:
kb_id = "XXXXXXXXXX" #Replace it with the knowledge base id for your knowledge base
model_id = "anthropic.claude-3-sonnet-20240229-v1:0"
model_arn = f'arn:aws:bedrock:{region_id}::foundation-model/{model_id}'
  1. Set up the Amazon Bedrock runtime client using the latest version of AWS SDK for Python (Boto3).
bedrock_config = Config(connect_timeout=120, read_timeout=120, retries={'max_attempts': 0})
bedrock_client = boto3.client('bedrock-runtime')
bedrock_agent_client = boto3.client("bedrock-agent-runtime", config=bedrock_config)
boto3_session = boto3.session.Session()
region_name = boto3_session.region_name
  1. Use the RetrieveAndGenerate API to retrieve the most relevant report from the knowledge base and generate an impression.
return bedrock_agent_client.retrieve_and_generate(
        input={
            'text': input
        },
        retrieveAndGenerateConfiguration={
            'knowledgeBaseConfiguration': {
                'generationConfiguration': {
                    'promptTemplate': {
                    'textPromptTemplate': promptTemplate
                    }
                },
                'knowledgeBaseId': kbId,
                'modelArn': model_arn,
                'retrievalConfiguration': {
                    'vectorSearchConfiguration': {
                        'numberOfResults': 3,
                        'overrideSearchType': 'HYBRID'
                        }
                }
               
            },
            'type': 'KNOWLEDGE_BASE'
            
        },
    )
  1. Use the following prompt template along with query (findings) and retrieval results to generate impressions with the Claude 3 Sonnet LLM.
promptTemplate = f"""
You have to generate radiology report impressions based on the following findings. Your job is to generate impression using only information from the search results.
Return only a single sentence and do not return the findings given.
   
Findings: $query$
                          
Here are the search results in numbered order:
$search_results$ """

Evaluation

Performance analysis

The performance of zero-shot, few-shot, and RAG techniques is evaluated using the ROUGE score. For more details on the definition of various forms of this score, please refer to part 1 of this blog.

The following table depicts the evaluation results for the dev1 and dev2 datasets. The evaluation result on dev1 (2,000 findings from the MIMIC CXR Radiology Report) shows that the zero-shot prompting performance was the poorest, whereas the RAG approach for report summarization performed the best. The use of the RAG technique led to substantial gains in performance, improving the aggregated average ROUGE1 and ROUGE2 scores by approximately 18 and 16 percentage points, respectively, compared to the zero-shot prompting method. An approximately 8 percentage point improvement is observed in aggregated ROUGE1 and ROUGE2 scores over the few-shot prompting technique.

Model Technique Dataset: dev1 Dataset: dev2
. . ROUGE1 ROUGE2 ROUGEL ROUGELSum ROUGE1 ROUGE2 ROUGEL ROUGELSum
Claude 3 Zero-shot 0.242 0.118 0.202 0.218 0.210 0.095 0.185 0.194
Claude 3 Few-shot 0.349 0.204 0.309 0.312 0.439 0.273 0.351 0.355
Claude 3 RAG 0.427 0.275 0.387 0.387 0.438 0.309 0.43 0.43

For dev2, an improvement of approximately 23 and 21 percentage points is observed in ROUGE1 and ROUGE2 scores of the RAG-based technique over zero-shot prompting. Overall, RAG led to an improvement of approximately 17 percentage points and 24 percentage points in ROUGELsum scores for the dev1 and dev2 datasets, respectively. The distribution of ROUGE scores attained by RAG technique for dev1 and dev2 datasets is shown in the following graphs.

dev1 Dev2
Dataset: dev1 Dataset: dev2

It is worth noting that RAG attains consistent average ROUGELSum for both test datasets (dev1=.387 and dev2=.43). This is in contrast to the average ROUGELSum for these two test datasets (dev1=.5708 and dev2=.4525) attained with the fine-tuned FLAN-T5 XL model presented in part 1 of this blog series. Dev1 is a subset of the MIMIC dataset, samples from which have been used as context. With the RAG approach, the median ROUGELsum is observed to be almost similar for both datasets dev2 and dev1.

Overall, RAG is observed to attain good ROUGE scores but falls short of the impressive performance of the fine-tuned FLAN-T5 XL model presented in part 1 of this blog series.

Cleanup

To avoid incurring future charges, delete all the resources you deployed as part of the tutorial.

Conclusion

In this post, we presented how various generative AI techniques can be applied for healthcare-specific tasks. We saw incremental improvement in results for domain-specific tasks as we evaluated and compared prompting techniques and the RAG pattern. We also see how fine-tuning the model to healthcare-specific data is comparatively better, as demonstrated in part 1 of the blog series. We expect to see significant improvements with increased data at scale, more thoroughly cleaned data, and alignment to human preference through instruction tuning or explicit optimization for preferences.

Limitations: This work demonstrates a proof of concept. As we analyzed deeper, hallucinations were observed occasionally.


About the authors

Ekta Walia Bhullar, PhD, is a senior AI/ML consultant with AWS Healthcare and Life Sciences (HCLS) professional services business unit. She has extensive experience in the application of AI/ML within the healthcare domain, especially in radiology. Outside of work, when not discussing AI in radiology, she likes to run and hike.

Priya Padate is a Senior Partner Solutions Architect with extensive expertise in Healthcare and Life Sciences at AWS. Priya drives go-to-market strategies with partners and drives solution development to accelerate AI/ML-based development. She is passionate about using technology to transform the healthcare industry to drive better patient care outcomes.

Dr. Adewale Akinfaderin is a senior data scientist in healthcare and life sciences at AWS. His expertise is in reproducible and end-to-end AI/ML methods, practical implementations, and helping global healthcare customers formulate and develop scalable solutions to interdisciplinary problems. He has two graduate degrees in physics and a doctorate in engineering.

Srushti Kotak is an Associate Data and ML Engineer at AWS Professional Services. She has a strong data science and deep learning background with experience in developing machine learning solutions, including generative AI solutions, to help customers solve their business challenges. In her spare time, Srushti loves to dance, travel, and spend time with friends and family.

Read More

MatterSim: A deep-learning model for materials under real-world conditions

MatterSim: A deep-learning model for materials under real-world conditions

The image features a complex network of interconnected nodes with a molecular structure, illuminated in blue against a dark background.

In the quest for groundbreaking materials crucial to nanoelectronics, energy storage, and healthcare, a critical challenge looms: predicting a material’s properties before it is even created. This is no small feat, with any combination of 118 elements in the periodic table, and the range of temperatures and pressures under which materials are synthesized and operated. These factors drastically affect atomic interactions within materials, making accurate property prediction and behavior simulation exceedingly demanding.

Here at Microsoft Research, we developed MatterSim, a deep-learning model for accurate and efficient materials simulation and property prediction over a broad range of elements, temperatures, and pressures to enable the in silico materials design. MatterSim employs deep learning to understand atomic interactions from the very fundamental principles of quantum mechanics, across a comprehensive spectrum of elements and conditions—from 0 to 5,000 Kelvin (K), and from standard atmospheric pressure to 10,000,000 atmospheres. In our experiment, MatterSim efficiently handles simulations for a variety of materials, including metals, oxides, sulfides, halides, and their various states such as crystals, amorphous solids, and liquids. Additionally, it offers customization options for intricate prediction tasks by incorporating user-provided data.

Figure 1: There are two subfigures. On the left-hand side, atomic structures of 12 materials belonging to metals, oxides, sulfides, halides, and organic molecules are shown. On the right-hand side, the temperature and pressure ranges of materials' application and synthesis are plotted.
Figure 1. MatterSim can model materials properties and behaviors under realistic temperature and pressure conditions for wide ranges of applications.

Simulating materials under realistic conditions across the periodic table

MatterSim’s learning foundation is built on large-scale synthetic data, generated through a blend of active learning, generative models, and molecular dynamics simulations. This data generation strategy ensures extensive coverage of material space, enabling the model to predict energies, atomic forces, and stresses. It serves as a machine-learning force field with a level of accuracy compatible with first-principles predictions. Notably, MatterSim achieves a10-fold increase in accuracy for material property predictions at finite temperatures and pressures when compared to previous state-of-the-art models. Our research demonstrates its proficiency in simulating a vast array of material properties, including thermal, mechanical, and transport properties, and can even predict phase diagrams.

Figure 2: There are three subfigures. The panel on the left shows a comparison of the highest phonon frequency predicted by MatterSim and by first-principles methods. The two values are for each material is very close, leading to a nearly straight line in the parity plot. The middle panel depicts the same relation of free energies of around 50 materials and comparison between MatterSim and first-principles results. The right panel shows the phase diagram of MgO predicted using MatterSim. The x-axis denotes the temperature and the y-axis denotes the pressure. The pressure ranges of where MgO’s B1 phase is below 500 GPa and this range decreases with temperature increase. The blue lines show the prediction from MatterSim and fits well with the shaded region which is the result from experiment measurement.
Figure 2. MatterSim achieves high accuracy in predicting mechanical properties, vibrational properties, and phases diagrams of material comparable to quantum mechanics and experimental measurements. The figure shows the comparison between the predicted properties and the experimental measured results. 

Adapting to complex design tasks

While trained on broad synthetic datasets, MatterSim is also adaptable for specific design requirements by incorporating additional data. The model utilizes active learning and fine-tuning to customize predictions with high data efficiency. For example, simulating water properties — a task seemingly straightforward but computationally intensive — is significantly optimized with MatterSim’s adaptive capability. The model requires only 3% of the data compared to traditional methods, to match experimental accuracy that would otherwise require 30 times more resources for a specialized model and exponentially more for first-principles methods.

Figure 3: There are two panels in this figure. The right panel shows the structure of Li2B12H12, a complex material system used for solid-state batteries. This system is used in the benchmark of the performance of MatterSim. The left panel panels show the comparison between number of data point needed to train a model from scratch and customize from MatterSim to achieve the same accuracy. MatterSim requires 3% and 10% of the data for the two tasks compared with training from scratch.
Figure 3. MatterSim achieves high data efficiency with 90%-97% data save for complex simulation tasks.

Spotlight: Event Series

Microsoft Research Forum

Join us for a continuous exchange of ideas about research in the era of general AI. Watch Episodes 1 & 2 on-demand.


Bridging the gap between atomistic models and real-world measurements

Translating material properties from atomic structures is a complex task, often too intricate for current methods based on statistics, such as molecular dynamics. MatterSim addresses this by mapping these relationships directly through machine learning. It incorporates custom adaptor modules that refine the model to predict material properties from structural data, eliminating the need for intricate simulations. Benchmarking against MatBench (opens in new tab), a renowned material property prediction benchmark set, MatterSim demonstrates significant accuracy improvement and outperforms all specialized property-specific models, showcasing its robust capability in direct material property prediction from domain-specific data.

Looking ahead 

As MatterSim research advances, the emphasis is on experimental validation to reinforce its potential role in pivotal sectors, including the design of catalysts for sustainability, energy storage breakthroughs, and nanotechnology advancements. The planned integration of MatterSim with generative AI models and reinforcement learning heralds a new era in the systematic pursuit of novel materials. This synergy is expected to revolutionize the field, streamlining guided creation of materials tailored for diverse applications ranging from semiconductor technologies to biomedical engineering. Such progress promises to expedite material development and bolster sustainable industrial practices, thereby fostering technological advancements that will benefit society. 

The post MatterSim: A deep-learning model for materials under real-world conditions appeared first on Microsoft Research.

Read More

Enhanced autoscaling with VASIM: Vertical Autoscaling Simulator Toolkit

Enhanced autoscaling with VASIM: Vertical Autoscaling Simulator Toolkit

This research was presented as a demonstration at the 40th IEEE International Conference on Data Engineering (opens in new tab) (ICDE 2024), one of the premier conferences on data and information engineering.

ICDE conference logo, in white, on the left side of the graphic. To the right, the first page of the accepted paper,

Since the inception of cloud computing, autoscaling has been an essential technique for optimizing resources and performance. By dynamically adjusting the number of computing resources allocated to a service based on current demand, autoscaling ensures that the service can handle the load efficiently while optimizing costs. However, developing and fine-tuning autoscaling algorithms, which govern this process, present significant challenges. The complexity and cost associated with testing these algorithms can lead to inefficient resource management and impede the development of more effective autoscaling strategies.

In our paper, “VASIM: Vertical Autoscaling Simulator Toolkit,” presented at ICDE 2024, we introduce a tool designed to address the complexities involved in assessing autoscaling algorithms. While existing simulation tools cover a range of capabilities, such as energy efficiency and fault tolerance, VASIM stands out by evaluating the critical recommender component within the algorithm and suggesting optimal resource scaling actions based on usage data, balancing performance and cost. This enables developers to iterate more rapidly, enhancing algorithmic performance, and improving resource efficiency and cost savings.

VASIM’s user-friendly interface simplifies the evaluation of autoscaling policies, as illustrated in Figure 1. First steps entail uploading historical data and defining autoscaling policies, including the algorithm and its parameters, shown in the left panel. The Simulation Run feature enables the modification of algorithm parameters, imported via a configuration file, and the execution of simulations based on the selected trace. A results screen displays the CPU limits determined by the selected policies as well as the actual CPU usage tailored to these limits. Additionally, VASIM provides fundamental metrics like throttling incidents, number of scaling operations, and amount of unused capacity, or slack, for the current simulation.

[On the left] Image of VASIM user interface. On the left panel, it has options to select from “Simulation Run”, “Simulation Tuning”, “Simulation Tuning History”. Option “Simulation Run” is selected. Below user has loaded a trace from csv file on disk (c_26742_perf_event_log.csv), algorithm C, metadata config json file from disk. Button “Visualize workload” was clicked and loaded trace is displayed. 

[On the right] On the right panel, user picked other parameters for simulation run (lag – how often recommender gives decision and initial core count) and algorithm parameter from json are shown for edit. 

Image of VASIM UI when simulation was run for selected algorithm, trace and parameter setting. It shows a graph with cpu usage in blue and the limit calculated by selected algorithm in red. It is different from the trace plot that was shown before because calculated limits were below cpu utilization, so the latter was cut off. On top of the plot it shows metrics of the simulation like average slack, average insufficient CPU, sum slack, sum insufficient CPU, number of scalings, number of times of insufficient CPU etc.
Figure 1. The VASIM user interface comprises a run simulation pane on the left and a results pane on the right.

VASIM achieves several important goals:

Resource efficiency and cost reduction. VASIM reduces costs by removing the need to test scaling operations in real-time, which would be resource intensive. This enables developers to adjust algorithms iteratively in a controlled, cost-efficient environment, accelerating development cycles. Because the tool allows users to upload CPU performance history and algorithm parameters, it delivers the results of scaling operations across the entire workload in minutes rather than hours.

Multi-objective optimization. It’s challenging to develop an autoscaling method that handles conflicting parameters. VASIM makes this easier by applying Pareto optimization techniques (opens in new tab), helping developers to find a balance among key metrics. Figure 2 depicts scatter plots for two metrics: average slack and average insufficient CPU. It also shows three optimization objectives: the optimal amount of slack, throttling, and number of scaling operations.

[On the left] A graph that plots the average slack on the Y axis and the average insufficient cpu on the X axis. It shows that the more average insufficient cpu decreases, the more average slack increases. There are six points in red that are pareto frontier points, all on the very edge of the graph but not too close to each other, showing some possible choices of configuration. 

[On the right] A 3D scatter plot displays the total slack on the X axis, cpu total throttle on the Y axis, and the amount of scalings in Z axis. It shows that as you aim to lower total slack and throttle, the amount of scalings increases.
Figure 2. The 2D diagram on the left shows a scatter plot of tuning with Pareto points. The 3D graph on the right shows a scatter plot with the three objectives.

Recommender algorithm testing. VASIM simplifies the process of testing and evaluating recommendation algorithms across diverse workloads. With all tuning jobs running in parallel, computation occurs more quickly, allowing users to efficiently adjust their recommender parameters as necessary. To assess the algorithm’s generalizability, we ran VASIM against 11 available open cluster traces (opens in new tab) for benchmarking and internal product workload traces. This enabled us to evaluate the algorithms’ robustness across a variety of workload types, including cyclical, bursty, and monotonic variations, demonstrating their reliability across different scenarios.

Versatility and adaptability. VASIM provides users with the flexibility to modify components, experiment with recommendation strategies, and evaluate the impact of changes in a controlled and customizable environment. Figure 3 shows the results of a simulation run on the same algorithm and historical performance data but with different parameters. This versatility ensures that infrastructure engineers can tailor the system to meet their needs, enhancing the overall effectiveness of their autoscaling strategies.

These graphs display VASIM running an identical algorithm on the same historical data but with varying parameters, affecting slack, throttling, and the frequency of scaling events. The objective is to maintain a minimal gap between the peak and the lowest resource utilization levels (the top of the bottom line and the bottom of the top line, respectively), and to reduce the space between the response lag indicated by the trailing edges to the left of the lines. Simultaneously, it's important to minimize the occurrence of scaling events to prevent disruptions in workload execution.
Figure 3. These graphs show VASIM running an identical algorithm on the same historical data but with varying parameters, affecting slack, throttling, and the frequency of scaling events. The objective is to maintain a minimal gap between the peak and the lowest resource utilization levels—the top of the bottom line and the bottom of the top line, respectively. The goal is also to reduce the space between the response lag indicated by the trailing edges to the left of the lines. Simultaneously, it’s important to minimize the occurrence of scaling events to prevent disruptions in workload execution.

Optimizing scalability and costs in Kubernetes environments

Our research on vertically autoscaling monolithic applications with a container-as-a-service algorithm (opens in new tab) helped us to better understand the tradeoffs between cost and availability that different algorithm variations introduce. Because VASIM is similar to standard autoscaling architecture (as in the Kubernetes Vertical Pod Autoscaler (opens in new tab) [VPA]) it allows us to test autoscaling algorithms for pods, applications, and virtual machine (VM) capacity. This is possible because these systems share similar components, including resource updaters, controllers, and recommenders. Despite differences in specific systems, their underlying architectures are sufficiently similar, enabling VASIM to effectively mimic them, as shown in Figure 4.

 
The image depicts how VASIM works. It has a Simulation Controller in the middle, which asks Recommender for decisions using one of the algorithms, Simulation Scaler with a scale function, Cloud State Provider to get traces and use them for time simulation, Analyzer to get metrics after each run. Params Tuning Controller tells Simulation Controller to run for every tuning setting and calls Analyzer to get pareto front to find tradeoff between multiple goals after multiple configs were evaluated. Recommender also needs data from Cloud State Provider to access historical data.
Figure 4. VASIM architecture mimics the main components of general autoscaling architectures, allowing users to parametrize those modules to fit their specific needs.
 

Implications and looking ahead

Looking forward, we plan to broaden the scope of VASIM’s support beyond just CPUs to include a wide range of resources, such as memory, disk I/O, and network bandwidth. This expansion will provide future users with a comprehensive understanding of system performance and enable them to make more accurate decisions regarding system management and resource optimization. Additionally, a deeper understanding of system performance will help inform proactive optimization strategies focused on maximizing system efficiency and performance.

The post Enhanced autoscaling with VASIM: Vertical Autoscaling Simulator Toolkit appeared first on Microsoft Research.

Read More

Drug Discovery, STAT! NVIDIA, Recursion Speed Pharma R&D With AI Supercomputer

Drug Discovery, STAT! NVIDIA, Recursion Speed Pharma R&D With AI Supercomputer

Described as the largest system in the pharmaceutical industry, BioHive-2 at the Salt Lake City headquarters of Recursion debuts today at No. 35, up more than 100 spots from its predecessor on the latest TOP500 list of the world’s fastest supercomputers.

The advance represents the company’s most recent effort to accelerate drug discovery with NVIDIA technologies.

“Just as with large language models, we see AI models in the biology domain improve performance substantially as we scale our training with more data and compute horsepower, which ultimately leads to greater impacts on patients’ lives,” said Recursion’s CTO, Ben Mabey, who’s been applying machine learning to healthcare for more than a decade.

BioHive-2 packs 504 NVIDIA H100 Tensor Core GPUs linked on an NVIDIA Quantum-2 InfiniBand network to deliver 2 exaflops of AI performance. The resulting NVIDIA DGX SuperPOD is nearly 5x faster than Recursion’s first-generation system, BioHive-1.

Performance Powers Through Complexity

That performance is key to rapid progress because “biology is insanely complex,” Mabey said.

Finding a new drug candidate can take scientists years performing millions of wet-lab experiments.

That work is vital; Recursion’s scientists run more than 2 million such experiments a week. But going forward, they’ll use AI models on BioHive-2 to direct their platform to the most promising biology areas to run their experiments.

“With AI in the loop today, we can get 80% of the value with 40% of the wet lab work, and that ratio will improve going forward,” he said.

Biological Data Propels Healthcare AI

Recursion is collaborating with biopharma companies such as Bayer AG, Roche and Genentech. Over time, it also amassed a more than 50-petabyte database of biological, chemical and patient data, helping it build powerful AI models that are accelerating drug discovery.

“We believe it’s one of the largest biological datasets on Earth — it was built with AI training in mind, intentionally spanning biology and chemistry,” said Mabey, who joined the company more than seven years ago in part due to its commitment to building such a dataset.

Creating an AI Phenomenon

Processing that data on BioHive-1, Recursion developed a family of foundation models called Phenom. They turn a series of microscopic cellular images into meaningful representations for understanding the underlying biology.

A member of that family, Phenom-Beta, is now available as a cloud API and the first third-party model on NVIDIA BioNeMo, a generative AI platform for drug discovery.

Over several months of research and iteration, BioHive-1 trained Phenom-1 using more than 3.5 billion cellular images. Recursion’s expanded system enables training even more powerful models with larger datasets in less time.

The company also used NVIDIA DGX Cloud, hosted by Oracle Cloud Infrastructure, to provide additional supercomputing resources to power their work.

Animation of how Recursion trains AI models for drug discovery on NVIDIA GPUs
Much like how LLMs are trained to generate missing words in a sentence, Phenom models are trained by asking them to generate the masked out pixels in images of cells.

The Phenom-1 model serves Recursion and its partners in several ways, including finding and optimizing molecules to treat a variety of diseases and cancers. Earlier models helped Recursion predict drug candidates for COVID-19 nine out of 10 times.

The company announced its collaboration with NVIDIA in July. Less than 30 days later, the combination of BioHive-1 and DGX Cloud screened and analyzed a massive chemical library to predict protein targets for approximately 36 billion chemical compounds.

In January, the company demonstrated LOWE, an AI workflow engine with a natural-language interface to help make its tools more accessible to scientists. And in April it described a billion-parameter AI model it built to provide a new way to predict the properties of key molecules of interest in healthcare.

Recursion uses NVIDIA software to optimize its systems.

“We love CUDA and NVIDIA AI Enterprise, and we’re looking to see if NVIDIA NIM can help us distribute our models more easily, both internally and to partners,” he said.

A Shared Vision for Healthcare

The efforts are part of a broad vision that Jensen Huang, NVIDIA founder and CEO, described in a fireside chat with Recursion’s chairman as moving toward simulating biology.

“You can now recognize and learn the language of almost anything with structure, and you can translate it to anything with structure … This is the generative AI revolution,” Huang said.

“We share a similar view,” said Mabey.

“We are in the early stages of a very interesting time where just as computers accelerated chip design, AI can speed up drug design. Biology is much more complex, so it will take years to play out, but looking back, people will see this was a real turning point in healthcare,” he added.

Learn about NVIDIA’s AI platform for healthcare and life sciences and subscribe to NVIDIA healthcare news.

Pictured at top: BioHive-2 with a few members of the Recursion team (from left) Paige Despain, John Durkin, Joshua Fryer, Jesse Dean, Ganesh Jagannathan, Chris Gibson, Lindsay Ellinger, Michael Secora, Alex Timofeyev, and Ben Mabey. 

Read More

NVIDIA Blackwell Platform Pushes the Boundaries of Scientific Computing

NVIDIA Blackwell Platform Pushes the Boundaries of Scientific Computing

Quantum computing. Drug discovery. Fusion energy. Scientific computing and physics-based simulations are poised to make giant steps across domains that benefit humanity as advances in accelerated computing and AI drive the world’s next big breakthroughs.

NVIDIA unveiled at GTC in March the NVIDIA Blackwell platform, which promises generative AI on trillion-parameter large language models (LLMs) at up to 25x less cost and energy consumption than the NVIDIA Hopper architecture.

Blackwell has powerful implications for AI workloads, and its technology capabilities can also help to deliver discoveries across all types of scientific computing applications, including traditional numerical simulation.

By reducing energy costs, accelerated computing and AI drive sustainable computing. Many scientific computing applications already benefit. Weather can be simulated at 200x lower cost and with 300x less energy, while digital twin simulations have 65x lower cost and 58x less energy consumption versus traditional CPU-based systems and others.

Multiplying Scientific Computing Simulations With Blackwell

Scientific computing and physics-based simulation often rely on what’s known as double-precision formats, or FP64 (floating point), to solve problems. Blackwell GPUs deliver 30% more FP64 and FP32 FMA (fused multiply-add) performance  than Hopper.

Physics-based simulations are critical to product design and development. From planes and trains to bridges, silicon chips and pharmaceuticals — testing and improving products in simulation saves researchers and developers billions of dollars.

Today application-specific integrated circuits (ASICs) are designed almost exclusively on CPUs in a long and complex workflow, including analog analysis to identify voltages and currents.

But that’s changing. The Cadence SpectreX simulator is one example of an analog circuit design solver. SpectreX circuit simulations are projected to run 13x quicker on a GB200 Grace Blackwell Superchip — which connects Blackwell GPUs and Grace CPUs — than on a traditional CPU.

Also, GPU-accelerated computational fluid dynamics, or CFD, has become a key tool. Engineers and equipment designers use it to predict the behavior of designs. Cadence Fidelity runs CFD simulations that are projected to run as much as 22x faster on GB200 systems than on traditional CPU-powered systems. With parallel scalability and 30TB of memory per GB200 NVL72 rack, it’s possible to capture flow details like never before.

In another application, Cadence Reality’s digital twin software can be used to create a virtual replica of a physical data center, including all its components — servers, cooling systems and power supplies. Such a virtual model allows engineers to test different configurations and scenarios before implementing them in the real world, saving time and costs.

Cadence Reality’s magic happens from physics-based algorithms that can simulate how heat, airflow and power usage affect data centers. This helps engineers and data center operators to more effectively manage capacity, predict potential operational problems and make informed decisions to optimize the layout and operation of the data center for improved efficiency and capacity utilization. With Blackwell GPUs, these simulations are projected to run up to 30x faster than with CPUs, offering accelerated timelines and higher energy efficiency.

AI for Scientific Computing

New Blackwell accelerators and networking will deliver leaps in performance for advanced simulation.

The NVIDIA GB200 kicks off a new era for high-performance computing (HPC). Its architecture sports a second-generation transformer engine optimized to accelerate inference workloads for LLMs.

This delivers a 30x speedup on resource-intensive applications like the 1.8-trillion-parameter GPT-MoE (generative pretrained transformer-mixture of experts) model compared to the H100 generation, unlocking new possibilities for HPC. By enabling LLMs to process and decipher vast amounts of scientific data, HPC applications can sooner reach valuable insights that can accelerate scientific discovery.

Sandia National Laboratories is building an LLM copilot for parallel programming. Traditional AI can generate basic serial computing code efficiently, but when it comes to parallel computing code for HPC applications, LLMs can falter. Sandia researchers are tackling this issue head-on with an ambitious project — automatically generating parallel code in Kokkos, a specialized programming language designed by multiple national labs for running tasks across tens of thousands of processors in the world’s most powerful supercomputers.

Sandia is using an AI technique known as retrieval-augmented generation, or RAG, which combines information-retrieval capabilities with language generation models. The team is creating a Kokkos database and integrating it with AI models using RAG.

Initial results are promising. Different RAG approaches from Sandia have demonstrated autonomously generated Kokkos code for parallel computing applications. By overcoming hurdles in AI-based parallel code generation, Sandia aims to unlock new possibilities in HPC across leading supercomputing facilities worldwide. Other examples include renewables research, climate science and drug discovery.

Driving Quantum Computing Advances

Quantum computing unlocks a time machine trip for fusion energy, climate research, drug discovery and many more areas. So researchers are hard at work simulating future quantum computers on NVIDIA GPU-based systems and software to develop and test quantum algorithms faster than ever.

The NVIDIA CUDA-Q platform enables both simulation of quantum computers and hybrid application development with a unified programming model for CPUs, GPUs and QPUs (quantum processing units) working together.

CUDA-Q is speeding simulations in chemistry workflows for BASF, high-energy and nuclear physics for Stony Brook and quantum chemistry for NERSC.

NVIDIA Blackwell architecture will help drive quantum simulations to new heights. Utilizing the latest NVIDIA NVLink multi-node interconnect technology helps shuttle data faster for speedup benefits to quantum simulations.

Accelerating Data Analytics for Scientific Breakthroughs 

Data processing with RAPIDS is popular for scientific computing. Blackwell introduces a hardware decompression engine to decompress compressed data and speed up analytics in RAPIDS.

The decompression engine provides performance improvements up to 800GB/s and enables Grace Blackwell to perform 18x faster than CPUs — on Sapphire Rapids — and 6x faster than NVIDIA H100 Tensor Core GPUs for query benchmarks.

Rocketing data transfers with 8TB/s of high-memory bandwidth and the Grace CPU high-speed NVLink Chip-to-Chip (C2C) interconnect, the engine speeds up the entire process of database queries. Yielding top-notch performance across data analytics and data science use cases, Blackwell speeds data insights and reduces costs.

Driving Extreme Performance for Scientific Computing with NVIDIA Networking

The NVIDIA Quantum-X800 InfiniBand networking platform offers the highest throughput for scientific computing infrastructure.

It includes NVIDIA Quantum Q3400 and Q3200 switches and the NVIDIA ConnectX-8 SuperNIC, together hitting twice the bandwidth of the prior generation. The Q3400 platform offers 5x higher bandwidth capacity and 14.4Tflops of in-network computing with NVIDIA’s scalable hierarchical aggregation and reduction protocol (SHARPv4), providing a 9x increase compared with the prior generation.

The performance leap and power efficiency translates to significant reductions in workload completion time and energy consumption for scientific computing.

Learn more about NVIDIA Blackwell.

Read More

Generating Science: NVIDIA AI Accelerates HPC Research

Generating Science: NVIDIA AI Accelerates HPC Research

Generative AI is taking root at national and corporate labs, accelerating high-performance computing for business and science.

Researchers at Sandia National Laboratories aim to automatically generate code in Kokkos, a parallel programming language designed for use across many of the world’s largest supercomputers.

It’s an ambitious effort. The specialized language, developed by researchers from several national labs, handles the nuances of running tasks across tens of thousands of processors.

Sandia is employing retrieval-augmented generation (RAG) to create and link a Kokkos database with AI models. As researchers experiment with different RAG approaches, initial tests show promising results.

Cloud-based services like NeMo Retriever are among the RAG options the scientists will evaluate.

“NVIDIA provides a rich set of tools to help us significantly accelerate the work of our HPC software developers,” said Robert Hoekstra, a senior manager of extreme scale computing at Sandia.

Building copilots via model tuning and RAG is just a start. Researchers eventually aim to employ foundation models trained with scientific data from fields such as climate, biology and material science.

Getting Ahead of the Storm

Researchers and companies in weather forecasting are embracing CorrDiff, a generative AI model that’s part of NVIDIA Earth-2, a set of services and software for weather and climate research.

CorrDiff can scale the 25km resolution of traditional atmosphere models down to 2 kilometers and expand by more than 100x the number of forecasts that can be combined to improve confidence in predictions.

“It’s a promising innovation … We plan to leverage such models in our global and regional AI forecasts for richer insights,” said Tom Gowan, machine learning and modeling lead for Spire, a company in Vienna, Va., that collects data from its own network of tiny satellites.

Generative AI enables faster, more accurate forecasts, he said in a recent interview.

“It really feels like a big jump in meteorology,” he added. “And by partnering with NVIDIA, we have access to the world’s best GPUs that are the most reliable, fastest and most efficient ones for both training and inference.”

Graphic showing Spire weather forecast

Switzerland-based Meteomatics recently announced it also plans to use NVIDIA’s generative AI platform for its weather forecasting business.

“Our work with NVIDIA will help energy companies maximize their renewable energy operations and increase their profitability with quick and accurate insight into weather fluctuations,” said Martin Fengler, founder and CEO of Meteomatics.

Generating Genes to Improve Healthcare

At Argonne National Laboratory, scientists are using the technology to generate gene sequences that help them better understand the virus behind COVID-19. Their award-winning models, called GenSLMs, spawned simulations that closely resemble real-world variants of SARS-CoV-2.

“Understanding how different parts of the genome are co-evolving gives us clues about how the virus may develop new vulnerabilities or new forms of resistance,” Arvind Ramanathan, a lead researcher, said in a blog.

GenSLMs were trained on more than 110 million genome sequences with NVIDIA A100 Tensor Core GPU-powered supercomputers, including Argonne’s Polaris system, the U.S. Department of Energy’s Perlmutter and NVIDIA’s Selene.

Microsoft Proposes Novel Materials

Microsoft Research showed how generative AI can accelerate work in materials science.

Their MatterGen model generates novel, stable materials that exhibit desired properties. The approach enables specifying chemical, magnetic, electronic, mechanical and other desired properties.

“We believe MatterGen is an important step forward in AI for materials design,” the Microsoft Research team wrote of the model they trained on Azure AI infrastructure with NVIDIA A100 GPUs.

Companies such as Carbon3D are already finding opportunities, applying generative AI to materials science in commercial 3D printing operations.

It’s just the beginning of what researchers will be able to do for HPC and science with generative AI. The NVIDIA H200 Tensor Core GPUs available now and the upcoming NVIDIA Blackwell Architecture GPUs will take their work to new levels.

Learn more about tools like NVIDIA Modulus, a key component in the Earth-2 platform for building AI models that obey the laws of physics, and NVIDIA Megatron-Core, a NeMo library to tune and train large language models.

Read More