Abstracts: December 12, 2023

Abstracts: December 12, 2023

Microsoft Research Podcast: Abstracts

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements. 

In this episode, Senior Principal Research Manager Tao Qin and Senior Researcher Lijun Wu discuss “FABind: Fast and Accurate Protein-Ligand Binding.” The paper, accepted at the 2023 Conference on Neural Information Processing Systems (NeurIPS), introduces a new method for predicting the binding structures of proteins and ligands during drug development. The method demonstrates improved speed and accuracy over current methods.

Transcript

[MUSIC PLAYS]

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Dr. Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.

[MUSIC FADES]

Today, I’m talking to Dr. Tao Qin, a Senior Principal Research Manager, and Dr. Lijun Wu, a Senior Researcher, both from Microsoft Research. Drs. Qin and Wu are coauthors of a paper titled “FABind: Fast and Accurate Protein-Ligand Binding,” and this paper—which was accepted for the 2023 Conference on Neural Information Processing Systems, or NeurIPS—is available now on arXiv. Tao Qin, Lijun Wu, thanks for joining us on Abstracts

LIJUN WU: Thanks. 

TAO QIN: Yeah, thank you. Yeah, it’s great to be here and to share our latest research. 

HUIZINGA: So, Tao, let’s start off with you. In a couple sentences, tell us what issue or problem your research addresses and, more importantly, why people should care about it.


QIN: Yeah, uh, we work on the problem of molecular docking, a computational modeling method used to predict the preferred orientation of one molecule when it binds to a second molecule to form a stable complex. So it aims to predict the binding pose of a ligand in the active site of a receptor and estimate the ligand-receptor binding affinity. This problem is very important for drug discovery and development. Accurately predicting binding poses can provide insights into how a drug candidate might bind to its biological target and whether it is likely to have the desired therapeutic effect. To make an analogy, just like a locker and a key, protein target is a locker, while the ligand is a key. We should carefully design the structure of the key so that it can perfectly fit into the locker. Similarly, the molecular structure should be accurately constructed so that the protein can be well bonded. Then the protein function would be activated or inhibited. Molecular docking is used intensively in the early stages of drug design and discovery to screen a large library of hundreds of thousands of compounds to identify promising lead compounds. It helps eliminate poor candidates and focus on experimental results of those most likely to bind to the target protein well. So clearly, improving the accuracy and also the speed of docking methods, like what we have done in this work, could accelerate the development of new life-saving drugs. 

HUIZINGA: So, Lijun, tell us how your approach builds on and/or differs from what’s been done previously in this field. 

WU: Sure, thanks, yeah. So conventional protein-ligand docking methods, they usually take the sampling and scoring ways. So … which … that means, they will use first some sampling methods to generate multiple protein-ligand docking poses as candidates. And then we will use some scoring functions to evaluate these candidates and select from them and to choose the best ones. So such as DiffDock, a very recent work developed by MIT, which is a very strong model to use the diffusion algorithm to do the sampling in this kind of way. And this kind of method, I say the sampling and scoring methods, they are accurate with good predictions, but of course, they are very slow. So this is a very big limitation because the sampling process usually takes a lot of time. So some other methods such as EquiBind or TANKBind, they treat the docking prediction as a regression task, which is to use deep networks to directly predict the coordinates of the atoms in the molecule. Obviously, this kind of method is much faster than the sampling methods, but the prediction accuracy is usually worse. So therefore, our FABind, which … aims to provide a both fast and accurate method for the docking problem. FABind keeps its fast prediction by modeling in a regression way, and also, we utilize some novel designs to improve its prediction accuracy. 

HUIZINGA: So, Lijun, let’s stay with you for a minute. Regarding your research strategy on this, uh, how would you describe your methodology, and how did you go about conducting this research? 

WU: OK, sure. So when we’re talking about the detailed method, we actually build an end-to-end deep learning framework, FABind, here. So for the protein-ligand docking, FABind divides the docking task as a pocket prediction process and also a pose prediction process. But importantly, we unify these two processes within a single deep learning model, which is a very novel equivalent graph neural network. Here, the pocket means a local part of the whole protein, which are some specific amino acids that can bind to the molecule in the structure space. So simply speaking, this novel graph neural network is stacked by some identity graph neural networks. And the graph neural layer is carefully designed by us, and we use the first graph layer for the pocket prediction and the later layers to do the pose prediction. And for each layer, there are some message passing operations we designed. The first one is an independent message passing, which is to update the information within the protein molecule itself. And the second one is the cross-attention messenger passing, which is to update the information between the whole protein and also the whole molecule so we can then let each other have a global view. And the last one is an interfacial messenger passing, which is to do the update, and we can message pass the information between the closed nodes between the protein and the molecule. So besides, there are also some small points that will help to get an accurate docking model. For example, we use a scheduled training technique to bridge the gap between the training and the inference stages. And also, we combine direct coordinate prediction and also the distance map refinement as our optimization method. 

HUIZINGA: Well, listen, I want to stay with you even more because you’re talking about the technical specifications of your research methodology. Let’s talk about results. What were your major findings on the performance of FABind?

WU: Yeah, the results are very promising. So first we need to care about the docking performance, which is the accuracy of the, uh, docking pose prediction. We compare our FABind to different baselines such as EquiBind, TANKBind, and also, I talked before about the recent strong model DiffDock, developed by MIT. So the results showed that our docking prediction accuracy are very good. They achieve a very competitive performance to the DiffDock like that. But specifically, we need to talk about that the speed is very important. When compared to DiffDock, we achieved about 170 times faster speed than DiffDock. So this is very promising. Besides, the interesting thing is that we found our FABind can achieve very, very strong performance on the unseen protein targets, which means that the protein structure that we have never seen before during the training, we can achieve very good performance. So our FABind achieves significantly better performance with about 10 percent to 40 percent accuracy improvement than DiffDock. This performance demonstrates that the practical effectiveness of our work is very promising since such kinds of new proteins are the most important ones that we need to care for a new disease. 

HUIZINGA: Tao, this is all fascinating, but talk about real-world significance for this work. Who does it help most and how? 

QIN: Yeah. As Lijun has introduced, FABind significantly outperforms earlier methods in terms of speed while maintaining competitive accuracy. This fast prediction capability is extremely important in real-world applications, where high-throughput virtual screening for compound selection is often required for drug discovery. So an efficient virtual screening process can significantly accelerate the drug discovery process. Furthermore, our method demonstrates great performance on unseen or new proteins, which indicates that our FABind possesses a strong generalization ability. This is very important. Consider the case of SARS-CoV-2, for example, where our knowledge of the protein target is very limited at the beginning of the pandemic. So if we have a robust docking model that can generalize to new proteins, we could conduct a large-scale virtual screening and, uh, confidently select potentially effective ligands. This would greatly speed up the development of new treatments. 

HUIZINGA: So downstream from the drug discovery science, benefits would accrue to people who have diseases and need treatment for those things. 

QIN: Yes, exactly. 

HUIZINGA: OK, well, Tao, let’s get an elevator pitch in here, sort of one takeaway, a golden nugget, uh, that you’d like our listeners to take away from this work. If, if there was one thing you wanted them to take away from the work, what would it be? 

QIN: Yeah, uh, thanks for a great question. So I think one sentence for takeaway is that if for some researchers, they are utilizing molecular docking and they are seeking an AI-based approach, our FABind method definitely should be in their consideration list, especially considering the exceptional predictive accuracy and the high computational efficiency of our method.

HUIZINGA: Finally, Tao, what are the big questions and problems that remain in this area, and what’s next on your research agenda? 

QIN: Actually, there are multiple unaddressed questions along this direction, so I think those are all opportunities for further exploration. So here I just give three examples. First, our method currently tackles rigid docking, where the target protein structure is assumed to be fixed, leaving only the ligand structure to be predicted. However, in a more realistic scenario, the protein is dynamic during molecular binding. So therefore, exploring flexible docking becomes an essential aspect. Second, our approach assumes that the target protein has only one binding pocket. In reality, a target protein may have multiple binding pockets. So this situation will be more challenging. So how to address such kind of significant challenge is worth exploration. Third, in the field of drug design, sometimes we need to find a target or we need to find a drug compound that can bind with multiple target proteins. In this work, we only consider a single target protein. So the accurate prediction of docking for multiple target proteins poses a great challenge. 

HUIZINGA: Well, Tao Qin and Lijun Wu, thank you for joining us today. And to our listeners, thanks for tuning in.  

[MUSIC PLAYS] 

If you’re interested in learning more about this work, you can find a link to the paper at aka.ms/abstracts or you can find it on arXiv. See you next time on Abstracts

[MUSIC FADES]

The post Abstracts: December 12, 2023 appeared first on Microsoft Research.

Read More

Create a web UI to interact with LLMs using Amazon SageMaker JumpStart

Create a web UI to interact with LLMs using Amazon SageMaker JumpStart

The launch of ChatGPT and rise in popularity of generative AI have captured the imagination of customers who are curious about how they can use this technology to create new products and services on AWS, such as enterprise chatbots, which are more conversational. This post shows you how you can create a web UI, which we call Chat Studio, to start a conversation and interact with foundation models available in Amazon SageMaker JumpStart such as Llama 2, Stable Diffusion, and other models available on Amazon SageMaker. After you deploy this solution, users can get started quickly and experience the capabilities of multiple foundation models in conversational AI though a web interface.

Chat Studio can also optionally invoke the Stable Diffusion model endpoint to return a collage of relevant images and videos if the user requests for media to be displayed. This feature can help enhance the user experience with the use of media as accompanying assets to the response. This is just one example of how you can enrich Chat Studio with additional integrations to meet your goals.

The following screenshots show examples of what a user query and response look like.

Chat Studio query interface

Chat Studio response interface

Large language models

Generative AI chatbots such as ChatGPT are powered by large language models (LLMs), which are based on a deep learning neural network that can be trained on large quantities of unlabeled text. The use of LLMs allows for a better conversational experience that closely resembles interactions with real humans, fostering a sense of connection and improved user satisfaction.

SageMaker foundation models

In 2021, the Stanford Institute for Human-Centered Artificial Intelligence termed some LLMs as foundation models. Foundation models are pre-trained on a large and broad set of general data and are meant to serve as the foundation for further optimizations in a wide range of use cases, from generating digital art to multilingual text classification. These foundation models are popular with customers because training a new model from scratch takes time and can be expensive. SageMaker JumpStart provides access to hundreds of foundation models maintained from third-party open source and proprietary providers.

Solution overview

This post walks through a low-code workflow for deploying pre-trained and custom LLMs through SageMaker, and creating a web UI to interface with the models deployed. We cover the following steps:

  1. Deploy SageMaker foundation models.
  2. Deploy AWS Lambda and AWS Identity and Access Management (IAM) permissions using AWS CloudFormation.
  3. Set up and run the user interface.
  4. Optionally, add other SageMaker foundation models. This step extends Chat Studio’s capability to interact with additional foundation models.
  5. Optionally, deploy the application using AWS Amplify. This step deploys Chat Studio to the web.

Refer to the following diagram for an overview of the solution architecture.

Chat Studio Solution Architecture

Prerequisites

To walk through the solution, you must have the following prerequisites:

  • An AWS account with sufficient IAM user privileges.
  • npm installed in your local environment. For instructions on how to install npm, refer to Downloading and installing Node.js and npm.
  • A service quota of 1 for the corresponding SageMaker endpoints. For Llama 2 13b Chat, we use an ml.g5.48xlarge instance and for Stable Diffusion 2.1, we use an ml.p3.2xlarge instance.

To request a service quota increase, on the AWS Service Quotas console, navigate to AWS services, SageMaker, and request for a service quota raise to a value of 1 for ml.g5.48xlarge for endpoint usage and ml.p3.2xlarge for endpoint usage.

The service quota request may take a few hours to be approved, depending on the instance type availability.

Deploy SageMaker foundation models

SageMaker is a fully managed machine learning (ML) service for developers to quickly build and train ML models with ease. Complete the following steps to deploy the Llama 2 13b Chat and Stable Diffusion 2.1 foundation models using Amazon SageMaker Studio:

  1. Create a SageMaker domain. For instructions, refer to Onboard to Amazon SageMaker Domain using Quick setup.

A domain sets up all the storage and allows you to add users to access SageMaker.

  1. On the SageMaker console, choose Studio in the navigation pane, then choose Open Studio.
  2. Upon launching Studio, under SageMaker JumpStart in the navigation pane, choose Models, notebooks, solutions.
    SageMaker JumpStart Console
  3. In the search bar, search for Llama 2 13b Chat.
  4. Under Deployment Configuration, for SageMaker hosting instance, choose ml.g5.48xlarge and for Endpoint name, enter meta-textgeneration-llama-2-13b-f.
  5. Choose Deploy.

SageMaker JumpStart Deployment Configuration

After the deployment succeeds, you should be able to see the In Service status.

Llama Model Status

  1. On the Models, notebooks, solutions page, search for Stable Diffusion 2.1.
  2. Under Deployment Configuration, for SageMaker hosting instance, choose ml.p3.2xlarge and for Endpoint name, enter jumpstart-dft-stable-diffusion-v2-1-base.
  3. Choose Deploy.

SageMaker JumpStart Deployment Configuration

After the deployment succeeds, you should be able to see the In Service status.

Stable Diffusion Model Status

Deploy Lambda and IAM permissions using AWS CloudFormation

This section describes how you can launch a CloudFormation stack that deploys a Lambda function that processes your user request and calls the SageMaker endpoint that you deployed, and deploys all the necessary IAM permissions. Complete the following steps:

  1. Navigate to the GitHub repository and download the CloudFormation template (lambda.cfn.yaml) to your local machine.
  2. On the CloudFormation console, choose the Create stack drop-down menu and choose With new resources (standard).
  3. On the Specify template page, select Upload a template file and Choose file.
  4. Choose the lambda.cfn.yaml file that you downloaded, then choose Next.
  5. On the Specify stack details page, enter a stack name and the API key that you obtained in the prerequisites, then choose Next.
  6. On the Configure stack options page, choose Next.
  7. Review and acknowledge the changes and choose Submit.

Set up the web UI

This section describes the steps to run the web UI (created using Cloudscape Design System) on your local machine:

  1. On the IAM console, navigate to the user functionUrl.
  2. On the Security Credentials tab, choose Create access key.
  3. On the Access key best practices & alternatives page, select Command Line Interface (CLI) and choose Next.
  4. On the Set description tag page, choose Create access key.
  5. Copy the access key and secret access key.
  6. Choose Done.
  7. Navigate to the GitHub repository and download the react-llm-chat-studio code.
  8. Launch the folder in your preferred IDE and open a terminal.
  9. Navigate to src/configs/aws.json and input the access key and secret access key you obtained.
  10. Enter the following commands in the terminal:
    npm install
    
    npm start

  11. Open http://localhost:3000 in your browser and start interacting with your models!

To use Chat Studio, choose a foundational model on the drop-down menu and enter your query in the text box. To get AI-generated images along with the response, add the phrase “with images” to the end of your query.

Add other SageMaker foundation models

You can further extend the capability of this solution to include additional SageMaker foundation models. Because every model expects different input and output formats when invoking its SageMaker endpoint, you will need to write some transformation code in the callSageMakerEndpoints Lambda function to interface with the model.

This section describes the general steps and code changes required to implement an additional model of your choice. Note that basic knowledge of Python language is required for Steps 6–8.

  1. In SageMaker Studio, deploy the SageMaker foundation model of your choice.
  2. Choose SageMaker JumpStart and Launch JumpStart assets.
  3. Choose your newly deployed model endpoint and choose Open Notebook.
  4. On the notebook console, find the payload parameters.

These are the fields that the new model expects when invoking its SageMaker endpoint. The following screenshot shows an example.

SageMaker Endpoint Configuration

  1. On the Lambda console, navigate to callSageMakerEndpoints.
  2. Add a custom input handler for your new model.

In the following screenshot, we transformed the input for Falcon 40B Instruct BF16 and GPT NeoXT Chat Base 20B FP16. You can insert your custom parameter logic as indicated to add the input transformation logic with reference to the payload parameters that you copied.

Lambda Code Snippet

  1. Return to the notebook console and locate query_endpoint.

This function gives you an idea how to transform the output of the models to extract the final text response.

SageMaker Endpoint Configuration

  1. With reference to the code in query_endpoint, add a custom output handler for your new model.
    Lambda Code
  2. Choose Deploy.
  3. Open your IDE, launch the react-llm-chat-studio code, and navigate to src/configs/models.json.
  4. Add your model name and model endpoint, and enter the payload parameters from Step 4 under payload using the following format:
    "add_model_name": {
    "endpoint_name": "add_model_enpoint",
    "payload": {
    "add_payload_paramters_here"
    }
    },

  5. Refresh your browser to start interacting with your new model!

Deploy the application using Amplify

Amplify is a complete solution that allows you to quickly and efficiently deploy your application. This section describes the steps to deploy Chat Studio to an Amazon CloudFront distribution using Amplify if you wish to share your application with other users.

  1. Navigate to the react-llm-chat-studio code folder you created earlier.
  2. Enter the following commands in the terminal and follow the setup instructions:
    npm install -g @aws-amplify/cli
    
    amplify configure

  3. Initialize a new Amplify project by using the following command. Provide a ­­project name, accept the default configurations, and choose AWS access keys when prompted to select the authentication method.
    amplify init

  4. Host the Amplify project by using the following command. Choose Amazon CloudFront and S3 when prompted to select the plugin mode.
    amplify hosting add

  5. Finally, build and deploy the project with the following command:
    amplify publish

  6. After the deployment succeeds, open the URL provided in your browser and start interacting with your models!

Clean up

To avoid incurring future charges, complete the following steps:

  1. Delete the CloudFormation stack. For instructions, refer to Deleting a stack on the AWS CloudFormation console.
  2. Delete the SageMaker JumpStart endpoint. For instructions, refer to Delete Endpoints and Resources.
  3. Delete the SageMaker domain. For instructions, refer to Delete an Amazon SageMaker Domain.

Conclusion

In this post, we explained how to create a web UI for interfacing with LLMs deployed on AWS.

With this solution, you can interact with your LLM and hold a conversation in a user-friendly manner to test or ask the LLM questions, and get a collage of images and videos if required.

You can extend this solution in various ways, such as to integrate additional foundation models, integrate with Amazon Kendra to enable ML-powered intelligent search for understanding enterprise content, and more!

We invite you to experiment with different pre-trained LLMs available on AWS, or build on top of or even create your own LLMs in SageMaker. Let us know your questions and findings in the comments, and have fun!


About the authors

Jarrett Yeo Shan Wei is an Associate Cloud Architect in AWS Professional Services covering the Public Sector across ASEAN and is an advocate for helping customers modernize and migrate into the cloud. He has attained five AWS certifications, and has also published a research paper on gradient boosting machine ensembles in the 8th International Conference on AI. In his free time, Jarrett focuses on and contributes to the generative AI scene at AWS.

Tammy Lim Lee Xin is an Associate Cloud Architect at AWS. She uses technology to help customers deliver their desired outcomes in their cloud adoption journey and is passionate about AI/ML. Outside of work she loves travelling, hiking, and spending time with family and friends.

Read More

Frugality meets Accuracy: Cost-efficient training of GPT NeoX and Pythia models with AWS Trainium

Frugality meets Accuracy: Cost-efficient training of GPT NeoX and Pythia models with AWS Trainium

Large language models (or LLMs) have become a topic of daily conversations. Their quick adoption is evident by the amount of time required to reach a 100 million users, which has gone from “4.5yrs by facebook” to an all-time low of mere “2 months by ChatGPT.” A generative pre-trained transformer (GPT) uses causal autoregressive updates to make prediction. Variety of tasks such as speech recognition, text generation, and question answering are demonstrated to have stupendous performance by these model architectures. Several recent models such as NeoX, Falcon, Llama use the GPT architecture as a backbone. Training LLMs requires colossal amount of compute time, which costs millions of dollars. In this post, we’ll summarize training procedure of GPT NeoX on AWS Trainium, a purpose-built machine learning (ML) accelerator optimized for deep learning training. We’ll outline how we cost-effectively (3.2 M tokens/$) trained such models with AWS Trainium without losing any model quality.

Solution overview

GPT NeoX and Pythia models

GPT NeoX and Pythia are the open-source causal language models by Eleuther-AI with approximately 20 billion parameters in NeoX and 6.9 billion in Pythia. Both are decoder models following similar architectural design as Chat GPT3. However, they also have several additions, which are also widely adopted in the recent models such as Llama. Particularly, they have rotational positional embedding (ROPE) with partial rotation across head dimensions. The original models (NeoX and Pythia 6.9B) are trained on openly available Pile dataset with deduplication and using Megatron and Deepspeed backend.

We demonstrate the pre-training and fine-tuning of these models on AWS Trainium-based Trn1 instances using Neuron NeMo library. To establish the proof-of-concept and quick reproduction, we’ll use a smaller Wikipedia dataset subset tokenized using GPT2 Byte-pair encoding (BPE) tokenizer.

Walkthrough

Download the pre-tokenized Wikipedia dataset as shown:

export DATA_DIR=~/examples_datasets/gpt2

mkdir -p ${DATA_DIR} && cd ${DATA_DIR}

wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
aws s3 cp s3://neuron-s3/training_datasets/gpt/wikipedia/my-gpt2_text_document.bin . --no-sign-request
aws s3 cp s3://neuron-s3/training_datasets/gpt/wikipedia/my-gpt2_text_document.idx . --no-sign-request
aws s3 cp s3://neuron-s3/training_datasets/gpt/wikipedia/license.txt . --no-sign-request

Both NeoX 20B and Pythia 6.9B uses ROPE with partial rotation, for example, rotating 25% of the head dimensions and keeping the rest unrotated. To efficiently implement the partial rotation on AWS Trainium accelerator, instead of concatenating the rotating and non-rotating dimensions, we append zero frequencies for non-rotating dimensions and then rotate the complete set of head dimensions. This simple trick helped us improve the throughput (sequences processed per sec) on AWS Trainium.

Training steps

To run the training, we use SLURM managed multi-node Amazon Elastic Compute Cloud (Amazon EC2) Trn1 cluster, with each node containing a trn1.32xl instance. Each trn1.32xl has 16 accelerators with two workers per accelerator. After downloading the latest Neuron NeMo package, use the provided neox and pythia pre-training and fine-tuning scripts with optimized hyper-parameters and execute the following for a four node training.

  1. Compile: Pre-compile the model with three train iterations to generate and save the graphs:
    sbatch --nodes 4 compile.slurm ./neoX_20B_slurm.sh

  2. Run: Execute the training by loading the cached graphs from first steps
    sbatch --nodes 4 run.slurm ./neoX_20B_slurm.sh

  3. Monitor results
    tensorboard --logdir=nemo_experiments/megatron_neox

Same steps needs to be followed for running Pythia 6.9B model with replacing neox_20B_slurm.sh by pythia_6.9B_slurm.sh.

Pre-training and fine-tuning experiments

We demonstrate the pre-training of GPT-NeoX and Pythia models on AWS Trainium using Neuron NeMo library for 10k iterations, and also show fine-tuning of these models for 1k steps. For pre-training, we use the GPT2 BPE tokenizer inside the NeMo and follow same config as used in the original model. Fine-tuning on AWS Trainium requires change of few parameters (such as vocab size division factor), which are provided in the fine-tuning scripts to accommodate for Megatron versus NeMo differences and GPU versus AWS Trainium changes. The multi-node distributed training throughput with varying number of nodes is shown in the Table-1.

Model Tensor Parallel Pipeline Parallel Number of instances Cost ($/hour) Sequence length Global batch size Throughput (seq/sec) Cost-throughput ratio (tokens/$)
Pythia 6.9B 8 1 1 7.59 2048 256 10.4 10,102,387
8 1 4 30.36 2048 256 35.8 8,693,881
NeoX 20B 8 4 4 30.36 2048 16384 13.60 3,302,704
8 4 8 60.72 2048 16384 26.80 3,254,134
8 4 16 121.44 2048 16384 54.30 3,296,632
8 4 32 242.88 2048 16384 107.50 3,263,241
8 4 64 485.76 2048 16384 212.00 3,217,708

Table 1. Comparing mean throughput of GPT NeoX and Pythia models for training up to 500 steps with changing number of nodes. The pricing of trn1.32xl is based on the 3-year reserved effective per hour rate.

Next, we also evaluate the loss trajectory of the model training on AWS Trainium and compare it with the corresponding run on a P4d (Nvidia A100 GPU cores) cluster. Along with the training loss, we also compare useful indicator such as gradient norm, which is 2-norm of the model gradients computed at each training iteration to monitor the training progress. The training results are shown in Figure-1, 2 and fine-tuning of NeoX 20B in Figure-3.

Training loss averaged across all workers (left) and gradient norm (right) at training each step.

Figure-1. Training loss averaged across all workers (left) and gradient norm (right) at training each step. NeoX 20B is trained on 4 nodes with small wiki dataset on GPU and Trainium with same training hyper-parameters (global batch size=256). GPU is using BF16 and default mixed-precision while AWS Trainium is using full BF16 with stochastic rounding. The loss and gradient norm trajectories match for GPU and AWS Trainium.

Training loss averaged across all workers (left) and gradient norm (right) at training each step (Pythia).

Figure-2. Training loss averaged across all workers (left) and gradient norm (right) at training each step. Similar to GPT NeoX in Figure-1, Pythia 6.9B is trained on 4 nodes with small wiki dataset on GPU and Trainium with same training hyper-parameters (global batch size=256). The loss and gradient norm trajectories match for GPU and Trainium.

Fine-tuning GPT NeoX 20B model on GPU and AWS Trainium with training loss averaged across all workers (left) and gradient norm (right).

Figure-3. Fine-tuning GPT NeoX 20B model on GPU and AWS Trainium with training loss averaged across all workers (left) and gradient norm (right). A small wiki dataset is used for fine-tuning demonstration. The loss and gradient norm trajectories match for GPU and AWS Trainium.

In this post, we showed cost-efficient training of LLMs on AWS deep learning hardware. We trained GPT NeoX 20B and Pythia 6.9B models on AWS Trn1 with Neuron NeMo library. The cost normalized throughput for 20 billion models with AWS Trainium is around approximately 3.2M tokens/$ spent. Along with cost-efficient training on AWS Trainium, we obtain similar model accuracy, which is evident from training step loss and gradient norm trajectory. We also fine-tuned the available checkpoints for NeoX 20B model on AWS Trainium. For additional information on the distributed training with NeMo Megatron on AWS Trainium, see AWS Neuron Reference for NeMo Megatron. A good resource to start fine-tuning of Llama model could be found here, Llama2 fine-tuning. To get started with managed AWS Trainium on Amazon SageMaker, see Train your ML Models with AWS Trainium and Amazon SageMaker.


About the Authors

Gaurav Gupta is currently an Applied Scientist at Amazon Web Services (AWS) AI labs. Dr. Gupta completed his PhD from USC Viterbi. His research interests span the domain of sequential data modeling, learning partial differential equations, information theory for machine learning, fractional dynamical models, and complex networks. He is currently working on applied and mathematical problems on LLMs training behavior, vision models with PDEs, information-theoretic multi-modality models. Dr. Gupta has publications in top journals/conferences such as Neurips, ICLR, ICML, Nature, IEEE Control Society, ACM cyber-physical society.

Ben Snyder is an applied scientist with AWS Deep Learning. His research interests include foundational models, reinforcement learning, and asynchronous optimization. Outside of work, he enjoys cycling and backcountry camping.

Amith (R) Mamidala is the senior machine learning application engineering at AWS Annapurna Labs. Dr. Mamidala completed his PhD at the Ohio State University in high performance computing and communication. During his tenure at IBM research, Dr. Mamidala contributed towards the BlueGene class of computers which often led the Top500 ranking of the most powerful and power-efficient supercomputers. The project was awarded 2009 National medal of Technology and Innovation. After a brief stint as an AI engineer at a financial hedge fund, Dr. Mamidala joined the Annapurna labs focusing on Large Language model training.

Jun (Luke) Huan is a principal scientist at AWS AI Labs. Dr. Huan works on AI and Data Science. He has published more than 180 peer-reviewed papers in leading conferences and journals. He was a recipient of the NSF Faculty Early Career Development Award in 2009. Before joining AWS, he worked at Baidu research as a distinguished scientist and the head of Baidu Big Data Laboratory. He founded StylingAI Inc., an AI start-up, and worked as the CEO and Chief Scientist in 2019-2021. Before joining industry, he was the Charles E. and Mary Jane Spahr Professor in the EECS Department at the University of Kansas.

Shruti Koparkar is a Senior Product Marketing Manager at AWS. She helps customers explore, evaluate, and adopt Amazon EC2 accelerated computing infrastructure for their machine learning needs.

Read More

Vodafone advances its machine learning skills with AWS DeepRacer and Accenture

Vodafone advances its machine learning skills with AWS DeepRacer and Accenture

Vodafone is transitioning from a telecommunications company (telco) to a technology company (TechCo) by 2025, with objectives of innovating faster, reducing costs, improving security, and simplifying operations. Thousands of engineers are being onboarded to contribute to this transition. By 2025, Vodafone plans to have 50% of its global workforce actively involved in software development, with an objective to deliver 60% of digital services in-house. This new workforce requires rapid reskilling and understanding of disruptive services such as artificial intelligence (AI) and machine learning (ML) to drive meaningful outcomes.

To help achieve this ambitious transition, Vodafone has partnered with Accenture and AWS to build a cloud platform that helps its engineers work in flexible, creative, and agile ways by providing them a curated set of managed, security and DevOps-oriented AWS services and application workloads. To learn more, check out Redefining Vodafone’s customer experience with AWS and the following talk at AWS re:Invent 2022.

Vodafone Digital engineering (VDE) invited Accenture and AWS to co-host an exclusive event at their annual DigiFest, a week-long event celebrating the scale of their global VDE teams, championing reusable apps and collaborative idea generation. As one of the main events of the DigiFest, AWS and Accenture conceptualized a company-wide AWS DeepRacer challenge where engineers can build and train their models to become better versed in using ML with AWS.

In this post, we share how Vodafone is advancing its ML skills using AWS DeepRacer and Accenture.

Why is machine learning important to Vodafone?

Machine learning is one of the fastest growing domains in technology and telecommunications, owing to the benefits of improved productivity and forecasting across key domains in telecommunications such as channels, CRM, billing, order management, service assurance, network management, and more.

Vodafone has already adopted ML in the proactive detection and correction of network anomalies to improve customer satisfaction. Their AI and ML capabilities in digital self-care, via a chatbot, have been helping their customer care team focus on cases that need deeper attention. Because they use AWS for providing digital services packaged as telco as a service, incorporating AI and ML components is crucial to maintain a competitive edge in delivering state-of-the-art services to customers.

Why AWS DeepRacer?

AWS DeepRacer is an interesting and fun way to get started with reinforcement learning (RL). RL is an advanced ML technique that takes a very different approach to training models than other ML methods. Its super power is that it learns very complex behaviors without requiring any labeled training data, and can make short-term decisions while optimizing for a longer-term goal. The AWS DeepRacer Challenge provided an opportunity for Vodafone’s engineers to engage in a friendly competition, develop an ML mindset, and share insights on how to succeed in a private virtual racing event.

Racing with AWS DeepRacer

The event played out in three stages, starting with a workshop on AWS DeepRacer to cover the basics of reinforcement learning, which was attended by over 225 Vodafone engineers. They learned how to fine-tune an AWS DeepRacer model by creating a reward function, exploring the action space, systematically tuning hyperparameters, examining the training job progress, evaluating the model, and testing the model on a virtual AWS DeepRacer vehicle and virtual track.

In the next stage, a league race was organized where 130 racers were able to view the race videos of the best model submission of every participant on a live leaderboard. This helped them understand how a high-performance model performs after it’s trained. They quickly understood overtraining occurs when a model is trained for too long, leading to overfitting, which leads to underperformance in a new environment. They also experimented with different styles of reward functions such as follow the center line, excessive steering penalty, slowness penalty, and progress rewards.

The event culminated with a grand finale, a showdown of 11 racers who tuned their models one final time to compete in a live race with commentary. All 11 racers completed a full lap with their models. Eight racers had a lap time of less than 15 seconds, with the winner coming in with an incredible lap time of 11.194 seconds on the tricky Toronto Turnpike virtual race track.

Summary

The goal of the AWS DeepRacer Challenge was to build awareness and excitement of ML on AWS for a global cloud engineering audience with varying technology skills and competencies. The tournament exceeded 585 total registrations across the globe, with over 400 models submitted and over 600 hours of training and evaluation.

Vodafone was able to help a broad range of builders get hands-on with ML through the AWS DeepRacer challenge. With over 47% AWS and ML beginners, it reaffirms how effective AWS DeepRacer can be in introducing ML with AWS in a safe and engaging environment for beginners.

“Having the Digital Engineering team attend events like DigiFest and participate in challenges like AWS DeepRacer is a huge part of our vision of building a world-class software engineering team in Vodafone. As we take on the complex challenge of transforming a telecommunications company into a technology company, growing our skillset becomes a top priority and our partnership with Accenture and AWS has provided the team with not just this, but multiple opportunities to learn and develop. I am excited for more of this to come!”

Ben Connolly, Vodafone Global Director of Cloud Engineering


About the Author

Ramakrishna Natarajan is a Senior Partner Solutions Architect at Amazon Web Services. He is based out of London and helps AWS Partners find optimal solutions on AWS for their customers. He specialises in Telecommunications OSS/BSS and has a keen interest in evolving domains such as AI/ML, Data Analytics, Security and Modernisation. He enjoys playing squash, going on long hikes and learning new languages.

Read More

Steering at the Frontier: Extending the Power of Prompting

Steering at the Frontier: Extending the Power of Prompting

three conversation bubbles on a blue, purple, and pink gradient background

We’re seeing exciting capabilities of frontier foundation models, including intriguing powers of abstraction, generalization, and composition across numerous areas of knowledge and expertise. Even seasoned AI researchers have been impressed with the ability to steer the models with straightforward, zero-shot prompts. Beyond basic, out-of-the-box prompting, we’ve been exploring new prompting strategies, showcased in our Medprompt work, to evoke the powers of specialists.  

Today, we’re sharing information on Medprompt and other approaches to steering frontier models in promptbase (opens in new tab), a collection of resources on GitHub. Our goal is to provide information and tools to engineers and customers to evoke the best performance from foundation models. We’ll start by including scripts that enable replication of our results using the prompting strategies that we present here. We’ll be adding more sophisticated general-purpose tools and information over the coming weeks.  

As an illustration of the capabilities of the frontier models and on opportunities to harness and extend the recent efforts with reaching state-of-the-art (SoTA) results via steering GPT-4, we’ll review SoTA results on benchmarks that Google chose for evaluating Gemini Ultra. Our end-to-end exploration, prompt design, and computing of performance took just a couple of days.

MICROSOFT RESEARCH PODCAST

Intern Insights: Dr. Madeleine Daepp with Jennifer Scurrell and Alejandro Cuevas

In this episode, PhD students Jennifer Scurrell and Alejandro Cuevas talk to Senior Researcher Dr. Madeleine Daepp. They discuss the internship culture at Microsoft Research, from opportunities to connect with researchers to the teamwork they say helped make it possible for them to succeed, and the impact they hope to have with their work.


Let’s focus on the well-known MMLU (opens in new tab) (Measuring Massive Multitask Language Understanding) challenge that was established as a test of general knowledge and reasoning powers of large language models.  The complete MMLU benchmark contains tens of thousands of challenge problems of different forms across 57 areas from basic mathematics to United States history, law, computer science, engineering, medicine, and more.  

In our Medprompt study, we focused on medical challenge problems, but found that the prompt strategy could have more general-purpose application and examined its performance on several out-of-domain benchmarks—despite the roots of the work on medical challenges. Today, we report that steering GPT-4 with a modified version of Medprompt achieves the highest score ever achieved on the complete MMLU.

In our explorations, we initially found that applying the original Medprompt to GPT-4 on the comprehensive MMLU achieved a score of 89.1%. By increasing the number of ensembled calls in Medprompt from five to 20, performance by GPT-4 on the MMLU further increased to 89.56%. To achieve a new SoTA on MMLU, we extended Medprompt to Medprompt+ by adding a simpler prompting method and formulating a policy for deriving a final answer by integrating outputs from both the base Medprompt strategy and the simple prompts. The synthesis of a final answer is guided by a control strategy governed by GPT-4 and inferred confidences of candidate answers. More details on Medprompt+ are provided in the promptbase repo. A related method for coupling complex and simple queries was harnessed by the Google Gemini team. GPT-4 steered with the modified Medprompt+ reaches a record score of 90.10%. We note that Medprompt+ relies on accessing confidence scores (logprobs) from GPT-4. These are not publicly available via the current API but will be enabled for all in the near future.

A graph showing the reported performance of baseline multiple models and methods on the MMLU benchmark. Moving from left to right, Palm 2-L (5-shot) achieved 78.4% accuracy, Claude 2 (5-shot CoT) achieved 78.5% accuracy, Inflection-2 (5-shot) achieved 79.6% accuracy, Google Pro (CoT@8) achieved 79.13% accuracy, Gemini Ultra (CoT@32) achieved 90.04% accuracy, GPT-4-1106 (5-Shot) achieved 86.4% accuracy, GPT-4-1106 (Medprompt @ 5) achieved 89.1% accuracy, GPT-4-1106 (Medprompt @ 20) achieved 89.56% accuracy, and GPT-4-1106 (Medprompt @ 31) achieved 90.10% accuracy.
Figure1. Reported performance of multiple models and methods on the MMLU benchmark.

While systematic prompt engineering can yield maximal performance, we continue to explore the out-of-the-box performance of frontier models with simple prompts. It’s important to keep an eye on the native power of GPT-4 and how we can steer the model with zero- or few-shot prompting strategies. As demonstrated in Table 1, starting with simple prompting is useful to establish baseline performance before layering in more sophisticated and expensive methods.

Benchmark GPT-4 Prompt GPT-4 Results Gemini Ultra Results
MMLU Medprompt+ 90.10% 90.04%
GSM8K Zero-shot 95.27% 94.4%
MATH Zero-shot 68.42% 53.2%
HumanEval Zero-shot 87.8% 74.4%
BIG-Bench-Hard Few-shot + CoT* 89.0% 83.6% 
DROP Zero-shot + CoT 83.7% 82.4%
HellaSwag 10-shot** 95.3%** 87.8%
* followed the norm of evaluations and used standard few-shot examples from dataset creators 
** source: Google 

Table 1: Model, strategies, and results

We encourage you to check out the promptbase repo (opens in new tab) on GitHub for more details about prompting techniques and tools. This area of work is evolving with much to learn and share. We’re excited about the directions and possibilities ahead.

The post Steering at the Frontier: Extending the Power of Prompting appeared first on Microsoft Research.

Read More

Meet NANA, Moonshine Studio’s AI-Powered Receptionist Avatar

Meet NANA, Moonshine Studio’s AI-Powered Receptionist Avatar

Editor’s note: This post is part of our weekly In the NVIDIA Studio series, which celebrates featured artists, offers creative tips and tricks, and demonstrates how NVIDIA Studio technology improves creative workflows. We’re also deep diving on new GeForce RTX 40 Series GPU features, technologies and resources, and how they dramatically accelerate content creation.

The creative team at Moonshine Studio — an artist-focused visual effects (VFX) studio specializing in animation and motion design — was tasked to solve a problem.

At their Taiwan office, receptionists were constantly engaged in meeting and greeting guests, preventing them from completing other important administrative work. To make matters worse, the automated kiosk greeting system wasn’t working as expected.

Senior Moonshine Studio 3D artist and this week’s In the NVIDIA Studio creator Eric Chiang stepped up to the challenge. He created a realistic, interactive 3D model that would serve as the foundation of a new AI-powered virtual assistant — NANA. The avatar can welcome guests and provide basic company info, easing the strain on the receptionist team.

Chiang built NANA using GPU-accelerated features in his favorite creative apps — powered by his NVIDIA Studio-badged MSI MEG Trident X2 PC, which is equipped with a GeForce RTX 4090 graphics card.

His creative workflow was enhanced by the Tensor Cores in his GPU, which supercharged AI-specific tasks — saving him time and elevating the quality of his work. RTX and AI also improve performance in gaming, boost productivity and more.

These advanced features are supported by NVIDIA Studio Drivers, — free for RTX GPU owners — which add performance and reliability. The December Studio Driver provides support for the Reallusion iClone AccuFACE plugin, GPU audio enhancements, AV1 in HandBrake and more — and is now ready for download.

Contests and Challenges Calling All Creators

Creative community The Rookies is hosting Meet Mat 3 — the 3D digital painting contest. Open to students and professionals with no more than a year of industry experience, it challenges contestants to use Adobe Substance 3D Painter to texture a blank character, MAT, in their own unique style. Prizes include GeForce RTX GPUs, Wacom Cintiq displays and more. Register today — entries close Jan. 5, 2024.

MAT, textured by artist Cino Lai in Adobe Substance 3D Painter.

And though temperatures continue to drop, the #WinterArtChallenge is heating up with un-brrrrrr-lievable entries like this extraordinary #InstantNeRF by @RadianceFields.

Be sure to include the #WinterArtChallenge hashtag for a chance to be featured on the @NVIDIAStudio, @NVIDIAOmniverse or @NVIDIAAIDev social channels.

An AI on the Future

Chiang began in Blender, sculpting intricate 3D models that served as the building blocks for NANA. Blender Cycles’ RTX-accelerated OptiX ray tracing in the viewport unlocked interactive, photorealistic modeling.

 

He then used the Marvelous Designer software for making, editing and reusing 3D clothes to create realistic clothing for NANA. This streamlined the design and simulation process, ensuring that the avatar is not only structurally sound but impeccably dressed.

NANA’s casual day outfit.

Chiang deployed Quixel Mixer and Adobe Substance 3D Painter for shading, adding depth, texture and realism to the 3D models.

 

He then used the Blender plug-in AccuRIG to efficiently create precise, adaptable character rigs.

 

Chiang put everything together in Unreal Engine, where he seamlessly integrated 3D objects into the scene, leveraging real-time rendering to create visually stunning results.

 

NVIDIA DLSS further increased viewport interactivity by using AI to upscale frames rendered at lower resolution while still retaining high-fidelity detail. All of this was powered by his GeForce RTX 4090 GPU.

The NVIDIA Studio-badged MSI MEG Trident X2 PC, equipped with a GeForce RTX 4090 graphics card.

Chiang is excited about what AI can do for creators and society at large.

“What was once science fiction is now becoming reality, opening the door to a whole new stage of scientific and technological development,” he said. “We are fortunate to participate in and witness this new stage.”

Moonshine Studio digital 3D artist Eric Chiang.

Visit Moonshine Studio and say hello to NANA.

Follow NVIDIA Studio on Instagram, Twitter and Facebook. Access tutorials on the Studio YouTube channel and get updates directly in your inbox by subscribing to the Studio newsletter. 

Read More

Phi-2: The surprising power of small language models

Phi-2: The surprising power of small language models

Contributors

Marah Abdin, Jyoti Aneja, Sebastien Bubeck, Caio César Teodoro Mendes, Weizhu Chen, Allie Del Giorno, Ronen Eldan, Sivakanth Gopi, Suriya Gunasekar, Mojan Javaheripi, Piero Kauffmann, Yin Tat Lee, Yuanzhi Li, Anh Nguyen, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Michael Santacroce, Anh Harkirat Singh Behl, Adam Taumann Kalai, Xin Wang, Rachel Ward, Philipp Witte, Cyril Zhang, Yi Zhang

Satya Nadella on stage at Microsoft Ignite 2023 announcing Phi-2.
Figure 1. Satya Nadella announcing Phi-2 at Microsoft Ignite 2023.

Over the past few months, our Machine Learning Foundations team at Microsoft Research has released a suite of small language models (SLMs) called “Phi” that achieve remarkable performance on a variety of benchmarks. Our first model, the 1.3 billion parameter Phi-1 (opens in new tab), achieved state-of-the-art performance on Python coding among existing SLMs (specifically on the HumanEval and MBPP benchmarks). We then extended our focus to common sense reasoning and language understanding and created a new 1.3 billion parameter model named Phi-1.5 (opens in new tab), with performance comparable to models 5x larger.

We are now releasing Phi-2 (opens in new tab), a 2.7 billion-parameter language model that demonstrates outstanding reasoning and language understanding capabilities, showcasing state-of-the-art performance among base language models with less than 13 billion parameters. On complex benchmarks Phi-2 matches or outperforms models up to 25x larger, thanks to new innovations in model scaling and training data curation.

With its compact size, Phi-2 is an ideal playground for researchers, including for exploration around mechanistic interpretability, safety improvements, or fine-tuning experimentation on a variety of tasks. We have made Phi-2 (opens in new tab) available on the Azure model catalog to foster research and development on language models.

Microsoft Research Podcast

Collaborators: Renewable energy storage with Bichlien Nguyen and David Kwabi

Dr. Bichlien Nguyen and Dr. David Kwabi explore their work in flow batteries and how machine learning can help more effectively search the vast organic chemistry space to identify compounds with properties just right for storing waterpower and other renewables.


Key Insights Behind Phi-2

The massive increase in the size of language models to hundreds of billions of parameters has unlocked a host of emerging capabilities that have redefined the landscape of natural language processing. A question remains whether such emergent abilities can be achieved at a smaller scale using strategic choices for training, e.g., data selection.

Our line of work with the Phi models aims to answer this question by training SLMs that achieve performance on par with models of much higher scale (yet still far from the frontier models). Our key insights for breaking the conventional language model scaling laws with Phi-2 are twofold:

Firstly, training data quality plays a critical role in model performance. This has been known for decades, but we take this insight to its extreme by focusing on “textbook-quality” data, following upon our prior work “Textbooks Are All You Need.” Our training data mixture contains synthetic datasets specifically created to teach the model common sense reasoning and general knowledge, including science, daily activities, and theory of mind, among others. We further augment our training corpus with carefully selected web data that is filtered based on educational value and content quality. Secondly, we use innovative techniques to scale up, starting from our 1.3 billion parameter model, Phi-1.5, and embedding its knowledge within the 2.7 billion parameter Phi-2. This scaled knowledge transfer not only accelerates training convergence but shows clear boost in Phi-2 benchmark scores.

A bar plot comparing the performance of Phi-2 (with 2.7B parameters) and Phi-1.5 (with 1.3B parameters) on common sense reasoning, language understanding, math, coding, and the Bigbench-hard benchmark. Phi-2 outperforms Phi1.5 in all categories. The commonsense reasoning tasks are PIQA, WinoGrande, ARC easy and challenge, and SIQA. The language understanding tasks are HellaSwag, OpenBookQA, MMLU, SQuADv2, and BoolQ. The math task is GSM8k, and coding includes the HumanEval and MBPP benchmarks.
Figure 2. Comparison between Phi-2 (2.7B) and Phi-1.5 (1.3B) models. All tasks are evaluated in 0-shot except for BBH and MMLU which use 3-shot CoT and 5-shot, respectively.

Training Details

Phi-2 is a Transformer-based model with a next-word prediction objective, trained on 1.4T tokens from multiple passes on a mixture of Synthetic and Web datasets for NLP and coding. The training for Phi-2 took 14 days on 96 A100 GPUs. Phi-2 is a base model that has not undergone alignment through reinforcement learning from human feedback (RLHF), nor has it been instruct fine-tuned. Despite this, we observed better behavior with respect to toxicity and bias compared to existing open-source models that went through alignment (see Figure 3). This is in line with what we saw in Phi-1.5 due to our tailored data curation technique, see our previous tech report (opens in new tab) for more details on this. For more information about the Phi-2 model, please visit Azure AI | Machine Learning Studio (opens in new tab).

A barplot comparing the safety score of Phi-1.5, Phi-2, and Llama-7B models on 13 categories of the ToxiGen benchmark. Phi-1.5 achieves the highest score on all categories, Phi-2 achieves the second-highest scores and Llama-7B achieves the lowest scores across all categories.
Figure 3. Safety scores computed on 13 demographics from ToxiGen. A subset of 6541 sentences are selected and scored between 0 to 1 based on scaled perplexity and sentence toxicity. A higher score indicates the model is less likely to produce toxic sentences compared to benign ones.

Phi-2 Evaluation

Below, we summarize Phi-2 performance on academic benchmarks compared to popular language models. Our benchmarks span several categories, namely, Big Bench Hard (BBH) (3 shot with CoT), commonsense reasoning (PIQA, WinoGrande, ARC easy and challenge, SIQA), language understanding (HellaSwag, OpenBookQA, MMLU (5-shot), SQuADv2 (2-shot), BoolQ), math (GSM8k (8 shot)), and coding (HumanEval, MBPP (3-shot)).

With only 2.7 billion parameters, Phi-2 surpasses the performance of Mistral and Llama-2 models at 7B and 13B parameters on various aggregated benchmarks. Notably, it achieves better performance compared to 25x larger Llama-2-70B model on muti-step reasoning tasks, i.e., coding and math. Furthermore, Phi-2 matches or outperforms the recently-announced Google Gemini Nano 2, despite being smaller in size.

Of course, we acknowledge the current challenges with model evaluation, and that many public benchmarks might leak into the training data. For our first model, Phi-1, we did an extensive decontamination study to discard this possibility, which can be found in our first report “Textbooks Are All You Need.” Ultimately, we believe that the best way to judge a language model is to test it on concrete use cases. Following that spirit, we also evaluated Phi-2 using several Microsoft internal proprietary datasets and tasks, comparing it again to Mistral and Llama-2. We observed similar trends, i.e. on average, Phi-2 outperforms Mistral-7B, and the latter outperforms the Llama-2 models (7B, 13B, and 70B).

Model Size BBH Commonsense
Reasoning
Language
Understanding
Math Coding
Llama-2 7B 40.0 62.2 56.7 16.5 21.0
13B 47.8 65.0 61.9 34.2 25.4
70B 66.5 69.2 67.6 64.1 38.3
Mistral 7B 57.2 66.4 63.7 46.4 39.4
Phi-2 2.7B 59.2 68.8 62.0 61.1 53.7
Table 1. Averaged performance on grouped benchmarks compared to popular open-source SLMs.
Model Size BBH BoolQ MBPP MMLU
Gemini Nano 2 3.2B 42.4 79.3 27.2 55.8
Phi-2 2.7B 59.3 83.3 59.1 56.7
Table 2. Comparison between Phi-2 and Gemini Nano 2 Model on Gemini’s reported benchmarks.

In addition to these benchmarks, we also performed extensive testing on commonly used prompts from the research community. We observed a behavior in accordance with the expectation we had given the benchmark results. For example, we tested a prompt used to probe a model’s ability to solve physics problems, most recently used to evaluate the capabilities of the Gemini Ultra model, and achieved the following result:

An example prompt is given to Phi-2 which says “A skier slides down a frictionless slope of height 40m and length 80m. What's the skier’s speed at the bottom?”. Phi-2 then answers the prompt by explaining the conversion of potential energy to kinetic energy and providing the formulas to compute each one. It then proceeds to compute the correct speed using the energy formulas.
Figure 4. Phi-2’s output on a simple physics problem, which includes an approximately correct square root calculation.
The model is then provided with a student’s wrong answer to the skier physics problem and asked if it can correct the student’s mistake. Phi-2 replies with the student’s mistake, i.e., using the wrong formula for potential energy, and provides the correct formula.
Figure 5. Similarly to Gemini’s test we also further queried Phi-2 with a student’s wrong answer to see if Phi-2 could identify where the mistake is (it did, despite Phi-2 being not fine-tuned for chat or instruction-following). We note however that it is not fully an apple-to-apple comparison with the Gemini Ultra’s output described in the Gemini report, in particular in the latter case the student’s answer was given as an image with handwritten text rather than raw text in our case.

The post Phi-2: The surprising power of small language models appeared first on Microsoft Research.

Read More

From PyTorch Conference 2023: From Dinosaurs to Seismic Imaging with Intel

From PyTorch Conference 2023: From Dinosaurs to Seismic Imaging with Intel

Dinosaur fossil

Lightning Talk 1: Seismic Data to Subsurface Models with OpenFWI

Speaker: Benjamin Consolvo, AI Software Engineering Manager, Intel, LinkedIn

Session Overview

In this session, Ben begins with an overview of seismic imaging and full waveform inversion (FWI). Seismic imaging and FWI helps us to explore land for important subsurface minerals necessary for human thriving. To find those crucial subsurface minerals, we need to image the subsurface with a high degree of accuracy at a low cost, which involves two main challenges. He explains the solutions for those challenges using AI, which are summarized below.

Challenges Solutions using AI
Traditional physics based FWI requires an accurate starting model. Data-driven deep learning solutions do not require an accurate starting model.
GPUs are typically used for fine-tuning neural networks but are often unavailable and expensive. CPUs are highly available, inexpensive, and viable for AI fine-tuning. The new 4th Gen Intel® Xeon® Scalable processor has the built-in AI accelerator engine called Intel® AMX (Intel® Advanced Matrix Extensions) that helps to accelerate AI training and inference performance.

Next, he shows the wave propagation for the subsurface model and corresponding seismic shot gathers. In his example, the shot gathers are synthetically generated time-sampled records of sounds recordings from a shot (like a dynamite explosion or vibroseis truck) recorded by geophones spread across a large area. For this application, the training data consists of a pair of subsurface model image and seismic shot gather images, where the model from the shot gather is predicted.

Number of Seismic Shot Images Number of subsurface model images
Train 120,000 24,000
Test 25,000 5,000
Validation 5,000 1,000

In this application, the algorithm used during training was InversionNET (encoder-decoder convolutional neural network). Check out the implementation details for InversionNET architecture in Deng et al. (2021).

He then shows the results:

  1. Prediction versus ground truth model after one epoch and at 50 epochs. After training InversionNET, the predicted model is much closer to the ground truth image.
  2. Training loss and validation loss curves decreasing over time across 50 epochs.

Finally, Ben concludes his talk by highlighting that he was able to successfully fine-tune a deep neural network without an accurate starting model to obtain subsurface model on a 4th generation Intel® Xeon® Scalable processor.

Watch the full video recording here and download the presentation. More details can be found in this blog.

About the Speaker

Ben Consolvo

Ben Consolvo is an AI Solutions Engineering Manager at Intel. He has been building a team and a program around Intel’s AI technology paired with Intel’s hardware offerings. He brings a background and passion in data science, particularly in deep learning (DL) and computer vision. He has applied his skills in DL in the cybersecurity industry to automatically identify phishing websites, as well as to the oil and gas industry to identify subsurface features for geophysical imaging.

Lightning Talk 2: Dinosaur Bone Hunt

Speaker: Bob Chesebrough, Sr Solution Architect, Intel, LinkedIn

Session Overview

In this session, Bob starts the presentation by explaining his interest in collecting dinosaur bones and gives an overview of Intel AI Software portfolio.

He then explains the steps to create a dinosaur site treasure map or dinosaur bone likelihood map:

  1. Collect data and create training data (New Mexico aerial photos of the Morrison Formation – a famous dinosaur bone bed in the Western United States and the GPS coordinates for small bone fragments discovered)
  2. Train a simple ResNet 18 model using Intel® Extension for PyTorch
  3. Score the model on Utah photos and create a heat map

Finally, Bob shows the results that dinosaur bones were discovered in Utah using dinosaur bone likelihood map. Go to the GitHub repository to access the code sample and try out the sample using Intel Extension for PyTorch.

Watch the full video recording here and download the presentation. More details can be found in this blog.

About the Speaker

Bob Chesebrough

Bob Chesebrough’s industry experience is software development/AI solution engineering for fortune 100 companies and national laboratories for over three decades. He is also a hobbyist who has logged over 800 miles and 1000 hours in the field finding dinosaur bones. He and his sons discovered an important fossil of the only known crocodilian from the Jurassic in New Mexico, they have also discovered and logged into the museum 2000+ bones localities and described a new mass bone bed in New Mexico.

Read More

Abstracts: December 11, 2023

Abstracts: December 11, 2023

Microsoft Research Podcast: Abstracts

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements.

In this episode, Principal Researcher Alessandro Sordoni joins host Gretchen Huizinga to discuss “Joint Prompt Optimization of Stacked LLMs using Variational Inference.” In the paper, which was accepted at the 2023 Conference on Neural Information Processing Systems (NeurIPS), Sordoni and his coauthors introduce Deep Language Networks, or DLNs, an architecture that treats large language models as layers within a network and natural language prompts as each layer’s learnable parameters.

Transcript

[MUSIC PLAYS]

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Dr. Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.

[MUSIC FADES]

Today I’m talking to Dr. Alessandro Sordoni, a Principal Researcher from Microsoft Research. Dr. Sordoni is coauthor of a paper titled “Joint Prompt Optimization of Stacked LLMs using Variational Inference,” and this paper, which was accepted for the 2023 Conference on Neural Information Processing Systems, or NeurIPS, is available now on arXiv. Alessandro, thanks for joining us on Abstracts!


ALESSANDRO SORDONI: Hi, Gretchen, thank you for having me.

HUIZINGA: So in a few sentences, tell us about the issue or problem that your research addresses and why we should care about it.

SORDONI: So in this paper, our starting points are large language models, and to make large language models solve tasks, one of the ways that is currently used is to prompt them. By prompting that means just giving instruction to them, and hopefully by joining instruction and the input of the task, the language model can solve the task following the rules specified in the instructions. And there has been some approaches already in the literature to actually infer what that instruction is without human intervention. And in this paper, we operate in that space, which is called kind of automatic prompt engineering. And our specific problem is to, one, how to actually infer those prompts for a language model. And, two, what happens if actually the output of that large language model gets into another language model and both language model needs prompt to operate? And so basically, we give sort of an algorithm to solve that joint prompt optimization. That’s why it’s called joint.

HUIZINGA: So what’s the underlying issue there that we should care about as potential users of this technology?

SORDONI: There are some problems that cannot be just solved by kind of one instruction or rule, I would say, but they necessitate some sort of higher-level reasoning or some sort of decomposition. And in that sense, it would maybe be useful to actually have multiple calls to the LLM, where each call is modulated by a different instruction. So the first instruction could be something very general, for example, decompose or visualize the problem into a different language that is formulated in. And the second call is now recompose this visualization that you have produced to solve the problem itself. And so basically, in that context, you can think about this as kind of augmenting the computational power of the language model by splitting the one call in multiple calls.

HUIZINGA: Well, go in a little deeper on the work that this builds on. All research kind of gets a prompt—no pun intended—from previous work. So how does your work build on and/or differ from what’s been done previously in this field?

SORDONI: I would say that our work started more with this intuition that LLMs are just kind of black-box computation units. Now this sort of black box can accept input as input language. The computation is modulated by an instruction and it outputs language, so you can stack these layers, right. So if the weights of this language layer now are the instructions and you can stack them together, how can you optimize them, right? And then we start to think, OK, but this is very related to kind of automatic prompt optimization. The overall kind of prompt engineering and prompt optimization approaches right now work by proposing some prompts and accepting some prompts. So we did some modifications with respect to how we propose new prompts to language models and how do we evaluate and accept then those that work given some task inputs and outputs. Our goal in the future—I would say in the near future—is going to be to basically integrate optimization that can really express arbitrary graphs …

HUIZINGA: Gotcha …

SORDONI: … of LLM calls right now. But in our paper, we started with the first step, which is, OK, say that I just have two calls. Can I just optimize prompts for that very simple graph? And we proposed an algorithm to do so. So basically, I guess our main contribution is, one, getting a better prompt optimizer for one layer and also devising an algorithm that works for two layers right now and that can be extended to multiple layers. But that’s also an engineering problem that needs to be tackled.

HUIZINGA: [LAUGHS] Yeah, always got to get the engineering in there! Well, listen, let’s keep going on this because it sounds like you’re talking about methodology and, and how you conducted this research. So expand a little bit on what you did actually to experiment in this arena.

SORDONI: Yeah, so I think that, uh, really the birth of this paper started from this kind of view of these language models as layers modulated by instructions that can be stacked upon each other. And from there, we said, OK, what can we do with this, basically? And so some of us worked on datasets that could be interesting for this new sort of methodology, I would say, or architecture. So basically, one question was, how do you go forward to actually test if this works in any way? And so we tried to select some datasets that were more of natural language tasks—for example, sentiment classification—and some datasets that were more about reasoning tasks. And our hunch was that basically stacking multiple layers together would help more in those tasks that would require some sort of decomposition of reasoning.

HUIZINGA: Right.

SORDONI: And for the reasoning task, we worked with this BIG-Bench Hard setting. And so parallel to that, there were some of us that worked, for example myself, in the optimization part, really in the algorithm part. And at first, we tried to do some sort of back propagation. But I quickly saw that there were some sort of issues with that … probably empirically issues. And so we tried to actually have a more formal understanding of this optimization algorithm by recurring to variational inference basically, so basically, to understand actually the first layer as producing some text and considering this text as a latent variable. When you open that box, it links also in your head to all … a bunch of kind of related works in the literature that have studied this problem very, very thoroughly. And so you can use those techniques into this context.

HUIZINGA: Interesting. So what were the results of this research? What did you find?

SORDONI: So what we found was that, indeed, the tasks in which these approaches seem to help the most are the tasks that require this sort of decomposition and reasoning. The first thing that was really, really kind of cool, it was that kind of you can go a long way in improving the performance of these language models by accurate prompt optimization. Because in some models, prompt optimization can be understood as kind of really tweaking the models towards solving the task. But in some other tasks, actually, when humans write prompts, they tend to maybe underspecify the prompt or tend to basically be not very clear to how to instruct the model. So the model has to do a lot of work to understand …

HUIZINGA: Right …

SORDONI: … what the human really wants to say to them. And so basically, this sort of prompt optimization acts as a sort of translator where it formulates a prompt that more comprehensively describes the task and more comprehensively contains some rules to solve the task. So it was very interesting to me, that kind of level of abstraction that was sort of required and needed in the prompt to really solve this task very, very well. The other finding is that this problem is very hard. It’s very tricky to optimize, to prompt, this type of optimization because this type of optimization doesn’t really follow a gradient direction like in deep neural networks.

HUIZINGA: Yeah.

SORDONI: It’s basically a sort of trial and error. And this trial and error is very finicky. There’s a lot of problems there. But I feel like I’m hopeful in the sense that this paper allowed us, I think, to hone in some very specific problem that if we solve them, we can make the problem much easier.

HUIZINGA: Let’s talk for a second about real-world impact of this research. Let’s extrapolate out from the lab and move into life. Who benefits from this most, and how do they benefit?

SORDONI: I think that, as I said before, like these automatic prompt optimization methods could benefit, I think, a large audience, or large amount of users, I would say, because they could be understood as a sort of translator between the user needs and what the LLM can do. For example, one effort here in Montréal that was led by my colleagues was kind of building this sort of interactive agent that would, through interaction with the user, form a prompt but interactively. So, for example, in DLN, like in our paper, we assume that we have a task and we do not have input or interaction with the user, right. But in more realistic scenarios, you might want to actually instruct your model to do something by some sort of active learning process where the model actually propose you whether what it did was favorable or desirable or not.

HUIZINGA: Right.

SORDONI: And the user can actually interact with that output, right. For the multilayer case, my hope is that that would be useful to build and optimize these large sort of graphs of LLM calls.

HUIZINGA: I want to take a second here to spell out some acronyms. You’ve referred to DLNs, and I don’t think our audience might know what that means. I’m assuming they know LLM means “large language model.” That’s sort of in the parlance. But talk a little bit about what that other acronym is.

SORDONI: Yeah, sorry I didn’t mention this. So DLN is basically how we refer to these architectures that are composed of language model layers. So DLN is, spells as “Deep Language Network.”

HUIZINGA: Gotcha.

SORDONI: People are free to use this name or not.

HUIZINGA: No, I like it …

SORDONI: I’m not a big fan of imposing acronyms on the world [LAUGHS], but that’s a, that’s a shorter version of it. So, yeah, so it’s really the idea that a language model is a layer in this hierarchy, and the layer accepts as input a text, it outputs a text, and really is modulated by an instruction or prompt that we want to learn.

HUIZINGA: And so the DLN is a deep language network and it sort of acts as a deep neural network but using language models as your layer.

SORDONI: Exactly, exactly, yes.

HUIZINGA: So this is a question I ask everyone, and it’s sort of like, how could you boil this down to one little takeaway if you’re standing on an elevator with somebody and they say, what do you do, Alessandro? So if there’s one thing you want people to take away from this work, what would it be?

SORDONI: The first thing that came to my mind is really the fact that these models can be understood really as a class, I would say, of probability distributions and that they are modulated by these prompts. And so basically, once you have that, once a language model just defines a (p) over sentences given some prompt, you can apply a lot of algorithms with those models. You can apply algorithms that resembles to EM, expectation maximization, or … I mean, we applied a form of that with variational inference, but maybe kind of it could open the path for other types of usages where kind of these are just very, very powerful probability distributions over these sentences that are considered as latent variable. I hope that our paper can show like a more or less practical kind of implementation of that idea. And that basically if you have to optimize, for example, prompts with one or two layers, you can definitely try our approach.

HUIZINGA: Well, finally, and we’ve been talking about this kind of already, but there seem to be some unresolved problems in the area. What do researchers like you need to be looking at in order to solve those? Sort of what’s next on the research agenda, whether it’s you or other researchers in this field?

SORDONI: So let me try to answer by something that really excites me now. What we are doing is that we are producing text, right. With the language model. But we are producing this text in such a way that it helps to solve a problem. And basically, this variational inference method and kind of framework gives us a way of understanding what does it mean to be a good text? Like what does it mean to be a good kind of latent variable or useful latent variable?

HUIZINGA: Right.

SORDONI: What does it mean to produce good data? So, for example, these big models kind of are really data creators, like this generative AI, right. But can we actually teach them to produce data such that this data can be helpful to solve tasks or to condition those same models to solve a task?

HUIZINGA: Right.

SORDONI: And what are the objective functions that promote the production of this useful data? What useful means from a mathematical perspective. I think that, apart from the prompt optimization angle, I feel like DLN to me kind of opened a little bit my spirit into kind of investigating ways of understanding what does it mean for some generated text to be useful to solve a task, I would say. Yeah.

HUIZINGA: Alessandro Sordoni, thanks for joining us today. And thanks to our listeners for tuning in. If you’re interested in learning more about this work, you can find a link to the paper at aka.ms/abstracts or you can find it on arXiv. See you next time on Abstracts!

The post Abstracts: December 11, 2023 appeared first on Microsoft Research.

Read More