Accelerating ML experimentation with enhanced security: AWS PrivateLink support for Amazon SageMaker with MLflow

Accelerating ML experimentation with enhanced security: AWS PrivateLink support for Amazon SageMaker with MLflow

With access to a wide range of generative AI foundation models (FM) and the ability to build and train their own machine learning (ML) models in Amazon SageMaker, users want a seamless and secure way to experiment with and select the models that deliver the most value for their business. In the initial stages of an ML project, data scientists collaborate closely, sharing experimental results to address business challenges. However, keeping track of numerous experiments, their parameters, metrics, and results can be difficult, especially when working on complex projects simultaneously. MLflow, a popular open-source tool, helps data scientists organize, track, and analyze ML and generative AI experiments, making it easier to reproduce and compare results.

SageMaker is a comprehensive, fully managed ML service designed to provide data scientists and ML engineers with the tools they need to handle the entire ML workflow. Amazon SageMaker with MLflow is a capability in SageMaker that enables users to create, manage, analyze, and compare their ML experiments seamlessly. It simplifies the often complex and time-consuming tasks involved in setting up and managing an MLflow environment, allowing ML administrators to quickly establish secure and scalable MLflow environments on AWS. See Fully managed MLFlow on Amazon SageMaker for more details.

Enhanced security: AWS VPC and AWS PrivateLink

When working with SageMaker, you can decide the level of internet access to provide to your users. For example, you can give users access permission to download popular packages and customize the development environment. However, this can also introduce potential risks of unauthorized access to your data. To mitigate these risks, you can further restrict which traffic can access the internet by launching your ML environment in an Amazon Virtual Private Cloud (Amazon VPC). With an Amazon VPC, you can control the network access and internet connectivity of your SageMaker environment, or even remove direct internet access to add another layer of security. See Connect to SageMaker through a VPC interface endpoint to understand the implications of running SageMaker within a VPC and the differences when using network isolation.

SageMaker with MLflow now supports AWS PrivateLink, which enables you to transfer critical data from your VPC to MLflow Tracking Servers through a VPC endpoint. This capability enhances the protection of sensitive information by making sure that data sent to the MLflow Tracking Servers is transferred within the AWS network, avoiding exposure to the public internet. This capability is available in all AWS Regions where SageMaker is currently available, excluding China Regions and GovCloud (US) Regions. To learn more, see Connect to an MLflow tracking server through an Interface VPC Endpoint.

In this blogpost, we demonstrate a use case to set up a SageMaker environment in a private VPC (without internet access), while using MLflow capabilities to accelerate ML experimentation.

Solution overview

You can find the reference code for this sample in GitHub. The high-level steps are as follows:

  1. Deploy infrastructure with the AWS Cloud Development Kit (AWS CDK) including:
  2. Run ML experimentation with MLflow using the @remote decorator from the open-source SageMaker Python SDK.

The overall solution architecture is shown in the following figure.

solution architecture

For your reference, this blog post demonstrates a solution to create a VPC with no internet connection using an AWS CloudFormation template.

Prerequisites

You need an AWS account with an AWS Identity and Access Management (IAM) role with permissions to manage resources created as part of the solution. For details, see Creating an AWS account.

Deploy infrastructure with AWS CDK

The first step is to create the infrastructure using this CDK stack. You can follow the deployment instructions from the README.

Let’s first have a closer look at the CDK stack itself.

It defines multiple VPC endpoints, including the MLflow endpoint as shown in the following sample:

vpc.add_interface_endpoint(
    "mlflow-experiments",
    service=ec2.InterfaceVpcEndpointAwsService.SAGEMAKER_EXPERIMENTS,
    private_dns_enabled=True,
    subnets=ec2.SubnetSelection(subnets=subnets),
    security_groups=[studio_security_group]
)

We also try to restrict the SageMaker execution IAM role so that you can use SageMaker MLflow only when you’re in the right VPC.

You can further restrict the VPC endpoint for MLflow by attaching a VPC endpoint policy.

Users outside the VPC can potentially connect to Sagemaker MLflow through the VPC endpoint to MLflow. You can add restrictions so that user access to SageMaker MLflow is only allowed from your VPC.

studio_execution_role.attach_inline_policy(
    iam.Policy(self, "mlflow-policy",
        statements=[
            iam.PolicyStatement(
                effect=iam.Effect.ALLOW,
                actions=["sagemaker-mlflow:*"],
                resources=["*"],
                conditions={"StringEquals": {"aws:SourceVpc": vpc.vpc_id } }
            )
        ]
    )
)

After successful deployment, you should be able to see the new VPC in the AWS Management Console for Amazon VPC without internet access, as shown in the following screenshot.

vpc

A CodeArtifact domain and a CodeArtifact repository with external connection to PyPI should also be created, as shown in the following figure, so that SageMaker can use it to download necessary packages without internet access. You can verify the creation of the domain and the repository by going to the CodeArtifact console. Choose “Repositories” under “Artifacts” from the navigation pane and you will see the repository “pip”.

CodeArtifact

ML experimentation with MLflow

Setup

After the CDK stack creation, a new SageMaker domain with a user profile should also be created. Launch Amazon SageMaker Studio and create a JupyterLab Space. In the JupyterLab Space, choose an instance type of ml.t3.medium, and select an image with SageMaker Distribution 2.1.0.

To check that the SageMaker environment has no internet connection, open the JupyterLab space and check the internet connection by running the curl command in a terminal.

no internet access

SageMaker with MLflow now supports MLflow version 2.16.2 to accelerate generative AI and ML workflows from experimentation to production. An MLflow 2.16.2 tracking server is created along with the CDK stack.

You can find the MLflow tracking server Amazon Resource Name (ARN) either from the CDK output or from the SageMaker Studio UI by clicking “MLFlow” icon, as shown in the following figure. You can click the “copy” button next to the “mlflow-server” to copy the MLflow tracking server ARN.

mlflow-tracking-server

As an example dataset to train the model, download the reference dataset from the public UC Irvine ML repository to your local PC, and name it predictive_maintenance_raw_data_header.csv.

Upload the reference dataset from your local PC to your JupyterLab Space as shown in the following figure.

JupyterLab

To test your private connectivity to the MLflow tracking server, you can download the sample notebook that has been uploaded automatically during the creation of the stack in a bucket within your AWS account. You can find the an S3 bucket name in the CDK output, as shown in the following figure.

s3 bucket arn

From the JupyterLab app terminal, run the following command:

aws s3 cp --recursive <YOUR-BUCKET-URI> ./

You can now open the private-mlflow.ipynb notebook.

In the first cell, fetch credentials for the CodeArtifact PyPI repository so that SageMaker can use pip from the private AWS CodeArtifact repository. The credentials will expire in 12 hours. Make sure to log on again when they expire.

%%bash
AWS_ACCOUNT=$(aws sts get-caller-identity --output text --query 'Account')
aws codeartifact login --tool pip --repository pip --domain code-artifact-domain --domain-owner ${AWS_ACCOUNT} --region ${AWS_DEFAULT_REGION}

Experimentation

After setup, start the experimentation. The scenario is using the XGBoost algorithm to train a binary classification model. Both the data processing job and model training job use @remote decorator so that the jobs are running in the SageMaker-associated private subnets and security group from your private VPC.

In this case, the @remote decorator looks up the parameter values from the SageMaker configuration file (config.yaml). These parameters are used for data processing and training jobs. We define the SageMaker-associated private subnets and security group in the configuration file. For the full list of supported configurations for the @remote decorator, see Configuration file in the SageMaker Developer Guide.

Note that we specify in PreExecutionCommands the aws codeartifact login command to point SageMaker to the private CodeAritifact repository. This is needed to make sure that the dependencies can be installed at runtime. Alternatively, you can pass a reference to a container in your Amazon ECR through ImageUri, which contains all installed dependencies.

We specify the security group and subnets information in VpcConfig.

config_yaml = f"""
SchemaVersion: '1.0'
SageMaker:
  PythonSDK:
    Modules:
      TelemetryOptOut: true
      RemoteFunction:
        # role arn is not required if in SageMaker Notebook instance or SageMaker Studio
        # Uncomment the following line and replace with the right execution role if in a local IDE
        # RoleArn: <replace the role arn here>
        # ImageUri: <replace with your image if you want to avoid installing dependencies at run time>
        S3RootUri: s3://{bucket_prefix}
        InstanceType: ml.m5.xlarge
        Dependencies: ./requirements.txt
        IncludeLocalWorkDir: true
        PreExecutionCommands:
        - "aws codeartifact login --tool pip --repository pip --domain code-artifact-domain --domain-owner {account_id} --region {region}"
        CustomFileFilter:
          IgnoreNamePatterns:
          - "data/*"
          - "models/*"
          - "*.ipynb"
          - "__pycache__"
        VpcConfig:
          SecurityGroupIds: 
          - {security_group_id}
          Subnets: 
          - {private_subnet_id_1}
          - {private_subnet_id_2}
"""

Here’s how you can setup an MLflow experiment similar to this.

from time import gmtime, strftime

# Mlflow (replace these values with your own, if needed)
project_prefix = project_prefix
tracking_server_arn = mlflow_arn
experiment_name = f"{project_prefix}-sm-private-experiment"
run_name=f"run-{strftime('%d-%H-%M-%S', gmtime())}"

Data preprocessing

During the data processing, we use the @remote decorator to link parameters in config.yaml to your preprocess function.

Note that MLflow tracking starts from the mlflow.start_run() API.

The mlflow.autolog() API can automatically log information such as metrics, parameters, and artifacts.

You can use log_input() method to log a dataset to the MLflow artifact store.

@remote(keep_alive_period_in_seconds=3600, job_name_prefix=f"{project_prefix}-sm-private-preprocess")
def preprocess(df, df_source: str, experiment_name: str):
    
    mlflow.set_tracking_uri(tracking_server_arn)
    mlflow.set_experiment(experiment_name)    
    
    with mlflow.start_run(run_name=f"Preprocessing") as run:            
        mlflow.autolog()
        
        columns = ['Type', 'Air temperature [K]', 'Process temperature [K]', 'Rotational speed [rpm]', 'Torque [Nm]', 'Tool wear [min]', 'Machine failure']
        cat_columns = ['Type']
        num_columns = ['Air temperature [K]', 'Process temperature [K]', 'Rotational speed [rpm]', 'Torque [Nm]', 'Tool wear [min]']
        target_column = 'Machine failure'                    
        df = df[columns]

        mlflow.log_input(
            mlflow.data.from_pandas(df, df_source, targets=target_column),
            context="DataPreprocessing",
        )
        
        ...
        
        model_file_path="/opt/ml/model/sklearn_model.joblib"
        os.makedirs(os.path.dirname(model_file_path), exist_ok=True)
        joblib.dump(featurizer_model, model_file_path)

    return X_train, y_train, X_val, y_val, X_test, y_test, featurizer_model

Run the preprocessing job, then go to the MLflow UI (shown in the following figure) to see the tracked preprocessing job with the input dataset.

X_train, y_train, X_val, y_val, X_test, y_test, featurizer_model = preprocess(df=df, 
                                                                              df_source=input_data_path, 
                                                                              experiment_name=experiment_name)

You can open an MLflow UI from SageMaker Studio as the following figure. Click “Experiments” from the navigation pane and select your experiment.

mlflow-UI

From the MLflow UI, you can see the processing job that just run.

mlflow experiment

You can also see security details in the SageMaker Studio console in the corresponding training job as shown in the following figure.

training security

Model training

Similar to the data processing job, you can also use @remote decorator with the training job.

Note that the log_metrics() method sends your defined metrics to the MLflow tracking server.

@remote(keep_alive_period_in_seconds=3600, job_name_prefix=f"{project_prefix}-sm-private-train")
def train(X_train, y_train, X_val, y_val,
          eta=0.1, 
          max_depth=2, 
          gamma=0.0,
          min_child_weight=1,
          verbosity=0,
          objective='binary:logistic',
          eval_metric='auc',
          num_boost_round=5):     
    
    mlflow.set_tracking_uri(tracking_server_arn)
    mlflow.set_experiment(experiment_name)
    
    with mlflow.start_run(run_name=f"Training") as run:               
        mlflow.autolog()
             
        # Creating DMatrix(es)
        dtrain = xgboost.DMatrix(X_train, label=y_train)
        dval = xgboost.DMatrix(X_val, label=y_val)
        watchlist = [(dtrain, "train"), (dval, "validation")]
    
        print('')
        print (f'===Starting training with max_depth {max_depth}===')
        
        param_dist = {
            "max_depth": max_depth,
            "eta": eta,
            "gamma": gamma,
            "min_child_weight": min_child_weight,
            "verbosity": verbosity,
            "objective": objective,
            "eval_metric": eval_metric
        }        
    
        xgb = xgboost.train(
            params=param_dist,
            dtrain=dtrain,
            evals=watchlist,
            num_boost_round=num_boost_round)
    
        predictions = xgb.predict(dval)
    
        print ("Metrics for validation set")
        print('')
        print (pd.crosstab(index=y_val, columns=np.round(predictions),
                           rownames=['Actuals'], colnames=['Predictions'], margins=True))
        
        rounded_predict = np.round(predictions)
    
        val_accuracy = accuracy_score(y_val, rounded_predict)
        val_precision = precision_score(y_val, rounded_predict)
        val_recall = recall_score(y_val, rounded_predict)

        # Log additional metrics, next to the default ones logged automatically
        mlflow.log_metric("Accuracy Model A", val_accuracy * 100.0)
        mlflow.log_metric("Precision Model A", val_precision)
        mlflow.log_metric("Recall Model A", val_recall)
        
        from sklearn.metrics import roc_auc_score
    
        val_auc = roc_auc_score(y_val, predictions)
        
        mlflow.log_metric("Validation AUC A", val_auc)
    
        model_file_path="/opt/ml/model/xgboost_model.bin"
        os.makedirs(os.path.dirname(model_file_path), exist_ok=True)
        xgb.save_model(model_file_path)

    return xgb

Define hyperparameters and run the training job.

eta=0.3
max_depth=10

booster = train(X_train, y_train, X_val, y_val,
              eta=eta, 
              max_depth=max_depth)

In the MLflow UI you can see the tracking metrics as shown in the figure below. Under “Experiments” tab, go to “Training” job of your experiment task. It is under “Overview” tab.

mlflow training result

You can also view the metrics as graphs. Under “Model metrics” tab, you can see the model performance metrics that configured as part of the training job log.

mlflow training metrics

With MLflow, you can log your dataset information alongside other key metrics, such as hyperparameters and model evaluation. Find more details in the blogpost LLM experimentation with MLFlow.

Clean up

To clean up, first delete all spaces and applications created within the SageMaker Studio domain. Then destroy the infrastructure created by running the following code.

cdk destroy

Conclusion

SageMaker with MLflow allows ML practitioners to create, manage, analyze, and compare ML experiments on AWS. To enhance security, SageMaker with MLflow now supports AWS PrivateLink. All MLflow Tracking Server versions including 2.16.2 integrate seamlessly with this feature, enabling secure communication between your ML environments and AWS services without exposing data to the public internet.

For an extra layer of security, you can set up SageMaker Studio within your private VPC without Internet access and execute your ML experiments in this environment.

SageMaker with MLflow now supports MLflow 2.16.2. Setting up a fresh installation provides the best experience and full compatibility with the latest features.


About the Authors

xiaoyu_profileXiaoyu Xing is a Solutions Architect at AWS. She is driven by a profound passion for Artificial Intelligence (AI) and Machine Learning (ML). She strives to bridge the gap between these cutting-edge technologies and a broader audience, empowering individuals from diverse backgrounds to learn and leverage AI and ML with ease. She is helping customers to adopt AI and ML solutions on AWS in a secure and responsible way.

Paolo-profilePaolo Di Francesco is a Senior Solutions Architect at Amazon Web Services (AWS). He holds a PhD in Telecommunications Engineering and has experience in software engineering. He is passionate about machine learning and is currently focusing on using his experience to help customers reach their goals on AWS, in particular in discussions around MLOps. Outside of work, he enjoys playing football and reading.

Tomer-profile Tomer Shenhar is a Product Manager at AWS. He specializes in responsible AI, driven by a passion to develop ethically sound and transparent AI solutions.

Read More

Crowning Achievement: NVIDIA Research Model Enables Fast, Efficient Dynamic Scene Reconstruction

Crowning Achievement: NVIDIA Research Model Enables Fast, Efficient Dynamic Scene Reconstruction

Content streaming and engagement are entering a new dimension with QUEEN, an AI model by NVIDIA Research and the University of Maryland that makes it possible to stream free-viewpoint video, which lets viewers experience a 3D scene from any angle.

QUEEN could be used to build immersive streaming applications that teach skills like cooking, put sports fans on the field to watch their favorite teams play from any angle, or bring an extra level of depth to video conferencing in the workplace. It could also be used in industrial environments to help teleoperate robots in a warehouse or a manufacturing plant.

The model will be presented at NeurIPS, the annual conference for AI research that begins Tuesday, Dec. 10, in Vancouver.

“To stream free-viewpoint videos in near real time, we must simultaneously reconstruct and compress the 3D scene,” said Shalini De Mello, director of research and a distinguished research scientist at NVIDIA. “QUEEN balances factors including compression rate, visual quality, encoding time and rendering time to create an optimized pipeline that sets a new standard for visual quality and streamability.”

Reduce, Reuse and Recycle for Efficient Streaming

Free-viewpoint videos are typically created using video footage captured from different camera angles, like a multicamera film studio setup, a set of security cameras in a warehouse or a system of videoconferencing cameras in an office.

Prior AI methods for generating free-viewpoint videos either took too much memory for livestreaming or sacrificed visual quality for smaller file sizes. QUEEN balances both to deliver high-quality visuals — even in dynamic scenes featuring sparks, flames or furry animals — that can be easily transmitted from a host server to a client’s device. It also renders visuals faster than previous methods, supporting streaming use cases.

In most real-world environments, many elements of a scene stay static. In a video, that means a large share of pixels don’t change from one frame to another. To save computation time, QUEEN tracks and reuses renders of these static regions — focusing instead on reconstructing the content that changes over time.

Using an NVIDIA Tensor Core GPU, the researchers evaluated QUEEN’s performance on several benchmarks and found the model outperformed state-of-the-art methods for online free-viewpoint video on a range of metrics. Given 2D videos of the same scene captured from different angles, it typically takes under five seconds of training time to render free-viewpoint videos at around 350 frames per second.

This combination of speed and visual quality can support media broadcasts of concerts and sports games by offering immersive virtual reality experiences or instant replays of key moments in a competition.

In warehouse settings, robot operators could use QUEEN to better gauge depth when maneuvering physical objects. And in a videoconferencing application — such as the 3D videoconferencing demo shown at SIGGRAPH and NVIDIA GTC — it could help presenters demonstrate tasks like cooking or origami while letting viewers pick the visual angle that best supports their learning.

The code for QUEEN will soon be released as open source and shared on the project page.

QUEEN is one of over 50 NVIDIA-authored NeurIPS posters and papers that feature groundbreaking AI research with potential applications in fields including simulation, robotics and healthcare.

Generative Adversarial Nets, the paper that first introduced GAN models, won the NeurIPS 2024 Test of Time Award. Cited more than 85,000 times, the paper was coauthored by Bing Xu, distinguished engineer at NVIDIA. Hear more from its lead author, Ian Goodfellow, research scientist at DeepMind, on the AI Podcast:

Learn more about NVIDIA Research at NeurIPS.

See the latest work from NVIDIA Research, which has hundreds of scientists and engineers worldwide, with teams focused on topics including AI, computer graphics, computer vision, self-driving cars and robotics.

Academic researchers working on large language models, simulation and modeling, edge AI and more can apply to the NVIDIA Academic Grant Program.

See notice regarding software product information.

Read More

vLLM Joins PyTorch Ecosystem: Easy, Fast, and Cheap LLM Serving for Everyone

vLLM Joins PyTorch Ecosystem: Easy, Fast, and Cheap LLM Serving for Everyone

vllm logo

We’re thrilled to announce that the vLLM project has become a PyTorch ecosystem project, and joined the PyTorch ecosystem family!

Running large language models (LLMs) is both resource-intensive and complex, especially as these models scale to hundreds of billions of parameters. That’s where vLLM comes in — a high-throughput, memory-efficient inference and serving engine designed for LLMs.

Originally built around the innovative PagedAttention algorithm, vLLM has grown into a comprehensive, state-of-the-art inference engine. A thriving community is also continuously adding new features and optimizations to vLLM, including pipeline parallelism, chunked prefill, speculative decoding, and disaggregated serving.

Since its release, vLLM has garnered significant attention, achieving over 31,000 GitHub stars—a testament to its popularity and thriving community. This milestone marks an exciting chapter for vLLM as we continue to empower developers and researchers with cutting-edge tools for efficient and scalable AI deployment. Welcome to the next era of LLM inference!

vLLM has always had a strong connection with the PyTorch project. It is deeply integrated into PyTorch, leveraging it as a unified interface to support a wide array of hardware backends. These include NVIDIA GPUs, AMD GPUs, Google Cloud TPUs, Intel GPUs, Intel CPUs, Intel Gaudi HPUs, and AWS Neuron, among others. This tight coupling with PyTorch ensures seamless compatibility and performance optimization across diverse hardware platforms.

Do you know you can experience the power of vLLM right from your phone? During this year’s Amazon Prime Day, vLLM played a crucial role in delivering lightning-fast responses to millions of users. Across three regions, over 80,000 Trainium and Inferentia chips powered an average of 3 million tokens per minute, all while maintaining a P99 latency of less than 1 second for the first response. That means when customers opened the Amazon app and chatted with Rufus, they were seamlessly interacting with vLLM in action!

vLLM also collaborates tightly with leading model vendors to ensure support for popular models. This includes tight integration with Meta LLAMA, Mistral, QWen, and DeepSeek models, plus many others. One particularly memorable milestone was the release of LLAMA 3.1 (405B). As the launching partner, vLLM was the first to enable running this very large model, showcasing vLLM’s capability to handle the most complex and resource-intensive language models.

To install vLLM, simply run:

pip install vllm

vLLM is designed for both researchers and production-grade serving.

To run vLLM as an OpenAI API compatible server, just use the Huggingface model ID:

vllm serve meta-llama/Llama-3.1-8B

To run vLLM as a simple function:

from vllm import LLM, SamplingParams

# Sample prompts.
prompts = [
   "Hello, my name is",
   "The president of the United States is",
   "The capital of France is",
   "The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Create an LLM.
llm = LLM(model="meta-llama/Llama-3.1-8B")
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
   prompt = output.prompt
   generated_text = output.outputs[0].text
   print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Open-source innovation is part of the vLLM’s DNA. Born out of a Berkeley academic project, it follows the legacy of other pioneering open-source initiatives such as BSD, which revolutionized operating systems in the 1980s. Other innovations from the same organization include Apache Spark and Ray, now the standard for big data and AI systems. In the Gen AI era, vLLM serves as a platform dedicated to democratizing AI inference.

The vLLM team remains steadfast in its mission to keep the project “of the community, by the community, and for the community.” Collaboration and inclusivity lie at the heart of everything we do.

If you have collaboration requests or inquiries, feel free to reach out at vllm-questions@lists.berkeley.edu. To join the active and growing vLLM community, explore our GitHub repository or connect with us on the vLLM Slack. Together, we can push the boundaries of AI innovation and make it accessible to all.

Read More

Momentum Approximation in Asynchronous Private Federated Learning

This paper was accepted for presentation at the International Workshop on Federated Foundation Models (FL@FM-NeurIPS’24), held in conjunction with NeurIPS 2024.
Asynchronous protocols have been shown to improve the scalability of federated learning (FL) with a massive number of clients. Meanwhile, momentum-based methods can achieve the best model quality in synchronous FL. However, naively applying momentum in asynchronous FL algorithms leads to slower convergence and degraded model performance. It is still unclear how to effective combinie these two techniques together to achieve a win-win…Apple Machine Learning Research

Abstracts: NeurIPS 2024 with Weizhu Chen

Abstracts: NeurIPS 2024 with Weizhu Chen

Illustrated image of Weizhu Chen.

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements.

In this episode, Weizhu Chen, vice president of Microsoft GenAI, joins host Amber Tingle to discuss the paper “Not All Tokens Are What You Need for Pretraining,” an oral presentation at this year’s Conference on Neural Information Processing Systems (NeurIPS). Based on an examination of model training at the token level, Chen and his coauthors present an alternate approach to model pretraining: instead of training language models to predict all tokens, they make a distinction between useful and “noisy” tokens. Doing so, the work shows, improves token efficiency and model performance.

Transcript

[MUSIC]

AMBER TINGLE: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Amber Tingle. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.

[MUSIC FADES] 

Our guest today is Weizhu Chen. He is vice president of Microsoft GenAI and coauthor of a paper called “Not All Tokens Are What You Need for Pretraining.” This paper is an oral presentation during the 38th annual Conference on Neural Information Processing Systems, also known as NeurIPS, which is happening this week in Vancouver. Weizhu, thank you for joining us today on Abstracts


WEIZHU CHEN: Thank you for having me, Amber. 

TINGLE: So let’s start with a brief overview of your paper. In a couple sentences, tell us about the problem your research addresses and, more importantly, why the research community and beyond should know about this work. 

CHEN: So my team basically in Microsoft GenAI, we are working on model training. So one of the things actually we do in the pretraining, we realize the importance of the data. And we found that actually when we do this kind of data for each of the tokens, some token is more important than the other. That’s one. The other one actually is some token actually is very, very hard to be predicted during the pretraining. So, for example, just like if someone see the text of “Weizhu,” and what’s the next token? It can be “Chen”; it can be any of the last name. So it’s very hard to be predicted. And if we try to enforce a language model to focus on this, kind of, the hard-to-predict token, just like actually it’s going to confuse the language model. There are so many different kinds of the example like this. Just like, for example, the serial number in your UPS. So the focus of this paper is try to identify which token actually is more important for the language model to learn. And actually the other token maybe is just the noise. And how can we try to discriminate the token—which is good token, which is noise token. Basically, you try to understand this kind of dynamic of the tokens. 

TINGLE: How did you conduct this research? 

CHEN: Actually we do a lot of work in the model training, including the pretraining and the post-training. So for the pretraining side, actually the most important thing to us is the data. We also try to understand, how can we leverage the existing data, and how can we create much more data, as well? And data basically is one of the most important thing to build a better foundation model. So we try to understand how much more we can get from the data. And the important thing for the data is about data filtering. So you think about actually in the previous literature, we do the data filtering, for example, just like we build a classifier to classify, OK, this page is more important than the other. And this page actually is a noise because there’s so much noise data in the web. So we just keep the best data to get into the pretraining corpus. And further away, we think about, OK, yeah, so this is … maybe it’s not fine grain enough, so can we try to understand even for the same page we want to keep? So some token is more important than the other. Maybe some token just some noise token. Actually you put this data into the pretraining, it’s going to hurt the model quality. So there is the motivation actually we try to think about.

TINGLE: And what were your major findings? 

CHEN: Our major finding is about basically, definitely this works so well. And it’s so important that actually we are able to get the best token from the corpus and then make it available and try to ask the model during the pretraining to ignore the token we don’t want to get into the model itself. So that is one. The second thing definitely data is the other very important thing. If you’re able to figure out the better way to build a better data is most likely you’re able to build a much better foundation model. The third thing actually is also connected to a lot of other existing work, just like data synthesis, just like distillation, just like data filtering, and so a lot of things are really connected together. And actually, this work, basically, you can associate with also a lot of other work we are working on, just like distillation. You can think about, for example, for this work, we also try to build a model, a reference model—we call as the reference model—to try to identify actually this data, this token, is more important than the other and try to understand the discrepancy between the reference model and the running model, their prediction on each tokens. So you can think about also it’s some kind of the try to distill from the reference model to the existing model, as well. 

TINGLE: Let’s talk a little bit about real-world impact. Who benefits most from this work? And how significant is this within your discipline and even downstream for people using applications? 

CHEN: This actually is very, very fundamental work because just like I share a little bit before, actually we build the data and this data is—build the data much better—is able to build a much better foundation model. If we’re able to build a better model actually is able to benefit so many different kinds of application. This also is going to help us to build a much better small language model. And we can also serve this model even in the edge side, in the client side, in the coding scenario. So we are going to see actually huge impact from this kind of the foundation model if you are able to benefit from building much better training data. 

TINGLE: Are there any unanswered questions or unsolved problems in this area? What’s next on your research agenda? 

CHEN: Yeah, I think that is a very good questions. And definitely there’s a lot of things about how to build a better data [that] is unsolved yet in the literature. And especially because when you do the pretraining, the most important part is the data, but the data is very limited. And how can we make better use from the existing limited data is a big challenge. Because we can increase the model by 10x, but it’s super hard to increase the data by 10x, especially when we want to deal with the high quality of data. The other way, even given the data, how can you identify, especially for this work, the importance of each token to build a much better model? I think all these things are very connected together. To me, actually, data is the oxygen. So there are still so many things we are able to do in the data, including building for even the small language model or the large model. 

TINGLE: Data is oxygen—I love that! So other than that being a key takeaway, is there any other one thing that you’d like our listeners to walk away from this conversation knowing? 

CHEN: I would love to say actually focus more on this kind of data and focus more about how can I get more from the data actually; it is the very important thing. And the other thing actually, we are working on something that’s very exciting. You can feel free to come to join us if you are very interested in this area. 

[MUSIC] 

TINGLE: Well, Weizhu Chen, thank you for joining us today. We really appreciate it. 

CHEN: Thank you. Thank you for having me. 

TINGLE: And thanks to our listeners for tuning in. If you’d like to read the full paper, you may find a link at aka.ms/abstracts. You can also find the paper on arXiv and on the NeurIPS conference website. I’m Amber Tingle from Microsoft Research, and we hope you’ll join us next time on Abstracts

[MUSIC FADES] 

The post Abstracts: NeurIPS 2024 with Weizhu Chen appeared first on Microsoft Research.

Read More

Thailand and Vietnam Embrace Sovereign AI to Drive Economic Growth

Thailand and Vietnam Embrace Sovereign AI to Drive Economic Growth

Southeast Asia is embracing sovereign AI.

The prime ministers of Thailand and Vietnam this week met with NVIDIA founder and CEO Jensen Huang to discuss initiatives that will accelerate AI innovation in their countries.

During his visit to the region, Huang also joined Bangkok-based cloud infrastructure company SIAM.AI Cloud onstage for a fireside chat on sovereign AI. In Vietnam, he announced NVIDIA’s collaboration with the country’s government on an AI research and development center — and NVIDIA’s acquisition of VinBrain, a health technology startup funded by Vingroup, one of Vietnam’s largest public companies.

These events capped a year of global investments in sovereign AI, the ability for countries to develop and harness AI using domestic computing infrastructure, data and workforces. AI will contribute nearly $20 trillion to the global economy through the end of the decade, according to IDC.

Canada, Denmark and Indonesia are among the countries that have announced initiatives to develop sovereign AI infrastructure powered by NVIDIA technology. And at the recent NVIDIA AI Summits in India and Japan, leading enterprises, infrastructure providers and startups in both countries announced sovereign AI projects in sectors including finance, healthcare and manufacturing.

Supporting Sovereign Cloud Infrastructure in Thailand

Huang’s Southeast Asia visit kicked off with a meeting with Thailand Prime Minister Paetongtarn Shinawatra, where he discussed the opportunities for sovereign AI development in Thailand and shared memories of his childhood years spent in Bangkok.

The pair discussed how further investing in AI education and training can help Thailand drive AI innovations in fields such as weather prediction, climate simulation and healthcare. NVIDIA is working with dozens of local universities and startups to support AI advancement in the country.

Huang and Shinawatra met in the Purple Room of the Thai-Khu-Fah building, which houses the offices of the prime minister and cabinet.

Huang later took the stage at an “AI Vision for Thailand” event hosted by SIAM.AI Cloud, a cloud platform company that offers customers access to virtual servers featuring NVIDIA Tensor Core GPUs.

“The most important part of artificial intelligence is the data. And the data of Thailand belongs to the Thai people,” Huang said in a fireside chat with Ratanaphon Wongnapachant, CEO of SIAM.AI Cloud. Highlighting the importance of sovereign AI development, Huang said, “The digital data of Thailand encodes the knowledge, the history, the culture, the common sense of your people. It should be harvested by your people.”

Following the conversation, Wongnapachant gifted Huang a custom leather jacket lined with Thai silk. The pair also signed an NVIDIA DGX H200 system in recognition of SIAM.AI Cloud’s plans to expand its offerings to NVIDIA H200 Tensor Core GPUs and NVIDIA GB200 Grace Blackwell Superchips.

Advancing AI From Research to Industry in Vietnam

In Hanoi the next day, Huang met with Vietnam’s Prime Minister Pham Minh Chinh, and NVIDIA signed an agreement to build the company’s first research and development center in the country. The center will focus on software development and collaborate with Vietnam’s enterprises, startups, government agencies and universities to accelerate AI adoption in the country.

The announcement builds on NVIDIA’s existing work with 65 universities in Vietnam and more than 100 of the country’s AI startups through NVIDIA Inception, a global program designed to help startups evolve faster. NVIDIA has acquired Inception member VinBrain, a Hanoi-based company that applies AI diagnostics to multimodal health data.

While in Vietnam, Huang also received the 2024 VinFuture Prize alongside AI pioneers Yoshua Bengio, Geoffrey Hinton, Yann Le Cun and Fei-Fei Li for their “transformational contributions to the advancement of deep learning.”

Broadcast live nationally in the country, the awards ceremony was hosted by the VinFuture Foundation, a nonprofit that recognizes innovations in science and technology with significant societal impact.

“Our award today is recognition by the VinFuture committee of the transformative power of AI to revolutionize every field of science and every industry,” Huang said in his acceptance speech.

Bengio, Huang and LeCun accepted the 2024 VinFuture Prize onstage in Hanoi.

Learn more about sovereign AI.

Editor’s note: The data on the economic impact of AI is from IDC’s press release titled “IDC: Artificial Intelligence Will Contribute $19.9 Trillion to the Global Economy through 2030 and Drive 3.5% of Global GDP in 2030,” published in September 2024.

Read More

Mistral-NeMo-Instruct-2407 and Mistral-NeMo-Base-2407 are now available on SageMaker JumpStart

Mistral-NeMo-Instruct-2407 and Mistral-NeMo-Base-2407 are now available on SageMaker JumpStart

Today, we are excited to announce that Mistral-NeMo-Base-2407 and Mistral-NeMo-Instruct-2407—twelve billion parameter large language models from Mistral AI that excel at text generation—are available for customers through Amazon SageMaker JumpStart. You can try these models with SageMaker JumpStart, a machine learning (ML) hub that provides access to algorithms and models that can be deployed with one click for running inference. In this post, we walk through how to discover, deploy and use the Mistral-NeMo-Instruct-2407 and Mistral-NeMo-Base-2407 models for a variety of real-world use cases.

Mistral-NeMo-Instruct-2407 and Mistral-NeMo-Base-2407 overview

Mistral NeMo, a powerful 12B parameter model developed through collaboration between Mistral AI and NVIDIA and released under the Apache 2.0 license, is now available on SageMaker JumpStart. This model represents a significant advancement in multilingual AI capabilities and accessibility.

Key features and capabilities

Mistral NeMo features a 128k token context window, enabling processing of extensive long-form content. The model demonstrates strong performance in reasoning, world knowledge, and coding accuracy. Both pre-trained base and instruction-tuned checkpoints are available under the Apache 2.0 license, making it accessible for researchers and enterprises. The model’s quantization-aware training facilitates optimal FP8 inference performance without compromising quality.

Multilingual support

Mistral NeMo is designed for global applications, with strong performance across multiple languages including English, French, German, Spanish, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, and Hindi. This multilingual capability, combined with built-in function calling and an extensive context window, helps make advanced AI more accessible across diverse linguistic and cultural landscapes.

Tekken: Advanced tokenization

The model uses Tekken, an innovative tokenizer based on tiktoken. Trained on over 100 languages, Tekken offers improved compression efficiency for natural language text and source code.

SageMaker JumpStart overview

SageMaker JumpStart is a fully managed service that offers state-of-the-art foundation models for various use cases such as content writing, code generation, question answering, copywriting, summarization, classification, and information retrieval. It provides a collection of pre-trained models that you can deploy quickly, accelerating the development and deployment of ML applications. One of the key components of SageMaker JumpStart is the Model Hub, which offers a vast catalog of pre-trained models, such as DBRX, for a variety of tasks.

You can now discover and deploy both Mistral NeMo models with a few clicks in Amazon SageMaker Studio or programmatically through the SageMaker Python SDK, enabling you to derive model performance and machine learning operations (MLOps) controls with Amazon SageMaker features such as Amazon SageMaker Pipelines, Amazon SageMaker Debugger, or container logs. The model is deployed in an AWS secure environment and under your virtual private cloud (VPC) controls, helping to support data security.

Prerequisites

To try out both NeMo models in SageMaker JumpStart, you will need the following prerequisites:

Discover Mistral NeMo models in SageMaker JumpStart

You can access NeMo models through SageMaker JumpStart in the SageMaker Studio UI and the SageMaker Python SDK. In this section, we go over how to discover the models in SageMaker Studio.

SageMaker Studio is an integrated development environment (IDE) that provides a single web-based visual interface where you can access purpose-built tools to perform ML development steps, from preparing data to building, training, and deploying your ML models. For more details on how to get started and set up SageMaker Studio, see Amazon SageMaker Studio.

In SageMaker Studio, you can access SageMaker JumpStart by choosing JumpStart in the navigation pane.

Then choose HuggingFace.

From the SageMaker JumpStart landing page, you can search for NeMo in the search box. The search results will list Mistral NeMo Instruct and Mistral NeMo Base.

You can choose the model card to view details about the model such as license, data used to train, and how to use the model. You will also find the Deploy button to deploy the model and create an endpoint.

Deploy the model in SageMaker JumpStart

Deployment starts when you choose the Deploy button. After deployment finishes, you will see that an endpoint is created. You can test the endpoint by passing a sample inference request payload or by selecting the testing option using the SDK. When you select the option to use the SDK, you will see example code that you can use in the notebook editor of your choice in SageMaker Studio.

Deploy the model with the SageMaker Python SDK

To deploy using the SDK, we start by selecting the Mistral NeMo Base model, specified by the model_id with the value huggingface-llm-mistral-nemo-base-2407. You can deploy your choice of the selected models on SageMaker with the following code. Similarly, you can deploy NeMo Instruct using its own model ID.

from sagemaker.jumpstart.model import JumpStartModel 

accept_eula = True 

model = JumpStartModel(model_id="huggingface-llm-mistral-nemo-base-2407") 
predictor = model.deploy(accept_eula=accept_eula)

This deploys the model on SageMaker with default configurations, including the default instance type and default VPC configurations. You can change these configurations by specifying non-default values in JumpStartModel. The EULA value must be explicitly defined as True to accept the end-user license agreement (EULA). Also make sure that you have the account-level service limit for using ml.g6.12xlarge for endpoint usage as one or more instances. You can follow the instructions in AWS service quotas to request a service quota increase. After it’s deployed, you can run inference against the deployed endpoint through the SageMaker predictor:

payload = {
    "messages": [
        {
            "role": "user",
            "content": "Hello"
        }
    ],
    "max_tokens": 1024,
    "temperature": 0.3,
    "top_p": 0.9,
}

response = predictor.predict(payload)['choices'][0]['message']['content'].strip()
print(response)

An important thing to note here is that we’re using the djl-lmi v12 inference container, so we’re following the large model inference chat completions API schema when sending a payload to both Mistral-NeMo-Base-2407 and Mistral-NeMo-Instruct-2407.

Mistral-NeMo-Base-2407

You can interact with the Mistral-NeMo-Base-2407 model like other standard text generation models, where the model processes an input sequence and outputs predicted next words in the sequence. In this section, we provide some example prompts and sample output. Keep in mind that the base model is not instruction fine-tuned.

Text completion

Tasks involving predicting the next token or filling in missing tokens in a sequence:

payload = {
    "messages": [
        {
            "role": "user",
            "content": "The capital of France is ___."
        }
    ],
    "max_tokens": 10,
    "temperature": 0.3,
    "top_p": 0.9,
}

response = predictor.predict(payload)['choices'][0]['message']['content'].strip()
print(response)

The following is the output:

Paris
The capital of France is Paris.

Mistral NeMo Instruct

The Mistral-NeMo-Instruct-2407 model is a quick demonstration that the base model can be fine-tuned to achieve compelling performance. You can follow the steps provided to deploy the model and use the model_id value of huggingface-llm-mistral-nemo-instruct-2407 instead.

The instruction-tuned NeMo model can be tested with the following tasks:

Code generation

Mistral NeMo Instruct demonstrates benchmarked strengths for coding tasks. Mistral states that their Tekken tokenizer for NeMo is approximately 30% more efficient at compressing source code. For example, see the following code:

payload = {
    "messages": [
        {
            "role": "user",
            "content": """Create a Binary Search Tree class with methods for insertion, searching, and in-order traversal."""
        }
    ],
    "max_tokens": 2048,
    "temperature": 0.3,
    "top_p": 0.9,
}
text_response = predictor.predict(payload)['choices'][0]['message']['content'].strip()
tokens = predictor.predict(payload)['usage']
print(text_response)
print(tokens)

The following is the output:

class Node:
    def __init__(self, key):
        self.key = key
        self.left = None
        self.right = None

class BinarySearchTree:
    def __init__(self):
        self.root = None

    def insert(self, key):
        if not self.root:
            self.root = Node(key)
        else:
            self._insert(self.root, key)

    def _insert(self, node, key):
        if key < node.key:
            if node.left:
                self._insert(node.left, key)
            else:
                node.left = Node(key)
        elif key > node.key:
            if node.right:
                self._insert(node.right, key)
            else:
                node.right = Node(key)

    def search(self, key):
        return self._search(self.root, key)

    def _search(self, node, key):
        if not node or node.key == key:
            return node

        if key < node.key:
            return self._search(node.left, key)
        else:
            return self._search(node.right, key)

    def inorder_traversal(self):
        self._inorder_traversal(self.root)
        print()

    def _inorder_traversal(self, node):
        if node:
            self._inorder_traversal(node.left)
            print(node.key, end=" ")
            self._inorder_traversal(node.right)

# Example usage:
bst = BinarySearchTree()
bst.insert(50)
bst.insert(30)
bst.insert(20)
bst.insert(40)
bst.insert(70)
bst.insert(60)
bst.insert(80)

print("In-order traversal:")
bst.inorder_traversal()  # Output: 20 30 40 50 60 70 80

print(f"Search 40: {bst.search(40).key if bst.search(40) else 'Not found'}")
print(f"Search 90: {bst.search(90).key if bst.search(90) else 'Not found'}")
{'prompt_tokens': 22, 'completion_tokens': 433, 'total_tokens': 455}

The model demonstrates strong performance on code generation tasks, with the completion_tokens offering insight into how the tokenizer’s code compression effectively optimizes the representation of programming languages using fewer tokens.

Advanced math and reasoning

The model also reports strengths in mathematic and reasoning accuracy. For example, see the following code:

payload = {
    "messages": [
        {   "role": "system", 
            "content": "You are an expert in mathematics and reasoning. Your role is to provide examples, explanations, and insights related to mathematical concepts, problem-solving techniques, and logical reasoning.",
            "role": "user",
            "content": """Calculating the orbital period of an exoplanet:
             Given: An exoplanet orbits its star at a distance of 2.5 AU (Astronomical Units). The star has a mass of 1.2 solar masses.
             Task: Calculate the orbital period of the exoplanet in Earth years."""
        }
    ],
    "max_tokens": 2048,
    "temperature": 0.3,
    "top_p": 0.9,
}
response = predictor.predict(payload)['choices'][0]['message']['content'].strip()
print(response)

The following is the output:

To calculate the orbital period of an exoplanet, we can use Kepler's Third Law, which states that the square of the orbital period (P) is directly proportional to the cube of the semi-major axis (a) of the orbit and inversely proportional to the mass (M) of the central body. The formula is:

P^2 = (4 * π^2 * a^3) / (G * M)

where:
- P is the orbital period in years,
- a is the semi-major axis in AU (Astronomical Units),
- G is the gravitational constant (6.67430 × 10^-11 m^3 kg^-1 s^-2),
- M is the mass of the star in solar masses.

First, we need to convert the mass of the star from solar masses to kilograms. The mass of the Sun is approximately 1.98847 × 10^30 kg. So, the mass of the star is:

M = 1.2 * 1.98847 × 10^30 kg = 2.386164 × 10^30 kg

Now, we can plug the values into Kepler's Third Law:

P^2 = (4 * π^2 * (2.5 AU)^3) / (G * M)

Since 1 AU is approximately 1.496 × 10^11 meters, the semi-major axis in meters is:

a = 2.5 AU * 1.496 × 10^11 m/AU = 3.74 × 10^12 m

Now, we can calculate P^2:

P^2 = (4 * π^2 * (3.74 × 10^12 m)^3) / (6.67430 × 10^-11 m^3 kg^-1 s^-2 * 2.386164 × 10^30 kg)

P^2 = (4 * π^2 * 5.62 × 10^36 m^3) / (1.589 × 10^20 m^3 kg^-1 s^-2)

P^2 = (4 * π^2 * 5.62 × 10^36 m^3) / (1.589 × 10^20 m^3 kg^-1 s^-2)

P^2 = 4.15 × 10^16 s^2

Now, we take the square root to find the orbital period in seconds:

P = √(4.15 × 10^16 s^2) ≈ 2.04 × 10^8 s

Finally, we convert the orbital period from seconds to Earth years (1 Earth year = 31,557,600 seconds):

P = (2.04 × 10^8 s) / (31,557,600 s/year) ≈ 6.47 years

Therefore, the orbital period of the exoplanet is approximately 6.47 Earth years.

Language translation task

In this task, let’s test Mistral’s new Tekken tokenizer. Mistral states that the tokenizer is two times and three times more efficient at compressing Korean and Arabic, respectively.

Here, we use some text for translation:

text= """
"How can our business leverage Mistral NeMo with our new RAG application?"
"What is our change management strategy once we roll out this new application to the field?
"""

We set our prompt to instruct the model on the translation to Korean and Arabic:

prompt=f"""

text={text}

Translate the following text into these languages:

1. Korean
2. Arabic

Label each language section accordingly""".format(text=text)

We can then set the payload:

payload = {
    "messages": [
        {   "role": "system", 
            "content": "You are an expert in language translation.",
            "role": "user",
            "content": prompt
        }
    ],
    "max_tokens": 2048,
    "temperature": 0.3,
    "top_p": 0.9,
}
#response = predictor.predict(payload)
text_response = predictor.predict(payload)['choices'][0]['message']['content'].strip()
tokens = predictor.predict(payload)['usage']
print(text_response)
print(tokens)

The following is the output:

**1. Korean**

- "우리의 비즈니스가 Mistral NeMo를 어떻게 활용할 수 있을까요?"
- "이 새 애플리케이션을 현장에 롤아웃할 때 우리의 변화 관리 전략은 무엇입니까?"

**2. Arabic**

- "كيف يمكن لعمليتنا الاست من Mistral NeMo مع تطبيق RAG الجديد؟"
- "ما هو استراتيجيتنا في إدارة التغيير بعد تفعيل هذا التطبيق الجديد في الميدان؟"
{'prompt_tokens': 61, 'completion_tokens': 243, 'total_tokens': 304}

The translation results demonstrate how the number of completion_tokens used is significantly reduced, even for tasks that are typically token-intensive, such as translations involving languages like Korean and Arabic. This improvement is made possible by the optimizations provided by the Tekken tokenizer. Such a reduction is particularly valuable for token-heavy applications, including summarization, language generation, and multi-turn conversations. By enhancing token efficiency, the Tekken tokenizer allows for more tasks to be handled within the same resource constraints, making it an invaluable tool for optimizing workflows where token usage directly impacts performance and cost.

Clean up

After you’re done running the notebook, make sure to delete all resources that you created in the process to avoid additional billing. Use the following code:

predictor.delete_model()
predictor.delete_endpoint()

Conclusion

In this post, we showed you how to get started with Mistral NeMo Base and Instruct in SageMaker Studio and deploy the model for inference. Because foundation models are pre-trained, they can help lower training and infrastructure costs and enable customization for your use case. Visit SageMaker JumpStart in SageMaker Studio now to get started.

For more Mistral resources on AWS, check out the Mistral-on-AWS GitHub repository.


About the authors

Niithiyn Vijeaswaran is a Generative AI Specialist Solutions Architect with the Third-Party Model Science team at AWS. His area of focus is generative AI and AWS AI Accelerators. He holds a Bachelor’s degree in Computer Science and Bioinformatics.

Preston Tuggle is a Sr. Specialist Solutions Architect working on generative AI.

Shane Rai is a Principal Generative AI Specialist with the AWS World Wide Specialist Organization (WWSO). He works with customers across industries to solve their most pressing and innovative business needs using the breadth of cloud-based AI/ML services provided by AWS, including model offerings from top tier foundation model providers.

Read More

Abstracts: NeurIPS 2024 with Pranjal Chitale

Abstracts: NeurIPS 2024 with Pranjal Chitale

diagram

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements. 

In this episode, Research Fellow Pranjal Chitale joins host Gretchen Huizinga to discuss the paper “CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark,” an oral presentation at this year’s Conference on Neural Information Processing Systems (NeurIPS). CVQA, which comprises questions and images representative of 31 languages and the cultures of 30 countries, was created in collaboration with native speakers and cultural experts to evaluate how well models perform across diverse linguistic and cultural contexts, an important step toward improving model inclusivity.

Transcript

[MUSIC]

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract— of their new and noteworthy papers.

[MUSIC FADES]

Today I’m talking to Pranjal Chitale, a research fellow at Microsoft Research India. Pranjal is coauthor of a paper called “CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark,” and this paper is an oral presentation at this week’s 38th annual Conference on Neural Information Processing Systems, or NeurIPS, in Vancouver, BC. Pranjal, thanks for joining us today on Abstracts!


PRANJAL CHITALE: Hi, Gretchen. Thanks for having me.

HUIZINGA: So, Pranjal, give us an overview of this paper. In a couple sentences, what problem are you trying to solve, and why should people care about it?

CHITALE: So we are witnessing some exciting times as LLMs are rapidly evolving as tools for countless use cases. While most of these LLMs were initially leveraged for natural language processing tasks, they are now expanded across languages and modalities. However, a major gap lies in the availability of multimodal data for non-English languages. Therefore, most multimodal models might not have coverage for non-English languages altogether or might just heavily rely on translations of the associated text in English-centric datasets so as to support multiple languages. The drawback of this approach is that it often misses the cultural nuances of local languages. And another reason why this is not optimal is the images are mostly Western-centric [and] therefore would not be well reflective of the local culture of a lot of regions. So this kind of bias can skew these models towards a Western perspective, raising concerns about inclusivity and safety of the content which they generate when serving a global population, which involves multicultural and multilingual users. Therefore, for a truly inclusive AI ecosystem, models must demonstrate cultural understanding to ensure that the generated content is safe, respectful for diverse communities. Evaluating cultural awareness, though, is extremely challenging because how to define culture itself is an unsolved problem. However, in this work, we are trying to take a step towards having a proxy which could measure cultural understanding.

HUIZINGA: Well, talk about how you did this. What methodology did you use for this paper, and what were your major findings?

CHITALE: Now that we have defined our broader problem, it is important to decide the scope of our solution because, as we discussed, culture is an umbrella term. So we need to define a smaller scope for this problem. We chose visual question answering, which is a multimodal task, and it is one of the most critical multimodal tasks for the scope of this work. So recognizing the limitations of existing VQA benchmarks, which often rely on translations and lack cultural representation, we developed CVQA, which is Culturally-diverse multilingual VQA benchmark. CVQA spans 30 countries, 31 languages, and has over 10,000 culturally nuanced questions, which were crafted by native speakers and cultural experts. So our focus was on creating questions which required what we term as cultural common sense to answer. For instance, with just the image, it is not possible to answer the question. You need some cultural awareness about the local culture to be able to answer the question. So these questions draw inspiration from knowledge of local culture. So one important aspect of this dataset is that we include both local language as well as English variants of the same question to allow robust testing of models across linguistic concepts. I would say the crux of this effort is that while most of the prior efforts may be small in terms of language—it could be language-group specific or country specific for most—but we wanted this to be a much larger global-scale collaborative effort. So this covers 31 languages across 30 countries. So to build CVQA, we worked with qualified volunteers from diverse age group and genders, ensuring that the questions authentically represented their cultures. So images which were collected, those were ensured to be copyright free, grounded in culture, and safe for work with strict guidelines to ensure that we avoid images which reflect some stereotypes or privacy violations. And we also had 10 categories, which involved topics ranging from daily life, sports, cuisine to history of the region, so a holistic view of the culture of the region. So each question was crafted as a multiple-choice task with challenging answer options which required both the image as well as cultural knowledge to solve. We also employed a maker-checker approach to ensure quality and consistency.

HUIZINGA: So you’ve created the benchmark. You’ve tested it. What were your major findings?

CHITALE: Now that we have created a benchmark, the next step is to evaluate how these multimodal models are performing on this benchmark. So we benchmark several state-of-the-art multimodal models, which include both open-source offerings like CLIP, BLIP, LLaVA-1.5, and proprietary offerings like GPT-4o or Gemini 1.5 Flash. So what we observed is there is a huge gap when it comes … in performance when we compare these proprietary offerings versus the open-source models. So GPT-4o was the highest-performing model with 75.4% accuracy on English prompts and 74.3% accuracy on local prompts. However, the story is completely different when we go to open-source models. These open-source models significantly lag behind the proprietary models. And one key finding over these open-source models is that these models perform even worse when prompted in the native language when we compare it to prompting in English. This potentially highlights that these models lack multilingual understanding capabilities, which may be because multilingual training data is pretty scarce.

HUIZINGA: Yeah.

CHITALE: So LLaVA-1.5 turned out to be the best open-source model. So one thing to notice, LLaVA-1.5 performs well across a large set of English VQA benchmarks, but when it comes to cultural understanding, it is a pretty weak model. Further, we also did some ablations to understand if adding location-specific information to the textual prompts has some impact or not, but we identified that it does not result in any significant performance improvements. Further, we also conducted a category-wise analysis. So, as we had mentioned, there are 10 categories to which these images belong. So what we observed is that certain categories, like people and everyday life, consistently saw higher accuracy across a large set of models. This may be likely due to abundance of human activity data in training datasets. However, when it comes to niche categories like cooking and food, pop culture, which are much more challenging, especially in local languages, these models struggle. Therefore, these are the kind of highly diverse cultural contexts which need improvement.

HUIZINGA: How’s this work going to make an impact outside the lab and in the real world?

CHITALE: CVQA is significant because it addresses a fundamental gap in how we evaluate vision-language and multimodal models today. While proprietary models are making impressive strides, open-source models, which are more accessible and easier to deploy, significantly lag behind in terms of cultural awareness and safety. So CVQA fills this gap and provides a much-needed benchmark to help us identify these gaps in the first place. So as to fix them, we first need to identify the gaps, and whether we are progressing or not can be captured by this benchmark. So for the real world, this benchmark does have some far-reaching implications. Models which understand culture are not just technically better, but they would create interactions which are far more engaging, natural, and safe for users from diverse backgrounds. So this benchmark offers entirely new axis for improvement, cultural awareness, and linguistic diversity. Therefore, by improving a model’s ability to handle culturally nuanced questions, CVQA ensures researchers and developers think beyond accuracy and also focus on cultural awareness and inclusivity before shipping these models into production.

HUIZINGA: Pranjal, what are the unanswered questions or unsolved problems in this field, and what do you plan to do about it?

CHITALE: So while CVQA makes some strides in addressing cultural and linguistic diversity, there is still much more to explore in this space. So this dataset only covers 31 languages and cultures, but this is just, like, a subset of the incredible diversity that exists globally. Many languages and cultures remain underrepresented, especially some of them are endangered or have limited digital resources. So expanding CVQA to include more of these languages would be a natural next step. Secondly, CVQA just focuses on single-turn question-answer pairs. But in reality, human interaction is often multi-turn and conversational in nature. So a multi-turn version of CVQA could better simulate real-world use cases and challenge models to maintain cultural and contextual awareness over extended dialogues. Another interesting area is personalization. So it would be very interesting if we could teach models to adapt to a user’s cultural background, preferences, or even regional nuances in real time. This remains a significant challenge, although this benchmark could help us move a step towards our broader goal.

[MUSIC]

HUIZINGA: Well, Pranjal Chitale, this is super important research, and thank you for joining us today. To our listeners, thanks for tuning in. If you’re interested in learning more about this paper, you can find it at aka.ms/abstracts. You can also find it on arXiv and on the NeurIPS website. And if you’re at NeurIPS, you can also go hear about it. See you next time on Abstracts!

[MUSIC FADES]

The post Abstracts: NeurIPS 2024 with Pranjal Chitale appeared first on Microsoft Research.

Read More

Abstracts: NeurIPS 2024 with Dylan Foster

Abstracts: NeurIPS 2024 with Dylan Foster

Illustrated image of Dylan Foster for the Abstracts series on the Microsoft Research Podcast.

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements. 

In this episode, Principal Researcher Dylan Foster joins host Amber Tingle to discuss the paper “Reinforcement Learning Under Latent Dynamics: Toward Statistical and Algorithmic Modularity,” an oral presentation at this year’s Conference on Neural Information Processing Systems (NeurIPS). In the paper, Foster and his coauthors explore whether well-studied RL algorithms for simple problems can be leveraged to solve RL problems with high-dimensional observations and latent dynamics, part of larger efforts to identify algorithm design principles that can enable agents to learn quickly via trial and error in unfamiliar environments.

Transcript

[MUSIC]

AMBER TINGLE: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Amber Tingle. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.

[MUSIC FADES]

Our guest today is Dylan Foster. He is a principal researcher at Microsoft Research and coauthor of a paper called “Reinforcement Learning Under Latent Dynamics: Toward Statistical and Algorithmic Modularity.” The work is among the oral presentations at this year’s Conference on Neural Information Processing Systems, or NeurIPS, in Vancouver. Dylan, welcome and thank you for joining us on the podcast!


DYLAN FOSTER: Thanks for having me.

TINGLE: Let’s start with a brief overview of this paper. Tell us about the problem this work addresses and why the research community should know about it.

FOSTER: So this is a, kind of, a theoretical work on reinforcement learning, or RL. When I say reinforcement learning, broadly speaking, this is talking about the question of how can we design AI agents that are capable of, like, interacting with unknown environments and learning how to solve problems through trial and error. So this is part of some broader agenda we’ve been doing on, kind of, theoretical foundations of RL. And the key questions we’re looking at here are what are called, like, exploration and sample efficiency. So this just means we’re trying to understand, like, what are the algorithm design principles that can allow you to explore an unknown environment and learn as quickly as possible? What we’re doing in this paper is we’re, kind of, looking at, how can you most efficiently solve reinforcement learning problems where you’re faced with very high-dimensional observations, but the underlying dynamics of the system you’re interacting with are simple? So this is a setting that occurs in a lot of natural reinforcement learning and control problems, especially in the context of, like, say, embodied decision-making. So if you think about, say, games like Pong, you know, the state of the game, like, the state of, like, Pong, is extremely simple. It’s just, you know, what is the position and velocity of the ball, and, like, where are the paddles? But what we’d like to be able to do is learn to, you know, like, control or, like, solve games like this from raw pixels or, like, images kind of in the same way that a human would, like, just solve them from vision. So if you look at these types of problems, you know, we call these, like, RL with rich observations or RL with latent dynamics. You know, these are interesting because they, kind of, require you to explore the system, but they also require, you know, representation learning. Like, you want to be able to use neural nets to learn a mapping from, say, the images you see to the latent state of the system. This is a pretty interesting and nontrivial algorithmic problem. And, kind of, what we do in this work is we take a first step towards something like a unified understanding for how to solve these sorts of, like, rich-observation, or latent dynamics, RL problems.

TINGLE: So how did you go about developing this theoretical framework?

FOSTER: Yeah, so if you look at these sort of RL problems with latent dynamics, this is something that’s actually received a lot of investigation in theory. And a lot of this goes back to, kind of, early work from our lab from, like, 2016, 2017 or so. There’s some really interesting results here, but progress was largely on a, like, case-by-case basis, meaning, you know, there are many different ways that you can try to model the latent dynamics of your problem, and, you know, each of these somehow leads to a different algorithm, right. So, like, you know, you think very hard about this modeling assumption. You think about, what would an optimal algorithm look like? And you end up, you know, writing an entire paper about it. And there’s nothing wrong with that per se, but if you want to be able to iterate quickly and, kind of, try different modeling assumptions and see what works in practice, you know, this is not really tenable. It’s just too slow. And so the starting point for this work was to, kind of, try to take a different and more modular approach. So the idea is, you know, there are many, many different types of, sort of, systems or modeling assumptions for the dynamics that have been already studied extensively and have entire papers about them for the simpler setting in which you can directly see the state of the system. And so what we wanted to ask here is, is it possible to use these existing results in more of, like, a modular fashion? Like, if someone has already written a paper on how to optimally solve a particular type of MDP, or Markov decision process, can we just take their algorithm as is and perhaps plug it into some kind of meta-algorithm that can directly, kind of, combine this with representation learning and use it to solve the corresponding rich-observation, or latent dynamics, RL problem?

TINGLE: What were your major findings? What did you learn during this process?

FOSTER: We started by asking the question sort of exactly the way that I just posed it, right. Like, can we take existing algorithms and use them to solve rich-observation RL problems in a modular fashion? And this turned out to be really tricky. Like, there’s a lot of natural algorithms you might try that seem promising at first but don’t exactly work out. And what this, kind of, led us to and, sort of, the first main result in this paper is actually a negative result. So what we actually showed is most, sort of, well-studied types of systems or, like, MDPs that have been studied in, like, the prior literature on RL, even if they’re tractable when you’re able to directly see the state of the system, they can become statistically intractable once you add, sort of, high-dimensional observations to the picture. And statistically tractable here means the amount of interaction that you need, like the amount of, sort of, attempts to explore the system that you need, in order to learn a good decision-making policy becomes, like, very, very large, like much, much larger than the corresponding, sort of, complexity if you were able to directly see the states of the system. You know, you could look at this and say, I guess we’re out of luck. You know, maybe there’s just no hope of solving these sorts of problems. But that’s perhaps a little too pessimistic. You know, really the way you should interpret this result is just that you need more assumptions. And that’s precisely what the, sort of, second result we have in this paper is. So our second result shows that you can, sort of, bypass this impossibility result and, you know, achieve truly modular algorithms under a couple different types of additional assumptions.

TINGLE: Dylan, I’d like to know—and I’m sure our audience would, too—what this work means when it comes to real-world application. What impact will this have on the research community?

FOSTER: Yeah, so maybe I’ll answer that, um, with two different points. The first one is a broader point, which is, why is it important to understand this problem of exploration and sample efficiency in reinforcement learning? If you look at the, sort of, setting we study in this paper—you know, this, like, RL or decision-making with high-dimensional observations—on the empirical side, people have made a huge amount of progress on this problem through deep reinforcement learning. This was what kind of led to these amazing breakthroughs in solving games like Atari in the last decade. But if you look at these results, the gains are somehow more coming from the, like, inductive bias or the, like, generalization abilities of deep learning and not necessarily from the specific algorithms. So, like, current algorithms do not actually explore very deliberately, and so their sample efficiency is very high. Like, it’s hard to draw a one-to-one comparison, but you can argue they need, like, far more experience than a human would to solve these sorts of problems. So it’s not clear that we’re really anywhere near the ceiling of what can be achieved in terms of, like, how efficiently can you have, you know, an agent learn to solve new problems from trial and error. And I think better algorithms here could potentially be, like, transformative in a lot of different domains. To get into this specific work, I think there’s a couple of important takeaways for researchers. One is that by giving this impossibility result that shows that RL with latent dynamics is impossible without further assumptions, we’re kind of narrowing down the search space where other researchers can look for efficient algorithms. The second takeaway is, you know, we are showing that this problem becomes tractable when you make additional assumptions. But I view these more as, like, a proof of concept. Like, we’re kind of, showing for the first time that it is possible to do something nontrivial, but I think a lot more work and research will be required in order to like, you know, build on this and take this to something that can lead to, like, practical algorithms.

TINGLE: Well, Dylan Foster, thank you for joining us today to discuss your paper on reinforcement learning under latent dynamics. We certainly appreciate it.

FOSTER: Thanks a lot. Thanks for having me.

[MUSIC]

TINGLE: And to our listeners, thank you all for tuning in. If you’d like to read Dylan’s paper, you may find a link at aka.ms/abstracts. You can also find the paper on arXiv and on the NeurIPS conference website. I’m Amber Tingle from Microsoft Research, and we hope you’ll join us next time on Abstracts!

[MUSIC FADES]

The post Abstracts: NeurIPS 2024 with Dylan Foster appeared first on Microsoft Research.

Read More