Apple Machine Learning Research
MetNet-3: A state-of-the-art neural weather model available in Google products
Forecasting weather variables such as precipitation, temperature, and wind is key to numerous aspects of society, from daily planning and transportation to energy production. As we continue to see more extreme weather events such as floods, droughts, and heat waves, accurate forecasts can be essential to preparing for and mitigating their effects. The first 24 hours into the future are especially important as they are both highly predictable and actionable, which can help people make informed decisions in a timely manner and stay safe.
Today we present a new weather model called MetNet-3, developed by Google Research and Google DeepMind. Building on the earlier MetNet and MetNet-2 models, MetNet-3 provides high resolution predictions up to 24 hours ahead for a larger set of core variables, including precipitation, surface temperature, wind speed and direction, and dew point. MetNet-3 creates a temporally smooth and highly granular forecast, with lead time intervals of 2 minutes and spatial resolutions of 1 to 4 kilometers. MetNet-3 achieves strong performance compared to traditional methods, outperforming the best single- and multi-member physics-based numerical weather prediction (NWP) models — such as High-Resolution Rapid Refresh (HRRR) and ensemble forecast suite (ENS) — for multiple regions up to 24 hours ahead.
Finally, we’ve integrated MetNet-3’s capabilities across various Google products and technologies where weather is relevant. Currently available in the contiguous United States and parts of Europe with a focus on 12 hour precipitation forecasts, MetNet-3 is helping bring accurate and reliable weather information to people in multiple countries and languages.
MetNet-3 precipitation output summarized into actionable forecasts in Google Search on mobile. |
Densification of sparse observations
Many recent machine learning weather models use the atmospheric state generated by traditional methods (e.g., data assimilation from NWPs) as the primary starting point to build forecasts. In contrast, a defining feature of the MetNet models has been to use direct observations of the atmosphere for training and evaluation. The advantage of direct observations is that they often have higher fidelity and resolution. However, direct observations come from a large variety of sensors at different altitudes, including weather stations at the surface level and satellites in orbit, and can be of varying degrees of sparsity. For example, precipitation estimates derived from radar such as NOAA’s Multi-Radar/Multi-Sensor System (MRMS) are relatively dense images, whereas weather stations located on the ground that provide measurements for variables such as temperature and wind are mere points spread over a region.
In addition to the data sources used in previous MetNet models, MetNet-3 includes point measurements from weather stations as both inputs and targets with the goal of making a forecast at all locations. To this end, MetNet-3’s key innovation is a technique called densification, which merges the traditional two-step process of data assimilation and simulation found in physics-based models into a single pass through the neural network. The main components of densification are illustrated below. Although the densification technique applies to a specific stream of data individually, the resulting densified forecast benefits from all the other input streams that go into MetNet-3, including topographical, satellite, radar, and NWP analysis features. No NWP forecasts are included in MetNet-3’s default inputs.
High resolution in space and time
A central advantage of using direct observations is their high spatial and temporal resolution. For example, weather stations and ground radar stations provide measurements every few minutes at specific points and at 1 km resolutions, respectively; this is in stark contrast with the assimilation state from the state-of-the-art model ENS, which is generated every 6 hours at a resolution of 9 km with hour-by-hour forecasts. To handle such a high resolution, MetNet-3 preserves another of the defining features of this series of models, lead time conditioning. The lead time of the forecast in minutes is directly given as input to the neural network. This allows MetNet-3 to efficiently model the high temporal frequency of the observations for intervals as brief as 2 minutes. Densification combined with lead time conditioning and high resolution direct observations produces a fully dense 24 hour forecast with a temporal resolution of 2 minutes, while learning from just 1,000 points from the One Minute Observation (OMO) network of weather stations spread across the United States.
MetNet-3 predicts a marginal multinomial probability distribution for each output variable and each location that provides rich information beyond just the mean. This allows us to compare the probabilistic outputs of MetNet-3 with the outputs of advanced probabilistic ensemble NWP models, including the ensemble forecast ENS from the European Centre for Medium-Range Weather Forecasts and the High Resolution Ensemble Forecast (HREF) from the National Oceanic and Atmospheric Administration of the US. Due to the probabilistic nature of the outputs of both models, we are able to compute scores such as the Continuous Ranked Probability Score (CRPS). The following graphics highlight densification results and illustrate that MetNet’s forecasts are not only of much higher resolution, but are also more accurate when evaluated at the overlapping lead times.
Top: MetNet-3’s forecast of wind speed for each 2 minutes over the future 24 hours with a spatial resolution of 4km. Bottom: ENS’s hourly forecast with a spatial resolution of 18 km. The two distinct regimes in spatial structure are primarily driven by the presence of the Colorado mountain ranges. Darker corresponds to higher wind speed. More samples available here: 1, 2, 3, 4. |
In contrast to weather station variables, precipitation estimates are more dense as they come from ground radar. MetNet-3’s modeling of precipitation is similar to that of MetNet-1 and 2, but extends the high resolution precipitation forecasts with a 1km spatial granularity to the same 24 hours of lead time as the other variables, as shown in the animation below. MetNet-3’s performance on precipitation achieves a better CRPS value than ENS’s throughout the 24 hour range.
Case study for Thu Jan 17 2019 00:00 UTC showing the probability of instantaneous precipitation rate being above 1 mm/h on CONUS. Darker corresponds to a higher probability value. The maps also show the prediction threshold when optimized towards Critical Success Index CSI (dark blue contours). This specific case study shows the formation of a new large precipitation pattern in the central US; it is not just forecasting of existing patterns. Top: ENS’s hourly forecast. Center: Ground truth, source NOAA’s MRMS. Bottom: Probability map as predicted by MetNet-3. Native resolution available here. |
Performance comparison between MetNet-3 and NWP baseline for instantaneous precipitation rate on CRPS (lower is better). |
Delivering realtime ML forecasts
Training and evaluating a weather forecasting model like MetNet-3 on historical data is only a part of the process of delivering ML-powered forecasts to users. There are many considerations when developing a real-time ML system for weather forecasting, such as ingesting real-time input data from multiple distinct sources, running inference, implementing real-time validation of outputs, building insights from the rich output of the model that lead to an intuitive user experience, and serving the results at Google scale — all on a continuous cycle, refreshed every few minutes.
We developed such a real-time system that is capable of producing a precipitation forecast every few minutes for the entire contiguous United States and for 27 countries in Europe for a lead time of up to 12 hours.
Illustration of the process of generating precipitation forecasts using MetNet-3. |
The system’s uniqueness stems from its use of near-continuous inference, which allows the model to constantly create full forecasts based on incoming data streams. This mode of inference is different from traditional inference systems, and is necessary due to the distinct characteristics of the incoming data. The model takes in various data sources as input, such as radar, satellite, and numerical weather prediction assimilations. Each of these inputs has a different refresh frequency and spatial and temporal resolution. Some data sources, such as weather observations and radar, have characteristics similar to a continuous stream of data, while others, such as NWP assimilations, are similar to batches of data. The system is able to align all of these data sources spatially and temporally, allowing the model to create an updated understanding of the next 12 hours of precipitation at a very high cadence.
With the above process, the model is able to predict arbitrary discrete probability distributions. We developed novel techniques to transform this dense output space into user-friendly information that enables rich experiences throughout Google products and technologies.
Weather features in Google products
People around the world rely on Google every day to provide helpful, timely, and accurate information about the weather. This information is used for a variety of purposes, such as planning outdoor activities, packing for trips, and staying safe during severe weather events.
The state-of-the-art accuracy, high temporal and spatial resolution, and probabilistic nature of MetNet-3 makes it possible to create unique hyperlocal weather insights. For the contiguous United States and Europe, MetNet-3 is operational and produces real-time 12 hour precipitation forecasts that are now served across Google products and technologies where weather is relevant, such as Search. The rich output from the model is synthesized into actionable information and instantly served to millions of users.
For example, a user who searches for weather information for a precise location from their mobile device will receive highly localized precipitation forecast data, including timeline graphs with granular minute breakdowns depending on the product.
MetNet-3 precipitation output in weather on the Google app on Android (left) and mobile web Search (right). |
Conclusion
MetNet-3 is a new deep learning model for weather forecasting that outperforms state-of-the-art physics-based models for 24-hour forecasts of a core set of weather variables. It has the potential to create new possibilities for weather forecasting and to improve the safety and efficiency of many activities, such as transportation, agriculture, and energy production. MetNet-3 is operational and its forecasts are served across several Google products where weather is relevant.
Acknowledgements
Many people were involved in the development of this effort. We would like to especially thank those from Google DeepMind (Di Li, Jeremiah Harmsen, Lasse Espeholt, Marcin Andrychowicz, Zack Ontiveros), Google Research (Aaron Bell, Akib Uddin, Alex Merose, Carla Bromberg, Fred Zyda, Isalo Montacute, Jared Sisk, Jason Hickey, Luke Barrington, Mark Young, Maya Tohidi, Natalie Williams, Pramod Gupta, Shreya Agrawal, Thomas Turnbull, Tom Small, Tyler Russell), and Google Search (Agustin Pesciallo, Bill Myers, Danny Cheresnick, Lior Cohen, Maca Piombi, Maia Diamant, Max Kamenetsky, Maya Ekron, Mor Schlesinger, Neta Gefen-Doron, Nofar Peled Levi, Ofer Lehr, Or Hillel, Rotem Wertman, Vinay Ruelius Shah, Yechie Labai).
Dialogue-guided visual language processing with Amazon SageMaker JumpStart
Visual language processing (VLP) is at the forefront of generative AI, driving advancements in multimodal learning that encompasses language intelligence, vision understanding, and processing. Combined with large language models (LLM) and Contrastive Language-Image Pre-Training (CLIP) trained with a large quantity of multimodality data, visual language models (VLMs) are particularly adept at tasks like image captioning, object detection and segmentation, and visual question answering. Their use cases span various domains, from media entertainment to medical diagnostics and quality assurance in manufacturing.
Key strengths of VLP include the effective utilization of pre-trained VLMs and LLMs, enabling zero-shot or few-shot predictions without necessitating task-specific modifications, and categorizing images from a broad spectrum through casual multi-round dialogues. Augmented by Grounded Segment Anything, VLP exhibits prowess in visual recognition, with object detection and segmentation being particularly notable. The potential exists to fine-tune VLMs and LLMs further using domain-specific data, aiming to boost precision and mitigate hallucination. However, like other nascent technologies, obstacles remain in managing model intricacy, harmonizing diverse modalities, and formulating uniform evaluation metrics.
Courtesy of NOMIC for OBELICS, HuggingFaceM4 for IDEFICS, Charles Bensimon for Gradio and Amazon Polly for TTS
In this post, we explore the technical nuances of VLP prototyping using Amazon SageMaker JumpStart in conjunction with contemporary generative AI models. Through multi-round dialogues, we highlight the capabilities of instruction-oriented zero-shot and few-shot vision language processing, emphasizing its versatility and aiming to capture the interest of the broader multimodal community. The demo implementation code is available in the following GitHub repo.
Solution overview
The proposed VLP solution integrates a suite of state-of-the-art generative AI modules to yield accurate multimodal outputs. Central to the architecture are the fine-tuned VLM and LLM, both instrumental in decoding visual and textual data streams. The TGI framework underpins the model inference layer, providing RESTful APIs for robust integration and effortless accessibility. Supplementing our auditory data processing, the Whisper ASR is also furnished with a RESTful API, enabling streamlined voice-to-text conversions. Addressing complex challenges like image-to-text segmentation, we use the containerized Grounded Segment Anything module, synergizing with the Grounded DINO and Segment Anything Model (SAM) mechanism for text-driven object detection and segmentation. The system is further refined with DistilBERT, optimizing our dialogue-guided multi-class classification process. Orchestrating these components is the LangChain processing pipeline, a sophisticated mechanism proficient in dissecting text or voice inputs, discerning user intentions, and methodically delegating sub-tasks to the relevant services. The synthesis of these operations produces aggregated outputs, delivering pinpoint and context-aware multimodal answers.
The following diagram illustrates the architecture of our dialogue-guided VLP solution.
Text Generation Inference
Text Generation Inference (TGI) is an open-source toolkit developed by Hugging Face for deploying LLMs as well as VLMs for inference. It enables high-performance text generation using tensor parallelism, model parallelism, and dynamic batching supporting some leading open-source LLMs such as Falcon and Llama V2, as well as VLMs like IDEFICS. Utilizing the latest Hugging Face LLM modules on Amazon SageMaker, AWS customers can now tap into the power of SageMaker deep learning containers (DLCs). This allows for the seamless deployment of LLMs from the Hugging Face hubs via pre-built SageMaker DLCs supporting TGI. This inference setup not only offers exceptional performance but also eliminates the need for managing the heavy lifting GPU infrastructure. Additionally, you benefit from advanced features like auto scaling of inference endpoints, enhanced security, and built-in model monitoring.
TGI offers text generation speeds up to 100 times faster than traditional inference methods and scales efficiently to handle increased requests. Its design ensures compatibility with various LLMs and, being open-source, democratizes advanced features for the tech community. TGI’s versatility extends across domains, enhancing chatbots, improving machine translations, summarizing texts, and generating diverse content, from poetry to code. Therefore, TGI emerges as a comprehensive solution for text generation challenges. TGI is implemented in Python and uses the PyTorch framework. It’s open-source and available on GitHub. It also supports PEFT with QLoRA for faster performance and logits warping to control generated text attributes, such as determining its length and diversity, without modifying the underlying model.
You can build a customized TGI Docker container directly from the following Dockerfile and then push the container image to Amazon Elastic Container Registry (ECR) for inference deployment. See the following code:
%%sh
# Define docker image name and container's Amazon Reource Name on ECR
container_name="tgi1.03"
region=`aws configure get region`
account=`aws sts get-caller-identity --query "Account" --output text`
full_name="${account}.dkr.ecr.${region}.amazonaws.com/${container_name}:latest"
# Get the login command from ECR and execute it directly
aws ecr get-login-password --region ${region}|docker login --username AWS
--password-stdin ${account}.dkr.ecr.${region}.amazonaws.com
# Build the TGI docker image locally
docker build . -f Dockerfile -t ${container_name}
docker tag ${container_name} ${full_name}
docker push ${full_name}
LLM inference with TGI
The VLP solution in this post employs the LLM in tandem with LangChain, harnessing the chain-of-thought (CoT) approach for more accurate intent classification. CoT processes queries to discern intent and trigger-associated sub-tasks to meet the query’s goals. Llama-2-7b-chat-hf (license agreement) is the streamlined version of the Llama-2 line, designed for dialogue contexts. The inference of Llama-2-7b-chat-hf is powered by the TGI container image, making it available as an API-enabled service.
For Llama-2-7b-chat-hf inference, a g5.2xlarge (24G VRAM) is recommended to achieve peak performance. For applications necessitating a more robust LLM, the Llama-v2-13b models fit well with a g5.12xlarge (96G VRAM) instance. For the Llama-2-70b models, consider either the GPU [2xlarge] – 2x Nvidia A100 utilizing bitsandbytes quantization or the g5.48xlarge. Notably, employing bitsandbytes quantization can reduce the required inference GPU VRAM by 50%.
You can use SageMaker DLCs with the TGI container image detailed earlier to deploy Llama-2-7b-chat-hf for inference (see the following code). Alternatively, you can stand up a quick local inference for a proof of concept on a g5.2xlarge instance using a Docker container.
import json
from time import gmtime, strftime
from sagemaker.huggingface import get_huggingface_llm_image_uri
from sagemaker.huggingface import HuggingFaceModel
from sagemaker import get_execution_role
# Prerequisite:create an unique model name
model_name = 'Llama-7b-chat-hf' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
# retrieve the llm image uri of SageMaker pre-built DLC TGI v1.03
tgi_image_ecr_uri = get_huggingface_llm_image_uri(
"huggingface",
version="1.0.3"
)
# Define Model and Endpoint configuration parameter
hf_config = {
'HF_MODEL_ID': "meta-research/Llama-2-7b-chat-hf", # Matching model_id on Hugging Face Hub
'SM_NUM_GPUS': json.dumps(number_of_gpu),
'MAX_TOTAL_TOKENS': json.dumps(1024),
'HF_MODEL_QUANTIZE': "bitsandbytes", # Use quantization for less vram requirement, commet it if no needed.
}
# create HuggingFaceModel with the SageMaker pre-built DLC TGI image uri
sm_llm_model = HuggingFaceModel(
role=get_execution_role(),
image_uri=tgi_image_ecr_uri,
env=hf_config
)
# Deploy the model
llm = sm_llm_model.deploy(
initial_instance_count=1,
instance_type="ml.g5.2xlarge",
container_startup_health_check_timeout=300, # in sec. Allow 5 minutes to be able to load the model
)
# define inference payload
prompt="""<|prompter|>How to select a right LLM for your generative AI project?<|endoftext|><|assistant|>"""
# hyperparameters for llm
payload = {
"inputs": prompt,
"parameters": {
"best_of": 1,
"decoder_input_details": true,
"details": true,
"do_sample": true,
"max_new_tokens": 20,
"repetition_penalty": 1.03,
"return_full_text": false,
"seed": null,
"stop": [
"photographer"
],
"temperature": 0.5,
"top_k": 10,
"top_p": 0.95,
"truncate": null,
"typical_p": 0.95,
"watermark": true
},
"stream": false
}
# send request to endpoint
response = llm.predict(payload)
Fine-tune and customize your LLM
SageMaker JumpStart offers numerous notebook samples that demonstrate the use of Parameter Efficient Fine Tuning (PEFT), including QLoRA for training and fine-tuning LLMs. QLoRA maintains the pre-trained model weights in a static state and introduces trainable rank decomposition matrices into each layer of the Transformer structure. This method substantially decreases the number of trainable parameters needed for downstream tasks.
Alternatively, you can explore Direct Preference Optimization (DPO), which obviates the necessity for setting up a reward model, drawing samples during fine-tuning from the LLM, or extensive hyperparameter adjustments. Recent research has shown that DPO’s fine-tuning surpasses RLHF in managing sentiment generation and enhances the quality of summaries and single-conversation responses, all while being considerably easier to set up and educate. There are three main steps to the DPO training process (refer to the GitHub repo for details):
- Perform supervised fine-tuning of a pre-trained base LLM to create a fine-tuned LLM.
- Run the DPO trainer using the fine-tuned model to create a reinforcement learning model.
- Merge the adaptors from DPO into the base LLM model for text generation inference.
You can deploy the merged model for inference using the TGI container image.
Visual language model
Visual Language Models (VLM) which combine both the vision and language modalities have been showing their improving effectiveness in generalization, leading to various practical use cases with zero-shot prompts or few-shot prompts with instructions. A VLM typically consists of three key elements: an image encoder, a text encoder, and a strategy to fuse information from the two encoders. These key elements are tightly coupled together because the loss functions are designed around both the model architecture and the learning strategy. Many state-of-the-art VLMs use CLIP/ViT (such as OpenCLIP) and LLMs (such as Llama-v1) and are trained on multiple publicly available datasets such as Wikipedia, LAION, and Public Multimodal Dataset.
This demo used a pre-trained IDEFICS-9b-instruct model developed by HuggingFaceM4, a fine-tuned version of IDEFICS-9b, following the training procedure laid out in Flamingo by combining the two pre-trained models (laion/CLIP-ViT-H-14-laion2B-s32B-b79K and huggyllama/llama-7b) with modified Transformer blocks. The IDEFICS-9b was trained on OBELIC, Wikipedia, LAION, and PMD multimodal datasets with a total 150 billion tokens and 1.582 billion images with 224×224 resolution each. The IDEFICS-9b was based on Llama-7b with a 1.31 million effective batch size. The IDEFICS-9b-instruct was then fine-tuned with 6.8 million multimodality instruction datasets created from augmentation using generative AI by unfreezing all the parameters (vision encoder, language model, cross-attentions). The fine-tuning datasets include the pre-training data with the following sampling ratios: 5.1% of image-text pairs and 30.7% of OBELICS multimodal web documents.
The training software is built on top of Hugging Face Transformers and Accelerate, and DeepSpeed ZeRO-3 for training, plus WebDataset and Image2DataSets for data loading. The pre-training of IDEFICS-9b took 350 hours to complete on 128 Nvidia A100 GPUs, whereas fine-tuning of IDEFICS-9b-instruct took 70 hours on 128 Nvidia A100 GPUs, both on AWS p4.24xlarge instances.
With SageMaker, you can seamlessly deploy IDEFICS-9b-instruct on a g5.2xlarge instance for inference tasks. The following code snippet illustrates how to launch a tailored deep learning local container integrated with the customized TGI Docker image:
%%sh
llm_model='HuggingFaceM4/idefics-9b-instruct'
docker_rt_name='idefics-9b-instruct'
docker_image_name='tgi1.03'
docker run --gpus="1,2,3,4" --shm-size 20g -p 8080:80 --restart unless-stopped --name ${docker_rt_name} ${docker_image_name} --model-id ${llm_model}
# Test the LLM API using curl
curl -X 'POST' 'http://<hostname_or_ip>:8080/'
-H 'accept: application/json'
-H 'Content-Type: application/json'
-d '{
"inputs": "User:![](http://<im
age_url>/image.png)Which device produced this image? Please explain the main clinical purpose of such image?Can you write a radiology report based on this image?<end_of_utterance>",
"parameters": {
"best_of": 1, "decoder_input_details": true,
"details": true, "do_sample": true, "max_new_tokens": 20,
"repetition_penalty": 1.03, "return_full_text": false,
"seed": null, "stop": [ "photographer" ],
"temperature": 0.5, "top_k": 10, "top_p": 0.95,
"truncate": null, "typical_p": 0.95, "watermark": true },
"stream": false
}'
You can fine-tune IDEFICS or other VLMs including Open Flamingo with your own domain-specific data with instructions. Refer to the following README for multimodality dataset preparation and the fine-tuning script for further details.
Intent classification with chain-of-thought
A picture is worth a thousand words, therefore VLM requires guidance to generate an accurate caption from a given image and question. We can use few-shot prompting to enable in-context learning, where we provide demonstrations in the prompt to steer the model to better performance. The demonstrations serve as conditioning for subsequent examples where we would like the model to generate a response.
Standard few-shot prompting works well for many tasks but is still not a perfect technique, especially when dealing with more complex reasoning tasks. The few-shot prompting template is not enough to get reliable responses. It might help if we break the problem down into steps and demonstrate that to the model. More recently, chain-of-thought (CoT) prompting has been popularized to address more complex arithmetic, common sense, and symbolic reasoning tasks
CoT eliminate manual efforts by using LLMs with a “Let’s think step by step” prompt to generate reasoning chains for demonstrations one by one. However, this automatic process can still end up with mistakes in generated chains. To mitigate the effects of the mistakes, the diversity of demonstrations matter. This post proposes Auto-CoT, which samples questions with diversity and generates reasoning chains to construct the demonstrations. CoT consists of two main stages:
- Question clustering – Partition questions of a given dataset into a few clusters
- Demonstration sampling – Select a representative question from each cluster and generate its reasoning chain using zero-shot CoT with simple heuristics
See the following code snippet:
from langchain.llms import HuggingFaceTextGenInference
from langchain import PromptTemplate, LLMChain
inference_server_url_local = <Your_local_url_for_llm_on_tgi:port>
llm_local = HuggingFaceTextGenInference(
inference_server_url=inference_server_url_local,
max_new_tokens=512,
top_k=10,
top_p=0.95,
typical_p=0.95,
temperature=0.1,
repetition_penalty=1.05,
template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.
Use ten five maximum and keep the answer as subtle as possible. List all actionable sub-tasks step by step in detail. Be cautious to avoid phrasing that might replicate previous
inquiries. This will help in obtaining an accurate and detailed answer. Avoid repetition for clarity.
Question: {question}
Answer: Understand the intent of the question then break down the {question} in to sub-tasks. """
prompt = PromptTemplate(
template=template,
input_variables= ["question"]
)
llm_chain_local = LLMChain(prompt=prompt, llm=llm_local)
llm_chain_local("Can you describe the nature of this image? Do you think it's real??")
Automatic Speech Recognition
The VLP solution incorporates Whisper, an Automatic Speech Recognition (ASR) model by OpenAI, to handle audio queries. Whisper can be effortlessly deployed via SageMaker JumpStart using its template. SageMaker JumpStart, known for its straightforward setup, high performance, scalability, and dependability, is ideal for developers aiming to craft exceptional voice-driven applications. The following GitHub repo demonstrates how to harness SageMaker real-time inference endpoints to fine-tune and host Whisper for instant audio-to-text transcription, showcasing the synergy between SageMaker hosting and generative models.
Alternatively, you can directly download the Dockerfile.gpu from GitHub developed by ahmetoner, which includes a pre-configured RESTful API. You can then construct a Docker image and run the container on a GPU-powered Amazon Elastic Compute Cloud (EC2) instance for a quick proof of concept. See the following code:
%%sh
docker_iamge_name = 'whisper-asr-webservice-gpu'
docker build -f Dockerfile.gpu -t ${docker_iamge_nam}
docker run -d --gpus all -p 8083:9000 --restart unless-stopped -e ASR_MODEL=base ${docker_iamge_nam}
curl -X 'POST' 'http://<asr_api_hostname>:<port>/asr?task=transcribe&encode=true&output=txt'
-H 'accept: application/json'
-H 'Content-Type: multipart/form-data'
-F 'audio_file=@dgvlp_3_5.mp3;type=audio/mpeg'
In the provided example, port 8083 is selected to host the Whisper API, with inbound network security rules activated. To test, direct a web browser to http://<IP_or_hostname>:8083/docs
and initiate a POST request test to the ASR endpoint. As an alternative, run the given command or employ the whisper-live module to verify API connectivity.
!pip install whisper-live
from whisper_live.client import TranscriptionClient
client = TranscriptionClient("<whisper_hostname_or_IP>", 8083, is_multilingual=True, lang="zh", translate=True)
client(audio_file_path) # Use sudio file
client() # Use microphone for transcribe
Multi-class text classification and keyword extraction
Multi-class classification plays a pivotal role in text prompt-driven object detection and segmentation. The distilbert-base-uncased-finetuned-sst-2-english model is a refined checkpoint of DistilBERT-base-uncased, optimized on the Stanford Sentiment Treebank (SST2) dataset by Hugging Face. This model achieves a 91.3% accuracy on the development set, while its counterpart bert-base-uncased boasts an accuracy of 92.7%. The Hugging Face Hub provides access to over 1,000 pre-trained text classification models. For those seeking enhanced precision, SageMaker JumpStart provides templates to fine-tune DistilBERT using custom annotated datasets for more tailored classification tasks.
import torch
from transformers import pipeline
def mclass(text_prompt, top_k=3, topics = ['Mask creation', 'Object detection',
'Inpainting', 'Segmentation', 'Upscaling', 'Creating an image from another one', 'Generating:q an image from text'],
model='distilbert-base-uncased-finetuned-sst-2-english'):
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# Define a german hypothesis template and the potential candidates for entailment/contradiction
template_de = 'The topic is {}'
# Pipeline abstraction from hugging face
pipe = pipeline(task='zero-shot-classification', model=model, tokenizer=model, device=device)
# Run pipeline with a test case
prediction = pipe(text_prompt, topics, hypothesis_template=template_de)
# Top 3 topics as predicted in zero-shot regime
return zip(prediction['labels'][0:top_k], prediction['scores'][0:top_k])
top_3_intend = mclass(text_prompt=user_prompt_str, topics=['Others', 'Create image mask', 'Image segmentation'], top_k=3)
The keyword extraction process employs the KeyBERT module, a streamlined and user-friendly method that harnesses BERT embeddings to generate keywords and key phrases closely aligned with a document—in this case, the objects specified in the query:
# Keyword extraction
from keybert import KeyBERT
kw_model = KeyBERT()
words_list = kw_model.extract_keywords(docs=<user_prompt_str>, keyphrase_ngram_range=(1,3))
Text prompt-driven object detection and classification
The VLP solution employs dialogue-guided object detection and segmentation by analyzing the semantic meaning of the text and identifying the action and objects from text prompt. Grounded-SAM is an open-source package created by IDEA-Research to detect and segment anything from a given image with text inputs. It combines the strengths of Grounding DINO and Segment Anything in order to build a very powerful pipeline for solving complex problems.
The following figure illustrates how Grounded-SAM can detect objects and conduct instance segmentation by comprehending textual input.
SAM stands out as a robust segmentation model, though it requires prompts, such as bounding boxes or points, to produce high-quality object masks. Grounding DINO excels as a zero-shot detector, adeptly creating high-quality boxes and labels using free-form text prompts. When these two models are combined, they offer the remarkable capability to detect and segment any object purely through text inputs. The Python utility script dino_sam_inpainting.py was developed to integrate Grounded-SAM methods:
!pip install git+https://github.com/facebookresearch/segment-anything.git
import dino_sam_inpainting as D
def dino_sam(image_path, text_prompt, text_threshold=0.4, box_threshold=0.5, output_dir='/temp/gradio/outputs'):
config_file = 'GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py' # change the path of the model config file
grounded_checkpoint = './models/groundingdino_swint_ogc.pth' # change the path of the model
sam_checkpoint = './models/sam_vit_h_4b8939.pth'
sam_hq_checkpoint = '' #if to use high quality, like sam_hq_vit_h.pth
use_sam_hq = ''
output_dir = '/tmp/gradio/outputs'
device = 'cuda'
# make dir
os.makedirs(output_dir, exist_ok=True)
# load image
image_pil, image = D.load_image(image_path)
# load model
model = D.load_model(config_file, grounded_checkpoint, device=device)
output_file_name = f'{format(os.path.basename(image_path))}'
# visualize raw image
image_pil.save(os.path.join(output_dir, output_file_name))
# run grounding dino model
boxes_filt, pred_phrases = D.get_grounding_output(
model, image, text_prompt, box_threshold, text_threshold, device=device
)
# initialize SAM
if use_sam_hq:
predictor = D.SamPredictor(D.build_sam_hq(checkpoint=sam_hq_checkpoint).to(device))
else:
predictor = D.SamPredictor(D.build_sam(checkpoint=sam_checkpoint).to(device))
image = cv2.imread(image_path)
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
predictor.set_image(image)
size = image_pil.size
H, W = size[1], size[0]
for i in range(boxes_filt.size(0)):
boxes_filt[i] = boxes_filt[i] * torch.Tensor([W, H, W, H])
boxes_filt[i][:2] -= boxes_filt[i][2:] / 2
boxes_filt[i][2:] += boxes_filt[i][:2]
boxes_filt = boxes_filt.cpu()
transformed_boxes = predictor.transform.apply_boxes_torch(boxes_filt, image.shape[:2]).to(device)
masks, _, _ = predictor.predict_torch(
point_coords = None,
point_labels = None,
boxes = transformed_boxes.to(device),
multimask_output = False,
)
# draw output image
plt.figure(figsize=(10, 10))
plt.imshow(image)
for mask in masks:
D.show_mask(mask.cpu().numpy(), plt.gca(), random_color=True)
for box, label in zip(boxes_filt, pred_phrases):
D.show_box(box.numpy(), plt.gca(), label)
output_file_name = f'{format(os.path.basename(image_path))}'
plt.axis('off')
plt.savefig(
os.path.join(output_dir, f'grounded_sam_{output_file_name}'),
bbox_inches="tight", dpi=300, pad_inches=0.0
)
D.save_mask_data(output_dir, masks, boxes_filt, pred_phrases)
return f'grounded_sam_{output_file_name}'
filename = dino_sam(image_path=<image_path_str>, text_prompt=<object_name_str>, output_dir=<output_image_filename_path_str>, box_threshold=0.5, text_threshold=0.55)
You can choose HQ-SAM to upgrade SAM for high-quality zero-shot segmentation. Refer to the following paper and code sample on GitHub for more details.
VLP processing pipeline
The main objective of the VLP processing pipeline is to combine the strengths of different models, creating a sophisticated workflow specialized for VLP. It’s important to highlight that this setup prioritizes the integration of top-tier models across visual, text, and voice domains. Each segment of the pipeline is modular, facilitating either standalone use or combined operation. Furthermore, the design ensures flexibility, enabling the replacement of components with more advanced models yet to come, while supporting multithreading and error handling with reputable implementation.
The following figure illustrates a VLP pipeline data flow and service components.
In our exploration of the VLP pipeline, we design one which can process both text prompts from open text format and casual voice inputs from microphones. The audio processing is facilitated by Whisper, capable of multilingual speech recognition and translation. The transcribed text is then channeled to an intent classification module, which discerns the semantic essence of the prompts. This works in tandem with a LangChain driven CoT engine, dissecting the main intent into finer sub-tasks for more detailed information retrieval and generation. If image processing is inferred from the input, the pipeline commences a keyword extraction process, selecting the top N keywords by cross-referencing objects detected in the original image. Subsequently, these keywords are routed to the Grounded-SAM engine, which generates bounding boxes. These bounding boxes are then supplied to the SAM model, which crafts precise segmentation masks, pinpointing each unique object instance in the source image. The final step involves overlaying the masks and bounding boxes onto the original image, yielding a processed image that is presented as a multimodal output.
When the input query seeks to interpret an image, the pipeline engages the LLM to organize the sub-tasks and refine the query with targeted goals. Subsequently, the outcome is directed to the VLM API, accompanied by few-shot instructions, the URL of the input image, and the rephrased text prompt. In response, the VLM provides the textual output. The VLP pipeline can be implemented using a Python-based workflow pipeline or alternative orchestration utilities. Such pipelines operate by chaining a sequential set of sophisticated models, culminating in a structured modeling procedure sequentially. The pipeline integrates with the Gradio engine for demonstration purposes:
def vlp_text_pipeline(str input_text, str original_image_path, chat_history):
intent_class = intent_classification(input_text)
key_words = keyword_extraction(input_text)
image_caption = vlm(input_text, original_image_path)
chat_history.append(image_caption)
if intent_class in {supported intents}:
object_bounding_box = object_detection(intent_class, key_words, original_image_path)
mask_image_path = image_segmentation(object_bounding_box, key_words, original_image_path)
chat_history.append(mask_image_path)
return chat_history
def vlp_voice_pipeline(str audio_file_path, str original_image_path, chat_history):
asr_text = whisper_transcrib(audio_file_path)
chat_history.append(asr_text, original_image_path, chat_history)
return chat_history
chat_history = map(vlp_pipelines, input_text, original_image_path, chat_history)
if (audio_file_path is None)
else map(vlp_voice_pipelines, original_image_path, chat_history)
Limitations
Using pre-trained VLM models for VLP has demonstrated promising potential for image understanding. Along with language-based object detection and segmentation, VLP can produce useful outputs with reasonable quality. However, VLP still suffers from inconsistent results, missing details from pictures, and it might even hallucinate. Moreover, models might produce factually incorrect texts and should not be relied on to produce factually accurate information. Since none of the referenced pre-trained VLM, SAM, or LLM models has been trained or fine-tuned for domain-specific production-grade applications, this solution is not designed for mission-critical applications that might impact livelihood or cause material losses
With prompt engineering, the IDEFICS model sometimes can recognize extra details after a text hint; however, the result is far from consistent and reliable. It can be persistent in maintaining inaccuracies and may be unable or unwilling to make corrections even when users highlight those during a conversation. Enhancing the backbone model by integrating Swin-ViT and fusing it with CNN-based models like DualToken-ViT, along with training using more advanced models like Llama-v2, could potentially address some of these limitations.
Next steps
The VLP solution is poised for notable progress. As we look ahead, there are several key opportunities to advance VLP solutions:
- Prioritize integrating dynamic prompt instructions and few-shot learning hints. These improvements will enable more accurate AI feedback.
- Intent classification teams should focus efforts on refining the classifier to pick up on nuanced, domain-specific intents from open prompts. Being able to understand precise user intents will be critical.
- Implement an agent tree of thoughts model into the reasoning pipeline. This structure will allow for explicit reasoning steps to complete sub-tasks.
- Pilot fine-tuning initiatives on leading models. Tailoring VLM, LLM, and SAM models to key industries and use cases through fine-tuning will be pivotal.
Acknowledgment
The authors extend their gratitude to Vivek Madan and Ashish Rawat for their insightful feedback and review of this post.
About the authors
Alfred Shen is a Senior AI/ML Specialist at AWS. He has been working in Silicon Valley, holding technical and managerial positions in diverse sectors including healthcare, finance, and high-tech. He is a dedicated applied AI/ML researcher, concentrating on CV, NLP, and multimodality. His work has been showcased in publications such as EMNLP, ICLR, and Public Health.
Dr. Li Zhang is a Principal Product Manager-Technical for Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms, a service that helps data scientists and machine learning practitioners get started with training and deploying their models, and uses reinforcement learning with Amazon SageMaker. His past work as a principal research staff member and master inventor at IBM Research has won the test of time paper award at IEEE INFOCOM.
Dr. Changsha Ma is an AI/ML Specialist at AWS. She is a technologist with a PhD in Computer Science, a master’s degree in Education Psychology, and years of experience in data science and independent consulting in AI/ML. She is passionate about researching methodological approaches for machine and human intelligence. Outside of work, she loves hiking, cooking, hunting food, mentoring college students for entrepreneurship, and spending time with friends and families.
Xin Huang is a Senior Applied Scientist for Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms. He focuses on developing scalable machine learning algorithms. His research interests are in the area of natural language processing, explainable deep learning on tabular data, and robust analysis of non-parametric space-time clustering. He has published many papers in ACL, ICDM, KDD conferences, and Royal Statistical Society: Series A.
How Reveal’s Logikcull used Amazon Comprehend to detect and redact PII from legal documents at scale
Today, personally identifiable information (PII) is everywhere. PII is in emails, slack messages, videos, PDFs, and so on. It refers to any data or information that can be used to identify a specific individual. PII is sensitive in nature and includes various types of personal data, such as name, contact information, identification numbers, financial information, medical information, biometric data, date of birth, and so on.
Finding and redacting PII is essential to safeguarding privacy, ensuring data security, complying with laws and regulations, and maintaining trust with customers and stakeholders. It’s a critical component of modern data management and cybersecurity practices. But finding PII among the morass of electronic data can present challenges for an organization. These challenges arise due to the vast volume and variety of data, data fragmentation, encryption, data sharing, dynamic content, false positives and negatives, contextual understanding, legal complexities, resource constraints, evolving data, user-generated content, and adaptive threats. However, failure to accurately detect and redact PII can lead to severe consequences for organizations. Consequences might encompass legal penalties, lawsuits, reputation damage, data breach costs, regulatory probes, operational disruption, trust erosion, and sanctions.
In the legal system, discovery is the legal process governing the right to obtain and the obligation to produce non-privileged matter relevant to any party’s claims or defenses in litigation. Electronic discovery also known as eDiscovery is the electronic aspect of identifying, collecting, and producing electronically stored information (ESI) in response to a request for production in a lawsuit or investigation. In the legal domain, it’s often required to identify, collect, and produce ESI during a lawsuit or investigation. If organizations are dealing with eDiscovery for litigations on subpoena responses, they’re probably concerned about accidentally sharing PII. Many organizations including government agencies, school districts, and legal professionals face the challenge of detecting and redacting PII accurately at scale. Especially if they’re part of a government group, redacting PII through the Freedom of Information Act and Digital Services Act is crucial for protecting individual privacy, ensuring compliance with data protection laws, preventing identity theft, and maintaining trust and transparency in government and digital services. It strikes a balance between transparency and privacy while mitigating legal and security risks.
Organizations can search for PII using methods such as keyword searches, pattern matching, data loss prevention tools, machine learning (ML), metadata analysis, data classification software, optical character recognition (OCR), document fingerprinting, and encryption.
Now a part of Reveal’s AI-powered eDiscovery platform, Logikcull is a self-service solution that allows legal professionals to process, review, tag, and produce electronic documents as part of a lawsuit or investigation. This unique offering helps attorneys discover valuable information related to the matter in hand while reducing costs, speeding up resolutions, and mitigating risks.
In this post, Reveal experts showcase how they used Amazon Comprehend in their document processing pipeline to detect and redact individual pieces of PII. Amazon Comprehend is a fully managed and continuously trained natural language processing (NLP) service that can extract insight about the content of a document or text. You can use Amazon Comprehend ML capabilities to detect and redact PII in customer emails, support tickets, product reviews, social media, and more.
Overview of solution
The overarching goal for the engineering team is to detect and redact PII from millions of legal documents for their customers. Using Reveal’s Logikcull solution, the engineering team implemented two processes, namely first pass PII detection and second pass PII detection and redaction. This two pass solution was made possible by using the ContainsPiiEntities and DetectPiiEntities APIs.
First pass PII detection
The goal of first pass PII detection is to find the documents that might contain PII.
- Users upload the files on which they would like to perform PII detection and redaction through Logikcull’s public website into a project folder. These files can be in the form of office documents, .pdf files, emails, or a .zip file containing all the supported file types.
- Logikcull stores these project folders securely inside an Amazon Simple Storage Service (Amazon S3) bucket. The files then pass through Logikcull’s massively parallel processing pipeline hosted on Amazon Elastic Compute Cloud (Amazon EC2), which processes the files, extracts the metadata, and generates artifacts in text format for data review. Logikcull’s processing pipeline supports text extraction for a wide variety of forms and files, including audio and video files.
- After the files are available in text format, Logikcull passes the input text along with the language model, which is English, through Amazon Comprehend by making the ContainsPiiEntities API call. The processing pipeline servers hosted on Amazon EC2 make the Amazon Comprehend
ContainsPiiEntities
API call by passing the request parameters as text and language code. TheContainsPiiEntities
API call analyzes input text for the presence of PII and returns the labels of identified PII entity types, such as name, address, bank account number, or phone number. The API response also includes a confidence score which indicates the level of confidence that Amazon Comprehend has assigned to the detection accuracy. The confidence score has a value between 0 and 1, with 1 signifying 100 percent confidence. Logikcull uses this confidence score to assign the tag PII Detected to the documents. Logikcull only assigns this tag to documents that have a confidence score of over 0.75. - PII Detected tagged documents are fed into Logikcull’s search index cluster for their users to quickly identify documents that contain PII entities.
Second pass PII detection and redaction
The first pass PII detection process narrows down the scope of the dataset by identifying which documents contain PII information. This speeds up the PII detection process and also reduces the overall cost. The goal of the second pass PII detection is to identify the individual instances of PII and redact them from the tagged documents in the first pass.
- Users search for documents through the Logikcull’s website that contains PII using Logikcull’s advanced search filters feature.
- The request is handled by Logikcull’s application servers hosted on Amazon EC2 and the servers communicates with the search index cluster to find the documents.
- The Logikcull applications servers are able to identify the individual instances of PII by making the DetectPiiEntities API call. The servers make the API call by passing the text and language of input documents. The
DetectPiiEntities
API action inspects the input text for entities that contain PII. For each entity, the response provides the entity type, where the entity text begins and ends, and the level of confidence that Amazon Comprehend has in its detection. - The users then select the specific entities that they want to redact using Logikcull’s web interface. The applications server sends these requests to Logikcull’s processing pipeline. The following is a screenshot of a PDF that was uploaded to Logikcull’s application. From the below screenshot, you can see that different PII entities such as name, address, phone number, email address, and so on, have been highlighted.
- The PII redaction is safely applied inside the Logikcull’s processing pipeline using custom business logic. From the screenshot that follows, you can see that users can select either specific PII entity types or all PII entity types that they want to redact and then, with a click of a single button, redact all the PII information.
Results
Logikcull, a Reveal technology, is currently processing over 20 million documents each week and was able to narrow down the scope of detection using the ContainsPiiEntities
API and display individual instances of PII entities to their customers by using the DetectPiiEntities
API.
“With Amazon Comprehend, Logikcull has been able to rapidly deploy powerful NLP capabilities in a fraction of the time a custom-built solution would have required.”
– Steve Newhouse, VP of Product for Logikcull.
Conclusion
Amazon Comprehend allows Reveal’s Logikcull technology to run PII detection at large scale for relatively low cost using Amazon Comprehend. The ContainsPiiEntities
API is used to do an initial scan of millions of documents. The DetectPiiEntities
API is used to run a detailed analysis of thousands of documents and identify individual pieces of PII in their documents.
Take a look at all the Amazon Comprehend features. Give the features a try and send us feedback either through the AWS forum for Amazon Comprehend or through your usual AWS support contacts.
About the Authors
Aman Tiwari is a General Solutions Architect working with Worldwide Commercial Sales at AWS. He works with customers in the Digital Native Business segment and helps them design innovative, resilient, and cost-effective solutions using AWS services. He holds a master’s degree in Telecommunications Networks from Northeastern University. Outside of work, he enjoys playing lawn tennis and reading books.
Jeff Newburn is a Senior Software Engineering Manager leading the Data Engineering team at Logikcull – A Reveal Technology. He oversees the company’s data initiatives, including data warehouses, visualizations, analytics, and machine learning. With experience spanning development and management in areas from ride sharing to data systems, he enjoys leading teams of brilliant engineers to exciting products.
Søren Blond Daugaard is a Staff Engineer in the Data Engineering team at Logikcull – A Reveal Technology. He implements highly scalable AI and ML solutions into the Logikcull product, enabling our customers to do their work more efficiently and with higher precision. His expertise spans data pipelines, web-based systems, and machine learning systems.
Kevin Lufkin is a Senior Software Engineer on the Search Engineering team at Logikcull – A Reveal Technology, where he focuses on developing customer facing and search-related features. His extensive expertise in UI/UX is complemented by a background in full-stack web development, with a strong focus on bringing product visions to life.
Turing’s Mill: AI Supercomputer Revs UK’s Economic Engine
The home of the first industrial revolution just made a massive investment in the next one.
The U.K. government has announced it will spend £225 million ($273 million) to build one of the world’s fastest AI supercomputers.
Called Isambard-AI, it’s the latest in a series of systems named after a legendary 19th century British engineer and hosted by the University of Bristol. When fully installed next year, it will pack 5,448 NVIDIA GH200 Grace Hopper Superchips to deliver a whopping 21 exaflops of AI performance for researchers across the country and beyond.
The announcement was made at the AI Safety Summit, a gathering of over 100 global government and technology leaders, held in Bletchley Park, the site of the world’s first digital programmable computer, which reflected the work of innovators like Alan Turing, considered the father of AI.
AI “will bring a transformation as far-reaching as the industrial revolution, the coming of electricity or the birth of the internet,” said British Prime Minister Rishi Sunak in a speech last week about the event, designed to catalyze international collaboration.
Propelling the Modern Economy
Like one of Isambard Brunel’s creations — the first propeller-driven, ocean-going iron ship — the AI technology running on his namesake is already driving countries forward.
AI contributes more than £3.7 billion to the U.K. economy and employs more than 50,000 people, said Michelle Donelan, the nation’s Science, Innovation and Technology Secretary, in an earlier announcement about the system.
The investment in the so-called AI Research Resource in Bristol “will catalyze scientific discovery and keep the U.K. at the forefront of AI development,” she said.
Like AI itself, the system will be used across a wide range of organizations tapping the potential of machine learning to advance robotics, data analytics, drug discovery, climate research and more.
“Isambard-AI represents a huge leap forward for AI computational power in the U.K.,” said Simon McIntosh-Smith, a Bristol professor and director of the Isambard National Research Facility. “Today, Isambard-AI would rank within the top 10 fastest supercomputers in the world and, when in operation later in 2024, it will be one of the most powerful AI systems for open science anywhere.”
The Next Manufacturing Revolution
Like the industrial revolution, AI promises advances in manufacturing. That’s one reason why Isambard-AI will be based at the National Composites Centre (NCC, pictured above) in the Bristol and Bath Science Park, one of the country’s seven manufacturing research centers.
The U.K.’s Frontier AI Taskforce, a research group leading a global effort on how frontier AI can be safely developed, will also be a major user of the system.
Hewlett Packard Enterprise, which is building Isambard-AI, is also collaborating with the University of Bristol on energy-efficiency plans that support net-zero carbon targets mandated by the British government.
Energy-Efficient HPC
A second system coming next year to the NCC will show Arm’s energy efficiency for non-accelerated high performance computing workloads.
Isambard-3 will deliver an estimated 2.7 petaflops of FP64 peak performance and consume less than 270 kilowatts of power, ranking it among the world’s three greenest non-accelerated supercomputers. That’s because the system — part of a research alliance among universities of Bath, Bristol, Cardiff and Exeter — will sport 384 Arm-based NVIDIA Grace CPU Superchips to power medical and scientific research.
“Isambard-3’s application performance efficiency of up to 6x its predecessor, which rivals many of the 50 fastest TOP500 systems, will provide scientists with a revolutionary new supercomputing platform to advance groundbreaking research,” said Bristol’s McIntosh-Smith, when the system was announced in March.
Unlocking the Power of Language: NVIDIA’s Annamalai Chockalingam on the Rise of LLMs
Generative AI and large language models (LLMs) are stirring change across industries — but according to NVIDIA Senior Product Manager of Developer Marketing Annamalai Chockalingam, “we’re still in the early innings.”
In the latest episode of NVIDIA’s AI Podcast, host Noah Kravitz spoke with Chockalingam about LLMs: what they are, their current state and their future potential.
LLMs are a “subset of the larger generative AI movement” that deals with language. They’re deep learning algorithms that can recognize, summarize, translate, predict and generate language.
AI has been around for a while, but according to Chockalingam, three key factors enabled LLMs.
One is the availability of large-scale data sets to train models with. As more people used the internet, more data became available for use. The second is the development of computer infrastructure, which has become advanced enough to handle “mountains of data” in a “reasonable timeframe.” And the third is advancements in AI algorithms, allowing for non-sequential or parallel processing of large data pools.
LLMs can do five things with language: generate, summarize, translate, instruct or chat. With a combination of “these modalities and actions, you can build applications” to solve any problem, Chockalingam said.
Enterprises are tapping LLMs to “drive innovation,” “develop new customer experiences,” and gain a “competitive advantage.” They’re also exploring what safe deployment of those models looks like, aiming to achieve responsible development, trustworthiness and repeatability.
New techniques like retrieval augmented generation (RAG) could boost LLM development. RAG involves feeding models with up-to-date “data sources or third-party APIs” to achieve “more appropriate responses” — granting them current context so that they can “generate better” answers.
Chockalingam encourages those interested in LLMs to “get your hands dirty and get started” — whether that means using popular applications like ChatGPT or playing with pretrained models in the NVIDIA NGC catalog.
NVIDIA offers a full-stack computing platform for developers and enterprises experimenting with LLMs, with an ecosystem of over 4 million developers and 1,600 generative AI organizations. To learn more, register for LLM Developer Day on Nov. 17 to hear from NVIDIA experts about how best to develop applications.
SUBHEAD: Subscribe to the AI Podcast: Now Available on Amazon Music
The AI Podcast is now available through Amazon Music.
In addition, get the AI Podcast through iTunes, Google Podcasts, Google Play, Castbox, DoggCatcher, Overcast, PlayerFM, Pocket Casts, Podbay, PodBean, PodCruncher, PodKicker, Soundcloud, Spotify, Stitcher and TuneIn.
Make the AI Podcast better. Have a few minutes to spare? Fill out this listener survey.