NVIDIA Partners With APEC Economies to Change Lives, Increase Opportunity, Improve Outcomes

NVIDIA Partners With APEC Economies to Change Lives, Increase Opportunity, Improve Outcomes

When patients in Vietnam enter a medical facility in distress, doctors use NVIDIA technology to get more accurate scans to diagnose their ailments. In Hong Kong, a different set of doctors leverage generative AI to discover new cures for patients.

Improving the health and well-being of citizens and strengthening economies and communities are key themes as world leaders soon gather in San Francisco for the 2023 Asia-Pacific Economic Cooperation (APEC) Summit.

When they meet to discuss bold solutions to improve the lives of their citizens and societies, NVIDIA’s AI and accelerated computing initiatives are a crucial enabler.

NVIDIA’s work to improve outcomes for everyday people while tackling future challenges builds on years of deep investment with APEC partners. With a strong presence in countries across the region, including a workforce of thousands and numerous collaborative projects in areas from farming to healthcare to education, NVIDIA is delivering new technologies and workforce training programs to enhance industrial development and advance generative AI research.

Beyond technological advancements, these efforts spur economic growth, create good-paying jobs and improve the health and well-being of people globally.

Research and National Compute Partnerships

NVIDIA has advanced AI research partnerships with several APEC economies. These accelerate scientific breakthroughs in AI and HPC to address national challenges, such as healthcare, skills development and creating more robust local AI ecosystems to protect and advance well-being, prosperity and security. For example:

  • Australia’s national science and research organization, CSIRO, has teamed with NVIDIA to advance Australia’s AI program across climate action, space exploration, quantum computing and AI education.
  • Singapore’s National Supercomputing Centre and Ministry of Education have partnered with NVIDIA to drive sovereign AI capabilities with a priority focus on sectors such as healthcare, climate science and digital twins.
  • Thailand was Southeast Asia’s first country to participate in NVIDIA’s AI Nations initiative, bringing together the Ministry of Education with a consortium of top universities to advance public-private collaborations in urban planning, public health and autonomous vehicles.
  • In Vietnam, NVIDIA is partnering with Viettel,  the nation’s largest employer, and Vietnam’s Academy for Science & Technology to upskill workforces, accelerate the introduction of AI services to industry and deploy next-generation 5G services.

Innovation Ecosystems

Startups are at the leading edge of AI innovation, and a robust startup ecosystem is vital to advancing technology within APEC economies.

NVIDIA Inception is a free program to help startups innovate faster. Through it, NVIDIA supports over 5,000 startups across APEC economies, and more than 15,000 globally, by providing cutting-edge technology, connections with venture capitalists and access to the latest technical resources.

In 2023, NVIDIA added nearly 1,000 APEC-area startups to the program. In addition to creating economic opportunities, Inception supports small- and medium-sized enterprises in developing novel solutions to some of society’s biggest challenges. Here’s what some of its members are doing:

  • In Malaysia, Tapway uses AI to reduce congestion and streamline traffic for more than 1 million daily travelers.
  • In New Zealand, Lynker uses geospatial analysis, deep learning and remote sensing for earth observation.  Lynker’s technology measures carbon sequestration on farms, detecting, monitoring and restoring wetlands and enabling more effective disaster relief.
  • In Thailand, AltoTech Global, an Inception partner, integrates AI software with Internet of Things devices to optimize energy consumption for hotels, buildings, factories and smart cities. AltoTech’s ultimate goal is contributing to the net-zero economy and helping customers achieve their net-zero targets.

Digital Upskilling and Tools for Growth

The NVIDIA Deep Learning Institute (DLI) provides AI training and digital upskilling programs that cultivate innovation and create economic opportunities.

DLI’s training and certification program helps individuals and organizations accelerate skills development and workforce transformation in AI, high performance computing and industrial digitalization.

Hands-on, self-paced and instructor-led courses are created and taught by NVIDIA experts, bringing real-world experience and deep technical know-how to developers and IT professionals.

Through this program, NVIDIA has trained more than 115,000 individuals in APEC economies, including more than 16,000 new trainees this year.

Separately, the NVIDIA Developer Program offers more than 2 million developers in APEC economies access to software development kits, application program interfaces, pretrained AI models and performance analysis tools to help developers create and innovate. Members receive free hands-on training, access to developer forums and early access to new products and services.

Creating a Better Future for All

As nations work together to address common challenges and improve the lives of their citizens, NVIDIA will continue to leverage its world-class technologies to help create a better world for all.

Read More

Dr Aengus Tran, co-founder of Annalise.ai and Harrison.ai on Using AI as a Spell Check for Health Checks

Dr Aengus Tran, co-founder of Annalise.ai and Harrison.ai on Using AI as a Spell Check for Health Checks

Clinician-led healthcare AI company Harrison.ai has built an AI system that effectively serves as a “spell checker” for radiologists — flagging critical findings to improve the speed and accuracy of radiology image analysis, reducing misdiagnoses.

In the latest episode of NVIDIA’s AI Podcast, host Noah Kravitz spoke with Harrison.ai CEO and cofounder Aengus Tran about the company’s mission to scale global healthcare capacity with autonomous AI systems.

Harrison.ai’s initial product, annalise.ai, is an AI tool that automates radiology image analysis to enable faster, more accurate diagnoses. It can produce 124-130 different possible diagnoses and flag key findings to aid radiologists in their final diagnosis. Currently, annalise.ai works for chest X-rays and brain CT scans, with more on the way.

While an AI designed for categorizing traffic lights, for example, doesn’t need perfection,  medical tools must be highly accurate — any oversight could be fatal. To overcome this challenge, annalise.ai was trained on millions of meticulously annotated images — some were annotated three to five times over before being used for training.

Harrison.ai is also developing Franklin.ai, a sibling AI tool aimed to accelerate and improve the accuracy of histopathology diagnosis — in which a clinician performs a biopsy and inspects the tissue for the presence of cancerous cells. Similarly to annalise.ai, Franklin.ai flags critical findings to assist pathologists in speeding and increasing the accuracy of diagnoses.

Ethical concerns about AI use are ever-rising, but for Tran, the concern is less about whether it’s ethical to use AI for medical diagnosis but “actually the converse: Is it ethical to not use AI for medical diagnosis,” especially if “humans using those AI systems simply pick up more misdiagnosis, pick up more cancer and conditions?”

Tran also talked about the future of AI systems and suggested that the focus is dual: first, focus on improving pree-xisting systems and then think of new cutting-edge solutions.

And for those looking to break into careers in AI and healthcare, Tran says that the “first step is to decide upfront what problems you’re willing to spend a huge part of your time solving first, before the AI part,” emphasizing that the “first thing is actually to fall in love with some problem.”

You Might Also Like

Jules Anh Tuan Nguyen Explains How AI Lets Amputee Control Prosthetic Hand, Video Games
A postdoctoral researcher at the University of Minnesota discusses his efforts to allow amputees to control their prosthetic limb — right down to the finger motions — with their minds.

Overjet’s Ai Wardah Inam on Bringing AI to Dentistry
Overjet, a member of NVIDIA Inception, is moving fast to bring AI to dentists’ offices. Dr. Wardah Inam, CEO of the company, discusses using AI to improve patient care.

Immunai CTO and Co-Founder Luis Voloch on Using Deep Learning to Develop New Drugs
Luis Voloch talks about tackling the challenges of the immune system with a machine learning and data science mindset.

Subscribe to the AI Podcast: Now Available on Amazon Music

The AI Podcast is now available through Amazon Music.

In addition, get the AI Podcast through iTunes, Google Podcasts, Google Play, Castbox, DoggCatcher, Overcast, PlayerFM, Pocket Casts, Podbay, PodBean, PodCruncher, PodKicker, Soundcloud, Spotify, Stitcher and TuneIn.

Make the AI Podcast better. Have a few minutes to spare? Fill out this listener survey.

Read More

STEER: Semantic Turn Extension-Expansion Recognition for Voice Assistants

*= Equal Contributors
In the context of a voice assistant system, steering refers to the phenomenon in which a user issues a follow-up command attempting to direct or clarify a previous turn. We propose STEER, a steering detection model that predicts whether a follow-up turn is a user’s attempt to steer the previous command. Constructing a training dataset for steering use cases poses challenges due to the cold-start problem. To overcome this, we developed heuristic rules to sample opt-in usage data, approximating positive and negative samples without any annotation. Our experimental results…Apple Machine Learning Research

SeMAnD: Self-Supervised Anomaly Detection in Multimodal Geospatial Datasets

*= Equal Contributors
We propose a Self-supervised Anomaly Detection technique, called SeMAnD, to detect geometric anomalies in Multimodal geospatial datasets. Geospatial data comprises acquired and derived heterogeneous data modalities that we transform to semantically meaningful, image-like tensors to address the challenges of representation, alignment, and fusion of multimodal data. SeMAnD is comprised of (i) a simple data augmentation strategy, called RandPolyAugment, capable of generating diverse augmentations of vector geometries, and (ii) a self-supervised training objective with three…Apple Machine Learning Research

EELBERT: Tiny Models through Dynamic Embeddings

We introduce EELBERT, an approach for compression of transformer-based models (for example, BERT), with minimal impact on the accuracy of downstream tasks. This is achieved by replacing the input embedding layer of the model with dynamic, for example, on-the-fly, embedding computations. Since the input embedding layer accounts for a significant fraction of the model size, especially for the smaller BERT variants, replacing this layer with an embedding computation function helps us reduce the model size significantly. Empirical evaluation on the GLUE benchmark shows that our BERT variants…Apple Machine Learning Research

Alternating updates for efficient transformers

Alternating updates for efficient transformers

Contemporary deep learning models have been remarkably successful in many domains, ranging from natural language to computer vision. Transformer neural networks (transformers) are a popular deep learning architecture that today comprise the foundation for most tasks in natural language processing and also are starting to extend to applications in other domains, such as computer vision, robotics, and autonomous driving. Moreover, they form the backbone of all the current state-of-the-art language models.

Increasing scale in Transformer networks has led to improved performance and the emergence of behavior not present in smaller networks. However, this increase in scale often comes with prohibitive increases in compute cost and inference latency. A natural question is whether we can reap the benefits of larger models without incurring the computational burden.

In “Alternating Updates for Efficient Transformers”, accepted as a Spotlight at NeurIPS 2023, we introduce AltUp, a method to take advantage of increased token representation without increasing the computation cost. AltUp is easy to implement, widely applicable to any transformer architecture, and requires minimal hyperparameter tuning. For instance, using a variant of AltUp on a 770M parameter T5-Large model, the addition of ~100 parameters yields a model with a significantly better quality.

Background

To understand how we can achieve this, we dig into how transformers work. First, they partition the input into a sequence of tokens. Each token is then mapped to an embedding vector (via the means of an embedding table) called the token embedding. We call the dimension of this vector the token representation dimension. The Transformer then operates on this sequence of token embeddings by applying a series of computation modules (called layers) using its network parameters. The number of parameters in each transformer layer is a function of the layer’s width, which is determined by the token representation dimension.

To achieve benefits of scale without incurring the compute burden, prior works such as sparse mixture-of-experts (Sparse MoE) models (e.g., Switch Transformer, Expert Choice, V-MoE) have predominantly focused on efficiently scaling up the network parameters (in the self-attention and feedforward layers) by conditionally activating a subset based on the input. This allows us to scale up network size without significantly increasing compute per input. However, there is a research gap on scaling up the token representation dimension itself by conditionally activating parts of the token representation vector.

Recent works (for example, scaling laws and infinite-width networks) have empirically and theoretically established that a wider token representation helps in learning more complicated functions. This phenomenon is also evident in modern architectures of increasing capability. For instance, the representation dimension grows from 512 (small) to 768 (base) and 1024 (corresponding to models with 770M, 3B, and 11B parameters respectively) in T5 models, and from 4096 (8B) to 8192 (64B) and 18432 (540B) in PaLM models. A widened representation dimension also significantly improves performance for dual encoder retrieval models. However, naïvely widening the representation vector requires one to increase the model dimension accordingly, which quadratically1 increases the amount of computation in the feedforward computation.

Method

AltUp works by partitioning a widened representation vector into equal sized blocks, processing only a single block at each layer, and using an efficient prediction-correction mechanism to infer the outputs of the other blocks (shown below on the right). This allows AltUp to simultaneously keep the model dimension, hence the computation cost, roughly constant and take advantage of using an increased token dimension. The increased token dimension allows the model to pack more information into each token’s embedding. By keeping the width of each transformer layer constant, AltUp avoids incurring the quadratic increase in computation cost that would otherwise be present with a naïve expansion of the representation.

An illustration of widening the token representation without (left) and with AltUp (right). This widening causes a near-quadratic increase in computation in a vanilla transformer due to the increased layer width. In contrast, Alternating Updates keeps the layer width constant and efficiently computes the output by operating on a sub-block of the representation at each layer.

More specifically, the input to each layer is two or more blocks, one of which is passed into the 1x width transformer layer (see figure below). We refer to this block as the “activated” block. This computation results in the exact output for the activated block. In parallel, we invoke a lightweight predictor that computes a weighted combination of all the input blocks. The predicted values, along with the computed value of the activated block, are passed on to a lightweight corrector that updates the predictions based on the observed values. This correction mechanism enables the inactivated blocks to be updated as a function of the activated one. Both the prediction and correction steps only involve a limited number of vector additions and multiplications and hence are much faster than a regular transformer layer. We note that this procedure can be generalized to an arbitrary number of blocks.

The predictor and corrector computations: The predictor mixes sub-blocks with trainable scalar coefficients; the corrector returns a weighted average of the predictor output and the transformer output. The predictor and corrector perform scalar-vector multiplications and incur negligible computation cost compared to the transformer. The predictor outputs a linear mixing of blocks with scalar mixing coefficients pi, j , and the corrector combines predictor output and transformer output with weights gi.

At a higher level, AltUp is similar to sparse MoE in that it is a method to add capacity to a model in the form of conditionally accessed (external) parameters. In sparse MoE, the additional parameters take the form of feed forward network (FFN) experts and the conditionality is with respect to the input. In AltUp, the external parameters come from the widened embedding table and the conditionality takes the form of alternating block-wise activation of the representation vector, as in the figure above. Hence, AltUp has the same underpinning as sparse MoE models.

An advantage of AltUp over sparse MoE is that it does not necessitate sharding since the number of additional parameters introduced is a factor2 of the embedding table size, which typically makes up a small fraction of the overall model size. Moreover, since AltUp focuses on conditionally activating parts of a wider token representation, it can be applied synergistically with orthogonal techniques like MoE to obtain complementary performance gains.

Evaluation

AltUp was evaluated on T5 models on various benchmark language tasks. Models augmented with AltUp are uniformly faster than the extrapolated dense models at the same accuracy. For example, we observe that a T5 Large model augmented with AltUp leads to a 27%, 39%, 87%, and 29% speedup on GLUE, SuperGLUE, SQuAD, and Trivia-QA benchmarks, respectively.

Evaluations of AltUp on T5 models of various sizes and popular benchmarks. AltUp consistently leads to sizable speedups relative to baselines at the same accuracy. Latency is measured on TPUv3 with 8 cores. Speedup is defined as the change in latency divided by the AltUp latency (B = T5 Base, L = T5 Large, XL = T5 XL models).

AltUp’s relative performance improves as we apply it to larger models — compare the relative speedup of T5 Base + AltUp to that of T5 Large + AltUp. This demonstrates the scalability of AltUp and its improved performance on even larger models. Overall, AltUp consistently leads to models with better predictive performance than the corresponding baseline models with the same speed on all evaluated model sizes and benchmarks.

Extensions: Recycled AltUp

The AltUp formulation adds an insignificant amount of per-layer computation, however, it does require using a wider embedding table. In certain scenarios where the vocabulary size (i.e., the number of distinct tokens the tokenizer can produce) is very large, this may lead to a non-trivial amount of added computation for the initial embedding lookup and the final linear + softmax operation. A very large vocabulary may also lead to an undesirable amount of added embedding parameters. To address this, Recycled-AltUp is an extension of AltUp that avoids these computational and parameter costs by keeping the embedding table’s width the same.

Illustration of the Architecture for Recycled-AltUp with K = 2.

In Recycled-AltUp, instead of widening the initial token embeddings, we replicate the embeddings K times to form a wider token representation. Hence, Recycled-AltUp adds virtually no additional parameters relative to the baseline transformer, while benefiting from a wider token representation.

Recycled-AltUp on T5-B/L/XL compared to baselines. Recycled-AltUp leads to strict improvements in pre-training performance without incurring any perceptible slowdown.

We also evaluate the lightweight extension of AltUp, Recycled-AltUp, with K = 2 on T5 base, large, and XL models and compare its pre-trained accuracy and speed to those of baselines. Since Recycled-AltUp does not require an expansion in the embedding table dimension, the models augmented with it have virtually the same number of trainable parameters as the baseline models. We again observe consistent improvements compared to the dense baselines.

Why does AltUp work?

AltUp increases a model’s capacity by adding and efficiently leveraging auxiliary parameters to the embedding table, and maintaining the higher dimensional representation across the layers. We believe that a key ingredient in this computation lies in AltUp’s prediction mechanism that performs an ensemble of the different blocks. This weighted combination enables continuous message passing to the entire vector despite activating only sub-blocks of it in each layer. Recycled-AltUp, on the other hand, does not add any additional parameters to the token embeddings. However, it still confers the benefit of simulating computation in a higher dimensional representation space since a higher dimensional representation vector is maintained when moving from one transformer layer to another. We conjecture that this aids the training by augmenting the flow of information through the network. An interesting research direction is to explore whether the benefits of Recycled-AltUp can be explained entirely by more favorable training dynamics.

Acknowledgements

We thank our collaborators Cenk Baykal, Dylan Cutler, and Rina Panigrahy at Google Research, and Nikhil Ghosh at University of California, Berkeley (work done during research internship at Google).


1This is because the feedforward layers of a Transformer are typically scaled quadratically with the model dimension. 

2This factor depends on the user-specified expansion factor, but is typically 1, i.e., we double the embedding table dimension. 

Read More

Harnessing the power of enterprise data with generative AI: Insights from Amazon Kendra, LangChain, and large language models

Harnessing the power of enterprise data with generative AI: Insights from Amazon Kendra, LangChain, and large language models

Large language models (LLMs) with their broad knowledge, can generate human-like text on almost any topic. However, their training on massive datasets also limits their usefulness for specialized tasks. Without continued learning, these models remain oblivious to new data and trends that emerge after their initial training. Furthermore, the cost to train new LLMs can prove prohibitive for many enterprise settings. However, it’s possible to cross-reference a model answer with the original specialized content, thereby avoiding the need to train a new LLM model, using Retrieval-Augmented Generation (RAG).

RAG empowers LLMs by giving them the ability to retrieve and incorporate external knowledge. Instead of relying solely on their pre-trained knowledge, RAG allows models to pull data from documents, databases, and more. The model then skillfully integrates this outside information into its generated text. By sourcing context-relevant data, the model can provide informed, up-to-date responses tailored to your use case. The knowledge augmentation also reduces the likelihood of hallucinations and inaccurate or nonsensical text. With RAG, foundation models become adaptable experts that evolve as your knowledge base grows.

Today, we are excited to unveil three generative AI demos, licensed under MIT-0 license:

  • Amazon Kendra with foundational LLM – Utilizes the deep search capabilities of Amazon Kendra combined with the expansive knowledge of LLMs. This integration provides precise and context-aware answers to complex queries by drawing from a diverse range of sources.
  • Embeddings model with foundational LLM – Merges the power of embeddings—a technique to capture semantic meanings of words and phrases—with the vast knowledge base of LLMs. This synergy enables more accurate topic modeling, content recommendation, and semantic search capabilities.
  • Foundation Models Pharma Ad Generator – A specialized application tailored for the pharmaceutical industry. Harnessing the generative capabilities of foundational models, this tool creates convincing and compliant pharmaceutical advertisements, ensuring content adheres to industry standards and regulations.

These demos can be seamlessly deployed in your AWS account, offering foundational insights and guidance on utilizing AWS services to create a state-of-the-art LLM generative AI question and answer bot and content generation.

In this post, we explore how RAG combined with Amazon Kendra or custom embeddings can overcome these challenges and provide refined responses to natural language queries.

Solution overview

By adopting this solution, you can gain the following benefits:

  • Improved information access – RAG allows models to pull in information from vast external sources, which can be especially useful when the pre-trained model’s knowledge is outdated or incomplete.
  • Scalability – Instead of training a model on all available data, RAG allows models to retrieve relevant information on the fly. This means that as new data becomes available, it can be added to the retrieval database without needing to retrain the entire model.
  • Memory efficiency – LLMs require significant memory to store parameters. With RAG, the model can be smaller because it doesn’t need to memorize all details; it can retrieve them when needed.
  • Dynamic knowledge update – Unlike conventional models with a set knowledge endpoint, RAG’s external database can undergo regular updates, granting the model access to up-to-date information. The retrieval function can be fine-tuned for distinct tasks. For example, a medical diagnostic task can source data from medical journals, ensuring the model garners expert and pertinent insights.
  • Bias mitigation – The ability to draw from a well-curated database offers the potential to minimize biases by ensuring balanced and impartial external sources.

Before diving into the integration of Amazon Kendra with foundational LLMs, it’s crucial to equip yourself with the necessary tools and system requirements. Having the right setup in place is the first step towards a seamless deployment of the demos.

Prerequisites

You must have the following prerequisites:

Although it’s possible to set up and deploy the infrastructure detailed in this tutorial from your local computer, AWS Cloud9 offers a convenient alternative. Pre-equipped with tools like AWS CLI, AWS CDK, and Docker, AWS Cloud9 can function as your deployment workstation. To use this service, simply set up the environment via the AWS Cloud9 console.

With the prerequisites out of the way, let’s dive into the features and capabilities of Amazon Kendra with foundational LLMs.

Amazon Kendra with foundational LLM

Amazon Kendra is an advanced enterprise search service enhanced by machine learning (ML) that provides out-of-the-box semantic search capabilities. Utilizing natural language processing (NLP), Amazon Kendra comprehends both the content of documents and the underlying intent of user queries, positioning it as a content retrieval tool for RAG based solutions. By using the high-accuracy search content from Kendra as a RAG payload, you can get better LLM responses. The use of Amazon Kendra in this solution also enables personalized search by filtering responses according to the end-user content access permissions.

The following diagram shows the architecture of a generative AI application using the RAG approach.

Documents are processed and indexed by Amazon Kendra through the Amazon Simple Storage Service (Amazon S3) connector. Customer requests and contextual data from Amazon Kendra are directed to an Amazon Bedrock foundation model. The demo lets you choose between Amazon’s Titan, AI21’s Jurassic, and Anthropic’s Claude models supported by Amazon Bedrock. The conversation history is saved in Amazon DynamoDB, offering added context for the LLM to generate responses.

We have provided this demo in the GitHub repo. Refer to the deployment instructions within the readme file for deploying it into your AWS account.

The following steps outline the process when a user interacts with the generative AI app:

  1. The user logs in to the web app authenticated by Amazon Cognito.
  2. The user uploads one or more documents into Amazon S3.
  3. The user runs an Amazon Kendra sync job to ingest S3 documents into the Amazon Kendra index.
  4. The user’s question is routed through a secure WebSocket API hosted on Amazon API Gateway backed by a AWS Lambda function.
  5. The Lambda function, empowered by the LangChain framework—a versatile tool designed for creating applications driven by AI language models—connects to the Amazon Bedrock endpoint to rephrase the user’s question based on chat history. After rephrasing, the question is forwarded to Amazon Kendra using the Retrieve API. In response, the Amazon Kendra index displays search outcomes, providing excerpts from pertinent documents sourced from the enterprise’s ingested data.
  6. The user’s question along with the data retrieved from the index are sent as a context in the LLM prompt. The response from the LLM is stored as chat history within DynamoDB.
  7. Finally, the response from the LLM is sent back to the user.

Document indexing workflow

The following is the procedure for processing and indexing documents:

  1. Users submit documents via the user interface (UI).
  2. Documents are transferred to an S3 bucket utilizing the AWS Amplify API.
  3. Amazon Kendra indexes new documents in the S3 bucket through the Amazon Kendra S3 connector.

Benefits

The following list highlights the advantages of this solution:

  • Enterprise-level retrieval – Amazon Kendra is designed for enterprise search, making it suitable for organizations with vast amounts of structured and unstructured data.
  • Semantic understanding – The ML capabilities of Amazon Kendra ensure that retrieval is based on deep semantic understanding and not just keyword matches.
  • Scalability – Amazon Kendra can handle large-scale data sources and provides quick and relevant search results.
  • Flexibility – The foundational model can generate answers based on a wide range of contexts, ensuring the system remains versatile.
  • Integration capabilities – Amazon Kendra can be integrated with various AWS services and data sources, making it adaptable for different organizational needs.

Embeddings model with foundational LLM

An embedding is a numerical vector that represents the core essence of diverse data types, including text, images, audio, and documents. This representation not only captures the data’s intrinsic meaning, but also adapts it for a wide range of practical applications. Embedding models, a branch of ML, transform complex data, such as words or phrases, into continuous vector spaces. These vectors inherently grasp the semantic connections between data, enabling deeper and more insightful comparisons.

RAG seamlessly combines the strengths of foundational models, like transformers, with the precision of embeddings to sift through vast databases for pertinent information. Upon receiving a query, the system utilizes embeddings to identify and extract relevant sections from an extensive body of data. The foundational model then formulates a contextually precise response based on this extracted information. This perfect synergy between data retrieval and response generation allows the system to provide thorough answers, drawing from the vast knowledge stored in expansive databases.

In the architectural layout, based on their UI selection, users are guided to either the Amazon Bedrock or Amazon SageMaker JumpStart foundation models. Documents undergo processing, and vector embeddings are produced by the embeddings model. These embeddings are then indexed using FAISS to enable efficient semantic search. Conversation histories are preserved in DynamoDB, enriching the context for the LLM to craft responses.

The following diagram illustrates the solution architecture and workflow.

We have provided this demo in the GitHub repo. Refer to the deployment instructions within the readme file for deploying it into your AWS account.

Embeddings model

The responsibilities of the embeddings model are as follows:

  • This model is responsible for converting text (like documents or passages) into dense vector representations, commonly known as embeddings.
  • These embeddings capture the semantic meaning of the text, allowing for efficient and semantically meaningful comparisons between different pieces of text.
  • The embeddings model can be trained on the same vast corpus as the foundational model or can be specialized for specific domains.

Q&A workflow

The following steps describe the workflow of the question answering over documents:

  1. The user logs in to the web app authenticated by Amazon Cognito.
  2. The user uploads one or more documents to Amazon S3.
  3. Upon document transfer, an S3 event notification triggers a Lambda function, which then calls the SageMaker embedding model endpoint to generate embeddings for the new document. The embeddings model converts the question into a dense vector representation (embedding). The resulting vector file is securely stored within the S3 bucket.
  4. The FAISS retriever compares this question embedding with the embeddings of all documents or passages in the database to find the most relevant passages.
  5. The passages, along with the user’s question, are provided as context to the foundational model. The Lambda function uses the LangChain library and connects to the Amazon Bedrock or SageMaker JumpStart endpoint with a context-stuffed query.
  6. The response from the LLM is stored in DynamoDB along with the user’s query, the timestamp, a unique identifier, and other arbitrary identifiers for the item such as question category. Storing the question and answer as discrete items allows the Lambda function to easily recreate a user’s conversation history based on the time when questions were asked.
  7. Finally, the response is sent back to the user via a HTTPs request through the API Gateway WebSocket API integration response.

Benefits

The following list describe the benefits of this solution:

  • Semantic understanding – The embeddings model ensures that the retriever selects passages based on deep semantic understanding, not just keyword matches.
  • Scalability – Embeddings allow for efficient similarity comparisons, making it feasible to search through vast databases of documents quickly.
  • Flexibility – The foundational model can generate answers based on a wide range of contexts, ensuring the system remains versatile.
  • Domain adaptability – The embeddings model can be trained or fine-tuned for specific domains, allowing the system to be adapted for various applications.

Foundation Models Pharma Ad Generator

In today’s fast-paced pharmaceutical industry, efficient and localized advertising is more crucial than ever. This is where an innovative solution comes into play, using the power of generative AI to craft localized pharma ads from source images and PDFs. Beyond merely speeding up the ad generation process, this approach streamlines the Medical Legal Review (MLR) process. MLR is a rigorous review mechanism in which medical, legal, and regulatory teams meticulously evaluate promotional materials to guarantee their accuracy, scientific backing, and regulatory compliance. Traditional content creation methods can be cumbersome, often requiring manual adjustments and extensive reviews to ensure alignment with regional compliance and relevance. However, with the advent of generative AI, we can now automate the crafting of ads that truly resonate with local audiences, all while upholding stringent standards and guidelines.

The following diagram illustrates the solution architecture.

In the architectural layout, based on their selected model and ad preferences, users are seamlessly guided to the Amazon Bedrock foundation models. This streamlined approach ensures that new ads are generated precisely according to the desired configuration. As part of the process, documents are efficiently handled by Amazon Textract, with the resultant text securely stored in DynamoDB. A standout feature is the modular design for image and text generation, granting you the flexibility to independently regenerate any component as required.

We have provided this demo in the GitHub repo. Refer to the deployment instructions within the readme file for deploying it into your AWS account.

Content generation workflow

The following steps outline the process for content generation:

  1. The user chooses their document, source image, ad placement, language, and image style.
  2. Secure access to the web application is ensured through Amazon Cognito authentication.
  3. The web application’s front end is hosted via Amplify.
  4. A WebSocket API, managed by API Gateway, facilitates user requests. These requests are authenticated through AWS Identity and Access Management (IAM).
  5. Integration with Amazon Bedrock includes the following steps:
    • A Lambda function employs the LangChain library to connect to the Amazon Bedrock endpoint using a context-rich query.
    • The text-to-text foundational model crafts a contextually appropriate ad based on the given context and settings.
    • The text-to-image foundational model creates a tailored image, influenced by the source image, chosen style, and location.
  6. The user receives the response through an HTTPS request via the integrated API Gateway WebSocket API.

Document and image processing workflow

The following is the procedure for processing documents and images:

  1. The user uploads assets via the specified UI.
  2. The Amplify API transfers the documents to an S3 bucket.
  3. After the asset is transferred to Amazon S3, one of the following actions takes place:
    • If it’s a document, a Lambda function uses Amazon Textract to process and extract text for ad generation.
    • If it’s an image, the Lambda function converts it to base64 format, suitable for the Stable Diffusion model to create a new image from the source.
  4. The extracted text or base64 image string is securely saved in DynamoDB.

Benefits

The following list describes the benefits of this solution:

  • Efficiency – The use of generative AI significantly accelerates the ad generation process, eliminating the need for manual adjustments.
  • Compliance adherence – The solution ensures that generated ads adhere to specific guidance and regulations, such as the FDA’s guidelines for marketing.
  • Cost-effective – By automating the creation of tailored ads, companies can significantly reduce costs associated with ad production and revisions.
  • Streamlined MLR process – The solution simplifies the MLR process, reducing friction points and ensuring smoother reviews.
  • Localized resonance – Generative AI produces ads that resonate with local audiences, ensuring relevance and impact in different regions.
  • Standardization – The solution maintains necessary standards and guidelines, ensuring consistency across all generated ads.
  • Scalability – The AI-driven approach can handle vast databases of source images and PDFs, making it feasible for large-scale ad generation.
  • Reduced manual intervention – The automation reduces the need for human intervention, minimizing errors and ensuring consistency.

You can deploy the infrastructure in this tutorial from your local computer or you can use AWS Cloud9 as your deployment workstation. AWS Cloud9 comes pre-loaded with the AWS CLI, AWS CDK, and Docker. If you opt for AWS Cloud9, create the environment from the AWS Cloud9 console.

Clean up

To avoid unnecessary cost, clean up all the infrastructure created via the AWS CloudFormation console or by running the following command on your workstation:

$ cdk destroy —all.

Additionally, remember to stop any SageMaker endpoints you initiated via the SageMaker console. Remember, deleting an Amazon Kendra index doesn’t remove the original documents from your storage.

Conclusion

Generative AI, epitomized by LLMs, heralds a paradigm shift in how we access and generate information. These models, while powerful, are often limited by the confines of their training data. RAG addresses this challenge, ensuring that the vast knowledge within these models is consistently infused with relevant, current insights.

Our RAG-based demos provide a tangible testament to this. They showcase the seamless synergy between Amazon Kendra, vector embeddings, and LLMs, creating a system where information is not only vast but also accurate and timely. As you dive into these demos, you’ll explore firsthand the transformational potential of merging pre-trained knowledge with the dynamic capabilities of RAG, resulting in outputs that are both trustworthy and tailored to enterprise content.

Although generative AI powered by LLMs opens up a new way of gaining information insights, these insights must be trustworthy and confined to enterprise content using the RAG approach. These RAG-based demos enable you to be equipped with insights that are accurate and up to date. The quality of these insights is dependent on semantic relevance, which is enabled by using Amazon Kendra and vector embeddings.

If you’re ready to further explore and harness the power of generative AI, here are your next steps:

  • Engage with our demos – The hands-on experience is invaluable. Explore the functionalities, understand the integrations, and familiarize yourself with the interface.
  • Deepen your knowledge – Take advantage of the resources available. AWS offers in-depth documentation, tutorials, and community support to aid in your AI journey.
  • Initiate a pilot project – Consider starting with a small-scale implementation of generative AI in your enterprise. This will provide insights into the system’s practicality and adaptability within your specific context.

For more information about generative AI applications on AWS, refer to the following:

Remember, the landscape of AI is constantly evolving. Stay updated, remain curious, and always be ready to adapt and innovate.


About The Authors

Jin Tan Ruan is a Prototyping Developer within the AWS Industries Prototyping and Customer Engineering (PACE) team, specializing in NLP and generative AI. With a background in software development and nine AWS certifications, Jin brings a wealth of experience to assist AWS customers in materializing their AI/ML and generative AI visions using the AWS platform. He holds a master’s degree in Computer Science & Software Engineering from the University of Syracuse. Outside of work, Jin enjoys playing video games and immersing himself in the thrilling world of horror movies.

Aravind Kodandaramaiah is a Senior Prototyping full stack solution builder within the AWS Industries Prototyping and Customer Engineering (PACE) team. He focuses on helping AWS customers turn innovative ideas into solutions with measurable and delightful outcomes. He is passionate about a range of topics, including cloud security, DevOps, and AI/ML, and can be usually found tinkering with these technologies.

Arjun Shakdher is a Developer on the AWS Industries Prototyping (PACE) team who is passionate about blending technology into the fabric of life. Holding a master’s degree from Purdue University, Arjun’s current role revolves around architecting and building cutting-edge prototypes that span an array of domains, presently prominently featuring the realms of AI/ML and IoT. When not immersed in code and digital landscapes, you’ll find Arjun indulging in the world of coffee, exploring the intricate mechanics of horology, or reveling in the artistry of automobiles.

Read More

Toward developing faster algorithms for minimizing submodular functions

Toward developing faster algorithms for minimizing submodular functions

This research paper was presented at the 64th IEEE Symposium on Foundations of Computer Science (FOCS) 2023 (opens in new tab), a premier forum for the latest research in theoretical computer science.

FOCS 2023 paper: Toward developing faster algorithms for minimizing submodular functions

Submodular functions are versatile mathematical tools, finding diverse applications in real-world scenarios and guiding solutions across complex domains. From dissecting the intricate networks of graphs to deciphering the complexities of economic landscapes through utility functions, and even navigating the enigmatic world of random variables via entropy functions, they offer valuable insights into challenging problems. Their wide-ranging applicability has made them pivotal tools for modeling and optimization in various theoretical computer science domains, including operations research and game theory. In recent years, submodular functions have gained prominence in solving optimization problems within machine learning (ML) applications. These tasks encompass vital areas such as feature selection and clustering, as illustrated in Figure 1. Additionally, submodular functions are instrumental in applications like sensor placement and graphical models. For further exploration, comprehensive resources are available in Bilmes’ insightful survey (opens in new tab) and Bach’s standard textbook (opens in new tab) on this subject.

Two graphics. The left graphic depicts the process of feature selection, beginning with all the features on the top, then the unselected features crossed in the middle, and finally the selected features remain at the bottom. The right graphic shows the process of clustering, where a set of points in 2D are assigned different colors so that points with the same color are physically close to each other to form a cluster.
Figure 1. Application of submodular function optimization to feature selection, on the left, and clustering on the right.

Algorithm design for submodular function minimization

In a joint paper with researchers from Stanford University, “Sparse Submodular Function Minimization(opens in new tab) (opens in new tab),” presented at FOCS 2023(opens in new tab) (opens in new tab), we investigate the problem of minimizing a submodular function in the standard model.   Here, we assume that the submodular function can be accessed through an evaluation oracle that returns the value ( f(S) ) in response to a query with a set ( S ). This is the most classical and well-studied model for studying algorithm design for minimizing submodular functions.

Before we discuss our study, it’s important to bear in mind that a submodular function ( f ) is defined on subsets of a finite set of elements ( V ) that satisfy a diminishing marginal difference property. That is, for any two subsets ( S subseteq T ) and any element ( e in V setminus T ), the marginal value of ( e ) when added to the smaller set ( f(S cup {e}) – f(S) ) is at least the marginal value of ( e ) when added to the bigger set ( f(T cup {e}) – f(T) ).

In the 1980s, foundational work (opens in new tab) revealed that submodular functions could be minimized in polynomial time, marking a significant breakthrough. Since then, researchers have made substantial progress in the quest for faster algorithms for submodular function minimization (SFM). Despite these efforts, fundamental questions persist, such as determining the minimum number of queries required to minimize any given submodular function—a concept referred to as the problem’s query complexity.

Currently, the most advanced algorithm needs to make ( widetilde{O}(n^2) ) queries for any given submodular function, while the best lower bound is only ( widetilde{Omega}(n) ), where (n) is the size of the ground set on which the submodular function is defined. This disparity results in a substantial gap, leaving an (n)-fold difference between the existing upper and lower bounds.

Given this considerable difference, a natural question arises: What additional structural assumptions could potentially pave the way for faster algorithms in submodular function minimization (SFM)? One prevalent assumption is sparsity, which posits that the size of the set minimizing the submodular function is small. This holds particular relevance in diverse applications, including signal processing, feature selection, and compressed sensing. In these scenarios, solutions are expected to exhibit sparse non-zero entries, making it important to understand how algorithmic complexity depends on sparsity, as it provides insights into the intricate combinatorial and geometric structures of the problems.

Interestingly, existing algorithmic techniques developed over the past four decades for SFM do not yield improved runtimes even when the solution is sparse. Therefore, it is imperative to develop innovative techniques that can drive advancements in sparse SFM and bridge the existing gap between upper and lower bounds.

Microsoft Research Podcast

AI Frontiers: The future of causal reasoning with Emre Kiciman and Amit Sharma

Emre Kiciman and Amit Sharma discuss their paper “Causal Reasoning and Large Language Models: Opening a New Frontier for Causality” and how it examines the causal capabilities of large language models (LLMs) and their implications.


Parallel algorithms for submodular function minimization

Exploring beyond SFM’s query complexity, recent research has shed light on the importance of sparse SFM, particularly in understanding the inherent adaptivity of parallel algorithms (known as parallel complexity) designed to solve the problem. Research has shown that any parallel algorithm for SFM requires a minimum adaptivity that is a polynomial in the size of the ground set.

Our results improve both parallel and sequential algorithms for SFM. For example, consider a scenario where the minimizer of the given submodular function is (widetilde{O}(1))-sparse. In this context, our parallel algorithm runs in a nearly constant number of rounds, while our sequential algorithm makes a nearly linear number of queries. This achievement stands in stark contrast with the previous best parallel upper bound of (widetilde{O}(n)) and the best query complexity upper bound of (widetilde{O}(n^2)).

Fast first-order methods for exact submodular function minimization

Current fast algorithms for SFM rely on cutting-plane methods, a standard class of convex optimization techniques applied to the Lovász extension—a natural continuous extension of the given submodular function. However, restricting the optimization domain to sparse solutions doesn’t significantly expedite cutting-plane methods beyond a logarithmic factor. To address this, we shifted our approach and employed first-order methods, including stochastic mirror descent, to minimize the Lovász extension. These methods, non-Euclidean generalizations of stochastic gradient descent, are more attuned to problem geometry. Unlike cutting-plane methods, first-order methods exhibit a polynomial convergence rate, rather than a polylogarithmic dependency on the additive error concerning the optimal solution. 

This rate of convergence indicates that first-order methods are better suited for approximate submodular function minimization, while our goal is to solve it exactly. Using the sparsity assumption, we developed a new algorithmic framework for SFM based on a new concept of duality. We used this framework to demonstrate how first-order methods, with substantially reduced accuracy requirements, can be applied to solve SFM exactly.

Toward faster algorithms for SFM and its applications

These techniques not only promise advancements for sparse SFM but also provide a foundation for tackling other fundamental problems in SFM theory. Our algorithms for sparse SFM serve as valuable starting points for designing improved algorithms for related problems. They offer potential insights into developing polynomial-time algorithms for SFM with lower query and parallel complexity, opening avenues for future research.

Traditionally, research on submodular function minimization has focused on the global properties of the problem over the past four decades. Sparse SFM, in contrast, enables us to explore local and more refined structures of submodular functions. Our work introduces new algorithmic tools that better use these structural properties, a vital aspect for applications in ML and operations research, because these areas often have special structures. Beyond advancing sparse SFM, our paradigm paves the way for the development of enhanced algorithms for SFM and its diverse applications.

The post Toward developing faster algorithms for minimizing submodular functions appeared first on Microsoft Research.

Read More

Digital Artist Steven Tung Shows Off So-fish-ticated Style This Week ‘In the NVIDIA Studio’

Digital Artist Steven Tung Shows Off So-fish-ticated Style This Week ‘In the NVIDIA Studio’

Editor’s note: This post is part of our weekly In the NVIDIA Studio series, which celebrates featured artists, offers creative tips and tricks, and demonstrates how NVIDIA Studio technology improves creative workflows. We’re also deep-diving on new GeForce RTX 40 Series GPU features, technologies and resources, and how they dramatically accelerate content creation.

Taiwanese artist Steven Tung creates captivating 2D and 3D digital art that explores sci-fi, minimalism and realism and pushes artistic boundaries.

This week In the NVIDIA Studio, Tung shares the inspiration and creative workflow behind his whimsical animation, The Given Fish.

Professional-grade technology, which was once available only at select special effects studios, is becoming increasingly accessible.

“Visual production capabilities continue to skyrocket, generating a growing demand for better computer hardware among the general public,” Tung said. “The evolving synergy between art and technology can spark endless possibilities for creators.”

Tung uses an MSI MEG Trident X2 desktop, powered by GeForce RTX 4090 graphics, to accelerate his creative workflow.

The MSI MEG Trident X2 desktop, powered by GeForce RTX 4090 graphics.

“The enhanced speed and performance expedites various processes, such as updating material textures in Adobe Substance 3D Painter and rendering in Blender,” said Tung. “The necessary specifications and requirements align, enabling maximum creativity without limitations.”

Exquisite Visuals Made E-fish-ciently

Tung’s 3D animation, The Given Fish, may look simple at first glance — but it’s surprisingly complex.

“GeForce RTX GPUs are indispensable hardware for 3D rendering tasks. Faster speeds bring significant benefits in production efficiency and time saved.” — Steven Tung

In the creative world behind the animation, the stone fish depicted can be consumed by people. The concept is that once taken out of the aquarium, the stone fish transforms into a real, living one.

“I have a strong desire to have an aquarium at home, but it’s not practical,” said Tung. “The next best thing is to turn that emotion into art.”

Tung began by creating concept sketches in Adobe Photoshop, where he had access to over 30 GPU-accelerated features that could help modify and adjust his canvas and maximize his efficiency.

Concept art for “The Given Fish.”

Next, Tung jumped from 2D to 3D with ZBrush. He first built a basic model and then refined critical details with custom brushes — adding greater depth and dimension with authentic, hand-sculpted textures.

Advanced sculpting in ZBrush.

He then used the UV unwrapping feature in RizomUV to ensure that his models were properly unwrapped and ready for texture application.

UV unwrapping feature in RizomUV.

Tung imported the models into Adobe 3D Substance Painter, where he meticulously painted textures, blended materials and used the built-in library to achieve lifelike stone textures. RTX-accelerated light and ambient occlusion baking optimized his assets in seconds.

Applying textures in Adobe Substance 3D Painter.

To bring all the elements together, Tung imported the models and materials into Blender. He set up texture channels, assigned texture files and assembled the models so that they would be true to the compositions outlined in the initial sketch.

Achieving realistic stone textures in Adobe 3D Substance Painter.

Next, Tung used Blender Cycles to light and render the scene.

Composition edits in Blender.

Blender Cycles’ RTX-accelerated, AI-powered OptiX ray tracing enabled interactive, photorealistic movement in the viewport and sped up animation work — all powered by his GeForce RTX 4090 GPU-equipped system.

Animation work in Blender.

RTX accelerated OptiX ray tracing in Blender Cycles enabled the fastest final frame render.

Digital artist Steven Tung.

Check out Tung’s portfolio on Instagram.

Follow NVIDIA Studio on Instagram, Twitter and Facebook. Access tutorials on the Studio YouTube channel and get updates directly in your inbox by subscribing to the Studio newsletter. 

Read More