April 2024 – Page 10

How to Regularize Your Regression

A series of regression instances in a pharmaceutical application. Can we learn how to set the regularization parameter (lambda) from similar domain-specific data?

Overview. Perhaps the simplest relation between a real dependent variable (y) and a vector of features (X) is a linear model (y = beta X). Given some training examples or datapoints consisting of pairs of features and dependent variables ((X_1, y_1),(X_2, y_2),dots,(X_m,y_m)), we would like to learn (beta) which would give the best prediction (y’) given features (X’) of an unseen example. This process of fitting a linear model (beta) to the datapoints is called linear regression. This simple yet effective model finds ubiquitous applications, ranging from biological, behavioral, and social sciences to environmental studies and financial forecasting, to make reliable predictions on future data. In ML terminology, linear regression is a supervised learning algorithm with low variance and good generalization properties. It is much less data-hungry than typical deep learning models, and performs well even with small amounts of training data. Further, to avoid overfitting the model to the training data, which reduces the prediction performance on unseen data, one typically uses regularization, which modifies the objective function of the linear model to reduce impact of outliers and irrelevant features (read on for details).

The most common method for linear regression is “regularized least squares”, where one finds the (beta) which minimizes

$$||y – beta X||_2^2 + lambda ||beta||.$$

Here the first term captures the error of (beta) on the training set, and the second term is a norm-based penalty to avoid overfitting (e.g. reducing impact of outliers in data). How to set (lambda) appropriately in this fundamental method depends on the data domain and is a longstanding open question. In typical modern applications, we have access to several similar datasets (X^{(0)},y^{(0)}, X^{(1)},y^{(1)}, dots) from the same application domain. For example, there are often multiple drug trial studies in a pharmaceutical company for studying the different effects of similar drugs. In this work, we show that we can indeed learn a good domain-specific value of (lambda) with strong theoretical guarantees of accuracy on unseen datasets from the same domain, and give bounds on how much data is needed to achieve this.

As our main result, we show that if the data has (p) features (i.e., the dimension of feature vector (X_i) is (p), then after seeing (O(p/epsilon^2)) datasets, we can learn a value of (lambda) which has error (averaged over the domain) within (epsilon) of the best possible value of (lambda) for the domain. We also extend our results to sequential data, binary classification (i.e. (y) is binary valued) and non-linear regression.

Problem Setup. Linear regression with norm-based regularization penalty is one of the most popular techniques that one encounters in introductory courses to statistics or machine learning. It is widely used for data analysis and feature selection, with numerous applications including medicine, quantitative finance (the linear factor model), climate science, and so on. The regularization penalty is typically a weighted additive term (or terms) of the norms of the learned linear model (beta), where the weight is carefully selected by a domain expert. Mathematically, a dataset has dependent variable (y) consisting of (m) examples, and predictor variables (X) with (p) features for each of the (m) datapoints. The linear regression approach (with squared loss) consists of solving a minimization problem

$$hat{beta}^{X,y}_{lambda_1,lambda_2}=text{argmin}_{betainmathbb{R}^p}||y-Xbeta||^2+lambda_1||beta||_1+lambda_2||beta||_2^2,$$

where the highlighted term is the regularization penalty. Here (lambda_1, lambda_2ge 0) are the regularization coefficients constraining the L1 and L2 norms, respectively, of the learned linear model (beta). For general (lambda_1) and (lambda_2) the above algorithm is popularly known as the Elastic Net, while setting (lambda_1 = 0) recovers Ridge regression and setting (lambda_2 = 0) corresponds to LASSO. Ridge and LASSO regression are both individually popular methods in practice, and the Elastic Net incorporates the advantages of both.

Despite the central role these coefficients play in linear regression, the problem of setting them in a principled way has been a challenging open problem for several decades. In practice, one typically uses “grid search” cross-validation, which involves (1) splitting the dataset into several subsets consisting of training and validation sets, (2) training several models (corresponding to different values of regularization coefficients) on each training set, and (3) comparing the performance of the models on the corresponding validation sets. This approach has several limitations.

First, this is very computationally intensive, especially with the large datasets that typical modern applications involve, as one needs to train and evaluate the model for a large number of hyperparameter values and training-validation splits. We would like to avoid repeating this cumbersome process for similar applications.
Second, theoretical guarantees on how well the coefficients learned by this procedure will perform on unseen examples are not known, even when the test data are drawn from the same distribution as the training set.
Finally, this can only be done for a finite set of hyperparameter values and it is not clear how the selected parameter compares to the best parameter from the continuous domain of coefficients. In particular, the loss as a function of the regularization parameter is not known to be Lipschitz.

Our work addresses all three of the above limitations simultaneously in the data-driven setting, which we motivate and describe next.

The importance of regularization

A visualization of the L1 and L2 regularized regressions.

The regularization coefficients (lambda_1) and (lambda_2) play a crucial role across fields: In machine learning, controlling the norm of model weights (beta) implies provable generalization guarantees and prevents over-fitting in practice. In statistical data analysis, their combined use yields parsimonious and interpretable models. In Bayesian statistics they correspond to imposing specific priors on (beta). Effectively, (lambda_2) regularizes (beta) by uniformly shrinking all coefficients, while (lambda_1) encourages the model vector to be sparse. This means that while they do yield learning-theoretic and statistical benefits, setting them to be too high will cause models to under-fit to the data. The question of how to set the regularization coefficients becomes even more unclear in the case of the Elastic Net, as one must juggle trade-offs between sparsity, feature correlation, and bias when setting both (lambda_1) and (lambda_2) simultaneously.

The data-driven algorithm design paradigm

In many applications, one has access to not just a single dataset, but a large number of similar datasets coming from the same domain. This is increasingly true in the age of big data, where an increasing number of fields are recording and storing data for the purpose of pattern analysis. For example, a drug company typically conducts a large number of trials for a variety of different drugs. Similarly, a climate scientist monitors several different environmental variables and continuously collects new data. In such a scenario, can we exploit the similarity of the datasets to avoid doing cumbersome cross-validation each time we see a new dataset? This motivates the data-driven algorithm design setting, introduced in the theory of computing community by Gupta and Roughgarden as a tool for design and analysis of algorithms that work well on typical datasets from an application domain (as opposed to worst-case analysis). This approach has been successfully applied to several combinatorial problems including clustering, mixed integer programming, automated mechanism design, and graph-based semi-supervised learning (Balcan, 2020). We show how to apply this analytical paradigm to tuning the regularization parameters in linear regression, extending the scope of its application beyond combinatorial problems [1, 2].

The learning model

A model for studying repeated regression instances from the same domain.

Formally, we model data coming from the same domain as a fixed (but unknown) distribution (D) over the problem instances. To capture the well-known cross-validation setting, we consider each problem instance of the form (P=(X_{text{train}}, y_{text{train}}, X_{text{val}}, y_{text{val}})). That is, the random process that generates the datasets and the (random or deterministic) process that generates the splits given the data, have been combined under (D). The goal of the learning process is to take (N) problem samples generated from the distribution (D), and learn regularization coefficients (hat{lambda}=(lambda_1, lambda_2)) that would generalize well over unseen problem instances drawn from (D). That is, on an unseen test instance (P’=(X’_{text{train}}, y’_{text{train}}, X’_{text{val}}, y’_{text{val}})), we will fit the model (beta) using the learning regularization coefficients (hat{lambda}) on (X’_{text{train}}, y’_{text{train}}), and evaluate the loss on the set (X’_{text{val}}, y’_{text{val}}). We seek the value of (hat{lambda}) that minimizes this loss, in expectation over the draw of the random test sample from (D).

How much data do we need?

The model (beta) clearly depends on both the dataset ((X,y)), and the regularization coefficients (lambda_1, lambda_2). A key tool in data-driven algorithm design is the analysis of the “dual function”, which is the loss expressed as a function of the parameters, for a fixed problem instance. This is typically easier to analyze than the “primal function” (loss for a fixed parameter, as problem instances are varied) in data-driven algorithm design problems. For Elastic Net regression, the dual is the validation loss on a fixed validation set for models trained with different values of (lambda_1, lambda_2) (i.e. two-parameter function) for a fixed training set. Typically the dual functions in combinatorial problems exhibit a piecewise structure, where the behavior of the loss function can have sharp transitions across the pieces. For example, in clustering this piecewise behavior could correspond to learning a different cluster in each piece. Prior research has shown that if we can bound the complexity of the boundary and piece functions in the dual function, then we can give a sample complexity guarantee, i.e. we can answer the question “how much data is sufficient to learn a good value of the parameter?”

An illustration of the piecewise structure of the Elastic Net dual loss function. Here (r_1) and (r_2) are polynomial boundary functions, and (f_{*,*}) are piece functions which are fixed rational functions given the signs of boundary functions.

Somewhat surprisingly, we show that the dual loss function exhibits a piecewise structure even in linear regression, a classic continuous optimization problem. Intuitively, the pieces correspond to different subsets of the features being “active”, i.e. having non-zero coefficients in the learned model (beta). Specifically, we show that the piece boundaries of the dual function are polynomial functions of bounded degree, and the loss within each piece is a rational function (ratio of two polynomial functions) again of bounded degree. We use this structure to establish a bound on the learning-theoretic complexity of the dual function; more precisely, we bound its pseudo-dimension (a generalization of the VC dimension to real-valued functions).

Theorem. The pseudo-dimension of the Elastic Net dual loss function is (Theta(p)), where (p) is the feature dimension.

(Theta(p)) notation here means we have an upper bound of (O(p)) as well as a lower bound (Omega(p)) on the pseudo-dimension. Roughly speaking, the pseudo-dimension captures the complexity of the function class from a learning perspective, and corresponds to the number of samples needed to guarantee small generalization error (average error on test data). Remarkably, we show an asymptotically tight bound on the pseudo-dimension by establishing a (Omega(p)) lower bound which is technically challenging and needs an explicit construction of a collection of “hard” instances. Tight lower bounds are not known for several typical problems in data-driven algorithm design. Our bound depends only on (p) (the number of features) and is independent of the number of datapoints (m). An immediate consequence of our bound is the following sample complexity guarantee:

Theorem. Given any distribution (D) (fixed, but unknown), we can learn regularization parameters (hat{lambda}) which obtain error within any (epsilon>0) of the best possible parameter with probability (1-delta) using only (O(1/epsilon^2(p+log 1/delta))) problem samples.

One way to understand our results is to instantiate them in the cross-validation setting. Consider the commonly used techniques of leave-one-out cross validation (LOOCV) and Monte Carlo cross validation (repeated random test-validation splits, typically independent and in a fixed proportion). Given a dataset of size (m_{text{tr}}), LOOCV would require (m_{text{tr}}) regression fits which can be computationally expensive for large datasets. Alternately, we can consider draws from a distribution (D_{text{LOO}}) which generates problem instances P from a fixed dataset ((X, y) in R^{m+1times p} times R^{m+1}) by uniformly selecting (j in [m + 1]) and setting (P = (X_{−j∗}, y_{−j} , X_{j∗}, y_j )). Our result now implies that roughly (O(p/epsilon^2)) iterations are enough to determine an Elastic Net parameter (hat{lambda}) with loss within (epsilon) (with high probability) of the parameter (lambda^*) obtained from running the full LOOCV. Similarly, we can define a distribution (D_{text{MC}}) to capture the Monte Carlo cross validation procedure and determine the number of iterations sufficient to get an (epsilon)-approximation of the loss corresponding parameter selection with an arbitrarily large number of runs. Thus, in a very precise sense, our results answer the question of how much cross-validation is enough to effectively implement the above techniques.

Sequential data and online learning

A more challenging variant of the problem assumes that the problem instances arrive sequentially, and we need to set the parameter for each instance using only the previously seen instances. We can think of this as a game between an online adversary and the learner, where the adversary wants to make the sequence of problems as hard as possible. Note that we no longer assume that the problem instances are drawn from a fixed distribution, and this setting allows problem instances to depend on previously seen instances which is typically more realistic (even if there is no actual adversary generating worst-case problem sequences). The learner’s goal is to perform as well as the best fixed parameter in hindsight, and the difference is called the “regret” of the learner.

To obtain positive results, we make a mild assumption on the smoothness of the data: we assume that the prediction values (y) are drawn from a bounded density distribution. This captures a common data pre-processing step of adding a small amount of uniform noise to the data for model stability, e.g. by setting the jitter parameter in the popular Python library scikit-learn. Under this assumption, we show further structure on the dual loss function. Roughly speaking, we show that the location of the piece boundaries of the dual function across the problem instances do not concentrate in a small region of the parameter space.This in turn implies (using Balcan et al., 2018) the existence of an online learner with average expected regret (O(1/sqrt{T})), meaning that we converge to the performance of the best fixed parameter in hindsight as the number of online rounds (T) increases.

Extension to binary classification, including logistic regression

Linear classifiers are also popular for the task of binary classification, where the (y) values are now restricted to (0) or (1). Regularization is also crucial here to learn effective models by avoiding overfitting and selecting important variables. It is particularly common to use logistic regression, where the squared loss above is replaced by the logistic loss function,

$$l_{text{RLR}}(beta,(X,y))=frac{1}{m}sum_{i=1}^mlog(1+exp(-y_ix_i^Tbeta)).$$

The exact loss minimization problem is significantly more challenging in this case, and it is correspondingly difficult to analyze the dual loss function. We overcome this challenge by using a proxy dual function which approximates the true loss function, but has a simpler piecewise structure. Roughly speaking, the proxy function considers a fine parameter grid of width (epsilon) and approximates the loss function at each point on the grid. Furthermore, it is piecewise linear and known to approximate the true loss function to within an error of (O(epsilon^2)) at all points (Rosset, 2004).

Our main result for logistic regression is that the generalization error with (N) samples drawn from the distribution (D) is bounded by (O(sqrt{(m^2+log 1/epsilon)/N}+epsilon^2+sqrt{(log 1/delta)/N})), with (high) probability (1-delta) over the draw of samples. (m) here is the size of the validation set, which is often small or even constant. While this bound is incomparable to the pseudo-dimension-based bounds above, we do not have lower bounds in this setting and tightness of our results in an interesting open question.

Beyond the linear case: kernel regression

So far, we have assumed that the dependent variable (y) has a linear dependence on the predictor variables. While this is a great first thing to try in many applications, very often there is a non-linear relationship between the variables. As a result, linear regression can result in poor performance in some applications. A common alternative is to use Kernelized Least Squares Regression, where the input (X) is implicitly mapped to high (or even infinite) dimensional feature space using the “kernel trick”. As a corollary of our main results, we can show that the pseudo-dimension of the dual loss function in this case is (O(m)), where (m) is the size of the training set in a single problem instance. Our results do not make any assumptions on the (m) samples within a problem instance/dataset; if these samples within problem instances are further assumed to be i.i.d. draws from some data distribution (distinct from problem distribution (D)), then well-known results imply that (m = O(k log p)) samples are sufficient to learn the optimal LASSO coefficient ((k) denotes the number of non-zero coefficients in the optimal regression fit).

Some final remarks

We consider how to tune the norm-based regularization parameters in linear regression. We pin down the learning-theoretic complexity of the loss function, which may be of independent interest. Our results extend to online learning, linear classification, and kernel regression. A key direction for future research is developing an efficient implementation of the algorithms underlying our approach.

More broadly, regularization is a fundamental technique in machine learning, including deep learning where it can take the form of dropout rates, or parameters in the loss function, with significant impact on the performance of the overall algorithm. Our research opens up the exciting question of tuning learnable parameters even in continuous optimization problems. Finally, our research captures an increasingly typical scenario with the advent of the data era, where one has access to repeated instances of data from the same application domain.

For further details about our cool results and the mathematical machinery we used to derive them, check out our papers linked below!

[1] Balcan, M.-F., Khodak, M., Sharma, D., & Talwalkar, A. (2022). Provably tuning the ElasticNet across instances. Advances in Neural Information Processing Systems, 35.

[2] Balcan, M.-F., Nguyen, A., & Sharma, D. (2023). New Bounds for Hyperparameter Tuning of Regression Problems Across Instances. Advances in Neural Information Processing Systems, 36.

101 real-world gen AI use cases featured at Google Cloud Next ’24

Hundreds of industry leaders joined us at Google Cloud Next ‘24. Here’s a snapshot of how some of them are using AI technologies.Read More

Cost-effective document classification using the Amazon Titan Multimodal Embeddings Model

Organizations across industries want to categorize and extract insights from high volumes of documents of different formats. Manually processing these documents to classify and extract information remains expensive, error prone, and difficult to scale. Advances in generative artificial intelligence (AI) have given rise to intelligent document processing (IDP) solutions that can automate the document classification, and create a cost-effective classification layer capable of handling diverse, unstructured enterprise documents.

Categorizing documents is an important first step in IDP systems. It helps you determine the next set of actions to take depending on the type of document. For example, during the claims adjudication process, the accounts payable team receives the invoice, whereas the claims department manages the contract or policy documents. Traditional rule engines or ML-based classification can classify the documents, but often reach a limit on types of document formats and support for the dynamic addition of a new classes of document. For more information, see Amazon Comprehend document classifier adds layout support for higher accuracy.

In this post, we discuss document classification using the Amazon Titan Multimodal Embeddings model to classify any document types without the need for training.

Amazon Titan Multimodal Embeddings

Amazon recently introduced Titan Multimodal Embeddings in Amazon Bedrock. This model can create embeddings for images and text, enabling the creation of document embeddings to be used in new document classification workflows.

It generates optimized vector representations of documents scanned as images. By encoding both visual and textual components into unified numerical vectors that encapsulate semantic meaning, it enables rapid indexing, powerful contextual search, and accurate classification of documents.

As new document templates and types emerge in business workflows, you can simply invoke the Amazon Bedrock API to dynamically vectorize them and append to their IDP systems to rapidly enhance document classification capabilities.

Solution overview

Let’s examine the following document classification solution with the Amazon Titan Multimodal Embeddings model. For optimal performance, you should customize the solution to your specific use case and existing IDP pipeline setup.

This solution classifies documents using vector embedding semantic search by matching an input document to an already indexed gallery of documents. We use the following key components:

Embeddings – Embeddings are numerical representations of real-world objects that machine learning (ML) and AI systems use to understand complex knowledge domains like humans do.
Vector databases – Vector databases are used to store embeddings. Vector databases efficiently index and organize the embeddings, enabling fast retrieval of similar vectors based on distance metrics like Euclidean distance or cosine similarity.
Semantic search – Semantic search works by considering the context and meaning of the input query and its relevance to the content being searched. Vector embeddings are an effective way to capture and retain the contextual meaning of text and images. In our solution, when an application wants to perform a semantic search, the search document is first converted into an embedding. The vector database with relevant content is then queried to find the most similar embeddings.

In the labeling process, a sample set of business documents like invoices, bank statements, or prescriptions are converted into embeddings using the Amazon Titan Multimodal Embeddings model and stored in a vector database against predefined labels. The Amazon Titan Multimodal Embedding model was trained using the Euclidean L2 algorithm and therefore for best results the vector database used should support this algorithm.

The following architecture diagram illustrates how you can use the Amazon Titan Multimodal Embeddings model with documents in an Amazon Simple Storage Service (Amazon S3) bucket for image gallery creation.

The workflow consists of the following steps:

A user or application uploads a sample document image with classification metadata to a document image gallery. An S3 prefix or S3 object metadata can be used to classify gallery images.
An Amazon S3 object notification event invokes the embedding AWS Lambda function.
The Lambda function reads the document image and translates the image into embeddings by calling Amazon Bedrock and using the Amazon Titan Multimodal Embeddings model.
Image embeddings, along with document classification, are stored in the vector database.

When a new document needs classification, the same embedding model is used to convert the query document into an embedding. Then, a semantic similarity search is performed on the vector database using the query embedding. The label retrieved against the top embedding match will be the classification label for the query document.

The following architecture diagram illustrates how to use the Amazon Titan Multimodal Embeddings model with documents in an S3 bucket for image classification.

The workflow consists of the following steps:

Documents that require classification are uploaded to an input S3 bucket.
The classification Lambda function receives the Amazon S3 object notification.
The Lambda function translates the image to an embedding by calling the Amazon Bedrock API.
The vector database is searched for a matching document using semantic search. Classification of the matching document is used to classify the input document.
The input document is moved to the target S3 directory or prefix using the classification retrieved from the vector database search.

To help you test the solution with your own documents, we have created an example Python Jupyter notebook, which is available on GitHub.

Prerequisites

To run the notebook, you need an AWS account with appropriate AWS Identity and Access Management (IAM) permissions to call Amazon Bedrock. Additionally, on the Model access page of the Amazon Bedrock console, make sure that access is granted for the Amazon Titan Multimodal Embeddings model.

Implementation

In the following steps, replace each user input placeholder with your own information:

Create the vector database. In this solution, we use an in-memory FAISS database, but you could use an alternative vector database. Amazon Titan’s default dimension size is 1024.

index = faiss.IndexFlatL2(1024)
indexIDMap = faiss.IndexIDMap(index)

After the vector database is created, enumerate over the sample documents, creating embeddings of each and store those into the vector database

Test with your documents. Replace the folders in the following code with your own folders that contain known document types:

DOC_CLASSES: list[str] = ["Closing Disclosure", "Invoices", "Social Security Card", "W4", "Bank Statement"]

getDocumentsandIndex("sampleGallery/ClosingDisclosure", DOC_CLASSES.index("Closing Disclosure"))
getDocumentsandIndex("sampleGallery/Invoices", DOC_CLASSES.index("Invoices"))
getDocumentsandIndex("sampleGallery/SSCards", DOC_CLASSES.index("Social Security Card"))
getDocumentsandIndex("sampleGallery/W4", DOC_CLASSES.index("W4"))
getDocumentsandIndex("sampleGallery/BankStatements", DOC_CLASSES.index("Bank Statement"))

Using the Boto3 library, call Amazon Bedrock. The variable inputImageB64 is a base64 encoded byte array representing your document. The response from Amazon Bedrock contains the embeddings.

bedrock = boto3.client(
service_name='bedrock-runtime',
region_name='Region’
)

request_body = {}
request_body["inputText"] = None # not using any text
request_body["inputImage"] = inputImageB64
body = json.dumps(request_body)
response = bedrock.invoke_model(
body=body, 
modelId="amazon.titan-embed-image-v1", 
accept="application/json", 
contentType="application/json")
response_body = json.loads(response.get("body").read())

Add the embeddings to the vector database, with a class ID that represents a known document type:

indexIDMap.add_with_ids(embeddings, classID)

With the vector database populated with images (representing our gallery), you can uncover similarities with new documents. For example, the following is the syntax used for search. The k=1 tells FAISS to return the top 1 match.

indexIDMap.search(embeddings, k=1)

In addition, the Euclidean L2 distance between the image on hand and the found image is also returned. If the image is an exact match, this value would be 0. The larger this value is, the further apart the images are in similarity.

Additional considerations

In this section, we discuss additional considerations for using the solution effectively. This includes data privacy, security, integration with existing systems, and cost estimates.

Data privacy and security

The AWS shared responsibility model applies to data protection in Amazon Bedrock. As described in this model, AWS is responsible for protecting the global infrastructure that runs all of the AWS Cloud. Customers are responsible for maintaining control over their content that is hosted on this infrastructure. As a customer, you are responsible for the security configuration and management tasks for the AWS services that you use.

Data protection in Amazon Bedrock

Amazon Bedrock avoids using customer prompts and continuations to train AWS models or share them with third parties. Amazon Bedrock doesn’t store or log customer data in its service logs. Model providers don’t have access to Amazon Bedrock logs or access to customer prompts and continuations. As a result, the images used for generating embeddings through the Amazon Titan Multimodal Embeddings model are not stored or employed in training AWS models or external distribution. Additionally, other usage data, such as timestamps and logged account IDs, is excluded from model training.

Integration with existing systems

The Amazon Titan Multimodal Embeddings model underwent training with the Euclidean L2 algorithm, so the vector database being used should be compatible with this algorithm.

Cost estimate

At the time of writing this post, as per Amazon Bedrock Pricing for the Amazon Titan Multimodal Embeddings model, the following are the estimated costs using on-demand pricing for this solution:

One-time indexing cost – $0.06 for a single run of indexing, assuming a 1,000 images gallery
Classification cost – $6 for 100,000 input images per month

Clean up

To avoid incurring future charges, delete the resources you created, such as the Amazon SageMaker notebook instance, when not in use.

Conclusion

In this post, we explored how you can use the Amazon Titan Multimodal Embeddings model to build an inexpensive solution for document classification in the IDP workflow. We demonstrated how to create an image gallery of known documents and perform similarity searches with new documents to classify them. We also discussed the benefits of using multimodal image embeddings for document classification, including their ability to handle diverse document types, scalability, and low latency.

As new document templates and types emerge in business workflows, developers can invoke the Amazon Bedrock API to vectorize them dynamically and append to their IDP systems to rapidly enhance document classification capabilities. This creates an inexpensive, infinitely scalable classification layer that can handle even the most diverse, unstructured enterprise documents.

Overall, this post provides a roadmap for building an inexpensive solution for document classification in the IDP workflow using Amazon Titan Multimodal Embeddings.

As next steps, check out What is Amazon Bedrock to start using the service. And follow Amazon Bedrock on the AWS Machine Learning Blog to keep up to date with new capabilities and use cases for Amazon Bedrock.

About the Authors

Sumit Bhati is a Senior Customer Solutions Manager at AWS, specializes in expediting the cloud journey for enterprise customers. Sumit is dedicated to assisting customers through every phase of their cloud adoption, from accelerating migrations to modernizing workloads and facilitating the integration of innovative practices.

David Girling is a Senior AI/ML Solutions Architect with over 20 years of experience in designing, leading, and developing enterprise systems. David is part of a specialist team that focuses on helping customers learn, innovate, and utilize these highly capable services with their data for their use cases.

Ravi Avula is a Senior Solutions Architect in AWS focusing on Enterprise Architecture. Ravi has 20 years of experience in software engineering and has held several leadership roles in software engineering and software architecture working in the payments industry.

George Belsian is a Senior Cloud Application Architect at AWS. He is passionate about helping customers accelerate their modernization and cloud adoption journey. In his current role, George works alongside customer teams to strategize, architect, and develop innovative, scalable solutions.

A quick guide to Amazon’s 20+ papers at ICASSP 2024

This year’s papers address topics such as speech enhancement, spoken-language understanding, dialogue, paralinguistics, and pitch estimation.Read More

AWS at NVIDIA GTC 2024: Accelerate innovation with generative AI on AWS

AWS was delighted to present to and connect with over 18,000 in-person and 267,000 virtual attendees at NVIDIA GTC, a global artificial intelligence (AI) conference that took place March 2024 in San Jose, California, returning to a hybrid, in-person experience for the first time since 2019.

AWS has had a long-standing collaboration with NVIDIA for over 13 years. AWS was the first Cloud Service Provider (CSP) to offer NVIDIA GPUs in the public cloud, and remains among the first to deploy NVIDIA’s latest technologies.

Looking back at AWS re:Invent 2023, Jensen Huang, founder and CEO of NVIDIA, chatted with AWS CEO Adam Selipsky on stage, discussing how NVIDIA and AWS are working together to enable millions of developers to access powerful technologies needed to rapidly innovate with generative AI. NVIDIA is known for its cutting-edge accelerators and full-stack solutions that contribute to advancements in AI. The company is combining this expertise with the highly scalable, reliable, and secure AWS Cloud infrastructure to help customers run advanced graphics, machine learning, and generative AI workloads at an accelerated pace.

The collaboration between AWS and NVIDIA further expanded at GTC 2024, with the CEOs from both companies sharing their perspectives on the collaboration and state of AI in a press release:

“The deep collaboration between our two organizations goes back more than 13 years, when together we launched the world’s first GPU cloud instance on AWS, and today we offer the widest range of NVIDIA GPU solutions for customers,” says Adam Selipsky, CEO of AWS. “NVIDIA’s next-generation Grace Blackwell processor marks a significant step forward in generative AI and GPU computing. When combined with AWS’s powerful Elastic Fabric Adapter networking, Amazon EC2 UltraClusters’ hyper-scale clustering, and our unique AWS Nitro System’s advanced virtualization and security capabilities, we make it possible for customers to build and run multi-trillion parameter large language models faster, at massive scale, and more securely than anywhere else. Together, we continue to innovate to make AWS the best place to run NVIDIA GPUs in the cloud.”

“AI is driving breakthroughs at an unprecedented pace, leading to new applications, business models, and innovation across industries,” says Jensen Huang, founder and CEO of NVIDIA. “Our collaboration with AWS is accelerating new generative AI capabilities and providing customers with unprecedented computing power to push the boundaries of what’s possible.”

Joint announcements and keynote

On the first day of the NVIDIA GTC, AWS and NVIDIA made a joint announcement focused on their strategic collaboration to advance generative AI. Huang included the AWS and NVIDIA collaboration on a slide during his keynote, highlighting the following announcements. The GTC keynote had over 21 million views within the first 72 hours.

AWS will offer the new NVIDIA Blackwell platform as Amazon Elastic Compute Cloud (Amazon EC2) instances and NVIDIA DGX Cloud to accelerate performance of building and running inference on multi-trillion parameter large language models (LLMs). Blackwell’s secure AI capabilities integrated with the AWS Nitro System and AWS Key Management Service (AWS KMS) will provide customers end-to-end control of their training data and model weights.
AWS will provide the cloud infrastructure for Project Ceiba, an AI supercomputer built exclusively on AWS with NVIDIA DGX Cloud, which will feature 20,736 NVIDIA GB200 Grace Blackwell Superchips capable of 414 exaflops for NVIDIA’s own AI R&D.
The Amazon SageMaker integration with NVIDIA NIM inference microservices will help customers further optimize price-performance of foundation models running on GPUs. (To learn more, see Optimize price-performance of LLM inference on NVIDIA GPUs using the Amazon SageMaker integration with NVIDIA NIM Microservices.)
AWS HealthOmics with the NVIDIA BioNeMo platform will accelerate generative AI in biology and drug discovery. (To learn more, refer to NVIDIA BioNeMo Expands Computer-Aided Drug Discovery With New Foundation Models, Protein language model training with NVIDIA BioNeMo framework on AWS ParallelCluster, and Find the Next Blockbuster with NVIDIA BioNeMo Framework on Amazon SageMaker.)
Amazon Robotics and NVIDIA’s long-standing collaboration regarding innovations in advanced simulations was also highlighted.

Media coverage

By March 22, AWS’s announcement with NVIDIA had generated 104 articles mentioning AWS and Amazon. The vast majority of coverage mentioned AWS’s plans to offer Blackwell-based instances. Adam Selipsky appeared on CNBC’s Mad Money to discuss the long-standing collaboration between AWS and NVIDIA, among the many other ways AWS is innovating in generative AI, stating that AWS has been the first to bring many of its GPUs to the cloud to drive efficiency and scalability for customers.

Project Ceiba has also been a focus in media coverage. Forbes referred to Project Ceiba as the “most exciting” project by AWS and NVIDIA, stating that it “should accelerate the pace of innovation in AI, making it possible to tackle more complex problems, develop more sophisticated models, and achieve previously unattainable breakthroughs.” The Next Platform ran an in-depth piece on Ceiba, stating that “the size and the aggregate compute of Ceiba cluster are both being radically expanded, which will give AWS a very large supercomputer in one of its data centers” and NVIDIA will use it to do AI research, among other things.

Live from GTC

“Live from GTC” was an on-site studio at GTC for invited speakers to have a fireside chat with tech influencers like VentureBeat. Chetan Kapoor, Director of Product Management for Amazon EC2 at AWS, was interviewed by VentureBeat at the Live from GTC studio, where he discussed AWS’s presence and highlighted key announcements at GTC.

The AWS booth and sessions

The AWS booth showcased generative AI services, like the LLMs with Anthropic and Cohere on Amazon Bedrock, PartyRock, Amazon Q, Amazon SageMaker JumpStart, and more. Highlights included:

AWS AI Chess Robots – Two robotic arms playing chess against each other, with each move generated in the cloud with LLMs on Amazon Bedrock and powered by the NVIDIA Jetson platform and NVIDIA GPUs
Wormhole – An alien robot from Media.Monks, who was busy having intelligent conversations with booth visitors powered by NVIDIA and a serverless Retrieval Augmented Generation (RAG) model using Claude 3 on Amazon Bedrock, along with other AWS services – Including SageMaker, Amazon Polly, and more

Additionally, AWS had 10 GTC sessions showcasing how the latest technologies from AWS and NVIDIA can drive business outcomes using generative AI. Some highlights include:
How Genius Sports Transforms NFL Game Viewing with Accelerated Computing on AWS (Presented by Amazon Web Services)
Accelerate Time to Train Your Largest Generative AI Models With SageMaker HyperPod (Presented by Amazon Web Services)

AWS presence with partners and customers

During GTC, AWS invited 23 partner and customer solution demos to join its booth with either a dedicated demo kiosk or a 30-minute in-booth session. Such partners and customers included Ansys, Anthropic, Articul8, Bria.ai, Cohere, Deci, Deepbrain.AI, Denali Advanced Integration, Ganit, Hugging Face, Lilt, Linker Vision, Mavenir, MCE, Media.Monks, Modular, NVIDIA, Perplexity, Quantiphi, Run.ai, Salesforce, Second Spectrum, and Slalom.

Among them, high-potential early-stage startups in generative AI across the globe were showcased with a dedicated kiosk at the AWS booth. The AWS Startups team works closely with these companies by investing and supporting their growth, offering resources through programs like AWS Activate.

AWS Generative AI Competency

NVIDIA was one of the 45 launch partners for the new AWS Generative AI Competency program. The Generative AI Center of Excellence for AWS Partners team members were on site at the AWS booth, presenting this program for both existing and potential AWS partners. The program offers valuable resources along with best practices for all AWS partners to build, market, and sell generative AI solutions jointly with AWS.

Additional resources

Watch a video recap of the AWS presence at NVIDIA GTC 2024. For additional resources about the AWS and NVIDIA collaboration, refer to the AWS at NVIDIA GTC 2024 resource hub.

About the Author

Julie Tang is the Senior Global Partner Marketing Manager for Generative AI at Amazon Web Services (AWS), where she collaborates closely with NVIDIA to plan and execute partner marketing initiatives focused on generative AI. Throughout her tenure at AWS, she has held various partner marketing roles, including Global IoT Solutions, AWS Partner Solution Factory, and Sr. Campaign Manager in Americas Field Marketing. Prior to AWS, Julie served as the Marketing Director at Segway. She holds a Master’s degree in Communications Management with a focus on marketing and entertainment management from the University of Southern California, and dual Bachelor’s degrees in Law and Broadcast Journalism from Fudan University.

6 new conversations with global leaders on AI and society

Our new limited-series podcast explores AI’s impact on policy, the economy, science, democracy, sustainability, and more.Read More

Using Amazon web traffic to track the eclipse

An animation that projects traffic fluctuations onto the U.S. map offers an example of how the Supply Chain Optimization Technologies team uses data visualization to glean insights.Read More

Bethesda’s ‘Fallout’ Titles Join GeForce NOW

Welcome to the wasteland, Vault Dwellers. Bethesda’s Fallout 4 and Fallout 76 are bringing post-nuclear adventures to the cloud.

These highly acclaimed action role-playing games lead 10 new titles joining GeForce NOW this week.

Announced as coming to GeForce NOW at CES, Honkai: Star Rail is targeting a release this quarter. Stay tuned for future updates.

Vault Into the Cloud

Adventurers needed, whether for mapping the irradiated wasteland or shaping the fate of humanity.

Fallout 4 on GeForce NOW — *Don’t let Dogmeat venture out alone.*

Embark on a journey through ruins of the post-apocalyptic Commonwealth in Fallout 4. As the sole survivor of Vault 111, navigate a world destroyed by nuclear war, make choices to reshape the wasteland and rebuild society one settlement at a time. With a vast, open world, dynamic crafting systems and a gripping storyline, the game offers an immersive single-player experience that challenges dwellers to emerge as beacons of hope for humanity’s remnants.

Fallout 76 on GeForce NOW — *Dust off your Pip-Boy and stream ‘Fallout 76’ from the cloud.*

Plus, in Fallout 76, head back to the early days of post-nuclear Appalachia and experience the Fallout universe’s largest, most dynamic world. Encounter unique challenges, build portable player homes called C.A.M.P.s, and cooperate or compete with other survivors in the mountainous lands in West Virginia.

Join the proud ranks of Vault survivors in the cloud today and stream these titles, including Creation Club content for Fallout 4, across devices. With longer gaming sessions and faster access to servers, GeForce NOW members can play anywhere, anytime, and at up to 4K resolution, streaming with an Ultimate membership. The games come just in time for those tuning into the Fallout series TV adaptation, released today, for a Fallout-filled week.

Go Big or Go Home

Gigantic: Rampage Edition on GeForce NOW — *Larger than life MOBA now streaming on GeForce NOW.*

Gigantic: Rampage Edition promises big fun with epic 5v5 matches, crossplay support, an exciting roster of heroes and more. Rush to the cloud to jump into the latest game from Arc Games and team with four other players to control objectives and take down the opposing team’s mighty Guardian. Think fast, be bold and go gigantic!

Look forward to these new games this week:

Gigantic: Rampage Edition (New release on Steam, April 9)
Inkbound 1.0 (New release, on Steam, April 9)
Broken Roads (New release on Steam, April 10)
Infection Free Zone (New release on Steam, April 11)
Shadow of the Tomb Raider: Definitive Edition (New release on Xbox and available on PC Game Pass, April 11)
Backpack Battles (Steam)
Fallout 4 (Steam)
Fallout 76 (Steam and Xbox, available on PC Game Pass)
Ghostrunner (Epic Games Store, free April 11-18)
Terra Invicta (Xbox, available on PC Game Pass)

What are you planning to play this weekend? Let us know on X or in the comments below.

can’t help 𝙛𝙖𝙡𝙡ing into the cloud

— NVIDIA GeForce NOW (@NVIDIAGFN) April 10, 2024

Ideas: Language technologies for everyone with Kalika Bali

Microsoft Research Podcast | Ideas | Kalika Bali

Behind every emerging technology is a great idea propelling it forward. In the new Microsoft Research Podcast series, Ideas, members of the research community at Microsoft discuss the beliefs that animate their research, the experiences and thinkers that inform it, and the positive human impact it targets.

In this episode, host Gretchen Huizinga talks with Principal Researcher Kalika Bali. Inspired by an early vision of “talking computers” and a subsequent career in linguistics, Bali has spent the last two decades bringing the two together. Aided by recent advances in large language models and motivated by her belief that everyone should have access to AI in their own language, Bali and her teams are building language technology applications that they hope will bring the benefits of generative AI to under-resourced and underserved language communities around the world.

Learn more:

The State and Fate of Linguistic Diversity and Inclusion in the NLP World
Publication, July 2020
Project VeLLM
Project page
Kahani: Visual Storytelling
Project page
Kahani: Visual Storytelling through Culturally Nuanced Images
Microsoft Research Forum | Episode 1, January 2024
Teachers in India help Microsoft Research design AI tool for creating great classroom content
Microsoft Research blog, October 2023
Digital Labor: Project Karya
Project page
Village by village, creating the building blocks for AI tools with work that also educates (opens in new tab)
Microsoft Source Asia blog, February 2024

Transcript

[TEASER]

[MUSIC PLAYS UNDER DIALOGUE]

KALIKA BALI: I do think, in some sense, the pushback that I got for my idea makes me think it was outrageous. I didn’t think it was outrageous at all at that time! I thought it was a very reasonable idea! But there was a very solid pushback and not just from your colleagues. You know, for researchers, publishing papers is important! No one would publish a paper which focused only on, say, Indian languages or low-resource languages. We’ve come a very long way even in the research community on that, right. We kept pushing, pushing, pushing! And now there are tracks, there are workshops, there are conferences which are devoted to multilingual and low-resource languages.

[TEASER ENDS]

GRETCHEN HUIZINGA: You’re listening to Ideas, a Microsoft Research Podcast that dives deep into the world of technology research and the profound questions behind the code. I’m Dr. Gretchen Huizinga. In this series, we’ll explore the technologies that are shaping our future and the big ideas that propel them forward.

[MUSIC FADES]

I’m excited to be live in the booth today with Kalika Bali, a principal researcher at Microsoft Research India. Kalika is working on language technologies that she hopes will bring the benefits of generative AI to under-resourced and underserved language communities around the world. Kalika, it’s a pleasure to speak with you today. Welcome to Ideas!

KALIKA BALI: Thank you. Thank you, Gretchen. Thank you for having me.

HUIZINGA: So before we dive in on the big ideas behind Kalika Bali’s research, let’s talk about you for a second. Tell us about your “origin story,” as it were, and if there is one, what “big idea” or animating “what if?” captured your imagination and inspired you to do what you’re doing today?

BALI: So, you know, I’m a great reader. I started reading well before I was taught in school how to read, and I loved science fiction. I come from a family where reading was very much a part of our everyday lives. My dad was a journalist, and I had read a lot of science fiction growing up, and I also saw a lot of science fiction, you know, movies … Star Trek … everything that I could get hold of in India. And I remember watching 2001: Space Odyssey. And there was this HAL that spoke. He actually communicated that he was a computer. And I was just so struck by it. I was like, this is so cool! You know, here are computers that can talk! Now, how cool would that be if it would happen in real life? I was not at all aware of what was happening in speech technology, whether it was possible or not possible, but that’s something that really got me into it. I’ve always, like, kind of, been very curious about languages and how they work and, you know, how people use different things in languages to express not just meaning, not just communicating, but you know expressing themselves, really. And so I think it’s a combination of HAL and this curiosity I had about the various ways in which people use languages that got me into what I’m doing now.

HUIZINGA: OK. So that’s an interesting path, and I want to go into that just a little bit, but let me anchor this: how old were you when you saw this talking computer?

BALI: Oh, I was in my early teens.

HUIZINGA: OK. And so at that time, did you have any conception that … ?

BALI: No. You know, there weren’t computers around me when I was growing up. We saw, you know, some at school, you know, people coded in BASIC …

HUIZINGA: Right?

BALI: And we heard about them a lot, but I hadn’t seen one since I was in high school.

HUIZINGA: OK. So there’s this inception moment, an aha moment, of that little spark and then you kind of drifted away from the computer side of it, and what … tell us about how you went from there to that!

BALI: So that, that’s actually a very funny story because I actually wanted to study chemistry. I was really fascinated by how these, you know, molecular parts rotate around each other and, you know, we can’t even tell where an electron is, etc. It sounded, like, really fun and cool. So I actually studied chemistry, but then I was actually going to pick up the admission form for my sister, who wanted to study in this university, and … or, no, she wanted to take an exam for her master’s. And I went there. I picked up the form, and I said, this is a cool place. I would love to study here! And then I started looking at everything like, you know, what can I apply for here? And something called linguistics came up, and I had no idea what linguistics was. So I went to the British Library, got like a thin book on introduction to linguistics, and it sounded fun! And I took the exam. And then, as they say, that was history. Then I just got into it.

HUIZINGA: OK. I mean, so much has happened in between then and now, and I think we’ll kind of get there in … but I do want you to connect the larger dot from how you got from linguistics to Microsoft Research [LAUGHTER] as a computer scientist.

BALI: So I actually started teaching at the University of South Pacific as a linguistics faculty in Fiji. And I was very interested in acoustics of speech sounds, etc., etc. That’s what I was teaching. And then there was a speech company in Belgium that was looking to start some work in Indian languages, and they contacted me, and at that time, you needed people who knew about languages to build language technology, especially people who knew about phonetics, acoustics, for speech technology. And that’s how I got into it. And then, you know, I just went from startups to companies and then Microsoft Research, 18 years ago, almost 18 years ago.

HUIZINGA: Wow. OK. I would love to actually talk to you about all that time. But we don’t have time because I have a lot more things to talk to you about, technology-wise. But I do want to know, you know, how would you describe the ideas behind your overarching research philosophy, and who are your influences, as they say in the rock-and-roll world? [LAUGHTER] Who inspired you? Real-life person, scientist or not, besides, HAL 9000, who’s fictional, and any seminal papers that, sort of, got you interested in that along the way?

BALI: So since I was really into speech, Ken Stevens—who was a professor, who sadly is no longer with us anymore, at MIT—was a big influence. He, kind of, had this whole idea of how speech is produced. And, you know, the first time I was exposed to the whole idea of the mathematics behind the speech, and I think he influenced me a lot on the speech side of things. For the language side of things, you know, my professor in India Professor Anvita Abbi—you know, she’s a Padma Shri, like, she’s been awarded by the Indian government for her work in, you know, very obscure, endangered languages—you know, she kind of gave me a feel for what languages are, and why they are important, and why it’s important to save them and not let them die away.

HUIZINGA: Right.

BALI: So I think I would say both of them. But what really got me into wanting to work with Indian language technology in a big way was I was working in Belgium, I was working in London, and I saw the beginning of how technology is, kind of, you know, making things easier, exciting; there’s cool technology available for English, for French, for German … But in a country like India, it was more about giving access to people who have no access, right? It actually mattered, because here are people who may not be very literate and therefore not be able to use technology in the way we know it, but they can talk.

HUIZINGA: Right.

BALI: And they can speak, and they should be able to access technology by doing that.

HUIZINGA: Right. OK. So just real quickly, that was then. What have you seen change in that time, and how profoundly have the ideas evolved?

BALI: So just from pure methodology and what’s possible, you know, I have seen it all. When I started working in language technology, mainly for Indian languages, but even for other languages, it was all a rule-based system. So everybody had to create all these rules that then were, you know, responsible for building or like making that technology work. But then, just at that time, you know, all the statistical systems and methodologies came into being. So we had hidden Markov models, you know, doing their thing in speech, and it was all about a lot of data. But that data still had to be procured in a certain way, labeled, annotated. It was still a very long and resource-intensive process. Now, with generative AI, the thing that I am excited about is, we have a very powerful tool, right?

HUIZINGA: Mm-hmm.

BALI: And, yes, it requires a lot of data, but it can learn also; you know, we can fine-tune stuff on smaller datasets …

HUIZINGA: Yeah …

BALI: … to work for, you know, relevant things. So it’s not going to take me years and years and years to first procure the data, then have it tagged for part of speech … then, you know, have it tagged for sentiment, have it tagged for this, have it tagged for that, and then, only can I think of building anything.

HUIZINGA: Right.

BALI: So it just shortens that timeline so much, and it’s very exciting.

HUIZINGA: Right. As an ex-English teacher—which I don’t think there is such a thing as an ex-English teacher; you’re always silently correcting someone’s grammar! [LAUGHTER]—just what you said about tagging parts of speech as what they are, right? And that, I used to teach that. And then you start to think, how would you translate that for a machine? So fascinating. So, Kalika, you have said that your choice of career was accidental—and you’ve alluded to the, sort of, the fortuitous things that happened along the way—but that linguistics is one subject that goes from absolute science to absolute philosophy. Can you unpack that a little bit more and how this idea impacted your work in language technology?

BALI: Yeah. So, so if you think about it, you know, language has a physical aspect, right. We move our various speech organs in a certain way. Our ears are constructed in a certain way. There is a physics of it where, when I speak, there are sound waves, right, which are going into your ear, and that’s being interpreted. So, you know, if you think about that, that’s like an absolute science behind it, right? But then, when you come to the structure of language, you know, the syntax, like you’re an English teacher, so you know this really well, that you know, there’s semantics; there’s, you know, morphology, how our words form, how our sentences form. And that’s like a very abstract kind of method that allows us to put, you know, meaningful sentences out there, right?

HUIZINGA: Right …

BALI: But then there’s this other part of how language works in society, right. The way I talk to my mother would be probably very different to the way I’m talking to you, would be very different from the way I talk to my friends, at a very basic level, right? The way, in India, I would greet someone older to me would be very different from the way I would greet somebody here, because here it’s like much less formal and that, you know, age hierarchy is probably less? If I did the same thing in India, I would be considered the rudest creature ever. [LAUGHS] So … and then, you know, you go into the whole philosophy—psycholinguistics part. What happens in our brains, you know, when we are speaking? Because language is controlled by various parts of our brain, right. And then, you go to the pure philosophy part, like why? How does language even occur? Why do we name things the way we name things? You know, why do we have a language of thought? You know, what language are we thinking in? [LAUGHTER]

HUIZINGA: Right.

BALI: So, so it really does cover the entire gamut of language …

HUIZINGA: Yeah, yeah, yeah …

BALI: … like from science to philosophy.

HUIZINGA: Yeah, as I said before, when we were talking out there, my mother-in-law was from Holland, and every time she did math or adding, she would do it in Dutch, which—she’d be speaking in English and then she’d go over here and count in Dutch out loud. And it’s like, yeah, your brain switches back and forth. This is so exciting to me. I had no idea how much I would love this podcast! So, much of your research is centered on this big idea called “design thinking,” and it’s got a whole discipline in universities around the world. And you’ve talked about using something you call the 4D process for your work. Could you explain that process, and how it plays out in the research you do with the communities you serve?

BALI: Yeah, so we’ve kind of adapted this. My ex-colleague Monojit Choudhury and I, kind of, came up with this whole thing about 4D thinking, which is essentially discover, design, develop and deploy, right. And when we are working with, especially with, marginalized or low-resource-language communities, the very basic thing we have to do is discover, because we cannot go with, you know, our own ideas and perceptions about what is required. And I can give you a very good example of this, right. You know, most of us, as researchers and technologists, when we think of language technology, we are thinking about machine translation; we’re thinking about speech recognition; we are thinking about state-of-the-art technology. And here we were talking to a community that spoke the language Idu Mishmi, which is a very small community in northeast of India. And we were talking about, you know, we can do this, we can do that. And they just turned to us and said, what we really want is a mobile digital dictionary! [LAUGHS]

HUIZINGA: Wow. Yeah …

BALI: Right? And, you know, if you don’t talk, if you don’t observe, if you are not open to what the community’s needs might be, then you’ll miss that, right. You’ll miss the real thing that will make a difference to that community. So that’s the discover part. The design part, again, you have to design with the community. You cannot go and design a system that they are unable to use properly, right. And again, another very good example, one of the people I know, you know, he gave me this very good example of why you have to think, even at the architecture level when you’re designing such things, is like a lot of applications in India and around the world require your telephone number for verification. Now, for women, it might be a safety issue. They might not want to give their telephone number. Or in India, many women might not even have a telephone, like a mobile number, right. So how do you think of other ways in which they can verify, right? And so that’s the design part. The develop and the deploy part, kind of, go hand in hand, because I think it’s a very iterative process. You develop quickly, you put it out there, allow it to fail and, you know …

HUIZINGA: Mm-hmm. Iterate …

BALI: Iterate. So that’s like the, kind of, design thinking that we have.

HUIZINGA: Yeah, I see that happening in accessibility technology areas, too, as well as language …

BALI: Yeah, and, you know, working with the communities, very quickly, you become really humble.

HUIZINGA: Sure.

BALI: There’s a lot of humility in me now. Though I have progressed in my career and, you know, supposedly become wiser, I am much more humble about what I know and what I can do than I was when I started off, you know.

HUIZINGA: I love that. Well, one thing I want to talk to you about that has intrigued me, there’s a thing that happens in India where you mix languages …

BALI: Yes!

HUIZINGA: You speak both Hindi and English at the same time, and you think, oh, you speak English, but it’s like, no, there’s words I don’t understand in that. What do you call that, and how did that drive your interest? I mean, that was kind of an early-on kind of thing in your work, right? Talk about that.

BALI: So that’s called code-mixing or code-switching. The only linguistic difference is code-mixing happens within a sentence, and code-switching means one sentence in one language and another.

HUIZINGA: Oh, really?

BALI: Yeah. So … but this is, like, not just India. This is a very, very common feature of multilingual societies all over the world. So it’s not multilingual individuals, but at the societal level, when you have multilingualism, then, you know, this is a marker of multilingualism. But code-mixing particularly means that you have to be fluent in both languages to actually code-mix, right. You have to have a certain amount of fluency in both languages. And there are various reasons why people do this. You know, it’s been studied by psychologists and linguists for a long time. And for most people like me, multilingual people, that’s the language we dream in, we think about. [LAUGHTER] That’s the language we talk to our siblings and friends in, right. And for us, it’s, like, just natural. We just keep …

HUIZINGA: Mixing …

BALI: … flipping between the two languages for a variety of reasons. We might do it for emphasis; we might do it for humor. We might just decide, OK, I’m going to pick this from this … the brain decides I’m going to pick this from this language …

HUIZINGA: Sure.

BALI: … and this … So the reason we got interested in, like, looking into code-mixing was that when we are saying that we want humans to be able to interact with machines in their most natural language, then by some estimates, half the world speaks like this!

HUIZINGA: Right.

BALI: So we have to be able to understand exactly how they speak and, you know, be able to process and understand their language, which is code-mixed …

HUIZINGA: Sure. Well, it seems like the human brain can pick this up and process it fairly quickly and easily, especially if it knows many languages. For a machine, it would be much more difficult?

BALI: It is. So initially, it was really difficult because, you know, the way we created systems was one language at a time …

HUIZINGA: Right!

BALI: … right. And it’s not about having an English engine and a Hindi engine available. It doesn’t work that way.

HUIZINGA: No!

BALI: So you’d really need something that, you know, is able to tackle the languages together. And in some theories, this is almost considered a language of its own because it’s not like you’re randomly mixing. There is a structure to …

HUIZINGA: Oh, is there?

BALI: Yeah. Where you can, where you can’t …

HUIZINGA: Gotcha.

BALI: You know, so there is a structure or grammar, you can say, of code-mixing. So we went after that. We, kind of, created tools which could generate grammatically viable code-mixed sentences given parallel data, etc.

HUIZINGA: That’s awesome. Amazing.

BALI: So, yeah, it takes effort to do it. But again, right now, because the generative AI models have at their disposal, you know, so many languages and at least, like, theoretically can work in many, many, many languages, you know, code-mixing might be an easier problem to solve right now.

HUIZINGA: Right. OK. So we’re talking mostly about widely used languages, and you’re very concerned right now on this idea of low-resource languages. So unpack what you mean by low-resource, and what’s missing from the communities that speak those languages?

BALI: Yeah. So when we say low-resource languages, we typically mean that languages do not have, say, digital resources, linguistic resources, language resources, that would enable technology building. It doesn’t mean that the communities themselves are impoverished in culture or linguistic richness, etc., right. But the reason why these communities do not have a lot of language resources, linguistic resources, digital resources, most of the time, it is because they are also marginalized in other ways … social and economic marginalization.

HUIZINGA: Right.

BALI: And these are … if you look at them, they’re not ti—I mean, of course, some of them are tiny, but when we say low-resource communities, we are talking about really big numbers.

HUIZINGA: Oh, really?

BALI: Yeah. So one of the languages that I have worked with—language communities that I’ve worked with—speak a language called Gondi, which is like a Dravidian language that is spoken in … like a South Indian language that is spoken in north, central-north area. It’s a tribal language, and it’s got around three million speakers.

HUIZINGA: Oh, wow!

BALI: Yeah. That’s like more than Welsh, …

HUIZINGA: Yeah! [LAUGHS]

BALI: … right? But because socio-politically, they have been—or economically, they have been marginalized, they do not have the resources to build technologies. And, you know, when we say empower everyone and we only empower the top tier, I don’t think we fulfill our ambition to empower everyone. And like I said earlier, for these communities, all the technology that we have, digital tools that we have access to, they really matter for them. So, for example, you know, a lot of government schemes or the forest reserve laws are provided, say, in Hindi. If they are provided in Gondi, these people have a real idea of what they can do.

HUIZINGA: Yeah. … Sure.

BALI: Similarly, for education, you know, there are books and books and books in Hindi. There’s no book available for Gondi. So how is the next generation even going to learn the language?

HUIZINGA: Right.

BALI: And there are many, many languages which are low resource. In fact, you know, we did a study sometime in 2020, I think, we published this paper on linguistic diversity, and there we saw that, you know, we divided languages in five categories, and the top most which have all the resources to build every possible technology have only five languages, right. And more than half of the world’s languages are at the bottom. So it is a big problem.

HUIZINGA: Yeah. Let’s talk about some of the specific technologies you’re working on. And I want to go from platform to project because you’ve got a big idea in a platform you call VeLLM. Talk about that.

BALI: So VeLLM, which actually means jaggery—the sweet, sugary jaggery—in Tamil, one of the languages in India …

HUIZINGA: Let me, let me interject that it’s not vellum like the paper, or what you’re talking about. It’s capital V, little e, and then LLM, which stands for large language model?

BALI: So universal, the “V” comes from there. Empowerment, “e” comes from there. Through large language models …

HUIZINGA: Got it. OK. But you shortened it to VeLLM.

BALI: Yeah.

HUIZINGA: OK.

BALI: So, so the thing with VeLLM is that a bunch of us got together just when this whole GPT was released, etc. We have a very strong group that works on technologies for empowerment in the India lab, Microsoft Research India. And we got together to see what it is that we can do now that we have access to such a strong and powerful tool. And we started thinking of the work that we’ve been doing, which is to, you know, build these technologies for specific areas and specific languages, specific demographies. So we, kind of, put all that knowledge and all that experience we had and thought of like, how can we scale that, really, across everything that we do? So VeLLM, at its base, you know, takes a GPT-like LLM, you know, as a horizontal across everything. On top of it, we have again, horizontals of machine learning, of multilingual tools and processes, which allow us to take the outputs from, say, GPT-like things and adapt it to different languages or, you know, some different kind of domain, etc. And then we have verticals on top of it, which allow people to build specific applications.

HUIZINGA: Let me just go back and say GPT … I think most of our audience will know that that stands for generative pretrained transformer models. But just so we have that for anyone who doesn’t know, let’s anchor that. So VeLLM basically was an enabling platform …

BALI: Yes.

HUIZINGA: … on which to build specific technologies that would solve problems in a vertical application.

BALI: Yes. Yes. And because it’s a platform, we’re also working on tools that are needed across domains …

HUIZINGA: Oh, interesting.

BALI: … as well as tools that are needed for specific domains.

HUIZINGA: OK, so let’s talk about some of the specifics because we could get into the weeds on the tools that everybody needs, but I like the ideas that you’re working on and the specific needs that you’re meeting, the felt-need thing that gets an idea going. So talk about this project that you’ve worked on called Kahani. Could you explain what that is, and how it works? It’s really interesting to me.

BALI: So Kahani, actually, is about storytelling, culturally appropriate storytelling, with spectacular images, as well as like textual story.

HUIZINGA: So visual storytelling?

BALI: Visual storytelling with the text. So this actually started when my colleague Sameer Segal, he was trying to use generative AI to create stories for his daughter, and he discovered that, you know, things are not very culturally appropriate! So I’ll give an example that, you know, if you want to take Frozen and take it to, like, the south Indian state of Kerala, you’ll have the beaches of Kerala, you’ll have even have the coconut trees, but then you will have this blond princess in a princess gown …

HUIZINGA: Sure …

BALI: … who’s there, right? So that’s where we started discussing this, and we, kind of, started talking about, how can we create visuals that are anchored on text of a story that’s culturally appropriate? So when we’re talking about, say, Little Red Riding Hood, if we ask the generative AI model, OK, that I want the story of Little Red Riding Hood but in an Indian context, it does a fantastic job. It actually gives you a very nice story, which, you know, just reflects the Red Riding Hood story into an Indian context. But the images don’t really …

HUIZINGA: Match … [LAUGHTER]

BALI: … Match at all. So that’s where the whole Kahani thing started. And we did a hackathon project on it. And then a lot of people got interested. It’s an ongoing project, so I won’t say that it’s out there yet, but we are very excited about it, but because think of it, we can actually create stories for children, you know, which is what we started with, but we can create so much more media, so much more culturally appropriate storytelling, which is not necessarily targeted at children.

HUIZINGA: Yeah, yeah.

BALI: So that’s what Kahani is about.

HUIZINGA: OK. And I saw a demo of it that your colleague did for Research Forum here, and there was an image of a girl—it was beautiful—and then there was a mask of some kind or a … what was that?

BALI: So the mask is called Nazar Battu, which is actually, you have these masks which are supposed to drive away the evil eye. So that’s what the mask was about. It’s a very Indian thing. You know, when you build a nice house, you put one on top of it so that the envious glances are, like, kept at bay. So, yeah, so that’s what it was.

HUIZINGA: And was there some issue of the generative AI not really understanding what that was?

BALI: No, it didn’t understand what it was.

HUIZINGA: So then can you fix that and make it more culturally aware?

BALI: So that’s what we are trying to do for the image thing. So we have another project on culture awareness where we are looking at understanding how much generative AI knows about other cultures.

HUIZINGA: Interesting.

BALI: So that’s a simultaneous project that’s happening. But in Kahani, a lot of it is, like, trying to get reference images, you know …

HUIZINGA: Yeah. … Into the system?

BALI: Into the system …

HUIZINGA: Gotcha …

BALI: … and trying to anchor on that.

HUIZINGA: Mmmm. So—and we’re not going to talk about that project, I don’t think—but … how do you assess whether an AI knows? By just asking it? By prompting and seeing what happens?

BALI: Yeah, yeah, yeah. So in another project, what we did was, we asked humans to play a game to get cultural artifacts from them. The problem with asking humans what cultural artifacts are important to them is we don’t think of like things as culture, right. [LAUGHS] This is food!

HUIZINGA: It’s just who we are!

BALI: This is my food. Like, you know, it’s not a culturally important artifact. This is how I greet my parents. It’s not like culturally …

HUIZINGA: So it’s just like fish swimming in water. You don’t see the water.

BALI: Exactly. So we gamified this thing, and we were able to get certain cultural artifacts, and we tried to get generative AI models to tell us about the same artifacts. And it didn’t do too well … [LAUGHS]

HUIZINGA: But that’s why it’s research!

BALI: Yes!

HUIZINGA: You try, you iterate, you try again … cool. As I mentioned earlier, I was a high school English teacher and an English major. I’m not correcting your grammar because it’s fantastic.

BALI: Thank you.

HUIZINGA: But as a former educator, one of the projects I felt was really compelling that you’re working on is called Shiksha. It’s a copilot in education. Tell our audience about this.

BALI: So this is actually our proof of concept for the VeLLM platform. Since almost all of us were interested in education, we decided to go for education as the first use case that we’re going to work on. And actually, it was a considered decision to go target teachers instead of students. I mean, you must have seen a lot of work being done on taking generative AI to students, right. But we feel that, you know, teachers are necessary to teach because they’re not just giving you information about the subject. They’re giving you skills to learn, which hopefully will stay with you for a lifetime, right. And if we enable teachers, they will enable so many hundreds of students. One teacher can enable thousands of students, right, over her career. So instead of, like, going and targeting students, if we make it possible for teachers to do their jobs more effectively or, like, you know, help them get over the problems they have, then we are actually creating an ecosystem where things will scale really fast, really quickly. And in India, you know, this is especially true because the government has actually come up with some digital resources for teachers to use, but there’s a lot more that can be done. So we interviewed about a hundred-plus teachers across different parts of the country. And this is the, you know, discover part.

HUIZINGA: Yeah!

BALI: And we found out that lesson plans are a big headache! [LAUGHS]

HUIZINGA: Yes, they are! Can confirm!

BALI: Yeah. And they spend a lot of time doing lesson plans because they’re required to create a lesson plan for every class they teach …

HUIZINGA: Sure. With learning outcomes …

BALI: Exactly.

HUIZINGA: All of it.

BALI: All of it. So that’s where we, you know, zeroed in on—how to make it easier for teachers to create lesson plans. And that’s what the Shiksha project is about. You know, there is an enrollment process where the teachers say what subject they’re teaching, what classes they’re teaching, what boards, because there are different boards of education …

HUIZINGA: Right …

BALI: … which have different syllabus. So all that. But after that, it takes less than seven minutes for a teacher to create an entire lesson plan for a particular topic. You know, class assignments, class activities, home assignments, homework—everything! Like the whole thing in seven minutes! And these teachers have the ability to go and correct it. Like, it’s an interactive thing. So, you know, they might say, I think this activity is too difficult for my students.

HUIZINGA: Yeah …

BALI: Can I have, like, an easier one? Or, can I change this to this? So it allows them to interactively personalize, modify the plan that’s put out. And I find that really exciting. And we’ve tested this with the Sikshana Foundation, which works with teachers in India. We’ve tested this with them. The teachers are very excited and now Sikshana wants to scale it to other schools.

HUIZINGA: Right … well, my first question is, where were you when I was teaching, Kalika?

BALI: There was no generative AI!

HUIZINGA: No. In fact, we just discovered the fax machine when I started teaching. Oh, that dates me! You know, back to what you said about teachers being instrumental in the lives of their students. You know, we can remember our favorite teachers, our best teachers. We don’t remember a machine.

BALI: No.

HUIZINGA: And what you’ve done with this is to embody the absolute sort of pinnacle of what AI can do, which is to be the collaborator, the assistant, the augmenter, and the helper so that the teacher can do that inspirational, connective-tissue job with the students without having to, like, sacrifice the rest of their life making lesson plans and grading papers. Oh, my gosh. OK. On the positive side, we’ve just talked about what this work proposes and how it’s good, but I always like to dig a little bit into the potential unintended consequences and what could possibly go wrong if, in fact, you got everything right. So I’ll anchor this in another example. When GPT models first came out, the first reaction came from educators. It feels like we’re in a bit of a paradigm shift like we were when the calculator and the internet even came out. [It’s] like, how do we process this? So I want to go philosophically here and talk about how you foresee us adopting and moving forward with generative AI in education, writ large.

BALI: Yeah, I think this is a question that troubles a lot of us and not just in education, but in all spheres that generative AI is …

HUIZINGA: Art …

BALI: … art …

HUIZINGA: … writing …

BALI: … writing …

HUIZINGA: … journalism …

BALI: Absolutely. And I think the way I, kind of, think about it in my head is it’s a tool. At the end of it, it is a tool. It’s a very powerful tool, but it is a tool, and humans must always have the agency over it. And we need to come up, as a society, you know, we need to come up with the norms of using the tool. And if you think about it, you know, internet, taking internet as an example, there is a lot of harm that internet has propagated, right. The darknet and all the other stuff that happens, right. But on the whole, there are regulations, but there are also an actual consensus around what constitutes the positive use of internet, right.

HUIZINGA: Sure, yeah.

BALI: Nobody says that, for example, deepfakes are …

HUIZINGA: Mm-hmm. Good …

BALI: … good, right. So we have to come from there and think about what kind of regulations we need to have in place, what kind of consensus we need to have in place, what’s missing.

HUIZINGA: Right. Another project that has been around, and it isn’t necessarily on top of VeLLM, but it’s called Karya, and you call it a social impact organization that serves not just one purpose, but three. Talk about that.

BALI: Oh, Karya is my favorite! [LAUGHS] So Karya started as a research project within Microsoft Research India, and this was the brainchild again of my colleague—I have like some of the most amazing colleagues, too, that I work with!—called Vivek Seshadri. And Vivek wanted to create, you know, digital work for people who do not have access to such work. So he wanted to go to the rural communities, to people who belong to slightly lower socioeconomic demographies, and provide work, like microtasks kind of work, gig work, to them. And he was doing this, and then we started talking, and I said, you know, we need so much data for all these languages and all these different tasks, and that could be, like, a really cool thing to try on Karya, and that’s where it all started, my involvement with Karya, which is still pretty strong. And Karya then became such a stable project that Microsoft Research India spun it out. So it’s now its own standalone startup right now like a social enterprise, and they work on providing digital work. They work on providing skills, like upskilling. They work on awareness, like, you know, making people aware of certain social, financial, other such trainings. So what’s been most amazing is that Karya has been able to essentially collect data for AI in the most ethical way possible. They pay their workers a little over the minimal wage. They also have something called data ownership practice, where the data that is created by, say, me, I have some sort of ownership on it. So what that means is that every time Karya sells a dataset, a royalty comes back …

HUIZINGA: No … !

BALI: Yeah! To the workers.

HUIZINGA: OK, we need to scale this out! [LAUGHS] OK. So to give a concrete example, the three purposes would be educational, financial—on their end—and data collection, which would ultimately support a low-resource language by having digital assets.

BALI: Absolutely!

HUIZINGA: So you could give somebody something to read in their language …

BALI: Yeah.

HUIZINGA: … that would educate them in the process. They would get paid to do it, and then you would have this data.

BALI: Yes!

HUIZINGA: OK. So cool. So simple.

BALI: Like I said, it’s my favorite project.

HUIZINGA: I get that. I totally get that.

BALI: And they … they’ve been, you know, they have been winning awards and things all over for the work that they’re doing right now. And I am very involved in one project with them, which is to do with gender-intentional AI, or gender-intentional datasets for AI, for Indian languages. And that’s really crucial because, you know, we talk about gender bias in datasets, etc., but all that understanding comes from a very Western perspective and for languages like English, etc. They do not translate very well to Indian languages.

HUIZINGA: Right.

BALI: And in this particular project, we’re looking at first, how to define gender bias. How do we even get data around gender bias? What does it even mean to say that technology is gender intentional?

HUIZINGA: Right. All right, well, let’s talk a little bit about what I like to call outrageous ideas. And these are the ones that, you know, on the research spectrum from sort of really practical applied research to blue sky get dismissed or viewed as unrealistic or unattainable. So years ago—here’s a little story about you—when you told your tech colleagues that you wanted to work with the world’s most marginalized languages, they told you you’d only marginalize yourself.

BALI: Yes!

HUIZINGA: But you didn’t say no. You didn’t say no. Um, two questions. Did you feel like your own idea was outrageous back then? And do you still have anything outrageous yet to accomplish in this plan?

BALI: Oh, yeah! I hope so! Yeah. No, I do think, in some sense, the pushback that I got for my idea makes me think it was outrageous. I didn’t think it was outrageous at all at that time! [LAUGHS] I thought it was a very reasonable idea! But there was a very solid pushback and not just from your colleagues. You know, for researchers, publishing papers is important! No one would publish a paper which focused only on, say, Indian languages or low-resource languages. We’ve come a very long way even in the research community on that, right. We kept pushing, pushing, pushing! And now, there are tracks, there are workshops, there are conferences which are devoted to multilingual and low-resource languages. When I said I wanted to work on Hindi, and Hindi is the biggest language in India, right. And even for that, I was told, why don’t you work on German instead? And I’m like, there are lots of people working on German who will solve the problems with German! Nobody is looking at Hindi! I mean, people should work on all the languages. People should work on German, but I don’t want to work on German! So there was a lot of pushback back then, and I see a little bit of that with the very low-resource languages even now. And I think some people think it’s a “feel-good” thing, whereas I think it’s not. I think it’s a very economically viable, necessary thing to build technology for these communities, for these languages. No one thought Hindi was economically viable 15 years ago, for whatever reason …

HUIZINGA: That … that floors me …

BALI: Yeah, but, you know, we’re not talking about tens of thousands of people in some of these languages; we’re talking about millions.

HUIZINGA: Yeah.

BALI: I still think that is a job that I need to continue, you know, pushing back on.

HUIZINGA: Do you think that any of that sort of outrageous reaction was due to the fact that the technology wasn’t as advanced as it is now and that it might have changed in terms of what we can do?

BALI: There was definitely the aspect of technology there that it was just quite difficult and very, very resource-intensive to build it for languages which did not have resources. You know, there was a time when we were talking about how to go about doing this, and because people in various big tech companies, people did not really remember a time when, for English, they had to start data collection from scratch because everyone who was working on, say, English at that time was building on what people had done years and years ago. So they could not even conceptualize that you had to start from scratch for anything, right. But now with the technology as well, I’m quite optimistic and trying to think of how cool it would be to do, you know, smaller data collections and fine-tuned models specifically and things like that, so I think that the technology is definitely one big thing, but economics is a big factor, too.

HUIZINGA: Mmm-hmm. Well, I’m glad that you said it isn’t just the feel good, but it actually would make economic sense because that’s some of the driver behind what technologies get “greenlit,” as it were. Is there anything outrageous now that you could think of that, even to you, sounds like, oh, we could never do that …

BALI: Well … I didn’t think HAL was outrageous, so I’m not … [LAUGHS]

HUIZINGA: Back to HAL 9000! [LAUGHS]

BALI: Yeah, so I don’t think of things as outrageous or not. I just think of things as things that need to get done, if that makes any sense?

HUIZINGA: Totally. Maybe it’s, how do we override “Open the pod bay door, HAL”—“No, I’m sorry, Dave. I can’t do that”? [LAUGHS]

BALI: Yes. [LAUGHS] Yeah…

HUIZINGA: Well, as we close—and I’m sad to close because you are so much fun—I want to do a little vision casting, but in reverse. So let’s fast-forward 20 years and look back. How have the big ideas behind your life’s work impacted the world, and how are people better off or different now because of you and the teams that you’ve worked with?

BALI: So the way I see it is that people across the board, irrespective of the language they speak, the communities they belong to, the demographies they represent, can use technology to make their lives, their work, better. I know it sounds like really a very big and almost too good to be true, but that’s what I’m aiming for.

HUIZINGA: Well, Kalika Bali, I’m so grateful I got to talk to you in person. And thanks for taking time out from your busy trip from India to sit down with me and our audience and share your amazing ideas.

[MUSIC PLAYS]

BALI: Thank you so much, Gretchen.

[MUSIC FADES]

The post Ideas: Language technologies for everyone with Kalika Bali appeared first on Microsoft Research.

Build an active learning pipeline for automatic annotation of images with AWS services

This blog post is co-written with Caroline Chung from Veoneer.

Veoneer is a global automotive electronics company and a world leader in automotive electronic safety systems. They offer best-in-class restraint control systems and have delivered over 1 billion electronic control units and crash sensors to car manufacturers globally. The company continues to build on a 70-year history of automotive safety development, specializing in cutting-edge hardware and systems that prevent traffic incidents and mitigate accidents.

Automotive in-cabin sensing (ICS) is an emerging space that uses a combination of several types of sensors such as cameras and radar, and artificial intelligence (AI) and machine learning (ML) based algorithms for enhancing safety and improving riding experience. Building such a system can be a complex task. Developers have to manually annotate large volumes of images for training and testing purposes. This is very time consuming and resource intensive. The turnaround time for such a task is several weeks. Furthermore, companies have to deal with issues such as inconsistent labels due to human errors.

AWS is focused on helping you increase your development speed and lower your costs for building such systems through advanced analytics like ML. Our vision is to use ML for automated annotation, enabling retraining of safety models, and ensuring consistent and reliable performance metrics. In this post, we share how, by collaborating with Amazon’s Worldwide Specialist Organization and the Generative AI Innovation Center, we developed an active learning pipeline for in-cabin image head bounding boxes and key points annotation. The solution reduces cost by over 90%, accelerates the annotation process from weeks to hours in terms of the turnaround time, and enables reusability for similar ML data labeling tasks.

Solution overview

Active learning is an ML approach that involves an iterative process of selecting and annotating the most informative data to train a model. Given a small set of labeled data and a large set of unlabeled data, active learning improves model performance, reduces labeling effort, and integrates human expertise for robust results. In this post, we build an active learning pipeline for image annotations with AWS services.

The following diagram demonstrates the overall framework for our active learning pipeline. The labeling pipeline takes images from an Amazon Simple Storage Service (Amazon S3) bucket and outputs annotated images with the cooperation of ML models and human expertise. The training pipeline preprocesses data and uses them to train ML models. The initial model is set up and trained on a small set of manually labeled data, and will be used in the labeling pipeline. The labeling pipeline and training pipeline can be iterated gradually with more labeled data to enhance the model’s performance.

In the labeling pipeline, an Amazon S3 Event Notification is invoked when a new batch of images comes into the Unlabeled Datastore S3 bucket, activating the labeling pipeline. The model produces the inference results on the new images. A customized judgement function selects parts of the data based on the inference confidence score or other user-defined functions. This data, with its inference results, is sent for a human labeling job on Amazon SageMaker Ground Truth created by the pipeline. The human labeling process helps annotate the data, and the modified results are combined with the remaining auto annotated data, which can be used later by the training pipeline.

Model retraining happens in the training pipeline, where we use the dataset containing the human-labeled data to retrain the model. A manifest file is produced to describe where the files are stored, and the same initial model is retrained on the new data. After retraining, the new model replaces the initial model, and the next iteration of the active learning pipeline starts.

Model deployment

Both the labeling pipeline and training pipeline are deployed on AWS CodePipeline. AWS CodeBuild instances are used for implementation, which is flexible and fast for a small amount of data. When speed is needed, we use Amazon SageMaker endpoints based on the GPU instance to allocate more resources to support and accelerate the process.

The model retraining pipeline can be invoked when there is new dataset or when the model’s performance needs improvement. One critical task in the retraining pipeline is to have the version control system for both the training data and the model. Although AWS services such as Amazon Rekognition have the integrated version control feature, which makes the pipeline straightforward to implement, customized models require metadata logging or additional version control tools.

The entire workflow is implemented using the AWS Cloud Development Kit (AWS CDK) to create necessary AWS components, including the following:

Two roles for CodePipeline and SageMaker jobs
Two CodePipeline jobs, which orchestrate the workflow
Two S3 buckets for the code artifacts of the pipelines
One S3 bucket for labeling the job manifest, datasets, and models
Preprocessing and postprocessing AWS Lambda functions for the SageMaker Ground Truth labeling jobs

The AWS CDK stacks are highly modularized and reusable across different tasks. The training, inference code, and SageMaker Ground Truth template can be replaced for any similar active learning scenarios.

Model training

Model training includes two tasks: head bounding box annotation and human key points annotation. We introduce them both in this section.

Head bounding box annotation

Head bounding box annotation is a task to predict the location of a bounding box of the human head in an image. We use an Amazon Rekognition Custom Labels model for head bounding box annotations. The following sample notebook provides a step-by-step tutorial on how to train a Rekognition Custom Labels model via SageMaker.

We first need to prepare the data to start the training. We generate a manifest file for the training and a manifest file for the test dataset. A manifest file contains multiple items, each of which is for an image. The following is an example of the manifest file, which includes the image path, size, and annotation information:

{
    "source-ref": "s3://mlsl-sandox/rekognition_images/train/IMS_00000_00_000_000_R2_1900_01_01_00000_compressed_front_tof_amp_000.jpeg",
    "bounding-box-attribute-name": {
        "image_size": [{
                "width": 640,
                "height": 480,
                "depth": 3
            }
        ],
        "annotations": [{
                "class_id": 1,
                "top": 189,
                "left": 209,
                "width": 97,
                "height": 121
            }
        ]
    },
    "bounding-box-attribute-name-metadata": {
        "objects": [{
                "confidence": 1
            }
        ],
        "class-map": {
            "1": "Head"
        },
        "type": "groundtruth/object-detection",
        "human-annotated": "yes",
        "creation-date": "2023-04-07T20:04:42",
        "job-name": "testjob"
    }
}

Using the manifest files, we can load datasets to a Rekognition Custom Labels model for training and testing. We iterated the model with different amounts of training data and tested it on the same 239 unseen images. In this test, the mAP_50 score increased from 0.33 with 114 training images to 0.95 with 957 training images. The following screenshot shows the performance metrics of the final Rekognition Custom Labels model, which yields great performance in terms of F1 score, precision, and recall.

We further tested the model on a withheld dataset that has 1,128 images. The model consistently predicts accurate bounding box predictions on the unseen data, yielding a high mAP_50 of 94.9%. The following example shows an auto-annotated image with a head bounding box.

Key points annotation

Key points annotation produces locations of key points, including eyes, ears, nose, mouth, neck, shoulders, elbows, wrists, hips, and ankles. In addition to the location prediction, visibility of each point is needed to predict in this specific task, for which we design a novel method.

For key points annotation, we use a Yolo 8 Pose model on SageMaker as the initial model. We first prepare the data for training, including generating label files and a configuration .yaml file following Yolo’s requirements. After preparing the data, we train the model and save artifacts, including the model weights file. With the trained model weights file, we can annotate the new images.

In the training stage, all the labeled points with locations, including visible points and occluded points, are used for training. Therefore, this model by default provides the location and confidence of the prediction. In the following figure, a large confidence threshold (main threshold) near 0.6 is capable of dividing the points that are visible or occluded versus outside of camera’s viewpoints. However, occluded points and visible points are not separated by the confidence, which means the predicted confidence is not useful for predicting the visibility.

To get the prediction of visibility, we introduce an additional model trained on the dataset containing only visible points, excluding both occluded points and outside of camera’s viewpoints. The following figure shows the distribution of points with different visibility. Visible points and other points can be separated in the additional model. We can use a threshold (additional threshold) near 0.6 to get the visible points. By combining these two models, we design a method to predict the location and visibility.

A key point is first predicted by the main model with location and main confidence, then we get the additional confidence prediction from the additional model. Its visibility is then classified as follows:

Visible, if its main confidence is greater than its main threshold, and its additional confidence is greater than the additional threshold
Occluded, if its main confidence is greater than its main threshold, and its additional confidence is less than or equal to the additional threshold
Outside of camera’s review, if otherwise

An example of key points annotation is demonstrated in the following image, where solid marks are visible points and hollow marks are occluded points. Outside of the camera’s review points are not shown.

Based on the standard OKS definition on the MS-COCO dataset, our method is able to achieve mAP_50 of 98.4% on the unseen test dataset. In terms of visibility, the method yields a 79.2% classification accuracy on the same dataset.

Human labeling and retraining

Although the models achieve great performance on test data, there are still possibilities for making mistakes on new real-world data. Human labeling is the process to correct these mistakes for enhancing model performance using retraining. We designed a judgement function that combined the confidence value that output from the ML models for the output of all head bounding box or key points. We use the final score to identify these mistakes and the resultant bad labeled images, which need to be sent to the human labeling process.

In addition to bad labeled images, a small portion of images are randomly chosen for human labeling. These human-labeled images are added into the current version of the training set for retraining, enhancing model performance and overall annotation accuracy.

In the implementation, we use SageMaker Ground Truth for the human labeling process. SageMaker Ground Truth provides a user-friendly and intuitive UI for data labeling. The following screenshot demonstrates a SageMaker Ground Truth labeling job for head bounding box annotation.

The following screenshot demonstrates a SageMaker Ground Truth labeling job for key points annotation.

Cost, speed, and reusability

Cost and speed are the key advantages of using our solution compared to human labeling, as shown in the following tables. We use these tables to represent the cost savings and speed accelerations. Using the accelerated GPU SageMaker instance ml.g4dn.xlarge, the whole life training and inference cost on 100,000 images is 99% less than the cost of human labeling, while the speed is 10–10,000 times faster than the human labeling, depending on the task.

The first table summarizes the cost performance metrics.

Model	mAP_50 based on 1,128 test images	Training cost based on 100,000 images	Inference cost based on 100,000 images	Cost reduction compared to human annotation	Inference time based on 100,000 images	Time acceleration compared to human annotation
Rekognition head bounding box	0.949	$4	$22	99% less	5.5 h	Days
Yolo Key points	0.984	$27.20	* $10	99.9% less	minutes	Weeks

The following table summarizes performance metrics.

Annotation Task	mAP_50 (%)	Training Cost ($)	Inference Cost ($)	Inference Time
Head Bounding Box	94.9	4	22	5.5 hours
Key Points	98.4	27	10	5 minutes

Moreover, our solution provides reusability for similar tasks. Camera perception developments for other systems like advanced driver assist system (ADAS) and in-cabin systems can also adopt our solution.

Summary

In this post, we showed how to build an active learning pipeline for automatic annotation of in-cabin images utilizing AWS services. We demonstrate the power of ML, which enables you to automate and expedite the annotation process, and the flexibility of the framework that uses models either supported by AWS services or customized on SageMaker. With Amazon S3, SageMaker, Lambda, and SageMaker Ground Truth, you can streamline data storage, annotation, training, and deployment, and achieve reusability while reducing costs significantly. By implementing this solution, automotive companies can become more agile and cost-efficient by using ML-based advanced analytics such as automated image annotation.

Get started today and unlock the power of AWS services and machine learning for your automotive in-cabin sensing use cases!

About the Authors

Yanxiang Yu is an Applied Scientist at at the Amazon Generative AI Innovation Center. With over 9 years of experience building AI and machine learning solutions for industrial applications, he specializes in generative AI, computer vision, and time series modeling.

Tianyi Mao is an Applied Scientist at AWS based out of Chicago area. He has 5+ years of experience in building machine learning and deep learning solutions and focuses on computer vision and reinforcement learning with human feedbacks. He enjoys working with customers to understand their challenges and solve them by creating innovative solutions using AWS services.

Yanru Xiao is an Applied Scientist at the Amazon Generative AI Innovation Center, where he builds AI/ML solutions for customers’ real-world business problems. He has worked in several fields, including manufacturing, energy, and agriculture. Yanru obtained his Ph.D. in Computer Science from Old Dominion University.

Paul George is an accomplished product leader with over 15 years of experience in automotive technologies. He is adept at leading product management, strategy, Go-to-Market and systems engineering teams. He has incubated and launched several new sensing and perception products globally. At AWS, he is leading strategy and go-to-market for autonomous vehicle workloads.

Caroline Chung is an engineering manager at Veoneer (acquired by Magna International), she has over 14 years of experience developing sensing and perception systems. She currently leads interior sensing pre-development programs at Magna International managing a team of compute vision engineers and data scientists.

Amazon Titan Multimodal Embeddings

Solution overview

Prerequisites

Implementation

Additional considerations

Data privacy and security

Data protection in Amazon Bedrock

Integration with existing systems

Cost estimate

Clean up

Conclusion

About the Authors

Joint announcements and keynote

Media coverage

Live from GTC

The AWS booth and sessions

AWS presence with partners and customers

AWS Generative AI Competency

Additional resources

About the Author

Vault Into the Cloud

Go Big or Go Home

Learn more:

Subscribe to the Microsoft Research Podcast:

Transcript

Solution overview

Model deployment

Model training

Head bounding box annotation

Key points annotation

Human labeling and retraining

Cost, speed, and reusability

Summary

About the Authors

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.