March 2023 – Page 9

Biased graph sampling for better related-product recommendation

Tailoring neighborhood sizes and sampling probability to nodes’ degree of connectivity improves the utility of graph-neural-network embeddings by as much as 230%.Read More

Try Bard and share your feedback

We’re starting to open access to Bard, an early experiment that lets you collaborate with generative AI. We’re beginning with the U.S. and the U.K., and will expand to m…Read More

Accelerate Amazon SageMaker inference with C6i Intel-based Amazon EC2 instances

This is a guest post co-written with Antony Vance from Intel.

Customers are always looking for ways to improve the performance and response times of their machine learning (ML) inference workloads without increasing the cost per transaction and without sacrificing the accuracy of the results. Running ML workloads on Amazon SageMaker running Amazon Elastic Compute Cloud (Amazon EC2) C6i instances with Intel’s INT8 inference deployment can help boost the overall performance by up to four times per dollar spent while keeping the loss in inference accuracy less than 1% as compared to FP32 when applied to certain ML workloads. When it comes to running the models in embedded devices where form factor and size of the model is important, quantization can help.

Quantization is a technique to reduce the computational and memory costs of running inference by representing the weights and activations with low-precision data types like 8-bit integer (INT8) instead of the usual 32-bit floating point (FP32). In the following example figure, we show INT8 inference performance in C6i for a BERT-base model.

The BERT-base was fine-tuned with SQuAD v1.1, with PyTorch (v1.11) being the ML framework used with Intel® Extension for PyTorch. A batch size of 1 was used for the comparison. Higher batch sizes will provide different cost per 1 million inferences.

In this post, we show you how to build and deploy INT8 inference with your own processing container for PyTorch. We use Intel extensions for PyTorch for an effective INT8 deployment workflow.

Overview of the technology

EC2 C6i instances are powered by third-generation Intel Xeon Scalable processors (also called Ice Lake) with an all-core turbo frequency of 3.5 GHz.

In the context of deep learning, the predominant numerical format used for research and deployment has so far been 32-bit floating point, or FP32. However, the need for reduced bandwidth and compute requirements of deep learning models has driven research into using lower-precision numerical formats. It has been demonstrated that weights and activations can be represented using 8-bit integers (or INT8) without incurring significant loss in accuracy.

EC2 C6i instances offer many new capabilities that result in performance improvements for AI and ML workloads. C6i instances provide performance advantages in FP32 and INT8 model deployments. FP32 inference is enabled with AVX-512 improvements, and INT8 inference is enabled by AVX-512 VNNI instructions.

C6i is now available on SageMaker endpoints, and developers should expect it to provide over two times price-performance improvements for INT8 inference over FP32 inference and up to four times performance improvement when compared with C5 instance FP32 inference. Refer to the appendix for instance details and benchmark data.

Deep learning deployment on the edge for real-time inference is key to many application areas. It significantly reduces the cost of communicating with the cloud in terms of network bandwidth, network latency, and power consumption. However, edge devices have limited memory, computing resources, and power. This means that a deep learning network must be optimized for embedded deployment. INT8 quantization has become a popular approach for such optimizations for ML frameworks like TensorFlow and PyTorch. SageMaker provides you with a bring your own container (BYOC) approach and integrated tools so that you can run quantization.

For more information, refer to Lower Numerical Precision Deep Learning Inference and Training.

Solution overview

The steps to implement the solution are as follows:

Provision an EC2 C6i instance to quantize and create the ML model.
Use the supplied Python scripts for quantization.
Create a Docker image to deploy the model in SageMaker using the BYOC approach.
Use an Amazon Simple Storage Service (Amazon S3) bucket to copy the model and code for SageMaker access.
Use Amazon Elastic Container Registry (Amazon ECR) to host the Docker image.
Use the AWS Command Line Interface (AWS CLI) to create an inference endpoint in SageMaker.
Run the provided Python test scripts to invoke the SageMaker endpoint for both INT8 and FP32 versions.

This inference deployment setup uses a BERT-base model from the Hugging Face transformers repository (csarron/bert-base-uncased-squad-v1).

Prerequisites

The following are prerequisites for creating the deployment setup:

A Linux shell terminal with the AWS CLI installed
An AWS account with access to EC2 instance creation (C6i instance type)
SageMaker access to deploy a SageMaker model, endpoint configuration, endpoint
AWS Identity and Access Management (IAM) access to configure an IAM role and policy
Access to Amazon ECR
SageMaker access to create a notebook with instructions to launch an endpoint

Generate and deploy a quantized INT8 model on SageMaker

Open an EC2 instance for creating your quantized model and push the model artifacts to Amazon S3. For endpoint deployment, create a custom container with PyTorch and Intel® Extension for PyTorch to deploy the optimized INT8 model. The container gets pushed into Amazon ECR and a C6i based endpoint is created to serve FP32 and INT8 models.

The following diagram illustrates the high-level flow.

To access the code and documentation, refer to the GitHub repo.

Example use case

The Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset consisting of questions posed by crowd-workers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

The following example is a question answering algorithm using a BERT-base model. Given a document as an input, the model will answer simple questions based on the learning and contexts from the input document.

The following is an example input document:

The Amazon rainforest (Portuguese: Floresta Amazônica or Amazônia; Spanish: Selva Amazónica, Amazonía or usually Amazonia; French: Forêt amazonienne; Dutch: Amazoneregenwoud), also known in English as Amazonia or the Amazon Jungle, is a moist broadleaf forest that covers most of the Amazon basin of South America. This basin encompasses 7,000,000 square kilometers (2,700,000 sq mi), of which 5,500,000 square kilometers (2,100,000 sq mi) are covered by the rainforest.

For the question “Which name is also used to describe the Amazon rainforest in English?” we get the answer:

also known in English as Amazonia or the Amazon Jungle,Amazonia or the Amazon Jungle, Amazonia.

For the question “How many square kilometers of rainforest is covered in the basin?” we get the answer:

5,500,000 square kilometers (2,100,000 sq mi) are covered by the rainforest.5,500,000.

Quantizing the model in PyTorch

This section gives a quick overview of model quantization steps with PyTorch and Intel extensions.

The code snippets are derived from a SageMaker example.

Let’s go over the changes in detail for the function IPEX_quantize in the file quantize.py.

Import intel extensions for PyTorch to help with quantization and optimization and import torch for array manipulations:

import intel_extension_for_pytorch as ipex
import torch

Apply model calibration for 100 iterations. In this case, you are calibrating the model with the SQuAD dataset:

model.eval()
conf = ipex.quantization.QuantConf(qscheme=torch.per_tensor_affine)
print("Doing calibration...")
for step, batch in enumerate(eval_dataloader):
    print("Calibration step-", step)
    with torch.no_grad():
        with ipex.quantization.calibrate(conf):
            model(**batch)
    if step == 100:
        break

Prepare sample inputs:

jit_inputs = []
    example_batch = next(iter(eval_dataloader))
    for key in example_batch:
        example_tensor = torch.ones_like(example_batch[key])
        jit_inputs.append(example_tensor)
    jit_inputs = tuple(jit_inputs)

Convert the model to an INT8 model using the following configuration:

with torch.no_grad():
    model = ipex.quantization.convert(model, conf, jit_inputs)

Run two iterations of forward pass to enable fusions:

with torch.no_grad(): model(**example_batch) model(**example_batch)

As a last step, save the TorchScript model:

model.save(os.path.join(args.model_path, "model_int8.pt"))

Clean up

Refer to the Github repo for steps to clean up the AWS resources created.

Conclusion

New EC2 C6i instances in an SageMaker endpoint can accelerate the inference deployment up to 2.5 times greater with INT8 quantization. Quantizing the model in PyTorch is possible with a few APIs from Intel PyTorch extensions. It’s recommended to quantize the model in C6i instances so that model accuracy is maintained in endpoint deployment. The SageMaker examples GitHub repo now provides an end-to-end deployment example pipeline for quantizing and hosting INT8 models.

We encourage you to create a new model or migrate an existing model using INT8 quantization using the EC2 C6i instance type and see the performance gains for yourself.

Notice and disclaimers

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (0BSD)

Appendix

New AWS instances in SageMaker with INT8 deployment support

The following table lists SageMaker instances with and without DL Boost support.

Instance Name	Xeon Gen Codename	INT8 Enabled?	DL Boost Enabled?
ml.c5. xlarge – ml.c5.9xlarge	Skylake/1^st	Yes	No
ml.c5.18xlarge	Skylake/1^st	Yes	No
ml.c6i.1x – 32xlarge	Ice Lake/3^rd	Yes	Yes

To summarize, INT8 enabled supports the INT8 data type and computation; DL Boost enabled supports Deep Learning Boost.

Benchmark data

The following table compares the cost and relative performance between c5 and c6 instances.

Latency and throughput measured with 10000 Inference queries to Sage maker endpoints.

E2E Latency of Inference Endpoint and Cost analysis
	P50(ms)	P90(ms)	Queries/Sec	$/1M Queries	Relative $/Performance
C5.2xLarge-FP32	76.6	125.3	11.5	$10.2	1.0x
c6i.2xLarge-FP32	70	110.8	13	$9.0	1.1x
c6i.2xLarge-INT8	35.7	48.9	25.56	$4.5	2.3x

INT8 models are expected to provide 2–4 times practical performance improvements with less than 1% accuracy loss for most of the models. Above table covers overhead latency (NW and demo application)

Accuracy for BERT-base model

The following table summarizes the accuracy for the INT8 model with the SQUaD v1.1 dataset.

Metric	FP32	INT8
Exact Match	85.8751	85.5061
F1	92.0807	91.8728

The GitHub repo comes with the scripts to check the accuracy of the SQuAD dataset. Refer to invoke-INT8.py and invoke-FP32.py scripts for testing.

Intel Extension for PyTorch

Intel® Extension for PyTorch* (an open–source project at GitHub) extends PyTorch with optimizations for extra performance boosts on Intel hardware. Most of the optimizations will be included in stock PyTorch releases eventually, and the intention of the extension is to deliver up-to-date features and optimizations for PyTorch on Intel hardware. Examples include AVX-512 Vector Neural Network Instructions (AVX512 VNNI) and Intel® Advanced Matrix Extensions (Intel® AMX).

The following figure illustrates the Intel Extension for PyTorch architecture.

For more detailed user guidance (features, performance tuning, and more) for Intel® Extension for PyTorch, refer to Intel® Extension for PyTorch* user guidance.

About the Authors

Rohit Chowdhary is a Sr. Solutions Architect in the Strategic Accounts team at AWS.

Aniruddha Kappagantu is a Software Development Engineer in the AI Platforms team at AWS.

Antony Vance is an AI Architect at Intel with 19 years of experience in computer vision, machine learning, deep learning, embedded software, GPU, and FPGA.

Amazon releases code, datasets for developing embodied AI agents

With Alexa Arena, developers can create simulated missions in which humans interact with virtual robots, providing a natural way to build generalizable AI models.Read More

Intelligently search your organization’s Microsoft Teams data source with the Amazon Kendra connector for Microsoft Teams

Organizations use messaging platforms like Microsoft Teams to bring the right people together to securely communicate with each other and collaborate to get work done. Microsoft Teams captures invaluable organizational knowledge in the form of the information that flows through it as users collaborate. However, making this knowledge easily and securely available to users can be challenging due to the fragmented nature of conversations across groups, channels, and chats within an organization. Additionally, the conversational nature of Microsoft Teams communication renders a traditional keyword-based approach to search ineffective when trying to find accurate answers to questions from the content and therefore requires intelligent search capabilities that have the ability to process natural language queries.

You can now use the Amazon Kendra connector for Microsoft Teams to index Microsoft Teams messages and documents, and search this content using intelligent search in Amazon Kendra, powered by machine learning (ML).

This post shows how to configure the Amazon Kendra connector for Microsoft Teams and take advantage of the service’s intelligent search capabilities. We use an example of an illustrative Microsoft Teams instance where users discuss technical topics related to AWS.

Solution overview

Microsoft Teams content for active organizations is dynamic in nature due to continuous collaboration. Microsoft Teams includes public channels where any user can participate, and private channels where only those users who are members of these channels can communicate with each other. Furthermore, individuals can directly communicate with one another in one-on-one and ad hoc groups. This communication is in the form of messages and threads of replies, with optional document attachments.

In our solution, we configure Microsoft Teams as a data source for an Amazon Kendra search index using the Amazon Kendra connector for Microsoft Teams. Based on the configuration, when the data source is synchronized, the connector crawls and indexes all the content from Microsoft Teams that was created on or before a specific date. The connector also indexes the Access Control List (ACL) information for each message and document. When access control or user context filtering is enabled, the search results of a query made by a user includes results only from those documents that the user is authorized to read.

The Amazon Kendra connector for Microsoft Teams can integrate with AWS IAM Identity Center (Successor to AWS Single Sign-On). You first must enable IAM Identity Center and create an organization to sync users and groups from your active directory. The connector will use the user name and group lookup for the user context of the search queries.

With Amazon Kendra Experience Builder, you can build and deploy a low-code, fully functional search application to search your Microsoft Teams data source.

Prerequisites

To try out the Amazon Kendra connector for Microsoft Teams using this post as a reference, you need the following:

An AWS account with privileges to create AWS Identity and Access Management (IAM) roles and policies. For more information, see Overview of access management: Permissions and policies.
Basic knowledge of AWS and working knowledge of Microsoft Teams.

Note that the Microsoft Graph API places throttling limits on the number of concurrent calls to a service to prevent overuse of resources.

Configure Microsoft Teams

The following screenshot shows our example Microsoft Teams instance with sample content and the PDF file AWS_Well-Architect_Framework.pdf that we will use for our Amazon Kendra search queries.

The following steps describe the configuration of a new Amazon Kendra connector application in the Azure portal. This will create a user OAuth token to be used in configuring the Amazon Kendra connector for Microsoft Teams.

Next to Client credentials, choose Add a certificate or secret to add a new client secret.

For Description, enter a description (for example, KendraConnectorSecret).
For Expires, choose an expiry date (for example, 6 months).
Choose Add.

Save the secret ID and secret value to use later when creating an Amazon Kendra data source.

Choose Add a permission.

Choose Microsoft Graph to add all necessary Microsoft Graph permissions.

Choose Application permissions.

The registered application should have the following API permissions to allow crawling all entities supported by the Amazon Kendra connector for Microsoft Teams:

ChannelMessage.Read.All
Chat.Read
Chat.Read.All
Chat.ReadBasic
Chat.ReadBasic.All
ChatMessage.Read.All
Directory.Read.All
Files.Read.All
Group.Read.All
Mail.Read
Mail.ReadBasic
User.Read
User.Read.All
TeamMember.Read.All

However, you can select a lesser scope based on the entities chosen to be crawled. The following lists are the minimum sets of permissions needed for each entity:

Channel Post:
- ChannelMessage.Read.All
- Group.Read.All
- User.Read
- User.Read.All
- TeamMember.Read.All (user-group mapping for identity crawl)
Channel Attachment:
- ChannelMessage.Read.All
- Group.Read.All
- User.Read
- User.Read.All
- TeamMember.Read.All (user-group mapping for identity crawl)
Channel Wiki:
- Group.Read.All
- User.Read
- User.Read.All
- TeamMember.Read.All (user-group mapping for identity crawl)
Chat Message:
- Chat.Read.All
- ChatMessage.Read.All
- ChatMember.Read.All
- User.Read
- User.Read.All
- Group.Read.All
- TeamMember.Read.All (user-group mapping for identity crawl)
Meeting Chat:
- Chat.Read.All
- ChatMessage.Read.All
- ChatMember.Read.All
- User.Read
- User.Read.All
- Group.Read.All
- TeamMember.Read.All (user-group mapping for identity crawl)
Chat Attachment:
- Chat.Read.All
- ChatMessage.Read.All
- ChatMember.Read.All
- User.Read
- User.Read.All
- Group.Read.All
- Files.Read.All
- TeamMember.Read.All (user-group mapping for identity crawl)
Meeting File:
- Chat.Read.All
- ChatMessage.Read.All
- ChatMember.Read.All
- User.Read
- User.Read.All
- Group.Read.All
- Files.Read.All
- TeamMember.Read.All (user-group mapping for identity crawl)
Calendar Meeting:
- Calendars.Read
- Group.Read.All
- TeamMember.Read.All
- User.Read
- User.Read.All
- TeamMember.Read.All (user-group mapping for identity crawl)
Meeting Notes:
- Group.Read.All
- User.Read
- User.Read.All
- Files.Read.All
- TeamMember.Read.All (user-group mapping for identity crawl)

Select your permissions and choose Add permissions.

Configure the data source using the Amazon Kendra connector for Microsoft Teams

To add a data source to your Amazon Kendra index using the Microsoft Teams connector, you can use an existing Amazon Kendra index, or create a new Amazon Kendra index. Then complete the steps in this section. For more information on this topic, refer to Microsoft Teams.

On the Amazon Kendra console, open the index and choose Data sources in the navigation pane.
Choose Add data source.
Under Microsoft Teams connector, choose Add connector.

In the Specify data source details section, enter the details of your data source and choose Next.
In the Define access and security section, for Tenant ID, enter the Microsoft Teams tenant ID from the Microsoft account dashboard.
Under Authentication, you can either choose Create to add a new secret with the client ID and client secret of the Microsoft Teams tenant, or use an existing AWS Secrets Manager secret that has the client ID and client secret of the Microsoft Teams tenant that you want the connector to access.
Choose Save.

Optionally, choose the appropriate payment model:
- Model A payment models are restricted to licensing and payment models that require security compliance.
- Model B payment models are suitable for licensing and payment models that don’t require security compliance.
- Use Evaluation Mode (default) for limited usage evaluation purposes.
For IAM role, you can choose Create a new role or choose an existing IAM role configured with appropriate IAM policies to access the Secrets Manager secret, Amazon Kendra index, and data source.
Choose Next.

In the Configure sync settings section, provide information regarding your sync scope.

For Sync mode, choose your sync mode (for this post, select Full sync).

With the Full sync option, every time the sync runs, Amazon Kendra will crawl all documents and ingest each document even if ingested earlier. The full refresh enables you to reset your Amazon Kendra index without the need to delete and create a new data source. If you choose New or modified content sync or New, modified, or deleted content sync, every time the sync job runs, it will process only objects added, modified, or deleted since the last crawl. Incremental crawls can help reduce runtime and cost when used with datasets that append new objects to existing data sources on a regular basis.

For Sync run schedule, choose Run on demand.
Choose Next.

In the Set field mappings section, you can optionally configure the field mappings, wherein Microsoft Teams field names may be mapped to a different Amazon Kendra attribute or facet.
Choose Next.

Review your settings and confirm to add the data source.
After the data source is added, choose Data sources in the navigation pane, select the newly added data source, and choose Sync now to start data source synchronization with the Amazon Kendra index.

The sync process can take upwards of 30 minutes (depending on the amount of data to be crawled).

Now let’s enable access control for the Amazon Kendra index.

In the navigation pane, choose your index.
On the User access control tab, choose Edit settings and change the settings to look like the following screenshot.
Choose Next, then choose Update.

Perform intelligent search with Amazon Kendra

Before you try searching on the Amazon Kendra console or using the API, make sure that the data source sync is complete. To check, view the data sources and verify if the last sync was successful.

Now we’re ready to search our index.

On the Amazon Kendra console, navigate to the index and choose Search indexed content in the navigation pane.
Let’s use the query “How do you detect security events” and not provide an access token.

Based on our access control settings, a valid access token is needed to access authenticated content; therefore, when we use this search query without setting any user name or group, no results are returned.

Next, choose Apply token and set the user name to a user in the domain (for example, usertest4) that has access to the Microsoft Teams content.

In this example, the search will return a result from the PDF file uploaded in the Microsoft Teams chat message.

Finally, choose Apply token and set the user name to a different user in the domain (for example, usertest) that has access to different Microsoft Teams content.

In this example, the search will return a different Microsoft Teams chat message.

This confirms that the ACLs ingested in Amazon Kendra by the connector for Microsoft Teams are being enforced in the search results based on the user name.

Clean up

To avoid incurring future costs, clean up the resources you created as part of this solution. If you created a new Amazon Kendra index while testing this solution, delete it. If you only added a new data source using the Amazon Kendra connector for Microsoft Teams, delete that data source.

Conclusion

With the Amazon Kendra connector for Microsoft Teams, organizations can make invaluable information trapped in their Microsoft Teams instances available to their users securely using intelligent search powered by Amazon Kendra. Additionally, the connector provides facets for Microsoft Teams attributes such as channels, authors, and categories for the users to interactively refine the search results based on what they’re looking for.

To learn more about the Amazon Kendra connector for Microsoft Teams, refer to Microsoft Teams.

For more information on how you can create, modify, or delete metadata and content when ingesting your data from the Microsoft Teams, refer to Customizing document metadata during the ingestion process and Enrich your content and metadata to enhance your search experience with custom document enrichment in Amazon Kendra.

About the Authors

Praveen Edem is a Senior Solutions Architect at Amazon Web Services. He works with major financial services customers, architecting and modernizing their critical large-scale applications while adopting AWS services. He has over 20 years of IT experience in application development and software architecture.

Gunwant Walbe is a Software Development Engineer at Amazon Web Services. He is an avid learner and keen to adopt new technologies. He develops complex business applications, and Java is his primary language of choice.

Vid2Seq: a pretrained visual language model for describing multi-event videos

Posted by Antoine Yang, Student Researcher, and Arsha Nagrani, Research Scientist, Google Research, Perception team

Videos have become an increasingly important part of our daily lives, spanning fields such as entertainment, education, and communication. Understanding the content of videos, however, is a challenging task as videos often contain multiple events occurring at different time scales. For example, a video of a musher hitching up dogs to a dog sled before they all race away involves a long event (the dogs pulling the sled) and a short event (the dogs being hitched to the sled). One way to spur research in video understanding is via the task of dense video captioning, which consists of temporally localizing and describing all events in a minutes-long video. This differs from single image captioning and standard video captioning, which consists of describing short videos with a single sentence.

Dense video captioning systems have wide applications, such as making videos accessible to people with visual or auditory impairments, automatically generating chapters for videos, or improving the search of video moments in large databases. Current dense video captioning approaches, however, have several limitations — for example, they often contain highly specialized task-specific components, which make it challenging to integrate them into powerful foundation models. Furthermore, they are often trained exclusively on manually annotated datasets, which are very difficult to obtain and hence are not a scalable solution.

In this post, we introduce “Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning”, to appear at CVPR 2023. The Vid2Seq architecture augments a language model with special time tokens, allowing it to seamlessly predict event boundaries and textual descriptions in the same output sequence. In order to pre-train this unified model, we leverage unlabeled narrated videos by reformulating sentence boundaries of transcribed speech as pseudo-event boundaries, and using the transcribed speech sentences as pseudo-event captions. The resulting Vid2Seq model pre-trained on millions of narrated videos improves the state of the art on a variety of dense video captioning benchmarks including YouCook2, ViTT and ActivityNet Captions. Vid2Seq also generalizes well to the few-shot dense video captioning setting, the video paragraph captioning task, and the standard video captioning task. Finally, we have also released the code for Vid2Seq here.

Vid2Seq is a visual language model that predicts dense event captions together with their temporal grounding in a video by generating a single sequence of tokens.

A visual language model for dense video captioning

Multimodal transformer architectures have improved the state of the art on a wide range of video tasks, such as action recognition. However it is not straightforward to adapt such an architecture to the complex task of jointly localizing and captioning events in minutes-long videos.

For a general overview of how we achieve this, we augment a visual language model with special time tokens (like text tokens) that represent discretized timestamps in the video, similar to Pix2Seq in the spatial domain. Given visual inputs, the resulting Vid2Seq model can both take as input and generate sequences of text and time tokens. First, this enables the Vid2Seq model to understand the temporal information of the transcribed speech input, which is cast as a single sequence of tokens. Second, this allows Vid2Seq to jointly predict dense event captions and temporally ground them in the video while generating a single sequence of tokens.

The Vid2Seq architecture includes a visual encoder and a text encoder, which encode the video frames and the transcribed speech input, respectively. The resulting encodings are then forwarded to a text decoder, which autoregressively predicts the output sequence of dense event captions together with their temporal localization in the video. The architecture is initialized with a powerful visual backbone and a strong language model.

Vid2Seq model overview: We formulate dense event captioning as a sequence-to-sequence problem, using special time tokens to allow the model to seamlessly understand and generate sequences of tokens containing both textual semantic information and temporal localization information grounding each text sentence in the video.

Large-scale pre-training on untrimmed narrated videos

Due to the dense nature of the task, the manual collection of annotations for dense video captioning is particularly expensive. Hence we pre-train the Vid2Seq model using unlabeled narrated videos, which are easily available at scale. In particular, we use the YT-Temporal-1B dataset, which includes 18 million narrated videos covering a wide range of domains.

We use transcribed speech sentences and their corresponding timestamps as supervision, which are cast as a single sequence of tokens. We pre-train Vid2Seq with a generative objective that teaches the decoder to predict the transcribed speech sequence given visual inputs only, and a denoising objective that encourages multimodal learning by requiring the model to predict masked tokens given a noisy transcribed speech sequence and visual inputs. In particular, noise is added to the speech sequence by randomly masking out spans of tokens.

Vid2Seq is pre-trained on unlabeled narrated videos with a generative objective (top) and a denoising objective (bottom).

Results on downstream dense video captioning benchmarks

The resulting pre-trained Vid2Seq model can be fine-tuned on downstream tasks with a simple maximum likelihood objective using teacher forcing (i.e., predicting the next token given previous ground-truth tokens). After fine-tuning, Vid2Seq notably improves the state of the art on three standard downstream dense video captioning benchmarks (ActivityNet Captions, YouCook2 and ViTT) and two video clip captioning benchmarks (MSR-VTT, MSVD). In our paper we provide additional ablation studies, qualitative results, as well as results in the few-shot settings and in the video paragraph captioning task.

Comparison to state-of-the-art methods for dense video captioning (left) and for video clip captioning (right), on the CIDEr metric (higher is better).

Conclusion

We introduce Vid2Seq, a novel visual language model for dense video captioning that simply predicts all event boundaries and captions as a single sequence of tokens. Vid2Seq can be effectively pretrained on unlabeled narrated videos at scale, and achieves state-of-the-art results on various downstream dense video captioning benchmarks. Learn more from the paper and grab the code here.

Acknowledgements

This research was conducted by Antoine Yang, Arsha Nagrani, Paul Hongsuck Seo, Antoine Miech, Jordi Pont-Tuset, Ivan Laptev, Josef Sivic and Cordelia Schmid.

GPTs are GPTs: An early look at the labor market impact potential of large language models

OpenAI Blog

NVIDIA CEO to Reveal What’s Next for AI at GTC

The secret’s out. Thanks to ChatGPT, everyone knows about the power of modern AI.

To find out what’s coming next, tune in to NVIDIA founder and CEO Jensen Huang’s keynote address at NVIDIA GTC on Tuesday, March 21, at 8 a.m. Pacific.

Huang will share his vision for the future of AI and how NVIDIA is accelerating it with breakthrough technologies and solutions. There couldn’t be a better time to get ready for what’s to come.

NVIDIA is a pioneer and leader in AI thanks to its powerful graphics processing units that have enabled new computing models like accelerated computing.

NVIDIA GPUs sparked the modern AI revolution by making deep neural networks faster and more efficient.

Today, NVIDIA GPUs power AI applications in every industry, from computer vision to natural language processing, from robotics to healthcare, and from gaming to chatbots.

GTC, which runs online March 20-23, is the conference for AI and the metaverse. It features more than 650 sessions on deep learning, computer vision, natural language processing, robotics, healthcare, gaming and more.

Speakers from Adobe, Amazon, Autodesk, Deloitte, Ford Motor, Google, IBM, Jaguar Land Rover, Lenovo, Meta, Netflix, Nike, OpenAI, Pfizer, Pixar, Subaru and more will all discuss their latest work.

Don’t miss out on talks from leaders such aas Demis Hassabis of DeepMind, Valeri Taylor of Argonne Labs, Scott Belsky of Adobe, Paul Debevec of Netflix, Thomas Schulthess of ETH Zurich, and a special fireside chat between Huang and Ilya Sutskever, co-founder of OpenAI, the creator of ChatGPT.

You can watch the keynote live or on demand. Register for free at https://www.nvidia.com/en-us/gtc/.

You can also join the conversation on social media using #GTC23.

Responsible AI at Google Research: The Impact Lab

Posted by Jamila Smith-Loud, Human Rights & Social Impact Research Lead, Google Research, Responsible AI and Human-Centered Technology Team

Globalized technology has the potential to create large-scale societal impact, and having a grounded research approach rooted in existing international human and civil rights standards is a critical component to assuring responsible and ethical AI development and deployment. The Impact Lab team, part of Google’s Responsible AI Team, employs a range of interdisciplinary methodologies to ensure critical and rich analysis of the potential implications of technology development. The team’s mission is to examine socioeconomic and human rights impacts of AI, publish foundational research, and incubate novel mitigations enabling machine learning (ML) practitioners to advance global equity. We study and develop scalable, rigorous, and evidence-based solutions using data analysis, human rights, and participatory frameworks.

The uniqueness of the Impact Lab’s goals is its multidisciplinary approach and the diversity of experience, including both applied and academic research. Our aim is to expand the epistemic lens of Responsible AI to center the voices of historically marginalized communities and to overcome the practice of ungrounded analysis of impacts by offering a research-based approach to understand how differing perspectives and experiences should impact the development of technology.

What we do

In response to the accelerating complexity of ML and the increased coupling between large-scale ML and people, our team critically examines traditional assumptions of how technology impacts society to deepen our understanding of this interplay. We collaborate with academic scholars in the areas of social science and philosophy of technology and publish foundational research focusing on how ML can be helpful and useful. We also offer research support to some of our organization’s most challenging efforts, including the 1,000 Languages Initiative and ongoing work in the testing and evaluation of language and generative models. Our work gives weight to Google’s AI Principles.

To that end, we:

Conduct foundational and exploratory research towards the goal of creating scalable socio-technical solutions
Create datasets and research-based frameworks to evaluate ML systems
Define, identify, and assess negative societal impacts of AI
Create responsible solutions to data collection used to build large models
Develop novel methodologies and approaches that support responsible deployment of ML models and systems to ensure safety, fairness, robustness, and user accountability
Translate external community and expert feedback into empirical insights to better understand user needs and impacts
Seek equitable collaboration and strive for mutually beneficial partnerships

We strive not only to reimagine existing frameworks for assessing the adverse impact of AI to answer ambitious research questions, but also to promote the importance of this work.

Current research efforts

Understanding social problems

Our motivation for providing rigorous analytical tools and approaches is to ensure that social-technical impact and fairness is well understood in relation to cultural and historical nuances. This is quite important, as it helps develop the incentive and ability to better understand communities who experience the greatest burden and demonstrates the value of rigorous and focused analysis. Our goals are to proactively partner with external thought leaders in this problem space, reframe our existing mental models when assessing potential harms and impacts, and avoid relying on unfounded assumptions and stereotypes in ML technologies. We collaborate with researchers at Stanford, University of California Berkeley, University of Edinburgh, Mozilla Foundation, University of Michigan, Naval Postgraduate School, Data & Society, EPFL, Australian National University, and McGill University.

We examine systemic social issues and generate useful artifacts for responsible AI development.

<!–

We examine systemic social issues and generate useful artifacts for responsible AI development.

–>

Centering underrepresented voices

We also developed the Equitable AI Research Roundtable (EARR), a novel community-based research coalition created to establish ongoing partnerships with external nonprofit and research organization leaders who are equity experts in the fields of education, law, social justice, AI ethics, and economic development. These partnerships offer the opportunity to engage with multi-disciplinary experts on complex research questions related to how we center and understand equity using lessons from other domains. Our partners include PolicyLink; The Education Trust – West; Notley; Partnership on AI; Othering and Belonging Institute at UC Berkeley; The Michelson Institute for Intellectual Property, HBCU IP Futures Collaborative at Emory University; Center for Information Technology Research in the Interest of Society (CITRIS) at the Banatao Institute; and the Charles A. Dana Center at the University of Texas, Austin. The goals of the EARR program are to: (1) center knowledge about the experiences of historically marginalized or underrepresented groups, (2) qualitatively understand and identify potential approaches for studying social harms and their analogies within the context of technology, and (3) expand the lens of expertise and relevant knowledge as it relates to our work on responsible and safe approaches to AI development.

Through semi-structured workshops and discussions, EARR has provided critical perspectives and feedback on how to conceptualize equity and vulnerability as they relate to AI technology. We have partnered with EARR contributors on a range of topics from generative AI, algorithmic decision making, transparency, and explainability, with outputs ranging from adversarial queries to frameworks and case studies. Certainly the process of translating research insights across disciplines into technical solutions is not always easy but this research has been a rewarding partnership. We present our initial evaluation of this engagement in this paper.

EARR: Components of the ML development life cycle in which multidisciplinary knowledge is key for mitigating human biases.

Grounding in civil and human rights values

In partnership with our Civil and Human Rights Program, our research and analysis process is grounded in internationally recognized human rights frameworks and standards including the Universal Declaration of Human Rights and the UN Guiding Principles on Business and Human Rights. Utilizing civil and human rights frameworks as a starting point allows for a context-specific approach to research that takes into account how a technology will be deployed and its community impacts. Most importantly, a rights-based approach to research enables us to prioritize conceptual and applied methods that emphasize the importance of understanding the most vulnerable users and the most salient harms to better inform day-to-day decision making, product design and long-term strategies.

Ongoing work

Social context to aid in dataset development and evaluation

We seek to employ an approach to dataset curation, model development and evaluation that is rooted in equity and that avoids expeditious but potentially risky approaches, such as utilizing incomplete data or not considering the historical and social cultural factors related to a dataset. Responsible data collection and analysis requires an additional level of careful consideration of the context in which the data are created. For example, one may see differences in outcomes across demographic variables that will be used to build models and should question the structural and system-level factors at play as some variables could ultimately be a reflection of historical, social and political factors. By using proxy data, such as race or ethnicity, gender, or zip code, we are systematically merging together the lived experiences of an entire group of diverse people and using it to train models that can recreate and maintain harmful and inaccurate character profiles of entire populations. Critical data analysis also requires a careful understanding that correlations or relationships between variables do not imply causation; the association we witness is often caused by additional multiple variables.

Relationship between social context and model outcomes

Building on this expanded and nuanced social understanding of data and dataset construction, we also approach the problem of anticipating or ameliorating the impact of ML models once they have been deployed for use in the real world. There are myriad ways in which the use of ML in various contexts — from education to health care — has exacerbated existing inequity because the developers and decision-making users of these systems lacked the relevant social understanding, historical context, and did not involve relevant stakeholders. This is a research challenge for the field of ML in general and one that is central to our team.

Globally responsible AI centering community experts

Our team also recognizes the saliency of understanding the socio-technical context globally. In line with Google’s mission to “organize the world’s information and make it universally accessible and useful”, our team is engaging in research partnerships globally. For example, we are collaborating with The Natural Language Processing team and the Human Centered team in the Makerere Artificial Intelligence Lab in Uganda to research cultural and language nuances as they relate to language model development.

Conclusion

We continue to address the impacts of ML models deployed in the real world by conducting further socio-technical research and engaging external experts who are also part of the communities that are historically and globally disenfranchised. The Impact Lab is excited to offer an approach that contributes to the development of solutions for applied problems through the utilization of social-science, evaluation, and human rights epistemologies.

Acknowledgements

We would like to thank each member of the Impact Lab team — Jamila Smith-Loud, Andrew Smart, Jalon Hall, Darlene Neal, Amber Ebinama, and Qazi Mamunur Rashid — for all the hard work they do to ensure that ML is more responsible to its users and society across communities and around the world.

People of AI

Posted by Ashley Oldacre and Laurence Moroney

Throughout the years, we have seen some incredible ways AI has had an impact on our careers and daily lives. From solving some really challenging problems like predicting air quality through apps like Air Cognizer, helping parents of deaf children learn sign language, protecting the Great Barrier Reef and bringing culture and people together through projects like Sounds of India and Shadow Art, the sky’s the limit.

But who are the people behind it all?

To answer this question, I joined forces with my co-host, Laurence Moroney, to launch the People of AI podcast. We want to share the stories of some of the incredible people behind this technology. Through our interviews, we learn from a handful of current AI/ML leaders and practitioners and invite them to share their stories, what they are building, lessons learned along the way, and how they see the future of the industry. Through our conversations, we uncover the passion and creativity behind AI and ML development, and potential applications and use cases for good.

There is no doubt that AI is front and center in our lives today. It’s changing the future and shaping our conversations – whether it’s with family or the latest chat app. Throughout this podcast we will connect with some of the people behind the technology, share in their enthusiasm, concerns and learn from them.

Starting today, we will release one new episode of “People of AI” per week. Listen to the first episode on the People of AI site, also available on Spotify, Apple podcasts, Google podcasts, Deezer and Stitcher.

Episode 1: meet your hosts, Ashley Oldacre and Laurence Moroney, as we uncover what it means to be a person of AI.
Episode 2: learn about entrepreneurship with Sharon Zhou, CS Faculty at Stanford and MIT Technology Review’s 35 Under 35.
Episode 3: learn about the amazing ways you can use ML on the web with Jason Mayes, the public face of Web ML at Google, Web Engineer, and Creative Innovator.
Episode 4: learn how to pivot mid-career into the field of ML with Catherine Nelson, Principal Data Scientist at SAP Concur.
Episode 5: learn how to follow your passion and bring it into your work with Gant Laborde, Chief Innovation Officer at Infinite Red, Inc. and Google Developer Expert.
Episode 6: we talk with Joana Carrasqueira, Head of Community for Developer Relations in ML at Google, about empowering and connecting our communities around AI.

Whether you’re just getting started in AI/ML, or looking to expand your established experience, these stories are for you. We hope you will tune in!

This podcast is sponsored by Google. Any remarks made by the speakers are their own and are not endorsed by Google.

Overview of the technology

Solution overview

Prerequisites

Generate and deploy a quantized INT8 model on SageMaker

Example use case

Quantizing the model in PyTorch

Clean up

Conclusion

Notice and disclaimers

Appendix

New AWS instances in SageMaker with INT8 deployment support

Benchmark data

Accuracy for BERT-base model

Intel Extension for PyTorch

About the Authors

Solution overview

Prerequisites

Configure Microsoft Teams

Configure the data source using the Amazon Kendra connector for Microsoft Teams

Perform intelligent search with Amazon Kendra

Clean up

Conclusion

About the Authors

A visual language model for dense video captioning

Large-scale pre-training on untrimmed narrated videos

Results on downstream dense video captioning benchmarks

Conclusion

Acknowledgements

What we do

Current research efforts

Understanding social problems

Centering underrepresented voices

Grounding in civil and human rights values

Ongoing work

Social context to aid in dataset development and evaluation

Relationship between social context and model outcomes

Globally responsible AI centering community experts

Conclusion

Acknowledgements

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.