On a Neural Implementation of Brenier’s Polar Factorization

In 1991, Brenier proved a theorem that generalizes the polar decomposition for square matrices — factored as PSD ×times× unitary — to any vector field F:Rd→RdF:mathbb{R}^drightarrow mathbb{R}^dF:Rd→Rd. The theorem, known as the polar factorization theorem, states that any field FFF can be recovered as the composition of the gradient of a convex function uuu with a measure-preserving map MMM, namely F=∇u∘MF=nabla u circ MF=∇u∘M. We propose a practical implementation of this far-reaching theoretical result, and explore possible uses within machine learning. The theorem is closely related…Apple Machine Learning Research

Jensen Huang, Mark Zuckerberg to Discuss Future of Graphics and Virtual Worlds at SIGGRAPH 2024

Jensen Huang, Mark Zuckerberg to Discuss Future of Graphics and Virtual Worlds at SIGGRAPH 2024

NVIDIA founder and CEO Jensen Huang and Meta founder and CEO Mark Zuckerberg will hold a public fireside chat on Monday, July 29, at the 50th edition of the SIGGRAPH graphics conference in Denver.

The two leaders will discuss the future of AI and simulation and the pivotal role of research at SIGGRAPH, which focuses on the intersection of graphics and technology.

Before the discussion, Huang will also appear in a fireside chat with WIRED senior writer Lauren Goode to discuss AI and graphics for the new computing revolution.

Both conversations will be available live and on replay at NVIDIA.com.

The appearances at the conference, which runs July 28-Aug. 1, highlight SIGGRAPH’s continued role in technological innovation. Nearly 100 exhibitors will showcase how graphics are stepping into the future.

Attendees exploring the SIGGRAPH Innovation Zone will encounter startups at the forefront of computing and graphics while insights from industry leaders like Huang deliver a glimpse into the technological horizon.

Since the conference’s 1974 inception in Boulder, Colorado, SIGGRAPH has been at the forefront of innovation.

It introduced the world to demos such as the “Aspen Movie Map” — a precursor to Google Street View decades ahead of its time — and one of the first screenings of Pixar’s Luxo Jr., which redefined the art of animation.

The conference remains the leading venue for groundbreaking research in computer graphics.

Publications that redefined modern visual culture — including Ed Catmull’s 1974 paper on texture mapping, Turner Whitted’s 1980 paper on ray-tracing techniques, and James T. Kajiya’s 1986 “The Rendering Equation” — first made their debut at SIGGRAPH.

Innovations like these are now spilling out across the world’s industries.

Throughout the Innovation Zone, over a dozen startups are showcasing how they’re bringing advancements rooted in graphics into diverse fields — from robotics and manufacturing to autonomous vehicles and scientific research, including climate science.

Highlights include Tomorrow.io, which leverages NVIDIA Earth-2 to provide precise weather insights and offers early warning systems to help organizations adapt to climate changes.

Looking Glass is pioneering holographic technology that enables 3D content experiences without headsets. The company is using NVIDIA RTX 6000 Ada Generation GPUs and NVIDIA Maxine technology to enhance real-time audio, video and augmented-reality effects to make this possible.

Manufacturing startup nTop developed a computer-aided design tool using NVIDIA GPU-powered signed distance fields. The tool uses the NVIDIA OptiX rendering engine and a two-way NVIDIA Omniverse LiveLink connector to enable real-time, high-fidelity visualization and collaboration across design and simulation platforms.

Conference attendees can also explore how generative AI — a technology deeply rooted in visual computing — is remaking professional graphics.

On July 31, industry leaders and developers will gather in room 607 at the Colorado Convention Center for Generative AI Day, exploring cutting-edge solutions for visual effects, animation and game development with leaders from Bria AI, Cuebric, Getty Images, Replikant, Shutterstock and others.

The conference’s speaker lineup is equally compelling.

In addition to Huang and Zuckerberg, notable presenters include Dava Newman of MIT Media Lab and Mark Sagar from Soul Machines, who’ll delve into the intersections of bioengineering, design and digital humans.

Finally, as part of SIGGRAPH’s rich legacy, the inaugural Stephen Parker Award will be presented to honor the memory and contributions of Stephen Parker, vice president of professional graphics at NVIDIA. Renowned for his pioneering work in interactive ray tracing and computer graphics, Parker left a legacy that continues to inspire the field.

Join the global technology community in Denver later this month to discover why SIGGRAPH remains at the forefront of demonstrating, predicting and shaping the future of technology.

Read More

Video auto-dubbing using Amazon Translate, Amazon Bedrock, and Amazon Polly

Video auto-dubbing using Amazon Translate, Amazon Bedrock, and Amazon Polly

This post is co-written with MagellanTV and Mission Cloud. 

Video dubbing, or content localization, is the process of replacing the original spoken language in a video with another language while synchronizing audio and video. Video dubbing has emerged as a key tool in breaking down linguistic barriers, enhancing viewer engagement, and expanding market reach. However, traditional dubbing methods are costly (about $20 per minute with human review effort) and time consuming, making them a common challenge for companies in the Media & Entertainment (M&E) industry. Video auto-dubbing that uses the power of generative artificial intelligence (generative AI) offers creators an affordable and efficient solution.

This post shows you a cost-saving solution for video auto-dubbing. We use Amazon Translate for initial translation of video captions and use Amazon Bedrock for post-editing to further improve the translation quality. Amazon Translate is a neural machine translation service that delivers fast, high-quality, and affordable language translation.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies such as AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to help you build generative AI applications with security, privacy, and responsible AI.

MagellanTV, a leading streaming platform for documentaries, wants to broaden its global presence through content internationalization. Faced with manual dubbing challenges and prohibitive costs, MagellanTV sought out AWS Premier Tier Partner Mission Cloud for an innovative solution.

Mission Cloud’s solution distinguishes itself with idiomatic detection and automatic replacement, seamless automatic time scaling, and flexible batch processing capabilities with increased efficiency and scalability.

Solution overview

The following diagram illustrates the solution architecture. The inputs of the solution are specified by the user, including the folder path containing the original video and caption file, target language, and toggles for idiom detector and formality tone. You can specify these inputs in an Excel template and upload the Excel file to a designated Amazon Simple Storage Service (Amazon S3) bucket. This will launch the whole pipeline. The final outputs are a dubbed video file and a translated caption file.

image001_video_dubbing.png

We use Amazon Translate to translate the video caption, and Amazon Bedrock to enhance the translation quality and enable automatic time scaling to synchronize audio and video. We use Amazon Augmented AI for editors to review the content, which is then sent to Amazon Polly to generate synthetic voices for the video. To assign a gender expression that matches the speaker, we developed a model to predict the gender expression of the speaker.

In the backend, AWS Step Functions orchestrates the preceding steps as a pipeline. Each step is run on AWS Lambda or AWS Batch. By using the infrastructure as code (IaC) tool, AWS CloudFormation, the pipeline becomes reusable for dubbing new foreign languages.

In the following sections, you will learn how to use the unique features of Amazon Translate for setting formality tone and for custom terminology. You will also learn how to use Amazon Bedrock to further improve the quality of video dubbing.

Why choose Amazon Translate?

We chose Amazon Translate to translate video captions based on three factors.

  • Amazon Translate supports over 75 languages. While the landscape of large language models (LLMs) has continuously evolved in the past year and continues to change, many of the trending LLMs support a smaller set of languages.
  • Our translation professional rigorously evaluated Amazon Translate in our review process and affirmed its commendable translation accuracy. Welocalize benchmarks the performance of using LLMs and machine translations and recommends using LLMs as a post-editing tool.
  • Amazon Translate has various unique benefits. For example, you can add custom terminology glossaries, while for LLMs, you might need fine-tuning that can be labor-intensive and costly.

Use Amazon Translate for custom terminology

Amazon Translate allows you to input a custom terminology dictionary, ensuring translations reflect the organization’s vocabulary or specialized terminology. We use the custom terminology dictionary to compile frequently used terms within video transcription scripts.

Here’s an example. In a documentary video, the caption file would typically display “(speaking in foreign language)” on the screen as the caption when the interviewee speaks in a foreign language. The sentence “(speaking in foreign language)” itself doesn’t have proper English grammar: it lacks the proper noun, yet it’s commonly accepted as an English caption display. When translating the caption into German, the translation also lacks the proper noun, which can be confusing to German audiences as shown in the code block that follows.

## Translate - without custom terminology (default)
import boto3
# Initialize a session of Amazon Translate
translate=boto3.client(service_name='translate', region_name='us-east-1', use_ssl=True)
def translate_text(text, source_lang, target_lang):
    result=translate.translate_text(
        Text=text, 
        SourceLanguageCode=source_lang, 
        TargetLanguageCode=target_lang)
    return result.get('TranslatedText')
text="(speaking in a foreign language)"
output=translate_text(text, "en", "de")
print(output)
# Output: (in einer Fremdsprache sprechen)

Because this phrase “(speaking in foreign language)” is commonly seen in video transcripts, we added this term to the custom terminology CSV file translation_custom_terminology_de.csv with the vetted translation and provided it in the Amazon Translate job. The translation output is as intended as shown in the following code.

## Translate - with custom terminology
import boto3
import json
# Initialize a session of Amazon Translate
translate=boto3.client('translate')
with open('translation_custom_terminology_de.csv', 'rb') as ct_file:
    translate.import_terminology(
        Name='CustomTerminology_boto3',
        MergeStrategy='OVERWRITE',
        Description='Terminology for Demo through boto3',
        TerminologyData={
            'File':ct_file.read(),
            'Format':'CSV',
            'Directionality':'MULTI'
        }
    )
text="(speaking in foreign language)"
result=translate.translate_text(
    Text=text,
    TerminologyNames=['CustomTerminology_boto3_2024'], 
    SourceLanguageCode="en",
    TargetLanguageCode="de"
)
print(result['TranslatedText'])
# Output: (Person spricht in einer Fremdsprache)

Set formality tone in Amazon Translate

Some documentary genres tend to be more formal than others. Amazon Translate allows you to define the desired level of formality for translations to supported target languages. By using the default setting (Informal) of Amazon Translate, the translation output in German for the phrase, “[Speaker 1] Let me show you something,” is informal, according to a professional translator.

## Translate - with informal tone (default) 
import boto3
# Initialize a session of Amazon Translate
translate=boto3.client(service_name='translate', region_name='us-east-1', use_ssl=True)
def translate_text(text, source_lang,target_lang):
    result=translate.translate_text(
        Text=text, 
        SourceLanguageCode=source_lang, 
        TargetLanguageCode=target_lang)
    return result.get('TranslatedText')
text="[Speaker 1] Let me show you something."
output=translate_text(text, "en", "de")
print(output)
# Output: [Sprecher 1] Lass mich dir etwas zeigen.

By adding the Formal setting, the output translation has a formal tone, which fits the documentary’s genre as intended.

## Translate - with formal tone 
import boto3
# Initialize a session of Amazon Translate
translate=boto3.client(service_name='translate', region_name='us-east-1', use_ssl=True)
def translate_text(text, source_lang, target_lang):
    result=translate.translate_text(
        Text=text, 
        SourceLanguageCode=source_lang, 
        TargetLanguageCode=target_lang,
        Settings={'Formality':'FORMAL'})
    return result.get('TranslatedText')
text="[Speaker 1] Let me show you something."
output=translate_text(text, "en", "de")
print(output)
# Output: [Sprecher 1] Lassen Sie mich Ihnen etwas zeigen.

Use Amazon Bedrock for post-editing

In this section, we use Amazon Bedrock to improve the quality of video captions after we obtain the initial translation from Amazon Translate.

Idiom detection and replacement

Idiom detection and replacement is vital in dubbing English videos to accurately convey cultural nuances. Adapting idioms prevents misunderstandings, enhances engagement, preserves humor and emotion, and ultimately improves the global viewing experience. Hence, we developed an idiom detection function using Amazon Bedrock to resolve this issue.

You can turn the idiom detector on or off by specifying the inputs to the pipeline. For example, for science genres that have fewer idioms, you can turn the idiom detector off. While, for genres that have more casual conversations, you can turn the idiom detector on. For a 25-minute video, the total processing time is about 1.5 hours, of which about 1 hour is spent on video preprocessing and video composing. Turning the idiom detector on only adds about 5 minutes to the total processing time.

We have developed a function bedrock_api_idiom to detect and replace idioms using Amazon Bedrock. The function first uses Amazon Bedrock LLMs to detect idioms in the text and then replace them. In the example that follows, Amazon Bedrock successfully detects and replaces the input text “well, I hustle” to “I work hard,” which can be translated correctly into Spanish by using Amazon Translate.

## A rare idiom is well-detected and rephrased by Amazon Bedrock 
text_rephrased=bedrock_api_idiom(text)
print(text_rephrased)
# Output: I work hard
response=translate_text(text_rephrased, "en", "es-MX")
print(response)
# Output: yo trabajo duro
response=translate_text(response, "es-MX", "en")
print(response)
# Output: I work hard

Sentence shortening

Third-party video dubbing tools can be used for time-scaling during video dubbing, which can be costly if done manually. In our pipeline, we used Amazon Bedrock to develop a sentence shortening algorithm for automatic time scaling.

For example, a typical caption file consists of a section number, timestamp, and the sentence. The following is an example of an English sentence before shortening.

Original sentence:

A large portion of the solar energy that reaches our planet is reflected back into space or absorbed by dust and clouds.

image002_video_dubbing.pn

Here’s the shortened sentence using the sentence shortening algorithm. Using Amazon Bedrock, we can significantly improve the video-dubbing performance and reduce the human review effort, resulting in cost saving.

Shortened sentence:

A large part of solar energy is reflected into space or absorbed by dust and clouds.

image003_video_dubbing.pn

Conclusion

This new and constantly developing pipeline has been a revolutionary step for MagellanTV because it efficiently resolved some challenges they were facing that are common within Media & Entertainment companies in general. The unique localization pipeline developed by Mission Cloud creates a new frontier of opportunities to distribute content across the world while saving on costs. Using generative AI in tandem with brilliant solutions for idiom detection and resolution, sentence length shortening, and custom terminology and tone results in a truly special pipeline bespoke to MagellanTV’s growing needs and ambitions.

If you want to learn more about this use case or have a consultative session with the Mission team to review your specific generative AI use case, feel free to request one through AWS Marketplace.


About the Authors

Na Yu is a Lead GenAI Solutions Architect at Mission Cloud, specializing in developing ML, MLOps, and GenAI solutions in AWS Cloud and working closely with customers. She received her Ph.D. in Mechanical Engineering from the University of Notre Dame.

Max Goff is a data scientist/data engineer with over 30 years of software development experience. A published author, blogger, and music producer he sometimes dreams in A.I.

Marco Mercado is a Sr. Cloud Engineer specializing in developing cloud native solutions and automation. He holds multiple AWS Certifications and has extensive experience working with high-tier AWS partners. Marco excels at leveraging cloud technologies to drive innovation and efficiency in various projects.

Yaoqi Zhang is a Senior Big Data Engineer at Mission Cloud. She specializes in leveraging AI and ML to drive innovation and develop solutions on AWS. Before Mission Cloud, she worked as an ML and software engineer at Amazon for six years, specializing in recommender systems for Amazon fashion shopping and NLP for Alexa. She received her Master of Science Degree in Electrical Engineering from Boston University.

Adrian Martin is a Big Data/Machine Learning Lead Engineer at Mission Cloud. He has extensive experience in English/Spanish interpretation and translation.

Ryan Ries holds over 15 years of leadership experience in data and engineering, over 20 years of experience working with AI and 5+ years helping customers build their AWS data infrastructure and AI models. After earning his Ph.D. in Biophysical Chemistry at UCLA and Caltech, Dr. Ries has helped develop cutting-edge data solutions for the U.S. Department of Defense and a myriad of Fortune 500 companies.

Andrew Federowicz is the IT and Product Lead Director for Magellan VoiceWorks at MagellanTV. With a decade of experience working in cloud systems and IT in addition to a degree in mechanical engineering, Andrew designs builds, deploys, and scales inventive solutions to unique problems. Before Magellan VoiceWorks, Andrew architected and built the AWS infrastructure for MagellanTV’s 24/7 globally available streaming app. In his free time, Andrew enjoys sim racing and horology.

Qiong Zhang, PhD, is a Sr. Partner Solutions Architect at AWS, specializing in AI/ML. Her current areas of interest include federated learning, distributed training, and generative AI. She holds 30+ patents and has co-authored 100+ journal/conference papers. She is also the recipient of the Best Paper Award at IEEE NetSoft 2016, IEEE ICC 2011, ONDM 2010, and IEEE GLOBECOM 2005.

Cristian Torres is a Sr. Partner Solutions Architect at AWS. He has 10 years of experience working in technology performing several roles such as: Support Engineer, Presales Engineer, Sales Specialist and Solutions Architect. He works as a generalist with AWS services focusing on Migrations to help strategic AWS Partners develop successfully from a technical and business perspective.

Read More

How Mixbook used generative AI to offer personalized photo book experiences

How Mixbook used generative AI to offer personalized photo book experiences

This post is co-written with Vlad Lebedev and DJ Charles from Mixbook.

Mixbook is an award-winning design platform that gives users unrivaled creative freedom to design and share one-of-a-kind stories, transforming the lives of more than six million people. Today, Mixbook is the #1 rated photo book service in the US with 26 thousand five-star reviews.

Mixbook is empowering users to share their stories with creativity and confidence. Their mission is to assist users in celebrating the beautiful moments of their lives. Mixbook aims to foster the profound connections between users and their loved ones through sharing of their stories in both physical and digital mediums.

Years ago, Mixbook undertook a strategic initiative to transition their operational workloads to Amazon Web Services (AWS), a move that has continually yielded significant advantages. This pivotal decision has been instrumental in propelling them towards fulfilling their mission, ensuring their system operations are characterized by reliability, superior performance, and operational efficiency.

In this post we show you how Mixbook used generative artificial intelligence (AI) capabilities in AWS to personalize their photo book experiences—a step towards their mission.

Business Challenge

In today’s digital world, we have a lot of pictures that we take and share with our friends and family. Let’s consider a scenario where we have hundreds of photos from a recent family vacation, and we want to create a coffee-table photo-book to make it memorable. However, choosing the best pictures from the lot and describing them with captions can take a lot of time and effort. As we all know, a picture’s worth a thousand words, which is why trying to sum up a moment with a caption of just six to ten words can be so challenging. Mixbook really gets the problem, and they’re here to fix it.

Solution

Mixbook Smart Captions is the magical solution to the caption conundrum. It doesn’t only interpret user photos; it also adds a sprinkle of creativity, making the stories pop.

Most importantly, Smart Captions doesn’t fully automate the creative process. Instead, it provides a creative partner to enable the user’s own storytelling to imbue a book with personal flourishes. Whether it’s a selfie or a scenic shot, the goal is to make sure users’ photos speak volumes, effortlessly.

Architecture overview

The implementation of the system involves three primary components:

  • Data intake
  • Information inference
  • Creative synthesis

Caption generation is heavily reliant on the inference process, because the quality and meaningfulness of the comprehension process output directly influence the specificity and personalization of the caption generation. The following is the data flow diagram of the caption generation process., which is described in the text that follows.

Data intake

A user uploads photos into Mixbook. The raw photos are stored in Amazon Simple Storage Service (Amazon S3).

The data intake process involves three macro components: Amazon Aurora MySQL-Compatible Edition, Amazon S3, and AWS Fargate for Amazon ECS. Aurora MySQL serves as the primary relational data storage solution for tracking and recording media file upload sessions and their accompanying metadata. It offers flexible capacity options, ranging from serverless on one end to reserved provisioned instances for predictable long-term use on the other. S3, in turn, provides efficient, scalable, and secure storage for the media file objects themselves. Its storage classes enable the maintenance of recent uploads in a warm state for low-latency access, while older objects can be transitioned to Amazon S3 Glacier tiers, thus minimizing storage expenses over time. Amazon Elastic Container Registry (Amazon ECS), when used in conjunction with the low-maintenance compute environment of AWS Fargate, forms a convenient orchestrator for containerized workloads, bringing all components together seamlessly.

Inference

The comprehension phase extracts essential contextual and semantic elements from the input, including image descriptions, temporal and spatial data, facial recognition, emotional sentiment, and labels. Among these, the image descriptions generated by a computer vision model offer the most fundamental understanding of the captured moments. Amazon Rekognition delivers precise detection of faces’ bounding boxes and emotional expressions. Face detection is crucial for optimal automatic photo placement and cropping, while emotion recognition allows for more effective story tone adjustments. The detected face bounding boxes on the photos are primarily used for optimal automatic photo placement and cropping. The emotions are used to help select a better tone to make it funnier or more nostalgic (for example). Furthermore, Amazon Rekognition enhances safety by identifying potentially objectionable content.

The inference pipeline is powered by an AWS Lambda-based multi-step architecture, which maximizes cost-efficiency and elasticity by running independent image analysis steps in parallel. AWS Step Functions enables the synchronization and ordering of interdependent steps.

The image captions are generated by an Amazon SageMaker inference endpoint, which is enhanced by an Amazon ElastiCache for Redis-powered buffer. The buffer was implemented after benchmarking the captioning model’s performance. The benchmarking revealed that the model performed optimally when processing batches of images, but underperformed when analyzing individual images.

Generation

The caption-generating mechanism behind the writing assistant feature is what turns Mixbook Studio into a natural language story-crafting tool. Powered by a Llama language model, the assistant initially used carefully engineered prompts created by AI experts. However, the Mixbook Storyarts team sought more granular control over the style and tone of the captions, leading to a diverse team that included an Emmy-nominated scriptwriter reviewing, adjusting, and adding unique handcrafted examples. This resulted in a process of fine-tuning the model, moderating modified responses, and deploying approved models for experimental and public releases. After inference, three captions are created and stored in Amazon Relational Database Service (Amazon RDS).

The following image shows the Mixbook Smart Captions feature in Mixbook Studio.

Benefits

Mixbook implemented this solution to provide new features to their customers. It provided an improved user experience with operational efficiency.

User experience

  • Enhanced storytelling: Captures the users’ emotions and experiences, now beautifully expressed through captions that are heartfelt.
  • User delight: Adds an element of surprise with captions that aren’t just accurate, but also delightful and imaginative. A delighted user Hanie U says “I hope there are more captions experiences released in the future.” Another user, Megan P. says, “It worked great!” Users can also edit the generated captions.
  • Time efficiency: Nobody has the time to struggle with captions. The feature saves precious time while making user stories shine bright.
  • Safety and correctness: The captions were generated responsibly, leveraging the guard-rails to ensure content moderation and relevancy.

System

  • Elasticity and scalability of Lambda
  • Comprehensible workflow orchestration with Step Functions
  • Variety of base models from SageMaker and tuning capabilities for maximum control

As a result of their improved user delight, Mixbook has been named as an official honoree of the Webby Awards in 2024 for Apps & Software Best Use of AI & Machine Learning.

“AWS enables us to scale the innovations our customers love most. And now, with the new AWS generative AI capabilities, we are able to blow our customers minds with creative power they never thought possible. Innovations like this are why we’ve been partnered with AWS since the beta in 2006.”

– Andrew Laffoon, CEO, Mixbook

Conclusion

Mixbook started experimenting with AWS generative AI solutions to augment their existing application in early 2023. They started with a quick proof-of-concept to yield results to show the art of the possible. Continuous development, testing, and integration using AWS breadth of services in compute, storage, analytics, and machine learning allowed them to iterate quickly. After they released the Smart Caption features in beta, they were able to quickly adjust according to real-world usage patterns, and protect the product’s value.

Try out Mixbook Studio to experience the storytelling. To learn more about AWS generative AI solutions, start with Transform your business with generative AI. To hear more from Mixbook leaders, listen to the AWS re:Think Podcast available from Art19, Apple Podcasts, and Spotify.


About the authors

Vlad Lebedev is a Senior Technology Leader at Mixbook. He leads a product-engineering team responsible for transforming Mixbook into a place for heartfelt storytelling. He draws on over a decade of hands-on experience in web development, system design, and data engineering to drive elegant solutions for complex problems. Vlad enjoys learning about both contemporary and ancient cultures, their histories, and languages.

DJ Charles is the CTO at Mixbook. He has enjoyed a 30-year career architecting interactive and e-commerce designs for top brands. Innovating broadband tech for the cable industry in the ’90s, revolutionizing supply-chain processes in the 2000s, and advancing environmental tech at Perillon led to global real-time bidding platforms for brands like Sotheby’s & eBay. Beyond tech, DJ loves learning new musical instruments, the art of songwriting, and deeply engages in music production & engineering in his spare time.

Malini Chatterjee is a Senior Solutions Architect at AWS. She provides guidance to AWS customers on their workloads across a variety of AWS technologies. She brings a breadth of expertise in Data Analytics and Machine Learning. Prior to joining AWS, she was architecting data solutions in financial industries. She is very passionate about semi-classical dancing and performs in community events. She loves traveling and spending time with her family.

Jessica Oliveira is an Account Manager at AWS who provides guidance and support to Commercial Sales in Northern California. She is passionate about building strategic collaborations to help ensure her customers’ success. Outside of work, she enjoys traveling, learning about different languages and cultures, and spending time with her family.

Read More

RUBICON: Evaluating conversations between humans and AI systems

RUBICON: Evaluating conversations between humans and AI systems

This paper has been accepted at the 1st ACM International Conference on AI-powered Software (opens in new tab) (AIware 2024), co-located with FSE 2024 (opens in new tab). AIware is the premier international forum on AI-powered software.

Rubicon paper at Alware 2024

Generative AI has redefined the landscape of AI assistants in software development, with innovations like GitHub Copilot providing real-time, chat-based programming support. As these tools increase in sophistication and domain specialization, assessing their impact on user interactions becomes more challenging. Developers frequently question whether modifications to their AI assistants genuinely improve the user experience, as indicated in a recent paper.

Traditional feedback mechanisms, such as simple thumbs-up or thumbs-down ratings, fall short in capturing the complexities of interactions within specialized settings, where nuanced data is often sparse. To address this issue, we introduce RUBICON: Rubric-based Evaluation of Domain Specific Human-AI Conversations,” presented at AIware 2024. RUBICON is an automated assessment technique that transforms a minimal dataset into an extensive array of domain-specific rubrics, helping ensure that updates not only modify but meaningfully improve user interactions.

Foundational communication principles

Effective conversation, whether human-to-human or human-to-AI, adheres to four maxims (opens in new tab) outlined by philosopher Paul Grice: quantity, quality, relation, and manner, ensuring that communication is concise, truthful, pertinent, and clear. In AI applications, they help create interactions that feel natural and engaging, fostering trust and empathy. Within domain-specific settings, RUBICON adapts these principles to ensure they are context-aware, improving the utility and clarity of interactions. For example, in Visual Studio, the AI helps the developer debug a program by providing detailed explanations and relevant code examples, shown in Figure 1. In Figure 2, its responses reflect that it’s guided by context.

In the image, we see two Human-AI debugging conversations side by side, both working on the same task but with different AI assistants. On the left side, the assistant suggests using an if-else block to catch and throw an exception. The user responds that they do not want to throw any exceptions. The assistant then proposes a try-catch block instead. The user ends the conversation by asking how to prevent the exception from occurring in the first place. The assistant makes assumptions without clarifying details about the scenario, leading to a superficial and unusable fix. On the right side, the assistant starts by asking the user to check a variable's value at a specific state. The user replies that the variable is empty. The assistant then forms a hypothesis and requests the relevant code file from the user. After receiving the code, the assistant provides a simple fix. The user ends the conversation by confirming that the solution worked. Here, the assistant actively investigates the error, collaborates with the user to gather information, and delivers a practical solution.
Figure 1. Contrasting interactions with two versions of the Visual Studio Debugging Assistant for the same task. On the left, the assistant makes assumptions without seeking clarification. On the right, the assistant proactively investigates the error, collaborates with the developer to gather essential information, and achieves a practical solution.
In the image, there are two sample initial responses to the same task by different debugging assistants, shown side by side. On the left, the assistant merely reiterates the meaning of the exception message and gives generic advice, such as asking the user to check why the serialization failed. On the right, the assistant identifies the probable source of the error, points out the specific method to the user, and requests the user to provide the code for that method.
Figure 2. Context awareness significantly improves the AI assistant’s efficacy. The response on the left is generic, superficially referring to the developer’s code and restating the obvious, providing little value. The reply on the right directs the developer toward a specific solution, the toJSON method.

In task-oriented environments, it’s important to assess how well a conversation aligns with user expectations and assists in achieving their goals. Conversations are only useful if they advance the user’s interests, and challenges can arise when users have misaligned expectations of the AI’s capabilities or when the AI directs the conversation too forcefully, prioritizing its methods over the user’s preferences. RUBICON balances the interaction dynamics between the AI and developer, promoting constructive exchanges without overwhelming or under-engaging. It calibrates the extent to which the AI should hypothesize and resolve issues versus how much it should leave to the developer.

Spotlight: Event Series

Microsoft Research Forum

Join us for a continuous exchange of ideas about research in the era of general AI. Watch the first three episodes on demand.


RUBICON’s rubric-based method and evaluation

RUBICON is built on the foundational work of SPUR—the Supervised Prompting for User Satisfaction Rubrics framework that was recently introduced—increasing its scope and crafting a broad spectrum of potential rubrics from each batch of data. Using a language model to create concise summaries that assess the quality of conversations, emphasizing communication principles, task orientation, and domain specificity. It identifies signals of user satisfaction and outlines the shared responsibilities of the user and the AI in achieving task objectives. These summaries are then refined into rubrics.

RUBICON’s novel selection algorithm sifts through numerous candidates to identify a select group of high-quality rubrics, enhancing their predictive accuracy in practical applications, as illustrated in Figure 3. The technique doesn’t require human intervention and can be trained directly on anonymized conversational data, helping to ensure customer data privacy while still extracting the important features for analysis.

The image contains three graphics. On the left is a bad Human-AI debugging conversation, and on the right is a good one. The center graphic lists sample rubrics generated by RUBICON from events of goodness/badness from both the conversations. Arrows connect specific events in the conversations to the corresponding rubric. For example, one arrow starts from the part of the right conversation where the assistant provides a ready-to-use code snippet to solve the bug, ending at the rubric, “The assistant provides a code snippet to illustrate the solution, aiding the user in implementing the fix.”
Figure 3. Overview of RUBICON’s framework and the various steps involved.

The effectiveness of RUBICON’s method is evidenced by its rubrics, which show an 18% increase in accuracy over SPUR in classifying conversations as positive or negative, as shown in Figure 4. Additionally, RUBICON achieves near-perfect precision in predicting conversation labels in 84% of cases involving unlabeled data.

The image depicts a workflow illustrating the RUBICON technique. It begins with a set of conversations, from which signals indicating conversation quality are extracted. An LLM then analyzes these signals, reasoning about why they occurred, using domain-specific insights and understanding of the user-assistant interaction. Another LLM summarizes these reasonings into a rubric pool, applying Gricean maxims to evaluate conversational situations. Finally, RUBICON’s novel selection policy algorithm selects the top-performing rubric from this pool.
Figure 4. Two analogous conversations facilitated by the Debugger AI assistant are evaluated against representative rubrics. Software engineers who evaluated the conversations found the one on the left less effective and the one on the right more so. RUBICON’s rubric also gave a higher score to the conversation on the right, demonstrating that RUBICON’s method of evaluation is consistent with that of the software engineers.

RUBICON-generated rubrics 

RUBICON-generated rubrics serve as a framework for understanding user needs, expectations, and conversational norms. These rubrics have been successfully implemented in Visual Studio IDE, where they have guided analysis of over 12,000 debugging conversations, offering valuable insights into the effectiveness of modifications made to the assistant and facilitating rapid fast iteration and improvement. For example, the rubrics The AI gave a solution too quickly, rather than asking the user for more information and trying to find the root cause of the issue,” or “The AI gave a mostly surface-level solution to the problem,” have indicated issues where the assistant prematurely offered solutions without gathering sufficient information. These findings led to adjustments in the AI’s behavior, making it more investigative and collaborative. 

Beyond conversational dynamics, the rubrics also identify systemic design flaws not directly tied to the conversational assistant. These include issues with the user interface issues that impede the integration of new code and gaps in user education regarding the assistant’s capabilities. To use RUBICON, developers need a small set of labeled conversations from their AI assistant and specifically designed prompts that reflect the criteria for task progression and completion. The methodology and example of these rubrics are detailed in the paper.

Implications and looking ahead

Developers of AI assistance value clear insights into the performance of their interfaces. RUBICON represents a valuable step toward developing a refined evaluation system that is sensitive to domain-specific tasks, adaptable to changing usage patterns, efficient, easy-to-implement, and privacy-conscious. A robust evaluation system like RUBICON can help to improve the quality of these tools without compromising user privacy or data security. As we look ahead, our goal is to broaden the applicability of RUBICON beyond just debugging in AI assistants like GitHub Copilot. We aim to support additional tasks like migration and scaffolding within IDEs, extending its utility to other chat-based Copilot experiences across various products.

The post RUBICON: Evaluating conversations between humans and AI systems appeared first on Microsoft Research.

Read More

Whispering Experts: Toxicity Mitigation in Pre-trained Language Models by Dampening Expert Neurons

An important issue with Large Language Models (LLMs) is their undesired ability to generate toxic language. In this work, we show that the neurons responsible for toxicity can be determined by their power to discriminate toxic sentences, and that toxic language can be mitigated by reducing their activation levels proportionally to this power. We propose AUROC adaptation (AURA), an intervention that can be applied to any pre-trained LLM to mitigate toxicity. As the intervention is proportional to the ability of each neuron to discriminate toxic content, it is free of any model-dependent…Apple Machine Learning Research

Contrasting Multiple Representations with the Multi-Marginal Matching Gap

Learning meaningful representations of complex objects that can be seen through multiple (k≥3kgeq 3k≥3) views or modalities is a core task in machine learning. Existing methods use losses originally intended for paired views, and extend them to kkk views, either by instantiating 12k(k−1)tfrac12k(k-1)21​k(k−1) loss-pairs, or by using reduced embeddings, following a one vs. average-of-resttextit{one vs. average-of-rest}one vs. average-of-rest strategy. We propose the multi-marginal matching gap (M3G), a loss that borrows tools from multi-marginal optimal transport (MM-OT) theory to…Apple Machine Learning Research

A Direct Algorithm for Multi-Gyroscope Infield Calibration

In this paper, we address the problem of estimating the rotational extrinsics, as well as the scale factors of two gyroscopes rigidly mounted on the same device. In particular, we formulate the problem as a least-squares minimization and introduce a direct algorithm that computes the estimated quantities without any iterations, hence avoiding local minima and improving efficiency. Furthermore, we show that the rotational extrinsics are observable while the scale factors can be determined up to global scale for general configurations of the gyroscopes. To this end, we also study special…Apple Machine Learning Research

CodeAct: Your LLM Agent Acts Better when Generating Code

Large Language Model (LLM) agents, capable of performing a broad range of actions, such as invoking tools and controlling robots, show great potential in tackling real-world challenges. LLM agents are typically prompted to produce actions by generating JSON or text in a pre-defined format, which is usually limited by constrained action space (e.g., the scope of pre-defined tools) and restricted flexibility (e.g., inability to compose multiple tools). This work proposes to use executable Python code to consolidate LLM agents’ actions into a unified action space (CodeAct). Integrated with a…Apple Machine Learning Research