Theoretical analysis provides insight into the optimization process during model training and reveals that for some optimizations, the Gaussian attention kernel may work better than softmax.Read More
5 ways Open Health Stack is helping developers address healthcare gaps
Open Health Stack, a set of open-source tools from Google, allows developers to create digital health solutions for low-resource settings around the world.Read More
Research Focus: Week of December 16, 2024
NEW RESEARCH
NeoMem: Hardware/Software Co-Design for CXL-Native Memory Tiering
The Compute Express Link (CXL) open standard interconnect enables integration of diverse types of memory into servers via its byte-addressable SerDes links. To fully utilize CXL-based heterogeneous memory systems (which combine different types of memory with varying access speeds), it’s necessary to implement efficient memory tiering—a strategy to manage data placement across memory tiers for optimal performance. Efficiently managing these memory systems is crucial, but has been challenging due to the lack of precise and efficient tools for understanding how memory is accessed.
In a recent paper: NeoMem: Hardware/Software Co-Design for CXL-Native Memory Tiering researchers from Microsoft propose a novel solution which features a hardware/software co-design to address this problem. NeoMem offloads memory profiling functions to CXL device-side controllers, integrating a dedicated hardware unit called NeoProf, which monitors memory accesses and provides the operating system (OS) with crucial page hotness statistics and other system state information. On the OS kernel side, the researchers designed a revamped memory-tiering strategy, enabling accurate and timely hot page promotion based on NeoProf statistics. Implemented on a real FPGA-based CXL memory platform and Linux kernel v6.3, NeoMem demonstrated 32% to 67% geomean speedup over several existing memory tiering solutions.
NEW RESEARCH
Chimera: Accurate retrosynthesis prediction by ensembling models with diverse inductive biases
Planning and conducting chemical syntheses is a significant challenge in the discovery of functional small molecules, which limits the potential of generative AI for molecular inverse design. Although early machine learning-based retrosynthesis models have shown the ability to predict reasonable routes, they are less accurate for infrequent, yet important reactions.
In a recent paper: Chimera: Accurate retrosynthesis prediction by ensembling models with diverse inductive biases, researchers from Microsoft and external colleagues address this limitation, with a new framework for building highly accurate reaction models. Chimera incorporates two newly developed models, each achieving state-of-the-art performance in their respective categories. Evaluations by PhD-level organic chemists show that Chimera’s predictions are preferred for their higher quality compared to baseline models.
The researchers further validate Chimera’s robustness by applying its largest-scale model to an internal dataset from a major pharmaceutical company, demonstrating its ability to generalize effectively under distribution shifts. This new framework shows the potential to substantially accelerate the development of even more accurate and versatile reaction prediction models.
Spotlight: AI-POWERED EXPERIENCE
Microsoft research copilot experience
Discover more about research at Microsoft through our AI-powered experience
NEW RESEARCH
The GA4GH Task Execution API: Enabling Easy Multicloud Task Execution
In bioinformatics and computational biology, data analysis often involves chaining command-line programs developed by specialized teams at different institutions. These tools, which vary widely in age, software stacks, and dependencies, lack a common programming interface, which makes integration, workflow management and reproducibility challenging.
A recent article (opens in new tab) emphasizes the development, adoption and implementation of the Global Alliance for Genomics and Health (GA4GH) Task Execution Service (TES) API, created in collaboration with researchers at Microsoft and other institutions. The TES API offers a unified schema and interface for submitting and managing tasks, seamlessly bridging gaps between on-premises high-performance and high-throughput computing systems, cloud platforms, and hybrid infrastructures. Its flexibility and extensibility have already made it a critical asset for applications ranging from federated data analysis to load balancing across multi-cloud systems.
Adopted by numerous service providers and integrated into several workflow engines, TES empowers researchers to execute complex computational tasks through a single, abstracted interface. This eliminates compatibility hurdles, accelerates research timelines, reduces costs and enables “compute to data” solutions—essential for tackling the challenges of distributed data analysis.
NEW RESEARCH
RedCode: Risky Code Execution and Generation Benchmark for Code Agents
Increasing use of code agents for AI-assisted coding and software development has brought safety and security concerns, such as generating or executing malicious code, which have become significant barriers to real-world deployment of these agents.
In a recent paper: RedCode: Risky Code Execution and Generation Benchmark for Code Agents, published at NeurIPS 2024, researchers from Microsoft and external colleagues propose comprehensive and practical evaluations on the safety of code agents. RedCode is an evaluation platform with benchmarks grounded in four key principles: real interaction with systems, holistic evaluation of unsafe code generation and execution, diverse input formats, and high-quality safety scenarios and tests.
This research evaluated three agents based on various large language models (LLMs), providing insights into code agents’ vulnerabilities. For instance, results showed that agents are more likely to reject executing unsafe operations on the operating system. Unsafe operations described in natural text lead to a lower rejection rate than those in code format. Additional evaluations revealed that more capable base models and agents with stronger overall coding abilities, such as GPT-4, tend to produce more sophisticated harmful software.
These findings highlight the need for stringent safety evaluations for diverse code agents. The underlying dataset and related code are publicly available at https://github.com/AI-secure/RedCode (opens in new tab).
NEW RESEARCH
Towards industrial foundation models: Integrating large language models with industrial data intelligence
Although large language models (LLMs) excel at language-focused tasks like news writing, document summarization, customer service, and supporting virtual assistants, they can face challenges when it comes to learning and inference on numeric and structured industry data, such as tabular and time series data. To address these issues, researchers from Microsoft propose a new approach to building industrial foundation models (IFMs). As outlined in a recent blog post, they have successfully demonstrated the feasibility of cross-domain universal in-context learning on tabular data and the significant potential it could achieve.
The researchers designed Generative Tabular Learning (opens in new tab) (GTL), a new framework that integrates multi-industry zero-shot and few-shot learning capabilities into LLMs. This approach allows the models to adapt and generalize to new fields, new data, and new tasks more effectively, flexibly responding to diverse data science tasks. This technical paradigm has been open-sourced (opens in new tab) to promote broader use.
Microsoft Research in the news
Microsoft’s smaller AI model beats the big guys: Meet Phi-4, the efficiency king
December 12, 2024
Microsoft launched a new artificial intelligence model today that achieves remarkable mathematical reasoning capabilities while using far fewer computational resources than its larger competitors.
Microsoft researcher Ece Kamar discusses the future of AI agents in 2025
Tech Brew | December 12, 2024
With AI agents widely expected to take off in 2025, the director of Microsoft’s AI Frontiers lab weighs in on the future of this technology, the safeguards needed, and the year ahead in AI research.
A new frontier awaits — computing with light
December 12, 2024
In the guts of a new type of computer, a bunch of tiny LEDs emit a green glow. Those lights have a job to do. They’re performing calculations. Right now, this math is telling the computer how to identify handwritten images of numbers. The computer is part of a research program at Microsoft.
The post Research Focus: Week of December 16, 2024 appeared first on Microsoft Research.
Imbue’s Kanjun Qiu Shares Insights on How to Build Smarter AI Agents
Imagine a future in which everyone is empowered to build and use their own AI agents. That future may not be far off, as new software is infused with intelligence through collaborative AI systems that work alongside users rather than merely automating tasks.
In this episode of the NVIDIA AI Podcast, Kanjun Qiu, CEO of Imbue, discusses the rise of AI agents, drawing parallels between the personal computer revolution of the late 1970s and 80s and today’s AI agent transformation. She details Imbue’s approach to building reasoning capabilities into its products, the challenges of verifying the correctness of AI outputs and how Imbue is focusing on post-training and fine-tuning to improve verification capabilities.
Learn more about Imbue, and read more about AI agents, including how virtual assistants can enhance customer service experiences.
And hear more about the future of AI and graphics by tuning in to the CES keynote, delivered by NVIDIA founder and CEO Jensen Huang live in Las Vegas on Monday, Jan. 6, at 6:30 p.m. PT.
Time Stamps
1:21 – What are AI agents? And Imbue’s approach to them.
9:00 – Where are AI agents being used the most today?
17:05 – Why building a good user experience around agents requires invention.
26:28 – How reasoning and verification capabilities factor into Imbue’s products.
You Might Also Like…
Zoom CTO Xuedong “XD” Huang on How AI Revolutionizes Productivity
Zoom is now transforming into an AI-first platform. CTO Xuedong Huang discusses Zoom’s AI Companion 2.0 and the company’s “federated AI” strategy, which aims to integrate multiple large language models to enhance productivity and collaboration.
How Roblox Uses Generative AI to Enhance User Experiences
Roblox is enhancing its colorful online platform with generative AI to improve user safety and inclusivity through features like automated chat filters and real-time text translation. Anupam Singh, VP of AI and growth engineering at Roblox, explores how AI coding assistants are helping creators focus more on creative expression.
Rendered.ai CEO Nathan Kundtz on Using AI to Build Better AI
Data is crucial for training AI and machine learning systems, and synthetic data offers a solution to the challenges of compiling real-world data. Nathan Kundtz, founder and CEO of Rendered.ai, discusses how his company’s platform generates synthetic data to enhance AI models.
Subscribe to the AI Podcast
Get the AI Podcast through Apple Podcasts, Google Podcasts, Google Play, Castbox, DoggCatcher, Overcast, PlayerFM, Pocket Casts, Podbay, PodBean, PodCruncher, PodKicker, Soundcloud, Spotify, Stitcher and TuneIn.
AI at Your Service: Digital Avatars With Speech Capabilities Offer Interactive Customer Experiences
Editor’s note: This post is part of the AI On blog series, which explores the latest techniques and real-world applications of agentic AI, chatbots and copilots. The series will also highlight the NVIDIA software and hardware powering advanced AI agents, which form the foundation of AI query engines that gather insights and perform tasks to transform everyday experiences and reshape industries.
To enhance productivity and upskill workers, organizations worldwide are seeking ways to provide consistent, around-the-clock customer service with greater speed, accuracy and scale.
Intelligent AI agents offer one such solution. They deliver advanced problem-solving capabilities and integrate vast and disparate sources of data to understand and respond to natural language.
Powered by generative AI and agentic AI, digital avatars are boosting efficiency across industries like healthcare, telecom, manufacturing, retail and more. According to Gartner, by 2028, 45% of organizations with more than 500 employees will use employee AI avatars to expand the capacity of human capital.1
From educating prospects on policies to giving customers personalized solutions, AI is helping organizations optimize revenue streams and elevate employee knowledge and productivity.
Where Context-Aware AI Avatars Are Most Impactful
Staying ahead in a competitive, evolving market requires continuous learning and analysis. AI avatars — also referred to as digital humans — are addressing key concerns and enhancing operations across industries.
One key benefit of agentic digital human technology is the ability to offer consistent, multilingual support and personalized guidance for a variety of use cases.
For instance, a medical-based AI agent can provide 24/7 virtual intake and support telehealth services. Or, a virtual financial advisor can help enhance client security and financial literacy by alerting bank customers of potential fraud, or offering personalized offers and investment tips based on their unique portfolio.
These digital humans boost efficiency, cut costs and enhance customer loyalty. Some key ways digital humans can be applied include:
- Personalized, On-Brand Customer Assistance: A digital human interface can provide a personal touch when educating new customers on a company’s products and service portfolios. They can provide ongoing customer support, offering immediate responses and solving problems without the need for a live operator.
- Enhanced Employee Onboarding: Intelligent AI assistants can offer streamlined, adaptable, personalized employee onboarding, whether in hospitals or offices, by providing consistent access to updated institutional knowledge at scale. With pluggable, customizable retrieval-augmented generation (RAG), these assistants can deliver real-time answers to queries while maintaining a deep understanding of company-specific data.
- Seamless Communication Across Languages: In global enterprises, communication barriers can slow down operations. AI-powered avatars with natural language processing capabilities can communicate effortlessly across languages. This is especially useful in customer service or employee training environments where multilingual support is crucial.
Learn more by listening to the NVIDIA AI Podcast episode with Kanjun Qiu, CEO of Imbue, who shares insights on how to build smarter AI agents.
Interactive AI Agents With Text-to-Speech and Speech-to-Text
With text-to-speech and speech-to-text capabilities, AI agents can offer enhanced interactivity and engagement in customer service interactions.
SoftServe, an IT consulting and digital services provider, has built several digital humans for a variety of use cases, highlighting the technology’s potential to enhance user experiences.
SoftServe’s Digital Concierge is accelerated by NVIDIA AI Blueprints and NVIDIA ACE technologies to rapidly deploy scalable, customizable digital humans across diverse infrastructures.
GEN, SoftServe’s virtual customer service assistant and digital concierge, makes customer service more engaging by providing lifelike interactions, continuous availability, personalized responses and simultaneous access to all necessary knowledge bases.
SoftServe also developed FINNA, an AI-powered virtual financial advisor that can provide financial guidance tailored to a client’s profile and simplify complex financial terminology. It helps streamline onboarding and due diligence, supporting goal-oriented financial planning and risk assessment.
AISHA is another AI-powered digital human developed by SoftServe with NVIDIA technology. Created for the UAE Ministry of Justice, the digital human significantly improves judicial processes by reducing case review times, enhancing the accuracy of rulings and providing rapid access to legal databases. It demonstrates how generative AI can bridge the gap between technology and meaningful user interaction to enhance customer service and operational efficiency in the judicial sector.
How to Design AI Agents With Avatar and Speech Features
Designing AI agents with avatar and speech features involves several key steps
- Determine the use case: Choose between 2D or 3D avatars based on the required level of immersion and interaction.
- Avatar development:
- For 3D avatars, use specialized software and technical expertise to create lifelike movements and photorealism.
- For 2D avatars, opt for quicker development suitable for web-embedded solutions.
- Integrate speech technologies: Use NVIDIA Riva for world-class automatic speech recognition, along with text-to-speech to enable verbal interactions.
- Rendering options: Use NVIDIA Omniverse RTX Renderer technology or Unreal Engine tools for 3D avatars to achieve high-quality output and compute efficiency.
- Deployment: Tap cloud-native deployment for real-time output and scalability, particularly for interactive web or mobile applications.
For an overview on how to design interactive customer service tools, read the technical blogs on how to “Build a Digital Human Interface for AI Apps With an NVIDIA AI Blueprint” and “Expanding AI Agent Interface Options With 2D and 3D Digital Human Avatars.”
NVIDIA AI Blueprint for Digital Humans
The latest release of the NVIDIA AI Blueprint for digital humans introduces several updates that enhance the interactivity and responsiveness of digital avatars, including dynamic switching between RAG models. Users can experience this directly in preview.
The integration of the Audio2Face-2D microservice in the blueprint means developers can create 2D digital humans, which require significantly less processing power compared with 3D models, for web- and mobile-based applications.
2D avatars are better suited for simpler interactions and platforms where photorealism isn’t necessary. This makes them ideal for scenarios like telemedicine, where quick loading times with lower bandwidth requirements are crucial.
Another significant update is the introduction of user attention detection through vision AI. This feature enables digital humans to detect when a user is present — even if they are idle or on mute — and initiate interaction, such as greeting the user. This capability is particularly beneficial in kiosk scenarios, where engaging users proactively can enhance the service experience.
Getting Started
NVIDIA AI Blueprints make it easy to start building and setting up virtual assistants by offering ready-made workflows and tools to accelerate deployment. Whether for a simple AI-powered chatbot or a fully animated digital human interface, the blueprints offer resources to create AI assistants that are scalable, aligned with an organization’s brand and deliver a responsive, efficient customer support experience.
1. Gartner®, Hype Cycle for the Future of Work, 2024, Tori Paulman, Emily Rose, etc., July 2024
GARTNER is a registered trademark and service mark and Hype Cycle is a trademark of Gartner, Inc. and/or its affiliates in the U.S. and internationally and is used herein with permission. All rights reserved.
How Fastweb fine-tuned the Mistral model using Amazon SageMaker HyperPod as a first step to build an Italian large language model
This post is co-written with Marta Cavalleri and Giovanni Germani from Fastweb, and Claudia Sacco and Andrea Policarpi from BIP xTech.
AI’s transformative impact extends throughout the modern business landscape, with telecommunications emerging as a key area of innovation. Fastweb, one of Italy’s leading telecommunications operators, recognized the immense potential of AI technologies early on and began investing in this area in 2019. With a vision to build a large language model (LLM) trained on Italian data, Fastweb embarked on a journey to make this powerful AI capability available to third parties.
Training an LLM is a compute-intensive and complex process, which is why Fastweb, as a first step in their AI journey, used AWS generative AI and machine learning (ML) services such as Amazon SageMaker HyperPod.
SageMaker HyperPod can provision and maintain large-scale compute resilient clusters powered by thousands of accelerators such as AWS Trainium and NVIDIA H200 and H100 Graphical Processing Units (GPUs), but its flexibility allowed Fastweb to deploy a small, agile and on-demand cluster enabling efficient resource utilization and cost management, aligning well with the project’s requirements.
In this post, we explore how Fastweb used cutting-edge AI and ML services to embark on their LLM journey, overcoming challenges and unlocking new opportunities along the way.
Fine-tuning Mistral 7B on AWS
Fastweb recognized the importance of developing language models tailored to the Italian language and culture. To achieve this, the team built an extensive Italian language dataset by combining public sources and acquiring licensed data from publishers and media companies. Using this data, Fastweb, in their first experiment with LLM training, fine-tuned the Mistral 7B model, a state-of-the-art LLM, successfully adapting it to handle tasks such as summarization, question answering, and creative writing in the Italian language, applying a nuanced understanding of Italian culture to the LLM’s responses and providing contextually appropriate and culturally sensitive output.
The team opted for fine-tuning on AWS. This strategic decision was driven by several factors:
- Efficient data preparation – Building a high-quality pre-training dataset is a complex task, involving assembling and preprocessing text data from various sources, including web sources and partner companies. Because the final, comprehensive pre-training dataset was still under construction, it was essential to begin with an approach that could adapt existing models to Italian.
- Early results and insights – Fine-tuning allowed the team to achieve early results in training models on the Italian language, providing valuable insights and preliminary Italian language models. This enabled the engineers to iteratively improve the approach based on initial outcomes.
- Computational efficiency – Fine-tuning requires significantly less computational power and less time to complete compared to a complete model pre-training. This approach streamlined the development process and allowed for a higher volume of experiments within a shorter time frame on AWS.
To facilitate the process, the team created a comprehensive dataset encompassing a wide range of tasks, constructed by translating existing English datasets and generating synthetic elements. The dataset was stored in an Amazon Simple Storage Service (Amazon S3) bucket, which served as a centralized data repository. During the training process, our SageMaker HyperPod cluster was connected to this S3 bucket, enabling effortless retrieval of the dataset elements as needed.
The integration of Amazon S3 and the SageMaker HyperPod cluster exemplifies the power of the AWS ecosystem, where various services work together seamlessly to support complex workflows.
Overcoming data scarcity with translation and synthetic data generation
When fine-tuning a custom version of the Mistral 7B LLM for the Italian language, Fastweb faced a major obstacle: high-quality Italian datasets were extremely limited or unavailable. To tackle this data scarcity challenge, Fastweb had to build a comprehensive training dataset from scratch to enable effective model fine-tuning.
While establishing strategic agreements to acquire licensed data from publishers and media companies, Fastweb employed two main strategies to create a diverse and well-rounded dataset: translating open source English training data into Italian and generating synthetic Italian data using AI models.
To use the wealth of information available in English, Fastweb translated open source English training datasets into Italian. This approach made valuable data accessible and relevant for Italian language training. Both LLMs and open source translation tools were used for this process.
The open source Argos Translate tool was used for bulk translation of datasets with simpler content. Although LLMs offer superior translation quality, Argos Translate is free, extremely fast, and well-suited for efficiently handling large volumes of straightforward data. For complex datasets where accuracy was critical, LLMs were employed to provide high-quality translations.
To further enrich the dataset, Fastweb generated synthetic Italian data using LLMs. This involved creating a variety of text samples covering a wide range of topics and tasks relevant to the Italian language. High-quality Italian web articles, books, and other texts served as the basis for training the LLMs to generate authentic-sounding synthetic content that captured the nuances of the language.
The resulting sub-datasets spanned diverse subjects, including medical information, question-answer pairs, conversations, web articles, science topics, and more. The tasks covered were also highly varied, encompassing question answering, summarization, creative writing, and others.
Each subset generated through translation or synthetic data creation underwent meticulous filtering to maintain quality and diversity. A similarity check was performed to deduplicate the data; if two elements were found to be too similar, one was removed. This step was crucial in maintaining variability and preventing bias from repetitive or overly similar content.
The deduplication process involved embedding dataset elements using a text embedder, then computing cosine similarity between the embeddings to identify similar elements. Meta’s FAISS library, renowned for its efficiency in similarity search and clustering of dense vectors, was used as the underlying vector database due to its ability to handle large-scale datasets effectively.
After filtering and deduplication, the remaining subsets were postprocessed and combined to form the final fine-tuning dataset, comprising 300,000 training elements. This comprehensive dataset enabled Fastweb to effectively fine-tune their custom version of the Mistral 7B model, achieving high performance and diversity across a wide range of tasks and topics.
All data generation and processing steps were run in parallel directly on the SageMaker HyperPod cluster nodes, using a unique working environment and highlighting the cluster’s versatility for various tasks beyond just training models.
The following diagram illustrates two distinct data pipelines for creating the final dataset: the upper pipeline uses translations of existing English datasets into Italian, and the lower pipeline employs custom generated synthetic data.
The computational cost of training an LLM
The computational cost of training LLMs scales approximately with the number of parameters and the amount of training data. As a general rule, for each model parameter being trained, approximately 24 bytes of memory are required. This means that to fully fine-tune a 7 billion parameter model like Mistral 7B, at least 156 GB of hardware memory is necessary, not including the additional overhead of loading training data.
The following table provides additional examples.
LLM Model Size vs. Training Memory | |
Number of Parameters | Memory Requirement |
500 million | 12 GB |
1 billion | 23 GB |
2 billion | 45 GB |
3 billion | 67 GB |
5 billion | 112 GB |
7 billion | 156 GB |
10 billion | 224 GB |
Parameter-efficient fine-tuning (PEFT) methods minimize the number of trainable parameters, whereas quantization reduces the number of bits per parameter, often with minimal negative impact on the final training results.
Despite these memory-saving techniques, fine-tuning large models still demands substantial GPU memory and extended training times. This makes distributed training essential, allowing the workload to be shared across multiple GPUs, thereby enabling the efficient handling of such large-scale computational tasks.
The following table and figure illustrate the allocation of GPU memory during each phase of LLM training.
Solution overview
Training LLMs often requires significant computational resources that can exceed the capabilities of a single GPU. Distributed training is a powerful technique that addresses this challenge by distributing the workload across multiple GPUs and nodes, enabling parallel processing and reducing training time. SageMaker HyperPod simplifies the process of setting up and running distributed training jobs, providing preconfigured environments and libraries specifically designed for this purpose.
There are two main techniques for distributed training: data parallelization and model parallelization. Data parallelization involves distributing the training data across multiple GPUs, whereas model parallelization splits the model itself across different GPUs.
To take advantage of distributed training, a cluster of interconnected GPUs, often spread across multiple physical nodes, is required. SageMaker HyperPod allows for both data and model parallelization techniques to be employed simultaneously, maximizing the available computational resources. Also, SageMaker HyperPod provides resilience through features like automatic fault detection and recovery, which are crucial for long-running training jobs. SageMaker HyperPod allows for the creation of personalized Conda environments, enabling the installation of necessary libraries and tools for distributed training.
One popular library for implementing distributed training is DeepSpeed, a Python optimization library that handles distributed training and makes it memory-efficient and fast by enabling both data and model parallelization. The choice to use DeepSpeed was driven by the availability of an extensive, already-developed code base, ready to be employed for training experiments. The high flexibility and environment customization capabilities of SageMaker HyperPod made it possible to create a personalized Conda environment with all the necessary libraries installed, including DeepSpeed.
The following diagram illustrates the two key parallelization strategies offered by DeepSpeed: data parallelism and model parallelism. Data parallelism involves replicating the entire model across multiple devices, with each device processing a distinct batch of training data. In contrast, model parallelism distributes different parts of a single model across multiple devices, enabling the training of large models that exceed the memory capacity of a single device.
To help meet the demanding computational requirements of training LLMs, we used the power and flexibility of SageMaker HyperPod clusters, orchestrated with Slurm. While HyperPod also supports orchestration with Amazon EKS, our research team had prior expertise with Slurm. The cluster configuration was tailored to our specific training needs, providing optimal resource utilization and cost-effectiveness.
The SageMaker HyperPod cluster architecture consisted of a controller machine to orchestrate the training job’s coordination and resource allocation. The training tasks were run by two compute nodes, which were g5.12xlarge instances equipped with high-performance GPUs. These compute nodes handled the bulk of the computational workload, using their GPUs to accelerate the training process.
The AWS managed high-performance Lustre file system (Amazon FSx for Lustre) mounted on the nodes provided high-speed data access and transfer rates, which are essential for efficient training operations.
SageMaker HyperPod is used to launch large clusters for pre-training Large Language Models (LLMs) with thousands of GPUs, but one of its key advantages is its flexibility, indeed it also allows for the creation of small, agile, and on-demand clusters. The versatility of SageMaker HyperPod made it possible to use resources only when needed, avoiding unnecessary costs.
For the DeepSpeed configuration, we followed the standard recommended setup, enabling data and model parallelism across the two g5.12xlarge nodes of the cluster, for a total of 8 GPUs.
Although more advanced techniques were available, such as offloading some computation to the CPU during training, our cluster was sized with a sufficiently high GPU margin. With 192 GiB (206 GB) of available overall GPU memory, even accounting for the additional GPU needed to keep dataset batches in memory during training, we had ample resources to train a 7B parameter model without the need for these advanced techniques. The following figure describes the infrastructure setup of our training solution.
Training results and output examples
After completing the training process, Fastweb’s fine-tuned language model demonstrated a significant performance improvement on Italian language tasks compared to the base model. Evaluated on an internal benchmark dataset, the fine-tuned model achieved an average accuracy increase of 20% across a range of tasks designed to assess its general understanding of the Italian language.
The benchmark tasks focused on three key areas: question answering, common sense reasoning, and next word prediction. Question answering tasks tested the model’s ability to comprehend and provide accurate responses to queries in Italian. Common sense reasoning evaluated the model’s grasp of common sense knowledge and its capacity to make logical inferences based on real-world scenarios. Next word prediction assessed the model’s understanding of language patterns and its ability to predict the most likely word to follow in a given context.
To evaluate the fine-tuned model’s performance, we initiated our interaction by inquiring about its capabilities. The model responded by enumerating its primary functions, emphasizing its ability to address Fastweb-specific topics. The response was formulated in correct Italian with a very natural syntax, as illustrated in the following example.
Afterwards, we asked the model to generate five titles for a presentation on the topic of AI.
Just for fun, we asked what the most famous sandwich is. The model responded with a combination of typical Italian ingredients and added that there is a wide variety of choices.
Lastly, we asked the model to provide us with a useful link to understand the recent EU AI Act. The model provided a working link, along with a helpful description.
Conclusion
Using SageMaker HyperPod, Fastweb successfully fine-tuned the Mistral 7B model as a first step in their generative AI journey, significantly improving its performance on tasks involving the Italian language.
Looking ahead, Fastweb plans to deploy their next models also on Amazon Bedrock using the Custom Model Import feature. This strategic move will enable Fastweb to quickly build and scale new generative AI solutions for their customers, using the broad set of capabilities available on Amazon Bedrock.
By harnessing Amazon Bedrock, Fastweb can further enhance their offerings and drive digital transformation for their customers. This initiative aligns with Fastweb’s commitment to staying at the forefront of AI technology and fostering innovation across various industries.
With their fine-tuned language model running on Amazon Bedrock, Fastweb will be well-positioned to deliver cutting-edge generative AI solutions tailored to the unique needs of their customers. This will empower businesses to unlock new opportunities, streamline processes, and gain valuable insights, ultimately driving growth and competitiveness in the digital age.
Fastweb’s decision to use the Custom Model Import feature in Amazon Bedrock underscores the company’s forward-thinking approach and their dedication to providing their customers with the latest and most advanced AI technologies. This collaboration with AWS further solidifies Fastweb’s position as a leader in digital transformation and a driving force behind the adoption of innovative AI solutions across industries.
To learn more about SageMaker HyperPod, refer to Amazon SageMaker HyperPod and the Amazon SageMaker HyperPod workshop.
About the authors
Marta Cavalleri is the Manager of the Artificial Intelligence Center of Excellence (CoE) at Fastweb, where she leads teams of data scientists and engineers in implementing enterprise AI solutions. She specializes in AI operations, data governance, and cloud architecture on AWS.
Giovanni Germani is the Manager of Architecture & Artificial Intelligence CoE at Fastweb, where he leverages his extensive experience in Enterprise Architecture and digital transformation. With over 12 years in Management Consulting, Giovanni specializes in technology-driven projects across telecommunications, media, and insurance industries. He brings deep expertise in IT strategy, cybersecurity, and artificial intelligence to drive complex transformation programs.
Claudia Sacco is an AWS Professional Solutions Architect at BIP xTech, collaborating with Fastweb’s AI CoE and specialized in architecting advanced cloud and data platforms that drive innovation and operational excellence. With a sharp focus on delivering scalable, secure, and future-ready solutions, she collaborates with organizations to unlock the full potential of cloud technologies. Beyond her professional expertise, Claudia finds inspiration in the outdoors, embracing challenges through climbing and trekking adventures with her family.
Andrea Policarpi is a Data Scientist at BIP xTech, collaborating with Fastweb’s AI CoE. With a strong foundation in computer vision and natural language processing, he is currently exploring the world of Generative AI and leveraging its powerful tools to craft innovative solutions for emerging challenges. In his free time, Andrea is an avid reader and enjoys playing the piano to relax.
Giuseppe Angelo Porcelli is a Principal Machine Learning Specialist Solutions Architect for Amazon Web Services. With several years of software engineering and an ML background, he works with customers of any size to understand their business and technical needs and design AI and ML solutions that make the best use of the AWS Cloud and the Amazon Machine Learning stack. He has worked on projects in different domains, including MLOps, computer vision, and NLP, involving a broad set of AWS services. In his free time, Giuseppe enjoys playing football.
Adolfo Pica has a strong background in cloud computing, with over 20 years of experience in designing, implementing, and optimizing complex IT systems and architectures and with a keen interest and hands-on experience in the rapidly evolving field of generative AI and foundation models. He has expertise in AWS cloud services, DevOps practices, security, data analytics and generative AI. In his free time, Adolfo enjoys following his two sons in their sporting adventures in taekwondo and football.
Maurizio Pinto is a Senior Solutions Architect at AWS, specialized in cloud solutions for telecommunications. With extensive experience in software architecture and AWS services, he helps organizations navigate their cloud journey while pursuing his passion for AI’s transformative impact on technology and society.
Using natural language in Amazon Q Business: From searching and creating ServiceNow incidents and knowledge articles to generating insights
Many enterprise customers across various industries are looking to adopt Generative AI to drive innovation, user productivity, and enhance customer experience. Generative AI–powered assistants such as Amazon Q Business can be configured to answer questions, provide summaries, generate content, and securely complete tasks based on data and information in your enterprise systems. Amazon Q Business understands natural language and allows users to receive immediate, permissions-aware responses from enterprise data sources with citations. This capability supports various use cases such as IT, HR, and help desk.
With custom plugins for Amazon Q Business, you can enhance the application environment to enable your users to use natural language to perform specific tasks related to third-party applications — such as Jira, Salesforce, and ServiceNow — directly from within their web experience chat.
Enterprises that have adopted ServiceNow can improve their operations and boost user productivity by using Amazon Q Business for various use cases, including incident and knowledge management. Users can search ServiceNow knowledge base (KB) articles and incidents in addition to being able to create, manage, and track incidents and KB articles, all from within their web experience chat.
In this post, we’ll demonstrate how to configure an Amazon Q Business application and add a custom plugin that gives users the ability to use a natural language interface provided by Amazon Q Business to query real-time data and take actions in ServiceNow. By the end of this hands-on session, you should be able to:
- Create an Amazon Q Business application and integrate it with ServiceNow using a custom plugin.
- Use natural language in your Amazon Q web experience chat to perform read and write actions in ServiceNow such as querying and creating incidents and KB articles in a secure and governed fashion.
Prerequisites
Before proceeding, make sure that you have the necessary AWS account permissions and services enabled, along with access to a ServiceNow environment with the required privileges for configuration.
AWS
- Have an AWS account with administrative access. For more information, see Setting up for Amazon Q Business. For a complete list of AWS Identity and Access Management (IAM) roles for Amazon Q Business, see IAM roles for Amazon Q Business. Although we’re using admin privileges for the purpose of this post, it’s a security best practice to apply least privilege permissions and grant only the permissions required to perform a task.
- Subscribe to the Amazon Q Business Pro plan which includes access to custom plugins to enable users to execute actions in third-party applications. For information on what is included in the tiers of user subscriptions, see Amazon Q Business pricing document.
ServiceNow
- Obtain a ServiceNow Personal Developer Instance or use a clean ServiceNow developer environment. You will need an account that has admin privileges to perform the configuration steps in ServiceNow.
Solution overview
The following architecture diagram illustrates the workflow for Amazon Q Business web experience with enhanced capabilities to integrate it seamlessly with ServiceNow.
The implementation includes the following steps:
- The solution begins with configuring Amazon Q Business using the AWS Management Console. This includes setting up the application environment, adding users to AWS IAM Identity Center, selecting the appropriate subscription tier, and configuring the web experience for users to interact with. The environment can optionally be configured to provide real-time data retrieval using a native retriever, which pulls information from indexed data sources, such as Amazon Simple Storage Service (Amazon S3), during interactions.
- The next step involves adjusting the global controls and response settings for the application environment guardrails to allow Amazon Q Business to use its large language model (LLM) knowledge to generate responses when it cannot find responses from your connected data sources.
- Integration with ServiceNow is achieved by setting up an OAuth Inbound application endpoint in ServiceNow, which authenticates and authorizes interactions between Amazon Q Business and ServiceNow. This involves creating an OAuth API endpoint in ServiceNow and using the web experience URL from Amazon Q Business as the callback URL. The setup makes sure that Amazon Q Business can securely perform actions in ServiceNow with the same scoped permissions as the user signing in to ServiceNow.
- The final step of the solution involves enhancing the application environment with a custom plugin for ServiceNow using APIs defined in an OpenAPI schema. The plugin allows Amazon Q Business to securely interact with ServiceNow’s REST APIs, enabling operations such as querying, creating, and updating records dynamically and in real time
Configuring the Amazon Q Business application
To create an Amazon Q Business application, sign in to the Amazon Q Business console.
As a prerequisite to creating an Amazon Q Business application, follow the instructions in Configuring an IAM Identity Center instance section. Amazon Q Business integrates with IAM Identity Center to enable managing user access to your Amazon Q Business application. This is the recommended method for managing human access to AWS resources and the method used for the purpose of this blog.
Amazon Q Business also supports identity federation through IAM. When you use identity federation, you can manage users with your enterprise identity provider (IdP) and use IAM to authenticate users when they sign in to Amazon Q Business.
Create and configure the Amazon Q Business application:
- In the Amazon Q Business console, choose Application from the navigation pane and then choose Create application.
- Enter the following information for your Amazon Q Business application:
- Application name: Enter a name for quick identification, such as
my-demo-application
. - Service access: Select the Create and use a new service-linked role (SLR). A service-linked role is a unique type of IAM role that is linked directly to Amazon Q Business. Service-linked roles are predefined by Amazon Q Business and include the permissions that the service requires to call other AWS services on your behalf.
- Choose Create.
- Application name: Enter a name for quick identification, such as
- After creating your Amazon Q Business application environment, create and select the retriever and provision the index that will power your generative AI web experience. The retriever pulls data from the index in real time during a conversation. On the Select Retriever page:
- Retrievers: Select Use native retriever.
- Index provisioning: Select Starter, which is ideal for proof-of-concept or developer workloads. See Index types for more information.
- Number of units: Enter
1
. This indicates the capacity units that you want to provision for your index. Each unit is 20,000 documents. Choose Next. - Choose Next.
- After you select a retriever for your Amazon Q Business application environment, you can optionally connect other data sources to it. Because a data source isn’t required for this session, we won’t configure one. For more information on connecting data sources to an Amazon Q Business application, see connecting data sources.
- Choose Next.
- As an account admin, you can add users to your IAM Identity Center instance from the Amazon Q Business console. After you add users or groups to an application environment, you can then choose the Amazon Q Business tier for each user or group. On the Add groups and users page:
- Choose Add groups and users.
- In the Add new users dialog box that opens, enter the details of the user. The details you must enter for a single user include: Username, First name, Last name, email address, Confirm email address, and Display name.
- Choose Next and then Add. The user is automatically added to an IAM Identity Center directory and an email invitation to join Identity Center is sent to the email address provided.
- After adding a user or group, choose the Amazon Q Business subscription tier for each user or group. From the Current subscription dropdown menu, select Q Business Pro.
- For the Web experience service access, select Create and use a new service role.
- Choose Create application.
Upon successful completion, Amazon Q Business returns a web experience URL that you can share with the users you added to your application environment. The Web experience URL (in this case: https://xxxxxxxx.chat.qbusiness.us-east-1.on.aws
/) will be used when creating an OAuth application endpoint in ServiceNow. Note that your web experience URL will be different from the one shown here.
Enhancing an Amazon Q Business application with guardrails
By default, an Amazon Q Business application is configured to respond to user chat queries using only enterprise data. Because we didn’t configure a data source for the purpose of this post, you will use Admin controls and guardrails to allow Amazon Q to use its LLM world knowledge to generate responses when it cannot find responses from your connected data sources.
Create a custom plugin for ServiceNow:
- From the Amazon Q Business console, choose Applications in the navigation pane. Select the name of your application from the list of applications.
- From the navigation pane, choose Enhancements, and then choose Admin Controls and guardrails.
- In Global Controls, choose Edit.
- In Response settings under Application guardrails, select Allow Amazon Q to fall back to LLM knowledge.
Configuring ServiceNow
To allow Amazon Q Business to connect to your ServiceNow instance, you need to create an OAuth inbound application endpoint. OAuth-based authentication validates the identity of the client that attempts to establish a trust on the system by using an authentication protocol. For more information, see OAuth Inbound and Outbound authentication.
Create an OAuth application endpoint for external client applications to access the ServiceNow instance:
- In the ServiceNow console, navigate to All, then System OAuth, then Application Registry and then choose New. On the interceptor page, select Create an OAuth API endpoint for external clients and then fill in the form with details for Name and Redirect URL. The other fields are automatically generated by the ServiceNow OAuth server.
- The Redirect URL is the callback URL that the authorization server redirects to. Enter the web experience URL of your Amazon Q Business application environment (which is the client requesting access to the resource), appended by
oauth/callback
. - For this example, the URL is:
https://xxxxxxxx.chat.qbusiness.us-east-1.on.aws/oauth/callback
- The Redirect URL is the callback URL that the authorization server redirects to. Enter the web experience URL of your Amazon Q Business application environment (which is the client requesting access to the resource), appended by
- For Auth Scope, set the value to
useraccount
. The scope API response parameter defines the amount of access granted by the access token, which means that the access token has the same rights as the user account that authorized the token. For example, if Abel Tuter authorizes an application by providing login credentials, then the resulting access token grants the token bearer the same access privileges as Abel Tuter. - Choose Submit.
This creates an OAuth client application record and generates a client ID and client secret, which Amazon Q Business needs to access the restricted resources on the instance. You will need this authentication information (client ID and client secret) in the following custom plugin configuration process.
Enhancing the Amazon Q Business application environment with custom plugins for ServiceNow
To integrate with external applications, Amazon Q Business uses APIs, which are configured as part of the custom plugins.
Before creating a custom plugin, you need to create or edit an OpenAPI schema, outlining the different API operations that you want to enable for your custom plugin. Amazon Q Business uses the configured third-party OpenAPI specifications to dynamically determine which API operations to perform to fulfill a user request. Therefore, the OpenAPI schema definition has a big impact on API selection accuracy and might require design optimizations. In order to maximize accuracy and improve efficiency with an Amazon Q Business custom plugin, follow the best practices for configuring OpenAPI schema definitions.
To configure a custom plugin, you must define at least one and a maximum of eight API operations that can be invoked. To define the API operations, create an OpenAPI schema in JSON or YAML format. You can create OpenAPI schema files and upload them to Amazon S3. Alternatively, you can use the OpenAPI text editor in the console, which will validate your schema.
For this post, a working sample of an OpenAPI Schema for ServiceNow is provided in JSON format. Before using it, edit the template file and replace <YOUR_SERVICENOW_INSTANCE_URL>
in the following sections with the URL of your ServiceNow instance.
You can use the REST API Explorer to browse available APIs, API versions, and methods for each API. The explorer enables you to test REST API requests straight from the user interface. The Table API provides endpoints that allow you to perform create, read, update, and delete (CRUD) operations on existing tables. The calling user must have sufficient roles to access the data in the table specified in the request. For additional information on assigning roles, see Managing roles.
{
"openapi": "3.0.1",
"info": {
"title": "Table API",
"description": "Allows you to perform create, read, update and delete (CRUD) operations on existing tables",
"version": "latest"
},
"externalDocs": {
"url": "https://docs.servicenow.com/?context=CSHelp:REST-Table-API"
},
"servers": [
{
"url": "YOUR_SERVICENOW_INSTANCE_URL"
}
],
"paths": {
"/api/now/table/{tableName}": {
"get": {
"description": "Retrieve records from a table",
"parameters": [
{
"name": "tableName",
"in": "path",
"description": "Table Name",
"required": true,
"schema": {
"type": "string"
}
},
{
"name": "sysparm_query",
"in": "query",
"description": "An encoded query string used to filter the results like Incidents Numbers or Knowledge Base IDs etc",
"required": true,
"schema": {
"type": "string"
}
},
{
"name": "sysparm_fields",
"in": "query",
"description": "A comma-separated list of fields to return in the response",
"required": false,
"schema": {
"type": "string"
}
},
{
"name": "sysparm_limit",
"in": "query",
"description": "The maximum number of results returned per page",
"required": false,
"schema": {
"type": "string"
}
}
],
"responses": {
"200": {
"description": "ok",
"content": {
"application/json": {
"schema": {
"$ref": "#/components/schemas/incident"
}
}
}
}
}
},
"post": {
"description": "Create a record",
"parameters": [
{
"name": "tableName",
"in": "path",
"description": "Table Name",
"required": true,
"schema": {
"type": "string"
}
}
],
"requestBody": {
"content": {
"application/json": {
"schema": {
"type": "object",
"properties": {
"short_description": {
"type": "string",
"description": "Short Description"
},
"description": {
"type": "string",
"description": "Full Description for Incidents only"
},
"caller_id": {
"type": "string",
"description": "Caller Email"
},
"state": {
"type": "string",
"description": "State of the incident",
"enum": [
"new",
"in_progress",
"resolved",
"closed"
]
},
"text": {
"type": "string",
"description": "Article Body Text for Knowledge Bases Only (KB)"
}
},
"required": [
"short_description",
"caller_id"
]
}
}
},
"required": true
},
"responses": {
"200": {
"description": "ok",
"content": {
"application/json": {}
}
}
}
}
},
"/api/now/table/{tableName}/{sys_id}": {
"get": {
"description": "Retrieve a record",
"parameters": [
{
"name": "tableName",
"in": "path",
"description": "Table Name",
"required": true,
"schema": {
"type": "string"
}
},
{
"name": "sys_id",
"in": "path",
"description": "Sys ID",
"required": true,
"schema": {
"type": "string"
}
},
{
"name": "sysparm_fields",
"in": "query",
"description": "A comma-separated list of fields to return in the response",
"required": false,
"schema": {
"type": "string"
}
}
],
"responses": {
"200": {
"description": "ok",
"content": {
"application/json": {},
"application/xml": {},
"text/xml": {}
}
}
}
},
"delete": {
"description": "Delete a record",
"parameters": [
{
"name": "tableName",
"in": "path",
"description": "Table Name",
"required": true,
"schema": {
"type": "string"
}
},
{
"name": "sys_id",
"in": "path",
"description": "Sys ID",
"required": true,
"schema": {
"type": "string"
}
}
],
"responses": {
"200": {
"description": "ok",
"content": {
"application/json": {},
"application/xml": {},
"text/xml": {}
}
}
}
},
"patch": {
"description": "Update or modify a record",
"parameters": [
{
"name": "tableName",
"in": "path",
"description": "Table Name",
"required": true,
"schema": {
"type": "string"
}
},
{
"name": "sys_id",
"in": "path",
"description": "Sys ID",
"required": true,
"schema": {
"type": "string"
}
}
],
"requestBody": {
"content": {
"application/json": {
"schema": {
"type": "object",
"properties": {
"short_description": {
"type": "string",
"description": "Short Description"
},
"description": {
"type": "string",
"description": "Full Description for Incidents only"
},
"caller_id": {
"type": "string",
"description": "Caller Email"
},
"state": {
"type": "string",
"description": "State of the incident",
"enum": [
"new",
"in_progress",
"resolved",
"closed"
]
},
"text": {
"type": "string",
"description": "Article Body Text for Knowledge Bases Only (KB)"
}
},
"required": [
"short_description",
"caller_id"
]
}
}
},
"required": true
},
"responses": {
"200": {
"description": "ok",
"content": {
"application/json": {},
"application/xml": {},
"text/xml": {}
}
}
}
}
}
},
"components": {
"schemas": {
"incident": {
"type": "object",
"properties": {
"sys_id": {
"type": "string",
"description": "Unique identifier for the incident"
},
"number": {
"type": "string",
"description": "Incident number"
},
"short_description": {
"type": "string",
"description": "Brief description of the incident"
}
}
}
},
"securitySchemes": {
"oauth2": {
"type": "oauth2",
"flows": {
"authorizationCode": {
"authorizationUrl": "YOUR_SERVICENOW_INSTANCE_URL/oauth_auth.do",
"tokenUrl": "YOUR_SERVICENOW_INSTANCE_URL/oauth_token.do",
"scopes": {
"useraccount": "Access equivalent to the user's account"
}
}
}
}
}
},
"security": [
{
"oauth2": [
"useraccount"
]
}
]
}
The URL for the ServiceNow instance used in this post is: https://devxxxxxx.service-now.com/
. After updating the sections of the template with the URL for this specific instance, the JSON should look like the following:
"servers": [
{
"url": "https://devxxxxxx.service-now.com/"
}
"securitySchemes": {
"oauth2": {
"type": "oauth2",
"flows": {
"authorizationCode": {
"authorizationUrl": "https://devxxxxxx.service-now.com/oauth_auth.do",
"tokenUrl": "https://devxxxxxx.service-now.com/oauth_token.do",
"scopes": {
"useraccount": "Access equivalent to the user's account"
}
}
}
}
}
To create a custom plugin for ServiceNow:
-
- Sign in to the Amazon Q Business console.
- Choose Applications in the navigation pane, and then select your application from the list of applications.
- In the navigation pane, choose Enhancements, and then choose Plugins.
- In Plugins, choose Add plugin.
- In Add plugins, choose Custom plugin.
- In Custom plugin, enter the following information:
- In Name and description, for Plugin name: Enter a name for your Amazon Q plugin.
- In API schema, for API schema source, select Define with in-line OpenAPI schema editor.
- Select JSON as the format for the schema.
- Remove any sample schema that appears in the inline OpenAPI schema editor and replace it with the text from the provided sample JSON template, updated with your ServiceNow instance URL.
- In Authentication: Select Authentication required.
- For AWS Secrets Manager secret, choose Create and add a new secret. You need to store the ServiceNow OAuth authentication credentials in a Secrets Manager secret to connect your third-party application to Amazon Q. In the window that opens, enter the details in the form:
- Secret name: A name for your Secrets Manager secret.
- Client ID: The Client ID from ServiceNow OAuth configuration in the previous section.
- Client secret: The Client Secret from ServiceNow OAuth configuration in the previous section.
- OAuth callback URL: The URL the user needs to be redirected to after authentication. This will be your web experience URL. For this example, it’s: https://xxxxxxxx.chat.qbusiness.us-east-1.on.aws/oauth/callback. Amazon Q Business will handle OAuth tokens in this URL.
- In Choose a method to authorize Amazon Q Business: Select Create and add a new service role. The console will generate a service role name. To connect Amazon Q Business to third-party applications that require authentication, you need to give the Amazon Q role permissions to access your Secrets Manager secret. This will enable an Amazon Q Business custom plugin to access the credentials needed to sign in to the third-party service.
- Choose Add plugin to add your plugin.
Upon successful completion, the plugin will appear under Plugins with Build status of Ready and Plugin status Active.
Using Amazon Q Business web experience chat to take actions in ServiceNow
Users can launch your Amazon Q Business web experience in two ways:
- AWS access portal URL provided in an invitation email sent to the user to join AWS IAM Identity Center.
- Web experience URL shared by the admin.
Navigate to the deployed web experience URL and sign with your AWS IAM Identity Center credentials.
After signing in, choose the New conversation icon in the left-hand menu to start a conversation.
Example: Search Knowledge Base Articles in ServiceNow for user issue and create an incident
The following chat conversation example illustrates a typical use case of Amazon Q Business integrated with custom plugins for ServiceNow. These features allow you to perform a wide range of tasks tailored to your organization’s needs.
In this example, we initiate a conversation in the web experience chat to search for KB articles related to ”log in issues” in ServiceNow by invoking a plugin action. After the user submits a prompt, Amazon Q Business queries ServiceNow through the appropriate API to retrieve the results and provides a response with related KB articles. We then proceed by asking Amazon Q Business for more details to see if any of the KB articles directly addresses the user’s issue. When no relevant KB articles pertaining to the user’s issue are found, we ask Amazon Q Business to summarize the conversation and create a new incident in ServiceNow, making sure the issue is logged for resolution.
User prompt 1 – I am having issues logging in to the intranet and want to know if there are any ServiceNow KB articles on log-in issues. Perform the search on both Short Description and Text field using LIKE operator
Before submitting the preceding prompt for an action to create an incident in ServiceNow, choose the vertical ellipsis to open Conversation settings, then choose Use a Plugin to select the corresponding custom plugin for ServiceNow.
If this is the first time a user is accessing the custom plugin or if their past sign-in has expired, the user will need to authenticate. After authenticating successfully, Amazon Q Business will perform the requested task.
Choose Authorize.
If the user isn’t already signed in to ServiceNow, they will be prompted to enter their credentials. For this example, the user signing in to ServiceNow is the admin user and API actions performed in ServiceNow by Amazon Q Business on behalf of the user will have the same level of access as the user within ServiceNow.
Choose Allow for Amazon Q Business to connect to ServiceNow and perform the requested task on your behalf.
Upon executing the user’s request after verifying that they are authorized, Amazon Q Business responds with the information that it retrieved. We then proceed to retrieve additional details with the following prompt.
User prompt 2 – Can you list the KB number and short description in a tabular form?
Because there no KB articles related the user’s issue were found, we will ask Amazon Q to summarize the conversation context to create an incident with the following prompt.
User prompt 3 – The error I get is "Unable to Login After System Upgrade". Summarize my issue and create an incident with detailed description and add a note that this needs to be resolved asap.
In response to your prompt for an action, Amazon Q displays a review form where you can modify or fill in the necessary information.
To successfully complete the action, choose submit.
Note: The caller_id
value entered in the following example is a valid ServiceNow user.
Your web experience will display a success message if the action succeeds, or an error message if the action fails. In this case, the action succeeded and Amazon Q Business responded accordingly.
The following screenshot shows that the incident was created successfully in ServiceNow.
Troubleshooting common errors
To have a seamless experience with third-party application integrations, it’s essential to thoroughly test, identify, and troubleshoot unexpected behavior.
A common error encountered in Amazon Q Business is API Response too large
, which occurs when an API response size exceeds the current limit of 100 KB. While prompting techniques are essential for obtaining accurate and relevant answers, optimizing API responses to include only the necessary and relevant data is crucial for better response times and enhanced user experience.
The REST API Explorer (shown in the following figure) in ServiceNow is a tool that allows developers and administrators to interact with and test the ServiceNow REST APIs directly from within the ServiceNow environment. It provides a user-friendly interface for making API requests, viewing responses, and understanding the available endpoints and data structures. Using this tool simplifies the process of testing and integrating with ServiceNow.
Clean up
To clean up AWS configurations, sign in to the Amazon Q Business console.
- From the Amazon Q Business console, in Applications, select the application that you want to delete.
- Choose Actions and select Delete.
- To confirm deletion, enter
Delete
.
This will take a few minutes to finish. When completed, the application and the configured custom plugin will be deleted.
When you delete the Amazon Q Business application, the users created as part of the configuration are not automatically deleted from IAM Identity Center. Use the instructions in Delete users in IAM Identity Center to delete the users created for this post.
To clean up in ServiceNow, release the Personal Developer Instance provisioned for this post by following the instructions in the ServiceNow Documentation.
Conclusion
The integration of generative AI-powered assistants such as Amazon Q Business with enterprise systems such as ServiceNow offers significant benefits for organizations. By using natural language processing capabilities, enterprises can streamline operations, enhance user productivity, and deliver better customer experiences. The ability to query real-time data and create incidents and knowledge articles through a secure and governed chat interface transforms how users interact with enterprise data and applications. As demonstrated in this post, enhancing Amazon Q Business to integrate with ServiceNow using custom plugins empowers users to perform complex tasks effortlessly, driving efficiency across various business functions. Adopting this technology not only modernizes workflows, but also positions enterprises at the forefront of innovation.
Learn more
- Amazon Q main product page
- Get started with Amazon Q
- Introducing Amazon Q, a new generative AI-powered assistant (preview)
- Improve developer productivity with generative-AI powered Amazon Q in Amazon CodeCatalyst (preview)
- Upgrade your Java applications with Amazon Q Code Transformation (preview)
- New generative AI features in Amazon Connect, including Amazon Q, facilitate improved contact center service
- New Amazon Q in QuickSight uses generative AI assistance for quicker, easier data insights (preview)
- Amazon Q brings generative AI-powered assistance to IT pros and developers (preview)
About the Author
Siddhartha Angara is a Senior Solutions Architect at Amazon Web Services. He helps enterprise customers design and build well-architected solutions in the cloud, accelerate cloud adoption, and build Machine Learning and Generative AI applications. He enjoys playing the guitar, reading and family time!
NVIDIA Awards up to $60,000 Research Fellowships to PhD Students
For more than two decades, the NVIDIA Graduate Fellowship Program has supported graduate students doing outstanding work relevant to NVIDIA technologies. Today, the program announced the latest awards of up to $60,000 each to 10 Ph.D. students involved in research that spans all areas of computing innovation.
Selected from a highly competitive applicant pool, the awardees will participate in a summer internship preceding the fellowship year. Their work puts them at the forefront of accelerated computing — tackling projects in autonomous systems, computer architecture, computer graphics, deep learning, programming systems, robotics and security.
The NVIDIA Graduate Fellowship Program is open to applicants worldwide.
The 2025-2026 fellowship recipients are:
- Anish Saxena, Georgia Institute of Technology — Rethinking data movement across the stack — spanning large language model architectures, system software and memory systems — to improve the efficiency of LLM training and inference.
- Jiawei Yang, University of Southern California — Creating scalable, generalizable foundation models for autonomous systems through self-supervised learning, leveraging neural reconstruction to capture detailed environmental geometry and dynamic scene behaviors, and enhancing adaptability in robotics, digital twin technologies and autonomous driving.
- Jiayi (Eris) Zhang, Stanford University — Developing intelligent algorithms, models and tools for enhancing user creativity and productivity in design, animation and simulation.
- Ruisi Cai, University of Texas at Austin — Working on efficient training and inference for large foundation models as well as AI security and privacy.
- Seul Lee, Korea Advanced Institute of Science and Technology — Developing generative models for molecules and exploration strategies in chemical space for drug discovery applications.
- Sreyan Ghosh, University of Maryland, College Park — Advancing audio processing and reasoning by designing resource-efficient models and training techniques, improving audio representation learning and enhancing audio perception for AI systems.
- Tairan He, Carnegie Mellon University — Researching the development of humanoid robots, with a focus on advancing whole-body loco-manipulation through large-scale simulation-to-real learning.
- Xiaogeng Liu, University of Wisconsin–Madison — Developing robust and trustworthy AI systems, with an emphasis on evaluating and enhancing machine learning models to ensure consistent performance and resilience against diverse attacks and unforeseen inputs.
- Yunze Man, University of Illinois Urbana-Champaign — Developing vision-centric reasoning models for multimodal and embodied AI agents, with a focus on object-centric perception systems in dynamic scenes, vision foundation models for open-world scene understanding and generation, and large multimodal models for embodied reasoning and robotics planning.
- Zhiqiang Xie, Stanford University — Building infrastructures to enable more efficient, scalable and complex compound AI systems while enhancing the observability and reliability of such systems.
We also acknowledge the 2025-2026 fellowship finalists:
- Bo Zhao, University of California, San Diego
- Chenning Li, Massachusetts Institute of Technology
- Dacheng Li, University of California, Berkeley
- Jiankai Sun, Stanford University
- Wenlong Huang, Stanford University
docTR joins PyTorch Ecosystem: From Pixels to Data, Building a Recognition Pipeline with PyTorch and docTR
We’re thrilled to announce that the docTR project has been integrated into the PyTorch ecosystem! This integration ensures that docTR aligns with PyTorch’s standards and practices, giving developers a reliable, community-backed solution for powerful OCR workflows.
For more information on what it means to be a PyTorch ecosystem project, see the PyTorch Ecosystem Tools page.
About docTR
docTR is an Apache 2.0 project developed and distributed by Mindee to help developers integrate OCR capabilities into applications with no prior knowledge required.
To quickly and efficiently extract text information, docTR uses a two-stage approach:
- First, it performs text detection to localize words.
- Then, it conducts text recognition to identify all characters in a word.
Detection and recognition are performed by state-of-the-art models written in PyTorch. To learn more about this approach, you can refer to the docTR documentation.
docTR enhances the user experience in PyTorch projects by providing high-performance OCR capabilities right out of the box. Its specially designed models require minimal to no fine-tuning for common use cases, allowing developers to quickly integrate advanced document analysis features.
Local installation
docTR requires Python >= 3.10 and supports Windows, Mac and Linux. Please refer to our README for necessary dependencies for MacBook with the M1 chip.
pip3 install -U pip
pip3 install "python-doctr[torch,viz]"
This will install docTR along with the latest version of PyTorch.
Note: docTR also provides docker images for an easy deployment, such as a part of Kubernetes cluster.
Text recognition
Now, let’s try docTR’s OCR recognition on this sample:
The OCR recognition model expects an image with only one word on it and will output the predicted word with a confidence score. You can use the following snippet to test OCR capabilities from docTR:
python
from doctr.io import DocumentFile
from doctr.models import recognition_predictor
doc = DocumentFile.from_images("/path/to/image")
# Load the OCR model
# This will download pre-trained models hosted by Mindee
model = recognition_predictor(pretrained=True)
result = model(doc)
print(result)
Here, the most important line of code is model = recognition_predictor(pretrained=True)
. This will load a default text recognition model, crnn_vgg16_bn
, but you can select other models through the arch
parameter. You can check out the available architectures.
When run on the sample, the recognition predictor retrieves the following data: [('MAGAZINE', 0.9872216582298279)]
Note: using the DocumentFile object docTR provides an easy way to manipulate PDF or Images.
Text detection
The last example was a crop on a single word. Now, what about an image with several words on it, like this one?
A text detection model is used before the text recognition to output a segmentation map representing the location of the text. Following that, the text recognition is applied on every detected patch.
Below is a snippet to run only the detection part:
from doctr.io import DocumentFile
from doctr.models import detection_predictor
from matplotlib import pyplot as plt
from doctr.utils.geometry import detach_scores
from doctr.utils.visualization import draw_boxes
doc = DocumentFile.from_images("path/to/my/file")
model = detection_predictor(pretrained=True)
result = model(doc)
draw_boxes(detach_scores([result[0]["words"]])[0][0], doc[0])
plt.axis('off')
plt.show()
Running it on the full sample yields the following:
Similarly to the text recognition, detection_predictor
will load a default model (fast_base
here). You can also load another one by providing it through the arch
parameter.
The full implementation
Now, let’s plug both components into the same pipeline.
Conveniently, docTR provides a wrapper that does exactly that for us:
from doctr.io import DocumentFile
from doctr.models import ocr_predictor
doc = DocumentFile.from_images("/path/to/image")
model = ocr_predictor(pretrained=True, assume_straight_pages=False)
result = model(doc)
result.show()
The last line should display a matplotlib window which shows the detected patches. Hovering the mouse over them will display their contents.
You can also do more with this output, such as reconstituting a synthetic document like so:
import matplotlib.pyplot as plt
synthetic_pages = result.synthesize()
plt.imshow(synthetic_pages[0])
plt.axis('off')
plt.show()
The pipeline is highly customizable, where you can modify the detection or recognition model behaviors by passing arguments to the ocr_predictor
. Please refer to the documentation to learn more about it.
Conclusion
We’re excited to welcome docTR into the PyTorch Ecosystem, where it seamlessly integrates with PyTorch pipelines to deliver state-of-the-art OCR capabilities right out of the box.
By empowering developers to quickly extract text from images or PDFs using familiar tooling, docTR simplifies complex document analysis tasks and enhances the overall PyTorch experience.
We invite you to explore the docTR GitHub repository, join the docTR community on Slack, and reach out at contact@mindee.com for inquiries or collaboration opportunities.
Together, we can continue to push the boundaries of document understanding and develop even more powerful, accessible tools for everyone in the PyTorch community.
Accelerating LLM Inference on NVIDIA GPUs with ReDrafter
Accelerating LLM inference is an important ML research problem, as auto-regressive token generation is computationally expensive and relatively slow, and improving inference efficiency can reduce latency for users. In addition to ongoing efforts to accelerate inference on Apple silicon, we have recently made significant progress in accelerating LLM inference for the NVIDIA GPUs widely used for production applications across the industry.
Earlier this year, we published and open sourced Recurrent Drafter (ReDrafter), a novel approach to speculative decoding that achieves state of the art…Apple Machine Learning Research