US National Science Foundation Launches National AI Research Resource Pilot

US National Science Foundation Launches National AI Research Resource Pilot

In a major stride toward building a shared national research infrastructure, the U.S. National Science Foundation has launched the National Artificial Intelligence Research Resource pilot program with significant support from NVIDIA.

The initiative aims to broaden access to the tools needed to power responsible AI discovery and innovation. It was announced Wednesday in partnership with 10 other federal agencies as well as private-sector, nonprofit and philanthropic organizations.

“The breadth of partners that have come together for this pilot underscores the urgency of developing a National AI Research Resource for the future of AI in America,” said NSF Director Sethuraman Panchanathan. “By investing in AI research through the NAIRR pilot, the United States unleashes discovery and impact and bolsters its global competitiveness.”

NVIDIA’s commitment of $30 million in technology contributions over two years is a key factor in enlarging the scale of the pilot, fueling the potential for broader achievements and accelerating the momentum toward full-scale implementation.

“The NAIRR is a vision of a national research infrastructure that will provide access to computing, data, models and software to empower researchers and communities,” said Katie Antypas, director of the Office of Advanced Cyberinfrastructure at the NSF.

“Our primary goals for the NAIRR pilot are to support fundamental AI research and domain-specific research applying AI, reach broader communities, particularly those currently unable to participate in the AI innovation ecosystem, and refine the design for the future full NAIRR,” Antypas added.

Accelerating Access to AI

“AI is increasingly defining our era, and its potential can best be fulfilled with broad access to its transformative capabilities,” said NVIDIA founder and CEO Jensen Huang.

“Partnerships are really at the core of the NAIRR pilot,” said Tess DeBlanc-Knowles, NSF’s special assistant to the director for artificial intelligence.

“It’s been incredibly impressive to see this breadth of partners come together in these 90 days, bringing together government, industry, nonprofits and philanthropies,” she added. “Our industry and nonprofit partners are bringing critical expertise and resources, which are essential to advance AI and move forward with trustworthy AI initiatives.”

NVIDIA’s collaboration with scientific centers aims to significantly scale up educational and workforce training programs, enhancing AI literacy and skill development across the scientific community.

NVIDIA will harness insights from researchers using its platform, offering an opportunity to refine and enhance the effectiveness of its technology for science, and supporting continuous advancement in AI applications.

“With NVIDIA AI software and supercomputing, the scientists, researchers and engineers of the extended NSF community will be able to utilize the world’s leading infrastructure to fuel a new generation of innovation,” Huang said.

The Foundation for Modern AI

Accelerating both AI research and research done with AI, NVIDIA’s contributions include NVIDIA DGX Cloud AI supercomputing resources and NVIDIA AI Enterprise software.

Offering full-stack accelerated computing from systems to software, NVIDIA AI provides the foundation for generative AI, with significant adoption across research and industries.

Broad Support Across the US Government

As part of this national endeavor, the NAIRR pilot brings together a coalition of government partners, showcasing a unified approach to advancing AI research.

Its partners include the U.S. National Science Foundation, U.S. Department of Agriculture, U.S. Department of Energy, U.S. Department of Veterans Affairs, National Aeronautics and Space Administration, National Institutes of Health, National Institute of Standards and Technology, National Oceanic and Atmospheric Administration, Defense Advanced Research Projects Agency, U.S. Patent and Trade Office and the U.S. Department of Defense.

The NAIRR pilot builds on the United States’ rich history of leading large-scale scientific endeavors, such as the creation of the internet, which, in turn, led to the advancement of AI.

Leading in Advanced AI

NAIRR promises to drive innovations across various sectors, from healthcare to environmental science, positioning the U.S. at the forefront of global AI advancements.

The launch meets a goal outlined in Executive Order 14110, signed by President Biden in October 2023, directing NSF to launch a pilot for the NAIRR within 90 days.

The NAIRR pilot will provide access to advanced computing, datasets, models, software, training and user support to U.S.-based researchers and educators.

“Smaller institutions, rural institutions, institutions serving underrepresented populations are key communities we’re trying to reach with the NAIRR,” said Antypas. “These communities are less likely to have resources to build their own computing or data resources.”

Paving the Way for Future Investments

As the pilot expedites the proof of concept, future investments in the NAIRR will democratize access to AI innovation and support critical work advancing the development of trustworthy AI.

The pilot will initially support AI research to advance safe, secure and trustworthy AI as well as the application of AI to challenges in healthcare and environmental and infrastructure sustainability.

Researchers can apply for initial access to NAIRR pilot resources through the NSF. The NAIRR pilot welcomes additional private-sector and nonprofit partners.

Those interested are encouraged to reach out to NSF at nairr_pilot@nsf.gov.

Read More

High Can See Clearly Now: AI-Powered NVIDIA RTX Video HDR Transforms Standard Video Into Stunning High Dynamic Range

High Can See Clearly Now: AI-Powered NVIDIA RTX Video HDR Transforms Standard Video Into Stunning High Dynamic Range

Editor’s note: This post is part of our weekly In the NVIDIA Studio series, which celebrates featured artists, offers creative tips and tricks, and demonstrates how NVIDIA Studio technology improves creative workflows. We’re also deep diving on new GeForce RTX 40 Series GPU features, technologies and resources, and how they dramatically accelerate content creation.

RTX Video HDR — first announced at CES — is now available for download through the January Studio Driver. It uses AI to transform standard dynamic range video playing in internet browsers into stunning high dynamic range (HDR) on HDR10 displays.

PC game modders now have a powerful new set of tools to use with the release of the NVIDIA RTX Remix open beta.

It features full ray tracing, NVIDIA DLSS, NVIDIA Reflex, modern physically based rendering assets and generative AI texture tools so modders can remaster games more efficiently than ever.

Pick up the new GeForce RTX 4070 Ti SUPER available from custom board partners in stock-clocked and factory-overclocked configurations to enhance creating, gaming and AI tasks.

Get creative superpowers with the GeForce RTX 4070 Ti SUPER available now.

Part of the 40 SUPER Series announced at CES, it’s equipped with more CUDA cores than the RTX 4070, a frame buffer increased to 16GB, and a 256-bit bus — perfect for video editing and rendering large 3D scenes. It runs up to 1.6x faster than the RTX 3070 Ti and 2.5x faster with DLSS 3 in the most graphics-intensive games.

And this week’s featured In the NVIDIA Studio technical artist Vishal Ranga shares his vivid 3D scene Disowned — powered by NVIDIA RTX and Unreal Engine with DLSS.

RTX Video HDR Delivers Dazzling Detail

Using the power of Tensor Cores on GeForce RTX GPUs, RTX Video HDR allows gamers and creators to maximize their HDR panel’s ability to display vivid, dynamic colors, preserving intricate details that may be inadvertently lost due to video compression.

RTX Video HDR and RTX Video Super Resolution can be used together to produce the clearest livestreamed video anywhere, anytime. These features work on Chromium-based browsers such as Google Chrome or Microsoft Edge.

To enable RTX Video HDR:

  1. Download and install the January Studio Driver.
  2. Ensure Windows HDR features are enabled by navigating to System > Display > HDR.
  3. Open the NVIDIA Control Panel and navigate to Adjust video image settings > RTX Video Enhancement — then enable HDR.

Standard dynamic range video will then automatically convert to HDR, displaying remarkably improved details and sharpness.

RTX Video HDR is among the RTX-powered apps enhancing everyday PC use, productivity, creating and gaming. NVIDIA Broadcast supercharges mics and cams; NVIDIA Canvas turns simple brushstrokes into realistic landscape images; and NVIDIA Omniverse seamlessly connects 3D apps and creative workflows. Explore exclusive Studio tools, including industry-leading NVIDIA Studio Drivers — free for RTX graphics card owners — which support the latest creative app updates, AI-powered features and more.

RTX Video HDR requires an RTX GPU connected to an HDR10-compatible monitor or TV. For additional information, check out the RTX Video FAQ.

Introducing the Remarkable RTX Remix Open Beta

Built on NVIDIA Omniverse, the RTX Remix open beta is available now.

The NVIDIA RTX open beta is out now.

It allows modders to easily capture game assets, automatically enhance materials with generative AI tools, reimagine assets via Omniverse-connected apps and Universal Scene Description (OpenUSD), and quickly create stunning RTX remasters of classic games with full ray tracing and NVIDIA DLSS technology.

RTX Remix has already delivered stunning remasters, such as Portal with RTX and the modder-made Portal: Prelude RTX. Orbifold Studios is now using the technology to develop Half-Life 2 RTX: An RTX Remix Project, a community remaster of one of the highest-rated games of all time. Check out the gameplay trailer, showcasing Orbifold Studios’ latest updates to Ravenholm:

Learn more about the RTX Remix open beta and sign up to gain access.

Leveling Up With RTX

Vishal Ranga has a decade’s worth of experience in the gaming industry, where he pursues level design.

“I’ve loved playing video games since forever, and that curiosity led me to game design,” he said. “A few years later, I found my sweet spot in technical art.”

Ranga specializes in level design.

His stunning scene Disowned was born out of experimentation with Unreal Engine’s new ray-traced global illumination lighting capabilities.

Remarkably, he skipped the concepting process — the entire project was conceived solely from Ranga’s imagination.

Applying the water shader and mocking up the lighting early helped Ranga set up the mood of the scene. He then updated old assets and searched the Unreal Engine store for new ones — what he couldn’t find, like fishing nets and custom flags, he created from scratch.

Ranga meticulously organizes assets.

“I chose a GeForce RTX GPU to use ray-traced dynamic global illumination with RTX cards for natural, more realistic light bounces.” — Vishal Ranga

Ranga’s GeForce RTX graphics card unlocked RTX-accelerated rendering for high-fidelity, interactive visualization of 3D designs during virtual production.

Next, he tackled shader work, blending in moss and muck into models of wood, nets and flags. He also created a volumetric local fog shader to complement the assets as they pass through the fog, adding greater depth to the scene.

Shaders add extraordinary depth and visual detail.

Ranga then polished everything up. He first used a water shader to add realism to reflections, surface moss and subtle waves, then tinkered with global illumination and reflection effects, along with other post-process settings.

Materials come together to deliver realism and higher visual quality.

Ranga used Unreal Engine’s internal high-resolution screenshot feature and sequencer to capture renders. This was achieved by cranking up screen resolution to 200%, resulting in crisper details.

Throughout, DLSS enhanced Ranga’s creative workflow, allowing for smooth scene movement while maintaining immaculate visual quality.

When finished with adjustments, Ranga exported the final scene in no time thanks to his RTX GPU.

 

Ranga encourages budding artists who are excited by the latest creative advances but wondering where to begin to “practice your skills, prioritize the basics.”

“Take the time to practice and really experience the highs and lows of the creation process,” he said. “And don’t forget to maintain good well-being to maximize your potential.”

3D artist Vishal Ranga.

Check out Ranga’s portfolio on ArtStation.

Follow NVIDIA Studio on Instagram, X and Facebook. Access tutorials on the Studio YouTube channel and get updates directly in your inbox by subscribing to the Studio newsletter. 

Read More

Exphormer: Scaling transformers for graph-structured data

Exphormer: Scaling transformers for graph-structured data

Graphs, in which objects and their relations are represented as nodes (or vertices) and edges (or links) between pairs of nodes, are ubiquitous in computing and machine learning (ML). For example, social networks, road networks, and molecular structure and interactions are all domains in which underlying datasets have a natural graph structure. ML can be used to learn the properties of nodes, edges, or entire graphs.

A common approach to learning on graphs are graph neural networks (GNNs), which operate on graph data by applying an optimizable transformation on node, edge, and global attributes. The most typical class of GNNs operates via a message-passing framework, whereby each layer aggregates the representation of a node with those of its immediate neighbors.

Recently, graph transformer models have emerged as a popular alternative to message-passing GNNs. These models build on the success of Transformer architectures in natural language processing (NLP), adapting them to graph-structured data. The attention mechanism in graph transformers can be modeled by an interaction graph, in which edges represent pairs of nodes that attend to each other. Unlike message passing architectures, graph transformers have an interaction graph that is separate from the input graph. The typical interaction graph is a complete graph, which signifies a full attention mechanism that models direct interactions between all pairs of nodes. However, this creates quadratic computational and memory bottlenecks that limit the applicability of graph transformers to datasets on small graphs with at most a few thousand nodes. Making graph transformers scalable has been considered one of the most important research directions in the field (see the first open problem here).

A natural remedy is to use a sparse interaction graph with fewer edges. Many sparse and efficient transformers have been proposed to eliminate the quadratic bottleneck for sequences, however, they do not generally extend to graphs in a principled manner.

In “Exphormer: Sparse Transformers for Graphs”, presented at ICML 2023, we address the scalability challenge by introducing a sparse attention framework for transformers that is designed specifically for graph data. The Exphormer framework makes use of expander graphs, a powerful tool from spectral graph theory, and is able to achieve strong empirical results on a wide variety of datasets. Our implementation of Exphormer is now available on GitHub.

Expander graphs

A key idea at the heart of Exphormer is the use of expander graphs, which are sparse yet well-connected graphs that have some useful properties — 1) the matrix representation of the graphs have similar linear-algebraic properties as a complete graph, and 2) they exhibit rapid mixing of random walks, i.e., a small number of steps in a random walk from any starting node is enough to ensure convergence to a “stable” distribution on the nodes of the graph. Expanders have found applications to diverse areas, such as algorithms, pseudorandomness, complexity theory, and error-correcting codes.

A common class of expander graphs are d-regular expanders, in which there are d edges from every node (i.e., every node has degree d). The quality of an expander graph is measured by its spectral gap, an algebraic property of its adjacency matrix (a matrix representation of the graph in which rows and columns are indexed by nodes and entries indicate whether pairs of nodes are connected by an edge). Those that maximize the spectral gap are known as Ramanujan graphs — they achieve a gap of d – 2*√(d-1), which is essentially the best possible among d-regular graphs. A number of deterministic and randomized constructions of Ramanujan graphs have been proposed over the years for various values of d. We use a randomized expander construction of Friedman, which produces near-Ramanujan graphs.

Expander graphs are at the heart of Exphormer. A good expander is sparse yet exhibits rapid mixing of random walks, making its global connectivity suitable for an interaction graph in a graph transformer model.

Exphormer replaces the dense, fully-connected interaction graph of a standard Transformer with edges of a sparse d-regular expander graph. Intuitively, the spectral approximation and mixing properties of an expander graph allow distant nodes to communicate with each other after one stacks multiple attention layers in a graph transformer architecture, even though the nodes may not attend to each other directly. Furthermore, by ensuring that d is constant (independent of the size of the number of nodes), we obtain a linear number of edges in the resulting interaction graph.

Exphormer: Constructing a sparse interaction graph

Exphormer combines expander edges with the input graph and virtual nodes. More specifically, the sparse attention mechanism of Exphormer builds an interaction graph consisting of three types of edges:

  • Edges from the input graph (local attention)
  • Edges from a constant-degree expander graph (expander attention)
  • Edges from every node to a small set of virtual nodes (global attention)
Exphormer builds an interaction graph by combining three types of edges. The resulting graph has good connectivity properties and retains the inductive bias of the input dataset graph while still remaining sparse.

Each component serves a specific purpose: the edges from the input graph retain the inductive bias from the input graph structure (which typically gets lost in a fully-connected attention module). Meanwhile, expander edges allow good global connectivity and random walk mixing properties (which spectrally approximate the complete graph with far fewer edges). Finally, virtual nodes serve as global “memory sinks” that can directly communicate with every node. While this results in additional edges from each virtual node equal to the number of nodes in the input graph, the resulting graph is still sparse. The degree of the expander graph and the number of virtual nodes are hyperparameters to tune for improving the quality metrics.

Furthermore, since we use an expander graph of constant degree and a small constant number of virtual nodes for the global attention, the resulting sparse attention mechanism is linear in the size of the original input graph, i.e., it models a number of direct interactions on the order of the total number of nodes and edges.

We additionally show that Exphormer is as expressive as the dense transformer and obeys universal approximation properties. In particular, when the sparse attention graph of Exphormer is augmented with self loops (edges connecting a node to itself), it can universally approximate continuous functions [1, 2].

Relation to sparse Transformers for sequences

It is interesting to compare Exphormer to sparse attention methods for sequences. Perhaps the architecture most conceptually similar to our approach is BigBird, which builds an interaction graph by combining different components. BigBird also uses virtual nodes, but, unlike Exphormer, it uses window attention and random attention from an Erdős-Rényi random graph model for the remaining components.

Window attention in BigBird looks at the tokens surrounding a token in a sequence — the local neighborhood attention in Exphormer can be viewed as a generalization of window attention to graphs.

The Erdős-Rényi graph on n nodes, G(n, p), which connects every pair of nodes independently with probability p, also functions as an expander graph for suitably high p. However, a superlinear number of edges (Ω(n log n)) is needed to ensure that an Erdős-Rényi graph is connected, let alone a good expander. On the other hand, the expanders used in Exphormer have only a linear number of edges.

Experimental results

Earlier works have shown the use of full graph Transformer-based models on datasets with graphs of size up to 5,000 nodes. To evaluate the performance of Exphormer, we build upon the celebrated GraphGPS framework [3], which combines both message passing and graph transformers and achieves state-of-the-art performance on a number of datasets. We show that replacing dense attention with Exphormer for the graph attention component in the GraphGPS framework allows one to achieve models with comparable or better performance, often with fewer trainable parameters.

Furthermore, Exphormer notably allows graph transformer architectures to scale well beyond the usual graph size limits mentioned above. Exphormer can scale up to datasets of 10,000+ node graphs, such as the Coauthor dataset, and even beyond to larger graphs such as the well-known ogbn-arxiv dataset, a citation network, which consists of 170K nodes and 1.1 million edges.

Results comparing Exphormer to standard GraphGPS on the five Long Range Graph Benchmark datasets. We note that Exphormer achieved state-of-the-art results on four of the five datasets (PascalVOC-SP, COCO-SP, Peptides-Struct, PCQM-Contact) at the time of the paper’s publication.

<!–

Results comparing Exphormer to standard GraphGPS on the five Long Range Graph Benchmark datasets. We note that Exphormer achieved state-of-the-art results on four of the five datasets (PascalVOC-SP, COCO-SP, Peptides-Struct, PCQM-Contact) at the time of publication.

–><!–

Model   PascalVOC-SP 

 F1 score  
 COCO-SP 

 F1 score  
 Peptides-Func 

 AP  
 Peptides-Struct 

 MAE  
 PCQM-Contact

 MRR
Standard GraphGPS   0.375 ± 0.011   0.341 ± 0.004   0.654 ± 0.004    0.250 ± 0.001   0.334 ± 0.001
Exphormer (ours)   0.398 ± 0.004   0.346 ± 0.001   0.653 ± 0.004   0.248 ± 0.001   0.364 ± 0.002

–>

Finally, we observe that Exphormer, which creates an overlay graph of small diameter via expanders, exhibits the ability to effectively learn long-range dependencies. The Long Range Graph Benchmark is a suite of five graph learning datasets designed to measure the ability of models to capture long-range interactions. Results show that Exphormer-based models outperform standard GraphGPS models (which were previously state-of-the-art on four out of five datasets at the time of publication).

Conclusion

Graph transformers have emerged as an important architecture for ML that adapts the highly successful sequence-based transformers used in NLP to graph-structured data. Scalability has, however, proven to be a major challenge in enabling the use of graph transformers on datasets with large graphs. In this post, we have presented Exphormer, a sparse attention framework that uses expander graphs to improve scalability of graph transformers. Exphormer is shown to have important theoretical properties and exhibit strong empirical performance, particularly on datasets where it is crucial to learn long range dependencies. For more information, we point the reader to a short presentation video from ICML 2023.

Acknowledgements

We thank our research collaborators Hamed Shirzad and Danica J. Sutherland from The University of British Columbia as well as Ali Kemal Sinop from Google Research. Special thanks to Tom Small for creating the animation used in this post.

Read More

NVIDIA DRIVE Partners Showcase Cutting-Edge Innovations in Automated and Autonomous Driving

NVIDIA DRIVE Partners Showcase Cutting-Edge Innovations in Automated and Autonomous Driving

The automotive industry is being transformed by the integration of cutting-edge technologies into software-defined cars.

At CES, NVIDIA invited industry leaders to share their perspectives on how technology, especially AI and computing power, is shaping the future of transportation.

Watch the video to learn more from NVIDIA’s auto partners.

Redefining Possibilities Through Partnership

Magnus Ostberg, chief software officer at Mercedes-Benz, underscores how the company’s partnership with NVIDIA helps push technological boundaries. “[NVIDIA] enables us to go further to bring automated driving to the next level and into areas that we couldn’t go before,” he says.

Computing Power: The Driving Force Behind Autonomy

Shawn Kerrigan, chief operating officer and cofounder at Plus, emphasizes the role of computing power, saying, “Autonomous technology requires immense computing power in order to really understand the world around it and make safe driving decisions.”

“What was impossible to do previously because computing wasn’t strong enough is now doable,” says Eran Ofri, CEO of Imagry. “This is an enabler for the progress of the autonomous driving industry.”

“We wanted a platform that has a track record of being deployed in the automotive industry,” adds Stefan Solyom, chief technology officer at Pebble. “This is what NVIDIA can give us.”

And Martin Kristensson, head of product strategy at Volvo Cars, says, “We partner with NVIDIA to get the best compute that we can. More compute in the car means that we can be more aware of the environment around us and reacting earlier and being even safer.”

The Critical Role of AI

Don Burnette, CEO and founder of Kodiak Robotics, states, “NVIDIA makes best-in-class hardware accelerators, and I think it’s going to play a large role in the AI developments for self-driving going forward.”

“Driving as a routine task is tedious,” adds Tony Han, CEO and cofounder of WeRide. “We want to alleviate people from the burden of driving to give back the time. NVIDIA is the backbone of our AI engine.”

And Thomas Ingenlath, CEO of Polestar, says, “Our Polestar 3 sits on the NVIDIA DRIVE platform. This is, of course, very much based on AI technology — and it’s really fascinating and a completely new era for the car.”

Simulation Is Key

Ziv Binyamini, CEO of Foretellix, highlights the role of simulation in development and verification. “Simulation is crucial for the development of autonomous systems,” he says.

Bruce Baumgartner, vice president of supply chain at Zoox, adds, “We have been leveraging NVIDIA’s technology first and foremost on-vehicle to power the Zoox driver. We also leverage NVIDIA technologies in our cloud infrastructure. In particular, we do a lot of work in our simulator.”

Saving Lives With Autonomy

Austin Russell, CEO and founder of Luminar, highlights the opportunity to save lives by using new technology, saying, “The DRIVE platform has been incredibly helpful to be able to actually enable autonomous driving capabilities as well as enhance safety capabilities on vehicles. To be able to have an opportunity to save as many as 100 million lives and 100 trillion hours of people’s time over the next 100 years — everything that we do at the company rolls up to that.”

“Knowing that [this technology] is in vehicles worldwide and saves lives on the road each and every day — the impact that you deliver as you keep people and family safe is amazingly rewarding,” adds Tal Krzypow, vice president of product and strategy at Cipia.

Technology Helps Solve Major Challenges

Shiv Tasker, global industry vice president at Capgemini, reflects on the role of technology in addressing global challenges, saying, “Our modern world is driven by technology, and yet we face tremendous challenges. Technology is the answer. We have to solve the major issues so that we leave a better place for our children and our grandchildren.”

Learn more about the NVIDIA DRIVE platform and how it’s helping industry leaders redefine transportation.

Read More

MetaOpt: Examining, explaining, and improving heuristic performance

MetaOpt: Examining, explaining, and improving heuristic performance

The MetaOpt workflow involves 4 steps (1) users encode the heuristic; (2) MetaOpt automatically does re-writes to obtain a single-level optimization; (3) it partitions the problem into smaller sub-problems to achieve scale; (4) it uses existing solvers to find the highest performance gap.

Heuristic algorithms, often referred to as heuristics, are tools used to approximate optimal algorithms to make faster and more efficient decisions. These are particularly useful in operational scenarios in production systems, such as determining which server a virtual machine should be assigned to or deciding whether data should be removed from a cache in a content delivery network.

However, cloud operators, who are responsible for designing and managing systems in production, often struggle to evaluate when and where their heuristics may underperform. This challenge can lead to over-provisioning the network and the inefficient use of available resources. Such practices can be costly and may result in the inability to meet customer demand.

To address this, we developed MetaOpt, a heuristic analyzer designed to enable operators to examine, explain, and improve heuristics’ performance before deploying them in critical, high-stakes environments. MetaOpt is unique because it not only compares algorithm performance but also provides insights into the underlying reasons for performance disparities between algorithms. It empowers operators and researchers to conduct “what-if” analyses, strategize how to combine heuristics in production, and understand why certain heuristics perform better in specific areas of the input space—the range of possible inputs that the heuristic may encounter.

We demonstrate MetaOpt’s capability for heuristic analysis by studying heuristics from three domains: traffic engineering, vector bin packing, and packet scheduling. MetaOpt identifies large performance gaps, enables us to prove properties about these heuristics, and guides us in improving them. Table 1 summarizes the results.

MetaOpt allowed us to (1) find the performance gap between heuristics from traffic engineering (TE), vector bin packing (VBP), and packet scheduling (PIFO); (2) prove various properties about the heuristic; and (3) design modificationsto improve their performance. DP refers to a heuristic Microsoft has deployed in our wide area network for traffic engineering.
Table 1. MetaOpt enabled us to find the performance gap between heuristics from traffic engineering, vector bin packing, and packet scheduling. It also helped us prove various properties about the heuristics. Finally, it helped us modify the heuristics to improve their performance. DP refers to a heuristic Microsoft has deployed in its wide area network for traffic engineering.

Currently, MetaOpt helps Azure operators analyze heuristics in production and serves as a “helper for theorem proving.” For example, we used MetaOpt to establish a tighter bound for the “first fit decreasing” heuristic in vector bin packing, a challenge for theoreticians for over three decades. As a result, we don’t need to over-provision resources in a cloud environment, ensuring we always have sufficient servers to meet customer demand.

MetaOpt framework

To use MetaOpt, users input the heuristic they want analyzed and either the optimal algorithm or another heuristic. MetaOpt efficiently translates these inputs into a solver format. It then finds performance gaps and the inputs that cause them. Recognizing that not all users are versed in optimization theory, we designed a higher-level abstraction for MetaOpt. This feature enables users to input their heuristics using a few simple building blocks and constrain the input space to what is relevant in practice. MetaOpt can then analyze decisions made by the heuristic that led to underperformance or identify input properties that caused the heuristic to make suboptimal choices. We illustrate the MetaOpt workflow in Figure 1.

The MetaOpt workflow involves 4 steps (1) users encode the heuristic; (2) MetaOpt automatically does re-writes to obtain a single-level optimization; (3) it partitions the problem into smaller sub-problems to achieve scale; (4) it uses existing solvers to find the highest performance gap.
Figure 1. The four steps in the MetaOpt workflow. (1) Users encode the heuristic; (2) MetaOpt automatically rewrites it to obtain a single-level optimization; (3) it divides the problem into smaller, more manageable segments for scalability; (4) it employs existing solvers to find the highest performance gap.

Rooted in game theory concepts

MetaOpt is based on Stackelberg games, a well-known class of leader-follower games in game theory. Here, the leader determines inputs for one or more followers, who must then optimize their outcomes based on these inputs. In the MetaOpt framework, the leader’s goal is to maximize the performance disparity between two algorithms (the followers) by deciding their inputs. The followers, representing the algorithms being compared, choose internal variables to optimize their outcomes. This, in turn, affects the leader’s results. We show this in Figure 2.

The stackelberg structure of MetaOpt
Figure 2. The high-level formulation of MetaOpt.

Looking ahead

MetaOpt marks a significant advance in the field of scalable, user-friendly analytical tools. It enables users to examine, understand, and explain differences in performance across competing algorithms. It also helps them improve those algorithms before deploying them in critical environments.

We began developing MetaOpt in early 2022 to address a specific need for heuristic analysis in our network’s traffic engineering solution. Since then, our focus has been on enhancing MetaOpt’s accessibility for users without a background in optimization theory. Currently, we are improving MetaOpt’s scalability and usability, and we are expanding the range of heuristics it supports. We plan to release it as an open-source tool at the USENIX Symposium on Networked Systems Design and Implementation (opens in new tab) (NSDI) conference, scheduled for April 16–18, 2024.

We believe that MetaOpt can significantly boost productivity for those studying or designing heuristics by serving as a risk-analysis engine and a tool for explainable AI and active learning. In the near future, we aim to publish papers on new MetaOpt applications and share our language for describing heuristics.

For more details, visit the MetaOpt webpage, and review our publications page for the latest developments.

The post MetaOpt: Examining, explaining, and improving heuristic performance appeared first on Microsoft Research.

Read More

How Amazon and NVIDIA Help Sellers Create Better Product Listings With AI

How Amazon and NVIDIA Help Sellers Create Better Product Listings With AI

It’s hard to imagine an industry more competitive — or fast-paced — than online retail.

Sellers need to create attractive and informative product listings that must be engaging, capture attention and generate trust.

Amazon uses optimized containers on Amazon Elastic Compute Cloud (Amazon EC2) with NVIDIA Tensor Core GPUs to power a generative AI tool that finds this balance at the speed of modern retail.

Amazon’s new generative AI capabilities help sellers seamlessly create compelling titles, bullet points, descriptions, and product attributes.

To get started, Amazon identifies listings where content could be improved and leverages generative AI to generate high-quality content automatically. Sellers review the generated content and can provide feedback if they want to or accept the content changes to the Amazon catalog.

Previously, creating detailed product listings required significant time and effort for sellers, but this simplified process gives them more time to focus on other tasks.

The NVIDIA TensorRT-LLM software is available today on GitHub and can be accessed through NVIDIA AI Enterprise, which offers enterprise-grade security, support, and reliability for production AI.

TensorRT-LLM open-source software makes AI inference faster and smarter. It works with large language models, such as Amazon’s models for the above capabilities, which are trained on vast amounts of text.

On NVIDIA H100 Tensor Core GPUs, TensorRT-LLM enables up to an 8x speedup on foundation LLMs such as Llama 1 and 2, Falcon, Mistral, MPT, ChatGLM, Starcoder and more.

It also supports multi-GPU and multi-node inference, in-flight batching, paged attention, and Hopper Transformer Engine with FP8 precision; all of which improves latencies and efficiency for the seller experience.

By using TensorRT-LLM and NVIDIA GPUs, Amazon improved its generative AI tool’s inference efficiency in terms of cost or GPUs needed by 2x, and reduced inference latency by 3x compared with an earlier implementation without TensorRT-LLM.

The efficiency gains make it more environmentally friendly, and the 3x latency improvement makes Amazon Catalog’s generative capabilities more responsive.

The generative AI capabilities can save sellers time and provide richer information with less effort. For example, it can enrich a listing for a wireless mouse with an ergonomic design, long battery life, adjustable cursor settings, and compatibility with various devices. It can also generate product attributes such as color, size, weight, and material. These details can help customers make informed decisions and reduce returns.

With generative AI, Amazon’s sellers can quickly and easily create more engaging listings, while being more energy efficient, making it possible to reach more customers and grow their business faster.

Developers can start with TensorRT-LLM today, with enterprise support available through NVIDIA AI Enterprise.

Read More

Accelerating Generative AI with PyTorch IV: Seamless M4T, fast

Accelerating Generative AI with PyTorch IV: Seamless M4T, fast

This post is the fourth part of a multi-series blog focused on how to accelerate generative AI models with pure, native PyTorch. To skip to the code, check out our github (seamless_communication, fairseq2). We are excited to share a breadth of newly released PyTorch performance features alongside practical examples to see how far we can push PyTorch native performance. In part one, we showed how to accelerate Segment Anything over 8x using only pure, native PyTorch. In part two, we showed how to accelerate Llama-7B by almost 10x using only native PyTorch optimizations. In part three, we showed how to accelerate text-to-image diffusion models up to 3x using only native Pytorch optimizations.

In this blog, we’ll focus on speeding up FAIR’s Seamless M4T-v2 model resulting in 2x speedup for text decoder module and 30x for vocoder module, resulting in 2.7x speedup for end-to-end inference, with no loss of accuracy by using CUDA Graph and native PyTorch optimization:

End to End Inference Speedup

Introduction

Seamless M4T is an open-source foundational speech/text translation and transcription technology developed by FAIR. Seamless M4T is a massively multilingual and multimodal machine translation model, with the latest version (Seamless M4T-v2) released on November 30th, 2023. The high-level model architecture of Seamless M4T-v2 is illustrated in Figure 1.

Model Architecture of Seamless M4T-v2

Figure 1. Model Architecture of Seamless M4T-v2.

Accelerating inference latency is crucial for translation models to improve user experience through faster communication across languages. In particular, batch_size=1 is often used for fast translation where latency matters a lot in applications such as chatbots, speech translation, and live subtitling. Therefore, we conducted the performance analysis on inference with batch_size=1, as shown in Figure 2 to understand the Amdahl’s Law bottleneck. Our results indicate that the text decoder and vocoder are the most time-consuming modules, accounting for 61% and 23% of the inference time, respectively.

Text decoder and vocoder are the most time consuming module. Breakdown of inference time by modules for English-Spanish S2ST (Speech-to-Speech-Text) task for batch_size=1 on A100 GPU.

Figure 2. Text decoder and vocoder are the most time consuming module. Breakdown of inference time by modules for English-Spanish S2ST (Speech-to-Speech-Text) task for batch_size=1 on A100 GPU.

To take a closer look at the performance bottleneck of the text decoder and vocoder, we analyzed GPU traces for the text decoder and vocoder for the 8th sample for the English-Spanish translation example of FLEURS dataset as shown in Figure 3. It revealed that the text decoder and vocoder are heavily CPU-bound modules. We observed a significant gap incurred by CPU overhead that delayed the launch of GPU kernels, resulting in a substantial increase in the execution time for both the modules.

CPU and GPU trace for Text Decoder

(a) CPU and GPU trace for Text Decoder

CPU and GPU trace for Vocoder

(b) CPU and GPU trace for Vocoder

Figure 3. Text Decoder and Vocoder are heavily CPU-bound modules. CPU and GPU trace for (a) Text Decoder (b) Vocoder for the 8th sample for English-Spanish translation example of FLEURS dataset. The trace is obtained by running inference with batch_size=1 on A100 gpu.

Based on the real-system performance analysis results that text_decoder and vocoder are heavily CPU bound modules in Seamless M4T-v2, we enabled torch.compile + CUDA Graph to those modules. In this post, we share modifications required to enable torch.compile + CUDA Graph on each module for batch_size=1 inference scenario, discussion on CUDA Graph and next step plans.

Torch.compile with CUDA Graph

torch.compile is a PyTorch API that allows users to compile PyTorch models into a standalone executable or script which is generally used for optimizing model performance by removing unnecessary overhead.

CUDA Graph is a feature provided by NVIDIA that allows for the optimization of kernel launches in CUDA applications. It creates an execution graph of CUDA kernels, which can be pre-processed and optimized by the driver before being executed on the GPU. The main advantage of using CUDA Graph is that it reduces the overhead associated with launching individual kernels, as the graph can be launched as a single unit, reducing the number of API calls and data transfers between the host and device. This can lead to significant performance improvements, especially for applications that have a large number of small kernels or repeat the same set of kernels multiple times. If this is something you are interested in learning more about, check out this paper that highlights the important role of data for accelerated computing: Where is the data? Why you cannot debate CPU vs. GPU performance without the answer by our own Kim Hazelwood! This is when NVIDIA was heavily investing in general-purpose GPU (GPGPUs) and before deep learning revolutionized the computing industry!

However, because CUDA Graph operates on 1) fixed memory pointer, 2) fixed shape of tensors, that are recorded at the compile time, we introduced the following improvements for CUDA Graph to be reused across multiple sizes of inputs to prevent CUDA Graph generation for each iteration and let the data inside CUDA Graph be reused across different runs to share KV Cache for multiple decoding steps.

Text Decoder

The Text Decoder in Seamless is a decoder from NLLB [1] that performs T2TT (Text to Text Translation). Also, this module is a CPU-bound model where gpu execution time is not long enough to hide CPU overhead because of the nature of auto-regressive generation that requires sequential processing of tokens, which limits the amount of parallelism that can be achieved on the GPU. Based on this observation, we enabled torch.compile + CUDA Graph for the text decoders to reduce the dominating CPU overhead as shown in Figure 4.

CPU and GPU trace for Text Decoder after torch.compile + CUDA Graph are enabled

Figure 4. CPU and GPU trace for Text Decoder after torch.compile + CUDA Graph are enabled.

1. Updating and retrieving KV cache

During inference, the text decoder has two computation phases: a prefill phase that consumes the prompt and an incremental generation phase that generates output tokens one by one. Given a high enough batch size or input length, prefill operates on a sufficiently high number of tokens in parallel — GPU performance is the bottleneck and the CPU overheads do not impact performance significantly. On the other hand, incremental token generation is always executed with sequence length 1 and it is often executed with a small batch size (even 1), e.g. for interactive use cases. Thus, incremental generation can be limited by the CPU speed and thus is a good candidate for torch.compile + CUDA Graph.

However, during the incremental token generation phase, the sequence_length dimension of key and value involved in the attention computation increases by one with each step while the sequence length of query always remains 1. Specifically, key/value are generated by appending the newly computed key/value of sequence length 1 to the key/value stored in the KV cache so far. But as mentioned above, CUDA Graph records all the shapes of tensors during compilation and replay with the recorded shapes. Thus, few modifications have been made to address this issue following the great work here.

a) We modify the KV-cache handling to take the indices in which to write new values in a CUDA Tensor (i.e., valid_seq_pos) rather than a Python integer.

Modification to KV cache append and get

Figure 5. Modification to KV cache append and get

b) We also modify attention to work with the fixed shape of key and value over the max_seq_length. We only compute softmax over the sequence positions up to the current decoding step (i.e., valid_seq_pos) . To mask out sequence positions > current decoding step (i.e., valid_seq_pos), we create a boolean mask tensor (i.e., mask) where sequence positions > valid_seq_pos are set to False.

Helper function to generate valid_seq_pos and mask

Figure 6. Helper function to generate valid_seq_pos and mask

It’s important to post that these modifications result in an increase in the amount of computation required, as we compute attention over more sequence positions than necessary (up to max_seq_length). However, despite this drawback, our results demonstrate that torch.compile + CUDA Graph still provide significant performance benefits compared to standard PyTorch code.

c) As different inference samples have different sequence length, it also generates different shapes of inputs that are to be projected to key and value for the cross attention layers. Thus, we pad the input to have a static shape and generate a padding mask to mask out padded output.

2. Memory Pointer Management

As CUDA Graph records memory pointers along with the shape of tensors, it is important to make different inference samples to correctly reference the recorded memory pointer (e.g., KV cache) to avoid compiling CUDA Graph for each inference sample. However, some parts of the Seamless codebase made different inference samples to refer to different memory addresses, so we made modifications to improve the memory implications.

e) Seamless adopts beam search as a text decoding strategy. In the beam search process, we need to perform KV cache reordering for all the attention layers for each incremental decoding step to make sure each selected beam performs with corresponding KV cache as shown in the code snippet below.

KV cache reordering operation for beam search decoding strategy

Figure 8. KV cache reordering operation for beam search decoding strategy.

The above code allocates new memory space and overwrites the original memory pointer for cache_k and cache_v. Thus we modified KV cache reordering to keep the memory pointer of each cache as was recorded during compilation by using copy_ operator.

In-place update for KV cache using copy_ operator

Figure 9. In-place update for KV cache using copy_ operator

f) After enabling torch.compile + CUDA Graph to text decoder by modifying the code as mentioned above, the overhead of text decoder shifts to KV cache reordering as shown in Figure 10. KV cache reordering repeatedly calls index_select 96 times (assuming 24 decoder layers where each layer consists of two types of attention layers with cache for key and value).

CPU and GPU trace for Text Decoder after enabling torch.compile + CUDA Graph

Figure 10. CPU and GPU trace for Text Decoder after enabling torch.compile + CUDA Graph.

As part of accelerating text decoder, we additionally applied torch.compile to KV cache reordering to benefit from fusing kernels as shown in Figure 11. Note that we cannot use CUDA Graph here (mode='max-autotune') here, because copy_ operation modifies the inputs which violates the static input requirement of CUDA graph version in torch.compile.

Applying torch.compile to KV Cache reordering

Figure 11. Applying torch.compile to KV Cache reordering.

As a result of enabling torch.compile to KV cache reordering, the gpu kernels that were launched separately (Figure 12(a)) are now fused so there are much fewer gpu kernels to launch (Figure 12(b)).

CPU and GPU trace for KV cache reordering before enabling torch.compile

(a) CPU and GPU trace for KV cache reordering before enabling torch.compile

CPU and GPU trace for KV cache reordering after enabling torch.compile

(b) CPU and GPU trace for KV cache reordering after enabling torch.compile

Figure 12. CPU and GPU trace for KV cache reordering (a) before and (b) after enabling torch.compile

Vocoder

Vocoder in Seamless is a HiFi-GAN unit-vocoder that converts generated units to waveform output where an unit is a representation of speech that combines different aspects such as phonemes and syllables, which can be used to generate sounds that are audible to humans. Vocoder is a relatively simple module that consists of Conv1d and ConvTranspose1d layers and is a CPU bound module as shown in FIgure 3. Based on this observation, we decided to enable torch.compile + CUDA Graph for vocoder to reduce the disproportionally large CPU overhead as shown in Figure 10. But there were several fixes to be made.

CPU and GPU trace for Vocoder after torch.compile + CUDA Graph are enabled

Figure 13. CPU and GPU trace for Vocoder after torch.compile + CUDA Graph are enabled.

a) The input tensor shape of the vocoder is different across different inference samples. But as CUDA Graph records the shape of tensors and replays them, we had to pad the input to the fixed size with zeros. Since vocoder only consists of Conv1d layers, we do not need an additional padding mask, and padding with zeros is sufficient.

b) Vocoder consists of conv1d layers wrapped with torch.nn.utils.weight_norm (see here). However, applying torch.compile directly to Vocoder incurs graph break as below, which leads to suboptimal performance improvement. This graph break happens inside the hook handling part in the PyTorch code of weight_norm.

[1/0_2] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] Graph break: setattr(UserDefinedObjectVariable) <function Module.__setattr__ at 0x7fac8f483c10> from user code at:
[1/0_2] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/mnt/fsx-home/yejinlee/yejinlee/seamless_communication/src/seamless_communication/models/vocoder/vocoder.py", line 49, in forward
[1/0_2] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     return self.code_generator(x, dur_prediction)  # type: ignore[no-any-return]1/0_2] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/data/home/yejinlee/mambaforge/envs/fairseq2_12.1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
[1/0_2] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     return forward_call(*args, **kwargs)
[2023-12-13 04:26:16,822] [1/0_2] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/mnt/fsx-home/yejinlee/yejinlee/seamless_communication/src/seamless_communication/models/vocoder/codehifigan.py", line 101, in forward
[1/0_2] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     return super().forward(x)
[1/0_2] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/mnt/fsx-home/yejinlee/yejinlee/seamless_communication/src/seamless_communication/models/vocoder/hifigan.py", line 185, in forward
[1/0_2] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     x = self.ups[i](x)
[1/0_2] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/data/home/yejinlee/mambaforge/envs/fairseq2_12.1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1550, in _call_impl
[1/0_2] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     args_result = hook(self, args)
[1/0_2] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/data/home/yejinlee/mambaforge/envs/fairseq2_12.1/lib/python3.8/site-packages/torch/nn/utils/weight_norm.py", line 65, in __call__
[1/0_2] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     setattr(module, self.name, self.compute_weight(module))
[1/0_2] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] 

Since the weights of layers do not change during the inference, we do not need weight normalization. So we simply removed weight normalization for Vocoder as shown in Figure 14, by utilizing remove_weight_norm function which is already provided at the Seamless codebase (here).

Removing weight_norm for Vocoder

Figure 14. Removing weight_norm for Vocoder

Performance Evaluation + Impact of CUDA Graph

Figure 15 shows the speedup result when enabling torch.compile(mode=”max-autotune”) + CUDA Graph on the text decoder and vocoder. We achieve 2x speedup for the text decoder and 30x speedup for vocoder, leading to 2.7x faster end-to-end inference time.

Inference time speedup of text decoder and vocoder of applying torch.compile and torch.compile + CUDA Graph

Inference time speedup of text decoder and vocoder of applying torch.compile and torch.compile + CUDA Graph

Figure 15. Inference time speedup of text decoder and vocoder of applying torch.compile and torch.compile + CUDA Graph

We also report the speedups for text decoder and vocoder using torch.compile without CUDA Graph, which is supported by torch.compile’s API (i.e., torch.compile(mode="max-autotune-no-cudagraphs")), to identify the impact of CUDA Graph on the performance. Without CUDA Graph, the speedup for text decoder and vocoder reduces to 1.17x and 18.4x. While still quite significant, it indicates the important role of CUDA Graph. We conclude that Seamless M4T-v2 is exposed to a lot of time launching CUDA kernels, especially when we use small batch size (e.g., 1) where the GPU kernel execution time is not long enough to amortize the GPU kernel launch time.

End-to-end inference speedup of applying torch.compile and CUDA graph incrementally

Figure 16. End-to-end inference speedup of applying torch.compile and CUDA graph incrementally. a) “Inc. Decoding”: Apply torch.compile only to the text decoder b) “Inc. Decoding w/ CUDA Graph”: Apply torch.compile + CUDA Graph to the text decoder c) “+KV Cache Reordering”: Additionally apply torch.compile to KV cache reordering operation upon b) d) “+Vocoder”: Additionally apply torch.compile to the vocoder upon c) e) “+Vocoder w/ CUDA Graph”: Additionally apply torch.compile + CUDA Graph to the vocoder upon d).

Figure 16 represents the cumulative effect of applying torch.compile with and without CUDA Graph to the modules. The results indicate a significant improvement in the end-to-end inference speedup, demonstrating the effectiveness of these techniques in optimizing the overall latency. As a result, we gain 2.7x end-to-end inference speedup for Seamless M4T-v2 with batch_size=1.

Acknowledgements

We thank the PyTorch team and Seamless team for their tremendous support with this work.

Read More

Build a vaccination verification solution using the Queries feature in Amazon Textract

Build a vaccination verification solution using the Queries feature in Amazon Textract

Amazon Textract is a machine learning (ML) service that enables automatic extraction of text, handwriting, and data from scanned documents, surpassing traditional optical character recognition (OCR). It can identify, understand, and extract data from tables and forms with remarkable accuracy. Presently, several companies rely on manual extraction methods or basic OCR software, which is tedious and time-consuming, and requires manual configuration that needs updating when the form changes. Amazon Textract helps solve these challenges by utilizing ML to automatically process different document types and accurately extract information with minimal manual intervention. This enables you to automate document processing and use the extracted data for different purposes, such as automating loans processing or gathering information from invoices and receipts.

As travel resumes post-pandemic, verifying a traveler’s vaccination status may be required in many cases. Hotels and travel agencies often need to review vaccination cards to gather important details like whether the traveler is fully vaccinated, vaccine dates, and the traveler’s name. Some agencies do this through manual verification of cards, which can be time-consuming for staff and leaves room for human error. Others have built custom solutions, but these can be costly and difficult to scale, and take significant time to implement. Moving forward, there may be opportunities to streamline the vaccination status verification process in a way that is efficient for businesses while respecting travelers’ privacy and convenience.

Amazon Textract Queries helps address these challenges. Amazon Textract Queries allows you to specify and extract only the piece of information that you need from the document. It gives you precise and accurate information from the document.

In this post, we walk you through a step-by-step implementation guide to build a vaccination status verification solution using Amazon Textract Queries. The solution showcases how to process vaccination cards using an Amazon Textract query, verify the vaccination status, and store the information for future use.

Solution overview

The following diagram illustrates the solution architecture.

The workflow includes the following steps:

  1. The user takes a photo of a vaccination card.
  2. The image is uploaded to an Amazon Simple Storage Service (Amazon S3) bucket.
  3. When the image gets saved in the S3 bucket, it invokes an AWS Step Functions workflow:
  4. The Queries-Decider AWS Lambda function examines the document passed in and adds information about the mime type, the number of pages, and the number of queries to the Step Functions workflow (for our example, we have four queries).
  5. NumberQueriesAndPagesChoice is a Choice state that adds conditional logic to a workflow. If there are between 15–31 queries and the number of pages is between 2–3,001, then Amazon Textract asynchronous processing is the only option, because synchronous APIs only support up to 15 queries and one-page documents. For all other cases, we route to the random selection of synchronous or asynchronous processing.
  6. The TextractSync Lambda function sends a request to Amazon Textract to analyze the document based on the following Amazon Textract queries:
    1. What is Vaccination Status?
    2. What is Name?
    3. What is Date of Birth?
    4. What is Document Number?
  7. Amazon Textract analyzes the image and sends the answers of these queries back to the Lambda function.
  8. The Lambda function verifies the customer’s vaccination status and stores the final result in CSV format in the same S3 bucket (demoqueries-textractxxx) in the csv-output folder.

Prerequisites

To complete this solution, you should have an AWS account and the appropriate permissions to create the resources required as part of the solution.

Download the deployment code and sample vaccination card from GitHub.

Use the Queries feature on the Amazon Textract console

Before you build the vaccination verification solution, let’s explore how you can use Amazon Textract Queries to extract vaccination status via the Amazon Textract console. You can use the vaccination card sample you downloaded from the GitHub repo.

  1. On the Amazon Textract console, choose Analyze Document in the navigation pane.
  2. Under Upload document, choose Choose document to upload the vaccination card from your local drive.
  3. After you upload the document, select Queries in the Configure Document section.
  4. You can then add queries in the form of natural language questions. Let’s add the following:
    • What is Vaccination Status?
    • What is Name?
    • What is Date of Birth?
    • What is Document Number?
  5. After you add all your queries, choose Apply configuration.
  6. Check the Queries tab to see the answers to the questions.

You can see Amazon Textract extracts the answer to your query from the document.

Deploy the vaccination verification solution

In this post, we use an AWS Cloud9 instance and install the necessary dependencies on the instance with the AWS Cloud Development Kit (AWS CDK) and Docker. AWS Cloud9 is a cloud-based integrated development environment (IDE) that lets you write, run, and debug your code with just a browser.

  1. In the terminal, choose Upload Local Files on the File menu.
  2. Choose Select folder and choose the vaccination_verification_solution folder you downloaded from GitHub.
  3. In the terminal, prepare your serverless application for subsequent steps in your development workflow in AWS Serverless Application Model (AWS SAM) using the following command:
    $ cd vaccination_verification_solution/
    $ pip install -r requirements.txt
    

  4. Deploy the application using the cdk deploy command:
    cdk deploy DemoQueries --outputs-file demo_queries.json --require-approval never

    Wait for the AWS CDK to deploy the model and create the resources mentioned in the template.

  5. When deployment is complete, you can check the deployed resources on the AWS CloudFormation console on the Resources tab of the stack details page.

Test the solution

Now it’s time to test the solution. To trigger the workflow, use aws s3 cp to upload the vac_card.jpg file to DemoQueries.DocumentUploadLocation inside the docs folder:

aws s3 cp docs/vac_card.JPG $(aws cloudformation list-exports --query 'Exports[?Name==`DemoQueries-DocumentUploadLocation`].Value' --output text)


The vaccination certificate file automatically gets uploaded to the S3 bucket demoqueries-textractxxx in the uploads folder.

The Step Functions workflow is triggered via a Lambda function as soon as the vaccination certificate file is uploaded to the S3 bucket.

The Queries-Decider Lambda function examines the document and adds information about the mime type, the number of pages, and the number of queries to the Step Functions workflow (for this example, we use four queries—document number, customer name, date of birth, and vaccination status).

The TextractSync function sends the input queries to Amazon Textract and synchronously returns the full result as part of the response. It supports 1-page documents (TIFF, PDF, JPG, PNG) and up to 15 queries. The GenerateCsvTask function takes the JSON output from Amazon Textract and converts it to a CSV file.

The final output is stored in the same S3 bucket in the csv-output folder as a CSV file.

You can download the file to your local machine using the following command:

aws s3 cp <paste the S3 URL from TextractOutputCSVPath>

The format of the result is timestamp, classification, filename, page number, key name, key_confidence, value, value_confidence, key_bb_top, key_bb_height, key_bb.width, key_bb_left, value_bb_top, value_bb_height, value_bb_width, value_bb_left.

You can scale the solution to hundreds of vaccination certificate documents for multiple customers by uploading their vaccination certificates to DemoQueries.DocumentUploadLocation. This automatically triggers multiple runs of the Step Functions state machine, and the final result is stored in the same S3 bucket in the csv-output folder.

To change the initial set of queries that are fed into Amazon Textract, you can go to your AWS Cloud9 instance and open the start_execution.py file. In the file view in the left pane, navigate to lambda, start_queries, app, start_execution.py. This Lambda function is invoked when a file is uploaded to DemoQueries.DocumentUploadLocation. The queries sent to the workflow are defined in start_execution.py; you can change those by updating the code as shown in the following screenshot.

Clean up

To avoid incurring ongoing charges, delete the resources created in this post using the following command:

cdk destroy DemoQueries

Answer the question Are you sure you want to delete: DemoQueries (y/n)? with y.

Conclusion

In this post, we showed you how to use Amazon Textract Queries to build a vaccination verification solution for the travel industry. You can use Amazon Textract Queries to build solutions in other industries like finance and healthcare, and retrieve information from documents such as paystubs, mortgage notes, and insurance cards based on natural language questions.

For more information, see Analyzing Documents, or check out the Amazon Textract console and try out this feature.


About the Authors

Dhiraj Thakur is a Solutions Architect with Amazon Web Services. He works with AWS customers and partners to provide guidance on enterprise cloud adoption, migration, and strategy. He is passionate about technology and enjoys building and experimenting in the analytics and AI/ML space.

Rishabh Yadav is a Partner Solutions architect at AWS with an extensive background in DevOps and Security offerings at AWS. He works with ASEAN partners to provide guidance on enterprise cloud adoption and architecture reviews along with building AWS practices through the implementation of the Well-Architected Framework. Outside of work, he likes to spend his time in the sports field and FPS gaming.

Read More

Large-scale Training of Foundation Models for Wearable Biosignals

Tracking biosignals is crucial for monitoring wellness and preempting the development of severe medical conditions. Today, wearable devices can conveniently record various biosignals, creating the opportunity to monitor health status without disruption to one’s daily routine. Despite the widespread use of wearable devices and existing digital biomarkers, the absence of curated data with annotated medical labels hinders the development of new biomarkers to measure common health conditions. In fact, medical datasets are usually small in comparison to other domains, which is an obstacle for…Apple Machine Learning Research