A new state of the art for unsupervised vision

Labeling data can be a chore. It’s the main source of sustenance for computer-vision models; without it, they’d have a lot of difficulty identifying objects, people, and other important image characteristics. Yet producing just an hour of tagged and labeled data can take a whopping 800 hours of human time. Our high-fidelity understanding of the world develops as machines can better perceive and interact with our surroundings. But they need more help.

Scientists from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), Microsoft, and Cornell University have attempted to solve this problem plaguing vision models by creating “STEGO,” an algorithm that can jointly discover and segment objects without any human labels at all, down to the pixel.

STEGO learns something called “semantic segmentation” — fancy speak for the process of assigning a label to every pixel in an image. Semantic segmentation is an important skill for today’s computer-vision systems because images can be cluttered with objects. Even more challenging is that these objects don’t always fit into literal boxes; algorithms tend to work better for discrete “things” like people and cars as opposed to “stuff” like vegetation, sky, and mashed potatoes. A previous system might simply perceive a nuanced scene of a dog playing in the park as just a dog, but by assigning every pixel of the image a label, STEGO can break the image into its main ingredients: a dog, sky, grass, and its owner.

Assigning every single pixel of the world a label is ambitious — especially without any kind of feedback from humans. The majority of algorithms today get their knowledge from mounds of labeled data, which can take painstaking human-hours to source. Just imagine the excitement of labeling every pixel of 100,000 images! To discover these objects without a human’s helpful guidance, STEGO looks for similar objects that appear throughout a dataset. It then associates these similar objects together to construct a consistent view of the world across all of the images it learns from.

Seeing the world

Machines that can “see” are crucial for a wide array of new and emerging technologies like self-driving cars and predictive modeling for medical diagnostics. Since STEGO can learn without labels, it can detect objects in many different domains, even those that humans don’t yet understand fully. 

“If you’re looking at oncological scans, the surface of planets, or high-resolution biological images, it’s hard to know what objects to look for without expert knowledge. In emerging domains, sometimes even human experts don’t know what the right objects should be,” says Mark Hamilton, a PhD student in electrical engineering and computer science at MIT, research affiliate of MIT CSAIL, software engineer at Microsoft, and lead author on a new paper about STEGO. “In these types of situations where you want to design a method to operate at the boundaries of science, you can’t rely on humans to figure it out before machines do.”

STEGO was tested on a slew of visual domains spanning general images, driving images, and high-altitude aerial photographs. In each domain, STEGO was able to identify and segment relevant objects that were closely aligned with human judgments. STEGO’s most diverse benchmark was the COCO-Stuff dataset, which is made up of diverse images from all over the world, from indoor scenes to people playing sports to trees and cows. In most cases, the previous state-of-the-art system could capture a low-resolution gist of a scene, but struggled on fine-grained details: A human was a blob, a motorcycle was captured as a person, and it couldn’t recognize any geese. On the same scenes, STEGO doubled the performance of previous systems and discovered concepts like animals, buildings, people, furniture, and many others.

STEGO not only doubled the performance of prior systems on the COCO-Stuff benchmark, but made similar leaps forward in other visual domains. When applied to driverless car datasets, STEGO successfully segmented out roads, people, and street signs with much higher resolution and granularity than previous systems. On images from space, the system broke down every single square foot of the surface of the Earth into roads, vegetation, and buildings. 

Connecting the pixels

STEGO — which stands for “Self-supervised Transformer with Energy-based Graph Optimization” — builds on top of the DINO algorithm, which learned about the world through 14 million images from the ImageNet database. STEGO refines the DINO backbone through a learning process that mimics our own way of stitching together pieces of the world to make meaning. 

For example, you might consider two images of dogs walking in the park. Even though they’re different dogs, with different owners, in different parks, STEGO can tell (without humans) how each scene’s objects relate to each other. The authors even probe STEGO’s mind to see how each little, brown, furry thing in the images are similar, and likewise with other shared objects like grass and people. By connecting objects across images, STEGO builds a consistent view of the word.

“The idea is that these types of algorithms can find consistent groupings in a largely automated fashion so we don’t have to do that ourselves,” says Hamilton. “It might have taken years to understand complex visual datasets like biological imagery, but if we can avoid spending 1,000 hours combing through data and labeling it, we can find and discover new information that we might have missed. We hope this will help us understand the visual word in a more empirically grounded way.”

Looking ahead

Despite its improvements, STEGO still faces certain challenges. One is that labels can be arbitrary. For example, the labels of the COCO-Stuff dataset distinguish between “food-things” like bananas and chicken wings, and “food-stuff” like grits and pasta. STEGO doesn’t see much of a distinction there. In other cases, STEGO was confused by odd images — like one of a banana sitting on a phone receiver — where the receiver was labeled “foodstuff,” instead of “raw material.” 

For upcoming work, they’re planning to explore giving STEGO a bit more flexibility than just labeling pixels into a fixed number of classes as things in the real world can sometimes be multiple things at the same time (like “food”, “plant” and “fruit”). The authors hope this will give the algorithm room for uncertainty, trade-offs, and more abstract thinking.

“In making a general tool for understanding potentially complicated datasets, we hope that this type of an algorithm can automate the scientific process of object discovery from images. There’s a lot of different domains where human labeling would be prohibitively expensive, or humans simply don’t even know the specific structure, like in certain biological and astrophysical domains. We hope that future work enables application to a very broad scope of datasets. Since you don’t need any human labels, we can now start to apply ML tools more broadly,” says Hamilton.

“STEGO is simple, elegant, and very effective. I consider unsupervised segmentation to be a benchmark for progress in image understanding, and a very difficult problem. The research community has made terrific progress in unsupervised image understanding with the adoption of transformer architectures,” says Andrea Vedaldi, professor of computer vision and machine learning and a co-lead of the Visual Geometry Group at the engineering science department of the University of Oxford. “This research provides perhaps the most direct and effective demonstration of this progress on unsupervised segmentation.” 

Hamilton wrote the paper alongside MIT CSAIL PhD student Zhoutong Zhang, Assistant Professor Bharath Hariharan of Cornell University, Associate Professor Noah Snavely of Cornell Tech, and MIT professor William T. Freeman. They will present the paper at the 2022 International Conference on Learning Representations (ICLR). 

Read More

Tooth Tech: AI Takes Bite Out of Dental Slide Misses by Assisting Doctors

Your next trip to the dentist might offer a taste of AI.

Pearl, a West Hollywood startup, provides AI for dental images to assist in diagnosis. It landed FDA clearance last month, the first to get such a go-ahead for dentistry AI.

The approval paves the way for its use in clinics across the United States.

“It’s really a first of its kind for dentistry,” said Ophir Tanz, co-founder and CEO of Pearl. “But we also have similar regulatory approvals across 50 countries globally.”

Pearl’s software platform, available in the cloud as a service, enables dentists to run real-time screening of X-rays. Dentists can then review the AI findings and share them with patients to facilitate informed dentist-patient discussions about diagnosis and treatment planning.

Behind the scenes, NVIDIA GPU-driven convolutional neural networks developed by Pearl can spot not just tooth decay but many other dental issues, like cracked crowns and root abscess requiring a root canal.

Pearl’s AI offers dentist results. The startup’s FDA application showed that on average Pearl AI was capable of spotting 36 percent more pathologies and other dental issues than an average dentist. “And that’s important because in dentistry it’s extremely common and routine to miss a pathology,” said Tanz.

The company’s products include its Practice Intelligence, which enables dental practices to run AI on patient data to discover missed diagnoses and treatment opportunities. Pearl Protect can help screen for dental insurance fraud, waste and abuse, while Claims Review offers automated claims examination.

Pearl, founded in 2019, is a member of the NVIDIA Inception startup program, which provided it access to Deep Learning Institute courses, NVIDIA Developer Forums and technical workshops.

Hatching Dental AI

The son of a dentist, Tanz has a mouthful of a founding tale. The entrepreneur decided to pursue AI for dental radiology after talking shop on a visit with his dentist. A partner at the practice liked the idea so much he jumped on board as a co-founder.

Pearl co-founders Cambron Carter, Kyle Stanley and Ophir Tanz (left to right)

Tanz, who founded tech unicorn GumGum for AI to analyze images, video and text for better contextual advertising, was joined by GumGum colleague Cambron Carter, now CTO and co-founder at Pearl. Dentist Kyle Stanley, co-founder and chief clinical officer, rounds out the trio with clinical experience.

Pearl’s founders targeted a host of conditions commonly addressed in dental clinics. They labeled more than a million images to help train their proprietary CNN models, running on NVIDIA V100 Tensor Core GPUs in the cloud, to identify issues. Before that they had prototyped on local NVIDIA-powered workstations.

Inference is done on cloud-based GPUs, where Pearl’s system synchronizes with the dentist’s real-time and historial radiology data. “The dental vertical is still undergoing a transition to the cloud, and now we’re bringing them AI in the cloud — we represent a wave of technology that will propel the field of dentistry into the future,” said Carter.

Getting FDA approval wasn’t easy, he said. It required completing an expansive clinical trial. Pearl submitted four studies, each involving thousands of X-rays and over 80 expert dentists and radiologists.

Getting Second Opinion for Diagnosis

Pearl offers dentists a product called Second Opinion to aid in the detection of disease in radiography. Second Opinion can identify dozens of conditions to help validate dentist’s findings, according to Tanz.

“We’re the only company in the world that is able to diagnose pathology and detect disease in an AI-driven manner in the dental practice,” he said. “We’re driving a much more comprehensive diagnosis, and it’s a diagnostic aid for general practitioners.”

Second Opinion is taking root in clinics. Sage Dental, which has more than 60 offices across the East Coast, is a customer. Dental 365 is a customer with more than 60 offices in the region as well.

“Second Opinion is an extremely important tool for the future of dentistry,” said Cindy Roark, chief clinical officer at Sage. “Dentistry has needed consistency for a very long time. Diagnosis is highly variable, and variability leads to confusion and distrust from patients.”

Boosting Doctor-Patient Rapport 

Dentists review X-rays while patients are in the chair, pointing out any issues as they go. Even for experienced dentists, making sense of the grayscale imagery that forms the basis of most treatment plans can be challenging — only compounded by the many demands on their attention throughout a busy day juggling patients.

For patients, comprehending the indistinct gradations in X-rays that separate healthy tooth structures from unhealthy ones is even harder.

But with AI-aided images, dentists are able to present areas of concern outlined by simple, digestible bounding boxes. This ensures that their treatment plans have a sound basis, while providing patients with a much clearer picture of what exactly is going on in their X-rays.

“You’re able to have a highly visual sort of discussion and paint a visual narrative for patients so that they really start to understand what is going on in their mouth,” said Dr. Stanley.

The post Tooth Tech: AI Takes Bite Out of Dental Slide Misses by Assisting Doctors appeared first on NVIDIA Blog.

Read More

GFN Thursday Is Fit for the Gods: ‘God of War’ Arrives on GeForce NOW

The gods must be smiling this GFN Thursday — God of War today joins the GeForce NOW library.

Sony Interactive Entertainment and Santa Monica Studios’ masterpiece is available to stream from GeForce NOW servers, across nearly all devices and at up to 1440p and 120 frames per second for RTX 3080 members.

Get ready to experience Kratos’ latest adventure as part of nine titles joining this week.

The Story of a Generation Comes to the Cloud

This GFN Thursday, God of War (Steam) comes to GeForce NOW.

With his vengeance against the Gods of Olympus years behind him, play as Kratos, now living as a man in the realm of Norse Gods and monsters. Mentor his son, Atreus, to survive a dark, elemental world filled with fearsome creatures and use your weapons and abilities to protect him by engaging in grand and grueling combat.

God of War’s PC port is as much a masterpiece as the original game, and RTX 3080 members can experience it the way its developers intended. Members can explore the dark, elemental world of fearsome creatures at up to 1440p and 120 FPS on PC, and up to 4K on SHIELD TV. The power of AI in NVIDIA DLSS brings every environment to life with phenomenal graphics and uncompromised image quality. Engage in visceral, physical combat with ultra-low latency that rivals even local console experiences.

Streaming from the cloud, you can play one of the best action games on your Mac at up to 1440p or 1600p on supported devices. Or take the action with you by streaming to your mobile device at up to 120 FPS, with up to eight-hour gaming session lengths for RTX 3080 members.

Enter the Norse realm and play God of War today.

More? More.

This week also brings a new instant-play free game demos streaming on GeForce NOW.

Squish, bop and bounce around to the rhythms of an electronica soundscape in the demo for Lumote: The Mastermote Chronicles. Give the demo a try for free before adding the full title to your wishlist on Steam.

Experience the innovative games being developed by studios from across the greater China region, which will participate in the Enter the Dragon indie game festival. Starting today, play the Nobody – The Turnaround demo, and look for others to be added in the days ahead.

Finally, get ready for the upcoming launch of Terraformers with the instant-play free demo that went live on GeForce NOW last week.

MotoGP22 on GeForce NOW
Get your zoomies out and be the fastest on the track in the racing game MotoGP22.

In addition, members can look for the following games arriving in full this week:

Finally, in case you hadn’t guessed it before, we bet you can now. Let us know your guess on Twitter or in the comments below.

The post GFN Thursday Is Fit for the Gods: ‘God of War’ Arrives on GeForce NOW appeared first on NVIDIA Blog.

Read More

Anticipating others’ behavior on the road

Humans may be one of the biggest roadblocks keeping fully autonomous vehicles off city streets.

If a robot is going to navigate a vehicle safely through downtown Boston, it must be able to predict what nearby drivers, cyclists, and pedestrians are going to do next.

Behavior prediction is a tough problem, however, and current artificial intelligence solutions are either too simplistic (they may assume pedestrians always walk in a straight line), too conservative (to avoid pedestrians, the robot just leaves the car in park), or can only forecast the next moves of one agent (roads typically carry many users at once.)  

MIT researchers have devised a deceptively simple solution to this complicated challenge. They break a multiagent behavior prediction problem into smaller pieces and tackle each one individually, so a computer can solve this complex task in real-time.

Their behavior-prediction framework first guesses the relationships between two road users — which car, cyclist, or pedestrian has the right of way, and which agent will yield — and uses those relationships to predict future trajectories for multiple agents.

These estimated trajectories were more accurate than those from other machine-learning models, compared to real traffic flow in an enormous dataset compiled by autonomous driving company Waymo. The MIT technique even outperformed Waymo’s recently published model. And because the researchers broke the problem into simpler pieces, their technique used less memory.

“This is a very intuitive idea, but no one has fully explored it before, and it works quite well. The simplicity is definitely a plus. We are comparing our model with other state-of-the-art models in the field, including the one from Waymo, the leading company in this area, and our model achieves top performance on this challenging benchmark. This has a lot of potential for the future,” says co-lead author Xin “Cyrus” Huang, a graduate student in the Department of Aeronautics and Astronautics and a research assistant in the lab of Brian Williams, professor of aeronautics and astronautics and a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL).

Joining Huang and Williams on the paper are three researchers from Tsinghua University in China: co-lead author Qiao Sun, a research assistant; Junru Gu, a graduate student; and senior author Hang Zhao PhD ’19, an assistant professor. The research will be presented at the Conference on Computer Vision and Pattern Recognition.

Multiple small models

The researchers’ machine-learning method, called M2I, takes two inputs: past trajectories of the cars, cyclists, and pedestrians interacting in a traffic setting such as a four-way intersection, and a map with street locations, lane configurations, etc.

Using this information, a relation predictor infers which of two agents has the right of way first, classifying one as a passer and one as a yielder. Then a prediction model, known as a marginal predictor, guesses the trajectory for the passing agent, since this agent behaves independently.

A second prediction model, known as a conditional predictor, then guesses what the yielding agent will do based on the actions of the passing agent. The system predicts a number of different trajectories for the yielder and passer, computes the probability of each one individually, and then selects the six joint results with the highest likelihood of occurring.

M2I outputs a prediction of how these agents will move through traffic for the next eight seconds. In one example, their method caused a vehicle to slow down so a pedestrian could cross the street, then speed up when they cleared the intersection. In another example, the vehicle waited until several cars had passed before turning from a side street onto a busy, main road.

While this initial research focuses on interactions between two agents, M2I could infer relationships among many agents and then guess their trajectories by linking multiple marginal and conditional predictors.

predictionprediction

Real-world driving tests

The researchers trained the models using the Waymo Open Motion Dataset, which contains millions of real traffic scenes involving vehicles, pedestrians, and cyclists recorded by lidar (light detection and ranging) sensors and cameras mounted on the company’s autonomous vehicles. They focused specifically on cases with multiple agents.

To determine accuracy, they compared each method’s six prediction samples, weighted by their confidence levels, to the actual trajectories followed by the cars, cyclists, and pedestrians in a scene. Their method was the most accurate. It also outperformed the baseline models on a metric known as overlap rate; if two trajectories overlap, that indicates a collision. M2I had the lowest overlap rate.

“Rather than just building a more complex model to solve this problem, we took an approach that is more like how a human thinks when they reason about interactions with others. A human does not reason about all hundreds of combinations of future behaviors. We make decisions quite fast,” Huang says.

Another advantage of M2I is that, because it breaks the problem down into smaller pieces, it is easier for a user to understand the model’s decision making. In the long run, that could help users put more trust in autonomous vehicles, says Huang.

But the framework can’t account for cases where two agents are mutually influencing each other, like when two vehicles each nudge forward at a four-way stop because the drivers aren’t sure who should be yielding.

They plan to address this limitation in future work. They also want to use their method to simulate realistic interactions between road users, which could be used to verify planning algorithms for self-driving cars or create huge amounts of synthetic driving data to improve model performance.

“Predicting future trajectories of multiple, interacting agents is under-explored and extremely challenging for enabling full autonomy in complex scenes. M2I provides a highly promising prediction method with the relation predictor to discriminate agents predicted marginally or conditionally which significantly simplifies the problem,” wrote Masayoshi Tomizuka, the Cheryl and John Neerhout, Jr. Distinguished Professor of Mechanical Engineering at University of California at Berkeley and Wei Zhan, an assistant professional researcher, in an email. “The prediction model can capture the inherent relation and interactions of the agents to achieve the state-of-the-art performance.” The two colleagues were not involved in the research.

This research is supported, in part, by the Qualcomm Innovation Fellowship. Toyota Research Institute also provided funds to support this work.

Read More

FormNet: Beyond Sequential Modeling for Form-Based Document Understanding

Form-based document understanding is a growing research topic because of its practical potential for automatically converting unstructured text data into structured information to gain insight about a document’s contents. Recent sequence modeling, which is a self-attention mechanism that directly models relationships between all words in a selection of text, has demonstrated state-of-the-art performance on natural language tasks. A natural approach to handle form document understanding tasks is to first serialize the form documents (usually in a left-to-right, top-to-bottom fashion) and then apply state-of-the-art sequence models to them.

However, form documents often have more complex layouts that contain structured objects, such as tables, columns, and text blocks. Their variety of layout patterns makes serialization difficult, substantially limiting the performance of strict serialization approaches. These unique challenges in form document structural modeling have been largely underexplored in literature.

An illustration of the form document information extraction task using an example from the FUNSD dataset.

In “FormNet: Structural Encoding Beyond Sequential Modeling in Form Document Information Extraction”, presented at ACL 2022, we propose a structure-aware sequence model, called FormNet, to mitigate the sub-optimal serialization of forms for document information extraction. First, we design a Rich Attention (RichAtt) mechanism that leverages the 2D spatial relationship between word tokens for more accurate attention weight calculation. Then, we construct Super-Tokens (tokens that aggregate semantically meaningful information from neighboring tokens) for each word by embedding representations from their neighboring tokens through a graph convolutional network (GCN). Finally, we demonstrate that FormNet outperforms existing methods, while using less pre-training data, and achieves state-of-the-art performance on the CORD, FUNSD, and Payment benchmarks.

FormNet for Information Extraction
Given a form document, we first use the BERT-multilingual vocabulary and optical character recognition (OCR) engine to identify and tokenize words. We then feed the tokens and their corresponding 2D coordinates into a GCN for graph construction and message passing. Next, we use Extended Transformer Construction (ETC) layers with the proposed RichAtt mechanism to continue to process the GCN-encoded structure-aware tokens for schema learning (i.e., semantic entity extraction). Finally, we use the Viterbi algorithm, which finds a sequence that maximizes the posterior probability, to decode and obtain the final entities for output.

Extended Transformer Construction (ETC)
We adopt ETC as the FormNet model backbone. ETC scales to relatively long inputs by replacing standard attention, which has quadratic complexity, with a sparse global-local attention mechanism that distinguishes between global and long input tokens. The global tokens attend to and are attended by all tokens, but the long tokens attend only locally to other long tokens within a specified local radius, reducing the complexity so that it is more manageable for long sequences.

Rich Attention
Our novel architecture, RichAtt, avoids the deficiencies of absolute and relative embeddings by avoiding embeddings entirely. Instead, it computes the order of and log distance between pairs of tokens with respect to the x and y axes on the layout grid, and adjusts the pre-softmax attention scores of each pair as a direct function of these values.

In a traditional attention layer, each token representation is linearly transformed into a Query vector, a Key vector, and a Value vector. A token “looks” for other tokens from which it might want to absorb information (i.e., attend to) by finding the ones with Key vectors that create relatively high scores when matrix-multiplied (called Matmul) by its Query vector and then softmax-normalized. The token then sums together the Value vectors of all other tokens in the sentence, weighted by their score, and passes this up the network, where it will normally be added to the token’s original input vector.

However, other features beyond the Query and Key vectors are often relevant to the decision of how strongly a token should attend to another given token, such as the order they’re in, how many other tokens separate them, or how many pixels apart they are. In order to incorporate these features into the system, we use a trainable parametric function paired with an error network, which takes the observed feature and the output of the parametric function and returns a penalty that reduces the dot product attention score.

The network uses the Query and Key vectors to consider what value some low-level feature (e.g., distance) should take if the tokens are related, and penalizes the attention score based on the error.

At a high level, for each attention head at each layer, FormNet examines each pair of token representations, determines the ideal features the tokens should have if there is a meaningful relationship between them, and penalizes the attention score according to how different the actual features are from the ideal ones. This allows the model to learn constraints on attention using logical implication.

A visualization of how RichAtt might act on a sentence. There are three adjectives that the word “crow” might attend to. “Lazy” is to the right, so it probably does not modify “crow” and its attention edge is penalized. “Sly” is many tokens away, so its attention edge is also penalized. “Cunning” receives no significant penalties, so by process of elimination, it is the best candidate for attention.

Furthermore, if one assumes that the softmax-normalized attention scores represent a probability distribution, and the distributions for the observed features are known, then this algorithm — including the exact choice of parametric functions and error functions — falls out algebraically, meaning FormNet has a mathematical correctness to it that is lacking from many alternatives (including relative embeddings).

Super-Tokens by Graph Learning
The key to sparsifying attention mechanisms in ETC for long sequence modeling is to have every token only attend to tokens that are nearby in the serialized sequence. Although the RichAtt mechanism empowers the transformers by taking the spatial layout structures into account, poor serialization can still block significant attention weight calculation between related word tokens.

To further mitigate the issue, we construct a graph to connect nearby tokens in a form document. We design the edges of the graph based on strong inductive biases so that they have higher probabilities of belonging to the same entity type. For each token, we obtain its Super-Token embedding by applying graph convolutions along these edges to aggregate semantically relevant information from neighboring tokens. We then use these Super-Tokens as an input to the RichAtt ETC architecture. This means that even though an entity may get broken up into multiple segments due to poor serialization, the Super-Tokens learned by the GCN will have retained much of the context of the entity phrase.

An illustration of the word-level graph, with blue edges between tokens, of a FUNSD document.

Key Results
The Figure below shows model size vs. F1 score (the harmonic mean of the precision and recall) for recent approaches on the CORD benchmark. FormNet-A2 outperforms the most recent DocFormer while using a model that is 2.5x smaller. FormNet-A3 achieves state-of-the-art performance with a 97.28% F1 score. For more experimental results, please refer to the paper.

Model Size vs. Entity Extraction F1 Score on CORD benchmark. FormNet significantly outperforms other recent approaches in absolute F1 performance and parameter efficiency.

We study the importance of RichAtt and Super-Token by GCN on the large-scale masked language modeling (MLM) pre-training task across three FormNets. Both RichAtt and GCN components improve upon the ETC baseline on reconstructing the masked tokens by a large margin, showing the effectiveness of their structural encoding capability on form documents. The best performance is obtained when incorporating both RichAtt and GCN.

Performance of the Masked-Language Modeling (MLM) pre-training. Both the proposed RichAtt and Super-Token by GCN components improve upon ETC baseline by a large margin, showing the effectiveness of their structural encoding capability on large-scale form documents.

Using BertViz, we visualize the local-to-local attention scores for specific examples from the CORD dataset for the standard ETC and FormNet models. Qualitatively, we confirm that the tokens attend primarily to other tokens within the same visual block for FormNet. Moreover for that model, specific attention heads are attending to tokens aligned horizontally, which is a strong signal of meaning for form documents. No clear attention pattern emerges for the ETC model, suggesting the RichAtt and Super-Token by GCN enable the model to learn the structural cues and leverage layout information effectively.

The attention scores for ETC and FormNet (ETC+RichAtt+GCN) models. Unlike the ETC model, the FormNet model makes tokens attend to other tokens within the same visual blocks, along with tokens aligned horizontally, thus strongly leveraging structural cues.

Conclusion
We present FormNet, a novel model architecture for form-based document understanding. We determine that the novel RichAtt mechanism and Super-Token components help the ETC transformer excel at form understanding in spite of sub-optimal, noisy serialization. We demonstrate that FormNet recovers local syntactic information that may have been lost during text serialization and achieves state-of-the-art performance on three benchmarks.

Acknowledgements
This research was conducted by Chen-Yu Lee, Chun-Liang Li, Timothy Dozat, Vincent Perot, Guolong Su, Nan Hua, Joshua Ainslie, Renshen Wang, Yasuhisa Fujii, and Tomas Pfister. Thanks to Evan Huang, Shengyang Dai, and Salem Elie Haykal for their valuable feedback, and Tom Small for creating the animation in this post.

Read More

Low-Resource Adaptation of Open-Domain Generative Chatbots

Recent work building open-domain chatbots has demonstrated that increasing model size improves performance. On the other hand, latency and connectivity considerations dictate the move of digital assistants on the device. Giving a digital assistant like Siri, Alexa, or Google Assistant the ability to discuss just about anything leads to the need for reducing the chatbot model size such that it fits on the user’s device. We demonstrate that low parameter models can simultaneously retain their general knowledge conversational abilities while improving in a specific domain. Additionally, we…Apple Machine Learning Research

End-to-End Speech Translation for Code Switched Speech

Code switching (CS) refers to the phenomenon of interchangeably using words and phrases from different languages. CS can pose significant accuracy challenges to NLP, due to the often monolingual nature of the underlying systems. In this work, we focus on CS in the context of English/Spanish conversations for the task of speech translation (ST), generating and evaluating both transcript and translation. To evaluate model performance on this task, we create a novel ST corpus derived from existing public data sets. We explore various ST architectures across two dimensions: cascaded (transcribe…Apple Machine Learning Research