Q&A with Stefano Leonardi, professor and principal investigator at Sapienza University of RomeRead More
Measuring Goodhart’s law
Goodhart’s law famously says: “When a measure becomes a target, it ceases to be a good measure.” Although originally from economics, it’s something we have to grapple with at OpenAI when figuring out how to optimize objectives that are difficult or costly to measure.OpenAI Blog
Lidar-Camera Deep Fusion for Multi-Modal 3D Detection
LiDAR and visual cameras are two types of complementary sensors used for 3D object detection in autonomous vehicles and robots. LiDAR, which is a remote sensing technique that uses light in the form of a pulsed laser to measure ranges, provides low-resolution shape and depth information, while cameras provide high-resolution shape and texture information. While the features captured by LiDAR and cameras should be merged together to provide optimal 3D object detection, it turns out that most state-of-the-art 3D object detectors use LiDAR as the only input. The main reason is that to develop robust 3D object detection models, most methods need to augment and transform the data from both modalities, making the accurate alignment of the features challenging.
Existing algorithms for fusing LiDAR and camera outputs, such as PointPainting, PointAugmenting, EPNet, 4D-Net and ContinuousFusion, generally follow two approaches — input-level fusion where the features are fused at an early stage, decorating points in the LiDAR point cloud with the corresponding camera features, or mid-level fusion where features are extracted from both sensors and then combined. Despite realizing the importance of effective alignment, these methods struggle to efficiently process the common scenario where features are enhanced and aggregated before fusion. This indicates that effectively fusing the signals from both sensors might not be straightforward and remains challenging.
In our CVPR 2022 paper, “DeepFusion: LiDAR-Camera Deep Fusion for Multi-Modal 3D Object Detection”, we introduce a fully end-to-end multi-modal 3D detection framework called DeepFusion that applies a simple yet effective deep-level feature fusion strategy to unify the signals from the two sensing modalities. Unlike conventional approaches that decorate raw LiDAR point clouds with manually selected camera features, our method fuses the deep camera and deep LiDAR features in an end-to-end framework. We begin by describing two novel techniques, InverseAug and LearnableAlign, that improve the quality of feature alignment and are applied to the development of DeepFusion. We then demonstrate state-of-the-art performance by DeepFusion on the Waymo Open Dataset, one of the largest datasets for automotive 3D object detection.
InverseAug: Accurate Alignment under Geometric Augmentation
To achieve good performance on existing 3D object detection benchmarks for autonomous cars, most methods require strong data augmentation during training to avoid overfitting. However, the necessity of data augmentation poses a non-trivial challenge in the DeepFusion pipeline. Specifically, the data from the two modalities use different augmentation strategies, e.g., rotating along the z-axis for 3D point clouds combined with random flipping for 2D camera images, often resulting in alignment that is inaccurate. Then the augmented LiDAR data has to go through a voxelization step that converts the point clouds into volume data stored in a three dimensional array of voxels. The voxelized features are quite different compared to the raw data, making the alignment even more difficult. To address the alignment issue caused by geometry-related data augmentation, we introduce Inverse Augmentation (InverseAug), a technique used to reverse the augmentation before fusion during the model’s training phase.
In the example below, we demonstrate the difficulties in aligning the augmented LiDAR data with the camera data. In this case, the LiDAR point cloud is augmented by rotation with the result that a given 3D key point, which could be any 3D coordinate, such as a LiDAR data point, cannot be easily aligned in 2D space simply through use of the original LiDAR and camera parameters. To make the localization feasible, InverseAug first stores the augmentation parameters before applying the geometry-related data augmentation. At the fusion stage, it reverses all data augmentation to get the original coordinate for the 3D key point, and then finds its corresponding 2D coordinates in the camera space.
During training, InverseAug resolves the inaccurate alignment from geometric augmentation. |
Left: Alignment without InverseAug. Right: Alignment quality improvement with InverseAug. |
LearnableAlign: A Cross-Modality-Attention Module to Learn Alignment
We also introduce Learnable Alignment (LearnableAlign), a cross-modality-attention–based feature-level alignment technique, to improve the alignment quality. For input-level fusion methods, such as PointPainting and PointAugmenting, given a 3D LiDAR point, only the corresponding camera pixel can be exactly located as there is a one-to-one mapping. In contrast, when fusing deep features in the DeepFusion pipeline, each LiDAR feature represents a voxel containing a subset of points, and hence, its corresponding camera pixels are in a polygon. So the alignment becomes the problem of learning the mapping between a voxel cell and a set of pixels.
A naïve approach is to average over all pixels corresponding to the given voxel. However, intuitively, and as supported by our visualized results, these pixels are not equally important because the information from the LiDAR deep feature unequally aligns with every camera pixel. For example, some pixels may contain critical information for detection (e.g., the target object), while others may be less informative (e.g., consisting of backgrounds such as roads, plants, occluders, etc.).
LearnableAlign leverages a cross-modality attention mechanism to dynamically capture the correlations between two modalities. Here, the input contains the LiDAR features in a voxel cell, and all its corresponding camera features. The output of the attention is essentially a weighted sum of the camera features, where the weights are collectively determined by a function of the LiDAR and camera features. More specifically, LearnableAlign uses three fully-connected layers to respectively transform the LiDAR features to a vector (ql), and camera features to vectors (kc) and (vc). For each vector (ql), we compute the dot products between (ql) and (kc) to obtain the attention affinity matrix that contains correlations between the LiDAR features and the corresponding camera features. Normalized by a softmax operator, the attention affinity matrix is then used to calculate weights and aggregate the vectors (vc) that contain camera information. The aggregated camera information is then processed by a fully-connected layer, and concatenated (Concat) with the original LiDAR feature. The output is then fed into any standard 3D detection framework, such as PointPillars or CenterPoint for model training.
LearnableAlign leverages the cross-attention mechanism to align LiDAR and camera features. |
DeepFusion: A Better Way to Fuse Information from Different Modalities
Powered by our two novel feature alignment techniques, we develop DeepFusion, a fully end-to-end multi-modal 3D detection framework. In the DeepFusion pipeline, the LiDAR points are first fed into an existing feature extractor (e.g., pillar feature net from PointPillars) to obtain LiDAR features (e.g., pseudo-images). In the meantime, the camera images are fed into a 2D image feature extractor (e.g., ResNet) to obtain camera features. Then, InverseAug and LearnableAlign are applied in order to fuse the camera and LiDAR features together. Finally, the fused features are processed by the remaining components of the selected 3D detection model (e.g., the backbone and detection head from PointPillars) to obtain the detection results.
The pipeline of DeepFusion. |
Benchmark Results
We evaluate DeepFusion on the Waymo Open Dataset, one of the largest 3D detection challenges for autonomous cars, using the Average Precision with Heading (APH) metric under difficulty level 2, the default metric to rank a model’s performance on the leaderboard. Among the 70 participating teams all over the world, the DeepFusion single and ensemble models achieve state-of-the-art performance in their corresponding categories.
The single DeepFusion model achieves new state-of-the-art performance on Waymo Open Dataset. |
The Ensemble DeepFusion model outperforms all other methods on Waymo Open Dataset, ranking No. 1 on the leaderboard. |
The Impact of InverseAug and LearnableAlign
We also conduct ablation studies on the effectiveness of the proposed InverseAug and LearnableAlign techniques. We demonstrate that both InverseAug and LearnableAlign individually contribute to a performance gain over the LiDAR-only model, and combining both can further yield an even more significant boost.
Ablation studies on InverseAug (IA) and LearnableAlign (LA) measured in average precision (AP) and APH. Combining both techniques contributes to the best performance gain. |
Conclusion
We demonstrate that late-stage deep feature fusion can be more effective when features are aligned well, but aligning features from two different modalities can be challenging. To address this challenge, we propose two techniques, InverseAug and LearnableAlign, to improve the quality of alignment among multimodal features. By integrating these techniques into the fusion stage of our proposed DeepFusion method, we achieve state-of-the-art performance on the Waymo Open Dataset.
Acknowledgements:
Special thanks to co-authors Tianjian Meng, Ben Caine, Jiquan Ngiam, Daiyi Peng, Junyang Shen, Bo Wu, Yifeng Lu, Denny Zhou, Quoc Le, Alan Yuille, Mingxing Tan.
New method for “editing” fabricated chips enables more-efficient designs
Reducing the energy of ion beams used for editing eliminates the need for “sacrificial” areas between electrical components and improves precision.Read More
MIT’s FutureMakers programs help kids get their minds around — and hands on — AI
As she was looking for a camp last summer, Yabesra Ewnetu, who’d just finished eighth grade, found a reference to MIT’s FutureMakers Create-a-thon. Ewnetu had heard that it’s hard to detect bias in artificial intelligence because AI algorithms are so complex, but this didn’t make sense to her. “I was like, well, we’re the ones coding it, shouldn’t we be able to see what it’s doing and explain why?” She signed up for the six-week virtual FutureMakers program so she could delve into AI herself.
FutureMakers is part of the MIT-wide Responsible AI for Social Empowerment and Education (RAISE) initiative launched earlier this year. RAISE is headquartered in the MIT Media Lab and run in collaboration with MIT Schwarzman College of Computing and MIT Open Learning.
MIT piloted FutureMakers to students from all over the United States last year in two formats.
During one-week, themed FutureMakers Workshops organized around key topics related to AI, students learn how AI technologies work, including social implications, then build something that uses AI.
And during six-week summer Create-a-Thons, middle school and high school students do a deep dive into AI and coding for four weeks, then take two weeks to design an app for social good. The Create-a-Thon culminates in a competition where teams present their ideas and prototypes to an expert panel of judges.
“We want to remove as many barriers as we possibly can to support diverse students and teachers,” says Cynthia Breazeal, a professor of media arts and sciences at MIT who founded the Media Lab’s Personal Robots Group and also heads up the RAISE initiative. All RAISE programs are free for educators and students. The courses are designed to meet students and teachers where they are in terms of resources, comfort with technology, and interests.
But it’s not all about learning to code.
“AI is shaping our behaviors, it’s shaping the way we think, it’s shaping the way we learn, and a lot of people aren’t even aware of that,” says Breazeal. “People now need to be AI literate given how AI is rapidly changing digital literacy and digital citizenship.”
The one-week FutureMaker Workshops are offered year-round. MIT trains teachers or people who work at STEM educational organizations so they can bring the tools and project-based hands-on curriculum and activities to their students. One year in, MIT has trained 60 teachers who have given workshops to more than 300 students, many from underserved and under-represented communities across the United States. Teachers and mentors choose from among four workshop themes for their training: Conversational AI, Dancing With AI, Creativity and AI, and How to Train Your Robot.
MIT worked with Lili’uokalani Trust in Hawaii to teach the How to Train Your Robot workshop during a spring break program on the remote islands of Moloka’i and Lana’i.
When the trust visited the MIT Media Lab on an East Coast study tour, “we were immediately inspired by the vast array of AI and STEM programs and decided to pilot How to Train Your Robot,” says Lili’uokalani Trust program manager Kau’ilani Arce.
The workshop introduced students to AI, image classification, and algorithmic bias, and taught them to program robots using a custom block-based coding environment built using the Scratch programming language, which was developed at the Media Lab.
On the first day, “we learned about algorithmic bias and how it can lead to deeply rooted issues, such as social and racial injustices,” Arce says. “It was a wonderful opportunity to critically think about how Native Hawaiians are equally represented in algorithms we use daily.”
The best moment for sixth-grader Yesmine Kiroloss: “When I got to program my robot!”
For students without previous AI experience, it took grit to understand the correlation between a coding environment and a functioning robot, says Arce. “There was an overwhelming sense of accomplishment.”
MIT collaborated with SureStart, a startup aimed at mentoring high school and college students in AI, for the first six-week FutureMakers Create-a-thon last summer.
The Create-a-Thon had two tracks: an MIT App Inventor track with 30 students including Ewnetu, and a Deep Learning track with 45. The 78 students hailed from more than 20 states and just over two-thirds were female.
For the first four weeks students worked in groups of eight with two graduate student mentors who summarized each day’s lessons and held office hours so the students could ask questions.
In the final two weeks, the students applied what they’d learned to create something with societal or environmental impact.
A key step was plotting out a minimum viable product: a web or mobile app that contained the minimum elements needed to illustrate their idea.
At the end of the six weeks, 15 teams of four students and a mentor showed off their ideas in an entrepreneurial-style pitch competition judged by experts form academia and industry.
Ewnetu’s team, Team Dyadic, built a prediction model to warn people about wildfires. The idea was inspired by a team member from California. The team bootstrapped a website, collected a dataset, trained a machine language model, and added an interactive map. “Our code is a prediction of how close the current conditions are to a fire condition,” says Ewnetu, who’s now a first-year student at Justice High School in Falls Church, Virginia.
The team members had a mix of experience. “There were people in the class who had a lot of [coding] experience and there were people in the class like me who had very little to no experience,” says Ewnetu. She needed a lot of help from the mentors in the first couple of weeks, but then everything clicked, she says. “It went from like an error every other line to an error maybe every other section.”
Ewnetu “is the perfect embodiment of what happens when you just provide people with support,” says SureStart founder Taniya Mishra. “[Having] high expectations is a good thing, especially if you can provide a lot of scaffolding.”
Team Dyadic made the finals. “To see all of our work culminate and then pay off just made us feel like winners,” Ewnetu says.
Meanwhile, Team Youth of Tech created the Vividly app, which allows parents to input questions for their child. When the child logs in, the app asks if they’re happy, sad, frustrated or angry, a bot named Viviana asks the questions, and the child communicates with the bot, knowing the parents can see the conversation.
The idea is to give kids a way to be open with their parents in a very comfortable environment, says Bella Baidak, a first-year masters of information student at Cornell Tech in New York City, and the team’s mentor.
“It’s a form of facilitating better communication so they can talk more,” says team member Netra Rameshbabu, now a first-year at Matea Valley High School in Aurora, Illinois.
“Our idea was to make this a routine, like brushing your teeth.”
Team Youth of Tech made the finals, then won. When they announced Team Youth, “I was screaming I was so excited! I was in tears. I was in joy,” says Rameshbabu.
The mentees did “a brilliant job,” says Kunjal Panchal, head mentor and PhD student at the University of Massachusetts at Amherst. “They know how to use AI and they know how to use it for the common good.”
This year’s six-week FutureMakers program starts July 6. Middle, high school, and undergraduate students can apply here.
Teachers can reach out to RAISE to learn about the one-week training workshops.
Students and teachers can also get started with AI this May 13 during the Day of AI. Students and their teachers from all over the country can learn about AI literacy through a modular, hands-on curriculum that supports up to four hours of learning per grade track. The Day of AI format can be taught by teachers with a wide range of technology backgrounds and are designed to be accessible to all students.
This year’s Day of AI, on May 13, includes teaching materials for upper elementary through high school. The program will eventually span all of K-12. Teachers can register here for a two-hour Day of AI teacher training program. Teaching materials are available under a Creative Commons license.
For this year’s Day of AI, students in grades 3 to 5 will learn about datasets, algorithms, predictions and bias, and will create an AI application that can tell the difference, say, between an image of a dog and that of a cat. Middle school students will learn about generative adversarial networks, which can produce both deepfakes and art. High school students will learn about the recommendation systems used by social media and their implications for individuals and for society. Students and their teachers can register here to participate.
An empirical analysis of compute-optimal large language model training
We ask the question: “What is the optimal model size and number of training tokens for a given compute budget?” To answer this question, we train models of various sizes and with various numbers of tokens, and estimate this trade-off empirically. Our main finding is that the current large language models are far too large for their compute budget and are not being trained on enough data.Read More
An empirical analysis of compute-optimal large language model training
We ask the question: “What is the optimal model size and number of training tokens for a given compute budget?” To answer this question, we train models of various sizes and with various numbers of tokens, and estimate this trade-off empirically. Our main finding is that the current large language models are far too large for their compute budget and are not being trained on enough data.Read More
Luxembourg
Scientists in Luxembourg solve problems for our global customers and collaborate with teams worldwide. Much of the work in Luxembourg is focused on surfacing the right products to Amazon retail customers and delivering them as efficiently as possible.Read More
London and Cambridge
Scientists in London and nearby Cambridge work on Amazon experiences including Amazon retail, Prime Video, Alexa, Ring, Halo, Echo Show, and others.Read More