Unsupervised and semi-supervised anomaly detection with data-centric ML

Unsupervised and semi-supervised anomaly detection with data-centric ML

Anomaly detection (AD), the task of distinguishing anomalies from normal data, plays a vital role in many real-world applications, such as detecting faulty products from vision sensors in manufacturing, fraudulent behaviors in financial transactions, or network security threats. Depending on the availability of the type of data — negative (normal) vs. positive (anomalous) and the availability of their labels — the task of AD involves different challenges.

(a) Fully supervised anomaly detection, (b) normal-only anomaly detection, (c, d, e) semi-supervised anomaly detection, (f) unsupervised anomaly detection.

While most previous works were shown to be effective for cases with fully-labeled data (either (a) or (b) in the above figure), such settings are less common in practice because labels are particularly tedious to obtain. In most scenarios users have a limited labeling budget, and sometimes there aren’t even any labeled samples during training. Furthermore, even when labeled data are available, there could be biases in the way samples are labeled, causing distribution differences. Such real-world data challenges limit the achievable accuracy of prior methods in detecting anomalies.

This post covers two of our recent papers on AD, published in Transactions on Machine Learning Research (TMLR), that address the above challenges in unsupervised and semi-supervised settings. Using data-centric approaches, we show state-of-the-art results in both. In “Self-supervised, Refine, Repeat: Improving Unsupervised Anomaly Detection”, we propose a novel unsupervised AD framework that relies on the principles of self-supervised learning without labels and iterative data refinement based on the agreement of one-class classifier (OCC) outputs. In “SPADE: Semi-supervised Anomaly Detection under Distribution Mismatch”, we propose a novel semi-supervised AD framework that yields robust performance even under distribution mismatch with limited labeled samples.

Unsupervised anomaly detection with SRR: Self-supervised, Refine, Repeat

Discovering a decision boundary for a one-class (normal) distribution (i.e., OCC training) is challenging in fully unsupervised settings as unlabeled training data include two classes (normal and abnormal). The challenge gets further exacerbated as the anomaly ratio gets higher for unlabeled data. To construct a robust OCC with unlabeled data, excluding likely-positive (anomalous) samples from the unlabeled data, the process referred to as data refinement, is critical. The refined data, with a lower anomaly ratio, are shown to yield superior anomaly detection models.

SRR first refines data from an unlabeled dataset, then iteratively trains deep representations using refined data while improving the refinement of unlabeled data by excluding likely-positive samples. For data refinement, an ensemble of OCCs is employed, each of which is trained on a disjoint subset of unlabeled training data. If there is consensus among all the OCCs in the ensemble, the data that are predicted to be negative (normal) are included in the refined data. Finally, the refined training data are used to train the final OCC to generate the anomaly predictions.

Training SRR with a data refinement module (OCCs ensemble), representation learner, and final OCC. (Green/red dots represent normal/abnormal samples, respectively).

SRR results

We conduct extensive experiments across various datasets from different domains, including semantic AD (CIFAR-10, Dog-vs-Cat), real-world manufacturing visual AD (MVTec), and real-world tabular AD benchmarks such as detecting medical (Thyroid) or network security (KDD 1999) anomalies. We consider methods with both shallow (e.g., OC-SVM) and deep (e.g., GOAD, CutPaste) models. Since the anomaly ratio of real-world data can vary, we evaluate models at different anomaly ratios of unlabeled training data and show that SRR significantly boosts AD performance. For example, SRR improves more than 15.0 average precision (AP) with a 10% anomaly ratio compared to a state-of-the-art one-class deep model on CIFAR-10. Similarly, on MVTec, SRR retains solid performance, dropping less than 1.0 AUC with a 10% anomaly ratio, while the best existing OCC drops more than 6.0 AUC. Lastly, on Thyroid (tabular data), SRR outperforms a state-of-the-art one-class classifier by 22.9 F1 score with a 2.5% anomaly ratio.

Across various domains, SRR (blue line) significantly boosts AD performance with various anomaly ratios in fully unsupervised settings.

SPADE: Semi-supervised Pseudo-labeler Anomaly Detection with Ensembling

Most semi-supervised learning methods (e.g., FixMatch, VIME) assume that the labeled and unlabeled data come from the same distributions. However, in practice, distribution mismatch commonly occurs, with labeled and unlabeled data coming from different distributions. One such case is positive and unlabeled (PU) or negative and unlabeled (NU) settings, where the distributions between labeled (either positive or negative) and unlabeled (both positive and negative) samples are different. Another cause of distribution shift is additional unlabeled data being gathered after labeling. For example, manufacturing processes may keep evolving, causing the corresponding defects to change and the defect types at labeling to differ from the defect types in unlabeled data. In addition, for applications like financial fraud detection and anti-money laundering, new anomalies can appear after the data labeling process, as criminal behavior may adapt. Lastly, labelers are more confident on easy samples when they label them; thus, easy/difficult samples are more likely to be included in the labeled/unlabeled data. For example, with some crowd-sourcing–based labeling, only the samples with some consensus on the labels (as a measure of confidence) are included in the labeled set.

Three common real-world scenarios with distribution mismatches (blue box: normal samples, red box: known/easy anomaly samples, yellow box: new/difficult anomaly samples).

Standard semi-supervised learning methods assume that labeled and unlabeled data come from the same distribution, so are sub-optimal for semi-supervised AD under distribution mismatch. SPADE utilizes an ensemble of OCCs to estimate the pseudo-labels of the unlabeled data — it does this independent of the given positive labeled data, thus reducing the dependency on the labels. This is especially beneficial when there is a distribution mismatch. In addition, SPADE employs partial matching to automatically select the critical hyper-parameters for pseudo-labeling without relying on labeled validation data, a crucial capability given limited labeled data.

Block diagram of SPADE with zoom in the detailed block diagram of the proposed pseudo-labelers.

SPADE results

We conduct extensive experiments to showcase the benefits of SPADE in various real-world settings of semi-supervised learning with distribution mismatch. We consider multiple AD datasets for image (including MVTec) and tabular (including Covertype, Thyroid) data.

SPADE shows state-of-the-art semi-supervised anomaly detection performance across a wide range of scenarios: (i) new-types of anomalies, (ii) easy-to-label samples, and (iii) positive-unlabeled examples. As shown below, with new-types of anomalies, SPADE outperforms the state-of-the-art alternatives by 5% AUC on average.

AD performances with three different scenarios across various datasets (Covertype, MVTec, Thyroid) in terms of AUC. Some baselines are only applicable to some scenarios. More results with other baselines and datasets can be found in the paper.

We also evaluate SPADE on real-world financial fraud detection datasets: Kaggle credit card fraud and Xente fraud detection. For these, anomalies evolve (i.e., their distributions change over time) and to identify evolving anomalies, we need to keep labeling for new anomalies and retrain the AD model. However, labeling would be costly and time consuming. Even without additional labeling, SPADE can improve the AD performance using both labeled data and newly-gathered unlabeled data.

AD performances with time-varying distributions using two real-world fraud detection datasets with 10% labeling ratio. More baselines can be found in the paper.

As shown above, SPADE consistently outperforms alternatives on both datasets, taking advantage of the unlabeled data and showing robustness to evolving distributions.

Conclusions

AD has a wide range of use cases with significant importance in real-world applications, from detecting security threats in financial systems to identifying faulty behaviors of manufacturing machines.

One challenging and costly aspect of building an AD system is that anomalies are rare and not easily detectable by people. To this end, we have proposed SRR, a canonical AD framework to enable high performance AD without the need for manual labels for training. SRR can be flexibly integrated with any OCC, and applied on raw data or on trainable representations.

Semi-supervised AD is another highly-important challenge — in many scenarios, the distributions of labeled and unlabeled samples don’t match. SPADE introduces a robust pseudo-labeling mechanism using an ensemble of OCCs and a judicious way of combining supervised and self-supervised learning. In addition, SPADE introduces an efficient approach to pick critical hyperparameters without a validation set, a crucial component for data-efficient AD.

Overall, we demonstrate that SRR and SPADE consistently outperform the alternatives in various scenarios across multiple types of datasets.

Acknowledgements

We gratefully acknowledge the contributions of Kihyuk Sohn, Chun-Liang Li, Chen-Yu Lee, Kyle Ziegler, Nate Yoder, and Tomas Pfister.

Read More

Research Focus: Week of February 6, 2023

Research Focus: Week of February 6, 2023

Microsoft Research Focus 09 edition, week of February 6, 2023

Welcome to Research Focus, a new series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Behind the Tech podcast with Tobi Lütke: CEO and Founder, Shopify

In the latest episode of Behind the Tech, Microsoft CTO Kevin Scott is joined by Tobi Lütke, CEO and founder of the Canadian multinational e-commerce platform Shopify. Since his early days running an online snowboard shop from his carport, Tobi has envisioned himself as a craftsman first and a business exec second, a mindset he has used to solve a wide variety of problems. He and Kevin discuss applying computer science and engineering techniques to build and scale a company, the idea of bringing an ‘apprentice mindset’ to his work, and how Tobi’s daily practice of writing code and tinkering in his home lab inspires him to be a more creative leader.

Tune in now to enjoy the discussion.


Distribution inference risks: Identifying and mitigating sources of leakage

Distribution inference (or property inference) attacks allow an adversary to infer distributional information about the training data of a machine learning model, which can cause significant problems. For example, leaking distribution of sensitive attributes such as gender or race can create a serious privacy concern. This kind of attack has been shown to be feasible on different types of models and datasets. However, little attention has been given to identifying the potential causes of such leakages and to proposing mitigations.

A new paper, Distribution Inference Risks: Identifying and Mitigating Sources of Leakage, focuses on theoretically and empirically analyzing the sources of information leakage that allow an adversary to perpetrate distribution inference attacks. The researchers identified three sources of leakage: (1) memorizing specific information about the value of interest to the adversary; (2) wrong inductive bias of the model; and (3) finiteness of the training data. Next, based on their analysis, the researchers propose principled mitigation techniques against distribution inference attacks. Specifically, they demonstrate that causal learning techniques are more resilient to a particular type of distribution inference risk — distributional membership inference — than associative learning methods. And lastly, they present a formalization of distribution inference that allows for reasoning about more general adversaries than was previously possible.


Siva Kakarla wins Applied Networking Research Prize

Microsoft’s Siva Kakarla has been awarded an Applied Networking Research Prize for 2023 in recognition of his work on checking the correctness of nameservers. A senior researcher in the Networking Research Group of Microsoft Research, Kakarla was one of six people to receive this annual award from the Internet Research Task Force.

In their paper: SCALE: Automatically Finding RFC Compliance Bugs in DNS Nameservers, Kakarla and his colleagues introduce the first approach for finding RFC (request for comment) compliance errors in DNS nameserver implementations through automatic test generation. Their approach, called Small-scope Constraint-driven Automated Logical Execution, or SCALE, generates high-coverage tests for covering RFC behaviors.

The Applied Networking Research Prize acknowledges advances in applied networking, interesting new research ideas of potential relevance to the internet standards community, and people that are likely to have an impact on internet standards and technologies.

The post Research Focus: Week of February 6, 2023 appeared first on Microsoft Research.

Read More

New NVIDIA Studio Laptops Powered by GeForce RTX 4090, 4080 Laptop GPUs Unleash Creativity

New NVIDIA Studio Laptops Powered by GeForce RTX 4090, 4080 Laptop GPUs Unleash Creativity

Editor’s note: This post is part of our weekly In the NVIDIA Studio series, which celebrates featured artists, offers creative tips and tricks, and demonstrates how NVIDIA Studio technology improves creative workflows. We’re also deep diving on new GeForce RTX 40 Series GPU features, technologies and resources, and how they dramatically accelerate content creation.

The first NVIDIA Studio laptops powered by GeForce RTX 40 Series Laptop GPUs are now available, starting with systems from MSI and Razer — with many more to come.

Featuring GeForce RTX 4090 and 4080 Laptop GPUs, the new Studio laptops use the NVIDIA Ada Lovelace architecture and fifth-generation Max-Q technologies for maximum performance and efficiency. They’re fueled by powerful NVIDIA RTX technology like DLSS 3, which routinely increases frame rates by 2x or more.

Backed by the NVIDIA Studio platform, these laptops give creators exclusive access tools and apps — including NVIDIA Omniverse, Canvas and Broadcast — and deliver breathtaking visuals with full ray tracing and time-saving AI features.

They come preinstalled with regularly updated NVIDIA Studio Drivers. This month’s driver is available for download starting today.

And when creating turns to gaming, the laptops enable playing at previously impossible levels of detail and speed.

Plus, In the NVIDIA Studio this week highlights the making of The Artists’ Metaverse, a video showcasing the journey of 3D collaboration between seven creators, across several time zones, using multiple creative apps simultaneously — all powered by NVIDIA Omniverse.

The Future of Content Creation, Anywhere

NVIDIA Studio laptops, powered by new GeForce RTX 40 Series Laptop GPUs, deliver the largest-ever generational leap in portable performance and are the world’s fastest laptops for creating and gaming.

These creative powerhouses run up to 3x more efficiently than the previous generation, enabling users to power through creative workloads in a fraction of the time, all using thin, light laptops — with 14-inch designs coming soon for the first time.

MSI’s Stealth 17 Studio comes with up to a GeForce RTX 4090 Laptop GPU.

MSI’s Stealth 17 Studio comes with up to a GeForce RTX 4090 Laptop GPU and an optional 17-inch, Mini LED 4K, 144Hz, 1000 Nits, DisplayHDR 1000 display — perfect for creators of all types. It’s available in various configurations at Amazon, Best Buy, B&H and Newegg.

New Razer Blade Studio Laptops come preinstalled with NVIDIA Broadcast.

Razer is upgrading its Blade laptops with up to a GeForce RTX 4090 Laptop GPU. Available with a 16- or 18-inch HDR-capable, dual-mode, mini-LED display, they feature a Creator mode that enables sharp, ultra-high-definition+ native resolution at 120Hz. It’s available at Razer, Amazon, Best Buy, B&H and Newegg.

The MSI Stealth 17 Studio and Razer Blade 16 and 18 come preinstalled with NVIDIA Broadcast. The app’s recent update to version 1.4 added an Eye Contact feature, ideal for content creators who want to record themselves while reading notes or avoid having to stare directly at the camera. The feature also lets video conference presenters appear as if they’re looking at their audience, improving engagement.

Designed for gamers, new units from ASUS, GIGABYTE and Lenovo are also available today and deliver great performance in creator applications with access to NVIDIA Studio benefits.

Groundbreaking Performance

The new Studio laptops have been put through rigorous testing, and many reviewers are detailing the new levels of performance and AI-powered creativity that GeForce RTX 4090 and 4080 Laptop GPUs make possible. Here’s what some are saying:

“NVIDIA’s GeForce RTX 4090 pushes laptops to blistering new frontiers: Yes, it’s fast, but also much more.” — PC World

“GeForce RTX 4090 Laptops can also find the favor of content creators thanks to NVIDIA Studio as well as AV1 support and the double NVENC encoder.” — HDBLOG.IT

“With its GeForce RTX 4090… and bright, beautiful dual-mode display, the Razer Blade 16 can rip through games with aplomb, while being equally adept at taxing, content creations workloads.” — Hot Hardware

“The Nvidia GeForce RTX 4090 mobile GPU is a step up in performance, as we’d expect from the hottest graphics chip.” — PC Magazine

“Another important point – particularly in the laptop domain – is the presence of enhanced AV1 support and dual hardware encoders. That’s really useful for streamers or video editors using a machine like this.” – KitGuru

Pick up the latest Studio systems or configure a custom system today.

Revisiting ‘The Artists’ Metaverse’

Seven talented artists join us In the NVIDIA Studio this week to discuss building The Artists’ Metaverse — a spotlight demo from last month’s CES. The group reflected on how easy it was to collaborate in real time from different parts of the world.

It started in NVIDIA Omniverse, a hub to interconnect 3D workflows replacing linear pipelines with live-sync creation. The artists connected to the platform via Omniverse Cloud.

“Setting up the Omniverse Cloud collaboration demo was a super easy process,” said award-winning 3D creator Rafi Nizam. “It was cool to see avatars appearing as people popped in, and the user interface makes it really clear when you’re working in a live state.”

Assets were exported into Omniverse with ease, thanks to the Universal Scene Description format.

Filmmaker Jae Solina, aka JSFILMZ, animated characters in Omniverse using Xsens and Unreal Engine.

“Prior to Omniverse, creating animations was such a hassle, let alone getting photorealistic animations,” Solina said. “Instead of having to reformat and upload files individually, everything is done in Omniverse in real time, leading to serious time saved.”

 

Jeremy Lightcap reflected on the incredible visual quality of the virtual scene, highlighting the seamless movement within the viewport.

The balloon 3D model was sculpted by hand in Gravity Sketch and imported into Omniverse.

“We had three Houdini simulations, a volume database file storm cloud, three different characters with motion capture and a very dense Western town set with about 100 materials,” Lightcap said. “I’m not sure how many other programs could handle that and still give you instant, path-traced lighting results.”

 

For Ashley Goldstein, an NVIDIA 3D artist and tutorialist, the demo highlighted the versatility of Omniverse. “I could update the scene and save it as a new USD layer, so when someone else opened it up, they had all of my updates immediately,” she said. “Or, if they were working on the scene at the same time, they’d be instantly notified of the updates and could fetch new content.”

Applying colors and textures to the ballon in Adobe Substance 3D Painter.

Edward McEvenue, aka edstudios, reflected on the immense value Omniverse on RTX hardware provides, displaying fully ray-traced graphics with instant feedback. “3D production is a very iterative process, where you have to make hundreds if not thousands of small decisions along the way before finalizing a scene,” he said. “Using GPU acceleration with RTX path tracing in the viewport makes that process so much easier, as you get near-instant feedback on the changes you’re making, with all of the full-quality lighting, shadows, reflections, materials and post-production effects directly in the working viewport.”

Edits to the 3D model in Blender are reflected in real time with photorealistic detail in Omniverse.

3D artist Shangyu Wang noted Omniverse is his preferred 3D collaborative content-creation platform. “Autodesk’s Unreal Live Link for Maya gave me a ray-traced, photorealistic preview of the scene in real time, no waiting to see the final render result,” he said.

Fellow 3D artist Pekka Varis mentioned Omniverse’s positive trajectory. “New features are coming in faster than I can keep up!” he said. “It can become the main standard of the metaverse.”

Omniverse transcends location, time and apps, where collaboration, communication and creativity reign supreme.

Download Omniverse today, free for all NVIDIA and GeForce RTX GPU owners — including those with new GeForce RTX 40 Series laptops.

Follow NVIDIA Studio on Instagram, Twitter and Facebook. Access tutorials on the Studio YouTube channel and get updates directly in your inbox by subscribing to the Studio newsletter. Learn more about Omniverse on Instagram, Medium, Twitter and YouTube for additional resources and inspiration. Check out the Omniverse forums, and join our Discord server and Twitch channel to chat with the community.

Read More

Share medical image research on Amazon SageMaker Studio Lab for free

Share medical image research on Amazon SageMaker Studio Lab for free

This post is co-written with Stephen Aylward, Matt McCormick, Brianna Major from Kitware and Justin Kirby from the Frederick National Laboratory for Cancer Research (FNLCR).

Amazon SageMaker Studio Lab provides no-cost access to a machine learning (ML) development environment to everyone with an email address. Like the fully featured Amazon SageMaker Studio, Studio Lab allows you to customize your own Conda environment and create CPU- and GPU-scalable JupyterLab version 3 notebooks, with easy access to the latest data science productivity tools and open-source libraries. Moreover, Studio Lab free accounts include a minimum of 15 GB of persistent storage, enabling you to continuously maintain and expend your projects across multiple sessions and allowing you to instantly pick up where your left off and even share your ongoing work and work environments with others.

A key issue faced by the medical image community is how to enable researchers to experiment and explore with these essential tools. To solve this challenge, AWS teams worked with Kitware and Frederick National Laboratory for Cancer Research (FNLCR) to bring together three major medical imaging AI resources for Studio Lab and the entire open-source JupyterLab community:

These tools and data combine to allow medical imaging AI researchers to quickly develop and thoroughly evaluate clinically ready deep learning algorithms in a comprehensive and user-friendly environment. Team members from FNLCR and Kitware collaborated to create a series of Jupyter notebooks that demonstrate common workflows to programmatically access and visualize TCIA data. These notebooks use Studio Lab to allow researchers to run the notebooks without the need to set up their own local Jupyter development environment—you can quickly explore new ideas or integrate your work into presentations, workshops, and tutorials at conferences.

The following example illustrates Studio Lab running a Jupyter notebook that downloads TCIA prostate MRI data, segments it using MONAI, and displays the results using itkWidgets.

example visualization of TCIA prostate MRI data on Studio Lab notebook

Although you can easily carry out smaller experiments and demos with the sample notebooks presented in this post on Studio Lab for free, it is recommended to use Amazon SageMaker Studio when you train your own medical image models at scale. Amazon SageMaker Studio is an integrated web-based development environment (IDE) with enterprise-grade security, governance, and monitoring features from which you can access purpose-built tools to perform all ML development steps. Open-source libraries like MONAI Core and itkWidgets also run on Amazon SageMaker Studio.

Install the solution

To run the TCIA notebooks on Studio Lab, you need to register an account using your email address on the Studio Lab website. Account requests may take 1–3 days to get approved.

After that, you can follow the installation steps to get started:

  1. Log in to Studio Lab and start a CPU runtime.
  2. In a separate tab, navigate to the TCIA notebooks GitHub repo and choose a notebook in the root folder of the repository.
  3. Choose Open Studio Lab to open the notebook in Studio Lab.
    open notebook in studio lab
  4. Back in Studio Lab, choose Copy to project.
    copy to project on studio lab
  5. In the new JupyterLab pop-up that opens, choose Clone Entire Repo.
    clone entire repo to studio lab
  6. In the next window, keep the defaults and choose Clone.
    clone repository on studio lab
  7. Choose OK when prompted to confirm to build the new Conda environment (medical-image-ai).
    OK to build the conda environmentBuilding the Conda environment will take up to 5 minutes.
  8. In the terminal that opened in the step before, run the following command to install NodeJS in the studiolab Conda environment, which is required to install the ImJoy JupyterLab 3 extension next: conda install -y -c conda-forge nodejs
    We now install the ImJoy Jupyter extension using the Studio Lab Extension Manager to enable interactive visualizations. The Imjoy extension allows itkWidgets and other data-intensive processes to communicate with local and remote Jupyter environments, including Jupyter notebooks, JupyterLab, Studio Lab, and so on.
  9. In the Extension Manager, search for “imjoy” and choose Install.
    extension manager on studio lab
  10. Confirm to rebuild the kernel when prompted.
    rebuild kernel after extension install
  11. Choose Save and Reload when the build is complete.
    save and reload Jupyter Lab notebook

After the installation of the ImJoy extension, you will be able to see the ImJoy icon in the top menu of your notebooks.

To verify this, navigate to the file browser, choose the TCIA_Image_Visualalization_with_itkWidgets notebook, and choose the medical-image-ai kernel to run it.

verify kernel

The ImJoy icon will be visible in the upper left corner of the notebook menu.

imjoy extension icon on notebook

With these installation steps, you have successfully installed the medical-image-ai Python kernel and the ImJoy extension as the prerequisite to run the TCIA notebooks together with itkWidgets on Studio Lab.

Test the solution

We have created a set of notebooks and a tutorial that showcases the integration of these AI technologies in Studio Lab. Make sure to choose the medical-image-ai Python kernel when running the TCIA notebooks in Studio Lab.

The first SageMaker notebook shows how to download DICOM images from TCIA and visualize those images using the cinematic volume rendering capabilities of itkWidgets.

first sample notebook

The second notebook shows how the expert annotations that are available for hundreds of studies on TCIA can be downloaded as DICOM SEG and RTSTRUCT objects, visualized in 3D or as overlays on 2D slices, and used for training and evaluation of deep learning systems.

second sample notebook

The third notebook shows how pre-trained MONAI deep learning models available on MONAI’s Model Zoo can be downloaded and used to segment TCIA (or your own) DICOM prostate MRI volumes.

third sample notebook

Choose Open Studio Lab in these and other JupyterLab notebooks to launch those notebooks in the freely available Studio Lab environment.

Clean up

After you have followed the installation steps in this post and created the medical-image-ai Conda environment, you may want to delete it to save storage space. To do so, use the following command:

conda remove --name medical-image-ai --all

You can also uninstall the ImJoy extension via the Extension Manager. Be aware that you will need to recreate the Conda environment and reinstall the ImJoy extension if you want to continue working with the TCIA notebooks in your Studio Lab account later.

Close your tab and don’t forget to choose Stop Runtime on the Studio Lab project page.

Conclusion

SageMaker Studio Lab is accessible to medical image AI research communities at no cost and can be used for medical image AI modeling and interactive medical image visualization in combination with MONAI and itkWidgets. You can use the TCIA open data and sample notebooks with Studio Lab at training events, like hackathons and workshops. With this solution, scientists and researchers can quickly experiment, collaborate, and innovate with medical image AI. If you have an AWS account and have set up a SageMaker Studio domain, you can also run these notebooks on Studio using the default Data Science Python kernel (with the ImJoy-jupyter-extension installed) while selecting from a variety of compute instance types.

Studio Lab also launched a new feature at AWS re:Invent 2022 to take the notebooks developed in Studio Lab and run them as batch jobs on a recurring schedule in your AWS accounts. Therefore, you can scale your ML experiments beyond the free compute limitations of Studio Lab and use more powerful compute instances with much bigger datasets on your AWS accounts.

If you’re interested in learning more about how AWS can help your healthcare or life sciences organization, please contact an AWS representative. For more information on MONAI and itkWidgets, please contact Kitware. New data is being added to TCIA on an ongoing basis, and your suggestions and contributions are welcome by visiting the TCIA website.

Further reading


About the Authors

Stephen Aylward is Senior Director of Strategic Initiatives at Kitware, an Adjunct Professor of Computer at The University of North Carolina at Chapel Hill, and a fellow of the MICCAI Society. Dr. Aylward founded Kitware’s office in North Carolina, has been a leader of several open-source initiatives, and is now Chair of the MONAI advisory board.

Matt McCormick, PhD, is a Distinguished Engineer at Kitware, where he leads development of the Insight Toolkit (ITK), a scientific image analysis toolkit. He has been a principal investigator and a co-investigator of several research grants from the National Institutes of Health (NIH), led engagements with United States national laboratories, and led various commercial projects providing advanced software for medical devices. Dr. McCormick is a strong advocate for community-driven open-source software, open science, and reproducible research.

Brianna Major is a Research and Development Engineer at Kitware with a passion for developing open source software and tools that will benefit the medical and scientific communities.

Justin Kirby is a Technical Project Manager at the Frederick National Laboratory for Cancer Research (FNLCR). His work is focused on methods to enable data sharing while preserving patient privacy to improve reproducibility and transparency in cancer imaging research. His team founded The Cancer Imaging Archive (TCIA) in 2010, which the research community has leveraged to publish over 200 datasets related to manuscripts, grants, challenge competitions, and major NCI research initiatives. These datasets have been discussed in over 1,500 peer reviewed publications.

Gang Fu is a Healthcare Solution Architect at AWS. He holds a PhD in Pharmaceutical Science from the University of Mississippi and has over ten years of technology and biomedical research experience. He is passionate about technology and the impact it can make on healthcare.

Alex Lemm is a Business Development Manager for Medical Imaging at AWS. Alex defines and executes go-to-market strategies with imaging partners and drives solutions development to accelerate AI/ML-based medical imaging research in the cloud. He is passionate about integrating open source ML frameworks with the AWS AI/ML stack.

Read More

Google Research, 2022 & beyond: Algorithms for efficient deep learning

Google Research, 2022 & beyond: Algorithms for efficient deep learning

(This is Part 4 in our series of posts covering different topical areas of research at Google. You can find other posts in the series here.)

The explosion in deep learning a decade ago was catapulted in part by the convergence of new algorithms and architectures, a marked increase in data, and access to greater compute. In the last 10 years, AI and ML models have become bigger and more sophisticated — they’re deeper, more complex, with more parameters, and trained on much more data, resulting in some of the most transformative outcomes in the history of machine learning.

As these models increasingly find themselves deployed in production and business applications, the efficiency and costs of these models has gone from a minor consideration to a primary constraint. In response, Google has continued to invest heavily in ML efficiency, taking on the biggest challenges in (a) efficient architectures, (b) training efficiency, (c) data efficiency, and (d) inference efficiency. Beyond efficiency, there are a number of other challenges around factuality, security, privacy and freshness in these models. Below, we highlight a panoply of works that demonstrate Google Research’s efforts in developing new algorithms to address the above challenges.

Efficient architectures

A fundamental question is “Are there better ways of parameterizing a model to allow for greater efficiency?” In 2022, we focused on new techniques for infusing external knowledge by augmenting models via retrieved context; mixture of experts; and making transformers (which lie at the heart of most large ML models) more efficient.

Context-augmented models

In the quest for higher quality and efficiency, neural models can be augmented with external context from large databases or trainable memory. By leveraging retrieved context, a neural network may not have to memorize the huge amount of world knowledge within its internal parameters, leading to better parameter efficiency, interpretability and factuality.

In “Decoupled Context Processing for Context Augmented Language Modeling”, we explored a simple architecture for incorporating external context into language models based on a decoupled encoder-decoder architecture. This led to significant computational savings while giving competitive results on auto-regressive language modeling and open domain question answering tasks. However, pre-trained large language models (LLMs) consume a significant amount of information through self-supervision on big training sets. But, it is unclear precisely how the “world knowledge” of such models interacts with the presented context. With knowledge aware fine-tuning (KAFT), we strengthen both controllability and robustness of LLMs by incorporating counterfactual and irrelevant contexts into standard supervised datasets.

One of the questions in the quest for a modular deep network is how a database of concepts with corresponding computational modules could be designed. We proposed a theoretical architecture that would “remember events” in the form of sketches stored in an external LSH table with pointers to modules that process such sketches.

Another challenge in context-augmented models is fast retrieval on accelerators of information from a large database. We have developed a TPU-based similarity search algorithm that aligns with the performance model of TPUs and gives analytical guarantees on expected recall, achieving peak performance. Search algorithms typically involve a large number of hyperparameters and design choices that make it hard to tune them on new tasks. We have proposed a new constrained optimization algorithm for automating hyperparameter tuning. Fixing the desired cost or recall as input, the proposed algorithm generates tunings that empirically are very close to the speed-recall Pareto frontier and give leading performance on standard benchmarks.

Mixture-of-experts models

Mixture-of-experts (MoE) models have proven to be an effective means of increasing neural network model capacity without overly increasing their computational cost. The basic idea of MoEs is to construct a network from a number of expert sub-networks, where each input is processed by a suitable subset of experts. Thus, compared to a standard neural network, MoEs invoke only a small portion of the overall model, resulting in high efficiency as shown in language model applications such as GLaM.

The decision of which experts should be active for a given input is determined by a routing function, the design of which is challenging, since one would like to prevent both under- and over-utilization of each expert. In a recent work, we proposed Expert Choice Routing, a new routing mechanism that, instead of assigning each input token to the top-k experts, assigns each expert to the top-k tokens. This automatically ensures load-balancing of experts while also naturally allowing for an input token to be handled by multiple experts.

Efficient transformers

Transformers are popular sequence-to-sequence models that have shown remarkable success in a range of challenging problems from vision to natural language understanding. A central component of such models is the attention layer, which identifies the similarity between “queries” and “keys”, and uses these to construct a suitable weighted combination of “values”. While effective, attention mechanisms have poor (i.e., quadratic) scaling with sequence length.

As the scale of transformers continues to grow, it is interesting to study if there are any naturally occurring structures or patterns in the learned models that may help us decipher how they work. Towards that, we studied the learned embeddings in intermediate MLP layers, revealing that they are very sparse — e.g, T5-Large models have <1% nonzero entries. Sparsity further suggests that we can potentially reduce FLOPs without affecting model performance.

We recently proposed Treeformer, an alternative to standard attention computation that relies on decision trees. Intuitively, this quickly identifies a small subset of keys that are relevant for a query and only performs the attention operation on this set. Empirically, the Treeformer can lead to a 30x reduction in FLOPs for the attention layer. We also introduced Sequential Attention, a differentiable feature selection method that combines attention with a greedy algorithm. This technique has strong provable guarantees for linear models and scales seamlessly to large embedding models.

Another way to make transformers efficient is by making the softmax computations faster in the attention layer. Building on our previous work on low-rank approximation of the softmax kernel, we proposed a new class of random features that provides the first “positive and bounded” random feature approximation of the softmax kernel and is computationally linear in the sequence length. We also proposed the first approach for incorporating various attention masking mechanisms, such as causal and relative position encoding, in a scalable manner (i.e., sub-quadratic with relation to the input sequence length).

Top

Training efficiency

Efficient optimization methods are the cornerstone of modern ML applications and are particularly crucial in large scale settings. In such settings, even first order adaptive methods like Adam are often expensive, and training stability becomes challenging. In addition, these approaches are often agnostic to the architecture of the neural network, thereby ignoring the rich structure of the architecture leading to inefficient training. This motivates new techniques to more efficiently and effectively optimize modern neural network models. We are developing new architecture-aware training techniques, e.g., for training transformer networks, including new scale-invariant transformer networks and novel clipping methods that, when combined with vanilla stochastic gradient descent (SGD), results in faster training. Using this approach, for the first time, we were able to effectively train BERT using simple SGD without the need for adaptivity.

Moreover, with LocoProp we proposed a new method that achieves performance similar to that of a second-order optimizer while using the same computational and memory resources as a first-order optimizer. LocoProp takes a modular view of neural networks by decomposing them into a composition of layers. Each layer is then allowed to have its own loss function as well as output target and weight regularizer. With this setup, after a suitable forward-backward pass, LocoProp proceeds to perform parallel updates to each layer’s “local loss”. In fact, these updates can be shown to resemble those of higher-order optimizers, both theoretically and empirically. On a deep autoencoder benchmark, LocoProp achieves performance comparable to that of higher-order optimizers while being significantly faster.

One key assumption in optimizers like SGD is that each data point is sampled independently and identically from a distribution. This is unfortunately hard to satisfy in practical settings such as reinforcement learning, where the model (or agent) has to learn from data generated based on its own predictions. We proposed a new algorithmic approach named SGD with reverse experience replay, which finds optimal solutions in several settings like linear dynamical systems, non-linear dynamical systems, and in Q-learning for reinforcement learning. Furthermore, an enhanced version of this method — IER — turns out to be the state of the art and is the most stable experience replay technique on a variety of popular RL benchmarks.

Top

Data efficiency

For many tasks, deep neural networks heavily rely on large datasets. In addition to the storage costs and potential security/privacy concerns that come along with large datasets, training modern deep neural networks on such datasets incurs high computational costs. One promising way to solve this problem is with data subset selection, where the learner aims to find the most informative subset from a large number of training samples to approximate (or even improve upon) training with the entire training set.

We analyzed a subset selection framework designed to work with arbitrary model families in a practical batch setting. In such a setting, a learner can sample examples one at a time, accessing both the context and true label, but in order to limit overhead costs, is only able to update its state (i.e., further train model weights) once a large enough batch of examples is selected. We developed an algorithm, called IWeS, that selects examples by importance sampling where the sampling probability assigned to each example is based on the entropy of models trained on previously selected batches. We provide a theoretical analysis, proving generalization and sampling rate bounds.

Another concern with training large networks is that they can be highly sensitive to distribution shifts between training data and data seen at deployment time, especially when working with limited amounts of training data that might not cover all of deployment time scenarios. A recent line of work has hypothesized “extreme simplicity bias” as the key issue behind this brittleness of neural networks. Our latest work makes this hypothesis actionable, leading to two new complementary approaches — DAFT and FRR — that when combined provide significantly more robust neural networks. In particular, these two approaches use adversarial fine-tuning along with inverse feature predictions to make the learned network robust.

Top

Inference efficiency

Increasing the size of neural networks has proven surprisingly effective in improving their predictive accuracy. However, it is challenging to realize these gains in the real-world, as the inference costs of large models may be prohibitively high for deployment. This motivates strategies to improve the serving efficiency, without sacrificing accuracy. In 2022, we studied different strategies to achieve this, notably those based on knowledge distillation and adaptive computation.

Distillation

Distillation is a simple yet effective method for model compression, which greatly expands the potential applicability of large neural models. Distillation has proved widely effective in a range of practical applications, such as ads recommendation. Most use-cases of distillation involve a direct application of the basic recipe to the given domain, with limited understanding of when and why this ought to work. Our research this year has looked at tailoring distillation to specific settings and formally studying the factors that govern the success of distillation.

On the algorithmic side, by carefully modeling the noise in the teacher labels, we developed a principled approach to reweight the training examples, and a robust method to sample a subset of data to have the teacher label. In “Teacher Guided Training”, we presented a new distillation framework: rather than passively using the teacher to annotate a fixed dataset, we actively use the teacher to guide the selection of informative samples to annotate. This makes the distillation process shine in limited data or long-tail settings.

We also researched new recipes for distillation from a cross-encoder (e.g., BERT) to a factorized dual-encoder, an important setting for the task of scoring the relevance of a [query, document] pair. We studied the reasons for the performance gap between cross- and dual-encoders, noting that this can be the result of generalization rather than capacity limitation in dual-encoders. The careful construction of the loss function for distillation can mitigate this and reduce the gap between cross- and dual-encoder performance. Subsequently, in EmbedDistil, we looked at further improving dual-encoder distillation by matching embeddings from the teacher model. This strategy can also be used to distill from a large to small dual-encoder model, wherein inheriting and freezing the teacher’s document embeddings can prove highly effective.

On the theoretical side, we provided a new perspective on distillation through the lens of supervision complexity, a measure of how well the student can predict the teacher labels. Drawing on neural tangent kernel (NTK) theory, this offers conceptual insights, such as the fact that a capacity gap may affect distillation because such teachers’ labels may appear akin to purely random labels to the student. We further demonstrated that distillation can cause the student to underfit points the teacher model finds “hard” to model. Intuitively, this may help the student focus its limited capacity on those samples that it can reasonably model.

Adaptive computation

While distillation is an effective means of reducing inference cost, it does so uniformly across all samples. Intuitively however, some “easy” samples may inherently require less compute than the “hard” samples. The goal of adaptive compute is to design mechanisms that enable such sample-dependent computation.

Confident Adaptive Language Modeling introduced a controlled early-exit functionality to Transformer-based text generators such as T5. In this form of adaptive computation, the model dynamically modifies the number of transformer layers that it uses per decoding step. The early-exit gates use a confidence measure with a decision threshold that is calibrated to satisfy statistical performance guarantees. In this way, the model needs to compute the full stack of decoder layers for only the most challenging predictions. Easier predictions only require computing a few decoder layers. In practice, the model uses about a third of the layers for prediction on average, yielding 2–3x speed-ups while preserving the same level of generation quality.

One popular adaptive compute mechanism is a cascade of two or more base models. A key issue in using cascades is deciding whether to simply use the current model’s predictions, or whether to defer prediction to a downstream model. Learning when to defer requires designing a suitable loss function, which can leverage appropriate signals to act as supervision for the deferral decision. We formally studied existing loss functions for this goal, demonstrating that they may underfit the training sample owing to an implicit application of label smoothing. We showed that one can mitigate this with post-hoc training of a deferral rule, which does not require modifying the model internals in any way.

For the retrieval applications, standard semantic search techniques use a fixed representation for each embedding generated by a large model. That is, irrespective of downstream task and its associated compute environment or constraints, the representation size and capability is mostly fixed. Matryoshka representation learning introduces flexibility to adapt representations according to the deployment environment. That is, it forces representations to have a natural ordering within its coordinates such that for resource constrained environments, we can use only the top few coordinates of the representation, while for richer and precision-critical settings, we can use more coordinates of the representation. When combined with standard approximate nearest neighbor search techniques like ScaNN, MRL is able to provide up to 16x lower compute with the same recall and accuracy metrics.

Top

Concluding thoughts

Large ML models are showing transformational outcomes in several domains but efficiency in both training and inference is emerging as a critical need to make these models practical in the real-world. Google Research has been investing significantly in making large ML models efficient by developing new foundational techniques. This is an on-going effort and over the next several months we will continue to explore core challenges to make ML models even more robust and efficient.

Acknowledgements

The work in efficient deep learning is a collaboration among many researchers from Google Research, including Amr Ahmed, Ehsan Amid, Rohan Anil, Mohammad Hossein Bateni, Gantavya Bhatt, Srinadh Bhojanapalli, Zhifeng Chen, Felix Chern, Gui Citovsky, Andrew Dai, Andy Davis, Zihao Deng, Giulia DeSalvo, Nan Du, Avi Dubey, Matthew Fahrbach, Ruiqi Guo, Blake Hechtman, Yanping Huang, Prateek Jain, Wittawat Jitkrittum, Seungyeon Kim, Ravi Kumar, Aditya Kusupati, James Laudon, Quoc Le, Daliang Li, Zonglin Li, Lovish Madaan, David Majnemer, Aditya Menon, Don Metzler, Vahab Mirrokni, Vaishnavh Nagarajan, Harikrishna Narasimhan, Rina Panigrahy, Srikumar Ramalingam, Ankit Singh Rawat, Sashank Reddi, Aniket Rege, Afshin Rostamizadeh, Tal Schuster, Si Si, Apurv Suman, Phil Sun, Erik Vee, Chong You, Felix Yu, Manzil Zaheer, and Yanqi Zhou.

Google Research, 2022 & beyond

This was the fourth blog post in the “Google Research, 2022 & Beyond” series. Other posts in this series are listed in the table below:

Language Models Computer Vision Multimodal Models
Generative Models Responsible AI ML & Computer Systems
Efficient Deep Learning Algorithmic Advances* Robotics
Health General Science & Quantum Community Engagement
* Articles will be linked as they are released.

Read More