Datasets at your fingertips in Google Search

Datasets at your fingertips in Google Search

Access to datasets is critical to many of today’s endeavors across verticals and industries, whether scientific research, business analysis, or public policy. In the scientific community and throughout various levels of the public sector, reproducibility and transparency are essential for progress, so sharing data is vital. For one example, in the United States a recent new policy requires free and equitable access to outcomes of all federally funded research, including data and statistical information along with publications.

To facilitate discovery of content with this level of statistical detail and better distill this information from across the web, Google now makes it easier to search for datasets. You can click on any of the top three results (see below) to get to the dataset page or you can explore further by clicking “More datasets.” Here is an example:

When users search for datasets in Google search, they find a dedicated section highlighting pages with dataset descriptions. They can explore many more datasets by clicking on “More datasets” and going to Dataset Search.

Powered by Dataset Search

Dataset Search, a dedicated search engine for datasets, powers this feature and indexes more than 45 million datasets from more than 13,000 websites. Datasets cover many disciplines and topics, including government, scientific, and commercial datasets. Dataset Search shows users essential metadata about datasets and previews of the data where available. Users can then follow the links to the data repositories that host the datasets.

Dataset Search primarily indexes dataset pages on the Web that contain schema.org structured data. The schema.org metadata allows Web page authors to describe the semantics of the page: the entities on the pages and their properties. For dataset pages, schema.org metadata describes key elements of the datasets, such as their description, license, temporal and spatial coverage, and available download formats. In addition to aggregating this metadata and providing easy access to it, Dataset Search normalizes and reconciles the metadata that comes directly from the Web pages.

If you are a dataset author or provider and want others to find your datasets in Search, make sure that you publish your dataset in a way that makes it discoverable and specifies how others can reuse the data. Specifically, ensure that the Web page that describes the dataset has machine-readable metadata. The easiest way to ensure this is to publish your dataset in an established dataset repository. Some repositories cater to specific research communities, while others are “generalists” (figshare.com, zenodo.org, datadryad.org, kaggle.com, etc.). These repositories automatically include metadata in dataset pages for every dataset, which makes it easy for search engines to discover and include them in specialized result sections, as in the figure above.

As data sharing continues to grow and evolve, we will continue to make datasets as easy to find, access, and use as any other type of information on the web.

Acknowledgments

We are extremely grateful to the numerous Googlers who contributed to developing and launching this feature, including: Rachel Zax, Damian Biollo, Shiyu Chen, Jonathan Drake, Sunil Vemuri, Stephen Tseou, Amit Bapat, Will Leszczuk, Marc Najork, Sergei Vassilvitskii, Bruno Possas, and Corinna Cortes.

Read More

Google Research, 2022 & beyond: Research community engagement

Google Research, 2022 & beyond: Research community engagement

(This is Part 9 in our series of posts covering different topical areas of research at Google. You can find other posts in the series here.)

Sharing knowledge is essential to Google’s research philosophy — it accelerates technological progress and expands capabilities community-wide. Solving complex problems requires bringing together diverse minds and resources collaboratively. This can be accomplished through building local and global connections with multidisciplinary experts and impacted communities. In partnership with these stakeholders, we bring our technical leadership, product footprint, and resources to make progress against some of society’s greatest opportunities and challenges.

We at Google see it as our responsibility to disseminate our work as contributing members of the scientific community and to help train the next generation of researchers. To do this well, collaborating with experts and researchers outside of Google is essential. In fact, just over half of our scientific publications highlight work done jointly with authors outside of Google. We are grateful to work collaboratively across the globe and have only increased our efforts with the broader research community over the past year. In this post, we will talk about some of the opportunities afforded by such partnerships, including:

Addressing social challenges together

Engaging the wider community helps us progress on seemingly intractable problems. For example, access to timely, accurate health information is a significant challenge among women in rural and densely populated urban areas across India. To solve this challenge, ARMMAN developed mMitra, a free mobile service that sends preventive care information to expectant and new mothers. Adherence to such public health programs is a prevalent challenge, so researchers from Google Research and the Indian Institute of Technology, Madras worked with ARMMAN to design an ML system that alerts healthcare providers about participants at risk of dropping out of the health information program. This early identification helps ARMMAN provide better-targeted support, improving maternal health outcomes.

Google Research worked with ARMMAN to design a system to alert healthcare providers about participants at risk for dropping out of their preventative care information program for expectant mothers. This plot shows the cumulative engagement drops prevented using our restless multi-armed bandit model (RMAB) compared to the control group (Round Robin).

We also support Responsible AI projects directly for other organizations — including our commitment of $3M to fund the new INSAIT research center based in Bulgaria. Further, to help build a foundation of fairness, interpretability, privacy, and security, we are supporting the establishment of a first-of-its-kind multidisciplinary Center for Responsible AI with a grant of $1M to the Indian Institute of Technology, Madras.

Top

Training the next generation of researchers

Part of our responsibility in guiding how technology affects society is to help train the next generation of researchers. For example, supporting equitable student persistence in computing research through our Computer Science Research Mentorship Program, where Googlers have mentored over one thousand students since 2018 — 86% of whom identify as part of a historically marginalized group.

We work towards inclusive goals and work across the globe to achieve them. In 2022, we expanded our research interactions and programs to faculty and students across Latin America, which included grants to women in computer science in Ecuador. We partnered with ENS, a university in France, to help fund scholarships for students to train through research. Another example is our collaboration with the Computing Alliance of Hispanic-Serving Institutions (CAHSI) to provide $4.8 million to support more than 30 collaborative research projects and over 3,000 Hispanic students and faculty across a network of Hispanic-serving institutions.

Efforts like these foster the research ecosystem and help the community give back. Through exploreCSR, we partner with universities to provide students with introductory experiences in research, such as Rice University’s regional workshop on applications and research in data science (ReWARDS), which was delivered in rural Peru by faculty from Rice. Similarly, one of our Awards for Inclusion Research led to a faculty member helping startups in Africa use AI.

The funding we provide is most often unrestricted and leads to inspiring results. Last year, for example, Kean University was one of 53 institutions to receive an exploreCSR award. It used the funding to create the Research Recruits Program, a two-semester program designed to give undergraduates an introductory opportunity to participate in research with a faculty mentor. A student at Kean with a chronic condition that requires him to take different medications every day, a struggle that affects so many, decided to pursue research on the topic with a peer. Their research, set to be published this year, demonstrates an ML solution, built with Google’s TensorFlow, that can identify pills with 99.8% certainty when used correctly. Results like these are why we continue to invest in younger generations, further demonstrated by our long-term commitment to funding PhD Fellows every year across the globe.

Building an inclusive ecosystem is imperative. To this end, we’ve also partnered with the non-profit Black in Robotics (BiR), formed to address the systemic inequities in the robotics community. Together, we established doctoral student awards that help financially support graduate students and to support BiR’s newly established Bay Area Robotics lab. We also help make global conferences accessible to more researchers around the world, for example, by funding 24 students this year to attend Deep Learning Indaba in Tunisia.

Top

Collaborating to advance scientific innovations

In 2022 Google sponsored over 150 research conferences and even more workshops, which leads to invaluable engagements with the broader research community. At research conferences, Googlers serve on program committees and organize workshops, tutorials and numerous other activities to collectively advance the field. Additionally, last year, we hosted over 14 dedicated workshops to bring together researchers, such as the 2022 Quantum Symposium, which generates new ideas and directions for the research field, further advancing research initiatives. In 2022, we authored 2400 papers, many of which were presented at leading research conferences, such as NeurIPS, EMNLP, ECCV, Interspeech, ICML, CVPR, ICLR, and many others. More than 50% of these papers were authored in collaboration with researchers beyond Google.

Over the past year, we’ve expanded our engagement models to facilitate students, faculty, and Google’s research scientists coming together across schools to form constructive research triads. One such project, undertaken in partnership with faculty and students from Georgia Tech, aims to develop a robot guide dog with human behavior modeling and safe reinforcement learning. Throughout 2022, we gave over 224 grants to researchers and over $10M in Google Cloud Platform credits for topics ranging from the improvement of algorithms for post-quantum cryptography with collaborators at CNRS in France to fostering cybersecurity research at TU Munich and Fraunhofer AISEC in Germany.

In 2022, we made 22 new multi-year commitments totaling over ~$80M to 65 institutions across nine countries, where each year we will host workshops to select over 100 research projects of mutual interest. For example, in a growing partnership, we are supporting the new Max Planck VIA-Center in Germany to work together on robotics. Another large area of investment is a close partnership with four universities in Taiwan (NTU, NCKU, NYCU, NTHU) to increase innovation in silicon chip design and improve competitiveness in semiconductor design and manufacturing. We aim to collaborate by default and were proud to be recently named one of Australia’s top collaborating companies.

Top

Fueling innovation in products and engineering

The community fuels innovation at Google. For example, by facilitating student researchers to work with us on defined research projects, we’ve experienced both incremental and more dramatic improvements. Together with visiting researchers, we combine information, compute power, and a great deal of expertise to bring about breakthroughs, such as leveraging our undersea internet cables to detect earthquakes. Visiting Researchers also worked hand-in-hand with us to develop Minerva, a state-of-the-art solution that came about by training a deep learning model on a dataset that contains quantitative reasoning with symbolic expressions.

Minerva incorporates recent prompting and evaluation techniques to better solve mathematical questions. It then employs majority voting, in which it generates multiple solutions to each question and chooses the most common answer as the solution, thus improving performance significantly.

Top

Open-sourcing datasets and tools

Engaging with the broader research community is a core part of our efforts to build a more collaborative ecosystem. We support the general advancement of ML and related research through the release of open-source code and datasets. We continued to grow open source datasets in 2022, for example, in natural language processing and vision, and expanded our global index of available datasets in Google Dataset Search. We also continued to release sustainability data via Data Commons and invite others to use it for their research. See some of the datasets and tools we released in 2022 listed below.

Dataset Description
   
Auto-Arborist A multiview urban tree classification dataset that consists of ~2.6M trees covering >320 genera, which can aid in the development of models for urban forest monitoring.
   
Bazel GitHub Metrics A dataset with GitHub download counts of release artifacts from selected bazelbuild repositories.
   
BC-Z demonstration Episodes of a robotic arm performing 100 different manipulation tasks. Data for each episode includes the RGB video, the robot’s end-effector positions, and the natural language embedding.
   
BEGIN V2 A benchmark dataset for evaluating dialog systems and natural language generation metrics.
   
CLSE: Corpus of Linguistically Significant Entities A dataset of named entities annotated by linguistic experts. It includes 34 languages and 74 different semantic types to support various applications from airline ticketing to video games.
   
CocoChorales A dataset consisting of over 1,400 hours of audio mixtures containing four-part chorales performed by 13 instruments, all synthesized with realistic-sounding generative models.
   
Crossmodal-3600 A geographically diverse dataset of 3,600 images, each annotated with human-generated reference captions in 36 languages.
   
CVSS: A Massively Multilingual Speech-to-Speech Translation Corpus A Common Voice-based Speech-to-Speech translation corpus that includes 2,657 hours of speech-to-speech translation sentence pairs from 21 languages into English.
   
DSTC11 Challenge Task This challenge evaluates task-oriented dialog systems end-to-end, from users’ spoken utterances to inferred slot values.
   
EditBench A comprehensive diagnostic and evaluation dataset for text-guided image editing.
   
Few-shot Regional Machine Translation FRMT is a few-shot evaluation dataset containing en-pt and en-zh bitexts translated from Wikipedia, in two regional varieties for each non-English language (pt-BR and pt-PT; zh-CN and zh-TW).
   
Google Patent Phrase Similarity A human-rated contextual phrase-to-phrase matching dataset focused on technical terms from patents.
   
Hinglish-TOP Hinglish-TOP is the largest code-switched semantic parsing dataset with 10k entries annotated by humans, and 170K generated utterances using the CST5 augmentation technique introduced in the paper.
   
ImPaKT A dataset that contains semantic parsing annotations for 2,489 sentences from shopping web pages in the C4 corpus, corresponding to annotations of 3,719 expressed implication relationships and 6,117 typed and summarized attributes.
   
InFormal A formality style transfer dataset for four Indic Languages, made up of a pair of sentences and a corresponding gold label identifying the more formal and semantic similarity.
   
MAVERICS A suite of test-only visual question answering datasets, created from Visual Question Answering image captions with question answering validation and manual verification.
   
MetaPose A dataset with 3D human poses and camera estimates predicted by the MetaPose model for a subset of the public Human36M dataset with input files necessary to reproduce these results from scratch.
   
MGnify proteins A 2.4B-sequence protein database with annotations.
   
MiQA: Metaphorical Inference Questions and Answers MiQA assesses the capability of language models to reason with conventional metaphors. It combines the previously isolated topics of metaphor detection and commonsense reasoning into a single task that requires a model to make inferences by selecting between the literal and metaphorical register.
   
MT-Opt A dataset of task episodes collected across a fleet of real robots, following the RLDS format to represent steps and episodes.
   
MultiBERTs Predictions on Winogender Predictions of BERT on Winogender before and after several different interventions.
   
Natural Language Understanding Uncertainty Evaluation NaLUE is a relabelled and aggregated version of three large NLU corpuses CLINC150, Banks77 and HWU64. It contains 50k utterances spanning 18 verticals, 77 domains, and ~260 intents.
   
NewsStories A collection of url links to publicly available news articles with their associated images and videos.
   
Open Images V7 Open Images V7 expands the Open Images dataset with new point-level label annotations, which provide localization information for 5.8k classes, and a new all-in-one visualization tool for better data exploration.
   
Pfam-NUniProt2 A set of 6.8 million new protein sequence annotations.
   
Re-contextualizing Fairness in NLP for India A dataset of region and religion-based societal stereotypes in India, with a list of identity terms and templates for reproducing the results from the “Re-contextualizing Fairness in NLP” paper.
   
Scanned Objects A dataset with 1,000 common household objects that have been 3D scanned for use in robotic simulation and synthetic perception research.
   
Specialized Rater Pools This dataset comes from a study designed to understand whether annotators with different self-described identities interpret toxicity differently. It contains the unaggregated toxicity annotations of 25,500 comments from pools of raters who self-identify as African American, LGBTQ, or neither.
   
UGIF A multi-lingual, multi-modal UI grounded dataset for step-by-step task completion on the smartphone.
   
UniProt Protein Names Data release of ~49M protein name annotations predicted from their amino acid sequence.
   
upwelling irradiance from GOES-16 Climate researchers can use the 4 years of outgoing longwave radiation and reflected shortwave radiation data to analyze important climate forcers, such as aircraft condensation trails.
   
UserLibri The UserLibri dataset reorganizes the existing popular LibriSpeech dataset into individual “user” datasets consisting of paired audio-transcript examples and domain-matching text-only data for each user. This dataset can be used for research in speech personalization or other language processing fields.
   
VideoCC A dataset containing (video-URL, caption) pairs for training video-text machine learning models.
   
Wiki-conciseness A manually curated evaluation set in English for concise rewrites of 2,000 Wikipedia sentences.
   
Wikipedia Translated Clusters Introductions to English Wikipedia articles and their parallel versions in 10 other languages, with machine translations to English. Also includes synthetic corruptions to the English versions, to be identified with NLI models.
   
Workload Traces 2022 A dataset with traces that aim to help system designers better understand warehouse-scale computing workloads and develop new solutions for front-end and data-access bottlenecks.
Tool Description
   
Differential Privacy Open Source Library An open-source library to enable developers to use analytic techniques based on DP.
   
Mood Board Search The result of collaborative work with artists, photographers, and image researchers to demonstrate how ML can enable people to visually explore subjective concepts in image datasets.
   
Project Relate An Android beta app that uses ML to help people with non-standard speech make their voices heard.
   
TensorStore TensorStore is an open-source C++ and Python library designed for storage and manipulation of n-dimensional data, which can address key engineering challenges in scientific computing through better management and processing of large datasets.
   
The Data Cards Playbook A Toolkit for Transparency in Dataset Documentation.

Top

Conclusion

Research is an amplifier, an accelerator, and an enabler — and we are grateful to partner with so many incredible people to harness it for the good of humanity. Even when investing in research that advances our products and engineering, we recognize that, ultimately, this fuels what we can offer our users. We welcome more partners to engage with us and maximize the benefits of AI for the world.

Acknowledgements

Thank you to our many research partners across the globe, including academics, universities, NGOs, and research organizations, for continuing to engage and work with Google on exciting research efforts. There are many teams within GoogIe who make this work possible, including Google’s research teams and community, research partnerships, education, and policy teams. Finally, I would especially like to thank those who provided helpful feedback in the development of this post, including Sepi Hejazi Moghadam, Jill Alvidrez, Melanie Saldaña, Ashwani Sharma, Adriana Budura Skobeltsyn, Aimin Zhu, Michelle Hurtado, Salil Banerjee and Esmeralda Cardenas.

Top

Google Research, 2022 & beyond

This was the ninth and final blog post in the “Google Research, 2022 & Beyond” series. Other posts in this series are listed in the table below:

Read More

Extract non-PHI data from Amazon HealthLake, reduce complexity, and increase cost efficiency with Amazon Athena and Amazon SageMaker Canvas

Extract non-PHI data from Amazon HealthLake, reduce complexity, and increase cost efficiency with Amazon Athena and Amazon SageMaker Canvas

In today’s highly competitive market, performing data analytics using machine learning (ML) models has become a necessity for organizations. It enables them to unlock the value of their data, identify trends, patterns, and predictions, and differentiate themselves from their competitors. For example, in the healthcare industry, ML-driven analytics can be used for diagnostic assistance and personalized medicine, while in health insurance, it can be used for predictive care management.

However, organizations and users in industries where there is potential health data, such as in healthcare or in health insurance, must prioritize protecting the privacy of people and comply with regulations. They are also facing challenges in using ML-driven analytics for an increasing number of use cases. These challenges include a limited number of data science experts, the complexity of ML, and the low volume of data due to restricted Protected Health Information (PHI) and infrastructure capacity.

Organizations in the healthcare, clinical, and life sciences are facing several challenges in using ML for data analytics:

  • Low volume of data – Due to restrictions on private, protected, and sensitive health information, the volume of usable data is often limited, reducing the accuracy of ML models
  • Limited talent – Hiring ML talent is hard enough, but hiring talent that has not only ML experience but also deep medical knowledge is even harder
  • Infrastructure management – Provisioning infrastructure specialized for ML is a difficult and time-consuming task, and companies would rather focus on their core competencies than manage complex technical infrastructure
  • Prediction of multi-modal issues – When predicting the likelihood of multi-faceted medical events, such as a stroke, different factors such as medical history, lifestyle, and demographic information must be combined

A possible scenario is that you are a healthcare technology company with a team of 30 non-clinical physicians researching and investigating medical cases. This team has the knowledge and intuition in healthcare but not the ML skills to build models and generate predictions. How can you deploy a self-service environment that allows these clinicians to generate predictions themselves for multivariate questions like, “How can I access useful data while being compliant with health regulations and without compromising privacy?” And how can you do that without exploding the number of servers the SysOps folks need to manage?

This post addresses all these problems simultaneously in one solution. First, it automatically anonymizes the data from Amazon HealthLake. Then, it uses that data with serverless components and no-code self-service solutions like Amazon SageMaker Canvas to eliminate the ML modeling complexity and abstract away the underlying infrastructure.

A modern data strategy gives you a comprehensive plan to manage, access, analyze, and act on data. AWS provides the most complete set of services for the entire end-to-end data journey for all workloads, all types of data, and all desired business outcomes.

Solution overview

This post shows that by anonymizing sensitive data from Amazon HealthLake and making it available to SageMaker Canvas, organizations can empower more stakeholders to use ML models that can generate predictions for multi-modal problems, such as stroke prediction, without writing ML code, while limiting access to sensitive data. And we want to automate that anonymization to make this as scalable and amenable to self-service as possible. Automating also allows you to iterate the anonymization logic to meet your compliance requirements, and provides the ability to re-run the pipeline as your population’s health data changes.

The dataset used in this solution is generated by Synthea™, a Synthetic Patient Population Simulator and open-source project under the Apache License 2.0.

The workflow includes a hand-off between cloud engineers and domain experts. The former can deploy the pipeline. The latter can verify if the pipeline is correctly anonymizing the data and then generate predictions without code. At the end of the post, we’ll look at additional services to verify the anonymization.

The high-level steps involved in the solution are as follows:

  1. Use AWS Step Functions to orchestrate the health data anonymization pipeline.
  2. Use Amazon Athena queries for the following:
    1. Extract non-sensitive structured data from Amazon HealthLake.
    2. Use natural language processing (NLP) in Amazon HealthLake to extract non-sensitive data from unstructured blobs.
  3. Perform one-hot encoding with Amazon SageMaker Data Wrangler.
  4. Use SageMaker Canvas for analytics and predictions.

The following diagram illustrates the solution architecture.

Architecture Diagram

Prepare the data

First, we generate a fictional patient population using Synthea™ and import that data into a newly created Amazon HealthLake data store. The result is a simulation of the starting point from where a healthcare technology company could run the pipeline and solution described in this post.

When Amazon HealthLake ingests data, it automatically extracts meaning from unstructured data, such as doctors notes, into separate structured fields, such as patient names and medical conditions. To accomplish this on the unstructured data in DocumentReference FHIR resources, Amazon HealthLake transparently triggers Amazon Comprehend Medical, where entities, ontologies, and their relationships are extracted and added back to Amazon HealthLake as discreet data within the extension segment of records.

We can use Step Functions to streamline the collection and preparation of the data. The entire workflow is visible in one place, with any errors or exceptions highlighted, allowing for a repeatable, auditable, and extendable process.

Query the data using Athena

By running Athena SQL queries directly on Amazon HealthLake, we are able to select only those fields that are not personally identifying; for example, not selecting name and patient ID, and reducing birthdate to birth year. And by using Amazon HealthLake, our unstructured data (the text field in DocumentReference) automatically comes with a list of detected PHI, which we can use to mask the PHI in the unstructured data. In addition, because generated Amazon HealthLake tables are integrated with AWS Lake Formation, you can control who gets access down to the field level.

The following is an excerpt from an example of unstructured data found in a synthetic DocumentReference record:

# History of Present Illness
Marquis
is a 45 year-old. Patient has a history of hypertension, viral sinusitis (disorder), chronic obstructive bronchitis (disorder), stress (finding), social isolation (finding).
# Social History
Patient is married. Patient quit smoking at age 16.
Patient currently has UnitedHealthcare.
# Allergies
No Known Allergies.
# Medications
albuterol 5 mg/ml inhalation solution; amlodipine 2.5 mg oral tablet; 60 actuat fluticasone propionate 0.25 mg/actuat / salmeterol 0.05 mg/actuat dry powder inhaler
# Assessment and Plan
Patient is presenting with stroke.

We can see that Amazon HeathLake NLP interprets this as containing the condition “stroke” by querying for the condition record that has the same patient ID and displays “stroke.” And we can take advantage of the fact that entities found in DocumentReference are automatically labeled SYSTEM_GENERATED:

SELECT
  code.coding[1].code,
  code.coding[1].display
FROM
  condition
WHERE
  split(subject.reference, '/')[2] = 'fbfe46b4-70b1-8f61-c343-d241538ebb9b'
AND
  meta.tag[1].display = 'SYSTEM_GENERATED'
AND regexp_like(code.coding[1].display, 'Cerebellar stroke syndrome')

The result is as follows:

G46.4, Cerebellar stroke syndrome

The data collected in Amazon HealthLake can now be effectively used for analytics thanks to the ability to select specific condition codes, such as G46.4, rather than having to interpret entire notes. This data is then stored as a CSV file in Amazon Simple Storage Service (Amazon S3).

Note: When implementing this solution, please follow the instructions on turning on HealthLake’s integrated NLP feature via a support case before ingesting data into your HealthLake data store.

Perform one-hot encoding

To unlock the full potential of the data, we use a technique called one-hot encoding to convert categorical columns, like the condition column, into numerical data.

One of the challenges of working with categorical data is that it is not as amenable to being used in many machine learning algorithms. To overcome this, we use one-hot encoding, which converts each category in a column to a separate binary column, making the data suitable for a wider range of algorithms. This is done using Data Wrangler, which has built-in functions for this:

The built-in function for one-hot encoding in SageMaker Data Wrangler

The built-in function for one-hot encoding in SageMaker Data Wrangler

One-hot encoding transforms each unique value in the categorical column into a binary representation, resulting in a new set of columns for each unique value. In the example below, the condition column is transformed into six columns, each representing one unique value. After one-hot encoding, the same rows would turn into a binary representation.

Before_and_After_One-hot_encoding_tables

With the data now encoded, we can move on to using SageMaker Canvas for analytics and predictions.

Use SageMaker Canvas for analytics and predictions

The final CSV file then becomes the input for SageMaker Canvas, which healthcare analysts (business users) can use to generate predictions for multivariate problems like stroke prediction without needing to have expertise in ML. No special permissions are required because the data doesn’t contain any sensitive information.

In the example of stroke prediction, SageMaker Canvas was able to achieve an accuracy rate of 99.829% through the use of advanced ML models, as shown in the following screenshot.

The analyze screen within SageMaker Canvas showing 99.829% as how often the model correctly predicts stroke

In the next screenshot, you can see that, according to the model’s prediction, this patient has a 53% chance of not having a stroke.

SageMaker Canvas Predict screen showing that the prediction is No stroke based on the input of not in labor force among other inputs.

You might posit that you can create this prediction using rule-based logic in a spreadsheet. But do those rules tell you the feature importance—for example, that 4.9% of the prediction is based on whether or not they have ever smoked tobacco? And what if, in addition to the current columns like smoking status and blood pressure, you add 900 more columns (features)? Would you still be able to use a spreadsheet to maintain and manage the combinations of all those dimensions? Real-life scenarios lead to many combinations, and the challenge is to manage this at scale with the right level of effort.

Now that we have this model, we can start making batch or single predictions, asking what-if questions. For example, what if this person keeps all variables the same but, as of the previous two encounters with the medical system, is classified as Full-time employment instead of Not in labor force?

According to our model, and the synthetic data we fed it from Synthea, the person is at a 62% risk of having a stroke.

SageMaker Canvas predict screen showing Yes as the prediction and full time employment as an input.

As we can tell from the circled 12% and 10% feature importance of the conditions from the two most recent encounters with the medical system, whether they are full-time employed or not has a big impact on their risk of stroke. Beyond the findings of this model, there is research that demonstrates a similar link:

These studies have used large population-based samples and controlled for other risk factors, but it’s important to note that they are observational in nature and do not establish causality. Further research is needed to fully understand the relationship between full-time employment and stroke risk.

Enhancements and alternate methods

To further validate compliance, we can use services like Amazon Macie, which will scan the CSV files in the S3 bucket and alert us if there is any sensitive data. This helps increase the confidence level of the anonymized data.

In this post, we used Amazon S3 as the input data source for SageMaker Canvas. However, we can also import data into SageMaker Canvas directly from Amazon RedShift and Snowflake—popular enterprise data warehouse services used by many customers to organize their data and popular third-party solutions. This is especially important for customers who already have their data in either Snowflake or Amazon Redshift being used for other BI analytics.

By using Step Functions to orchestrate the solution, the solution is more extensible. Instead of a separate trigger to invoke Macie, you can add another step to the end of the pipeline to call Macie to double-check for PHI. If you want to add rules to monitor your data pipeline’s quality over time, you can add a step for AWS Glue Data Quality.

And if you want to add more bespoke integrations, Step Functions lets you scale out to handle as much data or as little data as you need in parallel and only pay for what you use. The parallelization aspect is useful when you are processing hundreds of GB of data, because you don’t want to try to jam all that into one function. Instead, you want to break it up and run it in parallel so you’re not waiting for it to process in a single queue. This is similar to a check-out line at the store—you don’t want to have a single cashier.

Clean up

To avoid incurring future session charges, log out of SageMaker Canvas.

Log out button in SageMaker Canvas

Conclusion

In this post, we showed that predictions for critical health issues like stroke can be done by medical professionals using complex ML models but without the need for coding. This will vastly increase the pool of resources by including people who have specialized domain knowledge but no ML experience. Also, using serverless and managed services allows existing IT people to manage infrastructure challenges like availability, resilience, and scalability with less effort.

You can use this post as a starting point to investigate other complex multi-modal predictions, which are key to guiding the health industry toward better patient care. Coming soon, we will have a GitHub repository to help engineers more quickly launch the kind of ideas we presented in this post.

Experience the power of SageMaker Canvas today, and build your models using a user-friendly graphical interface, with the 2-month Free Tier that SageMaker Canvas offers. You don’t need any coding knowledge to get started, and you can experiment with different options to see how your models perform.

Resources

To learn more about SageMaker Canvas, refer to the following:

To learn more about other use cases that you can solve with SageMaker Canvas, check out the following:

To learn more about Amazon HealthLake, refer to the following:


About the Authors

Yann Stoneman headshot -- white male in 30s with slight beard and glasses smilingYann Stoneman is a Solutions Architect at AWS based out of Boston, MA and is a member of the AI/ML Technical Field Community (TFC). Yann earned his Bachelors at The Juilliard School. When he’s not modernizing workloads for global enterprises, Yann plays piano, tinkers in React and Python, and regularly YouTubes about his cloud journey.

Ramesh Dwarakanath headshotRamesh Dwarakanath is a Principal Solutions Architect at AWS based out of Boston, MA. He works with Enterprises in the Northeast area on their Cloud journey. His areas of interest are Containers and DevOps. In his spare time, Ramesh enjoys tennis, racquetball.

Bakha Nurzhanov's headshot with dark hair and slightly smilingBakha Nurzhanov is an Interoperability Solutions Architect at AWS, and is a member of the Healthcare and Life Sciences technical field community at AWS. Bakha earned his Masters in Computer Science from the University of Washington and in his spare time Bakha enjoys spending time with family, reading, biking, and exploring new places.

Scott Schreckengaust's headshotScott Schreckengaust has a degree in biomedical engineering and has been inventing devices alongside scientists on the bench since the beginning of his career. He loves science, technology, and engineering with decades of experiences in startups to large multi-national organizations within the Healthcare and Life Sciences domain. Scott is comfortable scripting robotic liquid handlers, programming instruments, integrating homegrown systems into enterprise systems, and developing complete software deployments from scratch in regulatory environments. Besides helping people out, he thrives on building — enjoying the journey of hashing out customer’s scientific workflows and their issues then converting those into viable solutions.

Read More

Build a GNN-based real-time fraud detection solution using the Deep Graph Library without using external graph storage

Build a GNN-based real-time fraud detection solution using the Deep Graph Library without using external graph storage

Fraud detection is an important problem that has applications in financial services, social media, ecommerce, gaming, and other industries. This post presents an implementation of a fraud detection solution using the Relational Graph Convolutional Network (RGCN) model to predict the probability that a transaction is fraudulent through both the transductive and inductive inference modes. You can deploy our implementation to an Amazon SageMaker endpoint as a real-time fraud detection solution, without requiring external graph storage or orchestration, thereby significantly reducing the deployment cost of the model.

Businesses looking for a fully-managed AWS AI service for fraud detection can also use Amazon Fraud Detector, which you can use to identify suspicious online payments, detect new account fraud, prevent trial and loyalty program abuse, or improve account takeover detection.

Solution overview

The following diagram describes an exemplar financial transaction network that includes different types of information. Each transaction contains information like device identifiers, Wi-Fi IDs, IP addresses, physical locations, telephone numbers, and more. We represent the transaction datasets through a heterogeneous graph that contains different types of nodes and edges. Then, the fraud detection problem is handled as a node classification task on this heterogeneous graph.

RGCN graph construction diagram

Graph neural networks (GNNs) have shown great promise in tackling fraud detection problems, outperforming popular supervised learning methods like gradient-boosted decision trees or fully connected feed-forward networks on benchmarking datasets. In a typical fraud detection setup, during the training phase, a GNN model is trained on a set of labeled transactions. Each training transaction is provided with a binary label denoting if it is fraudulent. This trained model can then be used to detect fraudulent transactions among a set of unlabeled transactions during the inference phase. Two different modes of inference exist: transductive inference vs. inductive inference (which we discuss more later in this post).

GNN-based models, like RGCN, can take advantage of topological information, combining both graph structure and features of nodes and edges to learn a meaningful representation that distinguishes malicious transactions from legitimate transactions. RGCN can effectively learn to represent different types of nodes and edges (relations) via heterogeneous graph embedding. In the preceding diagram, each transaction is being modeled as a target node, and several entities associated with each transaction get modeled as non-target node types, like ProductCD and P_emaildomain. Target nodes have numerical and categorical features assigned, whereas other node types are featureless. The RGCN model learns an embedding for each non-target node type. For the embedding of a target node, a convolutional operation is used to compute its embedding using its features and neighborhood embeddings. In the rest of the post, we use the terms GNN and RGCN interchangeably.

It’s worth noting that alternative strategies, such as treating the non-target entities as features and one-hot-encoding them, would often be infeasible because of the large cardinalities of these entities. Conversely, encoding them as graph entities enables the GNN model to take advantage of the implicit topology in the entity relationships. For example, transactions that share a phone number with known fraudulent transactions are more likely to be fraudulent too.

The graph representation employed by GNNs creates some complexity in their implementation. This is especially true for applications such as fraud detection, in which the graph representation may get augmented during inference with newly added nodes that correspond to entities not known during model training. This inference scenario is usually referred to as inductive mode. In contrast, transductive mode is a scenario that assumes the graph representation constructed during model training won’t change during inference. GNN models are often evaluated in transductive mode by constructing graph representations from a combined set of training and test examples, while masking test labels during back-propagation. This ensures the graph representation is static, and there the GNN model doesn’t require implementation of operations to extend the graph with new nodes during inference. Unfortunately, static graph representation can’t be assumed when detecting fraudulent transactions in a real-world setting. Therefore, support for inductive inference is required when deploying GNN models for fraud detection to production environments.

In addition, detecting fraudulent transactions in real time is crucial, especially in business cases where there is only one chance of stopping illegal activities. For example, fraudulent users can behave maliciously just once with an account and never use the same account again. Real-time inference on GNN models introduces additional complexity to the implementation. It is often necessary to implement subgraph extraction operations to support real-time inference. The subgraph extraction operation is needed to reduce inference latency when graph representation is large and performing inference on the entire graph becomes prohibitively expensive. An algorithm for real-time inductive inference with an RGCN model runs as follows:

  1. Given a batch of transactions and a trained RGCN model, extend graph representation with entities from the batch.
  2. Assign embedding vectors of new non-target nodes with the mean embedding vector of their respective node type.
  3. Extract a subgraph induced by k-hop out-neighborhood of the target nodes from the batch.
  4. Perform inference on the subgraph and return prediction scores for the batch’s target nodes.
  5. Clean up the graph representation by removing newly added nodes (this step ensures the memory requirement for model inference stays constant).

The key contribution of this post is to present an RGCN model implementing the real-time inductive inference algorithm. You can deploy our RGCN implementation to a SageMaker endpoint as a real-time fraud detection solution. Our solution doesn’t require external graph storage or orchestration, and significantly reduces the deployment cost of the RGCN model for fraud detection tasks. The model also implements transductive inference mode, enabling us to carry out experiments to compare model performance in inductive and transductive modes. The model code and notebooks with experiments can be accessed from the AWS Examples GitHub repo.

This post builds on the post Build a GNN-based real-time fraud detection solution using Amazon SageMaker, Amazon Neptune, and the Deep Graph Library. The previous post built a RGCN-based real-time fraud detection solution using SageMaker, Amazon Neptune, and the Deep Graph Library (DGL). The prior solution used a Neptune database as external graph storage, required AWS Lambda for orchestration for real-time inference, and only included experiments in transductive mode.

The RGCN model introduced in this post implements all operations of the real-time inductive inference algorithm using only the DGL as a dependency, and doesn’t require external graph storage or orchestration for deployment.

We first evaluate the performance of the RGCN model in transductive and inductive modes on a benchmark dataset. As expected, model performance in inductive mode is slightly lower than in transductive mode. We also study the effect of hyperparameter k on model performance. The hyperparameter k controls the number of hops performed to extract a subgraph in Step 3 of the real-time inference algorithm. Higher values of k will produce larger subgraphs and can lead to better inference performance at the expense of higher latency. As such, we also conduct timing experiments to evaluate the feasibility of the RGCN model for a real-time application.

Dataset

We use the IEEE-CIS fraud dataset, the same dataset that was used in the previous post. The dataset contains over 590,000 transaction records that have a binary fraud label (the isFraud column). The data is split into two tables: transaction and identity. However, not all transaction records have corresponding identity information. We join the two tables on the TransactionID column, which leaves us with a total of 144,233 transaction records. We sort the table by transaction timestamp (the TransactionDT column) and create an 80/20 percentage split by time, producing 115,386 and 28,847 transactions for training and testing, respectively.

For more details on the dataset and how to format it to suit the input requirement of the DGL, refer to Detecting fraud in heterogeneous networks using Amazon SageMaker and Deep Graph Library.

Graph construction

We use the TransactionID column to generate target nodes. We use the following columns to generate 11 types of non-target nodes:

  • card1 through card6
  • ProductCD
  • addr1 and addr2
  • P_emaildomain and R_emaildomain

We use 38 columns as categorical features of target nodes:

  • M1 through M9
  • DeviceType and DeviceInfo
  • id_12 through id_38

We use 382 columns as numerical features of target nodes:

  • TransactionAmt
  • dist1 and dist2
  • id_01 through id_11
  • C1 through C14
  • D1 through D15
  • V1 through V339

Our graph constructed from the training transactions contains 217,935 nodes and 2,653,878 edges.

Hyperparameters

Other parameters are set to match the parameters reported in the previous post. The following snippet illustrates training the RGCN model in transductive and inductive modes:

import pandas as pd
from fgnn.fraud_detector import FraudRGCN

# overload default hyperparameters defined in FraudRGCN constructor
params = {
    "embedding_size": 64,
    "n_layers": 2,
    "n_epochs": 150,
    "n_hidden": 16,
    "dropout": 0.2,
    "weight_decay": 5e-05,
    "lr": 0.01
}

# load train and test splits
df_train = pd.read_parquet('./data/train.parquet')
df_test = pd.read_parquet('./data/test.parquet')

# train RGCN model in inductive mode
fd_ind = FraudRGCN()
fd_ind.train_fg(df_train, params=params)

# train RGCN model in transductive mode
fd_trs = FraudRGCN()
# create boolean array to identify test examples
test_mask = [False]*len(df_train) + [True]*len(df_test)
# concatenate train and test examaples
df_combined = pd.concat([df_train, df_test], ignore_index=True)	
# test_mask must be passed in transductive mode, 
# so test labels are masked-out during back-propagation
fd.train_fg(df_combined, params=params, test_mask=test_mask)

# predict on both models extracting subgraph with 2 k-hops
fraud_proba_ind = fd_ind.predict(df_test, k=2)
fraud_proba_trs = fd_trs.predict(df_test, k=2)

Inductive vs. transductive mode

We perform five trials for inductive and five trials for transductive mode. For each trial, we train an RGCN model and save it to disk, obtaining 10 models. We evaluate each model on test examples while increasing the number of hops (parameter k) used to extract a subgraph for inference, setting k to 1, 2, and 3. We predict on all test examples at once, and compute the ROC AUC score for each trial. The following plot shows the mean and 95% confidence intervals of AUC scores.

Inductive vs Transductive model performance

We can see that performance in transductive mode is slightly higher than in inductive mode. For k=2, mean AUC scores for inductive and transductive modes are 0.876 and 0.883, respectively. This is expected because the RGCN model is able to learn embeddings of all entity nodes in transductive mode, including those in the test set. In contrast, inductive mode only allows the model to learn embeddings of entity nodes that are present in the training examples, and therefore some nodes have to be mean-filled during inference. At the same time, the drop in performance between transductive and inductive modes is not significant, and even in inductive mode, the RGCN model achieves good performance with an AUC of 0.876. We also observe that model performance doesn’t improve for values of k>2. This implies that setting k=2 would extract a sufficiently large subgraph during inference, resulting in optimal performance. This observation is also confirmed by our next experiment.

It’s also worth noting that, for transductive mode, our model’s AUC of 0.883 is higher than the corresponding AUC of 0.870 reported in the previous post. We use more columns as numerical and categorical features of target nodes, which can explain the higher AUC score. We also note that the experiments in the previous post only performed a single trial.

Inference on a small batch

For this experiment, we evaluate the RGCN model in a small batch inference setting. We use five models that were trained in inductive mode in the previous experiment. We compare performance of these models when predicting in two settings: full and small batch inference. For full batch inference, we predict on the entire test set, as was done in the previous experiment. For small batch inference, we predict in small batches by partitioning the test set into 28 batches of equal size with approximately 1,000 transactions in each batch. We compute AUC scores for both settings using different values of k. The following plot shows the mean and 95% confidence intervals for full and small batch inference settings.

Inductive model performance for full-batch vs small-batch

We observe that performance for small batch inference when k=1 is lower than for full batch. However, small batch inference performance matches full batch when k>1. This can be attributed to much smaller subgraphs being extracted for small batches. We confirm this by comparing subgraph sizes with the size of the entire graph constructed from the training transactions. We compare graph sizes in terms of number of nodes. For k=1, the average subgraph size for small batch inference is less than 2% of the training graph. And for full batch inference when k=1, the subgraph size is 22%. When k=2, subgraph sizes for small and full batch inference are 54% and 64%, respectively. Finally, subgraph sizes for both inference settings reach 100% for k=3. In other words, when k>1, the subgraph for a small batch becomes sufficiently large, enabling small batch inference to reach the same performance as full batch inference.

We also record prediction latency for every batch. We perform our experiments on an ml.r5.12xlarge instance, but you can use a smaller instance with 64 G memory to run the same experiments. The following plot shows the mean and 95% confidence intervals of small batch prediction latencies for different values of k.

Timing results for inductive small-batch

The latency includes all five steps of the real-time inductive inference algorithm. We see that when k=2, predicting on 1,030 transactions takes 5.4 seconds on average, resulting in a throughput of 190 transactions per second. This confirms that the RGCN model implementation is suitable for real-time fraud detection. We also note that the previous post did not provide hard latency values for their implementation.

Conclusion

The RGCN model released with this post implements the algorithm for real-time inductive inference, and doesn’t require external graph storage or orchestration. The parameter k in Step 3 of the algorithm specifies the number of hops performed to extract the subgraph for inference, and results in a trade-off between model accuracy and prediction latency. We used the IEEE-CIS fraud dataset in our experiments, and empirically validated that the optimal value of parameter k for this dataset is 2, achieving an AUC score of 0.876 and prediction latency of less than 6 seconds per 1,000 transactions.

This post provided a step-by-step process for training and evaluating an RGCN model for real-time fraud detection. The included model class implements methods for the entire model lifecycle, including serialization and deserialization methods. This enables the model to be used for real-time fraud detection. You can train the model as a PyTorch SageMaker estimator and then deploy it to a SageMaker endpoint by using the following notebook as a template. The endpoint is able to predict fraud on small batches of raw transactions in real time. You can also use Amazon SageMaker Inference Recommender to select the best instance type and configuration for the inference endpoint based on your workloads.

For more information about this topic and implementation, we encourage you to explore and test our scripts on your own. You can access the notebooks and related model class code from the AWS Examples GitHub repo.


About the Authors

Dmitriy BespalovDmitriy Bespalov is a Senior Applied Scientist at the Amazon Machine Learning Solutions Lab, where he helps AWS customers across different industries accelerate their AI and cloud adoption.

Ryan BrandRyan Brand is an Applied Scientist at the Amazon Machine Learning Solutions Lab. He has specific experience in applying machine learning to problems in healthcare and life sciences. In his free time, he enjoys reading history and science fiction.

Yanjun QiYanjun Qi is a Senior Applied Science Manager at the Amazon Machine Learning Solution Lab. She innovates and applies machine learning to help AWS customers speed up their AI and cloud adoption.

Read More

Generative AI at GTC: Dozens of Sessions to Feature Luminaries Speaking on Tech’s Hottest Topic

Generative AI at GTC: Dozens of Sessions to Feature Luminaries Speaking on Tech’s Hottest Topic

As the meteoric rise of ChatGPT demonstrates, generative AI can unlock enormous potential for companies, teams and individuals. 

Whether simplifying time-consuming tasks or accelerating 3D workflows to boost creativity and productivity, generative AI is already making an impact across industries — and there’s much more to come.

How generative AI is paving the way for the future will be a key topic at NVIDIA GTC, a free, global conference for the era of AI and the metaverse, taking place online March 20-23. 

Dozens of sessions will dive into topics around generative AI — from conversational text to the creation of virtual worlds from images. Here’s a sampling: 

Many more sessions on generative AI are available to explore at GTC, and registration is free. Join to discover the latest AI technology innovations and breakthroughs.

Featured image courtesy of Refik Anadol.

Read More

Fusion Reaction: How AI, HPC Are Energizing Science

Fusion Reaction: How AI, HPC Are Energizing Science

Brian Spears says his children will enjoy a more sustainable planet, thanks in part to AI and high performance computing (HPC) simulations.

“I believe I’ll see fusion energy in my lifetime, and I’m confident my daughters will see a fusion-powered world,” said the 45-year-old principal investigator at Lawrence Livermore National Laboratory who helped demonstrate the physics of the clean and abundant power source, making headlines worldwide.

Results from the experiment hit Spears’ inbox at 5:30 a.m. on Dec. 5 last year.

“I had to rub my eyes to make sure I wasn’t misreading the numbers,” he recalled.

A Nuclear Family  

Once he assured himself, he scurried downstairs to share the news with his wife, a chemical engineer at the lab who’s pioneering ways to 3D print glass, and also once worked on the fusion program.

LLNL principal investigator Brian Spears
Brian Spears

“One of my friends described us as a Star Trek household — I work on the warp core and she works on the replicator,” he quipped.

In a tweet storm after the lab formally announced the news, Spears shared his excitement with the world.

“Exhausted by an amazing day … Daughters sending me screenshots with breaking news about Mom and Dad’s work … Being a part of something amazing for humanity.”

In another tweet, he shared the technical details.

“Used two million joules of laser energy to crush a capsule 100x smoother than a mirror. It imploded to half the thickness of a hair. For 100 trillionths of a second, we produced ten petawatts of power. It was the brightest thing in the solar system.”

AI Helps Call the Shots

A week before the experiment, Spears’ team analyzed its precision HPC design, then predicted the result with AI. Two atoms would fuse into one, releasing energy in a process simply called ignition.

It was the most exciting of thousands of AI predictions in what’s become the two-step dance of modern science. Teams design experiments in HPC simulations, then use data from the actual results to train AI models that refine the next simulation.

AI uncovers details about the experiments hard for humans to see. For example, it tracked the impact of minute imperfections in the imploding capsule researchers blasted with 192 lasers to achieve fusion.

LLNL nuclear fusion experiment explained
A look inside the fusion experiment. Graphic courtesy of Lawrence Livermore National Laboratory.

“You need AI to understand the complete picture,” Spears said.

It’s a big canvas, filled with math describing the complex details of atomic physics.

A single experiment can require hundreds of thousands of relatively small simulations. Each takes a half day on a single node of a supercomputer.

The largest 3D simulations — called the kitchen sinks — consume about half of Sierra, the world’s sixth fastest HPC system, packing 17,280 NVIDIA GPUs.

Edge AI Guides Experiments

AI also helps scientists create self-driving experiments. Neural networks can make split-second decisions about which way to take an experiment based on results they process in real time.

For example, Spears, his colleagues and NVIDIA collaborated on an AI-guided experiment last year that fired lasers up to three times a second. It created the kind of proton beams that could someday treat a cancer patient.

“In the course of a day, you can get the kind of bright beam that may have taken you months or years of human-designed experiments,” Spears said. “This approach of AI at the edge will save orders of magnitude of time for our subject-matter experts.”

Directing lasers fired many times a second will also be a key job inside tomorrow’s nuclear fusion reactors.

Navigating the Data Deluge

AI’s impacts will be felt broadly across both scientific and industrial fields, Spears believes.

“Over the last decade we’ve produced more simulation and experimental data than we’re trained to deal with,” he said.

That deluge, once a burden for scientists, is now fuel for machine learning.

“AI is putting scientists back in the driver seat so we can move much more quickly,” he said.

Brian Spears interviewed on nuclear fusion experiment
Spears explained the ignition result in an interview (starting 8:19) with Government Matters.

Spears also directs an AI initiative at the lab that depends on collaborations with companies including NVIDIA.

“NVIDIA helps us look over the horizon, so we can take the next step in using AI for science,” he said

A Brighter Future

It’s hard work with huge impacts, like leaving a more resilient planet for the next generation.

Asked whether his two daughters plan a career in science, Spears beams. They’re both competitive swimmers who play jazz trumpet with interests in everything from bioengineering to art.

“As we say in science, they’re four pi, they cover the whole sky,” he said.

Read More

Flawless Fractal Food Featured This Week ‘In the NVIDIA Studio’

Flawless Fractal Food Featured This Week ‘In the NVIDIA Studio’

Editor’s note: This post is part of our weekly In the NVIDIA Studio series, which celebrates featured artists, offers creative tips and tricks, and demonstrates how NVIDIA Studio technology improves creative workflows.

ManvsMachine steps In the NVIDIA Studio this week to share insights behind fractal art — which uses algorithms to artistically represent calculations — derived from geometric objects as digital images and animations.

Ethos Reflected

Founded in London in 2007, ManvsMachine is a multidimensional creative company specializing in design, film and visual arts.

 

ManvsMachine works closely with the world’s leading brands and agencies, including Volvo, Adidas, Nike and more, to produce award-winning creative content.

 

The team at ManvsMachine finds inspiration from a host of places: nature and wildlife, conversations, films, documentaries, as well as new and historic artists of all mediums.

Fractal Food

For fans of romanesco broccoli, the edible flower bud resembling cauliflower in texture and broccoli in taste might conjure mild, nutty, sweet notes that lend well to savory pairings. For ManvsMachine, it presented an artistic opportunity.

Romanesco broccoli is the inspiration behind ‘Roving Romanesco.’

The Roving Romanesco animation started out as a series of explorations based on romanesco broccoli, a prime example of a fractal found in nature.

ManvsMachine’s goal was to find an efficient way of recreating it in 3D and generate complex geometry using a simple setup.

The genesis of the animation revolved around creating a phyllotaxis pattern, an arrangement of leaves on a plant stem, using the high-performance expression language VEX in SideFX’s Houdini software.

Points offset at 137.5 degrees, known as the golden angle.

This was achieved by creating numerous points and offsetting each from the previous one by 137.5 degrees, known as the golden or “perfect circular” angle, while moving outward from the center. The built-in RTX-accelerated Karma XPU renderer enabled fast simulation models powered by the team’s GeForce RTX 3090 GPUs.

Individual florets begin to form.

The team added simple height and width to the shapes using ramp controls then copied geometry onto those points inside a loop.

Romanesco broccoli starts to come together.

With the basic structure intact, ManvsMachine sculpted florets individually to create a stunning 3D model in the shape of romanesco broccoli. The RTX-accelerated Karma XPU renderer dramatically sped up animations of the shape, as well.

“Creativity is enhanced by faster ray-traced rendering, smoother 3D viewports, quicker simulations and AI-enhanced image denoising upscaling — all accelerated by NVIDIA RTX GPUs.” — ManvsMachine

The project was then imported to Foundry’s Nuke software for compositing and final touch-ups. When pursuing a softer look, ManvsMachine counteracted the complexity of the animation with some “easy-on-the-eyes” materials and color choices with a realistic depth of field.

Many advanced nodes in Nuke are GPU accelerated, which gave the team another speed advantage.

Projects like Roving Romanesco represent the high-quality work ManvsMachine strives to deliver for clients.

“Our ethos is reflected in our name,” said ManvsMachine. “Equal importance is placed on ideas and execution. Rather than sell an idea and then work out how to make it later, the preference is to present clients with the full picture, often leading with technique to inform the creative.”

Designers, directors, visual effects artists and creative producers — team ManvsMachine.

Check out @man.vs.machine on Instagram for more inspirational work.

Artists looking to hone their Houdini skills can access Studio Shortcuts and Sessions on the NVIDIA Studio YouTube channel. Discover exclusive step-by-step tutorials from industry-leading artists, watch inspiring community showcases and more, powered by NVIDIA Studio hardware and software.

Follow NVIDIA Studio on Instagram, Twitter and Facebook. Access tutorials on the Studio YouTube channel and get updates directly in your inbox by subscribing to the Studio newsletter.

Read More

Pixel Perfect: RTX Video Super Resolution Now Available for GeForce RTX 40 and 30 Series GPUs

Pixel Perfect: RTX Video Super Resolution Now Available for GeForce RTX 40 and 30 Series GPUs

Streaming video on PCs through Google Chrome and Microsoft Edge browsers is getting a GeForce RTX-sized upgrade today with the release of RTX Video Super Resolution (VSR).

Nearly 80% of internet bandwidth today is streaming video. And 90% of that content streams at 1080p or lower, including from popular sources like Twitch.tv, YouTube, Netflix, Disney+ and Hulu.

However, when viewers use displays higher than 1080p — as do many PC users — the browser must scale the video to match the resolution of the display. Most browsers use basic upscaling techniques, which result in final images that are soft or blurry.

With RTX VSR, GeForce RTX 40 and 30 Series GPU users can tap AI to upscale lower-resolution content up to 4K, matching their display resolution. The AI removes blocky compression artifacts and improves the video’s sharpness and clarity.

Just like putting on a pair of prescription glasses can instantly snap the world into focus, RTX Video Super Resolution gives viewers on GeForce RTX 40 and 30 Series PCs a clear picture into the world of streaming video.

RTX VSR is available now as part of the latest GeForce Game Ready Driver, which delivers the best experience for new game launches like Atomic Heart and THE FINALS closed beta.

The Evolution of AI Upscaling

AI upscaling is the process of converting lower-resolution media to a higher resolution by putting low-resolution images through a deep learning model to predict the high-resolution versions. To make these predictions with high accuracy, a neural network model must be trained on countless images at different resolutions.

4K displays can muddy visuals by having to stretch lower-resolution images to fit their screen. Using AI to upscale streamed video makes lower-resolution images fit with unrivaled crispness.

The deployed AI model can then take low-resolution video and produce incredible sharpness and enhanced details that no traditional scaler can recreate. Edges look sharper, hair looks scruffier and landscapes pop with striking clarity.

In 2019, an early version of this technology was released with SHIELD TV. It was a breakthrough that improved streamed content targeted for TVs, mostly ranging from 480p to 1080p, and optimized for a 10-foot viewing experience.

PC viewers are typically seated much closer than TV viewers to their displays, requiring a higher level of processing and refinement for upscaling. With GeForce RTX 40 and 30 Series GPUs, users now have extremely powerful AI processors with Tensor Cores, enabling a new generation of AI upscaling through RTX VSR.

How RTX Video Super Resolution Works

RTX VSR is a breakthrough in AI pixel processing that dramatically improves the quality of streamed video content beyond edge detection and feature sharpening.

Blocky compression artifacts are a persistent issue in streamed video. Whether the fault of the server, the client or the content itself, issues often become amplified with traditional upscaling, leaving a less pleasant visual experience for those watching streamed content.

Click the image to see the differences between bicubic upscaling (left) and RTX Video Super Resolution.

RTX VSR reduces or eliminates artifacts caused by compressing video — such as blockiness, ringing artifacts around edges, washout of high-frequency details and banding on flat areas — while reducing lost textures. It also sharpens edges and details.

The technology uses a deep learning network that performs upscaling and compression artifact reduction in a single pass. The network analyzes the lower-resolution video frame and predicts the residual image at the target resolution. This residual image is then superimposed on top of a traditional upscaled image, correcting artifact errors and sharpening edges to match the output resolution.

The deep learning network is trained on a wide range of content with various compression levels. It learns about types of compression artifacts present in low-resolution or low-quality videos that are otherwise absent in uncompressed images as a reference for network training. Extensive visual evaluation is employed to ensure that the generated model is effective on nearly all real-world and gaming content.

Getting Started

RTX VSR requires a GeForce RTX 40 or 30 Series GPU and works with nearly all content streamed in Google Chrome and Microsoft Edge.

The feature also requires updating to the latest GeForce Game Ready Driver, available today, or the next NVIDIA Studio Driver releasing in March. Both Chrome (version 110.0.5481.105 or higher) and Edge (version 110.0.1587.56) have updated recently to support RTX VSR.

To enable it, launch the NVIDIA Control Panel and open “Adjust video image settings.” Check the super resolution box under “RTX video enhancement” and select a quality from one to four — ranging from the lowest impact on GPU performance to the highest level of upscaling improvement.

Learn more, including other setup configurations, in this NVIDIA Knowledge Base article.

Read More