Adam and EV: NIO Selects NVIDIA for Intelligent, Electric Vehicles

Chinese electric automaker NIO will use NVIDIA DRIVE for advanced automated driving technology in its future fleets, marking the genesis of truly intelligent and personalized NIO vehicles.

During a global reveal event, the EV maker took the wraps off its latest ET7 sedan, which starts shipping in 2022 and features a new NVIDIA-powered supercomputer, called Adam, that uses NVIDIA DRIVE Orin to deploy advanced automated driving technology.

“The cooperation between NIO and NVIDIA will accelerate the development of autonomous driving on smart vehicles,” said NIO CEO William Li. “NIO’s in-house developed autonomous driving algorithms will be running on four industry-leading NVIDIA Orin processors, delivering an unprecedented 1,000+ trillion operations per second in production cars.”

The announcement marks a major step toward the widespread adoption of intelligent, high-performance electric vehicles, improving standards for both the environment and road users.

NIO has been a pioneer in China’s premium smart electric vehicle market. Since 2014, the automaker has been leveraging NVIDIA for its seamless infotainment experience. And now, with NVIDIA DRIVE powering automated driving features in its future vehicles, NIO is set to redefine mobility with continuous improvement and personalization.

“Autonomy and electrification are the key forces transforming the automotive industry,” said Jensen Huang, NVIDIA founder and CEO. “We are delighted to partner with NIO, a leader in the new energy vehicle revolution—leveraging the power of AI to create the software-defined EV fleets of the future.”

An Intelligent Creation

Software-defined and intelligent vehicles require a centralized, high-performance compute architecture to power AI features and continuously receive upgrades over the air.

The new NIO Adam supercomputer is one of the most powerful platforms to run in a vehicle. With four NVIDIA DRIVE Orin processors, Adam achieves more than 1,000 TOPS of performance.

Orin is the world’s highest-performance, most-advanced AV and robotics processor. This supercomputer on a chip is capable of delivering up to 254 TOPS to handle the large number of applications and deep neural networks that run simultaneously in autonomous vehicles and robots, while achieving systematic safety standards such as ISO 26262 ASIL-D.

By using multiple SoCs, Adam integrates the redundancy and diversity necessary for safe autonomous operation. The first two SoCs process the 8 gigabytes of data produced by the vehicle’s sensor set every second. The third Orin serves as a backup to ensure the system can still operate safely in any situation, while the fourth enables local training, improving the vehicle with fleet learning as well as personalizing the driving experience based on individual user preferences.

With high-performance compute at its core, Adam is a major achievement in the creation of automotive intelligence and autonomous driving.

Meet the ET7

NIO took the wraps off its much-anticipated ET7 sedan — the production version of its original EVE concept first shown in 2017.

The flagship vehicle leapfrogs current model capabilities, with more than 600 miles of battery range and advanced autonomous driving. As the first vehicle equipped with Adam, the ET7 can perform point-to-point autonomy, leveraging 33 sensors and high-performance compute to continuously expand the domains in which it operates  — from urban to highway driving to battery swap stations.

The intelligent sedan ensures a seamless experience from the moment the driver approaches the car. With a highly accurate digital key and soft-closing doors, users can open the car with a gentle touch. Enhanced driver monitoring and voice recognition enable easy interaction with the vehicle. And sensors on the bottom of the ET7 detect the road surface so the vehicle can automatically adjust the suspension for a smoother ride.

With AI now at the center of the NIO driving experience, the ET7 and upcoming NVIDIA-powered models are heralding the new generation of intelligent transportation.

The post Adam and EV: NIO Selects NVIDIA for Intelligent, Electric Vehicles appeared first on The Official NVIDIA Blog.

Read More

Artificial intelligence and machine learning continues at AWS re:Invent

A fresh new year is here, and we wish you all a wonderful 2021. We signed off last year at AWS re:Invent on the artificial intelligence (AI) and machine learning (ML) track with the first ever machine learning keynote and over 50 AI/ML focused technical sessions covering industries, use cases, applications, and more. You can access all the content for the AI/ML track on the AWS re:Invent website. But, the exciting news is we’re not done yet. We’re kicking off 2021 by bringing you even more content for AI and ML through a set of new sessions that you can stream live starting Jan 12, 2021. Each session will be offered multiple times, so you can find the time that works best for your location and schedule.

And of course, AWS re:Invent is free. Register now if you have not already and build your schedule from the complete agenda. Here are some sample sessions from the AI/ML track that will stream live starting next week

Here are a few sample sessions that will stream live starting next week.

Customers using AI/ML solutions from AWS

A day in the life of a machine learning data scientist at J P Morgan Chase (AIM319)

Thursday, January 14 – 8 AM to 8:30 AM PST

Thursday, January 14 – 4 PM to 4:30 PM PST

Friday, January 15 – 12 AM to 12:30 AM PST

Learn how data scientists at J P Morgan Chase use custom ML solutions built on top of Amazon SageMaker to gather intelligent insights, while adhering to secure control policies and regulatory requirements.

Streamlining media content with PBS (AIM318)

Wednesday, January 13 – 3 PM to 3:30 PM PST

Wednesday, January 13 – 11 PM to 11:30 PM PST

Thursday, January 14 – 7 AM to 7:30 AM PST

Enhancing the viewer experience by streamlining operational tasks to review, search, and analyze image and video content is a critical factor for the media and entertainment industry. Learn how PBS uses Amazon Rekognition to build relevant features such as deep content search, brand safety, and automated ad insertion to get more out of their content.

Fraud detection with AWS and Coinbase (AIM320)

Thursday, January 14 – 10:15 AM to 10:45 AM PST

Thursday, January 14 – 6:15 PM to 6:45 PM PST

Friday, January 15 – 2:15 AM to 2:45 AM PST

Among many use cases, ML helps mitigate a universally expensive problem: fraud. Join AWS and Coinbase to learn how to detect fraud faster using sample datasets and architectures, and help save millions of dollars for your organization.

Autonomous vehicle solutions with Lyft (AIM315)

Wednesday, January 13 – 2 PM to 2:30 PM PST

Wednesday, January 13 – 10 PM to 10:30 PM PST

Thursday, January 14 – 6 AM to 6:30 AM PST

In this session, we discuss how computer vision models are labeled and trained at Lyft using Amazon SageMaker Ground Truth for visual perception tasks that are critical for autonomous driving systems.

Modernize your contact center with AWS Contact Center Intelligence (CCI) (AIM214)

Tuesday, January 12 – 1:15 PM to 1:45 PM PST

Tuesday, January 12 – 9:15 PM to 9:45 PM PST

Wednesday, January 13 – 5:15 AM to 5:45 AM PST

Improve the customer experience with reduced costs using AWS Contact Center Intelligence (CCI) solutions. You will hear from SuccessKPI, an AWS partner, on how they use CCI solutions to solve business problems such as improving agent effectiveness and automating quality management in enterprise contact centers.

Machine learning concepts with AWS

Consistent and portable environments with containers (AIM317)

Wednesday, January 13 – 8:45 AM to 9:15 AM PST

Wednesday, January 13 – 4:45 PM to 5:15 PM PST
Thursday, January 14 – 12:45 AM to 1:15 AM PST

Learn how to build consistent and portable ML environments using containers with AWS services such as Amazon SageMaker and Amazon Elastic Kubernetes Service (Amazon EKS) across multiple deployment clusters. This session will help you build these environments with ease and at scale in the midst of the ever-growing list of open-source frameworks and tools.

Achieve real-time inference at scale with Deep Java Library (AIM410)

Thursday, January 14 – 3:30 PM to 4 PM PST

Thursday, January 14 – 11:30 PM to 12 AM PST

Friday, January 15 – 7:30 AM to 8 AM PST

Deep Java Library (DJL) from AWS helps you build ML applications without needing to learn a new language. Learn how to use DJL and deploy models including BERT in the DJL model zoo to achieve real-time inference at scale.

Don’t miss out on all the action. We look forward to seeing you on the artificial intelligence and machine learning track. Please see the re:Invent agenda for more details and to build your schedule.


About the Author

Shyam Srinivasan is on the AWS Machine Learning marketing team. He cares about making the world a better place through technology and loves being part of this journey. In his spare time, Shyam likes to run long distances, travel around the world, and experience new cultures with family and friends.

Read More

ML Metadata: Version Control for ML

Posted by Ben Mathes and Neoklis Polyzotis, on behalf of the TFX Team

When you write code, you need version control to keep track of it. What’s the ML equivalent of version control? If you’re building production ML systems, you need to be able to answer questions like these:

  • Which dataset was this model trained on?
  • What hyperparameters were used?
  • Which pipeline was used to create this model?
  • Which version of TensorFlow (and other libraries) were used to create this model?
  • What caused this model to fail?
  • What version of this model was last deployed?

Engineers at Google have learned, through years of hard-won experience, that this history and lineage of ML artifacts is far more complicated than a simple, linear log. You use Git (or similar) to track your code; you need something to track your models, datasets, and more. Git, for example, may simplify your life a lot, but under the hood there’s a graph of many things! The complexity of ML code and artifacts like models, datasets, and much more requires a similar approach.

That’s why we built Machine Learning Metadata (MLMD). It’s a library to track the full lineage of your entire ML workflow. Full lineage is all the steps from data ingestion, data preprocessing, validation, training, evaluation, deployment, and so on. MLMD is a standalone library, and also comes integrated in TensorFlow Extended. There’s also a demo notebook to see how you can integrate MLMD into your ML infrastructure today.

ML Metadata icon
Beyond versioning your model, ML Metadata captures the full lineage of the training process, including the dataset, hyperparameters, and software dependencies.

Here’s how MLMD can help you:

  • If you’re a ML Engineer: You can use MLMD to trace bad models back to their dataset, or trace from a bad dataset to the models you trained on it, and so on.
  • If you’re working in ML infrastructure: You can use MLMD to record the current state of your pipeline and enable event-based orchestration. You can also enable optimizations like skipping a step if the inputs and code are the same, memoizing steps in your pipelines. You can integrate MLMD into your training system so it automatically creates logs for querying later. We’ve found that this auto-logging of the full lineage as a side effect of training is the best way to use MLMD. Then you have the full history without extra effort.

MLMD is more than a TFX research project. It’s a key foundation to multiple internal MLOps solutions at Google. Furthermore, Google Cloud integrates tools like MLMD into its core MLOps platform:

The foundation of all these new services is our new ML Metadata Management service in AI Platform. This service lets AI teams track all the important artifacts and experiments they run, providing a curated ledger of actions and detailed model lineage. This will enable customers to determine model provenance for any model trained on AI Platform for debugging, audit, or collaboration. AI Platform Pipelines will automatically track artifacts and lineage and AI teams can also use the ML Metadata service directly for custom workloads, artifact and metadata tracking.

Want to know where your models come from? What training data was used? Did anyone else train a model on this dataset already, and was their performance better? Are there any tainted datasets we need to clean up after?

If you want to answer these questions for your users, check out MLMD on github, as a part of TensorFlow Extended, or in our demo notebook.

Read More

Mercedes-Benz Transforms Vehicle Cockpit with NVIDIA-Powered AI

The AI cockpit has reached galactic proportions with the new Mercedes-Benz MBUX Hyperscreen.

During a digital event, the luxury automaker unveiled the newest iteration of its intelligent infotainment system — a single surface extending from the cockpit to the passenger seat displaying all necessary functions at once. Dubbed the MBUX Hyperscreen, the system is powered by NVIDIA technology and shows how AI can create a truly intuitive and personalized experience for both the driver and passengers.

“The MBUX Hyperscreen reinvents how we interact with the car,” said Sajjad Khan, executive vice president at Mercedes-Benz. “It’s the nerve center that connects everyone in the car with the world.”

Like the MBUX system recently unveiled with the new Mercedes-Benz S-Class, this extended-screen system runs on high-performance, energy-efficient NVIDIA GPUs for instantaneous AI processing and sharp graphics.

A vehicle’s cockpit typically requires a collection of electronic control units and switches to perform basic functions, such as powering entertainment or adjusting the temperature. Using NVIDIA technology, Mercedes-Benz consolidated these components into one AI platform — with three separate screens under one glass surface — to simplify the architecture while creating more space to add new features.

“Zero Layer” User Interface

The driving principle behind the MBUX Hyperscreen is that of the “zero layer” — every necessary driving feature is delivered with a single touch.

However, developing the largest screen ever mounted in a series-built Mercedes-Benz was not enough to achieve this groundbreaking capability. The automaker also leveraged AI to promote commonly used features at relevant times while pushing those not needed to the background.

The deep neural networks powering the system process datasets such as vehicle position, cabin temperature and time of day to prioritize certain features — like entertainment or points of interest recommendations — while always keeping navigation at the center of the display.

“The system always knows what you want and need based on emotional intelligence,” Khan explained.

And these features aren’t just for the driver. Front-seat passengers get a dedicated screen for entertainment and ride information that doesn’t interfere with the driver’s display. It also enables the front seat passenger to share content with others in the car.

Experience Intelligence

This revolutionary AI cockpit experience isn’t a mere concept — it’s real technology that will be available in production vehicles this year.

The MBUX Hyperscreen will debut with the all-electric Mercedes-Benz EQS, combining electric and artificial intelligence. With the first generation MBUX now in 1.8 million cars, the next iteration coming in the redesigned S-Class, and now, the MBUX Hyperscreen in the EQS, customers will have a range of AI cockpit options.

And with the entire MBUX family powered by NVIDIA, these systems will constantly deliver new, surprising and intelligent features with high performance and a seamless experience.

The post Mercedes-Benz Transforms Vehicle Cockpit with NVIDIA-Powered AI appeared first on The Official NVIDIA Blog.

Read More

In a Quarantine Slump? How One High School Student Used AI to to Stay on Track

Canadian high schooler Ana DuCristea has a clever solution for the quarantine slump.

Using AI and natural language processing, she programmed an app capable of setting customizable reminders so you won’t miss any important activities, like baking banana bread or whipping up Dalgona coffee.

The project’s emblematic of how a new generation – with access to powerful technology and training — approaches the once exotic domain of AI.

A decade ago, deep learning was the stuff of elite research labs with big budgets.

Now it’s the kind of thing a smart, motivated high school student can knock out to solve a tangible problem.

DuCristea’s been interested in coding from childhood, and spends her spare time teaching herself new skills and taking online AI courses. After winning a Jetson Nano Developer Kit this summer at AI4ALL, an AI camp, she set to work remedying one of her pet peeves — the limited functionality of reminder applications.

She’s long envisioned a more useful app that could snooze for more specific lengths of time, and set reminders for specific tasks, dates and times. Using the Nano and her background on Python, DuCristea spent her after-school hours creating an app that does just that.

With the app, users can message a bot on Discord requesting a reminder for a specific task, date and time. DuCristea has shared the app’s code on Github, and is planning to continue training it to increase its accuracy and capabilities.

Key Points From This Episode:

Her first hands-on experience with the Jetson Nano has only strengthened her intent to pursue software or computer engineering at college, where she’ll continue to learn more about what area of STEM she’d like to focus on.

  • DuCristea’s interest in programming and electronics started at age nine, when her father gifted her a book on Python and she found it so interesting that she worked through it in a week. Since then, she’s taken courses on coding and shares her most recent projects on GitHub.
  • Programming the app took some creativity, as DuCristea didn’t have a large dataset to train on. After trying neural networks and vectorization, she eventually found that template searches worked best for her limited list of examples.

Tweetables:

“There’s so many programs, even exclusively for girls now in STEM — I would say go for them.” — Ana DuCristea [14:55]

“The Jetson Nano is a lot more accessible than most things in AI right now.” — Ana DuCristea [18:51]

You Might Also Like:

AI4Good: Canadian Lab Empowers Women in Computer Science

Doina Precup, associate professor at McGill University and research team lead at AI startup DeepMind, speaks about her personal experiences, along with the AI4Good Lab she co-founded to give women more access to machine learning training.

Jetson Interns Assemble! Interns Discuss Amazing AI Robots They’re Building

NVIDIA’s Jetson interns, recruited at top robotics competitions, discuss what they’re building with NVIDIA Jetson, including a delivery robot, a trash-disposing robot and a remote control car to aid in rescue missions.

A Man, a GAN and a 1080 Ti: How Jason Antic Created ‘De-Oldify’

Jason Antic explains how he created his popular app, De-Oldify, with just an NVIDIA GeForce 1080 Ti and a generative adversarial network. The tool colors old black-and-white shots for a more modern look.

Tune in to the AI Podcast

Get the AI Podcast through iTunes, Google Podcasts, Google Play, Castbox, DoggCatcher, Overcast, PlayerFM, Pocket Casts, Podbay, PodBean, PodCruncher, PodKicker, Soundcloud, Spotify, Stitcher and TuneIn. If your favorite isn’t listed here, drop us a note.

Tune in to the Apple Podcast Tune in to the Google Podcast Tune in to the Spotify Podcast

Make the AI Podcast Better

Have a few minutes to spare? Fill out this listener survey. Your answers will help us make a better podcast.

The post In a Quarantine Slump? How One High School Student Used AI to to Stay on Track appeared first on The Official NVIDIA Blog.

Read More

Q&A with Ayesha Ali, two-time award winner of Facebook request for research proposals in misinformation

Facebook is a place where bright minds in computer science come to work on some of the world’s most complex and challenging research problems. In addition to recruiting top talent, we maintain close ties with academia and the research community to collaborate on difficult challenges and find solutions together. In this new monthly interview series, we turn the spotlight on members of the academic community and the important research they do — as partners, collaborators, consultants, or independent contributors.

This month, we reached out to Ayesha Ali, professor at Lahore University of Management Sciences (LUMS) in Pakistan. Ali is a two-time winner of the Facebook Foundational Integrity Research request for proposals (RFP) in misinformation and polarization (2019 and 2020). In this Q&A, Ali shares the results of her research, its impact, and advice for university faculty looking to follow a similar path.

Q: Tell us about your role at LUMS and the type of research you and your department specialize in.

Ayesha Ali: I joined the Department of Economics at LUMS in 2016 as an assistant professor, after completing my PhD in economics at the University of Toronto. I am trained as an applied development economist, and my research focuses on understanding and addressing policy challenges facing developing countries, such as increasing human development, managing energy and environment, and leveraging technology for societal benefit. Among the themes that I am working on is how individuals with low levels of digital literacy perceive and react to content on social media, and how that affects their beliefs and behavior.

Q: How did you decide to pursue research projects in misinformation?

AA: Before writing the first proposal back in 2018, I had been thinking about the phenomenon of misinformation and fabricated content for quite some time. On multiple occasions, I had the opportunity to interact with colleagues in the computer science department on this issue, and we had some great discussions about it.

We quickly realized that we cannot combat misinformation with technology alone. It is a multifaceted issue. To address this problem, we need the following: user education, technology for filtering false news, and context-specific policies for deterring false news generation and dissemination. We were particularly interested in thinking about the different ways we could educate people who have low levels of digital literacy to recognize misinformation.

Q: What were the results of your first research project, and what are your plans for the second one?

AA: In our first project, we studied the effect of two types of user education programs in helping people recognize false news using a randomized field experiment. Using a list of actual news stories circulated on social media, we create a test to measure the extent to which people are likely to believe misinformation. Contrary to their perceived effectiveness, we found no significant effect of video-based general educational messages about misinformation.

However, when video-based educational messages were augmented with personalized feedback based on individuals’ past engagement with false news, there was a significant improvement in their ability to recognize false news. Our results show that, when appropriately designed, educational programs can be effective in making people more discerning consumers of information on social media.

Our second project aims to build on this research agenda. We plan to focus on nontextual misinformation, such as audio deepfakes. Audio messages are a popular form of communication among people with low levels of literacy and digital literacy. Using surveys and experiments, we will examine how people perceive, consume, and engage with information received via audio deepfakes, and what is the role of prior beliefs and analytical ability in forming perceptions about the accuracy of such information. We also plan to design and experimentally evaluate an educational intervention to increase people’s ability to identify audio deepfakes.

Q: What is the impact of your research in your region and globally?

AA: I think there are at least three ways in which our work is having an impact:

  1. Our work raises awareness about the importance of digital literacy campaigns in combating misinformation. It shows that such interventions hold promise in making users more discerning consumers of information if they are tailored to the target population (e.g., low literacy populations).
  2. Our work can affect policy about media literacy campaigns and how to structure them, especially for low digital literacy populations. We are already in touch with various organizations in Pakistan to see how our findings can be put to use in various digital literacy campaigns. For example, the COVID-19 vaccination is likely to be made available in the coming months, and there is a need to raise awareness about its importance and to proactively dispel any conspiracy theories and misinformation about them. Past experiences with polio vaccination campaigns have shown that conspiracy theories can take strong root and even endanger human lives.
  3. We hope that work will motivate others to work on such global societal challenges, especially in developing countries.

Q: What advice would you give to academics looking to get their research funded?

AA: I think there are three ingredients in a good research proposal:

  1. It tackles an important problem that ideally has contextual/local relevance.
  2. It demonstrates a well-motivated solution or a plan that has contextual/local relevance.
  3. It shows or at least makes the case for why you are uniquely placed to solve it well.

Q: Where can people learn more about your research?

AA: They can learn about my research on my webpage.

The post Q&A with Ayesha Ali, two-time award winner of Facebook request for research proposals in misinformation appeared first on Facebook Research.

Read More

Accelerating MLOps at Bayer Crop Science with Kubeflow Pipelines and Amazon SageMaker

This is a guest post by the data science team at Bayer Crop Science. 

Farmers have always collected and evaluated a large amount of data with each growing season: seeds planted, crop protection inputs applied, crops harvested, and much more. The rise of data science and digital technologies provides farmers with a wealth of new information. At Bayer Crop Science, we use AI and machine learning (ML) to help farmers achieve more bountiful and sustainable harvests. We also use data science to accelerate our research and development process; create efficiencies in production, operations, and supply chain; and improve customer experience.

To evaluate potential products, like a short-stature line of corn or an advanced herbicide, Bayer scientists often plant a small trial in a greenhouse or field. We then use advanced sensors and analytical models to evaluate the experimental results. For example, we might fly an unmanned aerial vehicle over a field and use computer vision models to count the number of plants or measure their height. In this way, we’ve collected data from millions of test plots around the world and used them to train models that can determine the size and position of every plant in our image library.

Analytical models like these are powerful but require effort and skill to design and train effectively. science@scale, the ML engineering team at Bayer Crop Science, has made these techniques more accessible by integrating Amazon SageMaker with open-source tools like KubeFlow Pipelines to create reproducible templates for analytical model training, hosting, and access. These resources help standardize how our data scientists interact with SageMaker services. They also make it easier to meet Bayer-specific requirements, such as using multiple AWS accounts and resource tags.

Standardizing the ML workflow for Bayer Crop Science

Data science teams at Bayer Crop Science follow a common pattern to develop and deploy ML models:

  1. A data scientist develops model and training code in a SageMaker notebook or other coding environment running in a project-specific AWS account.
  2. A data scientist trains the model on data stored in Amazon Simple Storage Service (Amazon S3).
  3. A data scientist partners with an ML engineer to deploy the trained model as an inference service.
  4. An ML engineer creates the API proxies required for applications outside of the project-specific account to call the inference service.
  5. ML and other engineers perform additional steps to meet Bayer-specific infrastructure and security requirements.

To automate this process, our team transformed the steps into a reusable, parameterized workflow using KubeFlow Pipelines (KFP). Each step of a workflow (a KFP component) is associated with a Docker container and connected via the KFP Pipelines framework. Using Kubeflow to host Bayer’s model training and deployment process was enabled through the use of the Amazon SageMaker Components for KubeFlow Pipelines, pre-built modules that simplify the process of running SageMaker operations from within KFP. We combined these with custom components to automate the Bayer-specific engineering steps, particularly those relating to cybersecurity. The resulting pipeline allows data scientists to trigger model training and deployment with only a few lines of code and ensures that the model artifacts are generated and maintained consistently. This provides data scientists more time to focus on improving the models themselves.

 

AWS account setup

Bayer Crop Science organizes its cloud resources into a large number of application-, team-, and project-specific accounts. For this reason, many ML projects require resources in at least three AWS accounts:

  • ML support account – Contains the shared infrastructure necessary to perform Bayer-specific proxy generation and other activities across multiple projects
  • KubeFlow account – Contains an Amazon Elastic Kubernetes Service (Amazon EKS) cluster hosting our KubeFlow deployment
  • Scientist account – At least one project-specific account in which data scientists store most of the required data and perform model development and training

The following diagram illustrates this architecture.

 

ML support AWS account

One centralized account contains the infrastructure required to perform Bayer-specific post-processing steps across multiple ML projects. Most notably, this includes a KubeFlow Master Pipeline Execution AWS Identity and Access Management (IAM) role. This role has trust relationships with all the pipeline execution roles in the scientist account, which it can assume when running the pipeline. It’s separate from the Pipeline Runner IAM role in the KubeFlow AWS account to allow management of these relationships independent from other entities within the KubeFlow cluster. The following code shows the trust relationship:

Trust Relationship (one element for each scientist account):
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::[kubeflow-account-number]:role/[kubeflow-pipeline-exeution-role-name]"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

KubeFlow AWS account

Bayer Crop Science uses a standard installation of KubeFlow hosted on Amazon EKS in a centralized AWS account. At the time of this writing, all KubeFlow pipelines run within the same namespace on a KubeFlow cluster and all components assume a custom IAM role when they run. The components can inherit the role from the worker instance, applied via OIDC integration (preferred) or obtained using open-source methods such as kube2iam.

Scientist AWS account

To enable access by model training and hosting resources, all scientist accounts must contain several IAM roles with standard permission sets. These are typically provisioned on request by an ML engineer using Terraform. These roles include:

  • Model Execution – Supports SageMaker inference endpoints
  • Training Execution – Supports SageMaker training jobs
  • KubeFlow Pipeline Execution – Supports creating, updating, or deleting resources using the Amazon SageMaker Components for KubeFlow Pipelines

These IAM roles are given policies that are appropriate for their associated tasks, which can vary depending on organizational needs. An S3 bucket is also created to store trained model artifacts and any data required by the model during inference or training.

KubeFlow pipeline setup

Our ML pipeline (see the following diagram) uses Amazon SageMaker Components for KubeFlow Pipelines to standardize the integration with SageMaker training and deployment services.

 

The ML pipeline exposes the parameters summarized in the following table.

Parameter Description
model_name Name of the model to train and deploy. Influences the job, endpoint, endpoint config, and model names.
model_docker_image If present, the pipeline attempts to deploy a model using this base Docker image.
model_artifact_s3_path If a model artifact already exists and doesn’t need to be trained, its S3 path can be specified.
environment JSON object containing environment variables injected into the model endpoint.
training_algorithm_name If training without a Docker image, one of preconfigured AWS training algorithms can be specified.
training_docker_image If training with a base Docker image, it can be specified here.
training_hyperparameters JSON object containing hyperparameters for the training job.
training_instance_count Specifies the number of training instances for use in distributed training scenarios.
training_instance_type String indicating which ML instance type is used to host the training process.
endpoint_instance_type String indicating which ML instance type is used to host the endpoint process.
training_channels JSON array of data channels that are injected into the training job.
training_s3_output_path Base S3 path where model artifacts are written in the case of a training job.
account_id Account number of the data scientist account. Used in role assumption logic.

See the following pipeline code:

@dsl.pipeline(name='Kubeflow Sagemaker Component Deployment Pipeline')
def pipeline(model_name = "",
             account_id = "",
             model_docker_image = "",
             model_artifact_s3_path = "",
             environment = '{}',
             training_algorithm_name = '',
             training_docker_image = "",
             training_hyperparameters = '{}',
             training_instance_count = 2,
             endpoint_instance_type = "ml.m5.large",
             training_instance_type = "ml.m5.large",
             training_channels = '',
             training_s3_output_path = ""
):

….Pipeline Component Code

if __name__ == '__main__':
    kfp.compiler.Compiler().compile(pipeline, __file__ + '.tar.gz')
    print("Pipeline compiled successfully.")

To create the pipeline, we ran the .py file to compile it into a .tar.gz file and uploaded it into the KubeFlow UI.

Running the pipeline

After pipeline creation is complete, data scientists can invoke the pipeline from multiple Jupyter notebooks using the KubeFlow SDK. They can then track the pipeline run for their model in the KubeFlow UI. See the following code:

kfp_token = get_oauth_token(client_id, client_secret)
kfp_client = kfp.Client(host=kubeflow_api, client_id=client_id, existing_token=kfp_token)
print("Connect to: " + str(kfp_client._run_api.api_client.configuration.host))
experiment = kfp_client.get_experiment(experiment_name="Default")
print(experiment)

def get_sgm_deploy_pipeline_id():
    pipelines = kfp_client.list_pipelines(page_size=1000)
    pipeline_id = None
    for pipeline in pipelines.pipelines:
        if pipeline.name == "sagemaker-components-poc":
            pipeline_id = pipeline.id
            break
    return pipeline_id

sagemaker_deployment_parameters = {
    "model_name": "your-model-name",
    "account_id": boto3.client("sts").get_caller_identity()["Account"],
    "model_docker_image": "520713654638.dkr.ecr.us-east-1.amazonaws.com/sagemaker-tensorflow-serving:1.12-cpu",
    "environment": json.dumps({ "SAGEMAKER_TFS_NGINX_LOGLEVEL": "info"}),
    "training_docker_image": "520713654638.dkr.ecr.us-east-1.amazonaws.com/sagemaker-tensorflow-scriptmode:1.12-cpu-py3",
    "training_hyperparameters": json.dumps({
      "model_dir": "/opt/ml/model",
      "sagemaker_container_log_level": "20",
      "sagemaker_enable_cloudwatch_metrics": "false",
      "sagemaker_mpi_custom_mpi_options": "-verbose --NCCL_DEBUG=INFO -x OMPI_MCA_btl_vader_single_copy_mechanism=none",
      "sagemaker_mpi_enabled": "true",
      "sagemaker_mpi_num_of_processes_per_host": "2",
      "sagemaker_program": "train_mnist.py",
      "sagemaker_region": "us-east-1",
      "sagemaker_submit_directory": "s3://path/to/sourcedir.zip"
}),
    "training_instance_count": "2",
    "training_channels": '[{"ChannelName":"train","DataSource":{"S3DataSource":{"S3Uri":"s3://path/to/training-data","S3DataType":"S3Prefix","S3DataDistributionType":"FullyReplicated"}},"ContentType":"","CompressionType":"None","RecordWrapperType":"None","InputMode":"File"},{"ChannelName":"test","DataSource":{"S3DataSource":{"S3Uri":"s3://path/to/test/data","S3DataType":"S3Prefix","S3DataDistributionType":"FullyReplicated"}},"ContentType":"","CompressionType":"None","RecordWrapperType":"None","InputMode":"File"}]',
    "training_s3_output_path": "s3://path/to/model/artifact/output/"
}

run = {
    "name": "my-run-name",
    "pipeline_spec": { 
        "parameters": [
            { "name": param, "value": sagemaker_deployment_parameters[param] } for param in sagemaker_deployment_parameters.keys()
        ], 
        "pipeline_id": get_sgm_deploy_pipeline_id() 
    },
    "resource_references": [
        {
            "key": {
                "id": experiment.id,
                "type": "EXPERIMENT"
            },
            "relationship": "OWNER"
        }
    ]
}

requests.post("{}/apis/v1beta1/runs".format(kubeflow_api), data=json.dumps(run), headers={ "Authorization": "Bearer " + kfp_token })

Each run consists of a series of steps:

  1. Create a persistent volume claim.
  2. Generate AWS credentials.
  3. Generate resource tags.
  4. (Optional) Transfer the Docker image to Amazon Elastic Container Registry (Amazon ECR).
  5. Train the model.
  6. Generate a model artifact.
  7. Deploy the model on SageMaker hosting services.
  8. Perform Bayer-specific postprocessing.

Step 1: Creating a persistent volume claim

The first step of the process verifies that a persistent volume claim (PVC) exists within the Kubernetes cluster hosting the KubeFlow instances. This volume is returned to the pipeline and used to pass data to various components within the pipeline. See the following code:

def get_namespace():
    return open("/var/run/secrets/kubernetes.io/serviceaccount/namespace").read()


def check_pvc_exists(pvc):
    config.load_incluster_config()
    v1 = client.CoreV1Api()
    namespace = get_namespace()
    try:
        response = v1.read_namespaced_persistent_volume_claim(pvc, namespace)
    except ApiException as error:
        if error.status == 404:
            print("PVC {} does not exist, so it will be created.".format(pvc))
            return False
        raise
    print(response)
    return True


def create_pvc(pvc_name):
    config.load_incluster_config()
    v1 = client.CoreV1Api()
    namespace = get_namespace()
    pvc_metadata = client.V1ObjectMeta(name=pvc_name)
    requested_resources = client.V1ResourceRequirements(requests={"storage": "50Mi"})
    pvc_spec = client.V1PersistentVolumeClaimSpec(
        access_modes=["ReadWriteMany"],
        resources=requested_resources,
        storage_class_name="efs",
        data_source=None,
        volume_name=None
    )
    k8s_resource = client.V1PersistentVolumeClaim(
        api_version="v1",
        kind="PersistentVolumeClaim",
        metadata=pvc_metadata,
        spec=pvc_spec
    )
    response = v1.create_namespaced_persistent_volume_claim(namespace, k8s_resource)
    print(response)

Step 2: Generating AWS credentials

This step generates a session token for the pipeline execution role in the specified scientist AWS account. It then writes a credentials file to the PVC in a way that allows boto3 to access it as a configuration. Downstream pipeline components mount the PVC as a volume and use the credentials file to perform operations against SageMaker.

This credential generation step is required for KubeFlow to operate across multiple AWS accounts in Bayer’s environment. This is because all pipelines run in the same namespace and run using the generic KubeFlow Pipeline Runner IAM role from the Kubeflow AWS account. Each pipeline in Bayer’s Kubeflow environment has a dedicated IAM role associated with it that has a trust relationship with the Kubeflow Pipeline Runner IAM role. For this deployment workflow, the KSageMaker Deployment Master Pipeline Executor IAM role is assumed by the KubeFlow Pipeline Runner IAM role, and then the appropriate deployment role within the data scientist account is assumed by that role in turn. This keeps the trust relationships for the deployment process as self-contained as possible. See the following code:

import os
credentials_file_path = "/tmp/aws_credentials"
if os.path.exists(credentials_file_path):
    os.remove(credentials_file_path)

import argparse
import sts_ops

parser = argparse.ArgumentParser()
parser.add_argument("--account_id", help="AWS Account Id", required=True)
parser.add_argument("--master_pipeline_role", help="ARN of master pipeline role", required=True)


args = parser.parse_args()

master_session = sts_ops.assume_master_pipeline_role(args.master_pipeline_role)
creds = sts_ops.generate_deploy_session_credentials(master_session, args.account_id)
credentials_output = """[default]
aws_access_key_id = {}
aws_secret_access_key = {}
aws_session_token = {}
""".format(creds["AccessKeyId"], creds["SecretAccessKey"], creds["SessionToken"])
open("/tmp/aws_credentials", "w").write(credentials_output)
open("/tmp/aws_credentials_location.txt", "w").write(credentials_file_path) 

Step 3: Generating resource tags

Within Bayer, a standard set of tags are used to help identify Amazon resources. These tags are specified in an S3 path and applied to the model and endpoints via parameters in the corresponding SageMaker components.

Step 4: (Optional) Transferring a Docker image to Amazon ECR

If the model training and inference images are not stored in a SageMaker-compatible Docker repository, this step copies them into Amazon ECR using a custom KubeFlow component.

Step 5: Training the model

The SageMaker Training KubeFlow Pipelines component creates a SageMaker training job and outputs a path to the eventual model artifact for downstream use. See the following code:

train_model_op = kfp.components. load_component_from_url('https://raw.githubusercontent.com/kubeflow/pipelines/cb36f87b727df0578f4c1e3fe9c24a30bb59e5a2/components/aws/sagemaker/train/component.yaml')
train_model_step = apply_environment_variables(train_model_op(
    algorithm_name=training_algorithm_name,
    hyperparameters=training_hyperparameters,
    image=training_docker_image,
    instance_type=training_instance_type,
    channels=training_channels,
    region=aws_region,
    instance_count=training_instance_count,
    role="arn:aws:iam::{}:role/sagemaker-deploy-model-execution-role".format(account_id),
    model_artifact_path=training_s3_output_path,
    network_isolation=False
), sgm_volume, create_secret_step.output

Step 6: Generating a model artifact

The SageMaker Create Model KubeFlow Pipelines component generates a .tar.gz file containing the model configuration and trained parameters for downstream use. If a model artifact already exists in the specified S3 location, this step deletes it before generating a new one. See the following code:

sagemaker_create_model_op = components.load_component_from_url('https://raw.githubusercontent.com/kubeflow/pipelines/cb36f87b727df0578f4c1e3fe9c24a30bb59e5a2/components/aws/sagemaker/model/component.yaml')
sagemaker_create_model_step = sagemaker_create_model_op(
    region=aws_region,
    model_name=model_name,
    image=image,
    role="arn:aws:iam::{}:role/sagemaker-deploy-model-execution-role".format(account_id),
    model_artifact_url=model_artifact_url,
    network_isolation=False,
    environment=environment,
    tags=tags
)

Step 7: Deploying the model on SageMaker hosting services

The SageMaker Create Endpoint KubeFlow Pipelines component creates an endpoint configuration and HTTPS endpoint. This process can take some time because the component pauses until the endpoint is in a ready state. See the following code:

Sagemaker_deploy_op = components.load_component_from_url('https://raw.githubusercontent.com/kubeflow/pipelines/cb36f87b727df0578f4c1e3fe9c24a30bb59e5a2/components/aws/sagemaker/deploy/component.yaml')
create_endpoint_step = apply_environment_variables(sagemaker_deploy_op(
    region=aws_region,
    endpoint_config_name=full_model_name,
    model_name_1=full_model_name,
    instance_type_1=endpoint_instance_type,
    endpoint_name=full_model_name,
    endpoint_config_tags=generate_tags_step.output,
    endpoint_tags=generate_tags_step.output
), sgm_volume, create_secret_step.output)
create_endpoint_step.after(create_model_step)

Step 8: Performing Bayer-specific postprocessing

Finally, the pipeline generates an Amazon API Gateway deployment and other Bayer-specific resources required for other applications within the Bayer network to use the model.

Conclusion

Data science is complex enough without asking data scientists to take on additional engineering responsibilities. By integrating open-source tools like KubeFlow with the power of Amazon SageMaker, the science@scale team at Bayer Crop Science is making it easier to develop and share advanced ML models. The MLOps workflow described in this post gives data scientists a self-service method to deploy scalable inference endpoints in the same notebooks they use for exploratory data analysis and model development. The result is rapid iteration, more successful data science products, and ultimately greater value for our farmer customers.

In the future, we’re looking forward to adding additional SageMaker components for hyperparameter optimization and data labeling to our pipeline. We’re also looking at ways to recommend instance types, configure endpoint autoscaling, and support multi-model endpoints. These additions will allow us to further standardize our ML workflows.


About the Authors

Thomas Kantowski is a cloud engineer at Bayer Crop Science. He received his master’s degree from the University of Oklahoma.

Brian Loyal leads science@scale, the enterprise ML engineering team at Bayer Crop Science.

Bhaskar Dutta is a data scientist at Bayer Crop Science. He designs machine learning models using deep neural networks and Bayesian statistics.

Read More