Meet TensorFlow community leads around the world

Posted by Joana Carrasqueira and Lynette Gaddi, Program Managers at Google

The TensorFlow community keeps growing every day, and includes many thousands of developers, educators, and researchers around the world. If you’d like to get involved with the community, there are many different organizations you can check out.

These include Special Interest Groups (SIGs), TensorFlow User Groups (TFUGs), and Google Developer Groups (GDGs). There are also many Google Developer Experts (GDEs) you can get in touch with. They’re knowledgeable about ML and help others in their community, and are a great point of contact to find future local events.

We spend a lot of time working with community leads, and in this article, we’d like to share some of their stories with you. We had the wonderful opportunity to interview several leads from different areas – including a SIG Lead, a Machine Learning GDE, and two TensorFlow User Group organizers, so you can learn about their background, how they got involved in the community, and how you can too.

TensorFlow branded banner with orange elements

Karl Lessard

TensorFlow SIG Lead for JVM

Montreal, Canada

Image of Karl Lessard

Karl has been working in software engineering and consulting for more than 20 years in various fields, including computer graphics and communications. He is now working full-time at Expedia in Montreal, focusing on delivering solutions for complex linguistic and localization challenges.

What does being a community leader mean to you?

What really matters is that all members of the group enjoy contributing to the project, and making it as fulfilling as possible for them. Because that’s what open-source is to me: a playground for grownups building something useful to the world! Being a community leader comes with a bunch of technical responsibilities, too, but for me that’s the most important thing.

How did you get involved in the TF community?

I started designing a few proposals to enhance the TensorFlow Java client, which at that time was offering minimal support for running model inference on Android devices. My proposed changes were welcomed by Google (special thanks to Asim Shankar), and I’ve submitted multiple pull requests over a couple years since then.

There was increasing general interest in supporting TensorFlow on the JVM following that, and I met the engineering team (and many others from the community) at a TensorFlow Dev summit in 2019 to suggest the idea of starting a group focusing on this topic. That’s how SIG JVM was born.

How do you contribute technically as a SIG Lead?

I still contribute to the design and the code of the project (like in the beginning), but I also review most of the pull requests, plan video calls, and make sure proposed changes are done with respect to the global vision of the project shared by other members, and they are being discussed properly and broadly.

Do you have any advice on how to get involved in the community?

If you can make it to the TensorFlow Dev/Contributor Summit, do it. That’s definitely a good place to meet people sharing the same interests as you. Also, you can get involved in the various discussions related to the topics of your interests. SIGs forums are a good place to start, and you can also get in touch with others on the new TensorFlow Forum. Finally, don’t be shy to make change proposals and/or to submit a few pull requests!

Ruqiya Bin Safi

Google Cloud GDE

Saudi Arabia

image of Ruqiya Bin Safi

Ruqiya Bin Safi is a Software Engineer that is interested in Artificial Intelligence, Machine Learning and Deep Learning as well as Data Science. Ruqiya started learning Machine Learning many years ago. She seeks to spread knowledge about Machine Learning and new technologies.

What does being a community leader mean to you?

It means a responsibility I take on, a goal I’ve accomplished, and a dream I fulfill. It means that I contribute to the development of my community and helping others. To be a community leader means sharing useful knowledge so that everyone can benefit. Leadership is also a give and take: the community gives to me, and I give back to the community. It’s a cooperation. We share similar goals and interests, and vision and mission. And we seek to employ what we’ve learned to develop tools that make the world a better place.

How did you get involved in the TF community?

I’ve always loved learning new technologies and using them to solve problems. I started as a software engineer, and found machine learning increasingly interesting as I studied it, and eventually made it a focus. I got involved with the TF community through a Women Techmakers (WTM) event. I then joined a local WTM community to help others learn about ML, and later applied and was accepted into the GDE program.

How do you contribute as a ML GDE?

I love giving talks, and most of my contributions were as speaker and technical trainer through various tech talks, panel discussions, and workshops. My goal is to help and motivate people to learn more about ML. I also run a deep learning monthly workshop that aims to help participants gain foundational knowledge of common deep learning techniques as well as practical experience in building neural networks with TensorFlow. I also really enjoy mentoring for Google for Startups Accelerator MENA program as well as some hackathons, and I write articles from time to time about machine learning and TensorFlow.

Do you have any advice on how to get involved in the community?

One of the best ways is to try to learn something new, and then share what you have learned with your community through blogging or a local meetup. Love what you do, and as much as you can, find work that you enjoy – and trust in yourself as you become more active and involved.

Armel Yara

TensorFlow User Group, Organizer

Abidja, West Africa

Photo of Armel Yara

Armel is a developer and TensorFlow community leader in Francophone Africa, where he organizes and hosts developer events in multiple languages (including large events like TensorFlow Everywhere SSA), and manages machine learning projects for local business

What does being a community leader mean to you?

Being a community leader means for me sharing experiences, being available for others, and listening to their needs and expectations.

How did you get involved in the TF community?

I get involved in the TF community by sharing the latest news about the TF community on my blog and working on open source projects.

How do you contribute as a TFUG lead?

I organize events and give technical sessions at universities and online.

Do you have any advice on how to get involved in the community?

My advice to become more active and get more involved in the community is to look over the membership expectations, share projects that you have built, and use them to motivate others. Lead by example, let others know how much you enjoy what you do and showcase your work.

Nijat Zeynalov

TensorFlow User Group, Organizer

Azerbaijan

photo of Nijat Zeynalov

Nijat is a Certified TensorFlow Developer and a first year master student at University of Tartu. He’s passionate about data science and machine learning, and organizing events to help others.

What does being a community leader mean to you?

I learned about leadership by running a local community where we aimed to provide free support to anyone interested in coding. In my mind, being a community leader means to help inspire others, and also to foster a community of respect that enables and encourages contributions of others. I strongly believe that leadership can be learned with practice.

How did you get involved in the TF community?

While I was preparing for the TensorFlow Developer certificate, I found a user groups page and I thought – why not set up a local user group for our country. I understood the responsibility of being a community leader, and I contacted a few TensorFlow User Group organizers to learn more about it. Their positive feedback about the overall impression made my decision even easier and motivated me to get started, and it’s been a great experience ever since.

How do you contribute as a community organizer?

We regularly discuss the latest TensorFlow updates in the user group, and we organise “Paper Reading Meetings” where we read and discuss one deep learning paper as a group. This has been a really great way for people to share their knowledge and ask questions. Additionally, in March, as a “TensorFlow User Group – Azerbaijan”, we held the 5-hour long “TensorFlow Everywhere – 2021” event which was the country’s largest machine learning event to date.

It was a pleasure to speak with Karl, Ruqiya, Armel and Nijat (thank you again for your time and contributions!) We hope their stories inspire you to get involved, and take on a leadership role in your local community in the future. If you’d like, you can start a conversation on the TensorFlow Forum and share how you got involved in the TensorFlow Community, and meet others. And check out the top of this post for more links to user and special interest groups.

Read More

Easier object detection on mobile with TensorFlow Lite

Posted by Khanh LeViet, Developer Advocate on behalf of the TensorFlow Lite team

Example of object detection on mobile

At Google I/O this year, we are excited to announce several product updates that simplify training and deployment of object detection models on mobile devices:

  • On-device ML learning pathway: a step-by-step tutorial on how to train and deploy a custom object detection model on mobile devices with no machine learning expertise required.
  • EfficientDet-Lite: a state-of-the-art object detection model architecture optimized for mobile devices.
  • TensorFlow Lite Model Maker for object detection: train custom models in just a few lines of code.
  • TensorFlow Lite Metadata Writer API: simplify metadata creation to generate custom object detection models compatible with TFLite Task Library.

Despite being a very common ML use case, object detection can be one of the most difficult to do. We’ve worked hard to make it easier for you, and in this blog post we’ll show you how to leverage the latest offerings from TensorFlow Lite to build a state-of-the-art mobile object detector using your own domain data.

On-device ML learning pathway: learn how to train and deploy custom TensorFlow Lite object detection model in 12 minutes.

Training a custom object detection model and deploying it to an Android app has become super easy with TensorFlow Lite. We released a learning pathway that teaches you step-by-step how to do it.

In the video, you can learn the steps to build a custom object detector:

  1. Prepare the training data.
  2. Train a custom object detection model using TensorFlow Lite Model Maker.
  3. Deploy the model on your mobile app using TensorFlow Lite Task Library.

There’s also a codelab with source code on GitHub for you to run through the code yourself. Please try it out and let us know your feedback!

EfficientDet-Lite: the state-of-the-art model architecture for object detection on mobile devices

Running machine learning models on mobile devices means we always need to consider the trade-off between model accuracy vs. inference speed and model size. The state-of-the-art mobile-optimized model doesn’t only need to be more accurate, but it also needs to run faster and be smaller. We adapted the neural architecture search technique published in the EfficientDet paper, then optimized the model architecture for running on mobile devices and came up with a novel mobile object detection model family called EfficientDet-Lite.

EfficientDet-Lite has 5 different versions: Lite0 to Lite4. The smaller version runs faster but is not as accurate as the larger version. You can experiment with multiple versions of EfficientNet-Lite and choose the one that is most suitable for your use case.

Model architecture

Size(MB)*

Latency (ms)**

Average Precision***

EfficientDet-Lite0

4.4

37

25.69%

EfficientDet-Lite1

5.8

49

30.55%

EfficientDet-Lite2

7.2

69

33.97%

EfficientDet-Lite3

11.4

116

37.70%

EfficientDet-Lite4

19.9

260

41.96%

SSD MobileNetV2 320×320

6.7

24

20.2%

SSD MobileNetV2 FPNLite 640×640

4.3

191

28.2%

* Size of the integer quantized models.

** Latency measured on Pixel 4 using 4 threads on CPU.

*** Average Precision is the mAP (mean Average Precision) on the COCO 2017 validation dataset.

We have released the EfficientDet-Lite models trained on the COCO dataset to TensorFlow Hub. You also can train EfficientDet-Lite custom models using your own training data with TensorFlow Lite Model Maker.

TensorFlow Lite Model Maker: train a custom object detection using transfer learning in a few lines of code

TensorFlow Lite Model Maker is a Python library that significantly simplifies the process of training a machine learning model using a custom dataset. It leverages transfer learning to enable training high quality models using just a handful of images.

Model Maker accepts datasets in the PASCAL VOC format and the Cloud AutoML’s CSV format. As you can create your own dataset using open-source GUI tools such as LabelImg or makesense.ai, everyone can create training data for Model Maker without writing a single line of code.

Once you have your training data, you can start training a TensorFlow Lite custom object detectors.

# Step 1: Choose the model architecture
spec = model_spec.get('efficientdet_lite2')

# Step 2: Load your training data
train_data, validation_data, test_data = object_detector.DataLoader.from_csv('gs://cloud-ml-data/img/openimage/csv/salads_ml_use.csv')

# Step 3: Train a custom object detector
model = object_detector.create(train_data, model_spec=spec, validation_data=validation_data)

# Step 4: Export the model in the TensorFlow Lite format
model.export(export_dir='.')

# Step 5: Evaluate the TensorFlow Lite model
model.evaluate_tflite('model.tflite', test_data)

Check out this notebook to learn more.

TensorFlow Lite Task Library: deploying object detection models on mobile in a few lines of code

TensorFlow Lite Task Library is a cross-platform library which simplifies TensorFlow Lite model deployments on mobile. Custom object detection models trained with TensorFlow Lite Model Maker can be deployed to an Android app in just a few lines of Kotlin code:

// Step 1: Load the TensorFlow Lite model
val detector = ObjectDetector.createFromFile(context, "model.tflite")

// Step 2: Convert the input Bitmap into a TensorFlow Lite's TensorImage object
val image = TensorImage.fromBitmap(bitmap)

// Step 3: Feed given image to the model and get the detection result
val results = detector.detect(image)

See our documentation to learn more about the customization options in Task Library, including how to configure the minimum detection threshold or the maximum number of detected objects.

TensorFlow Lite Metadata Writer API: simplify deployment of custom models trained with TensorFlow Object Detection API

Task Library relies on the model metadata bundled in the TensorFlow Lite model to execute the preprocessing and postprocessing logic required to run inference using the model. They include how to normalize the input image, or how to map the class id to human readable labels. Models trained using Model Maker have these metadata by default, making them compatible with Task Library. But if you train a TensorFlow Lite object detection model using a training pipeline other than Model Maker, you can add the metadata using TensorFlow Lite Metadata Writer API.

For example, if you train a model using TensorFlow Object Detection API, you can add metadata to the TensorFlow Lite model using this Python code:

LABEL_PATH = 'label_map.txt'
MODEL_PATH = "ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8/model.tflite"
SAVE_TO_PATH = "ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8/model_with_metadata.tflite"

# Step 1: Specify the preprocessing parameters and label file
writer = object_detector.MetadataWriter.create_for_inference(
writer_utils.load_file(MODEL_PATH), input_norm_mean=[0],
input_norm_std=[255], label_file_paths=[LABEL_PATH])

# Step 2: Export the model with metadata
writer_utils.save_file(writer.populate(), SAVE_TO_PATH)

Here we specify the normalization parameters (input_norm_mean=[0], input_norm_std=[255]) so that the input image will be normalized into the [0..1] range. You need to specify normalization parameters to be the same as in the preprocessing logic used during the model training.

See this notebook for a full tutorial on how to convert models trained with the TensorFlow Object Detection API to TensorFlow Lite and add metadata.

What’s next

Our goal is to make machine learning easier to use for every developer, with or without machine learning expertise. We are working with the Model Garden team to bring more object detection model architectures to Model Maker. We will also continue to work with researchers in Google to make future state-of-the-art object detection models available via Model Maker, shortening the path from cutting-edge research to production for everyone. Stay tuned for more updates!

Read More

Training with Multiple Workers using TensorFlow Quantum

Posted by Cheng Xing and Michael Broughton, Google

Training large machine learning models is a core ability for TensorFlow. Over the years, scale has become an important feature in many modern machine learning systems for NLP, image recognition, drug discovery etc. Making use of multiple machines to boost computational power and throughput has led to great advances in the field. Similarly in quantum computing and quantum machine learning, the availability of more machine resources speeds up the simulation of larger quantum states and more complex systems. In this tutorial you will walk through how to use TensorFlow and TensorFlow quantum to conduct large scale and distributed QML simulations. Running larger simulations with greater FLOP/s counts unlocks new possibilities for research that otherwise wouldn’t be possible at smaller scales. In the figure below we have outlined approximate scaling capabilities for several different hardware settings for quantum simulation.

Running distributed workloads often comes with infrastructure complexity, but we can use Kubernetes to simplify this process. Kubernetes is an open source container orchestration system, and it is a proven platform to effectively manage large-scale workloads. While it is possible to have a multi-worker setup with a cluster of physical or virtual machines, Kubernetes offers many advantages, including:

  • Service discovery – workers can easily identify each other using well-known DNS names, rather than manually configuring IP destinations.
  • Automatic bin-packing – your workloads are automatically scheduled on different machines based on resource demand and current consumption.
  • Automated rollouts and rollbacks – the number of worker replicas can be changed by changing a configuration, and Kubernetes automatically adds/removes workers in response and schedules in machines where resources are available.

This tutorial guides you through a TensorFlow Quantum multi-worker setup using Google Cloud products, including Google Kubernetes Engine, a managed Kubernetes platform. You will have the chance to take the single-worker Quantum Convolutional Neural Network (QCNN) tutorial in TensorFlow Quantum and augment it for multi-worker training.

From our experiments in the multi-worker setting, training a 23-qubit QCNN with 1,000 training examples, which corresponds to roughly 3,000 circuits simulated using full state vector simulation takes 5 minutes per epoch on a 32 node (512 vCPU) cluster, which costs a few US dollars. By comparison, the same training job on a single-worker would take roughly 4 hours per epoch. Pushing things a little bit farther, hundreds of thousands of 30-qubit circuits could be run in a few hours using more than 10,000 virtual CPUs which could have taken weeks to run in a single-worker setting. The actual performance and cost may vary depending on your cloud setup, such as VM machine type, total cluster running time, etc. Before performing larger experiments, we recommend starting with a small cluster first, like the one used in this tutorial.

The source code for this tutorial is available in the TensorFlow Quantum GitHub repository. README.md contains the quickest way to get this tutorial up and running. This tutorial will instead focus on walk through each step in detail, to help you understand the underlying concepts and integrate them with your own projects. Let’s get started!

1. Setting up Infrastructure in Google Cloud

The first step is to create the infrastructure resources in Google Cloud. If you have an existing Google Cloud environment, the exact steps might vary, due to organizational policy constraints for example. This is a guideline to the most common set of necessary steps. Note that you will be charged for Google Cloud resources you create, and here is a summary of billable resources used in this tutorial. If you are a new Google Cloud user, you are eligible for $300 in credits. If you are part of an academic institution, you may be eligible for Google Cloud research credits.

You will be running several shell commands in this tutorial. For that, you can use either a local Unix shell available on your computer or the Cloud Shell, which already contains many of the tools mentioned later.

A script automating the steps below is available in setup.sh. This section walks through every step in detail, and if this is your first time using Google Cloud, we recommend that you walk through the entire section. If you prefer to automate the Google Cloud setup process and skip this section:

  • Open setup.sh and configure parameter values inside.
  • Run ./setup.sh infra.

In this tutorial, you will use a few Google Cloud products:

To get your cloud environment ready, first follow these quick start guides:

For purposes of this tutorial, you could stop the Kubernetes Engine quickstart right before the instructions for creating a cluster. In addition, install gsutil, the Cloud Storage command-line tool (if you are using Cloud Shell, gsutil is already installed):

gcloud components install gsutil

For reference, shell commands throughout the tutorial will refer to these variables. Some of them will make more sense later on in the tutorial in the context of each command.

  • ${CLUSTER_NAME}: your preferred Kubernetes cluster name on Google Kubernetes Engine.
  • ${PROJECT}: your Google Cloud project ID.
  • ${NUM_NODES}: the number of VMs in your cluster.
  • ${MACHINE_TYPE}: the machine type of VMs. This controls the amount of CPU and memory resources for each VM.
  • ${SERVICE_ACCOUNT_NAME}: The name of both the Google Cloud IAM service account and the associated Kubernetes service account.
  • ${ZONE}: Google Cloud zone for the Kubernetes cluster.
  • ${BUCKET_REGION}: Google Cloud region for Google Cloud Storage bucket.
  • ${BUCKET_NAME}: Name of the Google Cloud Storage bucket for storing training output.

To ensure you have permissions to run cloud operations in the rest of the tutorial, make sure either you have the IAM role of owner, or all of the following roles:

  • container.admin
  • iam.serviceAccountAdmin
  • storage.admin

To check your roles, run:

gcloud projects get-iam-policy ${PROJECT}

with your Google Cloud project ID and search for your user account.

After you’ve completed the quickstart guides, run this command to create a Kubernetes cluster:

gcloud container clusters create ${CLUSTER_NAME} --workload-pool=${PROJECT}.svc.id.goog --num-nodes=${NUM_NODES} --machine-type=${MACHINE_TYPE} --zone=${ZONE} --preemptible

with your Google Cloud project ID and preferred cluster name.

--num-nodes is the number of Compute Engine virtual machines backing your Kubernetes cluster. This is not necessarily the same as the number of worker replicas you’d like to have for your QCNN job, as Kubernetes is able to schedule multiple replicas on the same node, provided that the node has enough CPU and memory resources. If you are trying this tutorial for the first time, we recommend 2 nodes.

--machine-type specifies the VM machine type. If you are trying this tutorial for the first time, we recommend “n1-standard-2”, with 2 vCPUs and 7.5GB of memory.

--zone is the Google Cloud zone where you’d like to run your cluster (for example “us-west1-a”).

--workload-pool enables the GKE Workload Identity feature, which ties Kubernetes service accounts with Google Cloud IAM service accounts. In order to have fine-grained access control, an IAM service account is recommended to access various Google Cloud products. Here you’ll create a service account to be used by your QCNN jobs. Kubernetes service account is the mechanism to inject the credentials of this IAM service account into your worker container.

--preemptible uses Compute Engine Preemptible VMs to back the Kubernetes cluster. They are up to 80% lower in cost compared to regular VMs, with the tradeoff that a VM may be preempted at any time, which will terminate the training process. This is well-suited for short-running training sessions with large clusters.

You can then create an IAM service account:

gcloud iam service-accounts create ${SERVICE_ACCOUNT_NAME}

and integrate it with Workload Identity:

gcloud iam service-accounts add-iam-policy-binding --role roles/iam.workloadIdentityUser --member "serviceAccount:${PROJECT}.svc.id.goog[default/${SERVICE_ACCOUNT_NAME}]" ${SERVICE_ACCOUNT_NAME}@${PROJECT}.iam.gserviceaccount.com

Now create a storage bucket, which is the basic container to store your data:

gsutil mb -p ${PROJECT} -l ${BUCKET_REGION} -b on gs://${BUCKET_NAME}

using your preferred bucket name. The bucket name is globally unique, so we recommend including your project name as part of the bucket name. The bucket region is recommended to be the region containing your cluster’s zone. The region of a zone is the part of the zone name without the section after the last hyphen. For example, the region of zone “us-west1-a” is “us-west1”.

To make your Cloud Storage data accessible by your QCNN jobs, give permissions to your IAM service account:

gsutil iam ch serviceAccount:${SERVICE_ACCOUNT_NAME}@${PROJECT}.iam.gserviceaccount.com:roles/storage.admin gs://${BUCKET_NAME}

2. Preparing Your Kubernetes Cluster

With the cloud environment set up, you can now install the necessary Kubernetes tools into the cluster. You’ll need tf-operator, a component from KubeFlow. KubeFlow is a toolkit for running machine learning workloads on Kubernetes, and tf-operator is a subcomponent which simplifies the management of TensorFlow jobs. tf-operator can be installed separately without the larger KubeFlow installation.

To install tf-operator, run:

docker pull k8s.gcr.io/kustomize/kustomize:v3.10.0
docker run k8s.gcr.io/kustomize/kustomize:v3.10.0 build "github.com/kubeflow/tf-operator.git/manifests/overlays/standalone?ref=v1.1.0" | kubectl apply -f -

(Note that tf-operator uses Kustomize to manage its deployment files, so it needs to be installed here as well)

3. Training with MultiWorkerMirroredStrategy

You can now take the QCNN code found on the TensorFlow Quantum research branch and prepare it to run in a distributed fashion. Let’s clone the source code:

git clone https://github.com/tensorflow/quantum.git && cd quantum && git checkout origin/research && cd qcnn_multiworker

Or, if you are using SSH keys to authenticate to GitHub:

git clone git@github.com:tensorflow/quantum.git && cd quantum && git checkout origin/research && cd qcnn_multiworker

Code Setup

The training directory contains the necessary pieces for performing distributed training of your QCNN. The combination of training/qcnn.py and common/qcnn_common.py is the same as the hybrid QCNN example in TensorFlow Quantum, but with a few feature additions:

  • Training can now optionally leverage multiple machines with tf.distribute.MultiWorkerMirroredStrategy.
  • TensorBoard integration, which you will explore in more detail in the next section.

MultiWorkerMirroredStrategy is the mechanism in TensorFlow to perform synchronized distributed training. Your existing model has been augmented for distributed training with just a few extra lines of code.

At the beginning of training/qcnn.py, we set up MultiWorkerMirroredStrategy:

strategy = tf.distribute.MultiWorkerMirroredStrategy()

In the model preparation step, we then pass in this strategy as an argument:

... = qcnn_common.prepare_model(strategy)

Each worker of our QCNN distributed training job will run a copy of this Python code. Every worker needs to know the network endpoint of all other workers. The TF_CONFIG environment variable is typically used for this purpose, but in our case, the tf-operator injects it automatically behind the scenes.

After the model is trained, weights are uploaded to your Cloud Storage bucket to be accessed later by the inference job.

if task_type == 'worker' and task_id == 0:
qcnn_weights_path='/tmp/qcnn_weights.h5'
qcnn_model.save_weights(qcnn_weights_path)
upload_blob(args.weights_gcs_bucket, qcnn_weights_path, f'qcnn_weights.h5')

Kubernetes Deployment Setup

Before proceeding to the Kubernetes deployment setup and launching your workers, several parameters need to be configured in the tutorial source code to match your own setup. The provided script, setup.sh, can be used to simplify this process.

Open setup.sh and configure parameter values inside, if you haven’t already done so in a previous step. Then run

./setup.sh param

At this point, the remaining steps in this section can be performed in one command:

make training

The rest of this section walks through the Kubernetes setup in detail.

Prior to running as containers in Kubernetes, the QCNN job needs to be packaged as a container image using Docker and uploaded to the Container Registry. The Dockerfile contains the specification for the image. To build and upload the image, run:

docker build -t gcr.io/${PROJECT}/qcnn .
docker push gcr.io/${PROJECT}/qcnn

Next, you’ll complete the Workload Identity setup by creating the Kubernetes service account using common/sa.yaml. This service account will be used by the QCNN containers.

apiVersion: v1
kind: ServiceAccount
metadata:
annotations:
iam.gke.io/gcp-service-account: ${SERVICE_ACCOUNT_NAME}@${PROJECT}.iam.gserviceaccount.com
name: ${SERVICE_ACCOUNT_NAME}

The annotation tells GKE this Kubernetes service account should be bound to the IAM service account you created previously. Let’s create this service account:

kubectl apply -f common/sa.yaml

The last step is to create the distributed training job. training/qcnn.yaml contains the Kubernetes specifications for your job. In Kubernetes, multiple containers with related functions are grouped into a single entity called a Pod, which is the most fundamental unit of work that can be scheduled. Typically, users leverage existing resource types such as Deployment and Job to create and manage workloads. You’ll instead use TFJob (as specified in the `kind` field), which is not a Kubernetes built-in resource type but rather a Custom Resource provided by the tf-operator, making it easier to work with TensorFlow workloads.

Notably, the TFJob spec contains the field tfReplicaSpecs.Worker, which lets you configure a MultiWorkerMirroredStrategy worker. Values of PS (parameter server), Chief, and Evaluator are also supported for asynchronous and other forms of distributed training. Under the hood, tf-operator creates two Kubernetes resources for each worker replica:

  • A Pod, using the Pod spec template at tfReplicaSpecs.Worker.template. This runs the container you’ve built previously on Kubernetes.
  • A Service, which exposes a well-known network endpoint visible within the Kubernetes cluster to give access to the worker’s gRPC training server. Other workers can communicate with its server by simply pointing to <service_name>:<port> (the alternative form of <service_name>.<service_namespace>.svc:<port> works as well).
TFJob
The TFJob generates one Service and Pod per worker replica. Once the TFJob is updated, changes are reflected in the underlying Services and Pods. Worker status is also reported in the TFJob.
The Service
The Service exposes worker servers to the rest of the cluster. Each worker communicates with other workers by using the destination worker’s Service name as the DNS name.

Within the worker spec, there are a few notable fields:

  • replicas: Number of worker replicas. It’s possible for multiple replicas to be scheduled on the same node, so this number is not limited to the number of nodes.
  • template: the Pod spec template for each worker replica
    • serviceAccountName: this gives the Pod access to the Kubernetes service account.
    • container:
      • image: Points to the Container Registry image you’ve built previously.
      • command: the container’s entry point command.
      • arg: command-line arguments.
      • ports: opens up one port for workers to communicate with each other, and another port for profiling.
    • affinity: this tells Kubernetes that you prefer to schedule worker Pods on different nodes as much as possible, to maximize resource utilization.

To create the TFJob:

kubectl apply -f training/qcnn.yaml

Inspecting the Deployment

Congratulations! Your distributed training is now underway. To check the job’s status, run kubectl get pods a few times (or add -w to stream the output). Eventually you should see there are the same number of qcnn-worker Pods as your replicas parameter, and they all have status Running:

NAME            READY   STATUS    RESTARTS
qcnn-worker-0 1/1 Running 0
qcnn-worker-1 1/1 Running 0

To access the worker’s log output, run:

kubectl logs <worker_pod_name>

or add -f to stream the output. The output of qcnn-worker-0 looks like this:


I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:411] Started server with target: grpc:/
/qcnn-worker-0.default.svc:2222

I tensorflow/core/profiler/rpc/profiler_server.cc:46] Profiler server listening on [::]:2223 selecte
d port:2223

Epoch 1/50

4/4 [==============================] - 7s 940ms/step - loss: 0.9387 - accuracy: 0.0000e+00 - val_loss: 0.7432 - val_accuracy: 0.0000e+00

I tensorflow/core/profiler/lib/profiler_session.cc:71] Profiler session collecting data.
I tensorflow/core/profiler/lib/profiler_session.cc:172] Profiler session tear down.

Epoch 50/50
4/4 [==============================] - 1s 222ms/step - loss: 0.1468 - accuracy: 0.4101 - val_loss: 0.2043 - val_accuracy: 0.4583
File /tmp/qcnn_weights.h5 uploaded to qcnn_weights.h5.

The output of qcnn-worker-1 should be similar except the last line is missing. The chief worker (worker 0) is responsible for saving weights of the entire model.

You can also verify that model weights are saved by visiting the Storage Browser in Cloud Console and browsing through the storage bucket you created previously.

To delete the training job, run

kubectl delete -f training/qcnn.yaml

4. Understanding Training Performance Using TensorBoard

TensorBoard is TensorFlow’s visualization toolkit. By integrating your TensorFlow Quantum model with TensorBoard, you get many visualizations about your model out of the box, such as training loss & accuracy, visualizing the model graph, and program profiling.

Code Setup

To enable TensorBoard for your job, create a TensorBoard callback and pass it into model.fit():

tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=args.logdir,
histogram_freq=1,
update_freq=1,
profile_batch='10, 20')

history = qcnn_model.fit(x=train_excitations,
y=train_labels,
batch_size=32,
epochs=50,
verbose=1,
validation_data=(test_excitations, test_labels),
callbacks=[tensorboard_callback])

The profile_batch parameter enables the TensorFlow Profiler in programmatic mode, which samples the program during the training step range you specify here. You can also enable the sampling mode,

tf.profiler.experimental.server.start(args.profiler_port)

which allows on-demand profiling initiated either by a different program or through the TensorBoard UI.

TensorBoard Features

Here we’ll highlight a subset of TensorBoard’s many powerful features used in this tutorial. Check out the TensorBoard guide to learn more.

Loss and Accuracy

Loss is the quantity that the model aims to minimize during training, computed via a loss function. Accuracy is the fraction of samples during training where predictions match labels. The loss metric is exported by default. To enable the accuracy metric, add the following to the model.compile() step:

qcnn_model.compile(..., metrics=[‘accuracy’])

Custom Metrics

In addition to loss and accuracy, TensorBoard also supports custom metrics. For example, the tutorial code exports the QCNN readout tensor as a histogram.

Profiler

The TensorFlow Profiler is a helpful tool in debugging performance bottlenecks in your model training job.

In this tutorial, we use both the programmatic mode, in which profiling is done for a predefined training step range, as well as the sampling mode, in which profiling can be done on-demand. For a MultiWorkerMirroredStrategy setup, currently programmatic mode only outputs profiling data from the chief (worker 0), whereas sampling mode is able to profile all workers.

When you first open the Profiler, the data displayed is from the programmatic mode. The overview page gives you a sense of how long training took during each step. This will act as a reference as you experiment with different methods of improving training performance, whether that’s by scaling infrastructure (adding more VMs to the cluster, using VMs with more CPU and memory, integrating with hardware accelerators) or improving code efficiency.

Perfomance Summary

The trace viewer gives the duration breakdown of all the training instructions under the hood, providing a detailed view to identify execution time bottlenecks.

Kubernetes Deployment Setup

To view the TensorBoard UI, you can create a TensorBoard instance in Kubernetes. The Kubernetes setup is in training/tensorboard.yaml. This file contains two objects:

  • A Deployment containing 1 Pod replica of the same worker container image, but run with a TensorBoard command: tensorboard --logdir=gs://${BUCKET_NAME}/${LOGDIR_NAME} --port=5001 --bind_all
  • A Service creating a network load balancer to make the TensorBoard UI accessible on the Internet, so you can view it in your browser.

It is also possible to run a local instance of TensorBoard on your workstation by pointing --logdir to the same Cloud Storage bucket, although additional IAM permissions setup is required.

Create this Kubernetes setup:

kubectl apply -f training/tensorboard.yaml

In the output of kubectl get pods, you should see there’s a Pod with the prefix qcnn-tensorboard which is eventually in Running status. To get the IP address of the TensorBoard instance, run

kubectl get svc tensorboard-service -w
NAME                  TYPE           CLUSTER-IP     EXTERNAL-IP     PORT(S)
tensorboard-service LoadBalancer 10.123.240.9 <pending> 5001:32200/TCP

The load balancer takes some time to provision so you may not see the IP right away. Once it’s available, go to <ip>:5001 in your browser to access the TensorBoard UI.

With TensorFlow 2.4 and higher, it’s possible to profile multiple workers in sampling mode: workers can be profiled while a training job is running, by clicking “Capture Profile” in the Tensorboard Profiler and “Profile Service URL” to qcnn-worker-<replica_id>:2223. To enable this, the profiler port needs to be exposed by the worker service. The tutorial source code provides a script which patches all worker Services generated by the TFJob with a profiler port. Run

training/apply_profiler_ports.sh

Note that the need to manually patch Services is temporary, and there is currently planned work in tf-operator to support specifying additional ports in the TFJob.

5. Running Inference

After completing the distributed training job, model weights are stored in your Cloud Storage bucket. You can then use these weights to construct an inference program, and then create an inference job in the Kubernetes cluster. It is also possible to run an inference program on your local workstation, although it requires additional IAM permissions to grant access to Cloud Storage.

Code Setup

Inference source code is available in the inference/ directory. The main file, qcnn_inference.py, mostly reuses the model construction code in common/qcnn_common.py, but loads model weights from your Cloud Storage bucket instead:

qcnn_weights_path = '/tmp/qcnn_weights.h5'
download_blob(args.weights_gcs_bucket, args.weights_gcs_path, qcnn_weights_path)
qcnn_model.load_weights(qcnn_weights_path)

It then applies the model to a test set and computes the mean squared error.

results = qcnn_model(test_excitations).numpy().flatten()
loss = tf.keras.losses.mean_squared_error(test_labels, results)

Kubernetes Deployment Setup

The remaining steps in this section can be performed in one command:

make inference

The inference program is built into the Docker image from the training step, so you don’t need to build a new image here. The inference job spec, inference/inference.yaml, contains a Job with its Pod spec pointing to the image but executes qcnn_inference.py instead. Run kubectl apply -f inference/inference.yaml to create the job.

The Pod prefixed with inference-qcnn should eventually be in Running status (kubectl get pods). In the log output of the inference Pod (kubectl logs <pod_name>), the mean squared error should be close to the final loss shown in the TensorBoard UI.


Blob qcnn_weights.h5 downloaded to /tmp/qcnn_weights.h5.
[-0.8220097 0.40201923 -0.82856977 0.46476707 -1.1281478 0.23317486
0.00584182 1.3351855 0.35139582 -0.09958048 1.2205497 -1.3038696
1.4065738 -1.1120421 -0.01021352 1.4553616 -0.70309246 -0.0518395
1.4699622 -1.3712595 -0.01870352 1.2939589 1.2865802 0.847203
0.3149605 1.1705848 -1.0051676 1.2537074 -0.2943283 -1.3489063
-1.4727883 1.4566276 1.3417912 0.9123422 0.2942805 -0.791862
1.2984066 -1.1139404 1.4648925 -1.6311806 -0.17530376 0.70148027
-1.0084027 0.09898916 0.4121615 0.62743163 -1.4237025 -0.6296255 ]
Test Labels
[-1 1 -1 1 -1 1 1 1 1 -1 1 -1 1 -1 1 1 -1 -1 1 -1 -1 1 1 1
1 1 -1 1 -1 -1 -1 1 1 1 -1 -1 1 -1 1 -1 -1 1 -1 1 1 1 -1 -1]
Mean squared error: tf.Tensor(0.29677835, shape=(), dtype=float32)

6. Cleaning Up

And this wraps up our journey through distributed training! After you are done experimenting with the tutorial, this section walks you through the steps to clean up Google Cloud resources.

First, remove the Kubernetes deployments. Run:

make delete-inference
kubectl delete -f training/tensorboard.yaml

and, if you haven’t done so already,

make delete-training

Then, delete the GKE cluster. This deletes the underlying VMs as well.

gcloud container clusters delete ${CLUSTER_NAME} --zone=${ZONE}

Next, delete the training data in your Google Cloud Storage.

gsutil rm -r gs://${BUCKET_NAME}

And lastly, remove the worker container image from Container Registry following these instructions using the Cloud Console. Look for the image name qcnn.

Next Steps

Now that you’ve tried out the multi-worker setup, try setting it up with your project! As all the tools mentioned in this tutorial continue to grow, best practices for training with multiple workers will change over time. Check back on the tutorial directory in the TensorFlow Quantum GitHub repository for updates!

As you continue to scale your experiment, you might eventually hit infrastructure limitations that require advanced configuration of the technologies used in this tutorial due to the complexity of working in a distributed environment. For a deeper dive into some of them, check out these resources:

If you are interested in conducting large scale QML research in Tensorflow Quantum, check out our research credit application page to apply for cloud credits on Google Cloud.Read More

Leveraging Machine Learning for Unstructured Data Processing at Pixie

A guest post by James Bartlett and Zain Asgar of Pixie.

At Pixie, our goal is to enable developers to quickly understand and debug production systems. We achieve this by providing developers easy access to an assortment of metric and log data from their production system. For example, we collect structured information about CPU and memory usage for each process in their system, as well as many types of unstructured data (for example, the body of an HTTP request, or the error message from a program).

These are just two examples, we collect many other types of data, as well. For this blog post, we will focus on the vast amounts of unstructured data we collect in Pixie such as HTTP request/response bodies. We foresee a future where this unstructured machine data can be queried as easily and efficiently as the structured data. To achieve this, we leverage state-of-the-art NLP techniques to learn the structure of the data.

In this article, we’d like to share our experience and efforts here, in the hopes they are useful to inform your thinking on similar problems.

HTTP clustering

Suppose a developer using Pixie wanted to get an idea of which types of HTTP requests are particularly slow. Instead of forcing the developer to sift through many individual HTTP requests by hand, we can instead cluster the HTTP requests semantically and then show them a timeseries of latencies for each type of semantically clustered request. To demonstrate this, let’s walk through the end result and then we’ll come back to how we got to this point. We will use Pixie to explore a demo application called Online Boutique. Once we have Pixie deployed to a kubernetes cluster running Online Boutique, we can start to explore. For example, we can look at a graph of the network connections within the Online Boutique application:

graph of the network connections within the Online Boutique application

As you can see in the service graph, there’s a frontend service that handles incoming requests and sends them to their respective microservices. So let’s delve into the HTTP requests sent to the frontend service and their corresponding latencies.

HTTP Request Body

Latency (ms)

“product_id=L9ECAV7KIM&quantity=3

3.325302

“email=someone%40example.com&street_address=1600+Amphitheatr…

102.625462

“product_id=OLJCESPC7Z&quantity=3”

3.4530790000000002

“product_id=L9ECAV7KIM&quantity=5”

4.828718

“product_id=0PUK6V6EV0&quantity=2”

5.319163

“email=someone%40example.com&street_address=1600+Amphitheatr

107.361424

“product_id=0PUK6V6EV0&quantity=4”

3.81733

“currency_code=EUR”

0.203676

“currency_code=USD”

0.220932

“product_id=0PUK6V6EV0&quantity=4”

4.538055

From this small sample of requests, it’s not immediately clear what’s going on. It looks like the requests with `email=…?address=…` are much slower than the others, but we can’t be sure these examples weren’t just outliers. Instead of looking at more data, let’s use our soon-to-be-explained unstructured text clustering techniques, to cluster the HTTP requests semantically by the contents of their bodies.

plot of the average 99th percentile response latency for requests for each semantic cluster

Here you can see a plot of the average 99th percentile response latency for requests for each semantic cluster. Using this view, you can quickly determine the three broad categories of requests coming into the frontend service, as well as the latency profiles of those requests. Immediately, we see that the “email” cluster of requests has significantly higher average p99 latency than the other clusters, and we see that the “product” cluster has occasional latency spikes. Both of these are actionable insights we can debug further. Now let’s dive in and discuss how we got to this point.

Model Development Details

Requirements

Since our models will be deployed on customers’ production clusters, they must be lightweight and performant; ideally fast enough to handle data at line rate with minimal CPU overhead. Any training on customer data must occur on the customer cluster to maintain data isolation guarantees. Furthermore, since the data plane is entirely contained on customer clusters, we have strict storage limitations for data, so we must leverage ML techniques to intelligently sample the data we collect.

Dataset

Due to our stringent data isolation requirements we’re using the loghub dataset to bootstrap our model training. This dataset is a collection of log messages from various contexts (Android sys logs, Apache Server logs, supercomputer/HPC logs, etc). To test the models generalization to unseen log formats, we reserved the Android log data for testing, and trained on the remainder of the log data.

We use Google’s SentencePiece to tokenize the log messages. In particular, we use their implementation of unigram language model based subword tokenization with a vocab size of 16k. The following image shows a word cloud of all 16k vocabulary subword pieces that are generated by our tokenization. The size of the words indicate the frequency in the dataset.

Word cloud showing vocabulary subword pieces from Logpai Loghub machine log dataset tokenization.
Word cloud showing vocabulary subword pieces from Logpai Loghub machine log dataset tokenization.

This word cloud provides insight into the biases of our dataset. For example, about 30% of the data is Windows logs, as you can see by the high frequency of the token “windows”, and “microsoft”. Also, if you have a keen eye, you might think we have a lot of frowny faces in our data set, but in fact “):” is almost always preceded by an opening parenthesis, as in the following examples:

[Thu Jan 26 12:23:07 2006] [error] config.update(): Can't create vm
[Fri Jan 27 11:55:16 2006] [error] [client 202.143.128.18] client sent HTTP/1.1 request without hostname (see RFC2616 section 14.23): /

Model Architecture

Using this tokenized dataset, we train a self-attention based model using left-to-right next word prediction (à la OpenAI’s GPT models). Left-to-right next word prediction is the task of trying to predict the next token given a sequence of prior context tokens. The left-to-right part distinguishes it from BERT style models that use bidirectional context (we will try bidirectional models in the future). This TensorFlow tutorial demonstrates training of a similar architecture, the only difference being we drop the encoder side of the architecture in the tutorial.

The architecture we use is displayed in the figure below. We have 6 decoder blocks, each with a self-attention layer and a feed-forward layer. Note that, for simplicity, the diagram leaves out the skip connections over the self-attention and feed-forward layers, as well as the layer normalizations that go with those skip connections.

GPT-style language model architecture
GPT-style language model architecture

All in all, this architecture has 6.47M parameters, making it quite small in comparison to state-of-the-art language models. DistillBERT, for instance, has 66M parameters. On the other hand, the largest version of GPT-3 has 175B parameters.

We trained this architecture for 10 epochs with roughly 100 million unique log messages per epoch. After each epoch, we ran the model on a validation set and the model from the epoch with the highest validation accuracy was used as the final model. We achieved a test accuracy of 63.13% for next word prediction on the holdout Android log data. Given that we haven’t yet explored hyperparameter tuning, or other optimizations, this level of accuracy is a promising starting point.

We now have a way to predict future tokens in machine log data from context, with somewhat decent accuracy. However, this doesn’t immediately help us with our original goal of extracting structured features from the data. To further this goal, we will explore the feature space generated by the language model, rather than the predictions of the language model.

The goal is to transform our complicated data space into a fixed-dimensional feature space which we can then use for subsequent tasks. In order to do this we need to transform the outputs of the language model into a fixed-dimensional vector, which we will call the feature vector. One way to do this comes from BERT style models.

With BERT style models the way to extend the pre-trained language model to supervised tasks is to add a fully connected network on the output of the <CLS> (or <s>) token of the sentence, and then fine-tune the model with the fully-connected network on some classification task (this is illustrated in the figure below). This leads to a natural feature vector as the output prior to the softmax layer.

Alammar, J (2018). The Illustrated Transformer
Alammar, J (2018). The Illustrated Transformer [Blog post]. Retrieved from https://jalammar.github.io/illustrated-transformer/

We plan to explore this method in the future, however for now we would like to see what results we can get without adding any extra supervision. This requires a heuristic approach to turn our sequence of outputs into a fixed-length vector. One way to do this is to use a max-pooling operator on the sequence dimension of the output. Suppose our language model outputs a sequence of 256-dimensional vectors, then a max-pooling on the sequence dimension will output a single 256-dimensional vector, where each dimension is the maximum value of that dimension across all outputs in the sequence. The idea behind this approach is that neurons that have stronger responses are more important to include in the final representation.

Results

We can test how well this method works for clustering on a subset of the loghub data that I’ve hand labeled into semantic clusters. Below are three of the log messages in the hand labelled test data set. The first two are labelled to be in the same semantic cluster, since both relate to failures to find files, the last is from a different cluster, since it’s an unrelated message about flushing data.

[Wed Nov 09 22:30:05 2005] [error] [client 216.138.114.25] script not found or unable to stat: /var/www/cgi-bin/awstats.p

[Sat Jan 28 19:29:29 2006] [error] [client 211.154.174.50] File does not exist: /var/www/html/modules

20171230-12:25:37:318|Step_StandStepCounter|30002312|flush sensor data

Using the hand-labelled test set, we can measure how well the model separates the different clusters. To do this, we use the KMeans algorithm to generate a clustering based on the output of the model, and then compare this clustering to the hand-labelled clustering. On this test set, the model’s adjusted rand score, a metric where 0.0 is random labelling and 1.0 is perfect labelling, was 0.53. As with next word prediction accuracy, the performance isn’t great but a good starting point.

We can also view a low-dimensional representation of the feature space for the model, using PCA to reduce the dimensionality to two. The figures below show the first two PCA dimensions of the embeddings for each point in the test data set. The colors represent the semantic cluster the point belongs to. Note that since these are plots in a two-dimensional subspace of the embedding space, the absolute position of points carries little meaning, more meaning is derived from the tightness of each of the clusters. In the figure below, we can see that the model separates some of the classes reasonably well, but fails on others.

2-dimensional representation of the feature space of the model.
2-dimensional representation of the feature space of the model.

Using this method, we should be able to cluster unstructured data in Pixie, and tag it with its semantic cluster ID, hence extracting a structured feature from our unstructured data. This particular feature is, as yet, not very human-interpretable, but we will get to that later.

Inference

So let’s try to implement this method within the Pixie system. In order to do that we first need to convert our model into TensorFlow Lite and then load it into the Pixie execution engine. We decided to use TensorFlow Lite because we need to minimize overhead as much as possible, and in the future we would like the flexibility to deploy to heterogeneous edge devices including Raspberry PI’s and ARM microcontrollers.

Converting to TensorFlow Lite is pretty simple. We create a TF function for our model and call the builtin converter to generate a tensorflow lite model protobuf file:

model = tf.keras.models.load_model(model_path)
@tf.function(input_signature=[tf.TensorSpec([1, max_length], dtype=tf.int32)
def pred_fn(encoded_text):
# Create a mask that masks out 0 tokens, and future tokens for next word prediction.
mask = create_padded_lookahead_mask(max_length)
# Our saved model outputs both its next word predictions, and the activations of its final layer. We only use the activations of the final layer for clustering purposes.
model_preds, last_layer_output = model([encoded_text, mask], training=False)
# Max pool over the seq dimension.
return tf.reduce_max(last_layer_output, axis=1)

converter = tf.lite.TFLiteConverter.from_concrete_functions([fn.get_concrete_function()])
tflite_model = converter.convert()

Pixie’s query engine allows querying and manipulating data collected by Pixie. This engine already has a KMeans operator, so all we need to do is load our tflite model into the engine, and then write a custom PxL script (a script in Pixie’s scripting language based on Python/Pandas) to cluster our data. We are working on a public API to load in custom ML models into the engine, but for now we will use some internal features to do that. Once the model is loaded in, we can use it on any unstructured data in the Pixie Platform.

Some of the areas we are currently exploring include our vision of federated differentially-private training of models, bidirectional language models ala BERT, compression schemes for unstructured data based on learned structural representations of the data, and anomaly detection on unstructured data

Our goal on the Pixie ML team is to harness ML to simplify developers’ access to monitoring data, while operating in heterogeneous edge environments. If any of this interests you, or you have other questions feel free to join our slack group.

Pixie is an open-source project that gives you instant visibility into your application. It provides access to metrics, events, traces and logs without changing code. Pixie is in the process of being contributed to the CNCF (Cloud Native Compute Foundation). Pixie was originally created at Pixie Labs, Inc., but contributed to open source by New Relic, Inc.

James is a software engineer at the New Relic on the Pixie Team. He was a founding engineer at Pixie Labs.

Zain is the GM/GVP of Pixie and Open Source at New Relic. He is also an Adjunct Professor of Computer Science at Stanford University. He was the Co-founder/CEO of Pixie Labs.

Read More

PluggableDevice: Device Plugins for TensorFlow

Posted by Penporn Koanantakool and Pankaj Kanwar.

As the number of accelerators (GPUs, TPUs) in the ML ecosystem has exploded, there has been a strong need for seamless integration of new accelerators with TensorFlow. In this post, we introduce the PluggableDevice architecture which offers a plugin mechanism for registering devices with TensorFlow without the need to make changes in TensorFlow code.

This PluggableDevice architecture has been designed & developed collaboratively within the TensorFlow community. It leverages the work done for Modular TensorFlow, and is built using the StreamExecutor C API. The PluggableDevice mechanism is available in TF 2.5.

The need for Seamless integration

Prior to this, any integration of a new device required changes to the core TensorFlow. This was not scalable because of several issues, for example:

  • Complex build dependencies and compiler toolchains. Onboarding a new compiler is nontrivial and adds to the technical complexity of the product.
  • Slow development time. Changes need code reviews from the TensorFlow team, which can take time. Added technical complexity also adds to the development and testing time for new features.
  • Combinatorial number of build configurations to test for. The changes made for a particular device might affect other devices or other components of TensorFlow. Each new device could increase the number of test configurations in a multiplicative manner.
  • Easy to break. The lack of a contract via a well defined API means that it’s easier to break a particular device.

What is PluggableDevice?

The PluggableDevice mechanism requires no device-specific changes in the TensorFlow code. It relies on C APIs to communicate with the TensorFlow binary in a stable manner. Plug-in developers maintain separate code repositories and distribution packages for their plugins and are responsible for testing their devices. This way, TensorFlow’s build dependencies, toolchains, and test process are not affected. The integration is also less brittle since only changes to the C APIs or PluggableDevice components could affect the code.

The PluggableDevice mechanism has four main components:

  • PluggableDevice type: A new device type in TensorFlow which allows device registration from plug-in packages. It takes priority over native devices during the device placement phase.
  • Custom operations and kernels: Plug-ins register their own operations and kernels to TensorFlow through the Kernel and Op Registration C API.
  • Device execution and memory management: TensorFlow manages plug-in devices through the StreamExecutor C API.
  • Custom graph optimization pass: Plug-ins can register one custom graph optimization pass, which will be run after all standard Grappler passes, through the Graph Optimization C API.
chart of how a device plug-in interacts with TensorFlow
How a device plug-in interacts with TensorFlow.

Using PluggableDevice

To be able to use a particular device, like one would a native device in TensorFlow, users only have to install the device plug-in package for that device. The following code snippet shows how the plugin for a new device, say Awesome Processing Unit (APU), would be installed and used. For simplicity, let this APU plug-in only have one custom kernel for ReLU.

$ pip install tensorflow-apu-0.0.1-cp36-cp36m-linux_x86_64.whl

Successfully installed tensorflow-apu-0.0.1
$ python
Python 3.6.9 (default, Oct 8 2020, 12:12:24)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf # TensorFlow registers PluggableDevices here
>>> tf.config.list_physical_devices()
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:APU:0', device_type='APU')]

>>> a = tf.random.normal(shape=[5], dtype=tf.float32) # Runs on CPU
>>> b = tf.nn.relu(a) # Runs on APU

>>> with tf.device("/APU:0"): # Users can also use 'with tf.device' syntax
... c = tf.nn.relu(a) # Runs on APU

>>> @tf.function # Defining a tf.function
... def run():
... d = tf.random.uniform(shape=[100], dtype=tf.float32) # Runs on CPU
... e = tf.nn.relu(d) # Runs on APU
>>> run() # PluggableDevices also work with tf.function and graph mode.

Upcoming PluggableDevices

We are excited to announce that Intel will be one of our first partners to release a PluggableDevice. Intel has made significant contributions to this effort, submitting over 3 RFCs implementing the overall mechanism. They will release an Intel extension for TensorFlow (ITEX) plugin package to bring Intel XPU to TensorFlow for AI workload acceleration. We also expect other partners to take advantage of PluggableDevice and release additional plug-ins.

We will publish a detailed tutorial on how to develop a PluggableDevice plug-in for partners who might be interested in leveraging this infrastructure. For questions on the PluggableDevices, engineers can post questions directly on the RFC PRs [1, 2, 3, 4, 5, 6], or on the TensorFlow Forum with the tag pluggable_device.

Read More

How TensorFlow helps Edge Impulse make ML accessible to embedded engineers

Posted by Daniel Situnayake, Founding TinyML Engineer, Edge Impulse.

Microcontrollers that run our world

No matter where you are reading this right now—your home, your office, or sitting in a vehicle—you are likely surrounded by microcontrollers. They are the tiny, low-power computers that animate our modern world: from smart watches and kitchen appliances to industrial equipment and public transportation. Mostly hidden inside other products, microcontrollers are actually the most numerous type of computer, with more than 28 billion of them shipped in 2020.

The software that powers all these devices is written by embedded software engineers. They’re some of the most talented, detail-oriented programmers in the industry, tasked with squeezing every last drop of efficiency from tiny, inexpensive processors. A typical mid-range microcontroller—based around Arm’s popular Cortex-M4 architecture—might have a 32-bit processor running at just 64Mhz, with 256KB of RAM and 1MB of flash memory for storing a program. That doesn’t leave a lot of room for waste.

Since microcontrollers interface directly with sensors and hardware, embedded engineers are often experts in signal processing and electrical engineering—and they tend to have a lot of domain knowledge in their area of focus. One engineer might be an expert on the niche sensors used for medical applications, while another might focus on analyzing audio signals.

Embedded machine learning

In the past few years, a set of technologies have been developed that make it possible to run miniature, highly optimized machine learning models on low-power microcontrollers like the one described above. By using machine learning to interpret sensor data right at the source, embedded applications can become smarter, faster, and more energy efficient, making their own decisions rather than having to stream data to the cloud and wait for a response. This concept is known as embedded machine learning, or TinyML.

With their deep signal processing and domain expertise, embedded engineers are ideally placed to design this new generation of smart applications. However, embedded engineers tend to have highly specialized skill sets and use development toolchains that are often far removed from the Python-heavy stack preferred by data scientists and machine learning engineers.

It isn’t reasonable to expect domain experts to retrain as data scientists, or for data scientists to learn the embedded development skills required to work with microcontrollers. Instead, a new generation of tooling is required that will allow those with domain expertise to capture their knowledge and insight as machine learning models and deploy them to embedded devices—with help from machine learning experts an optional extra.

The TinyML development process is similar to the traditional machine learning workflow. It starts with collecting, exploring, and evaluating a dataset. Next up, feature engineering takes the form of sophisticated digital signal processing, often using the types of algorithms that embedded engineers are already familiar with. Once features have been extracted from the data, a machine learning model is trained and evaluated—with a critical eye on its size, to make sure it will fit on a tiny microcontroller and run fast enough to be useful.

After the training, the model is optimized for size and efficiency. This often involves quantization, reducing the precision of the model’s weights so that they take up less precious memory. Once the model is ready, it must be deployed as a C++ library (the language of choice for the majority of embedded platforms) that includes all of the operator kernels required to run it. The embedded engineer can then write and tune an application that interprets the model’s output and uses it to make decisions.

Throughout this process, it’s important to carefully evaluate the model and application to ensure that it functions in the way that it is intended to when used in a real world environment. Without adequate monitoring and review, it’s possible to create models that seem superficially accurate but that fail in harmful ways when exposed to real world data.

Edge Impulse and TensorFlow

The Edge Impulse team has created an end-to-end suite of tooling that helps embedded engineers and domain experts build and test machine learning applications. Edge Impulse is designed to integrate beautifully with the tools that embedded engineers use every day, providing a high-level interface for incorporating machine learning into projects.

Edge Impulse makes use of the TensorFlow ecosystem for training, optimizing, and deploying deep learning models to embedded devices. While it was designed with non-ML engineers in mind, the philosophy behind Edge Impulse is that it should be extensible by machine learning experts and flexible enough to incorporate their insights and additions—from hand-tuned model architectures and loss functions to custom operator kernels.

This extensibility is made possible by the TensorFlow ecosystem, which provides a set of standards and integration points that experts can use to make their own improvements.

Training a tiny model

This process starts during training. Novice ML developers using Edge Impulse can use a library of preset deep learning model architectures designed to work well with embedded devices. For example, this simple convolutional model is intended for classifying ambient noise:

Neural network architecture

Under the hood, Edge Impulse generates a Python implementation of the model using TensorFlow’s Keras APIs. More experienced developers can customize the layers of the deep learning network, tweaking parameters and adding new layers that are reflected in the underlying Keras model. And expert developers have access to edit the training code itself, directly within the UI:

code snippet

Since Edge Impulse uses TensorFlow libraries and APIs, it’s incredibly simple to extend the built-in training code with your own logic. For example, the tf.data.Dataset class is used to provide an efficient pipeline to the training and validation datasets. This pipeline can easily be extended to add transformations, such as the data augmentation function seen in the following screenshot from an image classification project:

code snippet

For in-depth experiments, developers can download a Jupyter Notebook containing all of the dependencies required to run their training script locally.

Jupyter Notebook

Any custom model code using the TensorFlow APIs fits seamlessly into the end-to-end pipeline hosted by Edge Impulse. Training is run in the cloud, and trained models are automatically optimized for embedded deployment using a combination of TensorFlow utilities and Edge Impulse’s own open source technologies.

Model optimization

Quantization is the most common form of optimization used when deploying deep learning models to embedded devices. Edge Impulse uses TensorFlow’s Model Optimization Toolkit to quantize models, reducing their weights’ precision from float32 to int8 with minimal impact on accuracy.

Using TensorFlow Lite for Microcontrollers along with the emulation software Renode, Edge Impulse provides developers with an accurate estimate of the latency and memory usage of their model once it is deployed to the target embedded device. This makes it easy to determine the impact of optimizations such as quantization across different slices of the dataset:

A comparison between int8 quantized and unoptimized versions of the same mode, showing the difference in performance and results.
A comparison between int8 quantized and unoptimized versions of the same mode, showing the difference in performance and results.

For maximum flexibility and compatibility with developers’ existing workflows, the trained model is available for download in multiple formats. Developers can choose to export the original model as a TensorFlow SavedModel, or download one of several optimized models using the portable TensorFlow Lite flatbuffer format:

Download links for models serialized using TensorFlow’s SavedModel and TensorFlow Lite formats.
Download links for models serialized using TensorFlow’s SavedModel and TensorFlow Lite formats.

Deployment

Once a model has been trained and tested there are multiple ways to deploy it to the target device. Embedded engineers work heavily with C++, so the standard option is to export a C++ SDK: a library of optimized source code that implements both the signal processing pipeline and the deep learning model. The SDK has a permissive open source license, so developers are free to use it in any project or share it with others.

There are two main options for running deep learning models, both of which make use of TensorFlow technologies. The first, Edge Impulse’s EON Compiler, is a code generation tool that converts TensorFlow Lite models into human readable C++ programs.

Enabling EON Compiler
Enabling EON Compiler can reduce memory usage by up to 50% with no impact on model accuracy.

EON Compiler makes use of the operator kernels implemented in TensorFlow Lite for Microcontrollers, invoking them in an efficient manner that doesn’t require the use of an interpreter. This results in memory savings of up to 50%. It automatically applies any available optimized kernels for the target device, meaning libraries such as Arm’s CMSIS-NN will be used where appropriate.

Some projects benefit from additional flexibility. In these cases, developers can choose to export a library that uses the TensorFlow Lite for Microcontrollers interpreter to run the model. This can be useful for developers who wish to experiment with custom kernel implementations for their specific hardware, or who are working within an environment that has TensorFlow Lite for Microcontrollers built in.

In addition to the C++ SDK, developers can choose to target specific environments. For example, a TensorRT library provides optimized support for NVidia’s Jetson Nano embedded Linux developer kit. This interoperability is enabled by the extensive TensorFlow ecosystem and open source community, which has tooling for numerous platforms and targets.

TensorRT library
Models can be optimized and exported for targets in the broader TensorFlow ecosystem, such as NVidia’s Jetson Nano.

Enabling new technologies

TensorFlow is unique amongst deep learning frameworks due to its broad, mature, and extensible set of technologies for training and deploying models to embedded devices. TensorFlow formats, such as the TensorFlow Lite flatbuffer, have become de-facto standards amongst companies bringing deep learning models to the edge.

The TensorFlow ecosystem has been key to enabling the growth of embedded machine learning, enabling companies like Edge Impulse to put artificial intelligence in the hands of domain experts who are building the next generation of consumer and industrial technologies.

If you’d like to learn more about embedded machine learning using Edge Impulse and TensorFlow, there are many options. Take a look at the Introduction to Embedded Machine Learning course on Coursera, or jump right in with the Getting Started guide or Recognize sounds from audio tutorial. You can even check out a public Edge Impulse project that you can clone and customize with a single click.

Daniel Situnayake

Founding TinyML Engineer, Edge Impulse.

Read More

New Courses: Machine Learning Engineering for Production

Posted by Robert Crowe and Jocelyn Becker

AI

Have you mastered the art of building and training ML models, and are now ready to use them in a production deployment for a product or service? If so, we have a new set of courses to get you going. Built as a collaboration between the TensorFlow team, Andrew Ng, and deeplearning.ai, the new set of courses are launching as a specialization on Coursera: The Machine Learning Engineering for Production (MLOps) specialization.

The new specialization builds on the foundational knowledge taught in the popular specialization, DeepLearning.AI TensorFlow Developer Professional Certificate, that teaches how to build machine learning models with TensorFlow. The new MLOps specialization kicks off with an introductory course taught by Andrew Ng, followed by courses taught by Robert Crowe and Laurence Moroney that dive into the details of getting your models out to users.

Every lesson comes with plenty of hands-on exercises that give you practice at preparing your data, and training and deploying models.

By the end of the specialization, you’ll be ready to design and deploy an ML production system end-to-end. You’ll understand project scoping, data needs, modeling strategies, and deployment requirements. You’ll know how to optimize your data, models, and infrastructure to manage costs. You’ll know how to validate the integrity of your data to get it ready for production use, and then prototype, develop, and deploy your machine learning models, monitor the outcomes, and update the datasets and retrain the models continuously.

You’ll learn how to implement feature engineering, transformation, and selection with TFX as well as how to use analytics to address model fairness and explainability issues, and how to mitigate bottlenecks. You’ll also explore different scenarios and case studies of ML in practice, from personalization systems to automated vehicles.

Typical ML pipeline
You’ll learn how processing requirements are different in deployment than in training
Use of Accelerators in Serving Infrastructure
You’ll learn about different tools and platforms for deploying your machine learning systems.
Product recommendations
A common use of ML in production is personalization systems for product recommendations.
Autonomous Driving Systems
A cutting edge use of ML in practice is to guide automated vehicles.

Despite the growing recognition of AI/ML as a crucial pillar of digital transformation, successful ML deployments are a bottleneck for getting value from AI. For example, 72% of a cohort of organizations that began AI pilots before 2019 haven’t deployed even a single application in production. A survey by Algorithmia of the state of enterprise machine learning found that 55% of companies surveyed haven’t deployed an ML model.

Models don’t make it into production and if they do, they break because they fail to adapt to changes in the environment. Deloitte identified lack of talent and integration issues as factors that can stall or derail AI initiatives. This is where ML engineering and MLOps are essential. ML engineering provides a superset of the discipline of software engineering that handles the unique complexities of the practical applications of ML. MLOps is a methodology for ML engineering that unifies ML system development (the ML element) with ML system operations (the Ops element).

Unfortunately, job candidates with ML engineering and MLOps skills are relatively hard to find and expensive to hire. Our new MLOps specialization teaches a broad range of many of the skills necessary to work in this field, and will help prepare developers for current and future workplace challenges. We believe that this is a valuable contribution to the ML community, and we’re excited to make it available.

Enroll today to develop your machine learning engineering skills, and learn how to roll out your ML models to benefit your company and your users.

Read More

Introducing TensorFlow Decision Forests

Posted by Mathieu Guillame-Bert, Sebastian Bruch, Josh Gordon, Jan Pfeifer

We are happy to open source TensorFlow Decision Forests (TF-DF). TF-DF is a collection of production-ready state-of-the-art algorithms for training, serving and interpreting decision forest models (including random forests and gradient boosted trees). You can now use these models for classification, regression and ranking tasks – with the flexibility and composability of the TensorFlow and Keras.

GIF showing Random Forest decision model
Random Forests are a popular type of decision forest model. Here, you can see a forest of trees classifying an example by voting on the outcome.

About decision forests

Decision forests are a family of machine learning algorithms with quality and speed competitive with (and often favorable to) neural networks, especially when you’re working with tabular data. They’re built from many decision trees, which makes them easy to use and understand – and you can take advantage of a plethora of interpretability tools and techniques that already exist today.

TF-DF brings this class of models along with a suite of tailored tools to TensorFlow users:

  • Beginners will find it easier to develop and explain decision forest models. There is no need to explicitly list or pre-process input features (as decision forests can naturally handle numeric and categorical attributes), specify an architecture (for example, by trying different combinations of layers like you would in a neural network), or worry about models diverging. Once your model is trained, you can plot it directly or analyse it with easy to interpret statistics.
  • Advanced users will benefit from models with very fast inference time (sub-microseconds per example in many cases). And, this library offers a great deal of composability for model experimentation and research. In particular, it is easy to combine neural networks and decision forests.

If you’re already using decision forests outside of TensorFlow, here’s a little of what TF-DF offers:

  • It provides a slew of state-of-the-art Decision Forest training and serving algorithms such as random forests, gradient-boosted trees, CART, (Lambda)MART, DART, Extra Trees, greedy global growth, oblique trees, one-side-sampling, categorical-set learning, random categorical learning, out-of-bag evaluation and feature importance, and structural feature importance.
  • This library can serve as a bridge to the rich TensorFlow ecosystem by making it easier for you to integrate tree-based models with various TensorFlow tools, libraries, and platforms such as TFX.
  • And for users new to neural networks, you can use decision forests as an easy way to get started with TensorFlow, and continue to explore neural networks from there.

Code example

A good example is worth a thousand words. So in this blog post, we will show how easy it is to train a model with TensorFlow Decision Forests. More examples are available on the TF-DF website and GitHub page. You may also watch our talk at Google I/O 2021 .

Training a model

Let’s start with a minimal example where we train a random forest model on the tabular Palmer’s Penguins dataset. The objective is to predict the species of an animal from its characteristics. The dataset contains both numerical and categorical features and is stored as a csv file.

Three examples from the Palmer's Penguins dataset.
Three examples from the Palmer’s Penguins dataset.

Let’s train a model:

# Install TensorFlow Decision Forests
!pip install tensorflow_decision_forests

# Load TensorFlow Decision Forests
import tensorflow_decision_forests as tfdf

# Load the training dataset using pandas
import pandas
train_df = pandas.read_csv("penguins_train.csv")

# Convert the pandas dataframe into a TensorFlow dataset
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_df, label="species")

# Train the model
model = tfdf.keras.RandomForestModel()
model.fit(train_ds)

Observe that nowhere in the code did we provide input features or hyperparameters. That means, TensorFlow Decision Forests will automatically detect the input features from this dataset and use default values for all hyperparameters.

Evaluating a model

Now, let’s evaluate the quality of our model:

# Load the testing dataset
test_df = pandas.read_csv("penguins_test.csv")

# Convert it to a TensorFlow dataset
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(test_df, label="species")

# Evaluate the model
model.compile(metrics=["accuracy"])
print(model.evaluate(test_ds))
# >> 0.979311
# Note: Cross-validation would be more suited on this small dataset.
# See also the "Out-of-bag evaluation" below.

# Export the model to a TensorFlow SavedModel
model.save("project/my_first_model")

Easy, right? And a default RandomForest model with default hyperparameters provides a quick and good baseline for most problems. Decision forests in general will train quickly for small and medium sized problems, require less hyperparameter tuning compared to many other types of models, and will often provide strong results.

Interpreting a model

Now that you have looked at the accuracy of the trained model, let’s consider its interpretability. Interpretability is important if you wish to understand and explain the phenomenon being modeled, debug a model, or begin to trust its decisions. As noted above, we have provided a number of tools to interpret trained models, beginning with plots.

tfdf.model_plotter.plot_model_in_colab(model, tree_idx=0)
tree structure

You can visually follow the tree structure. In this tree, the first decision is based on the bill length. Penguins with bills longer than 42.2mm are likely to be the blue (Gentoo) or green (Chinstrap) species, while the ones with shorter bills are likely to be of the red specy (Adelie).

For the first group, the tree then asks about the flipper length. Penguins with flippers longer than 206.5mm are likely to be of the green species (Chinstrap), while the remaining are likely to be of the blue species (Gentoo).

Model statistics are complementary additions to plots. Example statistics include:

  • How many times is each feature used?
  • How fast did the model train (in number of trees and time)?
  • How are the nodes distributed in the tree structure (for example, what is the length of most branches?)

These and answers to more such inquiries are included in the model summary and accessible in the model inspector.

# Print all the available information about the model
model.summary()
>> Input Features (7):
>> bill_depth_mm
>> bill_length_mm
>> body_mass_g
>> ...
>> Variable Importance:
>> 1. "bill_length_mm" 653.000000 ################
>> ...
>> Out-of-bag evaluation: accuracy:0.964602 logloss:0.102378
>> Number of trees: 300
>> Total number of nodes: 4170
>> ...

# Get feature importance as a array
model.make_inspector().variable_importances()["MEAN_DECREASE_IN_ACCURACY"]
>> [("flipper_length_mm", 0.149),
>> ("bill_length_mm", 0.096),
>> ("bill_depth_mm", 0.025),
>> ("body_mass_g", 0.018),
>> ("island", 0.012)]

In the example above, the model was trained with default hyperparameter values. This is a good first solution, but “tuning” the hyper-parameters can often further improve the quality of the model. That can be done as in the following:

# List all the other available learning algorithms
tfdf.keras.get_all_models()
>> [tensorflow_decision_forests.keras.RandomForestModel,
>> tensorflow_decision_forests.keras.GradientBoostedTreesModel,
>> tensorflow_decision_forests.keras.CartModel]

# Display the hyper-parameters of the Gradient Boosted Trees model
? tfdf.keras.GradientBoostedTreesModel
>> A GBT (Gradient Boosted [Decision] Tree) is a set of shallow decision trees trained sequentially. Each tree is trained to predict and then "correct" for the errors of the previously trained trees (more precisely each tree predicts the gradient of the loss relative to the model output)..
...
Attributes:
num_trees: num_trees: Maximum number of decision trees. The effective number of trained trees can be smaller if early stopping is enabled. Default: 300.
max_depth: Maximum depth of the tree. `max_depth=1` means that all trees will be roots. Negative values are ignored. Default: 6.
...

# Create another model with specified hyper-parameters
model = tfdf.keras.GradientBoostedTreesModel(
num_trees=500,
growing_strategy="BEST_FIRST_GLOBAL",
max_depth=8,
split_axis="SPARSE_OBLIQUE",
)

# Evaluate the model
model.compile(metrics=["accuracy"])
print(model.evaluate(test_ds))
# >> 0.986851

Next steps

We hope you enjoyed reading this short demonstration of TensorFlow Decision Forests, and that you are as excited to use it and contribute to it as we are to develop it.

With TensorFlow Decision Forests, you can now train state-of-the-art Decision Forests models with maximum speed and quality and with minimal effort in TensorFlow. And if you feel adventurous, you can now combine decision forests and neural networks together to create new types of hybrid models.

If you would like to learn more about the TensorFlow Decision Forests library, we have put together a number of resources and recommend the following:

If you have any questions, please ask them on the discuss.tensorflow.org using the tag “TFDF” and we’ll do our best to help. Thanks again.

Read More

Run Your First Multi-Worker TensorFlow Training Job With GCP AI Platform

Posted by Nikita Namjoshi, Machine Learning Solutions Engineer

TensorFlow Header

When a single machine is not enough, it’s time to train and iterate faster with TensorFlow’s MultiWorkerMirroredStrategy. In this tutorial-style article you’ll learn how to launch a multi-worker training job on Google Cloud Platform (GCP) using AI Platform Training. You’ll also learn the basics of how TensorFlow distributes data and implements synchronous data parallelism across multiple machines. While this article focuses on a managed solution on GCP, you can also do all of this entirely in open-source on your own hardware.

Overview of Distributed Training

If you have a single GPU, TensorFlow will use this accelerator to speed up model training with no extra work on your part. However, if you want to get an additional boost from using multiple GPUs on a single machine or multiple machines (each with potentially multiple GPUs), then you’ll need to use tf.distribute, which is TensorFlow’s library for running a computation across multiple devices.

The simplest way to get started with distributed training is a single machine with multiple GPU devices. A TensorFlow distribution strategy from the tf.distribute module will manage the coordination of data distribution and gradient updates across all of the GPUs. If you want to learn more about training in this scenario, check out the previous post on distributed training basics.

If you’ve mastered single host training and are looking to scale even further, then adding multiple machines to your cluster can help you get an even greater performance boost. You can make use of a cluster of machines that are CPU only, or that each have one or more GPUs.

There are many ways to do multi-worker training on GCP. In this article we’ll use AI Platform Training, as it’s the quickest way to launch a distributed training job and has additional features that make it very easy to include as part of your production pipeline. To use this managed service, you’ll need to add a bit of extra code to your program and set up a config file that is specific to AI Platform. However; you will not have to endure the pains of GPU driver installation or cluster management, which can be very challenging in a distributed scenario.

Multi-Worker Cluster Configuration

The tf.distribute module currently provides two strategies for multi-worker training. In TensorFlow 2.5, ParameterServerStrategy is experimental, and MultiWorkerMirroredStrategy is a stable API.

Like its single-worker counterpart, MirroredStrategy, MultiWorkerMirroredStrategy is a synchronous data parallelism strategy that you can use with only a few code changes.

However, unlike MirroredStrategy, for a multi-worker setup TensorFlow needs to know which machines are part of your cluster. This is generally specified with the environment variable TF_CONFIG.

os.environ["TF_CONFIG"] = json.dumps({
"cluster": {
"chief": ["host1:port"],
"worker": ["host2:port", "host3:port"],
},
"task": {"type": "worker", "index": 1}
})

In this simple TF_CONFIG example, the “cluster” key contains a dictionary with the internal IPs and ports of all the machines. In MultiWorkerMirroredStrategy, all machines are designated as workers, which are the physical machines on which the replicated computation is executed. In addition to each machine being a worker, there needs to be one worker that takes on some extra work such as saving checkpoints and writing summary files to TensorBoard. This machine is known as the chief (or by its deprecated name master).

After you’ve added your machines to the cluster key, the next step is to set the “task”. This specifies the task type and task index of the current machine, which is an index into the cluster dictionary. The cluster key should be the same on each machine, but the task keys will be different.

Conveniently, when using AI Platform Training, the TF_CONFIG environment variable is set for you on each machine in your cluster so you don’t need to worry about this set up!

However, if you were trying to run a multi-worker job with, for example, 3 instances on Google Compute Engine, you would need to set this environment variable on each machine as shown below. For the machines that are not the chief, the TF_CONFIG looks the same except the task index increments by 1.

Machine 1 (Chief)

os.environ["TF_CONFIG"] = json.dumps({
"cluster": {
"chief": ["host1:port"],
"worker": ["host2:port", "host3:port"],
},
"task": {"type": "chief", "index": 0}
})

Machine 2

os.environ["TF_CONFIG"] = json.dumps({
"cluster": {
"chief": ["host1:port"],
"worker": ["host2:port", "host3:port"],
},
"task": {"type": "worker", "index": 0}
})

Machine 3

os.environ["TF_CONFIG"] = json.dumps({
"cluster": {
"chief": ["host1:port"],
"worker": ["host2:port", "host3:port"],
},
"task": {"type": "worker", "index": 1}
})

Setting this environment variable is fairly easy to do when you have only a few machines in your cluster; however, once you start scaling up, you don’t want to be assigning this variable to each machine manually. As mentioned earlier, one of the many benefits of using AI Platform is that this coordination happens automatically. The only configuration you have to provide is the number of machines in your cluster, and the number and type of GPUs per machine. We’ll do this step in a later section.

Set up the Distribution Strategy

In this Colab notebook, you’ll find the code to train a ResNet50 architecture on the Cassava dataset. In the following sections, we’ll review the new code that needs to be added to our program in order to do distributed training on multiple machines.

As with any strategy in the tf.distribute module, step one is to instantiate the strategy.

strategy = tf.distribute.MultiWorkerMirroredStrategy()

Note that there is a limitation where the instance of MultiWorkerMirroredStrategy needs to be created at the beginning of the program. Code that may create ops should be placed after the strategy is instantiated.

Next, you wrap the creation of your model variables within the strategy’s scope. This crucial step tells TensorFlow which variables should be mirrored across the replicas.

with strategy.scope():
model = create_model()
model.compile(
loss='sparse_categorical_crossentropy',
optimizer=tf.keras.optimizers.Adam(0.0001),
metrics=['accuracy'])

Lastly, you’ll need to scale your batch size by the number of replicas in your cluster. This ensures that each replica processes the same number of examples on each step.

per_replica_batch_size = 64
global_batch_size = per_replica_batch_size * strategy.num_replicas_in_sync

If you’ve used MirroredStrategy before, then the previous steps should be familiar. The main difference when moving from synchronous data parallelism on one machine to many is that the gradients at the end of each step now need to be synchronized across all GPUs in a machine and across all machines in the cluster. This additional step of synchronizing across the machines increases the overhead of distribution.

In TensorFlow, the multi-worker all-reduce communication is achieved via CollectiveOps. You don’t need to know much detail to execute a successful and performant training job, but at a high level, a collective op is a single op in the TensorFlow graph that can automatically choose an all-reduce algorithm according to factors such as hardware, network topology, and tensor sizes.

Dataset Sharding

In the single worker case, at each step your dataset is divided up across the replicas on your machine. This data splitting process becomes slightly more complicated in the multi-worker case. The data now also needs to be sharded, meaning that each worker is assigned a subset of the entire dataset. Therefore, at each step a global batch size of non overlapping dataset elements will be processed by each worker. This sharding happens automatically with tf.data.experimental.AutoShardPolicy.

By default, TensorFlow will first attempt to shard your data by FILE. This means that if your data exists across multiple files, each worker will process different file(s) and split the corresponding data amongst the replicas. FILE is the default autoshard policy because MultiWorkerMirroredStrategy works best for use cases with very large datasets, which are likely to not be in a single file. However, this option can lead to idle workers if the number of files is not divisible by the number of workers, or if some files are substantially longer than others.

If your data is not stored in multiple files, then the AutoShardPolicy will fall back to DATA, meaning that TensorFlow will autoshard the elements across all the workers. This guards against the potential idle worker scenario, but the downside is that the entire dataset will be read on each worker. You can read more about the different policies and see examples in the Distributed Input guide.

If you don’t want to use the default AUTO policy, you can set the desired AutoShardPolicy with the following code:

options = tf.data.Options()
options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.DATA
train_data = train_data.with_options(options)

Save Your Model

Saving your model is slightly more complicated in the multi-worker case because the destination needs to be different for each of the workers. The chief worker will save to the desired model directory, while the other workers will save the model to temporary directories. It’s important that these temporary directories are unique in order to prevent multiple workers from writing to the same location. Saving can contain collective ops, so all workers must save and not just the chief.

The following is boilerplate code that implements the intended saving logic, as well as some cleanup to delete the temporary directories once the training has completed. Note that the model_path is the name of the Google Cloud Storage (GCS) bucket where your model will be saved at the end of training.

model_path = {gs://path_to_your_gcs_bucket}

# Note that with MultiWorkerMirroredStrategy,
# the program is run on every worker.
def _is_chief(task_type, task_id):
# Note: there are two possible `TF_CONFIG` configurations.
# 1) In addition to `worker` tasks, a `chief` task type is used.
# The implementation demonstrated here is for this case.
# 2) Only `worker` task type is used; in this case, worker 0 is
# regarded as the chief. In this case, this function
# should be modified to
# return (task_type == 'worker' and task_id == 0) or task_type is None
return task_type == 'chief'


def _get_temp_dir(dirpath, task_id):
base_dirpath = 'workertemp_' + str(task_id)
temp_dir = os.path.join(dirpath, base_dirpath)
tf.io.gfile.makedirs(temp_dir)
return temp_dir

def write_filepath(filepath, task_type, task_id):
dirpath = os.path.dirname(filepath)
base = os.path.basename(filepath)
if not _is_chief(task_type, task_id):
dirpath = _get_temp_dir(dirpath, task_id)
return os.path.join(dirpath, base)

# Determine type and task of the machine from
# the strategy cluster resolver
task_type, task_id = (strategy.cluster_resolver.task_type,
strategy.cluster_resolver.task_id)

# Based on the type and task, write to the desired model path
write_model_path = write_filepath(model_path, task_type, task_id)
model.save(write_model_path)

Everything we’ve covered about setting up the distribution strategy, sharding data, and saving models applies whether you’re training on GCP, your own hardware, or another cloud platform.

Prepare code for AI Platform

The basic prerequisites for using AI Platform are that you need to have a GCP project with billing enabled, the AI Platform APIs enabled, and sufficient AI Platform quota. If any of these steps are a mystery to you, refer to the previous post to get up to speed on GCP basics.

If you’re already familiar with training on AI Platform with a single node, then you’ll likely breeze through this section. We’ll take the pieces we walked through in the previous section, and do a bit of rearranging to match AI Platform Training convention. All of the code can be found in this Github repo, but we’ll walk through it in detail in this section.

By AI Platform convention, training code is arranged according to the diagram below. The task.py file contains the code that executes your training job. The example in this tutorial also includes a model.py file, which has the Keras functional API code for the model. For more complex production applications you’ll likely have additional util.py or setup.py files, and you can see where those fit in the hierarchy below.

diagram showing path of file

Model code

The model.py file can be found in Github here. You can see that this file just has the code for building the ResNet50 model architecture.

Task code

The task.py file can be found in Github here. This file contains the main function, which will execute the training job and save the model.

def main():
args = get_args()
strategy = tf.distribute.MultiWorkerMirroredStrategy()
global_batch_size = PER_REPLICA_BATCH_SIZE * strategy.num_replicas_in_sync
train_data, number_of_classes = create_dataset(global_batch_size)

with strategy.scope():
model = create_model(number_of_classes)

model.fit(train_data, epochs=args.epochs)

# Determine type and task of the machine from
# the strategy cluster resolver
task_type, task_id = (strategy.cluster_resolver.task_type,
strategy.cluster_resolver.task_id)

# Based on the type and task, write to the desired model path
write_model_path = write_filepath(args.job_dir, task_type, task_id)
model.save(write_model_path)

In this simple example, the data preprocessing happens directly in the task.py file, but in reality for more complicated data processing you would probably want to split out this code into a separate data.py file that you can import into task.py (for example if your preprocessing includes parsing TFRecord files).

We explicitly set the AutoShardPolicy to DATA in this case because the Cassava dataset is not downloaded as multiple files. However, if we did not set the policy to DATA, the default AUTO policy would kick in and the end result would be the same.

options = tf.data.Options()
options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.DATA
train_data = train_data.with_options(options)

The task.py file also parses any command line arguments we need. In this simple example, the epochs are passed in via the command line. Additionally, we need to parse the argument job-dir, which is the GCS bucket where our model will be stored.

def get_args():
'''Parses args.'''
parser = argparse.ArgumentParser()
parser.add_argument(
'--epochs',
required=True,
type=int,
help='number training epochs')
parser.add_argument(
'--job-dir',
required=True,
type=str,
help='bucket to save model')
args = parser.parse_args()
return args

Lastly, the task.py file contains our boilerplate code for saving the model. For a production example, you probably would want to add this boilerplate to a util.py file, but again for this simple example we’ll keep everything in one file.

Custom Container Set up

AI Platform provides standard runtimes for you to execute your training job. While these runtimes might work for your use case, more specialized needs require a custom container. In this section, we’ll walk through how to set up your container image and push it to Google Container Registry (GCR).

Write Your Dockerfile

The following Dockerfile specifies the base image, using the TensorFlow 2.5 Enterprise GPU Deep Learning Container. Using the TensorFlow Enterprise image as our base image provides a useful design pattern for developing on GCP. TensorFlow Enterprise is a distribution of TensorFlow that is optimized for GCP. You can use TensorFlow Enterprise with AI Platform Notebooks, the Deep Learning VMs, and AI Platform Training, providing a seamless transition between different environments.

The code in our trainer directory is copied to the Docker image, and our entry point is the task.py script, which we will run as a module.

# Specifies base image and tag
FROM gcr.io/deeplearning-platform-release/tf2-gpu.2-5
WORKDIR /root

# Copies the trainer code to the docker image.
COPY trainer/ /root/trainer/

# Sets up the entry point to invoke the trainer.
ENTRYPOINT ["python", "-m", "trainer.task"]

Push Your Dockerfile to GCR

Next, we’ll set up some useful environment variables. You can select any name of your choosing for IMAGE_REPO_NAME and IMAGE_TAG. If you have not already set up the Google Cloud SDK, you can follow the steps here, as you’ll need to use the gcloud tool to push your container and kick off the training job.

export PROJECT_ID=$(gcloud config list project --format "value(core.project)")
export IMAGE_REPO_NAME={your_repo_name}
export IMAGE_TAG={your_image_tag}
export IMAGE_URI=gcr.io/$PROJECT_ID/$IMAGE_REPO_NAME:$IMAGE_TAG

Next, you’ll build your Dockerfile.

docker build -f Dockerfile -t $IMAGE_URI ./

Lastly, you can push your image to GCR.

gcloud auth configure-docker
docker push $IMAGE_URI

If you navigate to the GCR page in the GCP console UI, you should see your newly pushed image.

Configure Your Cluster

The final step before we can kick off our training job is to set up the cluster. AI Platform offers a set of predefined cluster specifications called scale tiers, but we’ll need to provide our own cluster setup for distributed training.

In the following config.yaml file, we’ve designated one master (equivalent to chief) and one worker. Each machine has one NVIDIA T4 Tensor Core GPU. For both machines, you’ll also need to specify the imageUri as the image you pushed to GCR in the previous step.

trainingInput:
scaleTier: CUSTOM
masterType: n1-standard-8
masterConfig:
acceleratorConfig:
count: 1
type: NVIDIA_TESLA_T4
imageUri: gcr.io/{path/to/image}:{tag}
useChiefInTfConfig: true
workerType: n1-standard-8
workerCount: 1
workerConfig:
acceleratorConfig:
count: 1
type: NVIDIA_TESLA_T4
imageUri: gcr.io/{path/to/image}:{tag}

In case you’re wondering what the useChiefInTfConfig flag does, TensorFlow uses the terminology “Chief” and AI Platform uses the terminology “Master”, so this flag will manage that discrepancy. You don’t need to worry about the details (although you will see an error message if you forget to set this flag!).

Feel free to experiment with this configuration by adding machines, adding GPUs, or removing all GPUs and training with CPUs only. You can see the supported regions and GPU types here for AI Platform, so just make sure your project has sufficient quota for whatever configuration you choose.

Launch Your Training Job

You can launch your training job easily with the following command:

gcloud ai-platform jobs submit training {job_name}  
--region europe-west2
--config config.yaml
--job-dir gs://{gcs_bucket/model_dir} --
--epochs 5

In the command above, you’ll need to give your job a name. In addition to passing in the region, you’ll need to define job-dir, which is the directory in your GCS bucket where you want your saved model file to be stored after training completes.

The empty — flag marks the end of the gcloud specific flags and the start of the args that you want to pass to your application (in this case, this is just the epochs).

After executing the training command, you should see the following message.

code snippet

You can navigate to the AI Platform UI in the GCP console and track the status of your job.

You’ll notice that your job will take around ten minutes to launch. This overhead might seem huge in our simple example where it doesn’t even take ten minutes to train on a single GPU. However, this overhead will be amortized for large jobs.

job details screen

When the job completes training, you’ll see a green check mark next to the job. You can then click the Model location URI and you’ll find your saved_model.pb file.

What’s Next

You now know the basics of launching a multi-worker training job on GCP. You also know the core concepts of MultiWorkerMirroredStrategy. To take your skills to the next level, try leveraging AI Platform’s hyperparameter tuning feature for your next training job (in open-source, you can use Keras Tuner), or using TFRecord files as your input data. You can also try out Parameter Server Strategy if you’d like to explore asynchronous training in TensorFlow. Happy distributed training!

Read More

Recap of TensorFlow at Google I/O 2021

Posted by the TensorFlow team

TensorFlow recap header

Thanks to everyone who joined our virtual I/O 2021 livestream! While we couldn’t meet in person, we hope we were able to make the event more accessible than ever. In this article, we’re recapping a few of the updates we shared during the keynote. You can watch the keynote below, and you can find recordings of every talk on the TensorFlow YouTube channel. Here’s a summary of a few announcements by product area (and there’s more in the videos, so be sure to check them out, too).

TensorFlow for Mobile and Web

The TensorFlow Lite runtime will be bundled with Google Play services

Let’s start with the announcement that the TensorFlow Lite runtime is going to be bundled with Google Play services, meaning you don’t need to distribute it with your app. This can greatly reduce your app’s bundle size. Now you can distribute your model without needing to worry about the runtime. You can sign up for an early access program today, and we expect a full rollout later this year.

You can now run TensorFlow Lite models on the web

All your TensorFlow Lite models can now directly be run on the web in the browser with the new TFLite Web APIs that are unified with TensorFlow.js. This task-based API supports running all TFLite Task Library models for image classification, objection detection, image segmentation, and many NLP problems. It also supports running arbitrary, custom TFLite models with easy, intuitive TensorFlow.js compatible APIs. With this option, you can unify your mobile and web ML development with a single stack.

A new On-Device Machine Learning site

We understand that the most effective developer path to reach Android, the Web and iOS isn’t always the most obvious. That’s why we created a new On-Device Machine Learning site to help you navigate your options, from turnkey to custom models, from cross platform mobile, to in-browser. It includes pathways to take you from an idea to a deployed app, with all the steps in between.

Performance profiling

When it comes to performance, we’re also working on additional tooling for Android developers. TensorFlow Lite includes built-in support for Systrace, integrating seamlessly with perfetto for Android 10.

And perf improvements aren’t limited to Android – for iOS developers TensorFlow Lite comes with built-in support for signpost-based profiling. When you build your app with the trace option enabled, you can run the Xcode profiler to see the signpost events, letting you dive deeper, and seeing all the way down to individual ops during execution.

Perfetto dashboard

TFX

TFX 1.0: Production ML at Enterprise-scale

Moving your ML models from prototype to production requires lots of infrastructure. Google created TFX because we needed a strong framework for our ML products and services, and then we open-sourced it so that others can use it too. It includes support for training models for mobile and web applications, as well as server-based applications.

After a successful beta with many partners, today we’re announcing TFX 1.0 — ready today for production ML at enterprise-scale. TFX includes all of the things an enterprise-ready framework needs, including enterprise-grade support, security patches, bug fixes, and guaranteed backward compatibility for the entire 1.X release cycle. It also includes strong support for running on Google Cloud and support for mobile, web, and NLP applications.

If you’re ready for production ML, TFX is ready for you. Visit the TFX site to learn more.

Responsible AI

We’re also sharing a number of new tools to help you keep Responsible AI top of mind in everything that you do when developing with ML.

Know Your Data

Know Your Data (KYD) is a new tool to help ML researchers and product teams understand rich datasets (images and text) with the goal of improving data and model quality, as well as surfacing and mitigating fairness and bias issues. Try the interactive demo at the link above to learn more.

Know Your Data interface

People + AI Guidebook 2.0

As you create AI solutions, building with a people centric approach is a key to doing it responsibly, and we’re delighted to announce the People + AI Guidebook 2.0. This update is designed to help you put best practices and guidance for people-centric AI into practice with a lot of new resources including code, design patterns and much more!

Also check out our Responsible AI Toolkit to help you integrate Responsible AI practices into your ML workflow using TensorFlow.

Decision forests in Keras

New support for random forests and gradient boosted trees

There’s more to ML than neural networks. Starting with TensorFlow 2.5, you can easily train powerful decision forest models (including favorites like random forests and gradient boosted trees) using familiar Keras APIs. There’s support for many state-of-the-art algorithms for training, serving and interpreting models for classification, regression and ranking tasks. And you can serve your decision forests using TF Serving, just like any other model trained with TensorFlow. Check out the tutorials here, and the video from this session.

TensorFlow Lite for Microcontrollers

A new pre-flashed board, experiments, and a challenge

TensorFlow Lite for Microcontrollers is designed to help you run ML models on microcontrollers and other devices with only a few kilobytes of memory. You can now purchase pre-flashed Arduino boards that will connect via Bluetooth and your browser. And you can use these to try out new Experiments With Google that let you make gestures and even create your own classifiers and run custom TensorFlow models. If you’re interested in challenges, we’re also running a new TensorFlow Lite for Microcontrollers challenge, you can check it out here. And also be sure to check out the TinyML workshop video in the next steps below.

Microcontroller chip

Google Cloud

Vertex AI: A new managed ML platform on Google Cloud

An ML model is only valuable if you can actually put it into production. And as you know, it can be challenging to productionize efficiently and at scale. That’s why Google Cloud is releasing Vertex AI, a new managed machine learning platform to help you accelerate experimentation and deployment of AI models. Vertex AI has tools that span every stage of the developer workflow, from data labeling, to working with notebooks and models, to prediction tools and continuous monitoring – all unified into one UI. While many of these offerings may be familiar to you, what really distinguishes Vertex AI is the introduction of new MLOps features. You can now manage your models with confidence using our MLOps tools such as Vertex Pipelines and Vertex Feature Store, to remove the complexity of robust self-service model maintenance and repeatability.

TensorFlow Cloud: Transition from local model building to distributed training on the Cloud

TensorFlow Cloud provides APIs that ease the transition from local model building and debugging to distributed training and hyperparameter tuning on Google Cloud. From inside a Colab or Kaggle Notebook or a local script file, you can send your model for tuning or training on Cloud directly, without needing to use the Cloud Console. We recently added a new site and new features, check it out if you’re interested in learning more.

Community

A new TensorFlow Forum

We created a new TensorFlow Forum for you to ask questions and connect with the community. It’s a place for developers, contributors, and users to engage with each other and the TensorFlow team. Create your account and join the conversation at discuss.tensorflow.org.

TensorFlow Forum page

Find all the talks here

This is just a small part of what was shared at Google I/O 2021. You can find all of the TensorFlow sessions in this playlist, and for your convenience here are direct links to each of the sessions also:

To learn more about TensorFlow, you check out tensorflow.org, read other articles on the blog, follow us on social media, and subscribe to our YouTube Channel, or join a TensorFlow User Group near you.

Read More