How LinkedIn Personalized Performance for Millions of Members using Tensorflow.js

A guest post by LinkedIn

Mark Pascual, Sr. Staff Engineer

Nitin Pasumarthy, Staff Engineer

Introduction

The Performance team at LinkedIn optimizes latency to load web and mobile pages. Faster sites improve customer engagement and eventually revenue to LinkedIn. This concept is well documented by many other companies too who have had similar experiences but how do you define the optimal trade off between page load times and engagement?

The relationship between speed and engagement is non-linear. Fast loading sites, after a point, may not increase engagement by further reducing their load times. At LinkedIn we have used this relationship between engagement and speed to selectively customize the features on LinkedIn Lite – a lighter, faster version of LinkedIn, specifically built for mobile web browsers.

To do this, we trained a deep neural network to identify if a request to LinkedIn would result in a fast page load in real time. Based on the performance quality result predicted by this model we change the resolution of all images on a given user’s news feed before the resulting webpage was sent to the client. This led to an increase in the magnitude of billions for extra Feed Viral Actions (+0.23%) taken, millions more Engaged Feed Users (+0.16%) and Sponsored Revenue increased significantly for us too (+0.76%).

Image Quality Comparison: Image on the left uses 4x more memory than the one on the right which is less than ideal to send to users on slow network connections or when the device may be low on resources. Prior to using an ML model, we only showed the low resolution image which was not great for users that had capacity for higher quality images on newer devices.

We described in great detail why many of our performance optimization experiments failed back in 2017 and how we used those learnings to build a Performance Quality Model (PQM) in our Personalizing Performance: Adapting Application in real time to member environments blog.

PQM’s bold goal is to predict various performance metrics (e.g. page load time) of any web / mobile page using both device and network characteristics of end users to empower (web) developers to build impactful application features that are otherwise tricky to implement (like the one we described above).

We are happy to announce that we are open sourcing our first performance quality model that is trained on millions of RUM data samples from around the world free to use for your own website performance optimizations! Learn more and get started here.

In the rest of this blog, we will go over how our team of full stack developers deployed this PQM in production that works at Linkedin scale! We wish to prove that deploying TensorFlow.js ML models today is both easy and beneficial for those working on the Node.js stack.

TensorFlow.js: Model Deployment in Node.js

At the time of our production deployment, LinkedIn’s TensorFlow model deployment machinery was still being developed. Furthermore, using TensorFlow Serving was not yet a feasible option for us. So even though we had a model ready for use, we needed to figure out a way to deploy it.

As LinkedIn is primarily a Java/JVM stack for serving external traffic, it might seem like TensorFlow Java would be ideal, but it was still experimental and didn’t have the API stability guarantees that we require.

We looked at our options and realized that we already use Node.js (behind the JVM) as part of our frontend web serving stack in order to perform server side optimizations when serving HTML pages. The architecture for this is unique in that we use the JVM to manage an external pool of Node.js processes to perform “work,” e.g., the previously mentioned server side optimizations. The “work” can really be anything that Node.js can perform. In our use case, this enables us to use TensorFlow.js in an architecture that was already proven.

We repurposed our frontend stack to use Node.js to deploy our custom model and ended up with great results. In terms of performance, our mixed stack of Java and Node.js easily met our SLAs. The 50th and 90th percentile production latencies as measured (a) from a client (within the datacenter), (b) from on host instrumentation, and (c) in terms of only Node.js performing inference using TensorFlow.js are shown in the table below.

50th Percentile

90th Percentile

From client (within datacenter)

10 ms

12 ms

On host

8 ms

10 ms

Inference in Node.js

2 ms

3 ms

The resulting architecture is shown above in Figure 1 below.

The API request that requires a prediction is received by the JVM server and is routed to our Rest.li infrastructure which in turn routes the request to our performance prediction resource. To handle the request, the PaaS resource performs some feature generation based on the inputs and then makes an RPC call out to the Node.js process for the prediction.

The N Node.js processes are long-lived. They are started upon JVM startup and have already loaded the desired model using tf.node.loadSavedModel(). When a process receives a request for a prediction, it simply takes the input features, calls tf_model.predict(), and returns the result. Here is a simplified version of the Node.js code:

const tf = require(‘@tensorflow/tfjs-node’);

async function main() {
// load the model when the process starts so it’s always ready
const model = await tf.node.loadSavedModel(‘model_dir’);

function predict(rawInput) {
return tf.tidy(() => {
// prepare the inputs as tensor arrays
const x = {}
for (const feature of Object.keys(predictionInput)) {
x[feature] = tf.tensor([input[feature]], [1, 1]);
}

const output = model.predict(x, {});
const probs = Array.from(output.probabilities.dataSync());
const classes = Array.from(output.all_class_ids.dataSync());
const result = Object.fromEntries(classes.map((classId, i) => [classId, probs[i]]));
return result; // {0: 0.8, 1: 0.15, 2: 0.05} probability of each performance quality
});
}

// Register our ‘predict’ RPC handler (pseudo-code)
// process is an abstraction of the Node.js side of the communication channel
// with the JVM
process.registerHandler(‘predict’, input => {
const result = predict(input);
return Promise.resolve(result);
});
}

main();

The API request that requires a prediction is received by the JVM server and is routed to our Rest.li infrastructure which in turn routes the request to our performance prediction resource. To handle the request, the PaaS resource performs some feature generation based on the inputs and then makes an RPC call out to the Node.js process for the prediction.

The N Node.js processes are long-lived. They are started upon JVM startup and have already loaded the desired model using tf.node.loadSavedModel(). When a process receives a request for a prediction, it simply takes the input features, calls tf_model.predict(), and returns the result. Here is a simplified version of the Node.js code:

Express could replace Rest.li’s role and the feature generation pieces would need to be ported to Node.js, but everything else remains the same. As you can see, the architecture is cleaner and requires less mental hoops to manage both Java and Node.js in the same stack.

An Unexpected Win

In the architecture we described above, the external processes do not have to be Node.js. The library that we use to manage the external process is pretty straightforward to implement in most technologies. In fact, we could have chosen Python for the external processes as it’s popular for this ML use case. So what are the reasons we stuck with Node.js? Well, there’s two: (1) we already had a Node.js implementation for the external process infrastructure and would have had to develop a new one for Python, and (2) it turns out that Node.js is also slightly faster at making the predictions due to the pre/post processing benefitting from the JIT compiler of JavaScript.

In order to prove this to ourselves, we took samples (~100k) from our real world prediction inputs and ran them against both Node.js and Python. The test bed was not exactly our production stack (we didn’t include the JVM side), but it was close enough for our purposes. The results are shown below:

Stack

50th percentile

Delta (from Python)

Python

1.872 ms

0%

Node.js

1.713 ms

-8.47%

The results show that Node.js is almost 10% faster at performing inference for our specific model. Of course, performance may vary based on model architectures and the amount of pre and post processing being performed in Node. These results were from our model running on a typical production machine. Results may also vary due to model complexity, machine differences, and so on.

Checkout our README in the open source repo to find out how we tested the model in Python and Node.js.

Looking to the future

Our current unique architecture does have some areas for improvement. Probably the biggest opportunity is to address the uniqueness of this multi stack architecture itself. The mix of both Java and Node.js technologies adds additional cognitive overhead and complexity during design, development, debugging, operations, maintenance – however as previously stated you could move the whole stack to Node to simplify matters, so this is a solvable problem.

Another potential area for improvement comes from currently using a single threaded architecture on the Node.js side. Because of this, only a single prediction currently occurs at a time so latency sometimes includes some amount of queueing time. This can potentially be worked around by using Node worker threads for parallel execution that may be considered in future versions of this implementation. In our particular case, however, prediction is very fast even as it stands, so we do not feel the need to explore this right now.

Summary

The availability of TensorFlow.js gave us an easy option to deploy our model to serve production use cases when other options were not quite suitable or available to us. While our unique requirements resulted in using non-standard architecture (the mixture of the JVM and Node.js), TensorFlow.js can be used to an even greater effect in a homogeneous Node.js serving stack resulting in a very clean and performant architecture. With our open source performance quality model, a full stack JS engineer can personalize performance and improve their user engagement and we look forward to seeing how others use our open sourced model to do just that on their own websites.

Acknowledgements

This success would not be possible without the tremendous work by Prasanna Vijayanathan and Niranjan Sharma. A huge thank you to Ritesh Maheshwari and Rahul Kumar for their support and encouragement. Special thanks to Ping Yu (Google) and Jason Mayes (Google) for their continued support and feedback.

Read More