How the advertising team scaled TensorFlow to 300 million predictions per second
Join over 31,000 AI insiders in touch with the most important ideas in machine learning through this free newsletter
Machine learning is changing many areas. One of the big ones is advertising. Companies like Google and Facebook are notorious for using big data to target personalized ads, but there are many other players in this space. This should come as no surprise as online advertising is much bigger than you think.Market estimates indicate that the total amount of media advertising spent in the US in 2020 will reach US$225.8 billion. I’m here. By 2024, this figure is expected to reach $322 billion.
But applying AI to advertising is more difficult than you might imagine. From a technical point of view, the industry is an interesting amalgamation of two fields: networking and machine learning. This presents a series of interesting challenges that must be addressed. You need high accuracy, constant model updates, and very low latency. Data drift is a major challenge, as data collection policies, user preferences, and other protocols can change quickly. This makes it difficult to implement traditional approaches/models. The authors of the paperScaling TensorFlow to 300 million predictions per second” details the challenges and approaches to tackle the problem. They do this by sharing what they’ve learned from working with Zemanta.
Above is a passage from the author. It explains what Zemanta is, how the service works, and how to sell ad space. The last part is really interesting as it details the use of machine learning to maximize KPIs. Maybe readers of this article will continue to work in this field (don’t forget me lol).
In this post, the authors/team of Zemanta share the training that enabled 300 million predictions per second using the TensorFlow framework. Which of these studies did you find most interesting? Tell us in the comments.
This is something that people working in machine learning are familiar with. When you read AI news, you might think that machine learning is the same as training large models with complex procedures. It’s no surprise that most beginners confuse machine learning and deep learning. They see news about GPT-4 and other giant models. And they assume that to build a good model, you need to know how to build these huge networks. These networks can take months to train or just tweak/prompt engineers on one of these architectures.
This article presents that reality. Note that many ML models are deployed in contexts where they need to make a lot of inferences (including this team’s model). Amazon Web Services,Inference accounts for up to 90% of total operating costs in deep learning applications”. Using huge models can really eat up your margins (and even turn profits into losses).
Simple models are easier to train, faster to test, don’t require many resources, and generally don’t lag too much. Applying large models at scale significantly increases server/running costs. The author echoes similar sentiments in the following quote from the paper:
Additionally, we are not using GPUs for inference in production. At our scale, having more than one top-of-the-line GPU in each machine would be prohibitively expensive. On the other hand, if you only have a small cluster of GPU machines, you are forced to move to a service-based architecture. Neither option is particularly favorable, given the relative small size of our model compared to state-of-the-art models in other areas of deep learning (such as computer vision and natural language processing). approach is much more economical. Our use case is also not suitable for GPU workloads as our model uses sparse weights.
Many companies don’t have large GPU systems that can be used just for data training and prediction. And they are almost unnecessary.To quote the author, a relatively small model is much more economical.
The best way to efficiently build and use huge models – don’t use them too much. Instead, let simple models/filters do most of the tasks, and use large AI models only when absolutely necessary.
A sparse matrix is a matrix with mostly zero values. They are used to describe systems with restricted interactions between two sets of components. For example, imagine a humanity matrix whose rows and columns correspond to people on Earth. The value of a particular index is 1 if the two people know each other and 0 if they don’t. This is a sparse matrix because most people don’t know most other people in the world.
The matrix Zemanta was using was sparse. They attributed it to the fact that most features are categorical. Using the Adam Optimizer significantly increased the execution cost (50% more than Adagrad). Adagrad’s performance, on the other hand, was abysmal. Fortunately, there is an alternative that performs well without being very expensive: LazyAdam.
Lazy evaluation is an established technique in software engineering. Lazy loading is often used in his GUI/interactive based platforms such as websites and games. It’s only a matter of time before lazy optimizers become established in machine learning. Keep an eye on it when it happens. If you’re looking for a way to study machine learning, this might be an interesting option.
By digging deeper into TF, I noticed that the computation is much more efficient (per example) when increasing the number of examples in the computation batch. This low linear growth is due to the highly vectorized TF code. TF also has some overhead for each compute call. This is amortized over larger batches. Given this, I thought that in order to reduce the number of compute calls, I should combine many requests into one computation..
So why did large training batches lead to lower computational costs? There are costs associated with moving batches from RAM/disk to memory. Using larger batches means less data movement and faster training time. However, with larger batches you miss out on the benefits of stochastic gradient descent. As indicated by Intel, large batch sizes can reduce generalization. However, keep in mind that you can work around this (an article about this is coming soon).
This halved the computational costThe full result of such optimization is:
This implementation is highly optimized and It reduces the number of compute calls by a factor of 5 and halves the CPU usage of TF computations.In the rare case that the batcher thread cannot get CPU time, those requests will time out. However, this happens in less than 0.01% of his requests. A slight increase in average latency was observed. It averages around 5ms and can be higher with peak traffic. We have SLAs and proper monitoring in place to ensure stable latency. This is very beneficial and is still core to the TF service mechanism as we haven’t significantly increased the timeout percentage.
The slightly increased latency makes sense. Check out Section 3.2 to read exactly what they did. This is about networking, so I’m no expert. But the results speak for themselves.
This paper is an interesting read. It combines engineering, networking and machine learning. In addition, we provide insight into the use of machine learning for large models and small companies where his 0.001% performance improvement is insignificant.
PS: For those of you who don’t know, I’m doing a quick research on AI safety. It has only 6 questions and should be completed in under 2 minutes. But it contains questions that need to be answered in order to build a stronger AI safety community. It helps ensure that the AI developed is safe, ethical, and useful. 👷♂️👷♀️
Please fill out this Google form
If you like what you read, I’m on the job market right now. You can find my resume here. A quick summary of my skill set-
- Machine Learning Engineer – Various tasks such as generative AI + text processing, modeling global supply chains, evaluating government policies (affecting over 200 million people), and even developing algorithms to beat Apple in Parkinson’s disease detection I have been working on
- AI Writer – 30,000+ email subscribers, 2 million+ impressions on LinkedIn, 600,000+ blog post readers in 2022.
If you want to talk more, Contact me on LinkedIn here.
That’s what this work is about. Thank you for your time. As always, if you’re interested in contacting me or checking out my other work, there’s a link at the end of this email/post. would be appreciated. You can drop it here. And if you think this article is of value, please share it with more people. It’s word-of-mouth referrals like yours that make me grow.
Use the links below to check out my other content, learn about tutoring, contact me about a project, or just say hello.
Small snippets on technology, AI and machine learning here
Check out my other articles on Medium. : https://rb.gy/zn1aiu
My YouTube: https://rb.gy/88iwdd
Contact me on LinkedIn Let’s connect: https://rb.gy/m5ok2y
My Instagram: https://rb.gy/gmvuy9
my twitter: https://twitter.com/Machine01776819
